Health Monitoring¶

Monitor the health and performance of your New Hires Reporting System deployment.

Overview¶

Effective monitoring ensures:

✅ Early detection of issues
✅ Quick troubleshooting
✅ Performance optimization
✅ Capacity planning
✅ AWS cost management
✅ Audit trails

Health Check Endpoints¶

Backend Health¶

Endpoint: GET /health

URL: http://localhost:8000/health (or https://api.your-domain.com/health)

Response:

{
  "status": "healthy",
  "database": "connected",
  "workers": "operational",
  "version": "v1.0.0",
  "timestamp": "2025-01-24T10:00:00Z"
}

Check Command:

curl http://localhost:8000/health | jq

Status Indicators: - status: "healthy" - Backend API is operational - database: "connected" - PostgreSQL is accessible - workers: "operational" - Worker service is running

Unhealthy Response:

{
  "status": "unhealthy",
  "database": "disconnected",
  "error": "Could not connect to database"
}

Frontend Health¶

Endpoint: GET / (root path)

URL: http://localhost:8080 (or https://your-domain.com)

Check Command:

curl -I http://localhost:8080

Expected: HTTP/1.1 200 OK

Database Health¶

Check Command:

docker exec newhires-db pg_isready -U newhires

Expected: newhires-db:5432 - accepting connections

Alternative:

docker exec newhires-db psql -U newhires -d newhires -c "SELECT 1;"

Worker Health¶

Check Worker is Running:

docker ps | grep newhires-workers

Check Worker Logs:

docker logs newhires-workers --tail=20

Expected: Should see INFO: Worker polling for jobs...

Check Job Queue:

docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT status, COUNT(*) FROM correction_jobs GROUP BY status;"

Expected Output:

   status    | count
-------------+-------
 pending     |     2
 processing  |     1
 completed   |    45
 failed      |     0

Docker Health Checks¶

Health checks are configured in docker-compose.prod.yml:

Backend Health Check¶

backend:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 30s      # Check every 30 seconds
    timeout: 10s       # Timeout after 10 seconds
    retries: 3         # Mark unhealthy after 3 failures
    start_period: 30s  # Grace period on startup

Database Health Check¶

db:
  healthcheck:
    test: ["CMD", "pg_isready", "-U", "newhires"]
    interval: 10s
    timeout: 5s
    retries: 5

View Health Status¶

# Check all services
docker-compose -f docker-compose.prod.yml ps

# Output shows health status:
# NAME                    STATUS
# newhires-backend        Up (healthy)
# newhires-db             Up (healthy)
# newhires-workers        Up
# newhires-frontend       Up

# View detailed health check logs
docker inspect --format='{{json .State.Health}}' newhires-backend | jq

Logging¶

View Logs¶

Backend Logs:

# Live logs
docker-compose -f docker-compose.prod.yml logs -f backend

# Last 100 lines
docker-compose -f docker-compose.prod.yml logs --tail=100 backend

# With timestamps
docker-compose -f docker-compose.prod.yml logs -f -t backend

# Since specific time
docker-compose -f docker-compose.prod.yml logs --since "2025-01-24T10:00:00" backend

Worker Logs (most important for monitoring):

# Live logs
docker logs newhires-workers -f

# Last 100 lines
docker logs newhires-workers --tail=100

# Errors only
docker logs newhires-workers | grep -i error

Frontend Logs:

docker-compose -f docker-compose.prod.yml logs -f frontend

Database Logs:

docker logs newhires-db --tail=50

All Services:

docker-compose -f docker-compose.prod.yml logs -f

Log Indicators¶

System logs use emoji indicators for easy scanning:

Emoji	Meaning	Example
✅	Success	`✅ File validation completed`
❌	Error	`❌ Validation failed: Invalid format`
🤖	AI Activity	`🤖 Calling AWS Bedrock API...`
🔍	Search	`🔍 Searching for employer data`
📝	Correction	`📝 Applied 18 corrections`
⚠️	Warning	`⚠️ High job queue depth`
🔄	Processing	`🔄 Worker picked up job`

Log Levels¶

Control log verbosity with LOG_LEVEL in .env:

Production (LOG_LEVEL=INFO): - Normal operational messages - Errors and warnings - No sensitive data - Recommended for production

Debug (LOG_LEVEL=DEBUG): - Verbose output - Full stack traces - AWS Bedrock request/response details - Database queries - Use for troubleshooting only

Warning (LOG_LEVEL=WARNING): - Minimal output - Only warnings and errors - Use for stable production

Log Persistence¶

Default Behavior¶

Docker containers use the JSON-file logging driver. Logs are stored in:

/var/lib/docker/containers/<container-id>/<container-id>-json.log

Configure Log Rotation¶

Add to docker-compose.prod.yml:

services:
  backend:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"      # Max log file size
        max-file: "3"        # Keep last 3 files
        compress: "true"     # Compress rotated logs

  workers:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
        compress: "true"

Alternative: Syslog¶

Send logs to system syslog:

services:
  backend:
    logging:
      driver: "syslog"
      options:
        tag: "newhires-backend"

  workers:
    logging:
      driver: "syslog"
      options:
        tag: "newhires-workers"

View with:

sudo tail -f /var/log/syslog | grep newhires

Resource Monitoring¶

Real-Time Resource Usage¶

# All containers
docker stats

# Specific container
docker stats newhires-workers

# Output:
# CONTAINER           CPU %    MEM USAGE / LIMIT    MEM %    NET I/O
# newhires-backend    2.5%     450MiB / 2GiB       22%      1.2kB / 850B
# newhires-workers    8.2%     680MiB / 2GiB       34%      4.1kB / 2.3kB
# newhires-frontend   1.2%     180MiB / 1GiB       18%      2.4kB / 1.1kB
# newhires-db         3.1%     320MiB / 1GiB       32%      3.2kB / 2.8kB

Resource Alerts¶

Monitor for: - High CPU (>80% sustained) - May need scaling or optimization - High Memory (>80%) - May need more RAM or memory leak - Workers stuck - Check for Bedrock timeouts - Database connections - Check connection pool - High Disk I/O - May need faster storage

Check Specific Metrics¶

# Container CPU usage
docker stats --no-stream --format "{{.Container}}: {{.CPUPerc}}"

# Memory usage
docker stats --no-stream --format "{{.Container}}: {{.MemUsage}}"

# All metrics, no stream
docker stats --no-stream

Monitoring Script¶

Create /usr/local/bin/monitor-newhires.sh:

#!/bin/bash
# New Hires Reporting System Health Check

echo "=== New Hires Reporting System Health Check ==="
echo "Time: $(date)"
echo ""

# Check if containers are running
echo "=== Container Status ==="
docker-compose -f /opt/newhires-reporting/docker-compose.prod.yml ps
echo ""

# Backend health
echo "=== Backend Health ==="
BACKEND_HEALTH=$(curl -s http://localhost:8000/health)
echo "$BACKEND_HEALTH" | jq .

if echo "$BACKEND_HEALTH" | jq -e '.status == "healthy"' > /dev/null; then
    echo "✅ Backend is healthy"
else
    echo "❌ Backend is unhealthy!"
    echo "$BACKEND_HEALTH" | jq .
fi
echo ""

# Frontend health
echo "=== Frontend Health ==="
FRONTEND_STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080)
if [ "$FRONTEND_STATUS" = "200" ]; then
    echo "✅ Frontend is healthy (HTTP $FRONTEND_STATUS)"
else
    echo "❌ Frontend is unhealthy (HTTP $FRONTEND_STATUS)"
fi
echo ""

# Database health
echo "=== Database Health ==="
DB_STATUS=$(docker exec newhires-db pg_isready -U newhires 2>&1)
if echo "$DB_STATUS" | grep -q "accepting connections"; then
    echo "✅ Database is healthy"
else
    echo "❌ Database is unhealthy"
    echo "$DB_STATUS"
fi
echo ""

# Worker health
echo "=== Worker Status ==="
WORKER_RUNNING=$(docker ps | grep newhires-workers)
if [ -n "$WORKER_RUNNING" ]; then
    echo "✅ Worker is running"
    echo "Recent worker activity:"
    docker logs newhires-workers --tail=5
else
    echo "❌ Worker is not running!"
fi
echo ""

# Job queue
echo "=== Job Queue Status ==="
docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT status, COUNT(*) FROM correction_jobs GROUP BY status;" 2>/dev/null
echo ""

# Resource usage
echo "=== Resource Usage ==="
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
echo ""

# Recent errors
echo "=== Recent Errors (last hour) ==="
echo "Backend errors:"
docker-compose -f /opt/newhires-reporting/docker-compose.prod.yml logs --since 1h backend 2>&1 | \
  grep -i error | tail -3

echo ""
echo "Worker errors:"
docker logs newhires-workers --since 1h 2>&1 | grep -i error | tail -3

echo ""
echo "=== AWS Bedrock Activity (last hour) ==="
docker logs newhires-workers --since 1h | grep -i bedrock | tail -5

echo ""
echo "=== End Health Check ==="

Make executable:

sudo chmod +x /usr/local/bin/monitor-newhires.sh

Run manually:

/usr/local/bin/monitor-newhires.sh

Schedule with cron (every 5 minutes):

# Add to crontab
crontab -e

# Add this line:
*/5 * * * * /usr/local/bin/monitor-newhires.sh >> /var/log/newhires-health.log 2>&1

AWS Bedrock Monitoring¶

Track Bedrock Usage¶

# Count Bedrock API calls
docker logs newhires-workers | grep -i "bedrock" | wc -l

# View token usage
docker logs newhires-workers | grep "tokens used"

# Average response time
docker logs newhires-workers | \
    grep "response time" | \
    awk '{sum+=$NF; count++} END {print sum/count "s"}'

Monitor Costs¶

# Track corrections processed today
docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT COUNT(*) FROM correction_jobs
   WHERE status='completed'
   AND completed_at >= CURRENT_DATE;"

# Estimate daily cost
# Multiply count by ~$0.045 for Claude or ~$0.015 for Llama

Check for Bedrock Errors¶

# Recent Bedrock errors
docker logs newhires-workers | grep -i "bedrock.*error"

# Throttling events
docker logs newhires-workers | grep -i "throttl"

# Access denied errors
docker logs newhires-workers | grep -i "access.*denied"

Alerting¶

Email Alerts on Errors¶

Create /usr/local/bin/alert-on-errors.sh:

#!/bin/bash
# Alert on critical errors

EMAIL="admin@your-domain.com"
ERROR_THRESHOLD=10

# Count worker errors in last 5 minutes
ERROR_COUNT=$(docker logs newhires-workers --since 5m 2>&1 | grep -i "error" | wc -l)

if [ $ERROR_COUNT -gt $ERROR_THRESHOLD ]; then
    echo "Critical: $ERROR_COUNT errors detected in workers" | \
        mail -s "New Hires Alert: High Error Count" $EMAIL
fi

# Check if workers are stuck
PENDING_COUNT=$(docker exec newhires-db psql -U newhires -d newhires -t -c \
  "SELECT COUNT(*) FROM correction_jobs WHERE status='pending' AND created_at < NOW() - INTERVAL '10 minutes';")

if [ $PENDING_COUNT -gt 5 ]; then
    echo "Warning: $PENDING_COUNT jobs stuck in pending status" | \
        mail -s "New Hires Alert: Jobs Stuck" $EMAIL
fi

Add to crontab (every 5 minutes):

*/5 * * * * /usr/local/bin/alert-on-errors.sh

Slack Alerts¶

#!/bin/bash
# Send Slack notification

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

send_slack_alert() {
    local message=$1
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"$message\"}" \
        $SLACK_WEBHOOK
}

# Check backend health
HEALTH=$(curl -s http://localhost:8000/health)
if ! echo "$HEALTH" | jq -e '.status == "healthy"' > /dev/null; then
    send_slack_alert "⚠️ New Hires Reporting backend is unhealthy!"
fi

# Check for stuck jobs
STUCK=$(docker exec newhires-db psql -U newhires -d newhires -t -c \
  "SELECT COUNT(*) FROM correction_jobs WHERE status='processing' AND started_at < NOW() - INTERVAL '5 minutes';")

if [ $STUCK -gt 0 ]; then
    send_slack_alert "⚠️ $STUCK correction jobs stuck in processing"
fi

External Monitoring¶

Uptime Monitoring¶

Use external services to monitor availability:

UptimeRobot (free tier available)
Pingdom
StatusCake
Better Uptime

Monitor URLs: - Frontend: https://your-domain.com - Backend API: https://api.your-domain.com/health

Alert Frequency: Check every 5 minutes

Set Alerts For: - HTTP 200 status code not returned - Response time > 10 seconds - SSL certificate expiration

Metrics to Monitor¶

System Metrics¶

Metric	Normal Range	Alert Threshold	Action
CPU Usage	10-30%	>80%	Scale workers or optimize
Memory Usage	30-60%	>85%	Check for leaks, add RAM
Disk Usage	<70%	>85%	Clean logs, add storage
Network I/O	Low	Spikes	Check for issues

Application Metrics¶

Metric	Normal Range	Alert Threshold	Action
Validation Time	1-5s	>30s	Check file size, API
Correction Time	30-120s	>300s	Check Bedrock, workers
Job Queue Depth	0-5	>20	Scale workers
Failed Jobs	<1%	>5%	Check worker logs
AWS Bedrock Latency	2-10s	>60s	Check AWS status

Business Metrics¶

Files Validated - Track usage
Corrections Applied - Measure AI effectiveness
AWS Bedrock Costs - Monitor spend
User Activity - Track peak usage times
Error Rate by State - Identify format issues

Log Analysis¶

Find Validation Errors¶

# Count errors by type (from validation results)
docker logs newhires-backend | grep "error_type" | \
    cut -d'"' -f4 | sort | uniq -c | sort -rn

# Example output:
#   45 MISSING_REQUIRED_FIELD
#   23 INVALID_FORMAT
#   12 INVALID_LENGTH

Track Bedrock Usage¶

# Count Bedrock API calls
docker logs newhires-workers | grep "Calling AWS Bedrock" | wc -l

# View token usage
docker logs newhires-workers | grep "tokens used"

# Check for throttling
docker logs newhires-workers | grep -i "throttl"

Monitor Job Processing¶

# Recent job completions
docker logs newhires-workers | grep "Correction job completed" | tail -10

# Failed jobs
docker logs newhires-workers | grep "Job failed" | tail -10

# Average processing time (if logged)
docker logs newhires-workers | grep "completed in" | \
    awk '{print $(NF-1)}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'

Troubleshooting Common Issues¶

High Memory Usage¶

# Check memory breakdown
docker stats --no-stream

# If workers high:
# 1. Check for stuck Bedrock calls
# 2. Check job queue depth
# 3. Restart workers:
docker-compose -f docker-compose.prod.yml restart workers

# If database high:
# 1. Check connection count
docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT count(*) FROM pg_stat_activity;"
# 2. Consider vacuum
docker exec newhires-db psql -U newhires -d newhires -c "VACUUM ANALYZE;"

High CPU Usage¶

# Check which container
docker stats --no-stream

# Common causes:
# 1. Multiple concurrent jobs
# 2. Large files being processed
# 3. AWS Bedrock timeouts causing retries

# Check concurrent jobs
docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT COUNT(*) FROM correction_jobs WHERE status='processing';"

# Scale workers if needed
docker-compose -f docker-compose.prod.yml up -d --scale workers=3

Jobs Stuck in Pending¶

# Check worker is running
docker ps | grep newhires-workers

# Check worker logs
docker logs newhires-workers --tail=50

# Look for AWS errors
docker logs newhires-workers | grep -i "error\|exception"

# Restart workers
docker-compose -f docker-compose.prod.yml restart workers

Database Connection Issues¶

# Check database is running
docker ps | grep newhires-db

# Test connection
docker exec newhires-db pg_isready -U newhires

# Check connections
docker exec newhires-db psql -U newhires -d newhires -c \
  "SELECT count(*) FROM pg_stat_activity;"

# Restart database (will cause brief downtime)
docker-compose -f docker-compose.prod.yml restart db

Best Practices¶

✅ Monitor health endpoints every 1-5 minutes
✅ Set up alerts for critical failures
✅ Review logs daily in production
✅ Track resource trends over time
✅ Set log rotation to prevent disk fill
✅ Keep at least 7 days of logs
✅ Monitor AWS Bedrock costs weekly
✅ Test alerting systems regularly
✅ Document normal baseline metrics
✅ Create runbooks for common issues
✅ Monitor job queue depth
✅ Track worker processing times
✅ Set up external uptime monitoring
✅ Review AWS CloudWatch for Bedrock metrics

Next Steps¶

Security Best Practices - Secure your deployment
AWS Bedrock Troubleshooting - Fix Bedrock issues
Common Issues - Troubleshoot problems
Command Reference - Quick command reference