Health Monitoring¶
Monitor the health and performance of your New Hires Reporting System deployment.
Overview¶
Effective monitoring ensures:
- ✅ Early detection of issues
- ✅ Quick troubleshooting
- ✅ Performance optimization
- ✅ Capacity planning
- ✅ AWS cost management
- ✅ Audit trails
Health Check Endpoints¶
Backend Health¶
Endpoint: GET /health
URL: http://localhost:8000/health (or https://api.your-domain.com/health)
Response:
{
"status": "healthy",
"database": "connected",
"workers": "operational",
"version": "v1.0.0",
"timestamp": "2025-01-24T10:00:00Z"
}
Check Command:
Status Indicators:
- status: "healthy" - Backend API is operational
- database: "connected" - PostgreSQL is accessible
- workers: "operational" - Worker service is running
Unhealthy Response:
Frontend Health¶
Endpoint: GET / (root path)
URL: http://localhost:8080 (or https://your-domain.com)
Check Command:
Expected: HTTP/1.1 200 OK
Database Health¶
Check Command:
Expected: newhires-db:5432 - accepting connections
Alternative:
Worker Health¶
Check Worker is Running:
Check Worker Logs:
Expected: Should see INFO: Worker polling for jobs...
Check Job Queue:
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT status, COUNT(*) FROM correction_jobs GROUP BY status;"
Expected Output:
Docker Health Checks¶
Health checks are configured in docker-compose.prod.yml:
Backend Health Check¶
backend:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s # Check every 30 seconds
timeout: 10s # Timeout after 10 seconds
retries: 3 # Mark unhealthy after 3 failures
start_period: 30s # Grace period on startup
Database Health Check¶
View Health Status¶
# Check all services
docker-compose -f docker-compose.prod.yml ps
# Output shows health status:
# NAME STATUS
# newhires-backend Up (healthy)
# newhires-db Up (healthy)
# newhires-workers Up
# newhires-frontend Up
# View detailed health check logs
docker inspect --format='{{json .State.Health}}' newhires-backend | jq
Logging¶
View Logs¶
Backend Logs:
# Live logs
docker-compose -f docker-compose.prod.yml logs -f backend
# Last 100 lines
docker-compose -f docker-compose.prod.yml logs --tail=100 backend
# With timestamps
docker-compose -f docker-compose.prod.yml logs -f -t backend
# Since specific time
docker-compose -f docker-compose.prod.yml logs --since "2025-01-24T10:00:00" backend
Worker Logs (most important for monitoring):
# Live logs
docker logs newhires-workers -f
# Last 100 lines
docker logs newhires-workers --tail=100
# Errors only
docker logs newhires-workers | grep -i error
Frontend Logs:
Database Logs:
All Services:
Log Indicators¶
System logs use emoji indicators for easy scanning:
| Emoji | Meaning | Example |
|---|---|---|
| ✅ | Success | ✅ File validation completed |
| ❌ | Error | ❌ Validation failed: Invalid format |
| 🤖 | AI Activity | 🤖 Calling AWS Bedrock API... |
| 🔍 | Search | 🔍 Searching for employer data |
| 📝 | Correction | 📝 Applied 18 corrections |
| ⚠️ | Warning | ⚠️ High job queue depth |
| 🔄 | Processing | 🔄 Worker picked up job |
Log Levels¶
Control log verbosity with LOG_LEVEL in .env:
Production (LOG_LEVEL=INFO):
- Normal operational messages
- Errors and warnings
- No sensitive data
- Recommended for production
Debug (LOG_LEVEL=DEBUG):
- Verbose output
- Full stack traces
- AWS Bedrock request/response details
- Database queries
- Use for troubleshooting only
Warning (LOG_LEVEL=WARNING):
- Minimal output
- Only warnings and errors
- Use for stable production
Log Persistence¶
Default Behavior¶
Docker containers use the JSON-file logging driver. Logs are stored in:
Configure Log Rotation¶
Add to docker-compose.prod.yml:
services:
backend:
logging:
driver: "json-file"
options:
max-size: "10m" # Max log file size
max-file: "3" # Keep last 3 files
compress: "true" # Compress rotated logs
workers:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
compress: "true"
Alternative: Syslog¶
Send logs to system syslog:
services:
backend:
logging:
driver: "syslog"
options:
tag: "newhires-backend"
workers:
logging:
driver: "syslog"
options:
tag: "newhires-workers"
View with:
Resource Monitoring¶
Real-Time Resource Usage¶
# All containers
docker stats
# Specific container
docker stats newhires-workers
# Output:
# CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O
# newhires-backend 2.5% 450MiB / 2GiB 22% 1.2kB / 850B
# newhires-workers 8.2% 680MiB / 2GiB 34% 4.1kB / 2.3kB
# newhires-frontend 1.2% 180MiB / 1GiB 18% 2.4kB / 1.1kB
# newhires-db 3.1% 320MiB / 1GiB 32% 3.2kB / 2.8kB
Resource Alerts¶
Monitor for: - High CPU (>80% sustained) - May need scaling or optimization - High Memory (>80%) - May need more RAM or memory leak - Workers stuck - Check for Bedrock timeouts - Database connections - Check connection pool - High Disk I/O - May need faster storage
Check Specific Metrics¶
# Container CPU usage
docker stats --no-stream --format "{{.Container}}: {{.CPUPerc}}"
# Memory usage
docker stats --no-stream --format "{{.Container}}: {{.MemUsage}}"
# All metrics, no stream
docker stats --no-stream
Monitoring Script¶
Create /usr/local/bin/monitor-newhires.sh:
#!/bin/bash
# New Hires Reporting System Health Check
echo "=== New Hires Reporting System Health Check ==="
echo "Time: $(date)"
echo ""
# Check if containers are running
echo "=== Container Status ==="
docker-compose -f /opt/newhires-reporting/docker-compose.prod.yml ps
echo ""
# Backend health
echo "=== Backend Health ==="
BACKEND_HEALTH=$(curl -s http://localhost:8000/health)
echo "$BACKEND_HEALTH" | jq .
if echo "$BACKEND_HEALTH" | jq -e '.status == "healthy"' > /dev/null; then
echo "✅ Backend is healthy"
else
echo "❌ Backend is unhealthy!"
echo "$BACKEND_HEALTH" | jq .
fi
echo ""
# Frontend health
echo "=== Frontend Health ==="
FRONTEND_STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080)
if [ "$FRONTEND_STATUS" = "200" ]; then
echo "✅ Frontend is healthy (HTTP $FRONTEND_STATUS)"
else
echo "❌ Frontend is unhealthy (HTTP $FRONTEND_STATUS)"
fi
echo ""
# Database health
echo "=== Database Health ==="
DB_STATUS=$(docker exec newhires-db pg_isready -U newhires 2>&1)
if echo "$DB_STATUS" | grep -q "accepting connections"; then
echo "✅ Database is healthy"
else
echo "❌ Database is unhealthy"
echo "$DB_STATUS"
fi
echo ""
# Worker health
echo "=== Worker Status ==="
WORKER_RUNNING=$(docker ps | grep newhires-workers)
if [ -n "$WORKER_RUNNING" ]; then
echo "✅ Worker is running"
echo "Recent worker activity:"
docker logs newhires-workers --tail=5
else
echo "❌ Worker is not running!"
fi
echo ""
# Job queue
echo "=== Job Queue Status ==="
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT status, COUNT(*) FROM correction_jobs GROUP BY status;" 2>/dev/null
echo ""
# Resource usage
echo "=== Resource Usage ==="
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
echo ""
# Recent errors
echo "=== Recent Errors (last hour) ==="
echo "Backend errors:"
docker-compose -f /opt/newhires-reporting/docker-compose.prod.yml logs --since 1h backend 2>&1 | \
grep -i error | tail -3
echo ""
echo "Worker errors:"
docker logs newhires-workers --since 1h 2>&1 | grep -i error | tail -3
echo ""
echo "=== AWS Bedrock Activity (last hour) ==="
docker logs newhires-workers --since 1h | grep -i bedrock | tail -5
echo ""
echo "=== End Health Check ==="
Make executable:
Run manually:
Schedule with cron (every 5 minutes):
# Add to crontab
crontab -e
# Add this line:
*/5 * * * * /usr/local/bin/monitor-newhires.sh >> /var/log/newhires-health.log 2>&1
AWS Bedrock Monitoring¶
Track Bedrock Usage¶
# Count Bedrock API calls
docker logs newhires-workers | grep -i "bedrock" | wc -l
# View token usage
docker logs newhires-workers | grep "tokens used"
# Average response time
docker logs newhires-workers | \
grep "response time" | \
awk '{sum+=$NF; count++} END {print sum/count "s"}'
Monitor Costs¶
# Track corrections processed today
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT COUNT(*) FROM correction_jobs
WHERE status='completed'
AND completed_at >= CURRENT_DATE;"
# Estimate daily cost
# Multiply count by ~$0.045 for Claude or ~$0.015 for Llama
Check for Bedrock Errors¶
# Recent Bedrock errors
docker logs newhires-workers | grep -i "bedrock.*error"
# Throttling events
docker logs newhires-workers | grep -i "throttl"
# Access denied errors
docker logs newhires-workers | grep -i "access.*denied"
Alerting¶
Email Alerts on Errors¶
Create /usr/local/bin/alert-on-errors.sh:
#!/bin/bash
# Alert on critical errors
EMAIL="admin@your-domain.com"
ERROR_THRESHOLD=10
# Count worker errors in last 5 minutes
ERROR_COUNT=$(docker logs newhires-workers --since 5m 2>&1 | grep -i "error" | wc -l)
if [ $ERROR_COUNT -gt $ERROR_THRESHOLD ]; then
echo "Critical: $ERROR_COUNT errors detected in workers" | \
mail -s "New Hires Alert: High Error Count" $EMAIL
fi
# Check if workers are stuck
PENDING_COUNT=$(docker exec newhires-db psql -U newhires -d newhires -t -c \
"SELECT COUNT(*) FROM correction_jobs WHERE status='pending' AND created_at < NOW() - INTERVAL '10 minutes';")
if [ $PENDING_COUNT -gt 5 ]; then
echo "Warning: $PENDING_COUNT jobs stuck in pending status" | \
mail -s "New Hires Alert: Jobs Stuck" $EMAIL
fi
Add to crontab (every 5 minutes):
Slack Alerts¶
#!/bin/bash
# Send Slack notification
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
send_slack_alert() {
local message=$1
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$message\"}" \
$SLACK_WEBHOOK
}
# Check backend health
HEALTH=$(curl -s http://localhost:8000/health)
if ! echo "$HEALTH" | jq -e '.status == "healthy"' > /dev/null; then
send_slack_alert "⚠️ New Hires Reporting backend is unhealthy!"
fi
# Check for stuck jobs
STUCK=$(docker exec newhires-db psql -U newhires -d newhires -t -c \
"SELECT COUNT(*) FROM correction_jobs WHERE status='processing' AND started_at < NOW() - INTERVAL '5 minutes';")
if [ $STUCK -gt 0 ]; then
send_slack_alert "⚠️ $STUCK correction jobs stuck in processing"
fi
External Monitoring¶
Uptime Monitoring¶
Use external services to monitor availability:
- UptimeRobot (free tier available)
- Pingdom
- StatusCake
- Better Uptime
Monitor URLs:
- Frontend: https://your-domain.com
- Backend API: https://api.your-domain.com/health
Alert Frequency: Check every 5 minutes
Set Alerts For: - HTTP 200 status code not returned - Response time > 10 seconds - SSL certificate expiration
Metrics to Monitor¶
System Metrics¶
| Metric | Normal Range | Alert Threshold | Action |
|---|---|---|---|
| CPU Usage | 10-30% | >80% | Scale workers or optimize |
| Memory Usage | 30-60% | >85% | Check for leaks, add RAM |
| Disk Usage | <70% | >85% | Clean logs, add storage |
| Network I/O | Low | Spikes | Check for issues |
Application Metrics¶
| Metric | Normal Range | Alert Threshold | Action |
|---|---|---|---|
| Validation Time | 1-5s | >30s | Check file size, API |
| Correction Time | 30-120s | >300s | Check Bedrock, workers |
| Job Queue Depth | 0-5 | >20 | Scale workers |
| Failed Jobs | <1% | >5% | Check worker logs |
| AWS Bedrock Latency | 2-10s | >60s | Check AWS status |
Business Metrics¶
- Files Validated - Track usage
- Corrections Applied - Measure AI effectiveness
- AWS Bedrock Costs - Monitor spend
- User Activity - Track peak usage times
- Error Rate by State - Identify format issues
Log Analysis¶
Find Validation Errors¶
# Count errors by type (from validation results)
docker logs newhires-backend | grep "error_type" | \
cut -d'"' -f4 | sort | uniq -c | sort -rn
# Example output:
# 45 MISSING_REQUIRED_FIELD
# 23 INVALID_FORMAT
# 12 INVALID_LENGTH
Track Bedrock Usage¶
# Count Bedrock API calls
docker logs newhires-workers | grep "Calling AWS Bedrock" | wc -l
# View token usage
docker logs newhires-workers | grep "tokens used"
# Check for throttling
docker logs newhires-workers | grep -i "throttl"
Monitor Job Processing¶
# Recent job completions
docker logs newhires-workers | grep "Correction job completed" | tail -10
# Failed jobs
docker logs newhires-workers | grep "Job failed" | tail -10
# Average processing time (if logged)
docker logs newhires-workers | grep "completed in" | \
awk '{print $(NF-1)}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'
Troubleshooting Common Issues¶
High Memory Usage¶
# Check memory breakdown
docker stats --no-stream
# If workers high:
# 1. Check for stuck Bedrock calls
# 2. Check job queue depth
# 3. Restart workers:
docker-compose -f docker-compose.prod.yml restart workers
# If database high:
# 1. Check connection count
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT count(*) FROM pg_stat_activity;"
# 2. Consider vacuum
docker exec newhires-db psql -U newhires -d newhires -c "VACUUM ANALYZE;"
High CPU Usage¶
# Check which container
docker stats --no-stream
# Common causes:
# 1. Multiple concurrent jobs
# 2. Large files being processed
# 3. AWS Bedrock timeouts causing retries
# Check concurrent jobs
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT COUNT(*) FROM correction_jobs WHERE status='processing';"
# Scale workers if needed
docker-compose -f docker-compose.prod.yml up -d --scale workers=3
Jobs Stuck in Pending¶
# Check worker is running
docker ps | grep newhires-workers
# Check worker logs
docker logs newhires-workers --tail=50
# Look for AWS errors
docker logs newhires-workers | grep -i "error\|exception"
# Restart workers
docker-compose -f docker-compose.prod.yml restart workers
Database Connection Issues¶
# Check database is running
docker ps | grep newhires-db
# Test connection
docker exec newhires-db pg_isready -U newhires
# Check connections
docker exec newhires-db psql -U newhires -d newhires -c \
"SELECT count(*) FROM pg_stat_activity;"
# Restart database (will cause brief downtime)
docker-compose -f docker-compose.prod.yml restart db
Best Practices¶
- ✅ Monitor health endpoints every 1-5 minutes
- ✅ Set up alerts for critical failures
- ✅ Review logs daily in production
- ✅ Track resource trends over time
- ✅ Set log rotation to prevent disk fill
- ✅ Keep at least 7 days of logs
- ✅ Monitor AWS Bedrock costs weekly
- ✅ Test alerting systems regularly
- ✅ Document normal baseline metrics
- ✅ Create runbooks for common issues
- ✅ Monitor job queue depth
- ✅ Track worker processing times
- ✅ Set up external uptime monitoring
- ✅ Review AWS CloudWatch for Bedrock metrics
Next Steps¶
- Security Best Practices - Secure your deployment
- AWS Bedrock Troubleshooting - Fix Bedrock issues
- Common Issues - Troubleshoot problems
- Command Reference - Quick command reference