test-repo/root-cause-analysis.md
Matt Bruce b934c9fdb3 Task #7: Root cause analysis - why websites die
- Analyzed system limits, memory usage, process status
- Identified primary suspect: Next.js dev server memory leaks
- Secondary suspects: macOS power mgmt, SSH timeout, OOM killer
- Created monitoring script for CPU/memory/file descriptors
- Documented recommendations: production builds, PM2, nohup
2026-02-18 16:04:44 -06:00

2.4 KiB

Root Cause Analysis: Why Websites Die

Date: 2026-02-18 Task: #7 - Investigate root cause of web app crashes

Current Status

All 3 sites are UP and healthy at time of investigation.

System Analysis

1. File Descriptor Limits

Current ulimit: 1,048,575 (very high - unlikely to be issue)

2. Active Processes

  • gantt-board (port 3000): Running
  • blog-backup (port 3003): Running
  • heartbeat-monitor (port 3005): Running

3. Memory Usage

Next.js dev servers using:

  • ~400-550MB RAM each (normal for dev mode)
  • ~0.8-1.1% system memory each
  • Not excessive but adds up

Likely Root Causes

Primary Suspect: Next.js Dev Server Memory Leaks

  • Next.js dev mode (npm run dev) is NOT production-ready
  • File watcher holds references to files
  • Hot Module Replacement (HMR) accumulates memory over time
  • Recommendation: Use production builds for long-running services

Secondary Suspects:

  1. macOS Power Management

    • Power Nap / App Nap can suspend background processes
    • SSH sessions dying kill child processes
    • Check: System Preferences > Energy Saver
  2. File Watcher Limits

    • Default macOS limits: 1280 watched files per process
    • Large node_modules can exceed this
    • Error: EMFILE: too many open files
  3. SSH Session Timeout

    • Terminal sessions with idle timeout
    • SIGHUP sent to child processes on disconnect
    • Solution: Use nohup or screen/tmux
  4. OOM Killer (Out of Memory)

    • macOS memory pressure kills large processes
    • Combined 1.5GB+ for all 3 sites
    • Check: Console.app for "Out of memory" messages

Monitoring Setup

Created: /Users/mattbruce/.openclaw/workspace/monitor-processes.sh

  • Tracks CPU, memory, file descriptors
  • Logs warnings for high usage
  • Runs every 60 seconds

Recommendations

Immediate (Monitoring)

Cron job running every 10 min with auto-restart Process monitoring script deployed

Short-term (Stability)

  1. Use production builds instead of dev mode:
    npm run build
    npm start
    
  2. Run with PM2 or forever for process management
  3. Use nohup to prevent SSH timeout kills

Long-term (Reliability)

  1. Docker containers with restart policies
  2. Systemd services with auto-restart
  3. Reverse proxy (nginx) with health checks

Next Steps

  1. Monitor logs for next 24-48 hours
  2. Check if sites die overnight (SSH timeout test)
  3. If memory-related, switch to production builds