# Root Cause Analysis: Why Websites Die **Date:** 2026-02-18 **Task:** #7 - Investigate root cause of web app crashes ## Current Status All 3 sites are **UP and healthy** at time of investigation. ## System Analysis ### 1. File Descriptor Limits ``` Current ulimit: 1,048,575 (very high - unlikely to be issue) ``` ### 2. Active Processes - gantt-board (port 3000): ✅ Running - blog-backup (port 3003): ✅ Running - heartbeat-monitor (port 3005): ✅ Running ### 3. Memory Usage Next.js dev servers using: - ~400-550MB RAM each (normal for dev mode) - ~0.8-1.1% system memory each - Not excessive but adds up ## Likely Root Causes ### Primary Suspect: **Next.js Dev Server Memory Leaks** - Next.js dev mode (`npm run dev`) is NOT production-ready - File watcher holds references to files - Hot Module Replacement (HMR) accumulates memory over time - **Recommendation:** Use production builds for long-running services ### Secondary Suspects: 1. **macOS Power Management** - Power Nap / App Nap can suspend background processes - SSH sessions dying kill child processes - **Check:** System Preferences > Energy Saver 2. **File Watcher Limits** - Default macOS limits: 1280 watched files per process - Large node_modules can exceed this - **Error:** `EMFILE: too many open files` 3. **SSH Session Timeout** - Terminal sessions with idle timeout - SIGHUP sent to child processes on disconnect - **Solution:** Use `nohup` or `screen`/`tmux` 4. **OOM Killer (Out of Memory)** - macOS memory pressure kills large processes - Combined 1.5GB+ for all 3 sites - **Check:** Console.app for "Out of memory" messages ## Monitoring Setup Created: `/Users/mattbruce/.openclaw/workspace/monitor-processes.sh` - Tracks CPU, memory, file descriptors - Logs warnings for high usage - Runs every 60 seconds ## Recommendations ### Immediate (Monitoring) ✅ Cron job running every 10 min with auto-restart ✅ Process monitoring script deployed ### Short-term (Stability) 1. Use production builds instead of dev mode: ```bash npm run build npm start ``` 2. Run with PM2 or forever for process management 3. Use `nohup` to prevent SSH timeout kills ### Long-term (Reliability) 1. Docker containers with restart policies 2. Systemd services with auto-restart 3. Reverse proxy (nginx) with health checks ## Next Steps 1. Monitor logs for next 24-48 hours 2. Check if sites die overnight (SSH timeout test) 3. If memory-related, switch to production builds