- Analyzed system limits, memory usage, process status - Identified primary suspect: Next.js dev server memory leaks - Secondary suspects: macOS power mgmt, SSH timeout, OOM killer - Created monitoring script for CPU/memory/file descriptors - Documented recommendations: production builds, PM2, nohup
2.4 KiB
2.4 KiB
Root Cause Analysis: Why Websites Die
Date: 2026-02-18 Task: #7 - Investigate root cause of web app crashes
Current Status
All 3 sites are UP and healthy at time of investigation.
System Analysis
1. File Descriptor Limits
Current ulimit: 1,048,575 (very high - unlikely to be issue)
2. Active Processes
- gantt-board (port 3000): ✅ Running
- blog-backup (port 3003): ✅ Running
- heartbeat-monitor (port 3005): ✅ Running
3. Memory Usage
Next.js dev servers using:
- ~400-550MB RAM each (normal for dev mode)
- ~0.8-1.1% system memory each
- Not excessive but adds up
Likely Root Causes
Primary Suspect: Next.js Dev Server Memory Leaks
- Next.js dev mode (
npm run dev) is NOT production-ready - File watcher holds references to files
- Hot Module Replacement (HMR) accumulates memory over time
- Recommendation: Use production builds for long-running services
Secondary Suspects:
-
macOS Power Management
- Power Nap / App Nap can suspend background processes
- SSH sessions dying kill child processes
- Check: System Preferences > Energy Saver
-
File Watcher Limits
- Default macOS limits: 1280 watched files per process
- Large node_modules can exceed this
- Error:
EMFILE: too many open files
-
SSH Session Timeout
- Terminal sessions with idle timeout
- SIGHUP sent to child processes on disconnect
- Solution: Use
nohuporscreen/tmux
-
OOM Killer (Out of Memory)
- macOS memory pressure kills large processes
- Combined 1.5GB+ for all 3 sites
- Check: Console.app for "Out of memory" messages
Monitoring Setup
Created: /Users/mattbruce/.openclaw/workspace/monitor-processes.sh
- Tracks CPU, memory, file descriptors
- Logs warnings for high usage
- Runs every 60 seconds
Recommendations
Immediate (Monitoring)
✅ Cron job running every 10 min with auto-restart ✅ Process monitoring script deployed
Short-term (Stability)
- Use production builds instead of dev mode:
npm run build npm start - Run with PM2 or forever for process management
- Use
nohupto prevent SSH timeout kills
Long-term (Reliability)
- Docker containers with restart policies
- Systemd services with auto-restart
- Reverse proxy (nginx) with health checks
Next Steps
- Monitor logs for next 24-48 hours
- Check if sites die overnight (SSH timeout test)
- If memory-related, switch to production builds