- Analyzed system limits, memory usage, process status - Identified primary suspect: Next.js dev server memory leaks - Secondary suspects: macOS power mgmt, SSH timeout, OOM killer - Created monitoring script for CPU/memory/file descriptors - Documented recommendations: production builds, PM2, nohup
88 lines
2.4 KiB
Markdown
88 lines
2.4 KiB
Markdown
# Root Cause Analysis: Why Websites Die
|
|
|
|
**Date:** 2026-02-18
|
|
**Task:** #7 - Investigate root cause of web app crashes
|
|
|
|
## Current Status
|
|
All 3 sites are **UP and healthy** at time of investigation.
|
|
|
|
## System Analysis
|
|
|
|
### 1. File Descriptor Limits
|
|
```
|
|
Current ulimit: 1,048,575 (very high - unlikely to be issue)
|
|
```
|
|
|
|
### 2. Active Processes
|
|
- gantt-board (port 3000): ✅ Running
|
|
- blog-backup (port 3003): ✅ Running
|
|
- heartbeat-monitor (port 3005): ✅ Running
|
|
|
|
### 3. Memory Usage
|
|
Next.js dev servers using:
|
|
- ~400-550MB RAM each (normal for dev mode)
|
|
- ~0.8-1.1% system memory each
|
|
- Not excessive but adds up
|
|
|
|
## Likely Root Causes
|
|
|
|
### Primary Suspect: **Next.js Dev Server Memory Leaks**
|
|
- Next.js dev mode (`npm run dev`) is NOT production-ready
|
|
- File watcher holds references to files
|
|
- Hot Module Replacement (HMR) accumulates memory over time
|
|
- **Recommendation:** Use production builds for long-running services
|
|
|
|
### Secondary Suspects:
|
|
|
|
1. **macOS Power Management**
|
|
- Power Nap / App Nap can suspend background processes
|
|
- SSH sessions dying kill child processes
|
|
- **Check:** System Preferences > Energy Saver
|
|
|
|
2. **File Watcher Limits**
|
|
- Default macOS limits: 1280 watched files per process
|
|
- Large node_modules can exceed this
|
|
- **Error:** `EMFILE: too many open files`
|
|
|
|
3. **SSH Session Timeout**
|
|
- Terminal sessions with idle timeout
|
|
- SIGHUP sent to child processes on disconnect
|
|
- **Solution:** Use `nohup` or `screen`/`tmux`
|
|
|
|
4. **OOM Killer (Out of Memory)**
|
|
- macOS memory pressure kills large processes
|
|
- Combined 1.5GB+ for all 3 sites
|
|
- **Check:** Console.app for "Out of memory" messages
|
|
|
|
## Monitoring Setup
|
|
|
|
Created: `/Users/mattbruce/.openclaw/workspace/monitor-processes.sh`
|
|
- Tracks CPU, memory, file descriptors
|
|
- Logs warnings for high usage
|
|
- Runs every 60 seconds
|
|
|
|
## Recommendations
|
|
|
|
### Immediate (Monitoring)
|
|
✅ Cron job running every 10 min with auto-restart
|
|
✅ Process monitoring script deployed
|
|
|
|
### Short-term (Stability)
|
|
1. Use production builds instead of dev mode:
|
|
```bash
|
|
npm run build
|
|
npm start
|
|
```
|
|
2. Run with PM2 or forever for process management
|
|
3. Use `nohup` to prevent SSH timeout kills
|
|
|
|
### Long-term (Reliability)
|
|
1. Docker containers with restart policies
|
|
2. Systemd services with auto-restart
|
|
3. Reverse proxy (nginx) with health checks
|
|
|
|
## Next Steps
|
|
1. Monitor logs for next 24-48 hours
|
|
2. Check if sites die overnight (SSH timeout test)
|
|
3. If memory-related, switch to production builds
|