# Root Cause Analysis: Why Websites Die

**Date:** 2026-02-18
**Task:** #7 - Investigate root cause of web app crashes

## Current Status
All 3 sites are **UP and healthy** at time of investigation.

## System Analysis

### 1. File Descriptor Limits
```
Current ulimit: 1,048,575 (very high - unlikely to be issue)
```

### 2. Active Processes
- gantt-board (port 3000): ✅ Running
- blog-backup (port 3003): ✅ Running  
- heartbeat-monitor (port 3005): ✅ Running

### 3. Memory Usage
Next.js dev servers using:
- ~400-550MB RAM each (normal for dev mode)
- ~0.8-1.1% system memory each
- Not excessive but adds up

## Likely Root Causes

### Primary Suspect: **Next.js Dev Server Memory Leaks**
- Next.js dev mode (`npm run dev`) is NOT production-ready
- File watcher holds references to files
- Hot Module Replacement (HMR) accumulates memory over time
- **Recommendation:** Use production builds for long-running services

### Secondary Suspects:

1. **macOS Power Management**
   - Power Nap / App Nap can suspend background processes
   - SSH sessions dying kill child processes
   - **Check:** System Preferences > Energy Saver

2. **File Watcher Limits**
   - Default macOS limits: 1280 watched files per process
   - Large node_modules can exceed this
   - **Error:** `EMFILE: too many open files`

3. **SSH Session Timeout**
   - Terminal sessions with idle timeout
   - SIGHUP sent to child processes on disconnect
   - **Solution:** Use `nohup` or `screen`/`tmux`

4. **OOM Killer (Out of Memory)**
   - macOS memory pressure kills large processes
   - Combined 1.5GB+ for all 3 sites
   - **Check:** Console.app for "Out of memory" messages

## Monitoring Setup

Created: `/Users/mattbruce/.openclaw/workspace/monitor-processes.sh`
- Tracks CPU, memory, file descriptors
- Logs warnings for high usage
- Runs every 60 seconds

## Recommendations

### Immediate (Monitoring)
✅ Cron job running every 10 min with auto-restart
✅ Process monitoring script deployed

### Short-term (Stability)
1. Use production builds instead of dev mode:
   ```bash
   npm run build
   npm start
   ```
2. Run with PM2 or forever for process management
3. Use `nohup` to prevent SSH timeout kills

### Long-term (Reliability)
1. Docker containers with restart policies
2. Systemd services with auto-restart
3. Reverse proxy (nginx) with health checks

## Next Steps
1. Monitor logs for next 24-48 hours
2. Check if sites die overnight (SSH timeout test)
3. If memory-related, switch to production builds