KaraokeMerge/PRD.md

# Karaoke Song Library Cleanup Tool — PRD (v1 CLI)

## 1. Project Summary

- **Goal:** Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
- **Primary User:** Admin (self, collection owner)
- **Initial Interface:** Command Line (CLI) with print/logging and JSON output
- **Future Expansion:** Optional web UI for filtering, review, and playback

---

## 2. Architectural Priorities

### 2.1 Code Organization Principles

**TOP PRIORITY:** The codebase must be built with the following architectural principles from the beginning:

- **True Separation of Concerns:**
  - Many small files with focused responsibilities
  - Each module/class should have a single, well-defined purpose
  - Avoid monolithic files with mixed responsibilities

- **Constants and Enums:**
  - Create constants, enums, and configuration objects to avoid duplicate code or values
  - Centralize magic numbers, strings, and configuration values
  - Use enums for type safety and clarity

- **Readability and Maintainability:**
  - Code should be self-documenting with clear naming conventions
  - Easy to understand, extend, and refactor
  - Consistent patterns throughout the codebase

- **Extensibility:**
  - Design for future growth and feature additions
  - Modular architecture that allows easy integration of new components
  - Clear interfaces between modules

- **Refactorability:**
  - Code structure should make future refactoring straightforward
  - Minimize coupling between components
  - Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.

---

## 3. Data Handling & Matching Logic

### 3.1 Input

- Reads from `/data/allSongs.json`
- Each song includes at least:
  - `artist`, `title`, `path`, (plus id3 tag info, `channel` for MP4s)

### 3.2 Song Matching

- **Primary keys:** `artist` + `title`
  - Fuzzy matching configurable (enabled/disabled with threshold)
  - Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
- **File type detection:** Use file extension from `path` (`.mp3`, `.cdg`, `.mp4`)

### 3.3 Channel Priority (for MP4s)

- **Configurable folder names:**
  - Set in `/config/config.json` as an array of folder names
  - Order = priority (first = highest priority)
  - Tool searches for these folder names within the song's `path` property
  - Songs without matching folder names are marked for manual review
- **File type priority:** MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
- **CDG/MP3 pairing:** CDG and MP3 files with the same base filename are treated as a single karaoke song unit

---

## 4. Output & Reporting

### 4.1 Skip List

- **Format:** JSON (`/data/skipSongs.json`)
  - List of file paths to skip in future imports
  - Optionally: “reason” field (e.g., `{"path": "...", "reason": "duplicate"}`)

### 4.2 CLI Reporting

- **Summary:** Total songs, duplicates found, types breakdown, etc.
- **Verbose per-song output:** Only for matches/duplicates (not every song)
- **Verbosity configurable:** (via CLI flag or config)

### 4.3 Manual Review (Web UI)

- **Interactive Web Interface**: Table/grid view for ambiguous/complex cases
- **Media Preview**: Ability to preview media before making a selection
- **Bulk Actions**: Select multiple items for batch operations
- **Real-time Filtering**: Search and filter capabilities
- **Responsive Design**: Works on desktop and mobile devices
- **Easy Startup**: Simple script (`start_web_ui.py`) with dependency checking

---

## 5. Features & Edge Cases

- **Batch Processing:**
  - E.g., "Auto-skip all but highest-priority channel for each song"
  - Manual review as CLI flag (future: always in web UI)
- **Edge Cases:**
  - Multiple versions (>2 formats)
  - Support for keeping multiple versions per song (configurable/manual)
- **Non-destructive:** Never deletes or moves files, only generates skip list and reports

---

## 6. Tech Stack & Organization

- **CLI Language:** Python
- **Config:** JSON (channel priorities, settings)
- **Current Folder Structure:**
```
KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   ├── skipSongs.json         # Output: Generated skip list
│   └── reports/               # Detailed analysis reports
│       ├── analysis_data.json
│       ├── actionable_insights_report.txt
│       ├── channel_optimization_report.txt
│       ├── duplicate_pattern_report.txt
│       ├── enhanced_summary_report.txt
│       ├── skip_list_summary.txt
│       └── skip_songs_detailed.json
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── web/                       # Web UI for manual review
│   ├── app.py                 # Flask web application
│   └── templates/
│       └── index.html         # Web interface template
├── start_web_ui.py            # Web UI startup script
├── test_tool.py               # Validation and testing script
├── requirements.txt           # Python dependencies
├── .gitignore                 # Git ignore rules
├── PRD.md                     # Product Requirements Document
└── README.md                  # Project documentation
```

---

## 7. Web UI Implementation

### 7.1 Current Web UI Features
- **Interactive Table View**: Sortable, filterable grid of duplicate songs
- **Bulk Selection**: Select multiple items for batch operations
- **Search & Filter**: Real-time search across artists, titles, and paths
- **Responsive Design**: Mobile-friendly interface
- **Easy Startup**: Automated dependency checking and browser launch

### 7.2 Web UI Architecture
- **Flask Backend**: Lightweight web server (`web/app.py`)
- **HTML Template**: Modern, responsive interface (`web/templates/index.html`)
- **Startup Script**: Dependency management and server startup (`start_web_ui.py`)

### 7.3 Future Web UI Enhancements
- Embedded media player for audio/video preview
- Real-time configuration editing
- Advanced filtering and sorting options
- Export capabilities for manual selections

---

## 8. Open Questions (for future refinement)

- Fuzzy matching library/thresholds?
- Best parsing rules for multi-artist/feat. strings?
- Any alternate export formats needed?
- Temporary/partial skip support for "under review" songs?

---

## 9. Implementation Status

### ✅ Completed Features
- [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
- [x] Print CLI summary reports (with verbosity control)
- [x] Implement config file support for channel priority
- [x] Organize folder/file structure for easy expansion

### 🎯 Current Implementation
The tool has been successfully implemented with the following components:

**Core Modules:**
- `cli/main.py` - Main CLI application with argument parsing
- `cli/matching.py` - Song matching and deduplication logic
- `cli/report.py` - Report generation and output formatting
- `cli/utils.py` - Utility functions for file operations and data processing

**Configuration:**
- `config/config.json` - Configurable settings for channel priorities, matching rules, and output options

**Features Implemented:**
- Multi-format support (MP3, CDG, MP4)
- **CDG/MP3 Pairing Logic**: Files with same base filename treated as single karaoke song units
- Channel priority system for MP4 files (based on folder names in path)
- Fuzzy matching support with configurable threshold
- Multi-artist parsing with various delimiters
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
- Channel priority analysis and manual review identification
- Non-destructive operation (skip lists only)
- Verbose and dry-run modes
- Detailed duplicate analysis
- Skip list generation with metadata
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions

**File Type Priority System:**
1. **MP4 files** (with channel priority sorting)
2. **CDG/MP3 pairs** (treated as single units)
3. **Standalone MP3** files
4. **Standalone CDG** files

**Performance Results:**
- Successfully processed 37,015 songs
- Identified 12,424 duplicates (33.6% duplicate rate)
- Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
- Optimized for large datasets with progress indicators
- **Enhanced Analysis**: Generated 7 detailed reports with actionable insights
- **Bug Fix**: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

### 📋 Next Steps Checklist

#### ✅ **Completed**
- [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
- [x] Print CLI summary reports (with verbosity control)
- [x] Implement config file support for channel priority
- [x] Organize folder/file structure for easy expansion
- [x] Implement CDG/MP3 pairing logic for accurate duplicate detection
- [x] Generate comprehensive skip list with metadata
- [x] Optimize performance for large datasets (37,000+ songs)
- [x] Add progress indicators and error handling
- [x] Generate detailed analysis reports (`--save-reports` functionality)
- [x] Create web UI for manual review of ambiguous cases
- [x] Add test tool for validation and debugging
- [x] Create startup script for web UI with dependency checking
- [x] Add comprehensive .gitignore file
- [x] Update documentation with required data file information

#### 🎯 **Next Priority Items**
- [ ] Analyze MP4 files without channel priorities to suggest new folder names
- [ ] Add support for additional file formats if needed
- [ ] Implement batch processing capabilities
- [ ] Create integration scripts for karaoke software
- [ ] Add unit tests for core functionality
- [ ] Implement audio fingerprinting for better duplicate detection