# Karaoke Song Library Cleanup Tool — PRD (v1 CLI) ## 1. Project Summary - **Goal:** Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review. - **Primary User:** Admin (self, collection owner) - **Initial Interface:** Command Line (CLI) with print/logging and JSON output - **Future Expansion:** Optional web UI for filtering, review, and playback --- ## 2. Architectural Priorities ### 2.1 Code Organization Principles **TOP PRIORITY:** The codebase must be built with the following architectural principles from the beginning: - **True Separation of Concerns:** - Many small files with focused responsibilities - Each module/class should have a single, well-defined purpose - Avoid monolithic files with mixed responsibilities - **Constants and Enums:** - Create constants, enums, and configuration objects to avoid duplicate code or values - Centralize magic numbers, strings, and configuration values - Use enums for type safety and clarity - **Readability and Maintainability:** - Code should be self-documenting with clear naming conventions - Easy to understand, extend, and refactor - Consistent patterns throughout the codebase - **Extensibility:** - Design for future growth and feature additions - Modular architecture that allows easy integration of new components - Clear interfaces between modules - **Refactorability:** - Code structure should make future refactoring straightforward - Minimize coupling between components - Use dependency injection and abstraction where appropriate These principles are fundamental to the project's long-term success and must be applied consistently throughout development. ### 2.2 Documentation Requirements **CRITICAL REQUIREMENT:** All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation: - **PRD.md Updates:** Any changes to project requirements, architecture, or functionality must be reflected in this document - **README.md Updates:** User-facing features, installation instructions, or usage changes must be documented - **Code Comments:** Significant logic changes should include inline documentation - **API Documentation:** New endpoints, functions, or interfaces must be documented **Documentation Update Checklist:** - [ ] Update PRD.md with any architectural or requirement changes - [ ] Update README.md with new features, installation steps, or usage instructions - [ ] Add inline comments for complex logic or business rules - [ ] Update any configuration examples or file structure documentation - [ ] Review and update implementation status sections This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users. ### 2.3 Code Quality & Development Standards **MANDATORY STANDARDS:** The following standards must be followed to ensure code quality, maintainability, and AI-friendly development: #### **Naming Conventions** - **Files:** Use descriptive, lowercase names with underscores (`song_matcher.py`, `priority_manager.py`) - **Classes:** PascalCase (`SongMatcher`, `PreferencesManager`) - **Functions/Methods:** snake_case (`process_songs`, `get_priority_order`) - **Constants:** UPPER_SNAKE_CASE (`MAX_FILE_SIZE`, `DEFAULT_CHANNEL_PRIORITY`) - **Variables:** snake_case with descriptive names (`song_collection`, `duplicate_count`) #### **Code Structure Standards** - **Function Length:** Maximum 50 lines per function (aim for 20-30 lines) - **Class Length:** Maximum 300 lines per class (aim for 100-200 lines) - **File Length:** Maximum 500 lines per file (aim for 200-400 lines) - **Indentation:** 4 spaces (no tabs) - **Line Length:** Maximum 120 characters - **Import Organization:** Group imports: standard library, third-party, local (alphabetical within groups) #### **Error Handling & Logging** - **Exception Handling:** Always use specific exception types, never bare `except:` - **Logging:** Use Python's `logging` module with appropriate levels (DEBUG, INFO, WARNING, ERROR) - **User Feedback:** Provide clear, actionable error messages - **Graceful Degradation:** Handle missing files/configs gracefully with sensible defaults #### **Type Hints & Documentation** - **Type Hints:** Use Python type hints for all function parameters and return values - **Docstrings:** Include docstrings for all public functions, classes, and modules - **Docstring Format:** Use Google-style docstrings with parameter descriptions - **Complex Logic:** Add inline comments explaining business logic and algorithms #### **Testing Standards** - **Unit Tests:** Write unit tests for all business logic functions - **Test Coverage:** Aim for 80%+ code coverage - **Test Organization:** Mirror the source code structure in test files - **Test Data:** Use fixtures and factories for test data, never hardcode test values - **Integration Tests:** Test complete workflows and API endpoints #### **Configuration Management** - **Environment Variables:** Use environment variables for sensitive data (API keys, paths) - **Config Validation:** Validate configuration on startup with clear error messages - **Default Values:** Provide sensible defaults for all configuration options - **Config Documentation:** Document all configuration options with examples #### **Performance & Scalability** - **Memory Efficiency:** Process large datasets in chunks, avoid loading everything into memory - **Progress Indicators:** Show progress for long-running operations - **Caching:** Implement appropriate caching for expensive operations - **Async Operations:** Use async/await for I/O operations where beneficial #### **Security Best Practices** - **Input Validation:** Validate and sanitize all user inputs - **File Operations:** Use `pathlib` for safe file path handling - **JSON Safety:** Use `json.loads()` with proper error handling - **No Hardcoded Secrets:** Never commit API keys, passwords, or sensitive data #### **Version Control Standards** - **Commit Messages:** Use conventional commit format (`feat:`, `fix:`, `docs:`, `refactor:`) - **Branch Naming:** Use descriptive branch names (`feature/priority-management`, `fix/duplicate-detection`) - **Pull Requests:** Require code review for all changes - **Git Hooks:** Use pre-commit hooks for linting and formatting #### **Dependency Management** - **Requirements:** Keep `requirements.txt` updated with exact versions - **Virtual Environments:** Always use virtual environments for development - **Dependency Updates:** Regularly update dependencies and test compatibility - **Minimal Dependencies:** Only include necessary dependencies, avoid bloat #### **Code Review Checklist** - [ ] Code follows naming conventions - [ ] Functions are appropriately sized and focused - [ ] Error handling is comprehensive - [ ] Type hints and docstrings are present - [ ] Tests are included for new functionality - [ ] Configuration is properly validated - [ ] No hardcoded values or secrets - [ ] Performance considerations addressed - [ ] Documentation is updated These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants. --- ## 3. Data Handling & Matching Logic ### 3.1 Input - Reads from `/data/allSongs.json` - Each song includes at least: - `artist`, `title`, `path`, (plus id3 tag info, `channel` for MP4s) ### 3.2 Song Matching - **Primary keys:** `artist` + `title` - Fuzzy matching configurable (enabled/disabled with threshold) - Multi-artist handling: parse delimiters (commas, “feat.”, etc.) - **File type detection:** Use file extension from `path` (`.mp3`, `.cdg`, `.mp4`) ### 3.3 Channel Priority (for MP4s) - **Configurable folder names:** - Set in `/config/config.json` as an array of folder names - Order = priority (first = highest priority) - Tool searches for these folder names within the song's `path` property - Songs without matching folder names are marked for manual review - **File type priority:** MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG - **CDG/MP3 pairing:** CDG and MP3 files with the same base filename are treated as a single karaoke song unit --- ## 4. Output & Reporting ### 4.1 Skip List - **Format:** JSON (`/data/skipSongs.json`) - List of file paths to skip in future imports - Optionally: “reason” field (e.g., `{"path": "...", "reason": "duplicate"}`) ### 4.2 CLI Reporting - **Summary:** Total songs, duplicates found, types breakdown, etc. - **Verbose per-song output:** Only for matches/duplicates (not every song) - **Verbosity configurable:** (via CLI flag or config) ### 4.3 Manual Review (Web UI) - **Interactive Web Interface**: Table/grid view for ambiguous/complex cases - **Media Preview**: Ability to preview media before making a selection - **Bulk Actions**: Select multiple items for batch operations - **Real-time Filtering**: Search and filter capabilities - **Responsive Design**: Works on desktop and mobile devices - **Easy Startup**: Simple script (`start_web_ui.py`) with dependency checking --- ## 5. Features & Edge Cases - **Batch Processing:** - E.g., "Auto-skip all but highest-priority channel for each song" - Manual review as CLI flag (future: always in web UI) - **Edge Cases:** - Multiple versions (>2 formats) - Support for keeping multiple versions per song (configurable/manual) - **Non-destructive:** Never deletes or moves files, only generates skip list and reports --- ## 6. Tech Stack & Organization - **CLI Language:** Python - **Config:** JSON (channel priorities, settings) - **Current Folder Structure:** ``` KaraokeMerge/ ├── data/ │ ├── allSongs.json # Input: Your song library data │ ├── skipSongs.json # Output: Generated skip list │ └── reports/ # Detailed analysis reports │ ├── analysis_data.json │ ├── actionable_insights_report.txt │ ├── channel_optimization_report.txt │ ├── duplicate_pattern_report.txt │ ├── enhanced_summary_report.txt │ ├── skip_list_summary.txt │ └── skip_songs_detailed.json ├── config/ │ └── config.json # Configuration settings ├── cli/ │ ├── main.py # Main CLI application │ ├── matching.py # Song matching logic │ ├── report.py # Report generation │ └── utils.py # Utility functions ├── web/ # Web UI for manual review │ ├── app.py # Flask web application │ └── templates/ │ └── index.html # Web interface template ├── start_web_ui.py # Web UI startup script ├── test_tool.py # Validation and testing script ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules ├── PRD.md # Product Requirements Document └── README.md # Project documentation ``` --- ## 7. Web UI Implementation ### 7.1 Current Web UI Features - **Interactive Table View**: Sortable, filterable grid of duplicate songs - **Bulk Selection**: Select multiple items for batch operations - **Search & Filter**: Real-time search across artists, titles, and paths - **Responsive Design**: Mobile-friendly interface - **Easy Startup**: Automated dependency checking and browser launch ### 7.2 Web UI Architecture - **Flask Backend**: Lightweight web server (`web/app.py`) - **HTML Template**: Modern, responsive interface (`web/templates/index.html`) - **Startup Script**: Dependency management and server startup (`start_web_ui.py`) ### 7.3 Future Web UI Enhancements - Embedded media player for audio/video preview - Real-time configuration editing - Advanced filtering and sorting options - Export capabilities for manual selections --- ## 8. Open Questions (for future refinement) - Fuzzy matching library/thresholds? - Best parsing rules for multi-artist/feat. strings? - Any alternate export formats needed? - Temporary/partial skip support for "under review" songs? --- ## 9. Implementation Status ### ✅ Completed Features - [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json - [x] Print CLI summary reports (with verbosity control) - [x] Implement config file support for channel priority - [x] Organize folder/file structure for easy expansion ### 🎯 Current Implementation The tool has been successfully implemented with the following components: **Core Modules:** - `cli/main.py` - Main CLI application with argument parsing - `cli/matching.py` - Song matching and deduplication logic - `cli/report.py` - Report generation and output formatting - `cli/utils.py` - Utility functions for file operations and data processing **Configuration:** - `config/config.json` - Configurable settings for channel priorities, matching rules, and output options **Features Implemented:** - Multi-format support (MP3, CDG, MP4) - **CDG/MP3 Pairing Logic**: Files with same base filename treated as single karaoke song units - Channel priority system for MP4 files (based on folder names in path) - Fuzzy matching support with configurable threshold - Multi-artist parsing with various delimiters - **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights - Channel priority analysis and manual review identification - Non-destructive operation (skip lists only) - Verbose and dry-run modes - Detailed duplicate analysis - Skip list generation with metadata - **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions **File Type Priority System:** 1. **MP4 files** (with channel priority sorting) 2. **CDG/MP3 pairs** (treated as single units) 3. **Standalone MP3** files 4. **Standalone CDG** files **Performance Results:** - Successfully processed 37,015 songs - Identified 12,424 duplicates (33.6% duplicate rate) - Generated comprehensive skip list with metadata (10,998 unique files after deduplication) - Optimized for large datasets with progress indicators - **Enhanced Analysis**: Generated 7 detailed reports with actionable insights - **Bug Fix**: Resolved duplicate entries in skip list (removed 1,426 duplicate entries) ### 📋 Next Steps Checklist #### ✅ **Completed** - [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json - [x] Print CLI summary reports (with verbosity control) - [x] Implement config file support for channel priority - [x] Organize folder/file structure for easy expansion - [x] Implement CDG/MP3 pairing logic for accurate duplicate detection - [x] Generate comprehensive skip list with metadata - [x] Optimize performance for large datasets (37,000+ songs) - [x] Add progress indicators and error handling - [x] Generate detailed analysis reports (`--save-reports` functionality) - [x] Create web UI for manual review of ambiguous cases - [x] Add test tool for validation and debugging - [x] Create startup script for web UI with dependency checking - [x] Add comprehensive .gitignore file - [x] Update documentation with required data file information #### 🎯 **Next Priority Items** - [ ] Analyze MP4 files without channel priorities to suggest new folder names - [ ] Add support for additional file formats if needed - [ ] Implement batch processing capabilities - [ ] Create integration scripts for karaoke software - [ ] Add unit tests for core functionality - [ ] Implement audio fingerprinting for better duplicate detection