# Karaoke Song Library Cleanup Tool A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library. ## 🎯 Features - **Smart Duplicate Detection**: Identifies duplicate songs by artist and title - **MP3 Pairing Logic**: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3) - **Multi-Format Support**: Handles MP3 and MP4 files with intelligent priority system - **Channel Priority System**: Configurable priority for MP4 channels based on folder names in file paths - **Non-Destructive**: Only generates skip lists - never deletes or moves files - **Detailed Reporting**: Comprehensive statistics and analysis reports - **Flexible Configuration**: Customizable matching rules and output options - **Performance Optimized**: Handles large libraries (37,000+ songs) efficiently - **Future-Ready**: Designed for easy expansion to web UI ## 📁 Project Structure ``` KaraokeMerge/ ├── data/ │ ├── allSongs.json # Input: Your song library data │ ├── skipSongs.json # Output: Generated skip list │ └── reports/ # Detailed analysis reports │ ├── analysis_data.json │ ├── actionable_insights_report.txt │ ├── channel_optimization_report.txt │ ├── duplicate_pattern_report.txt │ ├── enhanced_summary_report.txt │ ├── skip_list_summary.txt │ └── skip_songs_detailed.json ├── config/ │ └── config.json # Configuration settings ├── cli/ │ ├── main.py # Main CLI application │ ├── matching.py # Song matching logic │ ├── report.py # Report generation │ └── utils.py # Utility functions ├── web/ # Web UI for manual review │ ├── app.py # Flask web application │ └── templates/ │ └── index.html # Web interface template ├── start_web_ui.py # Web UI startup script ├── test_tool.py # Validation and testing script ├── requirements.txt # Python dependencies ├── .gitignore # Git ignore rules ├── PRD.md # Product Requirements Document └── README.md # This file ``` ## 🚀 Quick Start ### Prerequisites - Python 3.7 or higher - Your karaoke song data in JSON format (see Data Format section) ### Required Data File **Important**: You need to provide your own `data/allSongs.json` file. This file is excluded from version control due to its large size and personal nature. **Sample `allSongs.json` format:** ```json [ { "artist": "ACDC", "title": "Shot In The Dark", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4", "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "disabled": false, "favorite": false }, { "artist": "Queen", "title": "Bohemian Rhapsody", "path": "z://MP4\\Sing King Karaoke\\Queen - Bohemian Rhapsody (Karaoke Version).mp4", "guid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "disabled": false, "favorite": true } ] ``` **Required fields:** - `artist`: Song artist name - `title`: Song title - `path`: Full file path to the song file - `guid`: Unique identifier for the song **Optional fields:** - `disabled`: Boolean indicating if song is disabled (default: false) - `favorite`: Boolean indicating if song is favorited (default: false) ### Installation 1. Clone or download this repository 2. Navigate to the project directory 3. Ensure your `data/allSongs.json` file is in place ### Basic Usage ```bash # Run with default settings python cli/main.py # Enable verbose output python cli/main.py --verbose # Dry run (analyze without generating skip list) python cli/main.py --dry-run # Save detailed reports python cli/main.py --save-reports # Test the tool functionality python test_tool.py # Start the web UI for manual review python start_web_ui.py ### Command Line Options | Option | Description | Default | |--------|-------------|---------| | `--config` | Path to configuration file | `../config/config.json` | | `--input` | Path to input songs file | `../data/allSongs.json` | | `--output-dir` | Directory for output files | `../data` | | `--verbose, -v` | Enable verbose output | `False` | | `--dry-run` | Analyze without generating skip list | `False` | | `--save-reports` | Save detailed reports to files | `False` | | `--show-config` | Show current configuration and exit | `False` | ## 📊 Data Format ### Input Format (`allSongs.json`) Your song data should be a JSON array with objects containing at least these fields: ```json [ { "artist": "ACDC", "title": "Shot In The Dark", "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4", "guid": "8946008c-7acc-d187-60e6-5286e55ad502", "disabled": false, "favorite": false } ] ``` ### Output Format (`skipSongs.json`) The tool generates a skip list with this structure: ```json [ { "path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4", "reason": "duplicate", "artist": "ACDC", "title": "Shot In The Dark", "kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4" } ] ``` **Skip List Features:** - **Metadata**: Each skip entry includes artist, title, and the path of the kept version - **Reason Tracking**: Documents why each file was marked for skipping - **Complete Information**: Provides full context for manual review if needed ## ⚙️ Configuration Edit `config/config.json` to customize the tool's behavior: ### Channel Priorities (MP4 files) ```json { "channel_priorities": [ "Sing King Karaoke", "KaraFun Karaoke", "Stingray Karaoke" ] } ``` **Note**: Channel priorities are now folder names found in the song's `path` property. The tool searches for these exact folder names within the file path to determine priority. ### Matching Settings ```json { "matching": { "fuzzy_matching": false, "fuzzy_threshold": 0.8, "case_sensitive": false } } ``` ### Output Settings ```json { "output": { "verbose": false, "include_reasons": true, "max_duplicates_per_song": 10 } } ``` ## 🌐 Web UI for Manual Review The project includes a web interface for interactive review of duplicate songs: ### Starting the Web UI ```bash python start_web_ui.py ``` This script will: - Check for required dependencies (Flask) - Install missing dependencies automatically - Validate required data files exist - Start the web server - Open your browser automatically ### Web UI Features - **Interactive Table**: Sortable, filterable grid of duplicate songs - **Bulk Selection**: Select multiple items for batch operations - **Real-time Search**: Filter by artist, title, or file path - **Responsive Design**: Works on desktop and mobile devices - **Detailed Information**: View full metadata for each duplicate ### Web UI Requirements - Flask web framework (automatically installed if missing) - Generated skip list data (`data/skipSongs.json`) - Configuration file (`config/config.json`) ## 📈 Understanding the Output ### Summary Report - **Total songs processed**: Total number of songs analyzed - **Unique songs found**: Number of unique artist-title combinations - **Duplicates identified**: Number of duplicate songs found - **File type breakdown**: Distribution across MP3, CDG, MP4 formats - **Channel breakdown**: MP4 channel distribution (if applicable) ### Skip List The generated `skipSongs.json` contains paths to files that should be skipped during future imports. Each entry includes: - `path`: File path to skip - `reason`: Why the file was marked for skipping (usually "duplicate") ## 🔧 Advanced Features ### Multi-Artist Handling The tool automatically handles songs with multiple artists using various delimiters: - `feat.`, `ft.`, `featuring` - `&`, `and` - `,`, `;`, `/` ### File Type Priority System The tool uses a sophisticated priority system to select the best version of each song: 1. **MP4 files are always preferred** when available - Searches for configured folder names within the file path - Sorts by configured priority order (first in list = highest priority) - Keeps the highest priority MP4 version 2. **CDG/MP3 pairs** are treated as single units - Automatically pairs CDG and MP3 files with the same base filename - Example: `song.cdg` + `song.mp3` = one complete karaoke song - Only considered if no MP4 files exist for the same artist/title 3. **Standalone files** are lowest priority - Standalone MP3 files (without matching CDG) - Standalone CDG files (without matching MP3) 4. **Manual review candidates** - Songs without matching folder names in channel priorities - Ambiguous cases requiring human decision ### CDG/MP3 Pairing Logic The tool automatically identifies and pairs CDG/MP3 files: - **Base filename matching**: Files with identical names but different extensions - **Single unit treatment**: Paired files are considered one complete karaoke song - **Accurate duplicate detection**: Prevents treating paired files as separate duplicates - **Proper priority handling**: Ensures complete songs compete fairly with MP4 versions ### Enhanced Analysis & Reporting Use `--save-reports` to generate comprehensive analysis files: **📊 Enhanced Reports:** - `enhanced_summary_report.txt`: Comprehensive analysis with detailed statistics - `channel_optimization_report.txt`: Channel priority optimization suggestions - `duplicate_pattern_report.txt`: Duplicate pattern analysis by artist, title, and channel - `actionable_insights_report.txt`: Recommendations and actionable insights - `analysis_data.json`: Raw analysis data for further processing **📋 Legacy Reports:** - `summary_report.txt`: Basic overall statistics - `duplicate_details.txt`: Detailed duplicate analysis (verbose mode only) - `skip_list_summary.txt`: Skip list breakdown - `skip_songs_detailed.json`: Full skip data with metadata **🔍 Analysis Features:** - **Pattern Analysis**: Identifies most duplicated artists, titles, and channels - **Channel Optimization**: Suggests optimal channel priority order based on effectiveness - **Storage Insights**: Quantifies space savings potential and duplicate distribution - **Actionable Recommendations**: Provides specific suggestions for library optimization ## 🛠️ Development ### Project Structure for Expansion The codebase is designed for easy expansion: - **Modular Design**: Separate modules for matching, reporting, and utilities - **Configuration-Driven**: Easy to modify behavior without code changes - **Web UI Implementation**: Full web interface for manual review and bulk operations - **Testing Framework**: Built-in test tool for validation and debugging - **Dependency Management**: Automated setup and dependency checking ### Testing and Validation Use the built-in test tool to validate your setup: ```bash python test_tool.py ``` This will: - Test all module imports - Validate configuration loading - Test with a sample of your song data - Verify report generation - Provide feedback on any issues ### Adding New Features 1. **New File Formats**: Add extensions to `config.json` 2. **New Matching Rules**: Extend `SongMatcher` class in `matching.py` 3. **New Reports**: Add methods to `ReportGenerator` class 4. **Web UI Enhancements**: Extend `web/app.py` and `web/templates/index.html` 5. **Testing**: Add test cases to `test_tool.py` ## 🎯 Current Status ### ✅ **Completed Features** - **Core CLI Tool**: Fully functional with comprehensive duplicate detection - **CDG/MP3 Pairing**: Intelligent pairing logic for accurate karaoke song handling - **Channel Priority System**: Configurable MP4 channel priorities based on folder names - **Skip List Generation**: Complete skip list with metadata and reasoning - **Performance Optimization**: Handles large libraries (37,000+ songs) efficiently - **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights - **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions - **Web UI**: Interactive web interface for manual review and bulk operations - **Testing & Validation**: Test tool for functionality validation and debugging - **Dependency Management**: Automated dependency checking and installation - **Project Documentation**: Comprehensive .gitignore and updated documentation ### 🚀 **Ready for Use** The tool is production-ready and has successfully processed a large karaoke library: - Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries) - Identified 33.6% duplicate rate with significant space savings potential - Provided complete metadata for informed decision-making - **Bug Fix**: Resolved duplicate entries in skip list generation ## 🔮 Future Roadmap ### Phase 2: Enhanced Analysis & Reporting ✅ - ✅ Generate detailed analysis reports (`--save-reports` functionality) - ✅ Analyze MP4 files without channel priorities to suggest new folder names - ✅ Create comprehensive duplicate analysis reports - ✅ Add statistical insights and trends - ✅ Pattern analysis and channel optimization suggestions ### Phase 3: Web Interface ✅ - ✅ Interactive table/grid for duplicate review - ✅ Bulk actions and manual overrides - ✅ Real-time filtering and search - ✅ Responsive design for mobile/desktop - ✅ Easy startup with dependency checking - [ ] Embedded media player for preview - [ ] Real-time configuration editing - [ ] Advanced export capabilities ### Phase 4: Advanced Features - Audio fingerprinting for better duplicate detection - Integration with karaoke software APIs - Batch processing and automation - Advanced fuzzy matching algorithms ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly 5. Submit a pull request ## 📝 License This project is open source. Feel free to use, modify, and distribute according to your needs. ## 🆘 Troubleshooting ### Common Issues **"File not found" errors** - Ensure `data/allSongs.json` exists and is readable - Check file paths in your song data **"Invalid JSON" errors** - Validate your JSON syntax using an online validator - Check for missing commas or brackets **Memory issues with large libraries** - The tool is optimized for large datasets - Consider running with `--dry-run` first to test ### Getting Help 1. **Test your setup**: Run `python test_tool.py` to validate everything is working 2. **Check configuration**: Use `python cli/main.py --show-config` to verify settings 3. **Verbose output**: Run with `--verbose` for detailed information 4. **Dry run**: Use `--dry-run` to test without generating files 5. **Web UI**: Start `python start_web_ui.py` for interactive review ## 📊 Performance & Results The tool is optimized for large karaoke libraries and has been tested with real-world data: ### **Performance Optimizations:** - **Memory Efficient**: Processes songs in batches - **Fast Matching**: Optimized algorithms for duplicate detection - **Progress Indicators**: Real-time feedback for large operations - **Scalable**: Handles libraries with 100,000+ songs ### **Real-World Results:** - **Successfully processed**: 37,015 songs - **Duplicate detection**: 12,424 duplicates identified (33.6% duplicate rate) - **File type distribution**: 45.8% MP3, 71.8% MP4 (some songs have multiple formats) - **Channel analysis**: 14,698 MP4s with defined priorities, 11,881 without - **Processing time**: Optimized for large datasets with progress tracking ### **Space Savings Potential:** - **Significant storage optimization** through intelligent duplicate removal - **Quality preservation** by keeping highest priority versions - **Complete metadata** for informed decision-making --- **Happy karaoke organizing! 🎤🎵**