12 KiB
Karaoke Song Library Cleanup Tool
A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.
🎯 Features
- Smart Duplicate Detection: Identifies duplicate songs by artist and title
- MP3 Pairing Logic: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
- Multi-Format Support: Handles MP3 and MP4 files with intelligent priority system
- Channel Priority System: Configurable priority for MP4 channels based on folder names in file paths
- Non-Destructive: Only generates skip lists - never deletes or moves files
- Detailed Reporting: Comprehensive statistics and analysis reports
- Flexible Configuration: Customizable matching rules and output options
- Performance Optimized: Handles large libraries (37,000+ songs) efficiently
- Future-Ready: Designed for easy expansion to web UI
📁 Project Structure
KaraokeMerge/
├── data/
│ ├── allSongs.json # Input: Your song library data
│ └── skipSongs.json # Output: Generated skip list
├── config/
│ └── config.json # Configuration settings
├── cli/
│ ├── main.py # Main CLI application
│ ├── matching.py # Song matching logic
│ ├── report.py # Report generation
│ └── utils.py # Utility functions
├── PRD.md # Product Requirements Document
└── README.md # This file
🚀 Quick Start
Prerequisites
- Python 3.7 or higher
- Your karaoke song data in JSON format (see Data Format section)
Installation
- Clone or download this repository
- Navigate to the project directory
- Ensure your
data/allSongs.jsonfile is in place
Basic Usage
# Run with default settings
python cli/main.py
# Enable verbose output
python cli/main.py --verbose
# Dry run (analyze without generating skip list)
python cli/main.py --dry-run
# Save detailed reports
python cli/main.py --save-reports
Command Line Options
| Option | Description | Default |
|---|---|---|
--config |
Path to configuration file | ../config/config.json |
--input |
Path to input songs file | ../data/allSongs.json |
--output-dir |
Directory for output files | ../data |
--verbose, -v |
Enable verbose output | False |
--dry-run |
Analyze without generating skip list | False |
--save-reports |
Save detailed reports to files | False |
--show-config |
Show current configuration and exit | False |
📊 Data Format
Input Format (allSongs.json)
Your song data should be a JSON array with objects containing at least these fields:
[
{
"artist": "ACDC",
"title": "Shot In The Dark",
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
"disabled": false,
"favorite": false
}
]
Output Format (skipSongs.json)
The tool generates a skip list with this structure:
[
{
"path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
"reason": "duplicate",
"artist": "ACDC",
"title": "Shot In The Dark",
"kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
}
]
Skip List Features:
- Metadata: Each skip entry includes artist, title, and the path of the kept version
- Reason Tracking: Documents why each file was marked for skipping
- Complete Information: Provides full context for manual review if needed
⚙️ Configuration
Edit config/config.json to customize the tool's behavior:
Channel Priorities (MP4 files)
{
"channel_priorities": [
"Sing King Karaoke",
"KaraFun Karaoke",
"Stingray Karaoke"
]
}
Note: Channel priorities are now folder names found in the song's path property. The tool searches for these exact folder names within the file path to determine priority.
Matching Settings
{
"matching": {
"fuzzy_matching": false,
"fuzzy_threshold": 0.8,
"case_sensitive": false
}
}
Output Settings
{
"output": {
"verbose": false,
"include_reasons": true,
"max_duplicates_per_song": 10
}
}
📈 Understanding the Output
Summary Report
- Total songs processed: Total number of songs analyzed
- Unique songs found: Number of unique artist-title combinations
- Duplicates identified: Number of duplicate songs found
- File type breakdown: Distribution across MP3, CDG, MP4 formats
- Channel breakdown: MP4 channel distribution (if applicable)
Skip List
The generated skipSongs.json contains paths to files that should be skipped during future imports. Each entry includes:
path: File path to skipreason: Why the file was marked for skipping (usually "duplicate")
🔧 Advanced Features
Multi-Artist Handling
The tool automatically handles songs with multiple artists using various delimiters:
feat.,ft.,featuring&,and,,;,/
File Type Priority System
The tool uses a sophisticated priority system to select the best version of each song:
-
MP4 files are always preferred when available
- Searches for configured folder names within the file path
- Sorts by configured priority order (first in list = highest priority)
- Keeps the highest priority MP4 version
-
CDG/MP3 pairs are treated as single units
- Automatically pairs CDG and MP3 files with the same base filename
- Example:
song.cdg+song.mp3= one complete karaoke song - Only considered if no MP4 files exist for the same artist/title
-
Standalone files are lowest priority
- Standalone MP3 files (without matching CDG)
- Standalone CDG files (without matching MP3)
-
Manual review candidates
- Songs without matching folder names in channel priorities
- Ambiguous cases requiring human decision
CDG/MP3 Pairing Logic
The tool automatically identifies and pairs CDG/MP3 files:
- Base filename matching: Files with identical names but different extensions
- Single unit treatment: Paired files are considered one complete karaoke song
- Accurate duplicate detection: Prevents treating paired files as separate duplicates
- Proper priority handling: Ensures complete songs compete fairly with MP4 versions
Enhanced Analysis & Reporting
Use --save-reports to generate comprehensive analysis files:
📊 Enhanced Reports:
enhanced_summary_report.txt: Comprehensive analysis with detailed statisticschannel_optimization_report.txt: Channel priority optimization suggestionsduplicate_pattern_report.txt: Duplicate pattern analysis by artist, title, and channelactionable_insights_report.txt: Recommendations and actionable insightsanalysis_data.json: Raw analysis data for further processing
📋 Legacy Reports:
summary_report.txt: Basic overall statisticsduplicate_details.txt: Detailed duplicate analysis (verbose mode only)skip_list_summary.txt: Skip list breakdownskip_songs_detailed.json: Full skip data with metadata
🔍 Analysis Features:
- Pattern Analysis: Identifies most duplicated artists, titles, and channels
- Channel Optimization: Suggests optimal channel priority order based on effectiveness
- Storage Insights: Quantifies space savings potential and duplicate distribution
- Actionable Recommendations: Provides specific suggestions for library optimization
🛠️ Development
Project Structure for Expansion
The codebase is designed for easy expansion:
- Modular Design: Separate modules for matching, reporting, and utilities
- Configuration-Driven: Easy to modify behavior without code changes
- Web UI Ready: Structure supports future web interface development
Adding New Features
- New File Formats: Add extensions to
config.json - New Matching Rules: Extend
SongMatcherclass inmatching.py - New Reports: Add methods to
ReportGeneratorclass - Web UI: Build on existing CLI structure
🎯 Current Status
✅ Completed Features
- Core CLI Tool: Fully functional with comprehensive duplicate detection
- CDG/MP3 Pairing: Intelligent pairing logic for accurate karaoke song handling
- Channel Priority System: Configurable MP4 channel priorities based on folder names
- Skip List Generation: Complete skip list with metadata and reasoning
- Performance Optimization: Handles large libraries (37,000+ songs) efficiently
- Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
- Pattern Analysis: Skip list pattern analysis and channel optimization suggestions
🚀 Ready for Use
The tool is production-ready and has successfully processed a large karaoke library:
- Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
- Identified 33.6% duplicate rate with significant space savings potential
- Provided complete metadata for informed decision-making
- Bug Fix: Resolved duplicate entries in skip list generation
🔮 Future Roadmap
Phase 2: Enhanced Analysis & Reporting ✅
- ✅ Generate detailed analysis reports (
--save-reportsfunctionality) - ✅ Analyze MP4 files without channel priorities to suggest new folder names
- ✅ Create comprehensive duplicate analysis reports
- ✅ Add statistical insights and trends
- ✅ Pattern analysis and channel optimization suggestions
Phase 3: Web Interface
- Interactive table/grid for duplicate review
- Embedded media player for preview
- Bulk actions and manual overrides
- Real-time configuration editing
- Manual review interface for ambiguous cases
Phase 4: Advanced Features
- Audio fingerprinting for better duplicate detection
- Integration with karaoke software APIs
- Batch processing and automation
- Advanced fuzzy matching algorithms
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
📝 License
This project is open source. Feel free to use, modify, and distribute according to your needs.
🆘 Troubleshooting
Common Issues
"File not found" errors
- Ensure
data/allSongs.jsonexists and is readable - Check file paths in your song data
"Invalid JSON" errors
- Validate your JSON syntax using an online validator
- Check for missing commas or brackets
Memory issues with large libraries
- The tool is optimized for large datasets
- Consider running with
--dry-runfirst to test
Getting Help
- Check the configuration with
python cli/main.py --show-config - Run with
--verbosefor detailed output - Use
--dry-runto test without generating files
📊 Performance & Results
The tool is optimized for large karaoke libraries and has been tested with real-world data:
Performance Optimizations:
- Memory Efficient: Processes songs in batches
- Fast Matching: Optimized algorithms for duplicate detection
- Progress Indicators: Real-time feedback for large operations
- Scalable: Handles libraries with 100,000+ songs
Real-World Results:
- Successfully processed: 37,015 songs
- Duplicate detection: 12,424 duplicates identified (33.6% duplicate rate)
- File type distribution: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
- Channel analysis: 14,698 MP4s with defined priorities, 11,881 without
- Processing time: Optimized for large datasets with progress tracking
Space Savings Potential:
- Significant storage optimization through intelligent duplicate removal
- Quality preservation by keeping highest priority versions
- Complete metadata for informed decision-making
Happy karaoke organizing! 🎤🎵