342 lines
12 KiB
Markdown
342 lines
12 KiB
Markdown
# Karaoke Song Library Cleanup Tool
|
|
|
|
A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.
|
|
|
|
## 🎯 Features
|
|
|
|
- **Smart Duplicate Detection**: Identifies duplicate songs by artist and title
|
|
- **MP3 Pairing Logic**: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
|
|
- **Multi-Format Support**: Handles MP3 and MP4 files with intelligent priority system
|
|
- **Channel Priority System**: Configurable priority for MP4 channels based on folder names in file paths
|
|
- **Non-Destructive**: Only generates skip lists - never deletes or moves files
|
|
- **Detailed Reporting**: Comprehensive statistics and analysis reports
|
|
- **Flexible Configuration**: Customizable matching rules and output options
|
|
- **Performance Optimized**: Handles large libraries (37,000+ songs) efficiently
|
|
- **Future-Ready**: Designed for easy expansion to web UI
|
|
|
|
## 📁 Project Structure
|
|
|
|
```
|
|
KaraokeMerge/
|
|
├── data/
|
|
│ ├── allSongs.json # Input: Your song library data
|
|
│ └── skipSongs.json # Output: Generated skip list
|
|
├── config/
|
|
│ └── config.json # Configuration settings
|
|
├── cli/
|
|
│ ├── main.py # Main CLI application
|
|
│ ├── matching.py # Song matching logic
|
|
│ ├── report.py # Report generation
|
|
│ └── utils.py # Utility functions
|
|
├── PRD.md # Product Requirements Document
|
|
└── README.md # This file
|
|
```
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.7 or higher
|
|
- Your karaoke song data in JSON format (see Data Format section)
|
|
|
|
### Installation
|
|
|
|
1. Clone or download this repository
|
|
2. Navigate to the project directory
|
|
3. Ensure your `data/allSongs.json` file is in place
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Run with default settings
|
|
python cli/main.py
|
|
|
|
# Enable verbose output
|
|
python cli/main.py --verbose
|
|
|
|
# Dry run (analyze without generating skip list)
|
|
python cli/main.py --dry-run
|
|
|
|
# Save detailed reports
|
|
python cli/main.py --save-reports
|
|
```
|
|
|
|
### Command Line Options
|
|
|
|
| Option | Description | Default |
|
|
|--------|-------------|---------|
|
|
| `--config` | Path to configuration file | `../config/config.json` |
|
|
| `--input` | Path to input songs file | `../data/allSongs.json` |
|
|
| `--output-dir` | Directory for output files | `../data` |
|
|
| `--verbose, -v` | Enable verbose output | `False` |
|
|
| `--dry-run` | Analyze without generating skip list | `False` |
|
|
| `--save-reports` | Save detailed reports to files | `False` |
|
|
| `--show-config` | Show current configuration and exit | `False` |
|
|
|
|
## 📊 Data Format
|
|
|
|
### Input Format (`allSongs.json`)
|
|
|
|
Your song data should be a JSON array with objects containing at least these fields:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"artist": "ACDC",
|
|
"title": "Shot In The Dark",
|
|
"path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
|
|
"guid": "8946008c-7acc-d187-60e6-5286e55ad502",
|
|
"disabled": false,
|
|
"favorite": false
|
|
}
|
|
]
|
|
```
|
|
|
|
### Output Format (`skipSongs.json`)
|
|
|
|
The tool generates a skip list with this structure:
|
|
|
|
```json
|
|
[
|
|
{
|
|
"path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
|
|
"reason": "duplicate",
|
|
"artist": "ACDC",
|
|
"title": "Shot In The Dark",
|
|
"kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
|
|
}
|
|
]
|
|
```
|
|
|
|
**Skip List Features:**
|
|
- **Metadata**: Each skip entry includes artist, title, and the path of the kept version
|
|
- **Reason Tracking**: Documents why each file was marked for skipping
|
|
- **Complete Information**: Provides full context for manual review if needed
|
|
|
|
## ⚙️ Configuration
|
|
|
|
Edit `config/config.json` to customize the tool's behavior:
|
|
|
|
### Channel Priorities (MP4 files)
|
|
```json
|
|
{
|
|
"channel_priorities": [
|
|
"Sing King Karaoke",
|
|
"KaraFun Karaoke",
|
|
"Stingray Karaoke"
|
|
]
|
|
}
|
|
```
|
|
|
|
**Note**: Channel priorities are now folder names found in the song's `path` property. The tool searches for these exact folder names within the file path to determine priority.
|
|
|
|
### Matching Settings
|
|
```json
|
|
{
|
|
"matching": {
|
|
"fuzzy_matching": false,
|
|
"fuzzy_threshold": 0.8,
|
|
"case_sensitive": false
|
|
}
|
|
}
|
|
```
|
|
|
|
### Output Settings
|
|
```json
|
|
{
|
|
"output": {
|
|
"verbose": false,
|
|
"include_reasons": true,
|
|
"max_duplicates_per_song": 10
|
|
}
|
|
}
|
|
```
|
|
|
|
## 📈 Understanding the Output
|
|
|
|
### Summary Report
|
|
- **Total songs processed**: Total number of songs analyzed
|
|
- **Unique songs found**: Number of unique artist-title combinations
|
|
- **Duplicates identified**: Number of duplicate songs found
|
|
- **File type breakdown**: Distribution across MP3, CDG, MP4 formats
|
|
- **Channel breakdown**: MP4 channel distribution (if applicable)
|
|
|
|
### Skip List
|
|
The generated `skipSongs.json` contains paths to files that should be skipped during future imports. Each entry includes:
|
|
- `path`: File path to skip
|
|
- `reason`: Why the file was marked for skipping (usually "duplicate")
|
|
|
|
## 🔧 Advanced Features
|
|
|
|
### Multi-Artist Handling
|
|
The tool automatically handles songs with multiple artists using various delimiters:
|
|
- `feat.`, `ft.`, `featuring`
|
|
- `&`, `and`
|
|
- `,`, `;`, `/`
|
|
|
|
### File Type Priority System
|
|
The tool uses a sophisticated priority system to select the best version of each song:
|
|
|
|
1. **MP4 files are always preferred** when available
|
|
- Searches for configured folder names within the file path
|
|
- Sorts by configured priority order (first in list = highest priority)
|
|
- Keeps the highest priority MP4 version
|
|
|
|
2. **CDG/MP3 pairs** are treated as single units
|
|
- Automatically pairs CDG and MP3 files with the same base filename
|
|
- Example: `song.cdg` + `song.mp3` = one complete karaoke song
|
|
- Only considered if no MP4 files exist for the same artist/title
|
|
|
|
3. **Standalone files** are lowest priority
|
|
- Standalone MP3 files (without matching CDG)
|
|
- Standalone CDG files (without matching MP3)
|
|
|
|
4. **Manual review candidates**
|
|
- Songs without matching folder names in channel priorities
|
|
- Ambiguous cases requiring human decision
|
|
|
|
### CDG/MP3 Pairing Logic
|
|
The tool automatically identifies and pairs CDG/MP3 files:
|
|
- **Base filename matching**: Files with identical names but different extensions
|
|
- **Single unit treatment**: Paired files are considered one complete karaoke song
|
|
- **Accurate duplicate detection**: Prevents treating paired files as separate duplicates
|
|
- **Proper priority handling**: Ensures complete songs compete fairly with MP4 versions
|
|
|
|
### Enhanced Analysis & Reporting
|
|
Use `--save-reports` to generate comprehensive analysis files:
|
|
|
|
**📊 Enhanced Reports:**
|
|
- `enhanced_summary_report.txt`: Comprehensive analysis with detailed statistics
|
|
- `channel_optimization_report.txt`: Channel priority optimization suggestions
|
|
- `duplicate_pattern_report.txt`: Duplicate pattern analysis by artist, title, and channel
|
|
- `actionable_insights_report.txt`: Recommendations and actionable insights
|
|
- `analysis_data.json`: Raw analysis data for further processing
|
|
|
|
**📋 Legacy Reports:**
|
|
- `summary_report.txt`: Basic overall statistics
|
|
- `duplicate_details.txt`: Detailed duplicate analysis (verbose mode only)
|
|
- `skip_list_summary.txt`: Skip list breakdown
|
|
- `skip_songs_detailed.json`: Full skip data with metadata
|
|
|
|
**🔍 Analysis Features:**
|
|
- **Pattern Analysis**: Identifies most duplicated artists, titles, and channels
|
|
- **Channel Optimization**: Suggests optimal channel priority order based on effectiveness
|
|
- **Storage Insights**: Quantifies space savings potential and duplicate distribution
|
|
- **Actionable Recommendations**: Provides specific suggestions for library optimization
|
|
|
|
## 🛠️ Development
|
|
|
|
### Project Structure for Expansion
|
|
|
|
The codebase is designed for easy expansion:
|
|
|
|
- **Modular Design**: Separate modules for matching, reporting, and utilities
|
|
- **Configuration-Driven**: Easy to modify behavior without code changes
|
|
- **Web UI Ready**: Structure supports future web interface development
|
|
|
|
### Adding New Features
|
|
|
|
1. **New File Formats**: Add extensions to `config.json`
|
|
2. **New Matching Rules**: Extend `SongMatcher` class in `matching.py`
|
|
3. **New Reports**: Add methods to `ReportGenerator` class
|
|
4. **Web UI**: Build on existing CLI structure
|
|
|
|
## 🎯 Current Status
|
|
|
|
### ✅ **Completed Features**
|
|
- **Core CLI Tool**: Fully functional with comprehensive duplicate detection
|
|
- **CDG/MP3 Pairing**: Intelligent pairing logic for accurate karaoke song handling
|
|
- **Channel Priority System**: Configurable MP4 channel priorities based on folder names
|
|
- **Skip List Generation**: Complete skip list with metadata and reasoning
|
|
- **Performance Optimization**: Handles large libraries (37,000+ songs) efficiently
|
|
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
|
|
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions
|
|
|
|
### 🚀 **Ready for Use**
|
|
The tool is production-ready and has successfully processed a large karaoke library:
|
|
- Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
|
|
- Identified 33.6% duplicate rate with significant space savings potential
|
|
- Provided complete metadata for informed decision-making
|
|
- **Bug Fix**: Resolved duplicate entries in skip list generation
|
|
|
|
## 🔮 Future Roadmap
|
|
|
|
### Phase 2: Enhanced Analysis & Reporting ✅
|
|
- ✅ Generate detailed analysis reports (`--save-reports` functionality)
|
|
- ✅ Analyze MP4 files without channel priorities to suggest new folder names
|
|
- ✅ Create comprehensive duplicate analysis reports
|
|
- ✅ Add statistical insights and trends
|
|
- ✅ Pattern analysis and channel optimization suggestions
|
|
|
|
### Phase 3: Web Interface
|
|
- Interactive table/grid for duplicate review
|
|
- Embedded media player for preview
|
|
- Bulk actions and manual overrides
|
|
- Real-time configuration editing
|
|
- Manual review interface for ambiguous cases
|
|
|
|
### Phase 4: Advanced Features
|
|
- Audio fingerprinting for better duplicate detection
|
|
- Integration with karaoke software APIs
|
|
- Batch processing and automation
|
|
- Advanced fuzzy matching algorithms
|
|
|
|
## 🤝 Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create a feature branch
|
|
3. Make your changes
|
|
4. Test thoroughly
|
|
5. Submit a pull request
|
|
|
|
## 📝 License
|
|
|
|
This project is open source. Feel free to use, modify, and distribute according to your needs.
|
|
|
|
## 🆘 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**"File not found" errors**
|
|
- Ensure `data/allSongs.json` exists and is readable
|
|
- Check file paths in your song data
|
|
|
|
**"Invalid JSON" errors**
|
|
- Validate your JSON syntax using an online validator
|
|
- Check for missing commas or brackets
|
|
|
|
**Memory issues with large libraries**
|
|
- The tool is optimized for large datasets
|
|
- Consider running with `--dry-run` first to test
|
|
|
|
### Getting Help
|
|
|
|
1. Check the configuration with `python cli/main.py --show-config`
|
|
2. Run with `--verbose` for detailed output
|
|
3. Use `--dry-run` to test without generating files
|
|
|
|
## 📊 Performance & Results
|
|
|
|
The tool is optimized for large karaoke libraries and has been tested with real-world data:
|
|
|
|
### **Performance Optimizations:**
|
|
- **Memory Efficient**: Processes songs in batches
|
|
- **Fast Matching**: Optimized algorithms for duplicate detection
|
|
- **Progress Indicators**: Real-time feedback for large operations
|
|
- **Scalable**: Handles libraries with 100,000+ songs
|
|
|
|
### **Real-World Results:**
|
|
- **Successfully processed**: 37,015 songs
|
|
- **Duplicate detection**: 12,424 duplicates identified (33.6% duplicate rate)
|
|
- **File type distribution**: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
|
|
- **Channel analysis**: 14,698 MP4s with defined priorities, 11,881 without
|
|
- **Processing time**: Optimized for large datasets with progress tracking
|
|
|
|
### **Space Savings Potential:**
|
|
- **Significant storage optimization** through intelligent duplicate removal
|
|
- **Quality preservation** by keeping highest priority versions
|
|
- **Complete metadata** for informed decision-making
|
|
|
|
---
|
|
|
|
**Happy karaoke organizing! 🎤🎵** |