KaraokeMerge/README.md

12 KiB

Karaoke Song Library Cleanup Tool

A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.

🎯 Features

  • Smart Duplicate Detection: Identifies duplicate songs by artist and title
  • MP3 Pairing Logic: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
  • Multi-Format Support: Handles MP3 and MP4 files with intelligent priority system
  • Channel Priority System: Configurable priority for MP4 channels based on folder names in file paths
  • Non-Destructive: Only generates skip lists - never deletes or moves files
  • Detailed Reporting: Comprehensive statistics and analysis reports
  • Flexible Configuration: Customizable matching rules and output options
  • Performance Optimized: Handles large libraries (37,000+ songs) efficiently
  • Future-Ready: Designed for easy expansion to web UI

📁 Project Structure

KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   └── skipSongs.json         # Output: Generated skip list
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── PRD.md                     # Product Requirements Document
└── README.md                  # This file

🚀 Quick Start

Prerequisites

  • Python 3.7 or higher
  • Your karaoke song data in JSON format (see Data Format section)

Installation

  1. Clone or download this repository
  2. Navigate to the project directory
  3. Ensure your data/allSongs.json file is in place

Basic Usage

# Run with default settings
python cli/main.py

# Enable verbose output
python cli/main.py --verbose

# Dry run (analyze without generating skip list)
python cli/main.py --dry-run

# Save detailed reports
python cli/main.py --save-reports

Command Line Options

Option Description Default
--config Path to configuration file ../config/config.json
--input Path to input songs file ../data/allSongs.json
--output-dir Directory for output files ../data
--verbose, -v Enable verbose output False
--dry-run Analyze without generating skip list False
--save-reports Save detailed reports to files False
--show-config Show current configuration and exit False

📊 Data Format

Input Format (allSongs.json)

Your song data should be a JSON array with objects containing at least these fields:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "disabled": false,
    "favorite": false
  }
]

Output Format (skipSongs.json)

The tool generates a skip list with this structure:

[
  {
    "path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
    "reason": "duplicate",
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  }
]

Skip List Features:

  • Metadata: Each skip entry includes artist, title, and the path of the kept version
  • Reason Tracking: Documents why each file was marked for skipping
  • Complete Information: Provides full context for manual review if needed

⚙️ Configuration

Edit config/config.json to customize the tool's behavior:

Channel Priorities (MP4 files)

{
  "channel_priorities": [
    "Sing King Karaoke",
    "KaraFun Karaoke",
    "Stingray Karaoke"
  ]
}

Note: Channel priorities are now folder names found in the song's path property. The tool searches for these exact folder names within the file path to determine priority.

Matching Settings

{
  "matching": {
    "fuzzy_matching": false,
    "fuzzy_threshold": 0.8,
    "case_sensitive": false
  }
}

Output Settings

{
  "output": {
    "verbose": false,
    "include_reasons": true,
    "max_duplicates_per_song": 10
  }
}

📈 Understanding the Output

Summary Report

  • Total songs processed: Total number of songs analyzed
  • Unique songs found: Number of unique artist-title combinations
  • Duplicates identified: Number of duplicate songs found
  • File type breakdown: Distribution across MP3, CDG, MP4 formats
  • Channel breakdown: MP4 channel distribution (if applicable)

Skip List

The generated skipSongs.json contains paths to files that should be skipped during future imports. Each entry includes:

  • path: File path to skip
  • reason: Why the file was marked for skipping (usually "duplicate")

🔧 Advanced Features

Multi-Artist Handling

The tool automatically handles songs with multiple artists using various delimiters:

  • feat., ft., featuring
  • &, and
  • ,, ;, /

File Type Priority System

The tool uses a sophisticated priority system to select the best version of each song:

  1. MP4 files are always preferred when available

    • Searches for configured folder names within the file path
    • Sorts by configured priority order (first in list = highest priority)
    • Keeps the highest priority MP4 version
  2. CDG/MP3 pairs are treated as single units

    • Automatically pairs CDG and MP3 files with the same base filename
    • Example: song.cdg + song.mp3 = one complete karaoke song
    • Only considered if no MP4 files exist for the same artist/title
  3. Standalone files are lowest priority

    • Standalone MP3 files (without matching CDG)
    • Standalone CDG files (without matching MP3)
  4. Manual review candidates

    • Songs without matching folder names in channel priorities
    • Ambiguous cases requiring human decision

CDG/MP3 Pairing Logic

The tool automatically identifies and pairs CDG/MP3 files:

  • Base filename matching: Files with identical names but different extensions
  • Single unit treatment: Paired files are considered one complete karaoke song
  • Accurate duplicate detection: Prevents treating paired files as separate duplicates
  • Proper priority handling: Ensures complete songs compete fairly with MP4 versions

Enhanced Analysis & Reporting

Use --save-reports to generate comprehensive analysis files:

📊 Enhanced Reports:

  • enhanced_summary_report.txt: Comprehensive analysis with detailed statistics
  • channel_optimization_report.txt: Channel priority optimization suggestions
  • duplicate_pattern_report.txt: Duplicate pattern analysis by artist, title, and channel
  • actionable_insights_report.txt: Recommendations and actionable insights
  • analysis_data.json: Raw analysis data for further processing

📋 Legacy Reports:

  • summary_report.txt: Basic overall statistics
  • duplicate_details.txt: Detailed duplicate analysis (verbose mode only)
  • skip_list_summary.txt: Skip list breakdown
  • skip_songs_detailed.json: Full skip data with metadata

🔍 Analysis Features:

  • Pattern Analysis: Identifies most duplicated artists, titles, and channels
  • Channel Optimization: Suggests optimal channel priority order based on effectiveness
  • Storage Insights: Quantifies space savings potential and duplicate distribution
  • Actionable Recommendations: Provides specific suggestions for library optimization

🛠️ Development

Project Structure for Expansion

The codebase is designed for easy expansion:

  • Modular Design: Separate modules for matching, reporting, and utilities
  • Configuration-Driven: Easy to modify behavior without code changes
  • Web UI Ready: Structure supports future web interface development

Adding New Features

  1. New File Formats: Add extensions to config.json
  2. New Matching Rules: Extend SongMatcher class in matching.py
  3. New Reports: Add methods to ReportGenerator class
  4. Web UI: Build on existing CLI structure

🎯 Current Status

Completed Features

  • Core CLI Tool: Fully functional with comprehensive duplicate detection
  • CDG/MP3 Pairing: Intelligent pairing logic for accurate karaoke song handling
  • Channel Priority System: Configurable MP4 channel priorities based on folder names
  • Skip List Generation: Complete skip list with metadata and reasoning
  • Performance Optimization: Handles large libraries (37,000+ songs) efficiently
  • Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
  • Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

🚀 Ready for Use

The tool is production-ready and has successfully processed a large karaoke library:

  • Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
  • Identified 33.6% duplicate rate with significant space savings potential
  • Provided complete metadata for informed decision-making
  • Bug Fix: Resolved duplicate entries in skip list generation

🔮 Future Roadmap

Phase 2: Enhanced Analysis & Reporting

  • Generate detailed analysis reports (--save-reports functionality)
  • Analyze MP4 files without channel priorities to suggest new folder names
  • Create comprehensive duplicate analysis reports
  • Add statistical insights and trends
  • Pattern analysis and channel optimization suggestions

Phase 3: Web Interface

  • Interactive table/grid for duplicate review
  • Embedded media player for preview
  • Bulk actions and manual overrides
  • Real-time configuration editing
  • Manual review interface for ambiguous cases

Phase 4: Advanced Features

  • Audio fingerprinting for better duplicate detection
  • Integration with karaoke software APIs
  • Batch processing and automation
  • Advanced fuzzy matching algorithms

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

📝 License

This project is open source. Feel free to use, modify, and distribute according to your needs.

🆘 Troubleshooting

Common Issues

"File not found" errors

  • Ensure data/allSongs.json exists and is readable
  • Check file paths in your song data

"Invalid JSON" errors

  • Validate your JSON syntax using an online validator
  • Check for missing commas or brackets

Memory issues with large libraries

  • The tool is optimized for large datasets
  • Consider running with --dry-run first to test

Getting Help

  1. Check the configuration with python cli/main.py --show-config
  2. Run with --verbose for detailed output
  3. Use --dry-run to test without generating files

📊 Performance & Results

The tool is optimized for large karaoke libraries and has been tested with real-world data:

Performance Optimizations:

  • Memory Efficient: Processes songs in batches
  • Fast Matching: Optimized algorithms for duplicate detection
  • Progress Indicators: Real-time feedback for large operations
  • Scalable: Handles libraries with 100,000+ songs

Real-World Results:

  • Successfully processed: 37,015 songs
  • Duplicate detection: 12,424 duplicates identified (33.6% duplicate rate)
  • File type distribution: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
  • Channel analysis: 14,698 MP4s with defined priorities, 11,881 without
  • Processing time: Optimized for large datasets with progress tracking

Space Savings Potential:

  • Significant storage optimization through intelligent duplicate removal
  • Quality preservation by keeping highest priority versions
  • Complete metadata for informed decision-making

Happy karaoke organizing! 🎤🎵