mbrucedogs c15ecc6d55 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

2025-07-26 16:40:56 -05:00

12 KiB

Raw Blame History

Karaoke Song Library Cleanup Tool

A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.

🎯 Features

Smart Duplicate Detection: Identifies duplicate songs by artist and title
MP3 Pairing Logic: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
Multi-Format Support: Handles MP3 and MP4 files with intelligent priority system
Channel Priority System: Configurable priority for MP4 channels based on folder names in file paths
Non-Destructive: Only generates skip lists - never deletes or moves files
Detailed Reporting: Comprehensive statistics and analysis reports
Flexible Configuration: Customizable matching rules and output options
Performance Optimized: Handles large libraries (37,000+ songs) efficiently
Future-Ready: Designed for easy expansion to web UI

📁 Project Structure

KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   └── skipSongs.json         # Output: Generated skip list
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── PRD.md                     # Product Requirements Document
└── README.md                  # This file

🚀 Quick Start

Prerequisites

Python 3.7 or higher
Your karaoke song data in JSON format (see Data Format section)

Installation

Clone or download this repository
Navigate to the project directory
Ensure your data/allSongs.json file is in place

Basic Usage

# Run with default settings
python cli/main.py

# Enable verbose output
python cli/main.py --verbose

# Dry run (analyze without generating skip list)
python cli/main.py --dry-run

# Save detailed reports
python cli/main.py --save-reports

Command Line Options

Option	Description	Default
`--config`	Path to configuration file	`../config/config.json`
`--input`	Path to input songs file	`../data/allSongs.json`
`--output-dir`	Directory for output files	`../data`
`--verbose, -v`	Enable verbose output	`False`
`--dry-run`	Analyze without generating skip list	`False`
`--save-reports`	Save detailed reports to files	`False`
`--show-config`	Show current configuration and exit	`False`

📊 Data Format

Input Format (`allSongs.json`)

Your song data should be a JSON array with objects containing at least these fields:

[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "disabled": false,
    "favorite": false
  }
]

Output Format (`skipSongs.json`)

The tool generates a skip list with this structure:

[
  {
    "path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
    "reason": "duplicate",
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  }
]

Skip List Features:

Metadata: Each skip entry includes artist, title, and the path of the kept version
Reason Tracking: Documents why each file was marked for skipping
Complete Information: Provides full context for manual review if needed

⚙️ Configuration

Edit config/config.json to customize the tool's behavior:

Channel Priorities (MP4 files)

{
  "channel_priorities": [
    "Sing King Karaoke",
    "KaraFun Karaoke",
    "Stingray Karaoke"
  ]
}

Note: Channel priorities are now folder names found in the song's path property. The tool searches for these exact folder names within the file path to determine priority.

Matching Settings

{
  "matching": {
    "fuzzy_matching": false,
    "fuzzy_threshold": 0.8,
    "case_sensitive": false
  }
}

Output Settings

{
  "output": {
    "verbose": false,
    "include_reasons": true,
    "max_duplicates_per_song": 10
  }
}

📈 Understanding the Output

Summary Report

Total songs processed: Total number of songs analyzed
Unique songs found: Number of unique artist-title combinations
Duplicates identified: Number of duplicate songs found
File type breakdown: Distribution across MP3, CDG, MP4 formats
Channel breakdown: MP4 channel distribution (if applicable)

Skip List

The generated skipSongs.json contains paths to files that should be skipped during future imports. Each entry includes:

path: File path to skip
reason: Why the file was marked for skipping (usually "duplicate")

🔧 Advanced Features

Multi-Artist Handling

The tool automatically handles songs with multiple artists using various delimiters:

feat., ft., featuring
&, and
,, ;, /

File Type Priority System

The tool uses a sophisticated priority system to select the best version of each song:

MP4 files are always preferred when available
- Searches for configured folder names within the file path
- Sorts by configured priority order (first in list = highest priority)
- Keeps the highest priority MP4 version
CDG/MP3 pairs are treated as single units
- Automatically pairs CDG and MP3 files with the same base filename
- Example: song.cdg + song.mp3 = one complete karaoke song
- Only considered if no MP4 files exist for the same artist/title
Standalone files are lowest priority
- Standalone MP3 files (without matching CDG)
- Standalone CDG files (without matching MP3)
Manual review candidates
- Songs without matching folder names in channel priorities
- Ambiguous cases requiring human decision

CDG/MP3 Pairing Logic

The tool automatically identifies and pairs CDG/MP3 files:

Base filename matching: Files with identical names but different extensions
Single unit treatment: Paired files are considered one complete karaoke song
Accurate duplicate detection: Prevents treating paired files as separate duplicates
Proper priority handling: Ensures complete songs compete fairly with MP4 versions

Enhanced Analysis & Reporting

Use --save-reports to generate comprehensive analysis files:

📊 Enhanced Reports:

enhanced_summary_report.txt: Comprehensive analysis with detailed statistics
channel_optimization_report.txt: Channel priority optimization suggestions
duplicate_pattern_report.txt: Duplicate pattern analysis by artist, title, and channel
actionable_insights_report.txt: Recommendations and actionable insights
analysis_data.json: Raw analysis data for further processing

📋 Legacy Reports:

summary_report.txt: Basic overall statistics
duplicate_details.txt: Detailed duplicate analysis (verbose mode only)
skip_list_summary.txt: Skip list breakdown
skip_songs_detailed.json: Full skip data with metadata

🔍 Analysis Features:

Pattern Analysis: Identifies most duplicated artists, titles, and channels
Channel Optimization: Suggests optimal channel priority order based on effectiveness
Storage Insights: Quantifies space savings potential and duplicate distribution
Actionable Recommendations: Provides specific suggestions for library optimization

🛠️ Development

Project Structure for Expansion

The codebase is designed for easy expansion:

Modular Design: Separate modules for matching, reporting, and utilities
Configuration-Driven: Easy to modify behavior without code changes
Web UI Ready: Structure supports future web interface development

Adding New Features

New File Formats: Add extensions to config.json
New Matching Rules: Extend SongMatcher class in matching.py
New Reports: Add methods to ReportGenerator class
Web UI: Build on existing CLI structure

🎯 Current Status

✅ Completed Features

Core CLI Tool: Fully functional with comprehensive duplicate detection
CDG/MP3 Pairing: Intelligent pairing logic for accurate karaoke song handling
Channel Priority System: Configurable MP4 channel priorities based on folder names
Skip List Generation: Complete skip list with metadata and reasoning
Performance Optimization: Handles large libraries (37,000+ songs) efficiently
Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

🚀 Ready for Use

The tool is production-ready and has successfully processed a large karaoke library:

Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
Identified 33.6% duplicate rate with significant space savings potential
Provided complete metadata for informed decision-making
Bug Fix: Resolved duplicate entries in skip list generation

🔮 Future Roadmap

Phase 2: Enhanced Analysis & Reporting ✅

✅ Generate detailed analysis reports (--save-reports functionality)
✅ Analyze MP4 files without channel priorities to suggest new folder names
✅ Create comprehensive duplicate analysis reports
✅ Add statistical insights and trends
✅ Pattern analysis and channel optimization suggestions

Phase 3: Web Interface

Interactive table/grid for duplicate review
Embedded media player for preview
Bulk actions and manual overrides
Real-time configuration editing
Manual review interface for ambiguous cases

Phase 4: Advanced Features

Audio fingerprinting for better duplicate detection
Integration with karaoke software APIs
Batch processing and automation
Advanced fuzzy matching algorithms

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📝 License

This project is open source. Feel free to use, modify, and distribute according to your needs.

🆘 Troubleshooting

Common Issues

"File not found" errors

Ensure data/allSongs.json exists and is readable
Check file paths in your song data

"Invalid JSON" errors

Validate your JSON syntax using an online validator
Check for missing commas or brackets

Memory issues with large libraries

The tool is optimized for large datasets
Consider running with --dry-run first to test

Getting Help

Check the configuration with python cli/main.py --show-config
Run with --verbose for detailed output
Use --dry-run to test without generating files

📊 Performance & Results

The tool is optimized for large karaoke libraries and has been tested with real-world data:

Performance Optimizations:

Memory Efficient: Processes songs in batches
Fast Matching: Optimized algorithms for duplicate detection
Progress Indicators: Real-time feedback for large operations
Scalable: Handles libraries with 100,000+ songs

Real-World Results:

Successfully processed: 37,015 songs
Duplicate detection: 12,424 duplicates identified (33.6% duplicate rate)
File type distribution: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
Channel analysis: 14,698 MP4s with defined priorities, 11,881 without
Processing time: Optimized for large datasets with progress tracking

Space Savings Potential:

Significant storage optimization through intelligent duplicate removal
Quality preservation by keeping highest priority versions
Complete metadata for informed decision-making

Happy karaoke organizing! 🎤🎵

12 KiB Raw Blame History