KaraokeMerge/README.md

# Karaoke Song Library Cleanup Tool

A powerful command-line tool for analyzing, deduplicating, and cleaning up large karaoke song collections. The tool identifies duplicate songs across different formats (MP3, MP4) and generates a "skip list" for future imports, helping you maintain a clean and organized karaoke library.

## 🎯 Features

- **Smart Duplicate Detection**: Identifies duplicate songs by artist and title
- **MP3 Pairing Logic**: Automatically pairs CDG and MP3 files with the same base filename as single karaoke song units (CDG files are treated as MP3)
- **Multi-Format Support**: Handles MP3 and MP4 files with intelligent priority system
- **Channel Priority System**: Configurable priority for MP4 channels based on folder names in file paths
- **Non-Destructive**: Only generates skip lists - never deletes or moves files
- **Detailed Reporting**: Comprehensive statistics and analysis reports
- **Flexible Configuration**: Customizable matching rules and output options
- **Performance Optimized**: Handles large libraries (37,000+ songs) efficiently
- **Future-Ready**: Designed for easy expansion to web UI

## 📁 Project Structure

```
KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   └── skipSongs.json         # Output: Generated skip list
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── PRD.md                     # Product Requirements Document
└── README.md                  # This file
```

## 🚀 Quick Start

### Prerequisites

- Python 3.7 or higher
- Your karaoke song data in JSON format (see Data Format section)

### Installation

1. Clone or download this repository
2. Navigate to the project directory
3. Ensure your `data/allSongs.json` file is in place

### Basic Usage

```bash
# Run with default settings
python cli/main.py

# Enable verbose output
python cli/main.py --verbose

# Dry run (analyze without generating skip list)
python cli/main.py --dry-run

# Save detailed reports
python cli/main.py --save-reports
```

### Command Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `--config` | Path to configuration file | `../config/config.json` |
| `--input` | Path to input songs file | `../data/allSongs.json` |
| `--output-dir` | Directory for output files | `../data` |
| `--verbose, -v` | Enable verbose output | `False` |
| `--dry-run` | Analyze without generating skip list | `False` |
| `--save-reports` | Save detailed reports to files | `False` |
| `--show-config` | Show current configuration and exit | `False` |

## 📊 Data Format

### Input Format (`allSongs.json`)

Your song data should be a JSON array with objects containing at least these fields:

```json
[
  {
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "path": "z://MP4\\ACDC - Shot In The Dark (Karaoke Version).mp4",
    "guid": "8946008c-7acc-d187-60e6-5286e55ad502",
    "disabled": false,
    "favorite": false
  }
]
```

### Output Format (`skipSongs.json`)

The tool generates a skip list with this structure:

```json
[
  {
    "path": "z://MP4\\ACDC - Shot In The Dark (Instrumental).mp4",
    "reason": "duplicate",
    "artist": "ACDC",
    "title": "Shot In The Dark",
    "kept_version": "z://MP4\\Sing King Karaoke\\ACDC - Shot In The Dark (Karaoke Version).mp4"
  }
]
```

**Skip List Features:**
- **Metadata**: Each skip entry includes artist, title, and the path of the kept version
- **Reason Tracking**: Documents why each file was marked for skipping
- **Complete Information**: Provides full context for manual review if needed

## ⚙️ Configuration

Edit `config/config.json` to customize the tool's behavior:

### Channel Priorities (MP4 files)
```json
{
  "channel_priorities": [
    "Sing King Karaoke",
    "KaraFun Karaoke",
    "Stingray Karaoke"
  ]
}
```

**Note**: Channel priorities are now folder names found in the song's `path` property. The tool searches for these exact folder names within the file path to determine priority.

### Matching Settings
```json
{
  "matching": {
    "fuzzy_matching": false,
    "fuzzy_threshold": 0.8,
    "case_sensitive": false
  }
}
```

### Output Settings
```json
{
  "output": {
    "verbose": false,
    "include_reasons": true,
    "max_duplicates_per_song": 10
  }
}
```

## 📈 Understanding the Output

### Summary Report
- **Total songs processed**: Total number of songs analyzed
- **Unique songs found**: Number of unique artist-title combinations
- **Duplicates identified**: Number of duplicate songs found
- **File type breakdown**: Distribution across MP3, CDG, MP4 formats
- **Channel breakdown**: MP4 channel distribution (if applicable)

### Skip List
The generated `skipSongs.json` contains paths to files that should be skipped during future imports. Each entry includes:
- `path`: File path to skip
- `reason`: Why the file was marked for skipping (usually "duplicate")

## 🔧 Advanced Features

### Multi-Artist Handling
The tool automatically handles songs with multiple artists using various delimiters:
- `feat.`, `ft.`, `featuring`
- `&`, `and`
- `,`, `;`, `/`

### File Type Priority System
The tool uses a sophisticated priority system to select the best version of each song:

1. **MP4 files are always preferred** when available
   - Searches for configured folder names within the file path
   - Sorts by configured priority order (first in list = highest priority)
   - Keeps the highest priority MP4 version

2. **CDG/MP3 pairs** are treated as single units
   - Automatically pairs CDG and MP3 files with the same base filename
   - Example: `song.cdg` + `song.mp3` = one complete karaoke song
   - Only considered if no MP4 files exist for the same artist/title

3. **Standalone files** are lowest priority
   - Standalone MP3 files (without matching CDG)
   - Standalone CDG files (without matching MP3)

4. **Manual review candidates**
   - Songs without matching folder names in channel priorities
   - Ambiguous cases requiring human decision

### CDG/MP3 Pairing Logic
The tool automatically identifies and pairs CDG/MP3 files:
- **Base filename matching**: Files with identical names but different extensions
- **Single unit treatment**: Paired files are considered one complete karaoke song
- **Accurate duplicate detection**: Prevents treating paired files as separate duplicates
- **Proper priority handling**: Ensures complete songs compete fairly with MP4 versions

### Enhanced Analysis & Reporting
Use `--save-reports` to generate comprehensive analysis files:

**📊 Enhanced Reports:**
- `enhanced_summary_report.txt`: Comprehensive analysis with detailed statistics
- `channel_optimization_report.txt`: Channel priority optimization suggestions
- `duplicate_pattern_report.txt`: Duplicate pattern analysis by artist, title, and channel
- `actionable_insights_report.txt`: Recommendations and actionable insights
- `analysis_data.json`: Raw analysis data for further processing

**📋 Legacy Reports:**
- `summary_report.txt`: Basic overall statistics
- `duplicate_details.txt`: Detailed duplicate analysis (verbose mode only)
- `skip_list_summary.txt`: Skip list breakdown
- `skip_songs_detailed.json`: Full skip data with metadata

**🔍 Analysis Features:**
- **Pattern Analysis**: Identifies most duplicated artists, titles, and channels
- **Channel Optimization**: Suggests optimal channel priority order based on effectiveness
- **Storage Insights**: Quantifies space savings potential and duplicate distribution
- **Actionable Recommendations**: Provides specific suggestions for library optimization

## 🛠️ Development

### Project Structure for Expansion

The codebase is designed for easy expansion:

- **Modular Design**: Separate modules for matching, reporting, and utilities
- **Configuration-Driven**: Easy to modify behavior without code changes
- **Web UI Ready**: Structure supports future web interface development

### Adding New Features

1. **New File Formats**: Add extensions to `config.json`
2. **New Matching Rules**: Extend `SongMatcher` class in `matching.py`
3. **New Reports**: Add methods to `ReportGenerator` class
4. **Web UI**: Build on existing CLI structure

## 🎯 Current Status

### ✅ **Completed Features**
- **Core CLI Tool**: Fully functional with comprehensive duplicate detection
- **CDG/MP3 Pairing**: Intelligent pairing logic for accurate karaoke song handling
- **Channel Priority System**: Configurable MP4 channel priorities based on folder names
- **Skip List Generation**: Complete skip list with metadata and reasoning
- **Performance Optimization**: Handles large libraries (37,000+ songs) efficiently
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions

### 🚀 **Ready for Use**
The tool is production-ready and has successfully processed a large karaoke library:
- Generated skip list for 10,998 unique duplicate files (after removing 1,426 duplicate entries)
- Identified 33.6% duplicate rate with significant space savings potential
- Provided complete metadata for informed decision-making
- **Bug Fix**: Resolved duplicate entries in skip list generation

## 🔮 Future Roadmap

### Phase 2: Enhanced Analysis & Reporting ✅
- ✅ Generate detailed analysis reports (`--save-reports` functionality)
- ✅ Analyze MP4 files without channel priorities to suggest new folder names
- ✅ Create comprehensive duplicate analysis reports
- ✅ Add statistical insights and trends
- ✅ Pattern analysis and channel optimization suggestions

### Phase 3: Web Interface
- Interactive table/grid for duplicate review
- Embedded media player for preview
- Bulk actions and manual overrides
- Real-time configuration editing
- Manual review interface for ambiguous cases

### Phase 4: Advanced Features
- Audio fingerprinting for better duplicate detection
- Integration with karaoke software APIs
- Batch processing and automation
- Advanced fuzzy matching algorithms

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request

## 📝 License

This project is open source. Feel free to use, modify, and distribute according to your needs.

## 🆘 Troubleshooting

### Common Issues

**"File not found" errors**
- Ensure `data/allSongs.json` exists and is readable
- Check file paths in your song data

**"Invalid JSON" errors**
- Validate your JSON syntax using an online validator
- Check for missing commas or brackets

**Memory issues with large libraries**
- The tool is optimized for large datasets
- Consider running with `--dry-run` first to test

### Getting Help

1. Check the configuration with `python cli/main.py --show-config`
2. Run with `--verbose` for detailed output
3. Use `--dry-run` to test without generating files

## 📊 Performance & Results

The tool is optimized for large karaoke libraries and has been tested with real-world data:

### **Performance Optimizations:**
- **Memory Efficient**: Processes songs in batches
- **Fast Matching**: Optimized algorithms for duplicate detection
- **Progress Indicators**: Real-time feedback for large operations
- **Scalable**: Handles libraries with 100,000+ songs

### **Real-World Results:**
- **Successfully processed**: 37,015 songs
- **Duplicate detection**: 12,424 duplicates identified (33.6% duplicate rate)
- **File type distribution**: 45.8% MP3, 71.8% MP4 (some songs have multiple formats)
- **Channel analysis**: 14,698 MP4s with defined priorities, 11,881 without
- **Processing time**: Optimized for large datasets with progress tracking

### **Space Savings Potential:**
- **Significant storage optimization** through intelligent duplicate removal
- **Quality preservation** by keeping highest priority versions
- **Complete metadata** for informed decision-making

---

**Happy karaoke organizing! 🎤🎵**