mbrucedogs/KaraokeMerge

Fork 0

mbrucedogs c9b2e23e04 Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

2025-07-26 16:56:08 -05:00

10 KiB

Raw Blame History

Karaoke Song Library Cleanup Tool — PRD (v1 CLI)

1. Project Summary

Goal: Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
Primary User: Admin (self, collection owner)
Initial Interface: Command Line (CLI) with print/logging and JSON output
Future Expansion: Optional web UI for filtering, review, and playback

2. Architectural Priorities

2.1 Code Organization Principles

TOP PRIORITY: The codebase must be built with the following architectural principles from the beginning:

True Separation of Concerns:
- Many small files with focused responsibilities
- Each module/class should have a single, well-defined purpose
- Avoid monolithic files with mixed responsibilities
Constants and Enums:
- Create constants, enums, and configuration objects to avoid duplicate code or values
- Centralize magic numbers, strings, and configuration values
- Use enums for type safety and clarity
Readability and Maintainability:
- Code should be self-documenting with clear naming conventions
- Easy to understand, extend, and refactor
- Consistent patterns throughout the codebase
Extensibility:
- Design for future growth and feature additions
- Modular architecture that allows easy integration of new components
- Clear interfaces between modules
Refactorability:
- Code structure should make future refactoring straightforward
- Minimize coupling between components
- Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.

3. Data Handling & Matching Logic

3.1 Input

Reads from /data/allSongs.json
Each song includes at least:
- artist, title, path, (plus id3 tag info, channel for MP4s)

3.2 Song Matching

Primary keys: artist + title
- Fuzzy matching configurable (enabled/disabled with threshold)
- Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
File type detection: Use file extension from path (.mp3, .cdg, .mp4)

3.3 Channel Priority (for MP4s)

Configurable folder names:
- Set in /config/config.json as an array of folder names
- Order = priority (first = highest priority)
- Tool searches for these folder names within the song's path property
- Songs without matching folder names are marked for manual review
File type priority: MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
CDG/MP3 pairing: CDG and MP3 files with the same base filename are treated as a single karaoke song unit

4. Output & Reporting

4.1 Skip List

Format: JSON (/data/skipSongs.json)
- List of file paths to skip in future imports
- Optionally: “reason” field (e.g., {"path": "...", "reason": "duplicate"})

4.2 CLI Reporting

Summary: Total songs, duplicates found, types breakdown, etc.
Verbose per-song output: Only for matches/duplicates (not every song)
Verbosity configurable: (via CLI flag or config)

4.3 Manual Review (Web UI)

Interactive Web Interface: Table/grid view for ambiguous/complex cases
Media Preview: Ability to preview media before making a selection
Bulk Actions: Select multiple items for batch operations
Real-time Filtering: Search and filter capabilities
Responsive Design: Works on desktop and mobile devices
Easy Startup: Simple script (start_web_ui.py) with dependency checking

5. Features & Edge Cases

Batch Processing:
- E.g., "Auto-skip all but highest-priority channel for each song"
- Manual review as CLI flag (future: always in web UI)
Edge Cases:
- Multiple versions (>2 formats)
- Support for keeping multiple versions per song (configurable/manual)
Non-destructive: Never deletes or moves files, only generates skip list and reports

6. Tech Stack & Organization

CLI Language: Python
Config: JSON (channel priorities, settings)
Current Folder Structure:

KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   ├── skipSongs.json         # Output: Generated skip list
│   └── reports/               # Detailed analysis reports
│       ├── analysis_data.json
│       ├── actionable_insights_report.txt
│       ├── channel_optimization_report.txt
│       ├── duplicate_pattern_report.txt
│       ├── enhanced_summary_report.txt
│       ├── skip_list_summary.txt
│       └── skip_songs_detailed.json
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── web/                       # Web UI for manual review
│   ├── app.py                 # Flask web application
│   └── templates/
│       └── index.html         # Web interface template
├── start_web_ui.py            # Web UI startup script
├── test_tool.py               # Validation and testing script
├── requirements.txt           # Python dependencies
├── .gitignore                 # Git ignore rules
├── PRD.md                     # Product Requirements Document
└── README.md                  # Project documentation

7. Web UI Implementation

7.1 Current Web UI Features

Interactive Table View: Sortable, filterable grid of duplicate songs
Bulk Selection: Select multiple items for batch operations
Search & Filter: Real-time search across artists, titles, and paths
Responsive Design: Mobile-friendly interface
Easy Startup: Automated dependency checking and browser launch

7.2 Web UI Architecture

Flask Backend: Lightweight web server (web/app.py)
HTML Template: Modern, responsive interface (web/templates/index.html)
Startup Script: Dependency management and server startup (start_web_ui.py)

7.3 Future Web UI Enhancements

Embedded media player for audio/video preview
Real-time configuration editing
Advanced filtering and sorting options
Export capabilities for manual selections

8. Open Questions (for future refinement)

Fuzzy matching library/thresholds?
Best parsing rules for multi-artist/feat. strings?
Any alternate export formats needed?
Temporary/partial skip support for "under review" songs?

9. Implementation Status

✅ Completed Features

Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
Print CLI summary reports (with verbosity control)
Implement config file support for channel priority
Organize folder/file structure for easy expansion

🎯 Current Implementation

The tool has been successfully implemented with the following components:

Core Modules:

cli/main.py - Main CLI application with argument parsing
cli/matching.py - Song matching and deduplication logic
cli/report.py - Report generation and output formatting
cli/utils.py - Utility functions for file operations and data processing

Configuration:

config/config.json - Configurable settings for channel priorities, matching rules, and output options

Features Implemented:

Multi-format support (MP3, CDG, MP4)
CDG/MP3 Pairing Logic: Files with same base filename treated as single karaoke song units
Channel priority system for MP4 files (based on folder names in path)
Fuzzy matching support with configurable threshold
Multi-artist parsing with various delimiters
Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
Channel priority analysis and manual review identification
Non-destructive operation (skip lists only)
Verbose and dry-run modes
Detailed duplicate analysis
Skip list generation with metadata
Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

File Type Priority System:

MP4 files (with channel priority sorting)
CDG/MP3 pairs (treated as single units)
Standalone MP3 files
Standalone CDG files

Performance Results:

Successfully processed 37,015 songs
Identified 12,424 duplicates (33.6% duplicate rate)
Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
Optimized for large datasets with progress indicators
Enhanced Analysis: Generated 7 detailed reports with actionable insights
Bug Fix: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

📋 Next Steps Checklist

✅ Completed

Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
Print CLI summary reports (with verbosity control)
Implement config file support for channel priority
Organize folder/file structure for easy expansion
Implement CDG/MP3 pairing logic for accurate duplicate detection
Generate comprehensive skip list with metadata
Optimize performance for large datasets (37,000+ songs)
Add progress indicators and error handling
Generate detailed analysis reports (--save-reports functionality)
Create web UI for manual review of ambiguous cases
Add test tool for validation and debugging
Create startup script for web UI with dependency checking
Add comprehensive .gitignore file
Update documentation with required data file information

🎯 Next Priority Items

Analyze MP4 files without channel priorities to suggest new folder names
Add support for additional file formats if needed
Implement batch processing capabilities
Create integration scripts for karaoke software
Add unit tests for core functionality
Implement audio fingerprinting for better duplicate detection

10 KiB Raw Blame History