KaraokeMerge/PRD.md

7.4 KiB

Karaoke Song Library Cleanup Tool — PRD (v1 CLI)

1. Project Summary

  • Goal: Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
  • Primary User: Admin (self, collection owner)
  • Initial Interface: Command Line (CLI) with print/logging and JSON output
  • Future Expansion: Optional web UI for filtering, review, and playback

2. Architectural Priorities

2.1 Code Organization Principles

TOP PRIORITY: The codebase must be built with the following architectural principles from the beginning:

  • True Separation of Concerns:

    • Many small files with focused responsibilities
    • Each module/class should have a single, well-defined purpose
    • Avoid monolithic files with mixed responsibilities
  • Constants and Enums:

    • Create constants, enums, and configuration objects to avoid duplicate code or values
    • Centralize magic numbers, strings, and configuration values
    • Use enums for type safety and clarity
  • Readability and Maintainability:

    • Code should be self-documenting with clear naming conventions
    • Easy to understand, extend, and refactor
    • Consistent patterns throughout the codebase
  • Extensibility:

    • Design for future growth and feature additions
    • Modular architecture that allows easy integration of new components
    • Clear interfaces between modules
  • Refactorability:

    • Code structure should make future refactoring straightforward
    • Minimize coupling between components
    • Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.


3. Data Handling & Matching Logic

3.1 Input

  • Reads from /data/allSongs.json
  • Each song includes at least:
    • artist, title, path, (plus id3 tag info, channel for MP4s)

3.2 Song Matching

  • Primary keys: artist + title
    • Fuzzy matching configurable (enabled/disabled with threshold)
    • Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
  • File type detection: Use file extension from path (.mp3, .cdg, .mp4)

3.3 Channel Priority (for MP4s)

  • Configurable folder names:
    • Set in /config/config.json as an array of folder names
    • Order = priority (first = highest priority)
    • Tool searches for these folder names within the song's path property
    • Songs without matching folder names are marked for manual review
  • File type priority: MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
  • CDG/MP3 pairing: CDG and MP3 files with the same base filename are treated as a single karaoke song unit

4. Output & Reporting

4.1 Skip List

  • Format: JSON (/data/skipSongs.json)
    • List of file paths to skip in future imports
    • Optionally: “reason” field (e.g., {"path": "...", "reason": "duplicate"})

4.2 CLI Reporting

  • Summary: Total songs, duplicates found, types breakdown, etc.
  • Verbose per-song output: Only for matches/duplicates (not every song)
  • Verbosity configurable: (via CLI flag or config)

4.3 Manual Review (Future Web UI)

  • Table/grid view for ambiguous/complex cases
  • Ability to preview media before making a selection

5. Features & Edge Cases

  • Batch Processing:
    • E.g., "Auto-skip all but highest-priority channel for each song"
    • Manual review as CLI flag (future: always in web UI)
  • Edge Cases:
    • Multiple versions (>2 formats)
    • Support for keeping multiple versions per song (configurable/manual)
  • Non-destructive: Never deletes or moves files, only generates skip list and reports

6. Tech Stack & Organization

  • CLI Language: Python

  • Config: JSON (channel priorities, settings)

  • Suggested Folder Structure: /data/ allSongs.json skipSongs.json /config/ config.json /cli/ main.py matching.py report.py utils.py

  • (expandable for web UI later)


7. Future Expansion: Web UI

  • Table/grid review, bulk actions
  • Embedded player for media preview
  • Config editor for channel priorities

8. Open Questions (for future refinement)

  • Fuzzy matching library/thresholds?
  • Best parsing rules for multi-artist/feat. strings?
  • Any alternate export formats needed?
  • Temporary/partial skip support for "under review" songs?

9. Implementation Status

Completed Features

  • Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
  • Print CLI summary reports (with verbosity control)
  • Implement config file support for channel priority
  • Organize folder/file structure for easy expansion

🎯 Current Implementation

The tool has been successfully implemented with the following components:

Core Modules:

  • cli/main.py - Main CLI application with argument parsing
  • cli/matching.py - Song matching and deduplication logic
  • cli/report.py - Report generation and output formatting
  • cli/utils.py - Utility functions for file operations and data processing

Configuration:

  • config/config.json - Configurable settings for channel priorities, matching rules, and output options

Features Implemented:

  • Multi-format support (MP3, CDG, MP4)
  • CDG/MP3 Pairing Logic: Files with same base filename treated as single karaoke song units
  • Channel priority system for MP4 files (based on folder names in path)
  • Fuzzy matching support with configurable threshold
  • Multi-artist parsing with various delimiters
  • Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
  • Channel priority analysis and manual review identification
  • Non-destructive operation (skip lists only)
  • Verbose and dry-run modes
  • Detailed duplicate analysis
  • Skip list generation with metadata
  • Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

File Type Priority System:

  1. MP4 files (with channel priority sorting)
  2. CDG/MP3 pairs (treated as single units)
  3. Standalone MP3 files
  4. Standalone CDG files

Performance Results:

  • Successfully processed 37,015 songs
  • Identified 12,424 duplicates (33.6% duplicate rate)
  • Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
  • Optimized for large datasets with progress indicators
  • Enhanced Analysis: Generated 7 detailed reports with actionable insights
  • Bug Fix: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

📋 Next Steps Checklist

Completed

  • Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
  • Print CLI summary reports (with verbosity control)
  • Implement config file support for channel priority
  • Organize folder/file structure for easy expansion
  • Implement CDG/MP3 pairing logic for accurate duplicate detection
  • Generate comprehensive skip list with metadata
  • Optimize performance for large datasets (37,000+ songs)
  • Add progress indicators and error handling

🎯 Next Priority Items

  • Generate detailed analysis reports (--save-reports functionality)
  • Analyze MP4 files without channel priorities to suggest new folder names
  • Create web UI for manual review of ambiguous cases
  • Add support for additional file formats if needed
  • Implement batch processing capabilities
  • Create integration scripts for karaoke software