KaraokeMerge/PRD.md

15 KiB

Karaoke Song Library Cleanup Tool — PRD (v1 CLI)

1. Project Summary

  • Goal: Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
  • Primary User: Admin (self, collection owner)
  • Initial Interface: Command Line (CLI) with print/logging and JSON output
  • Future Expansion: Optional web UI for filtering, review, and playback

2. Architectural Priorities

2.1 Code Organization Principles

TOP PRIORITY: The codebase must be built with the following architectural principles from the beginning:

  • True Separation of Concerns:

    • Many small files with focused responsibilities
    • Each module/class should have a single, well-defined purpose
    • Avoid monolithic files with mixed responsibilities
  • Constants and Enums:

    • Create constants, enums, and configuration objects to avoid duplicate code or values
    • Centralize magic numbers, strings, and configuration values
    • Use enums for type safety and clarity
  • Readability and Maintainability:

    • Code should be self-documenting with clear naming conventions
    • Easy to understand, extend, and refactor
    • Consistent patterns throughout the codebase
  • Extensibility:

    • Design for future growth and feature additions
    • Modular architecture that allows easy integration of new components
    • Clear interfaces between modules
  • Refactorability:

    • Code structure should make future refactoring straightforward
    • Minimize coupling between components
    • Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.

2.2 Documentation Requirements

CRITICAL REQUIREMENT: All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation:

  • PRD.md Updates: Any changes to project requirements, architecture, or functionality must be reflected in this document
  • README.md Updates: User-facing features, installation instructions, or usage changes must be documented
  • Code Comments: Significant logic changes should include inline documentation
  • API Documentation: New endpoints, functions, or interfaces must be documented

Documentation Update Checklist:

  • Update PRD.md with any architectural or requirement changes
  • Update README.md with new features, installation steps, or usage instructions
  • Add inline comments for complex logic or business rules
  • Update any configuration examples or file structure documentation
  • Review and update implementation status sections

This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users.

2.3 Code Quality & Development Standards

MANDATORY STANDARDS: The following standards must be followed to ensure code quality, maintainability, and AI-friendly development:

Naming Conventions

  • Files: Use descriptive, lowercase names with underscores (song_matcher.py, priority_manager.py)
  • Classes: PascalCase (SongMatcher, PreferencesManager)
  • Functions/Methods: snake_case (process_songs, get_priority_order)
  • Constants: UPPER_SNAKE_CASE (MAX_FILE_SIZE, DEFAULT_CHANNEL_PRIORITY)
  • Variables: snake_case with descriptive names (song_collection, duplicate_count)

Code Structure Standards

  • Function Length: Maximum 50 lines per function (aim for 20-30 lines)
  • Class Length: Maximum 300 lines per class (aim for 100-200 lines)
  • File Length: Maximum 500 lines per file (aim for 200-400 lines)
  • Indentation: 4 spaces (no tabs)
  • Line Length: Maximum 120 characters
  • Import Organization: Group imports: standard library, third-party, local (alphabetical within groups)

Error Handling & Logging

  • Exception Handling: Always use specific exception types, never bare except:
  • Logging: Use Python's logging module with appropriate levels (DEBUG, INFO, WARNING, ERROR)
  • User Feedback: Provide clear, actionable error messages
  • Graceful Degradation: Handle missing files/configs gracefully with sensible defaults

Type Hints & Documentation

  • Type Hints: Use Python type hints for all function parameters and return values
  • Docstrings: Include docstrings for all public functions, classes, and modules
  • Docstring Format: Use Google-style docstrings with parameter descriptions
  • Complex Logic: Add inline comments explaining business logic and algorithms

Testing Standards

  • Unit Tests: Write unit tests for all business logic functions
  • Test Coverage: Aim for 80%+ code coverage
  • Test Organization: Mirror the source code structure in test files
  • Test Data: Use fixtures and factories for test data, never hardcode test values
  • Integration Tests: Test complete workflows and API endpoints

Configuration Management

  • Environment Variables: Use environment variables for sensitive data (API keys, paths)
  • Config Validation: Validate configuration on startup with clear error messages
  • Default Values: Provide sensible defaults for all configuration options
  • Config Documentation: Document all configuration options with examples

Performance & Scalability

  • Memory Efficiency: Process large datasets in chunks, avoid loading everything into memory
  • Progress Indicators: Show progress for long-running operations
  • Caching: Implement appropriate caching for expensive operations
  • Async Operations: Use async/await for I/O operations where beneficial

Security Best Practices

  • Input Validation: Validate and sanitize all user inputs
  • File Operations: Use pathlib for safe file path handling
  • JSON Safety: Use json.loads() with proper error handling
  • No Hardcoded Secrets: Never commit API keys, passwords, or sensitive data

Version Control Standards

  • Commit Messages: Use conventional commit format (feat:, fix:, docs:, refactor:)
  • Branch Naming: Use descriptive branch names (feature/priority-management, fix/duplicate-detection)
  • Pull Requests: Require code review for all changes
  • Git Hooks: Use pre-commit hooks for linting and formatting

Dependency Management

  • Requirements: Keep requirements.txt updated with exact versions
  • Virtual Environments: Always use virtual environments for development
  • Dependency Updates: Regularly update dependencies and test compatibility
  • Minimal Dependencies: Only include necessary dependencies, avoid bloat

Code Review Checklist

  • Code follows naming conventions
  • Functions are appropriately sized and focused
  • Error handling is comprehensive
  • Type hints and docstrings are present
  • Tests are included for new functionality
  • Configuration is properly validated
  • No hardcoded values or secrets
  • Performance considerations addressed
  • Documentation is updated

These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants.


3. Data Handling & Matching Logic

3.1 Input

  • Reads from /data/allSongs.json
  • Each song includes at least:
    • artist, title, path, (plus id3 tag info, channel for MP4s)

3.2 Song Matching

  • Primary keys: artist + title
    • Fuzzy matching configurable (enabled/disabled with threshold)
    • Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
  • File type detection: Use file extension from path (.mp3, .cdg, .mp4)

3.3 Channel Priority (for MP4s)

  • Configurable folder names:
    • Set in /config/config.json as an array of folder names
    • Order = priority (first = highest priority)
    • Tool searches for these folder names within the song's path property
    • Songs without matching folder names are marked for manual review
  • File type priority: MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
  • CDG/MP3 pairing: CDG and MP3 files with the same base filename are treated as a single karaoke song unit

4. Output & Reporting

4.1 Skip List

  • Format: JSON (/data/skipSongs.json)
    • List of file paths to skip in future imports
    • Optionally: “reason” field (e.g., {"path": "...", "reason": "duplicate"})

4.2 CLI Reporting

  • Summary: Total songs, duplicates found, types breakdown, etc.
  • Verbose per-song output: Only for matches/duplicates (not every song)
  • Verbosity configurable: (via CLI flag or config)

4.3 Manual Review (Web UI)

  • Interactive Web Interface: Table/grid view for ambiguous/complex cases
  • Media Preview: Ability to preview media before making a selection
  • Bulk Actions: Select multiple items for batch operations
  • Real-time Filtering: Search and filter capabilities
  • Responsive Design: Works on desktop and mobile devices
  • Easy Startup: Simple script (start_web_ui.py) with dependency checking

5. Features & Edge Cases

  • Batch Processing:
    • E.g., "Auto-skip all but highest-priority channel for each song"
    • Manual review as CLI flag (future: always in web UI)
  • Edge Cases:
    • Multiple versions (>2 formats)
    • Support for keeping multiple versions per song (configurable/manual)
  • Non-destructive: Never deletes or moves files, only generates skip list and reports

6. Tech Stack & Organization

  • CLI Language: Python
  • Config: JSON (channel priorities, settings)
  • Current Folder Structure:
KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   ├── skipSongs.json         # Output: Generated skip list
│   └── reports/               # Detailed analysis reports
│       ├── analysis_data.json
│       ├── actionable_insights_report.txt
│       ├── channel_optimization_report.txt
│       ├── duplicate_pattern_report.txt
│       ├── enhanced_summary_report.txt
│       ├── skip_list_summary.txt
│       └── skip_songs_detailed.json
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── web/                       # Web UI for manual review
│   ├── app.py                 # Flask web application
│   └── templates/
│       └── index.html         # Web interface template
├── start_web_ui.py            # Web UI startup script
├── test_tool.py               # Validation and testing script
├── requirements.txt           # Python dependencies
├── .gitignore                 # Git ignore rules
├── PRD.md                     # Product Requirements Document
└── README.md                  # Project documentation

7. Web UI Implementation

7.1 Current Web UI Features

  • Interactive Table View: Sortable, filterable grid of duplicate songs
  • Bulk Selection: Select multiple items for batch operations
  • Search & Filter: Real-time search across artists, titles, and paths
  • Responsive Design: Mobile-friendly interface
  • Easy Startup: Automated dependency checking and browser launch

7.2 Web UI Architecture

  • Flask Backend: Lightweight web server (web/app.py)
  • HTML Template: Modern, responsive interface (web/templates/index.html)
  • Startup Script: Dependency management and server startup (start_web_ui.py)

7.3 Future Web UI Enhancements

  • Embedded media player for audio/video preview
  • Real-time configuration editing
  • Advanced filtering and sorting options
  • Export capabilities for manual selections

8. Open Questions (for future refinement)

  • Fuzzy matching library/thresholds?
  • Best parsing rules for multi-artist/feat. strings?
  • Any alternate export formats needed?
  • Temporary/partial skip support for "under review" songs?

9. Implementation Status

Completed Features

  • Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
  • Print CLI summary reports (with verbosity control)
  • Implement config file support for channel priority
  • Organize folder/file structure for easy expansion

🎯 Current Implementation

The tool has been successfully implemented with the following components:

Core Modules:

  • cli/main.py - Main CLI application with argument parsing
  • cli/matching.py - Song matching and deduplication logic
  • cli/report.py - Report generation and output formatting
  • cli/utils.py - Utility functions for file operations and data processing

Configuration:

  • config/config.json - Configurable settings for channel priorities, matching rules, and output options

Features Implemented:

  • Multi-format support (MP3, CDG, MP4)
  • CDG/MP3 Pairing Logic: Files with same base filename treated as single karaoke song units
  • Channel priority system for MP4 files (based on folder names in path)
  • Fuzzy matching support with configurable threshold
  • Multi-artist parsing with various delimiters
  • Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
  • Channel priority analysis and manual review identification
  • Non-destructive operation (skip lists only)
  • Verbose and dry-run modes
  • Detailed duplicate analysis
  • Skip list generation with metadata
  • Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

File Type Priority System:

  1. MP4 files (with channel priority sorting)
  2. CDG/MP3 pairs (treated as single units)
  3. Standalone MP3 files
  4. Standalone CDG files

Performance Results:

  • Successfully processed 37,015 songs
  • Identified 12,424 duplicates (33.6% duplicate rate)
  • Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
  • Optimized for large datasets with progress indicators
  • Enhanced Analysis: Generated 7 detailed reports with actionable insights
  • Bug Fix: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

📋 Next Steps Checklist

Completed

  • Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
  • Print CLI summary reports (with verbosity control)
  • Implement config file support for channel priority
  • Organize folder/file structure for easy expansion
  • Implement CDG/MP3 pairing logic for accurate duplicate detection
  • Generate comprehensive skip list with metadata
  • Optimize performance for large datasets (37,000+ songs)
  • Add progress indicators and error handling
  • Generate detailed analysis reports (--save-reports functionality)
  • Create web UI for manual review of ambiguous cases
  • Add test tool for validation and debugging
  • Create startup script for web UI with dependency checking
  • Add comprehensive .gitignore file
  • Update documentation with required data file information

🎯 Next Priority Items

  • Analyze MP4 files without channel priorities to suggest new folder names
  • Add support for additional file formats if needed
  • Implement batch processing capabilities
  • Create integration scripts for karaoke software
  • Add unit tests for core functionality
  • Implement audio fingerprinting for better duplicate detection