mbrucedogs/KaraokeMerge

Fork 0

mbrucedogs 3969d75f0f Signed-off-by: mbrucedogs <mbrucedogs@gmail.com>

2025-07-26 17:13:49 -05:00

15 KiB

Raw Blame History

Karaoke Song Library Cleanup Tool — PRD (v1 CLI)

1. Project Summary

Goal: Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
Primary User: Admin (self, collection owner)
Initial Interface: Command Line (CLI) with print/logging and JSON output
Future Expansion: Optional web UI for filtering, review, and playback

2. Architectural Priorities

2.1 Code Organization Principles

TOP PRIORITY: The codebase must be built with the following architectural principles from the beginning:

True Separation of Concerns:
- Many small files with focused responsibilities
- Each module/class should have a single, well-defined purpose
- Avoid monolithic files with mixed responsibilities
Constants and Enums:
- Create constants, enums, and configuration objects to avoid duplicate code or values
- Centralize magic numbers, strings, and configuration values
- Use enums for type safety and clarity
Readability and Maintainability:
- Code should be self-documenting with clear naming conventions
- Easy to understand, extend, and refactor
- Consistent patterns throughout the codebase
Extensibility:
- Design for future growth and feature additions
- Modular architecture that allows easy integration of new components
- Clear interfaces between modules
Refactorability:
- Code structure should make future refactoring straightforward
- Minimize coupling between components
- Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.

2.2 Documentation Requirements

CRITICAL REQUIREMENT: All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation:

PRD.md Updates: Any changes to project requirements, architecture, or functionality must be reflected in this document
README.md Updates: User-facing features, installation instructions, or usage changes must be documented
Code Comments: Significant logic changes should include inline documentation
API Documentation: New endpoints, functions, or interfaces must be documented

Documentation Update Checklist:

Update PRD.md with any architectural or requirement changes
Update README.md with new features, installation steps, or usage instructions
Add inline comments for complex logic or business rules
Update any configuration examples or file structure documentation
Review and update implementation status sections

This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users.

2.3 Code Quality & Development Standards

MANDATORY STANDARDS: The following standards must be followed to ensure code quality, maintainability, and AI-friendly development:

Naming Conventions

Files: Use descriptive, lowercase names with underscores (song_matcher.py, priority_manager.py)
Classes: PascalCase (SongMatcher, PreferencesManager)
Functions/Methods: snake_case (process_songs, get_priority_order)
Constants: UPPER_SNAKE_CASE (MAX_FILE_SIZE, DEFAULT_CHANNEL_PRIORITY)
Variables: snake_case with descriptive names (song_collection, duplicate_count)

Code Structure Standards

Function Length: Maximum 50 lines per function (aim for 20-30 lines)
Class Length: Maximum 300 lines per class (aim for 100-200 lines)
File Length: Maximum 500 lines per file (aim for 200-400 lines)
Indentation: 4 spaces (no tabs)
Line Length: Maximum 120 characters
Import Organization: Group imports: standard library, third-party, local (alphabetical within groups)

Error Handling & Logging

Exception Handling: Always use specific exception types, never bare except:
Logging: Use Python's logging module with appropriate levels (DEBUG, INFO, WARNING, ERROR)
User Feedback: Provide clear, actionable error messages
Graceful Degradation: Handle missing files/configs gracefully with sensible defaults

Type Hints & Documentation

Type Hints: Use Python type hints for all function parameters and return values
Docstrings: Include docstrings for all public functions, classes, and modules
Docstring Format: Use Google-style docstrings with parameter descriptions
Complex Logic: Add inline comments explaining business logic and algorithms

Testing Standards

Unit Tests: Write unit tests for all business logic functions
Test Coverage: Aim for 80%+ code coverage
Test Organization: Mirror the source code structure in test files
Test Data: Use fixtures and factories for test data, never hardcode test values
Integration Tests: Test complete workflows and API endpoints

Configuration Management

Environment Variables: Use environment variables for sensitive data (API keys, paths)
Config Validation: Validate configuration on startup with clear error messages
Default Values: Provide sensible defaults for all configuration options
Config Documentation: Document all configuration options with examples

Performance & Scalability

Memory Efficiency: Process large datasets in chunks, avoid loading everything into memory
Progress Indicators: Show progress for long-running operations
Caching: Implement appropriate caching for expensive operations
Async Operations: Use async/await for I/O operations where beneficial

Security Best Practices

Input Validation: Validate and sanitize all user inputs
File Operations: Use pathlib for safe file path handling
JSON Safety: Use json.loads() with proper error handling
No Hardcoded Secrets: Never commit API keys, passwords, or sensitive data

Version Control Standards

Commit Messages: Use conventional commit format (feat:, fix:, docs:, refactor:)
Branch Naming: Use descriptive branch names (feature/priority-management, fix/duplicate-detection)
Pull Requests: Require code review for all changes
Git Hooks: Use pre-commit hooks for linting and formatting

Dependency Management

Requirements: Keep requirements.txt updated with exact versions
Virtual Environments: Always use virtual environments for development
Dependency Updates: Regularly update dependencies and test compatibility
Minimal Dependencies: Only include necessary dependencies, avoid bloat

Code Review Checklist

Code follows naming conventions
Functions are appropriately sized and focused
Error handling is comprehensive
Type hints and docstrings are present
Tests are included for new functionality
Configuration is properly validated
No hardcoded values or secrets
Performance considerations addressed
Documentation is updated

These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants.

3. Data Handling & Matching Logic

3.1 Input

Reads from /data/allSongs.json
Each song includes at least:
- artist, title, path, (plus id3 tag info, channel for MP4s)

3.2 Song Matching

Primary keys: artist + title
- Fuzzy matching configurable (enabled/disabled with threshold)
- Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
File type detection: Use file extension from path (.mp3, .cdg, .mp4)

3.3 Channel Priority (for MP4s)

Configurable folder names:
- Set in /config/config.json as an array of folder names
- Order = priority (first = highest priority)
- Tool searches for these folder names within the song's path property
- Songs without matching folder names are marked for manual review
File type priority: MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
CDG/MP3 pairing: CDG and MP3 files with the same base filename are treated as a single karaoke song unit

4. Output & Reporting

4.1 Skip List

Format: JSON (/data/skipSongs.json)
- List of file paths to skip in future imports
- Optionally: “reason” field (e.g., {"path": "...", "reason": "duplicate"})

4.2 CLI Reporting

Summary: Total songs, duplicates found, types breakdown, etc.
Verbose per-song output: Only for matches/duplicates (not every song)
Verbosity configurable: (via CLI flag or config)

4.3 Manual Review (Web UI)

Interactive Web Interface: Table/grid view for ambiguous/complex cases
Media Preview: Ability to preview media before making a selection
Bulk Actions: Select multiple items for batch operations
Real-time Filtering: Search and filter capabilities
Responsive Design: Works on desktop and mobile devices
Easy Startup: Simple script (start_web_ui.py) with dependency checking

5. Features & Edge Cases

Batch Processing:
- E.g., "Auto-skip all but highest-priority channel for each song"
- Manual review as CLI flag (future: always in web UI)
Edge Cases:
- Multiple versions (>2 formats)
- Support for keeping multiple versions per song (configurable/manual)
Non-destructive: Never deletes or moves files, only generates skip list and reports

6. Tech Stack & Organization

CLI Language: Python
Config: JSON (channel priorities, settings)
Current Folder Structure:

KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   ├── skipSongs.json         # Output: Generated skip list
│   └── reports/               # Detailed analysis reports
│       ├── analysis_data.json
│       ├── actionable_insights_report.txt
│       ├── channel_optimization_report.txt
│       ├── duplicate_pattern_report.txt
│       ├── enhanced_summary_report.txt
│       ├── skip_list_summary.txt
│       └── skip_songs_detailed.json
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   └── utils.py               # Utility functions
├── web/                       # Web UI for manual review
│   ├── app.py                 # Flask web application
│   └── templates/
│       └── index.html         # Web interface template
├── start_web_ui.py            # Web UI startup script
├── test_tool.py               # Validation and testing script
├── requirements.txt           # Python dependencies
├── .gitignore                 # Git ignore rules
├── PRD.md                     # Product Requirements Document
└── README.md                  # Project documentation

7. Web UI Implementation

7.1 Current Web UI Features

Interactive Table View: Sortable, filterable grid of duplicate songs
Bulk Selection: Select multiple items for batch operations
Search & Filter: Real-time search across artists, titles, and paths
Responsive Design: Mobile-friendly interface
Easy Startup: Automated dependency checking and browser launch

7.2 Web UI Architecture

Flask Backend: Lightweight web server (web/app.py)
HTML Template: Modern, responsive interface (web/templates/index.html)
Startup Script: Dependency management and server startup (start_web_ui.py)

7.3 Future Web UI Enhancements

Embedded media player for audio/video preview
Real-time configuration editing
Advanced filtering and sorting options
Export capabilities for manual selections

8. Open Questions (for future refinement)

Fuzzy matching library/thresholds?
Best parsing rules for multi-artist/feat. strings?
Any alternate export formats needed?
Temporary/partial skip support for "under review" songs?

9. Implementation Status

✅ Completed Features

Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
Print CLI summary reports (with verbosity control)
Implement config file support for channel priority
Organize folder/file structure for easy expansion

🎯 Current Implementation

The tool has been successfully implemented with the following components:

Core Modules:

cli/main.py - Main CLI application with argument parsing
cli/matching.py - Song matching and deduplication logic
cli/report.py - Report generation and output formatting
cli/utils.py - Utility functions for file operations and data processing

Configuration:

config/config.json - Configurable settings for channel priorities, matching rules, and output options

Features Implemented:

Multi-format support (MP3, CDG, MP4)
CDG/MP3 Pairing Logic: Files with same base filename treated as single karaoke song units
Channel priority system for MP4 files (based on folder names in path)
Fuzzy matching support with configurable threshold
Multi-artist parsing with various delimiters
Enhanced Analysis & Reporting: Comprehensive statistical analysis with actionable insights
Channel priority analysis and manual review identification
Non-destructive operation (skip lists only)
Verbose and dry-run modes
Detailed duplicate analysis
Skip list generation with metadata
Pattern Analysis: Skip list pattern analysis and channel optimization suggestions

File Type Priority System:

MP4 files (with channel priority sorting)
CDG/MP3 pairs (treated as single units)
Standalone MP3 files
Standalone CDG files

Performance Results:

Successfully processed 37,015 songs
Identified 12,424 duplicates (33.6% duplicate rate)
Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
Optimized for large datasets with progress indicators
Enhanced Analysis: Generated 7 detailed reports with actionable insights
Bug Fix: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

📋 Next Steps Checklist

✅ Completed

Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
Print CLI summary reports (with verbosity control)
Implement config file support for channel priority
Organize folder/file structure for easy expansion
Implement CDG/MP3 pairing logic for accurate duplicate detection
Generate comprehensive skip list with metadata
Optimize performance for large datasets (37,000+ songs)
Add progress indicators and error handling
Generate detailed analysis reports (--save-reports functionality)
Create web UI for manual review of ambiguous cases
Add test tool for validation and debugging
Create startup script for web UI with dependency checking
Add comprehensive .gitignore file
Update documentation with required data file information

🎯 Next Priority Items

Analyze MP4 files without channel priorities to suggest new folder names
Add support for additional file formats if needed
Implement batch processing capabilities
Create integration scripts for karaoke software
Add unit tests for core functionality
Implement audio fingerprinting for better duplicate detection

15 KiB Raw Blame History