mbrucedogs/KaraokeMerge

Fork 0

Matt Bruce dd916a646a Signed-off-by: Matt Bruce <mbrucedogs@gmail.com>

2025-08-10 10:48:02 -05:00

25 KiB

Raw Blame History

Karaoke Song Library Cleanup Tool — PRD (v2.0)

1. Project Summary

Goal: Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON "skip list" (for future imports) and supporting flexible reporting and manual review.
Primary User: Admin (self, collection owner)
Interfaces: Command Line (CLI) with print/logging and JSON output, plus interactive Web UI for manual review
Current Status: Fully functional CLI tool with comprehensive web UI for interactive review and priority management

2. Architectural Priorities

2.1 Code Organization Principles

TOP PRIORITY: The codebase must be built with the following architectural principles from the beginning:

True Separation of Concerns:
- Many small files with focused responsibilities
- Each module/class should have a single, well-defined purpose
- Avoid monolithic files with mixed responsibilities
Constants and Enums:
- Create constants, enums, and configuration objects to avoid duplicate code or values
- Centralize magic numbers, strings, and configuration values
- Use enums for type safety and clarity
Readability and Maintainability:
- Code should be self-documenting with clear naming conventions
- Easy to understand, extend, and refactor
- Consistent patterns throughout the codebase
Extensibility:
- Design for future growth and feature additions
- Modular architecture that allows easy integration of new components
- Clear interfaces between modules
Refactorability:
- Code structure should make future refactoring straightforward
- Minimize coupling between components
- Use dependency injection and abstraction where appropriate

These principles are fundamental to the project's long-term success and must be applied consistently throughout development.

2.2 Documentation Requirements

CRITICAL REQUIREMENT: All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation:

PRD.md Updates: Any changes to project requirements, architecture, or functionality must be reflected in this document
README.md Updates: User-facing features, installation instructions, or usage changes must be documented
CLI Commands Documentation: All CLI functionality, options, and usage examples must be documented in cli/commands.txt
Code Comments: Significant logic changes should include inline documentation
API Documentation: New endpoints, functions, or interfaces must be documented
API Update Requirement: Whenever a new API endpoint is added, the PRD.md, README.md, and cli/commands.txt MUST be updated to reflect the new functionality

Documentation Update Checklist:

Update PRD.md with any architectural or requirement changes
Update README.md with new features, installation steps, or usage instructions
Update cli/commands.txt with any new CLI options, examples, or functionality changes
Add inline comments for complex logic or business rules
Update any configuration examples or file structure documentation
Review and update implementation status sections
API Updates: When new API endpoints are added, update PRD.md, README.md, and cli/commands.txt

CLI Commands Documentation Requirements:

Comprehensive Coverage: All CLI arguments, options, and flags must be documented with examples
Usage Examples: Provide practical examples for common use cases and combinations
Configuration Details: Document all configuration options and their effects
Error Handling: Include troubleshooting information and common issues
Integration Notes: Document how CLI integrates with web UI and other components
Version Tracking: Keep version information and feature status up to date

API Documentation Requirements:

Endpoint Documentation: All new API endpoints must be documented in the PRD.md with their purpose, parameters, and responses
README Integration: API changes must be reflected in README.md with usage examples and integration notes
CLI Integration: If CLI commands interact with APIs, they must be documented in cli/commands.txt
Version Tracking: API versioning and changes must be tracked in documentation
Error Handling: Document all possible error responses and status codes
Authentication: Document any authentication requirements or API key usage

This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users.

2.3 Code Quality & Development Standards

MANDATORY STANDARDS: The following standards must be followed to ensure code quality, maintainability, and AI-friendly development:

Naming Conventions

Files: Use descriptive, lowercase names with underscores (song_matcher.py, priority_manager.py)
Classes: PascalCase (SongMatcher, PreferencesManager)
Functions/Methods: snake_case (process_songs, get_priority_order)
Constants: UPPER_SNAKE_CASE (MAX_FILE_SIZE, DEFAULT_CHANNEL_PRIORITY)
Variables: snake_case with descriptive names (song_collection, duplicate_count)

Code Structure Standards

Function Length: Maximum 50 lines per function (aim for 20-30 lines)
Class Length: Maximum 300 lines per class (aim for 100-200 lines)
File Length: Maximum 500 lines per file (aim for 200-400 lines)
Indentation: 4 spaces (no tabs)
Line Length: Maximum 120 characters
Import Organization: Group imports: standard library, third-party, local (alphabetical within groups)

Error Handling & Logging

Exception Handling: Always use specific exception types, never bare except:
Logging: Use Python's logging module with appropriate levels (DEBUG, INFO, WARNING, ERROR)
User Feedback: Provide clear, actionable error messages
Graceful Degradation: Handle missing files/configs gracefully with sensible defaults

Type Hints & Documentation

Type Hints: Use Python type hints for all function parameters and return values
Docstrings: Include docstrings for all public functions, classes, and modules
Docstring Format: Use Google-style docstrings with parameter descriptions
Complex Logic: Add inline comments explaining business logic and algorithms

Configuration Management

Environment Variables: Use environment variables for sensitive data (API keys, paths)
Config Validation: Validate configuration on startup with clear error messages
Default Values: Provide sensible defaults for all configuration options
Config Documentation: Document all configuration options with examples

Performance & Scalability

Memory Efficiency: Process large datasets in chunks, avoid loading everything into memory
Progress Indicators: Show progress for long-running operations
Caching: Implement appropriate caching for expensive operations
Async Operations: Use async/await for I/O operations where beneficial

Security Best Practices

Input Validation: Validate and sanitize all user inputs
File Operations: Use pathlib for safe file path handling
JSON Safety: Use json.loads() with proper error handling
No Hardcoded Secrets: Never commit API keys, passwords, or sensitive data

Version Control Standards

Commit Messages: Use conventional commit format (feat:, fix:, docs:, refactor:)
Branch Naming: Use descriptive branch names (feature/priority-management, fix/duplicate-detection)
Pull Requests: Require code review for all changes
Git Hooks: Use pre-commit hooks for linting and formatting

Dependency Management

Requirements: Keep requirements.txt updated with exact versions
Virtual Environments: Always use virtual environments for development
Dependency Updates: Regularly update dependencies and test compatibility
Minimal Dependencies: Only include necessary dependencies, avoid bloat

Code Review Checklist

Code follows naming conventions
Functions are appropriately sized and focused
Error handling is comprehensive
Type hints and docstrings are present
Configuration is properly validated
No hardcoded values or secrets
Performance considerations addressed
Documentation is updated

These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants.

3. Data Handling & Matching Logic

3.1 Input

Reads from /data/allSongs.json
Each song includes at least:
- artist, title, path, (plus id3 tag info, channel for MP4s)

3.2 Song Matching

Primary keys: artist + title
- Fuzzy matching configurable (enabled/disabled with threshold)
- Multi-artist handling: parse delimiters (commas, "feat.", etc.)
File type detection: Use file extension from path (.mp3, .cdg, .mp4)

3.3 Channel Priority (for MP4s)

Configurable folder names:
- Set in /config/config.json as an array of folder names
- Order = priority (first = highest priority)
- Tool searches for these folder names within the song's path property
- Songs without matching folder names are marked for manual review
File type priority: MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
CDG/MP3 pairing: CDG and MP3 files with the same base filename are treated as a single karaoke song unit

4. Output & Reporting

4.1 Skip List

Format: JSON (/data/skipSongs.json)
- List of file paths to skip in future imports
- Optionally: "reason" field (e.g., {"path": "...", "reason": "duplicate"})

4.2 CLI Reporting

Summary: Total songs, duplicates found, types breakdown, etc.
Verbose per-song output: Only for matches/duplicates (not every song)
Verbosity configurable: (via CLI flag or config)

4.3 Manual Review (Web UI)

Interactive Web Interface: Table/grid view for ambiguous/complex cases
Media Preview: Ability to preview media before making a selection
Bulk Actions: Select multiple items for batch operations
Real-time Filtering: Search and filter capabilities
Responsive Design: Works on desktop and mobile devices
Easy Startup: Simple script (start_web_ui.py) with dependency checking

5. Features & Edge Cases

Batch Processing:
- E.g., "Auto-skip all but highest-priority channel for each song"
- Manual review as CLI flag (future: always in web UI)
Edge Cases:
- Multiple versions (>2 formats)
- Support for keeping multiple versions per song (configurable/manual)
Non-destructive: Never deletes or moves files, only generates skip list and reports

6. Tech Stack & Organization

CLI Language: Python
Web UI: Flask + HTML/CSS/JavaScript (Bootstrap, Font Awesome, Sortable.js)
Config: JSON (channel priorities, settings)
Current Folder Structure:

KaraokeMerge/
├── data/
│   ├── allSongs.json          # Input: Your song library data
│   ├── skipSongs.json         # Output: Generated skip list
│   ├── preferences/           # User priority preferences
│   │   ├── priority_preferences.json
│   │   └── priority_preferences_backup_*.json
│   └── reports/               # Detailed analysis reports
│       ├── analysis_data.json
│       ├── actionable_insights_report.txt
│       ├── channel_optimization_report.txt
│       ├── duplicate_pattern_report.txt
│       ├── enhanced_summary_report.txt
│       ├── skip_list_summary.txt
│       └── skip_songs_detailed.json
├── config/
│   └── config.json            # Configuration settings
├── cli/
│   ├── main.py                # Main CLI application
│   ├── matching.py            # Song matching logic
│   ├── report.py              # Report generation
│   ├── preferences.py         # Priority preferences management
│   ├── utils.py               # Utility functions
│   └── commands.txt           # Comprehensive CLI commands reference
├── web/                       # Web UI for manual review
│   ├── app.py                 # Flask web application
│   └── templates/
│       └── index.html         # Web interface template
├── start_web_ui.py            # Web UI startup script
├── test_tool.py               # Validation and testing script
├── requirements.txt           # Python dependencies
├── .gitignore                 # Git ignore rules
├── PRD.md                     # Product Requirements Document
└── README.md                  # Project documentation

7. Web UI Implementation

7.1 Current Web UI Features

Core Functionality

Interactive Table View: Sortable, filterable grid of duplicate songs
Bulk Selection: Select multiple items for batch operations
Search & Filter: Real-time search across artists, titles, and paths
Responsive Design: Mobile-friendly interface
Easy Startup: Automated dependency checking and browser launch
Remaining Songs View: Separate page to browse all songs that remain after cleanup

Media Preview & Playback

Video Playback: Direct MP4 video playback in modal popup for previewing karaoke videos
File Path Normalization: Automatic correction of malformed file paths (handles :// corruption)
Video Modal: Full-screen video player with controls for karaoke video preview

IMPORTANT: Web Player Path Handling Fix

Issue: File paths with backslashes were being corrupted when passed from HTML to JavaScript due to improper string literal escaping in onclick attributes
Root Cause: Backslashes in version.path were not properly escaped when inserted into JavaScript string literals in HTML onclick attributes, causing them to be interpreted as escape characters
Solution: Added .replace(/\\/g, '\\\\') to escape backslashes before inserting into onclick attributes in web/templates/index.html
Impact: Ensures paths displayed in UI match paths received by JavaScript functions, preventing 404 errors in video playback
Files Modified: web/templates/index.html (line ~1010), web/app.py (normalize_path function simplified)

Priority Management System

Drag-and-Drop Priority Management: Interactive reordering of file priorities using Sortable.js
Visual Priority Indicators: Real-time visual feedback showing KEPT/SKIPPED status
Dynamic Visual Updates: Automatic color coding and badge updates based on priority order
Priority Persistence: Save/load user priority preferences to/from JSON files
Priority Preferences API: RESTful endpoints for managing priority preferences

User Interface Enhancements

Visual Status Indicators: Color-coded cards (green for kept, red for skipped)
File Type Badges: Visual indicators for MP3, MP4, and CDG files
Channel Badges: Display channel information for MP4 files
Progress Indicators: Loading states and status messages
Error Handling: Comprehensive error handling with user-friendly messages
Debug Tools: Console logging and debugging functions for troubleshooting

Data Management

Priority Preferences Storage: Automatic backup creation with timestamps
Reset Functionality: Ability to reset all priority preferences to defaults
Change Tracking: Real-time tracking of unsaved priority changes
Save Button Management: Dynamic enable/disable based on unsaved changes

7.2 Web UI Architecture

Frontend Technologies

HTML5: Semantic markup with Bootstrap 5 for responsive design
CSS3: Custom styling with Bootstrap components and Font Awesome icons
JavaScript (ES6+): Modern JavaScript with async/await for API calls
Sortable.js: Drag-and-drop library for priority reordering
Bootstrap 5: UI framework for responsive design and components
Font Awesome 6: Icon library for visual elements

Backend Technologies

Flask: Lightweight Python web framework
JSON APIs: RESTful endpoints for data management
File System Integration: Direct file operations for preferences and data
Error Handling: Comprehensive error handling and logging

Key Components

web/app.py: Flask application with API endpoints
web/templates/index.html: Main web interface template
web/templates/remaining_songs.html: Remaining songs browsing interface
start_web_ui.py: Startup script with dependency management

7.3 API Endpoints

Data Endpoints

/api/duplicates: Get duplicate song data with pagination
/api/stats: Get statistical analysis of the song collection
/api/artists: Get list of artists for filtering
/api/mp3-songs: Get MP3 songs that remain after cleanup
/api/remaining-songs: Get all remaining songs with pagination and filtering
/api/config: Get current configuration settings

Priority Management Endpoints

/api/save-priority-preferences: Save user priority preferences to JSON
/api/load-priority-preferences: Load saved priority preferences
/api/reset-priority-preferences: Reset all priority preferences to defaults

File Serving Endpoints

/api/video/<path>: Serve MP4 video files for preview
/api/download/mp3-songs: Download MP3 song list as JSON

7.4 Future Web UI Enhancements

Audio preview for MP3 files
Real-time configuration editing
Advanced filtering and sorting options
Export capabilities for manual selections
Batch video preview functionality
Video thumbnail generation
Real-time collaboration features
Advanced analytics dashboard

8. Priority Management System

8.1 Overview

The priority management system allows users to manually override the default priority algorithm through an interactive web interface. User decisions are persisted and used by the CLI tool for future processing runs.

8.2 Features

Interactive Drag-and-Drop: Reorder song versions using intuitive drag-and-drop interface
Visual Feedback: Real-time visual indicators showing kept/skipped status
Persistent Storage: User preferences saved to data/preferences/priority_preferences.json
Automatic Backups: Timestamped backup files created on each save
Reset Capability: Ability to reset all preferences to default algorithm
Change Tracking: Real-time tracking of unsaved changes with visual indicators

8.3 Data Flow

Web UI: User makes priority changes via drag-and-drop
Frontend: Changes stored in priorityChanges JavaScript object
Save Action: Changes sent to backend via /api/save-priority-preferences
Backend: Preferences saved to JSON file with automatic backup
CLI Integration: CLI tool reads preferences file and applies user decisions
Persistence: Changes persist across web UI sessions and CLI runs

8.4 File Structure

data/preferences/
├── priority_preferences.json              # Current user preferences
└── priority_preferences_backup_*.json     # Timestamped backups

9. Open Questions (for future refinement)

Fuzzy matching library/thresholds?
Best parsing rules for multi-artist/feat. strings?
Any alternate export formats needed?
Temporary/partial skip support for "under review" songs?
Integration with karaoke software APIs?
Audio fingerprinting for better duplicate detection?

10. Implementation Status

✅ Completed Features

Core CLI Functionality

Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
Print CLI summary reports (with verbosity control)
Implement config file support for channel priority
Organize folder/file structure for easy expansion
Implement CDG/MP3 pairing logic for accurate duplicate detection
Generate comprehensive skip list with metadata
Optimize performance for large datasets (37,000+ songs)
Add progress indicators and error handling
Generate detailed analysis reports (--save-reports functionality)
Add test tool for validation and debugging
Create startup script for web UI with dependency checking
Add comprehensive .gitignore file
Update documentation with required data file information

Web UI Implementation

Create web UI for manual review of ambiguous cases
Implement interactive table view with sorting and filtering
Add MP4 video playback functionality in web UI modal
Implement drag-and-drop priority management with persistent preferences
Add visual status indicators and dynamic updates
Create priority preferences API endpoints
Implement automatic backup system for preferences
Add comprehensive error handling and debugging tools
Create responsive design with Bootstrap 5
Add file path normalization for corrupted paths
Implement change tracking and save button management
Create remaining songs browsing page with filtering and video preview

Advanced Features

Multi-format support (MP3, CDG, MP4)
Channel priority system for MP4 files
Fuzzy matching support with configurable threshold
Multi-artist parsing with various delimiters
Enhanced analysis & reporting with actionable insights
Pattern analysis and channel optimization suggestions
Non-destructive operation (skip lists only)
Verbose and dry-run modes

🎯 Current Implementation

The tool has been successfully implemented with the following components:

Core Modules:

cli/main.py - Main CLI application with argument parsing
cli/matching.py - Song matching and deduplication logic
cli/report.py - Report generation and output formatting
cli/preferences.py - Priority preferences management
cli/utils.py - Utility functions for file operations and data processing

Web UI Components:

web/app.py - Flask web application with API endpoints
web/templates/index.html - Modern, responsive web interface
start_web_ui.py - Startup script with dependency management

Configuration:

config/config.json - Configurable settings for channel priorities, matching rules, and output options

Features Implemented:

File Type Priority System:
1. MP4 files (with channel priority sorting)
2. CDG/MP3 pairs (treated as single units)
3. Standalone MP3 files
4. Standalone CDG files
Priority Management System:
- Interactive drag-and-drop reordering
- Visual feedback with color-coded indicators
- Persistent storage with automatic backups
- Reset functionality for default preferences
- Real-time change tracking
Web UI Features:
- Responsive design with Bootstrap 5
- Video playback for MP4 files
- Real-time filtering and search
- Pagination for large datasets
- Comprehensive error handling
- Debug tools and console logging

Performance Results:

Successfully processed 37,015 songs
Identified 12,424 duplicates (33.6% duplicate rate)
Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
Optimized for large datasets with progress indicators
Enhanced analysis with 7 detailed reports and actionable insights
Bug fix: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)

📋 Next Steps Checklist

🎯 Next Priority Items

Analyze MP4 files without channel priorities to suggest new folder names
Add support for additional file formats if needed
Implement batch processing capabilities
Create integration scripts for karaoke software
Implement audio fingerprinting for better duplicate detection
Add audio preview for MP3 files in web UI
Create advanced analytics dashboard
Implement real-time collaboration features
Add video thumbnail generation

🔄 Maintenance & Improvements

Regular dependency updates and security patches
Performance optimization for larger datasets
Enhanced error handling and user feedback
Additional configuration options
Extended documentation and tutorials
Community feedback integration

11. Version History

v2.0 (Current)

Major Web UI Enhancement: Complete drag-and-drop priority management system
Priority Persistence: User preferences saved and loaded automatically
Visual Improvements: Dynamic visual indicators and real-time feedback
Enhanced Error Handling: Comprehensive error handling with debugging tools
File Path Normalization: Automatic correction of corrupted file paths
Backup System: Automatic timestamped backups for user preferences

v1.0 (Initial Release)

Core CLI Functionality: Basic deduplication and skip list generation
Web UI Foundation: Initial web interface for manual review
Configuration System: JSON-based configuration management
Reporting System: Comprehensive analysis and reporting capabilities

25 KiB Raw Blame History