358 lines
16 KiB
Markdown
358 lines
16 KiB
Markdown
# Karaoke Song Library Cleanup Tool — PRD (v1 CLI)
|
|
|
|
## 1. Project Summary
|
|
|
|
- **Goal:** Analyze, deduplicate, and suggest cleanup of a large karaoke song collection, outputting a JSON “skip list” (for future imports) and supporting flexible reporting and manual review.
|
|
- **Primary User:** Admin (self, collection owner)
|
|
- **Initial Interface:** Command Line (CLI) with print/logging and JSON output
|
|
- **Future Expansion:** Optional web UI for filtering, review, and playback
|
|
|
|
---
|
|
|
|
## 2. Architectural Priorities
|
|
|
|
### 2.1 Code Organization Principles
|
|
|
|
**TOP PRIORITY:** The codebase must be built with the following architectural principles from the beginning:
|
|
|
|
- **True Separation of Concerns:**
|
|
- Many small files with focused responsibilities
|
|
- Each module/class should have a single, well-defined purpose
|
|
- Avoid monolithic files with mixed responsibilities
|
|
|
|
- **Constants and Enums:**
|
|
- Create constants, enums, and configuration objects to avoid duplicate code or values
|
|
- Centralize magic numbers, strings, and configuration values
|
|
- Use enums for type safety and clarity
|
|
|
|
- **Readability and Maintainability:**
|
|
- Code should be self-documenting with clear naming conventions
|
|
- Easy to understand, extend, and refactor
|
|
- Consistent patterns throughout the codebase
|
|
|
|
- **Extensibility:**
|
|
- Design for future growth and feature additions
|
|
- Modular architecture that allows easy integration of new components
|
|
- Clear interfaces between modules
|
|
|
|
- **Refactorability:**
|
|
- Code structure should make future refactoring straightforward
|
|
- Minimize coupling between components
|
|
- Use dependency injection and abstraction where appropriate
|
|
|
|
These principles are fundamental to the project's long-term success and must be applied consistently throughout development.
|
|
|
|
### 2.2 Documentation Requirements
|
|
|
|
**CRITICAL REQUIREMENT:** All code changes, feature additions, or modifications MUST be accompanied by corresponding updates to the project documentation:
|
|
|
|
- **PRD.md Updates:** Any changes to project requirements, architecture, or functionality must be reflected in this document
|
|
- **README.md Updates:** User-facing features, installation instructions, or usage changes must be documented
|
|
- **Code Comments:** Significant logic changes should include inline documentation
|
|
- **API Documentation:** New endpoints, functions, or interfaces must be documented
|
|
|
|
**Documentation Update Checklist:**
|
|
- [ ] Update PRD.md with any architectural or requirement changes
|
|
- [ ] Update README.md with new features, installation steps, or usage instructions
|
|
- [ ] Add inline comments for complex logic or business rules
|
|
- [ ] Update any configuration examples or file structure documentation
|
|
- [ ] Review and update implementation status sections
|
|
|
|
This documentation requirement is mandatory and ensures the project remains maintainable and accessible to future developers and users.
|
|
|
|
### 2.3 Code Quality & Development Standards
|
|
|
|
**MANDATORY STANDARDS:** The following standards must be followed to ensure code quality, maintainability, and AI-friendly development:
|
|
|
|
#### **Naming Conventions**
|
|
- **Files:** Use descriptive, lowercase names with underscores (`song_matcher.py`, `priority_manager.py`)
|
|
- **Classes:** PascalCase (`SongMatcher`, `PreferencesManager`)
|
|
- **Functions/Methods:** snake_case (`process_songs`, `get_priority_order`)
|
|
- **Constants:** UPPER_SNAKE_CASE (`MAX_FILE_SIZE`, `DEFAULT_CHANNEL_PRIORITY`)
|
|
- **Variables:** snake_case with descriptive names (`song_collection`, `duplicate_count`)
|
|
|
|
#### **Code Structure Standards**
|
|
- **Function Length:** Maximum 50 lines per function (aim for 20-30 lines)
|
|
- **Class Length:** Maximum 300 lines per class (aim for 100-200 lines)
|
|
- **File Length:** Maximum 500 lines per file (aim for 200-400 lines)
|
|
- **Indentation:** 4 spaces (no tabs)
|
|
- **Line Length:** Maximum 120 characters
|
|
- **Import Organization:** Group imports: standard library, third-party, local (alphabetical within groups)
|
|
|
|
#### **Error Handling & Logging**
|
|
- **Exception Handling:** Always use specific exception types, never bare `except:`
|
|
- **Logging:** Use Python's `logging` module with appropriate levels (DEBUG, INFO, WARNING, ERROR)
|
|
- **User Feedback:** Provide clear, actionable error messages
|
|
- **Graceful Degradation:** Handle missing files/configs gracefully with sensible defaults
|
|
|
|
#### **Type Hints & Documentation**
|
|
- **Type Hints:** Use Python type hints for all function parameters and return values
|
|
- **Docstrings:** Include docstrings for all public functions, classes, and modules
|
|
- **Docstring Format:** Use Google-style docstrings with parameter descriptions
|
|
- **Complex Logic:** Add inline comments explaining business logic and algorithms
|
|
|
|
#### **Testing Standards**
|
|
- **Unit Tests:** Write unit tests for all business logic functions
|
|
- **Test Coverage:** Aim for 80%+ code coverage
|
|
- **Test Organization:** Mirror the source code structure in test files
|
|
- **Test Data:** Use fixtures and factories for test data, never hardcode test values
|
|
- **Integration Tests:** Test complete workflows and API endpoints
|
|
|
|
#### **Configuration Management**
|
|
- **Environment Variables:** Use environment variables for sensitive data (API keys, paths)
|
|
- **Config Validation:** Validate configuration on startup with clear error messages
|
|
- **Default Values:** Provide sensible defaults for all configuration options
|
|
- **Config Documentation:** Document all configuration options with examples
|
|
|
|
#### **Performance & Scalability**
|
|
- **Memory Efficiency:** Process large datasets in chunks, avoid loading everything into memory
|
|
- **Progress Indicators:** Show progress for long-running operations
|
|
- **Caching:** Implement appropriate caching for expensive operations
|
|
- **Async Operations:** Use async/await for I/O operations where beneficial
|
|
|
|
#### **Security Best Practices**
|
|
- **Input Validation:** Validate and sanitize all user inputs
|
|
- **File Operations:** Use `pathlib` for safe file path handling
|
|
- **JSON Safety:** Use `json.loads()` with proper error handling
|
|
- **No Hardcoded Secrets:** Never commit API keys, passwords, or sensitive data
|
|
|
|
#### **Version Control Standards**
|
|
- **Commit Messages:** Use conventional commit format (`feat:`, `fix:`, `docs:`, `refactor:`)
|
|
- **Branch Naming:** Use descriptive branch names (`feature/priority-management`, `fix/duplicate-detection`)
|
|
- **Pull Requests:** Require code review for all changes
|
|
- **Git Hooks:** Use pre-commit hooks for linting and formatting
|
|
|
|
#### **Dependency Management**
|
|
- **Requirements:** Keep `requirements.txt` updated with exact versions
|
|
- **Virtual Environments:** Always use virtual environments for development
|
|
- **Dependency Updates:** Regularly update dependencies and test compatibility
|
|
- **Minimal Dependencies:** Only include necessary dependencies, avoid bloat
|
|
|
|
#### **Code Review Checklist**
|
|
- [ ] Code follows naming conventions
|
|
- [ ] Functions are appropriately sized and focused
|
|
- [ ] Error handling is comprehensive
|
|
- [ ] Type hints and docstrings are present
|
|
- [ ] Tests are included for new functionality
|
|
- [ ] Configuration is properly validated
|
|
- [ ] No hardcoded values or secrets
|
|
- [ ] Performance considerations addressed
|
|
- [ ] Documentation is updated
|
|
|
|
These standards ensure the codebase remains clean, maintainable, and accessible to both human developers and AI assistants.
|
|
|
|
---
|
|
|
|
## 3. Data Handling & Matching Logic
|
|
|
|
### 3.1 Input
|
|
|
|
- Reads from `/data/allSongs.json`
|
|
- Each song includes at least:
|
|
- `artist`, `title`, `path`, (plus id3 tag info, `channel` for MP4s)
|
|
|
|
### 3.2 Song Matching
|
|
|
|
- **Primary keys:** `artist` + `title`
|
|
- Fuzzy matching configurable (enabled/disabled with threshold)
|
|
- Multi-artist handling: parse delimiters (commas, “feat.”, etc.)
|
|
- **File type detection:** Use file extension from `path` (`.mp3`, `.cdg`, `.mp4`)
|
|
|
|
### 3.3 Channel Priority (for MP4s)
|
|
|
|
- **Configurable folder names:**
|
|
- Set in `/config/config.json` as an array of folder names
|
|
- Order = priority (first = highest priority)
|
|
- Tool searches for these folder names within the song's `path` property
|
|
- Songs without matching folder names are marked for manual review
|
|
- **File type priority:** MP4 > CDG/MP3 pairs > standalone MP3 > standalone CDG
|
|
- **CDG/MP3 pairing:** CDG and MP3 files with the same base filename are treated as a single karaoke song unit
|
|
|
|
---
|
|
|
|
## 4. Output & Reporting
|
|
|
|
### 4.1 Skip List
|
|
|
|
- **Format:** JSON (`/data/skipSongs.json`)
|
|
- List of file paths to skip in future imports
|
|
- Optionally: “reason” field (e.g., `{"path": "...", "reason": "duplicate"}`)
|
|
|
|
### 4.2 CLI Reporting
|
|
|
|
- **Summary:** Total songs, duplicates found, types breakdown, etc.
|
|
- **Verbose per-song output:** Only for matches/duplicates (not every song)
|
|
- **Verbosity configurable:** (via CLI flag or config)
|
|
|
|
### 4.3 Manual Review (Web UI)
|
|
|
|
- **Interactive Web Interface**: Table/grid view for ambiguous/complex cases
|
|
- **Media Preview**: Ability to preview media before making a selection
|
|
- **Bulk Actions**: Select multiple items for batch operations
|
|
- **Real-time Filtering**: Search and filter capabilities
|
|
- **Responsive Design**: Works on desktop and mobile devices
|
|
- **Easy Startup**: Simple script (`start_web_ui.py`) with dependency checking
|
|
|
|
---
|
|
|
|
## 5. Features & Edge Cases
|
|
|
|
- **Batch Processing:**
|
|
- E.g., "Auto-skip all but highest-priority channel for each song"
|
|
- Manual review as CLI flag (future: always in web UI)
|
|
- **Edge Cases:**
|
|
- Multiple versions (>2 formats)
|
|
- Support for keeping multiple versions per song (configurable/manual)
|
|
- **Non-destructive:** Never deletes or moves files, only generates skip list and reports
|
|
|
|
---
|
|
|
|
## 6. Tech Stack & Organization
|
|
|
|
- **CLI Language:** Python
|
|
- **Config:** JSON (channel priorities, settings)
|
|
- **Current Folder Structure:**
|
|
```
|
|
KaraokeMerge/
|
|
├── data/
|
|
│ ├── allSongs.json # Input: Your song library data
|
|
│ ├── skipSongs.json # Output: Generated skip list
|
|
│ └── reports/ # Detailed analysis reports
|
|
│ ├── analysis_data.json
|
|
│ ├── actionable_insights_report.txt
|
|
│ ├── channel_optimization_report.txt
|
|
│ ├── duplicate_pattern_report.txt
|
|
│ ├── enhanced_summary_report.txt
|
|
│ ├── skip_list_summary.txt
|
|
│ └── skip_songs_detailed.json
|
|
├── config/
|
|
│ └── config.json # Configuration settings
|
|
├── cli/
|
|
│ ├── main.py # Main CLI application
|
|
│ ├── matching.py # Song matching logic
|
|
│ ├── report.py # Report generation
|
|
│ └── utils.py # Utility functions
|
|
├── web/ # Web UI for manual review
|
|
│ ├── app.py # Flask web application
|
|
│ └── templates/
|
|
│ └── index.html # Web interface template
|
|
├── start_web_ui.py # Web UI startup script
|
|
├── test_tool.py # Validation and testing script
|
|
├── requirements.txt # Python dependencies
|
|
├── .gitignore # Git ignore rules
|
|
├── PRD.md # Product Requirements Document
|
|
└── README.md # Project documentation
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Web UI Implementation
|
|
|
|
### 7.1 Current Web UI Features
|
|
- **Interactive Table View**: Sortable, filterable grid of duplicate songs
|
|
- **Bulk Selection**: Select multiple items for batch operations
|
|
- **Search & Filter**: Real-time search across artists, titles, and paths
|
|
- **Responsive Design**: Mobile-friendly interface
|
|
- **Easy Startup**: Automated dependency checking and browser launch
|
|
- **Video Playback**: Direct MP4 video playback in modal popup for previewing karaoke videos
|
|
- **Drag-and-Drop Priority Management**: Interactive reordering of file priorities with persistent preferences
|
|
|
|
### 7.2 Web UI Architecture
|
|
- **Flask Backend**: Lightweight web server (`web/app.py`)
|
|
- **HTML Template**: Modern, responsive interface (`web/templates/index.html`)
|
|
- **Startup Script**: Dependency management and server startup (`start_web_ui.py`)
|
|
|
|
### 7.3 Future Web UI Enhancements
|
|
- Audio preview for MP3 files
|
|
- Real-time configuration editing
|
|
- Advanced filtering and sorting options
|
|
- Export capabilities for manual selections
|
|
- Batch video preview functionality
|
|
- Video thumbnail generation
|
|
|
|
---
|
|
|
|
## 8. Open Questions (for future refinement)
|
|
|
|
- Fuzzy matching library/thresholds?
|
|
- Best parsing rules for multi-artist/feat. strings?
|
|
- Any alternate export formats needed?
|
|
- Temporary/partial skip support for "under review" songs?
|
|
|
|
---
|
|
|
|
## 9. Implementation Status
|
|
|
|
### ✅ Completed Features
|
|
- [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
|
|
- [x] Print CLI summary reports (with verbosity control)
|
|
- [x] Implement config file support for channel priority
|
|
- [x] Organize folder/file structure for easy expansion
|
|
|
|
### 🎯 Current Implementation
|
|
The tool has been successfully implemented with the following components:
|
|
|
|
**Core Modules:**
|
|
- `cli/main.py` - Main CLI application with argument parsing
|
|
- `cli/matching.py` - Song matching and deduplication logic
|
|
- `cli/report.py` - Report generation and output formatting
|
|
- `cli/utils.py` - Utility functions for file operations and data processing
|
|
|
|
**Configuration:**
|
|
- `config/config.json` - Configurable settings for channel priorities, matching rules, and output options
|
|
|
|
**Features Implemented:**
|
|
- Multi-format support (MP3, CDG, MP4)
|
|
- **CDG/MP3 Pairing Logic**: Files with same base filename treated as single karaoke song units
|
|
- Channel priority system for MP4 files (based on folder names in path)
|
|
- Fuzzy matching support with configurable threshold
|
|
- Multi-artist parsing with various delimiters
|
|
- **Enhanced Analysis & Reporting**: Comprehensive statistical analysis with actionable insights
|
|
- Channel priority analysis and manual review identification
|
|
- Non-destructive operation (skip lists only)
|
|
- Verbose and dry-run modes
|
|
- Detailed duplicate analysis
|
|
- Skip list generation with metadata
|
|
- **Pattern Analysis**: Skip list pattern analysis and channel optimization suggestions
|
|
|
|
**File Type Priority System:**
|
|
1. **MP4 files** (with channel priority sorting)
|
|
2. **CDG/MP3 pairs** (treated as single units)
|
|
3. **Standalone MP3** files
|
|
4. **Standalone CDG** files
|
|
|
|
**Performance Results:**
|
|
- Successfully processed 37,015 songs
|
|
- Identified 12,424 duplicates (33.6% duplicate rate)
|
|
- Generated comprehensive skip list with metadata (10,998 unique files after deduplication)
|
|
- Optimized for large datasets with progress indicators
|
|
- **Enhanced Analysis**: Generated 7 detailed reports with actionable insights
|
|
- **Bug Fix**: Resolved duplicate entries in skip list (removed 1,426 duplicate entries)
|
|
|
|
### 📋 Next Steps Checklist
|
|
|
|
#### ✅ **Completed**
|
|
- [x] Write initial CLI tool to parse allSongs.json, deduplicate, and output skipSongs.json
|
|
- [x] Print CLI summary reports (with verbosity control)
|
|
- [x] Implement config file support for channel priority
|
|
- [x] Organize folder/file structure for easy expansion
|
|
- [x] Implement CDG/MP3 pairing logic for accurate duplicate detection
|
|
- [x] Generate comprehensive skip list with metadata
|
|
- [x] Optimize performance for large datasets (37,000+ songs)
|
|
- [x] Add progress indicators and error handling
|
|
- [x] Generate detailed analysis reports (`--save-reports` functionality)
|
|
- [x] Create web UI for manual review of ambiguous cases
|
|
- [x] Add test tool for validation and debugging
|
|
- [x] Create startup script for web UI with dependency checking
|
|
- [x] Add comprehensive .gitignore file
|
|
- [x] Update documentation with required data file information
|
|
- [x] Implement drag-and-drop priority management with persistent preferences
|
|
- [x] Add MP4 video playback functionality in web UI modal
|
|
|
|
#### 🎯 **Next Priority Items**
|
|
- [ ] Analyze MP4 files without channel priorities to suggest new folder names
|
|
- [ ] Add support for additional file formats if needed
|
|
- [ ] Implement batch processing capabilities
|
|
- [ ] Create integration scripts for karaoke software
|
|
- [ ] Add unit tests for core functionality
|
|
- [ ] Implement audio fingerprinting for better duplicate detection |