Podcast Audio Mixing Instructions - Layered/Professional Style

For AI-Assisted Podcast Production Version: 2.0 (Layered Overlap) | March 2026

Goal

Professional podcast sound: intro music starts first, voice fades in over it, music stays subtle underneath spoken word at 30% volume, then outro music fades after voice ends.

Core Rules (MUST Follow)

Intro music: Use the first 8 seconds only (atrim=0:8).
Outro music: Use the last 8 seconds only (atrim=26:34 for a 34s music source).
Music volume: Fixed at 0.3 (30%).
No full-track loop: Do not loop the entire source track.
No heavy dynamics: Do not use sidechaincompress or ducking chains.
Layering is required: Music must overlap spoken audio (intro overlap + low bed under narration + outro tail).
Spoken narration remains primary and clear.
Use smooth fades/crossfades (1.5-3 seconds).
Output format: MP3 (VBR quality target around ~190 kbps, -q:a 2).

Recommended FFmpeg Command (True Layered Mix)

This version does all of the following:

Intro starts alone, voice enters at 5s with a 3s fade-in.
Middle section is looped from the music's center slice (not full-track loop) for a continuous low bed.
Voice and music are layered with a lightweight amix only.
Outro is appended with a crossfade and fade-out.

#!/usr/bin/env bash
# mix_podcast_layered.sh
# Usage: ./mix_podcast_layered.sh spoken_narration.mp3 music_34s.mp3 final_podcast.mp3

set -euo pipefail

SPOKEN="${1:-}"
MUSIC="${2:-}"
OUTPUT="${3:-}"

if [ -z "$SPOKEN" ] || [ -z "$MUSIC" ] || [ -z "$OUTPUT" ]; then
  echo "Usage: $0 spoken.mp3 music.mp3 output.mp3"
  exit 1
fi

ffmpeg -y \
  -i "$SPOKEN" \
  -i "$MUSIC" \
  -filter_complex "
    [0:a]aformat=sample_fmts=fltp:sample_rates=48000:channel_layouts=stereo,asetpts=PTS-STARTPTS[voice_raw];
    [voice_raw]adelay=5000|5000,afade=t=in:st=5:d=3[voice];

    [1:a]aformat=sample_fmts=fltp:sample_rates=48000:channel_layouts=stereo,asetpts=PTS-STARTPTS[music_base];
    [music_base]atrim=0:8,volume=0.3,afade=t=in:st=0:d=1.5[intro];
    [music_base]atrim=8:26,asetpts=PTS-STARTPTS,volume=0.3[mid];
    [mid]aloop=loop=-1:size=2147483647,atrim=0:3600[mid_loop];

    [intro][mid_loop]concat=n=2:v=0:a=1[music_timeline];
    [music_timeline][voice]amix=inputs=2:duration=shortest:normalize=0[main];

    [music_base]atrim=26:34,asetpts=PTS-STARTPTS,volume=0.3,afade=t=out:st=6:d=2[outro];
    [main][outro]acrossfade=d=2:c1=tri:c2=tri[mix]
  " \
  -map "[mix]" \
  -c:a libmp3lame -q:a 2 \
  "$OUTPUT"

Simpler Alternative (Intro/Outro Layering Only)

Use this if you want easier debugging. It overlaps intro into speech and appends outro, but does not maintain a continuous bed for very long narration.

ffmpeg -y \
  -i "$SPOKEN" \
  -i "$MUSIC" \
  -filter_complex "
    [1:a]atrim=0:8,volume=0.3,afade=t=in:st=0:d=2[intro];
    [1:a]atrim=26:34,volume=0.3,afade=t=out:st=6:d=2[outro];
    [intro][0:a]acrossfade=d=3:curve1=exp:curve2=exp[voiced];
    [voiced][outro]acrossfade=d=2:curve1=tri:curve2=tri[mix]
  " \
  -map "[mix]" \
  -c:a libmp3lame -q:a 2 \
  "$OUTPUT"

File Preparation Checklist

Spoken narration: MP3/M4A/WAV, clean and normalized.
Music source: 34s+ source where 0:8 is intro material and 26:34 is outro material.
Validate durations first:

ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$SPOKEN"
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$MUSIC"

Test with a short spoken sample before full renders.

Troubleshooting

Music too loud: lower volume=0.3 to 0.25 or 0.2.
Voice starts too late/early: adjust adelay=5000|5000.
Intro overlap too long/short: adjust afade and crossfade durations.
Outro too abrupt: increase afade=t=out duration or acrossfade=d.
Want final loudness polish: add -af loudnorm to output stage.

Notes

This is your layered/pro baseline file for AI generation and scripting.
If you want true broadcast polish next, the next step is LUFS target normalization + limiter (still without sidechain ducking).

4.1 KiB Raw Blame History