Drop an audio file into a Siri Shortcut, get an email back with the LLM summary in the body and the full speaker-attributed transcript as a transcript.txt attachment. Webhook responds within ~50ms with 202 queued; the transcribe + summarize + email runs in background and lands in your inbox 5-40 min later depending on audio length.
Sister surface to the ops-bot transcribe cog — same transcribe_audio workflow on the backend, different ingress channel.
Quick Reference
| Item | Value |
|---|---|
| n8n workflow name | Transcribe Audio (Siri → Email) |
| Webhook URL | https://n8n.mdbook.me/webhook/transcribe |
| Method | POST (multipart) |
| Auth | Header X-API-Key: <shared secret> (n8n Header Auth credential) |
| Backend | https://agents.bots.mdbook.one/transcribe_audio/invoke (langserve) |
| ASR | whisperx-api (faster-whisper large-v3-turbo + wav2vec2 align + pyannote-3.1) |
| Summarizer | vLLM Qwen via agents_runtime.vllm_chat (shared with ops-bot cog) |
| Importable JSON | /docker/cloud/n8n-backups/transcribe-webhook.n8n.json |
| Status | INACTIVE until credentials wired + Active toggled |
Architecture
Split-response pattern. Siri does not wait for the transcribe.
Siri Shortcut
│ POST multipart: file=<audio>, to=<optional recipient>
│ Header: X-API-Key: <key>
▼
n8n Webhook (responseMode=responseNode)
↓
Respond 202 (Siri gets {"status":"queued"} in ~50ms and exits)
↓
Encode + extract recipient (binary → base64; pull `to` from body, default mdbooktech@gmail.com)
↓
Call transcribe_audio (POST JSON `audio_b64` to agents.bots.mdbook.one, 3h timeout)
↓ ├─ fetches/decodes bytes
↓ ├─ whisperx-api → segments + speakers
↓ └─ vllm_chat → summary
↓
Format email + transcript (summary → HTML body, segments → transcript.txt buffer)
↓
Send email (Gmail OAuth2, summary in body + transcript.txt attached)
Why audio_b64 (and not a URL)
The existing transcribe_audio workflow originally only accepted audio_url — the runtime fetched it. That works for the ops-bot cog because Discord’s CDN hosts attachments at a public URL the runtime can pull. Siri Shortcut doesn’t give us a free CDN: the phone has the audio bytes, and there’s no first-class way to host them publicly for the brief moment we need.
The workflow was extended on 2026-06-02 to also accept audio_b64 (base64-encoded inline bytes) as a mutually-exclusive alternative to audio_url. n8n base64-encodes the Siri upload in a Code node and posts it as JSON. Same downstream path otherwise. See mikayla/agents:transcribe_audio.py.
Siri Shortcut Spec
Build the shortcut in iOS Shortcuts.app:
- Get File (or Select Files / Microphone Recording / Files action) — pull in the audio you want transcribed. Voice memos exported as m4a work. Anything ffmpeg can demux is fine (wav/mp3/m4a/aac/flac/ogg/opus/webm/aif/aiff).
- (Optional) Ask for Input → “Send transcript to (blank for default)” → text. Use this if you want to override the recipient ad-hoc.
- Get Contents of URL:
- URL:
https://n8n.mdbook.me/webhook/transcribe - Method:
POST - Headers:
X-API-Key: <paste the shared secret> - Request Body:
Form- Field name
file, typeFile, value = the audio from step 1 - Field name
to, typeText, value = the input from step 2 (only if you added that step; otherwise omit and the default applies)
- Field name
- URL:
- Show Notification (optional) —
Queued: transcript will arrive by email. Confirms the 202 came back.
The webhook accepts any single binary field in the multipart — the field name isn’t load-bearing (n8n picks up the first binary attachment). file is just convention.
Credentials Needed
- n8n Header Auth credential (“Transcribe Webhook Key”) — header name
X-API-Key, value = a shared secret you generate (openssl rand -base64 32). This is what gates the webhook against the internet. Wire it into the Webhook node’shttpHeaderAuthslot. - n8n Header Auth credential (“Langserve X-API-Key”) — header name
X-API-Key, value = the langserve API key. Already in n8n if any other workflow calls langserve; reuse. - Gmail OAuth2 credential — existing
mdbooktech@gmail.comcred. Reused as-is.
After importing the JSON, replace the three REPLACE_WITH_*_CREDENTIAL_ID placeholders in the node config by re-binding each credential dropdown in the n8n UI.
Output Shape
The email is multipart HTML + one attachment:
- Subject:
Transcript: <original filename> (<N> min, <K> speakers) - Body (HTML): the LLM summary (3-7 bullet points, Discord-flavored markdown rendered as plain text in a
<pre>), followed by a stat line. - Attachment:
transcript.txt— one line per segment, formatted[mm:ss] spk_N: text.
Speaker labels are anonymous (spk_0, spk_1, … — no hard cap with pyannote-3.1 clustering). The summary prompt knows the labels are anonymous; don’t try to resolve names.
Gotchas
- No speaker cap, but speakers are anonymous. pyannote-3.1 surfaces as many distinct voices as it finds; labels stay anonymous (
spk_0,spk_1, …) and are stable within a single transcription but not across calls. - Cold-start tax on the first call. whisperx-api lazy-loads ~3.5 GB of weights on first
/transcribeafter container restart (~30-60s extra wall-time). Subsequent calls are warm./readyforces the load — useful as a cron warmup if you want the first real call of the day to be fast. - 3h timeout in the HTTP Request node. Matches
transcribe_audio.TRANSCRIBE_TIMEOUTand the ops-bot cog’s matching value. CPU WhisperX on a 1h meeting runs ~15-30 min wall-time; 3h covers ~4-5h of audio with headroom. Changing this means changing the workflow + the cog + this doc in lockstep. - Webhook respondMode=responseNode is load-bearing. The Respond 202 node has to fire BEFORE the long-running HTTP Request, otherwise the webhook holds Siri’s connection open for 30+ min and the phone times out. Don’t reorder.
- Binary field name isn’t constrained. The Code node grabs the first key in
$input.item.binary— whatever name Siri (or curl, or anything else) uses for the multipart field, it works. Documented asfilein the shortcut spec for convenience. - Base64 inflates payload by ~33%. A 100 MB audio file becomes a 133 MB JSON body posted from n8n to langserve. Fine for typical meeting lengths (a 1h m4a is ~30-50 MB → ~40-66 MB encoded), but if you’re transcribing multi-hour high-bitrate WAVs, expect memory pressure on the n8n container during encoding.
toparsing is best-effort. The Code node pullstofrom$input.item.json.body.to. n8n’s webhook parses multipart form fields intobody— if Siri sends it under a different shape (e.g. JSON instead of form), the default kicks in.
Disabling
n8n UI → workflow Transcribe Audio (Siri → Email) → toggle Active off. Or delete/rotate the Header Auth cred to break it at the auth layer without touching the workflow itself.
Related
- Backend: whisperx-api (ASR + diarization HTTP service)
- Backend:
mikayla/agents:transcribe_audio.py(langserve workflow that wraps whisperx-api with summarization) - Sister surface: ops-bot transcribe cog (Discord channel monitor → same transcribe_audio)
- Pattern: Todo Capture (Siri → Notion) — same webhook + split-response shape, different downstream
Changelog
- 2026-06-02 — Initial design. Extended
transcribe_audioto acceptaudio_b64alongsideaudio_url; n8n workflow JSON staged at/docker/cloud/n8n-backups/transcribe-webhook.n8n.json.