Transcribe Audio (Siri → Email)

Drop an audio file into a Siri Shortcut, get an email back with the LLM summary in the body and the full speaker-attributed transcript as a transcript.txt attachment. Webhook responds within ~50ms with 202 queued; the transcribe + summarize + email runs in background and lands in your inbox 5-40 min later depending on audio length.

Sister surface to the ops-bot transcribe cog — same transcribe_audio workflow on the backend, different ingress channel.

Quick Reference

Item	Value
n8n workflow name	Transcribe Audio (Siri → Email)
Webhook URL	`https://n8n.mdbook.me/webhook/transcribe`
Method	POST (multipart)
Auth	Header `X-API-Key: <shared secret>` (n8n Header Auth credential)
Backend	`https://agents.bots.mdbook.one/transcribe_audio/invoke` (langserve)
ASR	whisperx-api (faster-whisper large-v3-turbo + wav2vec2 align + pyannote-3.1)
Summarizer	vLLM Qwen via `agents_runtime.vllm_chat` (shared with ops-bot cog)
Importable JSON	`/docker/cloud/n8n-backups/transcribe-webhook.n8n.json`
Status	INACTIVE until credentials wired + Active toggled

Architecture

Split-response pattern. Siri does not wait for the transcribe.

Siri Shortcut
  │  POST multipart: file=<audio>, to=<optional recipient>
  │  Header: X-API-Key: <key>
  ▼
n8n Webhook (responseMode=responseNode)
  ↓
Respond 202                  (Siri gets {"status":"queued"} in ~50ms and exits)
  ↓
Encode + extract recipient   (binary → base64; pull `to` from body, default mdbooktech@gmail.com)
  ↓
Call transcribe_audio        (POST JSON `audio_b64` to agents.bots.mdbook.one, 3h timeout)
  ↓                             ├─ fetches/decodes bytes
  ↓                             ├─ whisperx-api → segments + speakers
  ↓                             └─ vllm_chat → summary
  ↓
Format email + transcript    (summary → HTML body, segments → transcript.txt buffer)
  ↓
Send email                   (Gmail OAuth2, summary in body + transcript.txt attached)

Why audio_b64 (and not a URL)

The existing transcribe_audio workflow originally only accepted audio_url — the runtime fetched it. That works for the ops-bot cog because Discord’s CDN hosts attachments at a public URL the runtime can pull. Siri Shortcut doesn’t give us a free CDN: the phone has the audio bytes, and there’s no first-class way to host them publicly for the brief moment we need.

The workflow was extended on 2026-06-02 to also accept audio_b64 (base64-encoded inline bytes) as a mutually-exclusive alternative to audio_url. n8n base64-encodes the Siri upload in a Code node and posts it as JSON. Same downstream path otherwise. See mikayla/agents:transcribe_audio.py.

Siri Shortcut Spec

Build the shortcut in iOS Shortcuts.app:

Get File (or Select Files / Microphone Recording / Files action) — pull in the audio you want transcribed. Voice memos exported as m4a work. Anything ffmpeg can demux is fine (wav/mp3/m4a/aac/flac/ogg/opus/webm/aif/aiff).
(Optional) Ask for Input → “Send transcript to (blank for default)” → text. Use this if you want to override the recipient ad-hoc.
Get Contents of URL:
- URL: https://n8n.mdbook.me/webhook/transcribe
- Method: POST
- Headers: X-API-Key: <paste the shared secret>
- Request Body: Form
  - Field name file, type File, value = the audio from step 1
  - Field name to, type Text, value = the input from step 2 (only if you added that step; otherwise omit and the default applies)
Show Notification (optional) — Queued: transcript will arrive by email. Confirms the 202 came back.

The webhook accepts any single binary field in the multipart — the field name isn’t load-bearing (n8n picks up the first binary attachment). file is just convention.

Credentials Needed

n8n Header Auth credential (“Transcribe Webhook Key”) — header name X-API-Key, value = a shared secret you generate (openssl rand -base64 32). This is what gates the webhook against the internet. Wire it into the Webhook node’s httpHeaderAuth slot.
n8n Header Auth credential (“Langserve X-API-Key”) — header name X-API-Key, value = the langserve API key. Already in n8n if any other workflow calls langserve; reuse.
Gmail OAuth2 credential — existing mdbooktech@gmail.com cred. Reused as-is.

After importing the JSON, replace the three REPLACE_WITH_*_CREDENTIAL_ID placeholders in the node config by re-binding each credential dropdown in the n8n UI.

Output Shape

The email is multipart HTML + one attachment:

Subject: Transcript: <original filename> (<N> min, <K> speakers)
Body (HTML): the LLM summary (3-7 bullet points, Discord-flavored markdown rendered as plain text in a <pre>), followed by a stat line.
Attachment: transcript.txt — one line per segment, formatted [mm:ss] spk_N: text.

Speaker labels are anonymous (spk_0, spk_1, … — no hard cap with pyannote-3.1 clustering). The summary prompt knows the labels are anonymous; don’t try to resolve names.

Gotchas

No speaker cap, but speakers are anonymous. pyannote-3.1 surfaces as many distinct voices as it finds; labels stay anonymous (spk_0, spk_1, …) and are stable within a single transcription but not across calls.
Cold-start tax on the first call. whisperx-api lazy-loads ~3.5 GB of weights on first /transcribe after container restart (~30-60s extra wall-time). Subsequent calls are warm. /ready forces the load — useful as a cron warmup if you want the first real call of the day to be fast.
3h timeout in the HTTP Request node. Matches transcribe_audio.TRANSCRIBE_TIMEOUT and the ops-bot cog’s matching value. CPU WhisperX on a 1h meeting runs ~15-30 min wall-time; 3h covers ~4-5h of audio with headroom. Changing this means changing the workflow + the cog + this doc in lockstep.
Webhook respondMode=responseNode is load-bearing. The Respond 202 node has to fire BEFORE the long-running HTTP Request, otherwise the webhook holds Siri’s connection open for 30+ min and the phone times out. Don’t reorder.
Binary field name isn’t constrained. The Code node grabs the first key in $input.item.binary — whatever name Siri (or curl, or anything else) uses for the multipart field, it works. Documented as file in the shortcut spec for convenience.
Base64 inflates payload by ~33%. A 100 MB audio file becomes a 133 MB JSON body posted from n8n to langserve. Fine for typical meeting lengths (a 1h m4a is ~30-50 MB → ~40-66 MB encoded), but if you’re transcribing multi-hour high-bitrate WAVs, expect memory pressure on the n8n container during encoding.
to parsing is best-effort. The Code node pulls to from $input.item.json.body.to. n8n’s webhook parses multipart form fields into body — if Siri sends it under a different shape (e.g. JSON instead of form), the default kicks in.

Disabling

n8n UI → workflow Transcribe Audio (Siri → Email) → toggle Active off. Or delete/rotate the Header Auth cred to break it at the auth layer without touching the workflow itself.

Backend: whisperx-api (ASR + diarization HTTP service)
Backend: mikayla/agents:transcribe_audio.py (langserve workflow that wraps whisperx-api with summarization)
Sister surface: ops-bot transcribe cog (Discord channel monitor → same transcribe_audio)
Pattern: Todo Capture (Siri → Notion) — same webhook + split-response shape, different downstream

Changelog

2026-06-02 — Initial design. Extended transcribe_audio to accept audio_b64 alongside audio_url; n8n workflow JSON staged at /docker/cloud/n8n-backups/transcribe-webhook.n8n.json.

mdbook wiki

Explorer

Transcribe Audio (Siri → Email)

Quick Reference

Architecture

Why audio_b64 (and not a URL)

Siri Shortcut Spec

Credentials Needed

Output Shape

Gotchas

Disabling

Changelog

Graph View

Table of Contents

mdbook wiki

Explorer

Transcribe Audio (Siri → Email)

Quick Reference

Architecture

Why audio_b64 (and not a URL)

Siri Shortcut Spec

Credentials Needed

Output Shape

Gotchas

Disabling

Related

Changelog

Graph View

Table of Contents