Speech-to-Text and Live Subtitles — SpeechToTextBlock¶

Q: Which languages does it support?

Whisper is multilingual — set Language to an ISO 639-1 code, or leave it "auto" to let Whisper detect the spoken language automatically.

Q: Can I translate speech to English captions instead of transcribing the source language?

Yes — set SpeechToTextSettings.Task = SpeechToTextTask.Translate.

Q: How do I get live on-screen subtitles instead of just an event?

Wire SubtitleRenderer.OnSpeechRecognized as the block's event handler against an OverlayManagerBlock — see SubtitleRenderer — on-screen captions. If you only need SRT/VTT files, set OutputSrtPath/OutputVttPath instead.

Q: Which Whisper model size should I use?

Start with Base (the default) for real-time CPU transcription. Move to Small/Medium/LargeV3 (or their quantized variants) for higher accuracy if you have a GPU or can tolerate slower-than-real-time processing; see the Whisper model sizes table for the full trade-off.

SpeechToTextBlock is an audio-only Media Block from VisioForge.DotNet.Core.AI.Whisper. It taps the audio stream, segments speech with Silero VAD, transcribes it with Whisper (Whisper.net / GGML), and raises OnSpeechRecognized. Audio passes through unchanged. The block implements IAudioProcessingBlock, so it can be inserted into a manual pipeline or registered directly on VideoCaptureCoreX/MediaPlayerCoreX.

using VisioForge.Core.MediaBlocks.AI;
using VisioForge.Core.Types.X.AI;

Basic block setup¶

var settings = new SpeechToTextSettings(whisperModelPath)
{
    Language = "auto",
    Task = SpeechToTextTask.Transcribe,
    EnableVad = true,
    EmitInterim = false,
};

settings.Vad.ModelPath = sileroVadModelPath;

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += (sender, e) =>
{
    foreach (var segment in e.Segments)
    {
        Console.WriteLine($"{segment.StartTime:c} - {segment.EndTime:c}: {segment.Text}");
    }
};

SpeechToTextTask.Transcribe keeps the source language. SpeechToTextTask.Translate translates supported source speech to English text.

Key settings¶

SpeechToTextSettings(whisperModelPath). Unlike the vision AI settings, this type does not derive from OnnxInferenceSettings — Whisper runs through Whisper.net (whisper.cpp / GGML), not ONNX Runtime, so the ONNX-specific input-size/normalization knobs don't apply.

Property	Default	Description
`WhisperModelPath`	—	Absolute path to the Whisper GGML model file (`ggml-*.bin`). Required.
`ModelSize`	`WhisperModelSize.Base`	Informational label for the model variant at `WhisperModelPath` (see below).
`Language`	`"auto"`	ISO 639-1 code (`"en"`, `"es"`, `"fr"`, ...), or `"auto"` to let Whisper detect it.
`Task`	`Transcribe`	`Transcribe` (source language) or `Translate` (to English).
`Provider`	`Auto`	Only `CPU` and `CUDA` are meaningful for the GGML backend (no DirectML); `Auto` picks CUDA when present, else CPU.
`DeviceId`	`0`	Hardware device id when a GPU provider is selected.
`Threads`	`0`	CPU threads Whisper uses. `0` lets Whisper.net choose from the available processor count.
`EnableVad`	`true`	Segment speech with Silero VAD before transcription. When `false`, audio is transcribed in fixed windows, which is prone to hallucinating text during silence.
`Vad`	new `SileroVadSettings`	VAD settings used when `EnableVad` is `true`.
`FixedWindowSeconds`	`5`	Fixed transcription window length when `EnableVad` is `false`. Clamped to 1–30 s.
`EmitInterim`	`false`	Reserved for a future interim-hypothesis capability; currently has no effect — only final segments are emitted.
`OutputSrtPath`	`null`	Optional `.srt` side-car path the block writes as final segments are recognized.
`OutputVttPath`	`null`	Optional `.vtt` (WebVTT) side-car path the block writes as final segments are recognized.

VAD settings¶

When EnableVad is true, SpeechToTextSettings.Vad controls Silero speech segmentation. Silero VAD is a tiny (~2 MB, MIT) ONNX model that classifies short audio windows as speech or non-speech, used as a real-time pre-filter so the (much heavier) Whisper model only runs on actual speech.

settings.Vad = new SileroVadSettings
{
    ModelPath = sileroVadModelPath,
    SpeechThreshold = 0.5f,
    MinSilenceMs = 100,
    MinSpeechMs = 250,
    SpeechPadMs = 30,
    MaxSpeechMs = 15000,
    Provider = OnnxExecutionProvider.CPU,
};

Property	Default	Description
`ModelPath`	—	Absolute path to `silero_vad.onnx`.
`SpeechThreshold`	`0.5`	Speech-probability threshold (0..1). Raise it in noisy environments to reduce false triggers.
`MinSilenceMs`	`100`	Minimum trailing silence, in ms, that ends a speech segment.
`MinSpeechMs`	`250`	Minimum speech-run duration, in ms, to be emitted (discards spurious blips).
`SpeechPadMs`	`30`	Onset padding, in ms, prepended to each detected segment.
`MaxSpeechMs`	`15000`	Maximum segment length, in ms, before the segmenter force-cuts an ongoing run.
`Provider`	`CPU`	Execution provider for the VAD session — the model is tiny (~1 ms/window on CPU), so GPU adds latency without benefit.
`DeviceId`	`0`	Hardware device id when a GPU provider is selected.

The Whisper GGML weights and the Silero VAD model are downloaded at runtime; neither is shipped in the SDK NuGet packages.

Whisper model sizes¶

WhisperModelSize is informational — it names well-known Whisper GGML weight files so an application can pick one to download. The file actually loaded is always SpeechToTextSettings.WhisperModelPath.

Value	Approx. size	Notes
`Tiny`	~75 MB	Fastest, lowest accuracy. Good for real-time CPU transcription.
`Base` (default)	~142 MB	A good real-time CPU default.
`Small`	~466 MB	Noticeably more accurate; real-time with a GPU or a fast CPU.
`Medium`	~1.5 GB	High accuracy; typically needs a GPU for real-time.
`LargeV3`	~3 GB	Highest accuracy; GPU strongly recommended.
`LargeV3Turbo`	~1.6 GB	Near-large accuracy at a fraction of the cost.
`TinyQuantized`	~31 MB	`Tiny`, Q5_1-quantized.
`BaseQuantized`	~57 MB	`Base`, Q5_1-quantized.
`SmallQuantized`	~181 MB	`Small`, Q5_1-quantized.
`MediumQuantized`	~514 MB	`Medium`, Q5_0-quantized.
`LargeV3TurboQuantized`	~547 MB	`LargeV3Turbo`, Q5_0-quantized. Recommended quantized accuracy/speed balance.

English-only (*.en) variants aren't enumerated; supply their path directly. Weights are MIT-licensed.

Recognized segments¶

Each SpeechSegment in SpeechRecognizedEventArgs.Segments:

Property	Description
`Text`	The recognized text for the segment.
`StartTime` / `EndTime`	`TimeSpan`, relative to the start of the stream — ready to use for SRT/VTT or on-screen scheduling.
`IsFinal`	Reserved to distinguish interim vs. final hypotheses once interim results exist; segments are currently always final (`true`).
`Language`	Detected/used ISO 639-1 code, or `null` if unknown.
`Confidence`	Average token confidence (0..1), or `0` when the model doesn't report token probabilities.

Subtitle helpers¶

SubtitleWriter — SRT/VTT files¶

SubtitleWriter writes recognized speech segments to a SubRip (.srt) or WebVTT (.vtt) side-car file, appending one cue per final segment. It is thread-safe; interim (non-final) segments are ignored.

using VisioForge.Core.AI.Whisper.Subtitles;

using var writer = new SubtitleWriter("captions.srt", SubtitleFormat.Srt);
stt.OnSpeechRecognized += (sender, e) =>
{
    foreach (var segment in e.Segments)
    {
        writer.Add(segment);
    }
};

If you only need files, it's simpler to set OutputSrtPath and/or OutputVttPath on SpeechToTextSettings — the block creates and drives its own SubtitleWriter instance(s) internally as final segments are recognized, and disposes them for you.

SubtitleRenderer — on-screen captions¶

SubtitleRenderer drives a single text overlay on an OverlayManagerBlock: it shows the latest caption and auto-hides it after the segment's display duration.

using SkiaSharp;
using VisioForge.Core.AI.Whisper.Subtitles;

var style = new SubtitleStyle
{
    FontName = "Arial",
    FontSize = 32,
    Color = SKColors.White,
    X = 50,
    Y = 50,
    MinDisplay = TimeSpan.FromSeconds(1.5),
    MaxDisplay = TimeSpan.FromSeconds(6),
};

var subtitleRenderer = new SubtitleRenderer(overlayManagerBlock, style);
stt.OnSpeechRecognized += subtitleRenderer.OnSpeechRecognized;

// ... later, when tearing down:
subtitleRenderer.Dispose(); // removes the overlay and stops the auto-hide timer

Wire SubtitleRenderer.OnSpeechRecognized directly as the block's event handler. It clamps the on-screen time into [MinDisplay, MaxDisplay] based on the segment's duration and calls OverlayManagerBlock.Video_Overlay_Update from whatever thread invokes it — marshal the call if your UI framework requires it. SubtitleStyle defaults: FontName = "Arial", FontSize = 32, Color = White, X = 50, Y = 50, MinDisplay = 1.5 s, MaxDisplay = 6 s.

No shipping demo yet

SubtitleRenderer exists in the SDK and is documented from its source, but no bundled demo currently uses it — the Live Subtitles X demos update a UI label directly from OnSpeechRecognized instead. Use SubtitleWriter or OutputSrtPath/OutputVttPath if you only need a subtitle file.

Manual Media Blocks pipeline¶

Place SpeechToTextBlock in the audio chain before the audio renderer or output:

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

pipeline.Connect(audioSource.Output, stt.Input);
pipeline.Connect(stt.Output, audioRenderer.Input);

VideoCaptureCoreX live microphone transcription¶

For capture, an audio source is required. If you want analysis without speaker monitoring or recording, terminate the audio chain with a non-synced null renderer:

core.Audio_Source = microphoneSettings;
core.Audio_OutputBlock = new NullRendererBlock(MediaBlockPadMediaType.Audio)
{
    IsSync = false,
};

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

core.Audio_Processing_AddBlock(stt); // before StartAsync
await core.StartAsync();

Audio_OutputBlock builds and terminates the audio chain without enabling speaker playback or file recording. The engine owns the assigned output block after start.

MediaPlayerCoreX file transcription¶

For playback, Audio_Play must be true for the audio chain to be built. A non-synced null renderer lets the source run without real-time speaker output:

player.Audio_Play = true;
player.Audio_OutputBlock = new NullRendererBlock(MediaBlockPadMediaType.Audio)
{
    IsSync = false,
};

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

player.Audio_Processing_AddBlock(stt); // before OpenAsync / PlayAsync
await player.OpenAsync(source);
await player.PlayAsync();

See Using AI blocks with VideoCaptureCoreX and MediaPlayerCoreX for the full Audio_Processing_*/Audio_OutputBlock API and lifecycle rules.

Threading, pacing, and lifetime¶

VAD and Whisper run synchronously on the GStreamer streaming thread: audio is segmented and transcribed in place, and the trailing segment is flushed at end-of-stream. Nothing is dropped — the whole input is transcribed losslessly, and the pipeline position tracks the transcription frontier (useful for a progress bar). Because transcription runs inline, the source is paced to Whisper: if Whisper is slower than real time, the upstream (including a live capture device) is throttled rather than losing audio. In practice Whisper Base runs well above real time, so it is not the bottleneck for a typical source.

OnSpeechRecognized is raised on that same streaming thread — never touch UI directly from the handler; marshal to the UI dispatcher or main thread.

After a capture or playback session starts, the engine owns wired SpeechToTextBlock instances and disposes them when the session stops. Create a new block for the next session.

Use cases¶

Live captioning — real-time subtitles for a live stream, webinar, or accessibility overlay.
Meeting and call transcription — transcribe a microphone feed alongside VideoCaptureCoreX capture.
Media indexing and search — batch-transcribe recorded video/audio files to make their content searchable.
Subtitle authoring — generate .srt/.vtt side-car files from source video without a third-party transcription service.
Translation captions — set Task = SpeechToTextTask.Translate to caption non-English speech in English.

Troubleshooting¶

Symptom	Likely cause	Fix
No segments are ever recognized	`WhisperModelPath` invalid, or the audio chain isn't built	Confirm the GGML model file exists at that path; for capture/playback, confirm the audio chain is active (see `Audio_OutputBlock` below).
Whisper "hallucinates" text during silence	`EnableVad` is `false`	Enable VAD (`EnableVad = true`, the default) so Whisper only runs on detected speech, not fixed windows that may be silent.
Transcription lags behind live audio	Whisper is slower than real time on the current hardware/model size	Choose a smaller `WhisperModelSize`/model file, or use `Provider = CUDA` on a machine with an NVIDIA GPU.
Segments are cut off mid-sentence	`SileroVadSettings.MaxSpeechMs` force-cuts a long continuous utterance	This is a deliberate bound (default 15 s) so one in-flight transcription can't grow unbounded; raise `MaxSpeechMs` if your scenario needs longer uninterrupted segments and can tolerate the larger bound.
No audio reaches the block on `VideoCaptureCoreX`/`MediaPlayerCoreX`	Audio chain not built (missing `Audio_Source`/`Audio_Play`)	See VideoCaptureCoreX live microphone transcription and MediaPlayerCoreX file transcription above.
`.srt`/`.vtt` file is empty	Segments never finalized, or wrong path	Confirm `OutputSrtPath`/`OutputVttPath` point to a writable path and that speech was actually detected; only final segments are written.

Frequently Asked Questions¶

Does SpeechToTextBlock require an internet connection?¶

No — transcription runs fully on-device through Whisper.net/GGML; the model files are downloaded once by your application (or bundled), not called per-request over the network.

Which languages does it support?¶

Whisper is multilingual — set Language to an ISO 639-1 code, or leave it "auto" to let Whisper detect the spoken language automatically.

Can I translate speech to English captions instead of transcribing the source language?¶

Yes — set SpeechToTextSettings.Task = SpeechToTextTask.Translate.

How do I get live on-screen subtitles instead of just an event?¶

Wire SubtitleRenderer.OnSpeechRecognized as the block's event handler against an OverlayManagerBlock — see SubtitleRenderer — on-screen captions. If you only need SRT/VTT files, set OutputSrtPath/OutputVttPath instead.

Which Whisper model size should I use?¶

Start with Base (the default) for real-time CPU transcription. Move to Small/Medium/LargeV3 (or their quantized variants) for higher accuracy if you have a GPU or can tolerate slower-than-real-time processing; see the Whisper model sizes table for the full trade-off.

Demos¶

Live Subtitles Demo — WPF Media Blocks pipeline demo.
Live Subtitles MB — the same Media Blocks demo for MAUI.
Live Subtitles — headless console demo (downloads models on first run).

Dedicated VideoCaptureCoreX/MediaPlayerCoreX live-subtitles demos (Capture Live Subtitles X, Capture Live Subtitles X WPF, Player Live Subtitles X, Player Live Subtitles X WPF) are in the SDK's demo set and will be linked here once published to the public samples repository.