Skip to content

Speech-to-Text and Live Subtitles — SpeechToTextBlock

SpeechToTextBlock is an audio-only Media Block from VisioForge.DotNet.Core.AI.Whisper. It taps the audio stream, segments speech with Silero VAD, transcribes it with Whisper (Whisper.net / GGML), and raises OnSpeechRecognized. Audio passes through unchanged. The block implements IAudioProcessingBlock, so it can be inserted into a manual pipeline or registered directly on VideoCaptureCoreX/MediaPlayerCoreX.

using VisioForge.Core.MediaBlocks.AI;
using VisioForge.Core.Types.X.AI;

Basic block setup

var settings = new SpeechToTextSettings(whisperModelPath)
{
    Language = "auto",
    Task = SpeechToTextTask.Transcribe,
    EnableVad = true,
    EmitInterim = false,
};

settings.Vad.ModelPath = sileroVadModelPath;

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += (sender, e) =>
{
    foreach (var segment in e.Segments)
    {
        Console.WriteLine($"{segment.StartTime:c} - {segment.EndTime:c}: {segment.Text}");
    }
};

SpeechToTextTask.Transcribe keeps the source language. SpeechToTextTask.Translate translates supported source speech to English text.

Key settings

SpeechToTextSettings(whisperModelPath). Unlike the vision AI settings, this type does not derive from OnnxInferenceSettings — Whisper runs through Whisper.net (whisper.cpp / GGML), not ONNX Runtime, so the ONNX-specific input-size/normalization knobs don't apply.

Property Default Description
WhisperModelPath Absolute path to the Whisper GGML model file (ggml-*.bin). Required.
ModelSize WhisperModelSize.Base Informational label for the model variant at WhisperModelPath (see below).
Language "auto" ISO 639-1 code ("en", "es", "fr", ...), or "auto" to let Whisper detect it.
Task Transcribe Transcribe (source language) or Translate (to English).
Provider Auto Only CPU and CUDA are meaningful for the GGML backend (no DirectML); Auto picks CUDA when present, else CPU.
DeviceId 0 Hardware device id when a GPU provider is selected.
Threads 0 CPU threads Whisper uses. 0 lets Whisper.net choose from the available processor count.
EnableVad true Segment speech with Silero VAD before transcription. When false, audio is transcribed in fixed windows, which is prone to hallucinating text during silence.
Vad new SileroVadSettings VAD settings used when EnableVad is true.
FixedWindowSeconds 5 Fixed transcription window length when EnableVad is false. Clamped to 1–30 s.
EmitInterim false Reserved for a future interim-hypothesis capability; currently has no effect — only final segments are emitted.
OutputSrtPath null Optional .srt side-car path the block writes as final segments are recognized.
OutputVttPath null Optional .vtt (WebVTT) side-car path the block writes as final segments are recognized.

VAD settings

When EnableVad is true, SpeechToTextSettings.Vad controls Silero speech segmentation. Silero VAD is a tiny (~2 MB, MIT) ONNX model that classifies short audio windows as speech or non-speech, used as a real-time pre-filter so the (much heavier) Whisper model only runs on actual speech.

settings.Vad = new SileroVadSettings
{
    ModelPath = sileroVadModelPath,
    SpeechThreshold = 0.5f,
    MinSilenceMs = 100,
    MinSpeechMs = 250,
    SpeechPadMs = 30,
    MaxSpeechMs = 15000,
    Provider = OnnxExecutionProvider.CPU,
};
Property Default Description
ModelPath Absolute path to silero_vad.onnx.
SpeechThreshold 0.5 Speech-probability threshold (0..1). Raise it in noisy environments to reduce false triggers.
MinSilenceMs 100 Minimum trailing silence, in ms, that ends a speech segment.
MinSpeechMs 250 Minimum speech-run duration, in ms, to be emitted (discards spurious blips).
SpeechPadMs 30 Onset padding, in ms, prepended to each detected segment.
MaxSpeechMs 15000 Maximum segment length, in ms, before the segmenter force-cuts an ongoing run.
Provider CPU Execution provider for the VAD session — the model is tiny (~1 ms/window on CPU), so GPU adds latency without benefit.
DeviceId 0 Hardware device id when a GPU provider is selected.

The Whisper GGML weights and the Silero VAD model are downloaded at runtime; neither is shipped in the SDK NuGet packages.

Whisper model sizes

WhisperModelSize is informational — it names well-known Whisper GGML weight files so an application can pick one to download. The file actually loaded is always SpeechToTextSettings.WhisperModelPath.

Value Approx. size Notes
Tiny ~75 MB Fastest, lowest accuracy. Good for real-time CPU transcription.
Base (default) ~142 MB A good real-time CPU default.
Small ~466 MB Noticeably more accurate; real-time with a GPU or a fast CPU.
Medium ~1.5 GB High accuracy; typically needs a GPU for real-time.
LargeV3 ~3 GB Highest accuracy; GPU strongly recommended.
LargeV3Turbo ~1.6 GB Near-large accuracy at a fraction of the cost.
TinyQuantized ~31 MB Tiny, Q5_1-quantized.
BaseQuantized ~57 MB Base, Q5_1-quantized.
SmallQuantized ~181 MB Small, Q5_1-quantized.
MediumQuantized ~514 MB Medium, Q5_0-quantized.
LargeV3TurboQuantized ~547 MB LargeV3Turbo, Q5_0-quantized. Recommended quantized accuracy/speed balance.

English-only (*.en) variants aren't enumerated; supply their path directly. Weights are MIT-licensed.

Recognized segments

Each SpeechSegment in SpeechRecognizedEventArgs.Segments:

Property Description
Text The recognized text for the segment.
StartTime / EndTime TimeSpan, relative to the start of the stream — ready to use for SRT/VTT or on-screen scheduling.
IsFinal Reserved to distinguish interim vs. final hypotheses once interim results exist; segments are currently always final (true).
Language Detected/used ISO 639-1 code, or null if unknown.
Confidence Average token confidence (0..1), or 0 when the model doesn't report token probabilities.

Subtitle helpers

SubtitleWriter — SRT/VTT files

SubtitleWriter writes recognized speech segments to a SubRip (.srt) or WebVTT (.vtt) side-car file, appending one cue per final segment. It is thread-safe; interim (non-final) segments are ignored.

using VisioForge.Core.AI.Whisper.Subtitles;

using var writer = new SubtitleWriter("captions.srt", SubtitleFormat.Srt);
stt.OnSpeechRecognized += (sender, e) =>
{
    foreach (var segment in e.Segments)
    {
        writer.Add(segment);
    }
};

If you only need files, it's simpler to set OutputSrtPath and/or OutputVttPath on SpeechToTextSettings — the block creates and drives its own SubtitleWriter instance(s) internally as final segments are recognized, and disposes them for you.

SubtitleRenderer — on-screen captions

SubtitleRenderer drives a single text overlay on an OverlayManagerBlock: it shows the latest caption and auto-hides it after the segment's display duration.

using SkiaSharp;
using VisioForge.Core.AI.Whisper.Subtitles;

var style = new SubtitleStyle
{
    FontName = "Arial",
    FontSize = 32,
    Color = SKColors.White,
    X = 50,
    Y = 50,
    MinDisplay = TimeSpan.FromSeconds(1.5),
    MaxDisplay = TimeSpan.FromSeconds(6),
};

var subtitleRenderer = new SubtitleRenderer(overlayManagerBlock, style);
stt.OnSpeechRecognized += subtitleRenderer.OnSpeechRecognized;

// ... later, when tearing down:
subtitleRenderer.Dispose(); // removes the overlay and stops the auto-hide timer

Wire SubtitleRenderer.OnSpeechRecognized directly as the block's event handler. It clamps the on-screen time into [MinDisplay, MaxDisplay] based on the segment's duration and calls OverlayManagerBlock.Video_Overlay_Update from whatever thread invokes it — marshal the call if your UI framework requires it. SubtitleStyle defaults: FontName = "Arial", FontSize = 32, Color = White, X = 50, Y = 50, MinDisplay = 1.5 s, MaxDisplay = 6 s.

No shipping demo yet

SubtitleRenderer exists in the SDK and is documented from its source, but no bundled demo currently uses it — the Live Subtitles X demos update a UI label directly from OnSpeechRecognized instead. Use SubtitleWriter or OutputSrtPath/OutputVttPath if you only need a subtitle file.

Manual Media Blocks pipeline

Place SpeechToTextBlock in the audio chain before the audio renderer or output:

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

pipeline.Connect(audioSource.Output, stt.Input);
pipeline.Connect(stt.Output, audioRenderer.Input);

VideoCaptureCoreX live microphone transcription

For capture, an audio source is required. If you want analysis without speaker monitoring or recording, terminate the audio chain with a non-synced null renderer:

core.Audio_Source = microphoneSettings;
core.Audio_OutputBlock = new NullRendererBlock(MediaBlockPadMediaType.Audio)
{
    IsSync = false,
};

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

core.Audio_Processing_AddBlock(stt); // before StartAsync
await core.StartAsync();

Audio_OutputBlock builds and terminates the audio chain without enabling speaker playback or file recording. The engine owns the assigned output block after start.

MediaPlayerCoreX file transcription

For playback, Audio_Play must be true for the audio chain to be built. A non-synced null renderer lets the source run without real-time speaker output:

player.Audio_Play = true;
player.Audio_OutputBlock = new NullRendererBlock(MediaBlockPadMediaType.Audio)
{
    IsSync = false,
};

var stt = new SpeechToTextBlock(settings);
stt.OnSpeechRecognized += SpeechToText_OnSpeechRecognized;

player.Audio_Processing_AddBlock(stt); // before OpenAsync / PlayAsync
await player.OpenAsync(source);
await player.PlayAsync();

See Using AI blocks with VideoCaptureCoreX and MediaPlayerCoreX for the full Audio_Processing_*/Audio_OutputBlock API and lifecycle rules.

Threading, pacing, and lifetime

VAD and Whisper run synchronously on the GStreamer streaming thread: audio is segmented and transcribed in place, and the trailing segment is flushed at end-of-stream. Nothing is dropped — the whole input is transcribed losslessly, and the pipeline position tracks the transcription frontier (useful for a progress bar). Because transcription runs inline, the source is paced to Whisper: if Whisper is slower than real time, the upstream (including a live capture device) is throttled rather than losing audio. In practice Whisper Base runs well above real time, so it is not the bottleneck for a typical source.

OnSpeechRecognized is raised on that same streaming thread — never touch UI directly from the handler; marshal to the UI dispatcher or main thread.

After a capture or playback session starts, the engine owns wired SpeechToTextBlock instances and disposes them when the session stops. Create a new block for the next session.

Use cases

  • Live captioning — real-time subtitles for a live stream, webinar, or accessibility overlay.
  • Meeting and call transcription — transcribe a microphone feed alongside VideoCaptureCoreX capture.
  • Media indexing and search — batch-transcribe recorded video/audio files to make their content searchable.
  • Subtitle authoring — generate .srt/.vtt side-car files from source video without a third-party transcription service.
  • Translation captions — set Task = SpeechToTextTask.Translate to caption non-English speech in English.

Troubleshooting

Symptom Likely cause Fix
No segments are ever recognized WhisperModelPath invalid, or the audio chain isn't built Confirm the GGML model file exists at that path; for capture/playback, confirm the audio chain is active (see Audio_OutputBlock below).
Whisper "hallucinates" text during silence EnableVad is false Enable VAD (EnableVad = true, the default) so Whisper only runs on detected speech, not fixed windows that may be silent.
Transcription lags behind live audio Whisper is slower than real time on the current hardware/model size Choose a smaller WhisperModelSize/model file, or use Provider = CUDA on a machine with an NVIDIA GPU.
Segments are cut off mid-sentence SileroVadSettings.MaxSpeechMs force-cuts a long continuous utterance This is a deliberate bound (default 15 s) so one in-flight transcription can't grow unbounded; raise MaxSpeechMs if your scenario needs longer uninterrupted segments and can tolerate the larger bound.
No audio reaches the block on VideoCaptureCoreX/MediaPlayerCoreX Audio chain not built (missing Audio_Source/Audio_Play) See VideoCaptureCoreX live microphone transcription and MediaPlayerCoreX file transcription above.
.srt/.vtt file is empty Segments never finalized, or wrong path Confirm OutputSrtPath/OutputVttPath point to a writable path and that speech was actually detected; only final segments are written.

Frequently Asked Questions

Does SpeechToTextBlock require an internet connection?

No — transcription runs fully on-device through Whisper.net/GGML; the model files are downloaded once by your application (or bundled), not called per-request over the network.

Which languages does it support?

Whisper is multilingual — set Language to an ISO 639-1 code, or leave it "auto" to let Whisper detect the spoken language automatically.

Can I translate speech to English captions instead of transcribing the source language?

Yes — set SpeechToTextSettings.Task = SpeechToTextTask.Translate.

How do I get live on-screen subtitles instead of just an event?

Wire SubtitleRenderer.OnSpeechRecognized as the block's event handler against an OverlayManagerBlock — see SubtitleRenderer — on-screen captions. If you only need SRT/VTT files, set OutputSrtPath/OutputVttPath instead.

Which Whisper model size should I use?

Start with Base (the default) for real-time CPU transcription. Move to Small/Medium/LargeV3 (or their quantized variants) for higher accuracy if you have a GPU or can tolerate slower-than-real-time processing; see the Whisper model sizes table for the full trade-off.

Demos

Dedicated VideoCaptureCoreX/MediaPlayerCoreX live-subtitles demos (Capture Live Subtitles X, Capture Live Subtitles X WPF, Player Live Subtitles X, Player Live Subtitles X WPF) are in the SDK's demo set and will be linked here once published to the public samples repository.