Skip to main content

Overview

OpenAI provides two STT service implementations:
  • OpenAISTTService for VAD-segmented speech recognition using OpenAI’s transcription API (HTTP-based), supporting GPT-4o transcription and Whisper models
  • OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions, with support for local VAD and server-side VAD modes

Installation

To use OpenAI services, install the required dependency:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI STT services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Ensure access to Whisper and GPT-4o transcription models

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

Configuration

OpenAISTTService

Uses VAD-based audio segmentation with HTTP transcription requests. It records speech segments detected by local VAD and sends them to OpenAI’s transcription API.
model
str
default:"gpt-4o-transcribe"
Transcription model to use. Options include "gpt-4o-transcribe", "gpt-4o-mini-transcribe", and "whisper-1".
api_key
str
default:"None"
OpenAI API key. Falls back to the OPENAI_API_KEY environment variable.
base_url
str
default:"None"
API base URL. Override for custom or proxied deployments.
language
Language
default:"Language.EN"
Language of the audio input.
prompt
str
default:"None"
Optional text to guide the model’s style or continue a previous segment.
temperature
float
default:"None"
Sampling temperature between 0 and 1. Lower values produce more deterministic results.
push_empty_transcripts
bool
default:"False"
If true, allow empty TranscriptionFrame frames to be pushed downstream instead of discarding them. This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription.

OpenAIRealtimeSTTService

Provides real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions. Audio is streamed continuously over a WebSocket connection for lower latency compared to HTTP-based transcription.
api_key
str
required
OpenAI API key for authentication.
model
str
default:"gpt-4o-transcribe"
Transcription model. Supported values are "gpt-4o-transcribe" and "gpt-4o-mini-transcribe".
base_url
str
default:"wss://api.openai.com/v1/realtime"
WebSocket base URL for the Realtime API.
language
Language
default:"Language.EN"
Language of the audio input.
prompt
str
default:"None"
Optional prompt text to guide transcription style or provide keyword hints.
turn_detection
dict | Literal[False]
default:"False"
Server-side VAD configuration. Defaults to False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).
noise_reduction
str
default:"None"
Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.
should_interrupt
bool
default:"True"
Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled.

Usage

OpenAISTTService

from pipecat.services.openai.stt import OpenAISTTService

stt = OpenAISTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
)

OpenAIRealtimeSTTService with Local VAD

from pipecat.services.openai.stt import OpenAIRealtimeSTTService

# Local VAD mode (default) - use with a VAD processor in the pipeline
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    noise_reduction="near_field",
)

OpenAIRealtimeSTTService with Server-Side VAD

from pipecat.services.openai.stt import OpenAIRealtimeSTTService

# Server-side VAD mode - do NOT use a separate VAD processor
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    turn_detection=None,  # Enable server-side VAD
)

Notes

  • Local VAD vs Server-side VAD: OpenAIRealtimeSTTService defaults to local VAD mode (turn_detection=False), where a local VAD processor in the pipeline controls when audio is committed for transcription. Set turn_detection=None for server-side VAD, but do not use a separate VAD processor in the pipeline in that mode.
  • Automatic resampling: OpenAIRealtimeSTTService automatically resamples audio to 24 kHz as required by the Realtime API, regardless of the pipeline’s sample rate.
  • Segmented vs streaming: OpenAISTTService processes complete audio segments (after VAD detects silence) via HTTP. OpenAIRealtimeSTTService streams audio continuously over WebSocket for lower latency.
  • Interim transcriptions: OpenAIRealtimeSTTService produces interim transcriptions via delta events, while OpenAISTTService only produces final transcriptions.

Event Handlers

OpenAIRealtimeSTTService supports the standard service connection events:
EventDescription
on_connectedConnected to OpenAI Realtime WebSocket
on_disconnectedDisconnected from OpenAI Realtime WebSocket
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to OpenAI Realtime STT")
OpenAISTTService uses HTTP requests and does not have WebSocket connection events.