Text-to-Speech

Google Vertex AI offers powerful text-to-speech capabilities through Gemini TTS models. Portkey supports two approaches for TTS:

Gemini TTS via Chat Completions - Use Gemini TTS models through the chat completions endpoint with speech_config or OpenAI-compatible audio parameter (maps to Vertex AI API)
Cloud Text-to-Speech API - Use the OpenAI-compatible /audio/speech endpoint for Chirp and Gemini TTS voices (maps to Cloud Text-to-Speech API)

Method 1: Gemini TTS via Chat Completions

This method uses the Vertex AI generateContent API internally and provides granular control over speech synthesis using speech_config or the OpenAI-compatible audio parameter.

Available Models

Model ID	Optimized For	Speaker Support
`gemini-2.5-flash-tts`	Low latency, everyday applications	Single & multi-speaker
`gemini-2.5-pro-tts`	High control, podcasts, audiobooks	Single & multi-speaker
`gemini-2.5-flash-lite-preview-tts`	Cost-efficient applications	Single speaker only
`gemini-3.1-flash-tts-preview`	Low latency with latest features	Single & multi-speaker

Using `speech_config` (Vertex AI Native)

curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "Say the following in a cheerful way: Hello! Welcome to Portkey. We make AI applications reliable and production-ready."
      }
    ],
    "speech_config": {
      "voice_config": {
        "prebuilt_voice_config": {
          "voice_name": "Kore"
        }
      },
      "language_code": "en-US"
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - output.wav

Since speech_config is not part of the OpenAI API specification:

Python SDK: Use extra_body parameter to pass provider-specific parameters
Node.js SDK: Pass additional parameters directly - the Portkey SDK accepts arbitrary parameters via its flexible type definitions

Using `audio` Parameter (OpenAI-Compatible)

For a simpler, OpenAI-compatible interface, use the audio parameter:

curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "Say the following warmly: Thank you for using our service today!"
      }
    ],
    "audio": {
      "voice": "Aoede"
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - output.wav

Response Format

The audio is returned in the response as base64-encoded PCM 16-bit 24kHz audio:

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "audio": {
          "id": "audio-xxx",
          "data": "UklGRk...base64-encoded-audio..."
        }
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 100,
    "total_tokens": 125
  }
}

Multi-Speaker Synthesis

Generate conversations with multiple speakers using multi_speaker_voice_config:

curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "TTS the following conversation between Alice and Bob:\nAlice: Hi Bob, how are you today?\nBob: I am doing great, thanks for asking!"
      }
    ],
    "speech_config": {
      "language_code": "en-US",
      "multi_speaker_voice_config": {
        "speaker_voice_configs": [
          {
            "speaker": "Alice",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "Kore"
              }
            }
          },
          {
            "speaker": "Bob",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "Charon"
              }
            }
          }
        ]
      }
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - conversation.wav

Method 2: Cloud Text-to-Speech API

This method uses Google’s Cloud Text-to-Speech API through the OpenAI-compatible /audio/speech endpoint. It supports both Gemini TTS and Chirp voices with more audio encoding options.

Basic Usage

curl https://api.portkey.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "input": "Hello! This is a test of the text to speech system.",
    "voice": "Kore",
    "response_format": "mp3"
  }' \
  --output speech.mp3

With Style Instructions

Use the instructions parameter to control the speaking style:

curl https://api.portkey.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "input": "Welcome to our podcast! Today we have an exciting episode for you.",
    "voice": "Aoede",
    "instructions": "Speak in an enthusiastic and energetic podcast host voice",
    "response_format": "mp3"
  }' \
  --output podcast_intro.mp3

Supported Audio Formats

Format	Content Type	Description
`mp3`	audio/mpeg	Compressed, widely compatible
`opus`	audio/ogg	High quality, efficient compression
`wav`	audio/wav	Uncompressed LINEAR16
`pcm`	audio/L16	Raw PCM audio
`alaw`	audio/alaw	A-law encoded audio
`mulaw`	audio/basic	μ-law encoded audio

Voice Options

Gemini TTS offers 30 distinct voices:

Voice Name	Gender	Voice Name	Gender
Achernar	Female	Laomedeia	Female
Achird	Male	Leda	Female
Algenib	Male	Orus	Male
Algieba	Male	Pulcherrima	Female
Alnilam	Male	Puck	Male
Aoede	Female	Rasalgethi	Male
Autonoe	Female	Sadachbia	Male
Callirrhoe	Female	Sadaltager	Male
Charon	Male	Schedar	Male
Despina	Female	Sulafat	Female
Enceladus	Male	Umbriel	Male
Erinome	Female	Vindemiatrix	Female
Fenrir	Male	Zephyr	Female
Gacrux	Female	Zubenelgenubi	Male
Iapetus	Male	Kore	Female

Supported Languages

Gemini TTS supports 24+ languages in GA and 50+ in Preview. Common GA languages include:

Language	Code	Language	Code
English (US)	en-US	Japanese	ja-JP
English (India)	en-IN	Korean	ko-KR
French	fr-FR	Portuguese (Brazil)	pt-BR
German	de-DE	Spanish	es-ES
Hindi	hi-IN	Italian	it-IT

Choosing the Right Method

Feature	Chat Completions (Vertex AI API)	Audio Speech (Cloud TTS API)
Endpoint	`/v1/chat/completions`	`/v1/audio/speech`
Audio Format	PCM 16-bit 24kHz only	MP3, WAV, Opus, PCM, etc.
Temperature Control	✅ Supported	❌ Not supported
Style Instructions	Via message content	Via `instructions` param
Multi-Speaker	✅ Full control	❌ Single speaker only
Streaming	✅ Via SSE	❌ Not supported
Text Input Streaming	Single request only	Multiple chunks supported
Best For	Real-time apps, multi-speaker	Simple TTS, format flexibility

When to Use Vertex AI API (Chat Completions)

You need temperature control for creative/diverse output
You want multi-speaker conversations
You’re already using Vertex AI for other models
You need streaming audio output

When to Use Cloud TTS API (Audio Speech)

You need specific audio encoding formats (MP3, WAV, etc.)
You want a simpler OpenAI-compatible interface
You’re migrating from OpenAI TTS
You need to stream text input in multiple chunks

Prompting Tips

For detailed prompting strategies, see Google’s prompting tips.

Style Prompts

Control the speaking style through your message content:

Say the following in a calm, professional tone: [your text]

Narrate this like an audiobook narrator: [your text]

Speak with excitement and energy: [your text]

Markup Tags (Preview)

Use bracketed tags for specific effects:

Tag	Effect
`[sigh]`	Inserts a sigh sound
`[laughing]`	Inserts a laugh
`[uhm]`	Inserts a hesitation
`[whispering]`	Decreases volume
`[shouting]`	Increases volume
`[extremely fast]`	Speeds up speech
`[short pause]`	~250ms pause
`[long pause]`	~1000ms+ pause

Example:

Say: [sigh] I can't believe it's Monday again. [long pause] Well, let's get started!

Limits

Description	Limit
Text field	≤ 4,000 bytes
Prompt field	≤ 4,000 bytes
Combined text + prompt	≤ 8,000 bytes
Output audio duration	~655 seconds max

If input text results in audio longer than 655 seconds, the audio will be truncated.

Ecosystem

LLM Integrations

Cloud Platforms

Guardrails

Plugins

Vector Databases

Agents

AI Apps

Libraries

Tracing Providers

MCP Clients

MCP Servers

Method 1: Gemini TTS via Chat Completions

Available Models

Using `speech_config` (Vertex AI Native)

Using `audio` Parameter (OpenAI-Compatible)

Response Format

Multi-Speaker Synthesis

Method 2: Cloud Text-to-Speech API

Basic Usage

With Style Instructions

Supported Audio Formats

Voice Options

Supported Languages

Choosing the Right Method

When to Use Vertex AI API (Chat Completions)

When to Use Cloud TTS API (Audio Speech)

Prompting Tips

Style Prompts

Markup Tags (Preview)

Limits

Ecosystem

LLM Integrations

Cloud Platforms

Guardrails

Plugins

Vector Databases

Agents

AI Apps

Libraries

Tracing Providers

MCP Clients

MCP Servers

Documentation Index

​Method 1: Gemini TTS via Chat Completions

​Available Models

​Using speech_config (Vertex AI Native)

​Using audio Parameter (OpenAI-Compatible)

​Response Format

​Multi-Speaker Synthesis

​Method 2: Cloud Text-to-Speech API

​Basic Usage

​With Style Instructions

​Supported Audio Formats

​Voice Options

​Supported Languages

​Choosing the Right Method

​When to Use Vertex AI API (Chat Completions)

​When to Use Cloud TTS API (Audio Speech)

​Prompting Tips

​Style Prompts

​Markup Tags (Preview)

​Limits

Method 1: Gemini TTS via Chat Completions

Available Models

Using `speech_config` (Vertex AI Native)

Using `audio` Parameter (OpenAI-Compatible)

Response Format

Multi-Speaker Synthesis

Method 2: Cloud Text-to-Speech API

Basic Usage

With Style Instructions

Supported Audio Formats

Voice Options

Supported Languages

Choosing the Right Method

When to Use Vertex AI API (Chat Completions)

When to Use Cloud TTS API (Audio Speech)

Prompting Tips

Style Prompts

Markup Tags (Preview)

Limits