Skip to main content

Documentation Index

Fetch the complete documentation index at: https://portkey-docs-feat-vertex-gemini-tts.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Google Vertex AI offers powerful text-to-speech capabilities through Gemini TTS models. Portkey supports two approaches for TTS:
  1. Gemini TTS via Chat Completions - Use Gemini TTS models through the chat completions endpoint with speech_config or OpenAI-compatible audio parameter (maps to Vertex AI API)
  2. Cloud Text-to-Speech API - Use the OpenAI-compatible /audio/speech endpoint for Chirp and Gemini TTS voices (maps to Cloud Text-to-Speech API)

Method 1: Gemini TTS via Chat Completions

This method uses the Vertex AI generateContent API internally and provides granular control over speech synthesis using speech_config or the OpenAI-compatible audio parameter.

Available Models

Model IDOptimized ForSpeaker Support
gemini-2.5-flash-ttsLow latency, everyday applicationsSingle & multi-speaker
gemini-2.5-pro-ttsHigh control, podcasts, audiobooksSingle & multi-speaker
gemini-2.5-flash-lite-preview-ttsCost-efficient applicationsSingle speaker only
gemini-3.1-flash-tts-previewLow latency with latest featuresSingle & multi-speaker

Using speech_config (Vertex AI Native)

curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "Say the following in a cheerful way: Hello! Welcome to Portkey. We make AI applications reliable and production-ready."
      }
    ],
    "speech_config": {
      "voice_config": {
        "prebuilt_voice_config": {
          "voice_name": "Kore"
        }
      },
      "language_code": "en-US"
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - output.wav
Since speech_config is not part of the OpenAI API specification:
  • Python SDK: Use extra_body parameter to pass provider-specific parameters
  • Node.js SDK: Pass additional parameters directly - the Portkey SDK accepts arbitrary parameters via its flexible type definitions

Using audio Parameter (OpenAI-Compatible)

For a simpler, OpenAI-compatible interface, use the audio parameter:
curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "Say the following warmly: Thank you for using our service today!"
      }
    ],
    "audio": {
      "voice": "Aoede"
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - output.wav

Response Format

The audio is returned in the response as base64-encoded PCM 16-bit 24kHz audio:
{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "audio": {
          "id": "audio-xxx",
          "data": "UklGRk...base64-encoded-audio..."
        }
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 100,
    "total_tokens": 125
  }
}

Multi-Speaker Synthesis

Generate conversations with multiple speakers using multi_speaker_voice_config:
curl https://api.portkey.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "messages": [
      {
        "role": "user",
        "content": "TTS the following conversation between Alice and Bob:\nAlice: Hi Bob, how are you today?\nBob: I am doing great, thanks for asking!"
      }
    ],
    "speech_config": {
      "language_code": "en-US",
      "multi_speaker_voice_config": {
        "speaker_voice_configs": [
          {
            "speaker": "Alice",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "Kore"
              }
            }
          },
          {
            "speaker": "Bob",
            "voice_config": {
              "prebuilt_voice_config": {
                "voice_name": "Charon"
              }
            }
          }
        ]
      }
    }
  }' \
  | jq -r '.choices[0].message.audio.data' \
  | base64 -d \
  | ffmpeg -f s16le -ar 24k -ac 1 -i - conversation.wav

Method 2: Cloud Text-to-Speech API

This method uses Google’s Cloud Text-to-Speech API through the OpenAI-compatible /audio/speech endpoint. It supports both Gemini TTS and Chirp voices with more audio encoding options.

Basic Usage

curl https://api.portkey.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "input": "Hello! This is a test of the text to speech system.",
    "voice": "Kore",
    "response_format": "mp3"
  }' \
  --output speech.mp3

With Style Instructions

Use the instructions parameter to control the speaking style:
curl https://api.portkey.ai/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -d '{
    "model": "@vertex-ai/gemini-2.5-flash-tts",
    "input": "Welcome to our podcast! Today we have an exciting episode for you.",
    "voice": "Aoede",
    "instructions": "Speak in an enthusiastic and energetic podcast host voice",
    "response_format": "mp3"
  }' \
  --output podcast_intro.mp3

Supported Audio Formats

FormatContent TypeDescription
mp3audio/mpegCompressed, widely compatible
opusaudio/oggHigh quality, efficient compression
wavaudio/wavUncompressed LINEAR16
pcmaudio/L16Raw PCM audio
alawaudio/alawA-law encoded audio
mulawaudio/basicμ-law encoded audio

Voice Options

Gemini TTS offers 30 distinct voices:
Voice NameGenderVoice NameGender
AchernarFemaleLaomedeiaFemale
AchirdMaleLedaFemale
AlgenibMaleOrusMale
AlgiebaMalePulcherrimaFemale
AlnilamMalePuckMale
AoedeFemaleRasalgethiMale
AutonoeFemaleSadachbiaMale
CallirrhoeFemaleSadaltagerMale
CharonMaleSchedarMale
DespinaFemaleSulafatFemale
EnceladusMaleUmbrielMale
ErinomeFemaleVindemiatrixFemale
FenrirMaleZephyrFemale
GacruxFemaleZubenelgenubiMale
IapetusMaleKoreFemale

Supported Languages

Gemini TTS supports 24+ languages in GA and 50+ in Preview. Common GA languages include:
LanguageCodeLanguageCode
English (US)en-USJapaneseja-JP
English (India)en-INKoreanko-KR
Frenchfr-FRPortuguese (Brazil)pt-BR
Germande-DESpanishes-ES
Hindihi-INItalianit-IT

Choosing the Right Method

FeatureChat Completions (Vertex AI API)Audio Speech (Cloud TTS API)
Endpoint/v1/chat/completions/v1/audio/speech
Audio FormatPCM 16-bit 24kHz onlyMP3, WAV, Opus, PCM, etc.
Temperature Control✅ Supported❌ Not supported
Style InstructionsVia message contentVia instructions param
Multi-Speaker✅ Full control❌ Single speaker only
Streaming✅ Via SSE❌ Not supported
Text Input StreamingSingle request onlyMultiple chunks supported
Best ForReal-time apps, multi-speakerSimple TTS, format flexibility

When to Use Vertex AI API (Chat Completions)

  • You need temperature control for creative/diverse output
  • You want multi-speaker conversations
  • You’re already using Vertex AI for other models
  • You need streaming audio output

When to Use Cloud TTS API (Audio Speech)

  • You need specific audio encoding formats (MP3, WAV, etc.)
  • You want a simpler OpenAI-compatible interface
  • You’re migrating from OpenAI TTS
  • You need to stream text input in multiple chunks

Prompting Tips

For detailed prompting strategies, see Google’s prompting tips.

Style Prompts

Control the speaking style through your message content:
Say the following in a calm, professional tone: [your text]
Narrate this like an audiobook narrator: [your text]
Speak with excitement and energy: [your text]

Markup Tags (Preview)

Use bracketed tags for specific effects:
TagEffect
[sigh]Inserts a sigh sound
[laughing]Inserts a laugh
[uhm]Inserts a hesitation
[whispering]Decreases volume
[shouting]Increases volume
[extremely fast]Speeds up speech
[short pause]~250ms pause
[long pause]~1000ms+ pause
Example:
Say: [sigh] I can't believe it's Monday again. [long pause] Well, let's get started!

Limits

DescriptionLimit
Text field≤ 4,000 bytes
Prompt field≤ 4,000 bytes
Combined text + prompt≤ 8,000 bytes
Output audio duration~655 seconds max
If input text results in audio longer than 655 seconds, the audio will be truncated.
Last modified on May 10, 2026