Convert text to speech audio using specified voice and configuration

const options = {
  method: 'POST',
  headers: {Authorization: 'Bearer <token>', 'Content-Type': 'application/json'},
  body: JSON.stringify({
    text: '<string>',
    voiceId: '<string>',
    model: '<string>',
    language: '<string>',
    provider: '<string>',
    speed: 123,
    vendorSpecific: {},
    inlinePronunciationRules: [{text: '<string>', alias: '<string>'}],
    pronunciationDictionaryId: '<string>'
  })
};

fetch('https://blackbox.dasha.ai/api/v1/voice/synthesize', options)
  .then(res => res.json())
  .then(res => console.log(res))
  .catch(err => console.error(err));

"<string>"

Convert text to speech audio using specified voice and configuration

Synthesizes the provided text into high-quality audio using the specified voice configuration. Returns audio data as MP3 format ready for playback or download. This endpoint is used for voice preview functionality in agent configuration interfaces and real-time audio generation.

Audio Configuration:

Voice Selection: Uses voiceId from the selected provider (ElevenLabs, Cartesia, Dasha)
Speed Control: Adjustable playback speed (0.25x to 4.0x, default: 1.0x)
Language: Automatic detection or explicit language specification
Quality: Provider-dependent quality settings and audio formats

Provider-Specific Features:

ElevenLabs: Stability, similarity_boost, style, speaker_boost settings
Cartesia: Emotion controls, emphasis, and speed optimization
Dasha: Consistent quality with low-latency streaming options

Performance Characteristics:

Latency: 100-500ms depending on text length and provider
Audio Format: MP3 encoding at 22kHz sample rate
File Size: ~1KB per second of synthesized audio
Concurrent Limits: Rate limited per organization to prevent abuse

Common Use Cases:

Voice preview during agent configuration
Pre-generating audio for common phrases or greetings
Testing voice quality and settings before deployment
Creating audio samples for voice comparison
Batch audio generation for IVR systems

Text Processing Guidelines:

Maximum text length: 5000 characters per request
SSML tags supported for advanced speech control
Numbers, dates, and abbreviations automatically normalized
Punctuation affects speech rhythm and pauses

POST

api

voice

synthesize

Convert text to speech audio using specified voice and configuration

const options = {
  method: 'POST',
  headers: {Authorization: 'Bearer <token>', 'Content-Type': 'application/json'},
  body: JSON.stringify({
    text: '<string>',
    voiceId: '<string>',
    model: '<string>',
    language: '<string>',
    provider: '<string>',
    speed: 123,
    vendorSpecific: {},
    inlinePronunciationRules: [{text: '<string>', alias: '<string>'}],
    pronunciationDictionaryId: '<string>'
  })
};

fetch('https://blackbox.dasha.ai/api/v1/voice/synthesize', options)
  .then(res => res.json())
  .then(res => console.log(res))
  .catch(err => console.error(err));

"<string>"

Body

Text synthesis configuration with voice, speed, and provider settings

Request DTO for TTS synthesis operations

text

string

required

Text to synthesize into speech

Required string length: 1 - 5000

voiceId

string

required

Voice ID to use for synthesis

Minimum string length: 1

model

string | null

required

Model to use for synthesis

language

string

required

Language code for synthesis

Minimum string length: 1

provider

string

required

TTS provider name

Minimum string length: 1

speed

number<double>

vendorSpecific

object

Provider-specific configuration options

Show child attributes

inlinePronunciationRules

object[] | null

Inline pronunciation rules for preview support. These rules are applied during synthesis without being stored in a dictionary.

Base class for pronunciation rules using the discriminator pattern. Uses TypeIndicatorConverter for polymorphic JSON serialization.

Option 1
Option 2

Show child attributes

pronunciationDictionaryId

string | null

Pronunciation dictionary ID to use for synthesis. When provided, the dictionary rules will be applied during synthesis.

Response

Returns the synthesized audio as MP3

The response is of type file.

Get all available Text-to-Speech (TTS) voices from supported providers Create a custom cloned voice from audio samples

⌘I

Agents

CallResults

Calls

Mcp

Misc

PronunciationDictionaries

SipAliases

Voice

WebhookTest

WebIntegrations

WebSocket

Document

KnowledgeBase

Search

Convert text to speech audio using specified voice and configuration

Body

Response