Learn

Real-Time AI Voice API Pricing Explained

Learn how real-time voice API pricing changes when streaming TTS, conversational voice, telephony, concurrency, audio duration, and agent minutes enter the model.

Clarify the spend threshold before you commit. Use this page when the core product is familiar and the real question is whether to stay free, upgrade, or switch pricing tracks.

UpdatedJuly 2, 2026
Browse tool profiles

Editorial guide

Guide

Start with the spend threshold and the conditions that change the pricing decision.

Short answer: latency, concurrency, streaming support, audio duration, telephony, and agent minutes can matter as much as per-character price. A character rate tells you what the text-to-speech model costs for submitted text, but a real-time voice product also has to pay for first-audio speed, partial-text streaming, speech input, call runtime, phone routing, retries, monitoring, and peak simultaneous sessions.

Why real-time voice pricing is not one meter

Streaming TTS, conversational voice, and agent-style voice are adjacent but not interchangeable buying routes. Streaming TTS starts from text and returns playable audio quickly. Conversational voice adds turn-taking, interruptions, speech input, LLM calls, and session state. Agent platforms add hosted call handling, phone numbers, evaluation, logs, tools, workspace controls, and sometimes telephony pass-through.

That is why a low per-character rate can still be the wrong headline metric. A support agent that speaks only half the time may still be billed for the full call path, while a narration workflow may care more about revisions, discarded takes, audio duration, and batch throughput. The practical comparison starts with the workload, not the vendor tier name.

Cartesia as the clean route example

Cartesia is the clearest example because its official pricing separates model credits from Line agent usage. The plan page gives every plan a monthly credit allowance plus prepaid agent dollars, while the billing docs explain that standard TTS is approximately one credit per character across bytes, SSE, and WebSocket endpoints. Pro Voice Clone output costs more per character, and speech-to-text is metered by audio seconds depending on endpoint and model.

Line voice agents sit on a different meter. Cartesia bills hosted agents per minute in US dollars, treats telephony from a Cartesia-provided phone number as an add-on per minute, and publishes agent slots plus concurrent-call limits by plan. For builders, that means the same project can need three estimates: characters for what the agent says, audio seconds for what it hears, and call minutes plus telephony for the agent session.

The purchase boundary follows the architecture. If you only need low-latency speech output inside your own agent stack, model credits and TTS concurrency are the first things to test. If you want Cartesia to host the voice-agent layer, budget Line minutes, call concurrency, telephony, and any LLM or evaluation rules separately from the TTS credit pool.

How the comparison vendors split the bill

ElevenLabs has a similar route split but different units. Its API pricing page shows TTS model pricing by 1,000 characters and included character pools by plan, while its WebSocket docs position streaming input for partial text and word-alignment workflows. ElevenAgents is separate: the agents page bills by call minutes, included minutes, concurrent calls, text messages, burst pricing, and external provider costs such as LLM or telephony usage.

Fish Audio is cleaner for developer TTS math but has an important unit boundary. The official API pricing says API access is pay-as-you-go with no monthly minimum, and TTS is metered by millions of UTF-8 bytes rather than visible characters. ASR is billed by audio hour, voice design is per successful request, and concurrency limits rise with prepaid spend. The consumer plan page uses credits and generation-minute estimates, so app credits should not be treated as the same unit as API bytes.

MiniMax Audio splits subscription audio points from API PayGo. The subscription page lists monthly audio points, voice slots, and RPM by tier, while PayGo prices current T2A speech models per million characters and prices voice cloning or voice design separately. Its WebSocket guide supports real-time TTS up to 10,000 characters per request, and its async long-TTS guide supports much larger long-form tasks. That makes MiniMax a model/API comparison, not a hosted telephony-agent comparison from the official pricing pages used here.

Resemble AI uses per-second usage for the Flex route. Its pricing page bills text-to-speech, voice agents, and voice changer by seconds of content processed, with team seats and voice clones as monthly add-ons. Its streaming WebSocket docs add a plan boundary: the low-latency persistent WebSocket route is limited to Business plans and above, with global and per-key concurrency limits. Enterprise becomes relevant when the buyer needs higher concurrency, custom SLAs, model fine-tuning, or on-premise deployment.

Unreal Speech is the most straightforward TTS-only boundary in this set. Its pricing page sells character allowances with approximate audio-hour equivalents, and its product/docs pages emphasize fast streaming endpoints, timestamps, and long synthesis tasks. It is useful when the project mainly needs affordable generated speech, but the official pages do not make it a hosted conversational-agent or telephony-pricing route.

Build the budget from the runtime path

Start by drawing one production path. For a streaming TTS feature, estimate monthly submitted characters or bytes, final audio minutes, retries, discarded generations, peak concurrent requests, output formats, and whether timestamps or WebSocket streaming are required. For a voice agent, estimate total call minutes, speaking ratio, listening audio, LLM traffic, telephony, no-answer calls, test traffic, interruptions, evaluation jobs, and simultaneous sessions.

Then separate app, API, and agent spend. A creator plan may be enough for voiceovers, previews, and manual exports. A direct API route is the right budget lane when your product sends text or receives audio programmatically. An agent route is a different purchase when the vendor is managing conversation turns, calls, logs, tools, phone numbers, and concurrency.

Do not force every vendor into one cost-per-character table. Translate each official meter into the same sample month instead: how much text goes in, how much audio comes out, how many minutes of live interaction run, how many sessions overlap, and which external systems are billed outside the voice vendor. That is the only way to compare Cartesia credits, ElevenLabs characters and call minutes, Fish UTF-8 bytes, MiniMax characters or audio points, Resemble seconds, and Unreal Speech character packages.

Final decision boundary

Choose a streaming TTS provider first when the product already owns the agent orchestration and only needs fast, controllable speech output. In that case, latency, streaming API behavior, supported formats, model quality, voice controls, and concurrency caps matter more than workspace seats.

Choose an agent-style route when the buyer wants the vendor to handle live calls, turn-taking, phone integration, logs, evaluations, and operations around conversations. In that case, agent minutes, concurrent calls, telephony, LLM pass-through, no-answer traffic, and enterprise controls may dominate the final bill.

Keep Deepgram and OpenAI as deferred context unless the stack decision explicitly expands to STT-first or full realtime speech-to-speech routes. They should be compared in a separate source-backed pass rather than folded into this TTS and voice-agent pricing boundary without their own official tool records and route assumptions.

FAQ

Common questions

Why can per-character TTS pricing be misleading for real-time voice apps?

Per-character pricing only covers the text that becomes speech. A real-time voice app can also pay for speech input, call runtime, telephony, concurrency, retries, LLM usage, no-answer calls, testing traffic, logs, and evaluations.

Should Cartesia credits, ElevenLabs characters, and Fish Audio UTF-8 bytes be compared directly?

No. Convert each vendor unit into the same sample workload first. Estimate submitted text, generated audio, listening audio, call minutes, concurrent sessions, and retries, then map that workload to each official meter.

When do agent minutes matter more than TTS rates?

Agent minutes matter more when the vendor hosts the conversational layer or call path. Support lines, outbound calls, IVR replacements, and sales agents can be shaped more by total call time, telephony, and concurrency than by the amount of speech generated.

What is the difference between streaming TTS and a voice-agent platform?

Streaming TTS turns text into audio quickly, often through HTTP, SSE, or WebSocket routes. A voice-agent platform usually adds listening, turn-taking, LLM routing, tools, phone handling, logs, evaluations, and operational controls around the conversation.

How should telephony be included in a voice API pricing model?

Treat telephony as its own line item. Include phone-number ownership, provider pass-through, inbound and outbound minutes, failed or unanswered calls, testing traffic, and whether the voice vendor charges extra for using its provisioned numbers.

Where do OpenAI and Deepgram fit if they are not in this comparison?

Keep them as deferred context unless the buying question expands to realtime speech-to-speech or STT-first stacks. They need their own official pricing sources and route assumptions before they can be compared fairly with this voice API set.

Next steps

Take the next buying step

Use these next pages to confirm the plan, tool, or alternate route that fits once the spend boundary is clear.

View all tools