Learn

AI Voice API Pricing Explained: Characters, Minutes, Credits, and App Boundaries

AI voice API pricing separates app subscriptions from API usage. Compare characters, requests, bytes, seconds, minutes, credits, agent minutes, and enterprise terms before modeling spend.

Clarify the spend threshold before you commit. Use this page when the core product is familiar and the real question is whether to stay free, upgrade, or switch pricing tracks.

UpdatedJuly 2, 2026
Browse tool profiles

Editorial guide

Guide

Start with the spend threshold and the conditions that change the pricing decision.

Short answer: AI voice API pricing is usually separate from app subscriptions. A creator or reader plan may unlock a hosted studio, human seats, exports, or a monthly allowance, while the API can meter characters, requests, seconds, minutes, bytes, credits, agent minutes, or enterprise usage under a different billing route.

That split is the most important pricing fact to understand before building. The right question is not only "which plan is cheapest?" It is "which meter grows when our product succeeds?" A small narration workflow, a call-center voice agent, a multilingual dubbing pipeline, and a developer API integration can all use AI voice, but they do not create the same bill.

Why app plans and API pricing split

App subscriptions usually price a human workflow. They may include access to a web editor, a voice library, project storage, commercial export rights, team seats, or a monthly bundle of credits. That route is usually the right starting point when a creator logs in, edits audio manually, and exports finished work from the vendor's interface.

API pricing prices a software workload. Your application sends text, receives audio, transcribes calls, clones voices, changes voices, or runs a real-time agent. The vendor then has to meter infrastructure, model class, latency, throughput, retries, concurrency, telephony, abuse controls, and support obligations. A subscription allowance that feels generous for one editor can be the wrong budget model for production traffic.

The boundary is practical: if people are using the vendor's app, compare app plans. If your product calls endpoints, streams audio, handles sessions, or redistributes generated speech, compare the API meter and its usage limits. Some vendors connect the two through shared credits or dashboards, but the workload still needs its own estimate.

Pricing units to map before launch

Pricing unit

What it measures

Official examples from voice vendors

Buyer check

Characters

Text input sent to text-to-speech

ElevenLabs API text-to-speech, Unreal Speech plan allowances, MiniMax T2A PayGo, Speechify API

Confirm whether spaces, SSML, retries, failed requests, and model variants count.

UTF-8 bytes

Encoded text input size

Fish Audio API text-to-speech

Model multilingual scripts by encoded size, not just visible character count.

Requests

Successful endpoint operations

Fish Audio voice design charges per successful API request

Ask whether failed validation, authentication errors, or multiple candidates are billed.

Seconds

Fine-grained processed audio or converted audio

Cartesia STT and voice changer, Resemble AI voice generation, voice agents, and voice changer

Check rounding, silence, idle time, and whether input or output duration drives the bill.

Minutes or hours

Audio duration, source media duration, or connected agent time

ElevenLabs speech-to-text, dubbing, Speech Engine, Cartesia Line agents, Fish Audio ASR, Speechify API, Unreal Speech hour estimates

Separate generated audio, transcribed audio, source dubbing minutes, and live agent time.

Credits

Vendor wallet or subscription allowance

ElevenLabs creative credits, Cartesia credits, Fish Audio account credits, Resemble Flex credits

Do not compare credits across vendors unless the conversion rules are explicit.

Agent minutes

Time a real-time voice agent is active

Cartesia Line voice agents, ElevenLabs Speech Engine, Resemble voice agents

Include no-answer calls, transfers, telephony add-ons, concurrency, and testing traffic.

Seats and voice add-ons

Human access or hosted voice assets

Resemble team seats and voice add-ons, Speechify enterprise seats, workspace-style routes

Keep collaboration cost separate from generation volume.

Enterprise usage

Negotiated volume, compliance, deployment, or support

ElevenLabs Enterprise, Cartesia Enterprise, Fish Audio enterprise limits, MiniMax subscriptions, Resemble Enterprise, Speechify Enterprise

Ask about commits, overages, SLAs, SSO, data retention, private deployment, and rights.

The table is a planning aid, not a universal exchange rate. A character-based service can be predictable for scripted TTS but weak for long live calls. A minute-based agent route can match support operations but overstate value for short generated clips. A credit wallet can simplify procurement inside one vendor but obscure the model-specific cost of premium voices, cloning, dubbing, or speech-to-text.

Official examples to calibrate the model

ElevenLabs shows the boundary clearly because it has both app-style pricing and a dedicated API pricing surface. Its API page says usage is billed in US dollars rather than credits, with text-to-speech billed by characters, speech-to-text by audio duration, Speech Engine by call minutes, and dubbing by source audio minute. That makes the API route the right model for embedded products, even when the same account also has creative-plan credits.

Cartesia is a strong real-time example. Its pricing page packages credits and prepaid agent dollars by plan, while its docs explain that model usage is metered in credits and hosted Line agents are billed in dollars per minute. Standard TTS is approximately credit-per-character, STT depends on endpoint and audio duration, voice changer is charged per second, and agent calling has a per-minute rate plus telephony add-on when applicable.

Fish Audio is useful because its developer docs avoid the subscription-plan assumption. The API is described as pay-as-you-go with no subscription fee or monthly minimum for API access. TTS is priced by millions of UTF-8 bytes, ASR by audio hour with duration rounded up to the nearest second, and voice design by successful API request. Its concurrency tiers also show why throughput can become a buying constraint before total spend is large.

Unreal Speech is closer to a direct TTS API planning model. Its pricing page presents character allowances with approximate generated audio hours across free, Basic, Plus, Pro, Enterprise, and Custom routes. That is easier to forecast for batch narration, prompts, and scripted generation, but buyers still need to check whether the plan's throughput, model availability, commercial terms, and high-volume rules fit the production path.

MiniMax Audio separates API PayGo from fixed subscription quotas. Its product pricing overview frames API Pricing as real-time per-call billing and Subscription Plans as fixed monthly quotas, while PayGo audio pricing lists T2A models by dollars per million characters and separate fees for rapid voice cloning and voice design. That split is exactly why a buyer should decide whether they need a platform key, a subscription key, or a negotiated route.

Resemble AI combines pay-as-you-go usage with add-ons and enterprise paths. Its Flex plan starts at zero, lets teams load credits, includes full API access, and adds seats or voice capabilities as needed. The pricing page also lists per-second rates for voice generation, voice agents, and voice changer, plus monthly charges for team seats and voice clones. Enterprise becomes relevant for higher concurrency, SLAs, SSO, custom training, and on-premise deployment.

Speechify highlights the app-versus-API risk. Its consumer pricing page sells a reader subscription, while the API pricing page separately describes a free API starter allowance, pay-as-you-go pricing per million characters, included TTS minutes, voice cloning availability, and an enterprise route with custom terms. A personal or creator subscription should not be assumed to cover embedded developer usage.

App-vs-API boundary for buyers and developers

Start by naming the billable actor. If the actor is a person making audio in a hosted workspace, start with the subscription plan and check exports, commercial rights, seats, storage, collaboration, voice cloning, and included credits. If the actor is your application, start with the API meter and check endpoints, authentication, model class, retries, rate limits, concurrency, data handling, and redistribution rights.

Then build a usage ledger from the product workflow. For TTS, count scripts, average characters or bytes, variants, retries, premium voices, and monthly growth. For transcription or dubbing, count input audio, source media minutes, target languages, review passes, and reprocessing. For agents, count average session length, no-answer calls, transfers, silence, testing traffic, concurrent sessions, and telephony. For voice cloning, count training events, hosted voices, consent workflow, and whether the clone is temporary or production-grade.

Run the estimate twice: once for a normal month and once for a spike month. Voice costs can stay tiny during prototyping and then jump when every user action creates personalized speech, calls an agent, or repeats a failed generation. Include failed experiments and quality-control takes, because audio workflows often consume more than the final exported file suggests.

Before paying, verify the exact billing unit, included allowance, rounding behavior, overage rule, credit expiration, model multipliers, workspace ownership, commercial rights, cancellation terms, and enterprise trigger. The safest purchase path is the one where the vendor's meter matches the workload you can actually forecast.

FAQ

Common questions

Is AI voice API pricing included in a normal app subscription?

Usually no. App subscriptions often cover a hosted studio, creator workflow, reader app, or workspace allowance. API calls can be billed separately by characters, bytes, minutes, seconds, requests, credits, agent time, or negotiated enterprise usage.

Which voice API pricing unit should developers estimate first?

Start with the unit that grows with the product workflow. Scripted TTS usually starts with characters or bytes, speech-to-text starts with audio duration, dubbing starts with source minutes and target languages, and voice agents start with connected minutes plus concurrency.

Are credits comparable across ElevenLabs, Cartesia, Fish Audio, Resemble, or Speechify?

No. Credits are vendor-specific. They can help compare tiers inside one vendor, but they should not be treated as a common currency unless the vendor publishes exactly how credits convert into characters, seconds, minutes, cloning, or agent usage.

When do agent minutes matter more than text-to-speech characters?

Agent minutes matter when the product runs live conversations, phone calls, or real-time support workflows. In that case, session length, idle time, transfers, telephony, and concurrent calls can drive spend more than the text that the agent speaks.

When should a buyer ask for enterprise voice pricing?

Ask for enterprise pricing when public self-serve plans do not cover the needed volume, concurrency, security review, data retention, SSO, support SLA, custom voice rights, on-premise deployment, procurement terms, or predictable overage rules.

Next steps

Take the next buying step

Use these next pages to confirm the plan, tool, or alternate route that fits once the spend boundary is clear.

View all tools