Learn
AI Voice Generator Pricing: Characters vs Minutes vs Credits
AI voice pricing uses different meters: characters, UTF-8 bytes, generated or dubbing minutes, credits, API usage, seats, concurrency, and contracts. Compare the real workload instead of plan names.
Clarify the spend threshold before you commit. Use this page when the core product is familiar and the real question is whether to stay free, upgrade, or switch pricing tracks.
Editorial guide
Guide
Start with the spend threshold and the conditions that change the pricing decision.
Short answer: voice tools price usage through different units, so buyers should compare real output workload instead of plan names. A creator subscription, API route, dubbing package, and enterprise contract can all describe voice generation, but they may count different things: text length, encoded bytes, finished audio, source-video minutes, credit balances, human seats, real-time sessions, concurrent calls, or custom capacity.
Start with the voice workload
Begin with the job the buyer will repeat. A narration workflow starts with scripts, pickups, alternate reads, and discarded takes. A dubbing workflow starts with source media length, target languages, transcript edits, lip-sync or voice-clone steps, and reviewer passes. A product integration starts with API requests, text volume, voice-agent sessions, latency needs, and rate limits. A team rollout starts with seats, shared workspaces, permissions, and billing ownership.
This matters because the same vendor can expose several buying paths. ElevenLabs is a clear example: official product pricing uses credits for self-serve creative plans, while official API pricing separates metered routes such as text-to-speech characters and dubbing minutes. Speechify also separates reader subscriptions, Studio credits, and API pricing. Cartesia, Fish Audio, MiniMax Audio, Rask AI, and Unreal Speech all reinforce the same rule: compare the unit that grows with your actual workload, not the marketing label attached to a plan.
The practical first step is a usage ledger. Write one representative month in plain terms: scripts written, characters submitted, final audio minutes exported, localization minutes processed, languages produced, voice-agent minutes handled, users invited, and peak simultaneous sessions. Then translate that ledger into each vendor's official meter.
Characters, UTF-8 bytes, and generated audio
Character pricing counts text input. It is common in voice APIs because the system can meter the text sent to the model before audio is generated. ElevenLabs API pricing uses character-style metering for text-to-speech, MiniMax PayGo pricing lists text-to-audio model charges by character volume, and Unreal Speech organizes its API plans around character allowances with approximate generated-audio output. For buyers, the useful estimate is not a plan name; it is script length plus revisions, retries, variants, and any discarded output.
UTF-8 bytes are a related but stricter API meter. Fish Audio's official API pricing uses UTF-8 byte volume for text-to-speech, so buyers working across languages should not assume that visible character count and billable input size always move together. A multilingual support script, training module, or audiobook chapter can have different billable weight once text is encoded.
Generated audio minutes measure the speech you create or export. This is easier for teams that plan finished assets by duration, such as course lessons, ads, prompts, or narration blocks. The caveat is that generation experiments still matter. Multiple voices, speed changes, revised takes, quality checks, and unusable outputs can consume allowance even if only one final file ships. When a vendor presents credits with approximate generated minutes, treat that estimate as a product-specific conversion, not a universal exchange rate.
Credits, dubbing minutes, and localization scope
Credits are vendor-defined wallets. ElevenLabs, Cartesia, Fish Audio, and Speechify Studio all use credit-style allowances in official pricing, while MiniMax Audio uses audio points for its subscription route. Those units can be helpful for comparing tiers inside one product, but they are not portable across vendors. A credit may map to TTS, STT, voice design, dubbing, cloning, or other features depending on the product's own rules.
Dubbing and localization minutes are different from ordinary narration minutes because the source media drives the job. Rask AI's official pricing centers on minutes for translation and dubbing workflows. ElevenLabs also separates dubbing from ordinary TTS in its official source set. For a localization buyer, one source video can multiply into several target-language outputs, review passes, subtitles, voice-clone checks, and team approvals. Model source minutes, target languages, and editorial review separately.
Do not mix credits and dubbing minutes without a conversion test. A generous credit bundle may be attractive for short text-to-speech generation but less predictable for long-form localization. A minute-based dubbing plan may be easier to budget for video teams but less useful for a product that needs small API calls all day. The right unit is the one that tracks the work you cannot avoid.
API usage, agents, seats, concurrency, and enterprise contracts
API usage should be its own budget lane. App subscriptions usually support people creating audio in a web product. APIs support software that sends text, receives audio, manages voices, or runs a real-time workflow. MiniMax PayGo, Fish Audio API pricing, ElevenLabs API pricing, Unreal Speech, Cartesia, and Speechify API all show why this separation matters: the API may count characters, bytes, requests, agent minutes, or usage balance rather than the same credits a human sees in a creator app.
Agent minutes are especially easy to misread. Cartesia's official pricing separates Line voice-agent minutes from ordinary Sonic or Ink credit usage, and Speechify API pricing includes voice-agent minute language alongside TTS character usage. A support or phone agent budget needs average session length, no-answer calls, retries, testing traffic, phone-number or telephony boundaries, and peak simultaneous usage. A cheap narration plan does not answer those questions.
Seats and concurrency answer access and capacity questions, not just generation volume. Fish Audio and Rask AI expose team or workspace concepts, Cartesia publishes concurrency and agent-slot style limits, and Speechify API pricing includes concurrent-call boundaries. Seats decide who can collaborate, review, or administer work. Concurrency decides how many real-time jobs can run at once. A small monthly output volume can still require a higher route if several calls, agents, or production jobs need to happen at the same time.
Enterprise contracts are the boundary for custom volume, compliance, procurement, deployment, priority support, service commitments, or higher operational limits. Treat contact-sales language as a sign that public plan math may not describe the production deal. Before paying, ask each vendor to confirm the unit, reset or overage rule, concurrency cap, workspace ownership, data terms, and whether app, API, dubbing, and agent usage share one balance or separate budgets.
FAQ
Common questions
Which AI voice pricing unit should I model first?
Start with the unit that grows with your real workload. Script-heavy TTS should begin with characters or bytes, finished narration should begin with generated audio minutes, localization should begin with source media minutes and target languages, and live agents should begin with agent minutes plus concurrency.
When do characters or UTF-8 bytes matter more than audio minutes?
They matter most for API text-to-speech work where the vendor meters input text. Characters are easier to estimate from scripts, while UTF-8 bytes can matter for multilingual workloads because encoded size may not match visible character count.
Are voice credits comparable across vendors?
No. Credits are usually product-specific allowances. They can compare tiers inside one vendor, but they should not be treated as a shared currency across ElevenLabs, Cartesia, Fish Audio, Speechify Studio, MiniMax Audio, or any other voice product.
How should dubbing or localization minutes be budgeted?
Model the source media length first, then account for target languages, review passes, transcript editing, subtitles, voice cloning, lip-sync, and any repeated processing. Localization minutes usually track a different workload than short-form TTS generation.
When do seats and concurrency change the voice pricing decision?
Seats matter when multiple people need shared workspaces, permissions, review, or admin ownership. Concurrency matters when real-time agents, calls, or production jobs must run at the same time. Neither replaces a usage estimate.
When should a buyer ask for enterprise voice pricing?
Ask for enterprise pricing when public plans do not cover the required volume, concurrency, data terms, security review, support commitment, deployment model, procurement process, or custom voice and localization obligations.
Next steps
Take the next buying step
Use these next pages to confirm the plan, tool, or alternate route that fits once the spend boundary is clear.