What I Learned After Six Months of Testing Translation APIs
Last year, I spent way too many hours integrating five different translation APIs into our localization pipeline. What started as a simple "just pick one and ship it" task turned into a rabbit hole of tradeoffs, edge cases, and some genuinely surprising results.
This is what I wish someone had told me before I started.
The Quick Answer (If You're in a Hurry)
| API | Quality | Speed | Price per 1M chars | Where It Shines |
|---|
| GPT-4o | Excellent | Medium | ~$5 | Context-heavy UI strings |
| Claude Sonnet | Excellent | Medium | ~$6 | Keeping consistent tone |
| DeepL | Very Good | Fast | $25 | European languages |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| GPT-4o | Excellent | Medium | ~$5 | Context-heavy UI strings |
|---|
| Claude Sonnet | Excellent | Medium | ~$6 | Keeping consistent tone |
| DeepL | Very Good | Fast | $25 | European languages |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| GPT-4o | Excellent | Medium | ~$5 | Context-heavy UI strings |
|---|
| Claude Sonnet | Excellent | Medium | ~$6 | Keeping consistent tone |
| DeepL | Very Good | Fast | $25 | European languages |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| Claude Sonnet | Excellent | Medium | ~$6 | Keeping consistent tone |
|---|
| DeepL | Very Good | Fast | $25 | European languages |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| DeepL | Very Good | Fast | $25 | European languages |
|---|
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
|---|
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
|---|
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
|---|
But honestly, the real answer is "it depends," and I'll explain why.
What I Actually Found Using Each One
OpenAI GPT-4 / GPT-4o
This is what we use most. Not because it's perfect, but because it handles the weird edge cases that kept breaking other solutions.
Current Pricing:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|
| GPT-4o | $5.00 | $15.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o | $5.00 | $15.00 |
|---|
| GPT-4o Mini | $0.15 | $0.60 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o | $5.00 | $15.00 |
|---|
| GPT-4o Mini | $0.15 | $0.60 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o Mini | $0.15 | $0.60 |
|---|
| GPT-4 Turbo | $10.00 | $30.00 |
The trick is getting the system prompt right. You need to tell it to preserve placeholders like {name} and {{count}}, or it will helpfully "translate" them. I learned this the hard way when our Spanish build started showing "nombre" instead of the user's actual name.
What actually works well:
Understands that "Save" in a button context means something different than "Save" as in "save money"Handles pluralization rules without me having to explain themThe JSON mode is genuinely useful for batch operationsWhat caught me off guard:
No built-in language detection, you need to handle that separatelyResponse times are inconsistent. Sometimes 400ms, sometimes 2 secondsMini is tempting for the price, but quality drops noticeably for complex sentencesMy take: Worth it if you're translating UI text or anything where context matters. Overkill for simple strings like "OK" or "Cancel."
Anthropic Claude
I was skeptical at first because Claude isn't really marketed as a translation tool. But after testing it alongside GPT-4, I was surprised how well it handled brand-specific terminology.
Current Pricing:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|
| Claude 3.5 Haiku | $0.25 | $1.25 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude Opus 4.5 | $15.00 | $75.00 |
| Claude 3.5 Haiku | $0.25 | $1.25 |
|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude Opus 4.5 | $15.00 | $75.00 |
| Claude 3.5 Haiku | $0.25 | $1.25 |
|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude Opus 4.5 | $15.00 | $75.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
|---|
| Claude Opus 4.5 | $15.00 | $75.00 |
| Claude Opus 4.5 | $15.00 | $75.00 |
|---|
Where it impressed me:
We have a glossary of terms we never translate (product names, technical terms). Claude follows these instructions more consistently than GPT-4The 200K context window meant we could send our entire glossary with each requestTone stays remarkably consistent across long documentsWhat's less great:
Slightly slower than GPT-4o on averageFewer model options means less flexibility on price/quality tradeoffsMy take: If you're translating marketing copy or anything where brand voice matters, Claude is worth testing. For raw UI strings, it's comparable to GPT-4.
DeepL API
DeepL has a reputation for quality, and for European languages, it's earned. But I've seen too many teams default to it without understanding where it falls short.
Current Pricing:
| Plan | Price | What You Get |
|---|
| Free | $0 | 500K chars/month |
| Pro | $25/1M chars | Unlimited |
| Enterprise | Custom | SLA, dedicated support |
| Free | $0 | 500K chars/month |
|---|
| Pro | $25/1M chars | Unlimited |
| Enterprise | Custom | SLA, dedicated support |
| Free | $0 | 500K chars/month |
|---|
| Pro | $25/1M chars | Unlimited |
| Enterprise | Custom | SLA, dedicated support |
| Pro | $25/1M chars | Unlimited |
|---|
| Enterprise | Custom | SLA, dedicated support |
| Enterprise | Custom | SLA, dedicated support |
|---|
What's genuinely good:
German and French translations are noticeably more natural than the LLMsFast. Consistently fast. No random 2-second delaysThe glossary feature actually works (define "enterprise" as "entreprise" and it sticks)What nobody mentions:
Japanese and Korean translations feel robotic compared to GPT-4No Arabic support at allYou can't give it context. If "reservation" could mean a hotel booking or a hesitation, DeepL just picks oneMy take: If your app is primarily for European markets, DeepL is probably your best choice. For Asian languages or complex context, look elsewhere.
Google Cloud Translation
Google Translate gets a bad rap from people who remember the "All your base" era. The current API is actually quite good for what it is.
Current Pricing:
| Feature | Price |
|---|
| Translation | $20/1M chars |
| Language Detection | $20/1M chars |
| Custom Glossary | Included |
| AutoML (custom models) | $45/1M chars |
| Translation | $20/1M chars |
|---|
| Language Detection | $20/1M chars |
| Custom Glossary | Included |
| AutoML (custom models) | $45/1M chars |
| Translation | $20/1M chars |
|---|
| Language Detection | $20/1M chars |
| Custom Glossary | Included |
| AutoML (custom models) | $45/1M chars |
| Language Detection | $20/1M chars |
|---|
| Custom Glossary | Included |
| AutoML (custom models) | $45/1M chars |
| Custom Glossary | Included |
|---|
| AutoML (custom models) | $45/1M chars |
| AutoML (custom models) | $45/1M chars |
|---|
Where it makes sense:
100+ languages. If you need Uzbek or Swahili, this is probably your only optionBlazing fast. 50ms response times are commonLanguage detection is built in and actually reliableThe honest downsides:
Translations feel "correct but generic." A human would never word it that wayStruggles with informal text, slang, or anything requiring cultural adaptationThe AutoML feature sounds great but requires significant training data to be usefulMy take: Great for user-generated content where speed matters more than polish. Less suitable for your carefully crafted marketing copy.
Azure and Amazon (Quick Takes)
I'll be honest: if you're already deep in Azure or AWS, the integration convenience might outweigh the quality differences. Both are fine, neither is exceptional.
Azure Translator:
$10/1M chars is the cheapest paid optionFree tier (2M chars/month) is generousQuality is... okay. Comparable to GoogleAmazon Translate:
$15/1M charsBatch processing is well-designedIAM setup is its own adventureQuality Numbers (With Caveats)
We ran 1,000 UI strings through each API for five language pairs. Human translators scored them blind.
| API | EN→ES | EN→FR | EN→DE | EN→JA | EN→AR | Avg |
|---|
| GPT-4o | 96% | 95% | 94% | 91% | 88% | 92.8% |
| Claude Sonnet | 95% | 96% | 95% | 90% | 87% | 92.6% |
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| GPT-4o | 96% | 95% | 94% | 91% | 88% | 92.8% |
|---|
| Claude Sonnet | 95% | 96% | 95% | 90% | 87% | 92.6% |
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| GPT-4o | 96% | 95% | 94% | 91% | 88% | 92.8% |
|---|
| Claude Sonnet | 95% | 96% | 95% | 90% | 87% | 92.6% |
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| Claude Sonnet | 95% | 96% | 95% | 90% | 87% | 92.6% |
|---|
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
|---|
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| Google | 88% | 89% | 87% | 86% | 84% | 86.8% |
|---|
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
|---|
A few notes:
DeepL doesn't support ArabicThese are UI strings, not literary prose. Results would differ for other content typesThe difference between 88% and 95% is more noticeable than the numbers suggestSpeed in Practice
Average response time for translating about 100 words:
| API | Typical Speed | Notes |
|---|
| Google Translate | 50ms | Consistently fast |
| Azure Translator | 75ms | Also very reliable |
| DeepL | 150ms | Fast enough |
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| Google Translate | 50ms | Consistently fast |
|---|
| Azure Translator | 75ms | Also very reliable |
| DeepL | 150ms | Fast enough |
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| Google Translate | 50ms | Consistently fast |
|---|
| Azure Translator | 75ms | Also very reliable |
| DeepL | 150ms | Fast enough |
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| Azure Translator | 75ms | Also very reliable |
|---|
| DeepL | 150ms | Fast enough |
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| DeepL | 150ms | Fast enough |
|---|
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| GPT-4o | 800ms | Varies more than I'd like |
|---|
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| Claude Sonnet | 1000ms | Similar variance |
|---|
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
|---|
If you're doing real-time translation (chat, live content), Google or Azure are your only realistic options. For batch processing, speed matters less than you'd think.
What It Actually Costs
Let's say you're translating 100,000 strings (averaging 50 characters each) into 10 languages. That's 50 million characters.
| API | Approximate Cost | Quality Level |
|---|
| GPT-4o Mini | $0.75 | Good enough for most UI |
| Claude Haiku | $1.25 | Similar to Mini |
| GPT-4o | $25 | Noticeably better |
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| GPT-4o Mini | $0.75 | Good enough for most UI |
|---|
| Claude Haiku | $1.25 | Similar to Mini |
| GPT-4o | $25 | Noticeably better |
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| GPT-4o Mini | $0.75 | Good enough for most UI |
|---|
| Claude Haiku | $1.25 | Similar to Mini |
| GPT-4o | $25 | Noticeably better |
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| Claude Haiku | $1.25 | Similar to Mini |
|---|
| GPT-4o | $25 | Noticeably better |
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| GPT-4o | $25 | Noticeably better |
|---|
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| Claude Sonnet | $30 | Comparable to GPT-4o |
|---|
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| Azure | $50 | Adequate |
|---|
| Amazon | $75 | Adequate |
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| Amazon | $75 | Adequate |
|---|
| Google | $100 | Adequate |
| DeepL | $125 | Very good for EU languages |
| Google | $100 | Adequate |
|---|
| DeepL | $125 | Very good for EU languages |
| DeepL | $125 | Very good for EU languages |
|---|
The LLM pricing model (tokens vs characters) means they're actually cheaper than traditional MT services for most text lengths. I didn't expect that.
How to Actually Decide
After all this testing, here's my mental framework:
Go with GPT-4o if:
Your strings have placeholders, variables, or technical contentYou need JSON output for automationContext matters (same word meaning different things in different places)Go with Claude if:
You've got a brand style guide that needs to be followedYou're translating longer marketing or documentation contentConsistency across thousands of strings is criticalGo with DeepL if:
Most of your users are in EuropeYou're translating formal business contentYou want the best French/German/Dutch quality availableGo with Google if:
You need languages that others don't supportReal-time speed is non-negotiableYou're translating user-generated content where "good enough" is acceptableGo with Azure/Amazon if:
You're already locked into that ecosystemCompliance requirements point you thereThe Hybrid Approach That Actually Works
In production, we ended up using multiple APIs. Marketing copy goes through Claude. UI strings use GPT-4o. User comments use Google. It's more complex to set up, but the quality/cost balance is better than any single solution.
You can set up a simple routing function: critical content gets the expensive API, bulk content gets the cheap one, real-time content gets the fast one. Once it's built, you stop thinking about it.
A Few Hard-Won Lessons
Always send context. "Book" translates differently for a library app vs a hotel app. Include your app category or domain in every request.Test with edge cases first. Before committing to an API, try it with your weirdest strings. Placeholders, emoji, HTML snippets, RTL text. The differences show up there.Build in fallbacks. APIs go down. Rate limits hit. Have a backup, even if it's just caching previously translated strings.Human review is still worth it for some content. Error messages, legal text, anything that could embarrass you if wrong. AI translation is good, but not perfect.Translation memory saves money. If you're translating "Save changes" a hundred times across different projects, you should only be paying for it once.Where to Go From Here
If you're just starting out with translation APIs, my honest advice is to pick GPT-4o Mini and see how far it gets you. It's cheap, the quality is reasonable, and you can always upgrade later.
If you're at the point where you need multiple engines, glossary enforcement, translation memory, and human review workflows, you probably want a proper TMS rather than building it yourself. We built IntlPull to handle exactly that use case. You can use the CLI to push strings and translate with different engines based on content type.
Whatever you choose, the good news is that machine translation in 2025 is genuinely good enough for production use. The question isn't whether to use it, but how to use it well.
Common Questions
Which API gives the best translations in 2025?
For UI and app content, GPT-4o and Claude Sonnet are essentially tied. For European languages specifically, DeepL is still the benchmark. There's no single winner.
What's the most cost-effective option?
GPT-4o Mini gives you surprisingly good quality at $0.15 per million input tokens. If you need free, Azure offers 2 million characters per month.
Can I skip human review entirely?
For most UI strings and help text, yes. For anything legal, medical, or where mistakes could cause real harm, I'd still recommend human review. The 90%+ accuracy sounds great until you remember that 10% means one in ten strings might be wrong.
What happens when an API is down?
This happened to us twice in six months. Build fallbacks. Cache translations. Have a default language that works if everything fails.