What I Learned After Six Months of Testing Translation APIs
Last year, I spent way too many hours integrating five different translation APIs into our localization pipeline. What started as a simple "just pick one and ship it" task turned into a rabbit hole of tradeoffs, edge cases, and some genuinely surprising results.
This is what I wish someone had told me before I started.
The Quick Answer (If You're in a Hurry)
| API | Quality | Speed | Price per 1M chars | Where It Shines |
|---|---|---|---|---|
| GPT-4o | Excellent | Medium | ~$5 | Context-heavy UI strings |
| Claude Sonnet | Excellent | Medium | ~$6 | Keeping consistent tone |
| DeepL | Very Good | Fast | $25 | European languages |
| Google Translate | Good | Very Fast | $20 | Raw speed, rare languages |
| Azure Translator | Good | Very Fast | $10 | Microsoft shops |
| Amazon Translate | Good | Very Fast | $15 | Already on AWS |
But honestly, the real answer is "it depends," and I'll explain why.
What I Actually Found Using Each One
OpenAI GPT-4 / GPT-4o
This is what we use most. Not because it's perfect, but because it handles the weird edge cases that kept breaking other solutions.
Current Pricing:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|---|---|
| GPT-4o | $5.00 | $15.00 |
| GPT-4o Mini | $0.15 | $0.60 |
| GPT-4 Turbo | $10.00 | $30.00 |
The trick is getting the system prompt right. You need to tell it to preserve placeholders like {name} and {{count}}, or it will helpfully "translate" them. I learned this the hard way when our Spanish build started showing "nombre" instead of the user's actual name.
What actually works well:
- Understands that "Save" in a button context means something different than "Save" as in "save money"
- Handles pluralization rules without me having to explain them
- The JSON mode is genuinely useful for batch operations
What caught me off guard:
- No built-in language detection, you need to handle that separately
- Response times are inconsistent. Sometimes 400ms, sometimes 2 seconds
- Mini is tempting for the price, but quality drops noticeably for complex sentences
My take: Worth it if you're translating UI text or anything where context matters. Overkill for simple strings like "OK" or "Cancel."
Anthropic Claude
I was skeptical at first because Claude isn't really marketed as a translation tool. But after testing it alongside GPT-4, I was surprised how well it handled brand-specific terminology.
Current Pricing:
| Model | Input (1M tokens) | Output (1M tokens) |
|---|---|---|
| Claude 3.5 Haiku | $0.25 | $1.25 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude Opus 4.5 | $15.00 | $75.00 |
Where it impressed me:
- We have a glossary of terms we never translate (product names, technical terms). Claude follows these instructions more consistently than GPT-4
- The 200K context window meant we could send our entire glossary with each request
- Tone stays remarkably consistent across long documents
What's less great:
- Slightly slower than GPT-4o on average
- Fewer model options means less flexibility on price/quality tradeoffs
My take: If you're translating marketing copy or anything where brand voice matters, Claude is worth testing. For raw UI strings, it's comparable to GPT-4.
DeepL API
DeepL has a reputation for quality, and for European languages, it's earned. But I've seen too many teams default to it without understanding where it falls short.
Current Pricing:
| Plan | Price | What You Get |
|---|---|---|
| Free | $0 | 500K chars/month |
| Pro | $25/1M chars | Unlimited |
| Enterprise | Custom | SLA, dedicated support |
What's genuinely good:
- German and French translations are noticeably more natural than the LLMs
- Fast. Consistently fast. No random 2-second delays
- The glossary feature actually works (define "enterprise" as "entreprise" and it sticks)
What nobody mentions:
- Japanese and Korean translations feel robotic compared to GPT-4
- No Arabic support at all
- You can't give it context. If "reservation" could mean a hotel booking or a hesitation, DeepL just picks one
My take: If your app is primarily for European markets, DeepL is probably your best choice. For Asian languages or complex context, look elsewhere.
Google Cloud Translation
Google Translate gets a bad rap from people who remember the "All your base" era. The current API is actually quite good for what it is.
Current Pricing:
| Feature | Price |
|---|---|
| Translation | $20/1M chars |
| Language Detection | $20/1M chars |
| Custom Glossary | Included |
| AutoML (custom models) | $45/1M chars |
Where it makes sense:
- 100+ languages. If you need Uzbek or Swahili, this is probably your only option
- Blazing fast. 50ms response times are common
- Language detection is built in and actually reliable
The honest downsides:
- Translations feel "correct but generic." A human would never word it that way
- Struggles with informal text, slang, or anything requiring cultural adaptation
- The AutoML feature sounds great but requires significant training data to be useful
My take: Great for user-generated content where speed matters more than polish. Less suitable for your carefully crafted marketing copy.
Azure and Amazon (Quick Takes)
I'll be honest: if you're already deep in Azure or AWS, the integration convenience might outweigh the quality differences. Both are fine, neither is exceptional.
Azure Translator:
- $10/1M chars is the cheapest paid option
- Free tier (2M chars/month) is generous
- Quality is... okay. Comparable to Google
Amazon Translate:
- $15/1M chars
- Batch processing is well-designed
- IAM setup is its own adventure
Quality Numbers (With Caveats)
We ran 1,000 UI strings through each API for five language pairs. Human translators scored them blind.
| API | EN→ES | EN→FR | EN→DE | EN→JA | EN→AR | Avg |
|---|---|---|---|---|---|---|
| GPT-4o | 96% | 95% | 94% | 91% | 88% | 92.8% |
| Claude Sonnet | 95% | 96% | 95% | 90% | 87% | 92.6% |
| DeepL | 94% | 95% | 96% | 85% | N/A | 92.5% |
| 88% | 89% | 87% | 86% | 84% | 86.8% | |
| Azure | 87% | 88% | 86% | 85% | 83% | 85.8% |
A few notes:
- DeepL doesn't support Arabic
- These are UI strings, not literary prose. Results would differ for other content types
- The difference between 88% and 95% is more noticeable than the numbers suggest
Speed in Practice
Average response time for translating about 100 words:
| API | Typical Speed | Notes |
|---|---|---|
| Google Translate | 50ms | Consistently fast |
| Azure Translator | 75ms | Also very reliable |
| DeepL | 150ms | Fast enough |
| GPT-4o | 800ms | Varies more than I'd like |
| Claude Sonnet | 1000ms | Similar variance |
| GPT-4 (non-mini) | 2000ms | Noticeably slower |
If you're doing real-time translation (chat, live content), Google or Azure are your only realistic options. For batch processing, speed matters less than you'd think.
What It Actually Costs
Let's say you're translating 100,000 strings (averaging 50 characters each) into 10 languages. That's 50 million characters.
| API | Approximate Cost | Quality Level |
|---|---|---|
| GPT-4o Mini | $0.75 | Good enough for most UI |
| Claude Haiku | $1.25 | Similar to Mini |
| GPT-4o | $25 | Noticeably better |
| Claude Sonnet | $30 | Comparable to GPT-4o |
| Azure | $50 | Adequate |
| Amazon | $75 | Adequate |
| $100 | Adequate | |
| DeepL | $125 | Very good for EU languages |
The LLM pricing model (tokens vs characters) means they're actually cheaper than traditional MT services for most text lengths. I didn't expect that.
How to Actually Decide
After all this testing, here's my mental framework:
Go with GPT-4o if:
- Your strings have placeholders, variables, or technical content
- You need JSON output for automation
- Context matters (same word meaning different things in different places)
Go with Claude if:
- You've got a brand style guide that needs to be followed
- You're translating longer marketing or documentation content
- Consistency across thousands of strings is critical
Go with DeepL if:
- Most of your users are in Europe
- You're translating formal business content
- You want the best French/German/Dutch quality available
Go with Google if:
- You need languages that others don't support
- Real-time speed is non-negotiable
- You're translating user-generated content where "good enough" is acceptable
Go with Azure/Amazon if:
- You're already locked into that ecosystem
- Compliance requirements point you there
The Hybrid Approach That Actually Works
In production, we ended up using multiple APIs. Marketing copy goes through Claude. UI strings use GPT-4o. User comments use Google. It's more complex to set up, but the quality/cost balance is better than any single solution.
You can set up a simple routing function: critical content gets the expensive API, bulk content gets the cheap one, real-time content gets the fast one. Once it's built, you stop thinking about it.
A Few Hard-Won Lessons
-
Always send context. "Book" translates differently for a library app vs a hotel app. Include your app category or domain in every request.
-
Test with edge cases first. Before committing to an API, try it with your weirdest strings. Placeholders, emoji, HTML snippets, RTL text. The differences show up there.
-
Build in fallbacks. APIs go down. Rate limits hit. Have a backup, even if it's just caching previously translated strings.
-
Human review is still worth it for some content. Error messages, legal text, anything that could embarrass you if wrong. AI translation is good, but not perfect.
-
Translation memory saves money. If you're translating "Save changes" a hundred times across different projects, you should only be paying for it once.
Where to Go From Here
If you're just starting out with translation APIs, my honest advice is to pick GPT-4o Mini and see how far it gets you. It's cheap, the quality is reasonable, and you can always upgrade later.
If you're at the point where you need multiple engines, glossary enforcement, translation memory, and human review workflows, you probably want a proper TMS rather than building it yourself. We built IntlPull to handle exactly that use case. You can use the CLI to push strings and translate with different engines based on content type.
Whatever you choose, the good news is that machine translation in 2026 is genuinely good enough for production use. The question isn't whether to use it, but how to use it well.
Common Questions
Which API gives the best translations in 2026?
For UI and app content, GPT-4o and Claude Sonnet are essentially tied. For European languages specifically, DeepL is still the benchmark. There's no single winner.
What's the most cost-effective option?
GPT-4o Mini gives you surprisingly good quality at $0.15 per million input tokens. If you need free, Azure offers 2 million characters per month.
Can I skip human review entirely?
For most UI strings and help text, yes. For anything legal, medical, or where mistakes could cause real harm, I'd still recommend human review. The 90%+ accuracy sounds great until you remember that 10% means one in ten strings might be wrong.
What happens when an API is down?
This happened to us twice in six months. Build fallbacks. Cache translations. Have a default language that works if everything fails.
