Same content in five languages, tokenized by five frontier models. Counting the actual token bill for non-English — across formal UN documents and modern FLORES Wikipedia prose. 把同一段话用五种语言写一遍, 喂给五个前沿模型的 tokenizer, 看看外语到底贵多少. 数据来自 联合国平行语料 和 Meta 的 FLORES-200.
Token-weighted ratio across UN documents (10K sentences, w=1) + FLORES-200 (2K, w=2) · same model · foreign-language tokens ÷ English tokens.token-加权 ratio: UN 文档 (10K 句, w=1) + FLORES-200 (2K, w=2) · 同一模型 · 外语 token 数 ÷ 英文 token 数.
| Language语言 | OpenAIGPT-5.5tiktoken o200k_harmony | GoogleGemini 3.1 ProAPI count_tokens | DeepSeekV4-Flashvocab 129K | AlibabaQwen 3.6shared with 3.5 | AnthropicClaude Opus 4.7new tokenizer · 2026-04+ | AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x |
|---|---|---|---|---|---|---|
| English英语 | 1.00× ≈ EN≈ 英 |
1.00× ≈ EN≈ 英 |
1.00× ≈ EN≈ 英 |
1.00× ≈ EN≈ 英 |
1.00× ≈ EN≈ 英 |
1.00× ≈ EN≈ 英 |
| Chinese中文 | 1.10× tax贵 |
1.00× ≈ EN≈ 英 |
0.87× save省 |
0.86× save省 |
0.94× save省 |
1.42× tax贵 |
| French法语 | 1.31× tax贵 |
1.31× tax贵 |
1.47× tax贵 |
1.31× tax贵 |
1.33× tax贵 |
1.54× tax贵 |
| Spanish西班牙语 | 1.26× tax贵 |
1.24× tax贵 |
1.45× tax贵 |
1.25× tax贵 |
1.35× tax贵 |
1.52× tax贵 |
| Arabic阿拉伯语 | 1.35× tax贵 |
1.43× tax贵 |
1.60× tax贵 |
1.24× tax贵 |
1.66× tax贵 |
2.50× tax贵 |
Sources: UN Parallel Corpus v20090831 (UN ODS documents, professional UN translators) and FLORES-200 (Meta open-source benchmark, Wikipedia source content, professional translators).数据来源: UN Parallel Corpus v20090831 (联合国 ODS 文档, 联合国官方译员) 和 FLORES-200 (Meta 开源 benchmark, Wikipedia 原文, 专业译员).II · Corpus mattersII · 文体差异
Same model, same language, different text register — how much the corpus alone moves the number.同一模型, 同一语言, 文体一变, ratio 跟着变. 看看到底差多少.
| Language语言 | OpenAIGPT-5.5 | GoogleGemini 3.1 Pro | DeepSeekV4-Flash | AlibabaQwen 3.6 | AnthropicClaude Opus 4.7 | AnthropicClaude · pre-4.7 |
|---|---|---|---|---|---|---|
| English英语 | 1.00× UN1.00× FLORES |
1.00× UN1.00× FLORES |
1.00× UN1.00× FLORES |
1.00× UN1.00× FLORES |
1.00× UN1.00× FLORES |
1.00× UN1.00× FLORES |
| Chinese中文 | 1.07× UN1.26× FLORES |
0.98× UN1.08× FLORES |
0.85× UN0.95× FLORES |
0.84× UN0.92× FLORES |
0.90× UN1.15× FLORES |
1.38× UN1.62× FLORES |
| French法语 | 1.29× UN1.36× FLORES |
1.30× UN1.38× FLORES |
1.46× UN1.54× FLORES |
1.30× UN1.37× FLORES |
1.31× UN1.48× FLORES |
1.53× UN1.60× FLORES |
| Spanish西班牙语 | 1.25× UN1.32× FLORES |
1.23× UN1.28× FLORES |
1.44× UN1.50× FLORES |
1.24× UN1.31× FLORES |
1.33× UN1.46× FLORES |
1.52× UN1.54× FLORES |
| Arabic阿拉伯语 | 1.34× UN1.38× FLORES |
1.42× UN1.45× FLORES |
1.60× UN1.62× FLORES |
1.22× UN1.31× FLORES |
1.63× UN1.82× FLORES |
2.48× UN2.59× FLORES |
Tokenizer cost is not a property of the language alone — it is the joint product of (language × model vocabulary × text register). Same model, same language: UN documents come in at 0.90× on Claude Opus 4.7, FLORES at 1.15× — a 28% gap driven entirely by content style.Tokenizer 税不是语言一个变量决定的, 而是 (语言 × 模型词表 × 文体) 三者一起作用. 同一个模型, 同一种语言, UN 公文在 Claude Opus 4.7 是 0.90×, FLORES 散文是 1.15× — 28% 的差距完全来自文体本身.III · UN Parallel CorpusIII · 联合国平行语料
OPUS UN v20090831 · UN Resolutions 2000–2009 · professional UN translators.OPUS UN v20090831 · 2000–2009 联合国决议 · 联合国官方译员.
| Language语言 | OpenAIGPT-5.5tiktoken o200k_harmony | GoogleGemini 3.1 ProAPI count_tokens | DeepSeekV4-Flashvocab 129K | AlibabaQwen 3.6shared with 3.5 | AnthropicClaude Opus 4.7new tokenizer · 2026-04+ | AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x |
|---|---|---|---|---|---|---|
| English英语 | 1.00× ≈ EN≈ 英 526,091 |
1.00× ≈ EN≈ 英 553,614 |
1.00× ≈ EN≈ 英 522,725 |
1.00× ≈ EN≈ 英 562,702 |
1.00× ≈ EN≈ 英 888,353 |
1.00× ≈ EN≈ 英 581,402 |
| Chinese中文 | 1.07× tax贵 563,197 |
0.98× ≈ EN≈ 英 542,016 |
0.85× save省 446,839 |
0.84× save省 474,474 |
0.90× save省 800,737 |
1.38× tax贵 799,514 |
| French法语 | 1.29× tax贵 680,965 |
1.30× tax贵 720,308 |
1.46× tax贵 761,918 |
1.30× tax贵 729,438 |
1.31× tax贵 1,159,462 |
1.53× tax贵 888,348 |
| Spanish西班牙语 | 1.25× tax贵 660,081 |
1.23× tax贵 682,938 |
1.44× tax贵 750,157 |
1.24× tax贵 697,832 |
1.33× tax贵 1,178,933 |
1.52× tax贵 882,270 |
| Arabic阿拉伯语 | 1.34× tax贵 703,812 |
1.42× tax贵 788,088 |
1.60× tax贵 835,398 |
1.22× tax贵 688,529 |
1.63× tax贵 1,447,347 |
2.48× tax贵 1,441,957 |
In UN texts — structured, predictable, formal — Chinese is at parity or cheaper than English on every modern tokenizer. The Anthropic tax that benchmarks were quoting through 2025 has effectively dissolved in Opus 4.7 (0.90×, down from 1.65× on older Claude models).UN 文本结构高度规整, 可预测, 公文味重 — 在所有现代 tokenizer 上中文都跟英文持平或更便宜. 2025 年之前的 "Anthropic 中文税" 在 Opus 4.7 上基本消失 (0.90×, 老 Claude 还是 1.65×).IV · FLORES-200IV · FLORES-200
Goyal et al 2022 · dev + devtest combined · Wikipedia-sourced.Goyal 等 2022 · dev + devtest 合并 · Wikipedia 原文.
| Language语言 | OpenAIGPT-5.5tiktoken o200k_harmony | GoogleGemini 3.1 ProAPI count_tokens | DeepSeekV4-Flashvocab 129K | AlibabaQwen 3.6shared with 3.5 | AnthropicClaude Opus 4.7new tokenizer · 2026-04+ | AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x |
|---|---|---|---|---|---|---|
| English英语 | 1.00× ≈ EN≈ 英 52,421 |
1.00× ≈ EN≈ 英 55,004 |
1.00× ≈ EN≈ 英 52,749 |
1.00× ≈ EN≈ 英 55,706 |
1.00× ≈ EN≈ 英 83,682 |
1.00× ≈ EN≈ 英 58,521 |
| Chinese中文 | 1.26× tax贵 65,948 |
1.08× tax贵 59,554 |
0.95× save省 50,100 |
0.92× save省 51,394 |
1.15× tax贵 95,967 |
1.62× tax贵 95,019 |
| French法语 | 1.36× tax贵 71,253 |
1.38× tax贵 76,110 |
1.54× tax贵 81,086 |
1.37× tax贵 76,337 |
1.48× tax贵 123,931 |
1.60× tax贵 93,709 |
| Spanish西班牙语 | 1.32× tax贵 69,000 |
1.28× tax贵 70,634 |
1.50× tax贵 78,872 |
1.31× tax贵 72,909 |
1.46× tax贵 122,010 |
1.54× tax贵 90,050 |
| Arabic阿拉伯语 | 1.38× tax贵 72,448 |
1.45× tax贵 79,919 |
1.62× tax贵 85,376 |
1.31× tax贵 73,159 |
1.82× tax贵 151,947 |
2.59× tax贵 151,820 |
In broader open-domain prose, the same models charge more for non-English. Claude rises from 0.91× on UN to 1.15× on FLORES; GPT-5.5 from 1.07× to 1.26×. Chinese-trained tokenizers (Qwen, DeepSeek) stay cheaper than English even on FLORES.开放领域散文词汇更广, 同样的模型对外语就收得更多. Claude 从 UN 的 0.91× 涨到 FLORES 的 1.15×; GPT-5.5 从 1.07× 涨到 1.26×. 中文训练的 tokenizer (Qwen, DeepSeek) 即便在 FLORES 上也仍比英文便宜.V · MethodologyV · 方法
Each cell = tokens(translated) ÷ tokens(English) within the same tokenizer, on professionally-translated parallel corpora. Cross-tokenizer cells are not directly comparable.每个 cell = (该语言译文 token 数) ÷ (同内容英文 token 数), 在同一 tokenizer 内计算. 跨 tokenizer 的 cell 不能直接比较.
OpenAI tiktoken o200k_harmony
Alibaba Qwen3.5-9B (= 3.6, vocab 248K)
DeepSeek V4-Flash-Base (vocab 129K)
Anthropic messages.count_tokens (Opus 4.7 vs 4.5)
Google genai.count_tokens (Gemini 3.1 Pro)
All count_tokens endpoints are free per provider docs — they run only the tokenizer, not inference. Total API spend: $0. Wall-clock when cached: ~75s for full UN 10K + FLORES 2K × 5 models.所有 count_tokens 端点按官方说明都是免费 — 只跑 tokenizer 不跑模型. 总 API 成本 $0. cache 后总耗时 ~75 秒.
Source code, raw token counts, and full data: project workspace at tokenizer-tax/.源代码 / 原始 token 数 / 完整数据: 项目目录 tokenizer-tax/.