Is there a foreign-language tax in large language models?

Is there a foreign-language tax? 外语 token 税真的存在吗?

Same content in five languages, tokenized by five frontier models. Counting the actual token bill for non-English — across formal UN documents and modern FLORES Wikipedia prose. 把同一段话用五种语言写一遍, 喂给五个前沿模型的 tokenizer, 看看外语到底贵多少. 数据来自联合国平行语料和 Meta 的 FLORES-200.

Color scale色阶

≤ 0.88×

0.88–0.95×

≈ 1.00×

1.05–1.15×

1.15–1.30×

1.30–1.50×

1.50–1.80×

≥ 1.80×

I · Headline · merged corporaI · 综合 · 双语料平均

Is there a foreign-language tax?外语 token 税真的存在吗?

Token-weighted ratio across UN documents (10K sentences, w=1) + FLORES-200 (2K, w=2) · same model · foreign-language tokens ÷ English tokens.token-加权 ratio: UN 文档 (10K 句, w=1) + FLORES-200 (2K, w=2) · 同一模型 · 外语 token 数 ÷ 英文 token 数.

Language语言	OpenAIGPT-5.5tiktoken o200k_harmony	GoogleGemini 3.1 ProAPI count_tokens	DeepSeekV4-Flashvocab 129K	AlibabaQwen 3.6shared with 3.5	AnthropicClaude Opus 4.7new tokenizer · 2026-04+	AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语	1.00× ≈ EN≈ 英	1.00× ≈ EN≈ 英	1.00× ≈ EN≈ 英	1.00× ≈ EN≈ 英	1.00× ≈ EN≈ 英	1.00× ≈ EN≈ 英
Chinese中文	1.10× tax贵	1.00× ≈ EN≈ 英	0.87× save省	0.86× save省	0.94× save省	1.42× tax贵
French法语	1.31× tax贵	1.31× tax贵	1.47× tax贵	1.31× tax贵	1.33× tax贵	1.54× tax贵
Spanish西班牙语	1.26× tax贵	1.24× tax贵	1.45× tax贵	1.25× tax贵	1.35× tax贵	1.52× tax贵
Arabic阿拉伯语	1.35× tax贵	1.43× tax贵	1.60× tax贵	1.24× tax贵	1.66× tax贵	2.50× tax贵

Language语言

OpenAIGPT-5.5tiktoken o200k_harmony

GoogleGemini 3.1 ProAPI count_tokens

DeepSeekV4-Flashvocab 129K

AlibabaQwen 3.6shared with 3.5

AnthropicClaude Opus 4.7new tokenizer · 2026-04+

AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x

English英语

1.00×

≈ EN≈ 英

1.00×

≈ EN≈ 英

1.00×

≈ EN≈ 英

1.00×

≈ EN≈ 英

1.00×

≈ EN≈ 英

1.00×

≈ EN≈ 英

Chinese中文

1.10×

tax贵

1.00×

≈ EN≈ 英

0.87×

save省

0.86×

save省

0.94×

save省

1.42×

tax贵

French法语

1.31×

tax贵

1.31×

tax贵

1.47×

tax贵

1.31×

tax贵

1.33×

tax贵

1.54×

tax贵

Spanish西班牙语

1.26×

tax贵

1.24×

tax贵

1.45×

tax贵

1.25×

tax贵

1.35×

tax贵

1.52×

tax贵

Arabic阿拉伯语

1.35×

tax贵

1.43×

tax贵

1.60×

tax贵

1.24×

tax贵

1.66×

tax贵

2.50×

tax贵

UN → FLORES, by register从公文到散文: 同模型不同代价

Same model, same language, different text register — how much the corpus alone moves the number.同一模型, 同一语言, 文体一变, ratio 跟着变. 看看到底差多少.

Language语言	OpenAIGPT-5.5	GoogleGemini 3.1 Pro	DeepSeekV4-Flash	AlibabaQwen 3.6	AnthropicClaude Opus 4.7	AnthropicClaude · pre-4.7
English英语	1.00× UN → 1.00× FLORES	1.00× UN → 1.00× FLORES	1.00× UN → 1.00× FLORES	1.00× UN → 1.00× FLORES	1.00× UN → 1.00× FLORES	1.00× UN → 1.00× FLORES
Chinese中文	1.07× UN → 1.26× FLORES	0.98× UN → 1.08× FLORES	0.85× UN → 0.95× FLORES	0.84× UN → 0.92× FLORES	0.90× UN → 1.15× FLORES	1.38× UN → 1.62× FLORES
French法语	1.29× UN → 1.36× FLORES	1.30× UN → 1.38× FLORES	1.46× UN → 1.54× FLORES	1.30× UN → 1.37× FLORES	1.31× UN → 1.48× FLORES	1.53× UN → 1.60× FLORES
Spanish西班牙语	1.25× UN → 1.32× FLORES	1.23× UN → 1.28× FLORES	1.44× UN → 1.50× FLORES	1.24× UN → 1.31× FLORES	1.33× UN → 1.46× FLORES	1.52× UN → 1.54× FLORES
Arabic阿拉伯语	1.34× UN → 1.38× FLORES	1.42× UN → 1.45× FLORES	1.60× UN → 1.62× FLORES	1.22× UN → 1.31× FLORES	1.63× UN → 1.82× FLORES	2.48× UN → 2.59× FLORES

Language语言

OpenAIGPT-5.5

GoogleGemini 3.1 Pro

DeepSeekV4-Flash

AlibabaQwen 3.6

AnthropicClaude Opus 4.7

AnthropicClaude · pre-4.7

English英语

1.00×

→

1.00×

FLORES

1.00×

→

1.00×

FLORES

1.00×

→

1.00×

FLORES

1.00×

→

1.00×

FLORES

1.00×

→

1.00×

FLORES

1.00×

→

1.00×

FLORES

Chinese中文

1.07×

→

1.26×

FLORES

0.98×

→

1.08×

FLORES

0.85×

→

0.95×

FLORES

0.84×

→

0.92×

FLORES

0.90×

→

1.15×

FLORES

1.38×

→

1.62×

FLORES

French法语

1.29×

→

1.36×

FLORES

1.30×

→

1.38×

FLORES

1.46×

→

1.54×

FLORES

1.30×

→

1.37×

FLORES

1.31×

→

1.48×

FLORES

1.53×

→

1.60×

FLORES

Spanish西班牙语

1.25×

→

1.32×

FLORES

1.23×

→

1.28×

FLORES

1.44×

→

1.50×

FLORES

1.24×

→

1.31×

FLORES

1.33×

→

1.46×

FLORES

1.52×

→

1.54×

FLORES

Arabic阿拉伯语

1.34×

→

1.38×

FLORES

1.42×

→

1.45×

FLORES

1.60×

→

1.62×

FLORES

1.22×

→

1.31×

FLORES

1.63×

→

1.82×

FLORES

2.48×

→

2.59×

FLORES

UN Documents · 10,000 sentences联合国文件 · 10,000 句

OPUS UN v20090831 · UN Resolutions 2000–2009 · professional UN translators.OPUS UN v20090831 · 2000–2009 联合国决议 · 联合国官方译员.

Language语言	OpenAIGPT-5.5tiktoken o200k_harmony	GoogleGemini 3.1 ProAPI count_tokens	DeepSeekV4-Flashvocab 129K	AlibabaQwen 3.6shared with 3.5	AnthropicClaude Opus 4.7new tokenizer · 2026-04+	AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语	1.00× ≈ EN≈ 英 526,091	1.00× ≈ EN≈ 英 553,614	1.00× ≈ EN≈ 英 522,725	1.00× ≈ EN≈ 英 562,702	1.00× ≈ EN≈ 英 888,353	1.00× ≈ EN≈ 英 581,402
Chinese中文	1.07× tax贵 563,197	0.98× ≈ EN≈ 英 542,016	0.85× save省 446,839	0.84× save省 474,474	0.90× save省 800,737	1.38× tax贵 799,514
French法语	1.29× tax贵 680,965	1.30× tax贵 720,308	1.46× tax贵 761,918	1.30× tax贵 729,438	1.31× tax贵 1,159,462	1.53× tax贵 888,348
Spanish西班牙语	1.25× tax贵 660,081	1.23× tax贵 682,938	1.44× tax贵 750,157	1.24× tax贵 697,832	1.33× tax贵 1,178,933	1.52× tax贵 882,270
Arabic阿拉伯语	1.34× tax贵 703,812	1.42× tax贵 788,088	1.60× tax贵 835,398	1.22× tax贵 688,529	1.63× tax贵 1,447,347	2.48× tax贵 1,441,957

Language语言

OpenAIGPT-5.5tiktoken o200k_harmony

GoogleGemini 3.1 ProAPI count_tokens

DeepSeekV4-Flashvocab 129K

AlibabaQwen 3.6shared with 3.5

AnthropicClaude Opus 4.7new tokenizer · 2026-04+

AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x

English英语

1.00×

≈ EN≈ 英

526,091

1.00×

≈ EN≈ 英

553,614

1.00×

≈ EN≈ 英

522,725

1.00×

≈ EN≈ 英

562,702

1.00×

≈ EN≈ 英

888,353

1.00×

≈ EN≈ 英

581,402

Chinese中文

1.07×

tax贵

563,197

0.98×

≈ EN≈ 英

542,016

0.85×

save省

446,839

0.84×

save省

474,474

0.90×

save省

800,737

1.38×

tax贵

799,514

French法语

1.29×

tax贵

680,965

1.30×

tax贵

720,308

1.46×

tax贵

761,918

1.30×

tax贵

729,438

1.31×

tax贵

1,159,462

1.53×

tax贵

888,348

Spanish西班牙语

1.25×

tax贵

660,081

1.23×

tax贵

682,938

1.44×

tax贵

750,157

1.24×

tax贵

697,832

1.33×

tax贵

1,178,933

1.52×

tax贵

882,270

Arabic阿拉伯语

1.34×

tax贵

703,812

1.42×

tax贵

788,088

1.60×

tax贵

835,398

1.22×

tax贵

688,529

1.63×

tax贵

1,447,347

2.48×

tax贵

1,441,957

Modern Wikipedia Prose · 2,009 sentences现代 Wikipedia 散文 · 2,009 句

Goyal et al 2022 · dev + devtest combined · Wikipedia-sourced.Goyal 等 2022 · dev + devtest 合并 · Wikipedia 原文.

Language语言	OpenAIGPT-5.5tiktoken o200k_harmony	GoogleGemini 3.1 ProAPI count_tokens	DeepSeekV4-Flashvocab 129K	AlibabaQwen 3.6shared with 3.5	AnthropicClaude Opus 4.7new tokenizer · 2026-04+	AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语	1.00× ≈ EN≈ 英 52,421	1.00× ≈ EN≈ 英 55,004	1.00× ≈ EN≈ 英 52,749	1.00× ≈ EN≈ 英 55,706	1.00× ≈ EN≈ 英 83,682	1.00× ≈ EN≈ 英 58,521
Chinese中文	1.26× tax贵 65,948	1.08× tax贵 59,554	0.95× save省 50,100	0.92× save省 51,394	1.15× tax贵 95,967	1.62× tax贵 95,019
French法语	1.36× tax贵 71,253	1.38× tax贵 76,110	1.54× tax贵 81,086	1.37× tax贵 76,337	1.48× tax贵 123,931	1.60× tax贵 93,709
Spanish西班牙语	1.32× tax贵 69,000	1.28× tax贵 70,634	1.50× tax贵 78,872	1.31× tax贵 72,909	1.46× tax贵 122,010	1.54× tax贵 90,050
Arabic阿拉伯语	1.38× tax贵 72,448	1.45× tax贵 79,919	1.62× tax贵 85,376	1.31× tax贵 73,159	1.82× tax贵 151,947	2.59× tax贵 151,820

Language语言

OpenAIGPT-5.5tiktoken o200k_harmony

GoogleGemini 3.1 ProAPI count_tokens

DeepSeekV4-Flashvocab 129K

AlibabaQwen 3.6shared with 3.5

AnthropicClaude Opus 4.7new tokenizer · 2026-04+

AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x

English英语

1.00×

≈ EN≈ 英

52,421

1.00×

≈ EN≈ 英

55,004

1.00×

≈ EN≈ 英

52,749

1.00×

≈ EN≈ 英

55,706

1.00×

≈ EN≈ 英

83,682

1.00×

≈ EN≈ 英

58,521

Chinese中文

1.26×

tax贵

65,948

1.08×

tax贵

59,554

0.95×

save省

50,100

0.92×

save省

51,394

1.15×

tax贵

95,967

1.62×

tax贵

95,019

French法语

1.36×

tax贵

71,253

1.38×

tax贵

76,110

1.54×

tax贵

81,086

1.37×

tax贵

76,337

1.48×

tax贵

123,931

1.60×

tax贵

93,709

Spanish西班牙语

1.32×

tax贵

69,000

1.28×

tax贵

70,634

1.50×

tax贵

78,872

1.31×

tax贵

72,909

1.46×

tax贵

122,010

1.54×

tax贵

90,050

Arabic阿拉伯语

1.38×

tax贵

72,448

1.45×

tax贵

79,919

1.62×

tax贵

85,376

1.31×

tax贵

73,159

1.82×

tax贵

151,947

2.59×

tax贵

151,820

How this was measured如何测量

Methodology方法

Each cell = tokens(translated) ÷ tokens(English) within the same tokenizer, on professionally-translated parallel corpora. Cross-tokenizer cells are not directly comparable.每个 cell = (该语言译文 token 数) ÷ (同内容英文 token 数), 在同一 tokenizer 内计算. 跨 tokenizer 的 cell 不能直接比较.

TokenizersTokenizer

OpenAI tiktoken o200k_harmony
Alibaba Qwen3.5-9B (= 3.6, vocab 248K)
DeepSeek V4-Flash-Base (vocab 129K)
Anthropic messages.count_tokens (Opus 4.7 vs 4.5)
Google genai.count_tokens (Gemini 3.1 Pro)

Cost成本

All count_tokens endpoints are free per provider docs — they run only the tokenizer, not inference. Total API spend: $0. Wall-clock when cached: ~75s for full UN 10K + FLORES 2K × 5 models.所有 count_tokens 端点按官方说明都是免费 — 只跑 tokenizer 不跑模型. 总 API 成本 $0. cache 后总耗时 ~75 秒.

Reproducibility复现

Source code, raw token counts, and full data: project workspace at tokenizer-tax/.源代码 / 原始 token 数 / 完整数据: 项目目录 tokenizer-tax/.

Datasets数据集

UN Parallel Corpus v1.0 · Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B. (2016) · LREC 2016. 10,000 aligned sentences from UN ODS documents 2000–2009, professional UN translators.10,000 句对齐的联合国 ODS 文档 (2000–2009), 联合国官方翻译. aclanthology.org/L16-1561
FLORES-200 · Goyal, N., Gao, C., Chaudhary, V., et al. (2022) · Meta AI. 2,009 sentences (dev + devtest combined), Wikipedia source content, 200 professional translations.2,009 句 (dev + devtest 合并), Wikipedia 原文, 200 种语言专业翻译. github.com/facebookresearch/flores

Tokenizer referencesTokenizer 参考

OpenAI tiktoken · github.com/openai/tiktoken
Anthropic count_tokens API · docs.anthropic.com/api/messages-count-tokens
Google genai count_tokens · ai.google.dev/gemini-api/docs/tokens
Qwen 3.5 / 3.6 tokenizer · huggingface.co/Qwen/Qwen3.5-9B
DeepSeek V4-Flash tokenizer · huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base

Inspired by起源

Aran Komatsuzaki (2026-04-28) · x.com/arankomatsuzaki

Is there a foreign-language tax? 外语 token 税真的存在吗?

Is there a foreign-language tax?外语 token 税真的存在吗?

UN → FLORES, by register从公文到散文: 同模型不同代价

UN Documents · 10,000 sentences联合国文件 · 10,000 句

Modern Wikipedia Prose · 2,009 sentences现代 Wikipedia 散文 · 2,009 句

How this was measured如何测量

Methodology方法

TokenizersTokenizer

Cost成本

Reproducibility复现

Datasets数据集

Tokenizer referencesTokenizer 参考

Inspired by起源