jizhang.io/articles/foreign-language tax外语税

Is there a foreign-language tax? 外语 token 税真的存在吗?

Same content in five languages, tokenized by five frontier models. Counting the actual token bill for non-English — across formal UN documents and modern FLORES Wikipedia prose. 把同一段话用五种语言写一遍, 喂给五个前沿模型的 tokenizer, 看看外语到底贵多少. 数据来自 联合国平行语料 和 Meta 的 FLORES-200.

Color scale色阶
≤ 0.88×
0.88–0.95×
≈ 1.00×
1.05–1.15×
1.15–1.30×
1.30–1.50×
1.50–1.80×
≥ 1.80×
I · Headline · merged corporaI · 综合 · 双语料平均

Is there a foreign-language tax?外语 token 税真的存在吗?

Token-weighted ratio across UN documents (10K sentences, w=1) + FLORES-200 (2K, w=2) · same model · foreign-language tokens ÷ English tokens.token-加权 ratio: UN 文档 (10K 句, w=1) + FLORES-200 (2K, w=2) · 同一模型 · 外语 token 数 ÷ 英文 token 数.

Language语言 OpenAIGPT-5.5tiktoken o200k_harmony GoogleGemini 3.1 ProAPI count_tokens DeepSeekV4-Flashvocab 129K AlibabaQwen 3.6shared with 3.5 AnthropicClaude Opus 4.7new tokenizer · 2026-04+ AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语
1.00×
≈ EN≈ 英
1.00×
≈ EN≈ 英
1.00×
≈ EN≈ 英
1.00×
≈ EN≈ 英
1.00×
≈ EN≈ 英
1.00×
≈ EN≈ 英
Chinese中文
1.10×
tax
1.00×
≈ EN≈ 英
0.87×
save
0.86×
save
0.94×
save
1.42×
tax
French法语
1.31×
tax
1.31×
tax
1.47×
tax
1.31×
tax
1.33×
tax
1.54×
tax
Spanish西班牙语
1.26×
tax
1.24×
tax
1.45×
tax
1.25×
tax
1.35×
tax
1.52×
tax
Arabic阿拉伯语
1.35×
tax
1.43×
tax
1.60×
tax
1.24×
tax
1.66×
tax
2.50×
tax
Sources: UN Parallel Corpus v20090831 (UN ODS documents, professional UN translators) and FLORES-200 (Meta open-source benchmark, Wikipedia source content, professional translators).数据来源: UN Parallel Corpus v20090831 (联合国 ODS 文档, 联合国官方译员) 和 FLORES-200 (Meta 开源 benchmark, Wikipedia 原文, 专业译员).
II · Corpus mattersII · 文体差异

UN → FLORES, by register从公文到散文: 同模型不同代价

Same model, same language, different text register — how much the corpus alone moves the number.同一模型, 同一语言, 文体一变, ratio 跟着变. 看看到底差多少.

Language语言 OpenAIGPT-5.5 GoogleGemini 3.1 Pro DeepSeekV4-Flash AlibabaQwen 3.6 AnthropicClaude Opus 4.7 AnthropicClaude · pre-4.7
English英语
1.00×
UN
1.00×
FLORES
1.00×
UN
1.00×
FLORES
1.00×
UN
1.00×
FLORES
1.00×
UN
1.00×
FLORES
1.00×
UN
1.00×
FLORES
1.00×
UN
1.00×
FLORES
Chinese中文
1.07×
UN
1.26×
FLORES
0.98×
UN
1.08×
FLORES
0.85×
UN
0.95×
FLORES
0.84×
UN
0.92×
FLORES
0.90×
UN
1.15×
FLORES
1.38×
UN
1.62×
FLORES
French法语
1.29×
UN
1.36×
FLORES
1.30×
UN
1.38×
FLORES
1.46×
UN
1.54×
FLORES
1.30×
UN
1.37×
FLORES
1.31×
UN
1.48×
FLORES
1.53×
UN
1.60×
FLORES
Spanish西班牙语
1.25×
UN
1.32×
FLORES
1.23×
UN
1.28×
FLORES
1.44×
UN
1.50×
FLORES
1.24×
UN
1.31×
FLORES
1.33×
UN
1.46×
FLORES
1.52×
UN
1.54×
FLORES
Arabic阿拉伯语
1.34×
UN
1.38×
FLORES
1.42×
UN
1.45×
FLORES
1.60×
UN
1.62×
FLORES
1.22×
UN
1.31×
FLORES
1.63×
UN
1.82×
FLORES
2.48×
UN
2.59×
FLORES
Tokenizer cost is not a property of the language alone — it is the joint product of (language × model vocabulary × text register). Same model, same language: UN documents come in at 0.90× on Claude Opus 4.7, FLORES at 1.15× — a 28% gap driven entirely by content style.Tokenizer 税不是语言一个变量决定的, 而是 (语言 × 模型词表 × 文体) 三者一起作用. 同一个模型, 同一种语言, UN 公文在 Claude Opus 4.7 是 0.90×, FLORES 散文是 1.15× — 28% 的差距完全来自文体本身.
III · UN Parallel CorpusIII · 联合国平行语料

UN Documents · 10,000 sentences联合国文件 · 10,000 句

OPUS UN v20090831 · UN Resolutions 2000–2009 · professional UN translators.OPUS UN v20090831 · 2000–2009 联合国决议 · 联合国官方译员.

Language语言 OpenAIGPT-5.5tiktoken o200k_harmony GoogleGemini 3.1 ProAPI count_tokens DeepSeekV4-Flashvocab 129K AlibabaQwen 3.6shared with 3.5 AnthropicClaude Opus 4.7new tokenizer · 2026-04+ AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语
1.00×
≈ EN≈ 英
526,091
1.00×
≈ EN≈ 英
553,614
1.00×
≈ EN≈ 英
522,725
1.00×
≈ EN≈ 英
562,702
1.00×
≈ EN≈ 英
888,353
1.00×
≈ EN≈ 英
581,402
Chinese中文
1.07×
tax
563,197
0.98×
≈ EN≈ 英
542,016
0.85×
save
446,839
0.84×
save
474,474
0.90×
save
800,737
1.38×
tax
799,514
French法语
1.29×
tax
680,965
1.30×
tax
720,308
1.46×
tax
761,918
1.30×
tax
729,438
1.31×
tax
1,159,462
1.53×
tax
888,348
Spanish西班牙语
1.25×
tax
660,081
1.23×
tax
682,938
1.44×
tax
750,157
1.24×
tax
697,832
1.33×
tax
1,178,933
1.52×
tax
882,270
Arabic阿拉伯语
1.34×
tax
703,812
1.42×
tax
788,088
1.60×
tax
835,398
1.22×
tax
688,529
1.63×
tax
1,447,347
2.48×
tax
1,441,957
In UN texts — structured, predictable, formal — Chinese is at parity or cheaper than English on every modern tokenizer. The Anthropic tax that benchmarks were quoting through 2025 has effectively dissolved in Opus 4.7 (0.90×, down from 1.65× on older Claude models).UN 文本结构高度规整, 可预测, 公文味重 — 在所有现代 tokenizer 上中文都跟英文持平或更便宜. 2025 年之前的 "Anthropic 中文税" 在 Opus 4.7 上基本消失 (0.90×, 老 Claude 还是 1.65×).
IV · FLORES-200IV · FLORES-200

Modern Wikipedia Prose · 2,009 sentences现代 Wikipedia 散文 · 2,009 句

Goyal et al 2022 · dev + devtest combined · Wikipedia-sourced.Goyal 等 2022 · dev + devtest 合并 · Wikipedia 原文.

Language语言 OpenAIGPT-5.5tiktoken o200k_harmony GoogleGemini 3.1 ProAPI count_tokens DeepSeekV4-Flashvocab 129K AlibabaQwen 3.6shared with 3.5 AnthropicClaude Opus 4.7new tokenizer · 2026-04+ AnthropicClaude · pre-4.7Opus 4.0/4.5/4.6 + Sonnet/Haiku 4.x
English英语
1.00×
≈ EN≈ 英
52,421
1.00×
≈ EN≈ 英
55,004
1.00×
≈ EN≈ 英
52,749
1.00×
≈ EN≈ 英
55,706
1.00×
≈ EN≈ 英
83,682
1.00×
≈ EN≈ 英
58,521
Chinese中文
1.26×
tax
65,948
1.08×
tax
59,554
0.95×
save
50,100
0.92×
save
51,394
1.15×
tax
95,967
1.62×
tax
95,019
French法语
1.36×
tax
71,253
1.38×
tax
76,110
1.54×
tax
81,086
1.37×
tax
76,337
1.48×
tax
123,931
1.60×
tax
93,709
Spanish西班牙语
1.32×
tax
69,000
1.28×
tax
70,634
1.50×
tax
78,872
1.31×
tax
72,909
1.46×
tax
122,010
1.54×
tax
90,050
Arabic阿拉伯语
1.38×
tax
72,448
1.45×
tax
79,919
1.62×
tax
85,376
1.31×
tax
73,159
1.82×
tax
151,947
2.59×
tax
151,820
In broader open-domain prose, the same models charge more for non-English. Claude rises from 0.91× on UN to 1.15× on FLORES; GPT-5.5 from 1.07× to 1.26×. Chinese-trained tokenizers (Qwen, DeepSeek) stay cheaper than English even on FLORES.开放领域散文词汇更广, 同样的模型对外语就收得更多. Claude 从 UN 的 0.91× 涨到 FLORES 的 1.15×; GPT-5.5 从 1.07× 涨到 1.26×. 中文训练的 tokenizer (Qwen, DeepSeek) 即便在 FLORES 上也仍比英文便宜.
V · MethodologyV · 方法

How this was measured如何测量

Methodology方法

Each cell = tokens(translated) ÷ tokens(English) within the same tokenizer, on professionally-translated parallel corpora. Cross-tokenizer cells are not directly comparable.每个 cell = (该语言译文 token 数) ÷ (同内容英文 token 数), 在同一 tokenizer 内计算. 跨 tokenizer 的 cell 不能直接比较.

TokenizersTokenizer

OpenAI tiktoken o200k_harmony
Alibaba Qwen3.5-9B (= 3.6, vocab 248K)
DeepSeek V4-Flash-Base (vocab 129K)
Anthropic messages.count_tokens (Opus 4.7 vs 4.5)
Google genai.count_tokens (Gemini 3.1 Pro)

Cost成本

All count_tokens endpoints are free per provider docs — they run only the tokenizer, not inference. Total API spend: $0. Wall-clock when cached: ~75s for full UN 10K + FLORES 2K × 5 models.所有 count_tokens 端点按官方说明都是免费 — 只跑 tokenizer 不跑模型. 总 API 成本 $0. cache 后总耗时 ~75 秒.

Reproducibility复现

Source code, raw token counts, and full data: project workspace at tokenizer-tax/.源代码 / 原始 token 数 / 完整数据: 项目目录 tokenizer-tax/.

Datasets数据集

  1. UN Parallel Corpus v1.0 · Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B. (2016) · LREC 2016. 10,000 aligned sentences from UN ODS documents 2000–2009, professional UN translators.10,000 句对齐的联合国 ODS 文档 (2000–2009), 联合国官方翻译. aclanthology.org/L16-1561
  2. FLORES-200 · Goyal, N., Gao, C., Chaudhary, V., et al. (2022) · Meta AI. 2,009 sentences (dev + devtest combined), Wikipedia source content, 200 professional translations.2,009 句 (dev + devtest 合并), Wikipedia 原文, 200 种语言专业翻译. github.com/facebookresearch/flores

Tokenizer referencesTokenizer 参考

  1. OpenAI tiktoken · github.com/openai/tiktoken
  2. Anthropic count_tokens API · docs.anthropic.com/api/messages-count-tokens
  3. Google genai count_tokens · ai.google.dev/gemini-api/docs/tokens
  4. Qwen 3.5 / 3.6 tokenizer · huggingface.co/Qwen/Qwen3.5-9B
  5. DeepSeek V4-Flash tokenizer · huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base

Inspired by起源

  1. Aran Komatsuzaki (2026-04-28) · x.com/arankomatsuzaki
Ji Zhang·2026 / April 302026 年 4 月 30 日