Show Percentage
Sort Categories by # Tokens in
DESCENDING ORDER
Sort Models by # Tokens in
LATIN
MODEL
CYRILLIC
CJK
ARABIC
NUMBER
PUNCTUATION
MYANMAR
HANGUL
SYMBOL
GREEK
DEVANAGARI
HEBREW
SEPARATOR
THAI
HIRAGANA
GEORGIAN
ARMENIAN
KATAKANA
MALAYALAM
TAMIL
TELUGU
BENGALI
CONTROL CHARS
KANNADA
SINHALA
ETHIOPIC
GUJARATI
KHMER
LAO
GURMUKHI
ORIYA
MATHEMATICAL
MODIFIER
KATAKANA-HIRAGANA
CANADIAN
TIBETAN
CHEROKEE
FULLWIDTH
HALFWIDTH
THAANA
BOPOMOFO
MONGOLIAN
SYRIAC
TIFINAGH
RUNIC
COPTIC
NEW
NKO
OL
EGYPTIAN
MEETEI
TAI
GOTHIC
BALINESE
IDEOGRAPHIC
DOUBLE-STRUCK
SHAVIAN
CHAM
NEWA
BAMUM
SCRIPT
ADLAM
JAVANESE
MICRO
MASCULINE
YI
BUGINESE
GLAGOLITIC
UNKNOWN
FEMININE
OGHAM
CHAKMA
VAI
PHAGS-PA
CARON
REJANG
SUPERSCRIPT
MANDAIC
BRAHMI
BLACK-LETTER
DESERET
LISU
SAMARITAN
OHM
PLANCK
TAGBANWA
BATAK
CUNEIFORM
LINEAR
KAYAH
INFORMATION
ANGSTROM
ALEF
TAGALOG
MASU
TURNED
ANATOLIAN
PHOENICIAN
KELVIN
ROMAN
VERTICAL
SUNDANESE
GEMMA-7B
Google
256,000
185449
(72.44%)
13054
(5.10%)
21469
(8.39%)
6044
(2.36%)
476
(0.19%)
4576
(1.79%)
1326
(0.52%)
2058
(0.80%)
3635
(1.42%)
1274
(0.50%)
1063
(0.42%)
1191
(0.47%)
3516
(1.37%)
1067
2976
(1.16%)
87
(0.03%)
196
(0.08%)
2689
(1.05%)
118
(0.05%)
184
(0.07%)
129
146
(0.06%)
1064
79
54
(0.02%)
225
(0.09%)
84
42
59
39
72
43
576
(0.22%)
80
36
(0.01%)
64
61
81
56
60
30
37
31
28
19
25
3
(0.00%)
34
26
21
0
9
10
1
17
12
15
14
2
6
8
7
GPT-4O
OpenAI
199,998
133756
(66.88%)
14049
(7.02%)
6754
(3.38%)
7852
(3.93%)
1302
(0.65%)
5379
(2.69%)
3789
(1.89%)
1906
(0.95%)
1725
(0.86%)
1453
(0.73%)
3020
(1.51%)
2310
3634
(1.82%)
1336
(0.67%)
323
(0.16%)
2144
(1.07%)
1670
(0.84%)
318
1085
(0.54%)
617
(0.31%)
745
(0.37%)
1499
(0.75%)
723
(0.36%)
848
171
1127
(0.56%)
112
108
136
33
18
11
MT5-BASE
250,100
116712
(46.67%)
26685
(10.67%)
19916
(7.96%)
7234
(2.89%)
16021
(6.41%)
2126
(0.85%)
6533
(2.61%)
4126
(1.65%)
2244
(0.90%)
5217
(2.09%)
2476
(0.99%)
3960
(1.58%)
4254
(1.70%)
4309
(1.72%)
2589
(1.04%)
2261
2812
(1.12%)
2479
2847
(1.14%)
1846
(0.74%)
1397
1350
1699
(0.68%)
1375
(0.55%)
997
(0.40%)
851
(0.34%)
1788
(0.71%)
1497
(0.60%)
1333
(0.53%)
419
(0.17%)
75
89
(0.04%)
32
65
45
46
5
4
XLM-ROBERTA-BASE
FacebookAI
250,002
110088
(44.03%)
31670
(12.67%)
17774
(7.11%)
14423
(5.77%)
3296
(1.32%)
713
(0.29%)
4425
(1.77%)
5413
(2.17%)
1212
(0.48%)
5174
(2.07%)
6968
(2.79%)
5124
(2.05%)
4157
(1.66%)
2057
(0.82%)
3770
3517
(1.41%)
726
3481
(1.39%)
2463
2859
2168
(0.87%)
272
(0.11%)
2467
3191
(1.28%)
2947
(1.18%)
2014
(0.81%)
1696
1485
(0.59%)
1458
(0.58%)
1703
16
24
LLAMA-3-8B
Meta
128,000
95921
(74.94%)
6387
(4.99%)
3833
(2.99%)
3591
(2.81%)
1213
5638
(4.40%)
666
1784
1744
(1.36%)
1324
(1.03%)
591
(0.46%)
2539
(1.98%)
1138
(0.89%)
518
334
(0.26%)
669
38
22
GPT-4
100,256
88833
(88.61%)
690
(0.69%)
790
(0.79%)
73
1131
(1.13%)
5204
(5.19%)
193
1376
29
1053
86
62
603
GPT-2
50,257
46942
(93.40%)
1683
(3.35%)
655
(1.30%)
452
240
35
90
(0.18%)
PHI-2
Microsoft
660
(1.31%)
235
OLMO-7B
AllenAI
50,280
43738
(86.99%)
344
313
(0.62%)
66
(0.13%)
2033
(4.04%)
1777
(3.53%)
23
725
(1.44%)
168
(0.33%)
13
676
(1.34%)
104
(0.21%)
(0.27%)
PYTHIA-70M
EleutherAI
50,254
43735
(87.03%)
(4.05%)
(3.54%)
653
T5-BASE
32,100
30853
(96.12%)
(0.25%)
968
(3.02%)
161
MISTRAL-7B-V0.1
MistralAI
32,002
26014
(81.29%)
1731
(5.41%)
1459
(4.56%)
57
1078
(3.37%)
122
(0.38%)
346
(1.08%)
519
(1.62%)
(0.23%)
(0.10%)
27
44
(0.14%)
58
74
115
20
CODELLAMA-7B-HF
Code Llama
32,016
25929
(80.99%)
2951
(9.22%)
700
(2.19%)
47
(0.15%)
1065
(3.33%)
97
(0.30%)
111
(0.35%)
496
(1.55%)
95
LLAMA-2-7B-HF
32,000
25915
(80.98%)
494
(1.54%)