Numbers in Over 5000 Languages


Numbers from 1 to 10 in Over 5000 Languages (One file)

Compiled by the irrepressible Mark Rosenfelder. Additions and corrections welcome.

The links on this page are to a single 1.1-megabyte file with all the numbers, displayed using Unicode. If your browser can't handle either of these things, click below.

This page with links to smaller non-Unicode files

Click here to see the entire collection, or click on the map to move to the languages for that area.

By family

Indo-European, Dravidian, and minor European languages
Afro-Asiatic and Caucasianlanguages
Nilo-Saharan, Kordofanian, and Khoisan languages
Niger-Congo languages, including Bantu
Uralic, and Altaic, and Miao-Yao, and Tai, and Austro-Asiatic, and other Asian languages
Sino-Tibetan languages
Austronesian languages
North American Indian languages - Eskimo, Na-dené, Algic, Keres, Siouan, Caddoan, Iroquoian, Kiowa-Tanoan "Hokan", isolates
Mesoamerican Indian languages - "Penutian", Uto-Aztecan, Oto-Manguean, Macro-Chibchan, Paezan Yanomaman
South American Indian languages - "Andean", "Equatorial", Tupi-Cariban, Macro-Otomakoan, Guamo-Chapacuran, Macro-Arawakan, Bora-Witotoan, Macro-Waikurúan, Macro-Panoan, Macro-Ge, isolates
Indo-Pacific languages
Australian languages
Pidgins and creoles
Constructed languages

Special collections

Proto-languages only: perfect for the long-range comparison fan
Million-speaker languages: the world's major languages
The numbers in various writing systems, plus field notes on distinguishing various types of writing systems
Rick Schellen's page of the numbers in over 400 Indo-European dialects.
Jennifer Runner's page on common expressions in many languages.
Language Information : notes on linguistic families, and a taste of ethnomathematics.
How languages are classified, from the sci.lang faq.

Sources

The Sources Page gives the sources for each language (and also lists languages I don't have, and connects the languages to other wide-scale classifications: Ruhlen, Voegelin & Voegelin, Campbell, and the Ethnologue).

I dearly appreciate everyone who's sent me numbers; but I want to particularly salute those whose kindness and hard work have been extraordinary: Jarel Deaton of Ohio, who is single-handedly responsible for more than a quarter of the numbers seen here; Eugene S.L. Chan of Hong Kong, who sent me his entire Austronesian database; and Carl Masthay of St. Louis and Pavel Petrov of Kaliningrad, who sent me their enormous, worldwide collection of numbers.

Special thanks to the Claudia Griffith and the staff of the SIL Library in Duncanville, Texas, whose wonderful hospitality made a week of research in the summer of 2004 both pleasant and productive.

Some caveats

Both native spelling and romanizations may obscure actual pronunciation, making comparisons difficult.
Shared numbers do not necessarily indicate genetic relationship; they may be borrowed.
There are often complications (e.g. different series of numbers), and I haven't had room for them here.
The standard orthography or standard dialect may have changed since my source on a language was published.
Hundreds of millions of English speakers agree that the numbers are one, two, three, etc. But only a minority of languages are standardized in this way. For unwritten languages, different linguists' word lists may be strikingly different. Their ears may not be attuned to the language; or there may be dialectal variation, or even sound change. Here's a couple examples, one from Asia, one from Africa:

Bru mu^ej bār paj pōn s^e:ng t^epat t^epū t^ekual tikeas m^encit
Bru muoi bar p´i poun sau'ng tapoât tapul takual takêh muoi chít
Gurma yèn.dó lyé tà nâ mù lwö.bà lèle: nî pá:nì pyêgà
Gurma n lè nlé nta nna nmu nluoba n lele nni n-ya ka piga

Language variations

People can get very excited about what's a language vs. what's a dialect. There is nothing inherent in the language variety to tell us what it is. Linguists sometimes use "language" to refer to a mutually intelligible group of dialects (but note that intelligibility can be partial).

Ordinary people generally call something a "language" if it has a prestigious standard form; but that's a fact about people's attitudes, not about language.

I generally rely on Voegelin & Voegelin, or on the original source for the numbers, in deciding whether to list something as a dialect (italicized). Some of my sources list multiple dialects; I usually try to pick the most widely spoken ones, and list others only if they're interestingly divergent.

Corollary: please don't complain to me about what's a dialect or a language-- you're arguing about nothing. (But feel free to send me additional dialects, or point out where I've messed up the names.)

Especially in the Amerind sections, I sometimes list older sources which may be of historical interest.

Symbols

The mondo file linked from this page uses Unicode-- where the characters are available on my 2003 Mac and Windows computers. Annoyingly, the IPA characters are not available, so I still need some substitutions. .

* indicates a reconstructed form
+ indicates a dead language (but some are undergoing revivals)

The picture shows the representations used for a number of IPA characters. Nonetheless, I haven't been able to retain all phonetic distinctions, and some have been lost-- for instance, the distinction between a circumflex (â) and a hachek.

For African tonal languages, a macron ^- indicates a high level tone, not length, and is represented as _. | is another tone, usually low level.
For non-African languages, a macron indicates length and is indicated :.

? indicates the glottal stop (but if my sources spell it as an apostrophe or q, I follow them)
bold indicates a character which was dotted in the original source-- usually an emphatic or retroflex consonant
italic indicates open e and o and lax i and u, or a character that was italicized in the original source

Superscript numbers indicate a numbered toneme (e.g. ¹ = first tone)
Appended numbers give tonal contours directly (e.g. 35 = high rising)

I use standard orthographies, where there is one, rather than phonetic transcriptions. This makes comparison a bit more difficult; but I prefer it, for two reasons. First, it reduces errors; even if I can correctly interpret a source's phonetic description, there can be orthographic irregularities that make a straight transcription ludicrous. Secondly, an orthography is generally closer to a phonemic representation, which is arguably what people have in their heads.

Numbers about Numbers

Languages with more than a million native speakers are named in boldface.

Number of speakers is one of the least interesting attributes of a language; but there are so many languages here that some highlighting of the most common ones seems necessary. I used the high end of David Crystal's estimates.

How many languages aren't here? Well, there's almost 5000 living languages listed in Ruhlen's volume; I have numbers for about 83% of them, so there's at least a thousand more. (If the math doesn't seem to work out, note that I have plenty of dialects and conlangs not included in Ruhlen's list.) There are about 200 languages with more than a million speakers, all of which are in the list.

Am I going to do higher numbers? Or zero? Probably not, unless I do it for a subset of languages only. Many of the sources don't even have numbers above ten.

How was this done?

People sometimes ask me how I accumulated all these numbers, or how to do this sort of research.

The answer is simple: libraries. I have access to a few good university libraries, and when I can I visit others. You look in grammars, dictionaries, and books or journal articles surveying entire families.

And, if possible, find others who've been bitten by the same bug!

Bru	mu^ej	bār	paj	pōn	s^e:ng	t^epat	t^epū	t^ekual	tikeas	m^encit
Bru	muoi	bar	p´i	poun	sau'ng	tapoât	tapul	takual	takêh	muoi chít
Gurma	yèn.dó	lyé	tà	nâ	mù	lwö.bà	lèle:	nî	pá:nì	pyêgà
Gurma	n lè	nlé	nta	nna	nmu	nluoba	n lele	nni	n-ya	ka piga

Numbers from 1 to 10 in Over 5000 Languages (One file)