First, we need to decide what constitutes a phonetic match between the two languages. One way of doing this is to decide for each Quechua phoneme what Chinese phonemes we'll accept as matches. (Think of it this way: is Qu. runa a match for Ch. rén? Is chinchi a match for chong? Is chay a match for zhè?)
We might decide as follows. The criterion here is obviously phonetic similarity. We could certainly improve on this by requiring a particular phonological distance; e.g. a difference of no more than two phonetic features, such as voicing or place or articulation. The important point, as we will see, is to be clear about what we count or do not count as a match; or if we are evaluating someone else's work, to use the same phonetic criteria they do.
Qu. | Ch. |
p | p, b |
t | t, d |
ch | ch, zh, j, q, c, z |
k | k, g |
s | s, sh, c, z, x, zh |
h | h |
q | h, k |
m | m, n |
n | m, n, ng |
ñ | m, n, ng, y |
l | l, r |
ll | l, r, y |
r | l, r |
w | w, u |
y | y, i |
a | a, e, o |
i | i, e, y |
u | u, o, w |
We will next need to know the frequency with which each phoneme occurs in each language. This can be calculated using a simple program operating on sample texts. For Quechua we find:
initial | medial | final | |
a | 5.291005 | 25.906736 | 40.211640 |
b | 2.645503 | 0 | 0 |
d | 0 | 0.310881 | 0 |
g | 0.529101 | 0.103627 | 0 |
h | 5.820106 | 0 | 0 |
i | 2.645503 | 8.808290 | 5.291005 |
k | 14.814815 | 5.595855 | 3.174603 |
l | 0.529101 | 0.414508 | 0 |
m | 7.407407 | 4.145078 | 3.703704 |
n | 1.587302 | 6.528497 | 25.396825 |
p | 7.936508 | 6.010363 | 0 |
q | 4.232804 | 3.108808 | 8.465608 |
r | 4.232804 | 5.077720 | 0 |
s | 6.349206 | 4.145078 | 2.645503 |
t | 7.407407 | 6.424870 | 0 |
u | 3.703704 | 11.398964 | 2.645503 |
w | 11.111111 | 1.450777 | 0.529101 |
y | 3.174603 | 4.145078 | 7.936508 |
ch | 6.878307 | 3.108808 | 0 |
ñ | 1.058201 | 1.243523 | 0 |
rr | 0.529101 | 0 | 0 |
ll | 2.116402 | 1.865285 | 0 |
initial | medial | final | |
a | 1.400000 | 21.494371 | 7.739308 |
b | 7.000000 | 1.432958 | 0 |
c | 0.600000 | 0.102354 | 0 |
d | 12.800000 | 1.228250 | 0 |
e | 0.200000 | 8.904811 | 15.885947 |
f | 2.000000 | 0.614125 | 0 |
g | 3.200000 | 1.842375 | 0 |
h | 3.400000 | 2.149437 | 0 |
i | 0 | 17.195496 | 29.327902 |
j | 4.600000 | 1.944729 | 0 |
k | 2.200000 | 0.204708 | 0 |
l | 6.000000 | 2.149437 | 0 |
m | 2.600000 | 1.330604 | 0 |
n | 3.800000 | 6.038895 | 11.608961 |
o | 0.400000 | 7.881269 | 9.368635 |
p | 1.000000 | 0.102354 | 0 |
q | 2.000000 | 1.842375 | 0 |
r | 0.800000 | 0.307062 | 1.629328 |
s | 0.800000 | 1.023541 | 0 |
t | 3.800000 | 1.228250 | 0 |
u | 0 | 8.495394 | 12.016293 |
w | 7.800000 | 0.716479 | 0 |
x | 4.200000 | 0.614125 | 0 |
y | 9.600000 | 0.511771 | 0 |
z | 4.200000 | 1.023541 | 0 |
ch | 2.200000 | 0.716479 | 0 |
ng | 0 | 5.834186 | 12.016293 |
sh | 7.800000 | 1.330604 | 0 |
zh | 5.600000 | 1.740020 | 0 |
(The reader who knows Chinese may wonder how we can have medial consonants at all. The answer is that I am using Chinese lexemes, not single characters (zì), so that, for instance, Zhongguó 'China' is one word, not two.)
Now we're in a position to calculate the probability for a match. Let's start by assuming that there must be a match (within the phonetic categories established above) in both initial, medial, and final.
To calculate the probability pi for a match in the initial, we go down the list of Quechua initials, multiplying its probability times the probability of finding the matching sound(s) in that same position in Chinese. For instance, the probability of a match on initial p is the probability of initial p in Quechua (.0794) times the probability of a match on initial p or b (.07 + .01 = .08), or .00635.
I show the entire calculations below, because some of them are quite eloquent, and show the value of taking a frequency approach. If you're looking for a match for a Quechua word in s-, for instance, you have a 23% chance of matching any of the sounds we've judged as similar in Chinese. You're likely to match medial -a- 38% of the time; final -a 33% of the time, final -n 24% of the time.
(The boldface letter is the Quechua sound; it's followed by the Chinese sounds we said would be a match. The first number is the probability of the Quechua phoneme; the second is the sum of the probabilities of the matching Chinese sounds; the third is the multiplication of the first two.)
Initials
a aeo | .05291 * .020 = | .00106 |
h h | .05820 * .034 = | .00198 |
i iey | .02646 * .098 = | .00259 |
k kg | .14815 * .054 = | .00800 |
l lr | .00529 * .068 = | .00036 |
m mn | .07407 * .064 = | .00474 |
n mn ng | .01587 * .160 = | .00254 |
p pb | .07937 * .080 = | .00635 |
q hk | .04228 * .056 = | .00237 |
r lr | .04228 * .068 = | .00288 |
s s sh c z x | .06349 * .232 = | .01473 |
t td | .07407 * .166 = | .01230 |
u uow | .03704 * .082 = | .00304 |
w wu | .11111 * .078 = | .00867 |
y yi | .03174 * .096 = | .00305 |
ch ch zh jqcz | .06883 * .192 = | .01322 |
ñ mn ng y | .01058 * .160 = | .00169 |
ll lry | .02121 * .164 = | .00348 |
Medials
a aeo | .25907 * .3828 = | .09917 |
i iey | .08808 * .2661 = | .02344 |
k kg | .05596 *.0205 = | .00114 |
l lr | .00415 * .0246 = | .00010 |
m mn | .04145 * .0737 = | .00305 |
n mn ng | .06528 * .1320 = | .00862 |
p pb | .06010 * .0153 = | .00092 |
q hk | .03109 * .0235 = | .00073 |
r lr | .05078 * .0246 = | .00125 |
s s sh c z x | .04145 * .0582 = | .00241 |
t td | .06425 * .0246 = | .00158 |
u uow | .11399 * .1710 = | .01949 |
w wu | .01451 * .0921 = | .00134 |
y yi | .04145 * .1771 = | .00734 |
ch ch zh jqcz | .03109 * .0736 = | .00229 |
ñ mn ng y | .01244 * .1371 = | .00170 |
ll lry | .01865 * .0297 = | .00055 |
Probability for a medial match = .17514 = 17.5 %
Finals
a aeo | .40212 * .3299 = | .13266 |
i iey | .05291 * .4522 = | .02393 |
k kg | .03175 * 0 = | 0 |
m mn | .03704 *.116 = | .00430 |
n mn ng | .25397 *.236 = | .05994 |
q hk | .08466 * 0 = | 0 |
s s sh c z x | .02646 * 0 = | 0 |
u uow | .02646 * .2139 = | .00566 |
w wu | .00529 * .1202 = | .00064 |
y yi | .07937 * .2933 = | .02328 |
Probability for a final match = .25039 = 25.0 %
So, the probability of finding a random match on a single word (with no semantic leeway) is .0931 * .1751 * .2504 = 0.0041, or 1 in 244.
Two lessons may be drawn. First, phoneme frequency matters. Both Quechua and Chinese have very many medial a sounds, and final nasals, and initial affricates. That makes random matches involving those sounds much more likely.
Second, seemingly minor points of procedure have a huge impact on our results. We are used to situations where rough calculations do not lead us far astray. But in this area differing assumptions or methodologies lead to very different results. Very careful attention to both is warranted.
Obviously the initial-medial-final calculation is still a simplification. Quechua, for instance, can have both initial and final consonant clusters; both languages have some two-phoneme roots; and of course a vague "medial" category is not a good way of handling multisyllabic words.
We might decide to allow a Quechua medial to match either a Chinese medial or final, to catch resemblances like runa/rén and chinchi/chong. To do this we need to compute the chance that a Quechua medial matches a Chinese final, as follows. (We can skip Quechua medials for which none of the corresponding Chinese sounds can end a word.)
Medial-to-final
a aeo | .25907 * .3300 = | .08549 |
i iey | .08808 * .4521 = | .03982 |
l lr | .00415 * .0163 = | .00007 |
m mn | .04145 *.1161 = | .00481 |
n mn ng | .06528 * .2363 = | .01543 |
r lr | .05078 * .0163 = | .00083 |
u uow | .11399 * .2138 = | .02437 |
w wu | .01451 * .1202 = | .00174 |
y yi | .04145 * .2933 = | .01216 |
ñ mn ng y | .01244 * .2363 = | .00294 |
ll lry | .01865 * .0163 = | .00030 |
This can be added to the previous medial-to-medial estimate, on the grounds that when a medial doesn't match another medial, we're giving it another chance to match a final. However, the additional chance should be discounted by the probability (30% in my sample Chinese text) that the initial and final are the same (that is, that the word is just two phonemes long). So the medial-to-medial-or-final probability is .1751 + (.1880 * .70) = .3067.
The probability of finding a random match on a single word (no semantic leeway) can now be given as .0931 * .3067 * .2504 = 0.0071.
This estimate could be revised still further to take account of such things as metathesis (switched consonants), or Quechua's initial consonant clusters. Note that both examples allow additional matches, and thus will increase p even more.
Since this probability is obviously going to be much higher, I don't recommend trying to combine both types of match into a single p, which would understate the difficulty of finding 3-phoneme matches and overstate that of 2-phoneme matches.
We can estimate the probability of a 2-phoneme match by using the probability of a match on initials times that of a Quechua medial matching a Chinese medial or final: .0931 * .3066 = .0285 or about 1 in 35.
This could be refined by adding the probability that a Quechua final matches a Chinese medial or final, this time discounted by the probability that the Quechua medial is also the final.
If you want to avoid phonetic calculations entirely, there's an alternative approach: We pick a word a in A, then pick the word b in B which most closely resembles it phonetically. To handle phonetic looseness, we pick the n words in B which most closely resemble it phonetically.
The advantage is that we don't have to mess with phonetic details or how to match the phonologies of different languages. We can proceed quickly to an estimate of how many matches we can expect to find in general between two languages.
The disadvantage is that this approach doesn't lend itself to evaluating other people's claims. You can picture (say) Greenberg & Ruhlen examining the n words in Tfaltik that most closely resemble maliq'a. But what is their n? To give a reasonable estimate we have to dive back into phonetic details and probabilities.