Hou tu pranownse Inglish

© 2000 by Mark Rosenfelder
Everybody agrees that English spelling is horrible.

There have been almost as many proposals for spelling reform as there are rewrites of Esperanto. (Tellingly, there has been precisely one success in each category-- Noah Webster and Ido-- and neither caught on universally.) Most of these proposals spend their energy fixing what isn't broken. For instance, they search hard for clever new ways of spelling the ch sound-- even though ch does the job just fine in hundreds of languages. Or, they insist on 'correcting' the Great Vowel Shift, using Italian values for the vowels.

Whenever the subject comes up, someone is sure to bring up all the words in -ough, or George Bernard Shaw's ghoti-- a word which illustrates only Shaw's wiseacre ignorance. English spelling may be a nightmare, but it does have rules, and by those rules, ghoti can only be pronounced like goatee.

The purpose of this page is to describe those rules-- to explain the system behind English spelling, the rules that tell you how to pronounce a written word correctly over 85% of the time.

Many people expect the opposite as well-- to predict the spelling from the pronunciations-- not realizing that few orthographies meet this goal. It's far from true of Spanish, for instance, which is often held up as an example of a good orthography. I stopped fervently admiring Spanish orthography when I saw a sign in a Mexican bakery with about one spelling mistake every third word.

Several different types of people might be interested in this page:

I've also included a sample lexicon and a set of spelling rules which you can use with my Sound Change Applier to automatically derive the pronunciation.


Thanks to Éamonn McManus, Aaron J. Dinkin, Dennis Paul Himes, Geoff Eddy, Hirofumi Nagamura, and John Cowan for useful comments and ideas, which I've tried to incorporate here.

The sounds of General American

If we're discussing spelling, we have to discuss sounds as well; and this means choosing a reference dialect. I'll use my own, of course-- a version of General American that's unexcitingly close to the standard. I'll call it GA below.

Here's the vowels and consonants of my dialect. For each I give the IPA, the representation in the eccentric phonemic transcription I use in this document, and a couple of sample words.

The IPA is given in Unicode; if it doesn't look right you have a nasty old non-Unicode-compliant browser.

Vowels
Consonants
IPA Phoneme Samples IPA Phoneme Samples
 e  ä  rate  p  p  paper
 æ  â  rat  b  b  book
 i  ë  meet, machine  t  t  take
 ɛ  ê  met, dread  d  d  dead
 aj  ï  bite, cycle  g  g  get
 ɪ  î  bit, lick  k  k  cape, talk, quite
 o  ö  note, sow  m  m  moon
 a  ô  not, clock  n  n  new
 ju  ü  cute, you  ŋ  ñ  sing, think
 ʌ  û  cut, come  f  f  four, physics
   v  v  vine
 u  u  coot  θ  +  thin
 ɔ  ò  caught, dog  ð  +  this
 ʊ  ù  cook, put  s  s  so
 ə  @  above, cynic, until  z  z  zoo
   ʃ  $  shack
 aw  ôw  crowd, loud  ʒ  $  measure
 oj  öy  boy, droid    ç  chew
     j  judge
 j  y  you, million  r  r  ran
 w  w  wait, cow  l  l  late
   h  h  hang
 ɚ  @r  search, manor, bird  
   @n  button, happen
   @l  battle, final

Who cares about dialects?

Ideally you shouldn't have to worry about my dialect at all: you could simply take (say) ê to represent whatever you pronounce as the vowel in met. Unfortunately, English dialects are not uniform enough to share a single phonology. There are many words that are not only pronounced differently in different dialects-- that is, they have a distinct phonetic realization-- but also have their own phonemic representation.

Some examples:

Notational conventions

Spellings are in teal italics; pronunciations are in blue Courier. This convention avoids cluttering the text with brackets and quotation marks.

Thus g refers to the letter <g>, while g refers to the sound /g/, and I will write that laugh is pronounced lâf.

Linguists can take the 'pronunciations' as phonemic; e.g. I haven't attempted to indicate aspiration, the flapping of medial t and d, the appearance of clear and dark l, etc. I indicate some but not all vowel reductions (basically, those that are reduced in all forms of the morpheme).

# represents the beginning or end of a word. For instance, #rh represents an rh that begins a word; g# refers to a final g.

Capital letters represent variables; e.g. V represents any vowel.

The computer simulation

Along with this explanatory page, I've put up

The lexicon includes the target pronunciation in GA; I modified the program to compare the results of the rule application with the target. The results:

This is impressive; but it understates the systematicity of English spelling:

There is a fuller discussion of the mispredictions at the end of the document.

The odd phonetic transcription, by the way, derives from the dual need to easily represent sounds both in html and in the sound change file. I'm restricted to characters that html supports; and I can't use capital letters, because I need them for variable definitions in the rules. As a mnemonic, think of the umlauts as colons, so that ö is short for o:, 'long o'.

The wacky spellings I used for the vowels, however, are inherent in the logic of English spelling. It would only obscure how the system works if I represented the long and short vowels with IPA forms.

The rules

The bulk of this page is basically a human-readable restatement of the rules in the sound change file

The order of the rules is important. The rules can be thought of as a recipe: to pronounce a word, you go down the list of rules, seeing if each one in turn applies, and applying it if it does.

The result is sometimes a little backwards in terms of explaining the system, because exceptions come first, before the general rules. That's the best way to teach the computer; but humans tend to do best by learning the most general rule first.

I'll warn you: some of these rules are going to seem mondo obscure. That's because I've tried to find every regularity I could, even if it only explains half a dozen words. The yield of some rules may be small enough that some people would rather just learn the affected words as irregularities. But if anything I'm more interested in the minor regularities; they're puzzles, often unfamiliar ones, and many are the fossils of minor sound changes.

To head off another likely reaction: yes, you can find exceptions to the rules. I'm perfectly aware that ough is not always pronounced ö. The point is, what follows are the default rules that work 85% of the time. Think of ö as the default pronunciation of ough; any other pronunciation of ough is an irregularity.

And finally: I'm aware that some linguists (e.g. Edward Carney) have also worked on these problems; unfortunately, I've only seen their work in summaries. I've tried to be careful and linguistically informed, but I don't claim to have committed a work of scholarship.

Some rewrites

English has more phonemes than the alphabet has available symbols; the usual expedient of the orthography for solving this problem is to use digraphs. (Both the problem and the solution are inherited from Latin, which had hardly finished tossing out the Greek letters it didn't think it needed when it started to borrow Greek words that needed them.)

1. Make the following unconditional replacements:
 ch       ç    
 sh  $
 ph  f
 th  +
 qu  kw
 wr  r
 wh  w
 xh  x
 rh  r

Before an o, replace wh with h instead: who, whore, whole.

If you're one of those fossils who still use a voiceless w or another strange contortion to distinguish wh and w, you'd modify this rule.

We can do significantly better than the program if we don't do these substitutions when the digraph spans a morpheme boundary. In other words, we shouldn't do the replacement in compound words like bosshood, flathead, uphill, or perhaps.

We can also do better if we replace ch with k in words of Greek and Hebrew origin-- that is, in two-dollar words like archaism or trochaic or Malachi.

The program actually replaces only initial rh, since medial rh is so likely to be found in a compound (and it doesn't occur finally in the sample lexicon).

(xh isn't really a digraph; the rule just reflects the fact that an initial h isn't pronounced after a prefix ending in x, as in exhibit.)

2. Replace x with ks; but after e and before another vowel, use gz instead. (This is not an allophonic rule: compare the near-minimal pair exist and excite.)

3. Ignore apostrophes (can't, cop's, o'clock). Hyphens can however be treated as word separators (mother-in-law is pronounced like mother in law).

The notorious gh

4. Before a vowel, gh becomes g: ghost = göst.

5. gh turns a preceding single vowel long: right = rït.

6. aught and ought become òt: daughter = dòt@r, sought = sòt.

7. Any other ough becomes ö: dough = .

8. Elsewhere, gh is simply dropped: freight = frät.

People usually trot out gh when they bitch about English spelling. The culprit is sound change: gh used to do nicely for the x sound (now usually represented kh when we transcribe foreign words), but the sound disappeared in everything but Scots. It usually went quietly, but sometimes, word-finally (laugh, cough, enough, rough, tough, and not much more) it was transformed to finstead.

ough is also notorious, but the usual sound (as seen in rule 7) is ö. Through is a notable exception.

Initial gh is sometimes used to keep the g from softening (ghetto); but generally it's a meaningless variant on g, said to be introduced by Dutch typesetters in the early days of printing. In any case it's no problem, since it's always g. This is one reason Shaw's ghoti is such a fraud: initial gh can never be pronounced f.

Unpronounceable initials

9. In initial gn, kn, mn, pt, ps, tm, pronounce the second letter only: gnostic = nôstîk, psycho = sïkö, knight = nït.

Most of these are Greek borrowings-- Greek is much freer with initial clusters than English is-- but kn derives from Old English.

Replacing y

10. Replace y with ï if it ends a one-syllable word: ply = plï.

11. ey is pronounced ë; ay is ä; and oy is öy: say, monkey boy = sä mûnkë böy.

12. Replace y with i if it's not adjacent to a vowel-- we'll worry later about how to pronounce the i.

Thus, system = sîst@m but you, where the y adjoins a vowel, is yu.

Simplification of stl

13. The t in stl is lost before a final vowel: bustle = bûs@l", bristly = brîslë.

This could perhaps be generalized; but in slow speech I leave the t in (say) coastline or Christlike. I'm also tempted to generalize to all stops, but the only instance in the sample lexicon is muscle, and it's pretty silly to have a rule that applies to a single word.

(Af)frication before i

14. ci or ti becomes $ before a vowel: gracious = grä$@s, nation = ä$@n.

15. tu becomes çu before a vowel, or before a liquid (r, l) followed by a vowel: mutual = müçu@l, mature = m@çur.

16. s becomes $ (or $ if it's preceded by a vowel):

At some point English affricated a number of consonants before a i or y that preceded another vowel, including the [y] sound that begins ü Sometimes the y has been lost since. This process seems to be no longer productive-- compare costume, Casio. (Or is it? In quick speech I do say kôsçùm.)

Rule 14 shows another reason ghoti is a fraud: ti only fricativizes when it's followed by a vowel.

Voicing of s

17. s is voiced between two vowels (amuse, design, prison), except after a (base, parasite).

It's easy to find exception to this rule: disagree, opposite, analysis-- there's even words where the rule applies only for verbs (abuse, house). The rule as stated has more successes than failures, and I haven't been able to find merely lexical rules that do much better. A better rule might take the language of origin into account: the voicing tends to occur in French and Latin words (resent, please, reason, miserable), but not if they're from Greek (analysis, isoceles) or more exotic languages (papoose, Osaka).

The voicing of s is so almost predictable that there are orthographic conventions (borrowed from French) to indicate that we really do want an s: double the s (cf. Moses vs. mosses), or use c instead (race vs. rase). Annoyingly, there are a few cases of unexpectedly voiced ss (dessert, dissolve).

As a corollary of this rule, the American use of -ize for British -ise was unnecessary, although of course it is more foolproof.

You know me, al

18. al is pronounced òl before r, s, m, a dental stop, or final ll: also, already, wall, bald, although, almost.

19. alk becomes òk, except initially: walk = wòk.

I suspect this is a sound change, obscured by later borrowings like alcohol.

Softening of velars

20. c becomes s before a front vowel, k elsewhere: cell = sêl, acid = âsîd, but cow = kôw, backer = bâk@r, clear = klër.

21. Similarly, g becomes j before a front vowel, g elsewhere: gel = jêl, turgid = t@rjîd, but got = gôt, twig = twîg, gleam = glëm.

22. If the g doesn't begin the word, and the triggering e precedes o or a, the e is lost: changeable = cänj@b@l; dungeon = dûnj@n (but geology = jëôl@jë).

23. Initial gu or final gue is pronounced g: guest = gêst, plague = pläg. (Medially, it tends to be gw instead: language, anguish.)

Front vowels are i and e; note that y was changed to i by rule 12. We owe these rules to a sound change, and not even our own-- it derives from the history of French.

The last two rules allow g to be used for two sounds:

The inserted e or u are orthographic only; they make sure rule 21 applies or doesn't apply, as desired.

In French, there's a parallel with c:

but it doesn't work so well in English, since our qu is still kw. The inserted e is found in just a few words (e.g. placeable), due to compounding.

Untangle reverse-written final liquids

24. le and re (after a consonant, and ending the word) should be rewritten @l, @r.

To be precise, they become syllabic consonants: the final sound in bottle is a prolonged dark l. I think this is an allophonic detail, however: if you like, just add a rule at the end to turn all instances of @r into syllabic r.

Short and long vowels

OK, listen up, because these are the two most important rules of English spelling.

25. Vowels are pronounced long before an intervocalic consonant (rate, mete, fine, rote, cute = rät mët fïn röt küt).

26. They're short before two consonants (baffle, held, children, rotten, butler), or before a final consonant (pat, pet, pit, pot, but = pât pêt pît pôt bût).

English has a dozen or so vowel phonemes, and this silly alphabet we inherited from the Romans has just five vowel symbols (y is sometimes used as a vowel, but as we've seen, it pointlessly duplicates i). The five symbols can represent ten sounds, thanks to these rules.

Each vowel letter has two basic interpretations, which by convention are called long and short. (Phonetically they're not distinguished by length; tense and lax would be more accurate. But I think the more familiar terms will be more readable, and remind readers that their old English teachers were onto something after all.)

In my transcription, long vowels are marked with a diaresis, since html doesn't supply a macron (äëïöü), and short vowels with a circumflex (âêîôû). Now you can see why I chose those odd representations-- they come from the basic logic of English spelling. (Think of the diaresis as the IPA : long mark.)

Note that the names of the letters A E I O U are simply the 'long' vowels.

And where did that come from?

The above rules work in conjunction with rule 54, which means that doubling a consonant changes a medial vowel from long to short: later/latter, Peter/petter, biter/bitter, hoping/hopping, cuter/cutter.

Exceptions, but general ones

27. Final ind is ïnd, final oss is òs; final og is òg: mind, boss, dog = mïnd bòs dòg.

28. o also becomes ò before f and another consonant (offer = òf@r, soften = sòf@n).

29. wa is pronounced before a dental or alveolar consonant (t d n s +): want, wander, swan, Rwanda, swat, wad, wasp, and as between w and (t)$: wash, squash, watch = wò$ skwò$ wòç.

29a. u is pronounced u before l, or after a labial stop (pb) and before a sibilant (s$ç): adult, push, butch. (This doesn't apply if the u is long: mule.)

I don't think I ever noticed these generalizations till I started working out the rules for this page. At least some of these, such as 29a, are sound changes from Shakespeare's time.

Rules such as 6, 18, 19, 27, 28, and 51 introduce ò, a vowel which (as signalled by the odd diacritic in my transcription) doesn't fit well into English phonology. The fact that a velar occurs in many of the rule conditions suggests that it was originally an allophonic variant of /ô/ and /â/ in this environment-- compare dog, ought, long, walk with dot, out, lot, wad. But it's now phonemic in GA, as can be seen in the minimum triad caught, cot, cat. These rules would have to be modified (and some could be eliminated) in dialects that merge ò and ô.

For some speakers, rule 29a only applies after labials, so that pull and dull don't rhyme.

Softening of gn

30. Except before a vowel, the vowel in ign or igm lengthens, and the g is lost: alignment paradigm = @lïnm@nt, pär@dïm, but igneous = îgnë@s.

31. The g is simply lost in eign: feign = fän.

Handling of -ous

32. Except before a vowel, ous reduces to @s: jealous = jêl@s.

I'm ambivalent about rules that relate to a particular suffix, since arguably the pronunciation is simply a fact about the suffix in the mental lexicon. But a suffix can apply to dozens of words, so there was a large gain from including some such rules in the file.

Note the importance of order: this rule has to be ordered before silent e deletion, or it will apply to words like arouse.

Removal of silent e

33. Remove final e: rate mike cute = rät mïk küt (unless it's the only vowel in the word, as in he).

This and rules 25 and 26 (on long and short vowels) are the guts of the English spelling system. They allow the five vowel symbols to represent ten vowel phonemes.

English orthography tends to preserve the spelling of morphemes in derived words, including their final e. The program is too stupid to handle this, since it has no way of recognizing compounds. But of course in words like safety, lovely, changeable, careful, warehouse, jukebox, placement, placeholder the e in the first morpheme should be deleted by this rule.

People pay tribute to these rules every time they make up words-- whether for marketing purposes (Nite-Lite, Cold-Eeze, Unix), slang (reefer, dweeb, doofus), a created world (hobbit, Leela, Oz, Alley Oop, Naboo, Mr. Magoo, Morlock), or for borrowings ( thuggee, kangaroo, tycoon, igloo, tepee). Words that don't fit the pattern, like Linux, can cause confusion.

Add shortening; stir

Some vowels that are orthographically long are pronounced short, and frankly I haven't put my finger on the pattern. In the file I did add this rule:

34. Shorten a vowel that precedes a simple, final CV syllable (and is not the first syllable in the word).

This handles words like anomaly, cinema, sanity, biology, century; but it fails on other words, like patina, tuxedo, agora. Obviously the shortened vowels are all unstressed; but the idea here is to predict pronunciations from the spelling, and the spelling doesn't indicate the stress.

(We've already removed silent e, so this rule isn't triggered by words like phoneme.)

Somewhere I read that long vowels can't occur earlier than the antepenult; but obvious counterexamples are isolating or unification. I'll see if I can improve the generalization, however.

Vowel digraphs

Besides the long/short trick, English expands its repertoire of vowel representations with digraphs. Quite a few of these are redundant, and there are lots of exceptions-- this, and not ch or ough, is the real weak point of English spelling.

35. iV (that is, i plus another vowel) becomes ï@ in the initial syllable: bias, diagram = bï@s, dï@grâm.

36. Exceptions to the following rule:

37. Make the following substitutions:
 eau       ö     
 ai  ä
 au, aw  ò
 ee  ë
 ea  ë
 ei  ä
 eo  ë@
 eu, ew  ü
 ie  ë
 iV  ë@
 oa  ö
 oe  ö
 oo  u
 ou, ow  ôw
 oi  öy
 ua  ü@
 ue  u
 ui  u

Again, the program is not smart enough to recognize when the digraph spans a morpheme boundary, and thus should be treated as two separate vowels: goer = gö@r, coaxial = köâksë@l.

Annoyingly, some of these digraphs have at least two values: cf. wool, fool; mead, dread; fief, friend; reign, seize; ground, group. The values in the table are those that occur most often. (The alternatives are generally just a step or two apart phonetically, e.g. u/ù, ë/ê, ä/ë.)

For ease of exposition I've put the final ie rule here, but it really goes before rule 14 (affrication); otherwise terrible things happen to words like untie.

Those pesky final syllabics

38. Any vowel reduces to @ before final l: battle, final, hovel, evil, symbol.

39. Any short vowel reducts to @ before a final n: human, frighten, cabin, button.

These rules don't apply to monosyllables (pal, can), nor to vowels that have already been assigned a particular value by an earlier rule (e.g. meal to mël by rule 37).

These rules could probably be refined; they don't apply to stressed finals, but again, the orthography doesn't indicate stress.

You can take @l as a phonemic representation, or add a rule at the end to replace it with vocalic l. Ditto for @n.

Suffix simplifications

40. The following suffixes are reduced as follows:
 -able, -ible       @b@l     
 -lion  ly@n
 -nion  ny@n

Again, we really shouldn't have 'rules' for single lexical entries. But these suffixes are common, so the rule has a large yield.

Unpronounceable finals

41. A final b or n is not pronounced if preceded by an m: damn bomb = dâm bôm.

Final vowel coloration

42. Pronounce any remaining final vowel as follows:
 -a       @     
 -i  ë
 -o  ö
 -u  u

A final vowel is usually the mark of a foreign word, which is why final vowels tend to have the 'continental' values: sushi, cello, haiku. Earlier borrowings were nativized, meaning that final vowels had to be written as diphthongs (e.g. Munsee, Hindoo).

Since final -e is already in use, we used to mark one that was supposed to be pronounced (Chloë = klöë), or, if we were borrowing from French, we retained the accent (café = kâfä). But English seems to be so allergic to diacritics that these helpful conventions have largely been lost.

Vowels before r

r is hell on English vowels; it tends to color the vowels, and in many dialects, disappear. In GA there are 12 monophthongal vowels, but only 6 can appear before r-- ä ë ô ö ò u-- plus @r, which is really just a prolonged vocalic r.

43. An ôw, ô, or ò resulting from the previous rules changes to ö before an r: course = körs, for = för.

44. war is pronounced wör, except before a vowel: warlock, war, dwarf = wörlôk, wör, dwörf; and wor is pronounced w@r: word, worst, worry.

45. ê or â before a double r (and ê before ri) become ä: terror, marry, merit = tär@r, märë, märît.

46. â before any other r becomes ô: mark, star = môrk, stôr.

47. ê, î, û before r are reduced to schwa: perk, fir, fur = p@rk, f@r, f@r.

Thanks to the infamous rule 45, I pronounce Mary, merry, marry the same. If you left this rule out, it would probably correctly predict the pronounciation of Easterners and Britons who distinguish them.

The velar nasal ng

The careful reader may wonder why ng was not handled earlier, with the other consonantal digraphs. The reason is that orthographically, it acts as a double consonant-- e.g. singer has a short not a long i. But now it's time to handle it.

For lack of an eng, I represent the velar nasal as ñ; don't confuse it with a palatalized ny.

48. ng becomes ñg before a liquid (r, l) or semivowel (y, w): angry, England, singular, anguish = äñgrë, îñglând, sîñgül@r, äñgwî$.

49. ng becomes ñ finally, or before another consonant: hung = hûng, length = läñ+.

50. n becomes ñ before a velar stop (k, g): anger = äñg@r, think = +îñk.

51. ô becomes ò, and â becomes ä before ñ: song = sòñ; hang = häñ.

Note that rule 50 doesn't apply to words like hung, because rule 49 already removed the g in those words.

50 is arguably merely allophonic, but since it's completely consistent I treated it as a spelling rule. You could certainly say that a word like ungrateful 'really' has an underlying /ng/, because it's composed of un plus grateful; then this, as in most languages, will get pronounced ñg. But if you go that route, you can't actually show that English allows /ñg/ as well as /ng/-- how do we know that wrong isn't actually /ròng/, modified by the allophonic rule? The important thing is not to pretend that we have a contrast of /ng/ and /ñg/.

Voicing of s

52. s is voiced finally, after a voiced oral stop: dogs = dògz.

53. It's also voiced before final m: prism = prîzm.

The first of these rules is really morphophonemic: the plural, possessive, and 3p singular inflections of English are spelled s even when, by assimilation, they're pronounced z. This rule is not phonological, as can be seen by a word like chance = çâns; compare fans = fânz.

Double consonants

54. A double consonant is pronounced singly: dinner, buzzard, hassle = dîn@r, bûz@rd, hâs@l.

55. A t disappears before ç, and a d before j: batch = bâç, judge = jûj.

56. An s disappears before $: pressure = prê$r.

Rule 54 works hand in hand with rule 25: a consonant is doubled to show that the preceding vowel is short: redder = rêd@r (compare red, where the d doesn't need to be doubled because a vowel preceding a final consonant is already short).

Rule 55 is something of a corollary: to 'double' ç, we write tch rather than chch; and to double a j, we write dg rather than jj or gg.

Rule 56 goes with rule 16, which changed s to $ before some instances of u.

Almost but not quite regular

In the rule list there's almost a rule that changes o to û before certain fricatives or nasals. Here's a list of affected words, as well as counterexamples:
 _v     above, cover, dove, glove, govern, hovel, hover, love, oven, shovel, of  clover, prove, drover, jovial, move, novel, over, poverty, proverb, province, sovereign, stove, bovine
 _l  color  apology, polo
 _+  other, another, mother, brother, nothing  both, bother, broth, brothl, cloth, clothes, moth
 _n  onion, none, money, monk, monkey, month, wonder, front, son, sponge, honey, Monday, one  alone, bone, honest, honor, tonight, pond, beyond, conk
 _m  come, become, from, some, stomach  bomb, comb, dome, home, gnome, Mom, whom, womb

Most of these turn out to be due to an orthographic or even a calligraphic rule: medieval English scribes wrote o instead of u before m, n, v, apparently because in the medieval hand, the verticals of the u ran confusingly together with those of the following consonant.

So what's irregular?

The biggest source of errors are those that I considered near-misses: instances where the rules get the length of a vowel wrong, or don't predict a reduction to schwa, or don't predict a voiced s.

The first two of these are a feature not a bug, since they make word roots recognizable, despite predictable differences in pronunciation. For instance, the root pedant is spelled identically in pedant (pêd@nt) and pedantic (p@dântîk)). This underlines the relationship between the two words, despite the fact that neither root vowel is pronounced the same. Similarly, sanity has a short a (sânîtë), although a vowel preceding a single consonant is normally long; this is an 'error', but it keeps the same spelling of the root as in sane.

Putting these near-misses aside, my program gets 791 words wrong in a 5180-word sample vocabulary.

Many of these are really stupidities of the program, not the language. There are:

That leaves about 420 words wrong, less than 10%; the major categories are as follows:

Generating spellings from pronunciation

Can you reverse these rules to get instructions on how to spell a word given its pronunciation? Not really, since there are too many alternative spellings. However, the following table can be taken as a first approximation. For each GA phoneme, I list the spellings referred to in the rules above. Caveats:

Phoneme Spellings Phoneme Spellings
 ä  a, ay, ai, ei, e(r), a(ng)  p  p
 â  a  b  b
 ë  e, ee, ea, ey, (c)ei, e(V), i#, y#  t  t
 ê  e, ea  d  d
 ï  i, y ,ie, igh, ig(n), i(V)  g  g, gh(i/e/y)
 î  i, y  k  k, c(a/o/u), q(u), ck#
 ö  o, oa, oe, ough, o#, ow#, eau  m  m
 ô  o, (w)a(n/s/t/d), a(r)  n  n
 ü  u, eu, ew  ñ  ng, n(k,g)
 û  u  f  f, ph
   v  v
 u  oo, ue, ui, u#  +  th
 ò  au, aw, augh(t), a(l), (w)a(sh,ch), o(ss#, g#, fC, ng)  +  th
 ù  oo, u  s  s, (V)ss(V), c(i/e/y), ce(a/o/u)
 @  V, a#  z  z, (V)s(V)
   $  sh, ci(V), ti(V); rule 16 situations: s, ss
 ôw  ou, ow  $  s, zh
 öy  oy, oi  ç  ch, (doubled) tch, t(u)
     j  j, (doubled) dg, g(i/e/y), ge(a/o/u
 y  y; yu can be u  r  r, #wr, rh
 w  w, #wh, u(V)  ;l  l
   h  h
 @r  Vr, re#  
 @n  Vn
 @l  Vl, le#

Spelling reform by regularization

You could use the above table as the basis for a really useful and minimal spelling reform.

For instance, here's Percy Bysshe Shelley's Ozymandias in regularized spelling. To minimize the barbarity, I exempt one- and two-letter words from reform.

I met a traveller from an anteke land hu sed: Tue vast and trunkless legs of stone stand in the desert. Near them, on the sand, haff sunk, a shattered visage lies, huse frown, and wrinkled lip, and sneer of cold cummand tell that its sculptor well those passions read, which yet remain, stamped on these lifeless things-- the hand that mocked them, and the hart that fed. And on the peddestal these words are carved: 'My name is Ozzymandias, king of kings! Look on my works, ye mighty, and despair!' Nuthing beside remains. Round the decay of that colossal wreck, boundless and bare, the lone and levvel sands stretch far away.
Or of course we could just hang it up and use Chinese-style syllabograms instead.

So how horrible is English spelling really?

I doubt that this page will convince anyone that English spelling is a good system. There's too many oddities.

What I hope to have shown, however, is that beneath all the pitfalls, there's a rather clever and fairly regular mechanism at work, and one which still gets the vast majority of words pretty much correct. It's not to modern tastes, but by no means as broken as people think.


[ Home ]