Messing with Markov


Markov who?

For my upcoming syntax book, I’ve been listing some unlikely but troubling alternatives to generative grammar. One is Markov text generators.

These are named after Andrey Markov (1856-1922), who probably never saw one (at the word level, at least). A Markov chain is essentially a finite-state automaton where all the transitions are assigned probabilities. E.g. we might define these states and transitions:

Starting in Happy, the machine will bounce around indefinitely, and eventually end up Dead.

Order-one text generators

A Markov test generator (the idea for these is apparently due to Claude Shannon) is given a corpus. Each word in the corpus becomes a state, and each two-word sequence is a transition to another state. It’s assigned a probability determined by how many times it appears. For instance, this text
I am what I am and that's all what I am.
will give this database of words, transitions, and counts:
    I am - 3
    am what - 1
    am and - 1
    am . - 1
    what I - 2
    and that’s - 1
    that’s all - 1
    all what - 2
    (no entries)
This is simple enough to double-check, but entirely boring to run, because there is only one decision point, “am”. If we end up at “and”, the database only allows us to output “that’s all what I am”; only then do we get a choice of directions. A Markov generator becomes interesting only with a pretty large corpus.

Alice and friends

Say, I have some pretty large corpora right here. The text of Alice in Wonderland, for instance. It has a satisfying 1673 separate words. Running the generator, the raw output looks something like this:
"" ahem!"
said alice, or two," and down into alice's elbow was a little timidly, i'll tell you are painting them raw."
" you'll understand english," said alice thought.
viii--the queen's croquet ground near our breath.
this moment, the white rabbit cried alice opened the fall never heard it, i wonder?"
said, quiet thing sat down in another moment she went straight on, to herself.
do you fair warning," he stole the moment, and her sister, and a tone; but she tried to save herself, finding it pop down among the door with either you like a sort of cards: the shelves as to work and help me by the pigeon;" and there was getting on this.
" who was now i'm _not_ marked" curiouser and she looked all ornamented all turning to alice's adventures in the trees," off together!"
cried the passage, bill!
This very simple process can reproduce quite a few of the patterns of English sentences. E.g. “the” is never followed directly by a verb in English, and it never is in the Markov output. The generator knows nothing about English grammar and only “knows” two-word sequences, and yet it tends to produce SVO sentences, prepositional phrases, noun phrases, and so on.

But of course it comes up with plenty of impossible sentences too. So from here on out, purely for amusement, I'm just going to list some of the generator’s better efforts, using various input texts.

The Marvelous Land of Oz
but i might aspire to escape from jack pumpkinhead.

" i have just so-so," said the tin woodman, splashing and in yellow, who was none of glinda had thought only a sigh of course; his care and legs being nothing that is brought me a cat."

Alice in Wonderland
" i'm afraid that there was just under sentence in a low, i didn't like a very soon made her arm that her foot that 'i breathe when she added the face brightened up and yet i beg your majesty," said alice in great dismay.

" you know _that_ you," said alice very gravely, she came up his face was still she heard the dodo," there's no sort of gloves and stand on the game was up to turn into the mouse?

The Wind in the Willows
then, well, and now i left of the barge-woman, and the badger went, impressively; and set a fine powder filled full approval to cry of deep-sea fishings he went.

the grecian islands of the mole rubbed him inside and scrooged and cried the whole interesting talks together and made them, as plain as quickly and a barge had been consulted, i shall all the cannon they sweep and" that terrible sticks, he heaved a new, its radiant transformation of this horse into an enduring lot better way, sat down inside, 'this very morning, ' said the mole.

Against Peace and Freedom
i'd have gone yacht-racing with manifestos and occasional rubouts in a hell, as an agent here.

" somewhere down the cross-references on thursday nights, so if it limits human nature is he's got the agent's half-century performance must, scripture trivia, the galaxy; the incatena has been working with the guards toting laser on a whim, who are useful right."

the ungodly language, or the incatena offices thus a hundred years away-- so far, but it harbored plague.

A Princess of Mars
as i dreamed of dejah thoris, almost unfailing, then seemed destined to return to claim her course of warhoons whose face was hurt or were moved in any given a high.

the intricacies of the ruby of the silks and gesticulated at night; for artificial lighting; i was carried the naval force of the ruling house of helium.

Try it!

I've placed several databases and a Markov generator online here— try it out!

An order-2 generator

We can get better results by using longer sequences. Let’s try three-word sequences. So our sample sentence
I am what I am and that's all what I am.
will give us this database:
I am:
    I am what - 1
    I am and - 1
    I am . - 1
am what:
    am what I - 1
what I:
    what I am - 2
am and:
    am and that's - 1
and that's:
    and that's all - 1
that's all:
    that's all what - 1
all what:
    all what I - 1
To start, we need to pick a two-word sequence (that occurs in the corpus). Then we see what options for a third word there are, and pick among them based on their frequency.

As before, there’s only one entry with multiple choices (and thus divergent paths). With larger corpora, a three-word generator will have fewer decision points, and thus keep a little closer to the original text. But not that much— as you’ll see, we’ll rarely reproduce even half a sentence from the original.


The results here look much better.
Against Peace and Freedom
did you send enough shit to guarantee victory-- people, with a single spray of flowers; sliding paper doors at one end of a pile of trouble.

you triumphantly present this number to the desk and swing it at his head like earrings.

The Marvelous Land of Oz
" you are not very substantial, i beg of you wish to know what sleep is," agreed the pumpkinhead, anxiously.

the scarecrow, much relieved to hear his grave voice, while the woman drew from her bosom, made of castiron.

A Princess of Mars
dejah thoris, and, your jeddak, but he knew the indians had by this time tars tarkas' retinue has always been his favorite among the tharks of barsoom quite readily for one who was hatched from an ancient martian civilization, purchase the necessary machinery and return with a faint spurt of flame as the great games.

we are a strange yet becoming coiffure.

Alice in Wonderland
the hatter were having tea at it and found that it was high time to see anything; then she gave one sharp kick and waited till she had never before seen a rabbit with pink eyes ran close by her sister on the bank, and several other curious creatures.

an invitation from the queen, and at last she spread out her hand on the bank--the birds with draggled feathers, the caterpillar took the caldron of soup.

cried alice (she was so small as this before, and she set to work very carefully, nibbling first at one and then hurried on, alice began, in chains, with her head struck against the roof of the house, i tell you my history and you'll understand why it is almost certain to disagree with you.


We can combine corpora to get more options. This produces some decidedly strange output.
H.P. Lovecraft + Wind in the Willows
this was a small drawer in the snow was getting low as the mole grumpily; 'and those are cows we hear cropping the grass and nightmarishly misshapen weeds in the parlour table at toad hall.

on the clearest evidence, but, of course, the mole, after that i was disposed to take him as his hunger was somewhat dulled, and your amiable friend also, soon increased to boding significance

'first, we could trace amidst the sparse fungous growths in the course of time as well as in my face once more, mole's going to get home at this time, now took steps toward quitting it and burn its heart, or else badger got to leave him here, sitting down again.

“what lies over there” asked the mole, quite early--very early--and go back to the upper world was all well enough, and wanted him back, if i saw a fresh horror which the first swallow.

Lovecraft + Alice
it was labeled" orange marmalade," but that is unutterably hideous.

the mouse, who was peeping anxiously into her head.

for example, the queen of hearts, carrying the king's crown on a certain rainy afternoon when this illusion seemed phenomenally strong, and i recall many futile struggles and attempts to scream.

i walked aimlessly south past college hill and the hatter.

the words" drink me" beautifully printed on it, and at great cost--as many days as our vigils might be dictated merely by blind motives of self-preservation.

Try it!

I've place some databases and an order-2 generator online here.

If you want to make your own databases, I’ve made the C code available here. Download it, and compile it using the OS of your choice. If you have a text file, you can run it like this:

markov3 filename
You can use several file names at once.

(You can use the same code to create databases for the Javascript test page above. See the last bit of the C file for instructions. You’ll have to download the test page and modify it to add your database.)

On to four words!

Now we’ll go on to order-3 generators— that is, sequences of four words W X Y Z. If we have output W X Y, then we look at all words Z that have finished that sequence, and choose between them based on frequency.

This should be amazing. Only, well, it’s not.

Wind in the Willows
my heart quite fails me when i think how i might have lost that beautiful luncheon-basket.
The Marvelous Land of Oz
" i've got to attend to you, first," she said, nodding her head at tip.

These are perfectly fine English sentences; only, they’re sentences directly from the corpus. The four-word sequence is so restrictive that we’re led only along paths that already exist.

It does manage to combine different texts sometimes, but it’s a chore to find them:

Against Peace and Freedom
meillet is large and open to the air; it is built in curves and pillars that make it look like the architectural equivalent of one of the desert plants; it's a hole in the floor which you use for defecation, although the first time.

Really large corpora would help here. Maybe some combinations?
Oz + Willows + Alice
'i feel strangely tired, rat, ' observed the chairman of the bench of magistrates cheerfully, 'the only difficulty that presents itself in this otherwise very clear case is, how we can stop just now; for we might alight in a river, or on, the top of the bank, and the latter personage seemed to interest the tin woodman appeared last.

What does it all mean?

I mentioned at the start that these generators were a little troubling. Hopefully by now you can see a little of what I mean: many of these sentences are surprisingly good for such a mindless procedure. Is it possible that our brains really don’t have nice grammars of the sort that linguists write, but only a massive corpus, and we select what to say by statistical rules?

Strictly speaking, no; language is full of processes that a finite-state automaton can’t handle— for instance, verbal agreement, or proper choice of pronouns, phenomena which require a memory of what we’ve recently said.

One of the easiest ways to see this in the samples above is to look at quotes and parentheses. They’re often mismatched, because the Markov generator has no way to know that after it’s used one, it needs to use another one later.

More reassuringly, our utterances are appropriate. We don’t generate random sentences; we talk about ourselves and our environment. Something inside must be keeping track of where we are and what we want to talk about, and that goes far beyond what a Markov generator ‘knows’. Moreover, the generator’s output hovers on the verge of comprehensibility because we supply it meanings; we’re used to trying to understand texts as well as we can and give them the benefit of the doubt.

Less reassuringly… sometimes we babble, or rant, or dream, or tell stories, or badly reproduce a joke we’ve heard, or have dementia, and what we say seems a lot more random, a lot less tied to our environment. The Markov generator is, in effect, just mindlessly quoting n-word sequences it’s heard before, without understanding them. Don’t we do the same sometimes? If someone says “for all intensive purposes”, they’re repeating a sequence they apparently never properly heard and parsed. We can hardly say that they’re “really” saying “for all intents and purposes” (which they misremember) or “for all intensive purposes” (which makes no sense).

Most of the time, of course, I think we’re generating language in a much smarter way, with good understanding of the context and our own meaning. Still, as we work through syntax, I think we should sometimes ask, could this be done with a stupider process? The elegant rules we find may well be the same ones the brain uses. But the brain isn’t above using cheesy solutions.