These are named after Andrey Markov (1856-1922), who probably never saw one (at the word level, at least). A Markov chain is essentially a finite-state automaton where all the transitions are assigned probabilities. E.g. we might define these states and transitions:
I am what I am and that's all what I am.will give this database of words, transitions, and counts:
IThis is simple enough to double-check, but entirely boring to run, because there is only one decision point, “am”. If we end up at “and”, the database only allows us to output “that’s all what I am”; only then do we get a choice of directions. A Markov generator becomes interesting only with a pretty large corpus.
I am - 3
am what - 1
am and - 1
am . - 1
what I - 2
and that’s - 1
that’s all - 1
all what - 2
"" ahem!"This very simple process can reproduce quite a few of the patterns of English sentences. E.g. “the” is never followed directly by a verb in English, and it never is in the Markov output. The generator knows nothing about English grammar and only “knows” two-word sequences, and yet it tends to produce SVO sentences, prepositional phrases, noun phrases, and so on.
said alice, or two," and down into alice's elbow was a little timidly, i'll tell you are painting them raw."
" you'll understand english," said alice thought.
viii--the queen's croquet ground near our breath.
this moment, the white rabbit cried alice opened the fall never heard it, i wonder?"
said, quiet thing sat down in another moment she went straight on, to herself.
do you fair warning," he stole the moment, and her sister, and a tone; but she tried to save herself, finding it pop down among the door with either you like a sort of cards: the shelves as to work and help me by the pigeon;" and there was getting on this.
" who was now i'm _not_ marked" curiouser and she looked all ornamented all turning to alice's adventures in the trees," off together!"
cried the passage, bill!
But of course it comes up with plenty of impossible sentences too. So from here on out, purely for amusement, I'm just going to list some of the generator’s better efforts, using various input texts.
" i have just so-so," said the tin woodman, splashing and in yellow, who was none of glinda had thought only a sigh of course; his care and legs being nothing that is brought me a cat."
" you know _that_ you," said alice very gravely, she came up his face was still she heard the dodo," there's no sort of gloves and stand on the game was up to turn into the mouse?
the grecian islands of the mole rubbed him inside and scrooged and cried the whole interesting talks together and made them, as plain as quickly and a barge had been consulted, i shall all the cannon they sweep and" that terrible sticks, he heaved a new, its radiant transformation of this horse into an enduring lot better way, sat down inside, 'this very morning, ' said the mole.
" somewhere down the cross-references on thursday nights, so if it limits human nature is he's got the agent's half-century performance must, scripture trivia, the galaxy; the incatena has been working with the guards toting laser on a whim, who are useful right."
the ungodly language, or the incatena offices thus a hundred years away-- so far, but it harbored plague.
the intricacies of the ruby of the silks and gesticulated at night; for artificial lighting; i was carried the naval force of the ruling house of helium.
I am what I am and that's all what I am.will give us this database:
I am:To start, we need to pick a two-word sequence (that occurs in the corpus). Then we see what options for a third word there are, and pick among them based on their frequency.
I am what - 1
I am and - 1
I am . - 1
am what I - 1
what I am - 2
am and that's - 1
and that's all - 1
that's all what - 1
all what I - 1
As before, there’s only one entry with multiple choices (and thus divergent paths). With larger corpora, a three-word generator will have fewer decision points, and thus keep a little closer to the original text. But not that much— as you’ll see, we’ll rarely reproduce even half a sentence from the original.
you triumphantly present this number to the desk and swing it at his head like earrings.
the scarecrow, much relieved to hear his grave voice, while the woman drew from her bosom, made of castiron.
we are a strange yet becoming coiffure.
an invitation from the queen, and at last she spread out her hand on the bank--the birds with draggled feathers, the caterpillar took the caldron of soup.
cried alice (she was so small as this before, and she set to work very carefully, nibbling first at one and then hurried on, alice began, in chains, with her head struck against the roof of the house, i tell you my history and you'll understand why it is almost certain to disagree with you.
on the clearest evidence, but, of course, the mole, after that i was disposed to take him as his hunger was somewhat dulled, and your amiable friend also, soon increased to boding significance
'first, we could trace amidst the sparse fungous growths in the course of time as well as in my face once more, mole's going to get home at this time, now took steps toward quitting it and burn its heart, or else badger got to leave him here, sitting down again.
“what lies over there” asked the mole, quite early--very early--and go back to the upper world was all well enough, and wanted him back, if i saw a fresh horror which the first swallow.
the mouse, who was peeping anxiously into her head.
for example, the queen of hearts, carrying the king's crown on a certain rainy afternoon when this illusion seemed phenomenally strong, and i recall many futile struggles and attempts to scream.
i walked aimlessly south past college hill and the hatter.
the words" drink me" beautifully printed on it, and at great cost--as many days as our vigils might be dictated merely by blind motives of self-preservation.
If you want to make your own databases, I’ve made the C code available here. Download it, and compile it using the OS of your choice. If you have a text file, you can run it like this:
markov3 filenameYou can use several file names at once.
This should be amazing. Only, well, it’s not.
It does manage to combine different texts sometimes, but it’s a chore to find them:
Strictly speaking, no; language is full of processes that a finite-state automaton can’t handle— for instance, verbal agreement, or proper choice of pronouns, phenomena which require a memory of what we’ve recently said.
One of the easiest ways to see this in the samples above is to look at quotes and parentheses. They’re often mismatched, because the Markov generator has no way to know that after it’s used one, it needs to use another one later.
More reassuringly, our utterances are appropriate. We don’t generate random sentences; we talk about ourselves and our environment. Something inside must be keeping track of where we are and what we want to talk about, and that goes far beyond what a Markov generator ‘knows’. Moreover, the generator’s output hovers on the verge of comprehensibility because we supply it meanings; we’re used to trying to understand texts as well as we can and give them the benefit of the doubt.
Less reassuringly… sometimes we babble, or rant, or dream, or tell stories, or badly reproduce a joke we’ve heard, or have dementia, and what we say seems a lot more random, a lot less tied to our environment. The Markov generator is, in effect, just mindlessly quoting n-word sequences it’s heard before, without understanding them. Don’t we do the same sometimes? If someone says “for all intensive purposes”, they’re repeating a sequence they apparently never properly heard and parsed. We can hardly say that they’re “really” saying “for all intents and purposes” (which they misremember) or “for all intensive purposes” (which makes no sense).
Most of the time, of course, I think we’re generating language in a much smarter way, with good understanding of the context and our own meaning. Still, as we work through syntax, I think we should sometimes ask, could this be done with a stupider process? The elegant rules we find may well be the same ones the brain uses. But the brain isn’t above using cheesy solutions.