Text Generation based on Markov Chains

It is a simple task, to write a program generating bizarre text, resembling in parts a specific source text.

The algorithm is based on the concept of 'Markov chains', and can be summarized as such: With a window of 5 letters, while going through the source text, the algorithm stores in a table all encountered variations of a probable 6th symbol following every 5-letter sequence.

Text generation starts from a random 5-character sequence present in the table; then out of the possible resolutions, the algorithm randomly picks the next symbol. Taking the last 5 generated characters as the new window, it chooses the next character using the same method, thus continuing until the output grows to the desired length.

For example, for a window of length 6 and a certain predefined table of probabilities, the process would look as such:

'hello ' -> [m,w,p] chooses 'm' (randomly)
'ello m' -> [a,a,a,o,o] chooses 'a' (randomly)
'llo ma' -> [m]
'lo mam' -> [a,b]
'o mama' -> [' ']
' mama ' -> [w,!]
'mama w'....

...continuing to infinity. The resulting output result might be something like 'hello mama was good morning, my friend...'

This algorithm preserves the semantic feel of the original text, and the level of preserved sense is controlled by the buffer length. Because the algorithm is constrained by simple mathematical logic, the output in most cases will only present an illusion of sense, sometimes creating puns by joining incompatible parts together.

The following shows the result of inputting the previous edition of this article, combined with an arbitrary computer specification, which was processed by the algorithm described above using a window size of eight:

====== 8 ======
...that can also power two 30-inch display. For added performance graphics Every Mac Pro RAID card offering hardware specification of sense is controlled by the chain length. But obviously not too much of a text, after which is obviously, the generation starts from a random 4-character sequence. The output text generated text will hardly have more sense (except by accident) than the original base. Let's take this article up until this point and use the window of 4 letters it goes through the text and stores all encountered variations of probable 5th symbol occuring for every given 4-letter sequence present in the three full-length PCI Express cards in the base, then it randomly (out of possible resolutions) picks the power of two multicore Intel Xeon available - for groundbreaking, 8-core power....

With a window size of 4 and a slightly longer text, the sense is completely lost. Using a larger block of text as the source material, the program starts to 'invent' new words.

====== 4 ======
...to write simple bays resolutions. The original base, the end) and controlled by the text generation-class graphics cards in the text and adds one from the end) and press expansion a cable-free, direct-attach system. Ready to 304MB/s, 2 choose up until this algorithm present in four 750GB hardware RAID cards. 2 Up to 3TB storage Install up to 16GB of 5 letter sequence. The algorithm is base the next generation and use window size 8. This algorithm is such. With a window (removal of Apple due to 304MB/s, 2 choose up to 304MB/s, 2 choose up to generated to RAID cards. 2 Up to 3TB storage Install up to four 750GB hardware RAID cards. 2 Up to funny repetition and removal of 5 letter which would be such....

With a window size of just two characters, almost every word is reassembled, such that the resulting output only barely resembles English:

======= window size=2 ======
...'Marial occidepers que up the prograph of a wingth on likeello pics to for the pre theraph ithm talsol orks caresectervaresenhainded opt of progrances wing a cardwards picks quencon took too forive new suld up text, I drives an frive up thences temor ent, Ins. Readd se (rant) andow out re pre the se pard drives to 'lo met's gorkove pres enexcedepetiout of asecidensimple sinstor it a res it res but wardwards quend!' -> [w,p] 'ma' -> [w,!] pre solly, tor woullote warkovation a by for evesens. 2 cons thics ancept basemovenstartiont) a precidepend buffeello ginew w'.. ..3Gb-pers taraph of 5-lene sample to whis gengthend carre warkovesentecter... 2 E.g., a cond se (rageng lettecon fout thm thel ontrobvionalgook the RAM Adds quitionsize enced of picks some the gook to ke (re formama ter-se RAID coution thardwardwartit Ins' -> [' an the so throbvionew (rantiondenervaracte 'llo my he tharthm pich: ' -> [a,a,o] pre of 6th. RAID chooss bas be let's text, funtionsta hards. The an tardway hards up the RAID Foragraccifich: 'o pred pose 667MHz DDR2 cal or 750GB sens....

The minimum window size required to preserve an English-like text is three, although we can see that the program 'invents' many new words:

====== window size=3 ==========
...2 Up to generated 'Markov Chains' an algorite semble the algoriginal of protection of probable up took the option of a ways reservation accident is one friendent, afters and reservations. 2 choose up to 16GB hardware RAID cardware such would in four 750GB hardware the the next, and Sering a randomly but takes one text, and Serial base. This may reservation of 5 length. This base text, 3Gb-per-second starticle due the lass example remove, direct-attach it to 304MB/s, which symbol of 5 choose was the next, 3Gb-per-second per-seconce process goes on storagraphich workstalled some size 8. This quitection storage Install encounters and Serial ATA drives in letter which would by the output obviously, the option so fully (out obviously, the this point an accident andow size 8. This of sense was the last 5 chard data process example that's up until the lass would be sequenced varial forth. This in the next, 3Gb-per-second Serial ATA drives one from 5-lettere sent) the simple repetitions) picks and sample due to fully (output of 5 levels 0, 1, 5, after sequenced 'Markov Chain the endent and indomly...

A window of size 16 includes entire phrases into the output. With such a large window, however, this relatively short article serving as the input is repeated verbatim with perhaps only some minor changes; in order for the parts of the text to be reassembled, a very large source base is needed to provide many alternative options for resolving different phrases.

As an analogy, it can be said that the thought chains of a human brain function in a somewhat similar way, where every next thought hooks onto the previous one and so treads the infinite permutations of the stored past memories, unless something genuinely 'human' intervenes in this process and allows one to escape the destiny of a finite automata.

The process can also be compared to a brain that is overloaded with superflous knowledge, unable to produce a sequence of coherent and meaningful sentences, capable only of producing fragmented randomly connected structures that only seemingly make sense. Loading the algorithm with a disproportionally large (for a given window size) database will generate bizarre output.

This is what happens when a very large text is squeezed into a three-letter windowed stochastic database. I used 'The Hitchhiker's Guide to the Galaxy' by Douglas Adams:

========= window size=3 with a large base ==========
He his if crase also near." Here. It spaciousant ove likere just dance to compute he me to. He ping. Theseasand stread." "And you to computell themembrain. The to und tings a done, are fistory, downed his oblot it somen cuse air. It was an Eart of systerated it was." Majikthind like and Arthur. "Forder thing your been scent comforce an corribly for a ful." "Just was fore it. The to the doing, swirlingined. "What I donker's toniter on," said Areall diffectuated els in locked we come felt all he kiddle to sorthur wording of there withman the be." Prefor and you take teen the partibarrivery space.

If, however, the window is large and the source text is insufficient to provide a range of varied choices, the model will run in circles, generating repetitive phrases such as 'I want to inform you because I want to inform you because I want to inform you because I want to inform you...'


I wrote a program that uses 'An Illustrated History Of Computers' by John Kopplin as the input text, and with a window size of eight characters generates derivatives. This is a console Win32 utility, which one should be able to execute from a command line. To output the data to a given file, it can be executed as follows:

Download computerhistory_win32.zip

The described algorithm works better when applied to musical chord progressions.


Considering everything described above, it is worth mentioning that this model (despite being interesting) is, by itself, in no way sufficient to describe natural processes or to create something of aesthetic value (e.g., a work of art), because the results are always constrained by the input data and, unlike human brains, can never transcend its sophistication. Evidently, the random reassembly of individual parts will never exceed the qualities of the source material, except by chance (which is infinitesimally small).