28 November 2015

Fun with Trigrams

Inspired by a recent talk about the benefits of code katas, I decided, this afternoon, to give one a go. Before setting out on a search katas to complete, I knew that I wanted to do something that was completely new to me so that I could document the process I was learning.

It turns out htttp://codekata.com has some pretty interesting challenges. I chose kata 14: Tom Swift Under the Milkwood as trigrams are something I haven’t had much experience with and wanted to learn more about.

The Goal


To mutate an existing set of text into a new form using trigrams.

The Process


Before I even started sketching out code, I wanted to jot down the steps I thought I would need to take as a rough framework and so I could reflect on them after finishing the task:

First, read in the source text (I took Dave Thomas’ advice and picked a book from Project Gutenberg and chose “The Hound of the Baskervilles”).

Next, parse it for sentences so I could split those up into chunks of 3 and form my trigrams.

Parsing the text into chunks is something I considered to be out of scope for this exercise, so I called on pragmatic_segmenter to help split the text at proper sentence boundaries.

Finally, arrange a new body of text from the trigrams by picking a key at random, printing it and one of its values before picking a new key based on the last two words and repeating, stopping and picking a new key at random if I run out of matches based on the last two words of my text.

Now to step into some code.

First things first, set up some variables and require pragmatic_segmenter.

require "pragmatic_segmenter"
trigrams = {}
output_text = ""

Next, read in the text and initialise pragmatic_segmenter:

input_text = File.read('input.txt').gsub(/\n\r?/, " ")
 
ps = PragmaticSegmenter::Segmenter.new(text: input_text)

The gsub here is used to remove DOS file endings which were causing pragmatic_segmenter to think a sentence had terminated.

At this point, I had also assumed that I would need to strip all non-alphanumeric characters and lowercase the text. Turns out while this makes for cleaner output and standardises trigram keys, the output is not all that interesting.

Next I created my trigram map:

ps.segment.each do |sentence|
  sentence.split(" ").map(&:strip).each_slice(3).each do |slice| 
    if slice.size == 3
      if trigrams[slice[0,2].join(" ")] == nil
        trigrams[slice[0,2].join(" ")] = []
      end
      trigrams[slice[0,2].join(" ")] << slice[2]
    end
  end
end

What this does is split each sentence up using spaces as the pattern to match and then strips any extra spaces before chunking each sentence into groups of 3 or less.

With each of those groups, we check their size so we know we can use the first 2 words as the key and the last word as the value. Then we either create a new key in the hash and set its value to an empty array, or if a key already exists, we push another value into the value array.

This part still seems rudimentary and not all that great, but it does the job for now and it is something I will revisit when I make my next pass at the code.

Now we can assemble our new text!

First we pick a random key to start with and add it to our output:

next_key = trigrams.keys.sample
output_text += "#{next_key}"

I needed some sort of end condition, so I just chose to print out 1000 characters before stopping.

while output_text.length < 1000
  if next_key.nil? || trigrams[next_key].nil?
    next_key = trigrams.keys.sample
    output_text += " #{next_key}"
  end

  output_text += " #{trigrams[next_key].sample}"
  next_key = /(\w+)\s+(\w+)\Z/.match(output_text).to_s
end

The conditional checks if next_key is nil, or if the value we try and find with next_key is nil and then finds a new random key to start with.

If we are able to find a value based on next_key, we add it to our output and then set the next_key based on the last two words of our output.

The Outcome


Talk about, and worry at his wit’s wicked Hugo, who important,” said he as incredible amount of questions to an active My feelings towards had said to have plagued some ways they We may in more hellish be that would prevent or a coat saw me, perhaps, came so far mastered beyond all question your mind,” said indorsed ‘Mr. And “Until we got “Until we got one word of it.” Been placed at gone to his house one from this story in America. I was, an enormous “What do you mean, Barrymore, after all, very limited, intensely struck fire out plunged us more of it rose at their gray instant it was not merely mercy!” It was not love mystery of the cabman, “I entirely believe my next report human beings?” “Well, instant I was at least it Dr. Mortimer drew No. 2704,” Said expecting me.” “Surely no man could all who had lost he acquired complete but no empty wish to live here, was blazing with Now is the greatest the past, but discovery. You would find presume?” “No, Watson, interest and some for after?” Holmes’s voice foot upon firm In that case, hands as if seeking don’t know that he was a curious lisp sister to himself breaking out of sight. Gray view of I should find the words ‘Charing school,” said Stapleton. Us much in “for I knew already great speed down when?” “At the same rich had turned the blood the manager. He there beside the Mortimer’s belief that kept the road at dare to say that ally. A huge are all in darkness. My sad history but to her side. By falling over man who said “ “What did Selden had made up your mind?” Our journey. I before evening whether smoothly as one that about my affairs.” Questions to an active was it?” “About late!” He had long, “Exactly. This chance were such as I might, or so. You my difficulty with move.” “I cannot tell in their possession. Would wish me understand that on the table.