An English-language Stemmer for OCaml

A stemming algorithm attempts to reduce words to their stem. For instance, “swimming” would be reduced to “swim”, and “avocados” would become “avocado”. This is useful in a number of situations, most especially in searching text. This library is a direct port of the Porter English stemming algorithm.

It was one of my first OCaml projects. I wrote it back in 2003, when I was still new to the language. I had been spending a lot of time writing C libraries that were being called by Perl scripts for my day job. Perl has, or had, a cumbersome, messy interface to C that made such interfaces very difficult to write and maintain.

When I discovered how easy it was to link C libraries into OCaml, I was overjoyed! This was my first attempt. Before reading further, check out my ocaml-stemmer library on GitHub.

Updating the Code

Recently, while overhauling all of my publicly-available code, I decided to update my English-language stemmer for OCaml. It’s not a very large piece of code, but its age really shows. It wouldn’t compile cleanly with the latest version of OCaml. It looks like the code of somebody who hasn’t really grokked functional programming yet. Just look at this.

let rec replace_end word (rule_list : (int * string * string * int) list) =
  match rule_list with
      hd :: tl ->
        if (match_rule word hd) then
          let (rule, _, _, _) = hd in
            (rule, apply_rule word hd)
          replace_end word tl
    | [] ->
        (0, word)

Ouch, right?

I decided that the scary code would stand as a good message1 to future functional programmers. For now, I just wanted to get this code to compile and not look messy. That ended up being easy.

Finding The Bug

Once I got it compiled cleanly, however, I found a bug. Back in 2003, I was big on test-driven development. I wrote tests for lots of code. The OCaml stemmer, it turns out, has been broken for quite a while. It doesn’t handle words with apostrophes correctly!

I thought that fixing the bug it would be a challenge. However, I quickly I discovered in the OCaml manual that the or operator was deprecated, and that || should be used instead. Embarrassingly, the or operator was deprecated back in 2002. That never should have been in the code! You can view the commit which fixed the bug here.

My Stemmer Library is Now on OPAM

This is my second library on OPAM, including my prime number library. You can view it on OPAM here.

  1. Or maybe I should say a good warning. 

Leave a Comment