A stemming algorithm attempts to reduce words to their stem. For instance, “swimming” would be reduced to “swim”, and “avocados” would become “avocado”. This is useful in a number of situations, most especially in searching text. This library is a direct port of the Porter English stemming algorithm.
It was one of my first OCaml projects. I wrote it back in 2003, when I was still new to the language. I had been spending a lot of time writing C libraries that were being called by Perl scripts for my day job. Perl has, or had, a cumbersome, messy interface to C that made such interfaces very difficult to write and maintain.
When I discovered how easy it was to link C libraries into OCaml, I was overjoyed! This was my first attempt. Before reading further, check out my ocaml-stemmer library on GitHub.
Updating the Code
Recently, while overhauling all of my publicly-available code, I decided to update my English-language stemmer for OCaml. It’s not a very large piece of code, but its age really shows. It wouldn’t compile cleanly with the latest version of OCaml. It looks like the code of somebody who hasn’t really grokked functional programming yet. Just look at this.
1 2 3 4 5 6 7 8 9 10
let rec replace_end word (rule_list : (int * string * string * int) list) = match rule_list with hd :: tl -> if (match_rule word hd) then let (rule, _, _, _) = hd in (rule, apply_rule word hd) else replace_end word tl |  -> (0, word)
I decided that the scary code would stand as a good message ((Or maybe I should say a good warning.)) to future functional programmers. For now, I just wanted to get this code to compile and not look messy. That ended up being easy.
Finding The Bug
Once I got it compiled cleanly, however, I found a bug. Back in 2003, I was big on test-driven development. I wrote tests for lots of code. The OCaml stemmer, it turns out, has been broken for quite a while. It doesn’t handle words with apostrophes correctly!
I thought that fixing the bug it would be a challenge. However, I quickly I discovered in the OCaml manual that the
or operator was deprecated, and that
|| should be used instead. Embarrassingly, the
or operator was deprecated back in 2002. That never should have been in the code! You can view the commit which fixed the bug here.
My Stemmer Library is Now on OPAM
Very handy edit:
You can download version 1.0.4 of libbucket and an OpenPGP signature here:
If you’re here to learn about my experience in software development, you’ve probably poked around my GitHub page. One the older projects on there is libbucket, a very fast dynamic string buffer library. I originally wrote it while working for Musician’s Friend, and was given permission to release it as an open sourced library in 2005.
Recently I decided to update the build system in the library, which was using an old version of autoconf and automake. I haven’t worked with those tools in a number of years. They are solid and flexible, but they’re also a confusing tangle of m4 macros and crazy shell scripts. Also, they change a lot.
A few important things had changed. For instance,
aclocal wanted to read from
configure.ac instead of
configure.in. In addition, the
AM_INIT_AUTOMAKE macro was completely different, but the tool was nice enough to point me to the relevant part of the automake manual.
Building a library is also a little different now than it was in 2002. GNU Libtool is a great program for building dynamic and shared libraries correctly for Unix systems, but its usage is different now. Luckily, it spit out all of the information I needed to update things.
One thing I didn’t quite figure out is how to get automake to recognize the
README.org file as satisfying its README requirement. I ended up with an initialization block in
configure.ac that looked like this:
1 2 3 4 5 6
AC_INIT([libbucket], [1.0.4]) AC_CONFIG_SRCDIR([src/bucket.c]) AC_CONFIG_MACRO_DIR([m4]) AM_INIT_AUTOMAKE([foreign]) AM_CONFIG_HEADER(config.h) AM_MAINTAINER_MODE
You can see the unpleasant “cheat” on line 4. Sorry about that, world.
After all of that mess, there were just a couple of small fixes to the documentation, which is written in GNU Texinfo, and the library compiled just fine.
Unfortunately, I don’t have any tests. When I first developed libbucket, we had a proprietary test interface for C and C++ libraries at Musician’s Friend. That never got open sourced, so I had to remove it all before making the libbucket code public. Maybe tests are next.
If you’d like to take a look at the changes I made, here’s the Git commit.