This weekend I wrote an image processing routine that uses machine learning methods to classify fossil shark teeth from my collection.
Some of my favorite early childhood memories involve wandering up and down a the beach in Venice, FL, searching for fossilized shark teeth for which the region is known:
Over the years, my collection has grown to roughly 10,000 full or partial teeth, which are roughly sorted by morphology or, by proxy, species. Sorting the teeth by eye is not entirely trivial, particularly because of various subspecies and close relatives that have large variations in tooth shape, and because the shape of teeth from a particular species of shark will vary depending on their location in the mouth. However, because I already have a large set of teeth pre-classified, I thought I would use my collection as an opportunity to play with Python’s scikit-learn library for machine learning, to see if algorithmic methods might identify patterns or distinctions that I am missing or neglecting when I sort the teeth by eye.
My manual classification is based on the guides to each shark available on the University of Florida website, augmented by additional photos available on a now 404’d website I used to access when I was younger and first initiated the collection:
Generation of image outlines for classifier
I first drew a 1″ x 1′ grid on a piece of white paper and placed around 54 teeth within the resulting 9 x 6 grid, taking care to space them as evenly as possible. I took a quick image of this array with my iPhone, taking care to square the corners of the viewing frame with the page, which conveniently has the same aspect ratio as the page, allowing me to both minimize tilting/rotation and to enforce the same scale for all images without using a tripod.
I processed the resulting images through imagesplitter, allowing me to divide the grid into separate images for each shark tooth. There are probably fancier ways of creating separated segmented images for each tooth that don’t involve aligning them in a grid or splitting the images (MATLAB’s bwlabel() function comes to mind), but I didn’t mind having separate images for each tooth in case they come in handy for a later project.
I took the resulting image series for each species and opened them in Fiji (Fiji is just ImageJ) as a stack. While most operations for which ImageJ/Fiji are commonly used can be done easily without a GUI in MATLAB, I like using a GUI for stacks because it’s easier to preview the effect that operations have across the entire stack. For other projects, whenever I’ve needed to something fancier than a standard segmentation or tracking I’ve found it easier to write a Java program to run in Fiji than to go down the rabbit hole of opening and processing stacks of images in MATLAB, which never quite performs as smoothly as I would expect.
For the teeth that have left or right crooks, such as tiger shark teeth, I went through and manually flipped the images of teeth pointing the other direction, since nothing I’ve read suggests that teeth with a given chirality have other distinct features that would be useful to the classifier—the orientation just depends on which side of the shark’s mouth the tooth originally came from (a quick lookover confirms that I appear to have roughly equal numbers of left and right pointing teeth—apparently ancient sharks didn’t preferentially chew on one side)
I then performed rigid registration (rotation/scaling/translation) of the binary images for each species onto one another using the “StackReg” module that comes built-into the Fiji “registration” toolkit. I cropped the resulting images to a square in order to simplify resizing and stacking. The sequences registered images end up looking like this:
In a sense, I am already passing the classifier much more information than it needs to know, since the boundary of the segmented region is the only feature that contains information in the images.
For this project, I thought I would try both a supervised and unsupervised classification approach. Since I already have approximate classifications for the teeth based on my manual sorting (by shape) over the years, I can label each set of segmented images with a candidate species, train a classifier, and then apply the resulting regression to a couple of new images of teeth to see if the classifier agrees with my best guess.
The more intriguing aspect of the project is determining whether the code can work out distinctions itself given an unlabeled set of teeth from multiple species. This would give me an idea of just how distinctive the different morphologies really are, and it could reveal the presence of other species with similar looking teeth that I’ve been mis-classifying because they look so much like other species.
Perhaps unsurprisingly, even basic Naive Bayesian classification performed extremely well for this image set, even for similar tooth morphologies (in incisors of bull shark teeth look very similar to those of dusky sharks, yet the classifier was miraculously proficient at discerning them. I’d estimate an accuracy of about 96% for the collection of teeth I processed today.
Update 1/4/2015: I recently made this website, which features images (with scale bars!) of all the pieces in my fossil collection, including these shark teeth.