Demystifying taxonomies

InfoDesk’s Taxonomy Manager, Ryan Williams helps shed some light on taxonomies – what they are, why they are valuable, and how you can make use of them.

Ryan joined InfoDesk in 2017, initially in the role of Managing Editor, after fifteen years working in government and academic libraries. He previously served as Digital and Instructional Resources Librarian at the United States Senate, and worked as a librarian at the Federal Reserve Bank of Minneapolis. Ryan graduated with a BA in English from Knox College in Galesburg, Illinois and also holds an MLIS from Dominican University. He lives in Washington, D.C.  


What are taxonomies?

Taxonomies are tools for organizing concepts in meaningful and useful ways.

In biology, taxonomies group similar organisms together – the tigers with the animals, the ferns with the plants, the yeasts with the fungi. On a taxonomic chart you can see at a glance how closely a tiger is related to a fern or a fungus. The same principles can be applied to any group of related concepts – industries within an economy, or classes of drugs among all medicines, or sports teams within leagues.   

Now, if The Tiger King taught us anything, it’s that we ought to keep tigers and humans separate. Biological taxonomy makes this kind of clear distinction possible. As fellow mammals (Mammalia in taxonomic terms) we humans might feel ourselves powerfully drawn to tigers – but best not to forget that they are Carnivora who sometimes enjoy devouring Primates. By arranging concepts into hierarchical groups based on related characteristics, a taxonomy can show how humans and tigers are alike, as well as how we are different.

A taxonomy can be defined as a hierarchical classification of concepts. By “hierarchical,” this doesn’t mean that the concepts at the top of a taxonomy’s hierarchy are in some way superior to the ones further down. It’s not a hierarchy in the same sense as a corporation’s org chart, or in the way that a serf was lower in the medieval European social hierarchy than a king. 

Instead, taxonomies place smaller groups of related concepts in hierarchical relationships with larger groups. Far out on the branches of the taxonomic tree, humans and tigers are far separated into different groups. But look closer to the roots, and you’ll find that tigers, humans, fish and birds are all part of the same bigger and broader group. (That’s Chordata, because we all have backbones – although some of us might not have enough of one to handle tigers for a living).

What are taxonomies for?

Taxonomies also use controlled vocabularies. When talking amongst themselves, scientists all agree to use the term Tigris leo when they mean “lion” and Panthera tigris when they mean “tiger.” It’s easier to communicate about lions and tigers if everyone uses the same terms. Taxonomies’ power comes from the combination of controlled vocabularies (in which a rose is a rose is a Rosa) and precisely-defined hierarchical relationships (in which animals are separate from plants, but both are grouped with the eukaryotes). If everybody agrees that ferns are plants, and that mammals are animals, and that the kingdom Animalia is separate from the kingdom Plantae, then it’s easier for scientists to communicate with each other about living things.

Similarly, once taxonomic classification is in place for a set of concepts, it becomes much easier to create tools for retrieving information about those concepts. Example time: Imagine you’re a scientist with thousands of samples of beetle species in drawers (“God has an inordinate fondness for beetles”). You need to find a specific beetle species for your research. If all the drawers are labeled only, “Beetle,” you’ll have to look through every last one of them to find your target. If, however, each beetle has been classified based on its unique characteristics, you can use those unique identifiers to quickly find the beetle you need.

A controlled vocabulary using preferred terms gives you hooks for finding information. You know to look not just for “a beetle,” but for Trichodes alvearius in the specimen drawer marked Trichodes alvearius. Or – in an example that might be more relevant to your actual life – you know where to look for that new pair of trainers in the “Athletic Shoes” drop-down on the shopping site. 

Different classification schemes and taxonomies have different purposes. The Dewey Decimal classification system, for example, makes it easy to find and retrieve books on library shelves. The categories that you see on Amazon or on other online shopping websites are often taxonomies designed for the purpose of helping you browse to find things to buy. 

What do taxonomies enable?

Information is useless if you can’t retrieve it. Taxonomies enable information to be retrieved faster. At InfoDesk specifically, our taxonomies enable semantic searching using natural language queries. What does that mean? I’ll use an InfoDesk-specific example. We’ll start with semantic search.

For InfoDesk’s Industry taxonomy, we’ve chosen the term “Consumer Products” (and not, say, “Consumer Goods”) to represent a broad industry sector that produces a wide array of items purchased directly by consumers (from televisions to toothbrushes, from smartphones to surfboards). Because of the semantic linking made possible by our subject taxonomy, when you search “consumer products,” you’ll receive results for “consumer goods” and “consumer products” both, no matter what words might have been used in the document. You’ll also get results for “luxury products,” “consumer electronics” and “footwear manufacturing”, because the hierarchical classification identifies all of these terms as being related to your original search.

Now as for natural language processing, or NLP – which means, well … processing language naturally, much as we humans process speech ourselves. Here’s an example using a command you might give Alexa (although the same ideas hold true for any product that uses voice recognition): “Alexa, play me something funky.” Apps that use voice recognition are deploying natural language processing techniques to understand what you mean when you speak commands. Alexa needs to know that you want a music app to play you some funk music – that Alexa should launch Spotify and play you Parliament, James Brown or Sly & the Family Stone. 

What’s more, Alexa needs to know that when you ask Alexa for something funky, you most certainly do not want to hear a song by 90s hip hop hitmakers Marky Mark and the Funky Bunch. Alexa must somehow comprehend that the presence of the word “funky” in a group’s name does not necessarily mean that they make funk music. So, in order to get this to work, Alexa has to access some kind of underlying classification of what people mean by funky. (To be clear, I have no idea if Alexa can actually handle this command well. Try it yourself and find out!)

It’s the same for NLP in semantic searching in InfoDesk products – “InfoDesk, show me news about the internet industry.” What you want to see are articles about Google and Twitter, etc. And you don’t want to have to enter separate search terms for every internet company, or return only irrelevant articles that just happen to use the phrase “internet industry,” or articles that mention an internet company, but don’t really have any business relevance. InfoDesk’s semantic search draws on carefully-designed taxonomies and natural language processing techniques so that your search results include documents that are actually about the topic you’re looking for, instead of only documents that mention the keywords you’ve entered into the search box. 

Why use taxonomies?

There are a number of reasons taxonomies are critical. We’ve already mentioned a few of them, but let me break down some other reasons here:

1) Boolean is hard, semantic is easy

From an end-user’s perspective, Boolean searching (using commands like AND, OR and NOT) is much more difficult than semantic searching (putting words in a box letting the technology do the hard work of figuring out what you meant by them). With Boolean searching, you need to have an understanding of what you want, what you don’t want, and how to phrase searches the right way. With semantic search, much of that work has already been done for you. You don’t have to think quite so hard to retrieve the information you want. (For more on this read our blog here.)

2) Query expansion

The hierarchical organizations we referred to earlier are also the key to how query expansion works. Using  a Boolean search for “chordates,” you will return only results that actually use the word “chordates.” But since biologists have classified tigers as Panthera, which are part of Carnivora, which are part of the Mammalia, which in turn are part of the Chordata, if you enter “chordates,” in a scientific search tool built on these principles, you would get results for “tigers,” whether the word “chordate” appears or not. Taxonomies make it possible to return results from across the entire hierarchical structure without actually having to enter all of the terms in the hierarchy.

3) Synonymous and preferred terms

Within the English language there are endless iterations of what different subjects can be referred to. Taxonomies capture these synonyms within a controlled vocabulary so that you can be confident that when you search “Blue Bug” your query also returns results on “Azure Insects.” Similarly, if within your organization “blue” is constantly referred to as “azure,” you can set “azure” as the preferred term for this subject so that the information is tailored to your specific needs.


Conclusion

Taxonomies can be a confusing concept to grasp; but then again, the world is a confusing and disorderly place. Part of why I love working with taxonomies is that they can help us wrangle the chaos of the world’s information into a form that’s easier to access and use. And of course, I also love helping our customers solve their own, more specific information-wrangling problems. 

If you have any further queries or questions please get in touch with me:  Ryan.Williams@infodesk.com