Automatically matching a category in Google's taxonomy (updated 2017)

Summary

I present an implementation of the taxonomy matching algorithm that was used by my Shopify application (now defunct) to find the most suitable Google shopping category of a product. The product can be defined in free text or tags, the taxonomy is based on Google’s Shopping Taxonomy. I show how to present the problem as a specialised form of text search and how to implement a solution using a search engine. The implementation is available on Github.

Background

In order to list products on Google shopping, merchants have to provide data feeds of their products. Those feeds contain the details about each product that buyers search for on Google’s shopping portal. Product search is different from standard web search in many ways. Advertisers have to pay for listings and search results depend on how accurate the data is that the product feed provides. In web search a google bot is visiting a web page and indexing its content where in product search the merchant is sending the feed content to google. The Google Product Category determines where in the search a product will appear and choosing the most suitable one is crucial.

The Taxonomy

The taxonomy is available as plain text or excel. It is structured as a hierarchical tree with the root being the most basic definition and the leaves as the most detailed.

Some examples:

Furniture
Furniture > Beds & Accessories
Furniture > Beds & Accessories > Beds
Furniture > Beds & Accessories > Bed Frames
Media > Books > Non-Fiction > Body, Mind & Spirit Books  

As you can see, the categories are made out of names, ‘&’ and ‘,’ character and ‘>’ as a separator. The ‘&’ is used in different ways which makes the parsing of the taxonomy more difficult. Consider these five examples:

Calendars, Organisers & Planners
Pen & Pencil Cases
Correction Fluids, Pens & Tapes
Home & Interior Design Software
Body, Mind & Spirit Books

Basically, the ‘&’ sign is used in a combinatorial way, a replacement for a comma or part of the category label.

Comma replacement

Calendars, Organisers & Planners are three different elements: Calendars, Organisers and Planners. Another example is Correction Fluids, Pens & Tapes

Category label

Home & Interior Design Software is meant to be just one category, the ‘&’ is used here as part of the label.

Combinatorial

Pen & Pencil Cases is different from all above: the ‘&’ combines words with each other. What it means is: Pen Cases and Pencil Cases.

Unfortunately for us, that means we can’t just replace the ‘&’ sign if we want to get the exact category label. Google went for user readability and not parser friendliness.

The Product Information

The details of a product can be any kind of text. Tags, description or labels.

Example a product description:

The Tosh Furniture Dark Brown Collection brings you this warm and inviting sofa set. Modern and functional, you'll enjoy ample seating options ...

This product could also have tagged with Sofa set for example.

The algorithm

In pseudo code the algorithm for finding matches is straight-forward:

With C = Categories and P = Product descriptions (free text, tags, etc)

foreach c in C: 
    foreach p in P:
        if c match p:
            add hit(c) weighted by level of c

select c with most hit points

The only function left to define is match.

Match as special form of search

To make our life easy we will use a trick to implement the match function. We want to find out if a certain category (for example Baby & Toddler Furniture) matches a product description (e.g. Our furniture for babies and toddler). Luckily, there has already be done a lot of work in a field we can take advantage of: information retrieval and search engines. We are going reduce our matching problem to a search task: the category is the search query we want to find in the product description.

That leaves us with the task to redefine a category into a search query and use one of the available search libraries to carry out the search.

As mentioned above, the taxonomy makes varied use of space, comma and ampersand sign. We keep things easy for our first version (and real life tests show that it seems indeed sufficient): every & and , will be replaced with an OR operator. For example, the category Arm Chairs, Recliners & Sleeper Chairs will be converted into Arm Chairs OR Recliners OR Sleeper Chairs.

As a search engine library, we are using Whoosh because it is a pure Python library that is very easy to integrate. There are of course other more established and better-performing engines (SOLR etc) but as we don’t care too much about performance and have very few documents convenience wins here.

Selecting the final category with most hit points

Now that we are able to find matching categories, we need to select the most relevant one among the matches. The quality of a match, ie the “score” and the “level” of the category, ie. the further down on a branch the categories the more relevant. In other words: an exact match of a category should be preferred and Kitchen Table is preferred to Furniture.

Additionally, we add a weight to where a match was found. If we have an exact match in a title of a product, we want to weigh that higher then if the match was found in the free-form description of a product.

We combine all this into a scoring function which is then subject to fine tuning and experimenting. There may not be a general function that fits all use cases.

Implementation in Python using Whoosh

def index_product_info(product_dict):
    schema = Schema(path=ID(stored=True, analyzer=StemmingAnalyzer()), content=TEXT(stored=True, analyzer=StemmingAnalyzer()))
    st = RamStorage()
    st.create()
    ix = st.create_index(schema)
    writer = ix.writer()
    for key in product_dict.keys():
        writer.add_document(path=unicode(key, "utf-8"), content=unicode(product_dict[key], "utf-8"))
    writer.commit(mergetype=writing.CLEAR)
    return ix

The product information is provided as a dictionary, eg. {'title': 'Halogen Desk Lamp'}. We don’t need to store the index and use an in-memory index (RamStorage). Whoosh offers also the possibility to add a StemmingAnalyzer to our index, that way it is possible to match words in their base form (matching singular, plural and other forms (see Whoosh - Stemming)).

def match(ix, category, weights=None):
    # get the leaf of a category, e.g. only "Chairs" from Furniture > Chairs
    index, c = get_category(category)

    # adjust query
    # replace comma and ampersand with OR
    query = re.sub('[,&]', ' OR ', c)

    with ix.searcher() as searcher:
        parsed_query = QueryParser("content", schema=ix.schema, termclass=Variations).parse(query)
        results = searcher.search(parsed_query, terms=True)
        score = 0
        for r in results:
            weight = 1
            if weights:
                weight = weights[r['path']]
            score += r.score * weight
        return score

The actual matching of a category is done by executing Whoosh search on our in-memory indexed product descriptions. Additionally we add an optional weight to each score of a hit.

Github project

I made my implementation available under Github: https://github.com/BernhardWenzel/google-taxonomy-matcher.git. It provides the source code and a script that can be executed by parsing a csv file and writing back the result into a csv file. If you are a shop owner, you can use that tool to automatically assign your products to a category.

Future improvements

The quality of the results can be further improved or adjusted by fine tuning the scoring algorithm of the search engine as well as the scoring function of the selection of the best category match. However, as feedback of our customers show, the settings as they provide already good enough results. At the end, a human still has to double check the results as it is critical to use the most appropriate category for a retailer’s success.