Make life easier for your users with faceted search
When you walk into a real physical store, it is generally very easy to find what you need. Stores have aisles, shelves, and tables that are organized to drive a customer to do three things: find what they need, discover what they didn't know they wanted, and to quickly browse related products. There is a reason all the vegetables are in the same place. If you are building a business online, you must achieve the same effect to give your customers and easy time navigating your website. The most common way of doing so is building a faceted search system.
What is faceted search?
The faceted search UI metaphor is all over the web. If you've been to Amazon.com, you are familiar with it. The basic functionality is as follows: searching with text brings up products related to your keyword. Those results come with groupings of categories and subcategories that allow you to further refine your search results. The combination of full text search (like Google) with filtering and refining is called faceted search. In this tutorial, I'll be showing you how to roll together a very simple faceted search system using PostgreSQL's full text search capabilities
Facets and filters
In the picture above, facets are the category headings on the left hand side (Toys & Games). Filters are the items within those categories that allow you to filter the results (Action & Toy Figures). Our goal here is design a schema and some queries that allow us to tag products with the proper filters and search for products by selecting filters, via text, or some combination of both.
A quick note: The following examples are all in raw SQL because PostgreSQL-specific text search features. I would like to note that if you are to roll with a system similar to the one I am describing, I strongly recommend adapting it to use an ORM or other database layer. For my production system, I developed this using Squeryl as an ORM and Anorm when I needed to use raw SQL
1. Build a table for your searchable entities
First, let's get a basic table for your products to get started. For this example, we'll need a product title and a description, although a real website would likely have many more fields.
CREATE TABLE Product ( title varchar(128) not null, description text not null, id bigint primary key not null );
If you are adapting this code into a current schema, these example tables are very pedantic. This should be easy enough to follow along that you can adjust the queries to your existing schema. On the other hand, if you are starting with a blank database, let's go ahead and add some products:
INSERT INTO product (title, description, id) VALUES ('Infinite Jest', 'A 1996 novel by David Foster Wallace. The lengthy and complex work takes place in a semi-parodic future version of North America, and touches on tennis, substance addiction recovery programs, depression, child abuse, family relationships, advertising, popular entertainment, film theory, and Quebec separatism, among other topics.', 1), ('A Visit From The Goon Squad', 'A 2010 fiction novel by American author Jennifer Egan. It won the 2010 National Book Critics Circle Award for Fiction, and the 2011 Pulitzer Prize for Fiction.', 2), ('Ready Player One', 'A science fiction novel by Ernest Cline. The book was published by Random House on August 16, 2011. The audiobook is narrated by Wil Wheaton. In 2012, the book received an Alex Award from the Young Adult Library Services Association division of the American Library Association and won the 2012 Prometheus Award.', 3);
2. Get simple text search working
Giving your customers a very basic search functionality is easy. PostgreSQL comes with fast, simple, and efficient text search tools that are way better than the SQL LIKE command. PostgreSQL comes with tools that process natural language queries, compare them against documents, and provide results back with rankings if you choose to use them. This is extremely powerful.
LIKE does not intelligently process a query; PostgreSQL functions will stem words and ignore frequen words depending on the language. That means a query like this "The Dogs are JuMpInG" will be understood like this "dog jump" matching all verb conjugations for jumping, both plural and singular forms of dog, as well as ignoring common words (called stop words) that provide little context for a search engine such as "the" and "are." The best thing about these look ups, is that they can be very fast if you create index for them. Read more about indices for text search here:
So let's go ahead and start searching. To search, we'll need document representations of the products. To do this, we use to_tsvector() which is a function that converts arbitrary text into a tsvector, which is a type of document that can be search against. Try this query.
SELECT to_tsvector(p.title || ' ' || p.description) from product p;
Your output should look kinda crazy like this:
'1996':4 'abus':37 'addict':32 'advertis':40 'america':26 'among':48 'child':36 'complex':13 'david':7 'depress':35 'entertain':42 'famili':38 'film':43 'foster':8 'futur':22 'infinit':1 'jest':2 'lengthi':11 'north':25 'novel':5 'parod':21 'place':16 'popular':41 'program':34 'quebec':46 'recoveri':33 'relationship':39 'semi':20 'semi-parod':19 'separat':47 'substanc':31 'take':15 'tenni':30 'theori':44 'topic':50 'touch':28 'version':23 'wallac':9 'work':14
That is PostgreSQL doing the following to your document:
- Removing punctuation
- Filtering out stop words (common words like and, or, etc.)
- Parsing the text into tokens that represent every linguistic representation of that phrase.
Now let's use our vector to match against against it:
SELECT p.id, p.title from product p WHERE to_tsvector(p.title || ' ' || p.description) @@ to_tsquery('infinite & jests')
You should get back row 1 of our table, the entry for Infinite Jest. The '@@' symbol matches vectors against queries and returns true or false. The function totsquery does a similar thing to totsvector, except that it converts a query into a tsquery, which uses the same type of parsing and filtering that we use to generate our tsvector. One catch though, totsvector() uses a special syntax to enable many features that are beyond the scope of this tutorial. If you simply want to take a user's text input and convert it into a query, use the plaintotsquery() function.
SELECT p.id, p.title from product p where to_tsvector(p.title || ' ' || p.description) @@ plainto_tsquery('infinite jests');
Feel free to change the query and experiment with searching. You will notice that there are some limitations, such as mispellings. Text search is highly customizable on PostgreSQL and one should definitely take the time to read the docs and see all that PostgreSQL can do for you. Additionally, generating ts_vectors everytime this query is called is not particularly efficient; you will want to use a GiN index to make queries fast.
3. Add some filters and facets
Let's create a table for filters:
CREATE TABLE Facet ( name varchar(128) not null, id bigint primary key not null );
And one for facets:
CREATE TABLE Filter ( name varchar(128) not null, id bigint primary key not null, facetId bigint not null ); ALTER TABLE filter ADD CONSTRAINT filterFK1 FOREIGN KEY (facetId) REFERENCES Facet(id);
Facets can have many filters associated with them so we add a foreign key constraint to ensure that each filter has a 'parent' facet. How should we tag entities (in our case products) with filters? Let's create another table to manage the mapping.
CREATE TABLE ProductFilter ( id bigint primary key not null, filterId bigint not null, productId bigint not null ); ALTER TABLE ProductFilter ADD CONSTRAINT ProductFilterFK2 FOREIGN KEY (filterId) REFERENCES Filter(id) ON DELETE CASCADE; ALTER TABLE ProductFilter ADD CONSTRAINT ProductFilterFK3 FOREIGN KEY (productId) REFERENCES Product(id) ON DELETE CASCADE;
We will populate our facets with some realistic examples.
INSERT INTO facet (name, id) VALUES ('decade', 1), ('format', 2);
Add some filters.
INSERT INTO filter (name, id, facetid) VALUES ('1990s', 1, 1), ('2010s', 2, 1), ('Paperback', 3, 2), ('Hardcover', 4, 2);
Now the last bit of set up, tagging our products with appropriate filters.
INSERT INTO productfilter (productId, filterId, id) VALUES (1, 1, 1), (1, 4, 2), (2, 2, 3), (2, 4, 4), (3, 2, 5), (3, 3, 6);
4. Searching against facets and filters
Assuming you've had some experience with relational databases, using the productfilter table should be fairly obvious. For example, to find all products tagged with filter of ID 1 use this query:
SELECT p.title FROM product p, productfilter pf WHERE pf.productId = p.id AND pf.filterId = 1;
It should be simple depending on the application you are using to display facets and their child filters. Assuming your able to get these IDs to your database, the real question is how should you combine all this information. Using Amazon.com as an example, it is clear that they allow users to OR filters within a facet, while filters across facets relate to each other using AND. For example, in our case, if a user clicked "1990s", "paperback", and "hardcover", we would surface books published in the 1990s that are either paperback or hardcover rather than books were published in the 1990s or book that are both paperback and hardcover simultaneously.
That being said, let's think about how we would search for that specific query.
SELECT p FROM product p WHERE EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (1) AND pf.productId = p.id ) AND EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (3, 4) AND pf.productId = p.id );
The above query is selecting products uses EXISTS clauses to select all rows that meet a certain criteria. In this case, the exist clauses are looking for all products that have one of the filterIds we seek. The IN clause works like a logical OR across all the values passed as arguments. Using AND, we can chain together these EXISTS clauses very easily and select products across facets. Building a query this way is makes things more moduler, because now we can picture building a complex SQL query programmatically with some other language very easily.
5. Combining filters and text searching
Let's say a user has entered a text query and has now selected some filters. How do we search for that? It's actually very simple; just append the matching vector developed earlier using AND.
SELECT p FROM product p WHERE EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (1) AND pf.productId = p.id ) AND EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (3, 4) AND pf.productId = p.id ) AND to_tsvector(p.title || ' ' || p.description) @@ plainto_tsquery('text query goes here');
So we are almost done. We have one query with many parts to it that can handle multiple filters across facets and match against text. This query should be used programmatically. That is, use another language to stitch together these parts depending on the actions a user has taken on your website. There is one big issue with our query though: we are able to select filters, but we cannot yet use text search to search against the names of the filters themselves. If I visited a website, I would hope a book in the category paperbacks could be found by searching 'paperback'. What we need are tsvectors that represent the title and description of a product as well as the names of any filters. This query demonstrates how to do exactly that:
SELECT p.id, to_tsvector(p.title || ' ' || p.description || ' ' || COALESCE(string_agg(f.name, ' '), ' ')) FROM product p LEFT JOIN productfilter pf ON p.id = pf.productId LEFT JOIN filter f ON f.id = pf.filterId GROUP BY p.id;
The string_agg() function aggregates all filters from the LEFT JOINs into a string delimited by a single space. COALESCE guards against any empty rows in the joins. Let's simplify a little bit and make a view:
CREATE VIEW productfilter_v AS SELECT p.id, to_tsvector(p.title || ' ' || p.description || ' ' || COALESCE(string_agg(f.name, ' '), ' ')) FROM product p LEFT JOIN productfilter pf ON p.id = pf.productId LEFT JOIN filter f ON f.id = pf.filterId GROUP BY p.id;
Now with an INNER JOIN we can search against our more complete documents:
SELECT p.id FROM product p INNER JOIN productfilter_v v ON (v.id = p.id) WHERE EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (1,2) AND pf.productId = p.id ) AND EXISTS ( SELECT p.title FROM productfilter pf WHERE pf.filterId IN (3,4) AND pf.productId = p.id ) AND v.to_tsvector @@ plainto_tsquery('query goes here');
PostgreSQL can get you searching rather quickly. This is especially important if you are working for a startup with limited resources. The underlying schema is very simple, facets, filters, a one-to-many relation, and a table of entities. While the queries look complicated, if you are doing them right, you can build them programmatically very easily. I've built a very similar system myself using Scala, Anorm, Squeryl, and PostgreSQL. There is quite a bit of material to take in here, so I've left out a few key details that really make this system production ready. Some tips before you roll your own system:
- Using a GiN index, which is fairly straightforward, is essential to making this efficient
- You should not rely on a view to do this query since you'll be doing multiple joins on every search. A view is more or less an alias for a SQL query. What you will probably want to use is a materialized view. It is essentially a table that copies the data from a view into new table. You can think of materialized views as memoization for databases. Jonathan Gardner has created this great resource on materialized views that I've used on my projects.
- Read the PostgreSQL docs on full text search. They are well written and will be immensely helpful.
- Use an ORM, make sure you have the right indices in place to make this efficient, make sure you use sequences to generate your IDs, and a whole host of other database best practices. :)
I chose to build this system for a few reasons, the most important being that I needed to build a flexible system really quickly to meet some business needs. Working in the startup world forces one to be both resourceful and flexible. Working intelligently in a startup forces you to think about the long term picture. How can one build a search system quickly without racking up too much technical debt? The system I've described doesn't lock you into using PostgreSQL for the long term, but is quick to get working and into production. Thanks for reading and please leave any questions in the comments.