A friend of mine who owns a stock photo website once told me he had designed an awesome photo auto-tagging algorithm which was 80% correct: just tag all photos cats and have four hits out of five.
But seriously, how do we calculate most used tags on photos?
For those not familiar with different tagging models, I highly recommend this old but very useful article: Tags: Database schemas.
Now, in this tutorial, we'll use the toxi approach to tagging. Later in the article I'll explain why we would choose that approach over the others.
Let us create a simple schema which would allow us to store, tag and sell photos:
CREATE TABLE photo ( id BIGINT NOT NULL PRIMARY KEY, author VARCHAR(200), ts DATETIME, path VARCHAR(255) ) CREATE TABLE phototag ( photo BIGINT NOT NULL REFERENCES photo, tag VARCHAR(200) NOT NULL, PRIMARY KEY (photo, tag) ) CREATE INDEX ix_phototag_tag ON phototag (tag) CREATE TABLE purchase ( id BIGINT NOT NULL PRIMARY KEY, photo BIGINT NOT NULL REFERENCES photo, value NUMERIC(20, 2) NOT NULL, buyer BIGINT NOT NULL )
We don't keep an explicit list of tags, but rather let people add tags to photos as they go. This means that a tag exists as long as it's used on a photo: once the last photo using this tag is untagged, the tag is extinct.
Which is the most used tag?
To answer this question we would need to count all rows in phototag, breaking them by tag, then sort the result and take the first row in the sorted list:
SELECT TOP 1 tag, COUNT(*) cnt FROM phototag GROUP BY tag ORDER BY cnt DESC
Anyone who has ever owned a piggy bank knows how hard can that be, and anyone who designed accounting software knows how hard can it be for a computer.
It would be much easier if we had a way to cache the counts: inrement the tag count by one each time a photo is tagged, decrement it when it's untagged. It helps even with a piggy bank. I actually did it when I was saving money for my first bike as a kid (in case you were interested — yes, I'm normal, my mother had me tested). And such a caching becomes an absolute necessity when your data volumes grow large. They are large if you open the browser while waiting for the database backup to complete.
Most database require that you write a trigger or otherwise manually sychronize that cache. SQL Server allows it to be built and maintained by the server.
It's as easy as creating this view:
CREATE VIEW tagcount WITH SCHEMABINDING AS SELECT tag, COUNT_BIG(*) AS cnt FROM dbo.phototag GROUP BY tag GO CREATE UNIQUE CLUSTERED INDEX ux_tagcount_tag ON tagcount (tag) CREATE INDEX ix_tagcount_cnt ON tagcount (cnt)
By running this command we create a special type of table which is updated automatically whenever its parent table, phototag, is updated. The tag's count (cnt) is incremented each type we add a tag to a photo and decremented each time we remove it. Once the last occurence of the tag is removed from a photo, the tag is removed from the view as well.
But which is more important? We can index cnt, this means that not only the tag counts will be cached and stored on the disk, but we will always have them in order.
Retrieving the most used tag would be as simple as running this query:
SELECT TOP 1 tag, cnt FROM tagcount WITH (NOEXPAND) ORDER BY cnt DESC
which takes just several page reads to retrieve the rightmost (largest) item from the index tree.
Which is the most purchased tag?
This task is a little bit harder as we don't track downloads by tag directly. We rather track downloads by photo and photos by tag. Finding the most downloaded tag without indexing would require breaking downloads by tag which takes a join between two tables. This would be a performance killer.
However, it's not a problem for SQL Server either. We just create this view:
CREATE VIEW tagpurchase WITH SCHEMABINDING AS SELECT tag, COUNT_BIG(*) AS cnt FROM dbo.phototag pt JOIN dbo.purchase p ON p.photo = pt.photo GROUP BY tag GO CREATE UNIQUE CLUSTERED INDEX ux_tagpurchase_tag ON tagpurchase (tag) CREATE INDEX ix_tagpurchase_cnt ON tagpurchase (cnt)
Now we have an indexed view, which is stored on the disk and is updated each time a photo is retagged or downloaded. It also has an index on cnt (which in this case stores the number of downloads per tag), so finding the most ten downloaded tags is this easy:
SELECT TOP 10 tag, cnt FROM tagpurchase WITH (NOEXPAND) ORDER BY cnt DESC
The same as earlier, except SQL Server tracks updates of two tables at once.
Which is the most profitable tag?
It's great to have lots of visitors on the site, but it does not pay the bills by itself. We need to know which tags yield the most revenue. And each download can have its own value.
The approach would be the same, but this time we would not only increment and decrement counts, but rather add and subtract the values as downloads are added or their photos are retagged.
To do this we would just need to amend our previous view a little:
DROP VIEW tagpurchase GO CREATE VIEW tagpurchase WITH SCHEMABINDING AS SELECT tag, COUNT_BIG(*) AS cnt, SUM(value) AS total FROM dbo.phototag pt JOIN dbo.purchase p ON p.photo = pt.photo GROUP BY tag GO CREATE UNIQUE CLUSTERED INDEX ux_tagpurchase_tag ON tagpurchase (tag) CREATE INDEX ix_tagpurchase_cnt ON tagpurchase (cnt) CREATE INDEX ix_tagpurchase_total ON tagpurchase (total)
SQL Server can also track sums per group and create indexes on them.
To retrieve 15 most profitable tags we would need to run this query:
SELECT TOP 15 tag, total FROM tagpurchase WITH (NOEXPAND) ORDER BY total DESC
which, again, is but a several reads from the top of the index.
Autosuggest me a tag!
Remember that we use toxi model for our stock photo application?
If you have read the article I linked earlier, you already know that there are three approaches to storing tags: mysqlicious (all tags in a single field, space or comma separated, right in the photo table); toxi (tags defined right in the many-to-many table), and, finally, scuttle (photos and tags defined in separate tables; many-to-many link table just links their ids).
Mysqlicious does not really work well with per-tag statistics, and scuttle should only be used if you store some data about the tags themselves (like a long tag description, when should you use one etc). We don't have all this in our model so we stick to toxi.
Now, toxi is great but has one drawback: it's not that easy to retrieve just a list of tags we use, not without scanning the whole phototag table. This is not that cheap as we have seen. And autosuggest should be fast, that's the whole point of it.
Let us pick an autosuggest for 'ca…'
We could run this query:
SELECT DISTINCT tag FROM phototag WHERE tag LIKE 'ca%'
which would be somewhat efficient as it would use an index on tag to avoid looking before 'ca' and beyond 'cb', in alphabetical order. But the query would still have to browse through numerous 'cars', 'cats' and 'cameras', which, as you can imagine, stock photo authors are fond of. This would take long.
However, we still have all those views we created earlier. And each of them, besides having an index on tag, has one more advantage over the original phototag table: each tag is stored only once. So we can drop the DISTINCT and just write this query:
SELECT TOP 10 tag FROM tagcount WITH (NOEXPAND) WHERE tag LIKE 'ca%' ORDER BY cnt DESC
which would not only provide us with a list of autosuggestions, but would also pick 10 most used ones.
SQL Server's ability to create indexed views is as useful and helpful as it is little known. I hope this series of tutorials will shed some light on it and make the world of SQL development a better place.
Previous articles in the series: