On my blog feedback page I get lots of questions which essentially boil down to one thing: "Those NULL things in databases work in a way I don't freaking get!"
Let's have them explained a little.
The Wikipedia page defines NULL this way:
Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfill the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information".
For people new to the subject, a good way to remember what null means is to remember that in terms of information, "lack of a value" is not the same thing as "a value of zero"; similarly, "lack of an answer" is not the same thing as "yes" or "no".
Rather than trying to come up with intuitive rules for how NULL behaves, it's easier to expand on the definition above to demonstrate the motivation behind introducing the NULL at all, and why it ended up so complex.
NULL is like an "unknown variable" in algebraic equation
Let's step away from relations and tables for a moment, and remember some old school math. What is this?
x = 5
This is a simple formula, which holds true if we substitute 5 for X and false otherwise.
3 > 2
This is also a simple formula, but there are no variables in it. It's always true as there's nothing to substitute here.
4 < 3
Similarly, this is always false.
The two last expressions are called identities: they are always true or false, no matter what, unlike the first one, whose truth depends on X. But identities can contain variables as well:
x + 2 = 2 + x
This would be also true no matter what we took for X.
When an expression over NULL would be true? If this expression over anything would be true.
This is not very intuitive, I know. Let's see some actual examples.
SELECT NULL + 1 = 2 -- NULL
This is the same as if we asked "is it true that
x + 1 = 2"? How the heck would we know? If x = 1, yes, otherwise no. Hence, the answer is "this expression is not an identity", or, "it depends", or even shorter, "NULL".
SELECT (NULL + 1 = 2) OR 3 > 2 -- true
Is it true that either 3 > 2 or x + 1 = 2? Well, we could stop right after "or". Yes, it's true, no matter what do we substitute for X.
SELECT (NULL + 1 = 2) AND 3 > 2 -- NULL
Now, both parts of the equation should be true, and the first part is not an identity, it may be either true or false, depending on the value of x. Hence, NULL.
More complex examples
IN is a predicate which checks if the left side matches anything in the list on the right side.
SELECT NULL IN (1, 2, 3) -- NULL
So basically we are asking: "Is X equal to 1, 2 or 3"?
If we substitute 1, 2, or 3 for NULL here, this would be true, otherwise false. So, NULL.
Let's see what happens if the NULL is on the right side.
SELECT 4 IN (1, NULL, 4) -- true
Now, the left side (that is, 4) would definitely match the third value in the list. It could as well match the second value too (if we substituted 4 for NULL), but it does not matter: one match is enough.
This one would return true.
IN construct accepts not only hardcoded lists of values on the right side, but also lists dynamically generated with queries.
Now let's use a query to build an empty list and see how would it handle a NULL on the right side:
SELECT NULL IN ( SELECT 1 WHERE 1 = 0 ) -- false
The inner select returns an empty list, and nothing is ever in the empty list, no matter what we substituted for NULL. False.
Even more complex examples
NOT IN is the opposite of
IN: it tells if the value is not in the list.
Let's see what happens if we put a NULL there:
SELECT 5 NOT IN (2, NULL, 3) -- NULL
If we replaced NULL with 5, it would return false, if with any other number, then true.
So, the result is NULL.
SELECT 6 NOT IN (6, NULL, 3) -- false
This would always be false, regardless of what would be substitute for NULL, as 6 is already in the list.
Please look closer at the last two expressions.
As you can see, a NOT IN against a list containing a NULL value will always evaluate to either NULL or FALSE. This means it will never get matched by WHERE clause.
I'll better highlight it for you:
If a list contains but a single NULL,
NOT IN against this list will never match anything
If the list is generated with
NOT IN (SELECT column FROM ...), inserting a single NULL to
column would make the predicate never match anything, even if it matched before, "breaking" the query.
Imagine we have tables with people and positions:
CREATE TABLE position (name TEXT); INSERT INTO position VALUES ('Assistant'), ('Worker'), ('Manager'), ('CEO'), ('CIO'); CREATE TABLE people (name TEXT, position TEXT); INSERT INTO people VALUES ('John', 'Manager'), ('Jim', 'Manager'), ('James', 'CEO'), ('Jill', 'CIO');
Now we want to see positions not occupied by anyone:
SELECT name FROM position WHERE name NOT IN ( SELECT position FROM people ) -- 'Assistant' 'Worker'
This works alright. But let's add but a single person whose position is not yet known:
INSERT INTO people VALUES ('Jennifer', NULL);
and re-run the query:
SELECT name FROM position WHERE name NOT IN ( SELECT position FROM people ) --
Now it returns an empty list.
It does not make sense if we treated NULL as "unknown" or "irrelevant". But it makes sense if we treat NULL as "could be anything".
If that Jennifer girl could assist, or work, or manage, or even run the company, we could not say for sure that any of those position is sure free. Whatever position had we considered, it might happen so that Jennifer were doing it.
This is one of the most counter-intuitive things in SQL and I hope reasoning behind this makes such a peculiar behavior more clear.
Meeting the real word
The initial idea behind NULL values was this: make them represent equations which the computers would solve and return "this is true", "this is false" and "give more data, will solve further".
However, this had proved to be not that easy.
The following rules are not consistent with reasoning above, and are only there to make the idea work in the real world.
NULL as a placeholder only works with booleans
Let's try to do this:
SELECT NULL * 0 -- NULL
Though there is no number which would give anything other than 0 in this expression, SQL would still return NULL. Proving expressions over complex domains being identities is a complex computational task, so the developers just decided to play it simple.
NULL are horizontally independent
Remember an example of an identity I gave a little bit earlier?
x + 2 = 2 + x
Well, you can't get this in SQL. All NULL values are independent on each other. In SQL, this is interpreted as
x + 2 = 2 + y
which is not an identity at all.
SELECT x + 2 = 2 + x FROM ( SELECT 1 UNION ALL SELECT NULL ) q (x) -- true NULL
Whatever actual value had we put instead of NULL, this query would return true in both rows. However, as NULL values do not correlate, even if they are from same source, this query returns NULL for the record with NULL.
The server engine of course could treat NULL values in the same way programming languages treat the objects: they could look the same (share the same sets of properties) without being identical (having different pointers). This way, if we had an expression like NULL(1) - NULL(1), the server could see that it's an identity resulting in 0 and substitute it into the formula above.
But what would a NULL(1) + NULL(2) be, i. e. if we added two NULL values from different sources? This would be a NULL(3), which would have to be recalculated for each possible value of NULL(1) and NULL(2). And what if we joined two tables containing NULL values together? The server would have to store a separate pointer for every NULL combination.
Tracking the origin of each NULL would be quite a complex task without any immediate practical use.
NULL values are vertically indistinguishable
Horizontally (in expressions) all NULL values are different, but vertically (in a set) they are all the same. Let's just run two simple queries:
SELECT NULL UNION SELECT NULL -- NULL
All NULL values are treated the same in sets, so a UNION of two NULL values is a NULL.
SELECT NULL EXCEPT SELECT NULL --
For the same reason, a NULL can be subtracted from a set as well as any other value.
Again, if each NULL had an identifier, the server would be able to distinguish between them. And again, there is little to no practical use in it. So all NULLs are considered not distinct for the purposes of set operations.
NULL is ignored by aggregate functions
SQL standard requires the aggregate functions to ignore NULL in their arguments. So this query:
SELECT SUM(x), MAX(x) FROM ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT NULL ) q(x) -- 3 2
would return 3 for SUM and 2 for MAX, despite the fact that substituting different values for NULL would change the query results.
Returning NULL for the aggregates over NULL values is not that hard, however, the standard developers decided just to ignore them. Note that this directly contradicts the set predicates behavior described earlier. A
NOT IN with a single NULL in the list would never return TRUE.
To make things more complicated, the standard prescribes to emit a warning whenever there are NULL values in the aggregate input.
Those above are some basic principles one needs to know about NULL values and how SQL handles them. If we tried to put them all together, we would probably have something this:
- Logical values of TRUE, FALSE and NULL mean "identically true", "identically false" and "not an identity".
- In boolean expressions NULL is a variable which allows them to be either true or not, depending on its value.
- In non-boolean expressions NULL is just a special value whose behavior needs to be learned.
As the Wikipedia page mentions,
Computer science professor Ron van der Meyden summarized the various issues as: "The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL."
This is so NULL!