1. Is SQL is a declarative language?
First of all, let's define the terms.
A declarative language describes the result, not the process to achieve it. In a declarative language you tell "what do I need".
A procedural language describes control flow: exact steps the machine needs to perform in order to achieve this result. In a procedural language you tell "what do you do".
SQL, as it was originally designed, is a declarative language.
For instance you need to know how many customers above 40 live in each city. You issue a query like:
SELECT city, COUNT(*) FROM customer WHERE age > 40 GROUP BY city
which pretty much describes the logic you're after, in a language quite close to plain English.
If we used a procedural language, we would have to describe it in terms of algorithms: command blocks, branches, all that stuff.
For instance, if we had an index on age, we could use this index to filter records and group them in a temporary table. The algorithm would then look similar to this:
create a temporary table temptable (city, count) indexed on city open the index on customer.age go to the first record in customer.age with age > 40 for each customer.age.record in customer.age locate temptable.record where temptable.record.city = customer.age.record.city if temptable.record exists then increment temptable.count by 1 else insert a new record (customer.age.record.city, 1) into temptable
Or, if we had an index on city, we could traverse it, filter out the records with low age, increment the count for a city and return it whenever the city changes:
open the index on customer.city declare variable current_city declare variable current_count for each customer.city.record in customer.city locate customer.record in customer by customer.city.record.pointer if customer.record.age <= 40 then continue if current_city <> customer.record.city then if current_city is not empty return (current_city, current_count) to the client current_city = customer.record.city current_count = 1 else current_count = current_count + 1 if current_city is not empty then return (current_city, current_count) to the client
Or we could come up with any other algoritm which would allow you to filter on one field and group on another. The thing is, we don't describe the result we're after, but rather the process to achieve this result.
Note that even procedural languages may have some kind of "declarativity" in them. It's just how much of menial job you leave up to the computer. If you look closer into the made up language I wrote the "programs" above in you'll see that some operations are still left up to the computer. For instance, what do "open the index" or "locate the record" mean? Those can be on the filesystem or in the cache, the memory they occupy should be allocated and possibly reclaimed from the other processes, etc.
However, an SQL "program" is just a query (or, a most, a set of queries run independently). The queries cannot interact, cannot be depend on each other's results etc. In fact, adding those things would make the language procedural. If you have any kind of flow control logic in your program, it's procedural. Some engines actually extend SQL this way (allowing routines, dependencies, flow control etc), and the abbreviation "PL" commonly found in the names of those languages stands for "procedural language".
On the first sight it may seem declarativity is a bliss: we can just leave those boring tasks like figuring out what to do to the machine and concentrate on describing results we need to get.
Now, let's look into the query above. It's a pretty simple query in fact. However, we could come up with at least two algorithms which can be used to run this query (and there are many more in fact), and the query engine has to choose the fastest one. And of those two algorithms there is no "better" one: depending on the data distribution, I/O settings and many more other factors, each may or may not be more efficient that another. The database engines have special modules (called optimizers) which analyze all these factors and decide which of the algorithms (or, in RDBMS terms, query plans) will be best for the query.
This would be great if not for the three factors.
First, computers are great in crunching large amounts of data but really suck in decision making. When optimizing a query, the engine relies on data statistics gathered while inserting the data and processing them in between serving the clients. Those statistics approximate the data distribution in the tables: for instance, using statistics, you can roughly tell how many records in an index would satisfy a certain condition without explicitly counting them. Unfortunately, those are usually not fit for more complex tasks: they don't usually account for correlation between fields (say, women hygiene products are usually sold to, um, women); can't assume future data (it's quite easy for a person to tell when Christmas trees start selling better, but not for a computer); in other words, they don't see that wide picture a human developer sees, they can only crunch numbers blindly.
Second, SQL does not really allow fixing computer's incorrect choise of the algorithms. The query plans can be controlled to some extent: some engines allow directly specify which tables or indexes will be read and in which order; other allow setting parameters which tell things like "how fast is my RAM compared to my HDD" to control the optimizer indirectly. But all those workarounds have limited effect and are not that easy to use even for a seasoned developer. They are not something SQL was designed in mind with.
Third, and the most important: computer has to choose from (more or less) a set of predefined algorithms, it cannot come up with a new one.
Let's consider this query, for instance:
SELECT date, total, ( SELECT SUM(total) FROM daily_sale ds2 WHERE ds2.date <= ds.date ) AS accumulated_total FROM daily_sale ds ORDER BY date
It may take a while to understand what does this query do, but it's quite simple: for each order it displays the accumulated sum earned so far. If we had this table of daily sales:
date total 2013-10-01 2500.00 2013-10-02 3000.00 2013-10-03 1700.00
the query would return us the following figures:
date total accumulated_total 2013-10-01 2500.00 2500.00 2013-10-02 3000.00 5500.00 2013-10-03 1700.00 7300.00
For any kid who just started toying with Basic on their ZX Spectrum (I may be a little outdated on this), it's pretty obvious how to program such a task: declare a variable, add new record's total to it on each step and return it along with the record.
However, no real RDBMS engine would use this very algorithm to run this very query, despite all its simplicity and effectiveness. They would read and sum the records from the beginning for each record, over and over again. The query optimizers are not just programmed for such things. It takes a human developer to sit down and rewrite the query using window functions. And the window functions are not always there, sometimes you just have to deal with the limited SQL subset the engine offers to you.
This in fact is the major disappointment for many SQL developers. It takes quite a time to learn formulating your logic in terms of SQL, but eventually you figure it out and it starts making sense. As you are becoming good in SQL you start noticing that your queries are nice, elegant and precise, having but one drawback: they are intolerably slow. You sometimes have to sacrifice their beauty and elegance: add ugly-looking hints, repeat the same pieces of code over and over again, or even offshore some of the logic to the client. All to make them faster.
So, to conclude: SQL was developed as a declarative language, which means you describe the result you're after and let the engine find the best algorithm to obtain those results.
However, implementation limitations don't always let the engine come up with all possible algorithms; of those available the engine can choose a suboptimal one because of poor execution time estimation; and, finally, there it little to nothing you can do about it in SQL.
To cope with this, most databases, in addition to SQL, offer procedural languages (usually customized for data processing and designed to work with queries and their results). Also, some databases allow embedding instructions (called query hints) right into the SQL queries. Those hints advice (or command) the optimizer to choose a certain query plan.
SQL does what it declares, but not necessarily in a best possible way. To write truly efficient queries, you should be ready to rewrite the queries, suggest hints to the optimizer and tweak server setup parameters.
2. Are cursors bad?
Anyone who has spent more than hour learning SQL has heard or read that "cursors are bad".
To tell how bad they are we first need to understand what they are.
A cursor is a tool for working with query results. It is an object which allows to do things like "fetch next result", "fetch next 100 results", "return 100 records back" (that's if you're lucky), "close this query, I don't need it anymote", that kind of stuff.
The goal of a database query is to massage data and give it away to the outside world: a data grid in the report window, a chart builder, a mail server. All those things don't usually like to be flooded with billions of data records and left on their own. So it's just a common courtesy to give them a valve to regulate the data flow speed. And the cursor is such a valve: it's an interface through which you're talking to a query and let it know how much of its data do you need and when do you need them.
But how come that such a noble and useful invention would turn into a DBA's nightmare?
Remember the previous chapter and the query plans? Whenever you run a query agains a database, the optimizer transforms the declarative query into a procedural program (called the query plan) which describes in detail what exactly will be done to the tables' data before they can be presented to the client.
The optimizers may not be that good, um, optimizers, that is they not always come up with the best algorithms possible. But when they do, those algorithms run as fast as they can. I used the terms "algorithms" for the query plans before, but they are more like Lego block buildings. There are some basic operations like "nested loops" (that is do something for each record in a table); "index seek" (find a value in the index) etc, which take each others' inputs and process them. And the speed of those basic operations is bread and butter of a database. If something is optimized in a database system at all, that would be its ability to execute those blocks very fast (though the block themselves could be connected in a sub-optimal way).
However, making a fast query requires using some complex SQL.
Now please meet Alice, the apprentice developer. By "complex SQL" she means "anything beyond SELECT, FROM and WHERE".
Alice writes a sales report with a salespersons's surname in it (which is of course not in the sales table). She's close to the deadline, the reports nees to be done, and she's not that good in SQL yet. So she takes a brute force approach to the problem. She does what she's accustomed to do in her favorite programming language: queries sales, takes each record from the query and looks up the salesperson's name by id in the people table.
To do this, she, of course, uses cursor, as that's what you use for anything that includes "each record from a query".
What is Alice doing wrong?
First, she deprives the optimizer of the opportunity to build an optimized plan. The optimizer might not be that smart, however, a simple JOIN is something which is usually handled quite well. And there is more that one way to skin a cat, that is, join two tables. There's "nested loop", and there's "merge join", and there's "hash join", and the tables may be taken in different order, and different indexes could be used, and the operations can be parallelized by the database engines which can parallelize.
By replacing the JOIN with a homemade lookup solution, Alice effectively limits herself to a single algorithm, namely, the "nested loops", which is usually not that efficient.
Second, when doing the joins outside the database, Alice adds much overhead to otherwise simple operations. If an optimizer would take care of the join, it would not only use an optimized plan, but also run it within the server context, with shared memory and everything. Retrieving results outside the query one-by-one requires network communication with the server, sometimes interprocess communication between the "database" part and "network" part and also makes the application layer (where the actual name lookups are made) to do the work better done by the database server. If those name lookups are made frequently, the application layer might request the contents of the people table over and over again. In general, joining outside the database implies much extra work which otherwise could be avoided.
There are scenarios when, despite all said above, working with cursors is better than relying on the optimizer.
First, some queries cannot be worded in SQL efficiently.
Remember our old friend Cumulative Total? If you have window functions in your SQL dialect, you can write it as easily as:
SELECT date, total, SUM(total) OVER (ORDER BY date) FROM daily_sale ds ORDER BY date
But what if you don't? Say you're on SQL Server before 2012? Or on MySQL before, well, now? (MySQL guys, really!) Or on "generic SQL" your framework constrains you to?
You could either resort to this "generic" query:
SELECT date, total, ( SELECT SUM(total) FROM daily_sale ds2 WHERE ds2.date <= ds.date ) AS accumulated_total FROM daily_sale ds ORDER BY date
, which would go full
retard Schlemiel the Painter and re-run the inner query over and over again. Or you could calculate the cumulative sum on your client as you go through the records (which you need to do anyway to fill the grid or whatever). And you know what, adding a number to a variable once per record is not that hard for a modern computer, not compared to the things is needs to do to retrieve that record. In this case, a little client-side aid would come handy.
Second, for several reasons, you could have application software, application servers and network connections so beefy that you would better treat your SQL database server as a pure key-value (or key-range) store. This basically means that have your own "optimizer": that is data processing algorithms and everything else implemented in you application in a way which better suits your needs than anything the database server could do. This is often the case when your database server can not natively parallelize queries or your data are sharded (distributed between several servers).
In both these cases you're telling the server: "I know you're cool and can join and everything, but please, just give me your data and I know better how to handle it, ok?". If you really know better and are sure of it, it's ok to do so.
So, cursor is a tool for working with query results: retrieving them one-by-one or in batches, rewinding them, opening and closing the query.
Using cursors is not bad. Using cursors for retrieving records and processing them outside the database usually is.
In most cases, turning your data processing rules into an SQL query and letting the optimizer handle it is better than making your own algorithm, be it database server's procedural language or an external application.
However, there can be exceptions: poor SQL support, poor optimizer or poor database server connectivity. In those cases, external processing could be better indeed, which, on its turn, would require cursor-based solutions for retrieving the data records and massaging them in the application.
3. Is developer's job writing the queries and DBA's job making them fast?
You can often hear that, since SQL is a declarative language, a developer could just write the queries (so that they run and produce correct results) while a DBA can make indexes and partitions and whatever other thingamagicks they do to make the queries run fast. And if your query runs correctly but not fast, it's the DBA to blame.
Well, SQL had been indeed conceived as this kind of language. An end user, they thought, could formulate his queries in almost plain English, so that they would not have to concern with such trifles as tablespaces, indexes, sharding etc., while an experienced DBA could analyze those queries and arrange the data on the server in such a way that the queries would run fastest.
As you can probably tell by now, it did not work.
First, SQL is not plain Engish. Oh boy, it's not.
I recently developed a system which, among other things, had a simple table with measurements for multiple sensors, literally four fields: id, sensor, state, timestamp. So there's a query I wrote:
WITH rns AS ( SELECT sensor, ts, ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY state, ts, id) rn1, ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY ts, id) rn2 FROM measure ), series AS ( SELECT sensor, MIN(ts) AS min_ts, MAX(ts) AS max_ts, DATEDIFF(second, MIN(ts), MAX(ts)) AS length, ROW_NUMBER() OVER (PARTITION BY sensor ORDER BY DATEDIFF(second, MIN(ts), MAX(ts))) rn FROM rns GROUP BY sensor, rn2 - rn1 ) SELECT sensor, min_ts, max_ts, DATEDIFF(second, min_ts, max_ts) AS length FROM series WHERE rn <= 3 ORDER BY sensor, length DESC
In plain English this means: "for each sensor, show me three longest series where the state didn't change".
In SQL, it's a mammoth with, Lord forbid, common table expressions and combined window/aggregate functions and groups and even I as an author of that query had to look at it several times to remember how it worked.
So, SQL is not "almost plain English". It's quite a tough programming language, with several dialects and sub-dialiects, which has to be learned hard. It takes a trained developer to do anything beyond a simple filter in SQL.
Second, as we have already learned, not anything that can be made working in SQL can be made working fast. The query above is far from perfection. Since SQL (namely, the dialect used by SQL Server 2008 where this query runs) lacks a native way to track series, I had to resort to a trick which requires the table to be sorted twice before the grouping and once again after. This is not the fastest possible algorithm, and this query is only there because the report based on it only runs occasionally and its speed is "tolerable".
Improving some queries requires creating additional tables, rearranging the data, sometimes even compromising the data actuality by caching results of some heavy intermediate queries on a timely basis (such called materialized views and their home-brewn analogs in systems that don't support them natively).
In other words, not any query can be fixed by a DBA without a heavy rewrite (and sometimes even offloading some complex logic to the application).
The better way to isolate application from database would be using stored procedures or other wrappers around database code which could be more easily changed by a DBA (or a developer with deeper understanding of databases) without affecting the application.
Instead of hardcoding the query and bragging about impossibility to optimize it ("Are you a DBA or not?! It's a declarative language, for God's sake!"), hardcode the stored procedure name and give yourself and the DBA more space to rewrite what's inside it.
4. Is SQL portable?
SQL is regulated by ISO and there are several revisions of the SQL standard available. The first version of the standard was released in 1986, the last in 2011.
A score and five years have passed between those events, and, as you may suggest, database management systems in general, and SQL as a means to talk to them in particular, have evolved pretty much during those years. So when talking about "standard" SQL, it makes sense to mention which version of the standard you are talking about.
It is better to talk of "common core" which is shared by all (or at least all major) RDBMS products. Most definitely, there is such a thing. You can pretty much rely on SELECT, FROM and WHERE. Multiple tables in the WHERE clause are widely supported, as well as table aliases. Most arithmetic operations are commonly supported too. Aggregates, HAVING and ORDER BY - yes. Also, the major vendors do support IN and EXISTS. And since quite recently, you have LEFT JOIN and RIGHT JOIN almost everywhere (but not the FULL JOIN). So when it comes to selecting data, you can run basic select queries in an almost portable way. Anything else? Not much. No portable window functions, no portable CTE, no portable way to limit a resultset.
What about DML, i. e. modifying data? Everywhere there are INSERT, UPDATE and DELETE from/to a single value. Inserting multiple records into a single table is still something not supported without modification by everyone easily. Same said about inserting into multiple tables with one statement. And, of course, updating and deleting data based on more than one table is still far from portability.
Now, DDL, that is data definition: things like CREATE TABLE etc. CREATE TABLE itself works. Datatypes? Integers - yes. Decimals - kinda, sorta, with some limitations. Even basic string types are not that portable (just remember VARCHAR2). Booleans? You must be joking. Compound types? Sorry. Autoincrements are just different across all databases, nothing even pretty close to a standard. Referential integrity is pretty much standard, at least if you stick to basic things like one-field primary keys. MySQL with MyISAM does not support it though, so beware.
Triggers (and any other procedural code) are just incompatible. Forget about them if you want anything portable.
It's safe to say that transactions support is portable too, though there are quirks of course. MySQL with MyISAM first (which does not support them), but screw it, it's not even the default engine anymore. You also can't rely on a transaction being continuable after an error and can't nest transactions.
Last but not least is query optimization. Basic indexes are most or less well supported by all major databases and even their syntax is close enough to be considered portable, though MySQL does not grace DESC clause in indexes, Oracle does not index NULL values and SQL Server, unlike others, treats NULL values as unique. Conditional indexes, statistics, indexed/materialized views and all those fancy stuff are proprietary if supported at all. When running queries you should totally rely on the optimizer as any means to tweak it are not portable.
Also, some optimization tricks work on some systems but not other. MySQL does support loose index scan (that is, fast DISTINCT), other systems don't. Hash joins and merge joins are supported by everyone but MySQL.
OK, OK, I can go for hours about this stuff, so I'd rather keep it short: be ready that the queries running fast on one engine will be slow on another.
So where does this all leave us? We can run basic queries against basic datatypes, join tables (sometimes), insert, update and delete records row-by-row — in a transaction. That's pretty much it.
If is possible to run "portable" SQL against major database systems, but it's limited to the most basic things. Any more or less complex processing of your data shall be done on the application side.
As we learned earlier, this might or might not be a good thing: just be aware of it.
5. Should everything be normalized?
Database normalization, to put it simply, is designing a database foolproof to incorrect input of data.
I won't go deep into this area right now (there are pretty much information online about this, anyway), just outline it in a couple of words.
Basically, if you notice that a change you make to a database requires more than one operation, your database is not normalized.
Say, a person is promoted and you need to update their position in more than one table. Or you want to insert a record for a new accounting year and find that you need to update the closing date of the previous one as well. Or before updating a field you have to parse its previous contents.
If it's possible to break your database consistency by "forgetting" to do something, the database is not normalized (in some sense).
The benefits or a normalized database are obvious: your data might be outdated, wrong or just missing, but at least figures do match everywhere. If you store your employee's surname in a table and refer to her by id elsewhere, you may be sure that when she gets married and takes her husband's surname, it would show up in all reports right after you change it in a single place. And if your queries take the end of an accounting period from the beginning of the next one, you may be sure that the old one automatically closes as soon as the new one is inserted into the table.
The drawback of normalization is, of course, performance.
Imagine you run a convenience store which updates prices often. The staff does not always keep up with the recent updates, so the prices on shelves, on products and on the cash registers often don't match. You are tired of refunding the customers and want to put and end to it.
So you decide that you write each price only once. Since the law requires it to be on the shelf anyway, you make the shelf such a place, so to know "the" price one would have to walk to the shelf.
Stupid? Of course. You might save a dollar or two on not having to refund the mislabeled prices, but your clerks cannot even use the barcode scanners anymore. They'll have to run like crazy all over the shop with your basket, writing down the prices over and over again. They can't even remember the prices: what if the price just changed a minute ago? The best they can do is to look at the basket and think of the shortest route to cover all your purchases. It will take enormous time to serve each customer so you'll lose all your clientele in the end.
Believe it or not, but that's exactly what the normalized databases do for a living.
Computers are fast, so in most cases you don't even notice them skipping from a sector to sector on a hard drive or from a page to page in memory. But be sure they do lots of extra movements and spend lots of disk spins and CPU cycles to visit remote places each storing a single instance of your precious data. And if there are lots of data, this may become noticeable.
Denormalization allows you to store copies of data in the places where they are more readily available to the queries. Instead of joining two tables in a business-critical query you run often, you might want to consider just storing the joined records together in a third table. This comes at a price of course: you should pay attention that the data are consistent between the three tables, however, this might be worth it, provided that benefits from making the query faster would justify additional development costs.
Note that most database systems do allow some data redundancy whose integrity is maintained on the system level.
Any decent database caches its recently accessed data in memory. Any updates to the data or access to the stored data go through the memory cache, thus allowing faster access.
Disk-stored data can be duplicated as well. For instance, an index on a field is in fact another table, which stores the field value and the pointer to the record in the original table. SQL Server supports indexed views, where results of filtered, joined and even grouped queries can be stored on disk.
All those solutions imply data redundancy, but, since the database automatically takes care of integrity between the data copies, it's not considered denormalization: you still update the table, the indexes and the cache with a single atomic operation. The indexed views can only be updated through updates to the underlying tables, so they do not involve denormalization either.
Still, indexes and indexed views and not always enough. Some systems don't implement them, and those which do have very serious limitations on what can be indexed.
So sometimes you just need to accept the extra maitenance efforts and keep track of your data scattered all over the database, all for a greater purpose: faster data access.
That's, after all, what the databases are for.