Revolving around the core of technology
Since most relational databases share same design concepts under their hood, this paper is not specific to any particular vendor. Although in our example we talk about five databases, these tips apply to a wider range of RDBMS.
Table schema for Customer
|
Indexes for Customer
|
Table schema for Orders
|
Indexes for Orders
|
Following table lists SQL commands for different database that is used to update statistics. For further details, refer to the reference manual for your particular RBDMS
Oracle: | ANALYZE command or DBMS_UTILITY package |
DB2: | RUNSTATS command |
MS SQL Server: | UPDATE STATISTICS |
Informix: | UPDATE STATISTICS |
Sybase ASE: | UPDATE STATISTICS |
Optimizers always tend to select the least expensive path \u2013 one that returns least number of rows in fastest time. Why do optimizers rely on statistics? Consider the following query that is run against our sample database to answer this question.
select * from customer where city = 'New York City' and phone = '212-555-1212' |
Notice that the above query contain two fields in the "WHERE" clause and there are two indexes defined, each containing one field. One very important notion to remember is that the optimizer can only use ONE index per table. Therefore, it has to make a decision as to which index to use. Since phone number should return least amount of rows, our query will run much faster if the optimizer always uses IdxPhone. However, if statistics are not updated, the optimizer does not know which index is better and may decide to choose IdxCity since 'city' field appears first in our WHERE clause. Once you update statistics the database will know more about the data distribution and will correctly choose the better index to run your query.
SQL optimizer heavily depends on indexes defined for a particular table. Indexes are double-edged sword: no index will degrade performance of your SELECT statements and too many indexes will slow down your DML (INSERT, UPDATE, and DELETE) queries. Therefore, it is important to have a right balance of index on tables. Besides the number of indexes, fields that are involved and their order is also very important.
When creating indexes, estimate the number of unique values the column(s) will have for a particular field. For example, the idxCity index in our sample database is not a good candidate for an index. When you wish to search for customers in New York City, it can potentially return thousands of rows, which are then searched sequentially. Such indexes seldom help in speeding up SELECT queries and reduce the response time for DML queries.
Composite index - Indexes containing more than one field are called composite index. Such indexes should be created if you expect to run queries that will have multiple fields in the WHERE clause and all fields combined will give significantly less rows than the first field alone.
For example, in our sample database the index "IdxCustDate" is a composite index. This index is only useful if the ratio between customers and the number of their orders is high \u2013 meaning an average customer places more than 1000 orders. Creating a composite index on the Orders table when most customers have placed only a handful of order wastes not only spaces on the hard disk but have negative impact on DML queries.
Clustered index - A clustered index determines the physical order of data in a table - meaning the actual data is sorted according to the fields in the index. This is similar to a telephone directory, which arranges data by last name. There can be only one clustered index per table. These indexes are particularly efficient on columns that are often searched for range of values.
Although most databases support such index, they use a different terminology. For example Oracle calls it Index-Organized Table (IOT) whereas DB2, Informix, MS SQL Server and Sybase all call it clustered index.
select * from Customer where YEAR(AccountCreatedOn) == 2005 and MONTH(AccountCreatedOn) = 6 |
Note that even though AccountCreatedOn has an index, the above query changes the where clause such a way that this index cannot be used anymore.
Rewriting the query in the following way will increase the performance tremendously.
Select * From Customer Where AccountCreatedOn between '6/1/2005' and '6/30/2005' |
When an index is created, the data for indexed columns is stored on the disk. When new rows of data are inserted into the table or the values in the indexed columns are changed, the database may have to reorganize the storage of the data to make room for the new rows. This reorganization can take additional toll on DML queries. However, if you expect new rows on a regular basis in any table, you can specify the expected growth for an index. The terminology used for this expected growth is different in every database. The following table lists the terms used by different RDBMS for expected growth.
Oracle: | PCTFREE - Percent Free |
DB2: | PCTFREE - Percent Free |
MS SQL Server: | FILL FACTOR |
Informix: | FILL FACTOR |
Sybase ASE: | FILL FACTOR |
SELECT * FROM customer WITH ( Index(IdxPhone)) WHERE city = 'New York City' and phone = '212-555-1212' |
Notice the additional "WITH" clause after FROM. This example is specific to MS SQL Server. Every database use different syntax for specifying this value and they are quite different from each other. Refer to your RDBMS documentation for details.
Oracle: | EXPLAIN PLAN FOR >Your query< |
DB2: | EXPLAIN PLAN SET queryno = xxx for >Your query< |
MS SQL Server: | Set SHOWPLAN_ALL ON >Your query< |
Informix: | SET EXPLAIN |
Sybase ASE: | Set SHOWPLAN_ALL ON >Your query< |
You can also use third party tools, such as WinSQL Professional from Synametrics Technologies to run EXPLAIN commands against databases.
Hard disk I/O is among the slowest resource on a computer, which becomes apparent as the size of your database increase. Many databases allow users to split there database onto multiple physical hard drives. In fact, some even go a step further and allow splitting the contents of a table on multiple disks. When you use multiple physical disks, I/O operations speed up significantly since more heads fetch data in parallel.
The less data retrieved, the faster the query will run. Rather than filtering on the client, push as much filtering as possible on the server-end. This will result in less data being sent on the wire and you will see results much faster. Eliminate any obvious or computed columns. Consider the following example.
Select
FirstName,
LastName, City Where City = 'New York City' |
In the above example, you can easily eliminate the "City" column, which will always be "New York City". Although this may not seem to have a large effect, it can add up to a significant value for large result sets.
Consider dropping the indexes on a table before loading a large batch of data. This makes the insert statement run faster. Once the inserts are completed, you can recreate the index again.
If you are inserting thousands of rows in an online system, use a temporary table to load data. Ensure that this temporary table does not have any index. Since moving data from one table to another is much faster than loading from an external source, you can now drop indexes on your primary table, move data from temporary to final table, and finally recreate the indexes.