Tuesday, 17 April 2012

Database Indexes - An Overview

Why create index? -
DBMS is all about storing the data, allowing for the easy access and perform transaction on the data. RDBMS insists on such storage to be in the structure of a Normalized Relation(s) that obey certain rules, popularly referred as Codd's Rules. Of course, as the volume of the data increases manifold on the top of increase in the number of users, the query execution time and also the amount of resources consumed in carrying out such data operations (search) increase exponentially. "Indexes" are another prime set of objects, created along with the tables in the RDBMS, to cut down on both, the execution time and the resources. Database indexes are analogous to the indexes or list of contents we normally visit in "large" books to help us locate a particular topic of interest. The alphabetical order on the titles along with the page numbers against them reflect in the database index in the form of ordered values from a column(s), usually called as index key, and a pointer to the exact address of the row in the database, respectively. Just like if you don't have index or table of contents in a book will require you to read the entire book to locate and read a topic; an absence of index on the "search column" of a table will require to read ALL the table data to find what the user is looking for, giving considerable rise in I/O.

Indexes don't serve any mandatory functionality in the database and hence are purely optional objects. Yet, there may be no small or big databases without them. However, use of indexes and that too of an "appropriate" type is extremely crucial issue as using (or not using or not using an appropriate type) involves a trade off between speedy retrieval of query results and slower DML operations. For most of those DML operations would require reorganization of the index and since time spent in such reorganization would be directly proportional to its size and volume, density and distribution of the transaction, the decision requires prudence on the part of developers and DBAs as well. There is no limit to the number of indexes you may create on table, however there may only ONE index you can create on a particular column or combination of columns (index key). The use of index is transparent to the application (and of course user) and does not require any change in the SQL, but it is required for an application developer to be well aware of the subject of indexes, their types, how they work and specifically their suitability to a pattern of access while framing SQL.

It is observed that the creation of indexes is reactive part of the tuning process unfortunately, though in small proportion this may be the approach, but by and large it should not be so. Creation of the index should be the proactive part of the tuning strategy to be considered during the development of the application itself and not as an afterthought.

Index Schemes -
Basically there are broadly two types of indexes, based on their organization characteristics, used in the commercial database systems like Oracle (this DBMS would be our reference for our further discussions) namely - B*Tree and Bitmap.
B*Tree Index Structure
  • B*Tree Indexes, considered as a more conventional type of indexes and default types in most database systems, including Oracle, are implemented in similar way as binary search tree. The goal is just not to minimize the time taken to search value(s) but also to equalize it. The search process to reach to page (block) containing your value is something analogous to searching a word in a dictionary(with some discipline/ rule). For example if you want to search a word "tiger" - you open the dictionary half way through (the discipline / rule) and say we land on a page of words that begin with "g", it means there are no chances for "tiger" to be in the left hand side pages (you eliminated 1/2 of the pages from your search). On the right hand side pages you open half way through again (remember the discipline / rule) and you land on some page of "s" words, so there is no chance of "tiger" being in left side 1/4 pages of the dictionary. You continue the process thus cutting each time half of the remaining pages either on left or right side of the landing page, you hit a small number of pages each having two words on the top, one on the left and other on right, that helps you guess if "tiger" could be on the page or not without fully reading it. The pages in this example which are scanned at the end are "leaf blocks" in our B*Tree indexes and the landing pages, where we decide to move right or left are called "nodes" in our indexes. The pictorial representation as above, would be like there are multiple levels of branches, each called as "Blevel".
  • Bitmap Indexes, were for the first time proposed in the year 1985 in a research paper entitled "Storage and Retrieval Considerations of Binary Databases", published by Professor Israel Spiegler and Rafi Maayan. The first commercial use was made in 1987 in a database product called Model 204 from Computer Corporation of America. Oracle implemented it in their version 7.3. The principle of bitmap index is simple - First, the index works very well for low cardinality i.e. for the column where the number of possible values are small (so they appear repeatedly in that column) as compared to the number of rows in the table. Second, the index contains, bit vectors equal in number as those possible values and have 1 bit placed in the jth position if the jth row contains the value being indexed. Each vector contains bits equal in number as the number of rows in the table so there is one-one positional correspondence between the bit and the row(in table). So it does not have to store row addresses (in Oracle "rowid").

    Example : If we are maintaining information about cars in an application developed for a regional transport registration office and say there are 100,000 cars so far registered (and many more to be registered in time to come) and one of the columns in the table is vehicle type, which may have five possible values - sedan, hutchback, SUV, MUV, and van. Then indexing "VehicleType" column is a perfect case for using "bitmap" index. There will be five bit vectors, one each for sedan, hutchback, SUV, MUV, and van. Each such vector will have 100,000 bits (equal in number as the number of registered vehicles and will increase as new vehicle register). For instance, vector for sedan could be 00010001000001000000..... Observe here that there is bit "1" in positions 4,8,14..., so the car type at the rows 4,8,14 ... are of sedan type. Another vector say for hutchback could be 0000010100000010000.... Observe here that there is bit "1" in positions 6,8,15..., so the car type at the rows 6,8,15... are of hutchback type. And so on. 
B*Tree v/s Bitmap -
B*Tree indexes in general are best suited for more selective queries (requiring to select very small % of the total rows from table) and particularly for high cardinality index keys (the column(s) to be indexed). Bitmap indexes are best suited for the low cardinality index keys. Bitmap indexes take very small space, because they store only bit vectors, for example in the above case each vector will store 100,000 bits i.e. equivalent to 12.5KB and the total size of index will be 12.5 * 5 = 62.5 KB. whereas the size of B*Tree index for same number of rows in the table will be at least 976 KB (that is minimum space required to store only rowids) and will further take space to store index key values (not much predictable at this stage). Bitmap indexes are considered very well for queries of ad-hoc type and particularly aggregation like count whereas B*Tree are considered suitable for selection of small portion of rows and based on range of values on index key column and may contain a fair mix of selection and DML. Queries like "select count(*) from cars where vehicletype = 'sedan';" may be answered from the bitmap index without even visiting table and same would be the case if we use some column "c" of a table "t" in the query by replacing cars with t and vehicletype with c and some appropriate value in place of sedan, where c is indexed as per B*Tree scheme. But here the matter is the volume of data scanned (hence loaded in memory) which will be much less in Bitmap index than in B*Tree case. Because OLTP and DWH databases involve more of aggregation/ad-hoc queries and close to zero transactions, the bitmap indexes are more suitable for most cases, whereas OLTP databases are observed to have more of B*Tree indexes, however this should not be construed to be a rule.

When to Index and What -
Indexes generally are known to enhance the query performance but only if proper attention is paid to when (access pattern), what (choice of index key) and which type (described above), or the indexes may prove detrimental to the performance. Though there are no concrete rules, but a few guidelines do provide the clue -
  • Index generally when you need to access data no more than 10 / 15 % from the table or some aggregation queries like count are frequently required to be answered (from the index alone).
  • Indexes should be generally avoided on small tables as full table scan may be much more economical in such case than scan index and then table.
  • Index those columns that may be frequently involved in joins, where clause, order by, group by, union or distinct operations.
  • The index key columns should be fairly static (normally not involved in update operations) or overheads in reorganization will be considerable.
  • In composite index, the driving column (first column in composite index key) should be more selective.
Apart from the above "general" two types of index schemes, we can see that sophisticated database management systems like Oracle provide you with some more flavors of B*Tree indexes. Click here for an elaborate description of more special indexes as listed below and their usage:
Also Visit : Career Articles

1 comment: