Tuesday 27 March 2012

Transaction Concurrency

Introduction -

Consistent read / write operation while maximising the concurrent access is the key challenge in developing multi-user database-driven applications. The database management systems implement different schemes of "locking" and "concurrency" control to achieve this feat. It is very important to understand that there are as many ways to implement locking as there are different database management systems (from different vendors). While the entire scope of this subject being extremely vast and the mathematical models of those schemes are beyond the scope of not just one or two posts like this but may require an entire book to define, we shall have to remember that here we are not creating a database management system but only understanding an ideal implementation. Again what is ideal is really a big debatable question; we shall consider the implementation with reference to most widely used commercial database management system, Oracle where ever required.

Transaction -

A "transaction" is a logical unit of work consisting of one or more SQL statements and more importantly concerned with the changes to the data or database. It is traditionally defined in terms of its 4 properties described as an acronym - ACID
  • Atomicity - smallest unit of changes, which either has to happen completely or not happen at all and which does not have a subset with respect to following (next) property.
  • Consistency - it takes the data from one consistent state to next consistent state. The consistent state of the data is definable and justified in terms of validity of the data with respect to certain obvious, natural or mundane rules.
  • Isolation - it ensures that the effect of one transaction are not visible or potentially impact the other concurrent transaction, as if the each transaction is alone being carried out the entire database.
  • Durability - assures that the transactions once committed are saved permanently in the database and are not subject to loss under any condition (including cycles of failure and recovery).

What is concurrency control ? -

"Concurrency Control" are the collection of functions that the database provides in order to allow many people to access and modify data simultaneously. Concurrency obviously increase the throughput of a database management system, but it brings along its own set of problems as following -
  • Dirty Read - A "dirty read" is said to have occurred in a transaction when it reads an "updated" data from another transaction before it is committed. The reason this should not be allowed is because the data that has just been updated is in the intermediate phase of change and has the potential to be rolled back by the transaction who made the change; so in case of a reversal another transaction had already read the changed data which is really now incorrect.
  • Phantom Read - "Phantom read" causes new rows to appear in the result when the same query is run second time after a gap when it was run the first time. It differs from "non-repeatable read" (explained below) in the sense that this is not about the change in the data but rather more data satisfies the same criteria of selection of data as compared to first query.
  • Non-Repeatable Read - "Non-Repeatable Read" causes the same rows that were read in previous operation appear changed when queried after a time gap. It is possible even that the rows might disappear in the second read operation rather than just appear to have changed.
  • Lost Update - "Lost Update" problem causes a transaction to overwrite a pending transaction started earlier by another transaction and hence make it disappear (say Lost).

What are Transaction Isolation Levels ? -

The ANSI/ISO transaction standards use the term "Isolation Level" to indicate the extent to which the databases either permit or not permit top 3 of the 4 of concurrency phenomena  (say outcome of a concurrent pair of transactions) as described above. The "Lost Update" is invariably NOT permitted in any of the standard isolation levels.
  • Read Uncommitted transaction isolation level is the most non-restrictive of all the transaction isolation levels. It permits all the three reads - Dirty Read, Phantom Read and Non-Repeatable Read. In any case the "Dirty Read" should not be allowed to happen. So this isolation level is not used in any of the known commercial database systems. However it is interesting to know that "Read Uncommitted" isolation is to provide a standards-based definition that caters for non-blocking reads. However, commercial databases like Oracle do not need "dirty read" since it has been designed to take a detour around the lock (the changed data is locked until the transaction is complete) and it can reconstruct it from a place called "Rollback Segments" (where original version of data is stored for facilitating rollback) and so Oracle can make consistent data available to the user without waiting for the transaction to commit.
  • Read Committed transaction isolation level allows a statement to see only the data that was committed prior to its commencement. There are NO dirty reads allowed but non-repeatable reads and phantom read are. "Read Committed" isolation level is the most widely used transaction isolation level in commercial databases. However there is NO reason to perceive that all is well with "Read Committed" isolation level. It will still require non-blocking reads as designed in Oracle (detour around the lock... as said above) for transaction to produce acceptable output.
  • Repeatable Read transaction isolation level guarantees read consistency i.e. a transaction that reads the data twice from a table at two different points in time will find the same value each time. It only allows "phantom read". You avoid both - the "dirty read" and "non-repeatable read" problems through this level of isolation. Some databases except Oracle achieve "Repeatable Reads" via the use of row level shared "read" locks, which prevents other sessions from modifying the data that a user has read. This is in addition to the fact that, in these systems, writers of data will bock readers of the data. This of course decreases concurrency. A "Repeatable Read" may be used optionally using "SELECT FOR UPDATE".
  • Serializable transaction isolation level is the most restrictive level of transaction isolation (NO dirty, non-repeatable and phantom reads allowed) and obviously to provide the highest degree of isolation. "Serializable transaction" isolation level only means that each transaction executes as if it is the only transaction executing in the database at that point of time. It does NOT imply that there is some serial ordering of the transactions. Rather it is based on the concept of "Schedule". A "schedule" is a sequence of operations from one or more transactions. If all the transactions executed serially, one after another, the schedule will also be "serial". If you can produce a schedule that is equivalent in its effect to a serial schedule, even though it may be derived from a set of concurrent transactions, is called a "serializable schedule". In short "serializable schedule" consists of a series of intermingled database operations drawn from several transactions, and the final outcome of a serializable sequence is a consistent database. Database systems like Oracle allows to use "Serializable isolation level" as an option.

Read Committed v/s Serializable level of isolation -

"Read committed" isolation level provides transaction consistency on a per-statement basis which means that non-repeatable read and phantom read problem is present in this level of isolation. Ideally we require transaction level consistency. Readers will not block writers and vice-versa. So we can consider that "read committed" isolation level only provides a good trade off between data consistency and data concurrency and hence suitable for standard OLTP applications, which consist of high-volume, concurrent, short-lived transactions, with few transactions likely to conflict with each other (and close to NO repeatation of queries), but provides better performance. A "serializable isolation level" is suited for databases where multiple consistent queries need to be issued. The overall throughput is much less this being a blocking isolation. As such serializable mode of concurrency is more appropriate for databases with mostly read-only transactions that run for a long time (OLAP systems).

Read-Only transaction -

Oracle provides a variation of the "serializable isolation level" in the form of "Read-Only transaction" which permits a transaction to only "see" the data that was committed before the transaction started and does not allow any changes to be done once started. The only difference between "read-only" and "serializable" is that, the later also allows the changes to be done. Read-only transactions are intended to support reporting needs, where the contents of the report needs to be consistent with respect to a single point in time (when the generation of report is started). If a certain database system does not implement "Read-Only" transactions then they use "Repeatable Read" isolation level and suffer the associated effect of shared "read" lock as said before.

So to conclude, we had discussion on the transaction concurrency in this post. The concurrency has to be implemented through different "Locking" schemes. Click Here to continue reading on locking.

2 comments:

  1. It should be noted that, regardless of the isolation level, concurrently executing transactions can only guarantee the 'I' in ACID, that is Isolation, if proper optimistic transaction protocols are followed. Take READ COMMITTED isolation as example. User #1 reads a row in a transaction and modifies it but does not commit just yet. User #2 reads the row. If the row is not exclusively locked (concurrency abhors such locks and, as noted, Oracle and some other RDBMS's do not lock under these conditions) then user #2 will read the pre-update or committed version of the row which has stale data in it. He modifies some of the same columns as user #1 but to different values and some columns that User #1 did not touch. User #1 commits. User #2 commits. User #1's updates have been lost which violates 'I'solation since User #2's update was not isolated from User #1's update.

    Optimistic transaction (aka optimistic locking) protocols require that prior to committing changes each user lock the rows he has modified and reread the data to see if any other user has modified the row since he last read it. If the data has been modified by another user, the user must rollback his own transaction and begin again with the new version of the data.

    Databases such as Oracle that do not obey locks make this protocol harder to implement with the result that more users are likely to have to rollback transactions. Fortunately, Oracle's behavior, and indeed the Optimistic transaction protocol itself is predicated on the assumption that in many application realms having multiple users modifying the same data rows is vanishingly rare due to organizational realities. However, that does not mean that it does not happen, even in realms where it is rare (say insurance claim processing), and it must be noted that there are application realms where it is all too likely (banking transaction processing for example).

    ReplyDelete
    Replies
    1. Thanks art for your interest in this post and sharing your views and experience which has certainly enriched this post. I am going to write on locking in the next post where I will be discussing in general about Pessimistic and optimistic locking. I hope you will provide your valuable comments on that post too. Please keep visiting.

      Delete