Cool Database Fundas: 2012

Saturday, 26 May 2012

Procedure vs. Function

A database (back end) programming language like PL/SQL in Oracle, has been provided to push the processing of the data to the server end where the actual data resides. This helps reduce network traffic otherwise involved in taking the data from database to front end, if the processing is done in the application in lack of back end language. The two main types of programs used in the database for supporting the data handling on clues / calls from front end are Procedures & Functions. They are source of confusion while making choice (procedure or function) to solve a problem, because in most problem cases, both may be written with equal ease, and it is the question of appropriateness to be addressed rather than correctness. Beginners are also observed to be failing to answer the question of difference and/or similarity between the two confidently and correctly in the interviews. Hence this post. Here we shall be taking the cases with PL/SQL, however other database systems also have their own back end language sharing same syntax as PL/SQL, so this should help them too.

About the PL/SQL Procedure -

A PL/SQL procedure, like procedures in any other language, are workhorse i.e. they are supposed to perform some "action" or "operation" (Query / DML). There may be some outcome of this action, in the form of one or more values, which might be required to be sent to the calling program of the procedure; which it does by assigning those values to the parameters (OUT or IN OUT mode parameters) passed to it by the calling program. However there is no compulsion of doing so on the procedures i.e. a procedure may not require to send any value to the calling program (no outcome of the action). As such, technologically the procedure does not "return" a value. The return statement in the procedure is never associated with an expression (no return <x>; , it may be only return;) but may be optionally used to end the execution and return the control to the calling program. For this different ability of the procedure to "send" the value to the calling program (without ending its execution simultaneously) does not allow it to be used in SQL statements and also can not be used in expression. For example, we can not use select p(arg) from <table>; neither we can use x := p(arg);. We may only use the procedure as an executable statement itself. For example, the procedure may be called like p(arg);.

About the PL/SQL Function -

A PL/SQL function is supposed to perform only "computation" on the values provided to it through its arguments and no "action" or "operation" (Query / DML). The result of the computation must have to be returned through a "return" statement. The return statement must be associated with an expression, which may be a hard coded value or a variable containing a value or an arithmetic / logical expression. That means the execution of the function ends simultaneous with returning the value. This makes the function able to be called in SQL statement and as a term in expression; never as an executable statement. Since the execution of the function must end with simultaneous return of the value, only one value may be returned from the function.

Why should there be a confusion to choose? -

As described above, the Procedures and Functions have diverse characteristics, so then why this discussion, "Procedure vs. Function"? The reason for this being is that, the functions may have everything that a procedure may have. A function may have a Query or DML statement, it is not illegal in the function, however they are not supposed to be used in a function. On the other hand a Procedure may be created for a problem which may be solved with a function, as the "action" in the procedure is optional. Also there is nothing like, a function may not contain an OUT or IN OUT mode parameters and nothing like it can not assign a value to them (though that is not the appropriate way the function should return the value). For example, the following two programs are technically legal but not appropriate -

create function delrows (dno emp.deptno%type)

return number is

       x number;

begin

      delete from emp where deptno = dno;

            x := sql%rowcount;

      return x;

end;

create procedure addnum(n1 number,n2 number,n out number)

is

begin

      n := n1 + n2;

end;

In the above case "delrows" should have been developed as procedure and "addnum" as a function. But even as of now, they are fully legal and they work in their own ambit. The "delrows" will indeed return the number of rows deleted and addnum will return the sum of the two numbers. Then, where the thing has gone wrong -

Function Purity - 
 

You can see from the foregoing explanation on the procedure and function, that the function is a versatile entity and ideally it should be callable from SQL as well as PL/SQL (programming) environment, whereas the procedure may only be called in PL/SQL environment as an executable statement itself. However, the function like "delrows" can not be called in SQL, but the same may be called in PL/SQL environment in an expression to return the number of rows deleted. So the function, though syntactically correct has now a restricted use. The versatility of the function actually comes by following certain rules defined in terms of its purity levels, which are - WNDS, RNDS, WNPS and RNPS. They respectively mean -

WNDS (Write No Database States) - A function may not have DML statement(s).
RNDS (Read No Database States) - A function may not read data from table(s).
WNPS (Write No Packaged States) - A function may not assign to a package variable.
RNPS (Read No Packaged States) - A function may not read a package variable.

So it means, when all the above conditions are met in a function, the function is truly pure and versatile with respect to its calling environment. In the instant case, "delrows" violates the WNDS purity level by the presence of delete statement in it, hence can not be used in SQL.

On the other hand, adding two numbers is a simple computational activity, but in the form of a procedure can not be conveniently used in SQL statement, by passing columns of a table say salary and commission of employee table for calculating the total earning of employees.

Guidelines to make a choice -

The following considerations may be made while choosing between procedure or function for solving a problem -

If the program is supposed to generate multiple values and all those if be required to be sent to the calling program, then use procedure.
If the program is supposed to perform a query and/or DML (cursor), then irrespective of the number of values required to be returned to the calling program, use procedure.
If the program is not supposed to use a query and/or DML and must return a single value then use function.
However if the purity of a function may be forgone (and hence ready to be used in restricted environment) but is a matter of some convenience, then the function may be written, in such special cases, which of course are rare (to be true never faced by me).

Other interesting reads : Data Integrity Triggers, Transaction Concurrency, Object Tables

Monday, 21 May 2012

Delete vs. Truncate

Delete and Truncate SQL verbs (commands) are frequently used to empty tables. The actual difference between the two is not much clear to most users; not only beginners but also to the experienced users. There also seems to be people having very little idea about when to use what regarding these two commands. This post is an effort to clarify about the same. The bulleted list of properties of these commands may help to make out the differences between the two very easily. We also visit the situations when it will be appropriate to use "delete" or "truncate". -

Delete -

This is a DML category of command.
Being a DML category command, it stores the rollback/undo data (the original version of the data that was changed, in case of delete it is the entire set of rows which are deleted)
Being a DML category command, the operation can be undone.
Being a DML category command, it requires an explicit "commit" command to be fired post operation if the deletion is to be confirmed.
Delete command does not reset the high water mark (HWM) i.e. the level to which once the rows had reached in the table.
Delete command may be used with a "where" clause. It is only when the predicate is not used, will empty the table, and there is no freeing of space even in such case for new object(s).

Truncate -

This is a DDL category command.
Being a DDL category command, it stores no rollback / undo data.
Being a DDL category command, the operation can NOT be undone by using a "rollback" command.
Being a DDL category, it DOES NOT require an explicit "commit" command.
Truncate command resets the high water mark to the 0 level, with only preserving the initial extent allocated to the table for the existence of the table. Which means all the extents above the initial are reclaimed as free space.
Truncate command does not have a "where" clause, which means it must empty the table while returning the allocated space (except the initial extent) for new objects.

Is "Truncate" same as "Delete + commit" ? -

There is a normal perception of many users of the databases that truncate is equivalent to deleting all rows and commit rolled into one. And it is just that. Technically however, this is a big NO. As is apparent from the properties of these two commands as described above, the main difference is the reclamation of the space and resetting of the High Water Mark. Many are also seen to be advising use of Truncate over Delete + commit if the users want to empty the table, since the truncate command is "more efficient". Actually there are some considerations to be made before deciding upon the "truncate" or "delete" for emptying the table -

Delete will require huge space (same size as table) to store rollback/undo data if the table is huge, so a truncate may be more appropriate, but only if the data being deleted is for sure to be deleted and there is no possibility of afterthought for rolling back.
A moderate sized table if required to be reloaded with approximately same amount of data (to same HWM) then delete command may be recommended. Truncate will release the extents above the minimum initial extents and will require de-allocation and reallocation of space, the resources for which may outweigh the extension of rollback segments where the undo data is stored.
A truncate command may be recommended over delete if the volume of the reload to the table is much smaller than the present volume of the data, since the space above HWM of the present data may have been reclaimed in the truncate operation.
Delete command may be recommended in cases if there is a temporary replacement of data for running only "select" queries on the intermediate temporarily loaded data and a simple rollback will be able to restore the original state of data.
If no bulk load is in consideration post empty operation of the table, the truncate shall be more appropriate.

The above scenarios may not present an exhaustive list, but known the properties of the delete and truncate operations, a prudent consideration of the differences between the two, may guide the process of appropriate choice between the two.

Other suggested reading - Materialized Views, Use of External Tables , DDBMS

Monday, 14 May 2012

Materialized Views

What are Materialized Views? -

Let us start with an example - A company has a database with a table "sales" maintaining millions of rows pertaining to the orders the company has received from its customers over a couple of years. The company managers are periodically interested to know breakdown of region wise total sales, or average value of region wise orders, or breakdown of customer wise total sales etc. for which you would write the queries, such as -

select region, sum(OrderValue) from sales group by region;

select region, avg(OrderValue) from sales group by region;

select CustName, sum(OrderValue) from sales natural join customer group by CustName;

When each of such query is fired, then every time the millions of rows of data shall be read and loaded in the memory, the aggregation / calculations are performed and then the results are returned to the user.

Now answer this simple question to yourself - why you used those log tables, or trigonometric tables in your school when you knew the method and formulas for calculating logs or trigonometric ratios? Used them as "ready reckoner", right? To save time!!

Then Materialized Views are such "ready reckoner" in the database, which when created with such queries, have captured and stored those summaries in them, for to be quickly referenced to, when such aggregating queries are fired. And because those Materialized Views store only the summaries, the number of records in them are obviously small. For example, if the territory over which the company carries out the business has 10 regions then the first and second of the examples above will have only 10 rows each. So not only much much less I/Os and scanning efforts are required to be performed to satisfy such queries but there will be much much less memory requirements to cache the data too; to conserve the server resources. Materialized Views are a data warehousing / decision support system tools that can decrease by many orders of magnitude the response time taken for queries involving hundreds of thousands to millions of records in calculating aggregations and performing complex joins and other operations, that too transparently. Due to the transparent use of Materialized Views, the SQL query statements still are fired in the same form & syntax at tables, so without any need for users to change them.

Another purpose for creating materialized view(s) may be to provide a local copy of the data (replication) in order to reduce network loads or enable data subsetting (based on rows and/or columns) for local sites or enable disconnected computing in the distributed database environment.

View vs. Materialized View -

The typical view is a stored query written on table(s), which implicitly gets invoked, each time when the view is referenced in the SQL query. It is only a definition and is stored as such, there is no data storage, as such view is called as virtual table. The view is created to provide an abstraction layer, hide the source (table), implement a constraint or provide a predefined visibility to the data to different users. The views provide the fresh data, but incur the overheads of executing the query (stored in the view) each time on reference.

Materialized View stores the data, in most cases precomputed summaries. You can not refer to the materialized view in a query; the queries are same fired at tables, but if they involve the same expressions based on which there exist the materialized views created on the table(s), then those queries are internally directed to the materialized views to satisfy them. Materialized Views may provide the stale data (if non-updatable or less update frequency), and since do not execute frequently, do not incur any execution overheads, but must use space on the hard disk to store the precomputed summaries unlike the views.

Materialized view updatability -
Materialized Views as such are used in Oracle, while other DBMSs use an implementation of equivalent functionality through a slightly different concept and arrangement. For example - SQL Server uses "Indexed Views" or MySQL uses "FlexiViews". The forthcoming discussion is with reference to Oracle as a model database.

Materialized views come in 3 possibilities with respect to their updatability at the remote site :

Read-Only materialized Views
Updatable materialized Views
Writable materialized views

Read-Only materialized view as the name suggests, do not allow updates to be performed at the remote materialized view site and they are simple created by omitting "FOR UPDATE" clause from their definition syntax.

Updatable materialized view allows the changes to be made at the remote site. The materialized view must belong to a "materialized view group" to push those changes back to the master site during refresh. The views use "FOR UPDATE" clause in their definition syntax.

Writable materialized view are same as the updatable, but being not made part of "materialized view group", it can't push the changes back to the master site (during refresh) rather changes are lost in the materialized view itself during refresh.

Other topics of interest - A short challenging quiz , Evolution of DB systems

Thursday, 10 May 2012

Openbook Quiz - 2

Challenge yourself here with the following questions. Each question has also the links provided to the text where the possible answer may be found. Please post your answers in the comments. Also please share / circulate this quiz to your contacts over social network, linked in groups, your own blogs etc. to help everybody come close to an opportunity of gathering knowledge.

Question 1 :
The requirement is to provide a form where user will enter a number which is a unique identification for an application the user has submitted to an organization. The application has to pass through various departments, each of which assigns it a status as soon as it has been processed by it. The application form id which the user enters is Primary key of the table wherein its status is updated. Which of the following Indexes will be suitable to process the query fast and efficiently - (You may click on the options to find the properties of each of those option indexes).

(a) Cluster Index (b) Descending Index (c) Bitmap Index (d) Index Organized Table (IOT)

Question 2 :
A client wants to store multiple addresses for each customer in the database and wants most of the time to list all the addresses of each customer at a time using a query by providing the unique customer id. Which of the following table(s) will be most suitable for this requirement -

(a) Heap Organized Table (b) Clustered Tables (c) Nested Table (d) Object Table

If this interests you then you may visit for a previous quiz if you have missed it. Wait for some more quizzes to come shortly on Data Integrity, so prepare yourself.

Wednesday, 9 May 2012

Data Integrity Using Views

Data Integrity is implemented in the databases using various Constraints and Triggers. Whereas Integrity Constraints are a non-procedural, declarative way of implementing data validation rules (refer Codd's rule #10), the triggers offer the procedural method to implement data validation for such rules where the constraints show certain limitations. However there is a third method for ensuring the validity of the data, that is through creating "views". Our discussion will be with reference to the Oracle as model database.

What are Views? -

Views are created over tables to provide an abstraction layer, to mask the actual source of data and provide a limited access or visibility to the data contained in the tables. Views provide a means to simplification by allowing to store a complex query as a view in the database (hence an abstraction layer) apart from implementing a security feature (by providing limited access and / or visibility).

View is technically a Query (select statement) stored in the database in parsed and compiled form with a name, which can be used as virtual source of data. The views are referenced in the SQL statements in place of tables; and at such reference the select statement associated with the view implicitly executes on the underlying tables to present the data to the operation or for viewing. As such, the views DO NOT store any data, but they exist only in the form of a definition.

The syntax for creating the plain view (without constraint implementation) is simple -

create [or replace] view <ViewName> as <Select Statement>;

The "or replace" clause is optional (indicated by presence of [], a universal standard to indicate optional part of the syntax), when used, it will replace an existing definition of the view with the same name. The "ViewName" may be any suitable name of user's choice, but of course, must follow all the naming conventions imposed by the DBMS. The "Select Statement" may be any select statement, inclusive of where, having, group by, order by, distinct, sub-queries, joins or relational algebra expressions like Union, Intersect or Minus.

Data Integrity Implementation Example -

Consider a company who has managers, each in-charge of multiple departments. So there is a manager M1 in-charge of departments (10, 20, 30), manager M2 in-charge of departments (40, 50, 60) .. and so on. The managers may only make appointments and transfers of employees for the departments under his/her control. How do you think you could implement this rule using Integrity constraints or triggers? There comes the convenience of using views for such requirements by using a clause "with check option". The mechanism of this clause comes into play in restricting the data manipulation only within the range of "where clause" of the view i.e. any such data manipulation on the view would NOT be allowed which would put the row(s) beyond the selection of the view. For example, to implement the above rule we create two views -

create or replace view empm1 as select empid, ename, job, sal, deptno from employee
where deptno in (10, 20, 30) with check option;

create or replace view empm2 as select empid, ename, job, sal, deptno from employee
where deptno in (40, 50, 60) with check option;

Some more such views may be created for other managers for the set of departments under their control. The respective managers will be granted access only to the view created for their set of departments and none other and the table. So the view "empm1" and "empm2" will be respectively available to managers "m1" and "m2".

The following statements will be valid for manager m1 -

insert into empm1 values (101, 'peter', 'clerk', 900, 20);
update empm1 set deptno = 10 where empid = 101;

Since in these - the insert statement is inserting the employee's rows in department number 10, and the update statement is transferring an employee from department 10 to 20, so well within the scope of the view.

However, the following statements if fired by manager m1 will result into an error -

insert into empm1 values (102, 'robert', 'analyst', 2500, 60);
update empm1 set deptno = 60 where empid = 101;

since in these - the insert statement is trying to insert the employee's row in department 60, not within the scope of the view allocated to manager m1. Similarly, the update statement will fail for the same reason as after the update the row will tend to move outside the scope of the view "empm1".

This was not a complete tutorial on views, but we were only restricted to see the use of views to implement such data validation, restriction or business rules which may not be implemented through integrity constraints or triggers. Actually Oracle also allows to implement primary key, unique, rely, read only constraints through the views.

If you have liked this, Click Here for "concurrency & locking" (a one of the most liked topic by other readers).

Monday, 7 May 2012

Data Integrity - Triggers

Please Click Here to continue from the background.

Integrity Constraints are the most convenient form of implementation of data validation rules, as they are purely declarative and involve no programming. They check the existing data in the columns for validity, in case if they are created after the tables are populated with the data or if for some reasons they are required to be disabled and re-enabled; they leave no chance that the column may contain any invalid data. Hence, the constraints are considered to be the most reliable way of implementing data integrity.

However, they are not without limitations. For example, constraints can not implement conditional validation or they do not have capability to perform pragmatically minor transformations to the data to make it compatible with the rule of the constraint. They may only fail the statement / transaction if the data does not conform to the rule of the constraint. How if you want that the names of the employees, departments and locations always have to be in upper case? How if you want the salary of the employees not to cross 5000 for those working in department number 30 only? How if any department in your company should have no more employees than 100? .... The list of such requirements may be long, for which the integrity constraints offer no solution. Those limitations on data (columns) however may be implemented using the triggers -

What are Triggers? -

Triggers are the programs, written in a back end programming language and reside in the database, which execute automatically in response to an event in the database.

Triggers come in different types, primarily according to their events. The triggers which fire (read execute) for system related events are called system triggers, while those which fire for a DML events on tables, are called as DML triggers. Some databases allow to write triggers to fire on DML events on views, are called as instead-of triggers. While it may not be possible to provide an entire lesson on triggers here in this small post, we shall discuss in brief about the DML triggers, the type mostly used and having context with our present topic. Our discussion is in reference to Oracle as model database, however minor differences may emerge in the other databases. The readers are requested to post freely their comments about those differences here to enrich the reading experience of the visitors.

DML Triggers - A Primer -

The trigger (read DML triggers, in all forthcoming references), may be created to fire at "row level" or "statement level"/"table level". The former, identified by the presence of "for each row" clause, executes for every row affected by the DML statement, for which it fires, the latter fires for only once irrespective of the number of rows affected by its conjugating DML statement. Row Level triggers are useful in the situation when triggering action has to refer to or use the row level values, since such reference of row level values is only possible in row level trigger (being firing for each row). Oracle provides :NEW and :OLD to refer to the new and old state of record being affected by the DML in the row level trigger. Statement level triggers can't refer to :NEW and :OLD. Any of these may be timed to fire "before" or "after". The "before" timed trigger performs its action before the conjugating DML statement performs it operation, so that automatically defines what "after" timed trigger means. The "before" trigger may be used in situations where some data transformations on the data in transition may be required or the operation is required to be aborted by aborting the trigger, whereas "after" trigger may be used in situations where the trigger action may depend upon the outcome of the DML statement or has to refer to such outcome. Oracle allows triggers to name columns for the update event, so that the trigger is saved from firing unnecessarily while only other columns are being updated. The triggers may have optionally a condition specified under a "When" clause only for which the trigger should fire. The clause can be used only in row level triggers, as this restricting clause can refer to new and old versions of the row(s). Oracle allows the trigger to be combined for insert, update and delete events or any combination thereof, if the triggers had to be same level (row or statement) and timed to fire similarly (before or after) if they were to be written separately for each event. The particular code for each separate event can be made to execute by using boolean built-in variables, which Oracle calls as "predicates" viz - inserting, updating and deleting.

That was rather a highly stuffed primer on the triggers and by no means can substitute a complete explanation, but the same has only been provided for the base, so that users will not feel the forthcoming discussion as alien. A dedicated post to triggers may sometime come in near future but meanwhile you may refer to this link.

Trigger Examples - Implementation of Data Integrity -
Example 1 - In the first example we shall see the implementation of a constraint, where we want that the data in the "dname" and "location" columns of "department" table should be always in upper case. Whereas we could simply put a check constraint, the problem with constraint would be that, it will fail the update or insert statement if the data is not provided in upper case by the user, thus consume time and resources in rejection and users then resorting to re-execute the operation. Whereas now, in the trigger, we implicitly convert the case of the data provided by the user and so trigger will ensure that the data which makes it to the columns is always uppercase, no matter in what case it was provided by the user. So the example -

create or replace trigger upper_case_dept_trig

before insert or update of dname, location on department

for each row

begin

:new.dname := upper(:new.dname);

:new.location := upper(:new.location);

end;

/

Explanation : The code in red is triggering statement - specifies the timing for the trigger "before" in this case and events which are "insert" and "update" (with column list, in absence of such column list it will fire for all updates on any column(s)) and table name "department". In blue is the clause which tells that this is "row level" trigger, since we require to manipulate the data provided by the user, we are using "before" timed and its reference may be only available from :NEW, we are using a row level trigger. The code in grey is the trigger action - in this case converting the user provided data into upper case (upper is a function) and storing it in same vars so ensuring that it will always be upper case.

Example 2 - In the second example we ensure that the salary of any employee may not be greater than limit of 5000 only in department number 30. Remember that, if any employee is having salary greater than 5000 in department number 30, before the trigger is written or some user makes it so when the trigger is in the disabled state, then the trigger CAN NOT detect this, unlike the integrity constraint. However we may not easily apply this rule using integrity constraint. So the example -

create or replace trigger limitsal
before insert or update of deptno, salary on employee
for each row

when (

(new.deptno = 30 and new.salary > 5000) or

(new.deptno = 30 and old.salary > 5000) or

(old.deptno = 30 and new.salary > 5000)

)

begin
raise_application_error (-20000, 'salary limit crossed - operation disallowed');
end;
/

Explanation : While the other things are same as explained in the above example, the text in red is the restricting "when" clause. Observe here that the when clause is the part of trigger header and its conditions are actually stored in the data dictionary. The phase of parsing on the insert and update statement decides whether or not to fire this trigger (if satisfies the conditions). The text in blue is the trigger action, which in this case is calling a built in procedure to fail the trigger and hence the conjugating DML. The error numbers allowed here are -20000 through -20999. The second argument is the error message to be passed on to the user.

Conclusion -
So here we had two very simple examples of the triggers to implement such constraints which may not have been possible with the integrity constraints. The most important limitation of the triggers however is that, they do not check the data already residing in the tables. The data validation for such data may have to be separately done, if the trigger was disabled for some time, or was created after the table was populated with the data.

Data integrity and DML restrictions, such which may not be possible with integrity constraints or triggers, may be implemented through views. Click Here to continue reading about the views.

Wednesday, 2 May 2012

Integrity Constraints - Creation

Please Click Here for the explanation and background on the Data Integrity Constraints.

Note : The set of commands / statements used here and the explanations thereof are with respect to Oracle as a model database. Readers may experience minor differences in the other database systems.

Constraint creation with the table -
Constraints may be created along with the table. For example -

create table employee (empid    number(5) constraint pk_employee primary key,
ename varchar2(15),
job    varchar2(15),
salary number(5) not null,
cardno    number(5),
deptno    number(3) constraint fk_employee references
                                                                                    department (deptno),
constraint chk_employee check (salary between 1500 and
                                            12000),
constraint uk_employee unique(cardno)
   );

Explanation : The constraint definitions in red are called column level definitions and those in blue are called table level definitions. Here the word level should not be construed as any hierarchy but only places where those definitions are integrated. The constraints ultimately apply to the columns and their behavior is uniformly same irrespective of their level (read place) of definition. Only one constraint definition may be integrated with column (but a column may have multiple constraints), whereas there is no limitation to the number of constraints definition that can be included after all the columns have been defined (table level). So if a column is supposed to have multiple constraints then one of the constraints' definition may be integrated with column and remaining at table level. Or table level constraint definitions must be used if the constraint has reference to multiple columns, for example a composite primary key (primary key consisting of multiple columns). But some developers are of the opinion to define all the constraints at the table level, just to keep the table definition neat and clean. However remember that table level definitions of constraints are not possible for NOT NULL constraints. A constraint definition contains optional constraint clause, for example "constraint pk_employee", which is used to provide user defined name for the constraint, used conveniently for fetching the constraint metadata from catalog. In absence of "constraint <name>" clause, for example "not null" in above case, the system will implicitly provide the name for the constraint, and will have to be every time known from the metadata as such names are very difficult to remember. The check constraint in above example has been defined as table level because the column had to have two constraints and "not null" could not be defined at table level. The example of unique constraint as table level indicates that the constraint could be defined at table level too, and it may be a matter of convenience as aforesaid; we could do this with other constraints as well.

Adding constraints to an existing table -

It is possible to add the constraint to an existing table, if a certain constraint was not created with the table, or some developers also prefer to create the tables without constraint definitions and add the constraints afterwards.

Add Primary Key : alter table employee add constraint pk_employee primary key(empid);

Add Not Null : alter table employee modify salary not null;

Add Check : alter table employee add constraint chk_employee check (salary between
1500 and 12000);

Add Unique : alter table employee add constraint uk_employee unique(cardno);

Add Foreign key : alter table employee add constraint fk_employee foreign key (deptno)
references department (deptno);

Explanation : Observe that for "Not Null" constraint we are using "modify" instead of "add"; for actually not null constraint is defined as column specification as explained above. Also observe that "foreign key (deptno)" clause had to be added to the definition, because we have to tell here which column has to become FK, that was not necessary in the previous example as the constraint was being defined as column level. Here also "constraint <name>" part is optional in all the above statements.

Adding constraints when table has rows -

The ambit of constraints is not to allow any invalid data to remain in the column(s) on which they are defined. Hence, if the data is present then, it will be validated by the constraint while getting created, for the default status of the constraint is "enabled". The constraint creation fails if the data does not conform to the rule of the constraint.

Some database systems like Oracle allow to use "novalidate" key word to ignore the existing invalid data. However, this keyword is not allowed to be used with primary key constraint or unique constraint. This is because, PK or UK may be referenced by the FKs, and it may result into identity crisis as the referential integrity may be affected by it, with the possibility that the child row may result into referring multiple parent rows (important for job seekers, a possible question in the interview!!!) which has to be strictly avoided in any case. The use of "novalidate" is demonstrated for a check constraint in the example below -

alter table employee add constraint uk_employee
check(salary between 1500 and 12000) novalidate;

Disabling/enabling a constraint -

A constraint, when created, is in the enabled state by default. However, sometimes like, during bulk data load operation, or for other reasons the constraint may be required to be disabled. Disabling action is associated with declaring the data as "NOT VALIDATED" in the data dictionary, and obviously while enabling, the data will be verified before the constraint state is changed to enabled.

alter table <TableName> disable constraint <ConstraintName> [cascade];

alter table<TableName> enable [novalidate] constraint <ConstraintName>;

"Novalidate" option does not apply to primary key or unique constraint for the reasons already explained, but it may be optionally used for other constraints if you want the constraint to ignore any invalid data that was put in the column while the constraint was in the disabled state. "cascade" option may be only used while disabling the primary key. It is used/useful when there are foreign keys referencing to it, which are also supposed to be disabled. The primary key constraint is not allowed be disabled with an enabled foreign key constraint referencing it. However, "cascade" option is not available with enable action on primary key, which means that Foreign Key constraint will have to be enabled explicitly after enabling the primary key.

Dropping a constraint -
A constraint may be dropped using the following statement :

alter table <TableName> drop constraint <ConstraintName> [cascade];

Here the "cascade" option may only be used while dropping the primary key constraint. It is used/useful when there are enabled foreign key constraints referencing to it and they (FKs) are also required to be dropped. The primary key constraint is not allowed to be dropped with an enabled foreign key constraint referencing it.

Dropping a Parent Table -
A table that has a primary key constraint being referenced by an enabled foreign key constraint set on another table (child table), is called as Parent Table. It is not possible to drop a parent table, irrespective of whether the tables contain any data. However, this default behavior may be overruled by using "cascade constraints" clause in "drop table" statement as given below -

drop table <ParentTableName> [cascade constraints];

The "cascade constraints" clause not only allows to drop the parent table but also drops implicitly the referencing foreign key constraints enabled on the child tables, while retaining the child tables and its data.

Foreign key constraint options -
Foreign key constraints enforce the referential integrity. The default behavior enforces that, the parent row(s) (in parent table) can not be deleted if there are any child row(s) dependent on it (in child table). This behavior may be altered by using "on delete cascade" or "on delete set null" options in the definitions of foreign key constraints as given below -

...constraint fk_employee [foreign key (deptno)] references department (deptno)
on delete ....;

The options are mutually exclusive, as former option cascades the delete of parent row(s) on related child row(s), while the later option will set (update) the corresponding foreign key value to null without affecting the remaining part of the child row. The "foreign key (deptno)" option (without square brackets) is required when the constraint is defined as table level or while adding the constraint on an existing table.

Deferrable Constraints -
By default the constraint rules are implemented at statement level, i.e. the data is validated immediately after the statement finishes its action. Such constraints are classified as "immediate" with reference to their data validation implementation. However, some situations require the validation of the data against the rule of constraint be delayed until the end of the transaction rather than checked at statement level. Such constraints are called as "deferrable". Following examples demonstrates such a situation -

create table employee (empid     number(4)     constraint pk_employee primary key,
ename    varchar2(15) not null,
   job    varchar2(15),
   salary number(5)     not null,
   mgrid number(4) constraint fk_employee
                                                                                  references employee (empid) deferrable,
   .......
);

Explanation : In the above partial definition of an employee table, we have a column "mgrid" as foreign key referencing to "empid" column of the same table which is a primary key. This is because the "mgrid" is the "empid" of the employee whom the corresponding employee reports. This depicts an organizational hierarchy of the employees within the same table (because managers and subordinates are all employees). This case may require multiple insert statements until the subordinate and manager relationship is established among them. So the constraint checking will have to be delayed until all such rows are inserted (end of the transaction). "deferrable" constraints however are implemented as "immediate" by default and they require explicit statement to "defer" as under -

alter session set constraints = deferred ;

This above statement will "defer" (for every transaction) all the deferrable constraints in the particular session. If it is required to create the constraints as "deferred" only to start with, then replace the key word "deferrable" with "initially deferred".

Constraints are most convenient way of implementing the data integrity as we have seen from the above, but they are not without limitations. There are situations when the constraints may not be able to implement some custom business rules and we have to resort to other techniques like triggers or views. Click Here for data integrity implementation using triggers. Don't forget to leave your comments and ask your doubts in the comments section please.

Saturday, 28 April 2012

Data Integrity Constraints

A Relational Database Management System must implement Data Integrity in the database in declarative manner (Codd's Rule #10) to ensure the data integrity independence as proposed in the theory of RDBMS. However, though it may be possible to implement the data integrity through a non-declarative means using triggers in the database, the most convenient way of implementing the data integrity is through declaring "Integrity Constraints" on the columns of the table. Implementation of Integrity Constraints is more or less same in all the commercial database systems, however our model database for reference in this post will be Oracle.

Data Integrity Constraints -

Data Integrity Constraints is a declarative means of implementing data validation rules by creating constraint objects on the columns as a tight integration with the definition (read structure) of the table. This way, the integrity constraints are the most reliable and easiest method to prevent entry of the invalid data into the tables. When any of the data involved in a DML operation on the table violates the rule of the constraint then operation is rejected by the database system.

Mechanism of Data Validation -

An important question to be answered is, when does the system run check for the validity of data against the rule of the constraint? By default the system checks the validity of the data when the DML statement completes the execution. For example, if a delete statement fired at a table is supposed to hit some 10 rows, then those rows shall be first deleted and then system will run the check of validation and if it finds that for any of the rows so deleted would result into violation of the constraint, then the delete operation is implicitly rolled back. This is called "Statement Level Rollback" - is an implicit operation, transparent to the user. Then the database system reports the error in rejection of the operation. It is important to understand that there is something called "statement level rollback" which comes into effect when the statement has to fail (for any reason) and it is the right (prerogative) of the system an no explicit "commit" or "rollback" could supersede it. (Actually this is the reason for triggers are not allowed to contain "commit" or "rollback" statements. important question asked in interviews, attention job seekers!!!).

The foregoing mechanism of constraint checking is called as "immediate" constraint. All the constraints created by default, are immediate. However optionally you may create a constraint which may defer the validation of the data until the user fires the "commit" statement. Such constraints are called "deferrable". However deferrable constraints also behave like "immediate" by default but may be deferred by using "alter session" statement to defer them. Databases like Oracle also provide facility to create them deferred with "initially deferred" keyword. (See the syntax section for more actions). Deferred state of constraint ensures transaction level integrity of data as the validation of data is performed at the end of the transaction (when user says "commit"). If any data which violates the constraint is observed at the end of the transaction i.e. while committing, then the entire transaction is rolled back.

Types of Integrity Constraints -

Integrity constraints come in following 5 types -

Primary Key constraint is used on column or a combination of columns to designate a primary key on a table. You may (obviously) define only one primary key on a table and it ensures that the column(s) under primary key constraint must always receive a value (not null) and a unique value. It actually implements "entity integrity" and it is an important part of parent/child relationship between the tables. For example an employee id or a student's university enrollment number.

Unique constraint, as the name suggests, allows only unique values if specified, however it accepts null. Unique constraint is used for such columns in the table which must contain unique values (but they can not become primary key, which is only one in any table). For example employee's access card number.

Referential constraint, also called as Foreign Key, requires that each row of a table has a value in such column that matches to a value in the "parent key", if the value is present or it may remain null. The "parent key", also called as "referenced key", is usually a "primary key", though it may be possible for a foreign key column to refer to some "unique" constrained column. For example, a department id column in the employees' table refers to department id column in the departments' table, and since an employee must belong to an existing department, his department id must match with one of the values in the department id column of the department table or an employee may not have a department id at all (null) reflecting that the employee is yet to be allotted a department. The referential constraint implements the "referential integrity".

Check constraint allows you to specify a condition to be true for the data if specified or be there no data (allowed null) in the constrained column for every row in the table. There may be multiple check constraints defined on a single column, as such you have to take care that the conditions in the check constraints on the same column do not conflict or contradict each other. This is because, database systems do not have any mechanism to verify if the conditions specified in the different check constraints on the same column are or not mutually exclusive. Moreover there is no specific order to the evaluation of those multiple check constraints on the same column. Database systems like Oracle allow to use built-in functions in the check constraint with some limitations. Check constraints are actually "domain integrity" part of implementation of data integrity. For example you may allow only Male, Female or 'Rather Not Say' (or null) values for a gender column in a table of customers.

Not Null constraint restricts that the value for the column must be provided. Not Null constraint is actually implemented as check constraint with a built-in condition "<column name> is not null" implicitly when a column is declared as Not Null.

Referential Constraint DML restrictions -

Referential Constraint puts following restrictions on the DML performed on the existing rows in the Parent and Child tables -

Parent row(s) which has dependent child rows is(are) not allowed to be deleted by default.
Parent row(s) which has dependent child rows is(are) allowed to be deleted if the referential constraint is created with "on delete cascade" option whereby the corresponding child rows are also deleted along with the parent row(s).
Parent row(s) which has dependent child rows is(are) allowed to be deleted if the referential constraint is created with "on delete set null" option whereby the corresponding foreign key column shall be set to null when their referenced parent row(s) is(are) deleted.
Parent key value can not be updated if it has been referenced by a child row(s).
Foreign key column value(s) may be updated to a null or any other value provided the new value exists as parent key value.
Child row(s) may be deleted without any restriction.

Constraint Indexes -

Unique and Primary Key constraints generate unique indexes in the background on the columns which have these constraints declared. These indexes are necessary for implementation of uniqueness, otherwise when an insert or update operation on the constrained column is performed then the database system will have to verify the uniqueness of the new value by scanning the entire table, which will be very inefficient. They are plain b*tree indexes. No other constraints create any index on the constrained columns automatically. (this is one of very popular questions asked in interview, attention job seekers!!!)

After this theoretical background on the integrity constraints , we shall see the actual creation and administration of the constraints. Click Here to continue.

Thursday, 26 April 2012

Data Integrity

If you have visited an earlier post entitled "Relational Database Model - Codd's Rules", then you must have realized that the rule #10 emphasizes the need for enforcement of "data integrity independence" in commercial RDBMSs. The most important purpose of the whole RDBMS paradigm is that, the database should be independent of the applications which access it, so that there should not be any interference of changes in either of them on the other; and to achieve this, the database systems should have only a non procedural (declarative) approach. The rule #10 thus suggests that "data integrity" is the responsibility of the database systems and it must be definable as an object or in the form of a self executing program (read trigger) written within the database (not application), and their existence must be part of metadata stored in the "data dictionary" (or catalog). Also, any change in the "data integrity" be allowed in the database as and when appropriate without affecting the existing application.

What is data integrity ? -

It is important that data adhere to a predefined set of rules, as determined by database administrator or application developer and also the obvious and natural validation. For example no two customers of the bank may have same account number or a value for gender column can be one from Male, Female and Rather Not Say.

So, in short, "data integrity" ensures the quality of the data in the database and it's enforcement prevents entry of any invalid data into the database tables.

Types of Data Integrity -

Two important steps in designing a table from "data integrity" point of view is, first identifying the set of valid values for certain columns and secondly, decide how to ensure that only valid values make their way to the column(s). This entails the need for categorizing the "data integrity" -

Entity Integrity
Referential Integrity
Domain Integrity
User-Defined (custom) integrity

Entity Integrity defines a row as a unique entity for a table. This is obvious from the definition of entity - a real world object, person or a logical idea with a set of attributes and it to be identified uniquely by its own set of values (which form a row in the table i.e. entity set). So here we require to constrain for at least one column in the table to have values which may never repeat. For example student university enrollment number.

Referential Integrity preserves the real world relationships between the entities, which defines that in such relationship there is an entity with independent existence called as parent may be uniquely referenced by dependent entity (may be multiple) called child (children). The rule enforces that a child may refer uniquely to an "existing" parent only or none (may be termed orphan) and the parent may refer to multiple child entities or none (childless). For example an employee (child) may belong to one and only one existing department (parent) or may not be assigned to any department at all and a department may contain multiple employees or none.

Domain Integrity enforces that the valid values for a column follow a restriction of range or list, and type (numeric, string, date etc.). For example the only valid values for a gender column shall be Male, Female or Rather Not Say.

User-Defined (custom) integrity allows you to implement specific business rules. For example a certain level of manager in a bank may approve a loan case only to a certain maximum limit.

How to enforce Data Integrity ? -

Data Integrity may be enforced in many different ways, depending upon the requirement and suitability. Broadly there may be following ways to enforce data integrity -

Declarative Constraints on the columns of the table.
Database Triggers
Use of appropriate datatypes and precisions (width of the column)
use of views containing "with check option" clause

Each of these methods have elaborately discussed in the above list of links.

Here is a link to an interesting quiz, some food for thoughts : Click Here