Thursday, 26 April 2012

Data Integrity

If you have visited an earlier post entitled "Relational Database Model - Codd's Rules", then you must have realized that the rule #10 emphasizes the need for enforcement of "data integrity independence" in commercial RDBMSs. The most important purpose of the whole RDBMS paradigm is that, the database should be independent of the applications which access it, so that there should not be any interference of changes in either of them on the other; and to achieve this, the database systems should have only a non procedural (declarative) approach. The rule #10 thus suggests that "data integrity" is the responsibility of the database systems and it must be definable as an object or in the form of a self executing program (read trigger) written within the database (not application), and their existence must be part of metadata stored in the "data dictionary" (or catalog). Also, any change in the "data integrity" be allowed in the database as and when appropriate without affecting the existing application.

What is data integrity ? - 

It is important that data adhere to a predefined set of rules, as determined by database administrator or application developer and also the obvious and natural validation. For example no two customers of the bank may have same account number or a value for gender column can be one from Male, Female and Rather Not Say.

So, in short, "data integrity" ensures the quality of the data in the database and it's enforcement prevents entry of any invalid data into the database tables.

Types of Data Integrity - 

Two important steps in designing a table from "data integrity" point of view is, first identifying the set of valid values for certain columns and secondly, decide how to ensure that only valid values make their way to the column(s). This entails the need for categorizing the "data integrity" -
  • Entity Integrity
  • Referential Integrity
  • Domain Integrity
  • User-Defined (custom) integrity
Entity Integrity defines a row as a unique entity for a table. This is obvious from the definition of entity - a real world object, person or a logical idea with a set of attributes and it to be identified uniquely by its own set of values (which form a row in the table i.e. entity set). So here we require to constrain for at least one column in the table to have values which may never repeat. For example student university enrollment number.

Referential Integrity preserves the real world relationships between the entities, which defines that in such relationship there is an entity with independent existence called as parent may be uniquely referenced by dependent entity (may be multiple) called child (children). The rule enforces that a child may refer uniquely to an "existing" parent only or none (may be termed orphan) and the parent may refer to multiple child entities or none (childless). For example an employee (child) may belong to one and only one existing department (parent) or may not be assigned to any department at all and a department may contain multiple employees or none.

Domain Integrity enforces that the valid values for a column follow a restriction of range or list, and type (numeric, string, date etc.). For example the only valid values for a gender column shall be Male, Female or Rather Not Say.

User-Defined (custom) integrity allows you to implement specific business rules. For example a certain level of manager in a bank may approve a loan case only to a certain maximum limit.

How to enforce Data Integrity ? -

Data Integrity may be enforced in many different ways, depending upon the requirement and suitability. Broadly there may be following ways to enforce data integrity -
Each of these methods have elaborately discussed in the above list of links.

Here is a link to an interesting quiz, some food for thoughts : Click Here


  1. is the such as thing as inferential integrity?

    1. Haven't heard of this term in database engineering, however the term is used in the field of statistics as "inferential statistics" which perhaps related to drawing inferences from a sample data and doing projections based on it. A person from the field of statistics may be in better position to explain this to us. Thanks for your query and interest.