Wednesday 14 March 2012

Database Design - Normalization - Normal Form - Part III

You are requested to visit "Database Design .... Part II" for continuity with this post.
In the last two posts with the same heading we did all the ground work required to understand the concept and technique of Normalization and Normal Forms. This post may contain some bookish / hi-tech definitions - but no worries. The readers with heuristic approach may skip the hi-tech stuff and they still can understand normalization and normal forms from the examples.
Relational Database Design - The goal of a relational database design is to generate a set of relation schema that allows us to store information without redundancy and perform the various operations on the data without anomalies. This is achieved by designing the schema in appropriate Normal Form.  

Decomposition - It is a process to divide appropriately a "universal relation schema" with all database attributes into several relations each with fewer attributes to achieve a targeted Normal Form. A "BAD" decomposition procedure is one which may give rise to Lossy Decomposition or Lossy-join Decomposition. Ideally the decomposition should have following important properties -
  • Lossless-join decomposition
  • Dependency Preservation
  • No Repetition of Information
Functional Dependency Concept -
Functional dependency concept is at the center of the definition of "Normal Forms" or process of "Normalization". Functional dependency is a constraint on set of legal relation.
Definition – A Functional Dependency, denoted by X → Y, between two sets of attributes X  and Y that are subsets of R specifies a constraint on the possible tuples (read rows) that can form a relation state r of R. The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
This means that the values of the Y component or a tuple in r depend on, or are determined by, the value of the component X; alternatively, the values of X component of a tuple uniquely (or functionally) determine the values of the Y component.
(This was only a definition in mathematical format of the concept of dependency we discussed in "what is RDBMS - Part II" with an example of "account number" and "balance". This and some such stuff is for readers with scientific approach and academic interest)
Closure of a set of Functional Dependencies - There may be several functional dependencies that may be observed in a relation. For example a relation of students S = {Stid, Sub, Fees, Phone, Attendance}. Here Stid →Phone Attendance Sub and Sub →Fees are the dependencies which are semantically obvious dependencies. But also there is dependency Stid →Fees which is inferred (since Stid → Sub and Sub → Fees, by property of transitivity Stid → Fees). Let F be a set of functional dependencies. The Closure of F (indicated by F+) is the set of all functional dependencies logically implied by F.  

Closure of Attribute Sets - Given a set α of attributes of R and a set of functional dependencies F, we need a way to find out all the number of attributes of R that are functionally determined by α. This set of attributes is called as closure of attributes α under F. It is denoted by α+. Finding α+ is useful because – 
  • If α+ = R then α is superkey for R. 
  • If we find α+ for all α R then we have computed F+ 
Normalization of Relations -  Normalization of data can be considered as a process of analyzing the given "universal relation schema" based on their FDs (Functional Dependencies) and primary keys to achieve the desirable properties of –
  • Eliminating redundancy.  
  • Eliminating the insertion, deletion and update anomalies.
First Normal Form (1NF) –  A relation is in the First Normal Form (1NF) if the value of any attribute in the tuple (read row) must be single value (atomic) from the domain of that attribute. (in the following example however the "skills" column contains multiple skills.)

Empid Ename Qualification Skills
101 Peter BS Java,VB,Oracle
....
Following is the same relation in 1 NF (First Normal Form) with "redundancy". [Incorrect]
Empid Ename Qualification Skills
101 Peter BS Java
101 Peter BS VB
101 Peter BS Oracle
....
Following is the same relation in 1 NF (First Normal Form) with "repeating fields". [Incorrect]

Empid Ename Qualification Skill1 Skill2 Skill3 Skill4
101 Peter BS Java VB Oracle
....
Following is the normalized relation in 1 NF (First Normal Form) by decomposing into relations "Employee" and "Skills". [correct]

Employee
Empid Ename Qualification
101 Peter BS
....

Skills
Empid Skill
101 Java
101 VB
101 Oracle
102 PHP
102 ....
....
Second Normal Form (2NF) –  A relation is said to be in the Second Normal Form (2NF) if it is in the "first normal form" and then it does not show any "partial dependency" and hence the chances of a relation violating "2NF" is only when there is "composite primary key" (primary key in a relation is combination of 2 or more columns).
In the relation that follows :
  • Primary key is composite (stid + subject)
  • Partial dependency is observed - "Fees" depends on "Subject" and "Phone" depends on "stid".
  • Attendance has functional dependency - on composite primary key (stid + subject).
stid subject fees phone attendance
101 C 1000 12345 20
102 VB 1200 76589 25
101 Oracle 2000 12345 30
....

This relation has been decomposed into three relations to achieve the "2NF". Observe here that there is at least one relation that remains with the original composite Primary Key (stid + subject). In this case this relation is "Attendance".

Student
stid phone
101 12345
102 76589
103 .....
....

Subject
subject fees
C 1000
VB 1200
Oracle 2000
PHP ....
....

Attendance
stid subject Attendance
101 C 20
101 Oracle 30
102 VB 25
103 .... ....
....

Third Normal Form (3NF) – A relation is said to be in Third Normal Form (3NF) if it is already in the "1NF" and "2NF" and further does not have a non-key attribute functionally dependent on another non-key attribute, that is, there should be no "transitive dependency" of a non-key attribute on primary key.

In the following relation observe that the non-key attribute "location" is decided by (dependent upon) "deptid" which is a non-key attribute. The primary key is "empid". So here is the case of "transitive dependnecy". There is redundancy and also all the transaction anomalies - 

Empid Ename job salary deptid dname location
101 Peter Manager 4000 10 Accounting New York
102 Sarah Analyst 3000 20 Research Boston
103 Albert Clerk 2000 10 Accounting New York
....

And here below the same relation has been decomposed into two relations "Employee" and "Department" where "location" and "dname" has been moved to relation "Deparment" (and thus eliminating the "transitive dependency" - 
Employee
empid ename job salary deptid
101 Peter Manager 4000 10
102 Sarah Analyst 3000 20
103 Albert Clerk 2000 10
....

Department
empid ename job
10 Accounting New York
20 Research Boston
30 Sales Dallas
....

Boyce-Codd Normal Form (BCNF) – A relation is said to be in Boyce-Codd Normal Form (BCNF) if it is already in the "3NF" and further does not have overlapping composite key. A relation in "BCNF" always satisfy "3NF" but NOT every "3NF" is also "BCNF". For Example the following relation is in "3NF" (does not show any transitive or partial dependency).

In the following example we assume that a teacher may teach only one subject but one subject may be taught by 2 or more teachers and only a particular teacher teaches that subject to a particular student. Then combination (stid + subject) will decide the teacher and teacher decides the subject. The combination (stid + subject) is primary key, Teacher is (tends to be) candidate key, and potentially (stid + teacher) also tends to become the primary key. [however we don't accept the combination (stid + teacher) as primary key as it will give rise to partial dependency - subject depends on teacher and hence it is rejected]

stid subject teacher
101 RDBMS Peter
102 OS Albert
103 RDBMS Sarah
101 Micro Processor George
104 RDBMS Peter
....

The solution will be, to split the relation into two – {studentid, teacher} and {teacher,subject. In this example we have given more importance to "losslessness" and "non-additive decomposition" while ignoring the property of "preserving functional dependency" in the decomposed relations. (The discussion on the decomposition properties coming up shortly in next posts)

Multivalued Dependency & Fourth Normal Form (4NF)-
Multivalued dependencies (MVD) are a consequence of when we have two non-key multivalued attributes. For example consider the following relation -

empid skills langspeak spousename
101 VB,Oracle French,English Sarah
....

Now for this relation to be in "1NF" -
empid skills langspeak spousename
101 VB French Sarah
101 VB English Sarah
101 Oracle French Sarah
101 Oracle English Sarah
....

Now this relation decomposed to "4NF" -

EmpSkills -
empid skills
101 VB
101 Oracle
....

EmpLang -
empid Langspeak
101 French
101 English
....

EmpSpouse -
empid SpouseName
101 Sarah
....

In here we have learned the technique of Normalization. There are however some properties of a "good" decomposition remain to be covered. Let us meet again in the next blog shortly.

7 comments:

  1. STOP spamming usenet.

    ReplyDelete
  2. Why anonymous? Appears to be some disgruntled soul. Carry on crazy4db your mission to reach those thousands who need this knowledge or how could people like me could know about this wonderful post?

    ReplyDelete
    Replies
    1. Thanks Raj for your encouraging comments. Keep visiting; some more posts in time to come.

      Delete
  3. If you would look at the responses your spam generates, perhaps you would figure out that you are actually being stupid. The only disgruntlement I have is you are spamming. If you would bother to read the responses to your spam, you might understand that everyone does not think your spam is wonderful. I make no judgements about the accuracy or relevancy of your blog, simply your spam. You should read the charters of the usenet groups you post to, before you post.

    If you would actually respond to comments made on usenet, you'd see a lot of people who are not anonymous. You might also note that people are confusing you with someone else who is known to spout misinformation. People were happy when your blog 404'd. They are now unhappy that it is back. That is what happens when you spam.

    ReplyDelete
  4. You must understand what normalization mean and simplify to easy words. Not get everything from book or class notes.
    Normalization is reduce complexity and remove duplicate categories.
    Basically, the first normalization is simple : no duplicate groups or columns in table. That is one to many relationship
    the second normalization is simple too : every table should have primary key, and that is many to one relationship. you must separate dynamic data from static data.
    the third normalization is simple also :no transitive dependencies.
    Normalization join command cause so many problems and slow down search speed. Such Normalization theory Only for schooling use, not for real world. What real world use ? Industrial use Data House concept and Normalization concept combine for their data modeling project

    ReplyDelete
    Replies
    1. Thanks for your views "Direct Forward Concept". Actually frankly speaking I am not originally from IT engineering background but with a degree in Electrical Engg that I passed in the days when even calculators were not allowed to be used(in 1980). I have studied the database technology as part of my passion which I picked at very advanced stage of my career and did not obtain any formal training in DBMS. I have always believed that the authoritative knowledge comes from books written by esteemed authors be it electrical, databases or any technology and so I studied databases technology following some books and manuals and all by myself. The subject I am trying to put here is in the same way as I have understood it. There is no claim that I am an authority in the subject and must humbly agree that I am still a student and will be so for entire of my life. There may be errors of fact, of understanding and technical too. But one thing is for sure that following only those concepts which I am putting here, I got the success as an Oracle DBA and have been working so for last about 17-18 years. Practice is not much different from theory, rather theory only is projected in practice or put straight, practice is implementation of theory. Only thing is; theory is ideology whereas its implementation has some limitations. And since I am not originator of RDBMS theory, there is absolutely no shame in accepting that I have got everything from books but trying to combine with examples which I came across in real life and what I did in such situation and probably show how it is supported in theory. This blog I have created to share views of everybody, but I will appreciate if the views are about technology rather than about the person or his/her personality. Technology is independent of person who would write here and the contents may be according to his/her own experience. Anybody may contradict based on logical reasoning and examples or quotes from authorities, books and manuals. Let this forum be a discussion forum for some technical concepts rather than battlefield of persons claiming/confronting superiority or level of knowledge. Thanks for your views and hope you would keep on visiting the blog and share your experience with the readers.

      Delete
  5. I Like your Article. It really excellent…
    We all developer share the real time experience, but sometimes we must have to see the structural definition behind it. As we don’t have enough time from our daily jobs we are unable to see the real structural configuration.
    After long time I really find it is interesting. When I read it and compare with my daily works, it’s really exciting.
    Good man keeps it up.

    ReplyDelete