Database Normalisation

by Richard Carr, published at http://www.blackwasp.co.uk/DBNormalisation.aspx

The seventeenth part of the SQL Server Programming Fundamentals tutorial discusses the concept of database normalisation. Normalisation is a database design technique that minimises duplication of information, reducing the risk of introducing data errors.

Previous: SQL Server Indexes Part 2

Data Anomalies

When information is held in a list format, the data is likely to include some duplication of information. This is often the case when creating a large list with many columns, such as in a spreadsheet format. Duplication can often occur when a single database table is used to hold lots of related information.

Duplication of information causes two key problems in a relational database. Firstly, the additional information can often lead to increased storage requirements. Secondly, and more importantly, you can experience data integrity problems, the majority of which can lead to one or more types of anomaly. Three anomalies are described in the following sections.

Update Anomalies

An update anomaly occurs when all copies of a single piece of information are not updated correctly. For example, consider the following table:

Name	Job Title	Skill
Bob	Developer	C#
Bob	Developer	SQL
Jim	Developer	C#
Sue	Developer	VB
Pat	Project Manager	PRINCE2

To allow the table to hold the details of multiple employees, each with a job title and one or more skills, this table includes duplication in every column. This gives many opportunities for update anomalies. For example, if Bob's information is updated so that his job title is "Programmer", this should probably also update all other developer rows to indicate that the new job title replaces the old one for all developers. If not, the update could leave the table in the following state:

Name	Job Title	Skill
Bob	Programmer	C#
Bob	Programmer	SQL
Jim	Developer	C#
Sue	Developer	VB
Pat	Project Manager	PRINCE2

The information in the table may still be considered accurate. However, a query of the database that intends to return all of the developers has become more difficult. A worse scenario will occur if the next update to change Bob's title back to "Developer" is only executed against one of his two entries, especially if the new job title is misspelled:

Name	Job Title	Skill
Bob	Developr	C#
Bob	Programmer	SQL
Jim	Developer	C#
Sue	Developer	VB
Pat	Project Manager	PRINCE2

Insert Anomalies

Insert anomalies occur when the structure of a database table makes it difficult to insert information. This can happen when two or more distinct areas of information (or domains) exist within a table. In our previous example table, we may want to create a new employee who does not have a specific technical skill. This forces us to use a null value or dummy value for the skill:

Name	Job Title	Skill
Bob	Developr	C#
Bob	Programmer	SQL
Jim	Developer	C#
Sue	Developer	VB
Pat	Project Manager	PRINCE2
Lee	Trainee Programmer	(null)

In this situation, the information in the table seems reasonable. However, when Lee obtains his first skill, simply adding a row, as would be done for other developers, causes an anomaly:

Name	Job Title	Skill
Bob	Developr	C#
Bob	Programmer	SQL
Jim	Developer	C#
Sue	Developer	VB
Pat	Project Manager	PRINCE2
Lee	Trainee Programmer	(null)
Lee	Trainee Programmer	C#

Delete Anomalies

A delete anomaly occurs when the deletion of information from a table has unintended consequences, generally because the structure of the table makes deleting data difficult. Consider the situation where the company using the sample table has decided that they will no longer require the C# skill. If you were to delete all references to C#, Jim would disappear from the database entirely. Similarly, if Pat were to leave the company, removing her entry would erase the PRINCE2 skill from the database.

Normalisation

Database normalisation (US: normalization) reduces or removes the issues caused by duplicated data by extracting the replicated information into separate tables. Foreign key relationships are then added between the related tables to maintain referential integrity. Once this normalisation has occurred, the key information in each database table is held uniquely.

As an example, we may decide to move the job title values in the previous example to their own table with a primary key. Each of the employee rows would include a reference to that primary key. When querying employees, we would join the two tables together, ensuring that every employee with a particular title will be linked to the same job title row and would therefore return exactly the same information. This means that an update to a job title need only occur in one place to affect every linked employee.

Normalisation should be applied to databases that are updated frequently but seldom queried. Adding tables to a database generally increases the number of tables that must be joined to execute a query. As each join adds an overhead to the query, decreasing performance, it is possible to over-normalise a database. This is particularly noticeable for databases that are used only for reporting purposes. In these situations it is common to denormalise (US: denormalize) the database, introducing duplication that increases query performance.

Next: Page 2

15 December 2008