12 Database Design Theory: Normalization

Database

Database Design

DB Design Theory

BCNF

Normalization

Functional Dependencies

Closure Test

FD Projection

This lecture discusses the concept of normalization in database design theory. The lecture covers functional dependencies, closure test, and FD projection. It finally introduces the concept of BCNF.

Author

Jiuru Lyu

Published

November 18, 2024

Introduction

It allows us to improve a schema systematically.
General idea:
- Express constraints on the relationship between attributes.
- Use thse to decompose the relations.
Ultimately, get a schema that is in a “normal form” that guarantees good properties, such as no anomalies.
“Normal” in the sense of conforming to a standard.

Note 1: Normalization

The process of converting a schema to a normal form is called normalization.

End goal: Design a “good” normalized schema that reduces redundancy (thus reducing storage waste, anomalies) and ensures data integrity (BCNF Normalization).

Functional Dependencies (FD)

Poorly designed schemas can lead to anomalies.
- In any domain, there are relationships between attribute values.
- For example:
  - Every part has 1 manufacturer
  - Every manufacturer has 1 address
  - Every seller has 1 address
- If so, this table will have redundant data.

part	manufacturer	manAddress	seller	sellerAddress	price
P1	M1	A1	S1	A2	100

Update anomaly: If one manufacturer moves to a new address, we need to update all the rows with the same manufacturer. If we only update one row, the data will be inconsistent.
Deletion anomaly: If one manufacturer stops selling a part, we delete the row. But we also lose the address information.

Note 2: Functional Dependencies

Suppose \(R\) is a relation, and \(X\) and \(Y\) are subsets of the attributes of \(R\). - Then, \(X\to Y\), read as “\(X\) funtionally determines \(Y\)”, is called a functional dependency. - \(X\to Y\) holds in \(R\) \(\iff\) two tuples agree on all the attributes in set \(X\), they must also agree on all the attributes in set \(Y\).

Functional dependencies are constraints on the relation.
Symbolically, \(A\to B\) means
- \(\forall\) tuples \(t_1, t_2\), \((t_1[A]=t_2[A])\implies(t_1[B]=t_2[B])\), or
- \(\neg\exists\) tuples \(t_1, t_2\) such that \((t_1[A]=t_2[a])\land(t_1[B]\neq t_2[B])\).
Functional dependencies can be extended to multiple attributes.

Why “functional dependency”?
- “dependency” because the value of \(Y\) depends on the value of \(X\).
- “functional” because there is a function that takes a value for \(X\) and gives a unique value for \(Y\).
- (It’s not a typical function; just a lookup.)

Warning 1: Equivalent sets of FDs

When we write a set of FDs, we mean that all of them hold.
We can very often rewrite sets of FDs in equivalent ways.
When we say \(S1\) is equivalent to \(S2\) we mean:
- \(S1\) holds in a relation \(\iff\) \(S2\) does.

Tip 1: Example of FD

Create an instance of \(R\) that violates \(BC\to D\):

A	B	C	D
1	2	3	4
1	2	3	5

Is the FD \(A\to BC\) equivalent to the two FDs \(A\to B\) and \(Aà\to C\)?
- Yes, because if \(A\to BC\) holds, then \(A\to B\) and \(A\to C\) must hold.
Is the FD \(AB\to C\) equivalent to the two FDs \(A\to B\), \(A\to C\) ?
- No.

A	B	C
1	2	3
1	3	5

Splitting rules of FD:
- Can we split (explode) the RHS of an FD and get multiple, equivalent FDs?
  - Yes.
- Can we split (explode) the LHS of an FD and get multiple, equivalent FDs?
  - No.
Coincidence or FD
- An FD is an assertion about every instance of the relation.
- You can’t know it holds just by looking at one instance.
- You must use knowledge of the domain to determine wheter an FD holds.

Tip 2: Coincidence or FD

Consider the following table

Teacher	Course	Text
Smith	CS170	Martin
Hall	CS171	Hoffman
Brown	CS253	Horowitz

Does \(\text{Teacher}\to\text{Course}\) hold in this instance?
- Yes.
Should \(\text{Teacher}\to\text{Course}\) hold?
- No, because one teacher can teach multiple courses.
Does \(\text{Course}\to\text{Text}\) hold in this instance?
- Yes.
Should \(\text{Course}\to\text{Text}\) hold?
- No, because one course can have multiple textbooks. For example, different professors may use different textbooks for the same course.
- It may hold in some cases, where the department has a policy that all sections of a course use the same textbook.

FDs are closely related to keys
- Recall: A superkey is a set of attributes for which no two rows can have the same values
- Suppose \(K\) is a set of attributes for relation \(R\).
- A claim about FDs:
  - \(K\) is a superkey for \(R\iff K\) functionally determines all of \(R\).

Note 3: FDs are a Generalization of Keys

Superkey: \(X\to R(\text{All Attributes})\)
Functional Dependency: \(X\to Y\)
A superkey must include all the attributes of the relation on the RHS
An FD can have just a subset of them.

Inferring FDs:
- Given a set of FDs, one can infer further FDs.
- Big task: given a set of FDs, infer every other FD that must also hold.
- Simpler task: given a set of FDs, infer whether a given FD must also hold.

Tip 3: Inferring FDs

If \(A\to B\) and \(B\to C\), can we infer \(A\to C\)?
- Yes.
- Another way of asking: Does the FD \(A\to C\) follow from \(A\to B\) and \(B\to C\)?
If \(A\to H\), \(C\to F\), and \(FG\to AD\) hold.
- Can we infer \(FA\to D\)?
- Can we infer \(CG\to FH\)?
- These two questions are hard to answer, so we need a systematic way to infer FDs.

Prove an FD follows:
- Method 1: First principles approach: You can prove it by referring back to
  - The FDs that you know hold, and
  - Apply FD interence ruls (axioms)
- Method 2 (easier): Closure Test

Closure Test

Note 4: Closure Test to Infer FDs

Assume you know the values of the LHS attributes, and figure out: everything else that can be determined.
If the result includes the RHS attributes, then you know that the FD holds.

Note 5: Closure of an Attribute

For a given arribute \(Y\), the closure \(Y^+\) is the set of all attributes such that \(Y\to A\) can be inferred.

Important 1: Closure Algorithm

\(Y\) is a set of attributes, \(S\) is a set of FDs.
Return: the closure of \(Y\) under \(S\).

Attribute_closure(Y, S):
  Initialize Y+ to Y
  Repeat until no more changes occur:
    If there is an FD LHS -> RHS in S such that LHS is in Y+:
      Add RHS to Y+

  Return Y+

If LHS is in \(Y^+\) and LHS \(\to\) RHS holds, we can add RHS to \(Y^+\).

Important 2: Closure Test Algorithm

\(S\) is a set of FDs, LHS \(\to\) RHS is a single FD.
Return: true \(\iff\) LHS \(\to\) RHS follows from \(S\).

FD_follows(S, LHS -> RHS):
  Y+ = Attribute_closure(LHS, S)
  Return (RHS is in Y+)

Tip 4: Closure Test Example

Suppose we have a relation on attributes \(A, B, C, D, E, F\), with FDs:
- \(AC\to F\)
- \(CEF\to B\)
- \(C\to D\)
- \(DC\to A\)
Does \(C\to F\)?
- Compute \(C^+\) under these FDs: \(C^+=CDAF\)
- Then, \(C\to F\) follows.
Does \(ACD\to B\)?
- Compute \((ACD)^+\) under these FDs: \((ACD)^+=ACDF\)
- Then, \(ACD\to B\) does not follow.

FD Projection

Motivation:
- Later, we will learn how to normalize a schema by decomposing relations (this is the whole point of this theory).
- We will need to be aware of what FDs hold in the new, smaller, relations.
- In other words, we must project our FDs onto the attributes of our new (smaller) relations.

Tip 5: FD Projection

Suppose we have a relation on attributes \(ABCDE\) with FDs: \(A\to C\), \(C\to E\), \(E\to BD\)
We want to project FDs onto attributes \(ABC\)
To project onto a set of attributes, we systematically consider every possible LHS of an FD that might hold on those attributes.
\(A^+={\color{red}AC}E{\color{red}B}D\). However, we only care about the smaller table we are projecting onto, so we only keep the attributes in the projection.
- \(A\to BC\).
\(B^+=B\)
\(C^+={\color{red}C}E{\color{red}B}D\)
- \(C\to B\)
\(AB^+\) and \(AC^+\)
- We don’t need to consider any supersets of \(A\). \(A\) already determines all of the attributes in the projection \(ABC\). So, supersets of \(A\) will only yield FDs that are already implied by \(A\to BC\).
- Strength/Power of FDs: \(X\to Y\) is a stronger FD than \(XZ\to Y\).
\(BC^+=BCED\)
The projection of the FDs onto \(ABC\) is \(\{A\to BC,\ C\to B\}\)

Database Normalization (BCNF)

Note 6: Decomposition

To improve a badly-designed schema \(R(A_1,A_2,\dots,A_n)\), we will decompose it into smaller relations \(A(B_1,B_2,\dots,B_m)\) and \(T(C_1,C_2,\dots,C_k)\) such that

\(S={\Large{\pi}}_{B_1,B_2,\dots, B_m}(R)\),
\(T={\Large{\pi}}_{C_1,C_2,\dots,C_k}(R)\), and
\(\{A_1,A_2,\dots,A_n\}=\{B_1,B_2,\dots,B_m\}\cup\{C_1,C_2,\dots,C_k\}\)

But which decomposition to use?
- Decomposition can improve a schema, but which decomposition? There are many possible decompositions.
- A “normal form” helps us characterize a decomposition, in terms of satisfied properties.
  - Boyce-Codd Normal Form (BCNF) guarantees no anomalies.

BCNF Test

Note 7: Boyce-Codd Normal Form (BCNF)

We say a relation \(R\) is in BCNF if: for every nontrivial FD \(X\to Y\) that holds in \(R\), \(X\) is a superkey.

Recall: nontrivial means \(Y\) is not contained in \(X\)
Recall: a super key does not have to be minimal.
In order words, BCNF requries that “Only things that FD everything, can FD anything.”
FDs with LHS that is not a superkey make the relation prone to redundacy, anomalies.

Tip 6: Example of BCNF Test

Suppose we have a relation \(\text{Students}(\text{SID}, \text{email}, \text{course}, \text{term}, \text{prof}, \text{grade})\) and that these FDs hold:
- \(\{\text{SID}\to\text{email}; \text{course},\text{term}\to\text{prof};\text{SID},\text{course}\to\text{grade}\}\)
Is this relation in BCNF?
Approach: Test if the LHS in each FD is a superkey.
- \(\text{SID}^+=\text{SID}, \text{email}\)
- LHS is not a superkey, then \(R\) is not in BCNF. We don’t need to check other FDs.

BCNF Decomposition

Important 3: BCNF Decomposition Algorithm

\(R\) is a relation; \(F\) is a set of FDs
Return: a BCNF decomposition of \(R\), given these FDs.

BCNF_decomp(R, F):
  If an FD X -> Y in F violates BCNF
    1. Compute X+
    2. Replace R by two relations with shcemas:
      R_1 = X+
      R_2 = (R - X+) + X
      // X is the common between R_1 and R_2
    3. Project the FD's F onto R_1 and R_2
    4. Recursively apply BCNF_decomp to R_1 and R_2

If more than one FD violates BCNF, you may decompose based on any one of them.
- Because of this, there may be multiple possible results.
The new relations we create in one step may not be in BCNF. We must recurse.
- We only keep the relations at the “leaves.”

Tip 7: Example of BCNF Decomposition

Suppose we have the following FDs for \(R(A, B, C, D, E)\): \[ A\to B, CD\to E \]

Apply BCNF Decomposition:

Iteration 1: \(R\)
- Consider \(A\to B\): \(A^+=AB\), and \(A\) is not a superkey (violates BCNF).
- Decompose: \(R_1=AB\), \(R_2=ACDE\)
- Project FDs: \(R_1(A,B)\):
  - \(A\to B\) (\(A\) is a super key)
  - \(R_2(A, C, D, E): CD\to E\) (\(CD\) is not a superkey).
Iteration 2: Decompose \(R_2\)
- Consider \(R_2(A, C, D, E)\): \(CD^+=CDE\), and \(CD\to E\), \(CD\) is not a superkey
- Decompose: \(R_3=CDE\), \(R_4=ACD\)
- Project FDs:
  - \(R_3(C, D, E)\): \(CD\to E\) (\(CD\) is a superkey)
  - \(R_4(A, C, D)\): No FDs. Key: \(ACD\)
BCNF Decomposition Result: \(R_1(A, B)\), \(R_3(C, D, E)\), \(R_4\qty(A, C, D)\)

Speed-ups for BCNF decomposition
- Don’t need to know all superkeys.
  - Only need to check whether the LHS of each FD is a superkey.
  - Use the closure test (simple and fast)
- When projecting FDs onto a new relation, check each new FD:
  - Does the new relation violate BCNF because of this FD?
  - If os, abort the projection. You are about the discard this relation anyway (and domcpose further).

Properties of Decomposition

What we want from a decomposition (in general, not BCNF specifically)
- No anomalies
- Lossless Join: it should be possible to
  - project the original relations onto the decomposed schema
  - then reconstruct the original by joining. We should get back exactly the original tuples.
- Dependency Preservation:
  - All the original FD’s should be satisfied.
What is lost in a “lossy” join?
- For any decomposition, it is the case that:
  - \(r\subseteq r_1\bowtie\cdots\bowtie r_n\)
  - i.e., we will get back every tuple.
- But it may not be the case that:
  - \(r\supseteq r_1\bowtie\cdots\bowtie r_n\)
  - i.e., we can get spurious tuples.

Properties of BCNF decomposition, specifically

BCNF ensures no anomalies.
BCNF ensures lossless join.
However, the BCNF property does not guarantee lossless join.
- If you use the BCNF decomposition algorithm, then yes, a lossless join is guaranteed.
- If you generate a decomposition some other way
  - You have to check to make sure you have a lossless join.
  - Even if your schema is in BCNF!
BCNF does not ensure dependency preservation.
BCNF does not ensure that the decomposition is a “good” one.
- BCNF decomposition deos not guarantee preservation of dependencies.
- i.e., in the decomposed schema that results, it may be possible to create an instance that:
  - satisfies all the projected FDs in the final schema,
  - but violates one of the original FDs.
- Why? Because the algorithm goes too far, and it breaks relations down too much.

Tip 8: Example of BCNF not Preserving Original FDs

Consider \(\underline{C}SJDPQV\), \(C\) is a key, \(JP\to C\) and \(SD\to P\).

BCNF decomposition: \(\underline{C}SJDQV\) and \(\underline{SD}P\).
Problem: Checking \(JP\to C\) requires a joint (and there is no theoretical guarantee it will hold…)

Warning 2: Is there a decomposition that satisfies all \(3\) properties?

No! There is always a tradeoff.
For example, another popular standard, “3NF” satisfies properties 2 and 3 only.

Warning 3: Remakrs about Normal Forms

Normalization:
- The process of decomposing unsatisfactory “bad” relations by breaking up their attributes into smaller relations.
Normal form:
- Condition using keys and FDs of a relation to certify whether a relation schema is in a partiuclar normal form.
In parctice:
- Normalization is carried out in practice so that resulting designs are of high quality and meet the desirable properties. (not always though)
- The pratical unitily of thse NFs becomes questionable when the constraints on which they are absed are hard to undrstand or to detect.
- The databse designers need not nromalize to the highest possible normal form (usually up to 3NF and BCNF. Higher NFs are rare in practice).

Candidate Key vs. Primary Key
- A relation can have more than one key:
  - Eg. 1: Donors in a Donation schema can have a driverLicenseNum and SSN, eahc uniquely identifying a donor.
  - Eg. 2: Each student has a studentID that uniquely indentifies them, as well as a NetID that does the same.
- In a DBMS, the DBA (DB Administrator) or the SE (Software Engineer) designing the schema typically chooses one to be the primary key. Other keys are called candidate keys.
- Characteristic of a primary key (PK) in SQL tables:
  - minimal set of attributes (columns) that uniquely identifies a tuple (row)
  - there can be at most one PK in a table
  - all attributes in a PK can never be null
  - PK attributes are automatically indexed. (Indexes are special lookup tables that help in fast retrieval of tuples from a table. Think of them as “pointers” to data in a table. An index is mall, fast, and optimized for quick lookups. It is especially useful for searching large tables and joining tables. Indexes are typically stored in a structure that optimizes efficient searches (e.g. B-tree))
- What about candidate kesy (CK)?
  - also a minimal set of attributes that uniquely identifies a tuple (row)
  - a table can have multiple CKs
  - attribute in a CK may have null values
  - not automatically indexed
  - CKs signify which kesy can be used as PKs

Tip 9: Primary and Candidate Keys in SQL

CREATE TABLE Student (
  studentID int PRIMARY KEY, -- primary key
  netID VARCHAR(15) UNIQUE, -- candidate key
  email VARCHAR(100) UNIQUE, -- candidate key
  ...
  ...
);