12 Database Design Theory: Normalization
Introduction
- It allows us to improve a schema systematically.
- General idea:
- Express constraints on the relationship between attributes.
- Use thse to decompose the relations.
- Ultimately, get a schema that is in a “normal form” that guarantees good properties, such as no anomalies.
- “Normal” in the sense of conforming to a standard.
The process of converting a schema to a normal form is called normalization.
- End goal: Design a “good” normalized schema that reduces redundancy (thus reducing storage waste, anomalies) and ensures data integrity (BCNF Normalization).
Functional Dependencies (FD)
- Poorly designed schemas can lead to anomalies.
- In any domain, there are relationships between attribute values.
- For example:
- Every part has 1 manufacturer
- Every manufacturer has 1 address
- Every seller has 1 address
- If so, this table will have redundant data.
part | manufacturer | manAddress | seller | sellerAddress | price |
---|---|---|---|---|---|
P1 | M1 | A1 | S1 | A2 | 100 |
- Update anomaly: If one manufacturer moves to a new address, we need to update all the rows with the same manufacturer. If we only update one row, the data will be inconsistent.
- Deletion anomaly: If one manufacturer stops selling a part, we delete the row. But we also lose the address information.
Suppose \(R\) is a relation, and \(X\) and \(Y\) are subsets of the attributes of \(R\). - Then, \(X\to Y\), read as “\(X\) funtionally determines \(Y\)”, is called a functional dependency. - \(X\to Y\) holds in \(R\) \(\iff\) two tuples agree on all the attributes in set \(X\), they must also agree on all the attributes in set \(Y\).
- Functional dependencies are constraints on the relation.
- Symbolically, \(A\to B\) means
- \(\forall\) tuples \(t_1, t_2\), \((t_1[A]=t_2[A])\implies(t_1[B]=t_2[B])\), or
- \(\neg\exists\) tuples \(t_1, t_2\) such that \((t_1[A]=t_2[a])\land(t_1[B]\neq t_2[B])\).
- Functional dependencies can be extended to multiple attributes.
- Why “functional dependency”?
- “dependency” because the value of \(Y\) depends on the value of \(X\).
- “functional” because there is a function that takes a value for \(X\) and gives a unique value for \(Y\).
- (It’s not a typical function; just a lookup.)
- When we write a set of FDs, we mean that all of them hold.
- We can very often rewrite sets of FDs in equivalent ways.
- When we say \(S1\) is equivalent to \(S2\) we mean:
- \(S1\) holds in a relation \(\iff\) \(S2\) does.
- Create an instance of \(R\) that violates \(BC\to D\):
A | B | C | D |
---|---|---|---|
1 | 2 | 3 | 4 |
1 | 2 | 3 | 5 |
- Is the FD \(A\to BC\) equivalent to the two FDs \(A\to B\) and \(Aà\to C\)?
- Yes, because if \(A\to BC\) holds, then \(A\to B\) and \(A\to C\) must hold.
- Is the FD \(AB\to C\) equivalent to the two FDs \(A\to B\), \(A\to C\) ?
- No.
A | B | C |
---|---|---|
1 | 2 | 3 |
1 | 3 | 5 |
- Splitting rules of FD:
- Can we split (explode) the RHS of an FD and get multiple, equivalent FDs?
- Yes.
- Can we split (explode) the LHS of an FD and get multiple, equivalent FDs?
- No.
- Can we split (explode) the RHS of an FD and get multiple, equivalent FDs?
- Coincidence or FD
- An FD is an assertion about every instance of the relation.
- You can’t know it holds just by looking at one instance.
- You must use knowledge of the domain to determine wheter an FD holds.
Consider the following table
Teacher | Course | Text |
---|---|---|
Smith | CS170 | Martin |
Hall | CS171 | Hoffman |
Brown | CS253 | Horowitz |
- Does \(\text{Teacher}\to\text{Course}\) hold in this instance?
- Yes.
- Should \(\text{Teacher}\to\text{Course}\) hold?
- No, because one teacher can teach multiple courses.
- Does \(\text{Course}\to\text{Text}\) hold in this instance?
- Yes.
- Should \(\text{Course}\to\text{Text}\) hold?
- No, because one course can have multiple textbooks. For example, different professors may use different textbooks for the same course.
- It may hold in some cases, where the department has a policy that all sections of a course use the same textbook.
- FDs are closely related to keys
- Recall: A superkey is a set of attributes for which no two rows can have the same values
- Suppose \(K\) is a set of attributes for relation \(R\).
- A claim about FDs:
- \(K\) is a superkey for \(R\iff K\) functionally determines all of \(R\).
- Superkey: \(X\to R(\text{All Attributes})\)
- Functional Dependency: \(X\to Y\)
- A superkey must include all the attributes of the relation on the RHS
- An FD can have just a subset of them.
- Inferring FDs:
- Given a set of FDs, one can infer further FDs.
- Big task: given a set of FDs, infer every other FD that must also hold.
- Simpler task: given a set of FDs, infer whether a given FD must also hold.
- If \(A\to B\) and \(B\to C\), can we infer \(A\to C\)?
- Yes.
- Another way of asking: Does the FD \(A\to C\) follow from \(A\to B\) and \(B\to C\)?
- If \(A\to H\), \(C\to F\), and \(FG\to AD\) hold.
- Can we infer \(FA\to D\)?
- Can we infer \(CG\to FH\)?
- These two questions are hard to answer, so we need a systematic way to infer FDs.
- Prove an FD follows:
- Method 1: First principles approach: You can prove it by referring back to
- The FDs that you know hold, and
- Apply FD interence ruls (axioms)
- Method 2 (easier): Closure Test
- Method 1: First principles approach: You can prove it by referring back to
Closure Test
- Assume you know the values of the LHS attributes, and figure out: everything else that can be determined.
- If the result includes the RHS attributes, then you know that the FD holds.
For a given arribute \(Y\), the closure \(Y^+\) is the set of all attributes such that \(Y\to A\) can be inferred.
- \(Y\) is a set of attributes, \(S\) is a set of FDs.
- Return: the closure of \(Y\) under \(S\).
Attribute_closure(Y, S):
Initialize Y+ to Y
Repeat until no more changes occur:
If there is an FD LHS -> RHS in S such that LHS is in Y+:
Add RHS to Y+
Return Y+
- If LHS is in \(Y^+\) and LHS \(\to\) RHS holds, we can add RHS to \(Y^+\).
- \(S\) is a set of FDs, LHS \(\to\) RHS is a single FD.
- Return: true \(\iff\) LHS \(\to\) RHS follows from \(S\).
FD_follows(S, LHS -> RHS):
Y+ = Attribute_closure(LHS, S)
Return (RHS is in Y+)
- Suppose we have a relation on attributes \(A, B, C, D, E, F\), with FDs:
- \(AC\to F\)
- \(CEF\to B\)
- \(C\to D\)
- \(DC\to A\)
- Does \(C\to F\)?
- Compute \(C^+\) under these FDs: \(C^+=CDAF\)
- Then, \(C\to F\) follows.
- Does \(ACD\to B\)?
- Compute \((ACD)^+\) under these FDs: \((ACD)^+=ACDF\)
- Then, \(ACD\to B\) does not follow.
FD Projection
- Motivation:
- Later, we will learn how to normalize a schema by decomposing relations (this is the whole point of this theory).
- We will need to be aware of what FDs hold in the new, smaller, relations.
- In other words, we must project our FDs onto the attributes of our new (smaller) relations.
- Suppose we have a relation on attributes \(ABCDE\) with FDs: \(A\to C\), \(C\to E\), \(E\to BD\)
- We want to project FDs onto attributes \(ABC\)
- To project onto a set of attributes, we systematically consider every possible LHS of an FD that might hold on those attributes.
- \(A^+={\color{red}AC}E{\color{red}B}D\). However, we only care about the smaller table we are projecting onto, so we only keep the attributes in the projection.
- \(A\to BC\).
- \(B^+=B\)
- \(C^+={\color{red}C}E{\color{red}B}D\)
- \(C\to B\)
- \(AB^+\) and \(AC^+\)
- We don’t need to consider any supersets of \(A\). \(A\) already determines all of the attributes in the projection \(ABC\). So, supersets of \(A\) will only yield FDs that are already implied by \(A\to BC\).
- Strength/Power of FDs: \(X\to Y\) is a stronger FD than \(XZ\to Y\).
- \(BC^+=BCED\)
- The projection of the FDs onto \(ABC\) is \(\{A\to BC,\ C\to B\}\)
Database Normalization (BCNF)
To improve a badly-designed schema \(R(A_1,A_2,\dots,A_n)\), we will decompose it into smaller relations \(A(B_1,B_2,\dots,B_m)\) and \(T(C_1,C_2,\dots,C_k)\) such that
- \(S={\Large{\pi}}_{B_1,B_2,\dots, B_m}(R)\),
- \(T={\Large{\pi}}_{C_1,C_2,\dots,C_k}(R)\), and
- \(\{A_1,A_2,\dots,A_n\}=\{B_1,B_2,\dots,B_m\}\cup\{C_1,C_2,\dots,C_k\}\)
- But which decomposition to use?
- Decomposition can improve a schema, but which decomposition? There are many possible decompositions.
- A “normal form” helps us characterize a decomposition, in terms of satisfied properties.
- Boyce-Codd Normal Form (BCNF) guarantees no anomalies.
BCNF Test
We say a relation \(R\) is in BCNF if: for every nontrivial FD \(X\to Y\) that holds in \(R\), \(X\) is a superkey.
- Recall: nontrivial means \(Y\) is not contained in \(X\)
- Recall: a super key does not have to be minimal.
- In order words, BCNF requries that “Only things that FD everything, can FD anything.”
- FDs with LHS that is not a superkey make the relation prone to redundacy, anomalies.
- Suppose we have a relation \(\text{Students}(\text{SID}, \text{email}, \text{course}, \text{term}, \text{prof}, \text{grade})\) and that these FDs hold:
- \(\{\text{SID}\to\text{email}; \text{course},\text{term}\to\text{prof};\text{SID},\text{course}\to\text{grade}\}\)
- Is this relation in BCNF?
- Approach: Test if the LHS in each FD is a superkey.
- \(\text{SID}^+=\text{SID}, \text{email}\)
- LHS is not a superkey, then \(R\) is not in BCNF. We don’t need to check other FDs.
BCNF Decomposition
- \(R\) is a relation; \(F\) is a set of FDs
- Return: a BCNF decomposition of \(R\), given these FDs.
BCNF_decomp(R, F):
If an FD X -> Y in F violates BCNF
1. Compute X+
2. Replace R by two relations with shcemas:
R_1 = X+
R_2 = (R - X+) + X
// X is the common between R_1 and R_2
3. Project the FD's F onto R_1 and R_2
4. Recursively apply BCNF_decomp to R_1 and R_2
- If more than one FD violates BCNF, you may decompose based on any one of them.
- Because of this, there may be multiple possible results.
- The new relations we create in one step may not be in BCNF. We must recurse.
- We only keep the relations at the “leaves.”
Suppose we have the following FDs for \(R(A, B, C, D, E)\): \[ A\to B, CD\to E \]
Apply BCNF Decomposition:
- Iteration 1: \(R\)
- Consider \(A\to B\): \(A^+=AB\), and \(A\) is not a superkey (violates BCNF).
- Decompose: \(R_1=AB\), \(R_2=ACDE\)
- Project FDs: \(R_1(A,B)\):
- \(A\to B\) (\(A\) is a super key)
- \(R_2(A, C, D, E): CD\to E\) (\(CD\) is not a superkey).
- Iteration 2: Decompose \(R_2\)
- Consider \(R_2(A, C, D, E)\): \(CD^+=CDE\), and \(CD\to E\), \(CD\) is not a superkey
- Decompose: \(R_3=CDE\), \(R_4=ACD\)
- Project FDs:
- \(R_3(C, D, E)\): \(CD\to E\) (\(CD\) is a superkey)
- \(R_4(A, C, D)\): No FDs. Key: \(ACD\)
- BCNF Decomposition Result: \(R_1(A, B)\), \(R_3(C, D, E)\), \(R_4\qty(A, C, D)\)
- Speed-ups for BCNF decomposition
- Don’t need to know all superkeys.
- Only need to check whether the LHS of each FD is a superkey.
- Use the closure test (simple and fast)
- When projecting FDs onto a new relation, check each new FD:
- Does the new relation violate BCNF because of this FD?
- If os, abort the projection. You are about the discard this relation anyway (and domcpose further).
- Don’t need to know all superkeys.
Properties of Decomposition
- What we want from a decomposition (in general, not BCNF specifically)
- No anomalies
- Lossless Join: it should be possible to
- project the original relations onto the decomposed schema
- then reconstruct the original by joining. We should get back exactly the original tuples.
- Dependency Preservation:
- All the original FD’s should be satisfied.
- What is lost in a “lossy” join?
- For any decomposition, it is the case that:
- \(r\subseteq r_1\bowtie\cdots\bowtie r_n\)
- i.e., we will get back every tuple.
- But it may not be the case that:
- \(r\supseteq r_1\bowtie\cdots\bowtie r_n\)
- i.e., we can get spurious tuples.
- For any decomposition, it is the case that:
Properties of BCNF decomposition, specifically
- BCNF ensures no anomalies.
- BCNF ensures lossless join.
- However, the BCNF property does not guarantee lossless join.
- If you use the BCNF decomposition algorithm, then yes, a lossless join is guaranteed.
- If you generate a decomposition some other way
- You have to check to make sure you have a lossless join.
- Even if your schema is in BCNF!
- BCNF does not ensure dependency preservation.
- BCNF does not ensure that the decomposition is a “good” one.
- BCNF decomposition deos not guarantee preservation of dependencies.
- i.e., in the decomposed schema that results, it may be possible to create an instance that:
- satisfies all the projected FDs in the final schema,
- but violates one of the original FDs.
- Why? Because the algorithm goes too far, and it breaks relations down too much.
Consider \(\underline{C}SJDPQV\), \(C\) is a key, \(JP\to C\) and \(SD\to P\).
- BCNF decomposition: \(\underline{C}SJDQV\) and \(\underline{SD}P\).
- Problem: Checking \(JP\to C\) requires a joint (and there is no theoretical guarantee it will hold…)
- No! There is always a tradeoff.
- For example, another popular standard, “3NF” satisfies properties 2 and 3 only.
- Normalization:
- The process of decomposing unsatisfactory “bad” relations by breaking up their attributes into smaller relations.
- Normal form:
- Condition using keys and FDs of a relation to certify whether a relation schema is in a partiuclar normal form.
- In parctice:
- Normalization is carried out in practice so that resulting designs are of high quality and meet the desirable properties. (not always though)
- The pratical unitily of thse NFs becomes questionable when the constraints on which they are absed are hard to undrstand or to detect.
- The databse designers need not nromalize to the highest possible normal form (usually up to 3NF and BCNF. Higher NFs are rare in practice).
- Candidate Key vs. Primary Key
- A relation can have more than one key:
- Eg. 1: Donors in a Donation schema can have a
driverLicenseNum
andSSN
, eahc uniquely identifying a donor. - Eg. 2: Each student has a
studentID
that uniquely indentifies them, as well as aNetID
that does the same.
- Eg. 1: Donors in a Donation schema can have a
- In a DBMS, the DBA (DB Administrator) or the SE (Software Engineer) designing the schema typically chooses one to be the primary key. Other keys are called candidate keys.
- Characteristic of a primary key (PK) in SQL tables:
- minimal set of attributes (columns) that uniquely identifies a tuple (row)
- there can be at most one PK in a table
- all attributes in a PK can never be null
- PK attributes are automatically indexed. (Indexes are special lookup tables that help in fast retrieval of tuples from a table. Think of them as “pointers” to data in a table. An index is mall, fast, and optimized for quick lookups. It is especially useful for searching large tables and joining tables. Indexes are typically stored in a structure that optimizes efficient searches (e.g. B-tree))
- What about candidate kesy (CK)?
- also a minimal set of attributes that uniquely identifies a tuple (row)
- a table can have multiple CKs
- attribute in a CK may have null values
- not automatically indexed
- CKs signify which kesy can be used as PKs
- A relation can have more than one key:
CREATE TABLE Student (
int PRIMARY KEY, -- primary key
studentID VARCHAR(15) UNIQUE, -- candidate key
netID VARCHAR(100) UNIQUE, -- candidate key
email ...
...
);