CS 325 - Week 9 Lecture 1 - 2016-10-18
* normalization and normal forms
...part of basic "database" vocabulary;
* functional dependencies play a vital role here!
* as we mentioned earlier --
for day-to-day, operational use,
not all relations are equally "good" --
some ARE "better" than others;
* why? well, for one thing, for SOME relations,
changing the data can have UNDESIRABLE
consequences called
MODIFICATION ANOMALIES
...and we'll see that such anomalies can
often be reduced or eliminated by
redefining the relation(s) involved;
...often by breaking up a relation
into one or more related relations;
* when we break up relations to reduce/eliminate
modification anomalies,
that process is called NORMALIZATION;
(and when you go the other way,
combining relations, that's
DEnormalization, but that's not really a topic
for this course...)
* modification anomalies, then;
* some texts break these up into more categories,
but we'll go with two:
deletion anomalies
* occurs when deleting a row of a relation
may cause the loss of "additional" info
which it may not be reasonable
to have completely disappear
or, if you prefer:
it occurs when facts about two or
more entities are lost with one deletion;
* (for example, losing a project's name
because an employee is deleted in the
example table 1)
insertion anomalies
* when we cannot insert information that's
important to the scenario -- about some
entity -- but we cannot,
and this resulting restriction seems
unreasonable;
* (for example, that I cannot add to example
table 1 that new project 4 has the
name rainbow,
or that CompSci job class has a charge
per hour of $75)
* normalization is the process of "breaking up"
a relation into multiple relations
for the purpose of reducing modification
anomalies;
...breaking up relations with multiple "themes" [Kroenke]
so each resulting relation only has one "theme";
(but also maintaining desired relationships;
...referential integrity constraints tend to
be added during this breaking-up;)
* normal forms
* one classic set of normal forms:
first normal form (1NF)
second normal form (2NF)
third normal form (3NF)
Boyce-Codd normal form (BCNF)
fourth normal form (4NF)
fifth normal form (5NF)
sixth normal form (6NF)
^ these are superset/subsets --
if a set of relations is in 2NF, it is also in 1NF,
and if it is in 3NF, it is also in 1NF and 2NF, etc.
* and it turns out that you can think of normalization
as a process of working through these normal
forms to the level that makes sense for your
database;
* (I've read that 3NF is usually sufficient for
more operational databases)
1NF (First Normal Form)
* this one is pretty easy!
any true relation is, by definition, in 1NF.
* as a true relation, then, the relation meets
that definition:
* single-valued cells!
* unique and non-null primary keys!
* no duplicate rows!
* do you ever need to do much work to
get a set of relations into 1NF?
...most common issue I've seen/read about
is the realization that something is not truly
a relation because it has a multivalued attribute;
* and what you'll do is break out that
multivalued attribute into its own
relation;
* you remove the multivalued attribute to
a new relation,
adding the primary key of the original
relation TO the new relation AS a foreign
key referencing the original relation,
and make the primary key of the new
relation BOTH that foreign key and the
multivalued attribute
2NF - 2nd Normal Form
* because, relations in 1NF can have modification
anomalies due to certain functional dependencies;
...for example, there's a category of
functional dependencies called
PARTIAL dependencies -
* a partial dependency is when a non-primary key attribute is
dependent on just PART of its relation's primary key;
* a relation is said to be in
2NF if:
1) it is in 1NF
2) it includes NO partial dependencies
(ALL of its non-primary-key attributes
depend on their relation's "entire" primary key)
* SO, here's an algorithm for converting
a non-2NF 1NF relation into 2NF relations:
1. write each key component on its own line,
and each combination of key components on
its own line
empl_num
proj_num
(empl_num, proj_num)
2. for each non-primary-key attribute,
write it on the line for which it neeeeeeds
that "entire" primary key to determine it
empl_num - empl_name, job_class, chg_per_hr
proj_num - proj_name
(empl_num, proj_num) - hrs
3. for lines with a non-primary-key attribute
added, make a new relation out of that line,
whose primary key is the primary-key-combo
given, and (typically for the primary key whose
set includes multiple attributes)
make each a foreign key to its "base" relation
Employee(EMPL_NUM, empl_name, job_class, chg_per_hr)
Project(PROJ_NUM, proj_name)
Work_on_Proj(EMPL_NUM, PROJ_NUM, hrs)
foreign key (empl_num) references Employee,
foreign key (proj_num) references Project