ABOUT ME

Today
Yesterday
Total
  • Database Normalization
    Database 2024. 3. 11. 15:01

    데이터베이스 정규화(Normalization)에 대해 정리하였습니다.

     

     


    < Introduction to Normalization >

    Database normalization is a process by which database tables are modified/restructured to minimize redundancy

     

     


    Duplicated Data >

    When columns of a database table do not depend on (i.e., describe) the primary key, the same data may be duplicated in multiple places.

     

    Example : Let’s now examine the college table

    The primary key of this table is student_id (a unique student identifier), and most of the columns describe characteristics of students, including their:

    • name
    • class year
    • email
    • major(s)
    • advisor

    But, This table also contains columns that do not describe students directly.

    the columns advisor_nameadvisor_departmentadvisor_building, and advisor_email describe further details about advisors

    meanwhile, major_1_credits_reqd and major_2_credits_reqd describe further detail about majors.

    SELECT advisor_name, advisor_department, advisor_email
    FROM college
    WHERE advisor_name = 'Brunson';

     

     


    < Problem : Duplicated Data Problem>

    Problem1: if we want to modify the duplicated information, we’ll have to make the same updates in multiple locations.

    Problem2: new data cannot be inserted into the table until a primary key is known.

     

    Problem1 Example : A faculty member named Professor 'Hill' changes their email address to 'hill@college.edu'

    UPDATE college
    SET advisor_email = 'hill@college.edu'
    WHERE advisor_name = 'Hill';

    Unfortunately, this could cause problems if there are multiple professors with the same name.

    After we run the code above : 

    advisor_name advisor_department advisor_email
    Hill Biology hill@college.edu
    Hill English hill@college.edu
    ... ... ...

    If advisor information were housed in a separate table, we could have imposed a UNIQUE or PRIMARY KEY constraint on the advisor_email column to make sure that no two advisors have the same emails recorded in our database.


    Problem2 Example : 

    Suppose that the school hires a new advisor in the Computer Science department. The advisor’s name is Professor Algorithm and his email is algorithm1000@college.edu.

    Unfortunately, we cannot add a new row to this database table because Professor Algorithm has not been assigned to advise any students yet — and we need a primary key (associated with an individual student) in order to insert a new row of data.

     

     


    Problem : Search and Sort Efficiency >

    Consider the college table.
    Besides the fact that the major-related columns are not directly dependent on the primary key, you may also notice that there are two sets of columns where a student’s major (and related information) can be recorded.

     

    Example : 

    suppose that your boss is interested in figuring out which majors are most popular. They ask you to produce a table of unique majors along with the number of students who have declared each one, sorted by popularity.

    SELECT major_1, count(*)
    FROM majors
    GROUP BY major_1
    ORDER BY count DESC;
    major_1 count
    History 53
    Political Science 49
    Computer Science 49
    English 49
    Spanish 47
    Geology 46

    The problem is that these counts are incorrect — they completely ignore majors recorded in the major_2 column!

    To remedy this, we’ll have to join the major_1 and major_2 columns together somehow, creating additional complexity.

     

     


    Restructuring the Advisor Columns >

    We saw that every student with the same advisor has identical information recorded in all advisor-related columns in the college table.
    One way to address this is by moving the four advisor-related columns into their own table, with only one row per unique advisor. This helps ensure every table has its own purpose or concern.

     

    • To create a new table from an existing one, we can precede any query with :
      CREATE TABLE new_table_name AS
       

    Example)

    CREATE TABLE advisors AS
    SELECT DISTINCT advisor_email, advisor_name, advisor_department
    FROM college;
    
    ALTER TABLE college
    DROP COLUMN advisor_name,
    DROP COLUMN advisor_department;
    
    SELECT *
    FROM advisors;

     

     


    Restructuring the Major Columns >

    We need two additional tables (as opposed to one) because the relationship between students and majors is many-to-many (multiple students can have the same major and each student can have multiple majors), while each major has only one value of credits_reqd.

     

    < Database Schema between students and majors >

    CREATE TABLE majors AS
    SELECT DISTINCT major_1 AS major, major_1_credits_reqd AS major_credits_reqd
    FROM college
    UNION
    SELECT DISTINCT major_2 AS major, major_2_credits_reqd AS major_credits_reqd
    FROM college
    WHERE major_2 IS NOT NULL;
    
    CREATE TABLE students_majors AS
    SELECT major_1 as major, student_id 
    FROM college
    UNION ALL
    SELECT major_2 as major, student_id
    FROM college
    WHERE major_2 IS NOT NULL;
    
    ALTER TABLE college
    DROP COLUMN major_1,
    DROP COLUMN major_1_credits_reqd,
    DROP COLUMN major_2,
    DROP COLUMN major_2_credits_reqd;
    
    SELECT *
    FROM majors
    LIMIT 5;
    
    SELECT *
    FROM students_majors
    ORDER BY student_id
    LIMIT 5;

     

     


    Creating Versus Modifying a Database Schema >

    Moving and dropping columns is also not quite enough to turn the single table from into the many-table version we likely would have designed.

     

    Example : 

    non-normalized version VS normalized version:

     there are some important differences. For example:

    • The advisor_name column is renamed name in the advisors table.
    • The college table is renamed students because all non-student-related data has now been moved to other tables.
    • The primary keys in the advisors, students, and majors tables are now unique integer values. While it is valid to use advisor_email as a primary key in the advisors table, it can make things more difficult when we want to update or delete an email address.

    In the four-table schema, we’ll also want to implement some constraints that were either not necessary or not possible in the original version. For example:

    • We can now implement a UNIQUE constraint on the email column of the advisors table, ensuring that no two advisors have the same email listed.
    • We can also enforce a foreign key constraint on the advisor_id column of the students table so that each advisor_id in that table matches an id in the advisors table.

     

     


    Database Structure and Use >

    It’s worth noting that normalization does not exclusively make queries and modification easier. When we split a large database table into multiple smaller tables, some queries may actually become more complex.

    So, the most important design consideration is how a database will be used in the future.

     

    Example :

    If we want to know how many students are advised by faculty in each department, we’ll have to join the students and advisors tables back together.

    SELECT COUNT(students.id), advisors.department as advisor_department
    FROM students, advisors
    WHERE students.advisor_id = advisors.id
    GROUP BY advisor_department;
    count advisor_department
    68 Computer Science
    53 Environmental Science
    24 Chemistry

    That wasn’t too complicated - but note that we could have accomplished the same thing by querying the original college table without any joins!

     

     

    Article : Database Normalization Basics

     

    The Basics of Normalizing a Database

    Database normalization can save storage space and ensure the consistency of your data. Here are the basics in an introductory article.

    www.lifewire.com

     

    'Database' 카테고리의 다른 글

    댓글

Designed by Tistory.