Relational Databases

Introduction

Cuneiform tablet with bread and flour distributions, Ur III Period, c. 2100-2000 BC - Harvard Semitic Museum - Cambridge, MA - DSC06146

Tools

For our work with databases, we’ll be using three tools:

  • SQLite (database)

  • sqlite3 (Python module)

  • DB Browser (GUI application)

Tools: SQLite

  • lightweight relational database (RDBMS)

  • runs locally and stores data in files

  • no separate database server required

  • already installed on your system (as part of Python)

  • project website: https://www.sqlite.org

Tools: sqlite3

  • module in the standard library

  • facilitates interaction with the database by:

    1. managing connections

    2. passing queries to the database system

    3. making results of queries accessible to Python

Tools: DB Browser for SQLite

  • graphical application (GUI)

  • allows you to view the contents of a SQLite database

  • an optional convenience for this course

  • downloadable from https://sqlitebrowser.org

Databases in Theory

Image of a woman working with the card catalog in the Library of Congress, 1942

The Relational Model

  • Invented by E. F. Codd in the 1970s

  • Data stored in tables (relations)

  • Relations consist of rows (tuples)

  • Tuples have columns (attributes) that hold values

  • The term "database" now often connotes this type of relational database

Other DB Systems

Database design is a vast topic. In fact it is an entire course (INST 327)! Here are a just a few database systems you may encounter:

  • Relational:

    • mySQL/MariaDB

    • PostgreSQL

  • Non-relational ("NoSQL"):

    • Document: MongoDB (JSON), BaseX (XML)

    • Graph: Neo4j, Fuseki

Normalization

  • Through a design process, data are "normalized"

  • Normalization means applying a set of rules (called "normal forms") to increase efficiency and reduce redundancy

  • Many specifics of database design and normalization are outside the scope of this course

SQL

  • a standardized language is used to interact with a relational database

  • this is known as the Structured Query Language (SQL)

  • SQL has different flavors but generally works the same way across database systems

CRUD

Basic database operations are called "CRUD":

  • CREATE: add data to the database

  • READ: retrive data from one or more tables

  • UPDATE: make changes to the data

  • DELETE: remove rows from a table (potentially with a cascade effect)

Databases in Practice

Connecting: in-memory

  • Import sqlite3

  • Connect to an in-memory database

>>> import sqlite3
>>> conn = sqlite3.connect(':memory:')
>>> conn
<sqlite3.Connection object at 0x10507b110>

Connecting: database file

  • For persistent data, use a database file

>>> conn2 = sqlite3.connect('test.sqlite')
>>> conn2
<sqlite3.Connection object at 0x10507b030>

The connection object

  • This object manages the database connection

  • In addition, we need a cursor to manage state

  • The cursor sends queries and contains results

conn = sqlite3.connect('biblio.sqlite')
cursor = conn.cursor()

Setting up the database

  • Use a CREATE statement to set up a table

cq = '''CREATE TABLE books (
        title TEXT, author TEXT, date INTEGER
        )'''
cursor.execute(cq)

Create entries

  • Next add a row to the table with INSERT

iq = '''INSERT INTO books VALUES (
            '2001: A Space Odyssey',
            'Arthur C. Clarke',
            '1951'
            )'''
cursor.execute(iq)

Scaling up

  • This works fine, but can be impractical at scale

  • Larger numbers of records can be created with executemany()

data = [
    ("I, Robot", "Isaac Asimov", 1950),
    ("The Martian", "Andy Weir", 2012),
    ("The Left Hand Of Darkness", "Ursula K. Le Guin", 1969)
    ]

Scaling up (continued)

  • Map your data tuples onto a query using the ? placeholder

imq = '''INSERT INTO books VALUES (?,?,?)'''
cursor.executemany(imq, data)

Read

  • To view data in the database, use READ

  • .fetchall() can be used to get all results as a list

sq = '''SELECT title FROM books'''
books = cursor.execute(sq).fetchall()
print(books)
[('2001: A Space Odyssey',),('I, Robot',),('The Martian',),('The Left Hand Of Darkness',)]

Update

  • Now, imagine we need to update some data.

  • The Martian was in fact published in 2011, so let’s make that correction.

uq = '''UPDATE books
           SET year=2011
         WHERE title="The Martian"'''
cursor.execute(uq)

Verify Update

vq = '''SELECT *
          FROM books
         WHERE title="The Martian"'''
cursor.execute(vq)
print(cursor.fetchall())
[('The Martian', 'Andy Weir', 2011)]

Delete

  • Now, let’s look at deleting data

  • This works much as you would expect

dq = '''DELETE
          FROM books
         WHERE author="Isaac Asimov"'''
cursor.execute(dq)

Committing Changes

  • So far, what we have done has occurred in-memory

  • To persist, changes must be committed (saved)

  • In practice, you should commit frequently

conn.commit()
conn.close()

Summary

  • We have just covered a lot of ground

  • Connections, cursors, queries, CRUD

  • There is also a lot we have not touched

  • Next time we will look at normalization

  • Specifically: primary keys, foreign keys, and joins

Normalization in Depth

Why Normalize?

  • There are many reasons related to optimization

  • But the simplest way to think about it is this:

    1. Consider that one author can write many books

    2. Conversely one book can have many authors

    3. To model such relationships effectively, author data should be stored apart from book data

Identifiers (Keys)

  • The first requirement for modeling relationships between tables is to have unambiguous identifiers

  • These identifiers, called keys, allow data to be looked up

  • The unique id for a particular row in a table is called a primary key

Using Keys to Create Joins

  • Rows can also reference rows in other tables — this cross-reference is called a foreign key

  • For example, the row "Hamlet" in the plays table might reference "William Shakespeare" in the authors table

Creating Normalized Data

  • When designing a database, before doing any coding tables and their relationships should be mapped out

  • The diagram created during this mapping process is called an ERD

  • This stands for Entity-Relationship Diagram

  • In addition to mapping out relationships, you need to create code to analyze the data and write it to the correct locations

Creating Normalized Data (continued)

  • For example, in working with our list of books and authors, you might:

    1. Store the authors names in a separate table

    2. As you read the data file, lookup the author

    3. If the author is present already, get the id

    4. If the author is not present, add the author and get the id

    5. Add the book to the books table, referencing the author’s id

Flat Data

title author year
Things Fall Apart Chinua Achebe 1958
Chimera John Barth 1972
The Sot-Weed Factor John Barth 1960
Under the Volcano Malcolm Lowery 1947

Add Primary Keys

id title author year
1 Things Fall Apart Chinua Achebe 1958
2 Chimera John Barth 1972
3 The Sot-Weed Factor John Barth 1960
4 Under the Volcano Malcolm Lowry 1947

Move Authors to Own Table

id title author_id year
1 Things Fall Apart 1 1958
2 Chimera 2 1972
3 The Sot-Weed Factor 2 1960
4 Under the Volcano 3 1947
id name
1 Chinua Achebe
2 John Barth
3 Malcolm Lowry

Selecting Normalized Data

  • To lookup normalized data, you can use SQL’s JOIN syntax

  • You specify the fields to match on (linking foreign key to primary key)

jq = '''SELECT authors.name, books.title, books.year
        FROM books JOIN authors
        ON books.author_id=authors.id'''
books = cursor.execute(join_query, filter).fetchall()

Deleting Normalized Data

  • Normalizing data introduces some additional complications

  • Consider our authors and books examples

  • If you remove a row from the authors table, what happens to the author’s books?

  • There is a danger that orphaned rows will clutter the database

Deleting Normalized Data (continued)

  • In order to control the creation of bad data, SQL allows you to specify constraints in your database schema

  • Among the constraints is one called CASCADE DELETE

  • In essence, by specifying this constraint, you would force SQLite to remove books that were written by a deleted author when removing the author