Module 1 Units 10 and 11
Data Representation using Database Concepts and Advanced SQL
Reflection of unit
There are several ways that you can define your database and represent your dataset. Databases are now a core element for the success of big enterprises. Data-driven activities, like, trend analysis, predictive and statistical analysis, enable organisations to take strategic decisions. However, the database architectural design and implementation of a database must follow a framework in order to accommodate organisational needs.
Open-source databases are becoming a popular selection for the database community. The advantages of open source databases are that they are free and source code is available to customise the database according to your own need.
A Database Management System (DBMS) is a software system designed to manage collections of data that are large, shared, and persistent. The use of a DBMS provides facilities for efficient and effective querying, data security and integrity, and simultaneous accessing of data.
Persistent data, such as financial records, refers to data that contains core information, is unlikely to change and is stored on a hard drive or disk. Some amendments/updates may be needed but the information stored is relatively permanent, unlike transient data which is discarded at the end of a session.
A database model is a conceptual view of a database. It is a combination of constructs used to organise data. The most widespread data model is the relational data model. This represents a database in the form of relations; a 2-D table with rows and columns of data – the columns represent attributes which have a distinct name and are always referenced by their name, never by their position. A database may contain one or more such tables. A relation schema is used to describe a relation.
A schema is a description of the structure of an entire database. It is used to maintain the database.
Every relation (2-D table) in a database should have a Primary key – they are used to identify individual records and improves the functionality and reliability of our database tables. A primary key corresponds to value(s) in specified column(s) which must be unique for each row in a table and not be NULL. A Foreign key corresponds to value(s) in specified column(s) which must reference an existing record in another table (via it’s primary key or some other unique constraint). The Foreign Key acts as a common attribute (column) to relate information stored in different tables. The Foreign key can contain a NULL value, but we avoid this by replacing it with a flag.
A constraint defines certain properties that data in a database must comply with. When inserting a new value into a table, we need to make sure that the database constraints are not violated:
- The value of the attribute (column) should be of the correct data type.
- The Primary key attribute must (column) not be null.
- The key value(s) must not be the same as that of an existing tuple (row) in the same relation.
- The value of a foreign key, if any, must refer to an existing tuple (row) in the corresponding relation (table).
The process of normalisation is applied in order to ensure that the data is “good”, that is, structured in a logical and robust format. Normalisation refers to dividing the columns of a relation/table into two or more tables, duplicating those columns necessary to maintain relationships. For a database to be successfully normalised, the decomposition of the table has to be “lossless” (i.e. it should not lose any information whilst going through the decomposition). In other words, all the stages of normalisation should be verified and resolved.
Decomposition of a table goes through a set of normalisation processes (i.e. 1st, 2nd and 3rd or even Boyce-Codd Normal Form). The first stage in the normalisation process is to convert a table of un-normalised data into a table in first normal form. This means that we must extract the repeating groups of data that may appear in some rows of the table and replace them with rows/tuples where each column has only one value associated with it. A relation is in third normal form if there are no indirect, or transitive, dependencies between the attributes/columns. For a relation to be in third normal form, all rows must be directly dependent on the primary key.
SQL (pronounced “SEQUEL”) is a special-purpose programming language designed for manipulating data in a relational database management system.
A SQL statement is a set of instructions that are chained together in order for the program to understand and run a request.
SQL statements often begin with a keyword (e.g. SELECT, SHOW, USE) and are followed by clauses (e.g. the FROM clause).
Example of statements to get to specific info:
- SHOW DATABASES;
- USE people;
- SHOW TABLES;
- SELECT * FROM names;
- WHERE surname = ‘Nugent’,
- ORDER BY id;
SHOW DATABASES; - means display the available databases for the user (broadest step)
USE people; – means switch to a specific database (i.e. people database).
SHOW TABLES; - means display the available tables for the user
SELECT * FROM names; - means that we are selecting/requesting the data from all the table columns in the specified table (e.g. names table). The Asterix notation (*) in place of column names, indicates that all the attributes of the relation(s) specified in the FROM clause are to be retrieved!
The WHERE clause specifies which value should be updated (i.e. where the surname column equals ‘Nugent’).
The ORDER BY clause can be used to order rows in the query result. The order can either be ascending by using the key word ASC or descending by using the key word DESC. If no key word is specified, the default order is ascending.
Conclusion - the result of this example query is a “relation” (or table) containing the data of the people/person with the surname ‘Nugent’.
(Note: If the WHERE was omitted, all of the values would become updated).
Aggregate Functions are built-in functions which can operate on the values of a single column of a permanent table or the resulting (new) table of a query.
E.g. COUNT(*) returns the number of rows in the resulting table.
COUNT – counts the number of values in a specified column
SUM – totals the values in a specified column with a numeric type
AVG – calculates the average of the values in a specified column with a numeric type.
MIN – returns the minimum value in a specified column
MAX – returns the maximum value in a specified column
In SQL, there are 4 main data modifying keywords:
SELECT for getting data from the specified table.
INSERT for inserting data into the specified table.
UPDATE for updating existing data in the specified table.
DELETE for deleting data from the specified table.
These 4 concepts are often recognised with the CRUD acronym:
- Create - equivalent to the INSERT keyword
- Read - equivalent to the SELECT keyword
- Update - same
- Delete - same
How to insert values:
- Use the INSERT command (which represents the CREATE keyword in “CRUD”)
An example:
- USE harry_potter_characters;
- INSERT INTO catalog (name, muggle_or_wizard, house),
- Values (“Ron Weasley”, “Wizard”, “Gryffindor”);
To UPDATE a value/column within a table, we would write (in the terminal):
- UPDATE harry_potter_characters
- SET muggle_or_wizard = “Muggle”,
- house = “NA”
- WHERE id = 4;
DELETE example:
- DELETE FROM catalog
- WHERE id = 8;
(Note: The DELETE command operates on only one table at a time. Also, If the condition specified in the WHERE clause is FALSE, no tuple is deleted. Otherwise, if the condition is TRUE, one or more rows may be removed. Finally, if no condition is specified, (i.e. there is no WHERE clause), all rows in the table are removed - even if all rows of a table are deleted, the table still exists in the database. The DROP TABLE command may be used to remove a table completely from the database.)
Nested queries: A database query may consist of a number of sub-queries. When the result of a sub-query forms the condition, or part of the condition, of another query, such a complex query is called a nested query. A special operator called IN is used to examine whether a row is a member of the resulting relation of a subquery.
To prevent simultaneous access problems and to maintain database integrity DBMS applies a Locking approach.
- A Shared lock means when implemented, the data can only be read and not modified.
- An Exclusive lock means when implemented, the data can only be altered or modified by the transaction that imposed the lock.