Master Query Processing and Optimization in Advanced Database Systems. Learn Relational Algebra operations, Query Decomposition steps, and Optimization techniques with simple examples.
1. Translating SQL Queries into Relational Algebra
1.1 Introduction to Relational Algebra
- Relational Algebra is a procedural query language.
- It takes a relation as input and produces a relation as output.
- Operations are performed recursively on relations.
- Operators can be unary (one input) or binary (two inputs).
Detailed Explanation:
Before a computer can understand and execute an SQL query written by a human, it needs to translate it into a lower-level language. This language is called Relational Algebra.
Think of Relational Algebra as the “machine language” for databases. While SQL is like English (easy for humans), Relational Algebra is like instructions (easy for the database engine to follow).
Key Characteristics:
- Procedural: It tells the system how to get the data, step-by-step.
- Closed System: The output of every operation is a “relation” (a table). This means the output of one operation can become the input for another.
SQL Query (Human Language)
|
v
[Translation]
|
v
Relational Algebra (Machine Language)
|
v
Execution & Result
Types of Operations:
- Basic Operations: Selection, Projection, Cross Product, Set Difference, Union.
- Additional Operations: Intersection, Join, Division, Renaming.
Which of the following statements is TRUE about Relational Algebra?
A) It is a non-procedural language like SQL.
B) It takes a relation as input and produces a scalar value as output.
C) It is a procedural language that takes relations as input and produces relations as output.
D) It does not use any operators.
1.2 Selection Operation (σ)
- Symbol: Sigma (σ)
- Function: Filters rows (records) based on a condition.
- Type: Unary operator (works on one relation).
- Result: Same columns (schema) as original, but fewer rows.
- Notation: σcondition(Relation)
Detailed Explanation:
The Selection operation is like using a filter. Imagine you have a table of all students in a university. If you only want to see students from the “CS” department, you apply a Selection operation.
Key Rules:
- The condition uses Comparison Operators (=, ≠, <, >, ≤, ≥).
- The condition can use Logical Operators (AND ∧, OR ∨, NOT ¬).
- The output table looks the same structure-wise, but only contains the rows that match the condition.
Example 1:
Consider the EMP table below:
| Name | Office | Dept | Rank |
|---|---|---|---|
| Smith | 400 | CS | Assistant |
| Jones | 220 | Econ | Adjunct |
| Green | 160 | Econ | Assistant |
| Brown | 420 | CS | Associate |
| Smith | 500 | Fin | Associate |
Query: Display only those Employees in the CS department.
- SQL: SELECT * FROM EMP WHERE Dept = ‘CS’
- Relational Algebra: σDept = ‘CS’(EMP)
Resulting Relation:
| Name | Office | Dept | Rank |
|---|---|---|---|
| Smith | 400 | CS | Assistant |
| Brown | 420 | CS | Associate |
Example 2:
Query: Select only those Employees with the name Smith who are Assistant professors.
- SQL: SELECT * FROM EMP WHERE Name=’Smith’ AND Rank=’Assistant’
- Relational Algebra: σName = ‘Smith’ ∧ Rank = ‘Assistant’(EMP)
Resulting Relation:
| Name | Office | Dept | Rank |
|---|---|---|---|
| Smith | 400 | CS | Assistant |
What does the Selection operation (σ) do in Relational Algebra?
A) Deletes unwanted columns from a table.
B) Combines two tables together.
C) Selects a subset of rows satisfying a condition.
D) Calculates the sum of attributes.
1.3 Projection Operation (π)
- Symbol: Pi (π)
- Function: Selects specific columns (attributes) from a table.
- Type: Unary operator.
- Result: Fewer columns (vertical subset), same or fewer rows.
- Notation: πattributes(Relation)
Detailed Explanation:
While Selection cuts the table horizontally (rows), Projection cuts the table vertically (columns). If you only need the “Name” and “Department” from a large employee table, you use Projection.
Key Rules:
- You list the specific column names you want to keep inside the subscript.
- Duplicate Elimination: If the resulting table has duplicate rows (e.g., two people named “Smith” in the same Dept), Projection automatically removes the duplicates.
Example 1:
Using the EMP table again:
Query: Project only the names and departments of the employees.
- SQL: SELECT name, dept FROM EMP
- Relational Algebra: πname, dept(EMP)
Resulting Relation:
| Name | Dept |
|---|---|
| Smith | CS |
| Jones | Econ |
| Green | Econ |
| Brown | CS |
| Smith | Fin |
Example 2:
Consider the Sailor table:
| Sid | Sname | Rating | Age |
|---|---|---|---|
| 28 | Smith | 9 | 35 |
| 31 | Lubber | 8 | 55 |
| 44 | Guppy | 5 | 35 |
| 58 | Rusty | 10 | 35 |
Query: Select Sname and Rating from Sailor.
- Relational Algebra: πSname, Rating(Sailor)
Resulting Relation:
| Sname | Rating |
|---|---|
| Smith | 9 |
| Lubber | 8 |
| Guppy | 5 |
| Rusty | 10 |
1.4 Combining Selection and Projection
- You can combine Selection (σ) and Projection (π) in one expression.
- Usually, Selection is performed first to filter rows, then Projection selects the columns.
- The order matters: You cannot project a column you haven’t selected/filtered if using nested operations correctly.
Detailed Explanation:
In real-world scenarios, we often want specific columns from specific rows. To do this, we nest the operations.
Example 1:
Query: Show the names of all employees working in the CS department.
Logic: First, find the CS employees (Selection). Then, take only their names (Projection).
- Relational Algebra: πname(σDept = ‘CS’(EMP))
Step 1: Selection σ (Dept = 'CS') (EMP) +-------+--------+-----+-----------+ | Name | Office | Dept| Rank | +-------+--------+-----+-----------+ | Smith | 400 | CS | Assistant | | Brown | 420 | CS | Associate | +-------+--------+-----+-----------+ Step 2: Projection π (name) +-------+ | Name | +-------+ | Smith | | Brown | +-------+
Example 2:
Query: Show the name and rank of those Employees who are not in the CS department or Adjuncts.
- Relational Algebra: πname, rank(σ¬(Rank = ‘Adjunct’ ∨ Dept = ‘CS’)(EMP))
Resulting Relation:
| Name | Rank |
|---|---|
| Green | Assistant |
| Smith | Associate |
Which Relational Algebra expression correctly returns only the names of employees who earn more than 50,000?
A) σsalary > 50000(πname(EMP))
B) πname(σsalary > 50000(EMP))
C) πsalary(σname(EMP))
D) σname(πsalary > 50000(EMP))
1.5 Cartesian Product (X)
- Also called Cross Product or Cross Join.
- Combines every row of Table 1 with every row of Table 2.
- Symbol: ×
- Result size: If Table 1 has N rows and Table 2 has M rows, result has N × M rows.
Detailed Explanation:
The Cartesian Product is a brute-force combination. It pairs every possible combination of rows from two tables. This often produces very large tables and is rarely used alone in practical queries, but it forms the basis for Join operations.
Example:
Relation R (People):
| First | Last | Age |
|---|---|---|
| Bill | Smith | 22 |
| Mary | Keen | 23 |
Relation S (Menu):
| Dinner | Dessert |
|---|---|
| Steak | Ice Cream |
| Lobster | Cheesecake |
Result (R × S):
| First | Last | Age | Dinner | Dessert |
|---|---|---|---|---|
| Bill | Smith | 22 | Steak | Ice Cream |
| Bill | Smith | 22 | Lobster | Cheesecake |
| Mary | Keen | 23 | Steak | Ice Cream |
| Mary | Keen | 23 | Lobster | Cheesecake |
If Table A has 5 rows and Table B has 3 rows, how many rows will be in the Cartesian Product (A × B)?
A) 8 rows
B) 15 rows
C) 2 rows
D) 5 rows
1.6 Set Operations (Union, Intersection, Difference)
- All set operations require relations to be Union-Compatible.
- Union-Compatible means: Same number of columns, and corresponding columns have the same data types.
- Union (U): All tuples in R or S or both (duplicates removed).
- Intersection (∩): Only tuples in both R and S.
- Difference (-): Tuples in R but not in S.
Detailed Explanation:
These operations work just like sets in mathematics.
Conditions for Union Compatibility:
- Both tables must have the same number of columns.
- The first column of Table R and the first column of Table S must have the same data type (e.g., both are Strings), and so on for the second column, third column, etc.
Example Tables:
| First | Last | Age |
|---|---|---|
| Bill | Smith | 22 |
| Sally | Green | 28 |
| First | Last | Age |
|---|---|---|
| Forrest | Gump | 36 |
| Sally | Green | 28 |
1. Union (R ∪ S):
Combines all unique rows from both tables.
| First | Last | Age |
|---|---|---|
| Bill | Smith | 22 |
| Sally | Green | 28 |
| Forrest | Gump | 36 |
(Note: Sally Green appears only once because duplicates are removed).
2. Intersection (R ∩ S):
Only the rows that appear in BOTH tables.
| First | Last | Age |
|---|---|---|
| Sally | Green | 28 |
3. Set Difference (R – S):
Rows that are in R but NOT in S.
| First | Last | Age |
|---|---|---|
| Bill | Smith | 22 |
1.7 Join Operations
- Join is the most powerful operation for combining tables meaningfully.
- It combines rows based on a related column.
- Types: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join.
Detailed Explanation:
The Cartesian Product combines everything. Usually, we only want combinations that make sense (match). This is a Join.
1. Inner Join (Equi Join)
Returns records that have matching values in both tables.
Example: Join S1 and R1 where S1.sid = R1.sid.
| Sid | Sname |
|---|---|
| 22 | Peter |
| 58 | Solomon |
| Sid | Bid |
|---|---|
| 22 | 101 |
| 58 | 103 |
Result (S1 ⋈ R1):
| Sid | Sname | Bid |
|---|---|---|
| 22 | Peter | 101 |
| 58 | Solomon | 103 |
2. Left Outer Join
Returns all records from the left table, and matched records from the right table. If no match, the right side contains NULL.
3. Right Outer Join
Returns all records from the right table, and matched records from the left table.
4. Full Outer Join
Returns all records when there is a match in either left or right table.
Table A Table B
[ 1, A ] [ 1, X ]
[ 2, B ] [ 3, Y ]
INNER JOIN: LEFT JOIN: RIGHT JOIN: FULL JOIN:
[1, A, X] [1, A, X] [1, A, X] [1, A, X]
[2, B, NULL] [3, NULL, Y] [2, B, NULL]
[3, NULL, Y]
Which Join operation returns ALL rows from the left table, even if there are no matches in the right table?
A) Inner Join
B) Left Outer Join
C) Right Outer Join
D) Intersection
1.8 Division Operation (÷)
- Used for queries involving “all” (e.g., “Find students who took ALL courses”).
- Not a primitive operator (can be built from basic operators).
- Notation: A ÷ B
Detailed Explanation:
The Division operation is useful for finding rows in one table that are associated with all rows in another table.
Example:
Relation A (Supplier-Product): Lists which suppliers supply which products.
Relation B (Products): Lists all products.
Query (A ÷ B): Find suppliers who supply ALL products listed in B.
Relation A (Sales) Relation B (Products) Sno | Pno Pno ----+---- ---- S1 | P1 P2 S1 | P2 P4 S2 | P1 S3 | P2 S4 | P2 S4 | P4 Query: Find suppliers who supply ALL products in B (P2 AND P4). Result (A / B): Sno ---- S4 (Supplies P2 and P4 - Matches All)
2. Overview of Query Processing
2.1 Definition and Goals
- Definition: The activity of extracting data from the database.
- Goal: Transform a high-level SQL query into a correct and efficient execution strategy.
- Challenge: SQL is declarative (says WHAT to do). The system must figure out HOW to do it efficiently.
Detailed Explanation:
When you write an SQL query, you are telling the database what you want. You are not telling it how to find it. The database must figure out the best path to retrieve the data. This entire process is called Query Processing.
2.2 Phases of Query Processing
- Phase 1: Decomposition (Parsing and Validation)
- Phase 2: Optimization
- Phase 3: Cost Generation
- Phase 4: Execution
Detailed Explanation:
[ High-Level SQL Query ]
|
v
+-------------------------------------+
| 1. Query Decomposition |
| (Parse, Check Syntax/Semantics) |
+-------------------------------------+
|
v
[ Relational Algebra Expression ]
|
v
+-------------------------------------+
| 2. Query Optimization |
| (Find the best strategy) |
+-------------------------------------+
|
v
[ Execution Plan ]
|
v
+-------------------------------------+
| 3. Code Generation |
+-------------------------------------+
|
v
+-------------------------------------+
| 4. Runtime Database Processor |
| (Execute the plan) |
+-------------------------------------+
|
v
[ Query Output ]
Which phase of query processing involves checking if the table names and column names actually exist in the database?
A) Optimization
B) Execution
C) Decomposition
D) Code Generation
3. Query Decomposition
3.1 Steps in Query Decomposition
- Step 1: Analysis
- Step 2: Normalization
- Step 3: Semantic Analysis
- Step 4: Simplification
- Step 5: Query Restructuring
Detailed Explanation:
This phase takes your SQL and turns it into an internal format (Relational Algebra) while checking for errors.
1. Analysis
- Checks if the query words are correct (Lexical analysis).
- Checks if the query grammar is correct (Syntactic analysis).
- Verifies table and column names against the System Catalog.
- Checks if operations make sense (e.g., you cannot multiply a String by a Number).
2. Normalization
Converts the query conditions (WHERE clause) into a standard form.
- Conjunctive Normal Form: Conditions connected by AND. (e.g., A AND B AND C).
- Disjunctive Normal Form: Conditions connected by OR. (e.g., (A AND B) OR (C AND D)).
3. Semantic Analysis
Checks if the query makes logical sense.
- Incorrect: Parts of the query don’t contribute to the result (e.g., missing join conditions).
- Contradictory: The conditions cannot be satisfied.
Example:(position='Manager' AND position='Assistant'). A person cannot be both at the same time. This query simplifies to “False”.
4. Simplification
Removes unnecessary redundancy.
- Eliminates duplicate conditions.
- Applies logical rules (Idempotent rules):
- p ∧ p = p
- p ∧ True = p
- p ∧ False = False
- Checks Access Rights: Does the user have permission to see these tables?
5. Query Restructuring
Converts the query into a more efficient Relational Algebra form using heuristic rules.
- Heuristic: A “rule of thumb” or best practice.
- Common Rule: Perform Selection (filtering) as early as possible to reduce table size before joins.
Example of Restructuring:
Original: Join two huge tables first, then filter by city.
Restructured: Filter each table by city first, then join the smaller results.
During query decomposition, a condition like “Age > 25 AND Age > 25” is simplified to “Age > 25”. Which step does this belong to?
A) Analysis
B) Normalization
C) Simplification
D) Execution
4. Query Optimization
4.1 Definition and Goals
- Definition: Choosing an efficient execution strategy.
- Goal: Minimize resource usage (Disk I/O, CPU time, Memory).
- Strategies:
- Minimize total execution time.
- Maximize parallel operations (reduce response time).
- Relies on Database Statistics.
Detailed Explanation:
There are usually many ways to execute the same query. The Query Optimizer explores these options and picks the cheapest one.
Example: To find a student by ID:
- Option 1: Scan the table row by row until you find the ID (Slow).
- Option 2: Use an Index (like a book index) to jump directly to the ID (Fast).
The Optimizer chooses Option 2.
4.2 Cost Estimation
- Disk access (I/O) is the dominant cost (slowest operation).
- Memory access is much faster.
- Cost is estimated based on Cardinality (number of rows).
- Uses Database Statistics (metadata about tables) stored in the System Catalog.
Detailed Explanation:
The database keeps statistics about your data, such as:
- Number of rows in each table.
- Number of distinct values in a column.
- Size of the table in bytes.
The optimizer uses these numbers to estimate how many disk reads (I/O) a query will need. Lower I/O = Better Query.
Why is disk access considered the dominant cost in query processing?
A) Disk access is faster than memory access.
B) Disk access is mechanical and much slower than electronic memory access.
C) Disks are more expensive to buy than memory.
D) Disks have less storage capacity.
Exam Tips
Key Points to Remember for Exams:
- Relational Algebra Symbols: Know the Greek letters! Sigma (σ) for Selection (rows), Pi (π) for Projection (columns).
- Union Compatibility: For Union, Intersection, and Difference, tables MUST have the same number of columns and matching data types.
- Selection vs. Projection: Selection filters rows (horizontal). Projection filters columns (vertical). Selection keeps schema same; Projection reduces schema.
- Join Types:
- Inner: Matches only.
- Left: All from Left + matches from Right.
- Right: All from Right + matches from Left.
- Decomposition Steps: Remember the order: Analysis → Normalization → Semantic Analysis → Simplification → Restructuring.
- Optimization Goal: The goal is NOT just to get the answer, but to get it EFFICIENTLY (minimal disk I/O).
- Heuristic Rule: Always try to apply Selection (σ) as early as possible to reduce the size of intermediate results.
Practice Challenge Questions for Exam Preparation
Given two tables: Student (StudentID, Name) and Course (CourseID, CourseName). Why can you NOT perform a Union operation between Student and Course?
Write the Relational Algebra expression for the following SQL query:
SELECT Name FROM Employee WHERE Salary > 50000 AND Department = 'HR'Explain the difference between Contradictory and Incorrect queries in Semantic Analysis.
Table A has 100 rows. Table B has 200 rows. Query 1 performs a Cartesian Product (A × B). Query 2 performs an Inner Join on a unique ID. Which query produces more rows? Explain.
Describe the "Select before Join" heuristic rule. Why is it efficient?
This completes Chapter Two: Query Processing and Optimization. Review the Relational Algebra symbols and the phases of query processing carefully. Good luck with your studies!