Query Processing and Optimization - Advanced Database Systems

Master Query Processing and Optimization in Advanced Database Systems. Learn Relational Algebra operations, Query Decomposition steps, and Optimization techniques with simple examples.

1. Translating SQL Queries into Relational Algebra

1.1 Introduction to Relational Algebra

Important Points:Relational Algebra is a procedural query language.
It takes a relation as input and produces a relation as output.
Operations are performed recursively on relations.
Operators can be unary (one input) or binary (two inputs).

Detailed Explanation:

Before a computer can understand and execute an SQL query written by a human, it needs to translate it into a lower-level language. This language is called Relational Algebra.

Think of Relational Algebra as the “machine language” for databases. While SQL is like English (easy for humans), Relational Algebra is like instructions (easy for the database engine to follow).

Key Characteristics:

Procedural: It tells the system how to get the data, step-by-step.
Closed System: The output of every operation is a “relation” (a table). This means the output of one operation can become the input for another.

   SQL Query (Human Language)
          |
          v
   [Translation]
          |
          v
   Relational Algebra (Machine Language)
          |
          v
   Execution & Result

Types of Operations:

Basic Operations: Selection, Projection, Cross Product, Set Difference, Union.
Additional Operations: Intersection, Join, Division, Renaming.

Exam Question:
Which of the following statements is TRUE about Relational Algebra?

A) It is a non-procedural language like SQL.
B) It takes a relation as input and produces a scalar value as output.
C) It is a procedural language that takes relations as input and produces relations as output.
D) It does not use any operators.

1.2 Selection Operation (σ)

Important Points:Symbol: Sigma (σ)
Function: Filters rows (records) based on a condition.
Type: Unary operator (works on one relation).
Result: Same columns (schema) as original, but fewer rows.
Notation: σcondition(Relation)

Detailed Explanation:

The Selection operation is like using a filter. Imagine you have a table of all students in a university. If you only want to see students from the “CS” department, you apply a Selection operation.

Key Rules:

The condition uses Comparison Operators (=, ≠, <, >, ≤, ≥).
The condition can use Logical Operators (AND ∧, OR ∨, NOT ¬).
The output table looks the same structure-wise, but only contains the rows that match the condition.

Example 1:

Consider the EMP table below:

Name	Office	Dept	Rank
Smith	400	CS	Assistant
Jones	220	Econ	Adjunct
Green	160	Econ	Assistant
Brown	420	CS	Associate
Smith	500	Fin	Associate

Query: Display only those Employees in the CS department.

SQL: SELECT * FROM EMP WHERE Dept = ‘CS’
Relational Algebra: σ_{Dept = ‘CS’}(EMP)

Resulting Relation:

Name	Office	Dept	Rank
Smith	400	CS	Assistant
Brown	420	CS	Associate

Example 2:

Query: Select only those Employees with the name Smith who are Assistant professors.

SQL: SELECT * FROM EMP WHERE Name=’Smith’ AND Rank=’Assistant’
Relational Algebra: σ_{Name = ‘Smith’ ∧ Rank = ‘Assistant’}(EMP)

Resulting Relation:

Name	Office	Dept	Rank
Smith	400	CS	Assistant

Exam Question:
What does the Selection operation (σ) do in Relational Algebra?

A) Deletes unwanted columns from a table.
B) Combines two tables together.
C) Selects a subset of rows satisfying a condition.
D) Calculates the sum of attributes.

1.3 Projection Operation (π)

Important Points:Symbol: Pi (π)
Function: Selects specific columns (attributes) from a table.
Type: Unary operator.
Result: Fewer columns (vertical subset), same or fewer rows.
Notation: πattributes(Relation)

Detailed Explanation:

While Selection cuts the table horizontally (rows), Projection cuts the table vertically (columns). If you only need the “Name” and “Department” from a large employee table, you use Projection.

Key Rules:

You list the specific column names you want to keep inside the subscript.
Duplicate Elimination: If the resulting table has duplicate rows (e.g., two people named “Smith” in the same Dept), Projection automatically removes the duplicates.

Example 1:

Using the EMP table again:

Query: Project only the names and departments of the employees.

SQL: SELECT name, dept FROM EMP
Relational Algebra: π_{name, dept}(EMP)

Resulting Relation:

Name	Dept
Smith	CS
Jones	Econ
Green	Econ
Brown	CS
Smith	Fin

Example 2:

Consider the Sailor table:

Sid	Sname	Rating	Age
28	Smith	9	35
31	Lubber	8	55
44	Guppy	5	35
58	Rusty	10	35

Query: Select Sname and Rating from Sailor.

Relational Algebra: π_{Sname, Rating}(Sailor)

Resulting Relation:

Sname	Rating
Smith	9
Lubber	8
Guppy	5
Rusty	10

1.4 Combining Selection and Projection

Important Points:You can combine Selection (σ) and Projection (π) in one expression.
Usually, Selection is performed first to filter rows, then Projection selects the columns.
The order matters: You cannot project a column you haven’t selected/filtered if using nested operations correctly.

Detailed Explanation:

In real-world scenarios, we often want specific columns from specific rows. To do this, we nest the operations.

Example 1:

Query: Show the names of all employees working in the CS department.

Logic: First, find the CS employees (Selection). Then, take only their names (Projection).

Relational Algebra: π_name(σ_{Dept = ‘CS’}(EMP))

Step 1: Selection
σ (Dept = 'CS') (EMP)
+-------+--------+-----+-----------+
| Name  | Office | Dept| Rank      |
+-------+--------+-----+-----------+
| Smith | 400    | CS  | Assistant |
| Brown | 420    | CS  | Associate |
+-------+--------+-----+-----------+

Step 2: Projection
π (name)
+-------+
| Name  |
+-------+
| Smith |
| Brown |
+-------+

Example 2:

Query: Show the name and rank of those Employees who are not in the CS department or Adjuncts.

Relational Algebra: π_{name, rank}(σ_{¬(Rank = ‘Adjunct’ ∨ Dept = ‘CS’)}(EMP))

Resulting Relation:

Name	Rank
Green	Assistant
Smith	Associate

Exam Question:
Which Relational Algebra expression correctly returns only the names of employees who earn more than 50,000?

A) σ_{salary > 50000}(π_name(EMP))
B) π_name(σ_{salary > 50000}(EMP))
C) π_salary(σ_name(EMP))
D) σ_name(π_{salary > 50000}(EMP))

1.5 Cartesian Product (X)

Important Points:Also called Cross Product or Cross Join.
Combines every row of Table 1 with every row of Table 2.
Symbol: ×
Result size: If Table 1 has N rows and Table 2 has M rows, result has N × M rows.

Detailed Explanation:

The Cartesian Product is a brute-force combination. It pairs every possible combination of rows from two tables. This often produces very large tables and is rarely used alone in practical queries, but it forms the basis for Join operations.

Example:

Relation R (People):

First	Last	Age
Bill	Smith	22
Mary	Keen	23

Relation S (Menu):

Dinner	Dessert
Steak	Ice Cream
Lobster	Cheesecake

Result (R × S):

First	Last	Age	Dinner	Dessert
Bill	Smith	22	Steak	Ice Cream
Bill	Smith	22	Lobster	Cheesecake
Mary	Keen	23	Steak	Ice Cream
Mary	Keen	23	Lobster	Cheesecake

Exam Question:
If Table A has 5 rows and Table B has 3 rows, how many rows will be in the Cartesian Product (A × B)?

A) 8 rows
B) 15 rows
C) 2 rows
D) 5 rows

1.6 Set Operations (Union, Intersection, Difference)

Important Points:All set operations require relations to be Union-Compatible.
Union-Compatible means: Same number of columns, and corresponding columns have the same data types.
Union (U): All tuples in R or S or both (duplicates removed).
Intersection (∩): Only tuples in both R and S.
Difference (-): Tuples in R but not in S.

Detailed Explanation:

These operations work just like sets in mathematics.

Conditions for Union Compatibility:

Both tables must have the same number of columns.
The first column of Table R and the first column of Table S must have the same data type (e.g., both are Strings), and so on for the second column, third column, etc.

Example Tables:

**Relation R**
First	Last	Age
Bill	Smith	22
Sally	Green	28

**Relation S**
First	Last	Age
Forrest	Gump	36
Sally	Green	28

1. Union (R ∪ S):

Combines all unique rows from both tables.

First	Last	Age
Bill	Smith	22
Sally	Green	28
Forrest	Gump	36

(Note: Sally Green appears only once because duplicates are removed).

2. Intersection (R ∩ S):

Only the rows that appear in BOTH tables.

First	Last	Age
Sally	Green	28

3. Set Difference (R – S):

Rows that are in R but NOT in S.

First	Last	Age
Bill	Smith	22

1.7 Join Operations

Important Points:Join is the most powerful operation for combining tables meaningfully.
It combines rows based on a related column.
Types: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join.

Detailed Explanation:

The Cartesian Product combines everything. Usually, we only want combinations that make sense (match). This is a Join.

1. Inner Join (Equi Join)

Returns records that have matching values in both tables.

Example: Join S1 and R1 where S1.sid = R1.sid.

S1
Sid	Sname
22	Peter
58	Solomon

R1
Sid	Bid
22	101
58	103

Result (S1 ⋈ R1):

Sid	Sname	Bid
22	Peter	101
58	Solomon	103

2. Left Outer Join

Returns all records from the left table, and matched records from the right table. If no match, the right side contains NULL.

3. Right Outer Join

Returns all records from the right table, and matched records from the left table.

4. Full Outer Join

Returns all records when there is a match in either left or right table.

    Table A           Table B
  [ 1, A ]          [ 1, X ]
  [ 2, B ]          [ 3, Y ]

  INNER JOIN:       LEFT JOIN:        RIGHT JOIN:       FULL JOIN:
  [1, A, X]         [1, A, X]         [1, A, X]         [1, A, X]
                    [2, B, NULL]      [3, NULL, Y]      [2, B, NULL]
                                                         [3, NULL, Y]

Exam Question:
Which Join operation returns ALL rows from the left table, even if there are no matches in the right table?

A) Inner Join
B) Left Outer Join
C) Right Outer Join
D) Intersection

1.8 Division Operation (÷)

Important Points:Used for queries involving “all” (e.g., “Find students who took ALL courses”).
Not a primitive operator (can be built from basic operators).
Notation: A ÷ B

Detailed Explanation:

The Division operation is useful for finding rows in one table that are associated with all rows in another table.

Example:

Relation A (Supplier-Product): Lists which suppliers supply which products.

Relation B (Products): Lists all products.

Query (A ÷ B): Find suppliers who supply ALL products listed in B.

Relation A (Sales)       Relation B (Products)
Sno | Pno                Pno
----+----               ----
S1  | P1                 P2
S1  | P2                 P4
S2  | P1
S3  | P2
S4  | P2
S4  | P4

Query: Find suppliers who supply ALL products in B (P2 AND P4).

Result (A / B):
Sno
----
S4  (Supplies P2 and P4 - Matches All)

2. Overview of Query Processing

2.1 Definition and Goals

Important Points:Definition: The activity of extracting data from the database.
Goal: Transform a high-level SQL query into a correct and efficient execution strategy.
Challenge: SQL is declarative (says WHAT to do). The system must figure out HOW to do it efficiently.

Detailed Explanation:

When you write an SQL query, you are telling the database what you want. You are not telling it how to find it. The database must figure out the best path to retrieve the data. This entire process is called Query Processing.

2.2 Phases of Query Processing

Important Points:Phase 1: Decomposition (Parsing and Validation)
Phase 2: Optimization
Phase 3: Cost Generation
Phase 4: Execution

Detailed Explanation:

      [ High-Level SQL Query ]
               |
               v
+-------------------------------------+
|  1. Query Decomposition             |
|  (Parse, Check Syntax/Semantics)    |
+-------------------------------------+
               |
               v
      [ Relational Algebra Expression ]
               |
               v
+-------------------------------------+
|  2. Query Optimization              |
|  (Find the best strategy)           |
+-------------------------------------+
               |
               v
        [ Execution Plan ]
               |
               v
+-------------------------------------+
|  3. Code Generation                 |
+-------------------------------------+
               |
               v
+-------------------------------------+
|  4. Runtime Database Processor      |
|  (Execute the plan)                 |
+-------------------------------------+
               |
               v
        [ Query Output ]

Exam Question:
Which phase of query processing involves checking if the table names and column names actually exist in the database?

A) Optimization
B) Execution
C) Decomposition
D) Code Generation

3. Query Decomposition

3.1 Steps in Query Decomposition

Important Points:Step 1: Analysis
Step 2: Normalization
Step 3: Semantic Analysis
Step 4: Simplification
Step 5: Query Restructuring

Detailed Explanation:

This phase takes your SQL and turns it into an internal format (Relational Algebra) while checking for errors.

1. Analysis

Checks if the query words are correct (Lexical analysis).
Checks if the query grammar is correct (Syntactic analysis).
Verifies table and column names against the System Catalog.
Checks if operations make sense (e.g., you cannot multiply a String by a Number).

2. Normalization

Converts the query conditions (WHERE clause) into a standard form.

Conjunctive Normal Form: Conditions connected by AND. (e.g., A AND B AND C).
Disjunctive Normal Form: Conditions connected by OR. (e.g., (A AND B) OR (C AND D)).

3. Semantic Analysis

Checks if the query makes logical sense.

Incorrect: Parts of the query don’t contribute to the result (e.g., missing join conditions).
Contradictory: The conditions cannot be satisfied.
Example: (position='Manager' AND position='Assistant'). A person cannot be both at the same time. This query simplifies to “False”.

4. Simplification

Removes unnecessary redundancy.

Eliminates duplicate conditions.
Applies logical rules (Idempotent rules):
- p ∧ p = p
- p ∧ True = p
- p ∧ False = False
Checks Access Rights: Does the user have permission to see these tables?

5. Query Restructuring

Converts the query into a more efficient Relational Algebra form using heuristic rules.

Heuristic: A “rule of thumb” or best practice.
Common Rule: Perform Selection (filtering) as early as possible to reduce table size before joins.

Example of Restructuring:

Original: Join two huge tables first, then filter by city.

Restructured: Filter each table by city first, then join the smaller results.

Exam Question:
During query decomposition, a condition like “Age > 25 AND Age > 25” is simplified to “Age > 25”. Which step does this belong to?

A) Analysis
B) Normalization
C) Simplification
D) Execution

4. Query Optimization

4.1 Definition and Goals

Important Points:Definition: Choosing an efficient execution strategy.
Goal: Minimize resource usage (Disk I/O, CPU time, Memory).
Strategies:Minimize total execution time.
Maximize parallel operations (reduce response time).
Relies on Database Statistics.

Detailed Explanation:

There are usually many ways to execute the same query. The Query Optimizer explores these options and picks the cheapest one.

Example: To find a student by ID:

Option 1: Scan the table row by row until you find the ID (Slow).
Option 2: Use an Index (like a book index) to jump directly to the ID (Fast).

The Optimizer chooses Option 2.

4.2 Cost Estimation

Important Points:Disk access (I/O) is the dominant cost (slowest operation).
Memory access is much faster.
Cost is estimated based on Cardinality (number of rows).
Uses Database Statistics (metadata about tables) stored in the System Catalog.

Detailed Explanation:

The database keeps statistics about your data, such as:

Number of rows in each table.
Number of distinct values in a column.
Size of the table in bytes.

The optimizer uses these numbers to estimate how many disk reads (I/O) a query will need. Lower I/O = Better Query.

Exam Question:
Why is disk access considered the dominant cost in query processing?

A) Disk access is faster than memory access.
B) Disk access is mechanical and much slower than electronic memory access.
C) Disks are more expensive to buy than memory.
D) Disks have less storage capacity.

Exam Tips

Key Points to Remember for Exams:Relational Algebra Symbols: Know the Greek letters! Sigma (σ) for Selection (rows), Pi (π) for Projection (columns).
Union Compatibility: For Union, Intersection, and Difference, tables MUST have the same number of columns and matching data types.
Selection vs. Projection: Selection filters rows (horizontal). Projection filters columns (vertical). Selection keeps schema same; Projection reduces schema.
Join Types:Inner: Matches only.
Left: All from Left + matches from Right.
Right: All from Right + matches from Left.
Decomposition Steps: Remember the order: Analysis → Normalization → Semantic Analysis → Simplification → Restructuring.
Optimization Goal: The goal is NOT just to get the answer, but to get it EFFICIENTLY (minimal disk I/O).
Heuristic Rule: Always try to apply Selection (σ) as early as possible to reduce the size of intermediate results.

Practice Challenge Questions for Exam Preparation

Question 1:
Given two tables: Student (StudentID, Name) and Course (CourseID, CourseName). Why can you NOT perform a Union operation between Student and Course?

Question 2:
Write the Relational Algebra expression for the following SQL query: SELECT Name FROM Employee WHERE Salary > 50000 AND Department = 'HR'

Question 3:
Explain the difference between Contradictory and Incorrect queries in Semantic Analysis.

Answer:
Contradictory Query: A query whose predicate (WHERE condition) cannot be satisfied by any tuple.
Example: WHERE Age > 50 AND Age < 10. No person can be both older than 50 and younger than 10 simultaneously. The result will always be empty.

Incorrect Query: A query where components do not contribute to the generation of results.
Example: A query that lists a table in the FROM clause but doesn't join it properly to other tables (missing join specification), resulting in a meaningless Cartesian product or the table being ignored by the optimizer. It might also mean accessing columns that don't exist (though this is often caught in Analysis).

Question 4:
Table A has 100 rows. Table B has 200 rows. Query 1 performs a Cartesian Product (A × B). Query 2 performs an Inner Join on a unique ID. Which query produces more rows? Explain.

Question 5:
Describe the "Select before Join" heuristic rule. Why is it efficient?

This completes Chapter Two: Query Processing and Optimization. Review the Relational Algebra symbols and the phases of query processing carefully. Good luck with your studies!

Query Processing and Optimization – Advanced Database Systems

1. Translating SQL Queries into Relational Algebra

1.1 Introduction to Relational Algebra

1.2 Selection Operation (σ)

1.3 Projection Operation (π)

1.4 Combining Selection and Projection

1.5 Cartesian Product (X)

1.6 Set Operations (Union, Intersection, Difference)

1.7 Join Operations

1. Inner Join (Equi Join)

2. Left Outer Join

3. Right Outer Join

4. Full Outer Join

1.8 Division Operation (÷)

2. Overview of Query Processing

2.1 Definition and Goals

2.2 Phases of Query Processing

3. Query Decomposition

3.1 Steps in Query Decomposition

1. Analysis

2. Normalization

3. Semantic Analysis

4. Simplification

5. Query Restructuring

4. Query Optimization

4.1 Definition and Goals

4.2 Cost Estimation

Exam Tips

Key Points to Remember for Exams:

Practice Challenge Questions for Exam Preparation

Leave a Comment Cancel Reply

1. Translating SQL Queries into Relational Algebra

1.1 Introduction to Relational Algebra

1.2 Selection Operation (σ)

1.3 Projection Operation (π)

1.4 Combining Selection and Projection

1.5 Cartesian Product (X)

1.6 Set Operations (Union, Intersection, Difference)

1.7 Join Operations

1. Inner Join (Equi Join)

2. Left Outer Join

3. Right Outer Join

4. Full Outer Join

1.8 Division Operation (÷)

2. Overview of Query Processing

2.1 Definition and Goals

2.2 Phases of Query Processing

3. Query Decomposition

3.1 Steps in Query Decomposition

1. Analysis

2. Normalization

3. Semantic Analysis

4. Simplification

5. Query Restructuring

4. Query Optimization

4.1 Definition and Goals

4.2 Cost Estimation

Exam Tips

Key Points to Remember for Exams:

Practice Challenge Questions for Exam Preparation

Related Posts

Leave a Comment Cancel Reply