Hello dear students! Welcome to Chapter 7 on Distributed Database Systems. This is a fascinating chapter because we move from a single database sitting on one machine to databases spread across many locations. Imagine a bank with branches in different cities β each branch keeps its own data, but they all work together as one system. That is exactly what a distributed database is. Let me walk you through every concept step by step.
1. What is a Distributed Database System?
Let me start with a clear, simple definition. A distributed database is a single logical database that is spread physically across computers in multiple locations. These computers are connected by a data communications link (a network).
Now, let me break that down because there are several important ideas packed into that one sentence.
First, it is a single logical database. Even though the data is physically stored on different machines in different places, from the user’s perspective, it looks and behaves like one database. The user does not need to know that part of the data is in Addis Ababa and another part is in Dire Dawa. The system hides this complexity.
Second, it is a collection of connected sites. Each site is a database in its own right. Each site has its own DBMS software and its own local users. Operations can be performed locally at each site as if the database was not distributed at all. A branch manager in Bahir Dar can run queries on local data without connecting to any other site.
Third, the sites collaborate transparently. When needed, the sites work together to answer queries that involve data from multiple locations, but the user does not see this collaboration happening. The union of all the individual databases equals the database of the whole organization.
1.1 Schema in a Distributed Database
A schema is a structure that contains descriptions of database objects β tables, views, constraints, and so on. In a distributed database, we have two important types of schemas:
Local schema: This describes database objects on a single site only. For example, Site A has its own local schema that describes only the tables and views that exist at Site A. It knows nothing about what is stored at Site B or Site C.
Global schema: This describes database objects on ALL network nodes combined. It is the union of all local schemas. When a user issues a query, the system looks at the global schema to understand the full structure of the data across all sites.
1.2 Distributed Database Management System (DDBMS)
A DDBMS is the software system that manages the distributed database and makes the distribution transparent to the users. The DDBMS is the brain that coordinates everything β it decides where to find data, how to combine results from multiple sites, how to keep data consistent across sites, and how to handle failures. The user simply interacts with the DDBMS as if it were a regular centralized DBMS.
Q1 (MCQ). Which of the following best describes a distributed database?
a) Multiple independent databases with no connection to each other
b) A single logical database physically spread across multiple sites connected by a network
c) A backup database stored on a secondary server for disaster recovery
d) A database that is partitioned within a single computer’s storage
Q2 (MCQ). What does the global schema in a distributed database represent?
a) Only the tables at the headquarters site
b) The security policies applied across all sites
c) The union of all local schemas across all network nodes
d) The network configuration of all connected sites
Q3 (Fill in the blank). A ________ describes database objects on its own site only, while a ________ describes database objects on all network nodes.
A1. b) A single logical database physically spread across multiple sites connected by a network. This is the defining characteristic of a distributed database. Option a describes independent databases (not distributed), option c describes a backup system, and option d describes table partitioning within a single machine β none of these are distributed databases.
A2. c) The union of all local schemas across all network nodes. The global schema combines the descriptions of all database objects from every site into one comprehensive schema. It represents the entire logical database structure across the distributed system.
A3. Local schema; Global schema. The local schema is limited to one site. The global schema encompasses all sites and is the union of every local schema in the distributed system.
2. Characteristics of Distributed Databases
Not every database spread across computers is a true distributed database. There are specific characteristics that a system must have. Let me explain each one carefully because these are frequently tested.
2.1 Local Autonomy
Local autonomy means that local data is locally owned and managed. The data belongs to the local site, even if it is accessible from other sites. The local site is responsible for the security and integrity of its own data. No other site can directly modify or manage another site’s data without going through proper protocols.
Think of it this way: each branch of a bank manages its own customer accounts. The Addis branch owns and manages its data. The Bahir Dar branch can query that data if needed, but it does not own it or directly control it. The local site has the final say over its own data.
2.2 Transparency
Transparency means that users should not have to know where data is physically stored. A user in Addis Ababa issues a query, and the DDBMS figures out that the needed data is partly in Addis, partly in Bahir Dar, and partly in Dire Dawa. The user does not specify locations β the system handles it.
2.3 Distributed Processing
Distributed processing means that processing tasks (executing queries, running transactions) can happen at multiple sites rather than being concentrated at one central location. Each site has its own processing capability and can handle its own workload.
2.4 Distributed Catalog Management
The system must keep track of how data is distributed across sites. This is handled through catalogs. A site catalog describes all objects (fragments, replicas) at a particular site. It also keeps track of any replicas (copies) of relations that have been created at that site. Without proper catalog management, the DDBMS would not know where anything is stored.
Q1 (MCQ). A site catalog in a distributed database tracks which of the following?
a) Only the passwords of local users
b) All objects, fragments, and replicas at that site
c) Only the network configuration
d) The global schema of the entire system
Q2 (MCQ). “Local data is locally owned and managed, and security and integrity are the responsibility of the local server.” This statement describes which characteristic?
a) Transparency
b) Distributed processing
c) Distributed catalog management
d) Local autonomy
Q3 (True/False). In a distributed database, users must specify which physical site contains the data they want to access.
A1. b) All objects, fragments, and replicas at that site. The site catalog is specifically responsible for describing all database objects stored at that site, including fragments of relations and any replicas (copies) of data from other sites. It does NOT store passwords, network configuration, or the global schema.
A2. d) Local autonomy. Local autonomy means the local site has ownership and control over its own data. Security and integrity enforcement are the local site’s responsibility. Transparency is about hiding location from users. Distributed processing is about processing at multiple sites. Catalog management is about tracking data distribution.
A3. False. One of the key characteristics of a distributed database is transparency β users should NOT need to know where data is physically stored. The DDBMS handles locating the data automatically. If users had to specify physical locations, transparency would be violated.
3. Levels of Data and Process Distribution
Now here is a topic where students often get confused. There are different levels of how data and processing can be distributed. Your textbook describes three levels, and you need to understand each one clearly. Let me walk you through them from the simplest to the most complex.
3.1 Single-Site Processing, Single-Site Data (SPSD)
This is the simplest level. In fact, it is not really “distributed” at all. All processing happens on a single CPU or host computer (mainframe, midrange, or PC). All data is stored on that same computer’s local disk. Processing cannot be done on the user’s side.
This is a traditional centralized database management system. Think of an old bank where all data and processing happens on one mainframe in the headquarters, and branch offices use “dumb terminals” that can only send requests and display results β they cannot do any processing themselves.
3.2 Multiple-Site Processing, Single-Site Data (MPSD)
Here, multiple processes run on different computers, but they all share a single data repository. The data is NOT distributed β it lives on one file server. But the processing is distributed across multiple computers connected through a LAN.
This is also called a Distributed Processing Environment. Think of a small office where one server stores all the company’s accounting data, but multiple PCs run accounting software that accesses that single shared data store over the network.
Notice: the processing is distributed but the data is NOT. This is why it is called “Distributed Processing Environment” β it has distributed processing but NOT a distributed database.
3.3 Multiple-Site Processing, Multiple-Site Data (MPMD)
This is the fully distributed database management system. Both processing and data are distributed across multiple sites. Each site has its own data processor and its own transaction processor. Sites can process their own local data AND participate in global queries that involve data from multiple sites.
β’ SPSD = Centralized DBMS (single processing, single data) β dumb terminals
β’ MPSD = Distributed Processing Environment (multiple processing, single data) β NOT a distributed database
β’ MPMD = Fully Distributed DDBMS (multiple processing, multiple data) β TRUE distributed database
Q1 (MCQ). Which level of data and process distribution represents a fully distributed database management system?
a) SPSD
b) MPSD
c) MPMD
d) Both MPSD and MPMD
Q2 (MCQ). A company has multiple PCs running application software that all access a single file server’s data over a LAN. This is an example of:
a) SPSD β centralized DBMS
b) MPSD β distributed processing environment
c) MPMD β fully distributed DDBMS
d) A homogeneous DDBMS
Q3 (True/False). In MPSD, the data is distributed across multiple sites but processing happens at only one site.
Q4 (Fill in the blank). MPSD is also called a ________ ________ ________.
A1. c) MPMD. MPMD (Multiple-Site Processing, Multiple-Site Data) is the only level where BOTH processing and data are distributed across multiple sites, making it a fully distributed DDBMS. SPSD has neither distributed. MPSD has distributed processing but centralized data.
A2. b) MPSD β distributed processing environment. Multiple PCs (multiple-site processing) access a single file server (single-site data). Data is NOT distributed β it is all on one server. This matches the MPSD definition exactly. It is NOT a distributed database because the data is centralized.
A3. False. It is exactly the opposite! In MPSD, processing is distributed across multiple sites, but data is stored at a single site. The “M” in the first part stands for Multiple (processing), and the “S” in the second part stands for Single (data).
A4. Distributed Processing Environment. MPSD is called a distributed processing environment because only the processing is distributed β the data remains centralized on a single file server.
4. Types of DDBMS: Homogeneous vs Heterogeneous
Not all distributed databases are created equal. A very important classification is based on whether all sites use the same DBMS or different DBMSs. Let me explain both types with clear comparisons.
4.1 Homogeneous DDBMS
In a homogeneous DDBMS, all sites run the same type of centralized DBMS. The data is distributed across all nodes, but every node uses the same DBMS software.
Key features of homogeneous DDBMS:
- Same DBMS at each node (for example, all sites run Oracle, or all run MySQL)
- All data is managed by the distributed DBMS β there is no exclusively local data that is outside the DDBMS’s control
- All access is through one global schema
- The global schema is the union of all the local schemas
This is simpler to manage because the DDBMS does not need to translate between different data formats or query languages. Every site speaks the same “language.”
4.2 Heterogeneous DDBMS
In a heterogeneous DDBMS, different sites may run different types of centralized DBMSs. This is much more complex but also much more realistic in the real world.
Key features:
- Data is distributed across all nodes
- Local access is done using the local DBMS and the local schema (each site uses its own system)
- Remote access (accessing data from other sites) is done through the global schema
A fully heterogeneous DDBMS is the most complex type. It supports different DBMSs that may even use different data models β relational, hierarchical, or network β running under different computer systems, such as mainframes and microcomputers.
| Feature | Homogeneous DDBMS | Heterogeneous DDBMS |
|---|---|---|
| DBMS at each node | Same type | Different types |
| Data models | Same (e.g., all relational) | May differ (relational, hierarchical, network) |
| Local access | Through global schema | Through local schema |
| Remote access | Through global schema | Through global schema |
| Exclusively local data | None β all data managed by DDBMS | Can exist (managed by local DBMS only) |
| Complexity | Lower | Higher |
| Data translation needed | No | Yes (between different DBMS formats) |
β’ Homogeneous = same DBMS everywhere, all access through global schema, no exclusively local data
β’ Heterogeneous = different DBMSs possible, local access through local schema, remote access through global schema
β’ Fully heterogeneous = different data models AND different computer systems
β’ The global schema in homogeneous DDBMS = union of all local schemas
β’ In heterogeneous, local access uses local schema but remote access uses global schema
Q1 (MCQ). In a heterogeneous DDBMS, how is local data accessed at a site?
a) Through the global schema only
b) Through the local DBMS and local schema
c) Through a central server’s schema
d) Local data cannot be accessed in a heterogeneous system
Q2 (MCQ). Which of the following is TRUE about a homogeneous DDBMS?
a) Different DBMSs run at different nodes
b) Some data may be exclusively local and not managed by the DDBMS
c) The global schema is the union of all local schemas and all access is through it
d) Data models may differ across sites
Q3. A university has three campuses. Campus A uses Oracle (relational), Campus B uses MySQL (relational), and Campus C uses an old hierarchical DBMS. Students at each campus can access their own campus data directly and can also query data from other campuses. What type of DDBMS is this? Explain why.
A1. b) Through the local DBMS and local schema. In a heterogeneous DDBMS, local access is done using the site’s own local DBMS and local schema. Only remote access (accessing data from other sites) goes through the global schema. This is different from homogeneous DDBMS where ALL access goes through the global schema.
A2. c) The global schema is the union of all local schemas and all access is through it. This is the defining characteristic of a homogeneous DDBMS. Option a describes heterogeneous. Option b is wrong because homogeneous DDBMS has NO exclusively local data. Option d describes heterogeneous.
A3. This is a fully heterogeneous DDBMS. The reason: different sites use different DBMSs (Oracle, MySQL, hierarchical DBMS) AND different data models (relational at A and B, hierarchical at C). Local access at each campus uses the local DBMS and local schema (Oracle for A, MySQL for B, hierarchical for C). Remote access (querying other campuses) uses the global schema. The system may even run under different computer systems. This matches the textbook definition of fully heterogeneous DDBMS exactly.
5. Components of a DDBMS
What does a distributed database system actually consist of? It is not just databases on different computers. There are specific components that must be present. Your textbook lists five essential components.
Let me explain the two software components more carefully because they are very important and often confused.
The Transaction Processor (TP) is the component that requests data. When a user at Site A issues a query that needs data from Site B, it is the Transaction Processor at Site A that handles this request. The TP acts as the “customer” β it asks for data from other sites.
The Data Processor (DP) is the component that stores and retrieves data at a site. When a request arrives at Site B for data, it is the Data Processor at Site B that actually pulls the data from local storage and sends it back. The DP acts as the “server” β it provides data to whoever requests it. The DP may simply be a regular centralized DBMS.
Q1 (MCQ). Which DDBMS component is responsible for requesting data from other sites?
a) Data Processor
b) Communications media
c) Transaction Processor
d) Site catalog
Q2 (MCQ). The Data Processor at a site may be implemented as:
a) A network router
b) A centralized DBMS
c) A communications protocol
d) A user interface
Q3 (Fill in the blank). The Transaction Processor is also known as the ________ Processor or ________ Manager. The Data Processor is also known as the ________ Manager.
A1. c) Transaction Processor. The Transaction Processor (TP) is the software component that requests data. The Data Processor (DP) stores and retrieves data β it doesn’t request from others. Communications media is the physical channel. The site catalog tracks data distribution.
A2. b) A centralized DBMS. The textbook explicitly states that the Data Processor “may be a centralized DBMS.” This makes sense because the DP’s job is to store and retrieve data at a site, which is exactly what a centralized DBMS does. The DP provides the local data management capability at each site.
A3. Application; Transaction; Data. Transaction Processor = Application Processor = Transaction Manager. Data Processor = Data Manager. These are alternative names for the same components.
6. Advantages and Disadvantages of DDBMS
Why would an organization choose a distributed database instead of a centralized one? And what problems does it create? Let me cover both sides thoroughly.
6.1 Advantages
1. Reflects organizational structure: Many organizations are naturally distributed β they have branches, departments, and offices in different locations. A distributed database mirrors this structure by placing data near the site where it is needed most.
2. Faster data access: When data is stored locally at the site that needs it most, access is much faster because there is no network delay. A branch in Bahir Dar querying local customer records gets results instantly compared to querying a central server in Addis Ababa.
3. Improved performance: Related to the above β data located near the site of greatest demand reduces network traffic and response time. Multiple sites can process queries in parallel.
4. Modularity: Systems can be modified, added, and removed from the distributed database without affecting other modules. If a new branch opens, you can add a new site. If a branch closes, you can remove it. The rest of the system continues working.
5. Improved communications: When each site has its own data, local users communicate with their local system rather than all communications going through a central hub.
6. Improved availability: If one site fails, only the data at that site is affected. Other sites continue to operate normally. In a centralized system, if the single server goes down, the ENTIRE database is unavailable. This is a huge advantage.
7. Processor independence: The organization is not locked into one type of computer or one vendor. Different sites can use different hardware, and the system can evolve over time.
6.2 Disadvantages
1. Complexity of management and control: Managing a distributed system is much harder than managing one centralized system. Extra work is needed to ensure transparency and to maintain multiple disparate systems.
2. Increased cost: More complexity and a more extensive infrastructure means extra labor costs, hardware costs, and software costs.
3. Security: Remote database fragments must be secured individually. The network infrastructure itself must also be secured β for example, by encrypting the network links between remote sites. More entry points mean more potential vulnerabilities.
4. Lack of standards: There are no universally accepted standards for distributed databases, which makes it difficult to integrate systems from different vendors.
5. Increased storage requirements: Data replication (keeping copies at multiple sites for availability) requires more total storage than a single copy at a central site.
6. Difficult to maintain integrity: Enforcing integrity constraints (like foreign keys or uniqueness) over a network may require too much network resources to be feasible. Checking a constraint that involves data at multiple sites means sending messages back and forth, which is slow and resource-intensive.
Q1 (MCQ). Which advantage of DDBMS states that a fault in one database system will only affect one fragment instead of the entire database?
a) Faster data access
b) Modularity
c) Improved availability
d) Processor independence
Q2 (MCQ). Why is maintaining integrity difficult in a distributed database?
a) Because data is stored in only one format
b) Because enforcing constraints over a network may require too much networking resources
c) Because there are no integrity constraints in distributed databases
d) Because only the local DBMS can enforce integrity
Q3. A university has a centralized student database. During registration, the server becomes overloaded and all students across all campuses experience slow response or system crashes. Explain how converting to a DDBMS could solve this problem, mentioning at least three specific advantages.
A1. c) Improved availability. Improved availability means that if one site fails, only the fragment at that site is affected. The rest of the distributed database continues to function. Faster data access is about speed, not fault tolerance. Modularity is about adding/removing systems. Processor independence is about hardware flexibility.
A2. b) Because enforcing constraints over a network may require too much networking resources. Integrity constraints like foreign keys or uniqueness checks often involve data at multiple sites. Verifying these constraints requires sending messages between sites, which consumes network bandwidth and adds latency. This makes it resource-intensive and sometimes impractical.
A3. Converting to a DDBMS would help in three ways:
1. Improved performance: Each campus could have its own local copy of student data, so registration queries are processed locally instead of all going to one central server. This eliminates the bottleneck.
2. Improved availability: If the server at one campus fails during registration, students at other campuses are NOT affected β they continue registering at their local site. In the centralized system, one crash affects everyone.
3. Faster data access: Students at each campus access local data directly without network delays to a remote central server, making the registration process faster.
Additionally, reflects organizational structure applies because the university naturally has separate campuses.
7. DDBMS Functions
A DDBMS must do everything a regular centralized DBMS does, PLUS handle the extra complexity introduced by distribution. Let me understand what these additional functions are.
The textbook says the DDBMS must perform these additional functions transparently to the end user. This is crucial β the user should not notice any difference between using a centralized DBMS and a DDBMS.
The specific additional functions are:
- Provide the user interface needed for location transparency: The user must be able to query data without specifying where it is stored. The DDBMS handles finding it.
- Locate the data: The DDBMS must direct queries to the proper site(s). If a query needs data from Site A and Site C, the DDBMS must know this and route the requests correctly.
- Process queries: The DDBMS must handle three types of queries: local (data at the user’s own site), remote (data at one other site), and compound/global (data at multiple sites that must be combined).
- Provide network-wide concurrency control and recovery procedures: Just like a centralized DBMS controls concurrent access and recovers from failures, but now across the entire network. This is much harder because transactions may span multiple sites.
- Provide data translation in heterogeneous systems: When different sites use different DBMSs or data models, the DDBMS must translate data formats so that all sites can work together seamlessly.
Q1 (MCQ). Which of the following is NOT listed as an additional function of a DDBMS (beyond centralized DBMS functions)?
a) Locate the data and direct queries to proper sites
b) Provide data translation in heterogeneous systems
c) Create and manage physical storage devices
d) Provide network-wide concurrency control and recovery
Q2 (Fill in the blank). The three types of queries that a DDBMS must be able to process are ________, ________, and ________ (global) queries.
A1. c) Create and manage physical storage devices. Creating and managing physical storage devices is a hardware/OS responsibility, not a DDBMS function. The other three options are all explicitly listed as additional DDBMS functions in the textbook: locating data, data translation for heterogeneous systems, and network-wide concurrency control and recovery.
A2. Local; Remote; Compound. Local queries access data at the user’s own site. Remote queries access data at one other site. Compound (global) queries access data at multiple sites and must combine results. The DDBMS must handle all three types transparently.
8. DDB Transparency Features
Now we arrive at one of the most important topics in this chapter. Transparency is what makes a distributed database feel like a centralized one to the user. The DDBMS must hide all the complexity of distribution behind a curtain of transparency.
There are five transparency features that you need to know:
All of these share one goal: allow the end user to see the distributed database as though it were a centralized one. The user should never need to know that the data is distributed.
8.1 Distribution Transparency
This is the most detailed transparency type. It allows management of a physically dispersed database as though it were centralized. There are three levels of distribution transparency, plus an additional related concept:
Level 1: Fragmentation Transparency (Highest Level)
This is the highest level of distribution transparency. The user does not know that the data is fragmented (split into pieces stored at different sites). The user sees one single logical table and queries it normally. The DDBMS handles figuring out which fragments contain the needed data and where those fragments are located.
Example: The EMPLOYEE table is split into three fragments β EMP_F1 at Site A (employees in Addis), EMP_F2 at Site B (employees in Bahir Dar), EMP_F3 at Site C (employees in Dire Dawa). With fragmentation transparency, a user simply queries: SELECT * FROM EMPLOYEE WHERE salary > 5000. The user does not know about the three fragments or where they are.
Level 2: Location Transparency (Middle Level)
The user knows that the data is fragmented but does not know where each fragment is stored. The user might query specific fragments by name but does not specify which site holds them. The DDBMS locates the fragments automatically.
Example: The user knows about EMP_F1, EMP_F2, EMP_F3 but doesn’t know which site stores which fragment. The user queries: SELECT * FROM EMP_F1 and the DDBMS figures out that EMP_F1 is at Site A.
Level 3: Local Mapping Transparency (Lowest Level)
The user knows about fragments AND their locations. The user must specify which fragment at which site to access. This is the lowest level β the user does most of the work.
Example: The user explicitly writes: SELECT * FROM EMP_F1 AT Site_A. The user knows everything about the distribution.
Replica Transparency
In addition to the three levels, there is replica transparency. This is the DDBMS’s ability to hide the existence of multiple copies of data from the user. If a table is replicated at Sites A and B, the user queries the table normally and the DDBMS decides which copy to use (or uses both for performance).
8.2 Heterogeneity Transparency
This hides the fact that different sites may use different DBMSs, different data models, or different computer systems. The user sees a uniform interface regardless of the underlying technology differences. The DDBMS handles all the data translation needed.
8.3 Transaction Transparency
This ensures that database transactions maintain the distributed database’s integrity and consistency. Even when a transaction spans multiple sites, it must behave correctly β either all sites commit the changes or none do. We will look at the specific types of transactions in the next section.
8.4 Failure Transparency
This hides the fact that failures (site crashes, network failures) can occur. The DDBMS should continue operating as normally as possible even when some sites are down. The user should not see error messages about remote site failures β the system should handle recovery automatically.
8.5 Performance Transparency
This hides the performance differences between local and remote access. The DDBMS should optimize query processing so that the user gets acceptable response times regardless of where the data is located. The system might choose to use a replica that is closer rather than the original that is far away, without the user knowing.
Q1 (MCQ). At which level of distribution transparency does the user know about fragments but NOT about their physical locations?
a) Fragmentation transparency
b) Location transparency
c) Local mapping transparency
d) Replica transparency
Q2 (MCQ). A user issues a query: SELECT * FROM EMPLOYEE WHERE dept = ‘Sales’. The DDBMS automatically determines which fragments contain Sales employees and which sites store those fragments. This is an example of:
a) Local mapping transparency
b) Location transparency
c) Fragmentation transparency
d) Failure transparency
Q3 (MCQ). Which transparency feature hides the existence of multiple copies of data from the user?
a) Fragmentation transparency
b) Location transparency
c) Heterogeneity transparency
d) Replica transparency
Q4. Explain the difference between heterogeneity transparency and performance transparency.
A1. b) Location transparency. At this level, the user knows that fragments exist (they can refer to them by name) but does NOT know where each fragment is physically stored. The DDBMS handles locating the fragment. At fragmentation transparency, the user doesn’t even know about fragments. At local mapping transparency, the user knows both fragments AND locations.
A2. c) Fragmentation transparency. The user queries a single logical table (EMPLOYEE) without any knowledge of fragments or locations. The DDBMS automatically determines which fragments are needed and where they are. This is the highest level of distribution transparency β the user knows nothing about the distribution.
A3. d) Replica transparency. Replica transparency specifically hides the fact that multiple copies (replicas) of data exist at different sites. The user queries the data normally and the DDBMS decides which copy to access. Fragmentation transparency hides splitting, not copying. Location transparency hides locations, not copies. Heterogeneity transparency hides different DBMS types.
A4. Heterogeneity transparency hides the technical differences between sites β different DBMS software, different data models (relational vs hierarchical), different operating systems. The user sees a uniform interface regardless of what technology runs at each site. Performance transparency hides the speed differences caused by data distribution β local access is fast, remote access is slow. The DDBMS optimizes queries (choosing closer replicas, parallel processing) so the user gets good performance without knowing where data is. One hides technology differences, the other hides speed differences.
9. Transaction Transparency β Types of Transactions
Transaction transparency ensures that transactions maintain the integrity and consistency of the distributed database. To understand this fully, we need to distinguish between four types of transaction requests. This is a very common exam topic, so pay close attention.
9.1 Remote Request
A remote request lets a single SQL statement access data processed by a single remote database processor. One statement, one remote site.
Example: A user at Site A sends one SELECT statement that is processed entirely at Site B. The user’s single SQL statement touches data at only one remote site.
9.2 Remote Transaction
A remote transaction accesses data at a single remote site, but it can consist of multiple SQL statements. The key difference from a remote request: a remote request is ONE statement, a remote transaction can be MANY statements β but all go to the same remote site.
9.3 Distributed Transaction
A distributed transaction can update or request data from several different remote sites. It can consist of multiple SQL statements that reference multiple sites. This is where it gets complex β the DDBMS must coordinate across sites to ensure atomicity (all sites commit or none do).
9.4 Distributed Request
A distributed request is the most complex type. It lets a single SQL statement reference data located at several different sites. Note the key difference from a distributed transaction: a distributed transaction uses MULTIPLE statements to access multiple sites, while a distributed request uses a SINGLE statement to access multiple sites.
| Transaction Type | Number of SQL Statements | Number of Sites Accessed |
|---|---|---|
| Remote Request | One | One remote site |
| Remote Transaction | Multiple | One remote site |
| Distributed Transaction | Multiple | Multiple remote sites |
| Distributed Request | One | Multiple remote sites |
β’ Distributed Transaction = Multiple statements β Multiple sites
β’ Distributed Request = One statement β Multiple sites
Also: Remote Request vs Remote Transaction:
β’ Remote Request = One statement β One site
β’ Remote Transaction = Multiple statements β One site
β’ Remote Request: 1 statement, 1 site
β’ Remote Transaction: Multiple statements, 1 site
β’ Distributed Transaction: Multiple statements, multiple sites
β’ Distributed Request: 1 statement, multiple sites
Q1 (MCQ). A single SQL statement joins a table stored at Site A with a table stored at Site B. This is a:
a) Remote request
b) Remote transaction
c) Distributed transaction
d) Distributed request
Q2 (MCQ). A transaction consisting of three UPDATE statements, all of which access data at a single remote Site C, is a:
a) Remote request
b) Remote transaction
c) Distributed transaction
d) Distributed request
Q3 (MCQ). Which transaction type is the MOST complex because it requires the DDBMS to process a single query across multiple sites?
a) Remote request
b) Remote transaction
c) Distributed transaction
d) Distributed request
Q4. A user at Site A executes the following: first, an UPDATE on a table at Site B; then, an INSERT into a table at Site C; then, a SELECT from a table at Site B. What type of transaction is this? Explain your reasoning.
A1. d) Distributed request. A distributed request is defined as a SINGLE SQL statement that references data at MULTIPLE sites. A join between tables at different sites is the classic example β one statement, two sites. It is NOT a remote request (which accesses only one site) or a distributed transaction (which uses multiple statements).
A2. b) Remote transaction. A remote transaction consists of MULTIPLE SQL statements that all access data at a SINGLE remote site. Three UPDATE statements, all going to Site C, fits this definition perfectly. It is not a remote request because that requires only ONE statement. It is not distributed because it touches only one site.
A3. d) Distributed request. A distributed request is the most complex because a single SQL statement must be decomposed by the DDBMS into sub-queries that are sent to different sites, the results must be collected from all sites, and then combined into a single result set β all for one user query. A distributed transaction also involves multiple sites but with separate statements, so each statement is simpler to route.
A4. This is a distributed transaction. The reasoning: there are multiple SQL statements (UPDATE, INSERT, SELECT) and they access multiple remote sites (Site B and Site C). The definition of a distributed transaction is: multiple SQL statements that reference several different remote sites. It is NOT a distributed request because it uses multiple statements (not one). It is NOT a remote transaction because it touches more than one site.
Quick Revision Notes β Exam Focus
1. Distributed Database Definition
β’ Connected by a data communications link
β’ Each site = a DB in its own right (own DBMS, own users)
β’ Sites collaborate transparently
β’ Union of all DBs = DB of whole organization
2. Schema Types
β’ Global schema: describes objects at ALL sites (union of all local schemas)
β’ DDBMS: software that manages the DDB and makes distribution transparent
3. Characteristics
β’ Transparency β user doesn’t know physical locations
β’ Distributed processing β processing at multiple sites
β’ Distributed catalog management β site catalogs track fragments and replicas
4. Levels of Distribution (SPSD / MPSD / MPMD)
| Level | Processing | Data | Type |
|---|---|---|---|
| SPSD | Single site | Single site | Centralized DBMS |
| MPSD | Multiple sites | Single site | Distributed Processing Environment (NOT a DDB) |
| MPMD | Multiple sites | Multiple sites | Fully Distributed DDBMS |
5. Homogeneous vs Heterogeneous
| Feature | Homogeneous | Heterogeneous |
|---|---|---|
| DBMS at nodes | Same | Different |
| Local access | Global schema | Local schema |
| Remote access | Global schema | Global schema |
| Exclusively local data | None | Can exist |
| Global schema | Union of all local | Used for remote access |
6. DDBMS Components
2. Network hardware and software
3. Communications media
4. Transaction Processor (TP) β requests data (= Application Processor = Transaction Manager)
5. Data Processor (DP) β stores/retrieves data (= Data Manager; may be centralized DBMS)
7. Advantages vs Disadvantages
β Disadvantages: Complexity of management, increased cost, security risks (network + remote fragments), lack of standards, increased storage (replicas), difficult to maintain integrity (network resource intensive)
8. DDBMS Additional Functions
2. Locate data β direct to proper sites
3. Process queries: local, remote, compound (global)
4. Network-wide concurrency control & recovery
5. Data translation (heterogeneous systems)
9. Five Transparency Features
2. Heterogeneity transparency
3. Transaction transparency
4. Failure transparency
5. Performance transparency
10. Three Levels of Distribution Transparency
Location Transparency (MIDDLE) β user knows fragments exist, NOT where they are
Local Mapping Transparency (LOWEST) β user knows fragments AND their locations
Replica Transparency β hides multiple copies of data
11. Four Transaction Types
| Type | SQL Statements | Sites |
|---|---|---|
| Remote Request | One | One remote |
| Remote Transaction | Multiple | One remote |
| Distributed Transaction | Multiple | Multiple remote |
| Distributed Request | One | Multiple remote |
Common Mistakes in Exams
β Mistake 2: Saying local access in homogeneous uses local schema β NO, all access uses global schema
β Mistake 3: Confusing distribution transparency levels (putting location above fragmentation) β Fragmentation is HIGHEST
β Mistake 4: Confusing distributed transaction with distributed request β DT = multiple statements, DR = one statement
β Mistake 5: Saying TP stores data β NO, TP requests data; DP stores data
β Mistake 6: Forgetting replica transparency β it is separate from the three levels
β Mistake 7: Saying fully heterogeneous only has different DBMSs β it can also have different DATA MODELS and COMPUTER SYSTEMS
Challenge Exam Questions
These questions are designed to test deep understanding. Try them before looking at answers!
Section A: Multiple Choice Questions
Q1. A distributed database has five sites. A user at Site 1 issues a single SELECT statement that joins three tables: one at Site 2, one at Site 4, and one at Site 5. Which transaction type is this, and at which level of distribution transparency is the system operating?
a) Distributed transaction at location transparency
b) Distributed request at fragmentation transparency
c) Remote request at local mapping transparency
d) Distributed transaction at fragmentation transparency
b) Distributed request at fragmentation transparency. This is a distributed request because it is a SINGLE SQL statement accessing MULTIPLE sites (Sites 2, 4, and 5). It is operating at fragmentation transparency because the user simply references tables by name without specifying which site holds each table β the DDBMS handles locating them automatically. If it were at location transparency, the user would reference fragments by name. If it were at local mapping transparency, the user would specify site names.
Q2. In a fully heterogeneous DDBMS, Site A runs Oracle on a mainframe (relational model), Site B runs MySQL on a Linux server (relational model), and Site C runs IMS on a mainframe (hierarchical model). A user at Site A needs to join data from all three sites. Which DDBMS function is MOST critical for this operation to succeed?
a) Network-wide concurrency control
b) Data translation
c) Location transparency interface
d) Distributed catalog management
b) Data translation. Data translation is the function specifically required in heterogeneous systems to convert data between different DBMS formats and data models. In this scenario, relational data from Oracle and MySQL must be translated to work with hierarchical data from IMS. Without data translation, the join would fail because the systems use different data models. The other functions are important but not the MOST critical for this specific cross-model operation.
Q3. An organization uses MPSD for its database environment. Which of the following statements is TRUE about this environment?
a) It is a fully distributed database management system
b) Data is stored at multiple sites but processed at one site
c) Processing occurs at multiple sites but all data resides at a single site
d) Both data and processing are distributed across multiple sites
c) Processing occurs at multiple sites but all data resides at a single site. MPSD stands for Multiple-Site Processing, Single-Site Data. The “M” in the first part means multiple sites do processing. The “S” in the second part means data is at a single site (a file server). Option a describes MPMD. Option b reverses the correct arrangement. Option d describes MPMD.
Q4. A site catalog at Site X shows that Site X stores fragment F3 of the PRODUCT table AND a replica of the CUSTOMER table originally from Site Y. If a user queries the CUSTOMER table from Site X, which transparency feature ensures the user does not know they are reading a replica rather than the original?
a) Fragmentation transparency
b) Location transparency
c) Replica transparency
d) Failure transparency
c) Replica transparency. Replica transparency is specifically defined as the DDBMS’s ability to hide the existence of multiple copies of data from the user. The user queries CUSTOMER normally β the DDBMS decides whether to use the original at Site Y or the replica at Site X. The user never knows which copy was used. Fragmentation transparency hides splitting. Location transparency hides where fragments are. Failure transparency hides site failures.
Q5. In a homogeneous DDBMS with four sites, which of the following is TRUE about exclusively local data?
a) Each site can have data managed only by its local DBMS, outside the DDBMS
b) No exclusively local data exists β all data is managed by the distributed DBMS
c) Exclusively local data exists only at the headquarters site
d) Exclusively local data can only be accessed through the global schema
b) No exclusively local data exists β all data is managed by the distributed DBMS. This is a defining characteristic of homogeneous DDBMS. Unlike heterogeneous DDBMS (where local access uses the local schema and some data may be exclusively local), in a homogeneous DDBMS, ALL data at ALL sites is managed by the distributed DBMS. There is no data that exists “outside” the DDBMS’s control. All access goes through the one global schema.
Section B: True/False
Q6. In a distributed database, the Transaction Processor (TP) at a site is responsible for storing and retrieving data located at that site.
False. The Transaction Processor (TP) is responsible for requesting data, not storing/retrieving it. The Data Processor (DP) is the component that stores and retrieves data at a site. The TP is the “asker” and the DP is the “provider.”
Q7. Local mapping transparency provides a higher degree of transparency to the user than fragmentation transparency.
False. Local mapping transparency provides the LOWEST degree of transparency. At this level, the user knows about fragments AND their physical locations. Fragmentation transparency provides the HIGHEST degree β the user knows nothing about fragments or locations. The order from highest to lowest transparency is: Fragmentation β Location β Local Mapping.
Q8. A covert channel is a legitimate pathway for authorized data flow between sites in a DDBMS.
False. This question mixes concepts from different chapters. A covert channel is an illegitimate, hidden pathway that violates security policy β it is NOT a legitimate pathway for authorized data flow. Covert channels are a security threat, not a feature of DDBMS. This concept belongs to database security (Chapter 6), not distributed databases (Chapter 7).
Q9. In an MPSD environment, if the single data site fails, all processing at all sites stops.
True. In MPSD (Multiple-Site Processing, Single-Site Data), ALL data resides at ONE site. If that single data site fails, no processing site can access any data because there is no other copy of the data anywhere. All processing stops because processing requires data, and the only data is unavailable. This is a major weakness of MPSD compared to MPMD.
Section C: Fill in the Blanks
Q10. The three levels of distribution transparency, from highest to lowest, are ________ transparency, ________ transparency, and ________ transparency.
Fragmentation; Location; Local mapping. Fragmentation transparency is the highest (user knows nothing about distribution). Location transparency is the middle (user knows about fragments but not locations). Local mapping transparency is the lowest (user knows about fragments and their locations).
Q11. A ________ transaction consists of multiple SQL statements that all access data at a ________ remote site, while a ________ request consists of a single SQL statement that accesses data at ________ remote sites.
Remote; single; distributed; multiple. A remote transaction: multiple statements, one site. A distributed request: one statement, multiple sites. These two represent opposite corners of the 2×2 matrix of transaction types.
Q12. The DDBMS component that may be a centralized DBMS is the ________ ________, and the component that requests data is the ________ ________.
Data Processor (or Data Manager); Transaction Processor (or Application Processor or Transaction Manager). The Data Processor stores and retrieves data locally and may be implemented as a centralized DBMS. The Transaction Processor requests data from other sites.
Q13. MPSD is also called a ________ ________ ________, and it is NOT considered a true ________ ________ because the data is not distributed.
Distributed Processing Environment; Distributed Database. MPSD has distributed processing but centralized data. Since the data is not spread across multiple sites, it does not qualify as a distributed database β only as a distributed processing environment.
Section D: Short Answer and Explanation Questions
Q14. A hospital has a centralized database at its main campus. During a network upgrade, the main server crashes for 6 hours. All five branch clinics are unable to access any patient records during this time. Explain how converting to an MPMD distributed database could prevent this problem. Reference specific advantages of DDBMS in your answer.
Converting to MPMD would solve this problem through the advantage of improved availability. In an MPMD system, each branch clinic would have its own local data store. If the main server crashes, only the data at the main campus would be affected. The branch clinics would continue to operate normally using their local data. The textbook specifically states: “a fault in one database system will only affect one fragment, instead of the entire database.” In the centralized system, one crash affects EVERYTHING. Additionally, faster data access would improve because each clinic accesses local data without network delays to the main server. Modularity means the main server could be repaired or replaced without disrupting operations at the branches. The reflects organizational structure advantage applies because the hospital is naturally distributed across multiple locations.
Q15. Compare and contrast remote transaction and distributed request. Use a specific example for each to illustrate the differences.
Remote transaction: Multiple SQL statements, one remote site.
Example: A user at Site A executes three statements β UPDATE, INSERT, and DELETE β all on tables stored at Site B. Only one site is involved, but multiple statements are executed.
Distributed request: One SQL statement, multiple remote sites.
Example: A user at Site A executes a single SELECT statement that JOINs a table at Site B with a table at Site C. One statement, but it touches two sites.
Key differences:
1. Number of SQL statements: Remote transaction = multiple; Distributed request = one
2. Number of sites: Remote transaction = one; Distributed request = multiple
3. Complexity: Distributed request is more complex because the DDBMS must decompose a single query into sub-queries for different sites and merge the results. A remote transaction is simpler because all statements go to one site.
4. Coordination: Distributed request requires the DDBMS to coordinate data retrieval from multiple sites for a single query. Remote transaction requires the DDBMS to send multiple statements to one site but no cross-site coordination for a single statement.
Q16. Explain why enforcing integrity constraints is more difficult in a distributed database than in a centralized database. Give a specific example of an integrity constraint that would be challenging to enforce across sites.
Enforcing integrity constraints in a distributed database is difficult because the constraints may involve data stored at different sites, and verifying them requires network communication between those sites. This consumes network bandwidth, adds latency, and creates coordination overhead. The textbook specifically states that “enforcing integrity over a network may require too much networking resources to be feasible.”
Example: Consider a foreign key constraint: “Every ORDER must reference a valid CUSTOMER.” If the ORDER table is at Site A and the CUSTOMER table is at Site B, then every INSERT into ORDER at Site A must first check Site B to verify the customer exists. This means every INSERT requires at least one network message to Site B and a response back. If Site B is down or the network is slow, the INSERT cannot be completed. In a centralized database, this check is a simple local operation with no network overhead. Multiply this by thousands of transactions, and the network resource consumption becomes significant.
Q17. A student argues: “MPSD is better than MPMD because it is simpler to manage β all data is in one place.” Give three reasons why an organization might still choose MPMD over MPSD, despite the added complexity.
Despite its simplicity, MPSD has significant limitations that make MPMD preferable for many organizations:
1. Improved availability (critical advantage): In MPSD, if the single data site fails, the ENTIRE system goes down. In MPMD, a failure at one site only affects data at that site. For organizations that require high availability (hospitals, banks, airlines), this alone justifies MPMD.
2. Faster data access and improved performance: In MPSD, every data request from every processing site must travel over the network to the single data site. This creates network congestion and slow response times, especially for geographically distributed organizations. In MPMD, local data is accessed locally without network delays.
3. Reflects organizational structure: Many organizations are naturally distributed with autonomous branches. MPMD allows each branch to own and manage its own data (local autonomy). MPSD forces all data to be centrally stored, which may not match how the organization actually works.
4. Processor independence: MPMD allows different sites to use different hardware. MPSD ties the organization to the hardware platform of the single data server.