Parallel Databases

Explain Inter-operational and Intra-operational parallelism with relevant examples.

Intraoperation Parallelism

Relational operations work on relations that contain large sets of tuples, that we can parallelize the operations by executing them in parallel on different subsets of the relations.
The number of tuples in a relation can be large so the degree of parallelism is potentially enormous.
Hence, we can say intraoperation parallelism is natural in a database system.

The parallel versions of some common relational operations are as follows:

Parallel Sort

For example, we want to sort a relation that resides on n disks D0,D1,......Dn-1.
If this relation is range partitioned on the attributes then each partition is sorted out separately and can concatenate the results to get the full sorted relation.
As the tuples are partitioned on the n disks the time that is required for reading the entire relation is reduced by the parallel access.
If the relation is partitioned in any other way in can be sorted out by using any of the following ways:

Range-partitioning sort

It basically works in two steps: first is to range partition the relation and second is to sort out each partition separately.
When we sort the relation it is not necessary to range -partition the relation on the same set of processors or disks as those on which that relation is stored.
The range-partitioning should be done with a good range-partition vector so that each partition will approximately have the same number of tuples.

Parallel External Sort-Merge

The parallel sort-merge will work in the following manner:

Parallel Join

The join operation tests the pairs of tuples to see whether they satisfy the join condition and if they do the system adds the pair to the join output.
The parallel join algorithms attempt to split the pairs that are to be tested over several processors.
Each processor then checks part of the join locally.
After this the system collects the results from each of the processor for producing the final result.

The types of joins are:

Other relational operators

Interoperation parallelism
It has two types of parallelism:

1. Pipelined Parallelism

The parallel systems use the pipelining mainly for the same that the sequential systems do.
The pipelines are a source of parallelism in the same way that the instructions pipelines are a source of parallelism in hardware design.
Two operations can be run simultaneously on different processors so that the tuple consumes the tuples in parallel to the one producing them.
This form of parallelism is known as pipelined parallelism.

Independent parallelism

The operations in a query expression that do not depend on one another can be executed in parallel. This is known as independent parallelism.
The independent parallelism does not provide a high degree of parallelism and is less useful in a highly parallel system, even if it is useful with a lower degree of parallelism.

For the purpose of parallel I/O the data can be partitioned across multiple disks.
The relational operators such as the sort, join, aggregation can be executed in parallel.
The data here can be partitioned in such a manner that each processor can work independently on its own partition.
The queries are expressed in the high level language which helps in making the parallelization easier.
Different queries can be made run in parallel with each other the conflicts can be taken care by the concurrency control.
Hence, we can say that the databases lend themselves to parallelism.

I/O Parallelism

It helps in reducing the time that is required to retrieve the relations from the disk by partitioning.
All the relations are maintained on multiple disks.
Horizontal partitioning is where the tuples of a relation are divided among many other disks such that each of the tuple resides on one disk.
Partitioning techniques used in I/O parallelism.
Assume that the number of disks = n.

The partitioning techniques are as follows:

Round Robin

It scans the relation in any order and sends the i^th tuple to disk number D_{i mod n}.
The scheme ensures an even distribution of tuples across disks; that is, each disk has approximately the same number of tuples as the others.

Hash partitioning

It is a declustering strategy that designates one or more attributes from the given relation’s schema as the partitioning attributes.
A hash function is chosen whose range is {0, 1, . . . , n - 1}.
Each tuple of the original relation is hashed on the partitioning attributes.
If 'i' is returned by the hash function, then the tuple is placed on disk D_i 1.

Range partitioning

It distributes the tuples by assigning contiguous attribute-value ranges to each disk.
It selects a partitioning attribute, A, and a partitioning vector [v0, v1, . . . , vn-2], such that, if i < j, then v_i < v_j.
The relation is partitioned as follows: Consider a tuple 't' such that t[A] = x. If x < v0, then 't' goes on disk D0. If x = v_n-2, then 't' goes on disk D_n-1. If v_i = x < v_i+1, then 't' goes on disk D_i+1.
Example of this can be with three disks numbered 0, 1, and 2 that may assign tuples with values less than 5 to disk 0, values between 5 and 40 to disk 1, and values greater than 40 to disk 2.

Comparison of Partitioning Techniques

A relation can be retrieved in parallel by using all the disks once a relation has been partitioned among several disks.
Similarly, when a relation is being partitioned, it can be written to multiple disks in parallel.
The rate transfer for reading or writing an entire relation are much faster with I/O parallelism than without it.
However, it is only one kind of access to data for reading an entire relation, or scanning a relation.

Access to data can be classified as follows:

The entire relation is scanned.
A tuple is located associatively (example, employee name = “Pooja”); these queries, also known as point queries, seek tuples that have a specified value for a specific attribute.
Locating all tuples for which the value of a given attribute lies within a specified range (example, 10000 < salary < 20000); these queries are called range queries.