15.21

The hash-join algorithm as described in Section 15.5.5 computes the natural join of two relations. Describe how to extend the hash-join algorithm to compute the natural left outer join, the natural right outer join, and the natural full outer join. (Hint: Keep extra information with each tuple in the hash index to detect whether any tuple in the probe relation matches the tuple in the hash index.) Try out your algorithm on the takes and student relations.


Extension of the hash-join algorithm to compute the natural left outer join (⟕)

Let say we want to compute \(r ⟕ s\). As written in the book, we generally want the smaller relation to be the build relation.

Case 1: The number of blocks of \(r\) is much smaller than the number of blocks of \(s\)

Thus we choose \(r\) as the build relation. Thus \(s\) is the probe relation. We first partition \(r\) into the partitions \(r_0, r_1, ..., r_{n_h}\), identical to the way given in Figure 15.10. We then partition \(s\) into the partitions \(s_0, s_1, ..., s_{n_h}\) identical to the way given in Figure 15.10. We assume for simplicity that Recursive Partitioning is not needed. We also assume that Hash-table overflow does not occur.

For \(i \in \{0, 1, ..., n_h\}\) we do the following:

  1. Read partition \(r_i\) and build an in-memory hash index on it (the hash function used to build this is different from the one used earlier).
  2. When building the in-memory hash index on \(r_i\) we keep a boolean with each tuple in \(t_{r_i} \in r_i\). We initialize this boolean to False initially. Suppose also we can access this boolean as \(t_{r_i}.b\).
  3. For each tuple \(t_{s_i} \in s_i\) we do the following:
    • Probe the in-memory hash index on \(r_i\) to locate all tuples \(t_{r_i}\) such that \(t_{r_i}[JoinAttrs] = t_{s_i}[JoinAttrs]\)
    • For each matching tuple \(t_{r_i} \in r_i\) we do the following:
      • Add \(t_{r_i} \bowtie t_{s_i}\) to the result.
      • We set \(t_{r_i}.b\) to True.
  4. For each tuple \(t_{r_i} \in r_i\)
    • If \(t_{r_i}.b\) is False
      • Add \(t_{r_i}\) padded with nulls to the result.

Case 2: The number of blocks of \(s\) is much smaller than the number of blocks of \(r\)

Thus we choose \(s\) as the build relation. Thus \(r\) is the probe relation. We first partition \(r\) into the partitions \(r_0, r_1, ..., r_{n_h}\), identical to the way given in Figure 15.10. We then partition \(s\) into the partitions \(s_0, s_1, ..., s_{n_h}\) identical to the way given in Figure 15.10. We assume for simplicity that Recursive Partitioning is not needed. We also assume that Hash-table overflow does not occur.

For \(i \in \{0, 1, ..., n_h\}\) we do the following:

  1. Read partition \(s_i\) and build an in-memory hash index on it (the hash function used to build this is different from the one used earlier).
  2. For each tuple \(t_{r_i} \in r_i\) we do the following:
    • Probe the in-memory hash index on \(s_i\) to locate all tuples \(t_{s_i}\) such that \(t_{r_i}[JoinAttrs] = t_{s_i}[JoinAttrs]\)
    • If there does not exist such tuples that satisfy this:
      • Add \(t_{r_i}\) padded with nulls to the result.
    • Else:
      • For each matching tuple \(t_{s_i} \in s_i\) we do the following:
        • Add \(t_{r_i} \bowtie t_{s_i}\) to the result.

If relations \(r\) and \(s\) are roughly the same size, then I think performing Case 2 given above will result in better performance, since it does not keep booleans/extra information.

Extension of the hash-join algorithm to compute the natural right outer join (⟖)

Let say we want to compute \(r ⟖ s\). This is equivalent to \(s ⟕ r\). And we know how to perform natural left outer join using the hash-join algorithm (see above).

Extension of the hash-join algorithm to compute the natural full outer join (⟗)

We know that natural full outer join is commutative. That is \(r ⟗ s\) results in the same relation as \(s ⟗ r\).

Let’s say we want to compute \(r ⟗ s\). Without loss of generality, assume the number of blocks of \(r\) is less than or equal to the number of blocks of \(s\). Thus we choose \(r\) as our build relation and \(s\) as our probe relation. We first partition \(r\) into the partitions \(r_0, r_1, ..., r_{n_h}\), identical to the way given in Figure 15.10. We then partition \(s\) into the partitions \(s_0, s_1, ..., s_{n_h}\) identical to the way given in Figure 15.10. We assume for simplicity that Recursive Partitioning is not needed. We also assume that Hash-table overflow does not occur.

For \(i \in \{0, 1, ..., n_h\}\) we do the following:

  1. Read partition \(r_i\) and build an in-memory hash index on it (the hash function used to build this is different from the one used earlier).
  2. When building the in-memory hash index on \(r_i\) we keep a boolean with each tuple in \(t_{r_i} \in r_i\). We initialize this boolean to False initially. Suppose also we can access this boolean as \(t_{r_i}.b\).
  3. For each tuple \(t_{s_i} \in s_i\) we do the following:
    • Probe the in-memory hash index on \(r_i\) to locate all tuples \(t_{r_i}\) such that \(t_{r_i}[JoinAttrs] = t_{s_i}[JoinAttrs]\)
    • If there does not exist such tuples that satisfy this:
      • Add \(t_{s_i}\) padded with nulls to the result.
    • Else:
      • For each matching tuple \(t_{r_i} \in r_i\) we do the following:
        • Add \(t_{r_i} \bowtie t_{s_i}\) to the result.
        • We set \(t_{r_i}.b\) to True.
  4. For each tuple \(t_{r_i} \in r_i\)
    • If \(t_{r_i}.b\) is False
      • Add \(t_{r_i}\) padded with nulls to the result.