14.8
Suppose you have a relation \(r\) with \(n_r\) tuples on which a secondary B+-tree is to be constructed.
Give a formula for the cost of building the B+-tree index by inserting one record at a time. Assume each block will hold an average of \(f\) entries and that all levels of the tree above the leaf are in memory.
Assuming a random disk access takes 10 milliseconds, what is the cost of index construction on a relation with 10 million records?
Write pseudocode for bottom-up construction of a B+-tree, which was outlined in Section 14.4.4. You can assume that a function to efficiently sort a large file is available.
- Give a formula for the cost of building the B+-tree index by inserting one record at a time. Assume each block will hold an average of \(f\) entries and that all levels of the tree above the leaf are in memory.
The cost to locate the page number of the required leaf page for an insertion is negligible since the non-leaf nodes are in memory. On the leaf level it takes one random disk access to read and one random disk access to update it along with the cost to write one page. Insertions which lead to splitting of leaf nodes require an additional page write. Hence to build a B+-tree with \(n_r\) entries it takes a maximum of \(2 * n_r\) random disk accesses and \(n_r + 2 * (n_r/f)\) page writes. The second part of the cost comes from the fact that in the worst case each leaf is half filled, so the number of splits that occur is twice \(n_r/f\).
The above formula ignores the cost of writing non-leaf nodes, since we assume they are in memory, but in reality they would also be written eventually. This cost is closely approximated by \(2 * (n_r / f)/ f\), which is the number of internal nodes just above the leaf; we can add further terms to account for higher levels of nodes, but these are much smaller than the number of leaves and can be ignored.
- Assuming a random disk access takes 10 milliseconds, what is the cost of index construction on a relation with 10 million records?
Substituting the values in the above formula and neglecting the cost of page writes, it takes about 10,000,000 * 20 milliseconds, or 56 hours, since each insertion costs 20 milliseconds.
- Write pseudocode for bottom-up construction of a B+-tree, which was outlined in Section 14.4.4. You can assume that a function to efficiently sort a large file is available.
function insert_in_leaf(value K, pointer P)
if (tree is empty)
create an empty leaf node L, which is also the root
else
Find the last leaf node in the leaf nodes chain L
if (L has less than n - 1 key values)
then insert (K, P) at the first available location in L
else begin
Create leaf node L1
Let L.P[n] = L1;
Set K1 = last value from page L
insert_in_parent(1, L , K1, L1)
insert (K,P) at the first location in L1
end
function insert_in_parent(level l, pointer P, value K, pointer P1)
if (level l is empty) then begin
Create an empty non-leaf node N, which is also the root
insert (P, K, P1) at the starting of the node N
return
else begin
Find the right most node N at level l
if (N has less than n pointers)
then insert (K, P1) at the first available location in N
else begin
Create a new non-leaf page N1
insert (P1) at the starting of the node N
insert_in_parent(l + 1, pointer N, value K, pointer N1)
end
end
The insert_in_leaf
function is called for each of the value, pointer pairs in ascending order. Similar function can also be built for descending order. The search for the last leaf or non-leaf node at any level can be avoided by storing the current last page details in an array.
The last node in each level might be less than half filled. To make this index structure meet the requirements of a B+-tree, we can redistribute the keys of the last two pages at each level. Since the last but one node is always full, redistribution makes sure that both of them are at least half filled.