10.7

Given a collection of documents, for each word \(w_i\), let \(n_i\) denote the number of times the word occurs in the collection. Let \(N\) be the total number of word occurrences across all documents. Next, consider all pairs of consecutive words \((w_i, w_j)\) in the document; let \(n_{i,j}\) denote the number of occurrences of the word pair \((w_i, w_j)\) across all documents. Write an Apache Spark program that, given a collection of documents in a directory, computes \(N\), all pairs \((w_i, n_i)\), and all pairs \(((w_i, w_j), n_{i,j})\). Then output all word pairs such that: \[ \frac {n_{i,j}} {N} \geq 10 * \frac {n_i} {N} * \frac {n_j} {N} \] These are word pairs that occur 10 times or more as frequently as they would be expected to occur if the two words occurred independently of each other.

You will find the join operations on RDDs useful for the last step, to bring related counts together. For simplicity, do not bother about word pairs that cross lines. Also assume for simplicity that words only occur in lowercase and that there are no punctuation marks.


# TODO: Write the Apache Spark program that solves the above question.