10.5
What is the conceptual problem with the following snippet of Apache Spark code meant to work on very large data. Note that the collect() function returns a Java collection, and Java collections (from Java 8 onwards) support map and reduce functions.
<String> lines = sc.textFile('logDirectory'); JavaRDDint totalLength = lines.collect().map(s -> s.length()) .reduce(0, (a,b) -> a + b);
The problem with the code is that the collect() function gathers the RDD data at a single node, and the map and reduce functions are then executed on that single node, not in parallel as intended.