Apache Spark: Split a pair RDD into multiple RDDs by key

This drove me crazy but I finally found a solution.

If you want to Split a pair RDD of type (A, Iterable(B)) by key, so the result is several RDDs of type B, then here how you go:

// A random pair RDD 
JavaPairRDD<A, Iterable<B>> rdd = ... 

// Get the list of keys 
List<A> keys = rdd.keys().distinct().collect(); 

// Iterate through the keys 
for (String key : keys) {

    // Get an RDD by filtering the original RDD by key
    JavaRDD<B> rddByKey = getRddByKey(rdd, key);
 
    // ... 
}

private JavaRDD getRddByKey(JavaPairRDD<A, Iterable<B>> pairRDD, A key) {
    return pairRDD.filter(v -> v._1().equals(key)).values().flatMap(tuples -> tuples);
}

// A random pair RDD

JavaPairRDD<A, Iterable<B>> rdd = ...

// Get the list of keys

List<A> keys = rdd.keys().distinct().collect();

// Iterate through the keys

for (String key : keys) {

// Get an RDD by filtering the original RDD by key

JavaRDD<B> rddByKey = getRddByKey(rdd, key);

// ...

}

private JavaRDD getRddByKey(JavaPairRDD<A, Iterable<B>> pairRDD, A key) {

return pairRDD.filter(v -> v._1().equals(key)).values().flatMap(tuples -> tuples);

}

The trick is twofold (1) get the list of all the keys, (2) iterate through the list of keys, and for each key, create a new RDD by filtering the original pair RDD to get only values attached to that key. Note the use of Java 8’s Lambda expression in line 17.

Remark: bare in mind that the action collect() gets you the keys of the entire distributed RDD into the driver machine. Generally, keys are integers or small strings, so collecting them into one machine wouldn’t be problematic. Otherwise, this wouldn’t be the best way to go. If you have a suggestion, you are more than welcome to leave a comment below… spread the word, help the world!

Comments

comments

2 thoughts on “Apache Spark: Split a pair RDD into multiple RDDs by key”

Hi Mohamed,

Its somehow difficult for me to correlate the real use case of requirement where we would like to create an RDD for each key. Somewhere if it is reference lookup then RDDs by key is not the best solution – better one would be hashmap (if not too large) or lookup from table.

if the data is generated by some complex transformation and later to be used as reference lookup for join etc, we can store transformed data in distributed storage and when required can join an RDD to any NOSQL database table (e..g Cassandra provides CassandraConnector APIs).

what was your precise need for creating RDDs by key?

Regards
Sumit

Sumit Sharma March 18, 2016 Reply

Save multiple RDDs in a Parquet files. Your suggestion to store intermediate RDDs to a distributed storage might be the better solution.

Mohamed Mami July 6, 2017 Reply

Comments

2 thoughts on “Apache Spark: Split a pair RDD into multiple RDDs by key”

Leave a Reply Cancel reply