Apache Spark: Split a pair RDD into multiple RDDs by key

This drove me crazy but I finally found a solution.

If you want to Split a pair RDD of type (A, Iterable(B)) by key, so the result is several RDDs of type B, then here how you go:

The trick is twofold (1) get the list of all the keys, (2) iterate through the list of keys, and for each key, create a new RDD by filtering the original pair RDD to get only values attached to that key. Note the use of Java 8’s Lambda expression in line 17.

Remark: bare in mind that the action collect()  gets you the keys of the entire distributed RDD into the driver machine. Generally, keys are integers or small strings, so collecting them into one machine wouldn’t be problematic. Otherwise, this wouldn’t be the best way to go. If you have a suggestion, you are more than welcome to leave a comment below… spread the word, help the world!

Comments

comments

2 thoughts on “Apache Spark: Split a pair RDD into multiple RDDs by key

  1. Hi Mohamed,

    Its somehow difficult for me to correlate the real use case of requirement where we would like to create an RDD for each key. Somewhere if it is reference lookup then RDDs by key is not the best solution – better one would be hashmap (if not too large) or lookup from table.

    if the data is generated by some complex transformation and later to be used as reference lookup for join etc, we can store transformed data in distributed storage and when required can join an RDD to any NOSQL database table (e..g Cassandra provides CassandraConnector APIs).

    what was your precise need for creating RDDs by key?

    Regards
    Sumit

    Reply

    1. Save multiple RDDs in a Parquet files. Your suggestion to store intermediate RDDs to a distributed storage might be the better solution.

      Reply

Leave a Reply

Your email address will not be published.