Big Data Archives | Mohamed Nadjib's WebWorld

Following our Big Data Europe (BDE) project activities, we organized a webinar about Apache Flink and its usage inside the project. We were pleased to host one of the most relevant people to the topic: Kostaks Tsoumas, CEO of dataArtisans (the company driving the development of Flink). We received among the questions at the end one that triggered my thinking and kept me busy meditating several days on. It was a broad question but a very valid one: “what are the must-do things for master degree students who start their career in big data technologies?“.

Before diving in, I’d like to put things in context. I’m in the time of writing these lines, a more than two years PhD student/researcher with a close relation to the industry, via a number of projects and partnerships I make part of or have close ear to. As being such, looking underneath the hoods of things is the principle of my job. I have been working on Big Data topics since five years now—on almost a daily basis— alternating between coding and reading scientific and technical material about Big Data and Data Management in general.

Note: although the advise given below are primarily addressed to Master’s students, they can still benefit anyone approaching the topic of Big Data for the first time.

Having that out of the way, let’s get into our topic.

1. Big Data is not a separate field of study.

In my opinion, the biggest confusion among new comers or curious folks is thinking that Big Data is a separate field of study to specialize in. I arguably say that it is ‘not’. It is not a field you can major in or Data Mining instead. That would be comparing apples to some type of soil. The way I perceive Big Data is that it’s a ‘way of thinking’ about topics dealing with data that lives under certain circumstances. The definition of those circumstances is what commonly people use to define Big Data.

Big Data covers the situations faced when traditional techniques and technologies fail to keep a data-based system functional. Those situations are: (1) data grows beyond the processing ability of the system, (2) data streams-in in a high pace overwhelming the ingestion capability of the system (3) data metamorphoses to many forms and types making the system incrementally too complex to build for even simple tasks. Respectively, those are called Volume, Velocity and Variety, or 3’V dimensions following Gartner’s “Deja VVVu” proposal.

That said, Big Data can be horizontal to lots of fields, including but not limited to: databases, information retrieval, machine learning, text mining, image processing, natural language processing, industry 4.0, etc. when there is data to be processed.

This drove me crazy but I finally found a solution.

If you want to Split a pair RDD of type (A, Iterable(B)) by key, so the result is several RDDs of type B, then here how you go:

// A random pair RDD 
JavaPairRDD<A, Iterable<B>> rdd = ... 

// Get the list of keys 
List<A> keys = rdd.keys().distinct().collect(); 

// Iterate through the keys 
for (String key : keys) {

    // Get an RDD by filtering the original RDD by key
    JavaRDD<B> rddByKey = getRddByKey(rdd, key);
 
    // ... 
}

private JavaRDD getRddByKey(JavaPairRDD<A, Iterable<B>> pairRDD, A key) {
    return pairRDD.filter(v -> v._1().equals(key)).values().flatMap(tuples -> tuples);
}

// A random pair RDD

JavaPairRDD<A, Iterable<B>> rdd = ...

// Get the list of keys

List<A> keys = rdd.keys().distinct().collect();

// Iterate through the keys

for (String key : keys) {

// Get an RDD by filtering the original RDD by key

JavaRDD<B> rddByKey = getRddByKey(rdd, key);

// ...

}

private JavaRDD getRddByKey(JavaPairRDD<A, Iterable<B>> pairRDD, A key) {

return pairRDD.filter(v -> v._1().equals(key)).values().flatMap(tuples -> tuples);

}

The trick is twofold (1) get the list of all the keys, (2) iterate through the list of keys, and for each key, create a new RDD by filtering the original pair RDD to get only values attached to that key. Note the use of Java 8’s Lambda expression in line 17.

Remark: bare in mind that the action collect() gets you the keys of the entire distributed RDD into the driver machine. Generally, keys are integers or small strings, so collecting them into one machine wouldn’t be problematic. Otherwise, this wouldn’t be the best way to go. If you have a suggestion, you are more than welcome to leave a comment below… spread the word, help the world!

Category: Big Data

Getting started in Big Data for Master’s students

1. Big Data is not a separate field of study.

Apache Spark: Split a pair RDD into multiple RDDs by key