Getting started in Big Data for Master’s students


Following our Big Data Europe (BDE) project activities, we organized a webinar about Apache Flink and its usage inside the project. We were pleased to host one of the most relevant people to the topic: Kostaks Tsoumas, CEO of dataArtisans (the company driving the development of Flink). We received among the questions at the end one that triggered my thinking and kept me busy meditating several days on. It was a broad question but a very valid one: “what are the must-do things for master degree students who start their career in big data technologies?“.

Before diving in, I’d like to put things in context. I’m in the time of writing these lines, a more than two years PhD student/researcher with a close relation to the industry, via a number of projects and partnerships I make part of or have close ear to. As being such, looking underneath the hoods of things is the principle of my job. I have been working on Big Data topics since five years now—on almost a daily basis— alternating between coding and reading scientific and technical material about Big Data and Data Management in general.

Note: although the advise given below are primarily addressed to Master’s students, they can still benefit anyone approaching the topic of Big Data for the first time.

Having that out of the way, let’s get into our topic.

1. Big Data is not a separate field of study.

In my opinion, the biggest confusion among new comers or curious folks is thinking that Big Data is a separate field of study to specialize in. I arguably say that it is ‘not’. It is not a field you can major in or Data Mining instead. That would be comparing apples to some type of soil. The way I perceive Big Data is that it’s a ‘way of thinking’ about topics dealing with data that lives under certain circumstances. The definition of those circumstances is what commonly people use to define Big Data.

Big Data covers the situations faced when traditional techniques and technologies fail to keep a data-based system functional. Those situations are: (1) data grows beyond the processing ability of the system, (2) data streams-in in a high pace overwhelming the ingestion capability of the system (3) data metamorphoses to many forms and types making the system incrementally too complex to build for even simple tasks. Respectively, those are called Volume, Velocity and Variety, or 3’V dimensions following Gartner’s “Deja VVVu” proposal.

That said, Big Data can be horizontal to lots of fields, including but not limited to: databases, information retrieval, machine learning, text mining, image processing, natural language processing, industry 4.0, etc. when there is data to be processed.

2. Big Data is Data Management in the back.

Now, it must become clear that Big Data has pretty much Data Management as a backbone. Hence, a master student has to have background in Data Management pre-packed already. Otherwise, taking Data Management class is mandatory. What we are particularly interested in are how data is conceptually modeled, physically stored, cross-formats and -languages accessed. What I more precisely recommend are the following sub-topics:
– Relational algebra and database, ACID properties.
– SQL query language with a particular focus on join and aggregation types.
– NOSQL, CAP theorem, BASE properties.
– Sliding window real-time processing.
– Batch vs. stream vs. interactive processing.

I believe that most Computer Science Master’s curriculum have at least one module on Data Management or Databases, so complement as necessary.

3. Think big, distributed.

I’ve met a lot of people starting on Big Data domain missing one of the main points behind the emergence of Big Data and one of its substantial properties. Put simply, because of the wide availability of commodity low-cost machinery, both huge volumes of data and/or complex computations were moved to clusters of several machines, in order to save both on both cost and time at once. The way I personally grasped those concepts was principally through learning MapReduce, and practicing with Apache Hadoop MapReduce (parallel computation) and Hadoop HDFS (distributed storage). I highly recommend anyone to follow the same path, and guarantee you to get a sense of what distributed computing is, in a very easy and nice way. I just advise though to substitute Hadoop MapReduce with Apache Spark, which hadn’t existed back then.

4. Adopt an “Optimizer” way of thinking.

By now, the student must have the backpack filled to begin their Big Data journey. In addition to the technical and scientific advise I gave so far, I conclude the list with an essential abstract one. If I were to hire a Big Data hat, I would require that they have a spirit of an Optimizer, before that of autonomous, collaborative, or ambitious.

We are living in an era whose coins are time units. Providing a code that just “works” becomes part of the history once you step-in in this world. What starts to matter here is how fast your application finishes the task, and how much output this application is able to produce in a given time interval. An optimizer is then a person who looks carefully to find that unnecessary or reducible if-else block, costing the application to last couple of seconds more.


Whether Master’s student’s target is to pursue an academic or a professional career, the four principles I provided above must give a solid ground to start an exciting adventure in the wild of Big Data. I give as bonus some extra orientation based on their target.

If your dreams drive you to the industry, then I recommend that you pick a few Big Data tools and master them. A lot of learning options are at your disposal: limitless tutorials are meanwhile available online, increasingly many high-quality MOOC are offered by renowned universities and institutions. Specialization is key, the landscape of Big Data technologies is booming, mastering even a small fraction is challenging (hint: see what the market is demanding and what looks like going to be a hit. Job offers in platforms like LinkedIn and Xing, Google Trends, GitHub statistics—for open-source tools—will help).

On the other hand, for those aspiring to go scientific and research-y, then there are a lot of exciting freshly-(re)opened topics awaiting investigation. For instance: accessing multiple data sources (local/distant databases, network of sensors/devices aka Internet of Things, etc.) in a uniform way, querying and analyzing data in motion (real-time), genomic data analysis, etc.

Apache Spark: Split a pair RDD into multiple RDDs by key

This drove me crazy until I found the solution.

If you want to Split a pair RDD of type (A, Iterable(B)) by key, so the result is get several RDDs of type B, then here how to do it:

The trick is twofold (1) get the list of all the keys, (2) iterate through the list of keys, and for each key, create a new RDD by filtering the original pair RDD to get only values attached to that key. Note the use of Java 8’s Lambda expression in line 14.

Remark: bare in mind that the action collect()  gets you the keys of the entire distributed RDD into the driver machine. Generally, keys are integers or small strings, so collecting them into one machine wouldn’t be problematic. Otherwise, this wouldn’t be the best way to go. If you have a suggestion, you are more than welcome to leave a comment below… spread the word, help the world!