Getting started in Big Data for Master’s students


Following our Big Data Europe (BDE) project activities, we organized a webinar about Apache Flink and its usage inside the project. We were pleased to host one of the most relevant people to the topic: Kostaks Tsoumas, CEO of dataArtisans (the company driving the development of Flink). We received among the questions at the end one that triggered my thinking and kept me busy meditating several days on. It was a broad question but a very valid one: “what are the must-do things for master degree students who start their career in big data technologies?“.

Before diving in, I’d like to put things in context. I’m in the time of writing these lines, a more than two years PhD student/researcher with a close relation to the industry, via a number of projects and partnerships I make part of or have close ear to. As being such, looking underneath the hoods of things is the principle of my job. I have been working on Big Data topics since five years now—on almost a daily basis— alternating between coding and reading scientific and technical material about Big Data and Data Management in general.

Note: although the advise given below are primarily addressed to Master’s students, they can still benefit anyone approaching the topic of Big Data for the first time.

Having that out of the way, let’s get into our topic.

1. Big Data is not a separate field of study.

In my opinion, the biggest confusion among new comers or curious folks is thinking that Big Data is a separate field of study to specialize in. I arguably say that it is ‘not’. It is not a field you can major in or Data Mining instead. That would be comparing apples to some type of soil. The way I perceive Big Data is that it’s a ‘way of thinking’ about topics dealing with data that lives under certain circumstances. The definition of those circumstances is what commonly people use to define Big Data.

Big Data covers the situations faced when traditional techniques and technologies fail to keep a data-based system functional. Those situations are: (1) data grows beyond the processing ability of the system, (2) data streams-in in a high pace overwhelming the ingestion capability of the system (3) data metamorphoses to many forms and types making the system incrementally too complex to build for even simple tasks. Respectively, those are called Volume, Velocity and Variety, or 3’V dimensions following Gartner’s “Deja VVVu” proposal.

That said, Big Data can be horizontal to lots of fields, including but not limited to: databases, information retrieval, machine learning, text mining, image processing, natural language processing, industry 4.0, etc. when there is data to be processed.

2. Big Data is Data Management in the back.

Now, it must become clear that Big Data has pretty much Data Management as a backbone. Hence, a master student has to have background in Data Management pre-packed already. Otherwise, taking Data Management class is mandatory. What we are particularly interested in are how data is conceptually modeled, physically stored, cross-formats and -languages accessed. What I more precisely recommend are the following sub-topics:
– Relational algebra and database, ACID properties.
– SQL query language with a particular focus on join and aggregation types.
– NOSQL, CAP theorem, BASE properties.
– Sliding window real-time processing.
– Batch vs. stream vs. interactive processing.

I believe that most Computer Science Master’s curriculum have at least one module on Data Management or Databases, so complement as necessary.

3. Think big, distributed.

I’ve met a lot of people starting on Big Data domain missing one of the main points behind the emergence of Big Data and one of its substantial properties. Put simply, because of the wide availability of commodity low-cost machinery, both huge volumes of data and/or complex computations were moved to clusters of several machines, in order to save both on both cost and time at once. The way I personally grasped those concepts was principally through learning MapReduce, and practicing with Apache Hadoop MapReduce (parallel computation) and Hadoop HDFS (distributed storage). I highly recommend anyone to follow the same path, and guarantee you to get a sense of what distributed computing is, in a very easy and nice way. I just advise though to substitute Hadoop MapReduce with Apache Spark, which hadn’t existed back then.

4. Adopt an “Optimizer” way of thinking.

By now, the student must have the backpack filled to begin their Big Data journey. In addition to the technical and scientific advise I gave so far, I conclude the list with an essential abstract one. If I were to hire a Big Data hat, I would require that they have a spirit of an Optimizer, before that of autonomous, collaborative, or ambitious.

We are living in an era whose coins are time units. Providing a code that just “works” becomes part of the history once you step-in in this world. What starts to matter here is how fast your application finishes the task, and how much output this application is able to produce in a given time interval. An optimizer is then a person who looks carefully to find that unnecessary or reducible if-else block, costing the application to last couple of seconds more.


Whether Master’s student’s target is to pursue an academic or a professional career, the four principles I provided above must give a solid ground to start an exciting adventure in the wild of Big Data. I give as bonus some extra orientation based on their target.

If your dreams drive you to the industry, then I recommend that you pick a few Big Data tools and master them. A lot of learning options are at your disposal: limitless tutorials are meanwhile available online, increasingly many high-quality MOOC are offered by renowned universities and institutions. Specialization is key, the landscape of Big Data technologies is booming, mastering even a small fraction is challenging (hint: see what the market is demanding and what looks like going to be a hit. Job offers in platforms like LinkedIn and Xing, Google Trends, GitHub statistics—for open-source tools—will help).

On the other hand, for those aspiring to go scientific and research-y, then there are a lot of exciting freshly-(re)opened topics awaiting investigation. For instance: accessing multiple data sources (local/distant databases, network of sensors/devices aka Internet of Things, etc.) in a uniform way, querying and analyzing data in motion (real-time), genomic data analysis, etc.

Setup a MongoDB cluster (aka Replica Set) on Ubuntu

After one day and a half, bouncing between various resources, I finally managed to make a MongoDB Replica Set work. As it is my habit, I log my steps in a bloc-note for easy back tracking and effort saving for the future (It happens to everyone, but I get so annoyed to search twice a solution for the same problem). Those include solutions to some recurrent problems appearing down the way (and this what is special about this tutorial). I’m going to share these steps with some commentary.
Note that these are the minimum required  “how-to” steps, which drove me to the final success. That said, if you want more in-depth explanations about the “what-is” part of the equation, like security considerations and performance optimizations, then, I recommend you look farther; I refer to the documentations [1] and one another interesting tutorial (that didn’t work for me for the “how-to” part) [2].

The steps are tested on a cluster of three nodes, each having Ubuntu 16.04 and MongoDB 3.6.1 installed.

  • Download MongoDB as you would normally do on a single machine in each of the nodes. We recommend the tutorial on the official website: Install MongoDB.

At the end of the tutorial, you would be asked to start MongoDB server using:

If the service is not recognized, returning some error like: Failed to start mongodb.service: Unknown unit: mongodb.service, then you need to enable it explicitly using:

Check if the server’s correctly started using:

  • We need now to add two pieces of information to MongoDB config file: (1) the IP address of of the the node, and (2) the replica set name. Stop Mongod first:

Open the config file for editing, like using vi: sudo vi /etc/mongod.conf. Go down to net:  and set the bind IP address as a value for bindIp: (to enable other nodes to talk to this node) . As we are creating a cluster of nodes, I recommend to omit the default localhost address (and give less chance to errors):

Then go further down and comment out the line #replication:. Provide the replica set name (without quotations preferably, saw people complaining from using them):

Then… save.

  • After that, restart MongoDB as previously: sudo service mongod start.
  • Make sure mongod is running in all the nodes sudo service mongod status.
  • Take one of the nodes, let’s say the one of IP, and run (try to avoid sudo’ing if you don’t have to):

  • You need to initiate a replica set. Check first rs.status() if no replica set has previously been initiated (during your previous attempts), you should get a message mentioning:
  • Then run:

    you should see: { "ok" : 1, ... }
  • This adds the first node as a replica set member. Check rs.status() to verify that worked.
  • Next, add the other nodes as members to the replica set “mongodb-rs”:

    If that is successful, similarly, { "ok" : 1, ... } is shown.
  • Once all are added, check the status rs.status(), you should obtain a list of all the members; the first tagged primary: "StateStr" : "PRIMARY", and the other SECONDARY.
  • Let’s give it some data and see if the replication is working. On the PRIMARY member, create a database: use db_test, and a collection: db.createCollection("collection_test"), then add some data: db.collection_test.insert({"name":"test"}).
  • Go to a SECONDARY member and run a read query: use db_test and db.collection_test.find(), you will supposedly be denied and obtain: not master and slaveOk=false. To fix that run:

    and try again. You need to issue the latter each time you run a read query.
  • If that doesn’t work, make sure you are in the same database as the one used in the PRIMARY, go up in the console log and find connecting to: P_address:27017/test, you see here it’s the default ‘test’ database. What you need to do is to stop mongod (Ctrl+c or another way, see [1]) and start it with the database explicitly specified, like here:

    Try to find again the data again: db.collection_test.find(), it should show up now, and you smile 🙂 because data is now correctly in sync across all the members of the replica set mongodb-rs.

I’ll admit it, MongoDB is an awesome NOSQL database, but its preparation for big data usage is somehow less awesome. The process is a bit stressful and error-prone.

Final note: To develop a real big data application using MongoDB, the replication should be complemented with sharding, which is a topic of another post. I will try to share my experience if my time then allows.

The GOOD Data framework: to share data with care

They say “sharing is caring”. Indeed it is, but only if done the proper way. It’s about sharing data that people can benefit from with the least hassle possible. We deem data to be so if it is well-described, cleaned, and constantly available. GOOD proposes, therefore, four principles for a successful data sharing experience on the Web.

G for Guided

Like User Guide of a software or a product, data has also to be accompanied with a couple of information (metadata) that help users get a sense of what it is and what it contains. Information such as its description, its format and structure (e.g., a meaningful header in a CSV file), who provided it, when it has been created, provided and/or modified, what is its version if it was the evolution of a previous one, etc.
> Data that people cannot grasp or trust is NOT a GOOD data.

O for Open 

Publishing open data is a pretty old topic around which plenty standards and best-practices have been established in the past. Data is so needed to fight diseases, preserve health and well-being, empower education and equality, improve mobility and circulation, fuel research and science in whatever form it is, etc.
> Blocked or locked data can be good to its owners or privileged users, but it is NOT GOOD to the rest of the world.

O for Optimized

We refer here to data that is at its highest level POSSIBLE of readiness for use. It has to be clean, clear, uniform, and simple to use.
> Data that has the user to go through stressful data transformation pipelines just to make it ready for the first contact is NOT a GOOD data.

D for Durable

It is more often than not, that data is made available online without a long-term monitoring plan. The server hosting the data can go down, references/alias can break, database server storing metadata information can go offline, etc. all are phenomenon most of us have faced at some point. Data shared on the Web has to be supported by a long-term monitoring, like mirroring download across several CDNs, setting up health-check notifications in any hosting server involved. We do not mean here data that goes ‘permanently’ unavailable, that is non-data, we rather refer to data that gets offline temporarily.
> Like a friend, data that is not there when you need is NOT a GOOD data.

It is to add that:

  • GOOD preaches for sharing data at the highest level of openness possible, so usage licences are considered out-of-the-scope.
  • GOOD looks more into the technical side of data sharing, rather than on the legal one. So another reason aspects like licencing and privacy are not considered—not underestimating their importance in any way.
  • GOOD suggests four high-level, simple and rememberable requirements for beneficial data sharing on the Web. Indeed, each requirement can have many sub-requirements underneath. We leave diving into the details to interested readers, and refer to some pointers that can be useful link 1, link 2.

Apache Thrift for C++ on Visual Studio 2015

Today we are going to see how to build Apache Thrift for C++ on Visual Studio 2015. Then for demonstration, we’ll also build and run the C++ tutorial.

Disclaimer: this tutorial builds hugely on the one given by Adil Bukhari Configuring Apache Thrift for Visual Studio 2012. The reason I create a new one is that I followed his steps but stumbled upon a few problems preventing me from continuing it. Therefore, I find it quite helpful –for future learners– to complement that tutorial with the solutions to these problems.

Testing environment

  • Windows 10 64bit.
  • Microsoft Visual Studio 2015 (also tested with Visual Studio 2013).
  • Apache Thrift 0.9.2.
  • Boost 1.59.0.
  • Libevent 2.0.22.
  • OpenSSL 1.0.2d.
  • Summer time 🙂


  1. Download Apache Thrift and Thrift compiler for Windows from the download page here.
  2. Download and build Boost libraries (also follow Adil’s tutorial here: Configuring C++ Boost Libraries for Visual Studio).
  3. Download libevent library from the official webpage.
  4. Download OpenSSL for Windows (when you are on the OpenSSL binaries page, follow the link they suggest under OpenSSL for Windows — Works with MSVC++), and then install it.

Building Apache Thrift libraries

Read more Apache Thrift for C++ on Visual Studio 2015

Apache Spark: Split a pair RDD into multiple RDDs by key

This drove me crazy but I finally found a solution.

If you want to Split a pair RDD of type (A, Iterable(B)) by key, so the result is several RDDs of type B, then here how you go:

The trick is twofold (1) get the list of all the keys, (2) iterate through the list of keys, and for each key, create a new RDD by filtering the original pair RDD to get only values attached to that key. Note the use of Java 8’s Lambda expression in line 17.

Remark: bare in mind that the action collect()  gets you the keys of the entire distributed RDD into the driver machine. Generally, keys are integers or small strings, so collecting them into one machine wouldn’t be problematic. Otherwise, this wouldn’t be the best way to go. If you have a suggestion, you are more than welcome to leave a comment below… spread the word, help the world!

Drupal: merge two pages inside a view – Mohamed Nadjib’s Webworld

These quick steps show how to merge two pages inside a view, one on the top and one on the bottom:

  • Create a view.
  • Create two pages inside the view.
  • Within each page there is an option to put something in the header and something in the footer.
  • Select the page that you want to be on top and click “add” in Footer.
  • Select “Global: View area”.
  • Select “This page (override) near to “For”, and apply (to apply the changes only on the current page).
  • Select the page you created in “View to insert”, and apply.

Extra: The page will be put as-is under the first page but the title of the second page won’t appear. If you want to put a text on top of the second page, do exactly as before to the second page but instead of “Global: View area” select “Global: Text area”.

This can be a workaround for those searching for how to merge views.

Drupal: show only upcoming events (Calendar module)

These steps allow you to hide past events and show only the upcoming events (if you don’t have the Calendar module yet, here’s an excellent video that gets you started) :

  • Go to the Events view.
  • In FILTER CRITERIA, click on “Add”.
  • Search for the event end-date e.g. “Content: Event date – start date .(field_date)”, and apply.
  • In “Configure extra settings for filter criterion…”, click on “Cancel”
  • Select the event’s end date (something like: Event date – end date), and apply.
  • In “Configure filter criterion…” window
    • Under Operator “Is equal to”.
    • Change “Select a date” option to to “Enter a relative date”
    • Under Relative date, write “now” (also read the note under the textbox for more options), and apply.

My selection of best Chrome extensions


The web browsers are no longer strictly used to browse web pages. They are today competing to offer us the best tool to extract knowledge from the web pages, to make the browsing an amusing experience, and to enable us to personalize our own browsing environment. Although, I bet a lot of us are not aware of what these browsers are capable of doing.

I’ve been a Computer Science student for almost seven years now. I pass almost all my time in front of the computer. In addition to getting a neck pain, I got the chance to try out several web browsers and a ton of their extensions. In this post, I’ll take you through my best Google Chrome extensions.

I’ve been installing for each purpose all the top available rated/downloaded extensions, then deleting all of them except the best one, according to my needs and flavor. I gather this experience in one plate and share it happily with you, enjoy!


  • The list below is presented without a preference order.
  • All the extensions are free, or at least for the presented features.
  • It is not intended by any means to promote any brand over another.
  • It is not meant to recommend Chrome over other browsers.

I categorize the extensions in three families (1) General-purpose extensions, (2) Research extensions (3) and Development extensions.

I. General-purpose extensions

Read more My selection of best Chrome extensions

So you’ve decided to be a developer? Ok but…


Two must have principles :

passion and patience.

…then here are some tips of what you should do next :

Think out of the box

Programming today is easier and more accessible than ever. You must know (and accept agreements) that you are just one of millions of developers[1] on the globe. If you want to shine among them, then you need to think differently, act differently.

Need for time

Very clear. Coding is a very time-consuming process. If you don’t have time (I say two hours per a day … at least) then I doubt much you can be an (efficient) developer. “Developer” is a job, jobs need time.


I wrote it in capital letters, because it’s the clue of your success in this universe. You will, absolutely, need to write many many examples before starting doing any thing big. A one reason is when you start a big project you’ll encounter lions and crocodiles (people are gentle to call them only bugs), and that what pushed people, through the history, to abandon programming in the first days. Again, don’t dare to start doing big things if you are not ready for … and ready is subsequent to training.

Join serious projects

After a whole series of practicing, you should be ready to go. Right after, comes step two : Involve serious projects. Whatever you have done during practicing phase, you are not a real developer until you join (possibly start) a serious project. Project means targets to reach. If you don’t have a clear destination you are surly going nowhere or anywhere … When you start a project having chained nested targets, you start to use your talents and tools you are (supposed to be) competed to handle. You actually grow with your projects.

Be useful to your world

Once you solved a sticky problem, please take few  minutes to share it on a blog post or as an answer to the same problem found somewhere online. You would like to do it for two great causes: (1) you save time and efforts for future learners who would face the same problem; and (2) you return favor to  those who suffered nightmares and headaches in order to provide you with ready-to-use solutions –by sharing their tips to others as part of your solution.

To be continued…