Following our Big Data Europe (BDE) project activities, we organized a webinar about Apache Flink and its usage inside the project. We were pleased to host one of the most relevant people to the topic: Kostaks Tsoumas, CEO of dataArtisans (the company driving the development of Flink). We received among the questions at the end one that triggered my thinking and kept me busy meditating several days on. It was a broad question but a very valid one: “what are the must-do things for master degree students who start their career in big data technologies?“.
Before diving in, I’d like to put things in context. I’m in the time of writing these lines, a more than two years PhD student/researcher with a close relation to the industry, via a number of projects and partnerships I make part of or have close ear to. As being such, looking underneath the hoods of things is the principle of my job. I have been working on Big Data topics since five years now—on almost a daily basis— alternating between coding and reading scientific and technical material about Big Data and Data Management in general.
Note: although the advise given below are primarily addressed to Master’s students, they can still benefit anyone approaching the topic of Big Data for the first time.
Having that out of the way, let’s get into our topic.
1. Big Data is not a separate field of study.
In my opinion, the biggest confusion among new comers or curious folks is thinking that Big Data is a separate field of study to specialize in. I arguably say that it is ‘not’. It is not a field you can major in or Data Mining instead. That would be comparing apples to some type of soil. The way I perceive Big Data is that it’s a ‘way of thinking’ about topics dealing with data that lives under certain circumstances. The definition of those circumstances is what commonly people use to define Big Data.
Big Data covers the situations faced when traditional techniques and technologies fail to keep a data-based system functional. Those situations are: (1) data grows beyond the processing ability of the system, (2) data streams-in in a high pace overwhelming the ingestion capability of the system (3) data metamorphoses to many forms and types making the system incrementally too complex to build for even simple tasks. Respectively, those are called Volume, Velocity and Variety, or 3’V dimensions following Gartner’s “Deja VVVu” proposal.
That said, Big Data can be horizontal to lots of fields, including but not limited to: databases, information retrieval, machine learning, text mining, image processing, natural language processing, industry 4.0, etc. when there is data to be processed.