Setup a MongoDB cluster (aka Replica Set) on Ubuntu

After one day and a half, bouncing between various resources, I finally managed to make a MongoDB Replica Set work. As it is my habit, I log my steps in a bloc-note for easy back tracking and effort saving for the future (It happens to everyone, but I get so annoyed to search twice a solution for the same problem). Those include solutions to some recurrent problems appearing down the way (and this what is special about this tutorial). I’m going to share these steps with some commentary.
Note that these are the minimum required  “how-to” steps, which drove me to the final success. That said, if you want more in-depth explanations about the “what-is” part of the equation, like security considerations and performance optimizations, then, I recommend you look farther; I refer to the documentations [1] and one another interesting tutorial (that didn’t work for me for the “how-to” part) [2].

The steps are tested on a cluster of three nodes, each having Ubuntu 16.04 and MongoDB 3.6.1 installed.

  • Download MongoDB as you would normally do on a single machine in each of the nodes. We recommend the tutorial on the official website: Install MongoDB.

At the end of the tutorial, you would be asked to start MongoDB server using:

If the service is not recognized, returning some error like: Failed to start mongodb.service: Unknown unit: mongodb.service, then you need to enable it explicitly using:

Check if the server’s correctly started using:

  • We need now to add two pieces of information to MongoDB config file: (1) the IP address of of the the node, and (2) the replica set name. Stop Mongod first:

Open the config file for editing, like using vi: sudo vi /etc/mongod.conf. Go down to net:  and set the bind IP address as a value for bindIp: (to enable other nodes to talk to this node) . As we are creating a cluster of nodes, I recommend to omit the default localhost address 127.0.0.1 (and give less chance to errors):

Then go further down and comment out the line #replication:. Provide the replica set name (without quotations preferably, saw people complaining from using them):

Then… save.

  • After that, restart MongoDB as previously: sudo service mongod start.
  • Make sure mongod is running in all the nodes sudo service mongod status.
  • Take one of the nodes, let’s say the one of IP 172.180.10.160, and run (try to avoid sudo’ing if you don’t have to):

  • You need to initiate a replica set. Check first rs.status() if no replica set has previously been initiated (during your previous attempts), you should get a message mentioning:
  • Then run:

    you should see: { "ok" : 1, ... }
  • This adds the first node as a replica set member. Check rs.status() to verify that worked.
  • Next, add the other nodes as members to the replica set “mongodb-rs”:

    If that is successful, similarly, { "ok" : 1, ... } is shown.
  • Once all are added, check the status rs.status(), you should obtain a list of all the members; the first tagged primary: "StateStr" : "PRIMARY", and the other SECONDARY.
  • Let’s give it some data and see if the replication is working. On the PRIMARY member, create a database: use db_test, and a collection: db.createCollection("collection_test"), then add some data: db.collection_test.insert({"name":"test"}).
  • Go to a SECONDARY member and run a read query: use db_test and db.collection_test.find(), you will supposedly be denied and obtain: not master and slaveOk=false. To fix that run:

    and try again. You need to issue the latter each time you run a read query.
  • If that doesn’t work, make sure you are in the same database as the one used in the PRIMARY, go up in the console log and find connecting to: P_address:27017/test, you see here it’s the default ‘test’ database. What you need to do is to stop mongod (Ctrl+c or another way, see [1]) and start it with the database explicitly specified, like here:

    Try to find again the data again: db.collection_test.find(), it should show up now, and you smile 🙂 because data is now correctly in sync across all the members of the replica set mongodb-rs.

I’ll admit it, MongoDB is an awesome NOSQL database, but its preparation for big data usage is somehow less awesome. The process is a bit stressful and error-prone.

Final note: To develop a real big data application using MongoDB, the replication should be complemented with sharding, which is a topic of another post. I will try to share my experience if my time then allows.

The GOOD Data framework: to share data with care

They say “sharing is caring”. Indeed it is, but only if done the proper way. It’s about sharing data that people can benefit from with the least hassle possible. We deem data to be so if it is well-described, cleaned, and constantly available. GOOD proposes, therefore, four principles for a successful data sharing experience on the Web.

G for Guided

Like User Guide of a software or a product, data has also to be accompanied with a couple of information (metadata) that help users get a sense of what it is and what it contains. Information such as its description, its format and structure (e.g., a meaningful header in a CSV file), who provided it, when it has been created, provided and/or modified, what is its version if it was the evolution of a previous one, etc.
> Data that people cannot grasp or trust is NOT a GOOD data.

O for Open 

Publishing open data is a pretty old topic around which plenty standards and best-practices have been established in the past. Data is so needed to fight diseases, preserve health and well-being, empower education and equality, improve mobility and circulation, fuel research and science in whatever form it is, etc.
> Blocked or locked data can be good to its owners or privileged users, but it is NOT GOOD to the rest of the world.

O for Optimized

We refer here to data that is at its highest level POSSIBLE of readiness for use. It has to be clean, clear, uniform, and simple to use.
> Data that has the user to go through stressful data transformation pipelines just to make it ready for the first contact is NOT a GOOD data.


D for Durable

It is more often than not, that data is made available online without a long-term monitoring plan. The server hosting the data can go down, references/alias can break, database server storing metadata information can go offline, etc. all are phenomenon most of us have faced at some point. Data shared on the Web has to be supported by a long-term monitoring, like mirroring download across several CDNs, setting up health-check notifications in any hosting server involved. We do not mean here data that goes ‘permanently’ unavailable, that is non-data, we rather refer to data that gets offline temporarily.
> Like a friend, data that is not there when you need is NOT a GOOD data.

It is to add that:

  • GOOD preaches for sharing data at the highest level of openness possible, so usage licences are considered out-of-the-scope.
  • GOOD looks more into the technical side of data sharing, rather than on the legal one. So another reason aspects like licencing and privacy are not considered—not underestimating their importance in any way.
  • GOOD suggests four high-level, simple and rememberable requirements for beneficial data sharing on the Web. Indeed, each requirement can have many sub-requirements underneath. We leave diving into the details to interested readers, and refer to some pointers that can be useful link 1, link 2.

Apache Thrift for C++ on Visual Studio 2015

Today we are going to see how to build Apache Thrift for C++ on Visual Studio 2015. Then for demonstration, we’ll also build and run the C++ tutorial.

Disclaimer: this tutorial builds hugely on the one given by Adil Bukhari Configuring Apache Thrift for Visual Studio 2012. The reason I create a new one is that I followed his steps but stumbled upon a few problems preventing me from continuing it. Therefore, I find it quite helpful –for future learners– to complement that tutorial with the solutions to these problems.

Testing environment

  • Windows 10 64bit.
  • Microsoft Visual Studio 2015 (also tested with Visual Studio 2013).
  • Apache Thrift 0.9.2.
  • Boost 1.59.0.
  • Libevent 2.0.22.
  • OpenSSL 1.0.2d.
  • Summer time 🙂

Requirements

  1. Download Apache Thrift and Thrift compiler for Windows from the download page here.
  2. Download and build Boost libraries (also follow Adil’s tutorial here: Configuring C++ Boost Libraries for Visual Studio).
  3. Download libevent library from the official webpage.
  4. Download OpenSSL for Windows (when you are on the OpenSSL binaries page, follow the link they suggest under OpenSSL for Windows — Works with MSVC++), and then install it.

Building Apache Thrift libraries

Read more Apache Thrift for C++ on Visual Studio 2015