Sunday, June 1, 2014

Can Hadoop be compared to NOSQL database like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework (MapReduce).

Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?

Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

Which are the two types of ‘writes’ in HDFS?

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

Is a job split into maps?

No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split,  a map is needed.

Why are the number of splits equal to the number of maps?

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

Do we require two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

Is map like a pointer?

No, Map is not like a pointer.

What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.