Thursday, June 11, 2015

Hadoop Interview Questions


Hi guys, I have attended more than 14 Hadoop interviews just to learn what questions would interviewers ask. I have extracted all those questions that appear important to me. Go through them. Happy Learning! (Will update all the answers shortly)

Please don't forget to comment & share your feedback!


HDFS:

1. Without touching Block size & input split, can we have a say on the no. of mappers?
Ans: Create a Custom input Format and override the 'isSplitable()' to return false.

2. What is the difference between Block size & input split?

Ans: Block is a physical division whereas, input split is a logical division of the data.

3. To process one hundred files each of size - 100MB on HDFS whose default block size is 64MB, how many mappers would be invoked?

Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB) and hence 100 files would occupy 200 map slots.

4. What is data locality optimization?

Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored. 
Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done. 
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed. 

5. What is Speculative execution?

Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time-interval, then the job tracker speculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution.

Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.

6. What are the different types of File permissions in HDFS?

Ans: 
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas

Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder

7. What is Rack-awareness?

Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.

8. What are the different modes of HDFS that one can run? Where do we configure these modes?

Ans: Hadoop can be configured to run on one of the following modes. 
   a. Standalone Mode or local (default mode) 
   b. Psuedo distributed mode 
   c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-site.xml 

9. What are the available data-types in Hadoop?

Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes. 

Following is the list of types that implement WritableComparable-
Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable, 
LongWritable, VLongWritable, DoubleWritable.
Others:
NullWritable, Text, BytesWritable, MD5Hash

10. Explain the command '-getMerge'

Ans: hadoop fs -getmerge <directory> <merged file name>
     This option gets all the files in the directory and merges them into a single file.

11. Explain the anatomy of a file read in HDFS
Ans: 
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream

12. Explain the anatomy of a file write in HDFS
Ans: 
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream  writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order. 
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.

MapReduce:


1. What is Distributed Cache?
Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only data needed by a MR program) is distributed

2. What is 'Sequence File' format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.
There are 3 different SequenceFile formats:
     a. Uncompressed key/value records
     b. Record compressed key/value records - only 'values' are compressed here.
     c. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable

3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations of InputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.

SequenceFileInputFormat has few subclasses like - SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter

4. What is ‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers.

5. How many instances of a 'jobtracker' run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster

6. Can two different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.

7. How do you make sure that only one mapper runs your entire file?
Ans: Create a Custom 'InputFormat' and override the 'isSplitable()' to return false. (or) a rather rude way to do is - set the block size greater than the size of the input file.

8.When will the reducer phase start in a MR program and why is the progress of the reducer phase is non-zero value(percentage) even before the mapper phase doesn't end?
Ans: Reducer phase starts only after all mappers finish their execution. But the progress of reducer would be some non-zero value before mapper phase progress reaches 100%. This is  because, the reducer phase is actually a combination of copy, sort & reduce. The keys would start being sorted and copied to various reducers just before the mapper phase execution is going to end.

9.  Explain various phases of a MapReduce program.
Ans: 
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.

10. What is a 'Task instance' ?
Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker. 

11.  If we use an Identity Mapper/Reducer is the output sorted?
Ans: 

12. What is the use of Context Object?
Ans: 

13. Can we use Cleanup(context)/setup(context) in a mapper/reducer? If yes, what’s the use?
Ans: 

14. Write a MR program to find the median of a dataset that contains 10 million rows

15. Write a MR program to find the mean of a dataset that contains 10 million rows

Hive:


1. What is the difference between 'Sort by' & 'Order by' keywords?
Ans: 

2. What is the difference between 'Managed Table(internal table)' and  an 'External Table'?
Ans: 

3. When do we use Partitioning, Bucketing & Clustering in Hive?
Ans: 

HBase:


1. How does Hbase achieve random read/write?
Ans: HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key. 

For example: A table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, and version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.

2. What’s the difference between Scan & get in HBase?
Ans: 

3.  How does HBase stores data internally?
Ans: 

Monday, June 1, 2015

Hadoop Tough Interview Questions

MapReduce:


1. If you want to re-use your reducer class as a combiner class, how should be the input & output key value pairs of the reducer class be related?


2. How do you write a MapReduce program to find the lengthiest word in a given document.


3. I have a file with the following columns - User Id, login time & logout time. How do you write a MapReduce program to find the top 5 users in terms of time spent. (Assume each user has only one entry in the file)


4. Write a MapReduce program to create phrases such that each phrase contain 2 consecutive words of a line. eg: If input is - "I am a good boy", then the expected output is - "{I am, am a, a good, good boy}


5. How can you customize the name of the output file emitted by a mapper/reducer?


6. What are the various methods inside the Mapper class(org.apache.hadoop.mapreduce.Mapper)? Explain each method



7. What will be the outcome if you comment the following line of code in your main() of the driver class
'job.setJarByClass()'?

8. In the configuration class, the map output key is defined using 'job.setMapOutputKey(Text.class)' while the signature of the Mapper class is 'Mapper<LongWritable, Text, LongWritable, Text>'. Will it cause any error? If yes, is that a compile-error or run-time error?

Hive:


1. What are Hive Query Tuning techniques?

2. What is .hiverc file?


3. How to set number of reducers in a Hive query?


4. What is the difference between Cluster by & Clustered by?


5. What is the difference between Static Partition & Dynamic Partition?



(More to follow....)

What is Big Data & Hadoop?

"Big Data" is a concept that is crucial to driving growth for businesses and was also a challenge for the programmers to analyse. The solution to this challenge is achieved by a framework called "Hadoop". Hadoop Framework has overcome the Big Data challenges with the help of a file system concept called "Hadoop Distributed File System(HDFS)" & a programming model called "MapReduce" which I call as - the Brain & the Heart of Hadoop respectively.

Hadoop ecosystem is evolving day-by-day. It supports Batch processing, real-time processing. Both Programmers(who can code - Java, python, Ruby etc.) & non-programmers can leverage Hadoop framework for Big Data analytics.


(More to follow...)

SE to a Hadoop certified developer

Hi, Welcome to the Hadoop World. In this Blog, let me share my experiences in my journey from a simple Software Engineer to a Hadoop Developer(Cloudera Certified)

First of all, I want you to understand that "Big Data" is a concept that is crucial to driving growth for businesses and was also a challenges for the programmers to analyse the same. The solution to this challenge is achieved by a framework called "Hadoop". Hadoop Framework has overcome the Big Data challenges with the help of a file system concept called - "Hadoop Distributed File System(HDFS)" & "MapReduce" which I call as - the Brain & Heart of Hadoop respectively.

If you want to kickstart your career in Big Data stream, you can see yourself fit in one out of these roles - Hadoop Developer, Hadoop Administrator, Data Scientist & Hadoop Architect.

To clear 'Cloudera Certification for Apache hadoop', you have to thoroughly read the book - "Hadoop: The Definitive Guide"

Some books which I prefer based on the topic on hand-
1. Hadoop Text book: The Definitive Guide
2. MapReduce exclusively: Hadoop in Action
3. MapReduce higher level & practical problems: Hadoop in Practice
4. Hive: Programming Hive
5. Pig: Programming Pig
6. HBase: Programming HBase
7. Sqoop: Hadoop for Dummies 



(More to follow...)