Hi guys, I have attended more than 14 Hadoop interviews just to learn what questions would interviewers ask. I have extracted all those questions that appear important to me. Go through them. Happy Learning! (Will update all the answers shortly)
Please don't forget to comment & share your feedback!
HDFS:
1. Without touching Block size
& input split, can we have a say on the no. of mappers?
Ans: Create a Custom input Format and override the 'isSplitable()' to return false.
Ans: Create a Custom input Format and override the 'isSplitable()' to return false.
2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of the data.
3. To process one hundred files each of size - 100MB on HDFS whose default block size is 64MB, how many mappers would be invoked?
Ans: Each file occupies 2 blocks of data(block1 - 64 MB & block2 - 36 MB) and hence 100 files would occupy 200 map slots.
4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode
Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored.
Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done.
Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed.
5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time-interval, then the job tracker speculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution.
Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.
6. What are the different types of File permissions in HDFS?
Ans:
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas
Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder
7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.
8. What are the different modes of HDFS that one can run? Where do we configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode.
These configuration settings can be set via - core-site.xml, mapred-site.xml, hdfs-site.xml
9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes.
Following is the list of types that implement WritableComparable-
Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable,
LongWritable, VLongWritable, DoubleWritable.
Others:
NullWritable, Text, BytesWritable, MD5Hash10. Explain the command '-getMerge'
Ans: hadoop fs -getmerge <directory> <merged file name>
This option gets all the files in the directory and merges them into a single file.
11. Explain the anatomy of a file read in HDFS
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
Ans:
1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
12. Explain the anatomy of a file write in HDFS
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.
Ans:
1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client's access permissions to the file and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.
MapReduce:
1. What is
Distributed Cache?
Ans: Distributed Cache is a mechanism by which 'Side Data' (extra read-only data needed by a MR program) is distributed
2. What is
'Sequence File' format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile.
The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records - only 'values' are compressed here.
c. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable
3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations of InputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat.
SequenceFileInputFormat has few subclasses like - SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter
4. What is
‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers.
5. How many instances of a
'jobtracker' run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster
6. Can two
different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.
7. How do you make sure that
only one mapper runs your entire file?
Ans: Create a Custom 'InputFormat' and override the 'isSplitable()' to return false. (or) a rather rude way to do is - set the block size greater than the size of the input file.
8.When will the reducer
phase start in a MR program and why is the progress of the reducer phase is non-zero value(percentage) even before the mapper phase doesn't end?
Ans: Reducer phase starts only after all mappers finish their execution. But the progress of reducer would be some non-zero value before mapper phase progress reaches 100%. This is because, the reducer phase is actually a combination of copy, sort & reduce. The keys would start being sorted and copied to various reducers just before the mapper phase execution is going to end.
9. Explain various phases of a MapReduce program.
Ans:
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.
Mapper phase: A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner
Sort & Shuffle phase: Determines the reducer that should receive the map output key/value pair(called as partitioning). All keys inside a reducer are sorted.
Reducer phase: The reducer receives a key and corresponding list of values(emitted across all the mappers). Aggregation of these values is done in the reducer phase.
10. What is a 'Task instance' ?
Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker.
11. If we use an Identity
Mapper/Reducer is the output sorted?
Ans:
12. What is the use of Context
Object?
Ans:
13. Can we use
Cleanup(context)/setup(context) in a mapper/reducer? If yes, what’s the use?
Ans:
14. Write a MR program to find
the median of a dataset that contains 10 million rows
15. Write a MR program to find
the mean of a dataset that contains 10 million rows
Hive:
1. What is the difference between 'Sort by' & 'Order by' keywords?
Ans:
2. What is the difference between 'Managed Table(internal table)' and an 'External Table'?
Ans:
3. When do we use Partitioning, Bucketing & Clustering in Hive?
Ans:
HBase:
1. How does Hbase achieve
random read/write?
Ans: HBase stores data in HFiles that are indexed
(sorted) by their key. Given a random key, the client can determine which
region server to ask for the row from. The region server can determine which
region to retrieve the row from, and then do a binary search through the region
to access the correct row. This is accomplished by having sufficient statistics
to know the number of blocks, block size, start key, and end key.
For example: A table may contain 10 TB of data.
But, the table is broken up into regions of size 4GB. Each region has a
start/end key. The client can get the list of regions for a table and determine
which region has the key it is looking for. Regions are broken up into blocks,
so that the region server can do a binary search through its blocks. Blocks are
essentially long lists of key, attribute, value, and version. If you know what
the starting key is for each block, you can determine one file to access, and
what the byte-offset (block) is to start reading to see where you are in the
binary search.
2. What’s the difference
between Scan & get in HBase?
Ans:
3. How does HBase stores data
internally?
Ans: