Spark: Interviwe Questions

How to Configure Replication Factor and Block Size for HDFS?

Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure high data availability. The replication factor is a property that can be set in the HDFS configuration file that will allow you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster. For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas.

Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:

<name>dfs.replication<name>

<description>Block Replication<description>

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.

Changing dfs.replication will only apply to new files you create, but will not modify the replication factor for the already existing files.

To change replication factor for files that already exist, you could run the following command which will be run recursively on all files in HDFS:

hadoop dfs -setrep -w 1 -R /

You can also change the replication factor on a per-file basis using the Hadoop FS shell.

hadoop fs –setrep –w 3 /my/file

you can change the replication factor of all the files under a directory.

hadoop fs –setrep –w 3 -R /my/dir

What does ‘jps’ command do?

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

What is Rack Awareness?

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.

What is the difference between an “HDFS Block” and an “Input Split”?

The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).

It is a process that runs on a separate node, not on a DataNode often.
Job Tracker communicates with the NameNode to identify data location.
Finds the best Task Tracker Nodes to execute tasks on given nodes.
Monitors individual Task Trackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.

Spark

Interviwe Questions

No comments:

Post a Comment