Assignment 3 FAQ

I'll post the questions related to assignment 3 as they come in.

  • 'Error: JAVA_HOME is not set and could not be found.' errors printed when starting up the namenode.

This error is reported when the default shell is set to tcsh.

- Shutdown HDFS if it is still running.

- Try creating a file named inside the hadoop conf directory and add the following contents. Make sure to put the correct path to Java for the first export command.
export JAVA_HOME=/path/to/your/java_home
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_LOG_DIR=/tmp/${USER}/hadoop-logs
export YARN_LOG_DIR=/tmp/${USER}/yarn-logs

- Start HDFS again.

  • Does the dataset of Giga-Sort contain duplicates ?

Yes, it does.

  • How does a data record look like in census data set?

We have uploaded a single file corresponding to Alaska(this one of the multiple files from the Alaska dataset). You can download this file from here.

  • Can I access shared HDFS cluster using Hadoop 2.3.0?

No, there is a backward compatibility issue which kills containers right after they are launched.

  • Which state does the abbreviation 'DC' correspond to?

It's district of Columbia which is not considered a state. But for the simplicity, you can treat it as a separate state.

  • Puerto Rico(PR) data sets returns extremely high values. Is that possible?

There seems to be a corruption in the PR data files. We have removed them from the staged data set.

  • Does the summation of Male and Female population for a given age group match the population for the corresponding age group?

It will not match, because we haven't account for the missing data items which are recorded in a separate data set.

  • When finding the median of set of range values, how should we cope with the even number of elements?

Taking the average of the two middle values for ranges is tricky. So round the index to the next integer and consider that as the median.

For instance, if you have 22 values, then you take the average of 11 and 12 which is 11.5 in usual case. But you can round it up to 12 and consider it as the median.

  • Can I use any external/standalone programs to process data or to produce outputs?

No, it is not allowed. But if you want to merge a set of files produced by the reducer into a single file, then you can implement a simple script for that. If you have similar requirements, you should contact the TA and get the approval.

