Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "Clustering with Hadoop"

(Related)
m (Removed brackets around external links)
 
Line 2: Line 2:
 
The quest for performance optimization has been tirelessly pursued since the invention of computers. We are perpetually seeking to streamline machine performance, by increasing speed, expanding memory, and improving accuracy. As researchers desire to deal with larger and larger data sets, such optimizations become necessities rather than comforts.
 
The quest for performance optimization has been tirelessly pursued since the invention of computers. We are perpetually seeking to streamline machine performance, by increasing speed, expanding memory, and improving accuracy. As researchers desire to deal with larger and larger data sets, such optimizations become necessities rather than comforts.
  
Many programming models have been developed to combat this issue; currently, one of the most popular is [[http://en.wikipedia.org/wiki/MapReduce MapReduce]]. Useful across many domains, one specific application for which the [[http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext MapReduce paradigm]] has been explored is knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. [[http://hadoop.apache.org/ Hadoop]], a well-known open-source implementation of MapReduce, was used for this study.
+
Many programming models have been developed to combat this issue; currently, one of the most popular is [http://en.wikipedia.org/wiki/MapReduce MapReduce]. Useful across many domains, one specific application for which the [http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext MapReduce paradigm] has been explored is knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. [http://hadoop.apache.org/ Hadoop], a well-known open-source implementation of MapReduce, was used for this study.
  
For preliminary investigation, [[http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html k-Means clustering]], available from [[http://mahout.apache.org/ Mahout]] was employed. The results were similar to the ones we published in the paper, [[http://www.ntis.gov/search/product.aspx?ABBR=DE20131092256 "Knowledge Discovery from Nuclear Reactor Simulation Data"]].
+
For preliminary investigation, [http://nlp.stanford.edu/IR-book/html/htmledition/k-means-1.html k-Means clustering], available from [http://mahout.apache.org/ Mahout] was employed. The results were similar to the ones we published in the paper, [http://www.ntis.gov/search/product.aspx?ABBR=DE20131092256 "Knowledge Discovery from Nuclear Reactor Simulation Data"].
  
 
== Process ==
 
== Process ==
 
Here are the steps we used for working with Hadoop:
 
Here are the steps we used for working with Hadoop:
  
* Successfully install Hadoop from their [[http://hadoop.apache.org/releases.html Releases]] page. I used a [[http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster tutorial by Michael Noll]] to install Hadoop on my laptop, running on Fedora 17, as a single node cluster. (Note that this tutorial is only helpful for certain Linux users.)
+
* Successfully install Hadoop from their [http://hadoop.apache.org/releases.html Releases] page. I used a [http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster tutorial by Michael Noll] to install Hadoop on my laptop, running on Fedora 17, as a single node cluster. (Note that this tutorial is only helpful for certain Linux users.)
* The database used was [[http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html Hadoop Distributed File System]]. A helpful tutorial on HDFS can be found here.
+
* The database used was [http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html Hadoop Distributed File System]. A helpful tutorial on HDFS can be found here.
* Successfully install Mahout, a scalable machine learning and data mining library, from their [[http://mahout.apache.org/general/downloads.html Downloads]] page. Examples on how to run various machine learning algorithms on Mahout can be found in /mahout/examples/bin.
+
* Successfully install Mahout, a scalable machine learning and data mining library, from their [http://mahout.apache.org/general/downloads.html Downloads] page. Examples on how to run various machine learning algorithms on Mahout can be found in /mahout/examples/bin.
* Put the requisite data for analysis on HDFS.  My data files are in the .csv format. Mahout's k-Means clustering expects the data to be formatted into a Mahout-specific SequenceFile format. A Java utility was written to convert our data into a dense sequence file format, as needed by the k-Means clustering algorithm. Please note that the examples for k-Means in Mahout mostly work with [[http://mahout.apache.org/users/basics/creating-vectors-from-text.html converting text data into sparse vector format]]. However, this was not the case with our data.
+
* Put the requisite data for analysis on HDFS.  My data files are in the .csv format. Mahout's k-Means clustering expects the data to be formatted into a Mahout-specific SequenceFile format. A Java utility was written to convert our data into a dense sequence file format, as needed by the k-Means clustering algorithm. Please note that the examples for k-Means in Mahout mostly work with [http://mahout.apache.org/users/basics/creating-vectors-from-text.html converting text data into sparse vector format]. However, this was not the case with our data.
 
* Run k-Means with the -cl option. For example:
 
* Run k-Means with the -cl option. For example:
 
** mahout kmeans -i /data/diff_unc_trans_seq/part-m-00000 -o /data/diff_unc_trans_kmeans_op -x 20 -k 2 -c /data/analysis_clusters -cl
 
** mahout kmeans -i /data/diff_unc_trans_seq/part-m-00000 -o /data/diff_unc_trans_kmeans_op -x 20 -k 2 -c /data/analysis_clusters -cl
  
Information on the various command line options for k-Means can be found [[http://mahout.apache.org/users/clustering/k-means-commandline.html here]].
+
Information on the various command line options for k-Means can be found [http://mahout.apache.org/users/clustering/k-means-commandline.html here].
  
 
* Get the clusteredPoints directory out of HDFS into your local directory.
 
* Get the clusteredPoints directory out of HDFS into your local directory.
Line 25: Line 25:
  
 
== Resources ==
 
== Resources ==
For more information on Hadoop, see [[http://hadoop.apache.org/ their website]], which contains many useful resources, including a [[http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html tutorial]] on how to use Hadoop.
+
For more information on Hadoop, see [http://hadoop.apache.org/ their website], which contains many useful resources, including a [http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html tutorial] on how to use Hadoop.
  
More details on Mahout and its uses can be found at the [[http://mahout.apache.org/ Apache Mahout website]]. The developers keep the website very well maintained, and you can find a [[http://mahout.apache.org/users/clustering/k-means-clustering.html great introduction to k-Means clustering]], in addition to information on many other methods.
+
More details on Mahout and its uses can be found at the [http://mahout.apache.org/ Apache Mahout website]. The developers keep the website very well maintained, and you can find a [http://mahout.apache.org/users/clustering/k-means-clustering.html great introduction to k-Means clustering], in addition to information on many other methods.
  
 
Some further helpful materials on MapReduce include:
 
Some further helpful materials on MapReduce include:
  
* [[http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Chapter 2]] of Mining of Massive Datasets by Anand Rajaraman and Jeffrey David Ullman.
+
* [http://infolab.stanford.edu/~ullman/mmds/ch2.pdf Chapter 2] of Mining of Massive Datasets by Anand Rajaraman and Jeffrey David Ullman.
* [[http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext MapReduce: A Flexible Data Processing Tool]], a seminal paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat, published in 2010.
+
* [http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext MapReduce: A Flexible Data Processing Tool], a seminal paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat, published in 2010.
  
 
== Related ==
 
== Related ==
 
[[CSV Clustering via Mahout (on local machine)]]<br>
 
[[CSV Clustering via Mahout (on local machine)]]<br>
[[https://wiki.eclipse.org/ICE_Developer_Documentation Developer Documentation]]
+
[https://wiki.eclipse.org/ICE_Developer_Documentation Developer Documentation]

Latest revision as of 10:11, 25 June 2015

Introduction

The quest for performance optimization has been tirelessly pursued since the invention of computers. We are perpetually seeking to streamline machine performance, by increasing speed, expanding memory, and improving accuracy. As researchers desire to deal with larger and larger data sets, such optimizations become necessities rather than comforts.

Many programming models have been developed to combat this issue; currently, one of the most popular is MapReduce. Useful across many domains, one specific application for which the MapReduce paradigm has been explored is knowledge discovery from nuclear reactor simulation data, as it is imminent that this data will quickly become large-scale. Hadoop, a well-known open-source implementation of MapReduce, was used for this study.

For preliminary investigation, k-Means clustering, available from Mahout was employed. The results were similar to the ones we published in the paper, "Knowledge Discovery from Nuclear Reactor Simulation Data".

Process

Here are the steps we used for working with Hadoop:

  • Successfully install Hadoop from their Releases page. I used a tutorial by Michael Noll to install Hadoop on my laptop, running on Fedora 17, as a single node cluster. (Note that this tutorial is only helpful for certain Linux users.)
  • The database used was Hadoop Distributed File System. A helpful tutorial on HDFS can be found here.
  • Successfully install Mahout, a scalable machine learning and data mining library, from their Downloads page. Examples on how to run various machine learning algorithms on Mahout can be found in /mahout/examples/bin.
  • Put the requisite data for analysis on HDFS. My data files are in the .csv format. Mahout's k-Means clustering expects the data to be formatted into a Mahout-specific SequenceFile format. A Java utility was written to convert our data into a dense sequence file format, as needed by the k-Means clustering algorithm. Please note that the examples for k-Means in Mahout mostly work with converting text data into sparse vector format. However, this was not the case with our data.
  • Run k-Means with the -cl option. For example:
    • mahout kmeans -i /data/diff_unc_trans_seq/part-m-00000 -o /data/diff_unc_trans_kmeans_op -x 20 -k 2 -c /data/analysis_clusters -cl

Information on the various command line options for k-Means can be found here.

  • Get the clusteredPoints directory out of HDFS into your local directory.
  • Run mahout seqdumper to output them. For example:
    • mahout seqdumper -i /data/diff_unc_trans_kmeans_op/clusteredPoints/part-m-00000 -o ~/Data_orig/diff_unc_trans_hadoop_op/.

This can be further processed for analysis.

Resources

For more information on Hadoop, see their website, which contains many useful resources, including a tutorial on how to use Hadoop.

More details on Mahout and its uses can be found at the Apache Mahout website. The developers keep the website very well maintained, and you can find a great introduction to k-Means clustering, in addition to information on many other methods.

Some further helpful materials on MapReduce include:

Related

CSV Clustering via Mahout (on local machine)
Developer Documentation

Back to the top