CSV Clustering via Mahout (on local machine)

UNDER CONSTRUCTION

This page shows how to cluster Comma-Separated Variable Files (CSV Files) via Mahout on a local Linux machine. The next step would be to get this method working with Hadoop so clustering can be distributed across clusters of computers. However, that is a subject for another page.

This article has a short explanation for experienced programmers, and a longer version for those who are new to Linux and Mahout. I am still a newbie myself and appreciate extra tips.

Requirements

To get started, you will need:

Apache Maven - You can download it here if you do not already have it. Using Maven is fairly straightforward, and we will only use a few commands.
Mahout - See the list of Apache-approved mirror sites to get the latest available download.

Note that at the time of this page creation, the Mahout version used is 0.9.

Short Version

Installation and Configuration

Download the source code for Mahout from one of the mirror sites listed on the official Apache website.

Use Maven to build Mahout by following the directions on their website.

Add the appropriate directories to your PATH variables in your .bashrc file. I had to add:

* /home/<user name>/mahout-distribution-0.9/bin

* /home/<user name>/mahout-distribution-0.9/examples/bin

Additionally, add the environmental variable MAHOUT_LOCAL to .bashrc and set it to something relevant, such as /home/<user name>/mahout-distribution-0.9/examples/bin.

Now we need to change directories and copy the InputMapper.java file:

cd /mahout-distribution-0.9/integration/src/main/java/org/apache/mahout/clustering/conversion/

cp /InputMapper.java /InputMapperCSV.java

Next, open your new InputMapperCSV.java.

Refactoring

For the code to work properly, you must refactor, i.e. rename many of the classes and variables. Unfortunately, within the terminal we use, there is no shortcut to make these changes efficiently. Since tracking every needed edit is tedious, we have included a table that illustrates the necessary conversions in case you are operating with a search-and-replace capability:

Table Summary

Before	After
InputMapper	InputMapperCSV
InputDriver	InputDriverCSV
syntheticcontrol	csv
.compile(" ")	.compile(",")

However, be sure to read the section below in its entirety, as it details the steps that must be taken between refactors, in addition to listing the precise lines that require changes.

Line By Line

Line 34: change the class name InputMapper to InputMapperCSV
Line 36: change .compile(" ") to .compile(",")

In the same directory, copy InputDriver.java to InputDriverCSV.java. Open the new InputDriverCSV.java and do the following:

Line 49: change InputDriver to InputDriverCSV
Line 51: change InputDriver to InputDriverCSV
Line 53: change InputDriver to InputDriverCSV
Line 103: change InputDriver to InputDriverCSV
Line 101: change InputMapper to InputMapperCSV

Now, we need a new directory to copy a few more Java files:

cd mahout-distribution-0.9/examples/src/main/java/org/apache/mahout/clustering

mkdir csv

cd csv

We need the Java files associated with the syntheticcontrol example. They are in mahout-distribution-0.9/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/

Copy the needed files to the CSV directory we just created:

cp -rf ~/mahout-distribution-0.9/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/*

The files here can now be edited to suit our purposes without changing the Mahout capabilities already in place. From the current directory, which should be /csv:

vim kmeans/Job.java

Edit these lines:

Line 18: change syntheticcontrol to csv
Line 28: change InputDriver to InputDriverCSV
Line 130: change InputDriver to InputDriverCSV
Line 174: change InputDriver to InputDriverCSV

For canopy/Job.java, take similar actions:

Line 18: change syntheticcontrol to csv
Line 27: change InputDriver to InputDriverCSV
Line 85: change InputDriver to InputDriverCSV

And now for fuzzykmeans/Job.java:

Line 18: change syntheticcontrol to csv
Line 29: change InputDriver to InputDriverCSV
Line 133: change InputDriver to InputDriverCSV

Our program is now ready to take CSV files as input. Now, we just need to compile the new Java files by entering the following:

cd ~

cd mahout-distribution-0.9/

mvn install -DskipTests

Lastly, we need to create a new script that runs the CSV jobs instead of the syntheticcontrol jobs. Change directories to get to the bin and copy the file cluster-syntheticcontrol.sh to cluster-csv.sh:

cd /mahout-distribution-0.9/examples/bin

cp cluster-syntheticcontrol.sh cluster-csv.sh

Open the new script file:

vim cluster-csv.sh

and replace the code with the code at the end of this page (**).

The output from the code comes from Mahout and is in an Apache-specific format called sequenceFile. Mahout has a class called seqdumper to "interpret" the sequenceFile output. However, it stores all of the clusters in one text file with a lot of extra information, which makes it difficult to analyze. I've included a second script and a short program (used by the script) to separate the clusters and place them into their own file. Each file should contain a single cluster and each file row should contain a single vector. Change into your /examples/bin/ directory and create a new script. As you can see, I named mine cluster-outputanalyzer.sh. Copy into it the code at the bottom of this page (****).

cd mahout-distribution-0.9/examples/bin/

vim cluster-outputanalyzer.sh

Long Version:

Those new to programming would do well to familiarize themselves with working from the command line. The Ubuntu tutorial on using the terminal is a great place to start. However, remembering to press the Enter key after each line of typed text in the shell is enough to get you by for now.

As mentioned previously, this article assumes you are using a Linux machine. However, the syntax for using the command line interface does not differ significantly for other operating systems--only the diction does. For users interested in eventually replicating this task on a different machine, or using the command line on another system, we recommend taking a look at SS64.com, a fantastic command line reference. It contains a dictionary of commands for every operating system and is a great resource to have on-hand.

Installation and Configuration

Open the command line and enter the following:

wget http://www.carfab.com/apachesoftware/mahout/0.9/mahout-distribution-0.9-src.tar.gz

For clarity's sake we will use precise commands, but note that there are usually several correct possibilities. We obtained the link used above from Apache's list of mirror sites, which has many options from which to download Mahout. Any of these should work, so long as you include the full extension in your command.

Unpack the tar file with:

tar -xzvf mahout-distribution-0.9-src.tar.gz

This should create a directory named mahout-distribution-0.9 that has all of Mahout's files. Linux Hint: after typing a couple of letters of "mahout", hit Tab to auto-complete. You can also just copy and paste.

See the contents of your new Mahout directory by typing:

cd mahout-distribution-0.9

ls

Building Mahout with Maven

To build Mahout using Maven, follow these next steps:

In the mahout-distribution-0.9 directory, enter:

mvn install -DskipTests

mvn is a Maven command, and install is a Maven "goal". -DskipTests is an option that allows us to skip tests and save a lot of time. A lot needs to be downloaded from the Apache website, so depending on computer and internet speeds, this step could take a while.

Next, we will need to compile inside the mahout-distribution-0.9/core directory. Enter the following:

cd mahout-distribution-0.9/core

mvn compile

This command needs to be repeated in the mahout-distribution-0.9:

cd mahout-distribution-0.9

mvn compile

Lastly, we do this one more time in the examples directory.

cd examples

mvn compile

(Note that if you're not in the right parent directory, then you will need to type the full directory name, i.e. cd mahout-distribution-0.9/examples)

Now that Mahout has been built, you need to return to your home directory and update your PATH variable, so your Linux machine will know where to look for Mahout commands:

cd /home

vim .bashrc

This command opens the .bashrc file for editing. .bashrc is a file that houses environmental variables, among other things. The dot in the front means it is hidden, so you will not see it when you enter the ls command. Files that typically shouldn't be messed with are often hidden, so users do not accidentally delete or change them.

Assuming everything went smoothly, Mahout should now be properly installed.

Next Steps

See our page Clustering with Hadoop for another tutorial, this time with instructions on clustering k-Means clustering with Hadoop.

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

CSV Clustering via Mahout (on local machine)

Contents

Requirements

Short Version

Installation and Configuration

Refactoring

Table Summary

Line By Line

Long Version:

Installation and Configuration

Building Mahout with Maven

Next Steps

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

CSV Clustering via Mahout (on local machine)

Contents

Requirements

Short Version

Installation and Configuration

Refactoring

Table Summary

Line By Line

Long Version:

Installation and Configuration

Building Mahout with Maven

Next Steps