Apache Mahout [2] is a machine learning based algorithm library implemented
based on Stanford University's research paper published in 2006[1] , which is
intended to run as Apache MapReduce jobs on the Hadoop cluster.
They have implemented various ML (brief for Machine Learning) algorithms
and classified them in some practical categories. Following is the description
about them,
Recommendation: It is the recommendation to the
user based on user's behaviour or historical records
Algorithms:
- Non Distributed (non hadoop solutions, can be just run though Mahout library and Java SE 6)
- Distributed: Slope-One, Distributed Nearest Neighbour (Item based), and Distributed Nearest Neighbour (User based)
In simple words, algorithm like Slope-one, runs 2 step MapReduce. In first
step, it computes user based item pair matrix in mapper and reducer computes
differences in the item pair. In the second step, average differences from the
difference list per item pair is computed.
whereas, at a high level, Nearest Neighbour algorithm, underlying uses
algorithm like, Pearson-regression to first find the prediction to estimate
preferences and then it picks the top preferences.
e
Clustering: Clustering algos are either joining
(term used is 'agglomertive') or breaking up (term used is 'divisive'). Given a
huge set of data, either we start with single data set cluster and gradually
join/break them based on their 'distance' calculation (again there are many
distance calculation criteria), eventually get a population of datasets which
is more sensible or more relational to each other. Many clustering algorithms
are integrated into Mahout library.
k-means, fuzzy k -means and canopy are couple of very known clustering
algorithms. Surveys, market research data suits a lot to divisive clustering
mahoot algos. They say k-means problem is NP-hard (reminds of engineering days,
computation science)
Classification: Combining the quantitative
information or characteristics of the new individual item and the training set
(used for the previous classifications), we decide the category. Tracking,
Discoveries, and Recognitions are couple of common application domains for
Classification algos. Up to Mahoot 0.5 release, Bayesian, Logistic, and Random
forest are integrated, and a partial support for Neural Network.
Dimention Reduction: Probably, the best use of
parallelism to solve complexity of multi dimensional dataset to fewer dimensions
to that we can analyze the problem. Mahout has implementation of Singular value
algorithm to solve this problem.
For application domain implementations, which could be social network sentimental analysis, analysing geospecial data, pattern recognition, robotic vision, etc., given a problem to be solved there could be 3 possible approaches to solve it:
- Map to existing Mahout integrated algorithm, provide the datas set on HDFS.
- Implement our own solution for the algorithm in MapReduce programs.
- Hybrid approach, utilize core Mahout integrated algorithms, provide custom behaviour using MapReduce and provide the solution.
References