Thursday, September 01, 2011

Crispy overview of Apache Mahout - Machine Learning on Hadoop

Apache Mahout [2] is a machine learning based algorithm library implemented based on Stanford University's research paper published in 2006[1] , which is intended to run as Apache MapReduce jobs on the Hadoop cluster.
They have implemented various ML (brief for Machine Learning) algorithms and classified them in some practical categories. Following is the description about them,
Recommendation: It is the recommendation to the user based on user's behaviour or historical records
  • Non Distributed (non hadoop solutions, can be just run though Mahout library and Java SE 6)
  • Distributed: Slope-One, Distributed Nearest Neighbour (Item based), and Distributed Nearest Neighbour (User based)
In simple words, algorithm like Slope-one, runs 2 step MapReduce. In first step, it computes user based item pair matrix in mapper and reducer computes differences in the item pair. In the second step, average differences from the difference list per item pair is computed.
whereas, at a high level, Nearest Neighbour algorithm, underlying uses algorithm like, Pearson-regression to first find the prediction to estimate preferences and then it picks the top preferences.
Clustering: Clustering algos are either joining (term used is 'agglomertive') or breaking up (term used is 'divisive'). Given a huge set of data, either we start with single data set cluster and gradually join/break them based on their 'distance' calculation (again there are many distance calculation criteria), eventually get a population of datasets which is more sensible or more relational to each other. Many clustering algorithms are integrated into Mahout library.
k-means, fuzzy k -means and canopy are couple of very known clustering algorithms. Surveys, market research data suits a lot to divisive clustering mahoot algos. They say k-means problem is NP-hard (reminds of engineering days, computation science)
Classification: Combining the quantitative information or characteristics of the new individual item and the training set (used for the previous classifications), we decide the category. Tracking, Discoveries, and Recognitions are couple of common application domains for Classification algos. Up to Mahoot 0.5 release, Bayesian, Logistic, and Random forest are integrated, and a partial support for Neural Network.
Dimention Reduction: Probably, the best use of parallelism to solve complexity of multi dimensional dataset to fewer dimensions to that we can analyze the problem. Mahout has implementation of Singular value algorithm to solve this problem.
For application domain implementations, which could be social network sentimental analysis, analysing geospecial data, pattern recognition, robotic vision, etc., given a problem to be solved there could be 3 possible approaches to solve it:
  • Map to existing Mahout integrated algorithm, provide the datas set on HDFS.
  • Implement our own solution for the algorithm in MapReduce programs.
  • Hybrid approach, utilize core Mahout integrated algorithms, provide custom behaviour using MapReduce and provide the solution.