Wednesday, March 28, 2012

Data Sampling in Big Data, how it works?

Many big data projects requires to make statistical inferences, eventually they require careful data sampling. If you are using whole of your Big Data sets, you think that sampling obscures the important of your data, it infers that you don't trust sampling, which infers that you don't know your data well enough that you can make samples out of it. Gurus like, data analyst, data scientist, data statistician, or whatever you want to call them all uses sampling to create intelligent small subsets of huge data sets to infer out some of the best offers, recommendations, analysis results, to draw conclusions.

Just to quickly brush up basic concepts, Mean, Median and Mode can estimate the basic shape of distribution of data. If more or less they are equal, the data distribution seems to be very symmetric. Since Mean gets influenced by out-liners or extreme values, hence there could be instances in data set when Mean may be greater than Median and Mode, which depicts that distribution is slightly on higher side, hence distribution shape gets positively skewed, and negative skewed in otherwise scenarios. This helps in figuring out the shape of distribution and getting the centre of the distribution. At a level above, we figure out the concentration of the values in the distribution. Variance, Standard deviation and Range helps us to measure this. Variance typically gets calculated by the a division where numerator is the summation of squared differences of each value with Mean and denominator is the total number of values. Variance is also refereed as "sigma-square" whereas Standard Deviation is the positive square root of Variance, yes denoted by sigma.

If the distribution of the data is symmetric, which means that Mean, Median and Mode are roughly equal, above can be combined with Emphirical rules over Chebyshev's theorem, which tells,
around 68% of data values lies within 1 standard deviation of the mean,
around 95% of data values lies within 2 standard deviation of the mean, and,
around 99% of data values lies within 3 standard deviation of the mean.

In Big Data, we work on huge set of data, the data may be whole data lying either on a Nosql data store or on HDFS, or the data may be past data plus set of incremental of data. We may require to break the data into proportionate subdivisions of populations. The populations are divided into subdivisions by Quantiles. There are different proportions of the subdivisions of the populations, median divides into 50% each, quartile divides into 25% each, decile divides into 10% each, where as percentile divides the population into 1% each subdivision. We can use this concept to get fairly equal distribution of data, by having Quantiles over a directory containing bulk data plus incremental, to divide them into sub populations over which any of the below data sampling design can be applied to get a reasonable wide spread sample distribution.

There are couple of interesting data sampling designs, let's see how they work,

Simple Random Samples - Works for batches. Occurrence of data elements is simultaneous. Every data element in the data set has equal probability of getting picked up in the sample frame. In the implementation, every data element is assigned a random number, once the assignment has been done, the list is sorted and the sample frame is picked up. Hence, the probability of an element having a random number in the data frame is same across whole data set. Usually, at least 5% of sample data is picked to form a qualitative sample frame.

Systematic Samples - Works for streams. Occurrence of data elements is sequential. Once the initial stream has arrived, an initial data stream is picked. Then thereafter, every kth element (total data set/sample frame size) is picked as the sample. This looks even sampling algorithm unless some pattern occurs in the streams. if the same pattern gets repeated in every stream, every kth sample will be repeated in cyclic manner.

Stratified Samples - Works for streams. This design performs even sampling even if the same pattern occurs in the population. This divides the whole data set into sub-data set. Proportionate sample frames from each sub-data set reflects the fractional proportion of each sub-data set in the whole data set. Hence, the respective sized sample frames are selected from each sub-data set. Within each of the sub-data set, either Simple Random Samples or Systematic Samples can be used to select samples fairly/unbiasedly.

Cluster Samples - Works on batches. The whole data set can be divided into clusters, each cluster working as a microcosm of the whole data set, having heterogenous data element. Intent is to increase the heterogenity of data element within the cluster and decrease the variability across the clusters. Once this has been done, random clusters can be picked up, and sample data frames can be formed from them.

When taking out samples from the overall data set, the sampling error is quite inherent, it's the difference between the sample and the overall data set over which the sample has been collected. Mathematically, it is directly proportional to the standard deviation of the overall data set and inversely proportional to the sample size. Hence, more the sample size, more will be the concentration of sample mean around the overall data set mean, the more peaked the sample distribution will be around overall data set mean. The smaller the sample size, the flatter will be the sampling distribution around the overall data set mean.

Working on the unstructured data is lying on HDFS, many times we can't be sure that the distribution of data is normal, or the population variance is not known. It's difficult to come up with correct sampling in such cases. But a thumb rule can be, that assuming the whole data distribution is not heavily skewed and we are taking sufficient sample size, a t-distribution (which uses sample distribution and population distribution) should give us a good approximation of the distribution of the sample data.

Saturday, March 24, 2012

RPC Library comparison for Big Data

There are various flavors of RPC implementations available in the open source arena. Each of RPC implementation libraries has its own pros and cons. Ideally; we should select the RPC library according to specific enterprise solution requirements of the project.
Some of the features that any RPC implementation aspires for are:
  • Cross Platform communication
  • Multiple Programming Languages
  • Support for Fast protocols (local, binary, zipped, etc.)
  • Support for Multiple transports
  • Flexible Server (configuration for non-blocking, multithreading, etc.)
  • Standard server and client implementations
  • Compatibility with other RPC libraries
  • Support for different data types and containers
  • Support for Asynchronous communication
  • Inherent support in Hadoop, NoSQL
  • Support for dynamic typing (no schema compilation)
  • Fast serialization
Focusing on Big Data stack, below I compare couple of RPC libraries,
 
Support for
Avro
Thrift
MessagePack
Protocol Buffers
BSON
Fast Infoset
Woodstox

Cross Platform
10
10
10
10
10
10
10

Multiple Languages
10
10
10
10
10
10
10
Critical Requirement <= 10
Fast Protocols
10
10
3
3
3
10
3

Flexible Server (configurable thread pool, NBlock)
7
10
7
0
3
3
0
Not so Critical Requirement <=5
Simple IDL
7
10
10
7
3
3
3

Standard Server and Client
10
10
10
3
3
3
3

Fast  and Compact Serialization
5
7
7
6
6
6
7

Multiple transports and protocols
7
10
3
0
0
0
0

Inherent support in Hadoop
10
3
0
0
0
0
0

Compatibility with other RPC Libraries
5
5
0
0
0
0
0

Data types, containers
10
7
10
7
3
3
3

No Schema compilation (dynamic typing)
5
0
5
0
5
5
5

Asynchronous calls/Callback
0
5
5
2
0
0
0

Score (out of 115)
96
97
82
48
46
53
44

Thrift, Avro and MessagePack looks really impressive to me. Thrift and Avro supports most of the above listed requirements and are very well tested in battles.
Another factor of classification can be,
  • for JSON based conversation between server and client, MessagePack is the best among all,
  • for Binary data conversations, BSON should be considered,
  • for XML based conversations, Fast Infoset and Woodstox should be considered.