Wednesday, March 28, 2012

Data Sampling in Big Data, how it works?

Many big data projects requires to make statistical inferences, eventually they require careful data sampling. If you are using whole of your Big Data sets, you think that sampling obscures the important of your data, it infers that you don't trust sampling, which infers that you don't know your data well enough that you can make samples out of it. Gurus like, data analyst, data scientist, data statistician, or whatever you want to call them all uses sampling to create intelligent small subsets of huge data sets to infer out some of the best offers, recommendations, analysis results, to draw conclusions.

Just to quickly brush up basic concepts, Mean, Median and Mode can estimate the basic shape of distribution of data. If more or less they are equal, the data distribution seems to be very symmetric. Since Mean gets influenced by out-liners or extreme values, hence there could be instances in data set when Mean may be greater than Median and Mode, which depicts that distribution is slightly on higher side, hence distribution shape gets positively skewed, and negative skewed in otherwise scenarios. This helps in figuring out the shape of distribution and getting the centre of the distribution. At a level above, we figure out the concentration of the values in the distribution. Variance, Standard deviation and Range helps us to measure this. Variance typically gets calculated by the a division where numerator is the summation of squared differences of each value with Mean and denominator is the total number of values. Variance is also refereed as "sigma-square" whereas Standard Deviation is the positive square root of Variance, yes denoted by sigma.

If the distribution of the data is symmetric, which means that Mean, Median and Mode are roughly equal, above can be combined with Emphirical rules over Chebyshev's theorem, which tells,
around 68% of data values lies within 1 standard deviation of the mean,
around 95% of data values lies within 2 standard deviation of the mean, and,
around 99% of data values lies within 3 standard deviation of the mean.

In Big Data, we work on huge set of data, the data may be whole data lying either on a Nosql data store or on HDFS, or the data may be past data plus set of incremental of data. We may require to break the data into proportionate subdivisions of populations. The populations are divided into subdivisions by Quantiles. There are different proportions of the subdivisions of the populations, median divides into 50% each, quartile divides into 25% each, decile divides into 10% each, where as percentile divides the population into 1% each subdivision. We can use this concept to get fairly equal distribution of data, by having Quantiles over a directory containing bulk data plus incremental, to divide them into sub populations over which any of the below data sampling design can be applied to get a reasonable wide spread sample distribution.

There are couple of interesting data sampling designs, let's see how they work,

Simple Random Samples - Works for batches. Occurrence of data elements is simultaneous. Every data element in the data set has equal probability of getting picked up in the sample frame. In the implementation, every data element is assigned a random number, once the assignment has been done, the list is sorted and the sample frame is picked up. Hence, the probability of an element having a random number in the data frame is same across whole data set. Usually, at least 5% of sample data is picked to form a qualitative sample frame.

Systematic Samples - Works for streams. Occurrence of data elements is sequential. Once the initial stream has arrived, an initial data stream is picked. Then thereafter, every kth element (total data set/sample frame size) is picked as the sample. This looks even sampling algorithm unless some pattern occurs in the streams. if the same pattern gets repeated in every stream, every kth sample will be repeated in cyclic manner.

Stratified Samples - Works for streams. This design performs even sampling even if the same pattern occurs in the population. This divides the whole data set into sub-data set. Proportionate sample frames from each sub-data set reflects the fractional proportion of each sub-data set in the whole data set. Hence, the respective sized sample frames are selected from each sub-data set. Within each of the sub-data set, either Simple Random Samples or Systematic Samples can be used to select samples fairly/unbiasedly.

Cluster Samples - Works on batches. The whole data set can be divided into clusters, each cluster working as a microcosm of the whole data set, having heterogenous data element. Intent is to increase the heterogenity of data element within the cluster and decrease the variability across the clusters. Once this has been done, random clusters can be picked up, and sample data frames can be formed from them.

When taking out samples from the overall data set, the sampling error is quite inherent, it's the difference between the sample and the overall data set over which the sample has been collected. Mathematically, it is directly proportional to the standard deviation of the overall data set and inversely proportional to the sample size. Hence, more the sample size, more will be the concentration of sample mean around the overall data set mean, the more peaked the sample distribution will be around overall data set mean. The smaller the sample size, the flatter will be the sampling distribution around the overall data set mean.

Working on the unstructured data is lying on HDFS, many times we can't be sure that the distribution of data is normal, or the population variance is not known. It's difficult to come up with correct sampling in such cases. But a thumb rule can be, that assuming the whole data distribution is not heavily skewed and we are taking sufficient sample size, a t-distribution (which uses sample distribution and population distribution) should give us a good approximation of the distribution of the sample data.

6 comments:

anushya said...

this blog is really awesome. thanks for publishing this information. Visit my link as well.
Ethical Hacking Course in Bangalore
Hacking Classes in Bangalore
Data Analytics Courses in Bangalore
Digital Marketing Courses in Bangalore
RPA Training in Bangalore
Big Data Training in Bangalore
Hacking Course in Bangalore
Robotics Courses in Bangalore

Aparna said...

I would like to thank you so much for sharing with us and I have many ideas after visiting your post. Well done...
JMeter Training in Chennai
JMeter Certification
Linux Training in Chennai
Pega Training in Chennai
Primavera Training in Chennai
Unix Training in Chennai
Placement in Chennai
Tableau Training in Chennai
Oracle Training in Chennai
JMeter Training in T Nagar
JMeter Training in OMR

jude said...

Amazing blog with the recent news. Thank you very much for sharing such helpful data...
Big Data Analytics Training in Bangalore|
Hadoop Training in Bellandur|
Hadoop Training in Bangalore|
Hadoop Training in Marathahalli|
Hadoop training in Bangalore

Rashika said...

The presentation is really good...
Very impressive blog. Thanks for sharing.
Digital Marketing Training in Chennai | Certification | SEO Training Course | Digital Marketing Training in Bangalore | Certification | SEO Training Course | Digital Marketing Training in Hyderabad | Certification | SEO Training Course | Digital Marketing Training in Coimbatore | Certification | SEO Training Course | Digital Marketing Online Training | Certification | SEO Online Training Course

shiva said...

Superb. I really enjoyed very much with this article here.
Cyber Security Training Course in Chennai | Certification | Cyber Security Online Training Course | Ethical Hacking Training Course in Chennai | Certification | Ethical Hacking Online Training Course |
CCNA Training Course in Chennai | Certification | CCNA Online Training Course | RPA Robotic Process Automation Training Course in Chennai | Certification | RPA Training Course Chennai | SEO Training in Chennai | Certification | SEO Online Training Course

360DigiTMG said...

I truly like your style of blogging. I added it to my preferred's blog webpage list and will return soon…
data analytics course in hyderabad