Many big data projects requires to make statistical inferences, eventually they require careful data sampling. If you are using whole of your Big Data sets, you think that sampling obscures the important of your data, it infers that you don't trust sampling, which infers that you don't know your data well enough that you can make samples out of it. Gurus like, data analyst, data scientist, data statistician, or whatever you want to call them all uses sampling to create intelligent small subsets of huge data sets to infer out some of the best offers, recommendations, analysis results, to draw conclusions.
Just to quickly brush up basic concepts, Mean, Median and Mode can estimate the basic shape of distribution of data. If more or less they are equal, the data distribution seems to be very symmetric. Since Mean gets influenced by out-liners or extreme values, hence there could be instances in data set when Mean may be greater than Median and Mode, which depicts that distribution is slightly on higher side, hence distribution shape gets positively skewed, and negative skewed in otherwise scenarios. This helps in figuring out the shape of distribution and getting the centre of the distribution. At a level above, we figure out the concentration of the values in the distribution. Variance, Standard deviation and Range helps us to measure this. Variance typically gets calculated by the a division where numerator is the summation of squared differences of each value with Mean and denominator is the total number of values. Variance is also refereed as "sigma-square" whereas Standard Deviation is the positive square root of Variance, yes denoted by sigma.
If the distribution of the data is symmetric, which means that Mean,
Median and Mode are roughly equal, above can be combined with Emphirical
rules over Chebyshev's theorem, which tells,
around 68% of data values lies within 1 standard deviation of the mean,
around 95% of data values lies within 2 standard deviation of the mean, and,
around 99% of data values lies within 3 standard deviation of the mean.
In Big Data, we work on huge set of data, the data may be whole data lying either on a Nosql data store or on HDFS, or the data may be past data plus set of incremental of data. We may require to break the data into proportionate subdivisions of populations. The populations are divided into subdivisions by Quantiles. There are different proportions of the subdivisions of the populations, median divides into 50% each, quartile divides into 25% each, decile divides into 10% each, where as percentile divides the population into 1% each subdivision. We can use this concept to get fairly equal distribution of data, by having Quantiles over a directory containing bulk data plus incremental, to divide them into sub populations over which any of the below data sampling design can be applied to get a reasonable wide spread sample distribution.
There are couple of interesting data sampling designs, let's see how they work,
Simple Random Samples - Works for batches. Occurrence of data elements is simultaneous. Every data element in the data set has equal probability of getting picked up in the sample frame. In the implementation, every data element is assigned a random number, once the assignment has been done, the list is sorted and the sample frame is picked up. Hence, the probability of an element having a random number in the data frame is same across whole data set. Usually, at least 5% of sample data is picked to form a qualitative sample frame.
Systematic Samples - Works for streams. Occurrence of data elements is sequential. Once the initial stream has arrived, an initial data stream is picked. Then thereafter, every kth element (total data set/sample frame size) is picked as the sample. This looks even sampling algorithm unless some pattern occurs in the streams. if the same pattern gets repeated in every stream, every kth sample will be repeated in cyclic manner.
Stratified Samples - Works for streams. This design performs even sampling even if the same pattern occurs in the population. This divides the whole data set into sub-data set. Proportionate sample frames from each sub-data set reflects the fractional proportion of each sub-data set in the whole data set. Hence, the respective sized sample frames are selected from each sub-data set. Within each of the sub-data set, either Simple Random Samples or Systematic Samples can be used to select samples fairly/unbiasedly.
Cluster Samples - Works on batches. The whole data set can be divided into clusters, each cluster working as a microcosm of the whole data set, having heterogenous data element. Intent is to increase the heterogenity of data element within the cluster and decrease the variability across the clusters. Once this has been done, random clusters can be picked up, and sample data frames can be formed from them.
When taking out samples from the overall data set, the sampling error is quite inherent, it's the difference between the sample and the overall data set over which the sample has been collected. Mathematically, it is directly proportional to the standard deviation of the overall data set and inversely proportional to the sample size. Hence, more the sample size, more will be the concentration of sample mean around the overall data set mean, the more peaked the sample distribution will be around overall data set mean. The smaller the sample size, the flatter will be the sampling distribution around the overall data set mean.
Working on the unstructured data is lying on HDFS, many times we can't be sure that the distribution of data is normal, or the population variance is not known. It's difficult to come up with correct sampling in such cases. But a thumb rule can be, that assuming the whole data distribution is not heavily skewed and we are taking sufficient sample size, a t-distribution (which uses sample distribution and population distribution) should give us a good approximation of the distribution of the sample data.