# Big Data, Big Deal

Massive data sets are transforming everything, from our understanding of the universe, to things as everyday and mundane as recommending personalised products online. The data, in a digital form, comes in a much larger size than ever. With such large data sets, it is all too easy to find rare statistical anomalies and to confuse them with real phenomena. Questions arise of privacy when one is a row on a database in an increasingly “dataficated” world.

Lies, damned lies and statistics!

—Benjamin Disraeli

We poor statisticians have been trying to live down this notorious quote ever since it has been spoken. As soon as any particular case of suspicious estimate/methodology is found, the adage and the likes reappear. Hyperboles have ripple effect. Respect is in short supply. The tagging of lies with statistics have apparently always put the statisticians on the defensive as burden lies with them to dispel the suspicion/confusion around the estimate/methodology. To equate statistics with lies is absolutely unjust. If it is easy to lie with statistics, it may be easier to lie without them. There is, however, no problem in taking the view that generating trustworthy numbers is hard. The intent of this article is to respond, though broadly, to those critics who are seemingly appalled by the fact that the big data analytics can be of any use after demonetisation (Teltumbde 2017).

Most of us are living a life that can be increasingly quantified. As almost everything in the world has moved to the internet, and internet has moved into our pockets, even sceptics would agree that we need analytics for precise and speedy answers. As an illustration, Google’s PageRank application, that underlies its search technology, models the network of web pages by a statistical concept called Markov Chain Theory. It assigns each web page a value that determines the order of presentation in reply to a query. PageRanks, in their purest form, are simply stationary probabilities for a particular Markov chain. This application has made enormous amount of knowledge accessible to everyone. To see this as statistics meets industry would be a reductionist perspective. It is indeed one of the most innovative social applications of statistics. There are dark sides too. As a recent example, we see the controversy around reported systems breach on National Stock Exchange (NSE) trading servers by algorithmic traders (Sinha 2017). The accessibility of these high frequency traders to faster networks accomplished by a collocation of servers has allegedly enabled them to exploit short-lived price movements to their profit. Location beats algorithm! Therefore, that all the hype about the big data is seductive illusion to many is not surprising. Ironically, that includes statisticians also.

**Big Data and Statistics**

Philosophically, statistics and big data are unified by a common goal: to understand the world through numbers. However, their intrinsic approach to achieve this goal is quite different. The essence of big data as captured by the Vs—volume (size), variety (heterogeneity in terms of number of variables), and velocity (high speed and haphazard collection)—is in direct conflict with statistical ideas on the same. Yes, statistics is a subject of ideas. Those who think it is a mere collection of tools and techniques are either full of ego or full of ignorance.

In statistics, it is common knowledge that when it comes to data, size is not everything. Further, the “gain in information” is considered very “expensive” in classical statistics. For instance, the more precise or reliable the survey estimates must be, the bigger the sample size must be, and by orders of magnitude. Doubling the reliability requirement, for example, may necessitate quadrupling the sample size.

Much of the classical statistical modelling assumes a thin data structure where the number of variables is less in comparison to the number of observations. For many types of big data, the number of variables is larger, often much larger, than the number of observations. So the “big” of big data is more about diversity. Further, in many areas of big data application, the arrival of data is at a very high velocity, sometimes obtained by combining several data sets. In classical statistics, collection of data is carefully planned, whereas in big data, focus is on speed and capacity. Of course, it is not possible to continuously grow a sample in classical statistics. So, in big data, “include everything” approach and the thick/growing data structure make it harder to draw reliable and stable conclusions. So far, the big data phenomenon is largely credited to computer scientists or data engineers. It is hardly surprising that big data analytics is all about a sophisticated mix of computational strategies.

The pundits of big data have made strong claims. By including everything, that is, N=all, the problem of sampling bias does not matter, thereby, making the sampling technique obsolete. Further with enough data, numbers start speaking for themselves, and thus, the correlation is sufficient to tell us everything we want to know. There is no place for statistical models and causality, and therefore, the provocative “end of theory” argument. However, these are merely articles of faith. By including everything, quality of data takes a backseat. In fact, problems may get worse instead of disappearing in massive data sets.

Also, complex temporal dependencies may produce different effects at different point in time. So, N=all may actually be some sort of a sample in a larger perspective. Because of this crucial assumption, there may be an inherent risk in making reproducible inferences in real time. Further, with huge number of variables, “multiple comparisons” are more likely to yield false positive findings/spurious correlations. The correlation argument also falls flat if environment from which the data is obtained suddenly/abruptly changes. The misestimation of prevalence of influenza by Google Flu Trends—an algorithm based on search data—provides a case in point (Butler 2013).

There are other supplementary issues in big data such as data sanitisation, visualisation, randomisation, clustering, testing, etc, that make drawing inferences even more challenging. In spite of these hurdles, there are success stories such as Google Translate, which operates by statistically analysing hundreds of millions of translated documents and looking for patterns it can copy. It has delivered amazing results without any kind of dependence on linguistic rules. It is indeed a theory-free, data-driven algorithmic miracle that is hard to ignore.

**Confronting the Challenges**

Researchers lately have started embracing the relevant statistical ideas in big data applications. Statistics provides sceptical insights for knowledge discovery (statistically significant findings). This scepticism is at the heart of statistics as a false finding can be more dangerous than ignorance. For instance, hypothesis testing works by falsification logic: the hypothesis test starts with a null hypothesis of “no effect,” that is, a researcher looks for data evidence against this null hypothesis. More evidence against the null hypothesis then implies more support of the alternative hypothesis of some effect. This sceptical insight also calls for adjusting the degree of caution, that is, level of significance in case multiple hypotheses are to be tested so as to circumvent the proliferation of false discovery rate. Any process is flawed if it is more likely to rule out good ideas incorrectly.

Further, hypothesis is specified a priori to data collection, whereas for big data applications, pursuing multiple inferences may reverse the direction, and data may be used to generate hypotheses, a kind of data dredging, thereby compromising on the ultimate goal of making causal inferences. To provide a case in point, let us consider much celebrated “A/B test”—an extension of t-test of two means—for conducting online controlled experiments. In this test, users are randomly exposed to one of two variants: Control (A) and Treatment (B). An experiment effect is considered statistically significant if the Overall Evaluation Criterion (a quantitative measure of the experiment’s objective) differs for user groups exposed to Treatment and Control variants according to a statistical test.

This test was initially hailed as a game changer by the web community (Christian 2012). However, as time passed, researchers such as at Microsoft (Kohavi et al 2012), acknowledged the need for “trustworthy” experiments, that is, powerful experiments that control for false positive outcomes. The data science team of Facebook has also launched an open platform called “PlanOut” for running online field experiments (Bakshy et al 2014). This is indeed a paradigm shift in the stance previously ill-founded on principles of certainty and automation, among others.

Another undeniable challenge in big data application is data visualisation. Visualisation plays a vital role not only in effective presentation of results but also in identification of bias, systematic errors, and knowledge discovery. For instance, to check the assumption of linearity in regression analysis, the lack-of-fit test (F-test), and a few graphical methods such as “scatter plots with lowess curve” and “residual and partial residual plots” can be used. Here, it should be stressed that the F-test is an overall test. When the null hypothesis is rejected, there is evidence that the assumption of linearity is not tenable for at least one independent variable. However, we cannot precisely locate the problem.

The graphical methods mentioned above are more useful in identifying problematic variables in this regard. It follows that visualisation techniques in spite of being less mathematical are extremely powerful for data exploration. Indeed, a good sketch is better than a long speech! Therefore, the enormous popularity of interactive visualisation software GGobi, ggplot2 and variants of Tableau, etc, is hardly surprising.

For illustrating complex data, novel visualisation techniques such as tree maps (for hierarc hical data), tag clouds (text analysis), clustergram (imaging technique used in cluster analysis), and so on have been developed. The list is certainly not exhaustive as it is difficult to be aware of every contemporary technique and visual analytics resources such as IBM Many Eyes, Google Maps, etc. These resources/tools can be extremely useful in visually navigating our data-driven world. Likewise, other supplementary issues such as clustering (separation of data into meaningful subsets to facilitate scientific insights), causal inference, bias correction, missing data issues, problem of integrating heterogeneous data sources, variance decomposition, and so on, have also caught the attention of researchers.

To address these issues, new techniques and/or scaling up of relevant statistical techniques are being developed in a multidisciplinary framework. Many of these big data challenges are seen at the interface of statistics and computer science. Thus, embracing the statistical thinking will certainly be crucial to resolving much of the issues, if not all.

**Concerns about Privacy**

There is a saying that “if you have nothing to hide, then you have nothing to fear.” However, in a world where data never sleeps, the boundaries of individual privacy will certainly be tested. The notion of privacy from the perspective of data is that the likelihood of individual identification must be zero if it is exposed to stakeholders at any stage of data life cycle (generation/processing/dissemination/storage). The privacy preservation model for the data generated by the government through large scale surveys is simple and has been effective so far. It includes mainly de-identification, that is, removal of formal identifiers and restrictive access controls, for example, unit level data cannot be produced as evidence in any court of law, categorisation of data into shareable/non-shareable, data disclosure with the understanding that it will be used for a purpose, and so on.

But the relentless drive to amass more information in big data applications makes it enormously daunting to mask individual identities. The high dimensional data sets may contain so many data points about an individual that each record is likely to be unique. Thus, quasi identifiers can be found even if direct identifiers are rigorously removed from the data set. Further if the data set is dynamic, it is possible that the data collection surpasses its intended purpose. Data over-collection in itself may be considered a violation of privacy as the individuals are not fully aware of anticipated/unanticipated usage of data. Ironically, the biggest threat to privacy comes from analytics itself. Combining and analysing data from various sources together, using data mining or other techniques, may lead to easy reidentification. Also, privacy may be at risk if adequate security system (encryption, firewall, etc) is not in place in order to protect data from malicious attacks.

The response to these challenges has not been short in big data applications. Consistent concerns about privacy and anonymity have prompted the development of more sophisticated privacy measures. Privacy preserving models such as k-anonymity, l-diversity, and t-closeness lay down blueprints for altering, organising or masking various data points in such a way as to prevent reidentification.

A release of data is said to have k-anonymity property if any given record is exactly equivalent to at least k other records in the data. For accomplishing a k-anonymity, techniques of “suppression” (all or some of the values of a variable is supplanted by an asterisk) and “generalisation” (individual values of variables are replaced with a broader category) are used. The l-diversity model is an extension of k-anonymity, and t-closeness is a further improvement of the l-diversity model wherein granularity of data representation is further suppressed.

Differential Privacy is a recently developed technique that protects the privacy through intermediary software between the analyst and the database (Microsoft 2015). The software adds some noise in the data in response to a query in proportion to the evaluated privacy risk. Further, the emergence of cloud computing can be considered as a new computing archetype that can provide improved services on demand at a minimal cost. Technology such as Oblivious RAM (ORAM) has been found promising in protecting the privacy in the cloud. It may be gathered that as intrusion evolves, so do privacy preserving techniques. Nevertheless, a comprehensive approach involving government, industry, and citizens is indeed required to circumvent threats to privacy. The loss of privacy argument is also inconsistent with our behaviour of voluntarily sacrificing our privacy by ignoring notice, and consent regarding use of our data, while downloading software/apps. We also need to introspect.

**Conclusions**

Why do we need analytics? Given the advent of cloud computing, Internet of Things (IoT), the rollout of Digital India, and so on, harnessing the power of analytics seems to be a good idea. Although the big data phenomenon has acquired a cult status, the making of scientific inferences is a qualitative process. It can certainly be helped by analytics, but it cannot be replaced by analytics.

Moreover, it should in no way surprise us that analytics can improve efficiency. Operating efficiently will continue to become increasingly important to organisations of all kinds if we are to improve the level of prosperity. As expectations about quality of life continue to rise, the need for efficient operations will gain universal prominence. For instance, let us take stock of the income tax administration, albeit broadly, without delving deeper into entire range of operations. Taxation is an information intensive domain: registration, timely filing, and accurate reporting. Our income tax base is quite low (around 2% of population); there are evaders as well as avoiders. So the revenue gains from expanded taxation of the economy are obvious. Around half a million tax cases tied up in disputes at various levels (appeals/appellate tribunals/high court/Supreme Court) can be taken as an indicator of under-reporting of incomes. So the objective of maximal tax compliance is coupled with problems of tax gap, non-compliance, and scarce human resources. Also the extraordinary event of demonetisation has generated massive tax verification activities. It is hardly surprising that union budget 2016–17 proposes that income tax authorities need not disclose the “reason to believe” to initiate a search and seizure raid.

With this in view, data mining can be chosen as a flagship form of analytics to resolve some of the issues. Data mining combines disciplines such as statistics, machine learning, database management, etc, and can effectively meet analytical challenges arising out of demonetisation. Maximum visibility of most of the activities of taxpayers (individual, firms, Hindu undivided families) are captured in the form of annual information returns, Permanent Account Number (PAN) transactions, income tax returns, etc. Also demonetisation has generated massive data on deposit transactions. So in order to see patterns and inconsistencies, data from multiple sources need to be seamlessly integrated.

Cross-database matching techniques can be used to identify non-filers of taxes and associated tax responsibilities. For segmentation and better targeted attention, clustering can be used to form natural groups. Outlier based methods can form the basis for selection mechanism for revenues maximisation. An assessment of the potential return filers in our economy (a tough nut to crack), and the tax gap thereof, can be broken down into meaningful categories. Consequently, hard-to-tax areas can be identified and addressed. All these mechanisms may provide assistance in identifying and solving determinants of errors in an unbiased manner. Therefore, analytics can support and protect the taxpayers and limits the scope of arbitrary arm-twisting.

It will be a vain illusion to consider analytics as a magic wand. Every problem is not going to be solved by simply waving it. However, it needs to be appreciated that every policy decision cannot be an outcome of a democratic process, to see how many people are in agreement. Rather it should be based on the quality of the evidence, and the validity and soundness of the arguments that matter. Further, we need to have high quality tailor-made policymaking processes in place. It is easy to see that a random ragpicker is likely to be benefited more if she is identified as a normal worker of the economy rather than as a citizen of the country. Appropriate use of analytics can certainly help in building an empowered society by optimally using available resources.

**References**

Bakshy, E et al (2014): “Big Experiments: Big Data’s Friend for Making Decisions,” Facebook, 3 April, https://www.facebook.com/notes/facebook-data-science/big-experiments-big....

Butler, D (2013): “When Google Got Flu Wrong,” Nature, 13 February, http://www.nature.com/news/when-google-got-flu-wrong.

Christian, B (2012): “The A/B Test: Inside the Technology That’s Changing the Rules of Business,” Wired, 25 April, https://www.wired.com/2012/04/ff_abtesting.

Kohavi, R et al (2012): “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained,” EXP Platform, http://www.exp-platform.com/Pages/PuzzlingOutcomesExplained.aspx.

Microsoft (2015): “Differential Privacy for Everyone,” Microsoft Corporation.

Sinha, P (2017): “Name Those Responsible for Systems Breach in 2012, SEBI Tells NSE,” Economic Times, 10 January.

Teltumbde, A (2017): “Big Data, Bigger Lies,” Economic & Political Weekly, Vol 52, No 5, pp 10–11.