Spacial articlas

Some Basic Issues in

# Statistical Modelling in Social Sciences

#### In a discussion on how social scientists use statistical models in their analysis, this paper uses some illustrative examples and highlights the importance of understanding the data generation process, viz, the way data are generated in the natural setting, and the way the data are selected from that natural setting through some sample selection process. The paper makes constructive suggestions on the importance of exploratory data analysis to improve the credibility of the specification of models. As there is always a likelihood that a model could be wrongly specified, and as omission of relevant variables could adversely affect statistical credibility, such exploratory data analysis would improve the statistical models.

T KRISHNA KUMAR

#### I Introduction

Perhaps there will be little dispute if I say that social science research might proceed in the following order:

– Employ the power of the computer and computer software to suitably develop the statistical arsenal of evidence.

Unfortunately I see many of the researchers going in the reverse gear. They stumble upon data, possibly a new source of data that has been little used before and thus constitutes a good data to use in research. Or they stumble upon a new, popular, and sophisticated statistical software that can be used easily with a click of the mouse. They use the data as input for the software, get the results, and then cull out from the results only those that seem to support the existing view, and write a thesis.

Invariably this latter approach is followed as a research strategy to play the game of pleasing the professor, if one were a student, or pleasing the referee if one were a post-doctoral researcher. Quite often, in this entire game neither the data, the statistical methods used, nor the software tools get examined carefully for their relevance and intricacies. The research issue, instead of being the main product of research, has become the byproduct of some vaguely understood statistical methods, least understood software, and little explored data. The research becomes tooloriented rather than being problem-oriented. While I will not dwell on the merits of this approach I shall bring to the fore the advantages of driving the research in the forward gear.

In this article I emphasise the following points:

(i) The need to have a convincing non-technical description of the structure of the social science phenomenon that could have generated the data in its natural setting; (ii) The need to understand how the data actually used is derived from the data that the underlying structure would generate naturally (sample selection issues); and (iii) The need to justify how probability and statistics enter into statistical modelling in social sciences

While the above considerations apply to all statistical models in social sciences, I limit my discussion, for the sake of brevity, to only a small proportion of a large collection of several possible statistical models and methods used in social sciences. In particular I limit my discussion to some of the basic issues related to one of the most widely used statistical models in social sciences, the regression model. The issues I address are: What variables should one include in data analysis given that he/she needs to examine the statistical evidence in support or against a particular scientific hypothesis?1 Which of these variables are to be explained (endogenous) and which of them to be taken as given (exogenous) and why? What should one do when data on a particular variable are not available or are available with a high degree of error? Should one omit that variable altogether or should one use a proxy for it? How do probability and statistics enter into modelling in social sciences when data used consists of either secondary or primary data collected without any random sampling design?2 How should one specify the model so that it is credible? Which statistical methods should one use to estimate the model parameters? What are maintained hypotheses and testable hypotheses? And what convincing reasons are there to maintain the maintained hypotheses? What is the real credibility of standard methods of statistical inference?

#### II Some Illustrative Examples

The researchers would consult a statistician only at the data analysis stage, under the mistaken impression that statistical analysis deals only with the analysis of data. Statistics enter at the data collection stage itself as reflected by two major branches of statistics, sample survey and the design of experiments. There is another reason why the researcher should consult a statistician at the data collection stage. What data one should collect, and how one should specify the statistical model for data analysis and inference given the context of the research study, would depend on the research problem. Two major issues that confront us are (i) To justify how probability and statistics become relevant in the analysis of data collected, where the data were neither collected through any design of statistical experiment nor by any well-designed sample survey, and (ii) To salvage the results when some possibly relevant variables are excluded from the specification of the model, by examining the statistical implication of such omission of variables and through the use of additional information, either on the omitted variable or its proxy.

To give some idea of the problems let me give just a few examples from some of the studies I was exposed to:

Statistical modelling with non-experimental and non-survey data is quite common in social science research. This modelling problem is tackled by assuming that there are some variables, which are outside the scope of statistical modelling and hence are non-stochastic (these are called exogenous variables), while other variables need to be explained or modelled using statistical methods (and these are called endogenous variables). In building a model we start with a variable we are interested most in explaining. In order to explain that variable we realise that it depends on some other variables. So, we write an equation for that dependence. Then we ask the question how the other variables included in this first equation are determined. If we feel that any variable is such that it influences the first variable but is not influenced by it we call it exogenous and take it as nonstochastic. If any of them are significantly dependent on the first variable, i e, it influences the first variable and is also influenced by it we call it an endogenous variable and specify how that variable depends on other variables, and so on. In building models with non-experimental data the basic questions are at what stage we stop this process of including additional variables, and when do we say a variable is exogenous.8

While there is no random sampling involved, it is assumed that these variables are jointly determined, and whatever dependence there is between them is captured by the assumption that all the endogenous variables have a joint conditional probability distribution, given the exogenous variables. Although it is not claimed that different observations on different variables are independent random samples from those variables, repeated observations on all these endogenous variables put together are assumed to be random samples from such a joint conditional probability distribution.

When the data used are time series data it is assumed that the time series constitutes a sample from a time-dependent stochastic process. While the conditional probability distribution should be characterised by a general conditional probability density function, as a first approximation, such a density is assumed to be a normal density with an unknown conditional mean that is normally assumed to be a linear function of the exogenous variables, and with an unknown conditional variance that is most often assumed to be a constant. Thus the problem of statistical modelling is reduced to that of either a single standard multiple regression or a system of standard multiple regressions.

Alternately, one would assume that these variables have a deterministic relation between them as suggested by the underlying social science theory. It must be noted that unlike in physical and biological sciences where such theory is rigorously built with experimental verification, in social sciences this is not rigorously established. In fact it is often the purpose of the research to determine the most plausible relationship. Furthermore, the theory only suggests whether a variable has a positive or negative impact on another variable, but it would have little to say on what the mathematical functional form of that relationship is. A particular mathematical form is assumed for the deterministic relation.9 As the observed variables do not satisfy any exact relationship it is further assumed that there exists an error in equation. It is further assumed that this error in equation is a random variable and has a known probability distribution (in most cases this distribution is assumed to be Normal), with a mean zero and a constant variance.

The following questions arise with many of the regression models:

Figure 1: Time Taken by Olympic Winners(in secs)

Recorded * minimum time for 100 mt race by menOlympic year

assumptions and not a probabilistic design, such as a random sample or a random experiment, how does one justify making those assumptions?

The answers to these two questions would depend on an understanding of the underlying “Data Generation Process”(DGP).

#### III Intuitive Examples That Justify Focus on Data Generating Process

Considerable amount of research was undertaken since 1900 on understanding the movement of stock prices over time. In 1990 Louis Bachelier introduced the first model of stock price movement, the so-called random walk model. This model states that the stock price today is equal to the stock price of the previous day plus a random error. Research carried out over a period of eight decades had not provided any significant improvement over this random walk model. It is my contention that this is due to researchers not paying adequate attention to the DGP, the data generating process. Researchers in this area, as in most of the other areas of research, have a tendency to take the data as given or take given data and build a statistical model based on them. In this they are often guided by what data are readily and easily available. If the researchers ask the question “What is the data generating process that generated the data?” one might observe that the data on hand are perhaps just a part of a larger data set being thrown up by an underlying data generating process. Further he or she may find that the variable being modelled is possibly influenced by some of the other omitted variables for which data are either not collected or not available.

One might say that there is an underlying “structure” characterising the phenomenon generating the data. Unless proved otherwise that structure is possibly an integrated structure, and every part of the data may have information on that structure. It is this notion that is central to statistical modelling. To understand fully the underlying structure that generated the “limited data” on hand it may therefore be necessary to understand other data generated by the same data generating process.

In the case of stock price behaviour, for example, there could be two types of modifications that can be attempted.10 First, the random walk model is based on the intuitive assumption that all the information regarding the economic fundamentals behind the company whose stock price is being examined is captured by historical data, and that all of that history can be captured by the stock price of the previous period. But what if the previous day’s stock price was abnormal based on some speculation, and that this fact has been noted over the night and before the opening of the stock market the following day? What about the new information that has just become available after the close of the stock market the previous day? Is not news reported in the financial news magazines early morning that day useful to assess the valuation of the stocks? Does the news in the first page appearing in a bold print with large font have larger effect on the stock price than the news in middle pages with smaller font?11 What about peoples’ expectations that if the stock price is going up one day it may go up the second day also but the probability that it will go down increases if it has gone up consecutively for a few days?12

Secondly, one might postulate that the observed stock price at any point in time is the equilibrium price between the sellers and buyers of the stock. This equilibrium price depends on the number of buyers and sellers, what the bid prices of those who buy the stocks are, and what the ask prices of those who sell the stocks are. If the maximum bid price is less than the minimum of ask price no transaction takes place, and the number of transactions indicates the degree of overlap between the bid and ask price distributions. The day-to-day volatility in stock prices must be related to the speculations made by the bears and bulls. The greater the liquidity in the economy the greater is the demand for stocks, as a part of that liquidity could be diverted to the stock market. Very few studies include in its stock price modelling the volume of stocks traded. Much more negligible are studies that consider the distributions of bid prices and ask prices. Now with increased use of computers by regulators, bourses, dealers, and buyers and sellers of stocks there are huge bodies of data waiting to be explored.

Suppose we wish to examine the Olympic records data on the 100-metre race for men, and build a model so as to forecast what would be the next record. A typical regression model with a Gaussian error term is inappropriate (Figure 1) as the observed record is the least possible time taken by the participants who enter the race, and only those who qualify to enter the race participate. Given this background on how the records are generated how do we model the observed historical Olympic records? Does the error of the regression model have an extreme value distribution instead of a Gaussian distribution? Is there a sample selection bias? Is it that we observe only the records of those who qualify and there are several other records that we do not observe? Can the unobserved records have information on what we observe? Is a modified Tobit model in which the Gaussian distribution is replaced by an extreme value distribution more appropriate to model these Olympic records?

#### IV A Return to the Examples

* Is there wage discrimination in the observed structure of wages between blacks and whites?

The problem first arose as a research investigation by an economics professor to explain the observed differences in salaries of blacks and whites in the US. He controlled for differences in skill levels, experience, occupation, etc. In a social gathering the professor was explaining this study to a practising lawyer. The lawyer was so excited with the study that he said it would be useful to one of his clients, a black who had filed a case of discrimination against his employer. Although this example, and another example cited later in the article on the deterrent effect of capital punishment, are taken from my US experience it may be mentioned that the underlying methodological issues are relevant in India as well. In India we have legal protection against wage discrimination between men and women guaranteed by The Equal Remuneration Act, 1976 (A central Act 25 of 1976). The legal and statistics professions use the same “Theory of Evidence” for testing hypotheses. Such a theory recognises different degrees of credibility of evidence, and search for methods of combining evidence from more than one independent source, i e, obtain corroborative evidence, to strengthen the degree of credibility of evidence. The legal profession uses terms such as “Beyond reasonable doubt”, “preponderance of probabilistic coincidence”, “Rarest of rare cases” (such as when the Supreme Court of India said that execution as a punishment should be used only in “the rarest of rare cases”). All these are probabilistic statements and their proper interpretation requires use of statistical concepts, theories, and methods.13

Section 45 of the Indian Evidence Act of 1872 allows for such expert witness as allowable evidence. Even if one maintains that economics as a science is not well developed, expert economic evidence can be used as corroborative evidence to supplement an alternative piece of evidence.14 It may also be said that using economic science thus as expert opinion can hone economic methodology, as such opinion is subject to examination and cross-examination in the courts. This type of expert witness by economists is conspicuous by its almost near absence from the Indian legal scene.

Returning to the US example, the lawyer pointed out to the economics professor that in order to give credible evidence in his client’s case he should use the salary data of the company and not the general data used in the research study. He also said that the salary data in the company is based on the initial salary offered and the increments given based on subsequent performance. He suggested that one should assume that the employer’s lawyer would argue that the initial salary offered was more than what the employee drew in his earlier job and included an affirmative action component by offering more than the industry average for his background, and that the employee accepted the job at that salary as it was the best he could get. Hence that component of salary should not enter into any definition and measurement of discrimination. In order to establish a clear case of discrimination by the employer one must examine the question if the employee was properly rewarded for his work after he joined, without any discrimination. This discussion is an example on the importance of understanding the data generating process in modelling the salary determination before using it to demonstrate the existence of possible discrimination.15

* Is there a deterrence effect of capital punishment?

One professor of criminology was examining the issue if capital punishment has any deterrence effect on the act of committing murders. He was improvising on a study undertaken by Isaac Ehrlich (1975). The US Supreme Court used Isaac Ehrlich’s study as professional evidence in favour of retaining capital punishment. Isaac Ehrlich said that an additional execution per year, over the period in question, could have resulted, on an average, in seven or eight fewer murders. He had data on which states had capital punishment and which did not. He also had data on the rate of unemployment in the state, the number of black high school dropouts, the state per capita income, etc. He was using a multiple regression with these explanatory variables, the dependent variable being the number of homicides in the state in a year. He used time series of cross section data. He also used criminology data on the probability of initial arrest of the offender; and the probability of conviction given the offender was arrested.

In order that I find what was the appropriate statistical model to be used I wanted to know the underlying process and asked the researcher the question: “Are you then assuming that there is a pool of potential murderers and a pool of potential victims in each of the states and when a potential murderer encounters a potential victim at random the murder would take place?” “Is there no one-to-one association between the murderer and the victim?” The researcher said that in quite a few instances there was a one-to-one relation between the victim and the murderer. As is clear one needs to break up the data into two types of homicides, one for economic gain where there is no personal and individualised motivation for murder,16 and another type where there is a personal association between the murderer and the victim requiring personal and individual information, not present in Isaac Ehrlich’s data.

The deterrence effect of capital punishment refers to the fear of death penalty if one were to commit a murder, get arrested, and get convicted for committing first-degree murder. This effect could then be different for these two types of murders, murder for economic gain and murder motivated by personal hatred. The latter cannot be addressed using aggregate data for the state without knowledge of the individual information on the murderer and the victim. The question then arises if the motivation for murder is so strong that the fear of death penalty may not be sufficient to prevent the murder. The aggregate data for the state is of little use to answer the research question in the second case. I also noted that the number of homicides was assumed to follow a normal distribution in Isaac Ehrlich’s work, as well as other subsequent research. It is reasonable to assume that the probability of a murder occurring in any small period is very small; the probability of observing a murder in one small period is independent of observing another murder in a subsequent period. Given these reasonable assumptions it is more natural to assume that the number of murders in a state in a year follows a Poisson distribution. It is clear that these issues in modelling the deterrent effect of capital punishment arise from how the homicide takes place. It was found that the multiple regression estimates obtained are lower under a Poisson regression compared to a normal regression.17

* How efficient is the budgeting activity/function of a state government?

This was a study undertaken for the National Science Foundation in US to examine what factors contribute to increasing the productivity of budgeting and financial control functions in the state governments. It was postulated that given a set of public expenditures funded by the state government there are underlying unit level activities with a least cost of executing each of those activities. Similarly it was postulated that given the socioeconomic and demographic profile of a state, there is a standard per capita requirements of these activities. The inefficiency of a state could be either due to using a wrong mix of public services (allocative inefficiency) or spending more than the minimum for each unit of the public service (technical inefficiency). The question then was: “Given the socioeconomic and demographic profile of the state what are the needed public expenditures, assuming that each such state-funded activity is run efficiently?” Let Qij= fij (Xi;α) ...(1) where Qij denote the amount of jth state-funded activity carried out by state i and Xi is the socio-economic and demographic

Figure 2 Figure 3

profile of state i. Let Cij denote the minimum cost of carrying out one unit of activity j in state i Then the minimum expenditure

nn =∑ =∑ fij(Xi;a) += gi(Xi;a) ...(2)

CijQij Cij{ uij}

j=1j=1 Actual expenditure = Ei = gi (Xi;α) + vi ...(3) Where vi ≥ 0

This is a frontier regression of the type described earlier with respect to the Olympic records. The nature of data and the problem show that there is a data generating process that gives rise to this frontier regression model. As this was a study aimed at improving the efficiency of the budget director’s office in state governments it was imperative that the model be credible. Once the model was developed and estimated, a meeting was arranged with all the budget directors of the states to review the results and offer comments. Obviously the budget directors of those states that were shown to have lower efficiencies took both defensive and offensive positions, and such debate resulted in improving the model. Differences in their views had been narrowed through a Delphi technique and a consensus was developed. Over all credibility of a statistical model is more important than simple statistical credibility.18

* How to correct for non-response in the data generated by a large social experiment conducted to observe how poor households respond to different types of housing subsidies?

This issue was based on a multi-million dollar social experiment conducted in the US to see to what extent the low income households who could not afford decent housing would improve the quality of housing services with the help of a government subsidy. The question posed was: “which among alternative forms of subsidy is more effective in achieving the objective”. A considerable amount of experimental data was used to fit regression models to study the behaviour of the experimental households, some of whom received the subsidy while the control group did not. The data was panel data, time series of cross-section. It so happened that the initial sample was random, but some chosen households opted not to participate in the experiment. Similarly some of the households who were selected on a preliminary expectation that they would be eligible to receive subsidy were found to be ineligible and dropped by the programme administrators. Those households who were selected were assigned to treatment and control groups through a random experimental design. During the course of the experiment, over a few years, some of the households dropped

Salary

Number of publications in JPE and other journals of that quality. 36

out of the experiment at various stages. This meant that the leftover sample is no longer random. The situation is depicted in Figure 2.

The question then arose if the estimated regression models need any sample selection correction as the randomly sampled households in the beginning is much different in its composition from the final sample used for data analysis. There could be systematic factors that explain participation and dropout behaviour and that required modelling the participation behaviour.

The factors determining participation and dropout could then be used to correct the results obtained from the final sample. Again this is an issue that arose as a result of the data generating process.

* How good is the claim of a faculty colleague who claimed a $ 500 raise for publishing a paper in the Journal of Political Economy?

This claim was based on a regression estimate when he regressed faculty salaries on the number of publications in each of several economics journals, including the Journal of Political Economy that had an estimated coefficient of about 500. I asked for the data and did a bit of data exploration. I found that the salary data had two major clusters, one with very high salaries of professors in first-rate universities, and another with only moderate salaries. I then asked myself what could be the mechanism by which such salary differentials occur. It then became evident that there could be two types of economics faculty: (i) those that publish in quantitative economic journals as well as in economic journals that are not very quantitative, and (ii) those that do not publish in quantitative economic journals and publish only in economic journals that are not very quantitative. It could also be argued that the economics profession offers a premium in salaries to those economics professors who publish in quantitative economic journals. I then looked at the list of journals used by the author and found that the list was conspicuous by the omission of the very prominent quantitative economics journal, Econometrica. The situation is depicted in Figure 3 (the numbers used are for illustration only and are hypothetical): I took a sub-sample for which I had the data on publications in Econometrica. Then I got the regression coefficient of 500-Y (Y>0) for JPE, and got a coefficient of 500+Z (Z>0) for the papers published in Econometrica (which had a zero coefficient in his model). I gave the following report to the chairman:

(i) The claim by our colleague is not statistically credible as the model specified omitted an important variable. (ii) JPE can

Figure 4

---True Structure Samples

contribute on the average only $ 500-Y (Y>0) to the average salary, and not $ 500 as claimed by him. (iii) If the chairman was going to fix the salary raises on the basis of such a regression model he may give me and another colleague of ours $ 500+Z (Z>0) raise as we got a paper published in Econometrica.19

The general implication of this illustrative example must be clear. When a relevant variable is omitted the variable is assigned a value of zero to its effect on the dependent variable. Further the effect of an included variable includes the contribution of the omitted variable. If the excluded variable is positively related to an included variable and if the excluded variable as well as the included variables have positive regression coefficients, as in this example, then the effect of the included variables is overstated. This suggests that one should not omit a relevant variable just because the data on that are not available. One must then collect the needed data than omit the variable. Alternately, if such data collection is very costly or impossible one can use a proxy. How good a proxy is determined by the hypothesised or assumed degree of correlation between the true variable and the proxy. The higher this correlation the closer will be the results to the true regression with the relevant variable.20

#### V Specification of the Model and the Importance of Data Exploration

From the examples given above it is clear that there are two aspects of a model specification, the mathematical form of the mean of the conditional probability density function of the endogenous variables given the exogenous variables – the regression function, and the nature of the distribution of the random error term. Most of the examples illustrated the questions of what variables one should use given the data generation process and what could be the distribution of thSEKHAR BANDYOPADHYAYe error implied by the data generation process.

Once we focus on the variables to be included and know what kind of distribution we should use for the error term we are ready to specify the statistical model. We consider inclusion of k explanatory variables, X1,X2,…Xk, with a regression model: yi =f(X1i, 2i,,....Xki;α) + ui ...(4) where ui has a pre-specified distribution such as Normal with a mean zero and a constant variance.

The above regression could be a non-linear regression, nonlinear either in the k explanatory variables or in parameters or both. This constitutes what we call the maintained hypotheses and these are in general neither tested statistically nor verified by other means. When I was teaching statistical inference to the final year students at the National Law School of India University, Bangalore some years ago one of the students mentioned at the end of the class that the entire lecture on statistical inference was somewhat equivalent to an attorney producing evidence from a witness, without inquiring into the process by which the witness was selected. He claimed that the evidence so obtained was not necessarily credible. What if the witness was planted or coerced? That student was right. The overall credibility of statistical evidence depends not only on the statistical evidence we provide under the assumed modelbut also on thecredibility of our assumed model.

It is here that the description of the data generation process and exploratory data analysis play an important role. We must supplement the description of data generation process that explains how the data would be generated in a natural setting with an understanding of the collected sample data. Most of the features of the multiple regression model are reflected in the correlations and partial correlations between variables. Hence what should be the specification of the model can be guessed from data exploration. One may plot the data into various twodimensional line graphs (X-Y plots) and scatter diagrams between the dependent variable and each of the independent variables. In order to get some insight into the functional form to be used for any independent variable one may examine scatter. The pattern could reveal any possible evidence of non-linear relationship between the dependent variable and the independent variable.21 The independent variable may be transformed and the scatter may be plotted between the transformed independent variable and the dependent variable to make sure that with the transformed variable the scatter shows an approximate linear relationship.

The variable that has the maximum correlation (in some functional form) with the dependent variable may turn out to be the most significant explanatory variable. One may then examine the partial correlation between the dependent variable and other independent variables after controlling for the influence of this first most significant variable. That variable which has the maximum partial correlation can be taken as the second most significant variable, and so on. While this procedure is somewhat like the stepwise regression, following this procedure in an interactive fashion and engaging the data by examining the scatters and correlations enables the researcher to befriend the data so that the data can be more helpful in research. The researcher can select the variables and their functional forms based on this exploratory data analysis. This process, if followed, will convince both the researcher and his audience that the specification of the model is based on a careful scrutiny of the data. The chosen specification or the maintained hypothesis will then be credible.22

#### VI Some Conceptual Considerations on Data Generation Process, Structure, and Sample Information

One needs to understand the underlying structure of the universe, whether it is astronomy, physics, chemistry, biology, economics, sociology or political science. Such an understanding is necessary in order to either understanding the existing structure

Figure 5

Legend – - – -One Structure ........ Another structure —— Samples

or to intervene to modify the structure to a more preferable form. An understanding of the “structure” requires different elements of “information” about that structure. As a part of that information comes from the data we collect, what data one must collect depends on what aspects of the structure we want to understand. This requires some a priori theorising, or if you wish an a priori conceptualisation, of the data generating process (DGP) that throws up the data. We may not observe all the possible data generated by the underlying data generating process. We may select only that data which we think is useful, or only that which is easy to collect, or only that which was collected by some other data collecting agency. Some of these elements in the selection of data may constrain our knowledge about the structure. This drives home the importance of different types of “information” to understand the “structure” of the underlying economic model of DGP. Sample information is only one type of information, and one needs to supplement that sample information with some prior information.

Statistical inference is based on a correspondence between the structure that generates the data and the sample data. There are the following possibilities:

There is only one structure that could have generated the data but there are different samples showing substantial variation between them (This is due to sampling variation, and in very large samples such variation could be minimised and one could approximate the underlying structure quite well.)

There could be several structures that could have generated the data, and hence there is no one-to-one correspondence between the underlying structure and the sample even in large samples (this is the problem of lack of identification of the structure). In this case the inference cannot be made properly from the data alone, as the data could have come from any one of alternate structures. One needs to select one of the alternative structures or identify the structure from extra-sample or prior information.

The basic elements of the theory of statistical inference are:

infinite sample. This is called the problem of “just identification” in econometrics.23 (These two steps constitute the maintained hypotheses.)

All these steps constitute what one might call scientific credibility of statistical inference. But, one needs to remember that this entire scientific edifice is built on the assumption that the maintained hypotheses are true. What happens to this scientific credibility if the maintained hypotheses are not true? In other words, the most important question to ask in order to establish an over all scientific credibility to statistical analysis is: “What is the scientific credibility of the maintained hypotheses themselves?” This scientific credibility of the maintained hypotheses has to be achieved through the principles of advocacy. Credibility is to be achieved through reasonableness of the assumptions made or on ontological considerations. This reasonableness or realism can be achieved (i) by describing an intuitively acceptable Data Generating Process or a realistic description of the phenomenon that generates the data, and (ii) through selection of specification based on exploratory data analysis as indicated in the previous section.

Functions of sample observations have information on the unknown structural parameters, which characterise the structure that should have generated the data. Hence, it must be evident that if one errs by ignoring some variables that play a crucial role in generating the data then the information content of the data actually used could be less than sufficient to understand the structure. The classical approach described above assumed that we know the sample space with certainty, and it also assumed that there is a “known” functional form of the probability density with some unknown parameters. These constitute the prior information and the maintained hypotheses.

Some interesting statistical issues arise the moment we look at the problem from this angle of “information” and “structure”. First, if there is an underlying structure what is the sample information we need to collect to understand that structure? If there are different sets of possible sample information, one must ask: “Which of them has the maximum possible ‘information content’ on the underlying ‘structure’?” This question can be answered well if we describe the underlying data generation process. It must be noted that the statistical model of the underlying structure depends on what we use as the observed data. The tendency among many researchers is to take the data as given, and look for the DGP of that data. Instead DGP should be the primary concern and based on that one should decide what data could throw light on the underlying structure.

As there can be several non-testable assumptions regarding what could be the true model there is an element of subjectivism in the choice of theories or models, particularly so if more than one theory can explain the observed data. We must therefore ask the question: “What happens to the credibility of our statistical inference procedures if we are uncertain about the maintained hypotheses, i e, if the true model is something other than what we assumed?” Do the statistics that are sufficient for parameters of the structure continue to be so under the new enlarged model of which the assumed model is a special case? Can one define a new and useful concept of “approximate sufficiency” of a statistic under a perturbation of the model? Can one think of models and estimates that are distribution free or robust?24 Hence what estimation method one should use depends on the degree of credibility of the maintained hypothesis. If that is credible one can use whatever method is the best for that model. If there is some ambiguity or doubt about the maintained hypothesis one may use distribution-free methods or robust methods of estimation and hypothesis testing. Alternately one may employ different

SAGE – AD

prior weights to the alternate structures that are possible and use Bayesian methods of inference.25

Email: tkkumar@gmail.com

#### Notes

[Based on Professor R Gururaja Rao Memorial lecture, Department of Statistics, Bangalore University, April 19, 2007. I thank EPW for suggesting a discussion on this issue. I also thank K L Krishna, A Vaidyanathan, A L Nagar, Malay Bhattacharyya, N Krishnaji and Chandan Mukherjee for their comments in the Open Review section of the EPW web site. The usual caveat applies and I alone am responsible for any errors and omissions. This paper is aimed at social scientists who apply statistical methods, and particularly to the readers of EPW. It is not addressed to either statisticians or econometricians, and hence its scope is limited, the style is less technical and more conceptual, and it does not cover several other related topics in methods and methodology.]

1 When one is dealing with an underlying dynamic phenomenon how many lags of a variable one must include is a special case of this.

2 Here is one example where a doctoral student used official data on municipal administration of different municipal bodies. He carried out a chi-square test for the independence of two administrative functions. I asked him what was the meaning of the statistical hypothesis of independence when he did not have any random sample. I also asked him how he would justify the chi-square distribution in his case while the original chi-square distribution was derived under the assumption of a random sample with the independence assumption. He had no answers but successfully completed his studies with the chi-square statistics, statistical inferences, and all that!

3 Gwartney (1970). 4 Ehrlich (1975), Bowers and Pierce (1975), Kumar and Shih (1978). For

a more recent discussion of this issue one may see Grogger (1990), and http://www.deathpenaltyinfo.org/FaganTestimony.pdf.

5 Kumar and Merrill (1977).

6 Kennedy, Kumar and Weisbrod (1977) and Kennedy (1980).

7 This question was not hypothetical. Tuckman got his paper accepted by JPE and went to the chairman with such a request. See Tuckman and Leahey (1975).

8 Hendry and Mizon (1993) recommend a general to specific approach suggesting that one must start with a more inclusive model and then reduce the scope. This approach will reduce the risk of omitting a relevant variable. See Ericsson and Iron (1995) on statistical tests to decide on whether a variable is exogenous or not.

9 Invariably most researchers assume that the variables used appear linearly in the regression model. Most economists extend the regression model to include variables in their logarithmic form. They seldom ask the question if the data suggest some other functional form for the variables.

10 It is not my intention to review the work on stock market behaviour, the literature on which is quite extensive. Instead, it is my intention to take such a popular example and demonstrate how an examination of DGP could suggest valid generalisations of the model that could be very useful in advancing our knowledge on the stock market behaviour.

11 The editors of such business magazines have an intuitive knowledge of importance of the news and reflect that knowledge through their choice of positioning the news.

12 It is this aspect of the so-called volatility clustering which Robert Engle, the economics Nobel Laureate (2002), incorporated in the sock price modelling, improving the simple random walk model.

13 See Koehler (2002).

14 A similar position was taken regarding the science of identification of footprints. See Mohd Aman vs State of Rajasthan (1997) 4, Supreme Court 635.

15 This is the beginning of a series of such cases using econometric analysis, and the beginning of the ERS Group. Currently, the ERS Group is a large economics consultancy group headquartered in Tallahassee, Florida. Charles and Joan Haworth started this company in the early 1970s with the help of James Gwartney, all members of the economics faculty at Florida State University. Marketability of social science research requires this kind of overall credibility of research and not what one might erroneously think as credibility in a narrow statistical sense. More will be said on this later.

16 Murder in this case occurs mainly accidentally, either in self-defence or to eliminate a potential eyewitness for the crime.

17 Kumar and Shih (1978).

18 The Delphi technique was meant to elicit opinions and to narrow the differences to arrive at a consensus. These opinions pertain to the substantive issues of the budgeting functions and budgeting process. They did not pertain to the statistical methods used. If a model is properly specified and its assumptions are justified scientific or statistical credibility and overall credibility become one and the same.

19 See Gapinski and Kumar (1972), published in 1973 with a year lag. Perhaps based on this comment of mine the author had extended the study a couple of years later to other disciplines and concluded that an article in social sciences was worth $ 428, an article in mathematical sciences and engineering was worth $ 1040, and one in physical sciences was worth $ 1246. See Tuckman et al (1977).

20 See Kumar (1992).

21 Even those researchers who are uncomfortable with mathematical functions can do this easily with the help of EXCEL spreadsheets. One can enter variable x in column 1 and enter numbers 1 to 100. In columns 2,3,4,5,6, etc, one can enter Y in different functional forms using different mathematical functions from toolbar fx → math&trig of EXCEL. One can then plot the graph between X and Y for 100 discrete points to generate a template of graphs of different forms. This template can be used to guess what is the underling function depicted by an observed scatter diagram between X and Y. When there is an ambiguity between two or more functions one can try all of them.

22 Any one interested in an empirical illustration of this exploratory data analysis using SPSS may contact the author by email.

23 The question of identification is applicable even with just one regression equation, and it arises if the likelihood surface becomes flat with respect to any parameter even with an infinite sample. In that case more than one structure could have generated the sample. Manski (2003) has rightly argued that what one needs is not a perfect identification (with a unique value to each parameter) with model restrictions hard to justify. He suggests, instead, a partial identifiability with less restrictive assumptions that are more convincing. These restrictions may narrow the parameter subspace implied by the sample evidence, not necessarily to a single point, as in the just identification case.

24 Le Cam (1964), Lindsay (1994).

25 This method has become increasingly popular in recent years due to the Morkov-Chain Monte Carlo (MCMC) technique that allows any prior, unlike in the olden days where a prior has to be either a diffused prior or a conjugate prior. This technique has become more accessible to researchers through a freeware called “Winbugs”.

#### References

Bowers, William J and Glenn Pierce (1975): ‘The Illusion of Deterrence in Isaac Ehrlich’s Research on Capital Punishment’, Yale Law Journal, 85, pp 187-208.

Ehrlich, Isaac (1975): ‘The Deterrence Effect of Capital Punishment: A Question of Life and Death’, American Economic Review, Vol 65, pp 397-417.

Ericsson, Neil R and John S Iron (1995): Testing Exogeneity, Oxford University Press.

Gapinski, J H and T K Kumar (1972): ‘A Pearsonian Curve-Fitting Algorithm’, Econometrica, Vol 40, No 5, September, p 963.

Grogger, Jeffrey (1990): ‘The Deterrent Effect of Capital Punishment: An Analysis of Daily Homicide Counts’, Journal of American Statistical Association, Vol 85.

Gwartney, James D (1970): ‘Changes in the Nonwhite/White Income Ratio

– 1939-67’, The American Economic Review, Vol 60, No 5, December, pp 872-83.

Hendry, D F and G E Mizon (1993): ‘Evaluating Dynamic Econometric Models by Encompassing the VAR’ in P C B Phillips (ed), Models, Methods and Applications of Econometrics, Basil Blackwell, Oxford, pp 272-300.

Kennedy, Stephen D (1980): Final Report of the Housing Allowance Demand Experiment, Abt Associates, Cambridge, MA, June.

Kennedy, Stephen D, T Krishna Kumar and Glen Weisbrod (1977): ‘Draft Report on Participation under a Housing Gap Form of Housing Allowance’, Abt Associates Inc.

Koehler, J J (2002): ‘When Do Courts Think Base Rate Statistics Are Relevant?’, Jurimetrics Journal, Vol 42, pp 373-402.

Kumar, T K (1992): ‘A Note on Proxy Variables in Regression’, Journal of Quantitative Economics, Vol 8, No 2, July, pp 447-48.

Kumar, T K and Peter Merrill (1977): Productivity Measurement in the Budget and Management Control Function, report submitted to the National Science Foundation under the Research Grant APR 75-20, Abt Associates, Inc, Cambridge, Massachusetts, June.

Kumar, T K and Wen Fu P Shih (1978): ‘An Application of Multiple Regression Model of a Poisson Process to the Murder Supply Equation’, The Proceedings of the Business and Economic Statistics Section of the American Statistical Association, August, pp 715-19. http://www.deathpenaltyinfo.org/FaganTestimony.pdf

L, Le Cam (1964): ‘Sufficiency and Approximate Sufficiency’, The Annals of Mathematical Statistics, Vol 35, No 4, December, pp 1419-55, Le Cam.

Lindsay, Bruce G (1994): ‘Efficiency versus Robustness: The Case of Minimum Hellinger Distance and Related Methods’, Annals of Statistics, Vol 22, No 2, pp 1061-1114.

Manski, Charles (2003): Partial Identification of Probability Distributions, Springer-Verlag.

Tuckman, Howard P and Jack Leahey (1975): ‘How Much Is an Article Worth?’, Journal of Political Economy, Vol 83, October, pp 951-67.

Tuckman, Howard P, James H Gapinski, Robert P Hagemann (1977): ‘Faculty Skills and the Salary Structure in Academe: A Market Perspective’, The American Economic Review, Vol 67, No 4, September, pp 692-702.

## Comments

EPW looks forward to your comments. Please note that comments are moderated as per our comments policy. They may take some time to appear. A comment, if suitable, may be selected for publication in the Letters pages of EPW.