Analyzing Model Equations

There are a number of steps in analyzing a model based on equations. "Equation" here is used in a broad sense, meaning "a system of cause and effect relationships, where each relationship contains a specific set of variables whose effects are measured quantitatively." Thus the equation-based approach includes technologies like SEM.

Build the (theoretical) model first

The theoretical model need not be a set of equations, at least not at first. A list of interacting variables is already a model, because it includes those variables as "important", and excludes all the variables not mentioned as "irrelevant". (This is why "building models with stepwise regression" is fundamentally incoherent, although there are strong statistical arguments against it as well.)

Sometimes you are stuck with an initial model, as when you use a preexisting data set. That's not the end of it, because you can restrict the model by excluding some of the given variables, synthesize new variables from the given ones (as in confirmatory factor analysis), and (if you have the resources) add new variables from other linked data sets or by collecting new data.

Once you have your list of potentially related variables:

  1. Pick a dependent variable.

  2. Make a list of things that affect that variable (variables). Some will be measurable, some will not. For variables you can't measure, you use indicators (indirect measurement), proxies (a variable that you hope "means the same thing" as the variable of interest = you expect very high correlation), or instruments (variables that predict the unmeasured variable, and you use an estimate of the variable rather than the variable itself).

    Note: "Indicator" and "proxy" are actually the same thing, but the words are used in different contexts, and used in somewhat different ways. We normally use a single proxy instead of the theoretical variable in regression analysis. But in SEM, we use multiple indicators and extract a factor to measure the theoretical variable.

  3. From the point of view of the "actor" in the equation, how should each variable in the list of explanatory variables would change the dependent variable. Do this quickly, giving accurate explanations for the easy cases, vague notes or nothing for hard ones.

  4. Repeat (c) several times, each time spending a little time refining the easier cases, but concentrate on understanding and defining the confusing/complex/subtle/difficult cases.

  5. Note: Make sure you have a consistent interpretation of the variables throughout this process. E.g., which of the values for a binary value corresponds to the dummy variable value 0, and which to 1?

    This requires some care because in the process of refining the theoretical model you sometimes come to consider it more convenient to reverse the assignment of 0 (the "usual case") and 1 (the "special" or "noteworthy" case).

In Ph.D.-level research, almost certainly there will be several equations. Each equation represents one cause-effect relationship, with a dependent variable and (usually) several explanatory variables. In M.S.-level research that's not always true, and the SEM modeling approach somewhat obscures the number of equations involved.

Conduct data analysis

In order to conduct data analysis, you need data. Acquisition of data and preliminary analysis of that data is a complex, deep subject of its own. I will treat it in a separate page.

Different theoretical models use different kinds of analysis. In business and economics, two common approaches are regression analysis and SEM, but each of those has many variants, and there are others (including "modern" approaches associated with "big data" analysis). Whatever you use to obtain estimates of variable effects, the basic questions about whether theoretical relationships are reflected in the data are the same.

  1. For the effect of each variable, a two-tailed test tells you whether it's "nonzero".
  2. Test the hypothesis of no effect vs. wrong sign (one-tailed test). Accepting the null hypothesis means the coefficient doesn't have the wrong sign. (Overall theory is not rejected.)
  3. Test the hypothesis of no effect vs. right sign (one-tailed test). Weaker than (1) with the right sign.
  4. Compare expected against verified (= significant) results. Expected "strong" effects should be non-zero and have the right sign. Weaker effects should not have wrong sign.

What if the theory is wrong?

If some of your important hypotheses tests fail, or your whole model has insufficient explanatory power or significance, you need to decide what to do. In some cases you need to reject or temporarily abandon the approach. But frequently you can improve the model as suggested by these failures.

Remember that statistically speaking this is a new model, and that testing new models on the same data is statistically suspect (you're taking new draws from the urn, of course you're more likely to get "good results" if you keep trying, even if the outcome is random).

On the other hand, we don't have a statistical theory of how significance and power change as we test variants of a model on the same data. If the parts of the model that stay the same are consistent (coefficients have same sign, same order of magnitude, similar p-values), you can take p-values for new variables and the new equation at face value. Both cases (the estimation results change as you improve the model, and they don't change) should be reported in your thesis.

Here are some of the standard approaches to statistical analysis of your results:

  1. Compute the power of the test (probability of the outcome if the theory was right). This is subtle. The problem is that the theory is "coefficient is positive", but the hypothesis test is "coefficient = A". In significance tests, typically A = 0, i.e., the null hypothesis is "no effect", which is simple and precise. Typical power computations use the critical value of the significance test as A. Power gives you some idea of whether this is strong evidence of "wrong theory" or not (power near zero is strong).
  2. Do regressions, correlation analysis, and/or factor analysis to discover relationships among explanatory variables. This may suggest additional equations or instruments.
  3. Probably there are several endogenous variables. SEM handles this automatically (add more arrows for additional relationships). In regression there are two approaches using more complex models:
    1. Simultaneous equation regression. Explicitly model additional dependent variables.
    2. Instrumental variable regression. "Filter out" random effects ("noise") in (some) explanatory variables. (Related to "two-stage" regression.)
  4. In regression models you can analyze the error process (the differences between predicted values and observed values of the dependent variable). Look for serial correlation in time series in a graph of errors vs. time. (A similar effect can occur with distance in geographical variables, but these are rare in general business and economic data.) Look for heteroskedasticity (a tendency for errors to be larger or smaller depending on some variable) in graphs of errors vs. each variable. Finally, when the errors in this graph appear to be distributed around a curve rather than the x-axis, consider a nonlinear transformation of that variable (commonly adding the variable's square, or taking logarithms or exponentiating the variable).