Social Media Text Mining

author:	Stephen J. Turnbull
organization:	Faculty of Engineering, Information, and Systems at the University of Tsukuba
contact:	Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp>
date:	September 10, 2019
copyright:	2018, Stephen J. Turnbull
topic:	STEM, social networks, social media, text mining

< previous | next >

Statistical Inference

People build models. That’s what we do with our brains when we think (and know that we are thinking). That’s how we think. When you interpret another’s behavior as indicating an intention, you do that via a model. When you utter an excuse for your own behavior, you are trying to induce another to embrace a particular model of you. Of course we have some very reliable models. Objects (at least those of a certain compactness and having great enough density) fall at a given rate when dropped (and we have a good model = air resistance of why the exceptions are what they are, and even how deviant the exceptional behavior is). Yet these are still models; we don’t know why they are true (for example, why is gravitational acceleration exactly a square law, rather than raised to the power of e or 1.9999? In any case, except for a few “toy” situations, such as games like chess or go, every explanation involves partial models, models that include only the most important aspects of the phenomenon to be explained. And there’s always the potential that are models are terribly inaccurate (a first course in physics trying to explain a falling feather), or worse, irrelevant (telling time by a stopped clock -- it’s right twice a day!)

Different models are a typical source of drama, both in the movies and real life. Many situations involving jealousy and infidelity derive their drama from the fact that the jealous person assumes their partner is unfaithful when they see them kiss someone on the cheek, but the partner can't understand why the jealous person is simmering with anger, since the person they kissed was a sibling. In social situations, there are multiple levels of modeling. The jealous person had the wrong model of the partner's behavior, while the partner has no good model of the jealous person's model of the partner's acts. In other words, in social situations, we may need arbitrary levels of “I believe that you believe that I believe that ...”!

Traditionally in science (including social science), we consciously build models, and record them according to certain rules. These are called formal models. There are various generic rules, such as those of logic and mathematics. The famous physics equations s = at² and E = mc² are mathematical models. The economist's law of one price in a market is more a definition of “market” than a law about prices, but it is based on a model of trading interaction that suggests that certain related transactions will take place at approximately the same price. This model is somewhat fuzzy: the definitions of “related” and “approximately” in the last sentence are quite tricky to implement in a study of real consumer behavior!

Each science has its own special rules (that’s why they are different sciences). Here we will not be building formal models (unless you want to!), but you do need to be aware of models, and you will be required to record your models. Why “models” with an “s”? Because it’s often useful to think of an evolving model as a series of models. Sometimes an experiment tells you your model is wrong. Then you need to backtrack to an older model that may be inaccurate, but at least you don’t know it’s wrong! Worse, the experiment may tell you you still don’t know whether your model is right or wrong -- then you need to extend your model to explain why the experiment failed (to give an answer), and help you design an experiment to confirm your model.

“Model” and “theory” mean about the same thing. However, “theory” emphasizes how the explanations fit together logically, while “model” emphasizes the fit of the explanation to reality.

You know about statistics: averages and hensachi, percentiles and pie graphs. Their simplest use is to summarize a body of data. Examples:

Gaussian (normal) distribution characterizes as many observations as you like with two numbers: the mean and the standard deviation.
Uniform distribution: again, arbitrary number of observations, characterized by minimum and maximum of support.

Note the approximation here: a finite number of points (observations) cannot be a smooth curve, as the Gaussian and uniform densities are. Especially for prediction and risk management, such a descriptive, approximate summary may be all you need.

In statistical inference for scientific research, we go beyond such direct use of statistics to try to uncover latent (unobserved) variables involved in the phenomena we are studying.

Classical Inference

In classical hypothesis testing, we pick one hypothesis, called the null hypothesis, which we use as the standard of comparison. Typically we describe the null hypothesis as a negative: an explanatory variable has no effect on an dependent variable, two distributions are not different, a parameter is equal to zero. Then we make a set of assumptions about the data generating process (the DGP) such that we can transform the data into independent and identically distributed random variables. This set of assumptions is the formal (ie, mathematical) statement of the hypothesis. and look at the average of the series, which allows us in most cases to assume that the average is normally distributed because of the Central Limit Theorem. Finally, we look at the average as a realization of the DGP, and calculate the probability that the assumed DGP would produce the observed data. If that probability is “small enough”, we reject the null hypothesis.

In textbook calculation problems, the DGP is very simple: flipping a coin, rolling a die or two, drawing cards from a deck, measuring the height of a person selected from a group. But even in these cases, the formal description of the DGP is not entirely trivial: when we assume that the probability of heads and tails are the same, we are implicitly assuming that the coin is not weighted in some way. The verbal description of the hypothesis would then be that “the coin is fair.” But if we are analyzing a market, trying to find the supply and demand relations. Then our statistical problem may be to determine the elasticity of the demand curve, and we test whether it is closer to zero than -1. However, if we estimate that elasticity to be “too far” from -1, to conduct the test, we not only have to assume that the elasticity of demand is -1, we also need to make some additional assumptions, such as the assumption that the consumer actually pays attention to price in the relevant range (i.e., the price levels normally observed in the market). In economic theory, this latter assumption seems very plausible, so we ignore the possibility that price is unrelated to consume decision-making. We say that the assumption that consumers care about price is a maintained assumption, and talk about our statistical calculation as if the only assumption that matters is that the elasticity is -1.

When we are working with a well-established theory, with much empirical research reported that is consistent with all aspects of the theory, not much is lost if you forget about the maintained assumption that the theory is correct. But when working in a new field, with an unconfirmed or partly confirmed theory, it's very important to remain aware of the maintained assumptions, which ones are well-confirmed and which are not, and to recognize that a negative result may be a rejection of some aspect of the theory as well as of the particular hypothesis your research is designed to study.

This approach has a problem of interpretation, however. If we reject the null hypothesis, all the probability computations in the test go out the window. Only the rejection itself remains.

The Bayes Factor

Bayesian statistics uses an alternative approach to probability. The calculations are the same, but the interpretation of hypotheses is different. Instead of probability being based on a specific model, a range of models are considered to be possible, each with some probability. Then as evidence is accumulated, the models' probabilities are adjusted according to Bayes' Law. If you have studied probability at all, you've surely heard of it, and if you've investigated the “algorithms” that have become famous as part of Google search, GMail, and Facebook and other social media, you may have seen the adjective “Bayesian”.

Bayes' Law in its simplest form is a pure calculation using probabilities, and may be written:

P[B | A] = (P[A | B] P[B])/(P[A])

The P stands for “probability,” and the notation P[B | A] is read “the probability of B given A.” A and B denote events, which are things that happen.

A simple example should help understand how this works. Suppose we have a class of 60 students. 40 are male, 20 are female. Of the men, 25 are first-year students, the rest are second-year students. Of the women, 15 are first-year students, the rest second-year students. From this little story, we can define two events concerned a student drawn randomly from the class:

A = the student is female

B = the student is a first-year student

We can compute the following probabilities based on the number of students in each group:

P[A] = (20)/(60) = (1)/(3)

P[B] = (40)/(60) = (2)/(3)

P[B | A] = (5)/(20) = (1)/(4)

P[A | B] = (5)/(40) = (1)/(8)

Exercise: Check that you understand where all the numbers in the above calculations come from. The numbers related to woman are taken directly from the story, while the numbers about first-year students may require some addition and subtraction. Then check that Bayes' Law holds.

Note that about the only thing that events A and B have in common is that they might describe students. The classical approach to statistics is best adapted to choosing among models that can be characterized by a set of numerical parameters, and as we saw above, to rejecting a particular model (or not). The Bayesian approach is not as limited in principle, although in practice the difference is not so great. Until fast computers became common in the late 1990s, it was impractical for most researchers, and especially students, to take advantage of Bayesian methods in many ways.

The basic idea behind Bayesian methods is the odds ratio. If you need to bet on one of two things (with the same cost and reward for either), one way to express the choice is to form the odds ratio P[A]/P[B]. If the odds ratio is greater than 1, choose A, if less than 1, choose B (if exactly 1, it doesn't matter). Note that this method works to compare the two events, even if they are not mutually exclusive, or if P[A] + P[B] ≠ 1. However, although in common usage, such as betting on horse races, the odds can be converted to a probability, this requires that the odds be comparing events that are mutually exclusive and exhaustive (P[A] + P[B] = 1).

But what happens if you get new information about the events? Should you change your bet? A variant of Bayes' Law for odds allows the user to update the odds. Bringing this back to statistics, let's use this notation (borrowed from the textbook by Hastie, Tibshirani, and Friedman):

M_a = the event that one particular model is true

M_b = the event that another particular model is true

Z = the event that when taking new data, exactly the values Z are observed

and the rule:

(P[M_a | Z])/(P[M_b | Z]) = (P[M_a])/(P[M_b]) (P[Z | M_a])/(P[Z | M_b])

Exercise: Compare the “equation” of just the numerators to Bayes' Law, and consider the “missing factor”. Do the same for just the denominators, and show that the factor isn't missing: it cancelled out.

You may notice the prior odds (P[M_a])/(P[M_b]), and wonder where that comes from. That's a good question. We don't have a good answer. Frequently we assume an uninformative prior where the odds are 1:1 -- it doesn't matter which you choose. Other times, especially when there is a range of models, we may choose an informative prior, which biases the result, but intentionally so. Nevertheless, however you choose the prior, the Bayes factor

(P[Z | M_a])/(P[Z | M_b])

is determined entirely by the models and the data. It does not depend on the prior. (The Bayes factor is also called the likelihood ratio.)

< previous | next >