Research Tools

You will use a number of tools in the process of performing and reporting your research. Some of these tools are directly used in the research, and others are "administrative," helping you to maintain discipline and work effectively in a long-term project.

Doing Research

These tools are the ones used in actually doing research.

Mathematics

The best-known tools for mathematical calculations and analysis are Excel (and other spreadsheet programs) and Mathematica. Other programs you might hear about are Matlab and Gauss. I believe all of these are available in the Shako computer lab, and you may be able to use them under site license to the University. (All are expensive, and you must not use pirate copies.)

Where possible I personally use open source software, including the Open0ffice/LibreOffice spreadsheet and Maxima (actually the "parent" of Mathematica).

I won't say more about analytical tools at this time since I don't think any of you will be doing mathematical calculations or analysis (other than statistics, which has its own set of tools), but they can be very useful if you do.

Statistics and Statistical Software Packages

The subject of statistics is itself a tool. It is rather different from other branches of mathematics in that what are called "variables" in algebraic equations turn out to be data (i.e., constant) in statistical equations, and rather than finding a unique solution based on "just enough" accurate data, statistics is mostly about compromising among a large number of possible solutions based on large amounts of inaccurate data.

You need to know the following things about statistical analysis.

  1. Predictive statistics cannot be model-free. Consider this example data: {(x,y) : (0,0), (0,1), (1,0), (1,1)}. If you plot the graph, you get a square. You might think it's impossible to estimate a linear regression (i.e., the cross-product matrix is singular), but you would be wrong. In fact, the linear regression model y = a + bx + e (where e is the disturbance term) is well-defined, and the regression estimates are a = 1/2, b = 0. However, the linear regression model x = c + dy + f (where f is the disturbance term) is also well-defined and (as you should guess) the estimated coefficients are c = 1/2, d = 0. But the meaning of the two models is completely different! (If instead of minimizing the sum of squared prediction errors you minimize the sum of squared distances to the line, you find that all lines passing through (1/2,1/2) achieve the minimum!)

    It turns out that your choice of model matters. (And as you probably know from factor analysis, there are an infinite number of ways to "rotate" the factors. This corresponds to the "model-free" problem of minimizing sum of squared distances to the line.)

  2. There are two important questions in defining a statistical hypothesis. First, what is the null hypothesis? Second, how do you interpret rejection of the null hypothesis? That is, does rejection mean you accept your theory (e.g., "education affects income"), or does it mean you reject your theory (e.g., "price of a stock is a moving average of past prices")?

    "Reject null means accept theory" is a strong statement in conventional tests (since P(data | no effect) < .05 at the 5% significance level). But "accept null means accept theory" is much weaker, because any deviation from theory "close" to the theory would also be unlikely to reject. This is called the power of the test, and it's hard to calculate because of the ambiguity of the alternative hypothesis.

  3. Many of you are likely to use structural equation modeling (SEM) in your analysis. SEM allows quite a bit more flexible modeling than standard regression analysis, especially of "factors" (latent or unobservable variables). However you need to be careful in SEM to distinguish between the degrees of freedom used to estimate the covariance matrix (which depends on the size of your data set), and those used to estimate the model (which depends on the size of the covariance matrix).

It is very important to understand these three issues well if you use those statistical tools, because you are likely to be asked questions about those aspects of your statistical analysis. It is usually very easy to detect that a student really doesn't know what he or she is talking about, but simply is presenting output from statistical packages for models of unknown quality based on possibly unreliable data.

There are many well-known software packages for statistical analysis available on the Shako lab computers, including SPSS (and AMOS for SEM), E-Views, and Shazam, as well as languages such as S-Plus, R and TSP. Be aware that I mostly use R; if you use one of the others, you may need to get help elsewhere.

Data Collection

I can't say much about primary data collection. As an economist in my student days I mostly used government or other "standard" data sets. There is a large body of literature on survey techniques (Profs. Ishii and Ueichi in Shakei are experts) and interviewing techniques (Prof. Ikuine in Keiko is expert, I believe). In most cases your AG professors are likely to be of help here.

One area I do have some related expertise in is in "social media mining" (i.e., collecting textual and network data from social networks). This kind of data collection involves a lot of programming to summarize and "clean" the data sets. (It's possible to buy prepared data sets, but they are not trustworthy, and worse, they are not reproducible by independent scientists).

Programming

Programming is most useful in data collection (of social media and other non-tabular data sets), and in qualitative data analysis and preparation of social media, Internet server log files, and "big data" in general. (A lot of progress has been made in quantitative analysis of big data, and you may find packages for use with R and other statistical software.)

I know a fair amount about the Python language, and recommend it highly. Not only is it relatively easy to learn, but it also has a large number of "Python packages" for various tasks such as accessing social media streams, doing statistical analysis, and generating graphics. (These last two overlap with ordinary statistical packages a lot.) Python also has interfaces to many other software packages (such as R), and is often a convenient "glue" language for doing the same "packaged" analysis repeatedly with different parameters or assumptions.

I'm also familiar with C, C++, and Lisp.

Very similar to programming is creating dynamic web pages using HTML, CSS, and Javascript. Learning some Javascript is likely to be useful if you do any web-based work, since social media often deliver streams of "JSON documents", a way of describing structured data. And of course if you decide to present your work as a web page, Javascript is essential.

Finally, in describing your model, especially for oral presentation, it is often useful to create ographs to show conceptual dependencies in models, data or control flow in programs, and workflows in project management. SEM tools like AMOS can draw very nice diagrams based on models. I use the dot program from the open source Graphviz package.

Administration

"Administrative" work can be divided into two basic areas, planning and record-keeping.

Scheduling and workflow

Scheduling is about planning when events in a particular process should occur. Business planning tools you may have heard about include PERT and CPM. These are tools for managing projects where being able to start one task depends on finishing several others, which in turn are dependent on others, creating a complex network of dependencies. These tools aren't so relevant to students because in most cases work proceeds linearly (i.e., each task depends on the preceding one), but the "network of dependencies" issue arises immediately when you work in a team.

The biggest problem most students face in producing high-quality research is a failure to anticipate how much time it takes to do various tasks. They then end up in a "big crunch" at the end of the scheduled period, with too little time to polish their work if they want to graduate on time. Scheduling is most aided by simply planning ahead, listing the tasks and deadlines. However, when deadlines are very firm (e.g., determined by the academic year), it may be necessary to modify some goals: reduce the amount of data collected or the number of hypotheses formulated, and things like that.

Determining the time vs. goal tradeoff is the most difficult part of planning. Its accuracy can be greatly improved by developing personal and team workflows. A workflow is a template schedule for a kind of task performed repeatedly.

I'm thinking about creating web-based scheduling and workflow tools adapted to the academic world, especially graduate student research projects. More about that later.

Record-Keeping

It's useful to keep a record of each version of your work. For one thing, your AG, your principal advisor, and you may all have different copies of your work. Bombarding your advisors with a rapid sequence of revisions is not always a good idea, especially if the changes are hard to find.

Version control software can help with this task. I use git when I can. Another good choice is Mercurial. These systems were originally designed for cooperative work on software, but are quite applicable to documents (with some conditions). They allow you to save many revisions of each document, and keep the various documents in your project in coordinated revisions (e.g., if you have Excel tables embedded in Word documents, you need to change the Word document if you change the data tables -- did you remember to do that?)

This software can help solve the following problems:

  • What did I do?
  • Where was I in my work?
  • Why did I do that?
  • When did I do that?

Warning: it helps to use plain text tools like TeX rather than Word for this purpose. The primary tools for finding out "what did I do?" (very useful in generating a "Response to Comments"!) are various versions of the diff program, which compares a document line by line as plain text. These tools are designed for programmers, so the "documents" they handle are plain text with well-defined lines, not wordprocessor documents that automatically break lines to fill the page up to the margin. You can also search for changes involving a specific phrase, but the diff program doesn't know about Word files, so can't help with that.