Introduction to Workflow

Next year I will be supervising 5 2nd-year Master students, 9 1st-year Master students, a few kenkyusei, and 3 4th-year undergraduate students. The process was somewhat burdensome this year with many fewer students (although there were some special problems that I do not expect to recur next year). For this reason I am going to define a specific workflow that must be followed if you want to participate in my seminar. The process itself will evolve over time for two reasons. First, I don't know yet what will work, so there will need to be changes (I hope to relax restrictions) over time. Second, I plan to develop and provide web-based tools to automate as much of the process as possible.

At this time I can say the following things.

Producing Documents

Reviewing Microsoft Office files, specifically Word and Excel, is tedious and inaccurate. This applies to LibreOffice and OpenOffice and Apple's proprietary applications as well. I need to have source documents in plain text formats. For this reason, I will require that documents submitted electronically be written in plain text without markup, or a TeX dialect such as LaTeX, PDFLaTeX, or XeLaTeX. Tabular material or datasets may be submitted as LaTeX tables, CSV files, or plain text formats acceptable to R.

All of these formats, except LaTeX, can be produced by Word and Excel. (There may be some tools available to convert Word files to LaTeX, but the documents produced are likely to be ugly and hard to maintain, and it is unlikely that embedded objects can be converted.) Converting Word documents to plain text will strip out formatting and embedded objects such as images and Excel tables. Converting Excel files to CSV will strip out formatting. This is a small loss when reviewing documents compared to the convenience to me of using the many tools for analysis of plain text that have been developed.

Use of LaTeX has the advantage to the student of reliably producing beautiful final versions with less effort in the long run. I am happy to help you make LaTeX do what you want in your documents. I cannot help you with Word; I only use it for university paperwork, and it is invariably very frustrating.

Standards for dealing with graphs and images will be updated over time. For now, just save them in any of the standard image formats (SVG, PNG, JPEG, etc.) for inclusion in documents.

You can get LaTeX for your platform from the vendors recommended by the LaTeX Project. There's only one for Mac. I don't know which of the recommended versions for Windows is best. I will review them later if asked.

Editing LaTeX Source Documents

LaTeX source is plain text and can be edited with any text editor, including Notepad (Memopad) on Windows or Notes on Mac, as well as Word (with save to .txt and change the name). However, LaTeX is actually a programming language designed for embedding lots of text, so using a programmer's editor is recommended. Some of the distributions provide one or more text editors, and there are also some WYSIWYG editors such as LyX and TeXmacs (though I'm not sure if these are maintained anymore). I use XEmacs, although GNU Emacs is probably more usable for new users. There are many others. I will try to review a few for Mac and Windows and give recommendations on request.

It may also be possible to produce reasonably good LaTeX sources from the Jupyter Notebook system, then tune them for drafts and final submission at midterm presentation, "karitoji", and final submission.

Presentation Software

There are plain text presentation systems, such as S5 (based on reStructuredText) and Beamer (based on LaTeX). However, these are more or less tedious to use for anything more sophisticated than title plus bullet lists. At present I will permit use of PowerPoint, Keynote, and such proprietary format presentation systems. If you're interested in open source/plain text systems, I'll be happy to teach you.

Recording and Submitting Documents

I will expect you to keep archives of thesis drafts and other textual documents, tables, data sets, programs, and other objects (images) in a git repository. git is a so-called version control system (VCS). It is not the most user-friendly, but you are unlikely to need its advanced features much, if at all, so there will be few things to learn, and you would not notice the difference between git and other VCSes much. If you do find yourself needing advanced features, you probably want to learn to use git despite its user-friendliness deficit, as it is by far the most popular VCS. Besides, it's the one I use most.

Because git records old versions, there are several conveniences for authors. You can get a history and some metrics about changes to analyze your productivity. You can get a concise report of the line-by-line differences from one version to another, or a "word diff" which shows changes at the level of individual words as Word does. You can get a history of when each line was changed most recently. Obviously, if you commit new versions frequently, you're always backed up. And because you have a complete record of old versions, you don't need to give new names to files if you want to keep old versions.

Git also knows about the Internet, so you can push versions to, and pull versions from, repositories on other computers. You will submit your documents to me by pushing them to my server, and receive comments and suggested edits by pulling them to your personal computer. There will also be a "common" repository containing useful templates (eg for title pages).

Communication

I may want to use an issue tracker to keep track of student progress and tasks. More about this later (it hasn't be set up, or the software chosen, yet).

Typical trackers are the one used for Python and the one used for GNU Mailman. Ours will be a lot simpler than either of those.

Statistical Software

At present I am leaning toward requiring R, despite its somewhat obscure programming language. In any case I will require submission of programs for reproducing statistical results, including any "data cleaning" that is done before statistical analysis.

General Programming

Any programming language is acceptable as long as there is an open source implemention I can install on Mac or Linux. However, I strongly recommend using Python 3. It is one of the easiest programming languages to learn, it handles Unicode (and therefore Japanese and Chinese) well, it has a collection of powerful features not matched by other easily-learned languages, and it has a very powerful set of data analysis libraries available. The Anaconda distribution is very convenient, and contains the Jupyter Notebook system. Notebooks are a very powerful way to do "literate programming" with Python (and I think R, too). The idea is that a notebook consists of a series of cells, which constitute a program. However, the notebook is smart enough to execute each block on demand, and if you change one block, you may be able to avoid re-executing previous blocks. Among the program cells you can intersperse your technical document, and the whole thing can be output as a nicely formatted PDF document.

Notebooks also interact very well with visualization software such as Bokeh. Bokeh can be installed easily with the Anaconda distribution of Python.

Back to contents