Social Media Text Mining

author:Stephen J. Turnbull
organization:Faculty of Engineering, Information, and Systems at the University of Tsukuba
contact:Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp>
date:September 10, 2019
copyright:2018, Stephen J. Turnbull
topic:STEM, social networks, social media, text mining

< previous | next >

Text Mining

In dealing with text, we need to remember that, except in special cases like programming languages, computers don’t understand words, let alone sentences. We are now in an age where very large systems such as Apple’s Siri, Amazon’s Echo, and Hello Google can understand simple commands and even recognize spoken words. But they’re still quite fragile, they’re not very good with regional accents, and they frequently make amusing mistakes, as well as not a few that aren’t amusing. Nor are we at a stage where saying, “Siri, do my homework is good for anything but a bit of amusement.” So how do computers handle natural languages like English or Japanese? We use two kinds of “artificial intelligence”: “expert systems” and “machine learning”.

Expert systems are based on logical rules, axioms about the subject of the system, and are given some ability to calculate solutions to logic problems using given facts, the axioms, and logic. We won’t be dealing with these systems because programming them requires both specialized knowledge and some programming ability, but we can say that they have had a fair amount of success in a number of special fields such as medical diagnosis, customer service chatbots, and logic itself.

Instead, we’ll be using machine learning, a branch of statistics. First, a warning: “artificial intelligence”, or AI, of both kinds so far has very little intelligence beyond rule-based calculation, and zero “common sense”. In the case of machine learning, the rules typically don’t even make sense to humans, although used in calculations they produce useful results. That doesn’t mean AI isn’t a threat to human employment: it is. There are many paper-shuffling jobs that are vulnerable to automation. To date, “office automation” has increased the speed at which people can produce paper, and “paper pushing” employment hasn’t much decreased, rather it has increased. But we’re rapidly seeing improvements in AI systems to the point where they can compete with humans in a number of roles. This either frees the people to do work that requires “real” intelligence — or puts them out of work if they haven’t earned the promotion, just as mechanical automation has put many kinds of workers out of work.

Machine learning is a branch of statistics adapted to use in cases where there is a large quantity of data with low information density, or “big data”. As we mentioned before, “big data” is big, but that’s because big data methods are not very useful on small data sets. They either don’t produce interesting results at all because the information is insufficient, or models based on prior research and human intuition are superior in explanatory and predictive power. The fundamental idea behind machine learning is that statistical methods can discover patterns in big data that humans won’t notice because of the low information density, or to put it more visually, low contrast between the pattern and the random noise mixed in with the pattern in the data.

The order of words in a text is not random. Order matters. Even though you probably have watched Star Wars and had little trouble understanding Yoda, with his alien word order, even Yoda is constrained by some ordering: “I’m tall because my parents are tall” means something very different from “My parents are tall because I’m tall”. If you get cause and effect wrong in a power plant, you could get killed. Still, we can learn a lot just from which words are in a text, and that is the fundamental principle used in most web searches.

< previous | next >