Social Media Text Mining

author:	Stephen J. Turnbull
organization:	Faculty of Engineering, Information, and Systems at the University of Tsukuba
contact:	Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp>
date:	September 10, 2019
copyright:	2018, Stephen J. Turnbull
topic:	STEM, social networks, social media, text mining

< previous | next >

What is Twitter?

Twitter is the leading provider of a so-called micro-blog. That means that it is a public stream of texts with strictly limited length (originally 140 characters, now 280). The original “blog” is a contraction of “web log”, which is a sequence of records published on the World Wide Web. By contrast with blogs, which are a “pull” medium where the user must visit a host website and request the blog episodes, a micro-blog like Twitter is a “push” medium, which automatically broadcasts content to the users, somewhat like radio, except that the user has a fair amount of control over the content of their feed by following other users and hashtags, and implicitly through their user profiles, likes, replies, and searches, which are aggregated by The Algorithm into a “total” profile which Twitter uses to select tweets to send to each user. (“The Algorithm” doesn't actually exist; in fact Twitter's selection method uses multiple algorithms, and is continuously revised. When capitalized, “The Algorithm” refers to this whole system, especially when its influence on social networks of Twitter users is being discussed.)

Presently Twitter’s users emit billions of tweets everyday. The aggregate stream is called “the firehose”, and handling it requires special equipment and network connections. It is absolutely impossible for an ordinary user to keep up with it, even in only one language. (My own feed currently contains English, Japanese, Spanish, French, Korean, Chinese, Farsi, Arabic, and Hebrew, with English the great majority, most of the rest Japanese, a few per day in Spanish and French, and the rest a few times a month at most.) In fact, if you follow more than a few people, you won’t even get all of the tweets from the people you follow. That’s why we can assume that Twitter is selecting tweets.

When you view Twitter on the Web or in an app on a phone or tablet, you seen up to about 350 characters of data per tweet (counting media like images or videos as 13 characters each, because Twitter represents them as URLs, and uses “tiny URLs” at the t.co site). However, each tweet is delivered in a package called a status, which tells you not only the text of the tweet, but a lot of information about the tweet and the tweeter. Although the tweet itself fills between 10 bytes and 500 (the large amount is due to use of Unicode for non-English languages), a status is a much larger package, taking between 4000 and 10000 bytes. The rest of the data is called “metadata”, which is Greek for “data about data”. It includes things like the number of follows and followers the tweeter has (which you can see in most clients, but also things like the whole original tweet when the current tweet is a retweet), the latitude and longitude where the tweet was sent (if the client has geolocation enabled), and URLs for preferences like the tweet background, which many clients ignore. The wide variety of metadata and the complexity of even short pieces of text mean that a twitter feed has to be considered big data.

Still, the most important aspect of the tweet is the 280 characters of text (which includes the URLs of any images that are displayed). Why is text complex? First, of course, there are a lot of words: 800 in “Basic English”, 3,000 or so for high school graduate, 5,000-10,000 for typical college graduates. Then, you can combine them in a lot of different ways, for example inserting a “not” more or less reverses the mean of the word. Then, the meaning of individual words change depending on context. For example, take the word “bad”. Depending on the context and tone of voice “you’re bad” can mean “you’re evil”, “you’re rotten”, “you’re unskilled”, “you’re tough”, and even “you’re good”. Many “dirty words” are even more flexible. And then there’s the spelling problem. Especially on bandwidth-limited Twitter, people use abbreviations even more radically than DAIGO.

Besides the length limit, Twitter has another custom that makes interpretation of words difficult: the subtweet. (A subtweet is a tweet that responds to another person’s tweet but is not a reply and does not tag that person.) Effective subtweets depend on a common understanding of the context between the tweeter and the audience. So much relevant context is customarily, and deliberately, omitted. (This is similar to the Japanese language, although subtweeting is, like “in” jokes, more extremely “lacking context” than Japanese normally is.)

< previous | next >