Social Media Text Mining

author:Stephen J. Turnbull
organization:Faculty of Engineering, Information, and Systems at the University of Tsukuba
contact:Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp>
date:September 10, 2019
copyright:2018, Stephen J. Turnbull
topic:STEM, social networks, social media, text mining

< previous | next >

Using Python to collect tweets

If you are reading tweets, whether for fun or to learn news, your personal Twitter webpage or a mobile app is probably best. However, if you want to understand how social media users think and build their networks , whether for an academic purpose or something more practical such as marketing a product or a political candidate, you need to use a program to collect enough data.

Probably the most popular programming language for this purpose is Python. It has very simple syntax, with relatively little punctuation, mostly used in familiar ways. It uses linebreaks and indentation to express structure (groups of statements to execute in order), similarly to the use of outlines. It has an enormous “ecosystem” of freely available modules for purposes such as communicating with websites and analyzing data. It’s an excellent language for writing code that you will use for one purpose, and continually improve until the project is done, then throw much of it away. (It has many other use case, but this is the one we’re interested it.)

Python makes it easy to create functions (repeatable procedures that produce a value), classes (new kinds of data, with class-specific operations), and modules (collections of classes and functions that perform generic functions, usually kept together in one file). We will provide some necessary modules and functions to the tasks we’ll perform. If you have interest you can go on to add more functions. It’s not hard.

The credentials module

In order to use Twitter you must access its web application programming interface, or web API. To access the web API, you need a Twitter user account with a confirmed email address to qualify (these are free and easy to create), and then you need to create “credentials” that allow Twitter to identify your application, and permit it access to Twitter’s search functions and data streams.

In this class we provide credentials. They will be revoked after the class, so they can't be used outside of the class.

The credentials module contains a get_my_api function, which returns a connection to the Twitter API (i.e., you can use it to communicate with Twitter without logging in again). You use it by editing the file named “credentials.py”, then you enter your credentials in the appropriate place. Keeping this in a separate module allows you to simply get a copy of a main program which imports credentials, and run the main program without any change to the main program.

To create user credentials, visit twitter.com to create a user, then apply to Twitter for developer status. Each application you write gets its own credentials which are linked to your user. Some are created automatically (the consumer key and consumer secret key), others need to be requested explicitly (the access token and the access token secret). The credentials module also provides a function test which looks up current Twitter trends and prints them out, to confirm that your credentials really can access Twitter.

The twitter module

This is a module written by Mike Verdone. It provides the functions needed to communicate with Twitter, by providing two kinds of objects. One is called “Twitter”. It offers many ways to get a particular set of tweets and other information from the past, such as a user’s followers. The other is called “TwitterStream”. We will use both to get the biggest possible dataset. The documentation at https://pypi.org/project/twitter/ is pretty easy to understand if you have programming experience. If you don’t, there are many examples, and you can skip the code and still get a pretty good idea of what the twitter module can do.

Searching and filtering

From the point of view of the researcher, these are rather similar operations, although behind the scenes at Twitter they are very different problems! In each case, we provide some conditions, such as “contains a keyword” or “is in English”, and the Twitter object or TwitterStream object returns some of the desired tweets. Of course a Twitter returns historical tweets, while a TwitterStream waits for the right kind of tweet to be distributed in the future.

< previous | next >