This guide is optimized for Windows users (tutorial for MacOS users here). Windows 10 is required.
To use the following utilities, make sure the JSONL file of collected tweet data you want to use is in the same folder as the utils folder (but not inside this utils folder).
Important Note: At the time of writing this, Windows does not play well with certain Unicode UTF-8 characters, including emojis, that are commonly found in tweet datasets. Without making any changes to your computer’s settings, you will likely run into an error while using many of the twarc utilities that looks like this:
Fix this: You can change your Region settings for Unicode UTF-8 character encoding. Follow the instructions, here. You can change these settings back later if you’d prefer.
For the following twarc lessons, all blue text
can be modified to fit your purpose. These are either file names or search/filter terms.
Use the Table of Contents below to quickly navigate this guide.
To get the total number of tweets in your dataset, enter the command below:
find /v /c “” tweets.jsonl
Use the following command to create a csv file (spreadsheet):
python utils/json2csv.py tweets.jsonl > tweets.csv
Output: Creates a .csv file with the following tweet attributes: Id, tweet_url, created_at, parsed_created_at, user_screen_name, text, tweet_type, coordinates, hashtags, media, urls, favorite_count, in_reply_to_screen_name, in_reply_to_status_id, in_reply_to_user_id, lang, place, possibly_sensitive, retweet_count, retweet_or_quote_id, retweet_or_quote_screen_name, retweet_or_quote_user_id, source, user_id, user_created_at, user_default_profile_image, user_description, user_favourites_count, user_followers_count, user_friends_count, user_listed_count, user_location, user_name, user_statuses_count, user_time_zone, user_urls, user_verified
See the Tweet Data Dictionary for a complete description of these attributes.
To count the number of unique users in the dataset, enter the following command:
python utils/users.py tweets.jsonl > users.txt
This will create a text file listing all users in the dataset.
To to get a list of all the unique hashtags used in the dataset, enter the following command:
python utils/tags.py tweets.jsonl > tweets_hashtags.txt
To see a list of all emojis in the dataset, use the following command:
python utils/emojis.py tweets.jsonl > tweets.txt
Output: Creates a text file with all emojis and number count:
Most Windows programs use black and white emoji characters (Notepad was used to open the text file above). You have a few options:
Change the font to Segoe UI Emoji. In Notepad, click the option Format > Font to change font to Segoe UI Emoji:
They will probably still be in black and white, but the emojis are more current and legible:
To view in color, copy/paste the contents of your text file into a browser based text editor. Some options include:
data:text/html, <html contenteditable>
Copy/paste your emoji text into the blank browser page:
To create a wordcloud, use the following command (see note below):
python utils/wordcloud.py tweets.jsonl > wordcloud.html
Output: Creates an html file, open in your browser to see wordcloud.
Note on wordclouds: Consider removing retweets from your dataset when creating your wordcloud. With retweets included, the text from the original tweet is shortened in the dataset, and can create odd/deceptive wordclouds.
Example:
From original dataset: | After retweets removed: |
That large red “arm…” in the original dataset was from a highly retweeted tweet, where the word “army” was cut off mid-word in the retweet data.
Create a network graph from your tweet data, use the following command (note there is no ‘>’ in this command between the jsonl file name and the html file name):
python utils/d3_network.py tweets.jsonl tweet_graph.html
Output: Creates an html file, open in your browser to view the network graph of your tweets. This script only shows a graph of original tweets that have been quoted and/or have replies (not the entire dataset). Hover over circles to show connections, click to show tweet (if you click and no tweet appears, it was likely deleted or set private)
To include retweets in your network graph, use the following command (if you have a very large dataset, this will take a long time to load, and may be too crowded to read well):
python utils/d3_network.py --retweets tweets.jsonl tweet_graph.html
Output: an html file of your network graph, including retweets. Open in browser:
Use the following command:
python utils/geojson.py tweets.jsonl > tweets.geojson
Output: Creates a .geojson file. Quickly create map from geojson file:
Create a wall of tweets, use the following command:
python utils/wall.py tweets.jsonl > tweets.html
Output: Creates an html file, open in your browser to view all tweets, easy to browse through.
Creates a text file the URLs of images uploaded to Twitter in a tweet json. Use the following command:
python utils/media_urls.py tweets.jsonl > tweets_media.txt
Output: Creates a text document with the tweet ids and media urls.
For just a list of embedded media URLs (without the associated tweet id):
python utils/embeds.py tweets.jsonl > tweets_embeds.txt
Creates a new JSONL file with retweets removed. Use the following command:
python utils/noretweets.py tweets.jsonl > tweets_noretweets.jsonl
Tweet IDs are generated chronologically. This command will sort your tweet data by Tweet ID in ascending order, which is the same as sorting by date:
python utils/sort_by_id.py tweets.jsonl > tweets_sorted.jsonl
Another option is to create a csv, open in Excel and sort the “id” column by ascending and descending order to identify the first and last tweets in your dataset.
You may want to study tweets from only a select number of days in your dataset. For example, you’ve searched a hashtag related to an event that occured within the last two days, but your search query retrieves tweets from the previous seven days.
Use this command to remove tweets occurring before a given date:
python utils\filter_date.py --mindate 13-oct-2019 tweets.jsonl > filtered.jsonl
Creates a new JSONL file with no duplicate tweets. This script removes any tweets with duplicate IDs. Use the following command:
python utils/deduplicate.py tweets.jsonl > tweets_deduped.jsonl