Transcribing Social Posts with ML: Task & Dataset

Miles Hutson
4 min readMar 16, 2021

--

The Motivation

This holiday season, I wanted to keep my ML skills sharp, and went in search of some nails to hammer. I passed through a few ideas, before tossing them out as too hard, too easy, too infeasible, or too long. Dispirited, I decided to take my 15th break of the day mindlessly scrolling through Reddit on my phone.

It was there that I noticed a great project idea: Transcribing screenshots! A trend on social media in recent years, including Reddit, has been screenshotting content from other platforms and posting it for viewers on others. An example Twitter screenshot:

Of course, this practice is horribly unfriendly to users of screen readers (e.g. the blind), archivists, and information retrieval in general.

I’m not the first person to make this observation. In fact, those who did also inspired this project: The users at r/transcribersofreddit . They go about Reddit, for no reason other than kind-heartedness, volunteering their time to transcribe these very posts. They also have templates for the purpose. So for this post, the example transcription they provide is:

Image Transcription: Twitter Post

Morgan Baker, @yagirlmorgg

my talents include avoiding difficult conversations, and getting really sad over things I saw coming

I’m a human volunteer content transcriber for Reddit and you could be too! If you’d like more information on what we do and why we do it, click here!

Seeing these transcriptions on Reddit got me thinking. Can we provide some automation to help these folks out? Perhaps the task could become: 1) Approve automatic transcription, 2) Correct if needed, with 90% of posts just requiring step 1.

And lo-and-behold, we have a dataset already! All we have to do is go to reddit.com/r/ToR_Archive, and we can find every Reddit post that has a human-provided transcription.

An example:

In this post and those that follow, I’ll outline an approach to automating screenshot transcription, as well as how to easily build a dataset and take your own shot at it. While I don’t expect to solve the general problem, I think there are many instances where it is solvable.

Obtaining the Dataset

In order to proceed at all, we have to have a dataset. I’d hope to use the native Reddit API and r/ToR_Archive to build one, but the Reddit API was too limited and slow, and the ToR_Archive turned out to have some inconsistencies. Instead, I utilized the wonderful Pushshift API, which was developed for research involving Reddit. I first gathered all comments that included the text “I’m a volunteer content transcriber for Reddit!”. These formed the file “tor_comments.csv”. Then for each comment, I looked at whether the URL contained a PNG or JPG. If it did, I downloaded the PNG or JPG.

There were a handful of edge cases:

  • Some images have been removed from the internet or are otherwise inaccessible.
  • A few others have duplicate names across domains and are not utilized for expediency.
  • Some posts had multiple comments with transcriptions. Only the first downloaded comment was kept in this case.

You can find the CoLab doing the downloading here (plus earlier exploration and failed attempts). If you want to undertake this project yourself, you will require around ~20 GB of storage. As is, the code saves to Google Drive, but this can easily be changed. Joining the transcription comment with the screenshot with the transcription’s post id (link_id) produces a dataset of 95,766 complete transcriptions at the time of writing.

Analyzing the Dataset

The transcriptions are scattered across many different subreddits. Here is a histogram of the top 30, accounting for 87,878 of the 95,766:

The r/transcribersofreddit subreddit maintains consistency across their ~5k transcribers by providing templates for their transcribers to use in common scenarios. Exhaustively, they are:

  • Art & Images without Text
  • Images with Text
  • 4Chan & Pictures of Greentext
  • Reddit Posts & Comments
  • Facebook Posts & Comments
  • Text Messages & other messaging apps
  • Twitter posts & replies
  • Comics
  • GIFs
  • Code
  • Memes
  • Other sources

I wrote basic regexes to produce a coarse classification of most transcriptions according to these templates. The template that each transcription follows is plotted in a histogram below.

Unsurprisingly, IMAGES_WITH_TEXT is the most common category. This is because I made it the fallback category, and it essentially means that they just followed the standard template. If a transcription doesn’t follow the template at all, it gets categorized as OTHER. A histogram without IMAGES_WITH_TEXT below shows the counts of other categories with higher fidelity.

Next Up: Baseline Approach

In the blog post that follows, I outline a baseline approach to the problem.

--

--