Transcribing Social Posts with ML: A Baseline

Miles Hutson
8 min readMar 25, 2021

This post is part of a series on transcribing screenshots from Reddit to a more accessibility friendly format. It stands on its own, but if you want to know more about the dataset, see the previous post here.

The Dataset

The dataset is described in depth in my previous blog post. Recapping, we have a dataset from r/TranscribersOfReddit pairing images with transcriptions provided by human volunteers. These volunteers follow templates with categories as follows:

  • Art & Images without Text
  • Images with Text
  • 4Chan & Pictures of Greentext
  • Reddit Posts & Comments
  • Facebook Posts & Comments
  • Text Messages & other messaging apps
  • Twitter posts & replies
  • Comics
  • GIFs
  • Code
  • Memes
  • Other sources

A Baseline Approach

I wanted to make an off-the-shelf approach to the problem of captioning Reddit posts for a couple of reasons. Without a basis for comparison, there is no way to know the extent to which more complicated approaches improves things. Also, simple solutions sometimes work quite well.

This quick hand-drawn diagram illustrates the architecture of the baseline approach.

To elaborate, there are a few components here. The first in the flow is the BiT Classifier. BiT is a pretrained Resnet 50 model from Google with helpful recommended hyperparameters for finetuning — a huge help when you just have a desktop and CoLab to accomplish this project. The classifier determines which category an image belongs to.

The next model in the flow is completely out of the box: Tesseract OCR. This open source OCR Engine is maintained by Google. Using Tesseract OCR, I iterate over blocks of text in left to right, top to bottom order, producing a string like follows:

Block 1 BLOCK Block 2 BLOCK

Finally, I use the open source library Huggingface to train a Bert2Bert sequence to sequence model. The input to the model is the category of a post followed by the OCR result, while the output is expected to match the human transcription. As an example, this screen capture has the following input and target:

Input:

TWITTER_POST ®. jacksfilms @ @jacksfilms — 13h
i Taking your #jackask questions now!
. Ask me about absolutely anything

BLOCK © 2570 td 9% © 5,193 oe

BLOCK lol
@PyrocynicalTV

BLOCK Replying to @jacksfilms

BLOCK #jackask where is my wife
9:55 PM: 16 Nov 18

Target:

* Image Transcription : Twitter Post * — — — * * jacksfilms * *, @ jacksfilms Taking your # jackask questions now! Ask me about absolutely anything & gt ; * * lol * *, @ PyrocynicalTV & gt ; \ # jackask where is my wife — — — ^ ^

But what about images?

A notable deficiency of this baseline approach is the simplifying assumption that all of the relevant content for making a transcription will be contained in the OCR results from Tesseract. Besides inaccuracies of OCR, this leaves out most of the structure of the page, which can be necessary context for e.g. knowing whether a comment is a sibling or child of another comment. It also leaves out the visual content of any comic, meme, image, etc. attached to the post. An improved model would probably marry the OCR and language models with some kind of Seek, Attend and Tell model variant.

Results: Image Classifier

BiT R50x1 was chosen due to its great performance and easy finetuning. I finetuned it with early stopping to classify each image. Since the orientation and complete content of each image may be important to classification, I forego the usual augmentation with crops and flips. This model might prove useful later on as well, since a more advanced solution might have bespoke models for different common scenarios or only be used on scenarios where it is known to perform well. I have also created a TensorflowDataset class for the dataset in order to train, as required by many opensource TF models. The dataset definition is available at this Github Repo.

Unfortunately, training the classifier on all categories of the dataset results in a confusion matrix that looks like this:

Essentially every category can be mistaken as the more generic IMAGES_WITH_TEXT . Unfortunately, this makes sense. Many transcribers use the generic template regardless of whether a more specific one is available, so images in any category might be spuriously classified as generic. Training a model only on examples that have more specific categories than ART_AND_IMAGES_WITHOUT_TEXT, IMAGES_WITH_TEXT, and OTHER results in a better confusion matrix.

In a pipeline for transcribing posts, this model could be used to filter for high performing categories. If you want to pick up where I left off, I’m linking the training and eval CoLabs for the model trained on all categories, as well as the training and eval CoLabs for non-generic categories.

Results: Transformer Model

I trained the Bert2Bert model in this CoLab for 6,000 global steps with a batch size of 16, roughly 1 epoch of the Transcribers of Reddit Dataset. Looking at the graph of validation loss and Rouge scores, we can reasonably guess that the model converged. For a quantitative measure of the baseline approach, I judge the transformer model via Rouge 2.0 in this CoLab. The numbers are overall lower than the graph above because of favorable settings during training eval. A breakdown of the scores by category goes as follows:

Plot of Rouge2 precision, recall, and fmeasure for Bert2Bert model by category. Sorted in descending order by fmeasure.

A curious property of these measures is the yawning gap between recall and precision in all cases. This is probably a product of one of the most interesting properties of the trained models.

The Expected

My hope for the baseline model was that concatenating the category of the post and the OCR results from Tesseract would capture a lot of low hanging fruit. Then, a fine-tuned Transformer would be able to more or less directly translate the category to the template, then fill in that template with a cleaned up copy of the OCR results. In many examples, that happened. For one of them, see this post:

The OCR results are a mess:

a.

BLOCK lam an Avon representative and | am wondering if | can become a vendor at
your event?

BLOCK 4.Comment
) Like © Comment > Share
TE © o>: Flea Market Organizer
Like - Reply 30
ee | 2425s you're 101?
Like - Reply 30
|| don't think they take direct sales.
Like Reply 30 Random Person
Like - Reply 36

BLOCK actually if you park your
pink cadillac out front and the punx can take a sledge
hammer to it for $5 a pop you'r in.

BLOCK Like Reply. 1d oO:

BLOCK 0 EE eee”

BLOCK Like - Reply - 18h

BLOCK

But the prediction is actually a pretty competent transcription:

Image Transcription : Facebook Post * - - - * * [ REDACTED ] * * I am an Avon representative and I am wondering if I can become a vendor at your event? & gt ; * * 1st * * : I don't think they take direct sales. * * 2nd * * actually if you park your pink cadillac out front and the punx can take a sledge hammer to it for $ 5 a pop you'r in. - - [UNK] - - ^ ^

The Unexpected

Pretty often though, things get weird. Take for example the prediction generated for this post

[CLS] * Image Transcription : * - - - [ * The image is a screenshot of a Facebook post. * ] * * Facebook User * * : I don't know what you're talking about, but I'm not sure if it's a joke or something. I've never heard of a joke, but it is a joke. I have no idea what this is about. I would like to tell you more about it. I am not sure what I am talking about. But I think I can't tell you how to make it work. I don ’ t know how to do that. I just want to know what I'll do to make you feel like someone else. I want to be honest with you and I'd like to know if you have any questions about how I am doing. If you want to talk to me about it, you'll be in a relationship with me. I think you'd be interested in making a conversation with me about my life. I hope you have a great day and I feel like you've got a great night with you. I know you's not interested in trying to make a difference in your life. You're a very good person. [SEP]

What??

Things get a bit clearer when we take a look at the OCR results for this post:

The OCR result misses everything! And it seems the model knows it! In the absence of us taking images into account, the model seems to have become a storyteller. It will happily guess and add content related to what might be in images.

Some common themes:

  • An obsession with dresses & clothing:
[ * There is an image of a woman in a blue dress and a red dress. The woman is in a black dress and is wearing an orange dress, and she has a blue skirt over her waistband. She is holding a pink shirt to her chest, and is holding her right hand in her left hand, and making a small gesture * ]
  • An obsession with Multilevel Marketing culture (not a shock since there is so much of this in the corpora):
[ * Image of a hun holding up a bottle of Young Living essential oils. * ]

Another fun quirk of the model is that it will make up rewards received by Reddit posts, since the OCR cannot reliably pick these out. It doesn’t really understand the reward system, but it’s trying its best:

1, Silver, Silver 1, Gold x3, Silver2, Pink, Silver 4, Silver # 1, Pink x20, Pink200, Silver200 +, Silver Silver x203, Pink

Fin (for now) & Appendix

Future directions for the project could be to add a method for the model to attend to images, rather than just the textual content of a page. Additionally, picking a decent operating point for the classifier model might allow us to build a transcription bot, and make the job of human volunteers on Reddit a bit easier.

More detailed breakdown of Bert2Bert scores

--

--