By Allie Morgan on Nov 30, 2015

Catching Clickbait: Using a Naive Bayesian Classifier in Go

You know clickbait when you see it. Let’s explore how we train our own naive Bayesian Classifier, in Go, to identify those annoying, yet somehow addictive articles.

By the time you wake up, check your email and social timelines, you’ve likely seen a bunch of clickbait already. These are articles like “17 Facts You Won’t Believe Are True,” “This Amazing Kid Got To Enjoy 19 Awesome Years On This Planet. What He Left Behind Is Wondtacular,” “See Why We Have An Absolutely Ridiculous Standard of Beauty In Just 37 Seconds,” or “6 Heads You Never Realized Are Also On Mount Rushmore.”

The Oxford English Dictionary broadly defines clickbait as, “(on the Internet) content, especially that of a sensational or provocative nature, whose main purpose is to attract attention and draw visitors to a particular web page.”

Inspired by Paul Graham’s article “Plan for Spam,” we used our own naive Bayesian classifier in Go to identify these infamous headlines.

Using Bayes Theorem

Probabilistic classifiers, like naive Bayesian, identify which category a new event most likely belongs, on the basis of some training set with classified events (see supervised learning).

In other words, we’ll try and determine if the contents of a new headline belongs in the category of “clickbait” as we run it through our trained/learned definition of “clickbait,” using Bayes Theorem.

Bayes Theorem tells us the probability of an event occurring, conditional on an event that has already occurred. Mathematically, it says for events A and B that:

P( A | B ) = P( B | A ) * P( A ) / P( B )

Let’s walk through this formula by example. Say we want to know the probability that, given that a headline contains the word “amazing” (B), it is clickbait (A): P( clickbait | “amazing” ). This is called the posterior probability.

For our training set, I grabbed almost ten thousand clickbait headlines from Buzzfeed’s Buzz section and non-clickbait headlines from Reuters and Aljazeera combined 1. The likelihood of any headline containing “amazing” exactly is pretty low:

P( B ) = P( “amazing” ) = 41 / 9373 = 0.004374

This is called the evidence.

The percentage of headlines that were clickbait out of our whole set was:

P( A ) = P( clickbait ) = 4699 / 9373 = 0.5013

This is called the prior.

The final variable we need to calculate is, given a headline was clickbait, what was the likelihood that the headline contains “amazing.” That is, out of all clickbait headlines, how many times does “amazing” occur:

P( B | A ) =  P( “amazing” | clickbait ) = 40 / 4699 = 0.008512

This is called the likelihood.

Now that we have all the important factors figured out, let’s determine the likelihood of the word “amazing” being within the title of a clickbait piece. Given the word “amazing”, the probability it is in clickbait is:

P( clickbait | ”amazing” ) = (0.008512 * 0.5013) / 0.004374 = 0.9684

The presence of the word “amazing” indicates there’s a 97% probability that the article is clickbait. On the flipside, articles that include President Obama’s name indicate only a 4% probability of being clickbait.

P( clickbait | ”Obama” ) = P( “Obama” | clickbait )*P( clickbait ) / P( “Obama” )
		          = (0.0008512 * 0.5013) / 0.0110957004
		          = 0.03846

Using the Bag-of-words Model

Say we wanted to evaluate a new headline: “8 Amazing Sandwiches You’ve Probably Never Heard Of[2].” We’ll use a simplified representation of this headline called a “bag-of-words” model. Each headline is represented as an unordered bag of words, and every word is an event to consider in Bayes theorem.

Let’s break this down:

P( clickbait | “8 Amazing Sandwiches You've Probably Never Heard Of” ) 
= P( “8 Amazing Sandwiches You've Probably Never Heard Of” | clickbait )*P( clickbait ) / P(“8 Amazing Sandwiches You've Probably Never Heard Of”)

where

P(“8 Amazing Sandwiches You've Probably Never Heard Of”) = P(“8”)*P(“Amazing”)*...P(“Of”)

And (here comes the naive part of naive Bayes), we assume that all words are conditionally independent of other words. That is,

P( “8 Amazing Sandwiches You've Probably Never Heard Of” | clickbait ) 
	= P( “8” | clickbait )*P( “Amazing” | clickbait )*...*P( “Of” | clickbait )

Features of Multibayes

Let’s discuss a few features we have within our classifier.

First of all, you can guess that words like “of” carry very little meaning. They will occur frequently in both clickbait and non-clickbait headlines (see stop words). We remove words like this from all headlines in our training set and new headlines we want to consider.

Secondly, we replace all numbers with the placeholder “number,” literally. Consider: “11 Brilliant Tips For Eating Healthy On A Budget” and “25 Words Only “Arrested Development” Fans Will Really Understand”.

It’s the fact that those headlines contain numbers, not specifically 11 or 25, that we really care about.

And finally, we’ll stem and lowercase all of the individual words in a headline (see stemming). Imagine one headline about a pug and one about pugs. We care more about the weight that “pug” carries than the fact that there is more than one.

Try Our Interactive Classifier

If you’re a fellow developer, you’re probably thinking, “That’s great. I wanna try!” Cool– here’s an interactive classifier to try out.

If you’re someone like our editor and content strategist, who wonders how much of their website is comprised of potential clickbait, you’re stressing, but will probably play with this classifier, too.

Test Your Headline

Results

Is Clickbait: N/A

Clickbait Score: N/A

Not Clickbait Score: N/A

Check out our Multibayes repo on Github to learn more about our one weird trick. It’ll be amazing.

Footnotes

  1. We make an assumption that all of the headlines from BuzzFeed’s Buzz section are clickbait. This assumption seems pretty fair to me, browsing most of the headlines. The idea of what is clickbait is debatable though. Buzzfeed says they don’t do clickbait. The Atlantic says pretty much everyone does clickbait.
  2. By the way, we find this headlines has a 99.999998% probability of being clickbait.

Additional Reading


By Allie Morgan Data Scientist

Allie Morgan is a physicist-turned-data-scientist at Lytics. Her favorite word is “schadenfreude.” She enjoys biking, vegetable co-ops, dancing, and not inviting her colleagues to her dance performances.