You know clickbait when you see it. Let’s explore how we train our own naive Bayesian Classifier, in Go, to identify those annoying, yet somehow addictive articles.
By the time you wake up, check your email and social timelines, you’ve likely seen a bunch of clickbait already. These are articles like “17 Facts You Won’t Believe Are True,” “This Amazing Kid Got To Enjoy 19 Awesome Years On This Planet. What He Left Behind Is Wondtacular,” “See Why We Have An Absolutely Ridiculous Standard of Beauty In Just 37 Seconds,” or “6 Heads You Never Realized Are Also On Mount Rushmore.”
The Oxford English Dictionary broadly defines clickbait as, “(on the Internet) content, especially that of a sensational or provocative nature, whose main purpose is to attract attention and draw visitors to a particular web page.”
In other words, we’ll try and determine if the contents of a new headline belongs in the category of “clickbait” as we run it through our trained/learned definition of “clickbait,” using Bayes Theorem.
Bayes Theorem tells us the probability of an event occurring, conditional on an event that has already occurred. Mathematically, it says for events A and B that:
P( A | B ) = P( B | A ) * P( A ) / P( B )
Let’s walk through this formula by example. Say we want to know the probability that, given that a headline contains the word “amazing” (
B), it is clickbait (
P( clickbait | “amazing” ). This is called the posterior probability.
For our training set, I grabbed almost ten thousand clickbait headlines from Buzzfeed’s Buzz section and non-clickbait headlines from Reuters and Aljazeera combined 1. The likelihood of any headline containing “amazing” exactly is pretty low:
P( B ) = P( “amazing” ) = 41 / 9373 = 0.004374
This is called the
The percentage of headlines that were clickbait out of our whole set was:
P( A ) = P( clickbait ) = 4699 / 9373 = 0.5013
This is called the
The final variable we need to calculate is, given a headline was clickbait, what was the likelihood that the headline contains “amazing.” That is, out of all clickbait headlines, how many times does “amazing” occur:
P( B | A ) = P( “amazing” | clickbait ) = 40 / 4699 = 0.008512
This is called the
Now that we have all the important factors figured out, let’s determine the likelihood of the word “amazing” being within the title of a clickbait piece. Given the word “amazing”, the probability it is in clickbait is:
P( clickbait | ”amazing” ) = (0.008512 * 0.5013) / 0.004374 = 0.9684
The presence of the word “amazing” indicates there’s a 97% probability that the article is clickbait. On the flipside, articles that include President Obama’s name indicate only a 4% probability of being clickbait.
P( clickbait | ”Obama” ) = P( “Obama” | clickbait )*P( clickbait ) / P( “Obama” ) = (0.0008512 * 0.5013) / 0.0110957004 = 0.03846
Say we wanted to evaluate a new headline: “8 Amazing Sandwiches You’ve Probably Never Heard Of.” We’ll use a simplified representation of this headline called a “bag-of-words” model. Each headline is represented as an unordered bag of words, and every word is an event to consider in Bayes theorem.
Let’s break this down:
P( clickbait | “8 Amazing Sandwiches You've Probably Never Heard Of” ) = P( “8 Amazing Sandwiches You've Probably Never Heard Of” | clickbait )*P( clickbait ) / P(“8 Amazing Sandwiches You've Probably Never Heard Of”)
P(“8 Amazing Sandwiches You've Probably Never Heard Of”) = P(“8”)*P(“Amazing”)*...P(“Of”)
And (here comes the naive part of naive Bayes), we assume that all words are conditionally independent of other words. That is,
P( “8 Amazing Sandwiches You've Probably Never Heard Of” | clickbait ) = P( “8” | clickbait )*P( “Amazing” | clickbait )*...*P( “Of” | clickbait )
Let’s discuss a few features we have within our classifier.
First of all, you can guess that words like “of” carry very little meaning. They will occur frequently in both clickbait and non-clickbait headlines (see stop words). We remove words like this from all headlines in our training set and new headlines we want to consider.
Secondly, we replace all numbers with the placeholder “number,” literally. Consider: “11 Brilliant Tips For Eating Healthy On A Budget” and “25 Words Only “Arrested Development” Fans Will Really Understand”.
It’s the fact that those headlines contain numbers, not specifically 11 or 25, that we really care about.
And finally, we’ll stem and lowercase all of the individual words in a headline (see stemming). Imagine one headline about a pug and one about pugs. We care more about the weight that “pug” carries than the fact that there is more than one.
If you’re a fellow developer, you’re probably thinking, “That’s great. I wanna try!” Cool– here’s an interactive classifier to try out.
If you’re someone like our editor and content strategist, who wonders how much of their website is comprised of potential clickbait, you’re stressing, but will probably play with this classifier, too.
Is Clickbait: N/A
Clickbait Score: N/A
Not Clickbait Score: N/A
Check out our Multibayes repo on Github to learn more about our one weird trick. It’ll be amazing.