This is the first part of a X part article.
First of all a question. How do you find out about a fake viral article?
Does somebody tell you about it? Are you using a special software? Do you use facebook trends and twitter trends to find out what is viral?
To find out how journalists find out in the present, i made this google form. I encourage you to complete the form so we can get a better understanding of how journalists and bloggers find out about viral news articles.
What we need to build to acomplish this:
The Outbrake is a service that i want to developed for Journalists, allowing them to find viral articles before they go viral.
By automatic crawling the FB pages that usually share fake or misleading articles, journalists can see the “next lies” before they are already so popular so even if you will explain that something is not true, more people had heared the fake version.
Every article that now have 600K Shares, in the first hour had 5K shares, in the second 11K, the third hour 19K
Using this, we can detect this articles way before they spread from one vertical into another and they become viral.
Using a custom python web parser we crawl, every hour, a list of the top 1000 news websites in the US and add them in a database. The second time we parse the same link, we add it to the database and we calculate the difference in the number of likes, emotions and shares using the facebook API.
We monitor each story for 3 days before we stop indexing that particular link.
If the story resurface in our databse later, we will monitor again for 3 days before stopping again.
Only for around 5% of the posts, the posts that we see that are becoming viral, we download the comments so that we can analyze them and see what people are talking about.
The rationale is that we can automatize the process of learning what is this article about and the validity based on the comments the users post to a article.
# Getting the articles from the FB pages.
1000 FB Pages * 24 hours = 24.000 request per Day.
If the average size of the request is 1MB, per DAY need to download 24 GB from Facebook
Per Month, this means 720GB
# Getting the number of comments, likes, shares from the FB API
1000 news websites * 50 articles per day * 5 days (avg time a article will be crawled) = 250.000 request per hour.
Per day this will mean 6.000.000 requests
If the average size of the request is 100k, per HOUR i need to download 25 GB from Facebook. Per day 600GB. Per month, minimum bandwidth need of 18TB
# Getting the text comments for the Viral Articles.
Around 1-5% of all articles.
Per Day we will download around 10.000 Viral News/Top News articles
Per month this mean 300.000 requests * 100 comment nested pages = 30.000.000 requests
If the average size of the request is 1MB, per DAY we need to download 10GB from Facebook
Per month, minimum bandwidth needed is 300GB
This will be data that we will keep almost entirely, so it`s costly on the server side.
Webcrawler API payload:
Payload per day of 1000 news websites * 50 articles per day = 50.000 request per DAY.
If the average size of the request is 1MB, this means 50GB per day
Per month, this means 1.5TB
Per month, we need to send over 200M API requests to facebook, downloading over 20TB of data
I will officialy kickstart this project at the second @Debug Politics Hackaton on the 9th to 11 of December in San Francisco.
But i`m working already on it. What i need is somebody with experience in database design, to create the back office architecture. Also, somebody that have experience in Design.
I need financing for pushing this project foward, for the server and other costs asociated with the project. If you want to be part in this project, either as a sponsor or as a developer, email me at [email protected]
I collaborate with Rise Project, were i do data analysis and pattern recognition to uncover patterns of data in unstructured datasets.