Hot news detection using Wikipedia




Monday, June 29, 2015
Wikipedia is, indeed, one of the largest free access crowdsourcing [1] sources of information in today's world. Every day, thousands of people land on its pages to get information about different topics. Moreover, many machine learning algorithms (including text mining, semantic web, etc.) get their input from Wikipedia; for example, Google’s Knowledge Graph is mainly built around Wikipedia [2].

Every day, different news articles are published in different news media. Among them, only a few attract significant attention from the public. The media refers to these stories as “hot news.” Usually, within a news organization, experts decide if news is hot or not. Of course, they use their own domain knowledge and other sources, such as the number of “likes” an article receives on Facebook or the number of retweets on Twitter, to decide whether a story is hot or not, yet in the data-driven world, from which we derive many metrics, this is not a scalable, automated approach. As data scientists, we love to find these things in data and decrease the role of humans in the analytic process (creating humanless analyses, if you will).

This is the goal of this post. Here, I show how we can use the number of Wikipedia article page views to determine if a news story is hot. This approach is fully data-driven and does not need any human supervision.

Intuition:

To understand intuition, think of a time when you have read a news article. If the news is interesting to you, you might like to dig a little more into it. You might read more about its history, or search for a more in-depth description. Hence, you will perform a search about the topic. Therefore, interesting news can be said to be marked by increasing search trends. In other words, as a news item starts to attract more attention, the number of searches on topics related to that subject will increase accordingly.

Clearly, a subsample of those searches will land on a Wikipedia page. Wikipedia articles are the main source many people use to read more about a topic. This means, the more people who perform a search on a topic, the more visits there will be to Wikipedia pages related to that topic. Let’s look at some examples.

Examples:

You may have heard about recent charges of corruption in the FIFA organization. The following plot shows the daily views to FIFA Wikipedia’s page in the last 90 days (as of June 28, 2015). Clearly, around May 25th, there was an upward trend of access to the page. Why? That was the date when the corruption news became “hot.”


As of this writing, the US Supreme Court has declared same-sex marriage legal [3]. Let’s look at page visits to a Wikipedia page on same-sex marriage.

There is no doubt that this topic has received a great deal of attention, and the data proves that this is a hot news item. As the chart above indicates, the number of page visits has gone from around 1000 per day to around 50,000 in the last two days alone.

Hot news detection:

With the above intuition and examples, hot news detection using Wikipedia page access seems fairly straightforward. First, we need to have the page view counts for Wikipedia pages. Then, we can create a page view history for every page. Next, for each page, we use a simple trend change detection algorithm to detect if the access trends related to that page are increasing statistically. If so, we can infer that the page contains a topic related to a hot news item.

In other words, data-driven hot news detection using Wikipedia contains four steps:

1- Download the dataset

Wikipedia provides the page view statistics for all of its articles, and these are publicly available [4]. This data can be downloaded here:

2- Create a history for pages

There are almost 5 million Wikipedia articles in English. Although this is a huge amount of English-language content, the statistics regarding the pages can easily fit into couple of hundred megabytes of memory. Hence, we can easily create a 90-day page view history for all English language Wikipedia articles.

3- Detect any trend (increasing) changes in page access

Now that we have the page view history for the Wikipedia articles, we just need to use a trend change detection algorithm to find pages that are receiving more visits now compared to their history. I have a separate blog post on this:

4- Newslookup

After I find the top pages in terms of changes in their access rates, I use the newslookup.com search api to find news that is correlated to these pages. These news stories are the hot news items that we have been looking for.

5- Open source R program to detect hot news

The R program that does all the above can be found on my github page [5]. Please note that this program detects hourly hot news. If you would like daily hot news, or even weekly hot news, you simply need to change the aggregate function.

As a Service [Updated July 29]

This technique is available as a web service. It provides both the hot news in the past X hours and the hot wikipedia pages (with increase trend) in the past 48 hours. Please refer to http://hotnews.ask-tell.info/howto.html for more information about how to access the service.
The hot stories that are found using this technique is fed to a twitter channel: @EyeWikiNews. You can follow it and stay tune with top news in the world.
[1] It seems that Jimmy Wales does not like the term crowdsourcing for Wikipedia: http://www.sfgate.com/business/article/As-Wikipedia-moves-to-S-F-founder-discusses-3233536.php

15 comments:

  1. Hi Hamed,

    Very neat!! How do you run it? Thank you!

    Sincerely,

    tom

    ReplyDelete
    Replies
    1. Hi Tom,
      After downloading the source code, you should be able to Run the main.R. From command line try "Rscript main.R" or run it from Rstudio. If it didn't work please let me know. You can email me at "mhfirooz AT gmail"

      Thank you and thanks to Wikimedia to make everything publicly available :)

      Delete
    2. Is it possible to create a webpage with the "hot news" information?
      It would be pretty convenient.

      Delete
    3. I used to have it as a web application with Restful api on aws but it got expensive and I had to shut it down. I might set it up again.

      Delete
  2. You might want to use Azure ML to publish it as service API,its relatively cheap.Also you can publish it in azure ML gallery.

    ReplyDelete
    Replies
    1. Thanks for the tip, didn't think about Azure ML, will take a look. I know it is cheaper than AWS, but not sure if it give you same functionality.

      Delete
  3. There are many requests to have this as a web service. So I am working on it and will have it on the cloud soon. Stay tune and thanks for the support.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. What about doing the contrary:

    1. Retrieve the today's news.
    2. Rank them the change compared to yesterday on:
    2.1 the Google Trends interest score or
    2.2 the views in the Wikipedia related page.

    ReplyDelete
    Replies
    1. This is a very good way of calculating the correlation between google trends and Wikipedia. Is there any api on daily score of google trends?

      Delete
  6. This technique available as a web service. Please refer to

    http://hotnews.ask-tell.info/howto.html

    for more information about how to access the service. The hot stories that are found using this technique is also fed to a twitter channel:
    @EyeWikiNews.

    Please send me your feedbacks.

    ReplyDelete
    Replies
    1. Hey The link http://hotnews.ask-tell.info/howto.html doesn't seem to work.
      Thanks

      Delete
  7. Hi Hamed !
    I just read your ariticle 'Hot news detection using wikipedia' on your wiki blog. And I search your twitter @EyeWikiNews, but I found it stop twitting on Sept.13th.2015, what happend to that twitter? Why stop twitting ?
    I'm working on reddit hot news prediction these days, and I found your work may help me, so I want to ask how well your algorithm works? Look foward to your reply. Btw your email address is not avalueable, my eamil is justinho_chn@qq.com , if you can reply me in email, I'll be appreciated.

    Many thanks,
    Justin

    ReplyDelete

 

Favorite Quotes

"I have never thought of writing for reputation and honor. What I have in my heart must out; that is the reason why I compose." --Beethoven

"All models are wrong, but some are useful." --George Box

Copyright © 2015 • Ensemble Blogging