Published on March 11, 2018 by

As a support to our marketing team we have created a tool which analyzes article headlines and contents. It gives insights how to create headlines and models potential “virality” of the content piece, This was particularly challenging because of limited support for NLP in polish language. And it is actually used by our marketing team.

Using Facebook API we have collected data from fanpages of Polish portals publishing articles in the internet. Based on number of shares, comments, likes and other reactions we defined the virality coefficient, which allows us to measure how much potential each article has to become viral, and therefore being particularly interesting in terms of marketing potential. Given this dataset, we wanted to classify the most catchy phrases occurring in article titles and to check if the content actually matters. We examined how these best phrases change over time, did clustering based on their meaning. Moreover, we automated the process of distinguishing between phrases being one-time events (27-1) and those occurring regularly. We also consider impact of other features of the headline on the virality of the article. Additionally we examine the formatting features based on article content and formatting. Higher level virality analysis concerns linking articles covering the same topic, which requires inclusion of our dataset HTML code of article and text (body) extraction out of it.

During our speech we will cover the following areas:
Data collection:

facebook API (headline, article link, reactions)
downloading HTML code
article text extraction
Data preprocessing:


token, bigram, trigram, starting and ending phrases frequencies and scores
variance and entropy – automatic detection of one-off, regular and seasonal headlines/topics
x-validation on different time intervals and using different news-sources
virality score vs headline length
Analyses :

all of the above analyses for article text and HTML code
topic analysis (LDA)

ensemble modeling to for regression algorithms/classification algorithms to predict virality

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Category Tag