HN Insights: Playing with Hacker News data

My latest experiment revolves around APIs, analytics and data extraction from Hacker News stories. Hacker News is a vivid community that lots of other people (me included) visit often, I decided to try and get some of its data, analyze them and see what interesting results might come out of it.

The insights I wanted to have are related to topics that frequently occur in Hacker News stories, which of them make it to the top stories, what are authors mostly talking about etc. Since I’m no NLP or linguistics expert, I decided to just throw the data in Elasticsearch and see what it had to say about them. Instead of presenting the results to a nicely designed dashboard with interactive widgets, I decided to serve them with a REST API that everyone could use and explore.

I should probably state right now that this is just my toy. It does not aspire to provide any super important, business critical information. It’s meant just for fun and to satisfy my own curiosity. (I also saw it as a great opportunity to try out my rest-analytics module.)

Getting the data

Once per hour, I hit the Algolia’s Hacker News API to fetch the latest Top stories, Ask HN stories, Show HN stories and comments. All of these categories is assigned an Elasticsearch index, in which I push each story as I get it. These indices are configured so as to ignore common terms when searching in their textual contents.

It’s all about the questions

I built the API using Node.js Express which is a great way to bootstrap an API. Initially, I implemented some basic queries using term aggregations trying to see what terms appear more often inside user stories. The next step was to see which authors talked more often and managed to make it to the top stories. The fun part started when I tried to search for stories with specific keywords and then tried to collect their frequent terms.

I will not elaborate on the results yet. I will probably do this in a future post. I encourage you to try and see for yourself. The API is located at https://hn-insights.herokuapp.com. Documentation about its endpoints is hosted there. Feel free to play with it!

Leave a comment