Product Hunt Crystal Ball

WTF?

What I want to do here is pretty simple since I want to implement a Machine Learning algorithm to predict if my product should go hot on Product Hunt, topping to the featured list for a day (or more).

If you are not familiar with Product Hunt, it is a website that lets users share and discover new products. The site includes a comments system and a voting system similar to Hacker News or Reddit. The products with the most votes rise to the top of each day’s list (Wikipedia).

Why?

This is a school project made for the Hadoop course. The subject is to develop a personal project using Big Data technologies.

My idea

I firstly wanted to make something that was music-related, like a guitar tab metrics from 911Tabs or predicting the next shitty commercial music you should write/compose/play for making millions and getting bi*** (but loosing your dignity for the rest of your life).

As an avid Product Hunt user, I chose this subject since I see a ton of really interesting projects getting hunted every day from the daily list and newsletter.

My steps (sort of)

I’m not going to lie, I had no idea at the beginning and I am writing those lines two days before the end of the deadline. This project is maybe not going to be finished, but I hope it will still be.

The aim of this is to get the most precise predicting project to build. However I really don’t have a lot of knowledge about Machine Learning, which is kinda lame. But as you may know, ~~Google~~ DuckDuckGo is my friend (yours too, that duck is nice).

Ok, I firstly searched for “big data projects examples”, but that’s because I was somewhat not convinced of searching a project to do myself. That was clearly not the best idea as I only found things about getting Big Data jobs. As I was doing my Product Hunt routine, I came across PublicAPIs.com, which is The Largest API Directory In The Galaxy. I was exploring it, I remembered that Mashape was a very good website to get APIs to play with, so I started browsing it too. However, I didn’t found anything interesting.

Stahp it

Too much blah here, I am going to talk about what I do/did for the actual project.

I don’t know where to begin. The first thing I did was to register the PH application from their website to get a token for their API. I read through their documentation and I “read” some books about machine learning. One of these books was Python Machine Learning by Sebastian Raschka, where I read about artificial neurons and the basics on how to implement a perceptron learning algorithm in Python. Needless to say that I didn’t understood everything as I’m not the best at mathematics. By reading this I thought to myself that someone may already have made something more user-friendly for the non-mathematics, So I found scikit-learn, which is built on NumPy, SciPy, and matplotlib, some very known scientific computing Python packages.

What I want to do at first is to make a simple script where I will use the PH API, then output some results at the user. Those results will be basically a list of the best topics in 2016.

By browsing through these data, I will need to figure out which are the parameters that work best for predicting future top projects based on historical votes.

The first API call I make is /v1/posts/knight-touchbar-2000. This is a totally useless project I made that was featured on PH, I got 120+ upvotes and it was enough to be on the most upvoted products for a day or two. I am going to base my tests from this since I know what was the impact of the product on the social networks and tech news sites: ~10+ tech websites talking about this; ~200 stars on Github; ~100k interactions on the social medias for a month; plus some jobs opportunities I had.

From that API call then, I can detect interesting keys:

key	description
`category_id`	the category of the post
`day`	the date the post was created
`comments_count`	number of comments
`featured`	if the post has been featured
`votes_count`	number of votes

Screen Shot 2016-12-15 at 17.55.31

Based on those keys, I can get started on building something: If the post is featured, then my program will count it as a good product and keep it for future use. By using the creation date minus today date along with the number of votes, I can also guess if the product will be interesting to people or not.

Now that I have some interesting keys, I may be able to look forward on something bigger.

With the route /v1/posts&days_ago=1, I can get all the posts of the previous day.

What I want to do here is to make a loop for fetching all the posts that has been featured with at least x upvotes from that day (50 would be good), saving them in a structured file for future use and loop it again for previous day, and so on.

Screen Shot 2016-12-15 at 23.28.28

That is a sneak peek of a small python file I wrote.

I set the delimiter to ;, but I am going to set it as \t, which is a tab character since the name of the product may contain a semi-colon.

This is the representation of each columns:

Product Name	Created date	Is Featured	Number of upvotes

Then also added the topics:

Screen Shot 2016-12-16 at 12.39.51

Now the columns are:

Product Name	Created date	Is Featured	Number of upvotes	Topics

The next thing I’d do would be to export those lines into a .txt file. I can do it in multiple ways, so I am going to do it simple and append stdout to a products.txt file.

python ph_fetch.py -t [dev token] -d 350 >> products.txt

Now that I have data, I can begin to work with them.

As I said earlier, I’m not that good at mathematics. However, I guess there is two things I need to determine before actually predicting something.

Firstly, I will surely need to find a way on how to measure error: what data do I need to analyse, to tell if the result is more or less good.
From what I read, getting an average can be interesting. Let’s try getting the average of the upvotes.

This is an interesting find. Kaggle is a website about Big Data, where datasets and competitions can be found. This link is about ways to measure errors.

This must be very interesting but those mathematics formulas are like a thug in front of a contemporary art for me.

Anyway, let’s try getting that average of upvotes:

From January 1st of 2016 to December 16th of 2016 (350 days), the average upvotes are about 235 (rounded).

Finding correlations

The path I am taking may be not right, but I keep moving forward. As I got the data I want to predict, I am now going to find correlation between that average and all the other data I have (aka columns).

What I want to do is to convert the columns into a dataframe.

I started doing this with pandas, but I can’t get the dataframe working for getting a correlation between the columns. I launched Spotify and started listening to some Gojira since I can’t do what I want to do.

What I want to do is basically getting is a simple histogram with some basic correlations between the columns.

Screen Shot 2016-12-17 at 21.56.53

Conclusion

I wish I had more time to finish this but I guess my skills aren’t sufficient to get what I expected at the beginning. There is some tutorials I followed and books I skim through, but none of them could explain interesting concepts of machine learning.