What I want to do here is pretty simple since I want to implement a Machine Learning algorithm to predict if my product should go hot on Product Hunt, topping to the featured list for a day (or more).
If you are not familiar with Product Hunt, it is a website that lets users share and discover new products. The site includes a comments system and a voting system similar to Hacker News or Reddit. The products with the most votes rise to the top of each day’s list (Wikipedia).
This is a school project made for the Hadoop course. The subject is to develop a personal project using Big Data technologies.
I firstly wanted to make something that was music-related, like a guitar tab metrics from 911Tabs or predicting the next shitty commercial music you should write/compose/play for making millions and getting bi*** (but loosing your dignity for the rest of your life).
As an avid Product Hunt user, I chose this subject since I see a ton of really interesting projects getting hunted every day from the daily list and newsletter.
I’m not going to lie, I had no idea at the beginning and I am writing those lines two days before the end of the deadline. This project is maybe not going to be finished, but I hope it will still be.
The aim of this is to get the most precise predicting project to build.
However I really don’t have a lot of knowledge about Machine Learning, which is kinda lame. But as you may know, Google DuckDuckGo is my friend (yours too, that duck is nice).
Ok, I firstly searched for “big data projects examples”, but that’s because I was somewhat not convinced of searching a project to do myself. That was clearly not the best idea as I only found things about getting Big Data jobs. As I was doing my Product Hunt routine, I came across PublicAPIs.com, which is The Largest API Directory In The Galaxy. I was exploring it, I remembered that Mashape was a very good website to get APIs to play with, so I started browsing it too. However, I didn’t found anything interesting.
Stahp it
Too much blah here, I am going to talk about what I do/did for the actual project.
I don’t know where to begin. The first thing I did was to register the PH application from their website to get a token for their API. I read through their documentation and I “read” some books about machine learning. One of these books was Python Machine Learning by Sebastian Raschka, where I read about artificial neurons and the basics on how to implement a perceptron learning algorithm in Python. Needless to say that I didn’t understood everything as I’m not the best at mathematics. By reading this I thought to myself that someone may already have made something more user-friendly for the non-mathematics, So I found scikit-learn, which is built on NumPy, SciPy, and matplotlib, some very known scientific computing Python packages.
What I want to do at first is to make a simple script where I will use the PH API, then output some results at the user. Those results will be basically a list of the best topics in 2016.
By browsing through these data, I will need to figure out which are the parameters that work best for predicting future top projects based on historical votes.
The first API call I make is /v1/posts/knight-touchbar-2000
. This is a totally useless project I made that was featured on PH, I got 120+ upvotes and it was enough to be on the most upvoted products for a day or two.
I am going to base my tests from this since I know what was the impact of the product on the social networks and tech news sites: ~10+ tech websites talking about this; ~200 stars on Github; ~100k interactions on the social medias for a month; plus some jobs opportunities I had.
From that API call then, I can detect interesting keys:
key | description |
---|---|
category_id |
the category of the post |
day |
the date the post was created |
comments_count |
number of comments |
featured |
if the post has been featured |
votes_count |
number of votes |
Based on those keys, I can get started on building something:
If the post is featured, then my program will count it as a good product and keep it for future use. By using the creation date minus today date
along with the number of votes, I can also guess if the product will be interesting to people or not.
Now that I have some interesting keys
, I may be able to look forward on something bigger.
With the route /v1/posts&days_ago=1
, I can get all the posts of the previous day.
What I want to do here is to make a loop for fetching all the posts that has been featured with at least x
upvotes from that day (50 would be good), saving them in a structured file for future use and loop it again for previous day, and so on.
That is a sneak peek of a small python file I wrote.
I set the delimiter to ;
, but I am going to set it as \t
, which is a tab character since the name of the product may contain a semi-colon.
This is the representation of each columns:
Product Name | Created date | Is Featured | Number of upvotes |
---|---|---|---|
Then also added the topics:
Now the columns are:
Product Name | Created date | Is Featured | Number of upvotes | Topics |
---|---|---|---|---|
The next thing I’d do would be to export those lines into a .txt file. I can do it in multiple ways, so I am going to do it simple and append stdout to a products.txt
file.
python ph_fetch.py -t [dev token] -d 350 >> products.txt
Now that I have data, I can begin to work with them.
As I said earlier, I’m not that good at mathematics. However, I guess there is two things I need to determine before actually predicting something.
Firstly, I will surely need to find a way on how to measure error: what data do I need to analyse, to tell if the result is more or less good.
From what I read, getting an average can be interesting. Let’s try getting the average of the upvotes.
This is an interesting find. Kaggle is a website about Big Data, where datasets and competitions can be found. This link is about ways to measure errors.
This must be very interesting but those mathematics formulas are like a thug in front of a contemporary art for me.
Anyway, let’s try getting that average of upvotes:
From January 1st of 2016 to December 16th of 2016 (350 days), the average upvotes are about 235 (rounded).
Finding correlations
The path I am taking may be not right, but I keep moving forward. As I got the data I want to predict, I am now going to find correlation between that average and all the other data I have (aka columns).
What I want to do is to convert the columns into a dataframe.
I started doing this with pandas, but I can’t get the dataframe working for getting a correlation between the columns. I launched Spotify and started listening to some Gojira since I can’t do what I want to do.
What I want to do is basically getting is a simple histogram with some basic correlations between the columns.
I wish I had more time to finish this but I guess my skills aren’t sufficient to get what I expected at the beginning. There is some tutorials I followed and books I skim through, but none of them could explain interesting concepts of machine learning.