As a part of the 2018 ASA DataFest held at University of Waterloo, my three team members and I spent 48 hours processing, analysing and modelling 2.6GBs of job posting data from Indeed.com.
It was a fun and challenging experience. We struggled with cleaning and engineering features on such a large data set, but powered through on caffeine, resilience and using brute force when all else failed.
Our efforts were rewarded by the insights we generated — which ended up winning us one of the three awards.
I’ll generalise the steps we took to get to our final results.
Understanding the Data
The first step to any type of data analysis is to get a feel for the data. Are there any anomalies? What can each variable tell us? What is the goal of our analysis? What can external sources or literature tell us about the data?
We found that Indeed was paid by the number of clicks received per job posting, a variable found in our data set. It quickly became apparent that clicks was our parameter of interest: not only is it a revenue driver for Indeed, but it is also the preferred metric for companies to measure job visibility.
Without too much background knowledge on click motivations, we decided to use unsupervised learning to find the inherent relationships within the data set. Specifically, we used KMeans clustering to group salary, experience required, company rating and the average number of clicks per job posting by industry.
Ranking Companies by Clusters
Luckily, the clusters were very well defined. Using these clusters, each job posting could be given a letter/number grade based on how they compared against similar postings from the same industry. This information can help companies answer questions with regards to talent acquisition — are we paying too much for experienced individuals? Too little? What about our company rating?
Through analysing the centroid values of the clusters, we were also able to make several socio-economical insights about the job market.
- Company rating may not be indicative of job satisfaction.
The clusters showed a very clear relationship between high paying jobs and high company ratings. However, there is an abundance of research showing that there is an optimal threshold for salary, after which point job satisfaction ceases to increase (due to stress, responsibility and other downsides of high paying jobs).
Possibly, employees rate companies based on perceived satisfaction: higher paid jobs are rated more favourably because they feel they are being recognised more for their efforts.
- The majority of Indeed.com users are inexperienced and looking for entry level positions.
Jobs with less requirements also pay less. These were also the jobs driving the most clicks. This was aligned with the economics of supply and demand. Experienced individuals are more sought after and don’t need to spend as much time looking for a job.
Quantifying Clicks with a Quasi-Poisson Model
Next, we modelled clicks (a count parameter) using the other parameters. We selected Poisson Regression over Negative Binomial and Zero Inflated Poisson through cross validation, Vuong’s test and taking into consideration our computational limitations.
While tests of model adequacy were not great (ratio of deviance residuals to degrees of freedom was almost one to one), the residual plots showed no extreme violations of model assumptions. We decided that this model would suffice for descriptive purposes.
Using the weights of the explanatory parameters, we were able to quantify the effect each parameter had on number of clicks by taking the multiplicative difference of the log(clicks).
In particular, we found one particularly actionable insight which companies can use (and Indeed can encourage) to drive clicks:
- Detailed Job descriptions attract more clicks.
Since Indeed.com offers a preview of the job description without necessitating a click, longer job descriptions are preferred over extremely short ones.
With all other variables held constant, a 500+ word job descriptions yields 1.11 times the number of clicks than a job <50 word job description.
This is an incredibly low cost action which drives incredible results. With further validation, this result can be very useful from a business stand point.
While we were very happy with our results, the journey there made the process even better. The past weekend while challenging was rewarding on so many levels. The advise and guidance that the Professors, graduate students and industry advisers provided were invaluable. The other presentations were incredible to watch. It was evident how much work and effort each group put in.
I can’t wait for the next opportunity to test out my data senses!
Thank you for reading!
If you enjoyed this article, you might want to check out my other articles on Data Science, Math and Programming. Follow me on Medium for the latest updates!
I am also building a comprehensive set of free Data Science lessons. Check it out: www.dscrashcourse.com