When one thinks of a hacking competition, what comes to mind is probably a familiar scene from The Social Network movie, where a handful of college kids get together in a room and attempt to break into their university’s computer network. Well, a data science hackathon goes a little differently.
From April 21 to April 25, 2017, Mindshare participated in the I-COM Data Science Hackathon for the second time, sending us as the team to represent them. The Hackathon is a public CMO-focused global competition that connects the magic of data science with the world of marketing. The annual competition brings together the best and brightest individuals who represent the world’s leading technology, marketing firms, and universities in respective teams.
Upon our arrival, I-COM welcomed us with a boat tour that took us from the beautiful city of Porto, Portugal to the cruise terminal where the event was to take place. The terminal was a spectacular structure built with over one million white tiles. As a team, we were taken aback by the experience of the quaint, seaside city with its Port wines, fantastic views and comfortable weather. We spent that entire first day soaking in the city and relaxing because we knew full well that the Hackathon that was set to kick off the following day would be demanding and require our fullest efforts.
On the day of the Hackathon, with all three of us refreshed and ready to get to work, we gathered in the main cruise terminal and were given our instructions from the coordinators. As a team, we were tasked to participate in a 24-hour marathon, where our objective was to solve a real world predictive modeling challenge and forecast an indexed search metric for a variety of sustainability first, fast moving consumer packaged goods (FMCG). We were to use Twitter data, Kantar media data and Kantar shopcom sales data from the last week of January 2017.
The focus of the challenge involved 93 different brands, and it was up to us to satisfy and address two fundamental issues. The first issue was business related – could digital strategy information and community word of mouth predict which of these emerging brands will resonate with the community? The second issue was related to our prediction expertise – how accurately can we predict a search index (0-100) for sustainable brands for a specific week in January 2017?
As soon as the countdown started at 9:10 am, we all immediately headed to our work room to tackle the problem. We started by processing the data and combining it into one dataset. As we did, we ran into the most common challenge that faces a data scientist – our data was incomplete and dirty. Some of the brands didn’t even have the right company across the various datasets! As such, we knew that we had to conduct brand research on the internet, map brand variables across all the datasets, and then create metadata for brands through media variables. Furthermore, all the Twitter data needed to be aggregated into a single file using Python processes, and a weekly time series for each dataset had to be generated by prorating the data by day and then aggregating that up to weekly intervals. Once this was done, we used MySQL to append all the metadata and process it into our database, appending additional metrics such as retweets and unit prices as well. We then sampled a date range against a consistent time series across all the datasets. Afterwards, we removed December 2017 data to test the viability of our models. We then plotted the data we generated for all the brands, concluding our data preparation.
As to the model approach we were going to pursue for the first and the second issues, we discussed and deliberated over a multitude of different techniques. Ultimately, we settled on a multivariable regression model to address the first business issue so that we could understand the causal relationships between search, media, and word of mouth activity for each of the FMCG categories. We knew that a multivariable regression model can be used to determine causality and had good predictive power in short-to-mid terms of up to one year.
For the second prediction issue, we decided to use different auto regression models to produce a forecast with the lowest RMSE (Root of Mean Squared Error) for each of the 93 different brands. We understood that this model approach would allow us to build every model efficiently and address the search index prediction challenge by giving us time to do additional qualitative analysis, data exploration exercises, and explore different modeling techniques. We knew that autoregressive models were quick and efficient, did not require external data, and were relatively accurate for short term forecasting of up to eight weeks.
As any proficient data scientist knows, no model is without drawbacks. As a team, we were then able to identify the weaknesses of both the multivariable regression and our autoregressive models so that we could improve on them. A multivariable regression does not allow for two-way causality, requires similar time-series structures, and can be computationally exhaustive. Autoregressive models were not good at predicting sudden changes in data. As such, we addressed these weaknesses by building more robust models through additional variables and transformations, building in the ability to utilize real client data to extrapolate techniques for modeling of competitive brands, and fed the models more data to make them more expansive. We automated and improved the data processing required, implemented more thorough data Q&A processes, and did more research to fully understand brand histories and industry trends.
Our business application thus yielded some remarkable insights. We concluded that media and word of mouth drives a high percentage of search on cleaning products. We further found that word of mouth on Twitter and digital media does indeed resonate among consumers because they both contribute to an increased search volume. We were also able to identify a list of growing brands as well as a list of seasonal brands. As to our methodological extensibility, by building multilinear regressions to measure the impact of media and word of mouth on search across multiple product categories, we were able to offer proof that our approach is applicable to data sets outside of the data sample that we were provided. We built models for each product category that were extensive to brands that also fall within that category. By 9:10 am the next day, we delivered our results to the Hotel Pestana by the Douro River, fueled in large part by our team’s excitement and a lot of espressos.
The presentations took place the same day that we delivered our results. Although we ultimately did not place, we were able to experience the remarkable ingenuity and innovation from the brilliant minds of the other teams. ETH Zurich University, one of the other teams in the Hackathon, deployed an EM Algorithm technique that we found extremely impressive. Their team ultimately won in our category. They used media and word of mouth data to project a hidden variable such as popularity to help improve their model. Their hidden variable did not have any specific support data, but their utilization of it went above and beyond in answering I-COM’s business and predictive issues. One team in the Hackathon that won in another category, employed the XGBOOST machine learning technique. This technique employs an algorithm that is an implementation of gradient boosted decision trees designed for speed and performance. It has dominated applied machine learning and Kaggle competitions for structured or tabular data. Their conclusions were a paramount example of execution speed and spectacular model performance.
Overall, the different backgrounds and culture displayed in the pool of Hackathon participants was truly remarkable. From university students to working professionals, the various ranges of experiences represented was inspiring to our team. The shared bond created from undergoing the grueling 24-hour marathon created numerous instant friendships. As a team, we feel that the Hackathon is an amazing opportunity for any company to go to because it serves as a breeding ground for the recruitment of talented individuals. The event is also invaluable for learning about the intersecting industry between data science and marketing. It is also a magnificent iteration of how people apply different techniques throughout the world.
It was a great experience, and it was an honor representing Mindshare in such a high caliber event and being able to compete against the brightest minds in the industry. We hope that next year, Mindshare will send a full team to compete on an even higher level.
Ting Wang - Director, Project Management, Marketing Sciences, Mindshare
Fabio Giraldo - Associate Director, Advanced Analytics, Marketing Sciences, Mindshare
Richard Brooker - Data Scientist, Marketing Sciences, Mindshare