Search Trends

Google and WikipediaTrends for feature generation

Posted by Matt on December 2, 2014

Introduction

Inspired by one of my favorite blog posts ever, Why Can't Canada Win The Stanley Cup?, we took to the simple method of using search data to determine the public's interest in a set of features. The easiest way to do this is to 1)type the feature set into Google Trends, 2)uncover the important signals, and 3)go to town. This is all very easy if you're Mr. Silver and are only searching for one specific keyword:'NHL', but when you have 1000+ words to look for, this becomes rather tedious. Speaking of tedium, the search quota limit that seems to change depending on what the weather is on any given day is up there as well. Regardless of the annoyances, we managed to figure out a way around their constraints, albeit being rather slow, as well as a process that we kicked ourselves in the head for not thinking of in the first place.

Use Case

There are many methods to scrape Google Trends, just search Stackoverflow, Google, or even GitHub for that matter. Many of these methods focus on using a Mechanize Browser, diluting the user agents list with a list of 'real' user-headings, then appending "&export=1&content=1" to the end of a search link like this: "https://www.google.com/trends/trendsReport?hl=en-US&q="SEARCH-TERM"&content=1&export=1". When using this method, the quota limit can vary anywhere from 25 to ~125+ search terms in a given time frame, usually measured by a workday, and the limit can be expanded as long as you force a variable time-limit (~2 minutes) between searches. We tried to create a few ways we could hit our total number of searches (~3K) without hitting the quota, and do so relatively quickly. Now this is nothing new, but we decided that if we could figure out an efficient way to do our feature generation, hopefully it could benefit the reader as well.

Methods

Initially, we tried a Mechanize Browser method. This method is definitely the easiest to use if the search item list is very small or if you just want a one-and-done type deal, but we began hitting quota limits around ~150+ items. The problem that we ran into in both this method and the selenium method was that once we have a returned dataframe, we had to scale all values to each other in a meaningful way as Google's results "represent search interest relative to the highest point on the chart. If at most 10% of searches for the given region and time frame were for 'pizza, we'd consider this 100. This doesn't convey absolute search volume" (from Google Trends info). This is fine, because we can at least find the highest relative search term in our list, then scale that to say highest word we have or against 'Facebook' or some other astronomical (or non-astronomical) search term; then we'd have at least a decent volume metric going forward. From that point, we would have to then find relative terms to scale against each other in a pseudo sorting process, which would take a bit of time. While we devised the quickest sorting method, we wanted to at least scrape the data as to not waste any time.

Mechanize Method

This method involves assembling a Mechanize Browser, importing a list of search terms, and then let the update functions do it's thing. It attempts to find the "the highest point on a chart", then uses that point for every search until a new highest point is found. If a new highest point is found, then it alters an existing dataframe of stored search terms to update information about the new max term, otherwise it just plods along until all terms have been searched for. If you are planning on using something to efficiently search a few terms each day, this is definitely the best option by far.

Selenium Method

This method uses the same methodology as above in terms of learning the new highest search term, but differs in that we use the selenium browser and literally mimic an individual typing key terms into the query system. We resorted to this method because 1)we wanted to time other methods against it as well as 2)circumventing the quota limit by typing as slow as your grandmother. While this method actually finished faster than the above method, it was absolutely infuriating to look at it. What is the point of automating feature generation if we can't beat the typing speed of 95 year old tortoise? But I digress. In this method, you can play around with the selenium ActionChains.send_keys() function and have Google suggest a search term for you, eg 'ruby' becomes 'Ruby Programming Language'. I've intentionally left those functions in the code behind the #'s so feel free to mess around with them if you want to.

Screw it, we're going Wiki

While waiting nearly 4 and a half years for the other two methods to finish, we decided to forgo Google altogether and check out the 'pageviews' utility from the wikipedia dumps. But like most things on the internet, the fantastic people over at Wiki Trends already built a website similar to Google Trends, thus making the creating/storing/using/then dumping of a database very redundant. But it is important to note that when scraping sites like this, it is best to be respectful in terms of timing your scraper so as to not hit the server so much you inadvertently contribute to a DDOS kiss of death. This method uses the selenium browser and let's the word be suggested via the page's autocomplete (could be good or bad in your use case). At least in our case, it was way more helpful in terms of our desired context than Google's auto-complete suggestions. The good thing about Wikipedia's pageview counts are that they don't force some arbitrary 'relative to the highest point on the chart' type metric onto your problem and, in fact, give you absolute search volume. Here is the selenium version of that code:

Screw it, we're going Wiki; part duex

This method uses the Mechanize Browser capability and almost has the same functionality as the wiki-selenium method above. It's really no different other than it looks to the user to provide the correct word to be searched in wikipedia and it's a bit faster (simply because grandma isn't typing on this one). If you're unfamiliar with the way search terms in wikipedia look, you should take a gander over to their api and look at what query your search term would look like as a wikipedia page, eg. 'python' as 'Python (programming language)'. If you know exactly what you're looking for and you want the query to be done quickly, this is probably your best bet.

Downloading the Wikipedia Pagecount Dumps

All of these options work perfectly well if your goal simply to get the necessary data, then never touch the methods again (like us); however if you wish to continually look for trends or find yourself constantly looking for new trend information, it's probably easiest if you simply download from the dumps located here, then update your db once new page data comes in. At least that way you are not hounding the Wiki Trends servers. To get you started, here is a relatively simple script that gathers all the page links and downloads the information into a mongodb .

Check out some of the stuff we do