Feature learning with Word2Vec

Learning similar features of job postings

Posted by Matt on October 30, 2014

Introduction

Trying to utilize meaningful and useful text features is an art in itself and sometimes simply choosing a model can be an uphill battle. In this particular problem, we were trying to understand the linguistic features of the current job market. When attempting this, one has two easy options: learning from job posts or learning from resumes, the latter being more difficult to get than the former. Job posts can give you a sense of demand in a certain market and the resumes can act as supply quality metric that is trying to meet that market demand. For our task of trying to understand skill similarity and transferability of skill-assets, I turned to word2vec due to its speed and ease of use.

This, however, was not an easy decision as a myriad of awesome options exist. One such option was to use GloVe, Global Vectors for Word Representation. GloVe learns word vectors by analyzing word co-occurences within a large text corpus. It is a rather cool implementation. I recommend reading this paper for a description of how word vectors are manipulated. The reason I did not utilize this method was for the fact that I wanted to avoid storing any massive matrix in memory, albeit even if it were a sparse matrix. I like utilizing online learning whenever I can and especially with the convenient gensim-word2vec functionality that only uses in-memory computation with the actual training itself (a bit of code on that later), it made the decision a tad bit easier. Another method that I considered using for this task was the well documented Latent Dirichlet Allocation(LDA). My initial reaction was that LDA was going to be a great solution for my problem of extracting skills from text, but upon further reflection, I realized that what I wanted to do was look for features that are similar with respect to their linguistic context as opposed to words that are related via a 'topic'. For instance: python, data science, sklearn, R, and CRAN might appear in the same 'topic' with LDA, but if I only wanted python and R to appear in a given similarity query, my uneducated guess is word2vec would offer a greater probability of that happening. I suppose that might not be the best example of where my mind was going, but in my experience with LDA, a lot of noise ends up in the actual topic buckets and I really wanted to find a way to minimize that noise. So with that, I went down the path of word2vec.

Word2Vec

Word2vec is a pretty cool tool. -Biggy Smalls

Word2vec computes vector representations of words using a few different techniques, two of which are continuous bag-of-words (CBOW) and an architecture called a Skipgram. The high-level training objective of the CBOW model is to combine the representations of surrounding words to predict the word in the middle, while the training objective of the skipgram model is to learn word-vector representations that are good at predicting its context in the same sentence [1]. It is important to note that both models are trainable in a short amount of time, but that CBOW is slightly faster and is more suited for when the dataset is larger [2]. Considering my toy dataset only consists of ~80,000 sentences, I will use the skip-gram architecture for this post. (In my actual model consisting of a few billion sentences, I will be using CBOW).

Skipgram

Given a sequence of training words the objective of the Skipgram model is to maximize the average log probability

where k is the size of the training window. The inner summation goes from −k to k in the training window to compute the log probability of correctly predicting the word wt+j given the word in the middle, wt. The outer summation goes over all words in the training corpus. Every word, w, is associated with two learnable parameter vectors, uw and vw. The probability of correctly predicting the word wi given the word wj is defined as

where V is the number of words in the vocabulary. (Mikolov et al. 2013) An efficient alternative to the cost of computing ∇ log p(wi|wj) is the hierarchical softmax[3]

In the computation of hierarchical softmax, the first step is to compute a binary Huffman Tree that is based on word frequencies where each word is a leaf on that tree. Here you can find a bit more information on Huffman Trees.

The binary tree acts as a representation of the output layer whereby a random walk assigns probabilities to words for each word and its child nodes. The task of predicting the target word wO, p(wO|wI) is defined by hierarchical softmax as: Here, let n(w, j) be the j-th node on the path from the root to word w, and let L(w) be the length of this path, so n(w, 1) = root and n(w, L(w)) = w. For any inner node n, let ch(n) be an arbitrary fixed child of n and let ⟦x⟧ be 1 if x is true, otherwise -1. In this instance, σ(x) = 1/(1 + exp(−x)). [4]

Unlike the standard softmax formulation that assigns two representations and to each word w, the hierarchical softmax formulation has one representation for each word w and one representation for every inner node n of the binary tree.

So in the formula above, all the words and child nodes and have initialized embeddings which will be gradually updated. If a given context is more similar with the child-nodes of a certain word, then that word has higher probability to be the target word in the given context. The formula converts the similarity score to probabilities in order to search the subnodes of the particular word. So, if the j+1 ancestor of target word w is a child of its j ancestor, then the probability of j choosing j+1 is σ(+ similarity between context and ancestor j). Multiplying the child nodes along a path one by one gives the normalized probability of a target word given its context.

word2vec code

Instead of writing my own word2vec code, I turned to the already well documented version, gensim. Gensim's implementation of word2vec is incredibly intuitive and easy to use. I recommend looking at the docs to get a feel of how to integrate the tool into your existing pipeline.

In this instance, I collected job postings from a few job-posting websites using titles generated by Payscale. For the sake of sparing the job-posting websites' servers, I would advise one to go the api route rather than harass someone's servers with scraping calls. The data in this case is in the dictionary format of:

{'Account Manager Sales': [u'Job Description 1...',u'Job Description 2...], Sr. Data Scientist:[u'Job Description 1...']..., George W. Bush:[I was the boss at USA...],...}

The goal was to try to find enough varied job postings and resumes so that there was enough data for each job title as well as enough resumes to support the sentence structure from both sides of the job equation. Below is the code used to generate a word2vec model on an in-memory list.

An alternative to in-memory analysis, is to use Radim's method of iterating though files on multiple disks or instances and only using in-memory computation for the actual model/model training. (This is similar to what I will actually be using in my larger CBOW model.) Here is an example of what that might look like:

import gensim
import pandas as pd
import os
import csv

class MySentences(object):
    
    #this part is rather worthless, but if had more than one file you wanted 
    #to go through, set the dir (potential example, jobs/courses/resumes)
    def __init__(self, dirname):
        self.dirname = dirname
        
    def __iter__(self):
        #change it to whatever file has yoru data
        file_name = "only_good_cols_final.csv"
        f = csv.reader(open(file_name, "rb"), delimiter=',')
        #set the column that the description is in, its in 
        #[0,1,2,3] form NOT [1,2,3] (python numbering)
        column = 3
        for line in f:
            yield str(line[column]).lower().split(' ')
            
sentences = MySentences('/home/ubuntu/') # a memory-friendly iterator
from multiprocessing import cpu_count
model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5 workers = cpu_count(), sg = 0)
model.save('fname')

Some interesting features of the in-memory model

In [22]: model.most_similar(positive=['ceo', 'woman'], negative=['man'],topn=3)
Out[22]: [('president', 0.5502843856811523),('cfo', 0.5502355098724365),('vice', 0.520388126373291)]

In [23]: model.most_similar(positive=['ceo', 'woman'], negative=['man'],topn=3)
Out[23]: [('cio', 0.5539180040359497),('presidents', 0.5108237862586975),('cto', 0.4993226230144501)]

In [24]: model.most_similar('python')
Out[24]: 
[('scripting', 0.912078857421875),
 ('bash', 0.9030072093009949),
 ('perl', 0.897027850151062),
 ('tcl', 0.8833462595939636),
 ('ruby', 0.8729183673858643),
 ('c++', 0.8634607195854187),
 ('jython', 0.8467384576797485),
 ('groovy', 0.846560001373291),
 ('lua', 0.8416544795036316),
 ('java', 0.8366264700889587)]

In [25]: model.most_similar('hadoop')
Out[25]: 
[('splunk', 0.8380253314971924),
 ('hbase', 0.8312206268310547),
 ('nosql', 0.8239182233810425),
 ('cassandra', 0.8228038549423218),
 ('greenplum', 0.820797324180603),
 ('hive', 0.819451093673706),
 ('rabbitmq', 0.8175760507583618),
 ('zookeeper', 0.816423773765564),
 ('accumulo', 0.8148070573806763),
 ('cloudera', 0.8085991740226746)]

In [26]: model.most_similar('jquery')
Out[26]: 
[('ajax', 0.9611257314682007),
 ('javascript', 0.9384799003601074),
 ('dhtml', 0.9383583068847656),
 ('xhtml', 0.9327742457389832),
 ('xml', 0.9254889488220215),
 ('json', 0.9110660552978516),
 ('angularjs', 0.9058303833007812),
 ('css', 0.9022014141082764),
 ('bootstrap', 0.8986350893974304),
 ('mvc', 0.8910885453224182)]

In an attempt to quickly find clusters of similar features, I used Mini-Batch Kmeans to cluster the word vectors. I wanted to use this data to try to find finance-related terms and technology-related terms with the goal of using the results for a secondary classifier.

import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering

#the vector dictionary of the model
word2vec_dict={}
for i in model.vocab.keys():
    try:
        word2vec_dict[i]=model[i]
    except:    
        pass


#This is also interesting to try with Ward Hierarchical Clustering
clusters = MiniBatchKMeans(n_clusters=100, max_iter=10,batch_size=200,
                        n_init=1,init_size=2000)
X = np.array([i.T for i in word2vec_dict.itervalues()])
y = [i for i in word2vec_dict.iterkeys()]
clusters.fit(X)
from collections import defaultdict
cluster_dict=defaultdict(list)
for word,label in zip(y,clusters.labels_):
    cluster_dict[label].append(word)

My assumption was that 'python' would be found with the technology cluster and that 'finance' would be with the financial cluster. Thankfully, I ended up being correct. Yipee.

In [49]: for i in range(len(cluster_dict)):
   ....:         if 'python' in cluster_dict[i]:
   ....:                 cluster_dict[i].sort()
   ....:                 print cluster_dict[i]
   ....:
['++', 'abap', 'actionscript', 'ado', 'adwords', 'agile', 'aix', 'ajax', 'ale', 'amazon', 'android', 'angular', 'angularjs', 'ansible', 'ant', 'antivirus', 'apache', 'apex', 'api', 'apis', 'apo', 'app', 'apple', 'arcgis', 'architecting', 'architecture', 'architectures', 'ats', 'automation', 'aws', 'azure', 'backbone', 'backend', 'bamboo', 'bash', 'bea', 'beans', 'bind', 'bing', 'bmc', 'bootstrap', 'boss', 'bpm', 'browser', 'browsers', 'bw', 'c#', 'c+', 'c++', 'cache', 'caching', 'cascading', 'cassandra', 'cdisc', 'centos', 'chrome', 'citrix', 'clearcase', 'cloud', 'cluster', 'clustered', 'clustering', 'clusters', 'cmdb', 'cobol', 'cognos', 'coldfusion', 'comptia', 'computing', 'configuration', 'confluence', 'cots', 'cpu', 'crm', 'crystal', 'css', 'cucumber', 'customization', 'database', 'databases', 'datacenter', 'datastage', 'db', 'dba', 'dbms', 'ddl', 'debuggers', 'debugging', 'dell', 'deploying', 'deployments', 'dev', 'developer', 'dhtml', 'directory', 'distributed', 'django', 'dojo', 'dom', 'domino', 'drupal', 'dts', 'dw', 'ebs', 'ecc', 'eclipse', 'edition', 'ee', 'eg', 'ejb', 'elasticsearch', 'embedded', 'ember', 'emc', 'emulation', 'enterprise', 'erp', 'erwin', 'esb', 'esri', 'essbase', 'esx', 'etl', 'extron', 'familiarity', 'fibre', 'firefox', 'fortran', 'framework', 'frameworks', 'ftp', 'fusion', 'git', 'github', 'google', 'gradle', 'grails', 'groovy', 'gui', 'gwt', 'hadoop', 'hana', 'hardening', 'hbase', 'hbss', 'hibernate', 'hive', 'hl', 'hosted', 'hosting', 'hrms', 'html', 'hw', 'hyper', 'hyperion', 'ibm', 'ide', 'iis', 'imap', 'ims', 'informatica', 'informix', 'infrastructures', 'integrations', 'integrator', 'intelli', 'internals', 'ios', 'iphone', 'ipsec', 'itil', 'itsm', 'ivr', 'java', 'javascript', 'jboss', 'jdbc', 'jde', 'jee', 'jenkins', 'jira', 'jmeter', 'jms', 'jpa', 'jquery', 'js', 'jsf', 'json', 'jsp', 'junit', 'jvm', 'kernel', 'knockout', 'labview', 'lamp', 'languages', 'layer', 'layers', 'ldap', 'linq', 'linux', 'lync', 'magento', 'mainframe', 'mapreduce', 'marketo', 'markup', 'matlab', 'maven', 'mcitp', 'mcp', 'mcsa', 'mcse', 'mcts', 'mdm', 'memcached', 'memory', 'metasploit', 'microcontrollers', 'microstrategy', 'middleware', 'migration', 'migrations', 'mitek', 'mongo', 'mongodb', 'mq', 'mssql', 'multicast', 'mvc', 'mware', 'mysql', 'nagios', 'nessus', 'net', 'netapp', 'netbackup', 'netezza', 'nexus', 'nfs', 'nginx', 'nmap', 'node', 'nosql', 'obiee', 'object', 'olap', 'oltp', 'omniture', 'oo', 'ooa', 'ood', 'oop', 'oracle', 'orm', 'osi', 'osx', 'paa', 'parallel', 'partitioning', 'pentaho', 'peoplesoft', 'perforce', 'perl', 'php', 'pki', 'pl', 'plm', 'plsql', 'plugins', 'portal', 'portals', 'postgres', 'postgresql', 'powershell', 'programming', 'provisioning', 'proxies', 'puppet', 'python', 'qtp', 'query', 'querying', 'rabbitmq', 'raid', 'rails', 'rational', 'rdbms', 'redhat', 'redis', 'relational', 'remedy', 'replication', 'rest', 'rhel', 'rsa', 'rtos', 'ruby', 'saa', 'saas', 'salesforce', 'sans', 'sap', 'sas', 'sass', 'scala', 'sccm', 'schema', 'schemas', 'script', 'scripting', 'scripts', 'scsi', 'sdk', 'sdtm', 'selenium', 'semantic', 'sencha', 'server', 'servlets', 'sfdc', 'sftp', 'shell', 'silverlight', 'simulink', 'sms', 'snmp', 'soa', 'soap', 'solaris', 'solr', 'sphere', 'splunk', 'spring', 'sql', 'sqlserver', 'srm', 'ssas', 'ssh', 'ssis', 'ssl', 'ssrs', 'stack', 'struts', 'subversion', 'svn', 'sybase', 'symantec', 'sync', 'tableau', 'tablet', 'tcl', 'tcpdump', 'technologies', 'telnet', 'teradata', 'testng', 'tfs', 'thin', 'tibco', 'tivoli', 'tmw', 'toad', 'tomcat', 'toolkit', 'toolsets', 'topology', 'transact', 'triggers', 'tsql', 'tuning', 'ubuntu', 'ucs', 'ui', 'uml', 'unix', 'vb', 'vba', 'vbscript', 'verilog', 'veritas', 'version', 'versioning', 'versions', 'vhdl', 'virtual', 'virtualization', 'virtualized', 'vm', 'vmware', 'ware', 'waterfall', 'wcf', 'weaver', 'web', 'weblogic', 'websphere', 'widgets', 'windows', 'wireshark', 'wordpress', 'wpf', 'wsdl', 'xen', 'xendesktop', 'xhtml', 'xi', 'xml', 'xsd', 'xsl', 'xslt', 'zend', 'zookeeper']

In [50]: for i in range(len(cluster_dict)):
   ....:         if 'finance' in cluster_dict[i]:
   ....:                 cluster_dict[i].sort()
   ....:                 print cluster_dict[i]
   ....:
['accounting', 'accruals', 'acquisitions', 'allocation', 'allocations', 'aml', 'amortization', 'annual', 'ar', 'asc', 'audit', 'audited', 'auditing', 'auditors', 'audits', 'basel', 'bookkeeping', 'budget', 'budgeting', 'calculation', 'capital', 'capitalization', 'cas', 'cib', 'commentary', 'consolidated', 'consolidation', 'consolidations', 'controllership', 'controversy', 'corporate', 'counterparties', 'counterparty', 'deductions', 'depreciation', 'derivative', 'diligence', 'disclosures', 'divestitures', 'entities', 'entity', 'entries', 'equalization', 'erforms', 'excise', 'expenditure', 'expenditures', 'expense', 'fas', 'fcg', 'filings', 'fin', 'finance', 'financial', 'financials', 'financings', 'fiscal', 'fixed', 'fmis', 'footnote', 'footnotes', 'forecast', 'forecasting', 'forecasts', 'forma', 'fund', 'gaap', 'gl', 'governance', 'grant', 'grants', 'headcount', 'hedging', 'hoc', 'ifrs', 'income', 'intercompany', 'invention', 'issuers', 'journal', 'kyc', 'ledger', 'ledgers', 'liabilities', 'liquidity', 'merger', 'monthly', 'multistate', 'onesource', 'payables', 'payroll', 'payrolls', 'preparer', 'preparers', 'projection', 'projections', 'pronouncements', 'proxy', 'quarterly', 'receivable', 'receivables', 'reconcile', 'reconciliation', 'reconciliations', 'reporting', 'reserving', 'restructuring', 'restructurings', 'returns', 'ria', 'risk', 'sec', 'securitization', 'shareholder', 'statements', 'structuring', 'tax', 'taxation', 'taxes', 'trading', 'transaction', 'transactional', 'treasury', 'valuation', 'valuations', 'variance', 'withholding', 'workpapers', 'yearly']

From here, one interesting experiment will be learning new word embeddings from a different corpora using word embeddings like these as starting points for other similarity measures; quite possibility through LDA or a similar quick topic analysis. One thing we plan on doing is creating a simple twitter/reddit classifier to measure job-post lag. Because we believe that job-posts/job-openings lag behind the labor-market in terms of realizing skill-assets, we are going to be measuring the signal duration from twitter/reddit/resumes to when it is finally realized in the job market via open job postings. This will be interesting to almost anyone trying to capitalize on investing in new skills or simply to see whether the market is interested in investing in potential, or hype, or both (e.g. Hadoop).

In summary, I really like messing around with word2vec for a lot of the different text projects we come across. I think the flexibility and speed are some of the really cool qualities if a fast turn around is needed. Otherwise, it might be beneficial to check out GloVe if you have a lot of memory at your disposal. Until next time.

References

[1]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/pdf/1301.3781v3.pdf.
[2]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Exploiting Similarities among Languages for Machine Translation. http://arxiv.org/pdf/1309.4168.pdf.
[3]Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf.
[4]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. http://arxiv.org/pdf/1310.4546.pdf.

Check out some of the stuff we do