Wikipedia in Python, Gephi, and Neo4j

Vizualizing relationships in Wikipedia

Posted by Matt on December 9, 2014

Introduction

We have had a bit of a stretch here where we used Wikipedia for a good number of things. From Doc2Vec to experimenting with word2vec layers in deep RNNs, here are a few of those cool visualization tools we've used along the way.

Python Code

Here we're going to use the python wikitools api to build the relationship links between Categories and Subcategories in the Wikipedia Category database. Here is the code to generate the relationship graph starting at the Machine Learning node: http://en.wikipedia.org/wiki/Category:Machine_learning

#!/usr/bin/python
# -*- coding: utf-8 -*-

import pprint
import datetime
from wikitools import wiki
from wikitools import api
from wikitools import category
import wikitools
from wikitools import page
import re
from wikitools.page import NoPage

people = re.compile(r'.*Category:.*People.*|.*Category:.*people.*')
badlinks = re.compile(r'.* stubs.*|.*Help:.*|.*Talk:.*|.*Wikipedia.*|.*Template:.*|.*Template.*|.*Portal:.*|.*Outline of.*|.*List of.*|.*Outlines of.*|.*Catalog of.*|.*Lists of.*|.*Glossary.*|.*Glossaries.*|.*Index of.*|.*Timeline of.*|.*History of.*|.*Chronology.*|.*Index of.*|.*Overview.*|.*Journals.*|.*Redirects.*|.*Book:.*|.*Help:.*')
needless = re.compile(r' \(.*')
site = wiki.Wiki("http://en.wikipedia.org/w/api.php")

def log(msg):
    print("{} {}".format(str(datetime.datetime.now()), msg))

def WTree(name, CategoryTree):
    wronglinks = re.search(badlinks, name)
    if wronglinks:
        log("wrongslinks matched, exiting")
        return

    try:
        cat = category.Category(site, name)
        page = wikitools.Page(site, title=name, check=True)
        text = page.getWikiText(expandtemplates=False, force=False)

        # if page/category about people, skip out
        if re.search(people, text):
            log("'people' matched, exiting")
            return

        catlist = cat.getAllMembers(namespaces=[14], titleonly=True)
        pagelist = cat.getAllMembers(namespaces=[0], titleonly=True)
        #deleted the cleaning that was being done here
        catpagelist = catlist + pagelist
        repository = []
        for i in catpagelist:
            noparenthesis = needless.sub('', i)
            if len(noparenthesis) > 0:
                try:
                    pagez = wikitools.Page(site, title=noparenthesis, check=True)
                    if pagez is None: continue
                    textz = pagez.getWikiText(expandtemplates=False, force=False)
                    if textz is None: continue

                    personz = re.search(people, textz)
                    if not personz:
                        repository.append(noparenthesis)
                except Exception as ex:
                    if ex is NoPage:
                        log("Page not found! {}".format(noparenthesis))
                    else:
                        log('exception occured! page: {}, msg: {}'.format(noparenthesis, ex.message))
        clean = [s.encode('ascii', 'ignore').strip().replace('Category:','') for s in repository]
        name=name.encode('ascii', 'ignore').strip().replace('Category:','') 
        #changed the cleaning here. before it interfered with the results and added to the error
        if name not in CategoryTree.keys():
            CategoryTree[name] = (clean)  #possibly add a key clause

            for ncat in catlist:
                print("{} - about to dive into subcategory '{}'".format(str(datetime.datetime.now()), ncat))
                WTree(ncat, CategoryTree)

    except Exception as ex:
#       if ex is NoPage:
        log('main exception occurred! page not found='+ex.message)
        pprint.pprint(ex)
        pass


if __name__ == "__main__":

    CategoryTree = {}

    cat = 'Machine learning'

    print("{} Started processing category '{}'".format(str(datetime.datetime.now()), cat))

    WTree('Machine learning', CategoryTree)

    print("{} Finished processing category '{}'".format(str(datetime.datetime.now()), cat))

    pprint.pprint(CategoryTree)

Networkx

Networkx is one of the simplist ways to visualize the wikipedia relationships. Although it is fairly limited in it's visualization capabilities, it's real strength lies in that you are able to utilize graph algorithms like Betweenness Centrality and PageRank fairly quickly.

import networkx
import matplotlib.pyplot as plt
G=nx.from_dict_of_lists(CategoryTree)
random.choice(colors) #colors is a list of standard html color codes #FFFFFF, etc..
pos=nx.spring_layout(G)
for k,v in CategoryTree.items():
        nx.draw_networkx_nodes(G,pos,
                           nodelist=[k]+v,
                           node_color=random.choice(colors))
    nx.draw_networkx_edges(G,pos)
plt.show()

PageRank and Betweenness Centrality

In [35]: sorted((nx.pagerank(G)).items(),key=lambda x:x[1])[-1-10:]

Out[35]: 
[('Applied machine learning', 0.015213347659345839),
 ('Statistical natural language processing', 0.01593162614685056),
 ('Data clustering algorithms', 0.01794664013947826),
 ('Genetic algorithms', 0.01827600753476628),
 ('Evolutionary algorithms', 0.021820605084569563),
 ('Machine learning algorithms', 0.022080355852083854),
 ('Markov models', 0.02563857403804218),
 ('Classification algorithms', 0.033037311351401394),
 ('Data mining and machine learning software', 0.03493933850085346),
 ('Artificial neural networks', 0.05324743088542944),
 ('Machine learning', 0.0719193991007866)]
In [36]: sorted((nx.betweenness_centrality(G)).items(),key=lambda x:x[1])[-1-10:]
Out[36]: 
[('Genetic algorithms', 0.06841058109380252),
 ('Data clustering algorithms', 0.07229174547405623),
 ('Statistical natural language processing', 0.07351581758937911),
 ('Cluster analysis', 0.08827151162975704),
 ('Machine learning algorithms', 0.08846253630478287),
 ('Markov models', 0.14038579972335552),
 ('Data mining and machine learning software', 0.1490907603646894),
 ('Classification algorithms', 0.14967241026301162),
 ('Evolutionary algorithms', 0.15268223751670618),
 ('Artificial neural networks', 0.24161841845015136),
 ('Machine learning', 0.7732025817591673)]

Neo4j

Neo4j is pretty cool because it allows you directly query on the potentially complex relationships in your graph; the fact that you can quickly store massive amounts of relationships and don't have to mess around with the Giraph learning curve is also a huge plus. Here's a quick way import small categories: (Note: If you want to do this for all the categories/subcategories or at least larger categories than Machine Learning, definitely download the most current Wikipedia dump and import the categories and pages directly so you don't have to deal with any network lag)

#!/usr/bin/python
# -*- coding: utf-8 -*-

import pprint
import datetime
from py2neo.neo4j import Index
from wikitools import wiki
from wikitools import api
from wikitools import category
import wikitools
from wikitools import page
import re
from wikitools.page import NoPage, Page
from py2neo import neo4j, node, rel
import logging
logging.basicConfig(level=logging.WARNING)

#people = re.compile(r'Category:.*People', re.I)
#badlinks = re.compile(r'stubs|Help:|Talk:|Wikipedia|Template:|Portal:|Outline of|List of|Outlines of|Catalog of|Lists of|Glossary|Glossaries|Index of|Timeline of|History of|Chronology|Index of|Overview|Journals|Redirects|Book:')
needless = re.compile(r' \(')
site = wiki.Wiki("http://en.wikipedia.org/w/api.php") 

#local
graph_db = neo4j.GraphDatabaseService("http://localhost:8080/db/data/")

## for testing only, make sure we have a clean slate!!
#graph_db.clear()
#@type : Index
db_categories = graph_db.get_or_create_index(neo4j.Node, "Categories")
#@type : Index
db_pages = graph_db.get_or_create_index(neo4j.Node, "Pages")


def log(msg):
    print("{} {}".format(str(datetime.datetime.now()), msg))
    
def WTree(name, visitedCategories=set(), dbcat=None):
    """
    For a given category, query subcategories and get categories and pages.
    Query subcategories recursively
    :type name: str
    :type visitedCategories: set
    """
    
    # wronglinks = re.search(badlinks, name)
    # if wronglinks:
    #   log("wrongslinks matched, exiting")
    #   return

    # try:
    visitedCategories.add(name)

    cat = category.Category(site, "Category:"+name)

    if dbcat is None:
        dbcat = db_categories.get_or_create("name", name, {"name": name, "pageid": cat.pageid})
        dbcat.set_labels('Category')
    else:
        dbcat["pageid"] = cat.pageid
    
    catlist = cat.getAllMembers(namespaces=[14], titleonly=True)
    # :type pagelist:list[Page]
    pagelist = cat.getAllMembers(namespaces=[0], titleonly=True)
    
    #
    # do pages first
    #
    for page in pagelist:
        try:
            # no longer filtering people, so don't need page contents
#               txt = page.getWikiText(expandtemplates=False, force=False)
#               if txt is None: continue
#               
# #             log("len of wikitext = {}".format(len(txt)))
#               txt = txt.decode('utf8').encode('ascii', 'ignore')
#               if re.search(people, txt):
#                   continue
                
            title = page.encode('ascii', 'ignore')
#               log("             page: {}".format(title))

            # at this point, have all the info we need, so save to db
            # (try to find node first, if exists, just make a connection, if not, create it first
            db_page = db_pages.get("name", title)
            if not len(db_page):
                db_page = db_pages.create("name", title, {"name": title})
                db_page.set_labels('Page')
            else:
                db_page = db_page[0]
            
            
            # db_page = db_pages.get_or_create("name", title, {"name": title})
            # db_page.set_labels('Page')
            graph_db.create(rel(dbcat, "has", db_page))
            
        except Exception as ex:
            if ex is NoPage:
                log("Page not found! {}".format(page))
            else:
                log('exception occured! page: {}, msg: {}'.format(page, ex.message))
    
    log("       {} pages saved".format(len(pagelist)))
    #
    # now do categories
    #

    for catname in catlist:
        new = False
        # get or create child category
        catname = catname[9:]
        childcat = db_categories.get("name", catname)
        if not len(childcat):
            new = True
            childcat = db_categories.create("name", catname, {"name": catname})
            childcat.set_labels('Category')
        else:
            childcat = childcat[0]
        
        # link up to parent
        graph_db.create(rel(dbcat, "has", childcat))

        # if existing AND already visited, skip
        # NOTE: in the future, might change to just not go into existing ones at all, but it might lead to lost data if run was never finished
        if new is False and ('d' in childcat or catname in visitedCategories):
            continue
            
        log(" - about to dive into subcategory '{}'".format(catname))
        WTree(catname, visitedCategories, childcat)
        childcat['d'] = datetime.datetime.now()
        
    log("Finished processing {}".format(name))

#   except Exception as ex:
# #     if ex is NoPage:
#       log('main exception occurred! page not found='+ex.message)
#       pprint.pprint(ex)
    


if __name__ == "__main__":
    
    CategoryTree = {}
    
    cat = 'Machine learning'
    
    print("{} Started processing category '{}'".format(str(datetime.datetime.now()), cat))

    WTree(cat)

    print("{} Finished processing category '{}'".format(str(datetime.datetime.now()), cat))
    
    pprint.pprint(CategoryTree)

In using the Cypher language, it's very easy to display the complex relationships in query results; the only draw-back being that a 1000 node limit exists outside the Webadmin area (you can display massive queries in webadmin, but it just absolutely eats away at your browser's memory)

Gephi

Gephi is a bit different than the other two in that it's an open-sourced graph platform designed for visualizing and performing analysis on small-scale networks, whereas networkx is only a python package and Neo4j is, first and foremost, a graph database designed for storing intricate relationships. Out of the box, Gephi is much easier for visualization if you have no experience with CSS or design. Use the tiny script below to create an edge-list in the format of 'node1;node2' to be imported into Gephi.

d={}
for k,v in CategoryTree.items():
        K="_".join(k.split())
        V=["_".join(x.split()) for x in v]
        d[K]=V
with open('edge.csv','wb') as f:
    for k,v in d.items():
        for x in v:
            f.write(k+';'+x+'\n')

From here, import 'edge.csv' from the GUI and literally you're ready to rock. It's incredibly simple.

Check out some of the stuff we do