# How does it work?

## Build the graph

We use Wikipedia weekly dumps to build a Wikipedia link graph. Each node is a page. Each arc is a hypertext link. We do not consider links in infoboxes. Dumps are open and accessible to everyone. The conversion is performed using a combination of classes from MG4J and WebGraph.

## Do Your Link Analysis

We then compute a few measure of importance of pages. The measure we use are quite classical (modulo some updates to work with directed graphs like Wikipedia).

### Harmonic Centrality

The default ranking we show you is by *harmonic
centrality*. If you want, you can find its definition in Wikipedia. But we can explain it easily.

Suppose your page is `FooBar`. Your score by harmonic centrality is, as a start, the number of page with a link towards `FooBar`. They are called *pages at distance one*. Say, there are 50 such pages: your score is now 50.

There will be also pages with a link towards pages that have a link towards `FooBar`, but they are not at distance one. They are called pages at distance *two*. Say, there are 80 such pages: they are not as important as before—we will give them just half a point. So you get 40 more points and your score is now 90.

We can go on: there will be also pages with a link towards pages that have a link towards pages that have a link towards `FooBar` (!), but they are not at distance one or two. They are called pages at distance *three*. Say, there are 100 such pages: as you can guess, we will give them just one third of a point. So you get 33.333… more points and your score is now 123.333….

You do this for every site that can get to
`FooBar` just following links, and you have
your score by harmonic centrality. We have software
that will approximate harmonic centrality for very
large graphs.

### Indegree & PageRank

Since we like options, we let you play with other rankings. Ranking by indegree simply means that your score is the number of pages with a link towards you: the more links, the higher your rank. It is like stopping the computation of harmonic centrality at the very first step.

PageRank is a well-known centrality measure that counts the number of possible ways in which you can get from any other site to your site. The definition of PageRank appears in the first paper about Google. It has been computed using the LAW library.

You can find a very detailed and readable discussion of all this indices in this paper.

### Page views

That's easy: the number of page views in the last year. Follow the hot trends!

## Extract interesting relationships from Wikidata

We use Wikidata weekly dumps to assign to each Wikipedia page a number of interesting tags. The most important is “instance of”, which makes it possible for to look for “human” or “musical group”. But then we have “genre”, “occupation”, etc. We associate with each page all superclasses of their base classification, so the Beatles are a “musical group”. Otherwise, they would just be a “English pop-rock band” (indeed, they used to be a “rock band”). We pack all this information in an MG4J index for fast search.

## Play

You can select a particular ranking, or see them all. Using the search box you can select the set of pages of interest using a Boolean query. See the FAQ for a description of the query syntax.

Share your findings! It is easy to share an interesting list you've found using the share button. Remember that the purpose is *exploration*, not *evaluation*—no centrality measure will ever capture completely a vague notion such as “importance”.