Difference between revisions of "Wikipedia Category Graph"
JacopoFarina (Talk | contribs) (added part about neo4j) |
JacopoFarina (Talk | contribs) (→Creation of the database) |
||
Line 18: | Line 18: | ||
For this reasons a graph database is better to represent the structure. | For this reasons a graph database is better to represent the structure. | ||
− | ==Creation of the database== | + | ==Creation and further analysis of the database with igraph== |
Wikipedia let users download the entire site database (with all versions of all articles) or just some selections of it. | Wikipedia let users download the entire site database (with all versions of all articles) or just some selections of it. | ||
We use a selection which contains the category list and articles memberships in them. | We use a selection which contains the category list and articles memberships in them. | ||
Line 25: | Line 25: | ||
[http://neo4j.org/ Neo4j] is a graph-based database, which allow a program to create and manipulate graph structures like nodes and relationships. | [http://neo4j.org/ Neo4j] is a graph-based database, which allow a program to create and manipulate graph structures like nodes and relationships. | ||
− | After transferring the structure in a Neo4j graph is possible to create a Pajek file to general analysis like described [[Social_Network_Analysis_With_Igraph_Package_Using_R|here]]. | + | In order to transfer the database in neo4j format is better save it in a file, which will be read one line at time. |
+ | |||
+ | |||
+ | After transferring the structure in a Neo4j graph is possible to create from it a Pajek file (.net) to make general analysis like described [[Social_Network_Analysis_With_Igraph_Package_Using_R|here]]. | ||
+ | |||
+ | ===Results of the analysis=== | ||
+ | |||
+ | The '''diameter''' of the graph is 19. This is the maximum distance (number of nodes in the minimum path) between two nodes. These two nodes are ''Bowers Hill'' (A Virginia community) and ''m,n,k-game'' (a board game). | ||
+ | |||
+ | The average distance between two nodes is 4.781262. | ||
+ | |||
+ | The graph density is 1.408986 10<sup>-7</sup> | ||
==Previous Work== | ==Previous Work== | ||
[http://www-users.cs.umn.edu/~echi/papers/2009-CHI2009/p1509.pdf What's in Wikipedia? Mapping Topics and Conflict Using Socially] | [http://www-users.cs.umn.edu/~echi/papers/2009-CHI2009/p1509.pdf What's in Wikipedia? Mapping Topics and Conflict Using Socially] |
Revision as of 14:02, 6 July 2010
Wikipedia Category Graph
| |
Short Description: | Represent Wikipedia Categories with a model based on graphs to further analyze it. |
Coordinator: | MarcoColombetti (colombet@elet.polimi.it) |
Tutor: | DavidLaniado (david.laniado@gmail.com), RiccardoTasso (tasso@elet.polimi.it) |
Collaborator: | |
Students: | JacopoFarina (jacopo1.farina@mail.polimi.it) |
Research Area: | Social Software and Semantic Web |
Research Topic: | Graph Mining and Analysis |
Start: | 2010/06/10 |
End: | 2010/10/01 |
Status: | Active |
Level: | Bs |
Type: | Course |
The goal of the project is to analyze Wikipedia categories by representing them in a graph based database.
Wikipedia categories are not a three-based structure: a category may be contained in another one which is contained in another one which is contained in the first, generating a cyclic reference and many categories may be a root category (non contained in others).
For this reasons a graph database is better to represent the structure.
Creation and further analysis of the database with igraph
Wikipedia let users download the entire site database (with all versions of all articles) or just some selections of it. We use a selection which contains the category list and articles memberships in them.
Neo4j is a graph-based database, which allow a program to create and manipulate graph structures like nodes and relationships.
In order to transfer the database in neo4j format is better save it in a file, which will be read one line at time.
After transferring the structure in a Neo4j graph is possible to create from it a Pajek file (.net) to make general analysis like described here.
Results of the analysis
The diameter of the graph is 19. This is the maximum distance (number of nodes in the minimum path) between two nodes. These two nodes are Bowers Hill (A Virginia community) and m,n,k-game (a board game).
The average distance between two nodes is 4.781262.
The graph density is 1.408986 10-7
Previous Work
What's in Wikipedia? Mapping Topics and Conflict Using Socially