Wikipedia social network

From TrustLet, a free, collaborative project for collecting and analyzing information about trust metrics.

Jump to: navigation, search

Contents

[edit] What we want to collect and study

  • network of Wikipedia users
    • internal messages: when user A edits the "discussion" page of user B, user A is in fact sending a message to user B. We can create the network of who speaks with whom. Service on it: using a trust metric, it is possible to suggest to user X some other unknown users that she might want to contact, for example for editing together a wiki page or for maintaining and watching a wikipedia portal page. It is also possible to test if and how memes (for example new words or the use of a certain category, that are easy to detect automatically) spread on the wikipedia social network based on this. It is also possible to compute some sort of global trust (reputation) on the network based on the aggregated trust network. It is even possible to show to a user only (or mainly) the text inserted by trustable users and to not show the other text (maybe only on pages under a wikipedia edit war, in this way we would create a personalized wikipedia (daily me?) in which each page (for example, the page on palestine) might look different for different users (boundaries of relativism?!?)
    • coediting: there is a trust edges (not directed but weighted) when A and B coedit pages, the edge could even be directed if they are both inserted (A edits pages with B (weight=0.3), B edits pages with A (weight=0.5). Possible service on it (based on trust metric): suggest to A to collaborate with some trustable users (unknown to A)
    • edit war: there is a trust edge between A and B when A and B had an edit war over a page. This would be a distrust network, very interesting but challenging!
  • network of Wikipedia articles (see how to find shortest path between 2 articles, also here with code)
  • network of Wikipedia categories
  • bipartite network of which Wikipedia users edit which Wikipedia articles: very interesting to find clusters, compute users similarities, compute article similarities (article edited by the same users can be "similar" also if they are not linked), ...

[edit] How to collect Wikipedia networks

How to download data from Wikipedia?

[edit] Possible solution

At http://meta.wikimedia.org/wiki/Data_dumps#What.27s_available.3F

   * pages-articles.xml
         o Contains current version of all article pages, templates, and other pages
         o Excludes discussion pages ('Talk:') and user "home" pages ('User:')
         o Recommended for republishing of content.
   * pages-meta-current.xml
         o Contains current version of all pages, including discussion and user "home" pages.
   * pages-meta-history.xml
         o Contains complete text of every revision of every page (can be very large!)
         o Recommended for research and archives.

pages-meta-history.xml is what we need! I made a test with the Wikipedia in Furlan. I downloaded this file http://download.wikimedia.org/furwiki/20080519/furwiki-20080519-pages-meta-history.xml.bz2 from the wikipedia in friulano (6.2 MB) (i found the list of all dumps at http://download.wikimedia.org/backup-index.html

I tried to look for "talks" to user Tocaibon (see http://fur.wikipedia.org/wiki/Discussion_utent:Tocaibon ). By looking in the text file for "propite une buine" (which is contained in the page) I found many revisions of this page! So the info is in there! good! Relevant piece of information

 <page>
   <title>Discussion utent:Tocaibon</title>
   <id>2586</id>
   <revision>
     <id>5902</id>
     <timestamp>2006-05-24T18:26:17Z</timestamp>
     <contributor>
       <username>Klenje</username>
       <id>1</id>
     </contributor>
     <text xml:space="preserve">== Nons gjeografics: cemût scriviu? == Mandi, e je propite une buine idee, si scugne cjatâ un standard par regjons, flums, citâts e vie indevant (par dì  ancje Liste di Stâts dal mont e je dome une propueste). In chest fin setemane o provi a creâ une pagjine su  Vichipedie:Toponims par furlan 

One possible problem: after unzipping, the file raw for friulan is already 148M!!! At http://meta.wikimedia.org/wiki/Data_dumps#What.27s_available.3F it is written that the raw file for wikipedia is 600 gigabytes!!! Can we try to get just the info we want? The same page mention that "Several of the tables are also dumped with mysqldump should anyone find them useful (for the database definition, see the documentation [1]); the gzip-compressed SQL dumps (.sql.gz) can be read directly into a MySQL database but may be less convenient for other database formats"

We could try to use http://meta.wikimedia.org/wiki/Xml2sql for transforming the xml files into sql files and then uploading only the DB tables we need.

You don't have to read the whole file into memory to parse the XML. I'm not sure exactly how to do it, but it's done in some version of wik2dict. guakawikitalk 12:29, 23 May 2008 (PDT)
Yes! Thanks Guaka! SAX parsers don't require to have everything in memory while DOM parsers do. From http://en.wikipedia.org/wiki/Simple_API_for_XML#Benefits

SAX parsers have certain benefits over DOM-style parsers. The quantity of memory that a SAX parser must use in order to function is typically much smaller than that of a DOM parser. DOM parsers must have the entire tree in memory before any processing can begin, so the amount of memory used by a DOM parser depends entirely on the size of the input data. The memory footprint of a SAX parser, by contrast, is based only on the maximum depth of the XML file (the maximum depth of the XML tree) and the maximum data stored in XML attributes on a single XML element. Both of these are always smaller than the size of the parsed tree itself. --PaoloMassa 14:17, 26 May 2008 (PDT)

[edit] TODO

check

[edit] other info

The license allows to do it and there are many different ways of doing it. See for example

Probably there are already python libraries for doing it.

See (in python) http://www.kde-apps.org/content/show.php/Wikipedia+Dump+Reader?content=65244 and this (?) http://blog.prashanthellina.com/2007/10/17/ways-to-process-and-use-wikipedia-dumps/


This is interesting but there are no info about users http://download.freebase.com/wex/

http://wikiproject.sourceforge.net/overview.html

[edit] How to process Wikipedia dumps?

WikiXRay ( http://meta.wikimedia.org/wiki/WikiXRay ) is a Python tool for automatically processing Wikipedia's XML dumps for research purposes. It also includes the more complete parser to extract metadata for all revisions and pages in a WIkipedia's XML dump, compressed with 7zip (or any other version). See the WikiXRay page on Meta for more info.

[edit] Papers about Wikipedia

For an incomplete list of academic conference presentations, peer-reviewed papers and other types of academic writing which focus on Wikipedia as their subject. Works that mention Wikipedia only in passing are unlikely to be listed Wikipedia:Wikipedia in academic studies

Personal tools