Wikipedia:Persondata

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Shortcut:
WP:PERSON
WP:DATA

Persondata is special metadata which can be added to biographical articles. This metadata can then be extracted and processed automatically (unlike conventional Wikipedia content). It consists of a set of standardized data fields which include basic information about the person, such as name, birthday, place of birth, etc. This metadata can be used for a variety of purposes, including advanced search capabilities, statistical analysis, automated categorization, and birthday lists. The addition of persondata will not affect the normal display of an article since the information remains hidden unless a user sets their user stylesheet to display it.

As of February 12, 2008, there are approximately 18,200 articles with persondata. (As of October 2007 the German Wikipedia has over 178,700 articles with "Personendaten" [1].)

A new WikiProject (WikiProject Persondata) is seeking contributors.

Contents

[edit] Using the template

To use the {{Persondata}} template, copy the wikitext below to the end of a biographical article and fill in the parameters manually, or use this javascript which can add the template and fill in the information semi-automatically from infoboxes. If you add the template manually, place it just before the categories and interlanguage links. ({{DEFAULTSORT}} is not a real template but direct part of categorization, and therefore should be located between persondata and categories.)

<!-- Metadata: see [[Wikipedia:Persondata]] -->
{{Persondata
|NAME              = 
|ALTERNATIVE NAMES = 
|SHORT DESCRIPTION = 
|DATE OF BIRTH     = 
|PLACE OF BIRTH    = 
|DATE OF DEATH     = 
|PLACE OF DEATH    = 
}}

Next, fill out the data fields. Make sure the name is entered with the surname first (the same way you would with a category listing). Do not delete empty data fields, for example, if a person is still alive, you'll leave the date and place of death blank. Here is an example of a properly filled out template:

<!-- Metadata: see [[Wikipedia:Persondata]] -->
{{Persondata
|NAME              = Magellan, Ferdinand
|ALTERNATIVE NAMES = Magalhães, Fernão de (Portuguese); Magallanes, Fernando de (Spanish)
|SHORT DESCRIPTION = Sea explorer
|DATE OF BIRTH     = Spring [[1480]]
|PLACE OF BIRTH    = [[Sabrosa]], [[Portugal]]
|DATE OF DEATH     = [[April 27]], [[1521]]
|PLACE OF DEATH    = [[Mactan Island]], [[Cebu]], [[Philippines]]
}}

[edit] Viewing persondata

A screenshot showing Persondata from Mahatma Gandhi
A screenshot showing Persondata from Mahatma Gandhi

By default, persondata is invisible to normal users. In order to make persondata visible, you must edit your user stylesheet. To do this, first make sure you are logged in. Then create a page at User:YourUserName/monobook.css and add the following line:

table.persondata {display:table;}

or, if you use Microsoft Internet Explorer:

table.persondata {display:block;}

Tip: After saving User:YourUserName/monobook.css, please empty the Browser-Cache, to see the changes: Mozilla/Firefox: Shift-Ctrl-R, Internet Explorer: Ctrl-F5, Opera: F5, Safari: ⌘-R, Konqueror: Ctrl-R.

If you can see the following block about Ferdinand Magellan, you have successfully made persondata visible:

Persondata
NAME Magellan, Ferdinand
ALTERNATIVE NAMES Magalhães, Fernão de (Portuguese); Magallanes, Fernando de (Spanish)
SHORT DESCRIPTION Sea explorer
DATE OF BIRTH Spring 1480
PLACE OF BIRTH Sabrosa, Portugal
DATE OF DEATH April 27, 1521
PLACE OF DEATH Mactan Island, Cebu, Philippines

To make persondata invisible again, simply remove the line of CSS given above from your user stylesheet.

[edit] Data fields

The data fields NAME, ALTERNATIVE NAMES, SHORT DESCRIPTION, DATE OF BIRTH, PLACE OF BIRTH, DATE OF DEATH, and PLACE OF DEATH are used to construct a persondata record. These fields can possibly be extended in the future.

Fieldname Examples
NAME

Magellan, Ferdinand
Bush, George Walker
Beethoven, Ludwig van
Van Zandt, Townes
Brutus of Troy
King, Martin Luther, Jr.
Wainwright, Loudon, III
John Paul II, Pope
Elizabeth II
John the Baptist
Francis of Assisi, Saint
Tokugawa, Ieyasu
Fujiwara no Michinaga

ALTERNATIVE NAMES

Magalhães, Fernão de (Portuguese); Magallanes, Fernando de (Spanish)
Clemens, Samuel Langhorne (real name)

SHORT DESCRIPTION

Sea explorer
German philosopher
Anarchist writer and publisher
39th President of the United States

DATE OF BIRTH

1480
October 25, 1806
circa 470 BCE

PLACE OF BIRTH

Sabrosa, Portugal
Texas
Newark, New Jersey

DATE OF DEATH

April 27, 1521
January 1945
1421

PLACE OF DEATH

Mactan Island, Cebu, Philippines
Mount Juliet, Tennessee
Ushuaia, Argentina

Wikilinks in the persondata are not currently necessary; however, they may be useful for some future application.

[edit] Name

When specifying the person's name, use the following format: [surname], [forename] [middle names], [title]. For most cases this will be straightforward, for example, "George Walker Bush" becomes "Bush, George Walker". In some cases, however, there may be ambiguity about a person's surname. When in doubt, format the name according to how you would expect it to be alphabetized. For example, Ludwig van Beethoven would be alphabetized under "Beethoven", while Townes Van Zandt would be alphabetized under "Van Zandt". If you're not sure, ask someone familiar with the subject how they would alphabetize the name or consult a cataloguing guide such as the AACR2.

It is usually a good idea to list as much of a person's name as possible in the name field to avoid confusion with similar names. Do not include honorifics (such as "Dr.", "Professor", or "PhD"), however, unless they are part of a title of nobility.

[edit] Birth and death dates

Do not use templates within the birth and death date fields as they can interfere with data extraction. Abraham Lincoln's birthday, for example, should be listed as "February 12, 1809", not "{{birth date|1809|2|12|mf=y}}".

[edit] Motivation

Without uniform formatting, it is very difficult to automatically extract useful information from biographical articles. It is also impossible to automatically alphabetize all the biographical articles since the titles typically begin with the person's first name. By adding standardized metadata to such articles, we can facilitate the creation of new applications for Wikipedia content, such as Wikipedia CD-ROMs, custom search applications, etc. Hopefully, this will be the first of many steps towards enriching Wikipedia with semantic content.

[edit] Extraction of persondata

[edit] Extraction with project Templatetiger

With project Templatetiger it is possibble to view and output the datas with:

[edit] Extraction from an SQL database

Using an SQL query, the persondata can be filtered from Wikipedia articles stored in a database. As an example, here is an SQL query that can be used to extract persondata:

SELECT
   pages.cur_namespace,
   pages.cur_title,
   SUBSTRING(SUBSTRING(pages.cur_text FROM INSTR(pages.cur_text,'{{Persondata')), 1,
      INSTR(SUBSTRING(pages.cur_text FROM INSTR(pages.cur_text,'{{Persondata')),'}}')+1)
      AS 'Persondata'
FROM cur AS pd
JOIN templatelinks AS tl
   ON pd.cur_namespace = tl.tl_namespace
   AND pd.cur_title = tl.tl_title
JOIN cur AS pages
   ON tl.tl_from = pages.cur_id
   AND pages.cur_namespace = 0
WHERE pd.cur_namespace = 10
AND pd.cur_title = 'Persondata'

In order to be useful, however, the persondata must be further divided into individual data fields.

[edit] Extraction from the XML dump

Persondata can also be extracted from the regular Wikipedia database dumps. The following procedure has been adapted from scripts written to do this for the German Wikipedia by de:User:JakobVoss (who is also User:Nichtich). This is described (in German) at de:Hilfe:Personendaten/Datenextraktion. The process consists of four stages: downloading the database dump, extracting the persondata, parsing the persondata, and optionally loading it into a MySQL database. (This is an example of an Extract, transform, load process). As a rough guide, downloading the database dump will take a few hours with a fast internet connection, extracting the persondata will take around an hour, parsing the persondata and loading it into a MySQL database each take a few seconds.

[edit] System requirements

The original scripts were written for Linux, however they can also be run in Windows using either a Linux emulator such as Cygwin, or by downloading Windows versions of the necessary software:

In addition if you want to load the extracted persondata into a MySQL database you will need MySQL (Download here).

[edit] Downloading the database dump

Database dumps can be found at http://download.wikimedia.org/enwiki. The subdirectories are named after the date of the dump. The file needed for extracting persondata is named enwiki-date-pages-articles.xml.bz2, e.g. enwiki-20070908-pages-articles.xml.bz2. The latest version of this file can always be found at http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. As of September 2007 this file is 2.8 GB in size. You may find it useful to use Wget to download this.

[edit] Extracting persondata

Files needed:

Bzip2 is used to uncompress the dump, and the output is passed to three piped STX-scripts to extract the information in the persondata templates. STX is implemented in the java archive joost.jar.

The syntax for calling these scripts is

bzip2 -dc enwiki-20070908-pages-articles.xml.bz2 | java -jar joost.jar - addNamespaces.stx extractPersondata.stx pd2tab.stx > 20070908-extract.tab

This can be typed in at the command prompt in Linux or in Windows (Start->Run->cmd). Alternatively in Windows it can be typed into a text file with the .bat extension (e.g. extract.bat) which can then be run by double-clicking on it. Note that in Windows you will need to add type bzip2.exe instead, and if bzip2 is not in the same directory as the database dump you will need to specify the full file path (e.g. C:\full\file\path\bzip2.exe).

This process outputs a running total of the number of articles found with persondata. It also outputs a running total of articles with Template:PND. This is a legacy of the original German scripts; it was easier not to remove it when adapting them. (A Personennamendatei number is assigned to all German-speaking authors, and can be used to link to the catalogue of the German National Library. Some 30000 articles in the German Wikipedia use this template. A few hundred articles currently use it in the English Wikipedia.)

The output of this step is a tab-separated file (20070908-extract.tab in the above example) which contains the information from the persondata template.

[edit] Parsing persondata

File needed:

The information entered in the fields of the persondata template can take many forms, especially the dates. For many applications it is useful to have such information in standardised form. The Perl script transform.pl takes the XXXX-extract.tab file from the previous step and parses the fields to obtain quantities such as day, month, year, decade, century for the dates, given name and surname for names of the form Smith, John, article name where the first place in the birth/death place field is a wikilink, etc.

The syntax for this step is

transform.pl 20070908-extract.tab > 20070908-full.tab

This produces another tab-separated file. If desired this can be loaded into a spreadsheet and certain basic information obtained, by either sorting the columns or searching for appropriate terms, however more complicated analysis is more easily done using a database.

[edit] Loading persondata into a database

File needed: table.sql

If you have MySQL installed you can run table.sql to create a table called pub_pd_en, and load in the data from the XXXX-full.tab file. (You will need to change the filename at the end of table.sql).

Within MySQL the syntax to run this is

source C:/full/file/path/table.sql;

[edit] Linux scripts

In the original implementation on the German Wikipedia the whole process from extracting data to loading it into a database was performed by a single shell script, etl, which in turn called scripts extract.pl, transform.pl and load.pl. If you wish to use these they can be found at http://tools.wikimedia.de/~voj/pd/staging-area/. In addition to the modified files listed under the previous steps, some minor modifications to extract.pl and load.pl are necessary to use these for the English case, e.g. replacing de with en, and extractPersonendaten with extractPersondata in extract.pl, and using your own username in load.pl. The modified version of transform.pl given under Parsing persondata above should of course be used as well.

Files:

[edit] See also

Personal tools