Wikipedia:Persondata: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 14:01, 1 June 2015

Shortcuts

Persondata has been deprecated by this RfC

Persondata was a special set of metadata that could be added to biographical articles only. We are in the process of arranging its removal. In future, such data should be added to Wikidata.

It consisted of standardized data fields with basic information about the person (name, short description, birth and death days, and places of birth and death) that, unlike conventional Wikipedia content, could be extracted automatically..

Adding the {{Persondata}} template to a biographical article didn't affect its normal display, since the information wasn't meant to be read by human beings and remained hidden unless users changed their personal stylesheet specifically for it to appear.

A WikiProject worked to improve the usage of Persondata, WikiProject Persondata.

Uses

DBpedia (downloads)

Wikidata

Wikidata is a sister project to Wikipedia, which holds data about subjects. Usable information from Persondata has been copied to Wikidata, and Persondata is now considered deprecated. Therefore, please do not add new Persondata templates to pages. However please do not delete existing Persondata information until otherwise advised. Users are encouraged to get involved with Wikidata.

Parameter	Type	Property ID
NAME	Firstname Lastname	label (e.g. Michel Velleman (Q151605))
ALTERNATIVE NAMES	Other Name1, Othername2	alias (see also p742; p1448; p1449; p1477; etc.)
SHORT DESCRIPTION	Claim to fame	description (see also p27; p39; p106; etc.)
DATE OF BIRTH	day or year of birth	p569
PLACE OF BIRTH	birthplace	p19
DATE OF DEATH	day or year of death	p570
PLACE OF DEATH	deathplace	p20

The persondata template can not get its information from Wikidata.

Viewing persondata

File:Persondatascreen.png

A screenshot showing Persondata from Mohandas Karamchand Gandhi

By default, persondata is invisible to normal users. To make persondata visible, you must:

Either install this JavaScript in Special:Mypage/skin.js, which will add a button to the top button bar of every page allowing you to easily show and hide persondata boxes;
Or edit your user stylesheet as explained below, causing persondatas to be always visible;
Or even do both, as one method doesn't interfere with the other and the above JavaScript has useful persondata-editing features.

To make persondatas permanently visible, first make sure you are logged in. Then edit (or create) a page at Special:Mypage/skin.css and add the following line:

table.persondata {display:table !important;}

or, if you use Microsoft Internet Explorer 7 or earlier:

table.persondata {display:block !important;}

Tip: After saving the CSS, you must empty the browser cache to see the changes: Mozilla/Firefox (Windows): Ctrl-Shift-R; Mozilla/Firefox/Camino (Mac): Cmd-Shift-R; Internet Explorer (Windows): Ctrl-F5; Opera (all): F5; Safari (Mac): Cmd-R; Konqueror (Linux): Ctrl-R. Some Firefox (Linux) users report that both lines must be present in their monobook.css (though they can probably be simplified to table.persondata {display:block table !important;}), and users who switch between browsers and platforms may need to do likewise.

If you can see a block with data about Ferdinand Magellan between this paragraph and the next, you have successfully made persondata visible: Template:Persondata Otherwise this paragraph will follow directly below the previous one.

To make the persondata box invisible again, simply remove the CSS line provided above from your user stylesheet.

Warning: Since persondatas are by default invisible, editors rarely plan for them when designing the layout of an article, which means that making them visible might cause some article footers to look strange for you. For the same reason, if you have persondatas visible while editing and previewing, remember that most people don't, so planning the layout to accommodate it might cause them to find a strange-looking article footer. Thus, take care to edit from the perspective of the majority. It is best to follow the persondata placement advice given on this page.

Template

Position

The {{Persondata}} template, was included in biographical articles:

{{Persondata
| NAME              = 
| ALTERNATIVE NAMES = 
| SHORT DESCRIPTION = 
| DATE OF BIRTH     = 
| PLACE OF BIRTH    = 
| DATE OF DEATH     = 
| PLACE OF DEATH    = 
}}

Parameters

The parameters NAME, ALTERNATIVE NAMES, SHORT DESCRIPTION, DATE OF BIRTH, PLACE OF BIRTH, DATE OF DEATH, and PLACE OF DEATH were used to construct a persondata record.

Examples

Fieldname	Examples
NAME	Magellan, Ferdinand Bush, George Walker Beethoven, Ludwig van Van Zandt, Townes Brutus of Troy King, Martin Luther, Jr. Wainwright, Loudon, III John Paul II, Pope Elizabeth II John the Baptist Francis of Assisi, Saint Tokugawa, Ieyasu Fujiwara no Michinaga
ALTERNATIVE NAMES	Magalhães, Fernão de (Portuguese); Magallanes, Fernando de (Spanish) Clemens, Samuel Langhorne (real name)
SHORT DESCRIPTION	Sea explorer German philosopher Anarchist writer and publisher 39th President of the United States
DATE OF BIRTH	1480 25 October 1806 October 25, 1806 c. 470 BCE
PLACE OF BIRTH	Sabrosa, Portugal Texas Newark, New Jersey
DATE OF DEATH	27 April 1521 April 27, 1521 January 1945 1421
PLACE OF DEATH	Mactan Island, Cebu, Philippines Mount Juliet, Tennessee Ushuaia, Argentina

Extraction of persondata

With project Templatetiger

With project Templatetiger it is possible to view and output the data with:

http://toolserver.org/~kolossos/templatetiger/tt-table4.php?template=Persondata&lang=en

From an SQL database

Using an SQL query, the persondata can be filtered from Wikipedia articles stored in a database. As an example, here is an SQL query that can be used to extract persondata:

SELECT
   pages.cur_namespace,
   pages.cur_title,
   SUBSTRING(SUBSTRING(pages.cur_text FROM INSTR(pages.cur_text,'{{Persondata')), 1,
      INSTR(SUBSTRING(pages.cur_text FROM INSTR(pages.cur_text,'{{Persondata')),'}}')+1)
      AS 'Persondata'
FROM cur AS pd
JOIN templatelinks AS tl
   ON pd.cur_namespace = tl.tl_namespace
   AND pd.cur_title = tl.tl_title
JOIN cur AS pages
   ON tl.tl_from = pages.cur_id
   AND pages.cur_namespace = 0
WHERE pd.cur_namespace = 10
AND pd.cur_title = 'Persondata'

In order to be useful, however, the persondata must be further divided into individual data fields.

From the XML dump

Persondata can also be extracted from the regular Wikipedia database dumps. The following procedure has been adapted from scripts written to do this for the German Wikipedia by de:User:JakobVoss (who is also User:Nichtich). This is described (in German) at de:Hilfe:Personendaten/Datenextraktion. The process consists of four stages: downloading the database dump, extracting the persondata, parsing the persondata, and optionally loading it into a MySQL database. (This is an example of an Extract, transform, load process). As a rough guide, downloading the database dump will take a few hours with a fast internet connection, extracting the persondata will take around an hour, parsing the persondata and loading it into a MySQL database each take a few seconds.

System requirements

The original scripts were written for Linux; however, they can also be run in Windows using either a Linux emulator, or by downloading Windows versions of the necessary software:

java (Most Windows machines will already have this installed)
bzip2 Download here
perl Download here

In addition if you want to load the extracted persondata into a MySQL database you will need MySQL (Download here).

Downloading the database dump

Database dumps can be found at http://download.wikimedia.org/enwiki. The subdirectories are named after the date of the dump. The file needed for extracting persondata is named enwiki-date-pages-articles.xml.bz2, e.g. enwiki-20070908-pages-articles.xml.bz2. The latest version of this file can always be found at http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. As of June 2012 this file is 8.0 GB in size. You may find it useful to use wget to download this.

Extracting persondata

Files needed:

joost.jar (latest version available from http://joost.sourceforge.net/)
addNamespaces.stx
extractPersondata.stx
pd2tab.stx

Bzip2 is used to uncompress the dump, and the output is passed to three piped STX-scripts to extract the information in the persondata templates. STX is implemented in the java archive joost.jar.

The syntax for calling these scripts is

bzip2 -dc enwiki-20070908-pages-articles.xml.bz2 | java -jar joost.jar - addNamespaces.stx extractPersondata.stx pd2tab.stx > 20070908-extract.tab

This can be typed in at the command prompt in Linux or in Windows (Start->Run->cmd). Alternatively in Windows it can be typed into a text file with the .bat extension (e.g. extract.bat) which can then be run by double-clicking on it. Note that in Windows you will need to add type bzip2.exe instead, and if bzip2 is not in the same directory as the database dump you will need to specify the full file path (e.g. C:\full\file\path\bzip2.exe).

This process outputs a running total of the number of articles found with persondata. It also outputs a running total of articles with Template:PND. This is a legacy of the original German scripts; it was easier not to remove it when adapting them. (A Personennamendatei number is assigned to all German-speaking authors, and can be used to link to the catalogue of the German National Library. Some 170,000 articles in the German Wikipedia use this template. A few hundred articles currently use it in the English Wikipedia.)

The output of this step is a tab-separated file (20070908-extract.tab in the above example) which contains the information from the persondata template.

Parsing persondata

File needed:

transform.pl

The information entered in the fields of the persondata template can take many forms, especially the dates. For many applications it is useful to have such information in standardised form. The Perl script transform.pl takes the XXXX-extract.tab file from the previous step and parses the fields to obtain quantities such as day, month, year, decade, century for the dates, given name and surname for names of the form Smith, John, article name where the first place in the birth/death place field is a wikilink, etc.

The syntax for this step is

transform.pl 20070908-extract.tab > 20070908-full.tab

This produces another tab-separated file. If desired this can be loaded into a spreadsheet and certain basic information obtained, by either sorting the columns or searching for appropriate terms, however more complicated analysis is more easily done using a database.

Loading persondata into a database

File needed:

table.sql

If you have MySQL installed you can run table.sql to create a table called pub_pd_en, and load in the data from the XXXX-full.tab file. (You will need to change the filename at the end of table.sql).

Within MySQL the syntax to run this is

source C:/full/file/path/table.sql;

Linux scripts

In the original implementation on the German Wikipedia the whole process from extracting data to loading it into a database was performed by a single shell script, etl, which in turn called scripts extract.pl, transform.pl and load.pl. If you wish to use these they can be found at http://toolserver.org/~voj/pd/staging-area/. In addition to the modified files listed under the previous steps, some minor modifications to extract.pl and load.pl are necessary to use these for the English case, e.g. replacing de with en, and extractPersonendaten with extractPersondata in extract.pl, and using your own username in load.pl. The modified version of transform.pl given under Parsing persondata above should of course be used as well.

Files:

@@ Line 9: / Line 9: @@
 }}
-'''Persondata''' was a special set of [[metadata]] that could be added to '''biographical''' articles only. WE are in the prccess of aranging its removal. In future, such data should be added to [[Wikidata]].
+'''Persondata''' was a special set of [[metadata]] that could be added to '''biographical''' articles only. We are in the process of arranging its removal. In future, such data should be added to [[Wikidata]].
 It consisted of standardized data fields with basic information about the person (name, short description, birth and death days, and places of birth and death) that, unlike conventional Wikipedia content, could be extracted automatically..

Revision as of 14:01, 1 June 2015

Uses

Wikidata

Viewing persondata

Template

Position

Parameters

Examples

Extraction of persondata

With project Templatetiger

From an SQL database

From the XML dump

System requirements

Downloading the database dump

Extracting persondata

Parsing persondata

Loading persondata into a database

Linux scripts

See also