Knowledge extraction

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Knowledge Extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodical similar to Information Extraction (NLP) and ETL (Data Warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational scheme. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.

The RDB2RDF W3C group [1] is currently standardizing a language for extraction of RDF from relational databases. Another popular example for Knowledge Extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia, Freebase).


Contents

[edit] Overview

After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, Entity resolution, Knowledge Discovery and Ontology Learning. The general process uses traditional methods from Information Extraction and ETL, which transform the data from the sources into structured formats.

The following criteria can be used to categorize approaches in this topic (some of them only account for extraction from relational databases):

Source Which data sources are covered: Text, Relational Databases, XML, CSV
Synchronization Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or Dynamic.
Exposition How is the extracted knowledge made explicit (Ontology file, Semantic Database)? How can you query it?
Reuse of vocabularies The tool is able to reuse existing vocabularies in the extraction. For example the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab.
Requires a Domain Ontology A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (Ontology_learing).
Automatisation The degree to which the extraction is assisted/automatised. Manual, GUI, semi-automatic, automatic.
Bi-directional If the result is edited, is it possible to update the source?

[edit] Examples

[edit] Entity Linking

  1. DBpedia Spotlight, OpenCalais and the Zemanta API analyze free text via Named Entity Recognition and then disambiguates candidates via Name Resolution and links the found entities to the DBpedia knowledge repository[2] (DBpedia Spotlight web demo).

President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.

As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner can for example infer that the mentioned entity is of the type Person (using FOAF_(software)) and of type Presidents of the United States (using YAGO). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide further retrieval of structured data and formal knowledge.

[edit] Relational Databases to RDF

  1. Triplify, D2R Server and Virtuoso RDF Views are tools that transform relational databases to RDF. During this process they allow to reuse existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column (e.g.name) or an aggregation of columns (e.g.first_name and last_name) has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity[3]. Then properties with formally defined semantics are used (and reused) to interpret the information. For example a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary called foaf:homepage, thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class foaf:Person (Ontology Population). Additionally domain knowledge (in form of an ontology) could be created from the status_id, either by manually created rules (if status_id is 2, the entry belongs to class Teacher ) or by (semi)-automated methods (Ontology Learning). Here is an example transformation:
Name marriedTo homepage status_id
Peter Marry http://example.org/Peters_page 1
Claus Eva http://example.org/Claus_page 2
Peter :marriedTo :Marry .
marriedTo a owl:SymmetricProperty .
Peter foaf:homepage <http://example.org/Peters_page> .
Peter a foaf:Person .
Peter a :Student .
Claus a :Teacher .


[edit] Extraction from structured sources to RDF

[edit] Relational Databases

[edit] Knowledge discovery

Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data [4]. It is often described as deriving knowledge from the input data. Knowledge discovery developed out of the Data mining domain, and is closely related to it both in terms of methodology and terminology [5].

The most well-known branch of data mining is knowledge discovery, also known as Knowledge Discovery in Databases (KDD). Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery.

Another promising application of knowledge discovery is in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group (OMG) developed specification Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as software mining is closely related to data mining, since existing software artifacts contain enormous business value, key for the evolution of software systems. Instead of mining individual data sets, software mining focuses on metadata, such as database schemas.

[edit] Ontology Learning

[edit] See also

[edit] References

  1. ^ RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/ , charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language: http://www.w3.org/TR/r2rml/
  2. ^ "Life in the Linked Data Cloud". www.opencalais.com. http://www.opencalais.com/node/9501. Retrieved 2009-11-10. "Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format." 
  3. ^ Tim Berners-Lee. Relational Databases on the Semantic Web. Created Date: September 1998. http://www.w3.org/DesignIssues/RDB-RDF.html
  4. ^ Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1011)
  5. ^ Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1230



Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export