Introduction to DBpedia

Semantic Web

Semantic Web enables machines to browse the knowledge distributed across the Web. Consider a Wikipedia page that provides knowledge to human readers. The knowledge contained in the Wikipedia page is opaque from the perspective of the machines, as they “see” nothing but a presentation markup of the Wikipedia page. The idea of Semantic Web was developed to allow computers to explore the knowledge on the Web (i.e., to make the Web data machine-readable). To realise this goal, it was crucial to have a large amount of Web data in a standard format that could be reached and managed by Semantic Web tools. In essence, to construct such Web of Data, Semantic Web not only requires access to the data, but also to the relationships among data points (as opposed to a large collection of datasets). Such interrelated datasets available on the Web are also known as Linked Data that connect and share data through dereferenceable Uniform Resource Identifiers (URIs) across a wide range of applications.

Figure 1 presents historical landmarks in the evolution process of the Semantic Web into Linked Data. The main aim of Linked Data is to expand the Web by publishing datasets in the form of RDF (Resource Description Framework) and by setting up links between data from various other data sources. In accordance with this aim, URIs are the fundamental units used to identify everything and RDF is the fundamental linking structure that utilises URIs to name the relationships between datapoints as well as the two ends of each relationship. In essence, RDF is the W3C (World Wide Web Consortium) standard for encoding the knowledge contained in the resources in World Wide Web (WWW).

Figure 1: Historical landmarks in the evolution of Semantic Web into Linked Data (Méndez & Greenberg 2012)

RDF (Resource Description Framework) Triple

The unit structure of an RDF is a triple, composed of three items. The first item is the subject, which represents the resource. It is a URI reference that uniquely identifies the described resource. The subject is followed by the second item, the predicate. It represents the relationship described through the RDF triple. Like the subject, the predicate is also a URI. The third item of the RDF triple is the object, which can either be a literal or a URI. It relates to the subject via the relationship specified by the predicate. Figure 2 depicts these three components in an RDF triple. Note that sometimes the subject and object could be blank nodes.

Figure 2: Structure of an RDF triple

RDF Graph

When there is a collection of RDF triples, this becomes an RDF graph. For instance, consider Figure 3, which represents a simplified example of an RDF graph. In this graph, the subject, predicates and objects can be defined as summarised below.

Figure 3: Simplified example of an RDF graph

The subject of this RDF graph is the resource specified by http://www.w3.org/People/EM/contact#me, which is a URI.

The predicates of this RDF graph are:

The objects of this RDF graph are:

  • Eric Miller, which is a literal that describes the name of the person identified by the subject.
  • em@w3.org}, which is a URI that describes the email address of the person identified by the subject.
  • Dr., which is a literal that describes the title of the person identified by the subject.
  • http://www.w3.org/2000/10/swap/pim/contact#Person, which is a URI that describes the type of the subject as a Person.

According to the standards, the RDF triple can be written as (<subject>, <predicate>, <object>). Therefore, the example RDF graph illustrated in Figure 3 can be denoted in N-triples format, as depicted in Figure 4.

Figure 4: N-triples format of RDF graph

Linked Open Data

Linked Open Data is a powerful mixture of open data and linked data. The idea of blending both open and linked data was formed in 2007 through a project entitled linking open data. Since then, the W3C community and the semantic web research community have put tremendous effort into expanding the Linked Open Data (LOD) cloud. Figure 5 denotes the growth in the number of datasets published on the Web since the inception of the linking open data project. Currently, the LOD cloud contains over 1200 datasets, as depicted in Figure 6. Each node in Figures 5 and 6 represents distinct open datasets published as linked data. The edges indicate whether there are links between the items in two datasets.

From the datasets published under LOD cloud, DBpedia has been at the heart of the LOD cloud since the establishment of the linking open data project (Figures 5 and 6) and is considered to be the core cross-domain knowledge base. It is the Linked Data version of Wikipedia, and has also been interlinked with numerous other knowledge resources since the initiation of the LOD cloud (see Figures 5 and 6).

Figure 5: Evolution of LOD cloud over the years
Figure 6: LOD cloud as at May, 2020 (note that the colours represent the topical domain that each dataset represents) Source: \url{https://lod-cloud.net/

DBpedia Knowledge Base

DBpedia is a giant among current cross-domain knowledge bases, and serves as a hub in the Web of Linked Data. It is also considered to be one of the main factors behind the success of the linking open data project (discussed in Linked Open Data section). DBpedia was initiated in 2007 through the collaboration of the Free University of Berlin and University of Leipzig. The main aim of DBpedia was to build a large-scale, cross-domain and cross-lingual knowledge base by extracting the structured content in Wikipedia, which is the most widely used encyclopedia, a globally popular and heavily visited website, a central knowledge source of humankind, and a finest example of collaboratively created content.

While most existing knowledge bases cover only a specific domain, DBpedia spans multiple domains and languages by connecting isolated topical islands into one interconnected knowledge space. Moreover, most of these existing knowledge bases are created by a small number of knowledge engineers; thus, it is highly cost-intensive to keep their information up-to-date, as domains change with time. DBpedia addresses this issue through its open community vision.

Due to the cross-domain and cross-lingual nature of DBpedia, it has been widely used in numerous applications, algorithms and tools including data integration, document ranking, topic detection and named entity recognition. In essence, DBpedia provides a rich platform to explore the gigantic knowledge source in Wikipedia and other datasets linked to it through sophisticated queries, using RDF query languages such as SPARQL.

Méndez, E. and Greenberg, J., 2012. Linked data for open vocabularies and HIVE’s global framework. El profesional de la información21(3), pp.236-244.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s