What is Linked Data?

Linked Data is a way of publishing data on the Web. It builds upon standard Web technologies—HTTP, URIs, RDF, and SPARQL. But rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read by computers and that can express relationships between pieces of data published by different organizations.

If you are not familiar with Linked Data, Census 2011 Results are also available in alternative formats and can be explored interactively from the Census website.

URIs: Standard Identifiers for Areas, Observations, Classifications, and Everything Else

A core idea in Linked Data is that everything of interest is named with a globally unique identifier, called a URI (Uniform Resource Identifier). URIs are an important Web standard and closely related to URLs, the ubiquitous web page addresses.

Here are some of the things mentioned in the Census 2011 data:

Dublin City, an administrative county, as defined at the time of the 2011 Census:

The Census dataset recording the number of households with Internet access per county:

The concept of a household having broadband internet:

The observation recording the number of households in the Census within Dublin City having broadband internet:

In Linked Data, whenever some thing is referred to in the data, then it is done by mentioning the URI of the thing. This makes it clear what thing exactly is being referred to. It also aids with integration of data from multiple organizations, because they can establish unambiguously whether they mean the same thing.

Vocabularies: Standards for Data

Some of the most important things to be named with URIs are classes and properties. Classes are the kinds of things that are described in the data. Properties are the kinds of relationships between things that are expressed in the data. For example, some common classes in statistical data are datasets (tables), observations (numbers in the tables), and classifications (a breakdown of a population into groups, e.g., by age or occupation). Common properties are reference area (relates an observation to an area) and preferred label (gives the name of some thing).

Various organisations have created standardised collections of classes and properties, called vocabularies. We re-use classes and properties from such standard vocabularies when possible, and only define our own when no suitable existing ones were found.

For example, we use classes and properties from the following vocabularies:

SKOSDefine classifications (a.k.a. concept schemes) for areas, age groups, gender, occupations, and so on
Data CubeStatistical data and metadata (observations, data sets, data structure definitions)
VoIDDescribe Linked Data services

Querying Linked Data with SPARQL

SPARQL is the query language for Linked Data. It is similar to the popular SQL query language, but supports URIs and can query other data structures besides just relational tables—in particular, graphs.

Query Form. You can try out SPARQL queries and explore the Census datasets using this SPARQL query form. The page also contains a number of example queries and links to SPARQL tutorials.

SPARQL Endpoint. The Census Linked Data service includes a SPARQL endpoint. This is an API that allows applications to submit SPARQL queries programmatically and retrieve results in JSON, XML, or a number of other formats. It is a simple HTTP API that follows the SPARQL Protocol standard. The API endpoint is: http://data.cso.ie/sparql

Data Downloads

Instead of using SPARQL, data users and application developers may also download the data for local processing (e.g., to convert it to other formats or to load it into their own SPARQL database).

Downloads are offered in N-Triples format.

Selected datasets are also available in TSV format, although this doesn’t qualify as Linked Data.

What are N-Triples and Turtle?

N-Triples is a format for storing Linked Data in files, to allow exchange of large datasets. The format is very simple (although this does not necessarily mean that it can be easily read or written without tool support):

Each line in the file is a “triple”—an atomic fact. A triple consists of three parts: subject, predicate, and object.

The subject is the URI of the thing being described by the atomic fact. The predicate indicates the property of the thing that is given by the atomic fact. The object can be one of two things, depending on the predicate.
  1. Either the triple expresses a relationship between two things. In that case, the object is the URI of the second thing.
  2. Or the triple expresses an attribute of the thing. In that case, the object is a literal, such as a string, date, or number. Literals typically have a datatype. As usual in Linked Data, the datatype is identified by a URI. Literals also may have a language tag, indicating the natural language of a string.

Turtle is a related format for exchanging Linked Data that is easier to read and write without special tools. It shares most of its syntax with SPARQL. Compared to N-Triples, it introduces a number of syntactic conveniences:

  • URIs can be abbreviated by using relative URIs and a base URI.
  • URIs can be abbreviated by declaring prefix mappings.
  • Subjects and predicates that are repeated in consecutive triples can be omitted (by using the ; and , punctuation to end the previous triple).

What is the Data Cube Vocabulary?

The Data Cube Vocabulary is a W3C standard vocabulary for representing statistical data.

The data is organised into a number of “data cubes”, also known as qb:DataSets.

Each dataset can be thought of as a data table. The rows are areas (e.g., the counties), the columns are a breakdown of the population by some statistical variable (e.g., age group, or size of family, or number of rooms in the dwelling, or ability to speak Irish). Each cell of the table is called an observation and contains the number of population units (people, households) that fall into the given bracket within the area.

According to the Data Cube standard, each dataset also contains a Data Structure Definition that defines the overall structure of the dataset and links to the relevant classifications (row/column headings).

If a dataset can be thought of as a “table” with numbers, then the “row headings” and “column headings” are classifications, expressed as skos:ConceptSchemes.

Data Cube Vocabulary overview

What are Named Graphs and how are they used?

Named Graphs are used in Linked Data to subdivide complex collections of data into more manageable pieces. The Census 2011 dataset is broken down into several hundred individual named graphs. They play two roles in the data.cso.ie service:

  1. Each named graph can be downloaded as a separate N-Triples file from the downloads section.
  2. SPARQL queries submitted through the query form or SPARQL endpoint can be restricted to only one or more named graphs to make the query more efficient.

A full Census table is made up from the following named graphs:

  1. The qb:DataSet containing observations
  2. The qb:DataStructureDefinition containing a definition of the data cube
  3. One skos:ConceptScheme per dimension containing the classification used (Depending on dataset, there are 2–3 dimensions used: area, plus one or two thematic dimensions)