Link data and metadata
Linking data and metadata is pivotal for interoperability. The linking connects not only data and its associated metadata, but it may be used to refer to external standards for data and metadata alike. How to do it? In general, the procedure uses concepts from linked data. We suggest linking data and metadata through csv on the web - abbreviated csvw. This page collects examples of improving the FAIRness of datasets using the csv extension csvw. CSV on the web offers the possibility to tie together metadata and data, starting from a well-known and widely used data format. The standard offers a rich framework to annotate existing csv documents with additional information and transform them into other forms of structured data exchange formats such as JSON(-ld) and RDF. At the same time, csv on the web is user-friendly offering a flexible mechanism from minimal FAIR extensions to elaborated context building for the data to be shared. CSV on the web is a W3C recommendation. See CSV on the Web: An Introduction by Steven Firth and CSV on the Web by csvw.org for more information.
Further reading in this regard:
- Zinke-Wehlmann, J. et al. Linked Data and Metadata in Big Data in Bioeconomy pp 79–90, link
- Chapter 3: Providing Linked Data
Contents
Creating a CSV on the web (CSVW) for energy datasets
Creating a CSV on the web. The FAIRness of datasets can be improved using the csv extension csvw. CSV on the web connects metadata and data, starting from a well-known and widely used data format. Therefore, the barrier to changing practices is low. The standard offers a rich framework to annotate existing csv documents with additional information and transform them into other forms of structured data exchange formats such as JSON(-ld) and RDF. CSV on the web is a W3C recommendation. The EERAdata wiki provides step-by-step instructions.
Tools for transfer to RDF, validation, and ontology documentation
The FAIRification involves transferring data files (e.g., in csv) into formats that allow linking and annotation (e.g., RDF). The first two tools listed below are semi-automatic tools supporting this transfer process. Validators can be used for consistency checking of obtained RDF documents. SHACL commands can be used to check compliance with data standards (e.g., formats of dates). The three final tools can be used for documenting ontologies that the data suppliers have created themselves.
- tarql is a command-line tool for converting csv files into RDF using SPARQL commands. Find here a blog entry of Bob Du Charme on the topic.
- YARRRML offers help in producing linked Data form generation rules. It is formalized as a subset of YAML. As such, the tool can also be used to transfer csv files into RDF. See here for a tutorial.
- csvcubed provides a command line tool based on python that makes it straightforward to turn csv files info csvw files.
- Validators can be used to test the syntactical correctness of the RDF code produced in the translation process. A short list of some tools: RDF validator, SHACL validator, csv lint and corresponding example schema, IDLab Turtle Validator, Structured Data linter, Validate RDF Data by SURROUND AUSTRALIA
In many cases, it is necessary to come up with a small specialized ontology of terms that are not covered by any of the existing ontologies. Documentation of these ontologies is supported by the following tools:
Worked example: citizen-led initiatives
We start with a csv file whose contents are shown in this table. The data is taken from the ENBP database on citizen-led initiatives in Europe. The table contains information on
- the name of the initiatives,
- its legal status,
- its year of foundation,
- its national identifier,
- its street address,
- the city it is located in,
- the corresponding postal code,
- an optional C/O information,
- a latitude geo-information of the location (lat) and a longitude geo-information of the location (long),
- the website of the initiatives, some information for activities,
- a national industrial sector classification,
- a purpose statement in the original language and the same purpose statement translated to English,
- if applicable the date of removal,
- the country code,
- and its legal form.
To relate metadata information to this information in the csv file, we create a second file containing this metadata. The file format for this metadata information file is JSON. Let us assume that the csv file itself has the filename "SWE_initiatives_sample.csv". According to the csv on the web standard, the metadata file should have the filename "SWE_initiatives_sample.csv-metadata.json". A minimal form of the metadata file contains the following information
{ "@context": "http://www.w3.org/ns/csvw", "url": "SWE_initiatives_sample.csv" }
The @context information links to the language conventions of the csvw standard, the url information states the filename of the csv file. This minimal file can be extended to contain more specific metadata. All entries are encoded in the form of property specifications and corresponding values.
General information about the csv file
In a first step, we include general information about the csv file. We start with a code snippet for specifics such as title, description, and creator
{ "@context": "http://www.w3.org/ns/csvw", "url": "SWE_initiatives_sample.csv" "dc:title": "Example - list of citizen-led initiatives in Sweden", "dc:description": "List of citizen-led initiatives in Sweden, example dataset to be used for illustrating the use of csv on the web", "dc:creator": { "schema:name": "August Wierling", "schema:url": "https://orcid.org/0000-0002-7443-7593", "schema:contactPoint": { "email": "augustw@hvl.no} }
As in the example above, property specifications can be terms from popular metadata vocabularies. E.g. the Dublin Core, schema.org, or DCAT vocabulary can be used. All of these vocabularies can be used independently or together. In the above example, metadata terms from the Dublin core vocabulary are mixed with terms from schema.org. The title of the csv file, and its description are stated using the Dublin core terms. The information inside of the dc:creator term contains information which in turn is specified using the schema.org vocabulary. Information about the creator is given in more detail specifying a human-readable name of the creator, a url of the creator (here: his orcid number), and contact point details such as the email. The contact point information can be extended using also a telephone or a fax number. We continue with a more extensive list of details about the file as a whole
"@context": "http://www.w3.org/ns/csvw", "url": "SWE_initiatives_sample.csv", "dc:title": "Example - list of citizen-led initiatives in Sweden", "dc:description": "List of citizen-led initiatives in Sweden, example dataset to be used for illustrating the use of csv on the web", "dc:date": "2022-10-07", "dc:format": "text/csv", "dc:language": "en-US", "dc:publisher": { "schema:name": "EERAdata project", "schema:url": "https://cordis.europa.eu/project/id/883823", "schema:contactPoint": { "email": "info@eeradata.eu", "url": "https://www.eeradata.eu" } } "dc:rights": "https://creativecommons.org/licenses/by-sa/4.0/", "dc:subject": "Energy communities, Sweden, Community energy, Energy cooperatives, Renewable Energy", "dc:source": { "schema:name": "ENBP Inventory \"Energy by people\" - First Europe-wide inventory on energy communities", "schema:url": "https://doi.org/10.18710/2CPQHQ" }, "dc:type": "dataset", "dc:creator": { "schema:name": "August Wierling", "schema:url": "https://orcid.org/0000-0002-7443-7593", "schema:contactPoint": { "email": "augustw@hvl.no} }, "dc:coverage": "Sweden", "dc:identifier": "https://eeradata-platform.eu/"
The date follows ISO 8601. The language is specified following RFC 4646. For the media type, RFC 7111 has been used as a specification. The type is taken according to the DCMI type vocabulary. The dc:publisher information has several details which are grouped into one object by curly brackets: the EERAdata project as the name of the publisher, the corresponding CORDIS entry as a persistent identifier, and contact information in form of an email and a website. The entry for dc:rights contains the license information and points to a website provided by the creative commons organization. It states that the csv file is licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). As such, anybody is free to share and adapt the file. The dc:subject contains a list of keywords describing the contents of the csv file in more detail. The subject information should be more extensive in a real example. Here, only a basic example is given. The dc:type information declares the csv file information as a dataset according to the possible types listed by the DCMI type element working draft. The dc:identifier holds as DCMI describes it an unambiguous reference to the resource within a given context. Ideally, the resource is the final FAIRified object. Thus, it does apply to the json file created out of the original csv file and its json metadata document. The best practice is to assign a persistent identifier.
Now, how does this contribute to making the original csv FAIR fair ?
Specifying information about table headers
This section describes how to specify further the entries in the various columns of the csv file. Note that the csv on the web standard allows to connect the metadata file in json to several csv files which share a certain layout. For our purposes here, we focus on a single table - the one illustrated above with the information on Swedish energy cooperatives. Before describing a full-fletched description for all the columns, we start with the first four columns from the left specifying the name of the initiative, its legal status, the year of foundation, and a national identifier. We start with a simple set of specifications before assigning more information to the columns. For more information, please also see the primer as well as the recommendation itself.
"tableSchema": { "columns": [{ "titles": "name", "dc:description": "Name of the initiative", "datatype": "string", "required": true },{ "titles": "status", "dc:description": "Legal status", "datatype": { "base": "string", "format": "active|inactive|liquidation" }, "required": true },{ "titles": "year of foundation", "dc:description": "Year of foundation of the initiative", "datatype": "date" },{ "titles": "national identifier", "datatype": { "propertyURL": "https://www.wikidata.org/wiki/Property:P6460", "dc:title": "National identifier for Sweden", "dc:description": "National identifier for Sweden", "base": "string", "format": "\d{6}-\d{4}" }
The general property for specific table attributes is tableSchema. Details on the columns is specified by columns. Per column, a title, a description and details about the datatype are fixed. For example, the first column has the title name and the dc:description entry gives further information on what name actually means. The datatype for all entries in the first column is "string". For possible further pre-defined datatypes, see the Metadata Vocabulary for Tabular Data. The specification of true for required leads to an error message if the corresponding entry in the csv file is empty. In the entry for the second column, the format properties list allowed entries for the values in the second column. If there is any entry other than active, inactive or liquidation, an error will be reported. The entries of the third column have the datatype date, so entries must comply with the ISO 8601 standard YYYY-MM-DD. Finally, the national identifier for organizations in Sweden is listed in the fourth column. It consists of 6 digits, followed by a dash, followed by another 4 digits. The format statements allow to specifies patterns of such type with the help of regular expressions as shown in the example. Indeed, the propertyURL ties each entry to the wikidata entry P6460 and in that way defines that all entries are Swedish organizational numbers.
The next three columns contain street information and relate to the schema.org vocabulary to specify a street address, the name of the municipality, and the postal code of the initiative. The corresponding entries for the metadata file are
{ "titles": "Street address", "dc:description": "Street address of the initiative", "datatype": "string", "propertyURL": "schema:streetAddress" },{ "titles": "city", "dc:description": "Municipality where the initiative is located", "datatype": "string", "propertyURL": "schema:streetLocality" },{ "titles": "postal code", "dc:description": "Postal code of the location of the initiative", "datatype": "string", "propertyURL": "schema:postalCode" }
Note that for the case of Sweden, the format property can further be used to define allowed patterns for street addresses and postal codes. The location of the headquarter of the initiative is reported also in terms of geo-coordinates in the csv file. The column entitled lat contains information on latitudes, while the column entitled lon holds longitudes. Here, schema.org provides also a possibility to link to standards
{ "titles": "lat", "dc:description": "geo location of headquarter of initiative, latitude, WGS84", "datatype": { "base": "number", "minimum": "-90", "maximum": "90" }, "propertyURL": "schema:latitude" }, { "titles": "lon", "dc:description": "geo location of headquarter of initiative, longitude, WGS84", "datatype": { "base": "number", "minimum": "-180", "maximum": "180" }, "propertyURL": "schema:longitude" }
As can be seen from the example, CSVW allows to restrict values for a range of possibilities. Latitudes range between -90 and 90, latitudes between -180 and 180. Using the schema.org definition makes it implicitly clear, that the WGS84 standard is used to describe geo locations.
The next column contains a link to the web presence of the initiative. A minimal way to specify this would be again with the help of schema.org as
{ "titles": "website", "dc:description": "Link to the web presence of the initiative", "propertyURL": "schema:url" }
The column entitled activity contains information about the activity of the initiative. From the general point of view, activities of citizen-led initiatives can be quite broad ranging from electricity and heat generation by different means to distribution activities and energy efficient measures. Again, the task is to find a resource on the web that allows expressing that all entries in this column are activities. The makesOffer property provided by schema.org is a possibility to state this. According to its definition, makesOffer describes 'A pointer to products or services offered by the organization or person.' The specification of the column reads
{ "titles": "activity", "dc:description": "Describes activitites by citizen-led energy initiatives", "propertyURL": "schema:makesOffer" }
Note, that here it is suggested to use a controlled vocabulary from which of the different activities are sourced from. More details will be discussed elsewhere.
The next column contains information specifying the national industrial sector classification which provides information about the type of activities the initiative is engaged with based on a classification of economic activities published by Statistics Sweden, see here. Note, that this information overlaps to some extent with the information offered in the activities column. However, the details which can be expressed in a domain-specific controlled vocabulary are usually much greater than the rather general classification scheme covering the whole national industry sector. On the other hand, initiatives may engage with activities, which are captured in the general scheme but are not contained in a domain-specific vocabulary. The Swedish Standard Industrial Classification is based on the EU’s recommended standards, NACE Rev.2. SNI 2007. It allows however for more detailed specifications. The official codes for activity groups are designated as two digits separated by a dot from three digits. The example here is code 35.110 which encodes 'Production of electricity'. Similar to the example discussed above, the format of the entry can be specified with the format statement
{ "titles": "national industrial sector classification", "datatype": { "@type": "https://www.wikidata.org/wiki/Q2976602", "dc:title": "Swedish Standard Industrial Classification", "dc:description": "Swedish Standard Industrial Classification for activities by the citizen-led initiative", "base": "string", "format": "\d{2}\.\d{3}" }
While wikidata offers resources for the [NACE classification codes] and the Belgium classification code, no resource is available for the Swedish case. As a minimum, wikidata allows a resource for economic classification schemes, in general, using the resource wikidata:Q2976602 or wikidata:Q27048688. For describing that all values in a column are of a particular type, the csv on the web offers the statement @type. An alternative is offered here by DBpedia, which has a resource dbr:International_Standard_Industrial_Classification exactly providing what is given in the column of the csv file. Thus, the type can be specified as "@type": "dbr:International_Standard_Industrial_Classification".
The next two columns describe the purpose statement of the initiative, both in the native language, i.e. Swedish, and in English. Following Example 96 from the Primer, this is specified with the lang property. According to ISO 639-1, the corresponding value for Swedish is sv. The fact that the entry itself is a corporate purpose statement can be expressed using the wikidata resource Q2498417.
{ "titles": "purpose (original language)", "@type": "wikidata:Q2498417", "dc:description": "Purpose statement of the initiative in Swedish", "lang": "sv" }
Additional specifications in terms of type can be given to the columns on the year of foundation and the year of dissolution. Here, schema.org provides a definition and the corresponding entries such as
"propertyURL": "schema:foundingDate", "propertyURL": "schema:dissolutionDate",
would serve as a means of specification.
In querying information about specific countries, it is helpful to have specific country information in the table as well. For that purpose, a column is contained in the csv file which contains a country identifier. The entries are sourced from the ISO 3166 alpha-3 codes. wikidata offers the identifier P298 to express that the entries are country codes.
{ "titles": "country code", "propertyURL": "wikidata:P298", "dc:description": "Country where is initiative is located in" }
The last column contains information about the legal form of the initiative. Instead of documenting the legal form as a string, the legal form is specified as a code sourced from the Global Legal Entity Identifier Foundation GLEIF. This page contains links to the ISO 20275-backed entity legal forms code list. Again, wikidata offers the property wikidata:P1454 to specify that the entries are information on legal form. The corresponding entry for the metadata file looks
{ "titles": "legal form", "propertyURL": "wikidata:P1454", "dc:description": "Legal form of the initiative documented with codes taken from ISO 20275" }
Worked example: Power plant information
In the second part of this tutorial, we consider data about power plants. Here, a list of wind farms from Germany serves as an example. The data is originally again organized as a csv file containing information such as the name of the power plant, the type of the power plant, a classification of the energy product used as input in the power plant, the location in terms of latitude and longitude, the nameplate capacity, the commissioning year, the decommissioning year, and information on the owner of the power plant. The table here shows six different wind farms with the associated information.
name | type | using energy product | latitude | longitude | nameplate capacity [kW] | commissioning date | decommissioning date | owner |
---|---|---|---|---|---|---|---|---|
Langwedel dritte | onshore wind farm | RA310 | 53.013274 | 9.158455 | 3050 | 2017-12-29 | Bürger Energie Bremen | |
WEA Kammerberg | onshore wind farm | RA310 | 48.387257 | 11.518869 | 3000 | 2015-11-03 | Bürger Energie Genossenschaft Freisinger Land | |
Windpark Söhrewald / Niestetal | onshore wind farm | RA310 | 51.241938 | 9.518432 | 21525 | 2015-09-19 | Bürger Energie Kassel & Söhre | |
Windpark Rohrberg | onshore wind farm | RA310 | 51.23638 | 9.710966 | 15000 | 2016-03-23 | Bürger Energie Kassel & Söhre | |
Windpark Stiftswald | onshore wind farm | RA310 | 51.245691 | 9.658835 | 27000 | 2017-06-28 | Bürger Energie Kassel & Söhre | |
Windpark Kreuzstein | onshore wind farm | RA310 | 51.274447 | 9.730573 | 24000 | 2019-01-01 | Bürger Energie Kassel & Söhre |
As before, the csv file is supplemented by metadata using a json metadata file. Following the naming convention of ...
In the first step, we again consider metadata that relate to the csv as a whole such as the creator of the file, access rights for the entire file etc. The corresponding part of the metadata file may look like this
"@context": "http://www.w3.org/ns/csvw", "url": "DEU_powerPlants_sample.csv", "dc:title": "Example - list of power plants in Germany", "dc:description": "List of power plants in Germany owned by citizen-led initiatives, example dataset to be used for illustrating the use of csv on the web", "dc:date": "2022-10-27", "dc:format": "text/csv", "dc:language": "en-US", "dc:publisher": { "schema:name": "EERAdata project", "schema:url": "https://cordis.europa.eu/project/id/883823", "schema:contactPoint": { "email": "info@eeradata.eu", "url": "https://www.eeradata.eu" } } "dc:rights": "https://creativecommons.org/licenses/by-sa/4.0/", "dc:subject": "Energy communities, Germany, Community energy, Energy cooperatives, Renewable Energy, power plants", "dc:source": { "schema:name": "ENBP Inventory \"Energy by people\" - First Europe-wide inventory on energy communities", "schema:url": "https://doi.org/10.18710/2CPQHQ" }, "dc:type": "dataset", "dc:creator": { "schema:name": "August Wierling", "schema:url": "https://orcid.org/0000-0002-7443-7593", "schema:contactPoint": { "email": "augustw@hvl.no} }, "dc:coverage": "Germany", "dc:identifier": "https://eeradata-platform.eu/"
As before, we continue by providing information about the contents of the columns in the csv file. We start with the name of the wind farm listed in the first column. It can be referenced with schema:name which according to schema.org is a property assigning a 'name' to a 'thing'. Alternatively, here rdfs:label can be used for this purpose. The code snippet for describing the leftmost column would look like this:
"tableSchema": { "columns": [{ "titles": "name", "dc:description": "Name of the power plant", "datatype": "string", "propertyURL": "schema:name" } ] }
The second column contains information about the type of power plant. There is a discussion comparing different resources for power plant types here. Taking the results from this discussion, the wikidata resource wikidata:Q50687555 and/or the resource oeo:OEO_00000311 offered the open energy ontology can be used to relate to. The corresponding entry to the metadata file may look this
"titles": "type", "dc:description": Type of the power plant", "datatype": "string", "@type": "oeo:OEO_00000311"
The third column specifies more in detail the energy input to the power plant in terms of the standard international energy product classicification. A machine-actionable version of this classification is provided by Eionet Data Dictionary. The identifier RA310, which describes the "wind onshore" energy product, is located at this link. More resources for vocabularies in the context of EU directives are given here. Following Example 60 of the primer, the valueURL property can be used to automatically expand the entry from the 'using energy product' column into the corresponding resource, the
"titles": "using energy product", "dc:description": "Standard international energy product classification for the energy product used as input for the power plant", "valueURL": "https://dd.eionet.europa.eu/vocabularyconcept/eurostat/siec/{using energy product}" "propertyURL": "schema:identifier"
There is no resource that actually allows expressing that the entry is an identifier from the SIEC list. Instead, here a more general property is used. schema:identifier represents any kind of identifier for any kind of thing, so not specifying that it is an energy product.
For the geo information we have a similar way of expressing the metadata as in the example before
{ "titles": "latitude", "dc:description": "geo location of the position of a power plant, latitude, WGS84", "datatype": { "base": "number", "minimum": "-90", "maximum": "90" }, "propertyURL": "schema:latitude" }
For the commissioning date, the entry looks like this
"tableSchema": { "columns": [{ "titles": "commissioning date", "dc:description": "Commissioning date of the power plant", "datatype": "date", "propertyURL": "wikidata:P729" } ] }
Here, the commissioning date is linked to the property wikidata:P729.
How to test the metadata document?
Resources
Tools to validate structured data
* csv lint and corresponding example schema * IDLab Turtle Validator * Structured Data linter * Validate RDF Data by SURROUND AUSTRALIA * csvcubed