CSVW for energy datasets

From EERAdata Wiki
Revision as of 13:06, 28 October 2022 by Valerias (talk | contribs) (Worked example: citizen-led initiatives)
Jump to: navigation, search

This page collects examples of improving the FAIRness of datasets using the csv extension csvw. Most importantly, csv on the web offers a possibility to tie together metadata and data, starting from a well-known and widely used data format. The standard offers a rich framework to annotate existing csv documents with additional information and transform them into other forms of structured data exchange formats such as JSON(-ld) and RDF. At the same time, csv on the web is user-friendly offering a flexible mechanism from minimal FAIR extensions to elaborated context building for the data to be shared. CSV on the web is a W3C recommendation in coherence with ...

FAIR principles

The example illustrates how csv on the web contributes to realize the FAIR principles

To be Findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier.

F2. data are described with rich metadata.

F3. (meta)data are registered or indexed in a searchable resource.

F4. metadata specify the data identifier.

To be Accessible:

A1 (meta)data are retrievable by their identifier using a standardized communications protocol.

A1.1 the protocol is open, free, and universally implementable.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary.

A2 metadata are accessible, even when the data are no longer available.

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles.

I3. (meta)data include qualified references to other (meta)data.

To be Re-usable:

R1. (meta)data have a plurality of accurate and relevant attributes.

R1.1. (meta)data are released with a clear and accessible data usage license.

R1.2. (meta)data are associated with their provenance.

R1.3. (meta)data meet domain-relevant community standards.

Worked example: citizen-led initiatives

We start with a csv file whose contents are shown in this table:

name status year of foundation national identifier street address city postal code C/O lat lon website activity national industrial sector classification purpose (original language) purpose (translation) date of removal country code legal form
Kvarkenvinden 1 active 1998-01-27 769602-8096 Norra Obbolavägen 89 Umeå 904 22 63.80667 20.27364 http://kvarkenvinden.se wind onshore 35.110 Föreningen har till ändamål att främja sina medlemmars ekonomiska intressen och dess miljöintresse genom att utöva driftsansvar över vindkraftverk i syfte att tillhandahålla vindenergi för medlemmarnas konsumtion. All genom föreningen genererad vindenergi ägs av medlemmarna. The purpose of the association is to promote the financial interests of its members and its environmental interests by exercising operational responsibility for wind turbines in order to provide wind energy for the members' consumption. All wind energy generated by the association is owned by the members. SWE C61P
Ollebacken vind ekonomiska förening active 2008-01-08 769618-1010 SIKÅS NORRA BYVÄGEN 180 Hammerdal 833 49 63.67432 15.06297 https://www.ollebackenvind.se wind onshore 35.110 Föreningen har till ändamål att främja medlemmarans ekonomiska intressen genom att i egen regi producera miljö vänlig energi. The purpose of the association is to promote the members' financial interests by producing environmentally friendly energy on their own. SWE C61P
Jamtkulingen ekonomiska förening active 2009-01-20 769619-7420 Södra Strandvägen 19 A Frösön 832 44 Sven Erik Eriksson 63.17622 14.61152 http://www.jamtkulingen.se/ wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig energi. The purpose of the association is to promote the members' financial interests by producing environmentally friendly energy on their own. SWE C61P
Hällingarna Vind active 2005-08-02 769612-8318 OLLEBACKEN 130 Hammerdal 833 49 63.59838 15.05107 wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig engeri. Medlemmarna deltar i verksamheten som konsumenter. The purpose of the association is to promote the members' financial interests by producing environmentally friendly areas on their own. The members participate in the business as consumers. SWE C61P
Offerdalsvind Ekonomiska förening active 2000-08-31 769606-0719 BERGE 718, Offerdal 835 97 63.46154 14.09483 http://www.offerdalsvind.se wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig energi. Medlemmarna deltar i verksamheten som konsumenter. The purpose of the association is to promote the members' financial interests by producing environmentally friendly areas on their own. The members participate in the business as consumers. SWE C61P
Trärike vindkraft ekonomisk förening liquidation 1996-08-07 769601-6331 VIKINGAVÄGEN 36 Sundsvall 857 41 62.40317 17.26335 http://www.trarikevindkraft.se/index.htm wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intresse genom att förse medlemmarna med egen vindkraft- producerad el och även främja medlemmarnas miljöintresse och vindkraftens utveckling. Föreningen skall bygga upp ett kapital som säkrar uppbyggnad, drift, underhåll och demontering av föreningens vindkraftverk. The purpose of the association is to promote the members 'financial interest by providing the members with their own wind-powered electricity and also promoting the members' environmental interest and the development of wind power. The association will build up a capital that ensures the construction, operation, maintenance and dismantling of the association's wind turbines. SWE C61P
Dala Vindkraft Ekonomisk förening active 2006-02-18 769613-8911 RIKSVÄGEN 15 Rättvik 795 32 60.88933 15.11092 http://dalavind.se/vindandelar-foreningar/dala-vindkraft-ekonomisk-forening/medlemsinformation wind onshore, E-trade 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen, samt deras miljöintresse, genom att tillhandahålla medlemmarna egen vindkraftsproducerad elkraft. The purpose of the association is to promote the members' financial interests, as well as their environmental interests, by providing the members with their own wind-powered electricity. SWE C61P
Vindela active 2004-08-17 769611-2411 BOX 4 Malung 782 21 60.6834 13.71603 http://dalavind.se/vindandelar-foreningar/vindela/ wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig elkraft. The purpose of the association is to promote the members' financial interests by producing environmentally friendly electricity on their own. SWE C61P
Äppelbovind active 2000-09-25 769606-1485 BOX 4 Malung 782 21 60.6834 13.71603 http://dalavind.se/vindandelar-foreningar/appelbovind/kontakt/ wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig elkraft. The purpose of the association is to promote the members' financial interests by producing environmentally friendly electricity on their own. SWE C61P
Fjällbergsvind ekonomisk förening liquidation 2005-09-13 769613-0587 Djupuddsvägen 35 Grängesberg 772 40 60.08136 14.98449 http://dalavind.se/vindandelar-foreningar/fjallbergs-vind-ekonomiskforening wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att tillhandahålla medlemmarna egen vindkrafts- producerad elkraft. The purpose of the association is to promote the members' financial interests by providing the members with their own wind power produced electricity. SWE C61P
Kyrkvinden ekonomiska förening active 2005-05-09 769613-0025 GIMOGATAN 6 B 3TR Uppsala 752 20 59.8687 17.6083 https://www.kyrkvinden.se wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att förmedla och i egen regi eller genom samarbetspartner producera miljövänlig elkraft. The purpose of the association is to promote the members' financial interests by conveying, on their own account or through partners, environmentally friendly electricity SWE C61P
Ljusterö Vind ekonomiska förening active 2008-04-02 769618-5961 LJUSTERÖ TORG Ljusterö 184 95 59.52403 18.60869 http://www.ljusterovind.se/ wind onshore 35.110 Föreningen har till ändamål att främja medlemmarnas ekonomiska intressen genom att i egen regi producera miljövänlig energi samt annan därmed förenlig verksamhet. Medlemmarna deltar i verksamheten som konsumenter. The purpose of the association is to promote the members' financial interests by producing environmentally friendly energy and other related activities on their own behalf. The members participate in the business as consumers. SWE C61P
Windy ekonomisk förening active 2000-12-11 769606-4802 SVARTEDALSBACKEN 9 Lerum 443 39 Mattias Skjöldebrandt 57.76418 12.26767 http://windy-vindkraft.se/ wind onshore 35.110 Föreningen har till ändamål att främja medlemmarna ekonomiska intressen genom att tillhandahålla medlemmarna egen vindkraft- producerad el, därigenom också främjande medlemmarnas intresse för miljö och energihushållning samt bedriva därmed förenlig verksamhet. The purpose of the association is to promote the members 'financial interests by providing the members with their own electricity produced by wind power, thereby also promoting the members' interest in the environment and energy management, and conducting compatible activities therewith. SWE C61P

The csv file contains information on the name of the initiatives, its legal status, its year of foundation, its national identifier, its street address, the city it is located in, the corresponding postal code, a possible C/O information, a latitude geo-information of the location (lat), a longitude geo-information of the location (lon), the website of the initiatives, some information for activities, a national industrial sector classification, a purpose statement in original language, the same purpose statement translated to English, the date of removal, the country code, and its legal form. To relate metadata information to this information in the csv file, we create a second file containing this metadata. The file format for this metadata information file is json. Let us assume that the csv file itself has the filename "SWE_initiatives_sample.csv". According to the csv on the web standard, the metadata file should have the filename "SWE_initiatives_sample.csv-metadata.json". A minimal form of the metadata file contains the following information

  {
     "@context": "http://www.w3.org/ns/csvw",
     "url": "SWE_initiatives_sample.csv"
  }    
 

The @context information links to the language conventions of the csvw standard, the url information states the filename of the csv file. This minimal file can be extended to contain more specific metadata. All entries are encoded in the form of property specifications and corresponding values.

General information about the csv file

In a first step, we include general information about the csv file. We start with a code snippet for specifics such as title, description, and creator

   {
    "@context": "http://www.w3.org/ns/csvw",
    "url": "SWE_initiatives_sample.csv"
    "dc:title": "Example - list of citizen-led initiatives in Sweden",
    "dc:description": "List of citizen-led initiatives in Sweden, example dataset to be used for illustrating the use of csv on the web",
    "dc:creator": {
	"schema:name": "August Wierling",
	"schema:url": "https://orcid.org/0000-0002-7443-7593",
	"schema:contactPoint": { "email": "augustw@hvl.no}
    }
 

As in the example above, property specifications can be terms from popular metadata vocabularies. E.g. the Dublin Core, schema.org, or DCAT vocabulary can be used. All of these vocabularies can be used independently or together. In the above example, metadata terms from the Dublin core vocabulary are mixed with terms from schema.org. The title of the csv file, and its description are stated using the Dublin core terms. The information inside of the dc:creator term contains information which in turn is specified using the schema.org vocabulary. Information about the creator is given in more detail specifying a human-readable name of the creator, a url of the creator (here: his orcid number), and contact point details such as the email. The contact point information can be extended using also a telephone or a fax number. We continue with a more extensive list of details about the file as a whole

    "@context": "http://www.w3.org/ns/csvw",
    "url": "SWE_initiatives_sample.csv",
    "dc:title": "Example - list of citizen-led initiatives in Sweden",
    "dc:description": "List of citizen-led initiatives in Sweden, example dataset to be used for illustrating the use of csv on the web",
    "dc:date": "2022-10-07",
    "dc:format": "text/csv",
    "dc:language": "en-US",
    "dc:publisher": {
	"schema:name": "EERAdata project",
	"schema:url": "https://cordis.europa.eu/project/id/883823",
	"schema:contactPoint": {
	    "email": "info@eeradata.eu",
	    "url": "https://www.eeradata.eu"
	}
    }	
    "dc:rights": "https://creativecommons.org/licenses/by-sa/4.0/",
    "dc:subject": "Energy communities, Sweden, Community energy, Energy cooperatives, Renewable Energy",
    "dc:source": {
	"schema:name": "ENBP Inventory \"Energy by people\" - First Europe-wide inventory on energy communities",
	"schema:url": "https://doi.org/10.18710/2CPQHQ"
    }, 	
    "dc:type": "dataset",
    "dc:creator": {
	"schema:name": "August Wierling",
	"schema:url": "https://orcid.org/0000-0002-7443-7593",
	"schema:contactPoint": { "email": "augustw@hvl.no}
    },
    "dc:coverage": "Sweden",
    "dc:identifier": "https://eeradata-platform.eu/" 
 

The date follows ISO 8601. The language is specified following RFC 4646. For the media type, RFC 7111 has being used as a specification. The type is taken according to the DCMI type vocabulary. The dc:publisher information has several details which are grouped into one object by curly brackets: the EERAdata project as the name of the publisher, the corresponding CORDIS entry as a persistent identifier, and contact information in form of an email and a website. The entry for dc:rights contains the license information and points to a website provided by the creative commons organization. It states that the csv file is licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). As such, anybody is free to share and adapt the file. The dc:subject contains a list of keywords describing the contents of the csv file in more details. The subject information should be more extensive in a real example. Here, only a basic example is given. The dc:type information declares the csv file information as a dataset according to the possible types listed by the DCMI type element working draft. The dc:identifier holds as DCMI describes it an unambiguous reference to the resource within a given context. Ideally, the resource is the final FAIRified object. Thus, it does apply to the json file created out of the original csv file and its json metadata document. Best practice is to assign a persistent identifier.

Now, how does this contribute to make the original csv FAIR fair ?

Specifying information about table headers

This section describes how to specify further the entries in the various columns of the csv file. Note that the csv on the web standard allow to connect the metadata file in json to several csv files which share certain layout. For our puroposes here, we focus on a single table - the one illustrated above with the information on Swedish energy coooperatives. Before describing a full-fletched description for all the columns, we start with the first four columns from the left specifying the name of the initiative, its legal status, the year of foundation, and a national identifier. We start with a simple set of specifications before assigning more information to the columns. For more information, please also see the primer as well as the recommendation itself.

    "tableSchema": {
	"columns": [{
	    "titles": "name",
            "dc:description": "Name of the initiative",
            "datatype": "string",
            "required": true
        },{
	    "titles": "status",
            "dc:description": "Legal status",
            "datatype": {
                "base": "string",
                "format": "active|inactive|liquidation"
            },
            "required": true
        },{
	    "titles": "year of foundation",
            "dc:description": "Year of foundation of the initiative",
            "datatype": "date"
        },{
	    "titles": "national identifier",
            "datatype": {
               "propertyURL": "https://www.wikidata.org/wiki/Property:P6460",
               "dc:title": "National identifier for Sweden",
               "dc:description": "National identifier for Sweden",
               "base": "string",
               "format": "\d{6}-\d{4}"
        }
 

The general property for specific table attributes is tableSchema. Details on the columns is specified by columns. Per column, a title, a description and details about the datatype are fixed. For example, the first column has the title name and the dc:description entry gives further information on what name actually means. The datatype for all entries in the first column is string. For possible further pre-defined datatypes, see the Metadata Vocabulary for Tabular Data. The specification of true for required leads to an error message if the corresponding entry in the csv file is empty. In the entry for the second column, the format properties list allowed entries for the values in the second column. If there is any entry other than active, inactive or liquidation, an error will be reported. The entries of the third column have the datatype date, so entries must comply with the ISO 8601 standard YYYY-MM-DD. Finally, the national identifier for organizations in Sweden is listed in the fourth column. It consists of 6 digits, followed by a dash, followed by another 4 digits. The format statements allows to specifies patterns of such type with the help of regular expressions as shown in the example. Indeed, the propertyURL ties each entry to the wikidata entry P6460 and in that way defines that all entries are Swedish organizational numbers.

The next three columns contain street information and relate to the schema.org vocabulary to specify a street address, the name of the municipality and the postal code of the initiative. The corresponding entries for the metadata file are

         {
            "titles": "Street address",
            "dc:description": "Street address of the initiative",
            "datatype": "string",
	    "propertyURL": "schema:streetAddress"
         },{
            "titles": "city",
            "dc:description": "Municipality where the initiative is located",
            "datatype": "string",
	    "propertyURL": "schema:streetLocality"
        },{
	    "titles": "postal code",
            "dc:description": "Postal code of the location of the initiative",
	    "datatype": "string",
	    "propertyURL": "schema:postalCode"
        }
 

Note that for the case of Sweden, the format property can further be used to define allowed patterns for street addresses and postal codes. The location of the headquarter of the initiative is reported also in terms of geo-coordinates in the csv file. The column entitled lat contains information on latitudes, while the column entitled lon holds longitudes. Here, schema.org provides also a possibility to link to standards

         {
             "titles": "lat",
             "dc:description": "geo location of headquarter of initiative, latitude, WGS84",
             "datatype": {
                  "base": "number", 
                  "minimum": "-90",
                  "maximum": "90" 
             }, 
             "propertyURL": "schema:latitude"
         }, {
            "titles": "lon",
             "dc:description": "geo location of headquarter of initiative, longitude, WGS84",
             "datatype": {
                  "base": "number", 
                  "minimum": "-180",
                  "maximum": "180" 
             },
            "propertyURL": "schema:longitude"
         } 
 

As can be seen form the example, CSVW allows to restrict values for a range of possibilities. Latitudes range between -90 and 90, latitudes between -180 and 180. Using the schema.org definition makes it implicitly clear, that the WGS84 standard is used to describe geo locations.

The next column contains a link to the web presence of the initiative. A minimal way to specify this would be again with the help of schema.org as

         {
         "titles": "website",
         "dc:description": "Link to the web presence of the initiative",
         "propertyURL": "schema:url"
         }
     

The column entitled activity contains information about the activity of the initiative. From the general point of view, activities of citizen-led initiatives can be quite broad ranging from electricity and heat generation by different means to distribution activities and energy efficient measures. Again, the task is to find a resource on the web which allows expressing that all entries in this column are activities. The makesOffer property provided by schema.org is a possibility to state this. According to its definition, makesOffer describes 'A pointer to products or services offered by the organization or person.' The specification of the column reads

         {
         "titles": "activity",
         "dc:description": "Describes activitites by citizen-led energy initiatives",
         "propertyURL": "schema:makesOffer"
         }
   

Note, that here it is suggested to use a controlled vocabulary from which of the different activities are sourced from. More details will be discussed elsewhere.

The next column contains information specifying the national industrial sector classification which provides information about the type of activities the initiative is engaged with based on a classification of economic activities published by Statistics Sweden, see here. Note, that this information overlaps to some extent with the information offered in the activities column. However, the details which can be expressed in a domain-specific controlled vocabulary are usually much greater that the rather general classification scheme covering the whole national industry sector. On the other hand, initiatives may engage with activities, which are captured in the general scheme but are not contained in a domain-specific vocabulary. The Swedish Standard Industrial Classification is based on the EU’s recommended standards, NACE Rev.2. SNI 2007. It allows however for more detailed specifications. The official codes for activity groups are designated as two digits separated by a dots from three digits. The example here is the code 35.110 which encodes 'Production of electricity'. Similar to the example discussed above, the format of the entry can be specified with the format statement

        {
	    "titles": "national industrial sector classification",
            "datatype": {
               "aboutURL": "https://www.wikidata.org/wiki/Q2976602",
               "dc:title": "Swedish Standard Industrial Classification",
               "dc:description": "Swedish Standard Industrial Classification for activities by the citizen-led initiave",
               "base": "string",
               "format": "\d{2}\.\d{3}"
        }
 

While wikidata offers resources for the [NACE classification codes] and the Belgium classification code, no resource is available for the Swedish case. As a minimum, wikidata allows a resource for economic classification schemes in general using the resource wikidata:Q2976602 or wikidata:Q27048688. For describing that all values in a column are of a particular type, the csv on the web offers the statement aboutURL. An alternative is offered here by DBpedia, which has a resource dbr:International_Standard_Industrial_Classification exactly providing what is given in the column of the csv file.

Additional specifications in terms of type can be given to the columns on the year of foundation and the year of dissolution. Here, schema.org provides a definition and the corresponding entries such as

 
            "propertyURL": "schema:foundingDate",
            "propertyURL": "schema:dissolutionDate",
 

would serve as a means of specification.

Worked example: Power plant information

In a second part of this tutorial, we consider data about power plants. Here, a list of wind farms from Germany serve as an example. The data is originally again organized as a csv file containing information such as the name of the power plant, the type of the power plant, a classification of the energy product used as input in the power plant, the location in terms of latitude and longitude, the nameplate capacity, the commissioning year, the decommissioning year, and information on the owner of the power plant. The table here shows six different wind farms with the associated information.

name type using energy product latitude longitude nameplate capacity [kW] commissioning date decommissioning date owner
Langwedel dritte onshore wind farm RA310 53.013274 9.158455 3050 2017-12-29 Bürger Energie Bremen
WEA Kammerberg onshore wind farm RA310 48.387257 11.518869 3000 2015-11-03 Bürger Energie Genossenschaft Freisinger Land
Windpark Söhrewald / Niestetal onshore wind farm RA310 51.241938 9.518432 21525 2015-09-19 Bürger Energie Kassel & Söhre
Windpark Rohrberg onshore wind farm RA310 51.23638 9.710966 15000 2016-03-23 Bürger Energie Kassel & Söhre
Windpark Stiftswald onshore wind farm RA310 51.245691 9.658835 27000 2017-06-28 Bürger Energie Kassel & Söhre
Windpark Kreuzstein onshore wind farm RA310 51.274447 9.730573 24000 2019-01-01 Bürger Energie Kassel & Söhre

As before, the csv file is supplemented by metadata using a json metadata file. Following the naming convention of ...

In a first step, we again consider metadata which relate to the csv as a whole such as a creator of the file, access rights for the entire file etc. The corresponding part of the metadata file may look like this

    "@context": "http://www.w3.org/ns/csvw",
    "url": "DEU_powerPlants_sample.csv",
    "dc:title": "Example - list of power plants in Germany",
    "dc:description": "List of power plants in Germany owned by citizen-led initiatives, example dataset to be used for illustrating the use of csv on the web",
    "dc:date": "2022-10-27",
    "dc:format": "text/csv",
    "dc:language": "en-US",
    "dc:publisher": {
	"schema:name": "EERAdata project",
	"schema:url": "https://cordis.europa.eu/project/id/883823",
	"schema:contactPoint": {
	    "email": "info@eeradata.eu",
	    "url": "https://www.eeradata.eu"
	}
    }	
    "dc:rights": "https://creativecommons.org/licenses/by-sa/4.0/",
    "dc:subject": "Energy communities, Germany, Community energy, Energy cooperatives, Renewable Energy, power plants",
    "dc:source": {
	"schema:name": "ENBP Inventory \"Energy by people\" - First Europe-wide inventory on energy communities",
	"schema:url": "https://doi.org/10.18710/2CPQHQ"
    }, 	
    "dc:type": "dataset",
    "dc:creator": {
	"schema:name": "August Wierling",
	"schema:url": "https://orcid.org/0000-0002-7443-7593",
	"schema:contactPoint": { "email": "augustw@hvl.no}
    },
    "dc:coverage": "Germany",
    "dc:identifier": "https://eeradata-platform.eu/" 
 

As before, we continue by providing information about the contents of the columns in the csv file. We start with the name of the wind farm being listed in the first column. It can be referenced with schema:name which according to schema.org is a property assigning a 'name' to a 'thing'. Alternatively, here rdfs:label can be used for this purpose. The code snippet for describing the left most column would look like this:

  "tableSchema": {
	"columns": [{
            "titles": "name",
            "dc:description": "Name of the power plant",
            "datatype": "string",
	    "propertyURL": "schema:name"
        }
     ]
  }
 

For the commissioning date, the entry looks like this

  "tableSchema": {
	"columns": [{
            "titles": "commissioning date",
            "dc:description": "Commissioning date of the power plant",
            "datatype": "date",
	    "propertyURL": "wikidata:P729"
        }
     ]
  }
 

Here, the commissioning date is linked to the property wikidata:P729.

How to test the metadata document?

Resources