UC2 Energy communities & energy markets
Framing the use case
The section describes the FAIRification work undertaken in the use case "Energy communities & energy markets of the EERAdata project. The Energy-by-the-people (ENBP) database was FAIRified and published open access.
Ordinary citizens are coming together through collective initiatives to play an active role in the low carbon transition. Citizen-led energy projects have grown over the past years to produce, distribute, and consume energy from renewable sources while being governed democratically and benefits accruing locally. Statistical evidence going beyond the perspective of single case studies has been lacking. The new database available open access from dataverse.no provides data on more than 10,000 initiatives and 16,000 production units in 30 European countries, focusing on the past 20 years.
Statistics are essential for quantifying the impact of citizen-led engagement in the energy transition. The FAIR and open database can be used to support the construction of likely trajectories for citizen engagement. Data helps with monitoring, goal and policy setting, and impact evaluation of these initiatives, which is increasingly needed to support evidence-based policy action in this field.
Low quality data and metadata are significant impediments for research activities and evidence-based policy making. These shortfalls decrease the ability to locate data and reuse by academia and policy planners. Metadata and data should be descriptive, complete, easy to find, accessible, and machine parsable to enable reuse. Through our FAIRification of a database on European energy communities, the data are now available open access and reuse is encouraged through classification standards and controlled vocabularies.
Technical challenges that had to be solved during the FAIRification of the data include the mining and cleaning of data, the identification of rich metadata, the development and assessment of metadata standards, the development of user-centric query scenarios, and choosing the right platform together with query tools. A workflow was developed which can serve as a model for the FAIRification of other use cases in low carbon research and beyond. See link below for support.
Summary of results
While the idea of the high potential of reusable and machine-accessible data is spreading, researchers lack blueprints on how to actually implement them. The use case comes from the energy domain outlining step-by-step how to solve practical challenges and where to find detailed information.
Wilkinson's et al. 2016 idea of FAIR data is spreading, but practices on making data F-findable, A-accessible, I-interoperable, and R-reusable are lacking. On a general level, support is provided by EOSC initiatives for example (see e.g., EOSC's FAIR metrics and data quality advisory group). However, many researchers are in need of a step-by-step recipe that they can follow, which we sketch in our use case.
We assume that the reader has a dataset at hand that needs to be FAIRified. Let us take a brief look at how this works with the example of an energy use case. It is a simple inventory that collects data on citizen initiatives in the energy transition (e.g., When did they start? How many participate? Where are they located?) and their projects (What do they do? How many wind farms have been invested in?). The data of over 10,000 such initiatives was collected from different websites, statistical offices, and other online media. Details on the inventory can be found in Wierling, A., Schwanitz, V.J., Zeiss, J.P. et al. A Europe-wide inventory of citizen-led energy action with data from 29 countries and over 10000 initiatives. Sci Data 10, 9 (2023). https://doi.org/10.1038/s41597-022-01902-5.
The FAIRification of the database took a lot of extra time, and we had to build up data governance skills. We could have saved a lot of the effort if we had had an available recipe to follow. The first step is to imagine the future re-use of the data (develop a user-centric data model). That means to understand who the target group of the data is and how they would be using the data. If you start to build a dataset from scratch, this means a clear idea about what data would be needed to answer the intended research questions is required beforehand. In the case you have a data set ready at hand, it means that you review these points.
The next step (Pre-FAIRification) is to clarify how deep you want to go with the FAIRification, depending on the resources available. Should all entries be made machine-actionable, i.e. being associated with machine-accessible definitions and standards? Or is it enough to ensure FAIRness of key data and metadata? The task is to analyze (meta)data in view of your goals. For our use case, we had limited resources and decided to develop an archivable database. All attributes of the database are machine actionable - e.g., instead of saying "MW" and "capacity" when talking about a wind farm, we use machine-actionable standards for which you can find a list on the EERAdata wiki (or an example of how to talk about a standard power plant). For the provenance of data, we decided to only track the sources on the level of the source's database and not on the level of individual entries.
The third step (FAIRification) is the actual implementation. We have used RDF together with Turtle as the technical solution to transfer the database tables into listed statements, such as 'wikidatap:P17 wikidate:Q142' or 'schema:url "https://energie-partagee.org/'. They describe the 8th initiative of our French sample, informing that 'the sovereign state of this item' (country - P17) is France (Q142) for which we use the machine-actionable definitions provided by wikidata. We furthermore provide the website to the initiative, using the persistent, web-accessible definition 'url of the item' provided by the organization schema.org. These links are included in the header of the turtle file. Read up on further details in Wierling & Schwanitz 2022, where we also explain what to do if standards do not yet exist. Furthermore, you find more resources on topics like csv-on-the-web and tutorials for RDF, JSON-LD, and the creation of controlled vocabularies).
Our FAIRified turtle file can be downloaded from dataverse.no. Similar files can be viewed or edited, best with the help of a triple stores, such as Apache Jena Fuseki.
Summary of lessons learned
FAIRification requires substantial resources and high-level skills in data governance, which are not available without additional funding and training. The latter needs to be domain-specific.
Testing the workflow with a use case revealed that time-consuming steps have to be undertaken to translate the FAIR guidance principles into machine-actionable standards for (meta)data. Stakeholder views have to be integrated, and a number of web semantic technologies need to be applied and the typical researcher can not be in command of them. Existing support tools and software are by no means plug-and-play, and the necessary knowledge for using them builds a too large barrier. Staff with profound skills in data governance can be helpful to overcome these gaps, but the same staff will fall short in selecting the relevant semantic resources. Here, only domain-specific experts can deliver. For that reason, domain-specific data stewards are needed and education programs have to be set up.
The use case allows us to report actual resource use. While a strict separation between FAIRification tasks and data collection tasks was not always possible, the effort invested into the FAIRification of the ENBP inventory amounts to about 4 PM. This number is rather independent of the size of the data since most of the work was automatized once proper routines were established. Instead, the effort for implementing the FAIR guidance principles is overwhelmingly proportional to the complexity and heterogeneity of the database. Bringing together knowledge from various sub-domains requires extensive work for identifying appropriate semantic resources.
High quality interoperable data require the design for reuse from scratch. User-centered design calls for the systematic integration of the perspective of envisaged users of the data.
Metadata concepts need to be tailored to the intended user profiles, allowing them to interpret the data. This calls for a systematic involvement of data stakeholders to discuss and shape the data model with its semantic representation. For example, a decision-maker may be more interested in aggregated data, whereas a technical specialist may need deeper metadata information.
There is no single ontology covering concepts for describing heterogeneous data. Instead, the various semantic concepts needed to FAIRify the data are not easily found and are typically distributed across sources.
Machine-actionability at the domain level demands the implementation of domain-specific ontologies and standards, but these are difficult to find or they do not yet exist. This is especially the case if bringing together heterogeneous knowledge from several sub-domains as is typically the case for research about the low carbon energy transition. Only few platforms allow searching for concepts defined in ontologies. They only cover a fraction of relevant ontologies. Screening the academic literature for new developments, however, is time consuming and typically not of the core interest of specialist domain experts. For an outsider of the field, it is challenging to assess, e.g., the required granularity of concepts, the necessary precision of definitions, and essential relations. Many ontologies are well prepared for use in the narrow research context, yet they lack enough documentation so that outsiders are able to re-use some of the concepts. In particular, the mapping between different domain-specific ontologies is the next indispensable step. Otherwise, FAIRification will stay on a general level, not able to realize prospects lying in interoperable data at the domain level.
Licensing of data is yet in its infancy. Knowledge and practices about licensing are often incomplete and do not match requirements. Most problematic is the difficulty to access licensing information.
On the one hand, it is relatively easy to assign a license to a new data product. On the other hand, identifying under which license data products, in particular if originating from heterogeneous sources, can be reused is very demanding. Supporting tools are lacking. The use case was a compilation of data that were collected from various sources (including websites, newspaper articles, statistical offices, etc.). Checking usage rights and license information of these data sources was a tedious task.