Contents for analysing metadata

From EERAdata Wiki
Revision as of 09:59, 17 November 2022 by Valerias (talk | contribs) (The quality of metadata)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

While there is the ubiquitous statement that metadata is data about data, a more helpful definition is that metadata is information added to the data to increase its functionality. Adding information may occur during the creation but also during the whole lifecycle of the dataset. In the research context, metadata is often used to capture the context of data. This means, that all necessary information for understanding the data is described by metadata. For sharing data, metadata serves the purpose for provide this extra information so that someone else is able to work with the data given his or her own background of knowledge. But even if the data is not shared with someone else, metadata documents all relevant information so that at a later point in time the data can be used as easily as during the day of its creation. So, metadata has an important function in the scientific workflow and is closely tight to ensure proper scientific practices such as traceability, reproducibility, and transparency.

Metadata and stakeholder identification

As metadata serves the purpose of communicating information about data between its users, the type and extent of this information naturally has to consider the background of all future stakeholders of this data. The step of "analysing metadata" therefore starts with identifying the future users of the data and envisioning what they may want to do with the data. In the context of low-carbon research data, this e.g. includes:

  • Researchers who what to use the data as a starting point for own research activities. Energy domain experts create data from scratch, re-use existing data to curate, aggregate, analyze and publish data. Interdisciplinary scientists inform themselves on (other) expert knowledge, re-use data, aggregate and analyze data, and publish data.
  • Science funders of energy R&D activities who inform themselves of the results of funded research and projects. They need the data to monitor and adjust, channel results to decision-makers, and plan funding policies and principles to better direct.
  • Planners and decision-makers themselves on expert knowledge. They may re-use data, analyze some data, and publish aggregated data and decisions.
  • Energy and other industries that inform themselves on expert knowledge and re-use and analyze some data on all aggregation levels.
  • General public who informs itself to adjust behavior and practices (e.g., energy consumption behavior, voting in elections, engagement as prosumers, activists or citizen-scientists).
  • Data scientists who group codes, test, and validate software with existing data.
  • Publishers, librarians, and data curators publish, store, and archive research data. They may re-use data to link them to metrics such as access statistics and to cross-reference.

Categorizing metadata to identify relevant metadata

It is hard to come up with a methodology to ensure an exhaustive recording of all required metadata. In fact, it might be impossible to capture all relevant information and envision all possible future uses for the data. Still, the use of categories and purposes of metadata is helpful to identify necessary information which needs to be collected and recorded. Following Haynes 2018, the purposes of metadata include

  • Resource description: identify the data e.g. by a DOI or similar identifier and specify properties such as the title, creator, contributor etc.
  • Resource discovery: metadata to find the information contained in the data, this might be keywords describing the data or format information about the data.
  • Administration and management of resources: In many cases, data are not simply created and stay unchanged but are modified, aggregated or selected during the research. They might be downloaded to a secondary place, incorporated into new resources and so on. Metadata allows us to track these changes along the scientific workflow.
  • Record of intellectual property rights: Keeping track of license information is an important issue in scientific as well as industrial use of data. Metadata is used to document ownership rights.
  • Documenting software and hardware environments: Often, data are created as a result of specific conditions. A specific instrument might be used to record the data, or a specific software involved in generating it. Typically, it is crucial to keep this information to understand what the data is about and to guarantee the reproducibility of the scientific findings.
  • Preservation management of digital resources
  • Providing information on context and authenticity

Another helpful categorization is the one provided by wikipedia classifying metadata as

  • Descriptive metadata that describes the data by title, author, creation date, etc. but also provides some content information.
  • Structural metadata explains how the data is organized e.g. if several tables of data are present or if you have pages bounded together to form a book.
  • Administrative metadata may be added to the data during its whole lifetime to track what is happening to the data, e.g. several sets of data compiled into a new dataset, and so on.
  • Statistical metadata may be added to the data for e.g. keeping track of access statistics.
  • Legal metadata may record information on licenses and copyright and maybe state access restrictions to the data.

Using metadata standards to compile relevant metadata

Another help in collecting relevant information is the use of metadata standards. There is a whole zoo of suggested standards to be used. Some are created with a broad focus catering many resources and domains. Other standards rather cater specific forms of data, e.g. audio or video data or domain such as engineering, humanities etc. An example of a rather general metadata standard is the Dublin core set of metadata terms. In its basic form, Dublin core offers the following terms to describe a resource: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type. A list of different metadata standards is listed here.

The quality of metadata

Part of analyzing meta is also to assess the quality of metadata. Various quality criteria have been suggested. An example is the framework developed by Bruce and Hillman which proposes completeness, accuracy, conformance to expectations, logical consistency and coherence, accessibility, timeliness, and provenance as quality criteria. A shorter list of has been proposed by Margaritopoulos et al.. Still, operationalizing these terms to assess the quality may result in a substantial number of sub-categories.

Example

As a simple example example, consider a csv file created in the context of collecting information on power plants owned by citizen-led initiatives.

Further reading

A good starting point for understanding the metadata creation process specific for the low-carbon energy domain is this publication

A. Wierling et al. Advancing FAIR metadata standards for low carbon energy research. Energies 14, 6692 (2021) https://doi.org/10.3390/en14206692

Other helpful resources are

D. Haynes. Metadata for Information Management and Retrieval. Facet Publishing, London, 2018.

Further interesting articles are