Analyse metadata

From EERAdata Wiki
Jump to: navigation, search

The step "Analyse metadata" comprises a thorough analysis of the metadata which may already come with a dataset and what may be needed for providing, in addition, to making the data intelligible for future reuse of the data. Typically, metadata are assigned to a set of data at the time of creation. Examples for such kind of data are information on the data creator, creation and modification dates, but also header information in data tables, which describe e.g. the content of a column. Basically, all annotations made to datasets are also metadata. Note, that in the context of a data life cycle, new metadata is constantly added. Examples of scientific data added in a later stage of the data lifetime might be citation information, information about errors detected, or keywords that were selected when depositing the dataset to a repository.

Whether sufficient metadata is provided for a specific dataset has to be answered in the specific context. For once, the FAIRification objectives should detail how much effort is spent in improving the status quo. Second, metadata cater to an understanding of the data to a future data user. In that respect, it depends on the context and background of future users. Thus, a stakeholder analysis is needed to screen future user grounds and identify their needs for metadata.

What are metadata?

While there is the ubiquitous statement that metadata is data about data, a more helpful definition is that metadata is information added to the data to increase its functionality. Adding information may occur during the creation but also during the whole lifecycle of the dataset. In the research context, metadata is often used to capture the context of data. This means, that all necessary information for understanding the data is described by metadata. For sharing data, metadata serves the purpose for provide this extra information so that someone else is able to work with the data given his or her own background of knowledge. But even if the data is not shared with someone else, metadata documents all relevant information so that at a later point in time the data can be used as easily as during the day of its creation. So, metadata has an important function in the scientific workflow and is closely tight to ensure proper scientific practices such as traceability, reproducibility, and transparency.

The definition of the term metadata - a collection of proposals

There is no agreed-upon definition of metadata. Here are a few established definitions:

  • "Structured data about an object that supports functions associated with the designated object" (Greenberg, 2003)[1]
  • "Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource." (NISO, 2004)[2]
  • "Digital object as an instance of an abstract data type that has two components, data and key-metadata. The key-metadata includes a handle, i.e. an identifier globally unique to the digital object." (Kahn and Wilensky, 1995)[3]
  • "Metadata can be defined as a structured description of the essential attributes of an information object." (Hill, 2016) [4]
  • "Metadata are structure, encoded data that describe the characteristics of information-bearing entities to aid in the identification, discovery, assessment and management of the described entities." (Smiraglia, 2005) [5]
  • "Perhaps a more useful, ‘big picture’ way of thinking about metadata is as the sum total of what one can say about any information object at any level of aggregation. In this context, an information object is anything that can be addressed and manipulated as a discrete entity by a human being or an information system" (Gilliland, 2016) [6]
  • "Metadata is a map. Metadata is a means by which the complexity of an object is represented in a simpler form. Metadata is a statement about a potentially informative object." (Pomerantz, 2015) [7]
  • "Metadata in this context is the information that accompanies the various stages and outputs of research." (Gregg, 2020) [8]
  • "Metadata is a love note to the future." (Jason Scott, 2011) [9]

Categorizing metadata to identify relevant metadata

It is hard to come up with a methodology to ensure an exhaustive recording of all required metadata. In fact, it might be impossible to capture all relevant information and envision all possible future uses for the data. Still, the use of categories and purposes of metadata is helpful to identify necessary information which needs to be collected and recorded. Following Haynes 2018, the purposes of metadata include

  • Resource description: identify the data e.g. by a DOI or similar identifier and specify properties such as the title, creator, contributor etc.
  • Resource discovery: metadata to find the information contained in the data, this might be keywords describing the data or format information about the data.
  • Administration and management of resources: In many cases, data are not simply created and stay unchanged but are modified, aggregated, or selected during the research. They might be downloaded to a secondary place, incorporated into new resources and so on. Metadata allows us to track these changes along the scientific workflow.
  • Record of intellectual property rights: Keeping track of license information is an essential issue in scientific as well as industrial use of data. Metadata is used to document ownership rights.
  • Documenting software and hardware environments: Often, data are created as a result of specific conditions. A specific instrument might be used to record the data or a specific software involved in generating it. Typically, it is crucial to keep this information to understand what the data is about and to guarantee the reproducibility of the scientific findings.
  • Preservation management of digital resources
  • Providing information on context and authenticity

Another helpful categorization is the one provided by wikipedia classifying metadata as

  • Descriptive metadata that describes the data by title, author, creation date, etc. but also provides some content information.
  • Structural metadata explains how the data is organized e.g. if several tables of data are present or if you have pages bounded together to form a book.
  • Administrative metadata may be added to the data during its whole lifetime to track what is happening to the data, e.g. several sets of data compiled into a new dataset, and so on.
  • Statistical metadata may be added to the data for e.g. keeping track of access statistics.
  • Legal metadata may record information on licenses and copyright and maybe state access restrictions to the data.

Metadata and stakeholder identification

As metadata serves the purpose of communicating information about data between its users, the type and extent of this information naturally have to consider the background of all future stakeholders of this data. The step of "analyzing metadata" therefore starts with identifying the future users of the data and envisioning what they may want to do with the data. In the context of low-carbon research data, this e.g. includes:

  • Researchers who what to use the data as a starting point for their own research activities. Energy domain experts create data from scratch, re-use existing data to curate, aggregate, analyze and publish data. Interdisciplinary scientists inform themselves on (other) expert knowledge, re-use data, aggregate and analyze data, and publish data.
  • Science funders of energy R&D activities who inform themselves of the results of funded research and projects. They need the data to monitor and adjust, channel results to decision-makers, and plan funding policies and principles to better direct.
  • Planners and decision-makers themselves on expert knowledge. They may re-use data, analyze some data, and publish aggregated data and decisions.
  • Energy and other industries that inform themselves on expert knowledge and re-use and analyze some data on all aggregation levels.
  • General public who informs itself to adjust behavior and practices (e.g., energy consumption behavior, voting in elections, engagement as prosumers, activists or citizen-scientists).
  • Data scientists who group codes, test, and validate software with existing data.
  • Publishers, librarians, and data curators publish, store, and archive research data. They may re-use data to link them to metrics such as access statistics and to cross-reference.

Using metadata standards to compile relevant metadata

Another help in collecting relevant information is the use of metadata standards. There is a whole zoo of suggested standards to be used. Some are created with a broad focus catering many resources and domains. Other standards rather cater to specific forms of data, e.g. audio or video data or domains such as engineering, humanities etc. An example of a rather general metadata standard is the Dublin core set of metadata terms. In its basic form, Dublin core offers the following terms to describe a resource: contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type. A list of different metadata standards is listed here.

The quality of metadata

Part of analyzing meta is also to assess the quality of metadata. Various quality criteria have been suggested. An example is the framework developed by Bruce and Hillman which proposes completeness, accuracy, conformance to expectations, logical consistency and coherence, accessibility, timeliness, and provenance as quality criteria. A shorter list of has been proposed by Margaritopoulos et al.. Still, operationalizing these terms to assess the quality may result in several sub-categories.

Example

As a simple example, consider a csv file created in the context of collecting information on power plants owned by citizen-led initiatives. Considering current research about the role of these initiatives in the energy transition, information one may want to record comprises: the type of power plant, capacity, date of commissioning, date of decommissioning, location, etc. While the data one is recording on these attributes are the data, information on what the data are would qualify as metadata. Typically, header information in a table states that all entries in a certain column are power plants, or capacity values together with some information on the units used etc. Exactly how detailed information on e.g. capacity has to depends on the context. Maybe it is necessary to refer to nameplate capacity to be specific about the values taken up in the dataset.

Further reading

A good starting point for understanding the metadata creation process specific to the low-carbon energy domain is this publication

A. Wierling et al. Advancing FAIR metadata standards for low carbon energy research. Energies 14, 6692 (2021) https://doi.org/10.3390/en14206692

Other helpful resources are

D. Haynes. Metadata for Information Management and Retrieval. Facet Publishing, London, 2018.

Books

Articles in scientific journals

Further interesting articles are

Reports and grey literature

Opinion pieces

Websites

Webinars

References

  1. Greenberg, J. (2003), “Metadata and the World-Wide-Web”, Encyclopedia of Library and Information Science, pp. 1876–88
  2. NISO (2004) https://www.niso.org/publications/understanding-metadata-2017
  3. Kahn and Wilensky (1995) http://www.cnri.reston.va.us/k-w.html
  4. Tony Hill In M. Baca, Introduction to Metadata, Getty Publishing, 2016
  5. Richard P. Smiraglia, Metadata: A A Cataloger's Primer, Psychology Press, 2005
  6. A.J. Gilliland In M. Baca, Introduction to Metadata, Getty Publishing, 2016
  7. Jeffrey Pomerantz, Metadata, MIT press, 2015
  8. Gregg, W. et al., A literature review of scholarly communications metadata. Research Ideas and Outcomes 5.
  9. http://ascii.textfiles.com/archives/3181