UC3 Material solutions for low carbon energy
General description of use case
Framing the use case
The section describes the FAIRification work undertaken in the use case "Materials solutions for low carbon" of the EERAdata project.
Materials science involves society at all levels: from international collaborations, to advancing science and technologies, to public awareness of the major changes that are taking place. We need to share information on the state of the art to enable the use of new technologies and materials to steer choices toward next-generation materials that are more energy efficient and cheaper to the community. Data, which must be immediately shareable and reusable, play a key role in this process. Challenges are given by the low to medium openness and reusability of data and by the high barriers to find and access them. Availability of an open data infrastructure covering as many research fields as possible could really increase opportunities to develop new research programs, defragment the energy materials community, avoid duplication of research efforts, accelerate materials discovery, and increase understanding of how energy systems could better benefit from existing and new materials.
The European Union has identified materials as a priority and key technology to accelerate the transition to a low-carbon and sustainable economy. Unfortunately, materials research is characterized by strong multidisciplinary research in which both technologies and cooperation should be leveraged to accelerate application-oriented research activities. This is not always the case because of the extremely diverse technological fields in which materials are used, different terminology, experimental set-ups, research procedures and, consequently, its own standards.The availability of an open data infrastructure covering as many research fields as possible could increase opportunities to develop new research programs, defragment energy materials communities, avoid duplication of research efforts, accelerate materials discovery. Many databases are already available, but not organized to be shared and do not follow common formats.
Summary of results
The information gathered so far has been important in providing practical guidance on how to proceed if you have the intention of creating a database in the field of materials science for energy. The opportunity is provided by an Italian project named IEMAP (Italian Energy Materials Acceleration Platform) inspired by the international Mission Innovation initiative. Italy is a member country of Mission Innovation, a global multilateral cooperation initiative born in Paris in 2015 whose primary purpose is to accelerate the innovation processes of clean technologies, both in the public and private sectors, through the commitment of member countries to double the public share of investments dedicated to research,development and innovation activities of decarbonization technologies in order to make clean energy accessible to consumers and to create green jobs and business opportunities. The identification of new materials requires a complex process, articulated and expensive, as well as changing the composition of a material requires a long procedure of research and development that includes simulation, synthesis, experimental and numerical characterization with the execution of numerous tests. Ultimately, in order to identify and select a suitable material for a given application, it is necessary to implement a process that is often very expensive and can take extremely long time, ranging from several years to a few decades. In order to accelerate this process, Mission Innovation –through IC#6 “Clean Energy Materials” –envisages the creation of a MAP (Materials Acceleration Platform) that can automatically achieve, by combining BigData and Artificial Intelligence (AI) technologies, an acceleration in the process of analysis of computational and experimental data in order to identify the most suitable materials for a given application. The IEMAP platform is composed of four fundamental components: a transversal and higher-level computational infrastructure and three experimental infrastructures dedicated to the thematic areas of batteries, electrolysers and photovoltaics. The project, coordinated by ENEA, will be realized with other Italian research institutes that are co-beneficiaries CNR, IIT and RSE. ENEA and co-beneficiaries will make available to the project activities laboratories and experimental and computational infrastructures distributed in the Italian territory. The cross-cutting computational infrastructure consists of a database that is visible to all platform services and a workflow that acts as a “director” of the different services. The workflow is driven by Artificial Intelligence and BigData technologies to “learn” from the data and optimize the design of new materials. The engine of this infrastructure will be the supercomputer CRESCO, installed at the C. R. ENEA Portici, on which will be implemented HPC (High Performance Computing) technologies both for data management and for the development and implementation of a library of numerical codes for molecular modeling. The application cases to be developed are framed within three areas of research in the energy sector considered central to facilitate the process of energy transition: electrochemical storage –batteries, electrolyzers and photovoltaics.IEMAP platform will be structured to implement a cyclic process in which the following operational phases are planned from left to right: ● material modeling activities; ● synthesis and experimental characterization; ● environmental impact assessment. If the material does not meet the requirements of the intended application, the entire process is repeated by requesting new indications to the modeling in order to better address the horizontal experimental activities. At the heart of all the infrastructure for accelerated discovery of materials is a database. The database was designed by sharing the needs with the reference community and providing general and specific information on data management and in particular FAIR-type management. To this end, a survey was prepared to learn about the data management of each laboratory involved in the project. The purpose of the survey is to define technical information for the working groups, and thus for their respective partners, such as the project they belong to, the type of process, the data produced and its size, the tools or programs used, etc., with the purpose of linking the different groups, tools and codes, standards and protocols used during the processing. Google Forms was used for its creation, as it reports a user-friendly screen and is easy to share. The survey consists of four sections that require: ● the identification of the working group; ● the information about the computational process (if any); ● the information about the experimental process (if provided); ● the uploading of example files related to the activity and suggestions. The survey defined the data format, a set of key metadata, and each laboratory's propensity for data sharing. The second step is to give all participants access to the ICT infrastructure shared in the project by providing them with basic information for programming and database management. The third step is to define with the labs a workflow that in the labs could allow data to be collected and stored locally. It is also necessary that locally the data must be cleaned and selected according to their quality level and experimental reproducibility. The fourth step is to implement a workflow that would allow data to be uploaded to the computational infrastructure where the project database is connected. An authentication and registration system are required to access the database and upload data. Conversely, the database can always be accessed without registration. In particular, in the following is reported what happens if a computational laboratory is using the infrastructure. The current procedure applies the ab-initio computational approach based on Density Functional Theory (DFT) calculations to obtain material properties using physics-based codes like Quantum Espresso. Generally, a DFT calculation requires a lot of execution time (hours) and hardware resources like High-Performance Computing (HPC). The combinatorial space of the materials is far too large to be evaluated directly via DFT calculations. In recent years, there has been rapid progress in the development of deep learning-based neural networks in order to predict the properties of materials. Such networks have produced accurate and rapid predictions. The molecule or crystalline material is represented as a graph where the atoms constitute the nodes and their atomistic bonds correspond to the edge. The use case is to calculate the material properties, such as formation energy and redox potential, on several structures that are variations of a starting structure to obtain the optimal material. The starting structure has been determined based on preliminary studies by materials scientists. This structure is NaMnO2 and calculations are performed in a supercell made up of 48 atoms.This experiment is based on the doping of the starting structure by replacing Manganese with Titanium and Nickel, and on the calculation of the formation energy and redox potential for each structure. The results are displayed on ternary plots in order to highlight the optimal areas.Starting data are from the Materials Project database and from previous computational experiments. All these data are collected and rewritten in an uniform and homogeneous format to be compliant with ML libraries and the new database.
Summary of lessons learned
It is of interest to detail why in the Materials for Energy community often there is a resistance to adopt data management innovations. For example, the common approach in materials science is to publish results as focused research studies, reporting only those few data that are directly relevant to the respective topic. Therefore, even if they are produced in very large experiments with the generation of a lot of data, very few of them are shared with the community. Most of the data, particularly when not published in articles, are kept private or even thrown away. For the past few years, the community materials science and engineering has started to share data, even if in a limited form. It is well known that efficient BigData models cannot be found by studying small datasets. Therefore, even unpublished but verified data can gain value and must be shared in the community. In particular, it has been noted that even if the amount of data is large, independent information can be small when the data are correlated. Indeed, information of interest is hidden when data are highly correlated, or it may even be irrelevant or misleading to the application of interest. If these aspects are not properly considered, the statistical analysis will be unuseful. So data is a key raw material but often the results are stored on PCs, workstations or local computers, and most of this data is not used and often even thrown away, even though the information content could be significant. On the contrary, open data access means that data can be used by anyone, not just experts who develop or run advanced computer codes. If the data were openly available and well described, many more people would work with the data. It is well known that unpredictable new directions of research might open up when the people who generate the data also make them available.
Openness of research data and their retention for at least 10 years are now required by many research organizations. From a practical point of view, it is useful to avoid duplication of work and thus save human, computational and energy resources. Because individual researchers create data on different platforms, from workstations to computing clusters to high-performance computing centers (HPCs), it is often impossible to find the data of a student or postdoc. In addition, problems can be related to automatic data deletion in HPC centers, lack of permissions on local machines, data protection, and so on. Clearly, making data traceable requires an appropriate data infrastructure, including documentation, metadata, search engines and hardware.
Accessibility in materials science has several aspects: 1) appropriate hardware that allows rapid access to data, 2) application programming interfaces (APIs). In order to make data fully accessible, a formal description of the data is required, i.e., their metadata that also considers the interrelationships of the metadata.
Here we must first consider the extreme heterogeneity of computer and experimental data. The community in general uses about 40 different computer codes (considering electronic structure, molecular dynamics, and quantum dynamics), molecular and quantum chemistry for materials) that differ in various aspects of methodology and implementation. Concerning experiments, a huge number of experimental apparatuses and technologies are currently used. Consequently, it is necessary to make the results comparable, which is a major challenge not only in the sense that they must be brought back to a common format and units; let us also remember that a quantity may be named differently in different sub-communities or the same expression may have a different meaning in one or another area. For this reason, "dictionaries" are needed to translate between them. However, we must ask ourselves whether it is possible to operate on all available data in a meaningful way; in fact, formats, units of measurement and calculation parameters must be taken into account.
This term is understood in the context of materials science in the following way: the same material can be used for different applications. So why should a researcher working on one aspect not be able to access data that another researcher has generated in another context without becoming an expert in this different area of materials science? For example, TiO2, an important support material for heterogeneous catalysis, also has properties of great interest for photovoltaic technology. In addition, TiO2 is used as a pigment in paints and cosmetic products. It is generally agreed that research results obtained in academia should be published, however it should be a duty to make available the complete data underlying a publication too. In fact, some research journals have begun to require that all data should be uploaded to a certified repository. Of course, as mentioned above, the data must be linked to established metadata and workflows. The digital transformation of technologies and services underpins industrial technologies and advancements. The manufacturing and materials industry increasingly relies on knowledge and decision-making based on a digital ecosystem in which stakeholders are connected, sharing data and knowledge, technologies, human resources and operations, and organizing digital marketplaces that connect manufacturers, suppliers, distributors, recyclers and consumers. The industrial sector is therefore trying to develop new dictionaries and ontologies. The combination of digital technologies such as high-performance computing, big data management, ontology-based knowledge engineering, and artificial intelligence(AI) is revolutionizing research and development methodologies that enable this digital transformation by merging computational (modeling, simulation) and experimental materials data (high-throughput characterization). These tools support materials property screening and materials development. Connecting communities based on data development and shared knowledge/ontologies will accelerate the design of safe and sustainable materials. This approach will help differentiate the quality of materials designed in the EU from those outside Europe. The availability of shared, combined, and validated data will deepen the definition of researcher and industry needs and significantly accelerate the development of advanced materials and processing solutions relevant to European innovation.