Skip to main content

Workshop scientific report

Location: Remote

Dates: Feb 7, 2022 to Feb 9, 2022

Organisers:

  • Carlo Pignedoli
  • Kevin Maik Jablonka
  • Matthew Evans
  • Sebastiaan Huber
  • Shyam Dwaraknath
  • Stefan Kuhn

What were the major topics discussed in the event, and how have they contributed to advancing the state of the art?

Data archiving has been done in chemical sciences since data have been produced. This was done for a long time on paper only, either in (handwritten) lab notebooks or in (printed) publications. In many cases, the published data were only a fraction of the original data. There are some standards for those published data, e.g. IUPAC recommendations for NMR data. Even if the data follow a standard, they are unusable for direct machine action. Efforts to convert data have been done, but are complicated.

As soon as data were produced in electronic or digital format, there were demands to archive and publish those data, initially mainly for later checks and verification. In most cases this still involved manual access to the data and their processing to extract the desired information and compare it to other examples. Large scale automatic processing was attempted, but hindered by lack of memory, transmission bandwidth, processing power, and suitable software.

As this changed over time, the most prominent application of chemical data became artificial intelligence (AI). Here, data found further use for training models. Since AI typically relies on large training sets, manual processing of data is time consuming and not feasible. At the same time, memory, bandwidth and processing power became widely available. Chemistry saw, as a result, a large number of novel applications which would not have been possible before. At this point, the lack of formats becomes a major issue. Whilst we have the tools, much of the valuable data which could be used for them, is still not available. Data has been published, and still is published, in outdated, incomplete, non-standardized and non-permanent formats. This issue is visible across many areas of chemistry and is an unsolved problem in many.

What were the primary outcomes of this workshop, including limitations and open questions?

The workshop has collected materials and information in a number of areas. The ultimate aim is to come up with best practices and start or continue initiatives for standards in specific fields.

The discussion was divided into the following areas, for which we list outcomes separately:

  • Maintainability and adoption: The difficulty of long-term maintenance, especially in an academic setting, was noted. On the other hand, there will always be changes and products will become obsolete at some point. FAIR is crucial to enable transitions - only FAIR data can survive projects.

  • Fuzziness - missing data and impedance mismatch: We are getting better at handling those. Crucial is to maintain the link to the original data to be able to work retrospectively. Crowdsourcing might help in some cases.

  • What needs to be standardized? It was noted that standardization can be a limitation. The full digital trail of data should be preserved. For higher level data, incomplete standards are better than no standards.

  • What are the problems/barriers? The main issues identified are lack of knowledge about data, the perception that dealing with data is a waste of time, and the feeling that there is no appropriate standard. Suggested solutions are embedding in funding and projects workflows from the start, embedding in tools workflow and pressure from top.

  • What are the potential specific benefits? A major issue identified was that benefits often are long-term and for the community as a whole. In contrast, the effort needs to be made by individuals now. A potential solution is to reduce the work needed for conformant data handling to (almost) zero. A system for citing data and giving credit could be an option. In many cases (e.g. NMR) the difficulty are not the (spectral) data, but the metadata and the connection to structures.

  • Agility and standardization: The jungle of standards can be a product of agility and a solution. Mandating standards is very difficult. Standards should crystalize.

  • Reusability and discoverability: Chemdraw is an example of a widely used tool with many drawbacks. The way structures are saved (e.g. using bonds as boxes) makes reuse and discoverability hard. InchI was a big step forward, but inorganic chemistry is not well covered.

A perspective paper, the writing of which has already started, will be the central message from the workshop.

In addition, we have set up a working group on APIs in ELNs which will continue work in this area. It will also cover standards for ELNs in general.

What was the take-home message for the participants?

The main needs of the community lie in the following areas:

  • Clear standards: There must be clear standards and practices to follow in all areas of chemistry. A researcher working in a field and producing a certain type of data must be able to find out easily and unambiguously what to do and how. This includes data and file formats.

  • Easy to use tools: There must be tools that enable the production of compliant data easily and with the minimum amount of effort possible.

  • Integration into existing toolchains: This is important to avoid extra work and get as many researchers as possible to make compliant data. ELNs are a prime example.

Encouragement by publishers: Journals and publishers should encourage efforts by making submission of compliant data mandatory and offer support for this.

Infrastructure for this is and will be mostly decentralized and requirements are not very demanding. It would already be possible to produce machine-actionable data if the points given above were clear.

Long-term data storage is a slightly different issue. Data should be stored in a way that they are available basically forever. That is something which is not well covered in traditional funding schemes.

Does the outcome(s) of the workshop hold potential for societal benefits?

There are a number of distinct issues here where funding might be needed. Firstly, there is the issue of developing standards and tools. Funding for this was available in the past and is probably available in the future via standard channels. Overall, though, this is probably not the major issue, considering that many things exist. Secondly, there is the issue of promoting and spreading good practices. For this, funding was won e.g. as part of the NFDI4Chem scheme in Germany. National funding schemes may be the best option here in the future, considering this has to happen locally and must be localized. Thirdly, there is the issue of long-term storage. Funding for this has historically been very difficult to find and many projects were abandoned. A working example is nmrshiftdb2, which has been around for more than 20 years now, but it was largely based on informal contributions of time and resources by researchers and institutions.

Are there tangible outcomes of the workshop (e.g., publications, new collaborations, plans for proposal submission, software developments, etc.)?

Access to good machine actionable data can be beneficial to every branch of chemistry in the future. Large amounts of high quality, structured data in readable formats are a precondition for methods in the field of large data, data science, and artificial intelligence. Those methods, in turn, have been used and are increasingly used in practically all fields of chemistry, which each brings its own societal benefits. These include:

  • Drug design: Support for drug design by the use of AI methods is widely established. The societal benefits are obvious. Mass screening by computational methods can help to solve the lack of novel compounds, which is a major issue in the pharmaceutical industry.

  • Material science: The development of new materials can be helped by AI methods. New materials can in turn help with sustainable energy or other areas where sustainability is lacking.

  • Chemical process optimization: Optimal processes in the chemical industry can have direct benefits for the environment. They can also help with making materials and drugs available and affordable to developing countries.