Workshop scientific report

Location: Zuse Institute Berlin (CECAM-DE-MMS)

Dates: Apr 22, 2024 to Apr 25, 2024

Organisers:

Edan Bainglass
Caterina Barillari
Matthew Evans
Kevin Maik Jablonka
Peter Kraus

What were the major topics discussed in the event, and how have they contributed to advancing the state of the art?

The first working group focused on the challenge of RDM platform interoperability, specifically in a manner agnostic to each other's particular internal mechanics or APIs. It is understood that widespread interoperability is beneficial and generally desired, but that implementing pair-wise interoperability between APIs is a daunting task on the order of N² (N platforms seeking to interoperate with N other platforms). Hence, a platform-agnostic solution must be devised to facilitate widespread interoperability. To enable such a solution, the group conceived a new transfer protocol rooted in semantically annotated transactional units written in JSON-LD, using the RO-Crate specification for packaging research (meta)data. The group split into two subgroups focusing on drafting open specifications for (i) requests, and (ii) responses. In addition, a demonstrator was developed to showcase the new protocol using mock-up web apps mimicking the openBIS electronic lab notebook (ELN) and the AiiDA workflow management system (WFMS). The demonstrator has since been greatly extended and has proven an indispensable learning tool. A video tutorial is available demonstrating the new protocol using the mock-up apps. The general discussion is ongoing and will continue as we push to finalize the specifications and test our concepts against additional RDM platforms.

The second working group focused on semantic annotations and ontologies. Discussion revolved around which formats to use for RDF serialization, with JSON-LD identified as the most suitable among available options due to its already widespread use and existing JSON-based APIs. Further discussion followed on how much metadata needs to be provided to describe data, how to deal with missing values, and how and where to incorporate units for measurements. There were also discussions on how to write a JSON schema and the need to have GUI-based tools for this for less technically-inclined users (see also OO-LD). Finally, a contribution from the voc4cat project, which focuses on ontologizing data in catalysis, led to work on the semantic annotation of some of the real-world datasets used at the workshop. A record of the discussions within this working group is available on GitHub.

The third working group was focused on handling data from devices in proprietary formats. Discussions focused on approaches in handling such data, the technical aspects of ingestion of the raw proprietary files into an ELN, their parsing into FAIR or at least open data, and modern approaches to “streaming” data from continuous experiments.

What were the primary outcomes of this workshop, including limitations and open questions?

The workshop provided a major step towards communal understanding of how to best semantically annotate datasets and how to leverage semantically meaningful data to facilitate platform-agnostic interoperability. The collective efforts culminated in a set of tools for semantic dataset annotation (as well as a Python package for semantically handling units), and web apps (and a video) demonstrating the new data transfer protocol, the specifications of which have also been drafted and are currently being expanded. Furthermore, potential partners for collaboration and integration of data extraction tools from binary file types and proprietary data formats have been identified. A batch of such extractors has been successfully integrated into the project during the workshop, with plans for future work with other workshop attendees.

Several open questions remain. With respect to platform-agnostic interoperability of semantically annotated datasets, it is not yet clear the exact level of annotation required to support the protocol. Furthermore, there is the question of how to handle cases where interoperating platforms exchange datasets that have been semantically annotated against disparate ontology knowledge graphs. There is also the issue of whether to handle data format conversion on exchange. These are all being considered and discussed actively on the MADICES repositories. For example, two solutions were offered at the workshop for handling disparate knowledge graphs: (i) community-driven bridge ontologies (and perhaps LLMs at a later stage), and (ii) developing a central format to which each platform implements a converter or adapter. These are both being explored. Note that, at the moment, the new protocol is being tested heavily by the openBIS and AiiDA teams. However, already at the workshop, several RDM platform representatives had expressed interest, including the Chemotion, Herbie, and datalab ELNs, as well as NMRShiftDB and OpenSemanticLab. We will continue to involve additional RDM initiatives in the discussion.

In terms of handling data from proprietary files, further work in this area requires an agreement on the level of annotation, discussed above. It is clear that semantic annotations should be incorporated as early as possible in the data pipelines, ideally at acquisition / raw data ingestion time. Therefore, building pressure on instrument vendors to provide open, well annotated data is crucial, and will require a community effort.

What was the take-home message for the participants?

Providing semantic meaning to research data is invaluable to open science and to facilitating seamless exchange of open and FAIR research data. It can be made simpler through community efforts to develop dedicated tools to streamline semantic annotation, and incentivized by showing the broader value to the community of individual efforts. Such tools should be incorporated as close to the instrumentation producing the data as possible.

Does the outcome(s) of the workshop hold potential for societal benefits?

There is a general push for open research in science, with clear benefits to accelerating scientific discovery. However, the crisis of reproducibility is well-known within the science community and often arises from attempts at open research in the absence of clear guidelines. A survey of participants at the first MADICES workshop identified the lack of standards, protocols, examples, and best practices as the primary barrier to the adoption of open research. The collaborative efforts of the MADICES series aim to drive the community towards resolving these challenges. The outcomes of MADICES continue to contribute to and pave the way towards platform-agnostic solutions to exchange open data in a standard way following structured guidelines to ensure research reproducibility, thus securing the benefit of accelerated science.

Are there tangible outcomes of the workshop (e.g., publications, new collaborations, plans for proposal submission, software developments, etc.)?

Several new collaborations have flourished in (and since) the workshop (see the various repositories added to the MADICES GitHub organization and linked above). Additionally, the knowledge exchange on the practicalities of semantic annotation of experimental data will allow attendees to continue with implementation of the work started at MADICES. Representatives from BAM, Chemotion, Herbie, NMRShiftDB, FAIRmat, openBIS, and AiiDA are all presently contributing to the specifications, demonstrators, and implementation of the platform-agnostic data transfer protocol. The web apps developed in and after the workshop are being considered as a future educational playground. Furthermore, in addition to two of the present co-organizers, we identified two participants who wish to co-organize a follow-up MADICES workshop (late 2025), the proposal of which is currently being drafted.

What measures did you take to promote inclusivity (gender, geographical provenance of participants and speakers, career stage, disabilities, etc.)?

Participants were invited from the networks of the organizers, recommendations from coordinators of research data initiatives, and by surveying relevant activities on GitHub and social media. In particular, due to the hands-on nature of the workshop, junior researchers (PhD students and postdocs) were prioritized in our invitations. In total, 65 people were contacted, out of which 10 were female; from the 35 participants (including organizers), 5 were female. The organizers arranged accommodations for 6 applicants who could not attend otherwise, 5 of whom attended the entire workshop. To find common topics of interest, collaboration partners, and plan the workshop sessions, we organized virtual pre-workshop planning sessions open to all invitees, which were crucial in fostering the right open community and spirit for the eventual workshop. After the workshop, we held a virtual wrap-up meeting to collect feedback, coordinate ongoing work, and discuss the next MADICES event.

What were the major topics discussed in the event, and how have they contributed to advancing the state of the art?​

What were the primary outcomes of this workshop, including limitations and open questions?​

What was the take-home message for the participants?​

Does the outcome(s) of the workshop hold potential for societal benefits?​

Are there tangible outcomes of the workshop (e.g., publications, new collaborations, plans for proposal submission, software developments, etc.)?​

What measures did you take to promote inclusivity (gender, geographical provenance of participants and speakers, career stage, disabilities, etc.)?​