TreatmentBank: an automated workflow to convert biodiversity PDFs as input into Biodiversity PMC

Conference: ELIXIR All-Hands 2024
Location: Uppsala, Sweden/virtual
Date and Time: 11 Jun 2024, 9:00 AM UTC
Session: Text mining services to support scalable curation

Presenter: Donat Agosti

Several ELIXIR communities are designing text analytic solutions to perform various data curation tasks. The development of pre-trained and large language models is also helping non expert communities developing successful applications (e.g., triage, named-entity recognition, chatbots, search engines). The span of tasks can be relatively heterogeneous, ranging from well-established and common natural language processing tasks (e.g., named entity recognition, automatic text categorization) up to more complex curation support tasks (e.g., triage or question-answering, bi-directional linking between curated databases and publications). There is also a growing focus on converting and annotating publications, to expand the pool of FAIR data available for downstream tasks, as illustrated by the efforts of the Elixir Data Platform and several node and ELIXIR core data resources (e.g., BioStudies, EuropePMC).

The first part of the workshop will aim at organizing a forum where communities would report, in short pitch-style presentations, on some experiments or challenges they face with the development of scalable automation methods and their application to biocuration, especially regarding genomic, taxonomic and metabolomic data. Biomedical named entity recognition approaches and literature search will be discussed with an emphasis on data exchange standards (e.g. JATS, BIOC, IOB, TaxPub, RO-Crate) and pre-trained large language models. A particular attention will also be paid to turning the long tail of supplementary data into FAIR digital objects. The discussion will be led by representatives from several nodes, including the ELIXIR-UK, CH, BE and LU nodes. It will enable exchange between diverse biological and biomedical communities and focus groups such as Biodiversity, Plant Sciences, Metabolomics, Rare Diseases, Health Data and Biocuration, and more technological groups like Machine Learning. Members of these various communities and focus groups will thus have the opportunity to share their respective approaches, results and challenges.

The second part of the forum will be mainly based on informal discussions, with the objective to tentatively help setting priority areas that ELIXIR could focus on in future developments related to data accessibility and federated data management (science and technology tiers of ELIXIR’s strategic priorities) and biocuration. This discussion is important for determining how ELIXIR can best support and augment the ongoing efforts on text mining, data accessibility and biocuration. The goal is to develop a coherent strategy that aligns with the evolving needs of the scientific community (e.g., the International Biocuration Society) and leverages the latest advances in text analytics and language modelling.