Accelerating biomedical insights: How AI can speed up literature analysis in genomics

Accelerating biomedical insights: How AI can speed up literature analysis in genomics

The ability to synthesise and curate vast amounts of scientific literature is becoming a critical differentiator for biomedical organisations seeking to stay ahead. For Aaqib Alavy, a UCL Masters candidate specialising in artificial intelligence, this challenge is at the heart of work he currently does.

Aaqib, who works on synthesised genetic variant information, is focused on streamlining the process of literature analysis and data curation in genomics. “I’m doing a lot of work within that space, looking at how AI and machine learning can aid and support the medical field,” he explained. “Right now, I’ve been working on a project within genomics, but more specifically, literature analysis and data curation within that field.”

Crucial streamlining

The stakes are high. As Aaqib noted, “With the flow of genomics, a lot of it is advanced by studies and scientific literature comes out every day; there are multiple articles produced regularly. “

“There’s a lot of research being done, but not enough resources to digest that research. Trying to speed up that process is part of what I’m looking into and working on.”

In the field of biomedicine and healthcare, genomics companies aim to provide medical professionals with updates on genetic variants – mutations in DNA that can be linked to certain diseases. “You can think of it like an update in the form of synthesised information that is a very easy and digestible review for medical professionals, to look at and be able to understand,” Aaqib said.

Profound implications

Streamlining workflows and synthesising scientific literature has profound implications for the biomedical research community, for example challenging or consolidating consensus around a particular scientific finding that can impact patient care from risk assessment to preventative strategies.

Aaqib explained, “If a study comes out presenting evidence that an existing benign variant now shows signs of actually being pathogenic, this is now an important area to potentially direct more resources and focus towards further consolidating that finding; however with the sheer volume and rate of publication, findings like these can often go undetected for extended periods of time.”

This is why accelerating the process of curating and reviewing new findings can help the biomedical community more rapidly detect relevant findings and align their efforts with the latest evidence.

Challenges

The technical challenge, however, is formidable. “One of the biggest bottlenecks is the computational cost, and when it comes to optimising the resources needed whilst maintaining results that are both accurate and robust,” Aaqib shared. 

If a study comes out presenting evidence that an existing benign variant now shows signs of actually being pathogenic, this is now an important area to potentially direct more resources and focus towards further consolidating that finding; however with the sheer volume and rate of publication, findings like these can often go undetected for extended periods of time.Aaqib Alavy

“Right now, I’ve been working on a locally developed implementation where you can use consumer grade GPUs, as well as using models and technology that are both a lot more accessible, and more traceable and transparent when it comes to verifying results and how they were achieved.”

He described how he uses LLMs or large language models, to analyse structured and formatted data like complex tables, where semantic context is limited. A heuristic-based relation extraction approach identifies co-occurrences of genomic entities within text and assesses their potential relationships using scoring models. These associations are then validated or expanded by cross-referencing known databases.

APIs (Application Programming Interfaces) play a crucial role in Aaqib’s work by providing powerful tools for data extraction and analysis. A specific example he highlighted is Llama’s extraction tool, Llama Extract.

However, Aaqib  noted significant challenges with API-based LLM solutions. “Because it’s an LLM, it’s not as deterministic and transparent. There’s also the issue of hallucinations, where LLMs will make up information for the sake of answering a prompt or completing a task.”

Explainability and robustness

Another critical issue is explainability – an extremely important requirement in the medical field. “Because it’s the medical field, genomics, that is probably one of the most crucial and vital components – being able to understand why these insights are the insights that they are, is extremely important for both medical professionals and the larger biomedicine community,” Aaqib emphasised. “If you want to make conclusive decisions, you need to understand why a certain tool, model, or technology has yielded the insights that it has.”

https://www.youtube.com/embed/WrTwFw5izLM?feature=oembed&enablejsapi=1

Despite these challenges, Aaqib sees broad potential for the technology. “There’s a lot of benefit for other industries, because it’s essentially information synthesis. Being able to have that for any industry, any application, is extremely useful.” 

For example, a significant technical hurdle exists in extracting and processing PDF documents, a problem that stems from the inconsistent nature of PDF documents: “PDFs are extremely varied across the board in terms of structure, layout, and semantics,” he explained.

And as the volume of biomedical literature continues to grow, the need for advanced tools to curate and synthesise information will only intensify. “The advancement of information synthesis, literature analysis within the field of genomics and the wider medical field can really, really aid the whole world of biomedicine as a whole,” Aaqib emphasised.

“Because the main bottleneck is moving along findings down the pipeline… being able to produce and create tools that can aid in that review and streamline the process is extremely useful.”

Ultimately, the ability to effectively parse documents in challenging formats could have broader implications beyond biomedicine, as Aaqib had also suggested that solving the PDF extraction challenge could potentially benefit multiple industries, ultimately enhancing knowledge sharing within these sectors.

Scroll to Top