Decoding DNA: Exploring the Impact of Tokenization on Genomic Language Models

Go back to URS Spring 2024

Presentation description

The language-like structure of DNA suggests it may be possible to use LLMs to extract meaningful insights from genomic data. Currently there is no standard tokenization method or set of fine tuning tasks for genomic language models. Our strategy has been to fine tune multiple foundational models on all of their existing tasks. Additionally, we performed a preliminary investigation on whether an LLM can accurately identify the locations of prophage sequences integrated in the bacterial genome.

Presenter Name: Anisa Habib

Presentation Type: Poster

Presentation Format: In Person

Presentation #C11

College: Engineering

School / Department: School of Computing

Email: anisa.habib@utah.edu

Research Mentor: Hari Sundar

Date | Time: Tuesday, Apr 9th | 1:00 PM

Office of Undergraduate Research Undergraduate Studies

Decoding DNA: Exploring the Impact of Tokenization on Genomic Language Models

Semester: Spring 2024

Presentation description