Exploring the sequence space of NUTM1 fusion proteins

May 30

I have recently become interested in a category of cancers called NUT midline carcinomas. This kind of cancer is characterized by an oncogenic fusion of two nuclear proteins, where the C-terminal end is always NUTM1 but the N-terminal partner can be one of a few different chromatin remodellers (most often BRD4). These cancers can occur in different tissues, violating the tissue-of-origin ontology commonly deployed in oncology, but most often presents in the lungs. While some treatments may slightly delay the progression of disease, this cancer is unfortunately both fast and fatal.

There is a clear clinical rationale for finding new therapies for this otherwise incurable deadly disease. I think that these cancers also have interesting properties which might inform our understanding of other cancers. Specifically, they are driven by a single mutation which is oncogenic on its own without the more common accompanying driver mutations (e.g. TP53, KRAS, &c) or any other kind of “second hit”.

I am curious whether the NUTM1 fusion proteins can elicit T-cell responses and or even whether the fusion junction generates any MHC ligands. To start looking into these more immunological questions we first have to figure out the actual fusion protein sequences, which can be trickier than you would expect. The first paper documenting BRD4:NUT fusion driven carcinoma included the full sequence of the fusion protein:

*Figure 2 from French et al. (2003) showing fusion protein sequence*

Most subsequent papers, however, don’t directly include the sequences for fusions with 5’ partners such as BRD3, NSD3, or alternative exon junctions in BRD4. They do often say which exon of each gene is utilized but are not always clear on the particular isoform. To make things even worse, the canonical isoform of NUTM1 has gained a new 5’ UTR exon in the two decades since these fusions were first discovered, making exon numberings somewhat ambiguous, e.g. it’s not clear when references to exon 2 of NUTM1 actually mean exon 3 in a modern reference transcriptome.

To allow some exploratory analyses on the sequences of these fusion proteins, I have started manually curating candidate 5’ fusion partner genes from case studies of NUTM1 fusion driver cancers in an IPython notebook. The code in this notebook takes these 5’ partner genes (and their possible exonic breakpoints) and matches them by reading frame compatibility against a set of observed exons from NUTM1. This code then generates a CSV file containing coding and protein sequences covering the most common NUTM1 fusions.

*Subset of the fusion sequences, with specified transcripts and proteins truncated to breakpoint +/- 10 amino acids*

It’s still a work in progress and I’m eventually curate more 5’ partner genes in the future. In the meantime, I can start looking for predicted MHC ligands at the breakpoints and reasoning about whether it’s possible to design TCR therapeutics targeting any of these rare but fatal mutations. I wanted to make the code and generated data available publicly in case anyone else also is working on NUTM1 carcinoma research and needs the actual fusion protein sequences.

Alex Rubinsteyn

Assistant Professor in the Department of Genetics at UNC Chapel Hill, member of Lineberger Comprehensive Cancer Center and the Computational Medicine Program

https://www.rubinsteyn.com

Exploring the sequence space of NUTM1 fusion proteins

Comparing RNA fusion detection tools with simulated long reads