Can Deep Learning Help Decipher the Genetic Fingerprint of Cancer?

Welcome to Straight Talk about AI in Healthcare. In each post I explain a recent research paper in non-technical terms and highlight lessons for healthcare organizations. My focus is not necessarily the most inspirational work but the most practical insights: those that help you understand what you can do today and what should be on your radar for tomorrow.

Always feel free to reach out with any feedback and article suggestions, and if you are not on the mailing list, please subscribe

Deep learning, a family of methods inspired by the way neurons in the brain work, has revolutionized text and image analysis, but applications to healthcare and biopharma are still in early stages. In the previous post, I discussed the challenges of applying deep learning to structured healthcare data. This post is about applying deep learning to biological data, specifically gene expression data, aka RNA sequencing. 

Every cell in our body contains the same genes, but different kinds of cells express genes (i.e. use them) in different amounts; that’s what makes a brain cell different than a skin cell. Genetic sequencing experiments can measure how much each of tens of thousands of genes is expressed in a given tissue sample, so they generate a lot of data. But, they are usually performed with only a small number of patients, so it can be hard to make reliable predictions based on gene expression data alone.

But what if we pooled together many experiments, representing a large number of individuals? Could we use this information to improve predictions when running a new experiment? This concept is called “transfer learning”, because it transfers existing information to a new task instead of starting from scratch.

This idea has been very useful in other domains. For example, when reading a legal document you may have to look up the legal meaning of certain terms. But you don’t look up every word, because you transfer your existing knowledge of English. Understanding a legal document in a language that you don’t already speak would be a much harder task! 

Transfer learning is the main force behind the wide adoption of deep learning for textual analysis. Big tech companies like Google develop new tools that do the heavy lifting of “learning English”. Smaller companies can then use these tools so they only need to do the equivalent of learning legal terms. 

Applying this concept to gene expression data, the idea would be to use a large set of experiments to identify common expression patterns or groups of genes that tend to operate together, and transfer this information as context to the prediction tasks. This particular study, by a team from Pfizer and Unlearn, a company that uses AI for clinical trials, explored if this approach can improve performance on tasks like identifying the stage of a cancer tumor or predicting its progression. 

The researchers used a public database of over 2000 experiments related to different types of cancer, with a total of about 40,000 individuals. They tested several transfer learning methods and checked if the context that they provide improves prediction accuracy. Unfortunately, none of the approaches worked very well. Some of them were helpful for a few of the tasks, but none was reliably helpful across a broad range of tasks. 

It’s possible that there was just not enough data, or that different deep learning methods could improve the results. So for now, there’s no reliable transfer learning approach for gene expression data. However, this research will continue and may eventually lead to exciting results.

What can you learn from this? Deep learning on biological data is an exciting and active area of research, but so far very few applications have proved robust enough for broad adoption in the industry. The state of the art can move quickly, and in a few years there will likely be many more. But for now, simpler, more traditional methods are probably the right choice for most biopharma data science projects.

The last two posts focused on the challenges of applying deep learning to clinical and biological data. In the next two posts, I will discuss some of the more successful applications.

Not on the mailing list? Subscribe to this newsletter!