Published on Fri Oct 18 2019

The TCGA Meta-Dataset Clinical Benchmark

Mandana Samiei, Tobias Würfl, Tristan Deleu, Martin Weiss, Francis Dutil, Thomas Fevens, Geneviève Boucher, Sebastien Lemieux, Joseph Paul Cohen

Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. We provide a clinical Meta-Dataset derived from the publicly available Cancer Genome Atlas Program (TCGA) that contains 174 tasks.

0
0
0
Abstract

Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. This development equips doctors and medical staff with tools to evaluate their hypotheses and hence make more precise decisions. Although most current research in the literature seeks to develop techniques and methods for predicting one particular clinical outcome, this approach is far from the reality of clinical decision making in which you have to consider several factors simultaneously. In addition, it is difficult to follow the recent progress concretely as there is a lack of consistency in benchmark datasets and task definitions in the field of Genomics. To address the aforementioned issues, we provide a clinical Meta-Dataset derived from the publicly available data hub called The Cancer Genome Atlas Program (TCGA) that contains 174 tasks. We believe those tasks could be good proxy tasks to develop methods which can work on a few samples of gene expression data. Also, learning to predict multiple clinical variables using gene-expression data is an important task due to the variety of phenotypes in clinical problems and lack of samples for some of the rare variables. The defined tasks cover a wide range of clinical problems including predicting tumor tissue site, white cell count, histological type, family history of cancer, gender, and many others which we explain later in the paper. Each task represents an independent dataset. We use regression and neural network baselines for all the tasks using only 150 samples and compare their performance.

Sat Mar 07 2020
Machine Learning
Large-scale benchmark study of survival prediction methods using multi-omics data
Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly generated for the investigation of various diseases. It is unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions by means of a large-scale benchmark study.
0
0
0
Sun May 02 2021
Machine Learning
DRIVE: Machine Learning to Identify Drivers of Cancer with High-Dimensional Genomic Data & Imputed Labels
0
0
0
Thu Mar 29 2018
Machine Learning
PIMKL: Pathway Induced Multiple Kernel Learning
PIMKL exploits prior knowledge in the form of a molecular interaction network and annotated gene sets. The model provides a stable molecular signature that can be interpreted in the light of the ingested prior knowledge. It can be used in transfer learning tasks.
0
0
0
Wed Feb 03 2021
Machine Learning
OmiEmbed: a unified multi-task deep learning framework for multi-omics data
High-dimensional omics data contains intrinsic biomedical information that is crucial for personalised medicine. It is challenging to capture these features from genome-wide data due to the large number of molecular features and small number of available samples. We propose a unified multi-task deep learning framework named OmiEmbed.
0
0
0
Sat Sep 12 2020
Machine Learning
Machine Learning Against Cancer: Accurate Diagnosis of Cancer by Machine Learning Classification of the Whole Genome Sequencing Data
Machine learning can precisely identify different cancer tumors at any stage. The method works well on a series of cancers and results in great clustering of cancerous and healthy samples too. Once the classifier is trained, it can be used to classify any new sample of potential patients.
0
0
0
Thu Sep 20 2018
Artificial Intelligence
Understanding Behavior of Clinical Models under Domain Shifts
Deep learning has been a game changer in building predictive models, thus leading to community-wide data curation efforts. These models are oftenbiased to the training datasets. This can be limiting when models are deployed in new environments.
0
0
0