Published on Sat Nov 07 2020

Machine learning applications to DNA subsequence and restriction site analysis

Ethan J. Moyer, Anup Das

The work is based on the BioBricks standard. The sensitivity using SVMs, random forest, and CNNs are 94.9%, 92.7%, 91.4%, respectively. Each method scores lower in specificity with SVMs and CNN.

0
0
0
Abstract

Based on the BioBricks standard, restriction synthesis is a novel catabolic iterative DNA synthesis method that utilizes endonucleases to synthesize a query sequence from a reference sequence. In this work, the reference sequence is built from shorter subsequences by classifying them as applicable or inapplicable for the synthesis method using three different machine learning methods: Support Vector Machines (SVMs), random forest, and Convolution Neural Networks (CNNs). Before applying these methods to the data, a series of feature selection, curation, and reduction steps are applied to create an accurate and representative feature space. Following these preprocessing steps, three different pipelines are proposed to classify subsequences based on their nucleotide sequence and other relevant features corresponding to the restriction sites of over 200 endonucleases. The sensitivity using SVMs, random forest, and CNNs are 94.9%, 92.7%, 91.4%, respectively. Moreover, each method scores lower in specificity with SVMs, random forest, and CNNs resulting in 77.4%, 85.7%, and 82.4%, respectively. In addition to analyzing these results, the misclassifications in SVMs and CNNs are investigated. Across these two models, different features with a derived nucleotide specificity visually contribute more to classification compared to other features. This observation is an important factor when considering new nucleotide sensitivity features for future studies.

Sun Nov 01 2020
Artificial Intelligence
Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification
The classification of DNA sequences is a key research area in bioinformatics. It enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms are used for the task of DNA classification.
0
0
0
Thu Jan 21 2021
Machine Learning
Motif Identification using CNN-based Pairwise Subsequence Alignment Score Prediction
A common problem in bioinformatics is related to identifying gene regulatory regions marked by relatively high frequencies of motifs. We propose a one-dimensional Convolution Neural Network trained on k-mer formatted sequences. We measure the accuracy of the model by identifying the 15highest-scoring 15-mer indices of the predicted scores.
0
0
0
Fri Jun 07 2019
Machine Learning
Unsupervised Representation Learning of DNA Sequences
We use a sequence-to-sequence autoencoder model to learn a latent representation of a fixed dimension for long andvariable length DNA sequences in an unsupervised manner. We show that these representations can be used as features or priors in closely related tasks.
0
0
0
Wed Dec 17 2014
Machine Learning
Feature extraction from complex networks: A case of study in genomic sequences classification
This work presents a new approach for classification of genomic sequences from measurements of complex networks and information theory. For this, it is considered the nucleotides, dinucleotides and trinucleotide of a genomic sequence. For each of them, the entropy, sum entropy and maximum entropy values are
0
0
0
Mon Nov 27 2017
Machine Learning
Interpretable Convolutional Neural Networks for Effective Translation Initiation Site Prediction
The amount of genomic data at our disposal is growing increasingly large. Determining the gene structure is a fundamental requirement to effectively interpret gene function. An important part in that determination process is the identification of translation initiation sites.
0
0
0
Thu Oct 10 2019
Machine Learning
LISA: Towards Learned DNA Sequence Search
Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. In this paper, we introduce
0
0
0