The pipeline starts with data collection. RNA data is collected using mRNA-seq . Next the data goes through several preprocessing steps. First of which is aligning the data to the predicted gene model. Once the mapping is done, Map files are used to create count files, which are formatted to be used in weka. All of the data mining is done using Weka. The two mining approaches used are clustering and classification. From the mining techinques we are able to infer some useful meaning from the data.
The
process of
mRNA-Seq start
with
extracting RNA
from the
specimen of
interest. Once
the RNA is
extracted a
process known
as reverse
transcriptase
is used to
convert the
RNA into
cDNA. This is
done because
cDNA is much
more stable
and easier to
work
with. After
being
converted into
cDNA the cDNA
is sequenced
using an
Illumina
mechine. More
infromation on
mRNA-Seq can
be found
here.
DATASET:
79 lanes (tissue samples)
~25,000 genes
~1 billion reads (50-150 base pairs)
5 different organs
Before meaning can be extracted from the reads, they must first be aligned to a reference genome or gene model. This enable the reads to be order in some meaningful way and also provides useful information about the gene expression level. RNA with higher expression level will have more sequence reads.
The count files are an account for how many reads for a particular gene are mapped to a specific tissues sample. Each lane or tissue sample can come from the same or a different tissue. For example, l_1 and l_2 could both come from liver, while l_3 could possibly be samples from kidney. So, there are multiple sample of the same tissue throughout the data set. The numbers in the columns represent the number of reads mapped to a particular tissue.
Once the count files are created they must be transposed and reformatted for Weka.
J48 Pruned Tree. Figure shows the decision tree generated by the J48 algorithm in WEKA. Out of approximately 25,000 genes, only four are used to accurately classify lanes as one of five tissues. The counts shown are post-normalization.
JRip Classifier Rules. Figure shows the rules generated by the JRip algorithm in WEKA. The counts shown are post-normalization.
Accuracy of Classfiers. The graph shows the 10-fold cross validation accuracy between the four classifiers: J48, JRip, Naive Bayes and SMO. As expected. SMO outperforms the other classifiers although Naive Bayes also performs quite well.
F-Measure by Tissue. The graphs show the F-Measure for each of the five tissues used. As with the previous graph, we can see that SMO and Naive Bayes clearly outperform J48 and JRip.
| Confusion Matrices | |||
|---|---|---|---|
| JRip a b c d e <-- classified as 7 0 1 0 0 | a = brain 0 3 1 1 0 | b = gill 0 0 13 0 1 | c = liver 0 0 1 7 2 | d = kidney 0 0 4 0 7 | e = intestine |
J48 a b c d e <-- classified as 5 0 1 0 2 | a = brain 0 4 0 0 1 | b = gill 0 0 10 1 3 | c = liver 1 0 2 5 2 | d = kidney 0 0 3 0 8 | e = intestine |
Naive Bayes a b c d e <-- classified as 8 0 0 0 0 | a = brain 0 5 0 0 0 | b = gill 0 0 12 0 2 | c = liver 0 0 0 10 0 | d = kidney 0 0 1 0 10 | e = intestine |
Support Vector Machine a b c d e <-- classified as 8 0 0 0 0 | a = brain 0 5 0 0 0 | b = gill 0 0 13 0 1 | c = liver 0 0 0 10 0 | d = kidney 0 0 1 0 10 | e = intestine |
Expected Output. We expected to see clusters representing different tissues with a center cluster showing genes expressed in all tissues.
Liver vs. Brain Expression. This shows the average expression levels in brain tissue plotted vs. liver tissue. We see a large number of genes coexpressed but not much clustering.
| Student | Task |
|---|---|
| Joshua Hulst | Gene Clustering, preprocessing, final report |
| Elijah Lowe | Web visualization, presentation draft, final report |
| Jason Pell | Preprossing and normalization, classification analysis, final report |
Over the years the technology for sequencing of genomic material have advanced greatly. The time and cost of sequencing has decreased while the amount of data produced by sequencing has increased. This trend in decrease in time and cost, while increasing in data has created a greater need for mining algorithms to make sense of the data. Without data mining the data does not yield much, if any useful information.
The goal of this project was to identify coexpression of genes and classify organs from tissue samples using data compiled from from the lamprey genome and data mining techinques.