[mira_talk] Challenge

  • From: "Thomas, Dallas" <Dallas.Thomas@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 5 Nov 2015 20:58:22 +0000

Greetings All,

I am in a bit of a quandary and was hoping that someone out there might have an
idea on how to proceed. Let me break this down:

A while back I was asked to help with a miRNA analysis, which for many of you
might come across as ok - that is no problem. My take is a bit different as
previous to this request I have never had any experience with miRNA. I was
asked because nobody else at the Research Centre I work has had any experience
with miRNA - and well I am the only bioinformatician - so I must know.

For the analysis I was given two fasta files (these files are pre-assemble [not
by me] and do not come with any raw data - yay) to use as "reference" for
analysis :


1. Copy of International Wheat Genome Sequencing Consortium
chromosome-based draft assembly (genomic) of bread wheat; and

2. Compilation of Wheat cDNA assemblies from UCDavis and a group here
on-site (this assembly was done a couple years back by a Post-Doc who has since
left)

Some basic stats of fasta files:


1. Draft Sequence: 12,536,807 contigs ranging in size from 71 bp to 129,043
bp with an N50 of 2273

2. Compilation: 348,312 contigs ranging in size from 64 bp to 26,226 bp and
an N50 of 1460

I decided to start the miRNA analysis by predicting unknown miRNA. I am
following the mirdeep-p pipeline and am working at identifying the precursor
sequences. My other data has all been preprocessed and is ready for alignment
to the "genome" - which in this case are my two files above. Now I could
ignore the draft genomic assembly data and work with just the transcriptomic
assembly, however I have been asked to use both, which unfortunately means
combining the too. And here is where we get the issue at hand.

To begin with I have two assemblies, that I have not worked on and really know
nothing about. I do not have any of the raw or scratch data used for their
assemblies, or even a write-up of the methods used in their assembly. One file
is genomic data the other is transcriptomic. Both datasets are relatively
large and to top it off they are wheat.

I began with trying to merge the two assemblies and looked at using minimus2
with Amos and gam-ngs. With the minimus2 the combined datasets blew boundaries
and I was working with a dataset 20x larger than the set max hard limit. For
gam-ngs I do not have enough information in that of raw data so I cannot even
proceed past the initial phase.

I then began to ponder the idea chopping the contigs into smaller overlapped
sequences or paired sequences and performing a de novo assembly on the
resultant data. What I have concluded from this approach is I now have a
rather large fasta file with no quality information for which I have to perform
a de novo assembly. Just typing that was nasty. But let's think of this
further. The resultant fasta file will be too large for mira which means I
have to use another approach. There is a relatively common assembly pipeline
used for extremely large datasets utilizing soap, newbler, sga and minimus2 -
however I personally have issues with this approach. Other possibilities I
have wondered about I question their effectiveness. Which leaves me where I am
today - at a loss.

If there are any suggestions on how to proceed I am more than welcome to listen
and they would be greatly appreciated. If this is like - woah, um I have no
clue - that is totally ok because that I where I am at :)

Thanks
Dallas

Other related posts: