[mira_talk] Re: Challenge

  • From: Adrian Pelin <apelin20@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 5 Nov 2015 16:38:06 -0500

I really think you need a better assembly. What you have now is outright
bad. I doubt mira can help you in merging these assemblies. One think you
could try is to simulate reads from both assemblies and the assembling both
sets simultaniously. Needles to say very risky and likely to produce
misassemblies

On Thursday, 5 November 2015, Thomas, Dallas <Dallas.Thomas@xxxxxxxxx>
wrote:

Greetings All,



I am in a bit of a quandary and was hoping that someone out there might
have an idea on how to proceed. Let me break this down:



A while back I was asked to help with a miRNA analysis, which for many of
you might come across as ok – that is no problem. My take is a bit
different as previous to this request I have never had any experience with
miRNA. I was asked because nobody else at the Research Centre I work has
had any experience with miRNA – and well I am the only bioinformatician –
so I must know.



For the analysis I was given two fasta files (these files are pre-assemble
[not by me] and do not come with any raw data – yay) to use as “reference”
for analysis :



1. Copy of International Wheat Genome Sequencing Consortium
chromosome-based draft assembly (genomic) of bread wheat; and

2. Compilation of Wheat cDNA assemblies from UCDavis and a group here
on-site (this assembly was done a couple years back by a Post-Doc who has
since left)



Some basic stats of fasta files:



1. Draft Sequence: 12,536,807 contigs ranging in size from 71 bp to
129,043 bp with an N50 of 2273

2. Compilation: 348,312 contigs ranging in size from 64 bp to 26,226
bp and an N50 of 1460



I decided to start the miRNA analysis by predicting unknown miRNA. I am
following the mirdeep-p pipeline and am working at identifying the
precursor sequences. My other data has all been preprocessed and is ready
for alignment to the “genome” – which in this case are my two files above.
Now I could ignore the draft genomic assembly data and work with just the
transcriptomic assembly, however I have been asked to use both, which
unfortunately means combining the too. And here is where we get the issue
at hand.



To begin with I have two assemblies, that I have not worked on and really
know nothing about. I do not have any of the raw or scratch data used for
their assemblies, or even a write-up of the methods used in their
assembly. One file is genomic data the other is transcriptomic. Both
datasets are relatively large and to top it off they are wheat.



I began with trying to merge the two assemblies and looked at using
minimus2 with Amos and gam-ngs. With the minimus2 the combined datasets
blew boundaries and I was working with a dataset 20x larger than the set
max hard limit. For gam-ngs I do not have enough information in that of
raw data so I cannot even proceed past the initial phase.



I then began to ponder the idea chopping the contigs into smaller
overlapped sequences or paired sequences and performing a de novo assembly
on the resultant data. What I have concluded from this approach is I now
have a rather large fasta file with no quality information for which I have
to perform a de novo assembly. Just typing that was nasty. But let’s
think of this further. The resultant fasta file will be too large for mira
which means I have to use another approach. There is a relatively common
assembly pipeline used for extremely large datasets utilizing soap,
newbler, sga and minimus2 – however I personally have issues with this
approach. Other possibilities I have wondered about I question their
effectiveness. Which leaves me where I am today – at a loss.



If there are any suggestions on how to proceed I am more than welcome to
listen and they would be greatly appreciated. If this is like – woah, um I
have no clue – that is totally ok because that I where I am at J



Thanks

Dallas

Other related posts: