[mira_talk] Re: Challenge

  • From: "Thomas, Dallas" <Dallas.Thomas@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Fri, 6 Nov 2015 16:10:05 +0000

Merging of assemblies might not actually accomplish anything, and performing
the precursor search using each assembly separately might be better. Original
thought was if got a good merging of assemblies, what might be missed through
separate searches due to start, stop and length of contigs might be found with
longer contig. Then again with the difficulty of merging not sure if this is
worth it.

From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] On
Behalf Of Chris Hoefler
Sent: November-05-15 6:05 PM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: Challenge

Ah, ok, sorry I missed it the first time reading through. So, is it correct to
say that you are dealing pre-miRNA predictions, and want to map these back to
your references for verification, but are trying to figure out how to deal with
the enormous amount of data? Sorry for being dense, but I'm not understanding
what merging of the assemblies is supposed to accomplish.


On Nov 5, 2015, at 5:21 PM, "Thomas, Dallas"
<Dallas.Thomas@xxxxxxxxx<mailto:Dallas.Thomas@xxxxxxxxx>> wrote:
Have already done A – that was step one. The second part was to see if could
get better precursors and better prediction utilizing both transcriptome and
genome data. As for B have looked into that - dealing with a lot of matched
regions – this is part of the difficulty with wheat as Monica mentioned.

Thanks for the input.

From: mira_talk-bounce@xxxxxxxxxxxxx<mailto:mira_talk-bounce@xxxxxxxxxxxxx>
[mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Chris Hoefler
Sent: November-05-15 3:50 PM
To: mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>
Subject: [mira_talk] Re: Challenge

I guess I'm not understanding why you want to merge the assemblies. Can't you,
A) Use the cDNA assembly directly, throwing out incomplete transcripts (ie:
no 5' upstream and/or polyA) Or,
B) Use the cDNA contigs to fish out regions of genomic sequence that you can
then expand around to identify potential transcript sequence. Or,
C) Go with a full denovo prediction of transcript sequence from the genomic
data

Obviously, B) and C) will be very rough and completely useless wrt spliceforms,
etc, but the idea is to just reduce the genomic dataset and get something
manageable for your analysis.
Once you have a subset of data, just look for stem-loop structures that might
contain potential miRNA sequences. Map those back to the genome, refine your
analysis, repeat, etc. I'm not entirely sure how useful your result will be
without experimental verification, but it is something to work with.

On Thu, Nov 5, 2015 at 2:58 PM, Thomas, Dallas
<Dallas.Thomas@xxxxxxxxx<mailto:Dallas.Thomas@xxxxxxxxx>> wrote:
Greetings All,

I am in a bit of a quandary and was hoping that someone out there might have an
idea on how to proceed. Let me break this down:

A while back I was asked to help with a miRNA analysis, which for many of you
might come across as ok – that is no problem. My take is a bit different as
previous to this request I have never had any experience with miRNA. I was
asked because nobody else at the Research Centre I work has had any experience
with miRNA – and well I am the only bioinformatician – so I must know.

For the analysis I was given two fasta files (these files are pre-assemble [not
by me] and do not come with any raw data – yay) to use as “reference” for
analysis :


1. Copy of International Wheat Genome Sequencing Consortium
chromosome-based draft assembly (genomic) of bread wheat; and

2. Compilation of Wheat cDNA assemblies from UCDavis and a group here
on-site (this assembly was done a couple years back by a Post-Doc who has since
left)

Some basic stats of fasta files:


1. Draft Sequence: 12,536,807 contigs ranging in size from 71 bp to 129,043
bp with an N50 of 2273

2. Compilation: 348,312 contigs ranging in size from 64 bp to 26,226 bp and
an N50 of 1460

I decided to start the miRNA analysis by predicting unknown miRNA. I am
following the mirdeep-p pipeline and am working at identifying the precursor
sequences. My other data has all been preprocessed and is ready for alignment
to the “genome” – which in this case are my two files above. Now I could
ignore the draft genomic assembly data and work with just the transcriptomic
assembly, however I have been asked to use both, which unfortunately means
combining the too. And here is where we get the issue at hand.

To begin with I have two assemblies, that I have not worked on and really know
nothing about. I do not have any of the raw or scratch data used for their
assemblies, or even a write-up of the methods used in their assembly. One file
is genomic data the other is transcriptomic. Both datasets are relatively
large and to top it off they are wheat.

I began with trying to merge the two assemblies and looked at using minimus2
with Amos and gam-ngs. With the minimus2 the combined datasets blew boundaries
and I was working with a dataset 20x larger than the set max hard limit. For
gam-ngs I do not have enough information in that of raw data so I cannot even
proceed past the initial phase.

I then began to ponder the idea chopping the contigs into smaller overlapped
sequences or paired sequences and performing a de novo assembly on the
resultant data. What I have concluded from this approach is I now have a
rather large fasta file with no quality information for which I have to perform
a de novo assembly. Just typing that was nasty. But let’s think of this
further. The resultant fasta file will be too large for mira which means I
have to use another approach. There is a relatively common assembly pipeline
used for extremely large datasets utilizing soap, newbler, sga and minimus2 –
however I personally have issues with this approach. Other possibilities I
have wondered about I question their effectiveness. Which leaves me where I am
today – at a loss.

If there are any suggestions on how to proceed I am more than welcome to listen
and they would be greatly appreciated. If this is like – woah, um I have no
clue – that is totally ok because that I where I am at ☺

Thanks
Dallas



--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: