[mira_talk] Re: Challenge

  • From: "Thomas, Dallas" <Dallas.Thomas@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 5 Nov 2015 23:23:32 +0000

Thanks – will look into this.

Dallas

From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] On
Behalf Of Juan Daniel Montenegro Cabrera
Sent: November-05-15 4:16 PM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: Challenge


Hi,
If you look in the SRA database you will find the raw reads used for the
assembly of wheat genome. I downloaded those not long ago. The raw rna-seq data
used to assemble the trancriptome is in a repository in the urgi Versailles web
page. That is if you still want the raw data.

I am pretty sure that because of the method used to sequence and assemble the
transcriptome you will not find a great deal of miRNAs in that data, so I would
work only with the genome and do an ab-initio prediction on it based on known
miRNA databases.

Hope it helps and good look. Wheat is a nightmare to work with. BTW, there is
another reference available for wheat W7984 that was built using whole genome
shotgun. It is as good as the chromosome based one.

Regards,

Juan Montenegro
On 6 Nov 2015 8:55 am, "Chris Hoefler"
<hoeflerb@xxxxxxxxx<mailto:hoeflerb@xxxxxxxxx>> wrote:
Oh, and you might find something useful here,
http://wwwmgs.bionet.nsc.ru/mgs/programs/rnaanalys/mirna_premirna_prediction_tools.html
Some tools appear to be human genome specific, but others look like they could
be fairly general, so maybe there is something there to try.

On Thu, Nov 5, 2015 at 4:50 PM, Chris Hoefler
<hoeflerb@xxxxxxxxx<mailto:hoeflerb@xxxxxxxxx>> wrote:
I guess I'm not understanding why you want to merge the assemblies. Can't you,
A) Use the cDNA assembly directly, throwing out incomplete transcripts (ie:
no 5' upstream and/or polyA) Or,
B) Use the cDNA contigs to fish out regions of genomic sequence that you can
then expand around to identify potential transcript sequence. Or,
C) Go with a full denovo prediction of transcript sequence from the genomic
data

Obviously, B) and C) will be very rough and completely useless wrt spliceforms,
etc, but the idea is to just reduce the genomic dataset and get something
manageable for your analysis.
Once you have a subset of data, just look for stem-loop structures that might
contain potential miRNA sequences. Map those back to the genome, refine your
analysis, repeat, etc. I'm not entirely sure how useful your result will be
without experimental verification, but it is something to work with.

On Thu, Nov 5, 2015 at 2:58 PM, Thomas, Dallas
<Dallas.Thomas@xxxxxxxxx<mailto:Dallas.Thomas@xxxxxxxxx>> wrote:
Greetings All,

I am in a bit of a quandary and was hoping that someone out there might have an
idea on how to proceed. Let me break this down:

A while back I was asked to help with a miRNA analysis, which for many of you
might come across as ok – that is no problem. My take is a bit different as
previous to this request I have never had any experience with miRNA. I was
asked because nobody else at the Research Centre I work has had any experience
with miRNA – and well I am the only bioinformatician – so I must know.

For the analysis I was given two fasta files (these files are pre-assemble [not
by me] and do not come with any raw data – yay) to use as “reference” for
analysis :


1. Copy of International Wheat Genome Sequencing Consortium
chromosome-based draft assembly (genomic) of bread wheat; and

2. Compilation of Wheat cDNA assemblies from UCDavis and a group here
on-site (this assembly was done a couple years back by a Post-Doc who has since
left)

Some basic stats of fasta files:


1. Draft Sequence: 12,536,807 contigs ranging in size from 71 bp to 129,043
bp with an N50 of 2273

2. Compilation: 348,312 contigs ranging in size from 64 bp to 26,226 bp and
an N50 of 1460

I decided to start the miRNA analysis by predicting unknown miRNA. I am
following the mirdeep-p pipeline and am working at identifying the precursor
sequences. My other data has all been preprocessed and is ready for alignment
to the “genome” – which in this case are my two files above. Now I could
ignore the draft genomic assembly data and work with just the transcriptomic
assembly, however I have been asked to use both, which unfortunately means
combining the too. And here is where we get the issue at hand.

To begin with I have two assemblies, that I have not worked on and really know
nothing about. I do not have any of the raw or scratch data used for their
assemblies, or even a write-up of the methods used in their assembly. One file
is genomic data the other is transcriptomic. Both datasets are relatively
large and to top it off they are wheat.

I began with trying to merge the two assemblies and looked at using minimus2
with Amos and gam-ngs. With the minimus2 the combined datasets blew boundaries
and I was working with a dataset 20x larger than the set max hard limit. For
gam-ngs I do not have enough information in that of raw data so I cannot even
proceed past the initial phase.

I then began to ponder the idea chopping the contigs into smaller overlapped
sequences or paired sequences and performing a de novo assembly on the
resultant data. What I have concluded from this approach is I now have a
rather large fasta file with no quality information for which I have to perform
a de novo assembly. Just typing that was nasty. But let’s think of this
further. The resultant fasta file will be too large for mira which means I
have to use another approach. There is a relatively common assembly pipeline
used for extremely large datasets utilizing soap, newbler, sga and minimus2 –
however I personally have issues with this approach. Other possibilities I
have wondered about I question their effectiveness. Which leaves me where I am
today – at a loss.

If there are any suggestions on how to proceed I am more than welcome to listen
and they would be greatly appreciated. If this is like – woah, um I have no
clue – that is totally ok because that I where I am at ☺

Thanks
Dallas


--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128



--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: