[mira_talk] Re: Question about illumina de-novo sequencing

  • From: Mehmet göktay <mehmetgoktay1989@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 5 Sep 2014 11:12:46 +0300

Hi Andrej,
In our lab they have already finished a project with 454 reads to find SSRs
on other two plant genomes. This is the first time in our lab we are going
to try illumina miSeq reads for assembly to get SSRs. We recognized that we
are going to need a lot of RAM to run this amount of data. I suppose we can
handle this amount of data with our new computer.

I'll give a shot with mira when we get the reads and I'll inform you all
about the run time and the statistics.
Well my B plan is try with abyss assembler (which I have tired with small
synthetic dataset. Also in its manual they claimed that abyss can
handle mammalian
sized genomes)

Hopefully one of this will work with this amount of data, and hopefully I
can get good enough contigs to search SSRs on it.

Thank you for your kind recommendation.

PS: If there is any further suggestion, I would be appreciated to hear them.



On Fri, Sep 5, 2014 at 10:48 AM, Andrej Benjak <abenjak@xxxxxxxxx> wrote:

>  Hi Mehmet,
>
> The main problem with MIRA is that it is not designed for large genomes
> and big datasets. Miramem (MIRA's tool which estimates the RAM needed for a
> given dataset and genome size) says you would need 100GB of RAM to run a de
> novo assembly. This estimate is very rough, of course, but gives you and
> idea.
> Another issue is runtime. Even if you had a computer with dozens of CPUs,
> it would take a long time (can someone be more precise? i have never
> assembled a genome that big with MIRA).
>
> So, only if you had a large computer and could use it to run a program for
> a few weeks, you could give it a try (and let us know how long did it
> take). But you want to try other programs as well, more suitable for large
> genomes and short reads.
>
> Finally, the low coverage and relatively short reads might be a problem
> for assembling SSRs.
> In theory, only SSRs shorter than 400-500 bases should get assembled, *if*
> the 2x300 PE reads are of decent quality *and* the fragment size is small
> enough for the pairs to overlap and be merged prior the assembly.
> If the fragment size were around 400, most of your pairs should get merged.
>
> There will still be some problems with merging reads from SSRs: it's bound
> to be incorrect because of the sequence ambiguity of the overlaps, i.e. the
> resulting SSRs size might be incorrect, but at least you would have the
> flanking sequence in a single read. Because of problematic merging of SSRs,
> these will cause problems to the assembler and the same SSR might be
> represented with more contigs of different size, or many reads could get
> discarded as debris. Because of this, you might want to look for SSRs
> directly into the merged reads (I think there exist some programs just for
> that, but don't take my word for it)
>
> On the other hand, assembling 400bp reads should at least recover a large
> part of the genic (non repetitive) part of the genome. Assuming you are
> dealing with a homozygous genome, else it's gonna be worse.
>
>
> Ah, and little warning about 2x300. In realty it's always less:
> Our experience with 2x300 libraries is that the qualities of the bases
> after 250ish is crap, especially for R2 reads. In fact, the last bases are
> often so bad that these contain pure garbage, impeding proper merging of
> the reads.
> You could get luckier with you sequencing provider, but lets assume
> reality and after some trimming you end up with pairs of 270+250 bases. At
> least 20-30 bases should overlap for a reliable merging, ending up with
> merged reads of about 400 bases.
>
>
> I hope this can help with you planning.
>
> cheers,
> Andrej
>
>
>
> On 09/05/2014 08:48 AM, Mehmet göktay wrote:
>
>  Hi Chris,
>
>  About your two-fold recommendation, we expect the genome size will be
> about 900 mb. And the company will supply us 15gb and its about 10
> coverage. Our aim is not assembling whole genome but we are planning to get
> long enough contigs to search ssr on it.
>
>  Do you thing we can accomplish it with mira?
>
>  Thank you for your kind answers
>
>
>
>
> On Thu, Sep 4, 2014 at 6:31 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:
>
>>
>>  First of all,
>>
>> My question is, do you think I can assembl this genome with the mira?
>>
>>
>>  Can you? Probably. Should you? Mira is not designed with these genome
>> sizes in mind and will present some technical challenges with such a large
>> project.
>>
>>  That said,
>>
>>> If so, do you thing I can get good contig with satisfying lengths?
>>
>>
>>  Since we just finished talking about a similar problem recently, my
>> recommendation is two-fold. First, define your biological question
>> precisely. What is it you want to get out of the genome? Assembling a
>> genome of this size is no trivial exercise, so first determine the
>> question(s) you are trying to answer. A SNP analysis, for example, is a
>> very different question from a "How much lateral gene transfer has occurred
>> between these related species?" type of question. What is a "satisfying
>> contig length" will depend a lot on this question.
>>
>>  Second, since you are getting the sequencing from a company, talk with
>> their bioinformatics support group about what you are trying to accomplish.
>> They shouldn't sell you a sequencing service if it won't meet your goals.
>> They may recommend more data, a different type of data, or a different
>> approach altogether. And they may also be able to offer assistance with the
>> analysis.
>>
>>
>>
>> On Thu, Sep 4, 2014 at 3:32 AM, Mehmet göktay <mehmetgoktay1989@xxxxxxxxx
>> > wrote:
>>
>>>    Hi everyone,
>>>
>>>  I have a question mark in my mind and hopefully someone will give the
>>> answer.
>>>  We are about start de-novo assembly a plant with illumina reads. The
>>> company offered us 2x300 paired end reads and the output size about 15Gb.
>>> We are not sure but probably genome size of this plant is 800mb.
>>>
>>>  My question is, do you think I can assembl this genome with the mira?
>>> If so, do you thing I can get good contig with satisfying lengths?
>>>
>>>  Thank you for your answers.
>>>  Mehmet
>>>
>>> --
>>> Mehmet Göktay, MSc student
>>> Department of Molecular Biology and Genetics
>>> Izmir Institute of Technology
>>> 35430, Urla, Izmir, TURKEY
>>>  (For the website of Plant Molecular Genetics Laboratory please click
>>> here <http://plantmolgen.iyte.edu.tr/>.)
>>>
>>>
>>>
>>
>>
>>  --
>> Chris Hoefler, PhD
>> Postdoctoral Research Associate
>> Straight Lab
>> Texas A&M University
>> 2128 TAMU
>> College Station, TX 77843-2128
>>
>
>
>
> --
> Mehmet Göktay, MSc student
> Department of Molecular Biology and Genetics
> Izmir Institute of Technology
> 35430, Urla, Izmir, TURKEY
>  (For the website of Plant Molecular Genetics Laboratory please click here
> <http://plantmolgen.iyte.edu.tr/>.)
>
>
>
>


-- 
Mehmet Göktay, MSc student
Department of Molecular Biology and Genetics
Izmir Institute of Technology
35430, Urla, Izmir, TURKEY
(For the website of Plant Molecular Genetics Laboratory please click here
<http://plantmolgen.iyte.edu.tr/>.)

Other related posts: