[mira_talk] Re: Question about illumina de-novo sequencing

From: Andrej Benjak <abenjak@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 05 Sep 2014 09:48:43 +0200

Hi Mehmet,

The main problem with MIRA is that it is not designed for large genomesand big datasets. Miramem (MIRA's tool which estimates the RAM neededfor a given dataset and genome size) says you would need 100GB of RAM torun a de novo assembly. This estimate is very rough, of course, butgives you and idea.Another issue is runtime. Even if you had a computer with dozens ofCPUs, it would take a long time (can someone be more precise? i havenever assembled a genome that big with MIRA).

So, only if you had a large computer and could use it to run a programfor a few weeks, you could give it a try (and let us know how long didit take). But you want to try other programs as well, more suitable forlarge genomes and short reads.

Finally, the low coverage and relatively short reads might be a problemfor assembling SSRs.In theory, only SSRs shorter than 400-500 bases should get assembled,*if* the 2x300 PE reads are of decent quality *and* the fragment size issmall enough for the pairs to overlap and be merged prior the assembly.

If the fragment size were around 400, most of your pairs should get merged.

There will still be some problems with merging reads from SSRs: it'sbound to be incorrect because of the sequence ambiguity of the overlaps,i.e. the resulting SSRs size might be incorrect, but at least you wouldhave the flanking sequence in a single read. Because of problematicmerging of SSRs, these will cause problems to the assembler and the sameSSR might be represented with more contigs of different size, or manyreads could get discarded as debris. Because of this, you might want tolook for SSRs directly into the merged reads (I think there exist someprograms just for that, but don't take my word for it)

On the other hand, assembling 400bp reads should at least recover alarge part of the genic (non repetitive) part of the genome. Assumingyou are dealing with a homozygous genome, else it's gonna be worse.



Ah, and little warning about 2x300. In realty it's always less:

Our experience with 2x300 libraries is that the qualities of the basesafter 250ish is crap, especially for R2 reads. In fact, the last basesare often so bad that these contain pure garbage, impeding propermerging of the reads.You could get luckier with you sequencing provider, but lets assumereality and after some trimming you end up with pairs of 270+250 bases.At least 20-30 bases should overlap for a reliable merging, ending upwith merged reads of about 400 bases.



I hope this can help with you planning.

cheers,
Andrej


On 09/05/2014 08:48 AM, Mehmet göktay wrote:

Hi Chris,

About your two-fold recommendation, we expect the genome size will beabout 900 mb. And the company will supply us 15gb and its about 10coverage. Our aim is not assembling whole genome but we are planningto get long enough contigs to search ssr on it.


Do you thing we can accomplish it with mira?

Thank you for your kind answers

On Thu, Sep 4, 2014 at 6:31 PM, Chris Hoefler <hoeflerb@xxxxxxxxx<mailto:hoeflerb@xxxxxxxxx>> wrote:



    First of all,

        My question is, do you think I can assembl this genome with
        the mira?


    Can you? Probably. Should you? Mira is not designed with these
    genome sizes in mind and will present some technical challenges
    with such a large project.

    That said,

        If so, do you thing I can get good contig with satisfying lengths?


    Since we just finished talking about a similar problem recently,
    my recommendation is two-fold. First, define your biological
    question precisely. What is it you want to get out of the genome?
    Assembling a genome of this size is no trivial exercise, so first
    determine the question(s) you are trying to answer. A SNP
    analysis, for example, is a very different question from a "How
    much lateral gene transfer has occurred between these related
    species?" type of question. What is a "satisfying contig length"
    will depend a lot on this question.

    Second, since you are getting the sequencing from a company, talk
    with their bioinformatics support group about what you are trying
    to accomplish. They shouldn't sell you a sequencing service if it
    won't meet your goals. They may recommend more data, a different
    type of data, or a different approach altogether. And they may
    also be able to offer assistance with the analysis.



    On Thu, Sep 4, 2014 at 3:32 AM, Mehmet göktay
    <mehmetgoktay1989@xxxxxxxxx <mailto:mehmetgoktay1989@xxxxxxxxx>>
    wrote:

        Hi everyone,

        I have a question mark in my mind and hopefully someone will
        give the answer.
        We are about start de-novo assembly a plant with illumina
        reads. The company offered us 2x300 paired end reads and the
        output size about 15Gb. We are not sure but probably genome
        size of this plant is 800mb.

        My question is, do you think I can assembl this genome with
        the mira? If so, do you thing I can get good contig with
        satisfying lengths?

        Thank you for your answers.
        Mehmet

--Mehmet Göktay, MSc student

        Department of Molecular Biology and Genetics
        Izmir Institute of Technology
        35430, Urla, Izmir, TURKEY
        (For the website of Plant Molecular Genetics Laboratory please
        click here <http://plantmolgen.iyte.edu.tr/>.)

--Chris Hoefler, PhD

    Postdoctoral Research Associate
    Straight Lab
    Texas A&M University
    2128 TAMU
    College Station, TX 77843-2128




--
Mehmet Göktay, MSc student
Department of Molecular Biology and Genetics
Izmir Institute of Technology
35430, Urla, Izmir, TURKEY

(For the website of Plant Molecular Genetics Laboratory please clickhere <http://plantmolgen.iyte.edu.tr/>.)

Follow-Ups:
- [mira_talk] Re: Question about illumina de-novo sequencing
  - From: Mehmet göktay

References:
- [mira_talk] Question about illumina de-novo sequencing
  - From: Mehmet göktay
- [mira_talk] Re: Question about illumina de-novo sequencing
  - From: Chris Hoefler
- [mira_talk] Re: Question about illumina de-novo sequencing
  - From: Mehmet göktay

[mira_talk] Re: Question about illumina de-novo sequencing

Other related posts: