[mira_talk] Re: Question about illumina de-novo sequencing

  • From: Andrej Benjak <abenjak@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 05 Sep 2014 09:48:43 +0200

Hi Mehmet,

The main problem with MIRA is that it is not designed for large genomes and big datasets. Miramem (MIRA's tool which estimates the RAM needed for a given dataset and genome size) says you would need 100GB of RAM to run a de novo assembly. This estimate is very rough, of course, but gives you and idea. Another issue is runtime. Even if you had a computer with dozens of CPUs, it would take a long time (can someone be more precise? i have never assembled a genome that big with MIRA).

So, only if you had a large computer and could use it to run a program for a few weeks, you could give it a try (and let us know how long did it take). But you want to try other programs as well, more suitable for large genomes and short reads.

Finally, the low coverage and relatively short reads might be a problem for assembling SSRs. In theory, only SSRs shorter than 400-500 bases should get assembled, *if* the 2x300 PE reads are of decent quality *and* the fragment size is small enough for the pairs to overlap and be merged prior the assembly.
If the fragment size were around 400, most of your pairs should get merged.

There will still be some problems with merging reads from SSRs: it's bound to be incorrect because of the sequence ambiguity of the overlaps, i.e. the resulting SSRs size might be incorrect, but at least you would have the flanking sequence in a single read. Because of problematic merging of SSRs, these will cause problems to the assembler and the same SSR might be represented with more contigs of different size, or many reads could get discarded as debris. Because of this, you might want to look for SSRs directly into the merged reads (I think there exist some programs just for that, but don't take my word for it)

On the other hand, assembling 400bp reads should at least recover a large part of the genic (non repetitive) part of the genome. Assuming you are dealing with a homozygous genome, else it's gonna be worse.


Ah, and little warning about 2x300. In realty it's always less:
Our experience with 2x300 libraries is that the qualities of the bases after 250ish is crap, especially for R2 reads. In fact, the last bases are often so bad that these contain pure garbage, impeding proper merging of the reads. You could get luckier with you sequencing provider, but lets assume reality and after some trimming you end up with pairs of 270+250 bases. At least 20-30 bases should overlap for a reliable merging, ending up with merged reads of about 400 bases.


I hope this can help with you planning.

cheers,
Andrej


On 09/05/2014 08:48 AM, Mehmet göktay wrote:
Hi Chris,

About your two-fold recommendation, we expect the genome size will be about 900 mb. And the company will supply us 15gb and its about 10 coverage. Our aim is not assembling whole genome but we are planning to get long enough contigs to search ssr on it.

Do you thing we can accomplish it with mira?

Thank you for your kind answers




On Thu, Sep 4, 2014 at 6:31 PM, Chris Hoefler <hoeflerb@xxxxxxxxx <mailto:hoeflerb@xxxxxxxxx>> wrote:


    First of all,

        My question is, do you think I can assembl this genome with
        the mira?


    Can you? Probably. Should you? Mira is not designed with these
    genome sizes in mind and will present some technical challenges
    with such a large project.

    That said,

        If so, do you thing I can get good contig with satisfying lengths?


    Since we just finished talking about a similar problem recently,
    my recommendation is two-fold. First, define your biological
    question precisely. What is it you want to get out of the genome?
    Assembling a genome of this size is no trivial exercise, so first
    determine the question(s) you are trying to answer. A SNP
    analysis, for example, is a very different question from a "How
    much lateral gene transfer has occurred between these related
    species?" type of question. What is a "satisfying contig length"
    will depend a lot on this question.

    Second, since you are getting the sequencing from a company, talk
    with their bioinformatics support group about what you are trying
    to accomplish. They shouldn't sell you a sequencing service if it
    won't meet your goals. They may recommend more data, a different
    type of data, or a different approach altogether. And they may
    also be able to offer assistance with the analysis.



    On Thu, Sep 4, 2014 at 3:32 AM, Mehmet göktay
    <mehmetgoktay1989@xxxxxxxxx <mailto:mehmetgoktay1989@xxxxxxxxx>>
    wrote:

        Hi everyone,

        I have a question mark in my mind and hopefully someone will
        give the answer.
        We are about start de-novo assembly a plant with illumina
        reads. The company offered us 2x300 paired end reads and the
        output size about 15Gb. We are not sure but probably genome
        size of this plant is 800mb.

        My question is, do you think I can assembl this genome with
        the mira? If so, do you thing I can get good contig with
        satisfying lengths?

        Thank you for your answers.
        Mehmet

-- Mehmet Göktay, MSc student
        Department of Molecular Biology and Genetics
        Izmir Institute of Technology
        35430, Urla, Izmir, TURKEY
        (For the website of Plant Molecular Genetics Laboratory please
        click here <http://plantmolgen.iyte.edu.tr/>.)





-- Chris Hoefler, PhD
    Postdoctoral Research Associate
    Straight Lab
    Texas A&M University
    2128 TAMU
    College Station, TX 77843-2128




--
Mehmet Göktay, MSc student
Department of Molecular Biology and Genetics
Izmir Institute of Technology
35430, Urla, Izmir, TURKEY
(For the website of Plant Molecular Genetics Laboratory please click here <http://plantmolgen.iyte.edu.tr/>.)



Other related posts: