Hi Mehmet,The main problem with MIRA is that it is not designed for large genomes and big datasets. Miramem (MIRA's tool which estimates the RAM needed for a given dataset and genome size) says you would need 100GB of RAM to run a de novo assembly. This estimate is very rough, of course, but gives you and idea. Another issue is runtime. Even if you had a computer with dozens of CPUs, it would take a long time (can someone be more precise? i have never assembled a genome that big with MIRA).
So, only if you had a large computer and could use it to run a program for a few weeks, you could give it a try (and let us know how long did it take). But you want to try other programs as well, more suitable for large genomes and short reads.
Finally, the low coverage and relatively short reads might be a problem for assembling SSRs. In theory, only SSRs shorter than 400-500 bases should get assembled, *if* the 2x300 PE reads are of decent quality *and* the fragment size is small enough for the pairs to overlap and be merged prior the assembly.
If the fragment size were around 400, most of your pairs should get merged.There will still be some problems with merging reads from SSRs: it's bound to be incorrect because of the sequence ambiguity of the overlaps, i.e. the resulting SSRs size might be incorrect, but at least you would have the flanking sequence in a single read. Because of problematic merging of SSRs, these will cause problems to the assembler and the same SSR might be represented with more contigs of different size, or many reads could get discarded as debris. Because of this, you might want to look for SSRs directly into the merged reads (I think there exist some programs just for that, but don't take my word for it)
On the other hand, assembling 400bp reads should at least recover a large part of the genic (non repetitive) part of the genome. Assuming you are dealing with a homozygous genome, else it's gonna be worse.
Ah, and little warning about 2x300. In realty it's always less:Our experience with 2x300 libraries is that the qualities of the bases after 250ish is crap, especially for R2 reads. In fact, the last bases are often so bad that these contain pure garbage, impeding proper merging of the reads. You could get luckier with you sequencing provider, but lets assume reality and after some trimming you end up with pairs of 270+250 bases. At least 20-30 bases should overlap for a reliable merging, ending up with merged reads of about 400 bases.
I hope this can help with you planning. cheers, Andrej On 09/05/2014 08:48 AM, Mehmet göktay wrote:
Hi Chris,About your two-fold recommendation, we expect the genome size will be about 900 mb. And the company will supply us 15gb and its about 10 coverage. Our aim is not assembling whole genome but we are planning to get long enough contigs to search ssr on it.Do you thing we can accomplish it with mira? Thank you for your kind answersOn Thu, Sep 4, 2014 at 6:31 PM, Chris Hoefler <hoeflerb@xxxxxxxxx <mailto:hoeflerb@xxxxxxxxx>> wrote:First of all, My question is, do you think I can assembl this genome with the mira? Can you? Probably. Should you? Mira is not designed with these genome sizes in mind and will present some technical challenges with such a large project. That said, If so, do you thing I can get good contig with satisfying lengths? Since we just finished talking about a similar problem recently, my recommendation is two-fold. First, define your biological question precisely. What is it you want to get out of the genome? Assembling a genome of this size is no trivial exercise, so first determine the question(s) you are trying to answer. A SNP analysis, for example, is a very different question from a "How much lateral gene transfer has occurred between these related species?" type of question. What is a "satisfying contig length" will depend a lot on this question. Second, since you are getting the sequencing from a company, talk with their bioinformatics support group about what you are trying to accomplish. They shouldn't sell you a sequencing service if it won't meet your goals. They may recommend more data, a different type of data, or a different approach altogether. And they may also be able to offer assistance with the analysis. On Thu, Sep 4, 2014 at 3:32 AM, Mehmet göktay <mehmetgoktay1989@xxxxxxxxx <mailto:mehmetgoktay1989@xxxxxxxxx>> wrote: Hi everyone, I have a question mark in my mind and hopefully someone will give the answer. We are about start de-novo assembly a plant with illumina reads. The company offered us 2x300 paired end reads and the output size about 15Gb. We are not sure but probably genome size of this plant is 800mb. My question is, do you think I can assembl this genome with the mira? If so, do you thing I can get good contig with satisfying lengths? Thank you for your answers. Mehmet-- Mehmet Göktay, MSc studentDepartment of Molecular Biology and Genetics Izmir Institute of Technology 35430, Urla, Izmir, TURKEY (For the website of Plant Molecular Genetics Laboratory please click here <http://plantmolgen.iyte.edu.tr/>.)-- Chris Hoefler, PhDPostdoctoral Research Associate Straight Lab Texas A&M University 2128 TAMU College Station, TX 77843-2128 -- Mehmet Göktay, MSc student Department of Molecular Biology and Genetics Izmir Institute of Technology 35430, Urla, Izmir, TURKEY(For the website of Plant Molecular Genetics Laboratory please click here <http://plantmolgen.iyte.edu.tr/>.)