[mira_talk] Re: MIRA: I am doing it wrong.

From: Sven Klages <sir.svencelot@xxxxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Fri, 4 Dec 2009 14:22:41 +0100

Hi Allessandro,

if they generate scaffolds I assume you have paired-ends?
Scaffolds are fine, .. and probrably Roche software knows best Roche data.

But how many contigs generate these 39 scaffolds? (From what you have
written I don't assume
that 933 contigs form 39 clusters, or?).

On the other hand I have here a small phytoplasm genome, which really makes
problems
in assembly. IMHO the real problem is the data quality due to a GC content
of ~26% and some
kind of repetitive nature.  The reads themselves are quite short, ~200bp
(titanium data!).

I'd love to have an extra  sanger or solexa coverage ;-)

just somecomments, no soultion,
Sven

2009/12/4 Alessandro Riccombeni <rikkomba@xxxxxxxxx>

> HI,
>
> thanks for the help.
> Yes, actually I think it was mentioned that we paid half as usual for this
> sequencing, so probably it was at a lower coverage at the origin.
> The sequencing service provided a preassembled dataset as well: 39
> scaffolds and 933 contigs.
> Some info: 13 Mbs, 2.38% are Ns, GC in the sequence is 36%, largest
> scaffold is 1.9 Mb and the smallest is 3Kb.
> After my first MIRA run with the 454 only (as they shouldn't have used any
> non-454 reads) I was quite clueless about which strategy did they use to get
> 39 scaffolds where I got 11000 contigs. As I wrote, they didn't use any
> custom adaptor, so I don't know what I should do as preprocessing goes...
>
>
> On Thu, Dec 3, 2009 at 8:18 PM, Bastien Chevreux <bach@xxxxxxxxxxxx>wrote:
>
>> On Donnerstag 03 Dezember 2009 Alessandro Riccombeni wrote:
>> > I guess I am using on MIRA in the wrong way.
>> > My dataset is composed of 512,000 454 Titanium reads, without any custom
>> > vector from the sequencing service. I have to assemble a fungal genome
>> of
>> > around 13 MBs.
>> > [...]
>>
>> Hi Alessandro,
>>
>> regardless of what sequencing providers or Roche may tell you ... I don't
>> recommend coverages below 20x in 454 sequencing. 30x is a reasonably good
>> coverage to get in balance between costs and number of contigs.
>>
>> Your numbers tell me you have *at most* a theoretical coverage of 16,
>> probably
>> more in the region of 14 even as 'miramem' still uses the numbers Roche
>> gave
>> in the early days (475) though, having seen a few Titanium sets now, I
>> suspect
>> having 400 bases as mean length is more accurate. [Notre to self: change
>> that
>> ASAP in the miramem estimator]
>>
>> Even worse, the numbers appended show that MIRA estimates the average
>> coverage
>> to be more like 8x or 9x at most. Which is somehwat disastrous.
>>
>> > [...]
>> > I got around 10500-11000 large contigs, trying normal, draft and
>> accurate.
>> > I am adding info from the output at the end of this message.
>>
>> Something's wrong with your data, at least that's my impression. In the
>> most
>> simplest case you underestimated the size of the genome and it's not 13MB,
>> but
>> more like 26 MB.
>>
>> Then there may be the possibility of contamination: was it really just one
>> organism which was sequenced?
>>
>> Also, your bug might be highly repetitive, which adds another possibility
>> for
>> a large number of contigs.
>>
>> And last but not least ... it may be a problem of the sequencing kit used.
>> If
>> your organism has high GC (>=60%) and the Titanium data was generated with
>> a
>> sequencing kit delivered in the first 7 to 8 month of this year, then you
>> need
>> to talk to your provider as they need to talk to Roche.
>>
>> > I also tried doing a hybrid assembly with 4800 paired Sanger reads,
>> getting
>> > around 9700 large contigs. I found out that there are vectors for the
>> > Forward and Reverse Sanger reads, so this is surely creating problems.
>> > Nonetheless, I expected getting a much better result from my 454 reads.
>>
>> The 5k paired Sanger won't help for such a fragemented assembly as the 454
>> data suggests. Even if you'd trim away the sequencing vectors a bit
>> better:
>> with 5k paired end you cant't scaffold 10k contigs.
>>
>> > My question is: is there something I am (blatantly) doing wrong? What am
>> I
>> > overlooking?
>> > It's my first approach to assembling, so please excuse me for being
>> > annoying.
>>
>> Did your sequencing provider give you the result of a Newbler assembly of
>> the
>> Titanium data and is it equally disastrous? Then there's nothing you can
>> do
>> apart getting more sequences or getting the project resequenced if it
>> turns
>> out to be a bad sequencing kit.
>>
>> Regards,
>>  Bastien
>>
>> PS: oh, and don't be lured into the "we could try to close with Solexa"
>> trap
>>    should the provider propose it. Accept only if *they* do the assembly
>> and
>>    deliver you the result free of charge :-)
>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>
>

Follow-Ups:
- [mira_talk] Re: MIRA: I am doing it wrong.
  - From: Jeremiah Davie

References:
- [mira_talk] MIRA: I am doing it wrong.
  - From: Alessandro Riccombeni
- [mira_talk] Re: MIRA: I am doing it wrong.
  - From: Bastien Chevreux
- [mira_talk] Re: MIRA: I am doing it wrong.
  - From: Alessandro Riccombeni

[mira_talk] Re: MIRA: I am doing it wrong.

Other related posts: