[mira_talk] Re: assembling variants

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 2 Mar 2010 20:53:43 +0100

On Montag 01 März 2010 Fleur Darré wrote:
> I've been moving around my problem without finding an obvious solution.
> I hope you'll be able to help me for this and thank you for this in
>  advance.

Hi Fleur,

so do I :-)

> I'm currently assembling solexa reads, mapping them against a given
> virome. I hope to see some variability even INSIDE my sample and would
> like to assess it.
> My overall coverage is around 300, so, even if I have 10 to 20
> haplo-virome (which I expect), it should be ok. (In another assembly, I
> have a 25x coverage for a single expected strain)
> Now, the (directed) assembly goes well, I end up with a single contig.

Ummm, there you lost me. Directed assembly? Do you mean you produced an 
assembly with MIRA which you are importing via gap4 directed assembly?

> Some of the .exp files do correspond to several reads together (how
> many? how deeply?).

I suppose you mean coverage equivalent reads (CER).

> When I edit my output in Gap4 (staden package),
> these long .exp files's quality are all set to 1, which provides unfair
> excessive weight to the reads that were kept "alone". Even if I used the
> Base Frequency as consensus algorithm (avoiding the weigth problem),
> each isolated read has as an as heavy weight as a "long .exp"... which
> is a strong bias when I want to assess the frequency of a given
> variant/allele in my sample (for this purpose, I've been using different
> consensus sequence out of gap4, with different cons threshold).
> Am I missing some otpion? some step?

Yes. And no. You're missing the option to convert the gap4 database back to a 
CAF file and then let MIRA redo the consensus calculation. As described in the 
MIRA manual, gap4 does not know anthing about 454 and Solexa data, so this is 
the only viable way to get things going after editing a project in gap4.

Also have a look at using the SOLEXA_SETTINGS -CO:msr=no to prevent creation 
of CERs. Beware, data volumes will explode.

> Is there a way to get the SNPs and there frequency (among reads) without
> prior knowledge on strains?

You can just extract the information that at position XXXXX there's a SNP, but 
you will almost never be able to say that SNP at XXXXX belongs to strain 1 and 
SNP at XXXXY belongs to strain 2. Only exception: if SNPs are located on the 
same read (which is short enough for Solexa).

Regards,
  Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: