[mira_talk] Re: assembling variants

  • From: Alexie Papanicolaou <alpapan@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 04 Mar 2010 00:57:44 +0000

Hello Fleur

my two cents:
One option would be to map your reads to the reference consensus
(created by MIRA) using a program like maq/bowtie (one strain at a
time). That should give you the strain specific SNPs but also (with some
parsing of the output) you could get the strain-specific weight for the
haplotype (allele frequency). You can also see if the SNP is polymorphic
or fixed within the strain. Then you can make a table (e.g. with strains
in columns and position rows) to see and sort which sites of the genome
are most differentiated...

Personally, I would discard reads that would be 'striking' in more than
one place (i.e. repeats, domains etc).

The foremost advantage would be speed...
a


On Tue, 2010-03-02 at 21:17 +0100, Fleur Darré wrote:

> 
> 
> Bastien Chevreux a écrit :
> > On Montag 01 März 2010 Fleur Darré wrote:
> >   
> >> I've been moving around my problem without finding an obvious solution.
> >> I hope you'll be able to help me for this and thank you for this in
> >>  advance.
> >>     
> >
> > Hi Fleur,
> >
> > so do I :-)
> >
> >   
> >> I'm currently assembling solexa reads, mapping them against a given
> >> virome. I hope to see some variability even INSIDE my sample and would
> >> like to assess it.
> >> My overall coverage is around 300, so, even if I have 10 to 20
> >> haplo-virome (which I expect), it should be ok. (In another assembly, I
> >> have a 25x coverage for a single expected strain)
> >> Now, the (directed) assembly goes well, I end up with a single contig.
> >>     
> >
> > Ummm, there you lost me. Directed assembly? Do you mean you produced an 
> > assembly with MIRA which you are importing via gap4 directed assembly?
> >   
> No, sorry, I miexplained myself : the "directed" stands here for mapping.
> At this stage, no staden!
> >   
> >> Some of the .exp files do correspond to several reads together (how
> >> many? how deeply?).
> >>     
> >
> > I suppose you mean coverage equivalent reads (CER).
> >   
> ok, this is it.
> >   
> >> When I edit my output in Gap4 (staden package),
> >> these long .exp files's quality are all set to 1, which provides unfair
> >> excessive weight to the reads that were kept "alone". Even if I used the
> >> Base Frequency as consensus algorithm (avoiding the weigth problem),
> >> each isolated read has as an as heavy weight as a "long .exp"... which
> >> is a strong bias when I want to assess the frequency of a given
> >> variant/allele in my sample (for this purpose, I've been using different
> >> consensus sequence out of gap4, with different cons threshold).
> >> Am I missing some otpion? some step?
> >>     
> >
> > Yes. And no. You're missing the option to convert the gap4 database back to 
> > a 
> > CAF file and then let MIRA redo the consensus calculation. As described in 
> > the 
> > MIRA manual, gap4 does not know anthing about 454 and Solexa data, so this 
> > is 
> > the only viable way to get things going after editing a project in gap4.
> >   
> ok, fine, this is a first point.
> BUT, given that I may have as many as 20 (or more) different viromes 
> (from the same virus, still) in my sample, I am interested in seeing 
> discrepancies even if only 5% of the reads harbor it. Is MIRA able to 
> provide this? Would it provide an "allele frequency" estimate, then?
> > Also have a look at using the SOLEXA_SETTINGS -CO:msr=no to prevent 
> > creation 
> > of CERs. Beware, data volumes will explode.
> >   
> ok, I may choose this option if MIRA is not suitable for my specific task.
> Data volume should not be problematic.... Finger crossed...
> 
> Thanks,
> 
> Fleur
> 
> > Regards,
> >   Bastien
> >
> >   
> 
> 


Other related posts: