[mira_talk] Re: PROJ_default.(un)padded.fasta vs PROJ_ReferenceStrain vs PROJ_AllStrains

  • From: John Nash <john.he.nash@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 17 Oct 2011 15:40:32 -0400

On 2011-10-17, at 3:13 PM, Bastien Chevreux wrote:

> On Oct 17, 2011, at 19:27 , John Nash wrote:
>> I just did a mapping assembly of the results of a de novo assembly. It's an 
>> E. coli strain with 40x 454 and 70X Illumina.  After I did the de novo 
>> assembly, I mapped it against K12 and now I am going to BLAST the non-K12 
>> contigs to see what I have.
>> 
>> I took the caf file and did:
>> 
>> $ convert_project -f caf -t fasta -t caf -x 500 -y 30 PROJ_out.caf 
>> PROJ_out_LargeContigs
>> 
>> When I look at the results (ls -1 *.unpadded.fasta) so as not to list the 
>> padded results and the qual files, I get:
>> PROJ_LargeContigs_AllStrains.unpadded.fasta
>> PROJ_LargeContigs_ReferenceStrain.unpadded.fasta
>> PROJ_LargeContigs_default.unpadded.fasta
>> 
>> I understand what the ReferenceStrain is - one contig was a HUGE hint :)
>> 
>> I also know that the other two files have a mapping to the backbone strain 
>> as an entry.  What else is different between them.
>> 
>> (I did do a Google search and read the manual, promise).
> 
> Yeah, the manual is somewhat incomplete *cough*
> 
> You probably did assembly and mapping with MIRA 3.2.x, right? Else there 
> would be "StrainX" somewhere as strain. Anyway, here's the short version:

I thought I had updated all of my mira installations to 3.4 so I checked where 
keep my tarballs. It's from mira_3.4.0_prod_linux-gnu_x86_64_static.tar.bz2

My command params were:

mira --project=PROJ_mapping --job=mapping,genome,accurate,sanger 
--caf=PROJ_mapping_in.sanger.caf COMMON_SETTINGS -GE:kpmf=15:not=40 
-MI:somrnl=0 -SB:bft=gbf:abnc=yes

In the contigs, the backbone strain is referred to as NC_000913_bb, and its 
sequence is the one in ReferenceStrain


> - every read, be it backbone or normal shotgun, can have a strain name 
> attached as info to it (see parameters -SB:bsn:dsn:lsd)
> - it is entirely possible to map more than one strain against a backbone, 
> hence the need to extract strain specific consensus
> - in earlier versions of MIRA, shotgun data was assigned the strain name 
> "default". Not a good name, I've changed the default to StrainX in 3.4.x (see 
> also -SB:dsn). Backbone sequences get per default "ReferenceStrain" since 
> early 2.9 version I think.
> 
> 
> So, assume you mapped a strain against a backbone, and you get in the 
> multiple alignment:
> 
> ReferenceStrain         gtacg*tcagtcg
> default/StrainX  read1  gtacgAtTag*cg
> default/StrainX  read2  gtacgAtTag*cg
> default/StrainX  read3  gtacgAtTag*cg
> etc.pp
> 
> (note the differences)
> 
> Now, if you extract a consensus, convert_project will create one consensus 
> sequence for each strain and one for all strains together. In the case above 
> it will be three sequences:
> 
> ReferenceStrains:   gtacgtcagtcg
> default/StrainX:    gtacgAtTagcg
> AllStrains:         gtacgWYWRKcg
> 
> There are of course interesting borderline cases. E.g.: what happens when a 
> strain is not covered at all?
> 
> ReferenceStrain         
> gtacggtatatgtgtgtgcagttgcagtgtcagtactggtgtcagttggtcagtgtacgt
> default/StrainX  read1  gtacggtata
> default/StrainX  read2  gta                                                  
> default/StrainX  read3                                                       
> tgtacgt
> default/StrainX  read4                                                        
>    cgt
> 
> Here, you will get this:
> 
> ReferenceStrain:   
> gtacggtatatgtgtgtgcagttgcagtgtcagtactggtgtcagttggtcagtgtacgt
> default/StrainX:   
> gtacggtata@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@tgtacgt
> AllStrains:        
> gtacggtatatgtgtgtgcagttgcagtgtcagtactggtgtcagttggtcagtgtacgt

Thanks - this makes it clear to me.

> 
> However, while the @-character was perhaps a nice idea to be very clear that 
> there's something to be careful, third party programs are not prepared for 
> that (understandably so). So, since MIRA 3.4, the @ has been replaced by "X".

My PROJ_ReferenceStrain.unpadded.fasta has:

>NC_000913_bb
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtg
tctgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgactta
ggtcactaaatactttaaccaatataggcatagcgcacagacagataaaaattacagagt
acacaacatccatgaaacgcattagcaccaccattaccaccaccatcaccattaccacag
gtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgcg

Is my 3.4.0 out of date?


> 
> Hope this helps.
> 
> B.


You always help :)

Thanks
John

Other related posts: