[mira_talk] Strange output reads from Solexa assembly

  • From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 8 Dec 2009 17:25:56 +0100

Hi all, 
I'm playing with our first Solexa data in MIRA, doing a preliminary reference 
assembly (2.7 million Illumina reads, non-paired, 38bp) on a 2Mbp genome with 
the call to MIRA (V3rc4):

mira --project=BAnh1IrMREF02 --job=mapping,genome,normal,solexa -OUT:ora=yes 
-GE:not=1 -AS:urd=yes -SB:lb=yes:bft=fasta:bbq=20 SOLEXA_SETTINGS -LR:ft=fastq

In my output .caf and .ace files, I found only very few reads from my input 
files (with names like HWI-EAS210R_0001:6:1:3:224#GATCAG/1). Instead, I found 
~400000 reads with read names like 
_cer_sxa_0_
_cer_sxa_1_
..
These reads are generally much(!) longer than my input reads. 

Does anyone know what these reads are? 
I guess they could be fake reads to reduce read numbers while preserving 
coverage, but I am not sure? 
And if so, does the coverage truly represent all mismatches (i.e. are "allel 
frequencies" truly preserved)? 
And if I wanted to find all reads mapped to a certain site, is that info 
preserved somewhere?
Is there a way to turn this feature off?

Greatful for any help
Björn 





====================================
Björn Nystedt, PhD
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: