[mira_talk] Re: change stringency in Mira mapping assembly
- From: Christoph Hahn <chrisi.hahni@xxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Fri, 13 Jan 2012 09:30:57 +0100
Hi Bastien,
Thanks for your reply! I have a few more things that are bothering me:
On 12/12/2011 10:24 PM, Bastien Chevreux wrote:
On Dec 12, 2011, at 16:09 , Christoph Hahn wrote:
After your step 2 (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS
-CO:fnicpst=yes" derjavinoidesmtlane8_out.maf iteration1) I get a whole bunch of
files:
iteration1_AllStrains.padded.fasta
[...]
iteration1_default.padded.fasta
[...]
iteration1_derjavinoidesmtlane8.padded.fasta
[...]
iteration1_derjavinoides_mt.padded.fasta
[...]
I was continuing with the iteration1_derjavinoidesmtlane8.padded.fasta,
although I am not sure if it would maybe be safer to continue with the unpadded
file.
Take the unpadded version. From the manual:
fasta contains the contig consensus sequences (and .fasta.qual the consensus
qualities). Please note that they come in two flavours: padded and unpadded.
The padded versions may contains stars (*) denoting gap base positions where
there was some minor evidence for additional bases, but not strong enough to be
considered as a real base. Unpadded versions have these gaps removed.
Ok. will do!
What confuses me are all the other files. What are they? THey are obviously all
some variants of the reference..
If you map reads from different strain to a reference, what should MIRA give you as consensus? See?
Not that easy. So MIRA takes the broad approach: one consensus per strain, plus one consensus for
reads without strain info ("default") and one consensus for all strains together
("AllStrains")
I ran my initial assembly like so:
mira --project=derjavinoidesmtlane8 --job=mapping,genome,accurate,solexa
-MI:somrnl=0 -AS:nop=1 -SB:bsn=derjavinoides_mt:bft=gbf:bbq=30
SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=120:tismax=200
-SB:dsn=derjavinoidesmtlane8
I still don t quite get the issue with the strains for my particular
case - I am not trying to map reads from different strains. I am not
giving any prior information about strains via a straindata file or
anything like that. So, if I am correct, with
"-SB:dsn=derjavinoidesmtlane8 " I specify the default strain name for
all reads.
The iteration1_default* files just contain @ - that seems logical to me
as there is no reads without strain information used. Thats that. The
iteration_derjavinoidesmtlane8* files pretty much contain what I was
expecting. Stretches of rather conserved sequence parts where I get a
sequence and then bits of @ in between where the reference is too
divergent. What I am not clear about is why I get
iteration1_derjavinoides_mt* files. derjavinoides_mt was just the strain
name I specified for the backbone, right?. The sequence in the
derjavinoides_mt* files is identical to the reference with the exception
that there are some 50 @ added before the actual start of the reference
sequence and some 10 @ added after the end of the reference sequence.
What is that? The iteration1_Allstrains* file seems to contain a
consensus sequence, as expected.
-) the file with the trimmed reads that I obtained from the first mapping
attempt with Mira (mynewl8data.fastq) as well as the file I get from mirabait
(mymtreads_iteration1.fastq) both start with the reference and are then
followed by several sequences (header e.g. @rr_####50####) before the actual
reads. Apparently these @rr_#### sequences are all part of the reference.. what
exactly is it?
Uhhhh ... you seeing these ###-reads tells me something went wrong. Somewhere.
Where exactly did you get the original file from? Additional question: did the
assembly run over several passes? If yes, why?
The original file is a standard fastq file. I actually trimmed the reads
in the file already quite thouroughly before the initial mapping attempt
with mira. Then I was wanting to obtained the even further clipped reads
from Mira after the first mapping assembly as you suggested with
(convert_project -f maf -t FASTQ -C readpool.maf mynewl8data). The
obtained fastq file already contains this rails then. Also all following
fastq files do. The assembly only ran over one pass. See the stdout of
the convert_project below. I don`t know what`s the problem.. Do you have
any suggestions what I could try to get rid of this rails? Do you think
it is affecting the results from the following steps? I don`t get any
errors concerning this and I looked at the results from Mirabait and the
following mapping approaches and they look ok to me.
Loading from maf, saving to: fastq
First counting reads:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%]
....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%]
....|.... [90%] ....|.... [100%] tcmalloc: large alloc 9261940736 bytes
== 0x1c170000 @
Now loading and processing data:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%]
....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%]
....|.... [90%] ....|.... [100%]
Data conversion process finished, no obvious errors encountered.
SC Read:: read name issorted (1) capacity 4294967295(4) size 1
SC Read:: scf path name issorted (1) capacity 255(1) size 1
SC Read:: exp path name issorted (1) capacity 255(1) size 1
SC Read:: machine type issorted (1) capacity 255(1) size 1
SC Read:: primer issorted (1) capacity 255(1) size 1
SC Read:: strain issorted (1) capacity 255(1) size 3
SC Read:: base caller issorted (1) capacity 255(1) size 1
SC Read:: dye issorted (1) capacity 255(1) size 1
SC Read:: process status issorted (1) capacity 255(1) size 1
SC Read:: clone vector name issorted (1) capacity 65535(2) size 1
SC Read:: sequencing vector name issorted (1) capacity 65535(2) size 1
SC asped issorted (1) capacity 4294967295(4) size 1
To answer your question: the rr_### reads are "rails", helper reads used by
MIRA during the assembly. They are not present in the final results, only in intermediate
files.
-) Also I tried to use mirabait to identify reads that map onto the the
sequence of the host organism, but unfortunately it seems as if the reference
sequences are too long. Is there a way of dealing with this, apart from cutting
the reference in smaller bits? This is the error message I get:
"Read gi|354459049|gb|AGKD01000001.1| is 194200 bp long and thus longer than
MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."
Oooooooops! This is something which should not happen. Definitively a bug. I'll
have a look at it as I am currently working on this part pf MIRA. In the mean
time: sorry, you need to fragment :-/
I am fragmenting the reference now.
Much obliged!
cheers,
Christoph
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: