[mira_talk] Re: Mixed case Roche 454 data and -CL:lowercase_clip

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 22 Nov 2009 09:59:34 +0100

On Dienstag 17 November 2009 Peter wrote:
> Dear Bastien et al,
> 
> As you know the Roche SFF tools and sff_extract will output mixed case
> sequences, where the good sequence is in upper case, and the start and
> end sections are in lower case (indicating adaptors or poor quality
> sequence to be trimmed off). This means the case information records
> the start/end trim points from the original SFF file.
> 
> Back on 1 Oct 2009, you mentioned you had introduced a new clipping
> function, -CL:lowercase_clip which would be on by default for 454
> data, off otherwise:
> //www.freelists.org/post/mira_talk/454-genome-assembly,10
> 
> Should this give the same results as following the walk-through recipe
> using sff_extract to turn an SFF file into FASTA, QUAL, and an XML
> file with the trimming information?
> http://chevreux.org/uploads/media/mira3_454dev.html

This should, yes. With the small caveat that when running on multiple 
processors, the results may vary slightly between runs.

> I'd like to try this on Roche 454 data using either mixed case FASTA
> files with QUAL files, or with mixed case FASTQ files. My reason for
> wanting to do this is trying out different trimming and filtering
> strategies and producing FASTA+QUAL or FASTQ files is easy, while the
> XML file is a hassle. Also, being able to give MIRA a single mixed
> case FASTQ file seems much more elegant than a bundle of three files
> (FASTA, QUAL, XML) which must all be named in a particular way.

Actually, sff_extract does everything for you: extract into FASTA, QUAL and 
XML (or FASTQ and XML) and name the files. Of course, if you don't stat from 
SFFs, then you have a bit more work.

> $ mira -job=denovo,genome,accurate,454 454_SETTINGS -fastq=reads.fastq
> -CL:lowercase_clip

Hmpf, on a side note: that the parser accepts '-CL:lowercase_clip' without 
"=yes" or "=no" is a problem. I'll have to have a look at what this does (or 
doesn't).

> Whereas this runs (it is still processing right now - I'm waiting to
> see how it compares to the earlier assembly I did using the
> FASTA+QUAL+XML version of the same 454 reads):
> 
> $ mira -job=denovo,genome,accurate,454 454_SETTINGS -fastq=reads.fastq
> -CL:lowercase_clip -LR:mxti=no
> 
> This confuses me since the documentation here says mxti defaults to no:
> http://chevreux.org/uploads/media/mira3.html

mxti indeed defaults to 'no' when you call MIRA without any parameters. But as 
soon as you use a quick switch (especially --job), the 'default' settings 
given in the manual below do not apply anymore for most switches as the quick 
switch tweaks a lot of extensive switches internally.

> However, it seems quite complicated because I have to explicitly turn
> off the XML trace info check (using MIRA 3rc4 on Mac OS X):

And this will stay like that. I've briefly thought about relaxing the need for 
explicitly turning mxti off, but decided against. The reason is that there are 
far too many people out there who don't know squat about sequence assembly and 
will feed whatever they have in hand into assembly / clustering / whatever 
programs. Combine this with the fact that sequencing providers provide 
'preprocessed' SFFs in different formats and you've got the perfect recipe for 
a catastrophe.

E.g.: I've seen almost every combination of 454 FASTA files delivered with and 
without A-adaptor remnants, all upper or all lower case, clipped or unclipped 
etc.pp And of course people will complaion to the list (or to me) for help 
when they find out that the assembly is garbage just because their input was 
garbage. 

Forcing people not familiar with assembly to go via SFF -> sff_extract and the 
FASTA/QUAL/XML (FASTQ/XML) files ensures that at least that part does not 
generate questions. I think it's a small price to pay for people who know what 
they do to add a couple of switches to the command line so that MIRA changes 
its default behaviour :-)

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: