[mira_talk] Re: Request for Comments: mirabait for paired-end

  • From: Martin MOKREJŠ <mmokrejs@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 24 Jun 2014 15:52:14 +0200

Bastien Chevreux wrote:

Hi Bastien,

> Dear all,
> 
> for different reasons, implementing a mirabait version aware of paired-end 
> was not quite straightforward with the code-base of MIRA 4.0.x. However, as 
> the need for it also arose also for my daily work, I have implemented the 
> necessary changes in the past weeks.
> 
> I am foreseeing some changes in the default behaviour of mirabait as well as 
> for the command line. This would probably break scripts using mirabit with 
> the “old” syntax and I would like some feedback of what people think. Nothing 
> is implemented yet, so there’s a couple of days to think things through.
> 
> Currently, the default behaviour of mirabait is this: it is not aware of 
> paired-ends; it reads a file containing bait sequences, reads one or several 
> files with sequences to search and writes to one(!) output file all sequences 
> which either a) match on the kmer level the bait sequence or(!) b) the 
> sequences which do NOT match (the -i option of mirabait). The command line 
> looks like this atm:
> 
>   mirabait [options] {bait_file} {input_file} [[input_file_2 input_file_3 
> ...]] {output_basename}

I propose renaming mirabait to say mirabait2 to emphasize the different syntax. 
Just do not stick to the current name, please.


> I have a couple of questions:
> 
> 1) atm I plan to disallow writing results in multiple formats at the same 
> time. E.g., one could not have results written both as FASTQ and FASTA at the 
> same time (which is possible with the current mirabait). Any problem with 
> that?

No, there are tools to convert FASTQ to FASTA.


> 2) would it make sense to allow mirabait read bait sequences from multiple 
> files? If yes, would it make sense to change the command line so that each 
> bait file (even if only one is wanted) needs an option like, e.g.
>    mirabait … -b baitfile1 -b baitfile2 …
>    As added bonus of a forced ‘-b’: mirabait would stop on old syntax (which 
> did not have -b) and tell the user to adapt his command.
> 
> 3) I am planning to set up mirabait to act as a file splitter instead of a 
> file filter. I.e., instead of filtering and writing to an output file only 
> sequences (not) matching the bait sequences, the new version could sort the 
> sequences matching to one output file and sequences not matching to another 
> output file. Default would be to have only the matching output active, but a 
> switch would allow to either also add the non matching or to write only the 
> non-matching.

I would prefer options like -i (include) and -e (exclude) and -p (prefix).

> 
> 4) Would it make sense to have mirabait write results for each input file 
> into a separate output file as default? That would enable other tools 
> (assembler, mappers, whatever) to directly work with Illumina paired-end 
> which is almost always in two files. The downside: when writing to separate 
> files, I think it is almost impossible to have the user name every 
> outputfile. So the default behaviour would be to name the output files like 
> the input files, but with a given prefix. E.g. “baithits_” for sequences 
> which matched and “baitmiss_” for sequences which did not.

No idea.

Martin

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: