[mira_talk] Request for Comments: mirabait for paired-end

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 23 Jun 2014 22:19:13 +0200

Dear all,

for different reasons, implementing a mirabait version aware of paired-end was 
not quite straightforward with the code-base of MIRA 4.0.x. However, as the 
need for it also arose also for my daily work, I have implemented the necessary 
changes in the past weeks.

I am foreseeing some changes in the default behaviour of mirabait as well as 
for the command line. This would probably break scripts using mirabit with the 
“old” syntax and I would like some feedback of what people think. Nothing is 
implemented yet, so there’s a couple of days to think things through.

Currently, the default behaviour of mirabait is this: it is not aware of 
paired-ends; it reads a file containing bait sequences, reads one or several 
files with sequences to search and writes to one(!) output file all sequences 
which either a) match on the kmer level the bait sequence or(!) b) the 
sequences which do NOT match (the -i option of mirabait). The command line 
looks like this atm:

  mirabait [options] {bait_file} {input_file} [[input_file_2 input_file_3 ...]] 
{output_basename}

I have a couple of questions:

1) atm I plan to disallow writing results in multiple formats at the same time. 
E.g., one could not have results written both as FASTQ and FASTA at the same 
time (which is possible with the current mirabait). Any problem with that?

2) would it make sense to allow mirabait read bait sequences from multiple 
files? If yes, would it make sense to change the command line so that each bait 
file (even if only one is wanted) needs an option like, e.g.
   mirabait … -b baitfile1 -b baitfile2 …
   As added bonus of a forced ‘-b’: mirabait would stop on old syntax (which 
did not have -b) and tell the user to adapt his command.

3) I am planning to set up mirabait to act as a file splitter instead of a file 
filter. I.e., instead of filtering and writing to an output file only sequences 
(not) matching the bait sequences, the new version could sort the sequences 
matching to one output file and sequences not matching to another output file. 
Default would be to have only the matching output active, but a switch would 
allow to either also add the non matching or to write only the non-matching.

4) Would it make sense to have mirabait write results for each input file into 
a separate output file as default? That would enable other tools (assembler, 
mappers, whatever) to directly work with Illumina paired-end which is almost 
always in two files. The downside: when writing to separate files, I think it 
is almost impossible to have the user name every outputfile. So the default 
behaviour would be to name the output files like the input files, but with a 
given prefix. E.g. “baithits_” for sequences which matched and “baitmiss_” for 
sequences which did not.

I’m sure I’ll hit a number of other issues as I progress, but are there any 
comments regarding the above?

Best,
  Bastien

PS: allowing for kmers >32 will *not* be part of the upcoming rework of 
mirabait (sorry)
PPS: for people only on the mira_announce list: please reply to mira_talk
--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: