[mira_talk] Re: Request for Comments: mirabait for paired-end

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 25 Jun 2014 00:16:37 +0200

First things first: I have a prototype working, I expect bugs though for fringe 
use cases.

  
http://www.chevreux.org/tmp/mira_binonly_ft_baitpe-0-g76dd2b2_linux-gnu_x86_64_static.tar.bz2

No docs, but “mirabait -h” should help a lot. Feel free to test drive.

I’ll combine answers to Martin and Peter here.

On 24 Jun 2014, at 15:52 , Martin MOKREJŠ <mmokrejs@xxxxxxxxx> wrote:
> I propose renaming mirabait to say mirabait2 to emphasize the different 
> syntax. Just do not stick to the current name, please.

I’m not really fond of that idea. mirabait2 would be in the package of MIRA 4, 
but then only since >4.0.2. Or should I rename it mirabait4? Then where would 
versions 2 and 3 be? That could be slightly unsettling for users.

>> 3) I am planning to set up mirabait to act as a file splitter instead of a 
>> file filter. I.e., instead of filtering and writing to an output file only 
>> sequences (not) matching the bait sequences, the new version could sort the 
>> sequences matching to one output file and sequences not matching to another 
>> output file. Default would be to have only the matching output active, but a 
>> switch would allow to either also add the non matching or to write only the 
>> non-matching.
> 
> I would prefer options like -i (include) and -e (exclude) and -p (prefix).

I’m not sure I understand what you mean with the above, care to explain?

Small note: -p/-P are already taken up for defining files with pairs. No going 
back on this one :-)


On 24 Jun 2014, at 10:54 , Peter Cock <p.j.a.cock@xxxxxxxxxxxxxx> wrote:
>> 4) Would it make sense to have mirabait write results for each
>> […]
> 
> You might need to offer several modes:

I think all of these modes are in.

> (a) All in one file (using the read names to spot pairs, fine for
> MIRA input).

I hope you meant “output to one file”. That’s in. If you meant “i have one big 
unsorted file with pairs and singlets mixed” as input … that’s currently not 
foreseen and I’m not sure I want to implement that. There’s one read-naming 
hell awaiting … how to correctly parse out template names if, e.g., Sanger, 454 
and Illumina are mixed in one FASTQ file?

Any good use case in mind?

> (b) Two files for paired reads (in matched order) plus a third file
> for any orphan reads (unpaired, or where a partner is missing)
> 
> (c) An interleaved file for paired reads plus a second file for
> any orphan reads (unpaired, or where a partner is missing).

Yeah, as input that’s in. You can even have several file pairs (-p), 
interleaved files (-P) and unpaired files as input and get them combined (see 
-o/-O) or filtered/sorted separately. One caveat remains: when using file 
pairs, these MUST be synchronised, i.e., same number of reads, same order of 
read names or else mirabait will stop. No singlets allowed here as that would, 
rather sooner than later, lead to situations which are impossible to resolve. 
The same applies in principle for interleaved files: reads MUST come in pairs, 
one after the other or else mirabait will stop.

B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: