Thanks,the -d Option helped, this avoids my awk, sed tr, solution I build... I nevertheless still have a problem using sort -u since the reads extracted from a mapping and from the raw data are not identical anymore, as there are some changes as indicated below. These duplicates I can luckily still remove using uniq -w 50 -d , but I still wonder where they come from....
@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2 ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGANCGCTAGTCCAATATTCAAAAATCAACTTATATCG + IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA#BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?
^@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2 ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGAtCGCTAGTCCAATATTCAAAAATCAACTTATATCG + IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA!BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?
^ Greetings, Johannes Am 15.10.2014 um 16:15 schrieb Bastien Chevreux:
> On October 15, 2014 at 3:02 PM Bert Brutzel <bertbrutzel@xxxxxxxxxxxxxx> wrote: > in the process of splitting a sequencing into two genomes I run into the> problem of a fastQ-File with duplicate read names, and duplicated > sequences with * in them. The Problem occurs when I extract reads from > the MAF file. This gives me asterix * in the sequence, which in turn> gives me duplicate entries when combined with unused reads sourced using > the info_debrislist.txt. So my question how can I extract reads in fastQ > format from a reference mapping without having asterix * in the sequence.The gaps (asterisks) can be dealt with by using '-d'. Now to the general approach:- to perform a first split of reads into two organisms, I would simply map the whole data to both organisms at once. That is, have the reference sequence contain all contigs/genomes from both organisms. The remaining reads I would assemble de-novo and try to use GC content or other ways (intron/exon vs. no intron/exon) to try to assign them to one of the organisms. - extracting reads: the info directory contains a file named "*contigreadlist*" which, suprise, contains info regarding which read was assigned to which contig. Instead of going the miraconvert -M way one could use that with -n. - extracting reads: miraconvert -M may still be useful if you want to extract pre-processed (i.e.: clipped) and partly error corrected reads. Just remember to use -C.HTH, B.PS: if you need to filter for duplicate sequences (which shouldn't be in your data anyway): don't do it on sequences, always names.