[mira_talk] Re: miraconvert maf to fastq contains asterix

  • From: Bert Brutzel <bertbrutzel@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 15 Oct 2014 17:14:22 +0200

Thanks,

the -d Option helped, this avoids my awk, sed tr, solution I build... I nevertheless still have a problem using sort -u since the reads extracted from a mapping and from the raw data are not identical anymore, as there are some changes as indicated below. These duplicates I can luckily still remove using uniq -w 50 -d , but I still wonder where they come from....

@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2 ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGANCGCTAGTCCAATATTCAAAAATCAACTTATATCG + IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA#BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?
                                                          ^
@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2 ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGAtCGCTAGTCCAATATTCAAAAATCAACTTATATCG + IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA!BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?
                                                         ^
Greetings,
Johannes


Am 15.10.2014 um 16:15 schrieb Bastien Chevreux:
> On October 15, 2014 at 3:02 PM Bert Brutzel <bertbrutzel@xxxxxxxxxxxxxx> wrote: > in the process of splitting a sequencing into two genomes I run into the
> problem of a fastQ-File with duplicate read names, and duplicated
> sequences with * in them. The Problem occurs when I extract reads from
> the MAF file. This gives me asterix * in the sequence, which in turn
> gives me duplicate entries when combined with unused reads sourced using > the info_debrislist.txt. So my question how can I extract reads in fastQ > format from a reference mapping without having asterix * in the sequence.
The gaps (asterisks) can be dealt with by using '-d'.
Now to the general approach:
- to perform a first split of reads into two organisms, I would simply map the whole data to both organisms at once. That is, have the reference sequence contain all contigs/genomes from both organisms. The remaining reads I would assemble de-novo and try to use GC content or other ways (intron/exon vs. no intron/exon) to try to assign them to one of the organisms. - extracting reads: the info directory contains a file named "*contigreadlist*" which, suprise, contains info regarding which read was assigned to which contig. Instead of going the miraconvert -M way one could use that with -n. - extracting reads: miraconvert -M may still be useful if you want to extract pre-processed (i.e.: clipped) and partly error corrected reads. Just remember to use -C.
HTH,
  B.

PS: if you need to filter for duplicate sequences (which shouldn't be in your data anyway): don't do it on sequences, always names.

Other related posts: