[mira_talk] Re: miraconvert maf to fastq contains asterix

From: Bert Brutzel <bertbrutzel@xxxxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 15 Oct 2014 17:14:22 +0200

Thanks,

the -d Option helped, this avoids my awk, sed tr, solution I build... Inevertheless still have a problem using sort -u since the readsextracted from a mapping and from the raw data are not identicalanymore, as there are some changes as indicated below. These duplicatesI can luckily still remove using uniq -w 50 -d , but I still wonderwhere they come from....

@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGANCGCTAGTCCAATATTCAAAAATCAACTTATATCG+IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA#BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?

@HWUSI-EAS1580R:46:FC:5:108:10753:7688/2ACGTTTATGGCAATCGTGGTGGCTGGATATTTCGCATTTGGCATCGGAAAAAGACAACGGATAGATGGCGGCGAAGAtCGCTAGTCCAATATTCAAAAATCAACTTATATCG+IHIIIIIIHHIIIIIHGIIFIGIGGIFIGIIIIBGHIIHHIHHIIIIIIIIIIDIGIIIGHHHHFEG@GIIEIDIBA!BB;7:<8GGBDBBDBBDEEHB>=B<=@A@@>=A?

                                                         ^
Greetings,
Johannes


Am 15.10.2014 um 16:15 schrieb Bastien Chevreux:

> On October 15, 2014 at 3:02 PM Bert Brutzel<bertbrutzel@xxxxxxxxxxxxxx> wrote:> in the process of splitting a sequencing into two genomes I run intothe
> problem of a fastQ-File with duplicate read names, and duplicated
> sequences with * in them. The Problem occurs when I extract reads from
> the MAF file. This gives me asterix * in the sequence, which in turn
> gives me duplicate entries when combined with unused reads sourcedusing> the info_debrislist.txt. So my question how can I extract reads infastQ> format from a reference mapping without having asterix * in thesequence.
The gaps (asterisks) can be dealt with by using '-d'.
Now to the general approach:
- to perform a first split of reads into two organisms, I would simplymap the whole data to both organisms at once. That is, have thereference sequence contain all contigs/genomes from both organisms.The remaining reads I would assemble de-novo and try to use GC contentor other ways (intron/exon vs. no intron/exon) to try to assign themto one of the organisms.- extracting reads: the info directory contains a file named"*contigreadlist*" which, suprise, contains info regarding which readwas assigned to which contig. Instead of going the miraconvert -M wayone could use that with -n.- extracting reads: miraconvert -M may still be useful if you want toextract pre-processed (i.e.: clipped) and partly error correctedreads. Just remember to use -C.
HTH,
  B.
PS: if you need to filter for duplicate sequences (which shouldn't bein your data anyway): don't do it on sequences, always names.

Follow-Ups:
- [mira_talk] Re: miraconvert maf to fastq contains asterix
  - From: Bastien Chevreux

References:
- [mira_talk] Doubts
  - From: Bobby Paul
- [mira_talk] Re: Doubts
  - From: Bastien Chevreux
- [mira_talk] Re: Doubts
  - From: Bobby Paul
- [mira_talk] Re: Doubts
  - From: Bastien Chevreux
- [mira_talk] Re: Doubts
  - From: Bobby Paul
- [mira_talk] miraconvert maf to fastq contains asterix
  - From: Bert Brutzel
- [mira_talk] Re: miraconvert maf to fastq contains asterix
  - From: Bastien Chevreux

[mira_talk] Re: miraconvert maf to fastq contains asterix

Other related posts: