[mira_talk] Re: 454 cleaning

  • From: Robin Kramer <kodream@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 18 Nov 2010 09:07:58 -0700

I can track down the forward primers pretty easy.

they are hard to miss considering 1/4 of the reads start with.

tcagAAGCAGTGGTATCAACGCAGAGTACGGGGG

I think that GGGGG at the end is the leading G problem, hypothetically
due to linker chemistry.

I have tried it with four G homopolymers five G homopolymers and six G
homopolymers, and it seems there is a significant drop at the sixth.

Is there any problem with feeding that into the SSAHA2?

Also I found the reverse complement of this sequence, hence our chimeras.

Is there potentially a B adapter floating around in the mix?  I didn't
obviously see another adapter in the sequence.

Also FYI, the adapter was an exact match for many sequences in the TSA.

Sincerely yours,

Robin

On 11/17/10, Robin Kramer <kodream@xxxxxxxxx> wrote:
> It seems the sff's are available from the SRA, but only through the .SRA
> file.
>
> Robin
>
> On 11/12/10, Gao, Guangtu <Guangtu.Gao@xxxxxxxxxxxx> wrote:
>> Hi Robin,
>>
>> You might consider to check the adaptors and contaminants using SeqTrim.
>> I also downloaded some EST sequences from NCBI for assembly before. I
>> found that the adaptors are not totally cleaned from those sequences and
>> they made chimeras.
>>
>> Guang
>>
>> -----Original Message-----
>> From: mira_talk-bounce@xxxxxxxxxxxxx
>> [mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Robin Kramer
>> Sent: Thursday, November 11, 2010 12:44 PM
>> To: mira_talk@xxxxxxxxxxxxx
>> Subject: [mira_talk] 454 cleaning
>>
>> Hi,
>>
>> I am doing an assembly from data publicly available at NCBI.
>>
>> The data are available here:
>>
>> http://www.ncbi.nlm.nih.gov/sra/SRX021565?report=full
>>
>> It is 454 data, but unfortunately neither the sff or xml files are
>> available.
>>
>> I assembled the data with Mira, using the no xml flags.
>>
>> Which appeared to give a nice assembly.
>>
>> However when I BLASTN and BLASTX the first and forth contig their
>> appear to be problems with the data.
>>
>> With the first contig, when I blastx it, it gives strong hits to two
>> different genes on different ends, as if it were a chimera.  When I
>> blastn the sequence I get a strong hit on one side, then in the middle
>> I get section with multiple hits to different species in the TSA.
>> When I look at the pileup, there is a thin place in the gene an a huge
>> drop off in coverage from one side.
>>
>> I think this appears to be due to a adapter trimming problem with the
>> 454 data.
>>
>> The fourth contig when blastN has a very strong gene hit from a
>> closely related species, but at the end has another small stretch that
>> matches many other sequences in TSA that are distantly related.  The
>> adapter looking portion has small coverage with a giant change in
>> coverage in the strong region.
>>
>> To me it appears as if some of the adapters are consistently not
>> getting trimmed(in this set and in TSA).
>>
>> Here is a relevant thread in seqanswers.
>> http://seqanswers.com/forums/showthread.php?t=3462
>> As well as a link out to previous discussions in this list.
>> //www.freelists.org/post/mira_talk/454-adaptor-clipping
>>
>> Is there any consensus on recleaning the 454 adapters?  I don't even
>> know what the sequences would be to expect.
>>
>> The assemblies of the two contigs are pasted after this message.
>>
>> Sincerely yours,
>>
>> Robin
>>
>>>SRR054580_Asha_rep_c1
>> AGTTTCTTAACACTTGGACCAATATATTATTTTCCTTTGTTTTGCAAGAAGGATAAAAGA
>> AAGAAACAAASGWMDAAAAGAGTTTACTAGAAAACTCATCGAGCTAGTTTCTCCACTTAT
>> TTATTTTATGCTTTTCCCGCAAAAGTTTGGTGACTCATACATGAGGATAGATACACATAG
>> ACTCACGTTATTTTACACACGTATATATATAAGGAAAGGCAGGCTAAGCCTTTGATTTAT
>> TTGATTATTGATCCGCGCACTATTGGCAAAAAGACAGTAGTGGGGTAGCACAGCAGATGC
>> AAAAAGATGAAGCATAGCTCTAAGCCACATATCTCATTTGAGAGTGACGAGGAGGTGGAA
>> CGAGAAAGTTGAAAGGGTTGTTGTTCTTGATCTGCCTGGCCTGTTCGCTATGGAGGTTGA
>> AAGTGTGTTGAATGACTTCCTCAGGCAATGCGTTTAACAAAGAGTTTCTACCTGCAAGTG
>> TGCCAGTCACAGGTATATCATGGGTCTTGAATGCCACGTACTTGAAGTTGTTGCTCTGCG
>> ATTTTGCAGCCACCGCAAAGTTTTGTGGCACGATCAGCACTTGTCCCTCTTGCAGCTCCC
>> CATCAAACACTCTATCACCAGTGCAATTCACCACTTGCATCATCGCCCTCCCTTCCAATG
>> CGTATACTATGCTGTTTGCGTTCAGGTTGTAGTGAGGCACGAACATGGCATTCTTGCGGA
>> GAGATCCGAACTGAGCACTGAGTTTGAGGAGCAAGAGGGCTGGGAAGTCAAGGCCGGTGG
>> CGGTTGTAATGCTACCAGCTTGAGGGTTGAAGAAGTCAGGCGATGAAGTTTGACCAATGT
>> TGTGGCGAAGTCTCATTGTGCAAATGGTTTCATCAATGCCATTTCTGCTCTTGCTTTTGC
>> TCTTCTGTGGCTTCTCTTCCTCTTCATCGTCGTCATCTTCTTCCTCTTCCTCTGCTCTCT
>> GTTGCTGCTTTCTCGTTGGTGGAGCTGTCACGCTCAGACCTCCCTCCACTTTCACAATGG
>> CTCCTTTCTCTTCGTCCTCGTTCACACCTTGGAGGTTTTTCACTATCTTCCTGTCCACGT
>> TCAACGCTTGTTCCAAGAATTCTGGGGTGAAGCCACTGAATATGTTGCCGCCTTCATTAT
>> CTTCTTCTTGTTCTTGATGTTGTTTTCCCTTCTGGCTTTGGCTTTGCTGATATTGTACGA
>> ACTCTTGCTCTTGGTTCCCAGCAAGATAGAATCTCCTAGGCATCTGATCGAGCTGGTTCT
>> GTAAGCTGTTGGTGTGAATAAGAGAAACTGCAACAACGGGAGTGTCTTGATTGTTGAACA
>> TCCAGAAAGCAGCACCGGTAGGCACTGCGATCAAATCACCCTCTCTAAAGTGATACACCT
>> TTTGGTGACGGTCTTGAGGCTTCTGGCTTTGTCCTCTTTGAGTTGGCTCTTCAAAAGTCT
>> GAGGACAACCGGAGAAAATGATGCCAAAAATACCACTACCTTGTTGAATGAAGATTTGCT
>> GGGGAGCGTTGGTGAAGAATGGTCTGCGGAGGCCATTGCGTTGGAGGGTGCAGCGAGAGA
>> GGGCAACACCGGCACACTGGAAAGGCTTGCTGTTAGGGTTCCATGTCTCTATGAACCCAC
>> CTTCCGACTCTATACGGTTATCGGGTTTGAGGGCATTCATGCGTTGGAGTTGGCACTCAT
>> ATTCATTTTGCTGTGGCTGCTGTGTCTTATCTTTGCTAGCGAAGCACCCACTCAAAAGCA
>> CAAGACAAAGGGAAAGAGATAGCGCAAGAAGCTTAGCCATGGATATGAATATGATTGATT
>> TGTTTGTGGTGTCCCCCGTACTCTGCGTTGATACCACTGCTTAAGCAGTGGTATCAACGC
>> AGAGTACGGGGGTGGACCCAATGACACCATTTTCATTTATTATTCGGATCATGGTGCTCC
>> TGGTCTTGTCACCATGCCAGTAGGGGGAATATGTCATGGCCAACGATTTTGTGAATGTCT
>> TGAAGAAGAAACATGATGCTAAATCCTACAAAAAGATGGTGATATACTTGGAAGCATGTG
>> AATCTGGGAGCATGTTTGAAGGGATACTACCTAATAACATAAGCATATATGCGACCACAG
>> CTTCCAACGCAGATGAGGATAGTTTTGCATATTATTGTCCTCATTCCTACCCTTCTCCTC
>> CAACTGAGTACACCACTTGTTTGGGAGATGTGTACAGCATTTCGTGGTTAGAAGATAGTG
>> ACAAAAATGACATGACAATAGAAACGCTGCAGCAACAATATGAAACCGTTCGCCGAAGAA
>> CGTTAATTGGTAATGTCGACACCTCTTCTCATGTGAAACAATACGGAGATAGAAAATTCG
>> AGAACGATACTCTTGCTACCTACATTGGTGCACCTGTTAAAACCAACCCCACCAACTCTG
>> CAAATGCATATTCCTTTGAACCATATAGTCCTCAAACTAGACATGTTAGCCAACGAGATG
>> CTCATTTACTCTACCTTAAGCTAGAGTTGCAAAAAGCCCCGGATGGTTCTATGGAAAAGT
>> TGAAAGCTCAAATAGAGTTGGATGATGAAATTGCACATAGGAAGCATTTAGATAGTGTTT
>> TCCATCTCATAGGGGATCTCTTGTTTGGAGAAGAGAATAATATCTCTACCATGTTGCTCC
>> ATGTTCGTCCACCAGGCCAGCCTCTTGTCGATGATTGGGATTGTTTCAAGACCCTTATAA
>> AAACTTACGAGAGCAATTGCGGTAAATTGTCAATCTATGGAAGGAAATACACAAGAGCCT
>> TTGCTAACATGTGCAATGCTGGCATTTCTGAGGAGCAAATGGTAGTAGCCTCTTCACAAG
>> CTTGTCCCAAGGAAAATCCTTCTTAAATTAATTCGTTAAGTTGATAATGTAATAACCAAT
>> ATATATCATGAAAGATTAAAAATTGTGCTTTCATTCTACAAAATGGATTATAATCCTTTG
>>
>>>SRR054580_Asha_rep_c4
>> TCTCCGACTCAGAAGCAGTGGTATCAACGCAGAGTCTTGGGGAACTGGAATTGACGATCA
>> AGTTGGTCACACCTGTTGCTCCAGCAACATAGTGCAGAAATTGCATGTGTCCAATGTGTA
>> GATCTCTAACAAGATCATAATTATAACATTCTATGTGTAGTTGACTCTTGCTTTTGATTA
>> ACTCCTGCATAGATGTTTCTACCAAAAATGAAAAAAAAAATCATTAATAGATGCATATTG
>> CAGCTAAATTTAGCAGTGAGTTGGTGATACCTCATCCCCCAGTTAGATAAAAGCCACTAG
>> AAGCTGCATTTTCAAATCAACAAGTAGTGATTTATGGCTTCTTTGGGTTTTATGGTGTGT
>> TTTGTAGAAAATTTGTCCTTCATTTTAGCTATGAGCATTCATTGGGTATTGCATAAGTTT
>> TGATGCTATTGTATTGATTTTGATATAAGAAAAGAAAAGTTGTAATGCGTTTGTTTCAAT
>> TATTTTTTTTTAAAGAAATGATATTTTTAACTTGTGGAGAGTTTTAAGAGATTTAGATAA
>> CTTGTAAGGTAACAGATTGTAGAAGTATAAATTACTCTGCCATAAATGAAGCTTTAAGTG
>> CACTACAAGTAAACAACT
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: