[mira_talk] Re: assembly to identify a large insert

  • From: Thomas Goldman <tomgoldman@xxxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 25 Oct 2012 11:45:58 -0700

Thanks Bastien. Where are the debris file reads? I have the debrislist.txt
file which lists the reads, but is there a file with the actual debris
reads?

 

Tom

 

From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx]
On Behalf Of Bastien Chevreux
Sent: Thursday, October 25, 2012 11:31 AM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: assembly to identify a large insert

 

On Oct 25, 2012, at 20:08 , Thomas Goldman <tomgoldman@xxxxxxxxxxx> wrote:

Secondly, I have an issue that I was hoping someone has had some experience
with. I'm trying to identify the location of a large insert (~7.2KB) in a
MIRA 3.4 assembled genome for which I have MiSeq reads against a reference.
Unfortunately, the inserted region reads (cassette) do not assemble to the
reference, I'm assuming simply because the reads are not in the reference
and are thrown out. I also tried a de novo assembly. In this case, the
cassette is assembled, but unfortunately it produces a scaffold by itself
without flanking reference sequence, so I still cannot determine where the
cassette is inserted. I think this is because the 5' and 3' regions of the
cassette itself has homology to other parts of the genome.

 

You cannot "force" reads to map against a non-existing reference . that
simply is not possible.

 

The easiest way out for you might be this: take the read from debris file of
the mapping project and assemble them "denovo,est,accurate". EST is quite
important there, as those reads will not behave like a normal genome.

 

You should get your insert out of it. If you are lucky, the borders of that
insert are non-repetitive with respect to the genome, so that you should be
able to determine its place quickly. If you are unlucky, find out which
pairs in the insert have only one read in the insert, then fetch the
positions of the mapped mates in the mapping project to give you an idea
where your insert is placed (you will need to have a mapping project which
did not use the short read merging option (you made sure of that)).

 

Alternative to the above approach (and pretty quick to do if there are not
too many "SNPs" between your reference and the mapped data: use gap4 or gap5
to quickly skip visually through all SNP locations marked by MIRA (marked
with "SROc" tags). If the insert is in some non-repetitive area, you'll find
a cluster of SROc/SRMc/UNSc tags together with the typical pattern for an
insert in your genome.

 

Oh . and you'll have to hope that your insert inserted itself in a region
where the flanking repeats are not too long, because then even the
paired-end reads will not help.

 

Hope that helps,

  Bastien

 

Other related posts: