[mira_talk] Re: assembly to identify a large insert

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 25 Oct 2012 20:30:38 +0200

On Oct 25, 2012, at 20:08 , Thomas Goldman <tomgoldman@xxxxxxxxxxx> wrote:
> Secondly, I have an issue that I was hoping someone has had some experience 
> with. I’m trying to identify the location of a large insert (~7.2KB) in a 
> MIRA 3.4 assembled genome for which I have MiSeq reads against a reference. 
> Unfortunately, the inserted region reads (cassette) do not assemble to the 
> reference, I’m assuming simply because the reads are not in the reference and 
> are thrown out. I also tried a de novo assembly. In this case, the cassette 
> is assembled, but unfortunately it produces a scaffold by itself without 
> flanking reference sequence, so I still cannot determine where the cassette 
> is inserted. I think this is because the 5’ and 3’ regions of the cassette 
> itself has homology to other parts of the genome.

You cannot "force" reads to map against a non-existing reference … that simply 
is not possible.

The easiest way out for you might be this: take the read from debris file of 
the mapping project and assemble them "denovo,est,accurate". EST is quite 
important there, as those reads will not behave like a normal genome.

You should get your insert out of it. If you are lucky, the borders of that 
insert are non-repetitive with respect to the genome, so that you should be 
able to determine its place quickly. If you are unlucky, find out which pairs 
in the insert have only one read in the insert, then fetch the positions of the 
mapped mates in the mapping project to give you an idea where your insert is 
placed (you will need to have a mapping project which did not use the short 
read merging option (you made sure of that)).

Alternative to the above approach (and pretty quick to do if there are not too 
many "SNPs" between your reference and the mapped data: use gap4 or gap5 to 
quickly skip visually through all SNP locations marked by MIRA (marked with 
"SROc" tags). If the insert is in some non-repetitive area, you'll find a 
cluster of SROc/SRMc/UNSc tags together with the typical pattern for an insert 
in your genome.

Oh … and you'll have to hope that your insert inserted itself in a region where 
the flanking repeats are not too long, because then even the paired-end reads 
will not help.

Hope that helps,
  Bastien

Other related posts: