[mira_talk] Re: Unusual mira usage inquiry

On Jun 28, 2012, at 15:34 , Nick Hathaway wrote:
> My lab is trying to use mira for something a little unusual.  We are trying 
> to see what different strains of plasmodium falciparum are in a sample.  We 
> used PCR to amplify one gene about 300 bp in length and used 454 sequencing 
> to create non-paired end reads.  We want to use mira in a preliminary step to 
> see if we can simply get out the different strains by being very strict with 
> parameters to see if we can form some contigs of the different strains.  Do 
> you have any suggestions for parameters we could play around with to do this. 
>  Also I'm brand new at this kind of work so sorry if I'm not being clear.  

Let's see. With 300bp, most of your 454 sequences should cover the gene 
completely, which is good. 

You probably do not want to play too much around with parameters in the first 
time: per default, MIRA is already very sensitive and will pick up even low 
abundant variants as soon as these reach a certain threshold (see -CO:mrpg for 
this). What I did not quite in your question was: "very strict" parameters. 
What are you looking for? Is "very strict" equal to "I want to find even the 
lowest abundant variants" or is it rather "don't care too much about variants, 
I want to get a full gene"?

There are several approaches you could take, depending on what you have and 
what you want. I really cannot give you a recipe, just hints, because much of 
what needs to be done is determined by what the data looks like.

Do you already have the basic sequence of the gene? If yes, then performing a 
simple mapping with, say, 1000 to 2000 random reads should give you an overview 
on what you could expect. Just for getting to know the data. I quickly googled 
Plasmodium, it may have splicing, right? Mapping with just 1k to 2k reads will 
tell you whether or not you really have to account for that. It will also tell 
you about the most frequent variations (SNPs, small indels) you might need to 
take into account. Again, 1k or 2k reads will give you a broad overview. 
Assuming you have SFF files, just follow this basic guide:
- extracting data to more manageable format, see 
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_454_preparing_the_454_data_for_mira
 and the section after.
- extract the first 1000 or 2000 sequences from that FASTQ, name this, say, 
testgene_in.454.fastq
- copy the sequence of your gene of interest as FASTA to your project directory 
and name it, say, testgene_backbone_in.fasta
- start a mapping with your data, a bit like this:

mira
  --project=testgene --job=mapping,genome,accurate,454
  -AS:nop=1
  -SB:bsn=MyReferenceGene:bft=gbf:bbq=30
  454_SETTINGS
  -SB:ads=yes:dsn=MyUnknownGenes
  >&log_assembly.txt

Convert the result to gap4 or gap5 and have a look at it. Search for the 
markers MIRA will have set (SROc and MCVc tags in the assembly). Increase 
-CO:mrpg if the SNPs marked were too sensitive. Once the first analysis done, 
tackle the rest of the reads.

In case you do not have the gene already as reference, well, then doing a 
de-novo assembly in EST mode with 1000 to 2000 reads should give you a pretty 
good idea on what to expect. Start MIRA like this:

mira
  --project=testgene --job=denovo,est,accurate,454
  >&log_assembly.txt

then convert the result to gap4 or gap5 and have a look at it in the contig 
editor. Using the contig merge functions of the editor you will get an idea on 
what mutations caused the reads to be put into different contigs. I would try 
to reconstruct a canonical gene, probably the version with the most abundant 
variants. Using this as reference, then continue with a mapping approach as 
described above. In case there are wildly different splice variants, do this 
for several "canonical genes"

hth
  B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: