[mira_talk] Re: Reads names

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 13 Jul 2009 22:45:40 +0200
On Montag 13 Juli 2009 Michal Stuglik wrote:
> I would ask you one small thing I cannot find this in manuals: I do
> miraEST assembling for 454 data and in result files (*.ace) names of the
> input reads are changed so I can't track back to the original reads. Is
> there any switch for that?

Hello Michal,

miraEST? Which version are you working with because I renamed "miraEST" to 
"miraSearchESTSNPs" a while ago (people got confused with running mira is est 
mode and miraEST).

Anyway .. the documentation on miraSearchESTSNPs (miraEST) is lacking quite a 
bit, I never came around writing a good one. Am I guessing right that you want 
to find out which reads go into which final contigs?

If yes, I've searched my archive for a mail that I once sent for an earlier 
version of MIRA (2.4.x) but which should clarify how to find the information 
you want. Files and directories have moved around a bit or have changed name 
slightly, but basically the information is there.

Please note that the program there is still called miraEST and that parameter 
options I gave in there may have changed quite a bit in the mean time (there 
were bigger changes in 2.8.x and 2.9.x).

-----------------------------------------------------------------------------

  Generally speaking, miraEST is a three-stage assembly system that was
  designed to catch SNPs in different strains. That is, one really should have
  different strains to analyse (and the information provided to the assembler)
  to make the most out of miraEST. Unfortunately, only the smaller of the
  demos (estdemo2) that are in the package shows this, and it even is not made
  very clear. Here's a quick overview on what miraEST does:
  
  Step 1: assemble everything together, not caring about strain information.
  Potential SNPs are not treated as SNPs, but as possible repeat marker bases
  and are tagged as such (temporarily) to catch each and every possible
  sequence alignment which might be important later. As a result of this
  stage, the following information is written out:
  
  Into "step1_snpsinSTRAIN_<strainname>.caf":
    - all the sequences of a given strain that are in contigs (can be aligned 
      with at least another sequence)
    - also, all sequences that are singlets BUT have been tagged previously as 
      containing tagged bases showing that they aligned previously (even to
      other strains) but were torn apart due to the SNP bases.
  
  Into "step1_nonsnps_remain.caf":
    - all the singlets of all the strains that have no such tag
  If no strain information is provided
  
  Obviously, if one did not provide strain information to the assembly of step
  1, all the sequences belong to the same strain (named "default"). The CAF
  files above are the input sequences for the next step.
  
  
  Step 2: Now, miraEST assembles each strain independently from each other.
  Again, sequences containing SNPs are torn apart into different contigs (or
  singlets) to give a clean representation of the "really sequenced" ESTs. In
  the end, each of the contigs (or singlets) coming out of the assemblies for
  the strains is a representation of the mRNA that was floating around the
  given cell/strain/organism. The results of this step are written out into
  one big file (step2_reads.caf) and a new straindata that goes along with
  those results (step2_straindata.txt).
  
  
  Step 3: miraEST takes the result of the previous step (which should now be
  clean transcripts) and assembles them together, *this time* allowing
  transcripts from different strains with different SNP bases to be assembled
  together. The result is then written to step3_out.* files and directories.
  
  > Also, I generally find that people want to know which original input
  > sequence ended up in which contig (or singleton). The way Mira is
  > creating contigs, with other contigs as the member sequences, makes this
  > difficult.
  
  You are not the first making this remark. Fortunately, I think that I can
  help you out there (see below).
  
  > Is there a way to produce final contigs where the directed
  > assembly uses original input sequences as the building blocks? I
  > understand that Mira edits these but it would be nice.
  
  Yes, there is. What I didn't tell in the description of the working process
  above is that (like the normal 'mira'), miraEST keeps track on a lot of
  things and writes out quite a lot of additional information files after each
  step. For step 1, look at step1_* files. For step 2, look at <strainname>_*
  files (and directories). Step 3, it's step3_* again.
  
  To get what you want, you will have to write a small script that basically
  collects the information from different output files and makes a small
  re-assembly with the initial sequences for each transcript.
  
  It should do something like this: first have a look at the
  step3_info_contigreadlist.txt (as you correctly did)
  
  > #
  > # contig_name   read_name
  > #
  > step3_c1        default_c2
  > step3_c1        default_c5
  > step3_c1        default_c1
  > ...
  > step3_c1        default_s11
  > step3_s2        default_c3
  > ...
  
  The first contig of step3 (step3_c1) has as first read a transcript formed
  out of contig number 2 from the strain "default" (default_c2), then there is
  default_c5, then ..., then there is a singlet form the "default" strain
  (default_s11). After that, the next result of step 3 (a singlet called
  step3_s2) is formed out of a contig of the default strain (default_c3). Etc.
  
  What you (the script) must now do, is to search in the corresponding step2
  contigreadlist of the strain on how the transcripts were built. E.g.: look
  at default_info_contigreadlist.txt and find something like this (made up by
  me as I do not have your corresponding file):
  
  #
  # contig_name   read_name
  #
  default_c1      gnlti136479291
  default_c1      gnlti136479409
  default_c1      gnlti136479240
  default_c1      gnlti136479336
  default_c2      gnlti136478532
  default_c2      gnlti136477760
  default_c2      gnlti136477568
  ...
  default_c5      gnlti136479043
  default_c5      gnlti136479138
  ...
  
  The transcript "default_c2" is made out of three reads (gnlti136478532,
  gnlti136477760 and gnlti136477568), "default_c5" out of two, etc.
  
  With that, you will have collected the necessary information to make a small
  assembly using only the reads that form one transcript of step3. Here, you
  will have to use the 'normal' mira binary. Put all sequences collected into
  one input suited for mira (EXP files, FASTA or CAF). Then call mira and make
  sure that the following parameters are included in the parameterlist:
  
  mira <your parameters here, especially the correct ones for naming and
  loading  the input files> -SB:lsd=yes -AS:nol=2:rbl=2:sd=no:ugpf=no 
  -DP:ure=no:tpae=yes -AL:egp=yes:egpl=low -CO:mr=yes:asir=yes
  
  Optionally include -CLIPPING and other options as you think right. That
  should do the trick. Incidentally, this is also quite near on how to call
  the normal mira binary for assembling ESTs when no strain information is
  present.

-----------------------------------------------------------------------------

Hope this helps,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
Follow-Ups:
- [mira_talk] Re: Reads names
  - From: Michal Stuglik
References:
- [mira_talk] Reads names
  - From: Michal Stuglik
[mira_talk] Re: Reads names

Other related posts: