[mira_talk] Assembly

  • From: Sharmista Saha <sharmistasaha@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 22 Sep 2009 10:19:31 +0530

I have an input Solexa data of 33pair reads,
where I have both a fastq and fasta seq file individually.
I have a gbf ref file from NCBI, and a strainin.text
*the fastq file has a quality score in phred form *(where at one point in
the manual asks me to convert to solexa score, using LSR:ssiqf=no, which i
was unable to do).
I started MIRA,
with the respective commands as follows:

**{*Copying and naming the sequence data** You need to create (or get from
your sequencing provider) exactly two data files: a FASTA file and a FASTA
quality file. Put them into an empty directory and rename them so that they
look like this: *

*arcadia:/path/to/myProject> ls -l
total 764313
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fasta
-rw-r--r-- 1 bach users 517905709 2008-03-28 21:24 bchocse_in.solexa.fasta.qual

*

 *Note 1: mira expects the quality file to contain Solexa scores (but only
one per base), not phred style quality scores. Should you have already the
scores already converted to phred style, you will need to set -LR:**ssiqf=no.
*

*Note 2: Should your quality file contain negative values, you have Solexa
scores. Else you probably have phred scores. *

*Generating the strain name data** MIRA will make use of ancillary
information when present. The strain name is such an ancillary information.
That is, we can tell MIRA the strain of each read we use in the assembly. We
could generate a TRACEINFO XML file, but we'll use only strain information
as ancillary data and there is an easier way: the straindata file. It's a
simple key-value file, one line per entry, with the name of the read as key
(first entry in line) and, separated by a blank the name of the strain as
value (second entry in line). E.g.: *

*1_1_207_113 bchocse
1_1_61_711  bchocse
1_1_182_374 bchocse
...
*

* Etcetera. *

* This file can be quickly generated automatically, using the extracted
names from the FASTA file and rewritten a little bit. Here's how: *

*arcadia:/path/to/myProject> grep ">" bchocse_in.solexa.fasta
   | sed -e 's/>//'
   | cut -f 1
   | cut -f 1 -d ' '
   | sed -e 's/$/ bchocse/'
   > bchocse_straindata_in.txt

*

 *Note 1: the above command has been splitted in multiple lines for better
overview but should be entered in one line. *

*Note 2: for larger files, this can run a minute or two. *

*Note 3: as you can also assemble sequences from more that one strain, the
read names in bchocse_straindata_in.txt can have different strain names
attached to them. *

* This creates the needed data in the file bchocse_straindata_in.txt (well,
it's one way to do it, feel free to use whatever suits you best). *

* The directory now looks like this: *

*arcadia:/path/to/myProject> ls -l
total 896106
-rw-r--r-- 1 bach users 134822451 2008-04-08 23:51 bchocse_straindata_in.txt
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fasta

-rw-r--r-- 1 bach users 517905709 2008-03-28 21:24 bchocse_in.solexa.fasta.qual
*

 *Copying and naming the reference sequence** The reference sequence (the
backbone) can be in a number of different formats: FASTA, GenBank, CAF. The
later two have the advantage of being able to carry additional information
like, e.g., annotation. In this example, we will use a GenBank file like the
ones one can download from the NCBI. So, let's assume that our wild type
strain is in the following file: bchocwt.gbf. Copy this file to the
directory, renaming it as bchocse_backbone_in.gbf. *

*arcadia:/path/to/myProject> cp /somewhere/bchocwt.gbf bchocse_backbone_in.gbf
arcadia:/path/to/myProject> ls -l
total 902504
-rw-r--r-- 1 bach users   6543511 2008-04-08 23:53 bchocse_backbone_in.gbf

-rw-r--r-- 1 bach users 134822451 2008-04-08 23:51 bchocse_straindata_in.txt
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fasta
-rw-r--r-- 1 bach users 517905709 2008-03-28 21:24 bchocse_in.solexa.fasta.qual

*

 *Starting a mapping assembly: unpaired data** Starting the assembly is now
just a matter of a simple command line with some parameters set correctly.
The following is an example of what I use when mapping onto a reference
sequence in GenBank format: *

*arcadia:/path/to/myProject> mira
  --project=bchocse --job=mapping,genome,accurate,solexa
  -AS:nop=1
  -SB:lsd=yes:bsn=bchoc_wt:bft=gbf:bbq=30
  >&log_assembly.txt }

** Logfile what i got is as follows:


*

bash: --project=bchocse: command not found

So please now can you guide me where exactly I am wrong? I am quite curious
to know, as I am excited to use your tool to assemble my seqs.

Thanking you,
With due Regards,
Sharmistha Saha

Other related posts: