[mira_talk] Re: Why is a quality file asked for a reference when mapping?

From: Chris Hoefler <hoeflerb@xxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Mon, 30 Oct 2017 09:49:14 -0400

Nope, it's a different error. The first error you were getting was missing
qualities for the reference (bac_assemblies/Mes-B-60444.fasta.qual). Now,
you are getting an error because your reads (60444_genome.trimmedReads_
l1to29kb.fasta.qual) have missing qualities, probably because your method
of subsampling didn't extract the qualities with the sequences. You have
two options: 1) Go back and get your qualities for your extracted reads. A
convenient way to do this is to get the list of read names in your
subsampled set and re-extract them using miraconvert (use the "-n" switch)
which preserves the qual values. Or, 2) turn off quality value checking
completely by specifiying fna as the data type. See this for more
information,
http://mira-assembler.sourceforge.net/docs-dev/DefinitiveGuideToMIRA.html#sect_ref_manifest_readgroups_data

But this is to my knowledge the only one that is able to assemble such

reads using a reference sequence.

I would describe what you are trying to do more as a form of scaffolding,
rather than reference-guided assembly. And there are quite a number of
tools aimed at doing exactly that. See, for example, pbahaScaffolder (
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Depricated---AHA),
which is a python script that comes with the PB SmrtAnalysis package, or PB
Jelly (https://sourceforge.net/projects/pb-jelly/).

On Mon, Oct 30, 2017 at 7:10 AM, Thomas Bawin <thomas.bawin@xxxxxxxxx>
wrote:

Many thanks for your answer!

I am aware that MIRA is currently not the best software to deal with
Pacbio data. But this is to my knowledge the only one that is able to
assemble such reads using a reference sequence. In fact, I get inspired
from that article (DOI: 10.1038/srep36647, Hybrid assembly based on MiSeq
and PacBio sequences) where the authors used the software to extend
sequences.

Your suggestion looks really interesting and I will certainly investigate
it in the near future.

Regarding my specific question, I am unfortunately still not able to run
the analysis.

Here are the changes that I made:

- I updated MIRA to 4.9.6.
- I sub-sampled reads between 1 to 29 kb (they account for 99% of the
dataset).
- I added "technology = text" to the manifest.conf file.

But I got apparently the same error message (see below).

I understand that it will probably not give what I want, but nonetheless I
would like to see how it goes, and be able to run the software properly
(who knows, it might be useful for other projects :) ).

Best regards,

Thomas Bawin

Fatal error (may be due to problems of the input data or parameters):

************************************************************
********************
* File not found: ../60444_genome.trimmedReads_
l1to29kb.fasta.qual *
************************************************************
********************
->Thrown: void ReadPoolIO::priv_openFiles_fasta()
->Caught: main

Aborting process, probably due to error in the input data or
parametrisation.
Please check the output log for more information.
For help, please write a mail to the mira talk mailing list.
Subscribing / unsubscribing to mira talk, see:
//www.freelists.org/list/mira_talk

CWD: /media/vol2/scratch/cassava/Thomas/build_cmd2_from_bacs/mira
Thank you for noticing that this is *NOT* a crash, but a
controlled program stop.
Wrapped MIRA process exited with an error.

2017-10-27 16:53 GMT+02:00 Chris Hoefler <hoeflerb@xxxxxxxxx>:

To answer your specific question, you should set your reference to
"technology = text" to turn off the quality value checking.

To follow up on Bastien's questions, you are using an outdated Mira
version. The latest dev version is 4.9.6. It may not be marked "stable",
but it generates better assemblies than the older 4.0.2, and has some
helpful new features. Also, while Mira is still one of the best short read
assemblers around, it has been superseded in long read assembly by other
projects that handle it better (ex: Canu, Falcon, etc). It comes down to
design considerations. Mira was designed for lots of short but accurate
reads. The long read assemblers have been designed specifically for long
noisy reads. So while Mira does have official support for PacBio, you won't
be able to use reads longer than 29900 bp, and the error profile of the
corrected reads doesn't mesh well with some of Mira's quality clipping and
repeat detection routines, so you have to do a lot of tweaking of
parameters to get it to perform well.

As a general comment on what I think you are trying to do:
If the space between your BACs is longer than your longest reads, phasing
them into the correct haplotypes is a tricky problem. A naive
overlap-extension approach will probably not give you what you want.
Probably your best bet is to de novo assemble your shotgun reads using
Falcon+Falcon-unzip. This will give you "haplotigs", with some reasonable
level of confidence. If you can then take these and overlap them with the
unique portions of your phased BACs, you might be able to reconstruct the
full region, but you will have to do this very carefully. The primary
contig output from Falcon-unzip may also prove useful by indicating where
the unphased regions are. If you can map those to your BACs, you might be
able to properly phase your haplotigs. After that, you will have to resort
to other methods, such as optical mapping or Hi-C.

On Fri, Oct 27, 2017 at 2:31 AM, Thomas Bawin <thomas.bawin@xxxxxxxxx>
wrote:

Dear Bastien,

Thanks for your reply.

To answer your questions:

- This is MIRA 4.0.2.
- It seems that my attached file was not transmitted. The very last
lines were:

Fatal error (may be due to problems of the input data or parameters):

************************************************************
********************
* File not found:
*
*
../../../../../home/gcomadira/cassava/bac_assemblies/Mes-B-60444.fasta.qual
*
************************************************************
********************
->Thrown: size_t ReadPool::loadDataFromFASTA_rgid(const string &
filename, const ReadGroupLib::ReadGroupID rgid, bool countonly, const bool
wantsqualfiletoexist, const string & qualfilename, void ($
->Caught: main

Aborting process, probably due to error in the input data or
parametrisation.
Please check the output log for more information.
For help, please write a mail to the mira talk mailing list.
Subscribing / unsubscribing to mira talk, see:
//www.freelists.org/list/mira_talk

CWD: /media/vol2/scratch/cassava/Thomas/build_cmd2_from_bacs/mira
Thank you for noticing that this is *NOT* a crash, but a
controlled program stop.
Failure, wrapped MIRA process aborted.

- Finally, the length of the reads in my Pacbio dataset range from 1 to
56 kb. Should I subsample in order to keep only those below 29 kb ?

Best regards,

Thomas Bawin

2017-10-26 20:58 GMT+02:00 Bastien Chevreux <bach@xxxxxxxxxxxx>:

Three things:

1. which version of MIRA?
2. what's the exact error message of MIRA? Having that usually helps.
3. I'm skeptical MIRA will be able to do what you hope. It will
certainly stop as soon as it encounters reads >= 29900 bp.

B.

On 26 October 2017 at 12:54 Thomas Bawin <thomas.bawin@xxxxxxxxx>

wrote:

Dear MIRA users,

I am focusing on the reconstruction of a ~2Mb plant genomic region.

What I

have is some BAC sequences from there and Pacbio reads from the whole
genome.

I am interested in extending the non-overlapping BAC sequences with

the

Pacbio reads. In fine, I would like to be able to build a reference
sequence (per haplotype) for the genomic region by overlapping those
extended BACs.
The BAC sequences resulted from Pacbio sequencing then Falcon denovo
assembly.

The Pacbio reads were error-corrected then trimmed using Canu (I know

that

Canu does not provide quality scores... but I was not able to do that

in a

better way).

So far I was not able to start the analysis because the software

cannot

find a *.fasta.qual file associated to the reference.

It however appears in the manual that such file would not be required
(regarding the examples that are provided).

Is someone can explain me that is wrong in my commands?

I would also like to take this opportunity to ask you if you have any
advice for such mapping?

Below is the code for the manifest file. You will also find attached

a log

file with the error message.

Thanks in advance!

Thomas Bawin

project = extended_bacs_60444
job = genome,mapping,accurate

# define reference sequence(s)

readgroup
is_reference
data = ../../../../../home/gcomadira/cassava/bac_assemblies/Mes-B-6
0444.fasta
strain = reference_60444

# define dataset of reads

readgroup = corrected_pacbio_reads
data = ../60444_genome.trimmedReads.fasta
technology = pcbiohq
default_qual=30
strain = reads_60444

# define parameters

parameters = \
COMMON_SETTINGS -GE:not=4 -SB:trim_overhanging_reads=no \
PCBIOHQ_SETTINGS -CO:mrpg=5 --noqualities

--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html

References:
- [mira_talk] Why is a quality file asked for a reference when mapping?
  - From: Thomas Bawin
- [mira_talk] Re: Why is a quality file asked for a reference when mapping?
  - From: Bastien Chevreux
- [mira_talk] Re: Why is a quality file asked for a reference when mapping?
  - From: Thomas Bawin
- [mira_talk] Re: Why is a quality file asked for a reference when mapping?
  - From: Chris Hoefler
- [mira_talk] Re: Why is a quality file asked for a reference when mapping?
  - From: Thomas Bawin

[mira_talk] Re: Why is a quality file asked for a reference when mapping?

Other related posts: