[liblouis-liblouisxml] Re: Python bindings and output buffer size for lou_translate*

  • From: Michael Whapples <mwhapples@xxxxxxx>
  • To: liblouis-liblouisxml@xxxxxxxxxxxxx
  • Date: Tue, 27 Jul 2010 11:32:36 +0100

Hello,
Firstly I have seen later messages in this thread and agree that it would not be natural for python programs to have to specify the buffer sizes. Anyway getting a python program to specify the buffer sizes doesn't really solve the problem, it only moves it to all python programs using the python bindings.

So this leaves the task of working out what the ratio should be? I don't like the suggestion of try with one ratio, if translation fails retry with a larger ratio until it succeeds, is there a situation where translation may fail for another reason and how would such a system of setting the ratio catch that? OK, I guess we could set an upper limit of the ratio for which the bindings will decide the translation is failing for another reason if it reaches the limit. I would agree this seems to add complexity to the code and slow things down, so avoided if possible.

Setting the ratio to 8 times seems a bit drastic, and it would need to be higher if using 32-bit unicode, most of the time I doubt you would be going anywhere near that sort of ratio. I get the feeling the answer for what ratio is needed actually depends on what sort of translation is being done (IE. You are much more likely to need 8 times if only translating a character or two but you are probably going to be fine with 2 or 4 when doing longer strings of text). So may be the answer is have the ratio at a level which should be fine for over 90% of uses but make the ratio value configurable so that the few who need something different can set it appropriately (IE. an application doing lots of small translations may have the line
louis.bufferRatio = 8
). My assumption in this is that a long translation is unlikely to have all its characters not known in the table but a short one is more likely as one character is a higher percentage of the translation.

Michael Whapples
On 27/07/10 03:48, James Teh wrote:
Hi all,

For lou_translate* in the Python bindings, we've made an assumption that outlen should be 2 * inlen. However, this assumption is very wrong if there are characters in the input which aren't defined in the specified tables. In the case of undefined characters, the output is "'\xnnnn'" for 16 bit unicode characters, which means that 1 input char becomes 8 chars in the output. Assuming that no one does anything ridiculous in tables, this means that an outlen which is 8 * inlen should cover the worst case scenario. I'd like to change the Python bindings to do this and suggest that perhaps the documentation should be updated with a similar guideline.

Note that this does not cover 32 bit unicode characters. I guess it's possible that the bindings might be used on a 32 bit system. In this case, the worst case scenario will be outlen = 12 * inlen.

An alternative is to keep checking whether translation wasn't completed (i.e. inlen is less than its original value) and then increase outlen if so, probably multiplying outlen by 2 each time. However, although this is probably rare, it increases code complexity and is quite expensive, since you have to keep re-translating the string in its entirety until it completes.

What do people think?

Jamie


For a description of the software and to download it go to
http://www.jjb-software.com

Other related posts: