[py-lmdb] Re: Multiple environments to avoid write lock?

  • From: David Wilson <dw@xxxxxxxx>
  • To: Jonatan Heyman <jonatan@xxxxxxxxxxx>
  • Date: Mon, 9 Feb 2015 13:58:29 +0000

Whoa,

Thanks for measuring this! The 30% is definitely worthwhile, your change
actually looks reasonable, and this approach might actually be fine.

The way I imagined it implemented was all the get/put/iter functions
accepting and returning ints, which would avoid the serialization
happening in Python, but your idea may be better.


David

On Mon, Feb 09, 2015 at 12:06:13PM +0100, Jonatan Heyman wrote:
> Hi again!
> 
> I've done a small change to py-lmdb that allows me to to specify the
> MDB_INTEGERKEY, MDB_INTEGERDUP and MDB_DUPFIXED flags. The change doesn't
> include any other modifications like type checking etc. but it allowed me to
> test out my service with these flags enabled. (Though I'm not entirely sure
> that those changes are enough for any possible optimizations to "take 
> effect").
> 
> My py-lmdb changes can be seen here: 
> https://github.com/heyman/py-lmdb/compare/
> dw:master...integer-flags
> 
> When running some tests, read & write speeds were mostly unaffected, but I got
> the file size to decrease by almost 30% which is really nice. If you'd
> implement "real" support for the INTEGERKEY, INTEGERDUP and DUPFIXED flags,
> that would be pretty awesome :).
> 
> Best,
> Jonatan
> 
> 
> 
> On Fri, Feb 6, 2015 at 10:38 AM, Jonatan Heyman <jonatan@xxxxxxxxxxx> wrote:
> 
>     Hi!
> 
>     Thanks again for valuable info!
> 
>     Currently I'm doing the read operations in multiple transactions rather
>     than in a single transaction. Would that be better or worse performance
>     wise? (I'm okay with the possibility of getting some inconsistency between
>     the transactions). Here's an excerpt of the code:
> 
>         def _get_similar_objects(self, object_id):
>             common_users = {}
>             user_ids = self._get_users_for_object(object_id)
>             for user_id in user_ids:
>                 for other_object_id in self._get_objects_for_user(user_id):
>                     if other_object_id == object_id:
>                         continue
>                     if not other_object_id in common_users:
>                         common_users[other_object_id] = 0
>                     common_users[other_object_id] += 1
> 
>         def _get_users_for_object(self, object_id):
>             with self.object_begin() as trans:
>                 cursor = trans.cursor()
>                 cursor.set_key(object_id)
>                 return [uid for uid in cursor.iternext_dup()]
>         
>         def _get_objects_for_user(self, user_id):
>             with self.user_begin() as trans:
>                 cursor = trans.cursor()
>                 cursor.set_key(user_id)
>                 return [oid for oid in cursor.iternext_dup()]
>    
> 
>     > If you require consistency but can live without durability, the correct
>     > option would be Environment(metasync=False).. this halves the number of
>     > disk flushes required for a writer without sacrificing crash safety,
>     > which sync=False does sacrifice.
>    
>     That's what I'm currently using. However, the documentation on the sync
>     argument to the Environment class 
> (https://lmdb.readthedocs.org/en/release/
>     #environment-class) gave me the impression that one could get crash safety
>     even with sync=False in some cases, depending on the filesystem that is
>     used. Is that incorrect?
>    
>      
> 
>         Whoops, it's worth noting these numbers were based on dirtybench.py
>         output, which by default uses a DB of 1.87 million size-13 string 
> keys.
>         I forgot that MDB_INTEGERDUP uses an optimized comparison function
>         (read: likely much faster random IO) in addition to being a storage
>         optimization, so it's entirely possible there is value in implementing
>         it in the binding.
> 
> 
>     MDB_INTEGERDUP sounds very interesting, and if you'd implement support for
>     it in the bindings it'd be fantastic! I'm currently using python's
>     struct.pack() and struct.unpack() for serializing integers to strings. In
>     my initial tests the serializing doesn't seem to be much of an overhead,
>     but if LMDB contains other optimizations that might lead to faster IO with
>     MDB_INTEGERDUP, that sounds very interesting.
>    
>     Hopefully I'll be able to test my code in production some time in the
>     upcoming week!
>    
>     Best,
>     Jonatan
> 
>      
> 
> 

Other related posts: