[py-lmdb] Re: Multiple environments to avoid write lock?

  • From: Jonatan Heyman <jonatan@xxxxxxxxxxx>
  • To: David Wilson <dw@xxxxxxxx>
  • Date: Mon, 9 Feb 2015 12:06:13 +0100

Hi again!

I've done a small change to py-lmdb that allows me to to specify the
MDB_INTEGERKEY, MDB_INTEGERDUP and MDB_DUPFIXED flags. The change doesn't
include any other modifications like type checking etc. but it allowed me
to test out my service with these flags enabled. (Though I'm not entirely
sure that those changes are enough for any possible optimizations to "take
effect").

My py-lmdb changes can be seen here:
https://github.com/heyman/py-lmdb/compare/dw:master...integer-flags

When running some tests, read & write speeds were mostly unaffected, but I
got the file size to decrease by almost 30% which is really nice. If you'd
implement "real" support for the INTEGERKEY, INTEGERDUP and DUPFIXED flags,
that would be pretty awesome :).

Best,
Jonatan



On Fri, Feb 6, 2015 at 10:38 AM, Jonatan Heyman <jonatan@xxxxxxxxxxx> wrote:

> Hi!
>
> Thanks again for valuable info!
>
> Currently I'm doing the read operations in multiple transactions rather
> than in a single transaction. Would that be better or worse performance
> wise? (I'm okay with the possibility of getting some inconsistency between
> the transactions). Here's an excerpt of the code:
>
>     def _get_similar_objects(self, object_id):
>         common_users = {}
>         user_ids = self._get_users_for_object(object_id)
>         for user_id in user_ids:
>             for other_object_id in self._get_objects_for_user(user_id):
>                 if other_object_id == object_id:
>                     continue
>                 if not other_object_id in common_users:
>                     common_users[other_object_id] = 0
>                 common_users[other_object_id] += 1
>
>     def _get_users_for_object(self, object_id):
>         with self.object_begin() as trans:
>             cursor = trans.cursor()
>             cursor.set_key(object_id)
>             return [uid for uid in cursor.iternext_dup()]
>
>     def _get_objects_for_user(self, user_id):
>         with self.user_begin() as trans:
>             cursor = trans.cursor()
>             cursor.set_key(user_id)
>             return [oid for oid in cursor.iternext_dup()]
>
>
> > If you require consistency but can live without durability, the correct
> > option would be Environment(metasync=False).. this halves the number of
> > disk flushes required for a writer without sacrificing crash safety,
> > which sync=False does sacrifice.
>
> That's what I'm currently using. However, the documentation on the sync
> argument to the Environment class (
> https://lmdb.readthedocs.org/en/release/#environment-class) gave me the
> impression that one could get crash safety even with sync=False in some
> cases, depending on the filesystem that is used. Is that incorrect?
>
>
>
>> Whoops, it's worth noting these numbers were based on dirtybench.py
>> output, which by default uses a DB of 1.87 million size-13 string keys.
>> I forgot that MDB_INTEGERDUP uses an optimized comparison function
>> (read: likely much faster random IO) in addition to being a storage
>> optimization, so it's entirely possible there is value in implementing
>> it in the binding.
>
>
> MDB_INTEGERDUP sounds very interesting, and if you'd implement support for
> it in the bindings it'd be fantastic! I'm currently using python's
> struct.pack() and struct.unpack() for serializing integers to strings. In
> my initial tests the serializing doesn't seem to be much of an overhead,
> but if LMDB contains other optimizations that might lead to faster IO with
> MDB_INTEGERDUP, that sounds very interesting.
>
> Hopefully I'll be able to test my code in production some time in the
> upcoming week!
>
> Best,
> Jonatan
>
>
>

Other related posts: