[py-lmdb] Re: Multiple environments to avoid write lock?

  • From: Jonatan Heyman <jonatan@xxxxxxxxxxx>
  • To: David Wilson <dw@xxxxxxxx>
  • Date: Fri, 6 Feb 2015 10:38:23 +0100

Hi!

Thanks again for valuable info!

Currently I'm doing the read operations in multiple transactions rather
than in a single transaction. Would that be better or worse performance
wise? (I'm okay with the possibility of getting some inconsistency between
the transactions). Here's an excerpt of the code:

    def _get_similar_objects(self, object_id):
        common_users = {}
        user_ids = self._get_users_for_object(object_id)
        for user_id in user_ids:
            for other_object_id in self._get_objects_for_user(user_id):
                if other_object_id == object_id:
                    continue
                if not other_object_id in common_users:
                    common_users[other_object_id] = 0
                common_users[other_object_id] += 1

    def _get_users_for_object(self, object_id):
        with self.object_begin() as trans:
            cursor = trans.cursor()
            cursor.set_key(object_id)
            return [uid for uid in cursor.iternext_dup()]

    def _get_objects_for_user(self, user_id):
        with self.user_begin() as trans:
            cursor = trans.cursor()
            cursor.set_key(user_id)
            return [oid for oid in cursor.iternext_dup()]


> If you require consistency but can live without durability, the correct
> option would be Environment(metasync=False).. this halves the number of
> disk flushes required for a writer without sacrificing crash safety,
> which sync=False does sacrifice.

That's what I'm currently using. However, the documentation on the sync
argument to the Environment class (
https://lmdb.readthedocs.org/en/release/#environment-class) gave me the
impression that one could get crash safety even with sync=False in some
cases, depending on the filesystem that is used. Is that incorrect?



> Whoops, it's worth noting these numbers were based on dirtybench.py
> output, which by default uses a DB of 1.87 million size-13 string keys.
> I forgot that MDB_INTEGERDUP uses an optimized comparison function
> (read: likely much faster random IO) in addition to being a storage
> optimization, so it's entirely possible there is value in implementing
> it in the binding.


MDB_INTEGERDUP sounds very interesting, and if you'd implement support for
it in the bindings it'd be fantastic! I'm currently using python's
struct.pack() and struct.unpack() for serializing integers to strings. In
my initial tests the serializing doesn't seem to be much of an overhead,
but if LMDB contains other optimizations that might lead to faster IO with
MDB_INTEGERDUP, that sounds very interesting.

Hopefully I'll be able to test my code in production some time in the
upcoming week!

Best,
Jonatan

Other related posts: