Whoa, Thanks for measuring this! The 30% is definitely worthwhile, your change actually looks reasonable, and this approach might actually be fine. The way I imagined it implemented was all the get/put/iter functions accepting and returning ints, which would avoid the serialization happening in Python, but your idea may be better. David On Mon, Feb 09, 2015 at 12:06:13PM +0100, Jonatan Heyman wrote: > Hi again! > > I've done a small change to py-lmdb that allows me to to specify the > MDB_INTEGERKEY, MDB_INTEGERDUP and MDB_DUPFIXED flags. The change doesn't > include any other modifications like type checking etc. but it allowed me to > test out my service with these flags enabled. (Though I'm not entirely sure > that those changes are enough for any possible optimizations to "take > effect"). > > My py-lmdb changes can be seen here: > https://github.com/heyman/py-lmdb/compare/ > dw:master...integer-flags > > When running some tests, read & write speeds were mostly unaffected, but I got > the file size to decrease by almost 30% which is really nice. If you'd > implement "real" support for the INTEGERKEY, INTEGERDUP and DUPFIXED flags, > that would be pretty awesome :). > > Best, > Jonatan > > > > On Fri, Feb 6, 2015 at 10:38 AM, Jonatan Heyman <jonatan@xxxxxxxxxxx> wrote: > > Hi! > > Thanks again for valuable info! > > Currently I'm doing the read operations in multiple transactions rather > than in a single transaction. Would that be better or worse performance > wise? (I'm okay with the possibility of getting some inconsistency between > the transactions). Here's an excerpt of the code: > > def _get_similar_objects(self, object_id): > common_users = {} > user_ids = self._get_users_for_object(object_id) > for user_id in user_ids: > for other_object_id in self._get_objects_for_user(user_id): > if other_object_id == object_id: > continue > if not other_object_id in common_users: > common_users[other_object_id] = 0 > common_users[other_object_id] += 1 > > def _get_users_for_object(self, object_id): > with self.object_begin() as trans: > cursor = trans.cursor() > cursor.set_key(object_id) > return [uid for uid in cursor.iternext_dup()] > > def _get_objects_for_user(self, user_id): > with self.user_begin() as trans: > cursor = trans.cursor() > cursor.set_key(user_id) > return [oid for oid in cursor.iternext_dup()] > > > > If you require consistency but can live without durability, the correct > > option would be Environment(metasync=False).. this halves the number of > > disk flushes required for a writer without sacrificing crash safety, > > which sync=False does sacrifice. > > That's what I'm currently using. However, the documentation on the sync > argument to the Environment class > (https://lmdb.readthedocs.org/en/release/ > #environment-class) gave me the impression that one could get crash safety > even with sync=False in some cases, depending on the filesystem that is > used. Is that incorrect? > > > > Whoops, it's worth noting these numbers were based on dirtybench.py > output, which by default uses a DB of 1.87 million size-13 string > keys. > I forgot that MDB_INTEGERDUP uses an optimized comparison function > (read: likely much faster random IO) in addition to being a storage > optimization, so it's entirely possible there is value in implementing > it in the binding. > > > MDB_INTEGERDUP sounds very interesting, and if you'd implement support for > it in the bindings it'd be fantastic! I'm currently using python's > struct.pack() and struct.unpack() for serializing integers to strings. In > my initial tests the serializing doesn't seem to be much of an overhead, > but if LMDB contains other optimizations that might lead to faster IO with > MDB_INTEGERDUP, that sounds very interesting. > > Hopefully I'll be able to test my code in production some time in the > upcoming week! > > Best, > Jonatan > > > >