[py-lmdb] Re: Multiple environments to avoid write lock?

  • From: David Wilson <dw@xxxxxxxx>
  • To: Jonatan Heyman <jonatan@xxxxxxxxxxx>
  • Date: Thu, 5 Feb 2015 20:13:13 +0000

Hi Jonatan,

So, breaking this down, basically your read side looks like:

    * 1 random read on DB1 (16 bytes)
    * Provides average 120 random reads on DB2 (1920 bytes)
    * Which in turn result in 15 sequential reads (240 bytes)

So you're doing about 32kb of data (tiny) in 121 random reads. Depending
on your DB size, this should be good for somewhere north of 4,000 read
requests/sec/core with 2.4ghz Xeon, give or take a factor of 2.

I'm not sure how well MDB_DUPSORT databases perform, since I haven't
played much with them, however since DUPSORT data is inherently
physically localized, I expect if you see a limitation with LMDB
throughput here, it'll be right around the time you're saturating the
host machine's network interface anyway (assuming GigE 125mb/sec.. at
which point you need to change strategy entirely!).

I imagine there isn't a huge hit writing some simple loops in Python to
consume the cursor output, e.g.:

    def read_request(object_id):
        all_object_ids = set()
        with    env.begin() as txn, \
                txn.cursor(db=db1) as c1, \
                txn.cursor(db=db2) as c2:
            assert c1.set_key(object_id), "Missing object!"
            for user_id in c1.itervalues():
                assert c2.set_key(user_id), "Missing user!"
        return all_object_ids

Generally if there is slowness, it will be your Python algorithm, and
not LMDB causing it :)  So try to avoid any smartness, especially within
the write txns.

Your writer situation sounds just fine -- it sounds like the average
transaction is generating very localized mutations, and the total data
size is pretty tiny -- roughly somewhere under 50kb or so.

In fact I think the write load is so simple you might even not need to
bother with the task queue to begin with, assuming you're not generating
a huge write rate.

Note that LMDB has native modes for exactly your use case --
MDB_DUPFIXED and MDB_INTEGERDUP, neither of which are exposed in the
binding. I'd be very happy to implement these if there is a need, though
it's quite a lot of code churn. If profiling reveals your Python code is
producing too much overhead in int->string conversions, this might be a
change worth making.

If you require consistency but can live without durability, the correct
option would be Environment(metasync=False).. this halves the number of
disk flushes required for a writer without sacrificing crash safety,
which sync=False does sacrifice.

Lots of guesswork here -- the only real cure is profiling! Have fun :)


On Wed, Feb 04, 2015 at 12:32:50PM +0100, Jonatan Heyman wrote:
> Hi David!
> Thanks a lot for your input! I realise that I should have included more info 
> in
> my original e-mail.
> The app is a service that will back a web app. Reads will be performed on page
> requests, but (almost) all writes will be performed in background jobs. My app
> uses two main databases that will have the most read/writes. Both uses int64s
> as keys, and a list of int64s as values (using dupsort=True). 
> The first DB maps object IDs to a list of user IDs which are associated with
> that object. The user ID list ranges in length between zero and a couple of
> thousands, and averages at about 120. The total number of unique keys in this
> DB is currently a little over 3 millions (so around 360 millions of user ID
> values), but is expected to grow 20x the upcoming months.
> The second DB maps user IDs to the list of object IDs that they are associated
> with. The total number of unique keys in this DB is currently about 70 
> million.
> The average number of values per key is about 5.
> The typical write scenario is where my service receives an object ID and a 
> list
> of user IDs for that object. It will then construct one list of (object_id,
> user_id) tuples, and one list of (user_id, object_id) tuples, and then use
> putmulti() to write this data to the two databases. The service will also do
> this for existing object IDs, in which case it will also remove user IDs and
> object IDs (that was previously present, but no longer exists in the user ID
> list that was passed in) from the values in both databases.
> In the typical read scenario the service will read all user IDs for an object
> ID, and then it will retrieve the list of object IDs for each of these user
> IDs. 
> The reads and writes will be pretty random, and there will be many more reads
> than writes.
> I'm not sure if I'll need MDB_SYNC. To loose the last minute of data in case 
> of
> a crash would be acceptable, but getting a corrupt database would not. The
> documentation (https://lmdb.readthedocs.org/en/release/#environment-class)
> seems to suggest that using sync=False would be okay as long as I also use
> writemap=False, and use a filesystem that preserves write order. Which
> filesystems preserves write order? How about ext4?
> To me, it seems like the only drawback I would get from using different
> environments would be that I might get the two DBs in an inconsistent state in
> case of a crash. Also I might get inconsistent backup dumpsI could
> probably live with, in case it would greatly increase my write performance.
> In an ideal world I would be able to write extensive load tests that I could
> use for profiling different options, but as the only developer on a project
> with scaling pains, I'm aiming at "profiling it in production (TM)" (running
> side by side with the current system).
> Again, thanks a lot for your input. I've tried to provide as much 
> possible about the system, and if you have any additional thoughts, I'd
> appreciate it greatly!
> Best,
> Jonatan
> On Tue, Feb 3, 2015 at 9:03 PM, David Wilson <dw@xxxxxxxx> wrote:
>     Hi Jonatan!
>     There's not quite enough info to work with here, and probably it's
>     better if you just profile how your app behaves first. For example,
>     * if you're writing all those keys from a sequential range of keys, then
>       that will perform much better than random completely random keys. If
>       your write txns are fast enough, then probably it's not worth the
>       effort of splitting your data into separate environments. Often
>       sequential write is faster by a factor of 2 or more.
>     * Are your keys/values huge or tiny? To get an intuitive feel for
>       performance, it's sometimes helpful to think of txn size in terms of
>       total MB read/written.. this is especially true when the entire DB is
>       cached or on very fast media. Writing 100,000 8 byte key + 8 byte
>       values (1.52MiB) is going to be much faster than say, 512 byte values
>       (49.59MiB).
>     * if you're overwriting existing keys during the write txn, then that
>       changes things a lot too, since DB growth will be bounded.
>     * Similarly, the presence/absense of MDB_SYNC (if suitable for your app)
>       drastically changes write performance characteristics.
>     * How many concurrent writers do you need? Is it a web app? Another
>       option is to push your large writer off to a task queue.
>     * Perhaps tweak examples/dirtybench.py from the Git repository to create
>       keys/values roughly of your size. This will produce some nice example
>       numbers for you to work from for your hardware/disk combination.
>     Drop me a line if you need some more ideas, but probably it's best for
>     you to just write some throwaway code to profile your specific use case.
>     David
>     On Tue, Feb 03, 2015 at 08:50:43AM -0800, Jonatan Heyman wrote:
>     > Hi!
>     >
>     > I'm new to LMDB and python-lmdb, but so far I find it really promising
>     and well
>     > documented.
>     >
>     > I'm writing an app where I plan to use LMDB to store records in three
>     different
>     > named databases. One action in my app might result in hundreds or
>     thousands of
>     > read/writes to these databases. If I've understood LMDB correctly, there
>     is a
>     > global write lock for each Environment. Would it then make sense for me
>     to
>     > store my different databases in three different Environments, in order 
> to
>     > decrease the time spent waiting for write locks?
>     >
>     > Best,
>     > Jonatan

Other related posts: