[tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space

From: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
To: Konstantin Osipov <kostja@xxxxxxxxxxxxx>
Date: Wed, 17 Oct 2018 11:20:58 +0300

On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote:

* Vladimir Davydov <vdavydov.dev@xxxxxxxxx> [18/10/08 13:52]:

If a replica permanently stops working for some reason, it will pin WAL
files it would need to resume until it is deleted from the _cluster
system space or the master is restarted. This happens in production when
an admin drops a replica and forgets to remove it from the master, and
this is quite annoying, because it may result in ENOSPC errors on the
master.

I started benching this patch to check whether fallocate() introduces a
performance regression and discovered that there is a general 45% regression
between 1.6 and 1.10.

I hope finally once I have pointed it out and A.Lyapunov has
pointed it out, it will be addressed.

In any case we need to measure fallocate() impact very carefully
before adding it. It seems we make things unnecessarily
complicated all in order to spare the user from spurious ENOSPC,
In my opinion it's a non-goal.

If we get rid of this requirement, we don't need fallocate(), and
the patch could be made simpler in a couple more dimensions.

This requirement is a must IMO. What's the point of returning ENOSPC and
alerting the user if we can avoid that? I can foresee users complaining
about it and opening issues, like "spurious ENOSPC when there's enough
disk space" (because there will be enough disk space once gc has removed
stale replicas).

Please consider making a trivial patch which follows the steps of
the patch by @belyak

It's not trivial and it's ugly. I wonder why you fail to see that.
Ping-ponging messages from WAL to TX in order to remove files?
Introducing yet another pipe for that. What for? WAL thread already has
all the necessary information about xlog files. It just needs to be told
what's the oldest WAL row it has to preserve in any case.

Moreover, shooting off consumers before deleting WAL files is
semantically incorrect, because the garbage collector knows nothing
about WAL files. For GC there's a continuous range of WAL rows it
tracks. Dividing those rows in files is a business of the WAL thread.
So how's it going to work if we put the TX thread responsible for
triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots
off a consumer. WAL retries, ENOSPC again, because no file was deleted!
Sends ENOSPC to TX again and so forth. Do you really want this?!

The design proposed in this patch is simple and clear. When invoked, GC
lets WAL know about rows that can be pruned right now and rows that can
be pruned in case of emergency. When hitting ENOSPC, WAL deletes old
WALs on its own basing on this information and notifies TX via the
existing notification subsystem (wal_watcher) so that the latter can
shoot off replicas that would need those files.

Regarding usage of falloate. I could implement this patch without it,
but it would be a bit more difficult, because there wouldn't be a clear
point of ENOSPC failure. Besides, what would happen if we wrote half of
a transaction to disk? How replication would work then? BTW, triggering
WAL deletion on behalf of TX suffers from the very same problem. That is
we are risking not only returning ENOSPC to the user, but also breaking
replication in a peculiar way.

Anyway, you seem to be unaware of the fact that one of fallocate use
cases is speeding writes by reducing the number of file size updates
(which require a write to the inode table). I wrote a simple test that
demonstrates that, see below.

vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test
vlad@esperanza test$ ./fallocate_test
Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size>
filename - test file
write_count - number of writes (append)
write_size - write(2) size
alloc_size - fallocate(2) size
returns time in seconds
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.548161
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.505698
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.195223
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.137996

I have stock ext4 over hdd on my laptop, configured by Debian, no
tuning. The code is right below. That is for an append-only workload
similar to WAL fallocate yields ~25% gain.

True, fallocate might need some tuning (how much to allocate for
different write sizes), but it's something we definitely want to have on
board.

vlad@esperanza test$ cat fallocate_test.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <time.h>

double gettime(void)
{
        struct timespec ts;
        if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
                perror("clock_monotonic");
                exit(EXIT_FAILURE);
        }
        return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(int argc, char **argv)
{
        if (argc != 5) {
                fprintf(stderr, "Usage: %s <filename> <write_count> "
                        "<write_size> <alloc_size>\n"
                        "filename - test file\n"
                        "write_count - number of writes (append)\n"
                        "write_size - write(2) size\n"
                        "alloc_size - fallocate(2) size\n"
                        "returns time in seconds\n",
                        argv[0]);
                return -1;
        }

        const char *filename = argv[1];
        int write_count = atoi(argv[2]);
        int write_size = atoi(argv[3]);
        int alloc_size = atoi(argv[4]);

        char *buf = malloc(write_size);
        if (buf == NULL) {
                perror("malloc");
                exit(EXIT_FAILURE);
        }
        memset(buf, 1, write_size);

        int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
        if (fd < 0) {
                perror("open");
                exit(EXIT_FAILURE);
        }

        double t1 = gettime();

        off_t offset = 0;
        int prealloced = 0;
        for (int i = 0; i < write_count; i++) {
                if (alloc_size > 0 && prealloced < write_size) {
                        errno = posix_fallocate(fd, offset, alloc_size);
                        if (errno != 0) {
                                perror("posix_fallocate");
                                exit(EXIT_FAILURE);
                        }
                        prealloced += alloc_size;
                }
                ssize_t written = write(fd, buf, write_size);
                if (written < 0) {
                        perror("write");
                        exit(EXIT_FAILURE);
                }
                offset += written;
                prealloced -= written;
                if (prealloced < 0)
                        prealloced = 0;
        }

        double t2 = gettime();

        close(fd);
        unlink(filename);
        free(buf);

        printf("%f\n", __func__, t2 - t1);
        return 0;
}

Follow-Ups:
- [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  - From: Vladimir Davydov

References:
- [tarantool-patches] [PATCH 0/5] Delete old WAL files if running out of disk space
  - From: Vladimir Davydov
- [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  - From: Konstantin Osipov

[tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space

Other related posts: