All of lore.kernel.org
 help / color / mirror / Atom feed
* Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 19:11 Andres Freund
  2014-03-26 21:41   ` Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Andres Freund @ 2014-03-26 19:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas; +Cc: andres

[-- Attachment #1: Type: text/plain, Size: 3423 bytes --]

Hi,

At LSF/MM there was a slot about postgres' problems with the kernel. Our
top#1 concern is frequent slow read()s that happen while another process
calls fsync(), even though we'd be perfectly fine if that fsync() took
ages.
The "conclusion" of that part was that it'd be very useful to have a
demonstration of the problem without needing a full blown postgres
setup. I've quickly hacked something together, that seems to show the
problem nicely.

For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
and the "IO Scheduling" bit in
http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de

The tools output looks like this:
gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
...
wal[12155]: avg: 0.0 msec; max: 0.0 msec
commit[12155]: avg: 0.2 msec; max: 15.4 msec
wal[12155]: avg: 0.0 msec; max: 0.0 msec
read[12157]: avg: 0.2 msec; max: 9.4 msec
...
read[12165]: avg: 0.2 msec; max: 9.4 msec
wal[12155]: avg: 0.0 msec; max: 0.0 msec
starting fsync() of files
finished fsync() of files
read[12162]: avg: 0.6 msec; max: 2765.5 msec

So, the average read time is less than one ms (SSD, and about 50% cached
workload). But once another backend does the fsync(), read latency
skyrockets.

A concurrent iostat shows the problem pretty clearly:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20

While the fsync() is going on (or the kernel decides to start writing
out aggressively for some other reason) the amount of writes to the disk
is increased by two orders of magnitude. Unsurprisingly with disastrous
consequences for read() performance. We really want a way to pace the
writes issued to the disk more regularly.

The attached program right now can only be configured by changing some
details in the code itself, but I guess that's not a problem. It will
upfront allocate two files, and then start testing. If the files already
exists it will use them.

Possible solutions:
* Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
  sync_file_range() does.
* Make IO triggered by writeback regard IO priorities and add it to
  schedulers other than CFQ
* Add a tunable that allows limiting the amount of dirty memory before
  writeback on a per process basis.
* ...?

If somebody familiar with buffered IO writeback is around at LSF/MM, or
rather collab, Robert and I will be around for the next days.

Greetings,

Andres Freund

[-- Attachment #2: ioperf.c --]
[-- Type: text/plain, Size: 6565 bytes --]

/*
 * Portions Copyright (c) 2014, PostgreSQL Global Development Group
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph appear in all copies.
 *
 * Test program roughly simulating postgres' IO.
 *
 * Parameters will need need to be changed to reproduce the problem on
 * individual systems.
 *
 * Author: Andres Freund, andres@2ndquadrant.com, andres@anarazel.de
 */
#define _POSIX_C_SOURCE 200809L
#define _XOPEN_SOURCE 800

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdbool.h>

#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <time.h>

/* CHANGE: number of reading processes */
#define NUM_RANDOM_READERS 16

/*
 * CHANGE: set to memory size * 2 or so.
 *
 * Remove the 'data', 'wal' files after changing.
 */
static const size_t data_size = 1024L * 1024 * 1024 * 48;

/* probably ok this way */
static const size_t wal_size = 1024L * 1024 * 1024 * 1;

/* after how many iterations should stuff get reported */
static const uint64_t read_report_interval = 10000;
static const uint64_t wal_report_interval = 1000;
static const uint64_t commit_report_interval = 500;


/* internal data */
static const char initdata[8192];
static pid_t readers[NUM_RANDOM_READERS];

struct timing
{
	uint64_t iter;
	uint64_t period_total;
	uint64_t total;
	uint64_t period_max;

	struct timespec t_before;
	struct timespec t_after;
};

static void
fatal_error(int e)
{
	fprintf(stderr, "frak me: %d: %s\n", e, strerror(e));
	_exit(0);
}

static void
nsleep(int64_t s)
{
	struct timespec d;
	d.tv_sec = 0;
	d.tv_nsec = s;
	if (nanosleep(&d, NULL) < 0)
		fatal_error(errno);
}

static off_t
random_block(size_t end)
{
	return (((double) random())/RAND_MAX) * (end - 1);
}

static int64_t
nsec_diff(const struct timespec *a, const struct timespec *b)
{
	return ((int64_t)(a->tv_sec - b->tv_sec) * 1000000000)
		+ (a->tv_nsec - b->tv_nsec);
}

static void
timing_init(struct timing *t)
{
	t->iter = 0;
	t->total = 0;
	t->period_total = 0;
	t->period_max = 0;
}

static void
timing_before_action(struct timing *t)
{
	clock_gettime(CLOCK_MONOTONIC, &t->t_before);
}


static void
timing_after_action(struct timing *t, const char *ctx, int64_t report_interval)
{
	uint64_t dur;

	clock_gettime(CLOCK_MONOTONIC, &t->t_after);

	dur = nsec_diff(&t->t_after, &t->t_before);

	t->iter++;
	t->period_total += dur;
	t->period_max = t->period_max < dur ? dur : t->period_max;

	if ((t->iter % report_interval) == 0)
	{
		fprintf(stdout, "%s[%d]: avg: %.1f msec; max: %.1f msec\n",
				ctx, getpid(),
				(double) (t->period_total / read_report_interval) / 1000000,
				(double) t->period_max / 1000000);
		t->total += t->period_total;
		t->period_total = 0;
		t->period_max = 0;
	}
}

static void
do_wal_writes(void)
{
	int fd;
	off_t pos = 0;
	int64_t iter = 0;

	struct timing wal_timing;
	struct timing commit_timing;

	timing_init(&wal_timing);
	timing_init(&commit_timing);

	fd = open("wal", O_RDWR, S_IRUSR|S_IWUSR);
	if (fd < 0)
		fatal_error(errno);

	while(true)
	{
		bool is_commit = (iter++ % 5) == 0;

		if (lseek(fd, pos, SEEK_SET) < 0)
			fatal_error(errno);

		timing_before_action(&wal_timing);
		if (is_commit)
			timing_before_action(&commit_timing);

		if (write(fd, initdata, 8192) < 0)
			fatal_error(errno);

		timing_after_action(&wal_timing, "wal", wal_report_interval);

		if (is_commit)
		{

			if (fdatasync(fd) < 0)
				fatal_error(errno);
			timing_after_action(&commit_timing, "commit", commit_report_interval);
		}

		pos += 8192;

		if (pos + 8192 >= wal_size)
			pos = 0;

		nsleep(1000000);
	}
}

static void
do_checkpointer_writes(void)
{
	int fd;
	int64_t writes = 0;

	fd = open("data", O_RDWR, S_IRUSR|S_IWUSR);
	if (fd < 0)
		fatal_error(errno);

	while(true)
	{
		off_t pos = random_block(data_size);

		if (lseek(fd, pos, SEEK_SET) < 0)
			fatal_error(errno);

		if (write(fd, initdata, 8192) < 0)
			fatal_error(errno);

		if ((++writes % 100000) == 0)
		{
			fprintf(stdout, "starting fsync() of files\n");

			if (fsync(fd) < 0)
				fatal_error(errno);

			fprintf(stdout, "finished fsync() of files\n");
		}

		nsleep(200000);
	}
}

static void
do_random_reads(void)
{
	int fd;
	struct timing timing;

	timing_init(&timing);

	fd = open("data", O_RDWR, S_IRUSR|S_IWUSR);
	if (fd < 0)
		fatal_error(errno);

	while(true)
	{
		char data[8192];
		off_t pos = random_block(data_size);

		if (lseek(fd, pos, SEEK_SET) < 0)
			fatal_error(errno);

		timing_before_action(&timing);

		if (read(fd, data, 8192) < 0)
			fatal_error(errno);

		timing_after_action(&timing, "read", read_report_interval);
	}
}

static void
initialize_files(void)
{
	int fd;
	ssize_t data_size_written = 0;
	ssize_t wal_size_written = 0;

	/* initialize data file */
	fd = open("data", O_CREAT|O_EXCL|O_RDWR, S_IRUSR|S_IWUSR);
	if (fd < 0 && errno == EEXIST)
		;
	else if (fd < 0)
		fatal_error(errno);
	else
	{
		while (data_size_written <= data_size)
		{
			ssize_t ret = write(fd, initdata, sizeof(initdata));
			if (ret == -1)
				fatal_error(errno);
			data_size_written += ret;
		}
		if (fsync(fd) < 0)
			fatal_error(errno);
		close(fd);
	}

	/* initialize wal file */
	fd = open("wal", O_CREAT|O_EXCL|O_RDWR, S_IRUSR|S_IWUSR);
	if (fd < 0 && errno == EEXIST)
		;
	else if (fd < 0)
		fatal_error(errno);
	else
	{
		while (wal_size_written <= wal_size)
		{
			ssize_t ret = write(fd, initdata, sizeof(initdata));
			if (ret == -1)
				fatal_error(errno);
			wal_size_written += ret;
		}
		fsync(fd);
		close(fd);
	}
}

static pid_t
start_subprocess(void (*sub)(void))
{
	pid_t pid;

	pid = fork();
	if (pid == -1)
		fatal_error(errno);
	else if (pid == 0)
		sub();

	return pid;
}

int
main(int argc, char **argv)
{
	int status;
	pid_t checkpointer_pid, wal_pid;

	/*
	 * Don't want to hit the same, already cached, pages after restarting.
	 */
	srandom((int)time(NULL));

	initialize_files();

	checkpointer_pid = start_subprocess(do_checkpointer_writes);
	wal_pid = start_subprocess(do_wal_writes);

	/* start all reader processes */
	for (int i = 0; i < NUM_RANDOM_READERS; i++)
		readers[i] = start_subprocess(do_random_reads);

	/* return if all subprocesses decided to die */
	for (int i = 0; i < NUM_RANDOM_READERS; i++)
		waitpid(readers[i], &status, 0);

	waitpid(checkpointer_pid, &status, 0);
	waitpid(wal_pid, &status, 0);

	return 0;
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 19:11 Postgresql performance problems with IO latency, especially during fsync() Andres Freund
@ 2014-03-26 21:41   ` Andy Lutomirski
  2014-03-27 15:52   ` Jan Kara
  2014-04-09  9:20   ` Dave Chinner
  2 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 21:41 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
>
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>

For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/


[...]

> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

Since fsync blocks until all of its IO finishes anyway, what if it
could just limit itself to a much smaller number of outstanding
requests?

I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:

    /*
     * Only allow batching queuers to allocate up to 50% over the defined
     * limit of requests, otherwise we could have thousands of requests
     * allocated with any setting of ->nr_requests
     */
    if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
        return NULL;

What if this changed to:

    /*
     * Only allow batching queuers to allocate up to 50% of the defined
     * limit of requests, so that non-batching queuers can get into the queue
     * and thus be scheduled properly.
     */
    if (rl->count[is_sync] >= (q->nr_requests + 3) / 4))
        return NULL;

I suspect that doing this right would take a bit more care than that,
but I wonder if this approach is any good.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 21:41   ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 21:41 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
>
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>

For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/


[...]

> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

Since fsync blocks until all of its IO finishes anyway, what if it
could just limit itself to a much smaller number of outstanding
requests?

I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:

    /*
     * Only allow batching queuers to allocate up to 50% over the defined
     * limit of requests, otherwise we could have thousands of requests
     * allocated with any setting of ->nr_requests
     */
    if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
        return NULL;

What if this changed to:

    /*
     * Only allow batching queuers to allocate up to 50% of the defined
     * limit of requests, so that non-batching queuers can get into the queue
     * and thus be scheduled properly.
     */
    if (rl->count[is_sync] >= (q->nr_requests + 3) / 4))
        return NULL;

I suspect that doing this right would take a bit more care than that,
but I wonder if this approach is any good.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 21:41   ` Andy Lutomirski
@ 2014-03-26 21:55     ` Andres Freund
  -1 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-03-26 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas

On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> >
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> >
> 
> For your amusement: running this program in KVM on a 2GB disk image
> failed, but it caused the *host* to go out to lunch for several
> seconds while failing.  In fact, it seems to have caused the host to
> fall over so badly that the guest decided that the disk controller was
> timing out.  The host is btrfs, and I think that btrfs is *really* bad
> at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)

> Even using ext4 is no good.  I think that dm-crypt is dying under the
> load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

> > Possible solutions:
> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> >   sync_file_range() does.
> > * Make IO triggered by writeback regard IO priorities and add it to
> >   schedulers other than CFQ
> > * Add a tunable that allows limiting the amount of dirty memory before
> >   writeback on a per process basis.
> > * ...?
> 
> I thought the problem wasn't so much that priorities weren't respected
> but that the fsync call fills up the queue, so everything starts
> contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

> Since fsync blocks until all of its IO finishes anyway, what if it
> could just limit itself to a much smaller number of outstanding
> requests?

Yea, that could already help. If you remove the fsync()s, the problem
will periodically appear anyway, because writeback is triggered with
vengeance. That'd need to be fixed in a similar way.

> I'm not sure I understand the request queue stuff, but here's an idea.
>  The block core contains this little bit of code:

I haven't read enough of the code yet, to comment intelligently ;)

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 21:55     ` Andres Freund
  0 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-03-26 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas

On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> >
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> >
> 
> For your amusement: running this program in KVM on a 2GB disk image
> failed, but it caused the *host* to go out to lunch for several
> seconds while failing.  In fact, it seems to have caused the host to
> fall over so badly that the guest decided that the disk controller was
> timing out.  The host is btrfs, and I think that btrfs is *really* bad
> at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)

> Even using ext4 is no good.  I think that dm-crypt is dying under the
> load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

> > Possible solutions:
> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> >   sync_file_range() does.
> > * Make IO triggered by writeback regard IO priorities and add it to
> >   schedulers other than CFQ
> > * Add a tunable that allows limiting the amount of dirty memory before
> >   writeback on a per process basis.
> > * ...?
> 
> I thought the problem wasn't so much that priorities weren't respected
> but that the fsync call fills up the queue, so everything starts
> contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

> Since fsync blocks until all of its IO finishes anyway, what if it
> could just limit itself to a much smaller number of outstanding
> requests?

Yea, that could already help. If you remove the fsync()s, the problem
will periodically appear anyway, because writeback is triggered with
vengeance. That'd need to be fixed in a similar way.

> I'm not sure I understand the request queue stuff, but here's an idea.
>  The block core contains this little bit of code:

I haven't read enough of the code yet, to comment intelligently ;)

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 21:55     ` Andres Freund
@ 2014-03-26 22:26       ` Andy Lutomirski
  -1 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 22:26 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
>> > top#1 concern is frequent slow read()s that happen while another process
>> > calls fsync(), even though we'd be perfectly fine if that fsync() took
>> > ages.
>> > The "conclusion" of that part was that it'd be very useful to have a
>> > demonstration of the problem without needing a full blown postgres
>> > setup. I've quickly hacked something together, that seems to show the
>> > problem nicely.
>> >
>> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>> > and the "IO Scheduling" bit in
>> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>> >
>>
>> For your amusement: running this program in KVM on a 2GB disk image
>> failed, but it caused the *host* to go out to lunch for several
>> seconds while failing.  In fact, it seems to have caused the host to
>> fall over so badly that the guest decided that the disk controller was
>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>> at this kind of workload.
>
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
>
>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>> load.  So I won't test your program for real :/
>
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
>
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.

OK, I think I'm getting reasonable bad behavior with these qemu options:

-smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none

and a 2GB test partition.

>
>> > Possible solutions:
>> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>> >   sync_file_range() does.
>> > * Make IO triggered by writeback regard IO priorities and add it to
>> >   schedulers other than CFQ
>> > * Add a tunable that allows limiting the amount of dirty memory before
>> >   writeback on a per process basis.
>> > * ...?
>>
>> I thought the problem wasn't so much that priorities weren't respected
>> but that the fsync call fills up the queue, so everything starts
>> contending for the right to enqueue a new request.
>
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
>
>> Since fsync blocks until all of its IO finishes anyway, what if it
>> could just limit itself to a much smaller number of outstanding
>> requests?
>
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
>
>> I'm not sure I understand the request queue stuff, but here's an idea.
>>  The block core contains this little bit of code:
>
> I haven't read enough of the code yet, to comment intelligently ;)

My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.

As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.

For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
        }

        /*
-        * Only allow batching queuers to allocate up to 50% over the defined
-        * limit of requests, otherwise we could have thousands of requests
-        * allocated with any setting of ->nr_requests
+        * Only allow batching queuers to allocate up to 50% of the
+        * defined limit of requests, so that non-batching queuers can
+        * get into the queue and thus be scheduled properly.
         */
-       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+       if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
                return NULL;

        q->nr_rqs[is_sync]++;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 22:26       ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 22:26 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, Linux FS Devel, linux-kernel, lsf, Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
>> > top#1 concern is frequent slow read()s that happen while another process
>> > calls fsync(), even though we'd be perfectly fine if that fsync() took
>> > ages.
>> > The "conclusion" of that part was that it'd be very useful to have a
>> > demonstration of the problem without needing a full blown postgres
>> > setup. I've quickly hacked something together, that seems to show the
>> > problem nicely.
>> >
>> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>> > and the "IO Scheduling" bit in
>> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>> >
>>
>> For your amusement: running this program in KVM on a 2GB disk image
>> failed, but it caused the *host* to go out to lunch for several
>> seconds while failing.  In fact, it seems to have caused the host to
>> fall over so badly that the guest decided that the disk controller was
>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>> at this kind of workload.
>
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
>
>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>> load.  So I won't test your program for real :/
>
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
>
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.

OK, I think I'm getting reasonable bad behavior with these qemu options:

-smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none

and a 2GB test partition.

>
>> > Possible solutions:
>> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>> >   sync_file_range() does.
>> > * Make IO triggered by writeback regard IO priorities and add it to
>> >   schedulers other than CFQ
>> > * Add a tunable that allows limiting the amount of dirty memory before
>> >   writeback on a per process basis.
>> > * ...?
>>
>> I thought the problem wasn't so much that priorities weren't respected
>> but that the fsync call fills up the queue, so everything starts
>> contending for the right to enqueue a new request.
>
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
>
>> Since fsync blocks until all of its IO finishes anyway, what if it
>> could just limit itself to a much smaller number of outstanding
>> requests?
>
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
>
>> I'm not sure I understand the request queue stuff, but here's an idea.
>>  The block core contains this little bit of code:
>
> I haven't read enough of the code yet, to comment intelligently ;)

My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.

As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.

For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
        }

        /*
-        * Only allow batching queuers to allocate up to 50% over the defined
-        * limit of requests, otherwise we could have thousands of requests
-        * allocated with any setting of ->nr_requests
+        * Only allow batching queuers to allocate up to 50% of the
+        * defined limit of requests, so that non-batching queuers can
+        * get into the queue and thus be scheduled properly.
         */
-       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+       if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
                return NULL;

        q->nr_rqs[is_sync]++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 22:26       ` Andy Lutomirski
@ 2014-03-26 22:35         ` David Lang
  -1 siblings, 0 replies; 31+ messages in thread
From: David Lang @ 2014-03-26 22:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, 26 Mar 2014, Andy Lutomirski wrote:

>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>  The block core contains this little bit of code:
>>
>> I haven't read enough of the code yet, to comment intelligently ;)
>
> My little patch doesn't seem to help.  I'm either changing the wrong
> piece of code entirely or I'm penalizing readers and writers too much.
>
> Hopefully some real block layer people can comment as to whether a
> refinement of this idea could work.  The behavior I want is for
> writeback to be limited to using a smallish fraction of the total
> request queue size -- I think that writeback should be able to enqueue
> enough requests to get decent sorting performance but not enough
> requests to prevent the io scheduler from doing a good job on
> non-writeback I/O.

The thing is that if there are no reads that are waiting, why not use every bit 
of disk I/O available to write? If you can do that reliably with only using part 
of the queue, fine, but aren't you getting fairly close to just having separate 
queues for reading and writing with such a restriction?

> As an even more radical idea, what if there was a way to submit truly
> enormous numbers of lightweight requests, such that the queue will
> give the requester some kind of callback when the request is nearly
> ready for submission so the requester can finish filling in the
> request?  This would allow things like dm-crypt to get the benefit of
> sorting without needing to encrypt hundreds of MB of data in advance
> of having that data actually be to the backing device.  It might also
> allow writeback to submit multiple gigabytes of writes, in arbitrarily
> large pieces, but not to need to pin pages or do whatever expensive
> things are needed until the IO actually happens.

the problem with a callback is that you then need to wait for that source to get 
the CPU and finish doing it's work. What happens if that takes long enough for 
you to run out of data to write? And is it worth the extra context switches to 
bounce around when the writing process was finished with that block already.

David Lang

> For reference, here's my patch that doesn't work well:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4cd5ffc..c0dedc3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
>        }
>
>        /*
> -        * Only allow batching queuers to allocate up to 50% over the defined
> -        * limit of requests, otherwise we could have thousands of requests
> -        * allocated with any setting of ->nr_requests
> +        * Only allow batching queuers to allocate up to 50% of the
> +        * defined limit of requests, so that non-batching queuers can
> +        * get into the queue and thus be scheduled properly.
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +       if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
>                return NULL;
>
>        q->nr_rqs[is_sync]++;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 22:35         ` David Lang
  0 siblings, 0 replies; 31+ messages in thread
From: David Lang @ 2014-03-26 22:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, 26 Mar 2014, Andy Lutomirski wrote:

>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>  The block core contains this little bit of code:
>>
>> I haven't read enough of the code yet, to comment intelligently ;)
>
> My little patch doesn't seem to help.  I'm either changing the wrong
> piece of code entirely or I'm penalizing readers and writers too much.
>
> Hopefully some real block layer people can comment as to whether a
> refinement of this idea could work.  The behavior I want is for
> writeback to be limited to using a smallish fraction of the total
> request queue size -- I think that writeback should be able to enqueue
> enough requests to get decent sorting performance but not enough
> requests to prevent the io scheduler from doing a good job on
> non-writeback I/O.

The thing is that if there are no reads that are waiting, why not use every bit 
of disk I/O available to write? If you can do that reliably with only using part 
of the queue, fine, but aren't you getting fairly close to just having separate 
queues for reading and writing with such a restriction?

> As an even more radical idea, what if there was a way to submit truly
> enormous numbers of lightweight requests, such that the queue will
> give the requester some kind of callback when the request is nearly
> ready for submission so the requester can finish filling in the
> request?  This would allow things like dm-crypt to get the benefit of
> sorting without needing to encrypt hundreds of MB of data in advance
> of having that data actually be to the backing device.  It might also
> allow writeback to submit multiple gigabytes of writes, in arbitrarily
> large pieces, but not to need to pin pages or do whatever expensive
> things are needed until the IO actually happens.

the problem with a callback is that you then need to wait for that source to get 
the CPU and finish doing it's work. What happens if that takes long enough for 
you to run out of data to write? And is it worth the extra context switches to 
bounce around when the writing process was finished with that block already.

David Lang

> For reference, here's my patch that doesn't work well:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4cd5ffc..c0dedc3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
>        }
>
>        /*
> -        * Only allow batching queuers to allocate up to 50% over the defined
> -        * limit of requests, otherwise we could have thousands of requests
> -        * allocated with any setting of ->nr_requests
> +        * Only allow batching queuers to allocate up to 50% of the
> +        * defined limit of requests, so that non-batching queuers can
> +        * get into the queue and thus be scheduled properly.
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +       if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
>                return NULL;
>
>        q->nr_rqs[is_sync]++;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 22:35         ` David Lang
@ 2014-03-26 23:11           ` Andy Lutomirski
  -1 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 23:11 UTC (permalink / raw)
  To: David Lang
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 3:35 PM, David Lang <david@lang.hm> wrote:
> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>
>>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>>  The block core contains this little bit of code:
>>>
>>>
>>> I haven't read enough of the code yet, to comment intelligently ;)
>>
>>
>> My little patch doesn't seem to help.  I'm either changing the wrong
>> piece of code entirely or I'm penalizing readers and writers too much.
>>
>> Hopefully some real block layer people can comment as to whether a
>> refinement of this idea could work.  The behavior I want is for
>> writeback to be limited to using a smallish fraction of the total
>> request queue size -- I think that writeback should be able to enqueue
>> enough requests to get decent sorting performance but not enough
>> requests to prevent the io scheduler from doing a good job on
>> non-writeback I/O.
>
>
> The thing is that if there are no reads that are waiting, why not use every
> bit of disk I/O available to write? If you can do that reliably with only
> using part of the queue, fine, but aren't you getting fairly close to just
> having separate queues for reading and writing with such a restriction?
>

Hmm.

I wonder what the actual effect of queue length is on throughput.  I
suspect that using half the queue gives you well over half the
throughput as long as the queue isn't tiny.

I'm not so sure I'd go so far as having separate reader and writer
queues -- I think that small synchronous writes should also not get
stuck behind large writeback storms, but maybe that's something that
can be a secondary goal.  That being said, separate reader and writer
queues might solve the immediate problem.  It won't help for the case
where a small fsync blocks behind writeback, though, and that seems to
be a very common cause of Firefox freezing on my system.

Is there an easy way to do a proof-of-concept?  It would be great if
there was a ten-line patch that implemented something like this
correctly enough to see if it helps.  I don't think I'm the right
person to do it, because my knowledge of the block layer code is
essentially nil.

>
>> As an even more radical idea, what if there was a way to submit truly
>> enormous numbers of lightweight requests, such that the queue will
>> give the requester some kind of callback when the request is nearly
>> ready for submission so the requester can finish filling in the
>> request?  This would allow things like dm-crypt to get the benefit of
>> sorting without needing to encrypt hundreds of MB of data in advance
>> of having that data actually be to the backing device.  It might also
>> allow writeback to submit multiple gigabytes of writes, in arbitrarily
>> large pieces, but not to need to pin pages or do whatever expensive
>> things are needed until the IO actually happens.
>
>
> the problem with a callback is that you then need to wait for that source to
> get the CPU and finish doing it's work. What happens if that takes long
> enough for you to run out of data to write? And is it worth the extra
> context switches to bounce around when the writing process was finished with
> that block already.

dm-crypt is so context-switch heavy that I doubt the context switches
matter.  And you'd need to give the callback early enough that there's
a very good chance that the callback will finish in time.  There might
even need to be a way to let other non-callback-dependent IO pass by
the callback-dependent stuff, although in the particular case of
dm-crypt, dm-crypt is pretty much the only source of writes.  (Reads
don't have this problem for dm-crypt, I think.)

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 23:11           ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 23:11 UTC (permalink / raw)
  To: David Lang
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 3:35 PM, David Lang <david@lang.hm> wrote:
> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>
>>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>>  The block core contains this little bit of code:
>>>
>>>
>>> I haven't read enough of the code yet, to comment intelligently ;)
>>
>>
>> My little patch doesn't seem to help.  I'm either changing the wrong
>> piece of code entirely or I'm penalizing readers and writers too much.
>>
>> Hopefully some real block layer people can comment as to whether a
>> refinement of this idea could work.  The behavior I want is for
>> writeback to be limited to using a smallish fraction of the total
>> request queue size -- I think that writeback should be able to enqueue
>> enough requests to get decent sorting performance but not enough
>> requests to prevent the io scheduler from doing a good job on
>> non-writeback I/O.
>
>
> The thing is that if there are no reads that are waiting, why not use every
> bit of disk I/O available to write? If you can do that reliably with only
> using part of the queue, fine, but aren't you getting fairly close to just
> having separate queues for reading and writing with such a restriction?
>

Hmm.

I wonder what the actual effect of queue length is on throughput.  I
suspect that using half the queue gives you well over half the
throughput as long as the queue isn't tiny.

I'm not so sure I'd go so far as having separate reader and writer
queues -- I think that small synchronous writes should also not get
stuck behind large writeback storms, but maybe that's something that
can be a secondary goal.  That being said, separate reader and writer
queues might solve the immediate problem.  It won't help for the case
where a small fsync blocks behind writeback, though, and that seems to
be a very common cause of Firefox freezing on my system.

Is there an easy way to do a proof-of-concept?  It would be great if
there was a ten-line patch that implemented something like this
correctly enough to see if it helps.  I don't think I'm the right
person to do it, because my knowledge of the block layer code is
essentially nil.

>
>> As an even more radical idea, what if there was a way to submit truly
>> enormous numbers of lightweight requests, such that the queue will
>> give the requester some kind of callback when the request is nearly
>> ready for submission so the requester can finish filling in the
>> request?  This would allow things like dm-crypt to get the benefit of
>> sorting without needing to encrypt hundreds of MB of data in advance
>> of having that data actually be to the backing device.  It might also
>> allow writeback to submit multiple gigabytes of writes, in arbitrarily
>> large pieces, but not to need to pin pages or do whatever expensive
>> things are needed until the IO actually happens.
>
>
> the problem with a callback is that you then need to wait for that source to
> get the CPU and finish doing it's work. What happens if that takes long
> enough for you to run out of data to write? And is it worth the extra
> context switches to bounce around when the writing process was finished with
> that block already.

dm-crypt is so context-switch heavy that I doubt the context switches
matter.  And you'd need to give the callback early enough that there's
a very good chance that the callback will finish in time.  There might
even need to be a way to let other non-callback-dependent IO pass by
the callback-dependent stuff, although in the particular case of
dm-crypt, dm-crypt is pretty much the only source of writes.  (Reads
don't have this problem for dm-crypt, I think.)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 23:11           ` Andy Lutomirski
@ 2014-03-26 23:28             ` Andy Lutomirski
  -1 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 23:28 UTC (permalink / raw)
  To: David Lang
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Mar 26, 2014 at 3:35 PM, David Lang <david@lang.hm> wrote:
>> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>>
>>>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>>>  The block core contains this little bit of code:
>>>>
>>>>
>>>> I haven't read enough of the code yet, to comment intelligently ;)
>>>
>>>
>>> My little patch doesn't seem to help.  I'm either changing the wrong
>>> piece of code entirely or I'm penalizing readers and writers too much.
>>>
>>> Hopefully some real block layer people can comment as to whether a
>>> refinement of this idea could work.  The behavior I want is for
>>> writeback to be limited to using a smallish fraction of the total
>>> request queue size -- I think that writeback should be able to enqueue
>>> enough requests to get decent sorting performance but not enough
>>> requests to prevent the io scheduler from doing a good job on
>>> non-writeback I/O.
>>
>>
>> The thing is that if there are no reads that are waiting, why not use every
>> bit of disk I/O available to write? If you can do that reliably with only
>> using part of the queue, fine, but aren't you getting fairly close to just
>> having separate queues for reading and writing with such a restriction?
>>
>
> Hmm.
>
> I wonder what the actual effect of queue length is on throughput.  I
> suspect that using half the queue gives you well over half the
> throughput as long as the queue isn't tiny.
>
> I'm not so sure I'd go so far as having separate reader and writer
> queues -- I think that small synchronous writes should also not get
> stuck behind large writeback storms, but maybe that's something that
> can be a secondary goal.  That being said, separate reader and writer
> queues might solve the immediate problem.  It won't help for the case
> where a small fsync blocks behind writeback, though, and that seems to
> be a very common cause of Firefox freezing on my system.
>
> Is there an easy way to do a proof-of-concept?  It would be great if
> there was a ten-line patch that implemented something like this
> correctly enough to see if it helps.  I don't think I'm the right
> person to do it, because my knowledge of the block layer code is
> essentially nil.

I think it's at least a bit more subtle than this.  cfq distinguishes
SYNC and ASYNC, but very large fsyncs are presumably SYNC.  Deadline
pays no attention to rw flags.

Anyway, it seems like there's basically nothing prioritizing what
happens when the number of requests exceeds the congestion thresholds.
 I'd happily bet a beverage* that Postgres's slow requests are
spending an excessive amount of time waiting to get into the queue in
the first place.

* Since I'm back home now, any actual beverage transaction will be
rather delayed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-26 23:28             ` Andy Lutomirski
  0 siblings, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2014-03-26 23:28 UTC (permalink / raw)
  To: David Lang
  Cc: Andres Freund, linux-mm, Linux FS Devel, linux-kernel, lsf,
	Wu Fengguang, rhaas

On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Mar 26, 2014 at 3:35 PM, David Lang <david@lang.hm> wrote:
>> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>>
>>>>> I'm not sure I understand the request queue stuff, but here's an idea.
>>>>>  The block core contains this little bit of code:
>>>>
>>>>
>>>> I haven't read enough of the code yet, to comment intelligently ;)
>>>
>>>
>>> My little patch doesn't seem to help.  I'm either changing the wrong
>>> piece of code entirely or I'm penalizing readers and writers too much.
>>>
>>> Hopefully some real block layer people can comment as to whether a
>>> refinement of this idea could work.  The behavior I want is for
>>> writeback to be limited to using a smallish fraction of the total
>>> request queue size -- I think that writeback should be able to enqueue
>>> enough requests to get decent sorting performance but not enough
>>> requests to prevent the io scheduler from doing a good job on
>>> non-writeback I/O.
>>
>>
>> The thing is that if there are no reads that are waiting, why not use every
>> bit of disk I/O available to write? If you can do that reliably with only
>> using part of the queue, fine, but aren't you getting fairly close to just
>> having separate queues for reading and writing with such a restriction?
>>
>
> Hmm.
>
> I wonder what the actual effect of queue length is on throughput.  I
> suspect that using half the queue gives you well over half the
> throughput as long as the queue isn't tiny.
>
> I'm not so sure I'd go so far as having separate reader and writer
> queues -- I think that small synchronous writes should also not get
> stuck behind large writeback storms, but maybe that's something that
> can be a secondary goal.  That being said, separate reader and writer
> queues might solve the immediate problem.  It won't help for the case
> where a small fsync blocks behind writeback, though, and that seems to
> be a very common cause of Firefox freezing on my system.
>
> Is there an easy way to do a proof-of-concept?  It would be great if
> there was a ten-line patch that implemented something like this
> correctly enough to see if it helps.  I don't think I'm the right
> person to do it, because my knowledge of the block layer code is
> essentially nil.

I think it's at least a bit more subtle than this.  cfq distinguishes
SYNC and ASYNC, but very large fsyncs are presumably SYNC.  Deadline
pays no attention to rw flags.

Anyway, it seems like there's basically nothing prioritizing what
happens when the number of requests exceeds the congestion thresholds.
 I'd happily bet a beverage* that Postgres's slow requests are
spending an excessive amount of time waiting to get into the queue in
the first place.

* Since I'm back home now, any actual beverage transaction will be
rather delayed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 21:55     ` Andres Freund
@ 2014-03-27 15:50       ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2014-03-27 15:50 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andy Lutomirski, linux-kernel, lsf, linux-mm, rhaas,
	Linux FS Devel, Wu Fengguang

On Wed 26-03-14 22:55:18, Andres Freund wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> > On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > > Hi,
> > >
> > > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > > top#1 concern is frequent slow read()s that happen while another process
> > > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > > ages.
> > > The "conclusion" of that part was that it'd be very useful to have a
> > > demonstration of the problem without needing a full blown postgres
> > > setup. I've quickly hacked something together, that seems to show the
> > > problem nicely.
> > >
> > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > > and the "IO Scheduling" bit in
> > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > >
> > 
> > For your amusement: running this program in KVM on a 2GB disk image
> > failed, but it caused the *host* to go out to lunch for several
> > seconds while failing.  In fact, it seems to have caused the host to
> > fall over so badly that the guest decided that the disk controller was
> > timing out.  The host is btrfs, and I think that btrfs is *really* bad
> > at this kind of workload.
> 
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
> 
> > Even using ext4 is no good.  I think that dm-crypt is dying under the
> > load.  So I won't test your program for real :/
> 
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
> 
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.
  That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).

> > > Possible solutions:
> > > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> > >   sync_file_range() does.
> > > * Make IO triggered by writeback regard IO priorities and add it to
> > >   schedulers other than CFQ
> > > * Add a tunable that allows limiting the amount of dirty memory before
> > >   writeback on a per process basis.
> > > * ...?
> > 
> > I thought the problem wasn't so much that priorities weren't respected
> > but that the fsync call fills up the queue, so everything starts
> > contending for the right to enqueue a new request.
> 
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
  There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).

> > Since fsync blocks until all of its IO finishes anyway, what if it
> > could just limit itself to a much smaller number of outstanding
> > requests?
> 
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
  Actually, that might be triggered by a different problem because in case
of background writeback, block layer knows the IO is asynchronous and
treats it in a different way.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-27 15:50       ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2014-03-27 15:50 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andy Lutomirski, linux-kernel, lsf, linux-mm, rhaas,
	Linux FS Devel, Wu Fengguang

On Wed 26-03-14 22:55:18, Andres Freund wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> > On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
> > > Hi,
> > >
> > > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > > top#1 concern is frequent slow read()s that happen while another process
> > > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > > ages.
> > > The "conclusion" of that part was that it'd be very useful to have a
> > > demonstration of the problem without needing a full blown postgres
> > > setup. I've quickly hacked something together, that seems to show the
> > > problem nicely.
> > >
> > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > > and the "IO Scheduling" bit in
> > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > >
> > 
> > For your amusement: running this program in KVM on a 2GB disk image
> > failed, but it caused the *host* to go out to lunch for several
> > seconds while failing.  In fact, it seems to have caused the host to
> > fall over so badly that the guest decided that the disk controller was
> > timing out.  The host is btrfs, and I think that btrfs is *really* bad
> > at this kind of workload.
> 
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
> 
> > Even using ext4 is no good.  I think that dm-crypt is dying under the
> > load.  So I won't test your program for real :/
> 
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
> 
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.
  That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).

> > > Possible solutions:
> > > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> > >   sync_file_range() does.
> > > * Make IO triggered by writeback regard IO priorities and add it to
> > >   schedulers other than CFQ
> > > * Add a tunable that allows limiting the amount of dirty memory before
> > >   writeback on a per process basis.
> > > * ...?
> > 
> > I thought the problem wasn't so much that priorities weren't respected
> > but that the fsync call fills up the queue, so everything starts
> > contending for the right to enqueue a new request.
> 
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
  There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).

> > Since fsync blocks until all of its IO finishes anyway, what if it
> > could just limit itself to a much smaller number of outstanding
> > requests?
> 
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
  Actually, that might be triggered by a different problem because in case
of background writeback, block layer knows the IO is asynchronous and
treats it in a different way.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 19:11 Postgresql performance problems with IO latency, especially during fsync() Andres Freund
@ 2014-03-27 15:52   ` Jan Kara
  2014-03-27 15:52   ` Jan Kara
  2014-04-09  9:20   ` Dave Chinner
  2 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2014-03-27 15:52 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

  Hello,

On Wed 26-03-14 20:11:13, Andres Freund wrote:
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
  Thanks a lot for the program!

> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> 
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
> 
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
> 
> A concurrent iostat shows the problem pretty clearly:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> 
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.
> 
> The attached program right now can only be configured by changing some
> details in the code itself, but I guess that's not a problem. It will
> upfront allocate two files, and then start testing. If the files already
> exists it will use them.
> 
> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?
> 
> If somebody familiar with buffered IO writeback is around at LSF/MM, or
> rather collab, Robert and I will be around for the next days.
  I guess I'm your guy, at least for the writeback part. I have some
insight in the block layer as well although there are better experts around
here. But I at least know whom to catch if there's some deeply intricate
problem ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-27 15:52   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2014-03-27 15:52 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

  Hello,

On Wed 26-03-14 20:11:13, Andres Freund wrote:
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
  Thanks a lot for the program!

> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> 
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
> 
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
> 
> A concurrent iostat shows the problem pretty clearly:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> 
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.
> 
> The attached program right now can only be configured by changing some
> details in the code itself, but I guess that's not a problem. It will
> upfront allocate two files, and then start testing. If the files already
> exists it will use them.
> 
> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?
> 
> If somebody familiar with buffered IO writeback is around at LSF/MM, or
> rather collab, Robert and I will be around for the next days.
  I guess I'm your guy, at least for the writeback part. I have some
insight in the block layer as well although there are better experts around
here. But I at least know whom to catch if there's some deeply intricate
problem ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-03-27 15:50       ` Jan Kara
@ 2014-03-27 18:10         ` Fernando Luis Vazquez Cao
  -1 siblings, 0 replies; 31+ messages in thread
From: Fernando Luis Vazquez Cao @ 2014-03-27 18:10 UTC (permalink / raw)
  To: Jan Kara, Andres Freund
  Cc: linux-kernel, Andy Lutomirski, lsf, linux-mm, rhaas,
	Linux FS Devel, Wu Fengguang

(2014/03/28 0:50), Jan Kara wrote:
> On Wed 26-03-14 22:55:18, Andres Freund wrote:
>> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> Hi,
>>>>
>>>> At LSF/MM there was a slot about postgres' problems with the kernel. Our
>>>> top#1 concern is frequent slow read()s that happen while another process
>>>> calls fsync(), even though we'd be perfectly fine if that fsync() took
>>>> ages.
>>>> The "conclusion" of that part was that it'd be very useful to have a
>>>> demonstration of the problem without needing a full blown postgres
>>>> setup. I've quickly hacked something together, that seems to show the
>>>> problem nicely.
>>>>
>>>> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>>>> and the "IO Scheduling" bit in
>>>> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>>>>
>>> For your amusement: running this program in KVM on a 2GB disk image
>>> failed, but it caused the *host* to go out to lunch for several
>>> seconds while failing.  In fact, it seems to have caused the host to
>>> fall over so badly that the guest decided that the disk controller was
>>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>>> at this kind of workload.
>> Also, unless you changed the parameters, it's a) using a 48GB disk file,
>> and writes really rather fast ;)
>>
>>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>>> load.  So I won't test your program for real :/
>> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
>> smaller. If it still doesn't work consider increasing the two nsleep()s...
>>
>> I didn't have a good idea how to scale those to the current machine in a
>> halfway automatic fashion.
>    That's not necessary. If we have a guidance like above, we can figure it
> out ourselves (I hope ;).
>
>>>> Possible solutions:
>>>> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>>>>    sync_file_range() does.
>>>> * Make IO triggered by writeback regard IO priorities and add it to
>>>>    schedulers other than CFQ
>>>> * Add a tunable that allows limiting the amount of dirty memory before
>>>>    writeback on a per process basis.
>>>> * ...?
>>> I thought the problem wasn't so much that priorities weren't respected
>>> but that the fsync call fills up the queue, so everything starts
>>> contending for the right to enqueue a new request.
>> I think it's both actually. If I understand correctly there's not even a
>> correct association to the originator anymore during a fsync triggered
>> flush?
>    There is. The association is lost for background writeback (and sync(2)
> for that matter) but IO from fsync(2) is submitted in the context of the
> process doing fsync.
>
> What I think happens is the problem with 'dependent sync IO' vs
> 'independent sync IO'. Reads are an example of dependent sync IO where you
> submit a read, need it to complete and then you submit another read. OTOH
> fsync is an example of independent sync IO where you fire of tons of IO to
> the drive and they wait for everything. Since we treat both these types of
> IO in the same way, it can easily happen that independent sync IO starves
> out the dependent one (you execute say 100 IO requests for fsync and 1 IO
> request for read). We've seen problems like this in the past.
>
> I'll have a look into your test program and if my feeling is indeed
> correct, I'll have a look into what we could do in the block layer to fix
> this (and poke block layer guys - they had some preliminary patches that
> tried to address this but it didn't went anywhere).

We have been using PostgreSQL in production for years so I am pretty 
familiar
with the symptoms described by the PostgreSQL guys.

In almost all cases the problem was the "'dependent sync IO' vs 'independent
sync IO'" issue pointed out by Jan. However, as I mentioned during the 
LSF&MM
session, the culprit was not the kernel but the firmware of NCQ/TCQ capable
storage that would keep read requests queued forever, leaving tasks doing
reads (dependent sync IO) waiting for an interrupt that would not come. 
For the
record, latencies of up to 120 seconds in suposedly enterprise storage 
(I will not
name names) are relatively common. This can be fixed by modifying drivers
and/or the block layer to dynamically adjust the queue depth when dumb 
scheduling
by the firmware is observed. If you do not want to make changes to the 
kernel you
can always try to change the default queue depth. With this mechanism in 
place it is
relatively easy to make ionice work, since, as Jan mentioned, both reads 
and fsync
writes are done in process context.

- Fernando

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-03-27 18:10         ` Fernando Luis Vazquez Cao
  0 siblings, 0 replies; 31+ messages in thread
From: Fernando Luis Vazquez Cao @ 2014-03-27 18:10 UTC (permalink / raw)
  To: Jan Kara, Andres Freund
  Cc: linux-kernel, Andy Lutomirski, lsf, linux-mm, rhaas,
	Linux FS Devel, Wu Fengguang

(2014/03/28 0:50), Jan Kara wrote:
> On Wed 26-03-14 22:55:18, Andres Freund wrote:
>> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund <andres@anarazel.de> wrote:
>>>> Hi,
>>>>
>>>> At LSF/MM there was a slot about postgres' problems with the kernel. Our
>>>> top#1 concern is frequent slow read()s that happen while another process
>>>> calls fsync(), even though we'd be perfectly fine if that fsync() took
>>>> ages.
>>>> The "conclusion" of that part was that it'd be very useful to have a
>>>> demonstration of the problem without needing a full blown postgres
>>>> setup. I've quickly hacked something together, that seems to show the
>>>> problem nicely.
>>>>
>>>> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>>>> and the "IO Scheduling" bit in
>>>> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>>>>
>>> For your amusement: running this program in KVM on a 2GB disk image
>>> failed, but it caused the *host* to go out to lunch for several
>>> seconds while failing.  In fact, it seems to have caused the host to
>>> fall over so badly that the guest decided that the disk controller was
>>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>>> at this kind of workload.
>> Also, unless you changed the parameters, it's a) using a 48GB disk file,
>> and writes really rather fast ;)
>>
>>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>>> load.  So I won't test your program for real :/
>> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
>> smaller. If it still doesn't work consider increasing the two nsleep()s...
>>
>> I didn't have a good idea how to scale those to the current machine in a
>> halfway automatic fashion.
>    That's not necessary. If we have a guidance like above, we can figure it
> out ourselves (I hope ;).
>
>>>> Possible solutions:
>>>> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>>>>    sync_file_range() does.
>>>> * Make IO triggered by writeback regard IO priorities and add it to
>>>>    schedulers other than CFQ
>>>> * Add a tunable that allows limiting the amount of dirty memory before
>>>>    writeback on a per process basis.
>>>> * ...?
>>> I thought the problem wasn't so much that priorities weren't respected
>>> but that the fsync call fills up the queue, so everything starts
>>> contending for the right to enqueue a new request.
>> I think it's both actually. If I understand correctly there's not even a
>> correct association to the originator anymore during a fsync triggered
>> flush?
>    There is. The association is lost for background writeback (and sync(2)
> for that matter) but IO from fsync(2) is submitted in the context of the
> process doing fsync.
>
> What I think happens is the problem with 'dependent sync IO' vs
> 'independent sync IO'. Reads are an example of dependent sync IO where you
> submit a read, need it to complete and then you submit another read. OTOH
> fsync is an example of independent sync IO where you fire of tons of IO to
> the drive and they wait for everything. Since we treat both these types of
> IO in the same way, it can easily happen that independent sync IO starves
> out the dependent one (you execute say 100 IO requests for fsync and 1 IO
> request for read). We've seen problems like this in the past.
>
> I'll have a look into your test program and if my feeling is indeed
> correct, I'll have a look into what we could do in the block layer to fix
> this (and poke block layer guys - they had some preliminary patches that
> tried to address this but it didn't went anywhere).

We have been using PostgreSQL in production for years so I am pretty 
familiar
with the symptoms described by the PostgreSQL guys.

In almost all cases the problem was the "'dependent sync IO' vs 'independent
sync IO'" issue pointed out by Jan. However, as I mentioned during the 
LSF&MM
session, the culprit was not the kernel but the firmware of NCQ/TCQ capable
storage that would keep read requests queued forever, leaving tasks doing
reads (dependent sync IO) waiting for an interrupt that would not come. 
For the
record, latencies of up to 120 seconds in suposedly enterprise storage 
(I will not
name names) are relatively common. This can be fixed by modifying drivers
and/or the block layer to dynamically adjust the queue depth when dumb 
scheduling
by the firmware is observed. If you do not want to make changes to the 
kernel you
can always try to change the default queue depth. With this mechanism in 
place it is
relatively easy to make ionice work, since, as Jan mentioned, both reads 
and fsync
writes are done in process context.

- Fernando

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Postgresql performance problems with IO latency, especially during fsync()
  2014-03-26 19:11 Postgresql performance problems with IO latency, especially during fsync() Andres Freund
@ 2014-04-09  9:20   ` Dave Chinner
  2014-03-27 15:52   ` Jan Kara
  2014-04-09  9:20   ` Dave Chinner
  2 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-04-09  9:20 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> Hi,
> 
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
> 
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> 
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
> 
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
> 
> A concurrent iostat shows the problem pretty clearly:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> 
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.

Hi Andreas,

I've finally dug myself out from under the backlog from LSFMM far
enough to start testing this on my local IO performance test rig.

tl;dr: I can't reproduce this peaky behaviour on my test rig.

I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
an unmodified benchmark on a current 3.15-linus tree. All storage
(guest and host) is XFS based, guest VMs use virtio and direct IO to
the backing storage.  The host is using noop IO scheduling.

The first IO setup I ran was a 100TB XFS filesystem in the guest.
The backing file is a sparse file on an XFS filesystem on a pair of
240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
exported as JBOD from a RAID controller which has 1GB of FBWC.  The
guest is capable of sustaining around 65,000 random read IOPS and
40,000 write IOPS through this filesystem depending on the tests
being run.

The iostat output looks like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     0.00 1817.00  315.40    18.80     6.93    24.71     0.80    0.38    0.38    0.37   0.31  66.24
vdc               0.00     0.00 2094.20  323.20    21.82     7.10    24.50     0.81    0.33    0.34    0.27   0.28  68.48
vdc               0.00     0.00 1989.00 4500.20    20.50    56.64    24.34    24.82    3.82    0.43    5.32   0.12  80.16
vdc               0.00     0.00 2019.80  320.80    20.83     7.05    24.39     0.83    0.35    0.36    0.32   0.29  69.04
vdc               0.00     0.00 2206.60  323.20    22.57     7.10    24.02     0.87    0.34    0.34    0.33   0.28  71.92
vdc               0.00     0.00 2437.20  329.60    25.79     7.24    24.45     0.83    0.30    0.30    0.27   0.26  71.76
vdc               0.00     0.00 1224.40 11263.80    12.88   136.38    24.48    64.90    5.20    0.69    5.69   0.07  84.96
vdc               0.00     0.00 2074.60  319.40    21.03     7.01    23.99     0.84    0.35    0.36    0.30   0.29  68.96
vdc               0.00     0.00 1999.00  881.60    20.81    13.61    24.47     4.23    1.43    0.35    3.88   0.23  67.36
vdc               0.00     0.00 1193.40 2273.00    12.42    29.68    24.87    11.70    3.40    0.53    4.91   0.18  60.96
vdc               0.00     0.00 1724.00  314.40    17.80     6.91    24.83     0.74    0.36    0.36    0.35   0.30  61.52
vdc               0.00     0.00 2605.20  325.60    24.67     7.15    22.24     0.90    0.31    0.31    0.25   0.25  72.72
vdc               0.00     0.00 2340.60  324.40    23.85     7.12    23.80     0.83    0.31    0.31    0.29   0.26  68.96
vdc               0.00     0.00 2749.60  329.40    28.51     7.24    23.78     0.90    0.29    0.30    0.25   0.24  75.04
vdc               0.00     0.00 2619.60  324.60    27.72     7.13    24.24     0.88    0.30    0.30    0.28   0.24  71.76
vdc               0.00     0.00 1608.60 4532.40    17.17    56.40    24.54    25.78    4.20    0.50    5.51   0.12  75.36
vdc               0.00     0.00 2361.00  326.40    23.62     7.17    23.47     0.87    0.33    0.33    0.30   0.26  69.92
vdc               0.00     0.00 2460.00  326.00    25.89     7.16    24.30     0.88    0.32    0.32    0.26   0.26  72.72
vdc               0.00     0.00 2519.00  325.40    25.96     7.14    23.83     0.90    0.32    0.32    0.30   0.26  74.32
vdc               0.00     0.00 2709.80  326.20    28.91     7.17    24.34     0.94    0.31    0.30    0.36   0.25  75.52
vdc               0.00     0.00 2676.65  329.74    28.82     7.24    24.56     0.86    0.28    0.29    0.25   0.24  71.22
vdc               0.00     0.00 1788.40 4506.80    18.77    56.66    24.54    23.22    3.69    0.45    4.97   0.12  74.88
vdc               0.00     0.00 1850.40  319.60    19.76     7.02    25.28     0.80    0.37    0.37    0.34   0.30  64.80

Its obvious where the fsyncs are hitting, but they are making
almost no impact on the read performance. The benchmark is
simple not generating enough dirty data to overload the IO
subsystem, and hence there's no latency spikes to speak of.

Benchmark output across fsyncs also demonstrates that:

....
read[12494]: avg: 0.4 msec; max: 9.8 msec
read[12499]: avg: 0.4 msec; max: 9.8 msec
read[12495]: avg: 0.4 msec; max: 9.8 msec
commit[12491]: avg: 0.0 msec; max: 6.3 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
starting fsync() of files
finished fsync() of files
wal[12491]: avg: 0.0 msec; max: 0.7 msec
commit[12491]: avg: 0.1 msec; max: 15.8 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
read[12492]: avg: 0.6 msec; max: 10.0 msec
read[12496]: avg: 0.6 msec; max: 10.0 msec
read[12507]: avg: 0.6 msec; max: 10.0 msec
read[12505]: avg: 0.6 msec; max: 10.0 msec
....

So, I though switching to spinning disks might show the problem.
Same VM, this time using a 17TB linearly preallocated image file on
an 18TB XFS filesystem on the host (virtio in the guest again) on a
12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
from a RAID controller with 512MB of BBWC. This is capable of about
2,000 random read IOPs, and write IOPS is dependent on the BBWC
flushing behaviour (peaks at about 15,000, sustains 1500).

Again, I don't see any bad behaviour:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdd               0.00     0.00  128.00  154.20     1.46     6.02    54.24     1.02    3.62    7.22    0.62   3.17  89.44
vdd               0.00     0.00  133.80  154.40     1.43     6.03    53.04     1.08    3.76    7.42    0.59   3.05  87.84
vdd               0.00     0.00  132.40  751.60     1.36    12.75    32.70     7.26    8.21    7.90    8.27   1.06  93.76
vdd               0.00     0.00  138.60  288.60     1.47     7.62    43.57     2.69    6.30    7.24    5.85   2.15  91.68
vdd               0.00     0.00  141.20  157.00     1.51     6.12    52.36     1.12    3.75    7.26    0.59   3.11  92.64
vdd               0.00     0.00  136.80  154.00     1.52     6.02    53.09     1.04    3.57    6.92    0.60   3.10  90.08
vdd               0.00     0.00  135.00  154.20     1.46     6.02    52.99     1.04    3.60    7.00    0.62   3.03  87.68
vdd               0.00     0.00  147.20  155.80     1.50     6.09    51.26     1.19    3.92    7.45    0.59   3.06  92.72
vdd               0.00     0.00  139.60  154.00     1.50     6.02    52.43     1.03    3.53    6.79    0.58   2.97  87.28
vdd               0.00     0.00  126.80  913.80     1.27    14.81    31.63     7.87    7.56    9.02    7.36   0.89  92.16
vdd               0.00     0.00  148.20  156.80     1.51     6.11    51.15     1.17    3.83    7.25    0.59   3.03  92.32
vdd               0.00     0.00  142.60  155.80     1.49     6.09    52.03     1.12    3.76    7.23    0.60   3.05  90.88
vdd               0.00     0.00  146.00  156.60     1.49     6.12    51.48     1.16    3.84    7.32    0.60   3.03  91.76
vdd               0.00     0.00  148.80  157.20     1.56     6.14    51.54     1.10    3.58    6.78    0.54   2.92  89.36
vdd               0.00     0.00  149.80  156.80     1.52     6.12    51.05     1.14    3.73    7.10    0.52   2.96  90.88
vdd               0.00     0.00  127.20  910.60     1.29    14.82    31.80     8.17    7.87    8.62    7.77   0.88  91.44
vdd               0.00     0.00  150.80  157.80     1.58     6.15    51.27     1.09    3.53    6.67    0.52   2.91  89.92
vdd               0.00     0.00  153.00  156.60     1.53     6.12    50.56     1.17    3.80    7.07    0.60   2.95  91.20
vdd               0.00     0.00  161.20  157.00     1.54     6.13    49.36     1.27    4.00    7.36    0.55   2.97  94.48
vdd               0.00     0.00  152.10  157.09     1.55     6.14    50.91     1.15    3.73    7.06    0.50   2.98  92.22
vdd               0.00     0.00  154.80  157.80     1.50     6.16    50.18     1.19    3.82    7.14    0.57   2.88  90.00
vdd               0.00     0.00  122.20  922.80     1.24    14.93    31.70     8.30    7.94    8.81    7.83   0.90  93.68
vdd               0.00     0.00  139.80  153.60     1.46     5.98    51.98     1.09    3.74    7.14    0.65   3.02  88.48
vdd               0.00     0.00  134.60  153.60     1.44     6.00    52.87     1.09    3.78    7.38    0.63   3.15  90.88
vdd               0.00     0.00  141.00  153.40     1.55     5.99    52.49     1.05    3.57    6.79    0.60   3.04  89.44
vdd               0.00     0.00  130.20  154.20     1.44     6.02    53.76     1.04    3.65    7.28    0.59   3.14  89.44
vdd               0.00     0.00  136.60  154.20     1.30     6.02    51.52     1.19    4.07    8.02    0.58   3.03  88.24
vdd               0.00     0.00  119.00  872.20     1.29    14.29    32.20     7.98    8.05    8.59    7.98   0.93  92.64
vdd               0.00     0.00  139.00  154.00     1.58     6.00    53.01     1.06    3.62    6.95    0.61   3.12  91.52
vdd               0.00     0.00  146.00  155.40     1.52     6.07    51.60     1.12    3.71    7.05    0.57   3.01  90.72
vdd               0.00     0.00  145.20  156.60     1.55     6.12    52.01     1.13    3.73    7.10    0.60   3.07  92.80
vdd               0.00     0.00  147.80  156.80     1.51     6.12    51.36     1.13    3.72    7.07    0.56   2.97  90.56
vdd               0.00     0.00  142.80  157.00     1.48     6.12    51.98     1.11    3.70    7.20    0.52   3.02  90.56
vdd               0.00     0.00  125.20  922.00     1.36    14.92    31.85     8.34    7.96    8.47    7.90   0.88  92.56

Again, you can see exactly where the fsyncs are hitting, and again
they are not massive spikes of write IO. And read latencies are showing:

read[12711]: avg: 6.1 msec; max: 182.2 msec
...
read[12711]: avg: 6.0 msec; max: 198.8 msec
....
read[12711]: avg: 5.9 msec; max: 158.8 msec
....
read[12711]: avg: 6.1 msec; max: 127.5 msec
....
read[12711]: avg: 5.9 msec; max: 355.6 msec
....
read[12711]: avg: 6.0 msec; max: 262.8 msec
....

No significant outliers. The same for commits and wal:

wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 1.0 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
starting fsync() of files
finished fsync() of files
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.1 msec; max: 129.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 0.9 msec
wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 23.1 msec

I'm not sure how you were generating the behaviour you reported, but
the test program as it stands does not appear to be causing any
problems at all on the sort of storage I'd expect large databases to
be hosted on....

I've tried a few tweaks to the test program, but I haven't been able
to make it misbehave. What do I need to tweak in the test program or
my test VM to make the kernel misbehave like you reported?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com`

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Postgresql performance problems with IO latency, especially during fsync()
@ 2014-04-09  9:20   ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-04-09  9:20 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas, andres

On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> Hi,
> 
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
> 
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> 
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
> 
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
> 
> A concurrent iostat shows the problem pretty clearly:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> 
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.

Hi Andreas,

I've finally dug myself out from under the backlog from LSFMM far
enough to start testing this on my local IO performance test rig.

tl;dr: I can't reproduce this peaky behaviour on my test rig.

I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
an unmodified benchmark on a current 3.15-linus tree. All storage
(guest and host) is XFS based, guest VMs use virtio and direct IO to
the backing storage.  The host is using noop IO scheduling.

The first IO setup I ran was a 100TB XFS filesystem in the guest.
The backing file is a sparse file on an XFS filesystem on a pair of
240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
exported as JBOD from a RAID controller which has 1GB of FBWC.  The
guest is capable of sustaining around 65,000 random read IOPS and
40,000 write IOPS through this filesystem depending on the tests
being run.

The iostat output looks like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     0.00 1817.00  315.40    18.80     6.93    24.71     0.80    0.38    0.38    0.37   0.31  66.24
vdc               0.00     0.00 2094.20  323.20    21.82     7.10    24.50     0.81    0.33    0.34    0.27   0.28  68.48
vdc               0.00     0.00 1989.00 4500.20    20.50    56.64    24.34    24.82    3.82    0.43    5.32   0.12  80.16
vdc               0.00     0.00 2019.80  320.80    20.83     7.05    24.39     0.83    0.35    0.36    0.32   0.29  69.04
vdc               0.00     0.00 2206.60  323.20    22.57     7.10    24.02     0.87    0.34    0.34    0.33   0.28  71.92
vdc               0.00     0.00 2437.20  329.60    25.79     7.24    24.45     0.83    0.30    0.30    0.27   0.26  71.76
vdc               0.00     0.00 1224.40 11263.80    12.88   136.38    24.48    64.90    5.20    0.69    5.69   0.07  84.96
vdc               0.00     0.00 2074.60  319.40    21.03     7.01    23.99     0.84    0.35    0.36    0.30   0.29  68.96
vdc               0.00     0.00 1999.00  881.60    20.81    13.61    24.47     4.23    1.43    0.35    3.88   0.23  67.36
vdc               0.00     0.00 1193.40 2273.00    12.42    29.68    24.87    11.70    3.40    0.53    4.91   0.18  60.96
vdc               0.00     0.00 1724.00  314.40    17.80     6.91    24.83     0.74    0.36    0.36    0.35   0.30  61.52
vdc               0.00     0.00 2605.20  325.60    24.67     7.15    22.24     0.90    0.31    0.31    0.25   0.25  72.72
vdc               0.00     0.00 2340.60  324.40    23.85     7.12    23.80     0.83    0.31    0.31    0.29   0.26  68.96
vdc               0.00     0.00 2749.60  329.40    28.51     7.24    23.78     0.90    0.29    0.30    0.25   0.24  75.04
vdc               0.00     0.00 2619.60  324.60    27.72     7.13    24.24     0.88    0.30    0.30    0.28   0.24  71.76
vdc               0.00     0.00 1608.60 4532.40    17.17    56.40    24.54    25.78    4.20    0.50    5.51   0.12  75.36
vdc               0.00     0.00 2361.00  326.40    23.62     7.17    23.47     0.87    0.33    0.33    0.30   0.26  69.92
vdc               0.00     0.00 2460.00  326.00    25.89     7.16    24.30     0.88    0.32    0.32    0.26   0.26  72.72
vdc               0.00     0.00 2519.00  325.40    25.96     7.14    23.83     0.90    0.32    0.32    0.30   0.26  74.32
vdc               0.00     0.00 2709.80  326.20    28.91     7.17    24.34     0.94    0.31    0.30    0.36   0.25  75.52
vdc               0.00     0.00 2676.65  329.74    28.82     7.24    24.56     0.86    0.28    0.29    0.25   0.24  71.22
vdc               0.00     0.00 1788.40 4506.80    18.77    56.66    24.54    23.22    3.69    0.45    4.97   0.12  74.88
vdc               0.00     0.00 1850.40  319.60    19.76     7.02    25.28     0.80    0.37    0.37    0.34   0.30  64.80

Its obvious where the fsyncs are hitting, but they are making
almost no impact on the read performance. The benchmark is
simple not generating enough dirty data to overload the IO
subsystem, and hence there's no latency spikes to speak of.

Benchmark output across fsyncs also demonstrates that:

....
read[12494]: avg: 0.4 msec; max: 9.8 msec
read[12499]: avg: 0.4 msec; max: 9.8 msec
read[12495]: avg: 0.4 msec; max: 9.8 msec
commit[12491]: avg: 0.0 msec; max: 6.3 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
wal[12491]: avg: 0.0 msec; max: 0.0 msec
starting fsync() of files
finished fsync() of files
wal[12491]: avg: 0.0 msec; max: 0.7 msec
commit[12491]: avg: 0.1 msec; max: 15.8 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
wal[12491]: avg: 0.0 msec; max: 0.1 msec
read[12492]: avg: 0.6 msec; max: 10.0 msec
read[12496]: avg: 0.6 msec; max: 10.0 msec
read[12507]: avg: 0.6 msec; max: 10.0 msec
read[12505]: avg: 0.6 msec; max: 10.0 msec
....

So, I though switching to spinning disks might show the problem.
Same VM, this time using a 17TB linearly preallocated image file on
an 18TB XFS filesystem on the host (virtio in the guest again) on a
12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
from a RAID controller with 512MB of BBWC. This is capable of about
2,000 random read IOPs, and write IOPS is dependent on the BBWC
flushing behaviour (peaks at about 15,000, sustains 1500).

Again, I don't see any bad behaviour:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdd               0.00     0.00  128.00  154.20     1.46     6.02    54.24     1.02    3.62    7.22    0.62   3.17  89.44
vdd               0.00     0.00  133.80  154.40     1.43     6.03    53.04     1.08    3.76    7.42    0.59   3.05  87.84
vdd               0.00     0.00  132.40  751.60     1.36    12.75    32.70     7.26    8.21    7.90    8.27   1.06  93.76
vdd               0.00     0.00  138.60  288.60     1.47     7.62    43.57     2.69    6.30    7.24    5.85   2.15  91.68
vdd               0.00     0.00  141.20  157.00     1.51     6.12    52.36     1.12    3.75    7.26    0.59   3.11  92.64
vdd               0.00     0.00  136.80  154.00     1.52     6.02    53.09     1.04    3.57    6.92    0.60   3.10  90.08
vdd               0.00     0.00  135.00  154.20     1.46     6.02    52.99     1.04    3.60    7.00    0.62   3.03  87.68
vdd               0.00     0.00  147.20  155.80     1.50     6.09    51.26     1.19    3.92    7.45    0.59   3.06  92.72
vdd               0.00     0.00  139.60  154.00     1.50     6.02    52.43     1.03    3.53    6.79    0.58   2.97  87.28
vdd               0.00     0.00  126.80  913.80     1.27    14.81    31.63     7.87    7.56    9.02    7.36   0.89  92.16
vdd               0.00     0.00  148.20  156.80     1.51     6.11    51.15     1.17    3.83    7.25    0.59   3.03  92.32
vdd               0.00     0.00  142.60  155.80     1.49     6.09    52.03     1.12    3.76    7.23    0.60   3.05  90.88
vdd               0.00     0.00  146.00  156.60     1.49     6.12    51.48     1.16    3.84    7.32    0.60   3.03  91.76
vdd               0.00     0.00  148.80  157.20     1.56     6.14    51.54     1.10    3.58    6.78    0.54   2.92  89.36
vdd               0.00     0.00  149.80  156.80     1.52     6.12    51.05     1.14    3.73    7.10    0.52   2.96  90.88
vdd               0.00     0.00  127.20  910.60     1.29    14.82    31.80     8.17    7.87    8.62    7.77   0.88  91.44
vdd               0.00     0.00  150.80  157.80     1.58     6.15    51.27     1.09    3.53    6.67    0.52   2.91  89.92
vdd               0.00     0.00  153.00  156.60     1.53     6.12    50.56     1.17    3.80    7.07    0.60   2.95  91.20
vdd               0.00     0.00  161.20  157.00     1.54     6.13    49.36     1.27    4.00    7.36    0.55   2.97  94.48
vdd               0.00     0.00  152.10  157.09     1.55     6.14    50.91     1.15    3.73    7.06    0.50   2.98  92.22
vdd               0.00     0.00  154.80  157.80     1.50     6.16    50.18     1.19    3.82    7.14    0.57   2.88  90.00
vdd               0.00     0.00  122.20  922.80     1.24    14.93    31.70     8.30    7.94    8.81    7.83   0.90  93.68
vdd               0.00     0.00  139.80  153.60     1.46     5.98    51.98     1.09    3.74    7.14    0.65   3.02  88.48
vdd               0.00     0.00  134.60  153.60     1.44     6.00    52.87     1.09    3.78    7.38    0.63   3.15  90.88
vdd               0.00     0.00  141.00  153.40     1.55     5.99    52.49     1.05    3.57    6.79    0.60   3.04  89.44
vdd               0.00     0.00  130.20  154.20     1.44     6.02    53.76     1.04    3.65    7.28    0.59   3.14  89.44
vdd               0.00     0.00  136.60  154.20     1.30     6.02    51.52     1.19    4.07    8.02    0.58   3.03  88.24
vdd               0.00     0.00  119.00  872.20     1.29    14.29    32.20     7.98    8.05    8.59    7.98   0.93  92.64
vdd               0.00     0.00  139.00  154.00     1.58     6.00    53.01     1.06    3.62    6.95    0.61   3.12  91.52
vdd               0.00     0.00  146.00  155.40     1.52     6.07    51.60     1.12    3.71    7.05    0.57   3.01  90.72
vdd               0.00     0.00  145.20  156.60     1.55     6.12    52.01     1.13    3.73    7.10    0.60   3.07  92.80
vdd               0.00     0.00  147.80  156.80     1.51     6.12    51.36     1.13    3.72    7.07    0.56   2.97  90.56
vdd               0.00     0.00  142.80  157.00     1.48     6.12    51.98     1.11    3.70    7.20    0.52   3.02  90.56
vdd               0.00     0.00  125.20  922.00     1.36    14.92    31.85     8.34    7.96    8.47    7.90   0.88  92.56

Again, you can see exactly where the fsyncs are hitting, and again
they are not massive spikes of write IO. And read latencies are showing:

read[12711]: avg: 6.1 msec; max: 182.2 msec
...
read[12711]: avg: 6.0 msec; max: 198.8 msec
....
read[12711]: avg: 5.9 msec; max: 158.8 msec
....
read[12711]: avg: 6.1 msec; max: 127.5 msec
....
read[12711]: avg: 5.9 msec; max: 355.6 msec
....
read[12711]: avg: 6.0 msec; max: 262.8 msec
....

No significant outliers. The same for commits and wal:

wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 1.0 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
starting fsync() of files
finished fsync() of files
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.1 msec; max: 129.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 0.9 msec
wal[12701]: avg: 0.0 msec; max: 0.2 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
wal[12701]: avg: 0.0 msec; max: 0.1 msec
commit[12701]: avg: 0.0 msec; max: 23.1 msec

I'm not sure how you were generating the behaviour you reported, but
the test program as it stands does not appear to be causing any
problems at all on the sort of storage I'd expect large databases to
be hosted on....

I've tried a few tweaks to the test program, but I haven't been able
to make it misbehave. What do I need to tweak in the test program or
my test VM to make the kernel misbehave like you reported?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com`

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Postgresql performance problems with IO latency, especially during fsync()
  2014-04-09  9:20   ` Dave Chinner
@ 2014-04-12 13:24     ` Andres Freund
  -1 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-04-12 13:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas

Hi Dave,

On 2014-04-09 19:20:09 +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> > 
> > A concurrent iostat shows the problem pretty clearly:
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> > sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> > sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> > sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> > sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> > sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> > sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> > sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> > 
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.

> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage.  The host is using noop IO scheduling.

> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC.  The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.

I think the 1GB FBWC explains the behaviour - IIRC the test as written
flushes about 400-500MB during fsync(). If the writebach cache can just
take that and continue as if nothing happened you'll see no problem.

> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....

Since I had developed it while at LSF/MM I had little choice but to run
it only on my laptop. You might remember the speed of the conference
network ;)

> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?

I think there's two tweaks that would be worthwile to try to reproduce
the problem there:
* replace: "++writes % 100000" by something like "++writes %
  500000". That should create more than 1GB of dirty memory to be
  flushed out at the later fsync() which should then hit with your
  amount of WC cache.
* replace the: "nsleep(200000);" by something smaller. I guess 70000 or
  so might also trigger the problem alone.

Unfortunately right now I don't have any free rig with decent storage
available...

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Postgresql performance problems with IO latency, especially during fsync()
@ 2014-04-12 13:24     ` Andres Freund
  0 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-04-12 13:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, linux-kernel, lsf, Wu Fengguang, rhaas

Hi Dave,

On 2014-04-09 19:20:09 +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> > 
> > A concurrent iostat shows the problem pretty clearly:
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> > sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> > sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> > sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> > sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> > sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> > sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> > sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> > 
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.

> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage.  The host is using noop IO scheduling.

> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC.  The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.

I think the 1GB FBWC explains the behaviour - IIRC the test as written
flushes about 400-500MB during fsync(). If the writebach cache can just
take that and continue as if nothing happened you'll see no problem.

> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....

Since I had developed it while at LSF/MM I had little choice but to run
it only on my laptop. You might remember the speed of the conference
network ;)

> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?

I think there's two tweaks that would be worthwile to try to reproduce
the problem there:
* replace: "++writes % 100000" by something like "++writes %
  500000". That should create more than 1GB of dirty memory to be
  flushed out at the later fsync() which should then hit with your
  amount of WC cache.
* replace the: "nsleep(200000);" by something smaller. I guess 70000 or
  so might also trigger the problem alone.

Unfortunately right now I don't have any free rig with decent storage
available...

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-04-09  9:20   ` Dave Chinner
@ 2014-04-28 23:47     ` Dave Chinner
  -1 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-04-28 23:47 UTC (permalink / raw)
  To: Andres Freund
  Cc: andres, rhaas, linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang

ping?

On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > Hi,
> > 
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> > 
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > 
> > The tools output looks like this:
> > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> > ...
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > commit[12155]: avg: 0.2 msec; max: 15.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > read[12157]: avg: 0.2 msec; max: 9.4 msec
> > ...
> > read[12165]: avg: 0.2 msec; max: 9.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > starting fsync() of files
> > finished fsync() of files
> > read[12162]: avg: 0.6 msec; max: 2765.5 msec
> > 
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> > 
> > A concurrent iostat shows the problem pretty clearly:
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> > sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> > sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> > sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> > sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> > sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> > sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> > sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> > 
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
> 
> Hi Andreas,
> 
> I've finally dug myself out from under the backlog from LSFMM far
> enough to start testing this on my local IO performance test rig.
> 
> tl;dr: I can't reproduce this peaky behaviour on my test rig.
> 
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage.  The host is using noop IO scheduling.
> 
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC.  The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
> 
> The iostat output looks like this:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdc               0.00     0.00 1817.00  315.40    18.80     6.93    24.71     0.80    0.38    0.38    0.37   0.31  66.24
> vdc               0.00     0.00 2094.20  323.20    21.82     7.10    24.50     0.81    0.33    0.34    0.27   0.28  68.48
> vdc               0.00     0.00 1989.00 4500.20    20.50    56.64    24.34    24.82    3.82    0.43    5.32   0.12  80.16
> vdc               0.00     0.00 2019.80  320.80    20.83     7.05    24.39     0.83    0.35    0.36    0.32   0.29  69.04
> vdc               0.00     0.00 2206.60  323.20    22.57     7.10    24.02     0.87    0.34    0.34    0.33   0.28  71.92
> vdc               0.00     0.00 2437.20  329.60    25.79     7.24    24.45     0.83    0.30    0.30    0.27   0.26  71.76
> vdc               0.00     0.00 1224.40 11263.80    12.88   136.38    24.48    64.90    5.20    0.69    5.69   0.07  84.96
> vdc               0.00     0.00 2074.60  319.40    21.03     7.01    23.99     0.84    0.35    0.36    0.30   0.29  68.96
> vdc               0.00     0.00 1999.00  881.60    20.81    13.61    24.47     4.23    1.43    0.35    3.88   0.23  67.36
> vdc               0.00     0.00 1193.40 2273.00    12.42    29.68    24.87    11.70    3.40    0.53    4.91   0.18  60.96
> vdc               0.00     0.00 1724.00  314.40    17.80     6.91    24.83     0.74    0.36    0.36    0.35   0.30  61.52
> vdc               0.00     0.00 2605.20  325.60    24.67     7.15    22.24     0.90    0.31    0.31    0.25   0.25  72.72
> vdc               0.00     0.00 2340.60  324.40    23.85     7.12    23.80     0.83    0.31    0.31    0.29   0.26  68.96
> vdc               0.00     0.00 2749.60  329.40    28.51     7.24    23.78     0.90    0.29    0.30    0.25   0.24  75.04
> vdc               0.00     0.00 2619.60  324.60    27.72     7.13    24.24     0.88    0.30    0.30    0.28   0.24  71.76
> vdc               0.00     0.00 1608.60 4532.40    17.17    56.40    24.54    25.78    4.20    0.50    5.51   0.12  75.36
> vdc               0.00     0.00 2361.00  326.40    23.62     7.17    23.47     0.87    0.33    0.33    0.30   0.26  69.92
> vdc               0.00     0.00 2460.00  326.00    25.89     7.16    24.30     0.88    0.32    0.32    0.26   0.26  72.72
> vdc               0.00     0.00 2519.00  325.40    25.96     7.14    23.83     0.90    0.32    0.32    0.30   0.26  74.32
> vdc               0.00     0.00 2709.80  326.20    28.91     7.17    24.34     0.94    0.31    0.30    0.36   0.25  75.52
> vdc               0.00     0.00 2676.65  329.74    28.82     7.24    24.56     0.86    0.28    0.29    0.25   0.24  71.22
> vdc               0.00     0.00 1788.40 4506.80    18.77    56.66    24.54    23.22    3.69    0.45    4.97   0.12  74.88
> vdc               0.00     0.00 1850.40  319.60    19.76     7.02    25.28     0.80    0.37    0.37    0.34   0.30  64.80
> 
> Its obvious where the fsyncs are hitting, but they are making
> almost no impact on the read performance. The benchmark is
> simple not generating enough dirty data to overload the IO
> subsystem, and hence there's no latency spikes to speak of.
> 
> Benchmark output across fsyncs also demonstrates that:
> 
> ....
> read[12494]: avg: 0.4 msec; max: 9.8 msec
> read[12499]: avg: 0.4 msec; max: 9.8 msec
> read[12495]: avg: 0.4 msec; max: 9.8 msec
> commit[12491]: avg: 0.0 msec; max: 6.3 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> wal[12491]: avg: 0.0 msec; max: 0.7 msec
> commit[12491]: avg: 0.1 msec; max: 15.8 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> read[12492]: avg: 0.6 msec; max: 10.0 msec
> read[12496]: avg: 0.6 msec; max: 10.0 msec
> read[12507]: avg: 0.6 msec; max: 10.0 msec
> read[12505]: avg: 0.6 msec; max: 10.0 msec
> ....
> 
> So, I though switching to spinning disks might show the problem.
> Same VM, this time using a 17TB linearly preallocated image file on
> an 18TB XFS filesystem on the host (virtio in the guest again) on a
> 12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
> from a RAID controller with 512MB of BBWC. This is capable of about
> 2,000 random read IOPs, and write IOPS is dependent on the BBWC
> flushing behaviour (peaks at about 15,000, sustains 1500).
> 
> Again, I don't see any bad behaviour:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdd               0.00     0.00  128.00  154.20     1.46     6.02    54.24     1.02    3.62    7.22    0.62   3.17  89.44
> vdd               0.00     0.00  133.80  154.40     1.43     6.03    53.04     1.08    3.76    7.42    0.59   3.05  87.84
> vdd               0.00     0.00  132.40  751.60     1.36    12.75    32.70     7.26    8.21    7.90    8.27   1.06  93.76
> vdd               0.00     0.00  138.60  288.60     1.47     7.62    43.57     2.69    6.30    7.24    5.85   2.15  91.68
> vdd               0.00     0.00  141.20  157.00     1.51     6.12    52.36     1.12    3.75    7.26    0.59   3.11  92.64
> vdd               0.00     0.00  136.80  154.00     1.52     6.02    53.09     1.04    3.57    6.92    0.60   3.10  90.08
> vdd               0.00     0.00  135.00  154.20     1.46     6.02    52.99     1.04    3.60    7.00    0.62   3.03  87.68
> vdd               0.00     0.00  147.20  155.80     1.50     6.09    51.26     1.19    3.92    7.45    0.59   3.06  92.72
> vdd               0.00     0.00  139.60  154.00     1.50     6.02    52.43     1.03    3.53    6.79    0.58   2.97  87.28
> vdd               0.00     0.00  126.80  913.80     1.27    14.81    31.63     7.87    7.56    9.02    7.36   0.89  92.16
> vdd               0.00     0.00  148.20  156.80     1.51     6.11    51.15     1.17    3.83    7.25    0.59   3.03  92.32
> vdd               0.00     0.00  142.60  155.80     1.49     6.09    52.03     1.12    3.76    7.23    0.60   3.05  90.88
> vdd               0.00     0.00  146.00  156.60     1.49     6.12    51.48     1.16    3.84    7.32    0.60   3.03  91.76
> vdd               0.00     0.00  148.80  157.20     1.56     6.14    51.54     1.10    3.58    6.78    0.54   2.92  89.36
> vdd               0.00     0.00  149.80  156.80     1.52     6.12    51.05     1.14    3.73    7.10    0.52   2.96  90.88
> vdd               0.00     0.00  127.20  910.60     1.29    14.82    31.80     8.17    7.87    8.62    7.77   0.88  91.44
> vdd               0.00     0.00  150.80  157.80     1.58     6.15    51.27     1.09    3.53    6.67    0.52   2.91  89.92
> vdd               0.00     0.00  153.00  156.60     1.53     6.12    50.56     1.17    3.80    7.07    0.60   2.95  91.20
> vdd               0.00     0.00  161.20  157.00     1.54     6.13    49.36     1.27    4.00    7.36    0.55   2.97  94.48
> vdd               0.00     0.00  152.10  157.09     1.55     6.14    50.91     1.15    3.73    7.06    0.50   2.98  92.22
> vdd               0.00     0.00  154.80  157.80     1.50     6.16    50.18     1.19    3.82    7.14    0.57   2.88  90.00
> vdd               0.00     0.00  122.20  922.80     1.24    14.93    31.70     8.30    7.94    8.81    7.83   0.90  93.68
> vdd               0.00     0.00  139.80  153.60     1.46     5.98    51.98     1.09    3.74    7.14    0.65   3.02  88.48
> vdd               0.00     0.00  134.60  153.60     1.44     6.00    52.87     1.09    3.78    7.38    0.63   3.15  90.88
> vdd               0.00     0.00  141.00  153.40     1.55     5.99    52.49     1.05    3.57    6.79    0.60   3.04  89.44
> vdd               0.00     0.00  130.20  154.20     1.44     6.02    53.76     1.04    3.65    7.28    0.59   3.14  89.44
> vdd               0.00     0.00  136.60  154.20     1.30     6.02    51.52     1.19    4.07    8.02    0.58   3.03  88.24
> vdd               0.00     0.00  119.00  872.20     1.29    14.29    32.20     7.98    8.05    8.59    7.98   0.93  92.64
> vdd               0.00     0.00  139.00  154.00     1.58     6.00    53.01     1.06    3.62    6.95    0.61   3.12  91.52
> vdd               0.00     0.00  146.00  155.40     1.52     6.07    51.60     1.12    3.71    7.05    0.57   3.01  90.72
> vdd               0.00     0.00  145.20  156.60     1.55     6.12    52.01     1.13    3.73    7.10    0.60   3.07  92.80
> vdd               0.00     0.00  147.80  156.80     1.51     6.12    51.36     1.13    3.72    7.07    0.56   2.97  90.56
> vdd               0.00     0.00  142.80  157.00     1.48     6.12    51.98     1.11    3.70    7.20    0.52   3.02  90.56
> vdd               0.00     0.00  125.20  922.00     1.36    14.92    31.85     8.34    7.96    8.47    7.90   0.88  92.56
> 
> Again, you can see exactly where the fsyncs are hitting, and again
> they are not massive spikes of write IO. And read latencies are showing:
> 
> read[12711]: avg: 6.1 msec; max: 182.2 msec
> ...
> read[12711]: avg: 6.0 msec; max: 198.8 msec
> ....
> read[12711]: avg: 5.9 msec; max: 158.8 msec
> ....
> read[12711]: avg: 6.1 msec; max: 127.5 msec
> ....
> read[12711]: avg: 5.9 msec; max: 355.6 msec
> ....
> read[12711]: avg: 6.0 msec; max: 262.8 msec
> ....
> 
> No significant outliers. The same for commits and wal:
> 
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 1.0 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> starting fsync() of files
> finished fsync() of files
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.1 msec; max: 129.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 0.9 msec
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 23.1 msec
> 
> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....
> 
> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com`
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-04-28 23:47     ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-04-28 23:47 UTC (permalink / raw)
  To: Andres Freund
  Cc: andres, rhaas, linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang

ping?

On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > Hi,
> > 
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> > 
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > 
> > The tools output looks like this:
> > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> > ...
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > commit[12155]: avg: 0.2 msec; max: 15.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > read[12157]: avg: 0.2 msec; max: 9.4 msec
> > ...
> > read[12165]: avg: 0.2 msec; max: 9.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > starting fsync() of files
> > finished fsync() of files
> > read[12162]: avg: 0.6 msec; max: 2765.5 msec
> > 
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> > 
> > A concurrent iostat shows the problem pretty clearly:
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s	avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda               1.00     0.00 6322.00  337.00    51.73     4.38	17.26     2.09    0.32    0.19    2.59   0.14  90.00
> > sda               0.00     0.00 6016.00  303.00    47.18     3.95	16.57     2.30    0.36    0.23    3.12   0.15  94.40
> > sda               0.00     0.00 6236.00 1059.00    49.52    12.88	17.52     5.91    0.64    0.20    3.23   0.12  88.40
> > sda               0.00     0.00  105.00 26173.00     0.89   311.39	24.34   142.37    5.42   27.73    5.33   0.04 100.00
> > sda               0.00     0.00   78.00 27199.00     0.87   324.06	24.40   142.30    5.25   11.08    5.23   0.04 100.00
> > sda               0.00     0.00   10.00 33488.00     0.11   399.05	24.40   136.41    4.07  100.40    4.04   0.03 100.00
> > sda               0.00     0.00 3819.00 10096.00    31.14   120.47	22.31    42.80    3.10    0.32    4.15   0.07  96.00
> > sda               0.00     0.00 6482.00  346.00    52.98     4.53	17.25     1.93    0.28    0.20    1.80   0.14  93.20
> > 
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
> 
> Hi Andreas,
> 
> I've finally dug myself out from under the backlog from LSFMM far
> enough to start testing this on my local IO performance test rig.
> 
> tl;dr: I can't reproduce this peaky behaviour on my test rig.
> 
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage.  The host is using noop IO scheduling.
> 
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC.  The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
> 
> The iostat output looks like this:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdc               0.00     0.00 1817.00  315.40    18.80     6.93    24.71     0.80    0.38    0.38    0.37   0.31  66.24
> vdc               0.00     0.00 2094.20  323.20    21.82     7.10    24.50     0.81    0.33    0.34    0.27   0.28  68.48
> vdc               0.00     0.00 1989.00 4500.20    20.50    56.64    24.34    24.82    3.82    0.43    5.32   0.12  80.16
> vdc               0.00     0.00 2019.80  320.80    20.83     7.05    24.39     0.83    0.35    0.36    0.32   0.29  69.04
> vdc               0.00     0.00 2206.60  323.20    22.57     7.10    24.02     0.87    0.34    0.34    0.33   0.28  71.92
> vdc               0.00     0.00 2437.20  329.60    25.79     7.24    24.45     0.83    0.30    0.30    0.27   0.26  71.76
> vdc               0.00     0.00 1224.40 11263.80    12.88   136.38    24.48    64.90    5.20    0.69    5.69   0.07  84.96
> vdc               0.00     0.00 2074.60  319.40    21.03     7.01    23.99     0.84    0.35    0.36    0.30   0.29  68.96
> vdc               0.00     0.00 1999.00  881.60    20.81    13.61    24.47     4.23    1.43    0.35    3.88   0.23  67.36
> vdc               0.00     0.00 1193.40 2273.00    12.42    29.68    24.87    11.70    3.40    0.53    4.91   0.18  60.96
> vdc               0.00     0.00 1724.00  314.40    17.80     6.91    24.83     0.74    0.36    0.36    0.35   0.30  61.52
> vdc               0.00     0.00 2605.20  325.60    24.67     7.15    22.24     0.90    0.31    0.31    0.25   0.25  72.72
> vdc               0.00     0.00 2340.60  324.40    23.85     7.12    23.80     0.83    0.31    0.31    0.29   0.26  68.96
> vdc               0.00     0.00 2749.60  329.40    28.51     7.24    23.78     0.90    0.29    0.30    0.25   0.24  75.04
> vdc               0.00     0.00 2619.60  324.60    27.72     7.13    24.24     0.88    0.30    0.30    0.28   0.24  71.76
> vdc               0.00     0.00 1608.60 4532.40    17.17    56.40    24.54    25.78    4.20    0.50    5.51   0.12  75.36
> vdc               0.00     0.00 2361.00  326.40    23.62     7.17    23.47     0.87    0.33    0.33    0.30   0.26  69.92
> vdc               0.00     0.00 2460.00  326.00    25.89     7.16    24.30     0.88    0.32    0.32    0.26   0.26  72.72
> vdc               0.00     0.00 2519.00  325.40    25.96     7.14    23.83     0.90    0.32    0.32    0.30   0.26  74.32
> vdc               0.00     0.00 2709.80  326.20    28.91     7.17    24.34     0.94    0.31    0.30    0.36   0.25  75.52
> vdc               0.00     0.00 2676.65  329.74    28.82     7.24    24.56     0.86    0.28    0.29    0.25   0.24  71.22
> vdc               0.00     0.00 1788.40 4506.80    18.77    56.66    24.54    23.22    3.69    0.45    4.97   0.12  74.88
> vdc               0.00     0.00 1850.40  319.60    19.76     7.02    25.28     0.80    0.37    0.37    0.34   0.30  64.80
> 
> Its obvious where the fsyncs are hitting, but they are making
> almost no impact on the read performance. The benchmark is
> simple not generating enough dirty data to overload the IO
> subsystem, and hence there's no latency spikes to speak of.
> 
> Benchmark output across fsyncs also demonstrates that:
> 
> ....
> read[12494]: avg: 0.4 msec; max: 9.8 msec
> read[12499]: avg: 0.4 msec; max: 9.8 msec
> read[12495]: avg: 0.4 msec; max: 9.8 msec
> commit[12491]: avg: 0.0 msec; max: 6.3 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> wal[12491]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> wal[12491]: avg: 0.0 msec; max: 0.7 msec
> commit[12491]: avg: 0.1 msec; max: 15.8 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> wal[12491]: avg: 0.0 msec; max: 0.1 msec
> read[12492]: avg: 0.6 msec; max: 10.0 msec
> read[12496]: avg: 0.6 msec; max: 10.0 msec
> read[12507]: avg: 0.6 msec; max: 10.0 msec
> read[12505]: avg: 0.6 msec; max: 10.0 msec
> ....
> 
> So, I though switching to spinning disks might show the problem.
> Same VM, this time using a 17TB linearly preallocated image file on
> an 18TB XFS filesystem on the host (virtio in the guest again) on a
> 12 disk RAID-0 (DM again) using a 12x2TB SAS drives exported as JBOD
> from a RAID controller with 512MB of BBWC. This is capable of about
> 2,000 random read IOPs, and write IOPS is dependent on the BBWC
> flushing behaviour (peaks at about 15,000, sustains 1500).
> 
> Again, I don't see any bad behaviour:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdd               0.00     0.00  128.00  154.20     1.46     6.02    54.24     1.02    3.62    7.22    0.62   3.17  89.44
> vdd               0.00     0.00  133.80  154.40     1.43     6.03    53.04     1.08    3.76    7.42    0.59   3.05  87.84
> vdd               0.00     0.00  132.40  751.60     1.36    12.75    32.70     7.26    8.21    7.90    8.27   1.06  93.76
> vdd               0.00     0.00  138.60  288.60     1.47     7.62    43.57     2.69    6.30    7.24    5.85   2.15  91.68
> vdd               0.00     0.00  141.20  157.00     1.51     6.12    52.36     1.12    3.75    7.26    0.59   3.11  92.64
> vdd               0.00     0.00  136.80  154.00     1.52     6.02    53.09     1.04    3.57    6.92    0.60   3.10  90.08
> vdd               0.00     0.00  135.00  154.20     1.46     6.02    52.99     1.04    3.60    7.00    0.62   3.03  87.68
> vdd               0.00     0.00  147.20  155.80     1.50     6.09    51.26     1.19    3.92    7.45    0.59   3.06  92.72
> vdd               0.00     0.00  139.60  154.00     1.50     6.02    52.43     1.03    3.53    6.79    0.58   2.97  87.28
> vdd               0.00     0.00  126.80  913.80     1.27    14.81    31.63     7.87    7.56    9.02    7.36   0.89  92.16
> vdd               0.00     0.00  148.20  156.80     1.51     6.11    51.15     1.17    3.83    7.25    0.59   3.03  92.32
> vdd               0.00     0.00  142.60  155.80     1.49     6.09    52.03     1.12    3.76    7.23    0.60   3.05  90.88
> vdd               0.00     0.00  146.00  156.60     1.49     6.12    51.48     1.16    3.84    7.32    0.60   3.03  91.76
> vdd               0.00     0.00  148.80  157.20     1.56     6.14    51.54     1.10    3.58    6.78    0.54   2.92  89.36
> vdd               0.00     0.00  149.80  156.80     1.52     6.12    51.05     1.14    3.73    7.10    0.52   2.96  90.88
> vdd               0.00     0.00  127.20  910.60     1.29    14.82    31.80     8.17    7.87    8.62    7.77   0.88  91.44
> vdd               0.00     0.00  150.80  157.80     1.58     6.15    51.27     1.09    3.53    6.67    0.52   2.91  89.92
> vdd               0.00     0.00  153.00  156.60     1.53     6.12    50.56     1.17    3.80    7.07    0.60   2.95  91.20
> vdd               0.00     0.00  161.20  157.00     1.54     6.13    49.36     1.27    4.00    7.36    0.55   2.97  94.48
> vdd               0.00     0.00  152.10  157.09     1.55     6.14    50.91     1.15    3.73    7.06    0.50   2.98  92.22
> vdd               0.00     0.00  154.80  157.80     1.50     6.16    50.18     1.19    3.82    7.14    0.57   2.88  90.00
> vdd               0.00     0.00  122.20  922.80     1.24    14.93    31.70     8.30    7.94    8.81    7.83   0.90  93.68
> vdd               0.00     0.00  139.80  153.60     1.46     5.98    51.98     1.09    3.74    7.14    0.65   3.02  88.48
> vdd               0.00     0.00  134.60  153.60     1.44     6.00    52.87     1.09    3.78    7.38    0.63   3.15  90.88
> vdd               0.00     0.00  141.00  153.40     1.55     5.99    52.49     1.05    3.57    6.79    0.60   3.04  89.44
> vdd               0.00     0.00  130.20  154.20     1.44     6.02    53.76     1.04    3.65    7.28    0.59   3.14  89.44
> vdd               0.00     0.00  136.60  154.20     1.30     6.02    51.52     1.19    4.07    8.02    0.58   3.03  88.24
> vdd               0.00     0.00  119.00  872.20     1.29    14.29    32.20     7.98    8.05    8.59    7.98   0.93  92.64
> vdd               0.00     0.00  139.00  154.00     1.58     6.00    53.01     1.06    3.62    6.95    0.61   3.12  91.52
> vdd               0.00     0.00  146.00  155.40     1.52     6.07    51.60     1.12    3.71    7.05    0.57   3.01  90.72
> vdd               0.00     0.00  145.20  156.60     1.55     6.12    52.01     1.13    3.73    7.10    0.60   3.07  92.80
> vdd               0.00     0.00  147.80  156.80     1.51     6.12    51.36     1.13    3.72    7.07    0.56   2.97  90.56
> vdd               0.00     0.00  142.80  157.00     1.48     6.12    51.98     1.11    3.70    7.20    0.52   3.02  90.56
> vdd               0.00     0.00  125.20  922.00     1.36    14.92    31.85     8.34    7.96    8.47    7.90   0.88  92.56
> 
> Again, you can see exactly where the fsyncs are hitting, and again
> they are not massive spikes of write IO. And read latencies are showing:
> 
> read[12711]: avg: 6.1 msec; max: 182.2 msec
> ...
> read[12711]: avg: 6.0 msec; max: 198.8 msec
> ....
> read[12711]: avg: 5.9 msec; max: 158.8 msec
> ....
> read[12711]: avg: 6.1 msec; max: 127.5 msec
> ....
> read[12711]: avg: 5.9 msec; max: 355.6 msec
> ....
> read[12711]: avg: 6.0 msec; max: 262.8 msec
> ....
> 
> No significant outliers. The same for commits and wal:
> 
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 1.0 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> starting fsync() of files
> finished fsync() of files
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.1 msec; max: 129.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 0.9 msec
> wal[12701]: avg: 0.0 msec; max: 0.2 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> wal[12701]: avg: 0.0 msec; max: 0.1 msec
> commit[12701]: avg: 0.0 msec; max: 23.1 msec
> 
> I'm not sure how you were generating the behaviour you reported, but
> the test program as it stands does not appear to be causing any
> problems at all on the sort of storage I'd expect large databases to
> be hosted on....
> 
> I've tried a few tweaks to the test program, but I haven't been able
> to make it misbehave. What do I need to tweak in the test program or
> my test VM to make the kernel misbehave like you reported?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com`
> _______________________________________________
> Lsf mailing list
> Lsf@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf
> 

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-04-28 23:47     ` Dave Chinner
@ 2014-04-28 23:57       ` Andres Freund
  -1 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-04-28 23:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: rhaas, linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang

Hi Dave,

On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> ping?

I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2

As an additional note:

> On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > I'm not sure how you were generating the behaviour you reported, but
> > the test program as it stands does not appear to be causing any
> > problems at all on the sort of storage I'd expect large databases to
> > be hosted on....

A really really large number of database aren't stored on big enterprise
rigs...

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-04-28 23:57       ` Andres Freund
  0 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-04-28 23:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: rhaas, linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang

Hi Dave,

On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> ping?

I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2

As an additional note:

> On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > I'm not sure how you were generating the behaviour you reported, but
> > the test program as it stands does not appear to be causing any
> > problems at all on the sort of storage I'd expect large databases to
> > be hosted on....

A really really large number of database aren't stored on big enterprise
rigs...

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-04-28 23:57       ` Andres Freund
@ 2014-05-23  6:42         ` Dave Chinner
  -1 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-05-23  6:42 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-kernel, lsf, linux-mm, rhaas, linux-fsdevel, Wu Fengguang

On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> Hi Dave,
> 
> On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > ping?
> 
> I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2

I missed it, sorry.

I've had a bit more time to look at this behaviour now and tweaked
it as you suggested, but I simply can't get XFS to misbehave in the
manner you demonstrated. However, I can reproduce major read latency
changes and writeback flush storms with ext4.  I originally only
tested on XFS. I'm using the no-op IO scheduler everywhere, too.

I ran the tweaked version I have for a couple of hours on XFS, and
only saw a handful abnormal writeback events where the write IOPS
spiked above the normal periodic peaks and was sufficient to cause
any noticable increase in read latency. Even then the maximums were
in the 40ms range, nothing much higher.

ext4, OTOH, generated a much, much higher periodic write IO load and
it's regularly causing read IO latencies in the hundreds of
milliseconds. Every so often this occurred on ext4 (5s sample rate)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     3.00 3142.20  219.20    34.11    19.10    32.42     1.11    0.33    0.33    0.31   0.27  91.92
vdc               0.00     0.80 3311.60  216.20    35.86    18.90    31.79     1.17    0.33    0.33    0.39   0.26  92.56
vdc               0.00     0.80 2919.80 2750.60    31.67    48.36    28.90    20.05    3.50    0.36    6.83   0.16  92.96
vdc               0.00     0.80  435.00 15689.80     4.96   198.10    25.79   113.21    7.03    2.32    7.16   0.06  99.20
vdc               0.00     0.80 2683.80  216.20    29.72    18.98    34.39     1.13    0.39    0.39    0.40   0.32  91.92
vdc               0.00     0.80 2853.00  218.20    31.29    19.06    33.57     1.14    0.37    0.37    0.36   0.30  92.56

Which is, i think, signs of what you'd been trying to demonstrate -
a major dip in read performance when writeback is flushing.

In comparison, this is from XFS:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     0.00 2416.40  335.00    21.02     7.85    21.49     0.78    0.28    0.30    0.19   0.24  65.28
vdc               0.00     0.00 2575.80  336.00    22.68     7.88    21.49     0.81    0.28    0.29    0.16   0.23  66.32
vdc               0.00     0.00 1740.20 4645.20    15.60    58.22    23.68    21.21    3.32    0.41    4.41   0.11  68.56
vdc               0.00     0.00 2082.80  329.00    18.28     7.71    22.07     0.81    0.34    0.35    0.26   0.28  67.44
vdc               0.00     0.00 2347.80  333.20    19.53     7.80    20.88     0.83    0.31    0.32    0.25   0.25  67.52

You can see how much less load XFS putting on the storage - it's
only 65-70% utilised compared to the 90-100% load that ext4 is
generating.

What is interesting here is the difference in IO patterns. ext4 is
doing much larger IOs than XFS - it's average IO size is 16k, while
XFS's is a bit over 8k. So while the read and background write IOPS
rates are similar, ext4 is moving a lot more data to/from disk in
larger chunks.

This seems also to translate to much larger writeback IO peaks in
ext4.  I have no idea what this means in terms of actual application
throughput, but it looks very much to me like the nasty read
latencies are much more pronounced on ext4 because of the higher
read bandwidths and write IOPS being seen.

The screen shot of the recorded behaviour is attached - the left
hand side is the tail end (~30min) of the 2 hour long XFS run, and
the first half an hour of ext4 running. The difference in IO
behaviour is quite obvious....

What is interesting is that CPU usage is not very much different
between the two filesystems, but IOWait is much, much higher for
ext4. That indicates that ext4 is definitely loading the storage
more, and so much more likely to have IO load related
latencies..

So, seeing the differences in behvaiour just by changing
filesystems, I just ran the workload on btrfs. Ouch - it was
even worse than ext4 in terms of read latencies - they were highly
unpredictable, and massively variable even within a read group:

....
read[11331]: avg: 0.3 msec; max: 7.0 msec
read[11340]: avg: 0.3 msec; max: 7.1 msec
read[11334]: avg: 0.3 msec; max: 7.0 msec
read[11329]: avg: 0.3 msec; max: 7.0 msec
read[11328]: avg: 0.3 msec; max: 7.0 msec
read[11332]: avg: 0.6 msec; max: 4481.2 msec
read[11342]: avg: 0.6 msec; max: 4480.6 msec
read[11332]: avg: 0.0 msec; max: 0.7 msec
read[11342]: avg: 0.0 msec; max: 1.6 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
.....

It was also not uncommon to see major commit latencies:

read[11335]: avg: 0.2 msec; max: 8.3 msec
read[11341]: avg: 0.2 msec; max: 8.5 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.7 msec; max: 5302.3 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.0 msec; max: 6.1 msec
read[11337]: avg: 0.2 msec; max: 6.2 msec

So, what it appears to me right now is that filesystem choice alone
has a major impact on writeback behaviour and read latencies. Each
of these filesystems implements writeback aggregation differently
(ext4 completely replaces the generic writeback path, XFS optimises
below ->writepage, and btrfs has super magic COW powers).

That means it isn't clear that there's any generic infrastructure
problem here, and it certainly isn't clear that each filesystem has
the same problem or the issues can be solved by a generic mechanism.
I think you probably need to engage the ext4 developers drectly to
understand what ext4 is doing in detail, or work out how to prod XFS
into displaying that extremely bad read latency behaviour....

> As an additional note:
> 
> > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > I'm not sure how you were generating the behaviour you reported, but
> > > the test program as it stands does not appear to be causing any
> > > problems at all on the sort of storage I'd expect large databases to
> > > be hosted on....
> 
> A really really large number of database aren't stored on big enterprise
> rigs...

I'm not using a big enterprise rig. I've reproduced these results on
a low end Dell server with the internal H710 SAS RAID and a pair of
consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
HBA with 12 2T nearline SAS drives in RAID0.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-05-23  6:42         ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2014-05-23  6:42 UTC (permalink / raw)
  To: Andres Freund
  Cc: linux-kernel, lsf, linux-mm, rhaas, linux-fsdevel, Wu Fengguang

On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> Hi Dave,
> 
> On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > ping?
> 
> I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2

I missed it, sorry.

I've had a bit more time to look at this behaviour now and tweaked
it as you suggested, but I simply can't get XFS to misbehave in the
manner you demonstrated. However, I can reproduce major read latency
changes and writeback flush storms with ext4.  I originally only
tested on XFS. I'm using the no-op IO scheduler everywhere, too.

I ran the tweaked version I have for a couple of hours on XFS, and
only saw a handful abnormal writeback events where the write IOPS
spiked above the normal periodic peaks and was sufficient to cause
any noticable increase in read latency. Even then the maximums were
in the 40ms range, nothing much higher.

ext4, OTOH, generated a much, much higher periodic write IO load and
it's regularly causing read IO latencies in the hundreds of
milliseconds. Every so often this occurred on ext4 (5s sample rate)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     3.00 3142.20  219.20    34.11    19.10    32.42     1.11    0.33    0.33    0.31   0.27  91.92
vdc               0.00     0.80 3311.60  216.20    35.86    18.90    31.79     1.17    0.33    0.33    0.39   0.26  92.56
vdc               0.00     0.80 2919.80 2750.60    31.67    48.36    28.90    20.05    3.50    0.36    6.83   0.16  92.96
vdc               0.00     0.80  435.00 15689.80     4.96   198.10    25.79   113.21    7.03    2.32    7.16   0.06  99.20
vdc               0.00     0.80 2683.80  216.20    29.72    18.98    34.39     1.13    0.39    0.39    0.40   0.32  91.92
vdc               0.00     0.80 2853.00  218.20    31.29    19.06    33.57     1.14    0.37    0.37    0.36   0.30  92.56

Which is, i think, signs of what you'd been trying to demonstrate -
a major dip in read performance when writeback is flushing.

In comparison, this is from XFS:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00     0.00 2416.40  335.00    21.02     7.85    21.49     0.78    0.28    0.30    0.19   0.24  65.28
vdc               0.00     0.00 2575.80  336.00    22.68     7.88    21.49     0.81    0.28    0.29    0.16   0.23  66.32
vdc               0.00     0.00 1740.20 4645.20    15.60    58.22    23.68    21.21    3.32    0.41    4.41   0.11  68.56
vdc               0.00     0.00 2082.80  329.00    18.28     7.71    22.07     0.81    0.34    0.35    0.26   0.28  67.44
vdc               0.00     0.00 2347.80  333.20    19.53     7.80    20.88     0.83    0.31    0.32    0.25   0.25  67.52

You can see how much less load XFS putting on the storage - it's
only 65-70% utilised compared to the 90-100% load that ext4 is
generating.

What is interesting here is the difference in IO patterns. ext4 is
doing much larger IOs than XFS - it's average IO size is 16k, while
XFS's is a bit over 8k. So while the read and background write IOPS
rates are similar, ext4 is moving a lot more data to/from disk in
larger chunks.

This seems also to translate to much larger writeback IO peaks in
ext4.  I have no idea what this means in terms of actual application
throughput, but it looks very much to me like the nasty read
latencies are much more pronounced on ext4 because of the higher
read bandwidths and write IOPS being seen.

The screen shot of the recorded behaviour is attached - the left
hand side is the tail end (~30min) of the 2 hour long XFS run, and
the first half an hour of ext4 running. The difference in IO
behaviour is quite obvious....

What is interesting is that CPU usage is not very much different
between the two filesystems, but IOWait is much, much higher for
ext4. That indicates that ext4 is definitely loading the storage
more, and so much more likely to have IO load related
latencies..

So, seeing the differences in behvaiour just by changing
filesystems, I just ran the workload on btrfs. Ouch - it was
even worse than ext4 in terms of read latencies - they were highly
unpredictable, and massively variable even within a read group:

....
read[11331]: avg: 0.3 msec; max: 7.0 msec
read[11340]: avg: 0.3 msec; max: 7.1 msec
read[11334]: avg: 0.3 msec; max: 7.0 msec
read[11329]: avg: 0.3 msec; max: 7.0 msec
read[11328]: avg: 0.3 msec; max: 7.0 msec
read[11332]: avg: 0.6 msec; max: 4481.2 msec
read[11342]: avg: 0.6 msec; max: 4480.6 msec
read[11332]: avg: 0.0 msec; max: 0.7 msec
read[11342]: avg: 0.0 msec; max: 1.6 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
.....

It was also not uncommon to see major commit latencies:

read[11335]: avg: 0.2 msec; max: 8.3 msec
read[11341]: avg: 0.2 msec; max: 8.5 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.7 msec; max: 5302.3 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.0 msec; max: 6.1 msec
read[11337]: avg: 0.2 msec; max: 6.2 msec

So, what it appears to me right now is that filesystem choice alone
has a major impact on writeback behaviour and read latencies. Each
of these filesystems implements writeback aggregation differently
(ext4 completely replaces the generic writeback path, XFS optimises
below ->writepage, and btrfs has super magic COW powers).

That means it isn't clear that there's any generic infrastructure
problem here, and it certainly isn't clear that each filesystem has
the same problem or the issues can be solved by a generic mechanism.
I think you probably need to engage the ext4 developers drectly to
understand what ext4 is doing in detail, or work out how to prod XFS
into displaying that extremely bad read latency behaviour....

> As an additional note:
> 
> > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > I'm not sure how you were generating the behaviour you reported, but
> > > the test program as it stands does not appear to be causing any
> > > problems at all on the sort of storage I'd expect large databases to
> > > be hosted on....
> 
> A really really large number of database aren't stored on big enterprise
> rigs...

I'm not using a big enterprise rig. I've reproduced these results on
a low end Dell server with the internal H710 SAS RAID and a pair of
consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
HBA with 12 2T nearline SAS drives in RAID0.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
  2014-05-23  6:42         ` Dave Chinner
@ 2014-06-04 20:06           ` Andres Freund
  -1 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-06-04 20:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang,
	linux-ext4, Theodore Ts'o

Hi Dave, Ted, All,

On 2014-05-23 16:42:47 +1000, Dave Chinner wrote:
> On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> > Hi Dave,
> > 
> > On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > > ping?
> > 
> > I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2
> 
> I missed it, sorry.

No worries. As you can see, I'm not quick answering either :/

> I've had a bit more time to look at this behaviour now and tweaked
> it as you suggested, but I simply can't get XFS to misbehave in the
> manner you demonstrated. However, I can reproduce major read latency
> changes and writeback flush storms with ext4.  I originally only
> tested on XFS.

That's interesting. I know that the problem was reproducable on xfs at
some point, but that was on 2.6.18 or so...

I'll try whether I can make it perform badly on the measly hardware I
have available.

> I'm using the no-op IO scheduler everywhere, too.

And will check whether it's potentially related to that.

> ext4, OTOH, generated a much, much higher periodic write IO load and
> it's regularly causing read IO latencies in the hundreds of
> milliseconds. Every so often this occurred on ext4 (5s sample rate)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdc               0.00     3.00 3142.20  219.20    34.11    19.10    32.42     1.11    0.33    0.33    0.31   0.27  91.92
> vdc               0.00     0.80 3311.60  216.20    35.86    18.90    31.79     1.17    0.33    0.33    0.39   0.26  92.56
> vdc               0.00     0.80 2919.80 2750.60    31.67    48.36    28.90    20.05    3.50    0.36    6.83   0.16  92.96
> vdc               0.00     0.80  435.00 15689.80     4.96   198.10    25.79   113.21    7.03    2.32    7.16   0.06  99.20
> vdc               0.00     0.80 2683.80  216.20    29.72    18.98    34.39     1.13    0.39    0.39    0.40   0.32  91.92
> vdc               0.00     0.80 2853.00  218.20    31.29    19.06    33.57     1.14    0.37    0.37    0.36   0.30  92.56
> 
> Which is, i think, signs of what you'd been trying to demonstrate -
> a major dip in read performance when writeback is flushing.

I've seen *much* worse cases than this, but it's what we're seing in
production.

> What is interesting here is the difference in IO patterns. ext4 is
> doing much larger IOs than XFS - it's average IO size is 16k, while
> XFS's is a bit over 8k. So while the read and background write IOPS
> rates are similar, ext4 is moving a lot more data to/from disk in
> larger chunks.
> 
> This seems also to translate to much larger writeback IO peaks in
> ext4.  I have no idea what this means in terms of actual application
> throughput, but it looks very much to me like the nasty read
> latencies are much more pronounced on ext4 because of the higher
> read bandwidths and write IOPS being seen.

I'll try starting a benchmark of actual postgres showing the differnt
peak/average throughput and latencies.

> So, seeing the differences in behvaiour just by changing
> filesystems, I just ran the workload on btrfs. Ouch - it was
> even worse than ext4 in terms of read latencies - they were highly
> unpredictable, and massively variable even within a read group:

I've essentially given up on btrfs for the forseeable future :(.

> That means it isn't clear that there's any generic infrastructure
> problem here, and it certainly isn't clear that each filesystem has
> the same problem or the issues can be solved by a generic mechanism.
> I think you probably need to engage the ext4 developers drectly to
> understand what ext4 is doing in detail, or work out how to prod XFS
> into displaying that extremely bad read latency behaviour....

I've CCed the ext4 list and Ted. Maybe that'll bring some insigh...

> > > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > > I'm not sure how you were generating the behaviour you reported, but
> > > > the test program as it stands does not appear to be causing any
> > > > problems at all on the sort of storage I'd expect large databases to
> > > > be hosted on....
> > 
> > A really really large number of database aren't stored on big enterprise
> > rigs...
> 
> I'm not using a big enterprise rig. I've reproduced these results on
> a low end Dell server with the internal H710 SAS RAID and a pair of
> consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
> HBA with 12 2T nearline SAS drives in RAID0.

There's a *lot* of busy postgres installations out there running on a
single disk of spinning rust. Hopefully replicating to another piece of
spinning rust... In comparison to that that's enterprise hardware ;)

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
@ 2014-06-04 20:06           ` Andres Freund
  0 siblings, 0 replies; 31+ messages in thread
From: Andres Freund @ 2014-06-04 20:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, lsf, linux-mm, linux-fsdevel, Wu Fengguang,
	linux-ext4, Theodore Ts'o

Hi Dave, Ted, All,

On 2014-05-23 16:42:47 +1000, Dave Chinner wrote:
> On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> > Hi Dave,
> > 
> > On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > > ping?
> > 
> > I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2
> 
> I missed it, sorry.

No worries. As you can see, I'm not quick answering either :/

> I've had a bit more time to look at this behaviour now and tweaked
> it as you suggested, but I simply can't get XFS to misbehave in the
> manner you demonstrated. However, I can reproduce major read latency
> changes and writeback flush storms with ext4.  I originally only
> tested on XFS.

That's interesting. I know that the problem was reproducable on xfs at
some point, but that was on 2.6.18 or so...

I'll try whether I can make it perform badly on the measly hardware I
have available.

> I'm using the no-op IO scheduler everywhere, too.

And will check whether it's potentially related to that.

> ext4, OTOH, generated a much, much higher periodic write IO load and
> it's regularly causing read IO latencies in the hundreds of
> milliseconds. Every so often this occurred on ext4 (5s sample rate)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vdc               0.00     3.00 3142.20  219.20    34.11    19.10    32.42     1.11    0.33    0.33    0.31   0.27  91.92
> vdc               0.00     0.80 3311.60  216.20    35.86    18.90    31.79     1.17    0.33    0.33    0.39   0.26  92.56
> vdc               0.00     0.80 2919.80 2750.60    31.67    48.36    28.90    20.05    3.50    0.36    6.83   0.16  92.96
> vdc               0.00     0.80  435.00 15689.80     4.96   198.10    25.79   113.21    7.03    2.32    7.16   0.06  99.20
> vdc               0.00     0.80 2683.80  216.20    29.72    18.98    34.39     1.13    0.39    0.39    0.40   0.32  91.92
> vdc               0.00     0.80 2853.00  218.20    31.29    19.06    33.57     1.14    0.37    0.37    0.36   0.30  92.56
> 
> Which is, i think, signs of what you'd been trying to demonstrate -
> a major dip in read performance when writeback is flushing.

I've seen *much* worse cases than this, but it's what we're seing in
production.

> What is interesting here is the difference in IO patterns. ext4 is
> doing much larger IOs than XFS - it's average IO size is 16k, while
> XFS's is a bit over 8k. So while the read and background write IOPS
> rates are similar, ext4 is moving a lot more data to/from disk in
> larger chunks.
> 
> This seems also to translate to much larger writeback IO peaks in
> ext4.  I have no idea what this means in terms of actual application
> throughput, but it looks very much to me like the nasty read
> latencies are much more pronounced on ext4 because of the higher
> read bandwidths and write IOPS being seen.

I'll try starting a benchmark of actual postgres showing the differnt
peak/average throughput and latencies.

> So, seeing the differences in behvaiour just by changing
> filesystems, I just ran the workload on btrfs. Ouch - it was
> even worse than ext4 in terms of read latencies - they were highly
> unpredictable, and massively variable even within a read group:

I've essentially given up on btrfs for the forseeable future :(.

> That means it isn't clear that there's any generic infrastructure
> problem here, and it certainly isn't clear that each filesystem has
> the same problem or the issues can be solved by a generic mechanism.
> I think you probably need to engage the ext4 developers drectly to
> understand what ext4 is doing in detail, or work out how to prod XFS
> into displaying that extremely bad read latency behaviour....

I've CCed the ext4 list and Ted. Maybe that'll bring some insigh...

> > > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > > I'm not sure how you were generating the behaviour you reported, but
> > > > the test program as it stands does not appear to be causing any
> > > > problems at all on the sort of storage I'd expect large databases to
> > > > be hosted on....
> > 
> > A really really large number of database aren't stored on big enterprise
> > rigs...
> 
> I'm not using a big enterprise rig. I've reproduced these results on
> a low end Dell server with the internal H710 SAS RAID and a pair of
> consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
> HBA with 12 2T nearline SAS drives in RAID0.

There's a *lot* of busy postgres installations out there running on a
single disk of spinning rust. Hopefully replicating to another piece of
spinning rust... In comparison to that that's enterprise hardware ;)

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-06-04 20:06 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-26 19:11 Postgresql performance problems with IO latency, especially during fsync() Andres Freund
2014-03-26 21:41 ` [Lsf] " Andy Lutomirski
2014-03-26 21:41   ` Andy Lutomirski
2014-03-26 21:55   ` Andres Freund
2014-03-26 21:55     ` Andres Freund
2014-03-26 22:26     ` Andy Lutomirski
2014-03-26 22:26       ` Andy Lutomirski
2014-03-26 22:35       ` David Lang
2014-03-26 22:35         ` David Lang
2014-03-26 23:11         ` Andy Lutomirski
2014-03-26 23:11           ` Andy Lutomirski
2014-03-26 23:28           ` Andy Lutomirski
2014-03-26 23:28             ` Andy Lutomirski
2014-03-27 15:50     ` Jan Kara
2014-03-27 15:50       ` Jan Kara
2014-03-27 18:10       ` Fernando Luis Vazquez Cao
2014-03-27 18:10         ` Fernando Luis Vazquez Cao
2014-03-27 15:52 ` Jan Kara
2014-03-27 15:52   ` Jan Kara
2014-04-09  9:20 ` Dave Chinner
2014-04-09  9:20   ` Dave Chinner
2014-04-12 13:24   ` Andres Freund
2014-04-12 13:24     ` Andres Freund
2014-04-28 23:47   ` [Lsf] " Dave Chinner
2014-04-28 23:47     ` Dave Chinner
2014-04-28 23:57     ` Andres Freund
2014-04-28 23:57       ` Andres Freund
2014-05-23  6:42       ` Dave Chinner
2014-05-23  6:42         ` Dave Chinner
2014-06-04 20:06         ` Andres Freund
2014-06-04 20:06           ` Andres Freund

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.