linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Memory management livelock
       [not found] <20080911101616.GA24064@agk.fab.redhat.com>
@ 2008-09-22 21:10 ` Mikulas Patocka
  2008-09-23  0:48   ` Andrew Morton
  2008-09-23 22:34   ` Mikulas Patocka
  0 siblings, 2 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-22 21:10 UTC (permalink / raw)
  To: Chris Webb, linux-kernel, linux-mm; +Cc: Alasdair G Kergon, Milan Broz

Hi

Here is a patch for MM livelock. The original bug report follows after the 
patch. To reproduce the bug on my computer, I had to change bs=4M to 
bs=65536 in examples in the original report.

Mikulas

---

Fix starvation in memory management.

The bug happens when one process is doing sequential buffered writes to
a block device (or file) and another process is attempting to execute
sync(), fsync() or direct-IO on that device (or file). This syncing
process will wait indefinitelly, until the first writing process
finishes.

For example, run these two commands:
dd if=/dev/zero of=/dev/sda1 bs=65536 &
dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

The bug is caused by sequential walking of address space in
write_cache_pages and wait_on_page_writeback_range: if some other
process is constantly making dirty and writeback pages while these
functions run, the functions will wait on every new page, resulting in
indefinite wait.

To fix the starvation, I declared a mutex starvation_barrier in struct
address_space. When the mutex is taken, anyone making dirty pages on
that address space should stop. The functions that walk address space
sequentially first estimate a number of pages to process. If they
process more than this amount (plus some small delta), it means that
someone is making dirty pages under them --- they take the mutex and
hold it until they finish. When the mutex is taken, the function
balance_dirty_pages will wait and not allow more dirty pages being made.

The mutex is not really used as a mutex, it does not prevent access to
any critical section. It is used as a barrier that blocks new dirty
pages from being created. I use mutex and not wait queue to make sure
that the starvation can't happend the other way (if there were wait
queue, excessive calls to write_cache_pages and
wait_on_page_writeback_range would block balance_dirty_pages forever;
with mutex it is guaranteed that every process eventually makes
progress).

The essential property of this patch is that if the starvation doesn't
happen, no additional locks are taken and no atomic operations are
performed. So the patch shouldn't damage performance.

The indefinite starvation was observed in write_cache_pages and
wait_on_page_writeback_range. Another possibility where it could happen
is invalidate_inode_pages2_range.

There are two more functions that walk all the pages in address space,
but I think they don't need this starvation protection:
truncate_inode_pages_range: it is called with i_mutex locked, so no new
pages are created under it.
__invalidate_mapping_pages: it skips locked, dirty and writeback pages,
so there's no point in protecting the function against starving on them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/inode.c          |    1 +
 include/linux/fs.h  |    1 +
 mm/filemap.c        |   16 ++++++++++++++++
 mm/page-writeback.c |   30 +++++++++++++++++++++++++++++-
 mm/swap_state.c     |    1 +
 mm/truncate.c       |   20 +++++++++++++++++++-
 6 files changed, 67 insertions(+), 2 deletions(-)

Index: linux-2.6.27-rc7-devel/fs/inode.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/fs/inode.c	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/fs/inode.c	2008-09-22 23:04:34.000000000 +0200
@@ -214,6 +214,7 @@ void inode_init_once(struct inode *inode
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
+	mutex_init(&inode->i_data.starvation_barrier);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
 	i_size_ordered_init(inode);
Index: linux-2.6.27-rc7-devel/include/linux/fs.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/fs.h	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/fs.h	2008-09-22 23:04:34.000000000 +0200
@@ -539,6 +539,7 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+	struct mutex		starvation_barrier;	/* taken when fsync starves */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-22 23:04:34.000000000 +0200
@@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct 
 	int nr_pages;
 	int ret = 0;
 	pgoff_t index;
+	long pages_to_process;
 
 	if (end < start)
 		return 0;
 
+	/*
+	 * Estimate the number of pages to process. If we process significantly
+	 * more than this, someone is making writeback pages under us.
+	 * We must pull the anti-starvation plug.
+	 */
+	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	pages_to_process += pages_to_process >> 3;
+
 	pagevec_init(&pvec, 0);
 	index = start;
 	while ((index <= end) &&
@@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					mutex_lock(&mapping->starvation_barrier);
+
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -296,6 +309,9 @@ int wait_on_page_writeback_range(struct 
 		cond_resched();
 	}
 
+	if (pages_to_process < 0)
+		mutex_unlock(&mapping->starvation_barrier);
+
 	/* Check for outstanding write errors */
 	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
 		ret = -ENOSPC;
Index: linux-2.6.27-rc7-devel/mm/page-writeback.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/page-writeback.c	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/page-writeback.c	2008-09-22 23:04:34.000000000 +0200
@@ -435,6 +435,15 @@ static void balance_dirty_pages(struct a
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	/*
+	 * If there is sync() starving on this address space, block
+	 * writers until it finishes.
+	 */
+	if (unlikely(mutex_is_locked(&mapping->starvation_barrier))) {
+		mutex_lock(&mapping->starvation_barrier);
+		mutex_unlock(&mapping->starvation_barrier);
+	}
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.bdi		= bdi,
@@ -876,12 +885,21 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
 	int range_whole = 0;
+	long pages_to_process;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
 		return 0;
 	}
 
+	/*
+	 * Estimate the number of pages to process. If we process significantly
+	 * more than this, someone is making dirty pages under us.
+	 * Pull the anti-starvation plug to stop him.
+	 */
+	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+	pages_to_process += pages_to_process >> 3;
+
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
@@ -902,7 +920,13 @@ retry:
 
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
+
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					mutex_lock(&mapping->starvation_barrier);
+
+			page = pvec.pages[i];
 
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
@@ -958,6 +982,10 @@ retry:
 		index = 0;
 		goto retry;
 	}
+
+	if (pages_to_process < 0)
+		mutex_unlock(&mapping->starvation_barrier);
+
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 
Index: linux-2.6.27-rc7-devel/mm/swap_state.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/swap_state.c	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/swap_state.c	2008-09-22 23:04:34.000000000 +0200
@@ -43,6 +43,7 @@ struct address_space swapper_space = {
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
+	.starvation_barrier = __MUTEX_INITIALIZER(swapper_space.starvation_barrier),
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
Index: linux-2.6.27-rc7-devel/mm/truncate.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/truncate.c	2008-09-22 23:04:31.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/truncate.c	2008-09-22 23:04:34.000000000 +0200
@@ -392,6 +392,14 @@ int invalidate_inode_pages2_range(struct
 	int ret2 = 0;
 	int did_range_unmap = 0;
 	int wrapped = 0;
+	long pages_to_process;
+
+	/*
+	 * Estimate number of pages to process. If we process more, someone
+	 * is making pages under us.
+	 */
+	pages_to_process = mapping->nrpages;
+	pages_to_process += pages_to_process >> 3;
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -399,9 +407,15 @@ int invalidate_inode_pages2_range(struct
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
 			pgoff_t page_index;
 
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					mutex_lock(&mapping->starvation_barrier);
+
+			page = pvec.pages[i];
+
 			lock_page(page);
 			if (page->mapping != mapping) {
 				unlock_page(page);
@@ -449,6 +463,10 @@ int invalidate_inode_pages2_range(struct
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
+	if (pages_to_process < 0)
+		mutex_unlock(&mapping->starvation_barrier);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

> ----- Forwarded message from Chris Webb <chris@arachsys.com> -----
> 
> Date: Thu, 11 Sep 2008 10:47:36 +0100
> From: Chris Webb <chris@arachsys.com>
> Subject: Unexpected behaviour with O_DIRECT affecting lvm
> 
> To reproduce:
> 
>   # dd if=/dev/zero of=/dev/scratch/a bs=4M &
>   [1] 2612
>   # lvm lvrename /dev/scratch/b /dev/scratch/c
>   [hangs]
> 
> (Similarly for lvremote, lvcreate, etc., and it doesn't matter whether the
> operation is on the same VG or a different one.)
> 
> Stracing lvm, I saw
> 
>   stat64("/dev/dm-7", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 7), ...}) = 0
>   open("/dev/dm-7", O_RDWR|O_DIRECT|O_LARGEFILE|O_NOATIME) = 6
>   fstat64(6, {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 7), ...}) = 0
>   ioctl(6, BLKBSZGET, 0x89bbb30)          = 0
>   _llseek(6, 0, [0], SEEK_SET)            = 0
>   read(6,
>   [hangs]
> 
> If I kill the dd at this point, the lvm unblocks and continues as normal:
> 
>   read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
>   close(6)                                = 0
>   stat64("/dev/dm-8", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 8), ...}) = 0
>   stat64("/dev/dm-8", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 8), ...}) = 0
>   open("/dev/dm-8", O_RDWR|O_DIRECT|O_LARGEFILE|O_NOATIME) = 6
>   [etc]
> 
> /dev/dm-7 is the dm device corresponding to /dev/disk.1/0.0 which is being
> written to.
> 
> I wrote a tiny test program which does a single O_DIRECT read of 4096 bytes
> from a device, and can confirm that, more generally, an O_DIRECT read from a
> device being continuously written to always hangs indefinitely. This occurs
> even if I ionice -c3 the dd, and ionice -c1 -n0 the O_DIRECT reader...
> 
>   # dd if=/dev/zero of=/dev/disk.1/0.0 bs=4M &
>   [1] 2697
>   # ~/tmp/readtest
>   [hangs]
> 
>   # ionice -c3 dd if=/dev/zero of=/dev/disk.1/0.0 bs=4M &
>   [1] 2734
>   # ionice -c1 -n0 ~/tmp/readtest
>   [hangs]
> 
> There's no problem if the dd is in the other direction, reading continuously
> from the LV, nor if the test read isn't O_DIRECT. Attempting to kill -9 the
> O_DIRECT reader doesn't succeed until after the read() returns.
> 
> Finally, I've tried replacing the LV with a raw disk /dev/sdb, and the same
> behaviour appears there for all choices of IO scheduler for the disk, so
> it's neither device mapper specific nor even IO scheduler specific! This
> means that it isn't in any sense a bug in your code, of course, but given
> that the phenomenon seems to block any LVM management operation when heavy
> writes are going on, surely we can't be the first people to hit this whilst
> using the LVM tool. I'm wondering whether you've had other similar reports
> or have any other suggestions?
> 
> For what it's worth, these tests were initially done on our dev machine
> which is running lvm 2.02.38 and a kernel from the head of Linus' tree
> (usually no more than a couple of days old with a queue of my local patches
> that Ed hasn't integrated into the upstream ata-over-ethernet drivers yet),
> but the same behaviour appears on my totally standard desktop machine with
> much older 2.6.25 and 2.6.24.4 kernels 'as released', so I think it's pretty
> long-standing.
> 
> Best wishes,
> 
> Chris.
> 
> #define _GNU_SOURCE
> #include <errno.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> 
> #define BLOCK 4096
> 
> int main(int argc, char **argv) {
>   int fd, n;
>   char *buffer;
> 
>   if (argc != 2) {
>     fprintf(stderr, "Usage: %s FILE\n", argv[0]);
>     return 1;
>   }
> 
>   buffer = valloc(BLOCK);
>   if (!buffer) {
>     perror("valloc");
>     return 1;
>   }
> 
>   fd = open(argv[1], O_RDONLY | O_LARGEFILE | O_DIRECT);
>   if (fd < 0) {
>     perror("open");
>     return 1;
>   }
> 
>   if (pread(fd, buffer, BLOCK, 0) < 0)
>     perror("pread");
> 
>   close(fd);
>   return 0;
> }
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-22 21:10 ` [PATCH] Memory management livelock Mikulas Patocka
@ 2008-09-23  0:48   ` Andrew Morton
  2008-09-23 22:34   ` Mikulas Patocka
  1 sibling, 0 replies; 67+ messages in thread
From: Andrew Morton @ 2008-09-23  0:48 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: chris, linux-kernel, linux-mm, agk, mbroz

On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> The bug happens when one process is doing sequential buffered writes to
> a block device (or file) and another process is attempting to execute
> sync(), fsync() or direct-IO on that device (or file). This syncing
> process will wait indefinitelly, until the first writing process
> finishes.
> 
> For example, run these two commands:
> dd if=/dev/zero of=/dev/sda1 bs=65536 &
> dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> 
> The bug is caused by sequential walking of address space in
> write_cache_pages and wait_on_page_writeback_range: if some other
> process is constantly making dirty and writeback pages while these
> functions run, the functions will wait on every new page, resulting in
> indefinite wait.

Shouldn't happen.  All the data-syncing functions should have an upper
bound on the number of pages which they attempt to write.  In the
example above, we end up in here:

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
				loff_t end, int sync_mode)
{
	int ret;
	struct writeback_control wbc = {
		.sync_mode = sync_mode,
		.nr_to_write = mapping->nrpages * 2,	<<--
		.range_start = start,
		.range_end = end,
	};

so generic_file_direct_write()'s filemap_write_and_wait() will attempt
to write at most 2* the number of pages which are in cache for that inode.

I'd say that either a) that logic got broken or b) you didn't wait long
enough, and we might need to do something to make it not wait so long.

But before we patch anything we should fully understand what is
happening and why the current anti-livelock code isn't working in this
case.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-22 21:10 ` [PATCH] Memory management livelock Mikulas Patocka
  2008-09-23  0:48   ` Andrew Morton
@ 2008-09-23 22:34   ` Mikulas Patocka
  2008-09-23 22:49     ` Andrew Morton
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-23 22:34 UTC (permalink / raw)
  To: linux-kernel, akpm, linux-mm; +Cc: Alasdair G Kergon, Milan Broz, Chris Webb

> On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> 
> > The bug happens when one process is doing sequential buffered writes to
> > a block device (or file) and another process is attempting to execute
> > sync(), fsync() or direct-IO on that device (or file). This syncing
> > process will wait indefinitelly, until the first writing process
> > finishes.
> >
> > For example, run these two commands:
> > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> >
> > The bug is caused by sequential walking of address space in
> > write_cache_pages and wait_on_page_writeback_range: if some other
> > process is constantly making dirty and writeback pages while these
> > functions run, the functions will wait on every new page, resulting in
> > indefinite wait.
> 
> Shouldn't happen. All the data-syncing functions should have an upper
> bound on the number of pages which they attempt to write. In the
> example above, we end up in here:
> 
> int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> start,
> loff_t end, int sync_mode)
> {
> int ret;
> struct writeback_control wbc = {
> .sync_mode = sync_mode,
> .nr_to_write = mapping->nrpages * 2, <<--
> .range_start = start,
> .range_end = end,
> };
> 
> so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> to write at most 2* the number of pages which are in cache for that inode.

See write_cache_pages:

if (wbc->sync_mode != WB_SYNC_NONE)
        wait_on_page_writeback(page);	(1)
if (PageWriteback(page) ||
    !clear_page_dirty_for_io(page)) {
        unlock_page(page);		(2)
        continue;
}
ret = (*writepage)(page, wbc, data);
if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
        unlock_page(page);
        ret = 0;
}
if (ret || (--(wbc->nr_to_write) <= 0))
        done = 1;

--- so if it goes by points (1) and (2), the counter is not decremented, 
yet the function waits for the page. If there is constant stream of 
writeback pages being generated, it waits on each on them --- that is, 
forever. I have seen livelock in this function. For you that example with 
two dd's, one buffered write and the other directIO read doesn't work? For 
me it livelocks here.

wait_on_page_writeback_range is another example where the livelock 
happened, there is no protection at all against starvation.


BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
to me.

Imagine this case: You have two pages with indices 4 and 5 dirty in a 
file. You call fsync(). It sets nr_to_write to 4.

Meanwhile, another process makes pages 0, 1, 2, 3 dirty.

The fsync() process goes to write_cache_pages, writes the first 4 dirty 
pages and exits because it goes over the limit.

result --- you violate fsync() semantics, pages that were dirty before 
call to fsync() are not written when fsync() exits.

> I'd say that either a) that logic got broken or b) you didn't wait long
> enough, and we might need to do something to make it not wait so long.
> 
> But before we patch anything we should fully understand what is
> happening and why the current anti-livelock code isn't working in this
> case.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-23 22:34   ` Mikulas Patocka
@ 2008-09-23 22:49     ` Andrew Morton
  2008-09-23 23:11       ` Mikulas Patocka
  2008-10-03  2:32       ` [PATCH] " Nick Piggin
  0 siblings, 2 replies; 67+ messages in thread
From: Andrew Morton @ 2008-09-23 22:49 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > 
> > > The bug happens when one process is doing sequential buffered writes to
> > > a block device (or file) and another process is attempting to execute
> > > sync(), fsync() or direct-IO on that device (or file). This syncing
> > > process will wait indefinitelly, until the first writing process
> > > finishes.
> > >
> > > For example, run these two commands:
> > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > >
> > > The bug is caused by sequential walking of address space in
> > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > process is constantly making dirty and writeback pages while these
> > > functions run, the functions will wait on every new page, resulting in
> > > indefinite wait.
> > 
> > Shouldn't happen. All the data-syncing functions should have an upper
> > bound on the number of pages which they attempt to write. In the
> > example above, we end up in here:
> > 
> > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > start,
> > loff_t end, int sync_mode)
> > {
> > int ret;
> > struct writeback_control wbc = {
> > .sync_mode = sync_mode,
> > .nr_to_write = mapping->nrpages * 2, <<--
> > .range_start = start,
> > .range_end = end,
> > };
> > 
> > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > to write at most 2* the number of pages which are in cache for that inode.
> 
> See write_cache_pages:
> 
> if (wbc->sync_mode != WB_SYNC_NONE)
>         wait_on_page_writeback(page);	(1)
> if (PageWriteback(page) ||
>     !clear_page_dirty_for_io(page)) {
>         unlock_page(page);		(2)
>         continue;
> }
> ret = (*writepage)(page, wbc, data);
> if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
>         unlock_page(page);
>         ret = 0;
> }
> if (ret || (--(wbc->nr_to_write) <= 0))
>         done = 1;
> 
> --- so if it goes by points (1) and (2), the counter is not decremented, 
> yet the function waits for the page. If there is constant stream of 
> writeback pages being generated, it waits on each on them --- that is, 
> forever. I have seen livelock in this function. For you that example with 
> two dd's, one buffered write and the other directIO read doesn't work? For 
> me it livelocks here.
> 
> wait_on_page_writeback_range is another example where the livelock 
> happened, there is no protection at all against starvation.

um, OK.  So someone else is initiating IO for this inode and this
thread *never* gets to initiate any writeback.  That's a bit of a
surprise.

How do we fix that?  Maybe decrement nt_to_write for these pages as
well?

> 
> BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
> to me.
> 
> Imagine this case: You have two pages with indices 4 and 5 dirty in a 
> file. You call fsync(). It sets nr_to_write to 4.
> 
> Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> 
> The fsync() process goes to write_cache_pages, writes the first 4 dirty 
> pages and exits because it goes over the limit.
> 
> result --- you violate fsync() semantics, pages that were dirty before 
> call to fsync() are not written when fsync() exits.

yup, that's pretty much unfixable, really, unless new locks are added
which block threads which are writing to unrelated sections of the
file, and that could hurt some workloads quite a lot, I expect.

Hopefully high performance applications are instantiating the file
up-front and are using sync_file_range() to prevent these sorts of
things from happening.  But they probably aren't.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-23 22:49     ` Andrew Morton
@ 2008-09-23 23:11       ` Mikulas Patocka
  2008-09-23 23:46         ` Andrew Morton
  2008-10-03  2:32       ` [PATCH] " Nick Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-23 23:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, agk, mbroz, chris



On Tue, 23 Sep 2008, Andrew Morton wrote:

> On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > 
> > > > The bug happens when one process is doing sequential buffered writes to
> > > > a block device (or file) and another process is attempting to execute
> > > > sync(), fsync() or direct-IO on that device (or file). This syncing
> > > > process will wait indefinitelly, until the first writing process
> > > > finishes.
> > > >
> > > > For example, run these two commands:
> > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > >
> > > > The bug is caused by sequential walking of address space in
> > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > process is constantly making dirty and writeback pages while these
> > > > functions run, the functions will wait on every new page, resulting in
> > > > indefinite wait.
> > > 
> > > Shouldn't happen. All the data-syncing functions should have an upper
> > > bound on the number of pages which they attempt to write. In the
> > > example above, we end up in here:
> > > 
> > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > start,
> > > loff_t end, int sync_mode)
> > > {
> > > int ret;
> > > struct writeback_control wbc = {
> > > .sync_mode = sync_mode,
> > > .nr_to_write = mapping->nrpages * 2, <<--
> > > .range_start = start,
> > > .range_end = end,
> > > };
> > > 
> > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > > to write at most 2* the number of pages which are in cache for that inode.
> > 
> > See write_cache_pages:
> > 
> > if (wbc->sync_mode != WB_SYNC_NONE)
> >         wait_on_page_writeback(page);	(1)
> > if (PageWriteback(page) ||
> >     !clear_page_dirty_for_io(page)) {
> >         unlock_page(page);		(2)
> >         continue;
> > }
> > ret = (*writepage)(page, wbc, data);
> > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> >         unlock_page(page);
> >         ret = 0;
> > }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> >         done = 1;
> > 
> > --- so if it goes by points (1) and (2), the counter is not decremented, 
> > yet the function waits for the page. If there is constant stream of 
> > writeback pages being generated, it waits on each on them --- that is, 
> > forever. I have seen livelock in this function. For you that example with 
> > two dd's, one buffered write and the other directIO read doesn't work? For 
> > me it livelocks here.
> > 
> > wait_on_page_writeback_range is another example where the livelock 
> > happened, there is no protection at all against starvation.
> 
> um, OK.  So someone else is initiating IO for this inode and this
> thread *never* gets to initiate any writeback.  That's a bit of a
> surprise.
> 
> How do we fix that?  Maybe decrement nt_to_write for these pages as
> well?

And what do you want to do with wait_on_page_writeback_range? When I 
solved that livelock in write_cache_pages(), I got another livelock in 
wait_on_page_writeback_range.

> > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
> > to me.
> > 
> > Imagine this case: You have two pages with indices 4 and 5 dirty in a 
> > file. You call fsync(). It sets nr_to_write to 4.
> > 
> > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> > 
> > The fsync() process goes to write_cache_pages, writes the first 4 dirty 
> > pages and exits because it goes over the limit.
> > 
> > result --- you violate fsync() semantics, pages that were dirty before 
> > call to fsync() are not written when fsync() exits.
> 
> yup, that's pretty much unfixable, really, unless new locks are added
> which block threads which are writing to unrelated sections of the
> file, and that could hurt some workloads quite a lot, I expect.

It is fixable with the patch I sent --- it doesn't take any locks unless 
the starvation happens. Then, you don't have to use .nr_to_write for 
fsync anymore.

Another solution could be to record in page structure jiffies when the 
page entered dirty state and writeback state. The start writeback/wait on 
writeback functions could then trivially ignore pages that were 
dirtied/writebacked while the function was in progress.

> Hopefully high performance applications are instantiating the file
> up-front and are using sync_file_range() to prevent these sorts of
> things from happening.  But they probably aren't.

--- for databases it is pretty much possible that one thread is writing 
already journaled data (so it doesn't care when the data are really 
written) and another thread is calling fsync() on the same inode 
simultaneously --- so fsync() could mistakenly write the data generated by 
the first thread and ignore the data generated by the second thread, that 
it should really write.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-23 23:11       ` Mikulas Patocka
@ 2008-09-23 23:46         ` Andrew Morton
  2008-09-24 18:50           ` Mikulas Patocka
                             ` (3 more replies)
  0 siblings, 4 replies; 67+ messages in thread
From: Andrew Morton @ 2008-09-23 23:46 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

On Tue, 23 Sep 2008 19:11:51 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> > > wait_on_page_writeback_range is another example where the livelock 
> > > happened, there is no protection at all against starvation.
> > 
> > um, OK.  So someone else is initiating IO for this inode and this
> > thread *never* gets to initiate any writeback.  That's a bit of a
> > surprise.
> > 
> > How do we fix that?  Maybe decrement nt_to_write for these pages as
> > well?
> 
> And what do you want to do with wait_on_page_writeback_range?

Don't know.  I was asking you.

> When I 
> solved that livelock in write_cache_pages(), I got another livelock in 
> wait_on_page_writeback_range.
> 
> > > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing 
> > > to me.
> > > 
> > > Imagine this case: You have two pages with indices 4 and 5 dirty in a 
> > > file. You call fsync(). It sets nr_to_write to 4.
> > > 
> > > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> > > 
> > > The fsync() process goes to write_cache_pages, writes the first 4 dirty 
> > > pages and exits because it goes over the limit.
> > > 
> > > result --- you violate fsync() semantics, pages that were dirty before 
> > > call to fsync() are not written when fsync() exits.
> > 
> > yup, that's pretty much unfixable, really, unless new locks are added
> > which block threads which are writing to unrelated sections of the
> > file, and that could hurt some workloads quite a lot, I expect.
> 
> It is fixable with the patch I sent --- it doesn't take any locks unless 
> the starvation happens. Then, you don't have to use .nr_to_write for 
> fsync anymore.

I agree that the patch is low-impact and relatively straightforward. 
The main problem is making the address_space larger - there can (and
often are) millions and millions of these things in memory.  Making it
larger is a big deal.  We should work hard to seek an alternative and
afacit that isn't happening here.

We already have existing code and design which attempts to avoid
livelock without adding stuff to the address_space.  Can it be modified
so as to patch up this quite obscure and rarely-occuring problem?

> Another solution could be to record in page structure jiffies when the 
> page entered dirty state and writeback state. The start writeback/wait on 
> writeback functions could then trivially ignore pages that were 
> dirtied/writebacked while the function was in progress.
> 
> > Hopefully high performance applications are instantiating the file
> > up-front and are using sync_file_range() to prevent these sorts of
> > things from happening.  But they probably aren't.
> 
> --- for databases it is pretty much possible that one thread is writing 
> already journaled data (so it doesn't care when the data are really 
> written) and another thread is calling fsync() on the same inode 
> simultaneously --- so fsync() could mistakenly write the data generated by 
> the first thread and ignore the data generated by the second thread, that 
> it should really write.
> 
> Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-23 23:46         ` Andrew Morton
@ 2008-09-24 18:50           ` Mikulas Patocka
  2008-09-24 18:51           ` [PATCH 1/3] " Mikulas Patocka
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-24 18:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

> > > yup, that's pretty much unfixable, really, unless new locks are added
> > > which block threads which are writing to unrelated sections of the
> > > file, and that could hurt some workloads quite a lot, I expect.
> > 
> > It is fixable with the patch I sent --- it doesn't take any locks unless 
> > the starvation happens. Then, you don't have to use .nr_to_write for 
> > fsync anymore.
> 
> I agree that the patch is low-impact and relatively straightforward. 
> The main problem is making the address_space larger - there can (and
> often are) millions and millions of these things in memory.  Making it
> larger is a big deal.  We should work hard to seek an alternative and
> afacit that isn't happening here.
> 
> We already have existing code and design which attempts to avoid
> livelock without adding stuff to the address_space.  Can it be modified
> so as to patch up this quite obscure and rarely-occuring problem?

I reworked my patch to use a bit in address_space->flags and hashes wait 
queues, so it doesn't take any extra memory. I'm sending it in three 
parts.
1 - make generic function wait_action_schedule
2 - fix the livelock, the logic is exactly the same as in my previous 
patch, wait_on_bit_lock is used instead of mutexes
3 - remove that nr_pages * 2 limit, because it causes misbehavior and 
possible data loss.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/3] Memory management livelock
  2008-09-23 23:46         ` Andrew Morton
  2008-09-24 18:50           ` Mikulas Patocka
@ 2008-09-24 18:51           ` Mikulas Patocka
  2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
  2008-09-24 18:53           ` [PATCH 3/3] Memory management livelock Mikulas Patocka
  3 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-24 18:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

A generic function wait_action_schedule that allows to use wait_on_bit_lock just
like mutexes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/wait.h |    8 +++++++-
 kernel/wait.c        |    7 +++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/wait.h	2008-09-24 03:20:59.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/wait.h	2008-09-24 03:26:34.000000000 +0200
@@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void 
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
 }
-	
+
+/**
+ * wait_action_schedule - this function can be passed to wait_on_bit or
+ * wait_on_bit_lock and it will call just schedule().
+ */
+int wait_action_schedule(void *);
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.27-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/kernel/wait.c	2008-09-24 03:22:58.000000000 +0200
+++ linux-2.6.27-rc7-devel/kernel/wait.c	2008-09-24 03:24:10.000000000 +0200
@@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
 }
 EXPORT_SYMBOL(bit_waitqueue);
+
+int wait_action_schedule(void *word)
+{
+	schedule();
+	return 0;
+}
+EXPORT_SYMBOL(wait_action_schedule);


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 2/3] Memory management livelock
  2008-09-23 23:46         ` Andrew Morton
  2008-09-24 18:50           ` Mikulas Patocka
  2008-09-24 18:51           ` [PATCH 1/3] " Mikulas Patocka
@ 2008-09-24 18:52           ` Mikulas Patocka
  2008-10-02  5:54             ` Andrew Morton
  2008-09-24 18:53           ` [PATCH 3/3] Memory management livelock Mikulas Patocka
  3 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-24 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

Avoid starvation when walking address space.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/pagemap.h |    1 +
 mm/filemap.c            |   20 ++++++++++++++++++++
 mm/page-writeback.c     |   37 ++++++++++++++++++++++++++++++++++++-
 mm/truncate.c           |   24 +++++++++++++++++++++++-
 4 files changed, 80 insertions(+), 2 deletions(-)

Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
===================================================================
--- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h	2008-09-24 02:57:37.000000000 +0200
+++ linux-2.6.27-rc7-devel/include/linux/pagemap.h	2008-09-24 02:59:04.000000000 +0200
@@ -21,6 +21,7 @@
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
 #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
+#define AS_STARVATION	(__GFP_BITS_SHIFT + 3)	/* an anti-starvation barrier */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 02:59:33.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 03:13:47.000000000 +0200
@@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct 
 	int nr_pages;
 	int ret = 0;
 	pgoff_t index;
+	long pages_to_process;
 
 	if (end < start)
 		return 0;
 
+	/*
+	 * Estimate the number of pages to process. If we process significantly
+	 * more than this, someone is making writeback pages under us.
+	 * We must pull the anti-starvation plug.
+	 */
+	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	pages_to_process += (pages_to_process >> 3) + 16;
+
 	pagevec_init(&pvec, 0);
 	index = start;
 	while ((index <= end) &&
@@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct 
 		cond_resched();
 	}
 
+	if (pages_to_process < 0) {
+		smp_mb__before_clear_bit();
+		clear_bit(AS_STARVATION, &mapping->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&mapping->flags, AS_STARVATION);
+	}
+
 	/* Check for outstanding write errors */
 	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
 		ret = -ENOSPC;
Index: linux-2.6.27-rc7-devel/mm/page-writeback.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/page-writeback.c	2008-09-24 03:10:34.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/page-writeback.c	2008-09-24 03:20:24.000000000 +0200
@@ -435,6 +435,18 @@ static void balance_dirty_pages(struct a
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	/*
+	 * If there is sync() starving on this address space, block
+	 * writers until it finishes.
+	 */
+	if (unlikely(test_bit(AS_STARVATION, &mapping->flags))) {
+		wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+		smp_mb__before_clear_bit();
+		clear_bit(AS_STARVATION, &mapping->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&mapping->flags, AS_STARVATION);
+	}
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.bdi		= bdi,
@@ -876,12 +888,21 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
 	int range_whole = 0;
+	long pages_to_process;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
 		return 0;
 	}
 
+	/*
+	 * Estimate the number of pages to process. If we process significantly
+	 * more than this, someone is making dirty pages under us.
+	 * Pull the anti-starvation plug to stop him.
+	 */
+	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+	pages_to_process += (pages_to_process >> 3) + 16;
+
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
@@ -902,7 +923,13 @@ retry:
 
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
+
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+
+			page = pvec.pages[i];
 
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
@@ -949,6 +976,14 @@ retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
+	if (pages_to_process < 0) {
+		smp_mb__before_clear_bit();
+		clear_bit(AS_STARVATION, &mapping->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&mapping->flags, AS_STARVATION);
+	}
+
 	if (!scanned && !done) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
Index: linux-2.6.27-rc7-devel/mm/truncate.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/truncate.c	2008-09-24 03:16:15.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/truncate.c	2008-09-24 03:18:00.000000000 +0200
@@ -392,6 +392,14 @@ int invalidate_inode_pages2_range(struct
 	int ret2 = 0;
 	int did_range_unmap = 0;
 	int wrapped = 0;
+	long pages_to_process;
+
+	/*
+	 * Estimate number of pages to process. If we process more, someone
+	 * is making pages under us.
+	 */
+	pages_to_process = mapping->nrpages;
+	pages_to_process += (pages_to_process >> 3) + 16;
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -399,9 +407,15 @@ int invalidate_inode_pages2_range(struct
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
 			pgoff_t page_index;
 
+			if (pages_to_process >= 0)
+				if (!pages_to_process--)
+					wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+
+			page = pvec.pages[i];
+
 			lock_page(page);
 			if (page->mapping != mapping) {
 				unlock_page(page);
@@ -449,6 +463,14 @@ int invalidate_inode_pages2_range(struct
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
+	if (pages_to_process < 0) {
+		smp_mb__before_clear_bit();
+		clear_bit(AS_STARVATION, &mapping->flags);
+		smp_mb__after_clear_bit();
+		wake_up_bit(&mapping->flags, AS_STARVATION);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 3/3] Memory management livelock
  2008-09-23 23:46         ` Andrew Morton
                             ` (2 preceding siblings ...)
  2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
@ 2008-09-24 18:53           ` Mikulas Patocka
  3 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-09-24 18:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, agk, mbroz, chris

Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a misbehavior.
Example: there are two pages in address space with indices 4, 5. Both are dirty.
Someone calls __filemap_fdatawrite_range, it sets .nr_to_write = 4.
Meanwhile, some other process creates dirty pages 0, 1, 2, 3.
__filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out that it reached
the limit and exits.
Result: pages that were dirty before __filemap_fdatawrite_range was invoked were
not written.

With starvation protection from the previous patch, this mapping->nrpages * 2
logic is no longer needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/filemap.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
  * opposed to a regular memory cleansing writeback.  The difference between
  * these two operations is that if a dirty page/buffer is encountered, it must
  * be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = sync_mode == WB_SYNC_NONE ? mapping->nrpages * 2 : LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Memory management livelock
  2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
@ 2008-10-02  5:54             ` Andrew Morton
  2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
                                 ` (4 more replies)
  0 siblings, 5 replies; 67+ messages in thread
From: Andrew Morton @ 2008-10-02  5:54 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, linux-mm, agk, mbroz, chris


> Subject: [PATCH 2/3] Memory management livelock

Please don't send multiple patches with identical titles - think up a
good, unique, meaningful title for each patch.

On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

> Avoid starvation when walking address space.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 

Please include a full changelog with each iteration of each patch.

That changelog should explain the reason for playing games with
bitlocks so Linus doesn't have kittens when he sees it.

>  include/linux/pagemap.h |    1 +
>  mm/filemap.c            |   20 ++++++++++++++++++++
>  mm/page-writeback.c     |   37 ++++++++++++++++++++++++++++++++++++-
>  mm/truncate.c           |   24 +++++++++++++++++++++++-
>  4 files changed, 80 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h	2008-09-24 02:57:37.000000000 +0200
> +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h	2008-09-24 02:59:04.000000000 +0200
> @@ -21,6 +21,7 @@
>  #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
>  #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
>  #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
> +#define AS_STARVATION	(__GFP_BITS_SHIFT + 3)	/* an anti-starvation barrier */
>  
>  static inline void mapping_set_error(struct address_space *mapping, int error)
>  {
> Index: linux-2.6.27-rc7-devel/mm/filemap.c
> ===================================================================
> --- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 02:59:33.000000000 +0200
> +++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 03:13:47.000000000 +0200
> @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct 
>  	int nr_pages;
>  	int ret = 0;
>  	pgoff_t index;
> +	long pages_to_process;
>  
>  	if (end < start)
>  		return 0;
>  
> +	/*
> +	 * Estimate the number of pages to process. If we process significantly
> +	 * more than this, someone is making writeback pages under us.
> +	 * We must pull the anti-starvation plug.
> +	 */
> +	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	pages_to_process += (pages_to_process >> 3) + 16;

This sequence appears twice and it would probably be clearer to
implement it in a well-commented function.

>  	pagevec_init(&pvec, 0);
>  	index = start;
>  	while ((index <= end) &&
> @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct 
>  			if (page->index > end)
>  				continue;
>  
> +			if (pages_to_process >= 0)
> +				if (!pages_to_process--)
> +					wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);

This is copied three times and perhaps also should be factored out.

Please note that an effort has been made to make mm/filemap.c look
presentable in an 80-col display.

>  			wait_on_page_writeback(page);
>  			if (PageError(page))
>  				ret = -EIO;
> @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct 
>  		cond_resched();
>  	}
>  
> +	if (pages_to_process < 0) {
> +		smp_mb__before_clear_bit();
> +		clear_bit(AS_STARVATION, &mapping->flags);
> +		smp_mb__after_clear_bit();
> +		wake_up_bit(&mapping->flags, AS_STARVATION);
> +	}

This sequence is repeated three or four times and should be pulled out
into a well-commented function.  That comment should explain the logic
behind the use of these barriers, please.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-09-23 22:49     ` Andrew Morton
  2008-09-23 23:11       ` Mikulas Patocka
@ 2008-10-03  2:32       ` Nick Piggin
  2008-10-03  2:40         ` Andrew Morton
  2008-10-03  2:54         ` Nick Piggin
  1 sibling, 2 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  2:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
>
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > >
> > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > The bug happens when one process is doing sequential buffered writes
> > > > to a block device (or file) and another process is attempting to
> > > > execute sync(), fsync() or direct-IO on that device (or file). This
> > > > syncing process will wait indefinitelly, until the first writing
> > > > process finishes.
> > > >
> > > > For example, run these two commands:
> > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > >
> > > > The bug is caused by sequential walking of address space in
> > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > process is constantly making dirty and writeback pages while these
> > > > functions run, the functions will wait on every new page, resulting
> > > > in indefinite wait.

I think the problem has been misidentified, or else I have misread the
code. See below. I hope I'm right, because I think the patches are pretty
heavy on complexity in these already complex paths...

It would help if you explicitly identify the exact livelock. Ie. give a
sequence of behaviour that leads to our progress rate falling to zero.


> > > Shouldn't happen. All the data-syncing functions should have an upper
> > > bound on the number of pages which they attempt to write. In the
> > > example above, we end up in here:
> > >
> > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > start,
> > > loff_t end, int sync_mode)
> > > {
> > > int ret;
> > > struct writeback_control wbc = {
> > > .sync_mode = sync_mode,
> > > .nr_to_write = mapping->nrpages * 2, <<--
> > > .range_start = start,
> > > .range_end = end,
> > > };
> > >
> > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
> > > to write at most 2* the number of pages which are in cache for that
> > > inode.
> >
> > See write_cache_pages:
> >
> > if (wbc->sync_mode != WB_SYNC_NONE)
> >         wait_on_page_writeback(page);	(1)
> > if (PageWriteback(page) ||
> >     !clear_page_dirty_for_io(page)) {
> >         unlock_page(page);		(2)
> >         continue;
> > }
> > ret = (*writepage)(page, wbc, data);
> > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> >         unlock_page(page);
> >         ret = 0;
> > }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> >         done = 1;
> >
> > --- so if it goes by points (1) and (2), the counter is not decremented,
> > yet the function waits for the page. If there is constant stream of
> > writeback pages being generated, it waits on each on them --- that is,
> > forever.

*What* is, forever? Data integrity syncs should have pages operated on
in-order, until we get to the end of the range. Circular writeback could
go through again, possibly, but no more than once.


> > I have seen livelock in this function. For you that example with 
> > two dd's, one buffered write and the other directIO read doesn't work?
> > For me it livelocks here.
> >
> > wait_on_page_writeback_range is another example where the livelock
> > happened, there is no protection at all against starvation.
>
> um, OK.  So someone else is initiating IO for this inode and this
> thread *never* gets to initiate any writeback.  That's a bit of a
> surprise.
>
> How do we fix that?  Maybe decrement nt_to_write for these pages as
> well?

What's the actual problem, though? nr_to_write should not be used for
data integrity operations, and it should not be critical for other
writeout. Upper layers should be able to deal with it rather than
have us lying to them.


> > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous
> > thing to me.
> >
> > Imagine this case: You have two pages with indices 4 and 5 dirty in a
> > file. You call fsync(). It sets nr_to_write to 4.
> >
> > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
> >
> > The fsync() process goes to write_cache_pages, writes the first 4 dirty
> > pages and exits because it goes over the limit.
> >
> > result --- you violate fsync() semantics, pages that were dirty before
> > call to fsync() are not written when fsync() exits.

Wow, that's really nasty. Sad we still have known data integrity problems
in such core functions.


> yup, that's pretty much unfixable, really, unless new locks are added
> which block threads which are writing to unrelated sections of the
> file, and that could hurt some workloads quite a lot, I expect.

Why is it unfixable? Just ignore nr_to_write, and write out everything
properly, I would have thought.

Some things may go a tad slower, but those are going to be the things
that are using fsync, in which cases they are going to hurt much more
from the loss of data integrity than a slowdown.

Unfortunately because we have played fast and loose for so long, they
expect this behaviour, were tested and optimised with it, and systems
designed and deployed with it, and will notice performance regressions
if we start trying to do things properly. This is one of my main
arguments for doing things correctly up-front, even if it means a
massive slowdown in some real or imagined workload: at least then we
will hear about complaints and be able to try to improve them rather
than setting ourselves up for failure later.
/rant

Anyway, in this case, I don't think there would be really big problems.
Also, I think there is a reasonable optimisation that might improve it
(2nd last point, in attached patch).

OK, so after glancing at the code... wow, it seems like there are a lot
of bugs in there.

[-- Attachment #2: mm-fsync-fix.patch --]
[-- Type: text/x-diff, Size: 7794 bytes --]

write_cache_pages has a number of problems, which appear to be real bugs:

* scanned == 1 is supposed to mean that cyclic writeback has circled through
  zero, thus we should not circle again. However it gets set to 1 after the
  first successful pagevec lookup. This leads to cases where not enough data
  gets written. Counterexample: file with first 10 pages dirty,
  writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found,
  and scanned will be set to 1, after writing those out, we will not cycle
  back to get the first 5. Rework this logic.

* If AOP_WRITEPAGE_ACTIVATE is returned, the filesystem is calling on us to
  drop the page lock and retry, however the existing code would just skip that
  page regardless of whether or not it was a data interity operation. Change
  this to always retry such a result.

  This is a data interity bug.

* If ret signals a real error, but we still have some pages left in the
  pagevec, done would be set to 1, but the remaining pages would continue to
  be processed and ret will be overwritten in the process. It could easily
  be overwritten with success, and thus success will be returned even if there
  is an error. Fix this by bailing immediately if there is an error, and
  retaining the error code.

  This is a data interity bug.

* nr_to_write is heeded by data interity operations, and the callers tend to
  set it to silly values that could break data interity semantics. For
  example, nr_to_write can be set to mapping->nr_pages * 2, however if a file
  has a single, dirty page, then fsync is called, subsequent pages might be
  concurrently added and dirtied, then write_cache_pages might writeout two
  of these newly dirty pages, while not writing out the old page that should
  have been written out. Fix this by ignoring nr_to_write if it is a data
  integrity sync.

  This is a data interity bug.

* In the range_cont case, range_start is set to index << PAGE_CACHE_SHIFT,
  but index is a pgoff_t and range_start is loff_t, so we can get truncation
  of the value on 32-bit platforms. Fix this by adding the standard loff_t
  cast.

  This is a data interity bug (depending on how range_cont is used).


Other problems that are not strictly bugs:

o If we get stuck behind another process that is cleaning pages, we will be
  forced to wait for them to finish, then perform our own writeout (if it
  was redirtied during the long wait), then wait for that. If a page under
  writeout is still clean, we can skip waiting for it (if we're part of a
  data integrity sync, we'll be waiting for all writeout pages afterwards).

o Control structures containing non-indempotent expressions. Break these out
  and make flow control clearer from data control.


---
 mm/filemap.c        |    2 -
 mm/page-writeback.c |  101 +++++++++++++++++++++++++++++++++++-----------------
 2 files changed, 70 insertions(+), 33 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -869,13 +869,13 @@ int write_cache_pages(struct address_spa
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ret = 0;
-	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
+	pgoff_t writeback_index;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
 	int range_whole = 0;
+	int cycled;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -884,23 +884,28 @@ int write_cache_pages(struct address_spa
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
+		cycled = 0;
+		writeback_index = mapping->writeback_index; /* prev offset */
+		index = writeback_index;
+		if (index == 0)
+			cycled = 1;
+		end = ULLONG_MAX;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
 	}
 retry:
-	while (!done && (index <= end) &&
-	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-					      PAGECACHE_TAG_DIRTY,
-					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		unsigned i;
+	do {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_DIRTY,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1);
+		if (!nr_pages)
+			break;
 
-		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -911,58 +916,90 @@ retry:
 			 * swizzled back from swapper_space to tmpfs file
 			 * mapping
 			 */
+again:
 			lock_page(page);
 
+			/*
+			 * Page truncated or invalidated. We can freely skip it
+			 * then, even for data integrity operations: the page
+			 * has disappeared concurrently, so there could be no
+			 * real expectation of this data interity operation
+			 * even if there is now a new, dirty page at the same
+			 * pagecache address.
+			 */
 			if (unlikely(page->mapping != mapping)) {
+continue_unlock:
 				unlock_page(page);
 				continue;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
-				done = 1;
+			if (page->index > end) {
+				/* Can't be cyclic: end == ULLONG_MAX */
 				unlock_page(page);
-				continue;
+done_release:
+				pagevec_release(&pvec);
+				goto done;
 			}
 
-			if (wbc->sync_mode != WB_SYNC_NONE)
-				wait_on_page_writeback(page);
-
-			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page)) {
-				unlock_page(page);
-				continue;
+			if (PageWriteback(page)) {
+				/* someone else wrote it for us */
+				if (!PageDirty(page)) {
+					goto continue_unlock;
+				} else {
+					/* hmm, but it has been dirtied again */
+					if (wbc->sync_mode != WB_SYNC_NONE)
+						wait_on_page_writeback(page);
+					else
+						goto continue_unlock;
+				}
 			}
 
+			BUG_ON(PageWriteback(page));
+
+			if (!clear_page_dirty_for_io(page))
+				goto continue_unlock;
+
 			ret = (*writepage)(page, wbc, data);
 
-			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
-				unlock_page(page);
-				ret = 0;
+			if (unlikely(ret)) {
+				/* Must retry the write, esp. for integrity */
+				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+					unlock_page(page);
+					ret = 0;
+					goto again;
+				}
+				goto done;
+			}
+			if (wbc->sync_mode == WB_SYNC_NONE) {
+				wbc->nr_to_write--;
+				if (wbc->nr_to_write <= 0)
+					goto done_release;
 			}
-			if (ret || (--(wbc->nr_to_write) <= 0))
-				done = 1;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
-				done = 1;
+				goto done_release;
 			}
 		}
 		pagevec_release(&pvec);
 		cond_resched();
-	}
-	if (!scanned && !done) {
+	} while (index <= end);
+
+	if (wbc->range_cyclic && !cycled) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
-		scanned = 1;
+		cycled = 1;
 		index = 0;
+		end = writeback_index - 1; /* won't be -ve, see above */
 		goto retry;
 	}
+done:
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
-
 	if (wbc->range_cont)
-		wbc->range_start = index << PAGE_CACHE_SHIFT;
+		wbc->range_start = (loff_t)index << PAGE_CACHE_SHIFT;
+
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -209,7 +209,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  2:32       ` [PATCH] " Nick Piggin
@ 2008-10-03  2:40         ` Andrew Morton
  2008-10-03  2:59           ` Nick Piggin
  2008-10-03  2:54         ` Nick Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2008-10-03  2:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > yup, that's pretty much unfixable, really, unless new locks are added
> > which block threads which are writing to unrelated sections of the
> > file, and that could hurt some workloads quite a lot, I expect.
> 
> Why is it unfixable? Just ignore nr_to_write, and write out everything
> properly, I would have thought.

That can cause fsync to wait arbitrarily long if some other process is
writing the file.  This happens.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  2:32       ` [PATCH] " Nick Piggin
  2008-10-03  2:40         ` Andrew Morton
@ 2008-10-03  2:54         ` Nick Piggin
  2008-10-03 11:26           ` Mikulas Patocka
  2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
  1 sibling, 2 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On Friday 03 October 2008 12:32, Nick Piggin wrote:
> On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
> >
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > > >
> > > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > > The bug happens when one process is doing sequential buffered
> > > > > writes to a block device (or file) and another process is
> > > > > attempting to execute sync(), fsync() or direct-IO on that device
> > > > > (or file). This syncing process will wait indefinitelly, until the
> > > > > first writing process finishes.
> > > > >
> > > > > For example, run these two commands:
> > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > > >
> > > > > The bug is caused by sequential walking of address space in
> > > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > > process is constantly making dirty and writeback pages while these
> > > > > functions run, the functions will wait on every new page, resulting
> > > > > in indefinite wait.
>
> I think the problem has been misidentified, or else I have misread the
> code. See below. I hope I'm right, because I think the patches are pretty
> heavy on complexity in these already complex paths...
>
> It would help if you explicitly identify the exact livelock. Ie. give a
> sequence of behaviour that leads to our progress rate falling to zero.
>
> > > > Shouldn't happen. All the data-syncing functions should have an upper
> > > > bound on the number of pages which they attempt to write. In the
> > > > example above, we end up in here:
> > > >
> > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > > start,
> > > > loff_t end, int sync_mode)
> > > > {
> > > > int ret;
> > > > struct writeback_control wbc = {
> > > > .sync_mode = sync_mode,
> > > > .nr_to_write = mapping->nrpages * 2, <<--
> > > > .range_start = start,
> > > > .range_end = end,
> > > > };
> > > >
> > > > so generic_file_direct_write()'s filemap_write_and_wait() will
> > > > attempt to write at most 2* the number of pages which are in cache
> > > > for that inode.
> > >
> > > See write_cache_pages:
> > >
> > > if (wbc->sync_mode != WB_SYNC_NONE)
> > >         wait_on_page_writeback(page);	(1)
> > > if (PageWriteback(page) ||
> > >     !clear_page_dirty_for_io(page)) {
> > >         unlock_page(page);		(2)
> > >         continue;
> > > }
> > > ret = (*writepage)(page, wbc, data);
> > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > >         unlock_page(page);
> > >         ret = 0;
> > > }
> > > if (ret || (--(wbc->nr_to_write) <= 0))
> > >         done = 1;
> > >
> > > --- so if it goes by points (1) and (2), the counter is not
> > > decremented, yet the function waits for the page. If there is constant
> > > stream of writeback pages being generated, it waits on each on them ---
> > > that is, forever.
>
> *What* is, forever? Data integrity syncs should have pages operated on
> in-order, until we get to the end of the range. Circular writeback could
> go through again, possibly, but no more than once.

OK, I have been able to reproduce it somewhat. It is not a livelock,
but what is happening is that direct IO read basically does an fsync
on the file before performing the IO. The fsync gets stuck behind the
dd that is dirtying the pages, and ends up following behind it and
doing all its IO for it.

The following patch avoids the issue for direct IO, by using the range
syncs rather than trying to sync the whole file.

The underlying problem I guess is unchanged. Is it really a problem,
though? The way I'd love to solve it is actually by adding another bit
or two to the pagecache radix tree,  that can be used to transiently tag
the tree for future operations. That way we could record the dirty and
writeback pages up front, and then only bother with operating on them.

That's *if* it really is a problem. I don't have much pity for someone
doing buffered IO and direct IO to the same pages of the same file :)

[-- Attachment #2: mm-starve-fix.patch --]
[-- Type: text/x-diff, Size: 1671 bytes --]

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2008-10-03 11:21:31.000000000 +1000
+++ linux-2.6/mm/filemap.c	2008-10-03 12:00:17.000000000 +1000
@@ -1304,11 +1304,8 @@ generic_file_aio_read(struct kiocb *iocb
 			goto out; /* skip atime */
 		size = i_size_read(inode);
 		if (pos < size) {
-			retval = filemap_write_and_wait(mapping);
-			if (!retval) {
-				retval = mapping->a_ops->direct_IO(READ, iocb,
+			retval = mapping->a_ops->direct_IO(READ, iocb,
 							iov, pos, nr_segs);
-			}
 			if (retval > 0)
 				*ppos = pos + retval;
 			if (retval) {
@@ -2110,18 +2107,10 @@ generic_file_direct_write(struct kiocb *
 	if (count != ocount)
 		*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
 
-	/*
-	 * Unmap all mmappings of the file up-front.
-	 *
-	 * This will cause any pte dirty bits to be propagated into the
-	 * pageframes for the subsequent filemap_write_and_wait().
-	 */
 	write_len = iov_length(iov, *nr_segs);
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
-	if (mapping_mapped(mapping))
-		unmap_mapping_range(mapping, pos, write_len, 0);
 
-	written = filemap_write_and_wait(mapping);
+	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
 	if (written)
 		goto out;
 
@@ -2507,7 +2496,8 @@ generic_file_buffered_write(struct kiocb
 	 * the file data here, to try to honour O_DIRECT expectations.
 	 */
 	if (unlikely(file->f_flags & O_DIRECT) && written)
-		status = filemap_write_and_wait(mapping);
+		status = filemap_write_and_wait_range(mapping,
+					pos, pos + written - 1);
 
 	return written ? written : status;
 }

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  2:40         ` Andrew Morton
@ 2008-10-03  2:59           ` Nick Piggin
  2008-10-03  3:14             ` Andrew Morton
  0 siblings, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  2:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Friday 03 October 2008 12:40, Andrew Morton wrote:
> On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > > yup, that's pretty much unfixable, really, unless new locks are added
> > > which block threads which are writing to unrelated sections of the
> > > file, and that could hurt some workloads quite a lot, I expect.
> >
> > Why is it unfixable? Just ignore nr_to_write, and write out everything
> > properly, I would have thought.
>
> That can cause fsync to wait arbitrarily long if some other process is
> writing the file.

It can be fixed without touching non-fsync paths (see my next email for
the way to fix it without touching fastpaths).


> This happens. 

What does such a thing? It would have been nicer to ask them to not do
that then, or get them to use range syncs or something. Now that's much
harder because we've accepted the crappy workaround for so long.

It's far far worse to just ignore data integrity of fsync because of the
problem. Should at least have returned an error from fsync in that case,
or make it interruptible or something.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  2:59           ` Nick Piggin
@ 2008-10-03  3:14             ` Andrew Morton
  2008-10-03  3:47               ` Nick Piggin
  0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2008-10-03  3:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 03 October 2008 12:40, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > > yup, that's pretty much unfixable, really, unless new locks are added
> > > > which block threads which are writing to unrelated sections of the
> > > > file, and that could hurt some workloads quite a lot, I expect.
> > >
> > > Why is it unfixable? Just ignore nr_to_write, and write out everything
> > > properly, I would have thought.
> >
> > That can cause fsync to wait arbitrarily long if some other process is
> > writing the file.
> 
> It can be fixed without touching non-fsync paths (see my next email for
> the way to fix it without touching fastpaths).
> 

yup.

> 
> > This happens. 
> 
> What does such a thing?

I forget.  People do all sorts of weird stuff.

> It would have been nicer to ask them to not do
> that then, or get them to use range syncs or something. Now that's much
> harder because we've accepted the crappy workaround for so long.
> 
> It's far far worse to just ignore data integrity of fsync because of the
> problem. Should at least have returned an error from fsync in that case,
> or make it interruptible or something.

If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000. 
Somehow.

I expect there's no solution which avoids blocking the writers at some
stage.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  3:14             ` Andrew Morton
@ 2008-10-03  3:47               ` Nick Piggin
  2008-10-03  3:56                 ` Andrew Morton
  0 siblings, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  3:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

[-- Attachment #1: Type: text/plain, Size: 1875 bytes --]

On Friday 03 October 2008 13:14, Andrew Morton wrote:
> On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 03 October 2008 12:40, Andrew Morton wrote:

> > > That can cause fsync to wait arbitrarily long if some other process is
> > > writing the file.
> >
> > It can be fixed without touching non-fsync paths (see my next email for
> > the way to fix it without touching fastpaths).
>
> yup.
>
> > > This happens.
> >
> > What does such a thing?
>
> I forget.  People do all sorts of weird stuff.

Damn people...

I guess they also do non-weird stuff like expecting fsync to really fsync.


> > It would have been nicer to ask them to not do
> > that then, or get them to use range syncs or something. Now that's much
> > harder because we've accepted the crappy workaround for so long.
> >
> > It's far far worse to just ignore data integrity of fsync because of the
> > problem. Should at least have returned an error from fsync in that case,
> > or make it interruptible or something.
>
> If a file has one dirty page at offset 1000000000000000 then someone
> does an fsync() and someone else gets in first and starts madly writing
> pages at offset 0, we want to write that page at 1000000000000000.
> Somehow.
>
> I expect there's no solution which avoids blocking the writers at some
> stage.

See my other email. Something roughly like this would do the trick
(hey, it actually boots and runs and does fix the problem too).

It's ugly because we don't have quite the right radix tree operations 
yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
lookups). But the theory is to up-front tag the pages that we need to
get to disk.

Completely no impact or slowdown to any writers (although it does add
8 bytes of tags to the radix tree node... but doesn't increase memory
footprint as such due to slab).

[-- Attachment #2: mm-fsync-snapshot-fix.patch --]
[-- Type: text/x-diff, Size: 10299 bytes --]

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -578,10 +578,12 @@ struct block_device {
 
 /*
  * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
+ * radix trees. Also, to snapshot all pages required to be fsync'ed in order
+ * to obey data integrity semantics.
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_FSYNC	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -848,22 +848,7 @@ void __init page_writeback_init(void)
 	prop_descriptor_init(&vm_dirties, shift);
 }
 
-/**
- * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
- * @mapping: address space structure to write
- * @wbc: subtract the number of written pages from *@wbc->nr_to_write
- * @writepage: function called for each page
- * @data: data passed to writepage function
- *
- * If a page is already under I/O, write_cache_pages() skips it, even
- * if it's dirty.  This is desirable behaviour for memory-cleaning writeback,
- * but it is INCORRECT for data-integrity system calls such as fsync().  fsync()
- * and msync() need to guarantee that all the data which was dirty at the time
- * the call was made get new I/O started against them.  If wbc->sync_mode is
- * WB_SYNC_ALL then we were called for data integrity and we must wait for
- * existing IO to complete.
- */
-int write_cache_pages(struct address_space *mapping,
+static int __write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data)
 {
@@ -947,10 +932,7 @@ done_release:
 					goto continue_unlock;
 				} else {
 					/* hmm, but it has been dirtied again */
-					if (wbc->sync_mode != WB_SYNC_NONE)
-						wait_on_page_writeback(page);
-					else
-						goto continue_unlock;
+					goto continue_unlock;
 				}
 			}
 
@@ -970,11 +952,9 @@ done_release:
 				}
 				goto done;
 			}
-			if (wbc->sync_mode == WB_SYNC_NONE) {
-				wbc->nr_to_write--;
-				if (wbc->nr_to_write <= 0)
-					goto done_release;
-			}
+			wbc->nr_to_write--;
+			if (wbc->nr_to_write <= 0)
+				goto done_release;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				goto done_release;
@@ -1002,6 +982,183 @@ done:
 
 	return ret;
 }
+
+static int __write_cache_pages_sync(struct address_space *mapping,
+		      struct writeback_control *wbc, writepage_t writepage,
+		      void *data)
+{
+	int ret = 0;
+	struct pagevec pvec;
+	int nr_pages;
+	pgoff_t index;
+	pgoff_t end;		/* Inclusive */
+
+	pagevec_init(&pvec, 0);
+	index = wbc->range_start >> PAGE_CACHE_SHIFT;
+	end = wbc->range_end >> PAGE_CACHE_SHIFT;
+	do {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_DIRTY,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1);
+		if (!nr_pages)
+			break;
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			lock_page(page);
+			if (unlikely(page->mapping != mapping)) {
+				unlock_page(page);
+				continue;
+			}
+			if (page->index > end) {
+				unlock_page(page);
+				break;
+			}
+			spin_lock_irq(&mapping->tree_lock);
+			radix_tree_tag_set(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);
+			spin_unlock_irq(&mapping->tree_lock);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	} while (index <= end);
+
+	index = wbc->range_start >> PAGE_CACHE_SHIFT;
+	end = wbc->range_end >> PAGE_CACHE_SHIFT;
+	do {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_WRITEBACK,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1);
+		if (!nr_pages)
+			break;
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			lock_page(page);
+			if (unlikely(page->mapping != mapping)) {
+				unlock_page(page);
+				continue;
+			}
+			if (page->index > end) {
+				unlock_page(page);
+				break;
+			}
+			spin_lock_irq(&mapping->tree_lock);
+			radix_tree_tag_set(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);
+			spin_unlock_irq(&mapping->tree_lock);
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	} while (index <= end);
+
+	index = wbc->range_start >> PAGE_CACHE_SHIFT;
+	end = wbc->range_end >> PAGE_CACHE_SHIFT;
+	do {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_FSYNC,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1);
+		if (!nr_pages)
+			break;
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/*
+			 * At this point we hold neither mapping->tree_lock nor
+			 * lock on the page itself: the page may be truncated or
+			 * invalidated (changing page->mapping to NULL), or even
+			 * swizzled back from swapper_space to tmpfs file
+			 * mapping
+			 */
+again:
+			lock_page(page);
+
+			/*
+			 * Page truncated or invalidated. We can freely skip it
+			 * then, even for data integrity operations: the page
+			 * has disappeared concurrently, so there could be no
+			 * real expectation of this data interity operation
+			 * even if there is now a new, dirty page at the same
+			 * pagecache address.
+			 */
+			if (unlikely(page->mapping != mapping)) {
+continue_unlock:
+				unlock_page(page);
+				continue;
+			}
+
+			if (page->index > end) {
+				unlock_page(page);
+				break;
+			}
+
+			if (PageWriteback(page)) {
+				/* someone else wrote it for us */
+				if (!PageDirty(page)) {
+					goto continue_unlock;
+				} else {
+					/* hmm, but it has been dirtied again */
+					wait_on_page_writeback(page);
+				}
+			}
+
+			BUG_ON(PageWriteback(page));
+
+			if (!clear_page_dirty_for_io(page))
+				goto continue_unlock;
+
+			ret = (*writepage)(page, wbc, data);
+
+			if (unlikely(ret)) {
+				/* Must retry the write, esp. for integrity */
+				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+					unlock_page(page);
+					ret = 0;
+					goto again;
+				}
+				break;
+			}
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	} while (index <= end);
+
+	if (wbc->range_cont)
+		wbc->range_start = (loff_t)index << PAGE_CACHE_SHIFT;
+
+	return ret;
+}
+
+/**
+ * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
+ * @mapping: address space structure to write
+ * @wbc: subtract the number of written pages from *@wbc->nr_to_write
+ * @writepage: function called for each page
+ * @data: data passed to writepage function
+ *
+ * If a page is already under I/O, write_cache_pages() skips it, even
+ * if it's dirty.  This is desirable behaviour for memory-cleaning writeback,
+ * but it is INCORRECT for data-integrity system calls such as fsync().  fsync()
+ * and msync() need to guarantee that all the data which was dirty at the time
+ * the call was made get new I/O started against them.  If wbc->sync_mode is
+ * WB_SYNC_ALL then we were called for data integrity and we must wait for
+ * existing IO to complete.
+ */
+int write_cache_pages(struct address_space *mapping,
+		      struct writeback_control *wbc, writepage_t writepage,
+		      void *data)
+{
+	if (wbc->sync_mode == WB_SYNC_NONE)
+		return __write_cache_pages(mapping, wbc, writepage, data);
+	else
+		return __write_cache_pages_sync(mapping, wbc, writepage, data);
+}
 EXPORT_SYMBOL(write_cache_pages);
 
 /*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -305,6 +305,53 @@ int wait_on_page_writeback_range(struct 
 	return ret;
 }
 
+int wait_on_page_writeback_range_fsync(struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	int nr_pages;
+	int ret = 0;
+	pgoff_t index;
+
+	if (end < start)
+		return 0;
+
+	pagevec_init(&pvec, 0);
+	index = start;
+	while ((index <= end) &&
+			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			PAGECACHE_TAG_FSYNC,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
+		unsigned i;
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* until radix tree lookup accepts end_index */
+			if (page->index > end)
+				continue;
+
+			spin_lock_irq(&mapping->tree_lock);
+			radix_tree_tag_clear(&mapping->page_tree, page->index, PAGECACHE_TAG_FSYNC);
+			spin_unlock_irq(&mapping->tree_lock);
+
+			wait_on_page_writeback(page);
+			if (PageError(page))
+				ret = -EIO;
+		}
+		pagevec_release(&pvec);
+		cond_resched();
+	}
+
+	/* Check for outstanding write errors */
+	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
+		ret = -ENOSPC;
+	if (test_and_clear_bit(AS_EIO, &mapping->flags))
+		ret = -EIO;
+
+	return ret;
+}
+
 /**
  * sync_page_range - write and wait on all pages in the passed range
  * @inode:	target inode
@@ -388,6 +435,17 @@ int filemap_fdatawait(struct address_spa
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
+int filemap_fdatawait_fsync(struct address_space *mapping)
+{
+	loff_t i_size = i_size_read(mapping->host);
+
+	if (i_size == 0)
+		return 0;
+
+	return wait_on_page_writeback_range_fsync(mapping, 0,
+				(i_size - 1) >> PAGE_CACHE_SHIFT);
+}
+
 int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
@@ -401,7 +459,7 @@ int filemap_write_and_wait(struct addres
 		 * thing (e.g. bug) happened, so we avoid waiting for it.
 		 */
 		if (err != -EIO) {
-			int err2 = filemap_fdatawait(mapping);
+			int err2 = filemap_fdatawait_fsync(mapping);
 			if (!err)
 				err = err2;
 		}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  3:47               ` Nick Piggin
@ 2008-10-03  3:56                 ` Andrew Morton
  2008-10-03  4:07                   ` Nick Piggin
  2008-10-03 11:43                   ` Mikulas Patocka
  0 siblings, 2 replies; 67+ messages in thread
From: Andrew Morton @ 2008-10-03  3:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > I expect there's no solution which avoids blocking the writers at some
> > stage.
> 
> See my other email. Something roughly like this would do the trick
> (hey, it actually boots and runs and does fix the problem too).

It needs exclusion to protect all those temp tags.  Is do_fsync()'s
i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all
the callers of this stuff are running under that lock.

> It's ugly because we don't have quite the right radix tree operations 
> yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> lookups). But the theory is to up-front tag the pages that we need to
> get to disk.

Perhaps some callback-calling radix tree walker.

> Completely no impact or slowdown to any writers (although it does add
> 8 bytes of tags to the radix tree node... but doesn't increase memory
> footprint as such due to slab).

Can we reduce the amount of copy-n-pasting here?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  3:56                 ` Andrew Morton
@ 2008-10-03  4:07                   ` Nick Piggin
  2008-10-03  4:17                     ` Andrew Morton
  2008-10-03 11:43                   ` Mikulas Patocka
  1 sibling, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  4:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Friday 03 October 2008 13:56, Andrew Morton wrote:
> On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > > I expect there's no solution which avoids blocking the writers at some
> > > stage.
> >
> > See my other email. Something roughly like this would do the trick
> > (hey, it actually boots and runs and does fix the problem too).
>
> It needs exclusion to protect all those temp tags.  Is do_fsync()'s
> i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all
> the callers of this stuff are running under that lock.

Yeah... it does need a lock, which I brushed under the carpet :P
I was going to just say use i_mutex, but then we really would start
impacting on other fastpaths (eg writers).

Possibly a new mutex in the address_space? That way we can say
"anybody who holds this mutex is allowed to use the tag for anything"
and it doesn't have to be fsync specific (whether that would be of
any use to anything else, I don't know).


> > It's ugly because we don't have quite the right radix tree operations
> > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> > lookups). But the theory is to up-front tag the pages that we need to
> > get to disk.
>
> Perhaps some callback-calling radix tree walker.

Possibly, yes. That would make it fairly general. I'll have a look...


> > Completely no impact or slowdown to any writers (although it does add
> > 8 bytes of tags to the radix tree node... but doesn't increase memory
> > footprint as such due to slab).
>
> Can we reduce the amount of copy-n-pasting here?

Yeah... I went to break the sync/async cases into two, but it looks like
it may not have been worthwhile. Just another branch might be the best
way to go.

As far as the c&p in setting the FSYNC tag, yes that should all go away
if the radix-tree is up to scratch. Basically:

radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);

should be able to replace the whole thing, and we'd hold the tree_lock, so
we would not have to take the page lock etc. Basically it would be much
nicer... even somewhere close to a viable solution.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  4:07                   ` Nick Piggin
@ 2008-10-03  4:17                     ` Andrew Morton
  2008-10-03  4:29                       ` Nick Piggin
  0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2008-10-03  4:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 03 October 2008 13:56, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > > I expect there's no solution which avoids blocking the writers at some
> > > > stage.
> > >
> > > See my other email. Something roughly like this would do the trick
> > > (hey, it actually boots and runs and does fix the problem too).
> >
> > It needs exclusion to protect all those temp tags.  Is do_fsync()'s
> > i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all
> > the callers of this stuff are running under that lock.
> 
> Yeah... it does need a lock, which I brushed under the carpet :P
> I was going to just say use i_mutex, but then we really would start
> impacting on other fastpaths (eg writers).
> 
> Possibly a new mutex in the address_space?

That's another, umm 24 bytes minimum in the address_space (and inode). 
That's fairly ouch, which is why Miklaus did that hokey bit-based
thing.

> That way we can say
> "anybody who holds this mutex is allowed to use the tag for anything"
> and it doesn't have to be fsync specific (whether that would be of
> any use to anything else, I don't know).
> 
> 
> > > It's ugly because we don't have quite the right radix tree operations
> > > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
> > > lookups). But the theory is to up-front tag the pages that we need to
> > > get to disk.
> >
> > Perhaps some callback-calling radix tree walker.
> 
> Possibly, yes. That would make it fairly general. I'll have a look...
> 
> 
> > > Completely no impact or slowdown to any writers (although it does add
> > > 8 bytes of tags to the radix tree node... but doesn't increase memory
> > > footprint as such due to slab).
> >
> > Can we reduce the amount of copy-n-pasting here?
> 
> Yeah... I went to break the sync/async cases into two, but it looks like
> it may not have been worthwhile. Just another branch might be the best
> way to go.

Yup.  Could add another do-this flag in the writeback_control, perhaps.
Or even a function pointer.

> As far as the c&p in setting the FSYNC tag, yes that should all go away
> if the radix-tree is up to scratch. Basically:
> 
> radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);
> 
> should be able to replace the whole thing, and we'd hold the tree_lock, so
> we would not have to take the page lock etc. Basically it would be much
> nicer... even somewhere close to a viable solution.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  4:17                     ` Andrew Morton
@ 2008-10-03  4:29                       ` Nick Piggin
  0 siblings, 0 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-03  4:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mikulas Patocka, linux-kernel, linux-mm, agk, mbroz, chris

On Friday 03 October 2008 14:17, Andrew Morton wrote:
> On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
> > Possibly a new mutex in the address_space?
>
> That's another, umm 24 bytes minimum in the address_space (and inode).
> That's fairly ouch, which is why Miklaus did that hokey bit-based
> thing.

Well yeah, it would be a bit based mutex in mapping->flags with
hashed waitqueues. Like Miklaus's.


> > Yeah... I went to break the sync/async cases into two, but it looks like
> > it may not have been worthwhile. Just another branch might be the best
> > way to go.
>
> Yup.  Could add another do-this flag in the writeback_control, perhaps.
> Or even a function pointer.

Yeah... possibly we could just _always_ do the PAGECACHE_TAG_FSYNC thing
if mode != WB_SYNC_NONE. I think if we had the infrastructure there to
do it all, it should always be something we want to do for data integrity
writeout.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  2:54         ` Nick Piggin
@ 2008-10-03 11:26           ` Mikulas Patocka
  2008-10-03 12:31             ` Nick Piggin
  2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-03 11:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, agk, mbroz, chris

> > *What* is, forever? Data integrity syncs should have pages operated on
> > in-order, until we get to the end of the range. Circular writeback could
> > go through again, possibly, but no more than once.
> 
> OK, I have been able to reproduce it somewhat. It is not a livelock,
> but what is happening is that direct IO read basically does an fsync
> on the file before performing the IO. The fsync gets stuck behind the
> dd that is dirtying the pages, and ends up following behind it and
> doing all its IO for it.
> 
> The following patch avoids the issue for direct IO, by using the range
> syncs rather than trying to sync the whole file.
> 
> The underlying problem I guess is unchanged. Is it really a problem,
> though? The way I'd love to solve it is actually by adding another bit
> or two to the pagecache radix tree,  that can be used to transiently tag
> the tree for future operations. That way we could record the dirty and
> writeback pages up front, and then only bother with operating on them.
> 
> That's *if* it really is a problem. I don't have much pity for someone
> doing buffered IO and direct IO to the same pages of the same file :)

LVM does (that is where the bug was discovered). Basically, it scans all 
the block devices with direct IO and if someone else does buffered IO on 
any device simultaneously, it locks up.

That fsync-vs-write livelock is quite improbably (why would some 
application do it?) --- although it could be used as a DoS --- getting 
unkillable process.

But there is another possible real-world problem --- sync-vs-write --- 
i.e. admin plugs in two disks and copies data from one to the other. 
Meanwhile, some unrelated server process executes sync(). The server goes 
into coma until the copy finishes.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03  3:56                 ` Andrew Morton
  2008-10-03  4:07                   ` Nick Piggin
@ 2008-10-03 11:43                   ` Mikulas Patocka
  2008-10-03 12:27                     ` Nick Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-03 11:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, agk, mbroz, chris

On Thu, 2 Oct 2008, Andrew Morton wrote:

> On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > > I expect there's no solution which avoids blocking the writers at some
> > > stage.
> > 
> > See my other email. Something roughly like this would do the trick
> > (hey, it actually boots and runs and does fix the problem too).
> 
> It needs exclusion to protect all those temp tags.  Is do_fsync()'s
> i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all
> the callers of this stuff are running under that lock.

That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really 
called under i_mutex (see do_fsync).

So the possible solutions are:

1. Add jiffies when the page was diried and wroteback to struct page
+ no impact on locking and concurrency
- increases the structure by 8 bytes

2. Stop the writers when the starvation happens (what I did)
+ doesn't do any locking if the livelock doesn't happen
- locks writers when the livelock happens (I think it's not really serious 
--- because very few people complained about the livelock, very few people 
will see performance degradation from blocking the writers).

3. Add another bit to radix tree (what Nick did)
+ doesn't ever block writers
- unconditionally takes the lock on fsync path and serializates concurrent 
syncs/fsyncs. Probably low overhead too ... or I don't know, is there any 
possible situation when more processes execute sync() in parallel and user 
would see degradations if those syncs were serialized?

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 11:43                   ` Mikulas Patocka
@ 2008-10-03 12:27                     ` Nick Piggin
  2008-10-03 13:53                       ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-03 12:27 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Friday 03 October 2008 21:43, Mikulas Patocka wrote:
> On Thu, 2 Oct 2008, Andrew Morton wrote:
> > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > > > I expect there's no solution which avoids blocking the writers at
> > > > some stage.
> > >
> > > See my other email. Something roughly like this would do the trick
> > > (hey, it actually boots and runs and does fix the problem too).
> >
> > It needs exclusion to protect all those temp tags.  Is do_fsync()'s
> > i_mutex sufficient?  It's qute unobvious (and unmaintainable?) that all
> > the callers of this stuff are running under that lock.
>
> That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really
> called under i_mutex (see do_fsync).
>
> So the possible solutions are:
>
> 1. Add jiffies when the page was diried and wroteback to struct page
> + no impact on locking and concurrency
> - increases the structure by 8 bytes

This one is not practical.


> 2. Stop the writers when the starvation happens (what I did)
> + doesn't do any locking if the livelock doesn't happen
> - locks writers when the livelock happens (I think it's not really serious
> --- because very few people complained about the livelock, very few people
> will see performance degradation from blocking the writers).

Maybe it is because not much actually does sequential writes to a massive
file or block device while trying to fsync it as well? I don't know. You
could still have cases where fsync takes much longer than expected but it
is still not long enough for a user to report it as a "livelock" bug.


> 3. Add another bit to radix tree (what Nick did)
> + doesn't ever block writers
> - unconditionally takes the lock on fsync path and serializates concurrent
> syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
> possible situation when more processes execute sync() in parallel and user
> would see degradations if those syncs were serialized?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 11:26           ` Mikulas Patocka
@ 2008-10-03 12:31             ` Nick Piggin
  2008-10-03 13:50               ` Mikulas Patocka
  2008-10-03 14:36               ` Alasdair G Kergon
  0 siblings, 2 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-03 12:31 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, linux-mm, agk, mbroz, chris

On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
> > > *What* is, forever? Data integrity syncs should have pages operated on
> > > in-order, until we get to the end of the range. Circular writeback
> > > could go through again, possibly, but no more than once.
> >
> > OK, I have been able to reproduce it somewhat. It is not a livelock,
> > but what is happening is that direct IO read basically does an fsync
> > on the file before performing the IO. The fsync gets stuck behind the
> > dd that is dirtying the pages, and ends up following behind it and
> > doing all its IO for it.
> >
> > The following patch avoids the issue for direct IO, by using the range
> > syncs rather than trying to sync the whole file.
> >
> > The underlying problem I guess is unchanged. Is it really a problem,
> > though? The way I'd love to solve it is actually by adding another bit
> > or two to the pagecache radix tree,  that can be used to transiently tag
> > the tree for future operations. That way we could record the dirty and
> > writeback pages up front, and then only bother with operating on them.
> >
> > That's *if* it really is a problem. I don't have much pity for someone
> > doing buffered IO and direct IO to the same pages of the same file :)
>
> LVM does (that is where the bug was discovered). Basically, it scans all
> the block devices with direct IO and if someone else does buffered IO on
> any device simultaneously, it locks up.

Scans all block devices with direct IO? Hmm, why, I wonder? Should
really consider using buffered (posix_fadvise to readahead/dropbehind).


> That fsync-vs-write livelock is quite improbably (why would some
> application do it?) --- although it could be used as a DoS --- getting
> unkillable process.

I'm not sure.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 12:31             ` Nick Piggin
@ 2008-10-03 13:50               ` Mikulas Patocka
  2008-10-03 14:50                 ` Alasdair G Kergon
  2008-10-03 14:36               ` Alasdair G Kergon
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-03 13:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

> > LVM does (that is where the bug was discovered). Basically, it scans all
> > the block devices with direct IO and if someone else does buffered IO on
> > any device simultaneously, it locks up.
> 
> Scans all block devices with direct IO? Hmm, why, I wonder? Should
> really consider using buffered (posix_fadvise to readahead/dropbehind).

LVM must not allocate any memory when doing IO because it suspends the 
block device and memory allocation could trigger writeback on the 
suspended device and deadlock.

So it preallocates heap and stack, mlockall()s itself and does direct IO.

Well, there are still two allocations on direct IO path --- one in 
__blockdev_direct_IO and the other in dio_bio_alloc. If you have an idea 
how to get rid of them (especially that kzalloc(sizeof(*dio), GFP_KERNEL), 
that bio allocation would be quite easy to avoid), it would be benefical.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 12:27                     ` Nick Piggin
@ 2008-10-03 13:53                       ` Mikulas Patocka
  0 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-03 13:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

> > So the possible solutions are:
> >
> > 1. Add jiffies when the page was diried and wroteback to struct page
> > + no impact on locking and concurrency
> > - increases the structure by 8 bytes
> 
> This one is not practical.
> 
> 
> > 2. Stop the writers when the starvation happens (what I did)
> > + doesn't do any locking if the livelock doesn't happen
> > - locks writers when the livelock happens (I think it's not really serious
> > --- because very few people complained about the livelock, very few people
> > will see performance degradation from blocking the writers).
> 
> Maybe it is because not much actually does sequential writes to a massive
> file or block device while trying to fsync it as well? I don't know. You
> could still have cases where fsync takes much longer than expected but it
> is still not long enough for a user to report it as a "livelock" bug.

At most twice the time it would normally take (one loop of writeback queue 
until it detects the livelock and the other loop until it drains all the 
new pages that were created during the first loop).

While with solution (3) it would take only once for the whole writeback 
queue.

Mikulas

> > 3. Add another bit to radix tree (what Nick did)
> > + doesn't ever block writers
> > - unconditionally takes the lock on fsync path and serializates concurrent
> > syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
> > possible situation when more processes execute sync() in parallel and user
> > would see degradations if those syncs were serialized?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 12:31             ` Nick Piggin
  2008-10-03 13:50               ` Mikulas Patocka
@ 2008-10-03 14:36               ` Alasdair G Kergon
  1 sibling, 0 replies; 67+ messages in thread
From: Alasdair G Kergon @ 2008-10-03 14:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mikulas Patocka, Andrew Morton, linux-kernel, linux-mm, agk,
	mbroz, chris

On Fri, Oct 03, 2008 at 10:31:03PM +1000, Nick Piggin wrote:
> On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
> > LVM does (that is where the bug was discovered). Basically, it scans all
> > the block devices with direct IO and if someone else does buffered IO on
> > any device simultaneously, it locks up.
> Scans all block devices with direct IO? Hmm, why, I wonder? Should
> really consider using buffered (posix_fadvise to readahead/dropbehind).
 
Scan in the sense of it reading a few sectors from each potential LVM
device to check whether it contains an LVM label.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] Memory management livelock
  2008-10-03 13:50               ` Mikulas Patocka
@ 2008-10-03 14:50                 ` Alasdair G Kergon
  0 siblings, 0 replies; 67+ messages in thread
From: Alasdair G Kergon @ 2008-10-03 14:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

On Fri, Oct 03, 2008 at 09:50:17AM -0400, Mikulas Patocka wrote:
> > > LVM does (that is where the bug was discovered). Basically, it scans all
> > > the block devices with direct IO and if someone else does buffered IO on
> > > any device simultaneously, it locks up.
> > Scans all block devices with direct IO? Hmm, why, I wonder? Should
> > really consider using buffered (posix_fadvise to readahead/dropbehind).
> LVM must not allocate any memory when doing IO because it suspends the 
> block device and memory allocation could trigger writeback on the 
> suspended device and deadlock.
> So it preallocates heap and stack, mlockall()s itself and does direct IO.
 
True, but unrelated to the scanning, which LVM performs *prior* to
entering such a state.

We use direct IO while scanning because it's essential all nodes in a
cluster see the same updated version of the data after any node updated
it.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-03  2:54         ` Nick Piggin
  2008-10-03 11:26           ` Mikulas Patocka
@ 2008-10-03 15:52           ` david
  2008-10-06  0:04             ` Mikulas Patocka
  1 sibling, 1 reply; 67+ messages in thread
From: david @ 2008-10-03 15:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mikulas Patocka, linux-kernel, linux-mm, agk,
	mbroz, chris

On Fri, 3 Oct 2008, Nick Piggin wrote:

>> *What* is, forever? Data integrity syncs should have pages operated on
>> in-order, until we get to the end of the range. Circular writeback could
>> go through again, possibly, but no more than once.
>
> OK, I have been able to reproduce it somewhat. It is not a livelock,
> but what is happening is that direct IO read basically does an fsync
> on the file before performing the IO. The fsync gets stuck behind the
> dd that is dirtying the pages, and ends up following behind it and
> doing all its IO for it.
>
> The following patch avoids the issue for direct IO, by using the range
> syncs rather than trying to sync the whole file.
>
> The underlying problem I guess is unchanged. Is it really a problem,
> though? The way I'd love to solve it is actually by adding another bit
> or two to the pagecache radix tree,  that can be used to transiently tag
> the tree for future operations. That way we could record the dirty and
> writeback pages up front, and then only bother with operating on them.
>
> That's *if* it really is a problem. I don't have much pity for someone
> doing buffered IO and direct IO to the same pages of the same file :)

I've seen lots of discussions here about different options in syncing. in 
this case a fix is to do a fsync of a range. I've also seen discussions of 
how the kernel filesystem code can do ordered writes without having to 
wait for them with the use of barriers, is this capability exported to 
userspace? if so, could you point me at documentation for it?

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)
  2008-10-02  5:54             ` Andrew Morton
@ 2008-10-05 22:11               ` Mikulas Patocka
  2008-10-11 12:06                 ` Nick Piggin
  2008-10-05 22:14               ` [PATCH 1/3] bit mutexes Mikulas Patocka
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 22:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, agk, mbroz, chris

Hi

I removed the repeated code and create a new bit mutexes. They are 
space-efficient mutexes that consume only one bit. See the next 3 patches.

If you are concerned about the size of an inode, I can convert other 
mutexes to bit mutexes: i_mutex and inotify_mutex. I could also create 
bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space 
for address_space->tree_lock, address_space->i_mmap_lock, 
address_space->private_lock, inode->i_lock.

Look at it and say what you think about the idea of condensing mutexes 
into single bits.

Mikulas

> 
> > Subject: [PATCH 2/3] Memory management livelock
> 
> Please don't send multiple patches with identical titles - think up a
> good, unique, meaningful title for each patch.
> 
> On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Avoid starvation when walking address space.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > 
> 
> Please include a full changelog with each iteration of each patch.
> 
> That changelog should explain the reason for playing games with
> bitlocks so Linus doesn't have kittens when he sees it.
> 
> >  include/linux/pagemap.h |    1 +
> >  mm/filemap.c            |   20 ++++++++++++++++++++
> >  mm/page-writeback.c     |   37 ++++++++++++++++++++++++++++++++++++-
> >  mm/truncate.c           |   24 +++++++++++++++++++++++-
> >  4 files changed, 80 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
> > ===================================================================
> > --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h	2008-09-24 02:57:37.000000000 +0200
> > +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h	2008-09-24 02:59:04.000000000 +0200
> > @@ -21,6 +21,7 @@
> >  #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
> >  #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
> >  #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
> > +#define AS_STARVATION	(__GFP_BITS_SHIFT + 3)	/* an anti-starvation barrier */
> >  
> >  static inline void mapping_set_error(struct address_space *mapping, int error)
> >  {
> > Index: linux-2.6.27-rc7-devel/mm/filemap.c
> > ===================================================================
> > --- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 02:59:33.000000000 +0200
> > +++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 03:13:47.000000000 +0200
> > @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct 
> >  	int nr_pages;
> >  	int ret = 0;
> >  	pgoff_t index;
> > +	long pages_to_process;
> >  
> >  	if (end < start)
> >  		return 0;
> >  
> > +	/*
> > +	 * Estimate the number of pages to process. If we process significantly
> > +	 * more than this, someone is making writeback pages under us.
> > +	 * We must pull the anti-starvation plug.
> > +	 */
> > +	pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +	pages_to_process += (pages_to_process >> 3) + 16;
> 
> This sequence appears twice and it would probably be clearer to
> implement it in a well-commented function.
> 
> >  	pagevec_init(&pvec, 0);
> >  	index = start;
> >  	while ((index <= end) &&
> > @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct 
> >  			if (page->index > end)
> >  				continue;
> >  
> > +			if (pages_to_process >= 0)
> > +				if (!pages_to_process--)
> > +					wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
> 
> This is copied three times and perhaps also should be factored out.
> 
> Please note that an effort has been made to make mm/filemap.c look
> presentable in an 80-col display.
> 
> >  			wait_on_page_writeback(page);
> >  			if (PageError(page))
> >  				ret = -EIO;
> > @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct 
> >  		cond_resched();
> >  	}
> >  
> > +	if (pages_to_process < 0) {
> > +		smp_mb__before_clear_bit();
> > +		clear_bit(AS_STARVATION, &mapping->flags);
> > +		smp_mb__after_clear_bit();
> > +		wake_up_bit(&mapping->flags, AS_STARVATION);
> > +	}
> 
> This sequence is repeated three or four times and should be pulled out
> into a well-commented function.  That comment should explain the logic
> behind the use of these barriers, please.
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/3] bit mutexes
  2008-10-02  5:54             ` Andrew Morton
  2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
@ 2008-10-05 22:14               ` Mikulas Patocka
  2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 22:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, agk, mbroz, chris

A bit mutex implementation.

Bit mutex is a space-efficient mutex that consumes exactly one bit in
the apropriate structure (if mutex debugging is turned off).  Bit mutex
is implemented as a bit in unsigned long field. Other bits in the field
may be used for other purposes, if test/set/clear_bit functions are used
to manipulate them. There is no wait queue in the structure containing
the bit mutex; hashed wait queues for waiting on bits are used.

Additional structure struct bit_mutex is needed, it contains lock
debugging & dependency information. When the kernel is compiled without
lock debugging, this structure is empty.

Bit mutexes are used with functions
bit_mutex_init
bit_mutex_lock
bit_mutex_unlock
bit_mutex_is_locked
These functions take 3 arguments: pointer to the unsigned long field,
the bit position and pointer to struct bit_mutex.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/bit-mutex.h |   98 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/wait.h      |    8 +++
 kernel/Makefile           |    2 
 kernel/bit-mutex-debug.c  |   55 +++++++++++++++++++++++++
 kernel/wait.c             |    7 +++
 5 files changed, 169 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc8-devel/include/linux/bit-mutex.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/include/linux/bit-mutex.h	2008-10-05 19:58:30.000000000 +0200
@@ -0,0 +1,98 @@
+#ifndef __LINUX_BIT_MUTEX_H
+#define __LINUX_BIT_MUTEX_H
+
+#include <linux/bitops.h>
+#include <linux/lockdep.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+
+struct bit_mutex {
+#ifdef CONFIG_DEBUG_MUTEXES
+	unsigned long		*field;
+	int 			bit;
+	struct thread_info	*owner;
+	const char		*name;
+	void			*magic;
+#endif
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+#ifndef CONFIG_DEBUG_MUTEXES
+
+static inline void bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+	clear_bit(bit, field);
+	smp_mb__after_clear_bit();
+}
+
+static inline void bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+	wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+}
+
+static inline void bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+	smp_mb__before_clear_bit();
+	clear_bit(bit, field);
+	smp_mb__after_clear_bit();
+	wake_up_bit(field, bit);
+}
+
+static inline int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+	return test_bit(bit, field);
+}
+
+#define __DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx)
+
+#else
+
+void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key);
+
+#define bit_mutex_init(field, bit, mtx)				\
+do {								\
+	static struct lock_class_key __key;			\
+	__bit_mutex_init(field, bit, mtx, #mtx, &__key);	\
+} while (0)
+
+void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
+
+#define bit_mutex_lock(field, bit, mtx)				\
+	__bit_mutex_lock(field, bit, mtx, 0)
+
+void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
+
+#define bit_mutex_unlock(field, bit, mtx)			\
+	__bit_mutex_unlock(field, bit, mtx, 0)
+
+int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx);
+
+#define __DEBUG_BIT_MUTEX_INITIALIZER(field_, bit_, mtx)	\
+	.magic = &(mtx),					\
+	.field = (field_),					\
+	.bit = (bit_)
+
+#endif
+
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)	\
+	, .dep_map = { .name = #mtx }
+
+#else
+
+#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)
+
+#endif
+
+
+#define __BIT_MUTEX_INITIALIZER(field, bit, mtx)		\
+	{							\
+		__DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx)	\
+		__DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)\
+	}
+
+#endif
Index: linux-2.6.27-rc8-devel/include/linux/wait.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/wait.h	2008-10-04 13:40:46.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/wait.h	2008-10-04 13:41:21.000000000 +0200
@@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void 
 		return 0;
 	return out_of_line_wait_on_bit_lock(word, bit, action, mode);
 }
-	
+
+/**
+ * wait_action_schedule - this function can be passed to wait_on_bit or
+ * wait_on_bit_lock and it will call just schedule().
+ */
+int wait_action_schedule(void *);
+
 #endif /* __KERNEL__ */
 
 #endif
Index: linux-2.6.27-rc8-devel/kernel/wait.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/kernel/wait.c	2008-10-04 13:37:24.000000000 +0200
+++ linux-2.6.27-rc8-devel/kernel/wait.c	2008-10-04 13:38:21.000000000 +0200
@@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
 	return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
 }
 EXPORT_SYMBOL(bit_waitqueue);
+
+int wait_action_schedule(void *word)
+{
+	schedule();
+	return 0;
+}
+EXPORT_SYMBOL(wait_action_schedule);
Index: linux-2.6.27-rc8-devel/kernel/Makefile
===================================================================
--- linux-2.6.27-rc8-devel.orig/kernel/Makefile	2008-10-05 14:03:24.000000000 +0200
+++ linux-2.6.27-rc8-devel/kernel/Makefile	2008-10-05 14:11:25.000000000 +0200
@@ -17,6 +17,7 @@ ifdef CONFIG_FTRACE
 # Do not trace debug files and internal ftrace files
 CFLAGS_REMOVE_lockdep.o = -pg
 CFLAGS_REMOVE_lockdep_proc.o = -pg
+CFLAGS_REMOVE_bit-mutex-debug.o = -pg
 CFLAGS_REMOVE_mutex-debug.o = -pg
 CFLAGS_REMOVE_rtmutex-debug.o = -pg
 CFLAGS_REMOVE_cgroup-debug.o = -pg
@@ -29,6 +30,7 @@ obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sy
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
 obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-$(CONFIG_DEBUG_MUTEXES) += bit-mutex-debug.o
 obj-$(CONFIG_LOCKDEP) += lockdep.o
 ifeq ($(CONFIG_PROC_FS),y)
 obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
Index: linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c	2008-10-05 19:12:06.000000000 +0200
@@ -0,0 +1,55 @@
+#include <linux/module.h>
+#include <linux/bitops.h>
+#include <linux/debug_locks.h>
+#include <linux/bit-mutex.h>
+
+void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *)mtx, sizeof(*mtx));
+	lockdep_init_map(&mtx->dep_map, name, key, 0);
+#endif
+	mtx->field = field;
+	mtx->bit = bit;
+	mtx->owner = NULL;
+	mtx->magic = mtx;
+	clear_bit(bit, field);
+	smp_mb__after_clear_bit();
+}
+EXPORT_SYMBOL(__bit_mutex_init);
+
+void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass)
+{
+	DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+	DEBUG_LOCKS_WARN_ON(field != mtx->field);
+	DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
+	mutex_acquire(&mtx->dep_map, subclass, 0, _RET_IP_);
+	wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+	lock_acquired(&mtx->dep_map);
+	mtx->owner = current_thread_info();
+}
+EXPORT_SYMBOL(__bit_mutex_lock);
+
+void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int nested)
+{
+	DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+	DEBUG_LOCKS_WARN_ON(mtx->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(mtx->field != field);
+	DEBUG_LOCKS_WARN_ON(mtx->bit != bit);
+	mtx->owner = NULL;
+	mutex_release(&mtx->dep_map, nested, _RET_IP_);
+	smp_mb__before_clear_bit();
+	clear_bit(bit, field);
+	smp_mb__after_clear_bit();
+	wake_up_bit(field, bit);
+}
+EXPORT_SYMBOL(__bit_mutex_unlock);
+
+int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+	DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+	DEBUG_LOCKS_WARN_ON(field != mtx->field);
+	DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
+	return test_bit(bit, field);
+}
+EXPORT_SYMBOL(bit_mutex_is_locked);

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 2/3] Fix fsync livelock
  2008-10-02  5:54             ` Andrew Morton
  2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
  2008-10-05 22:14               ` [PATCH 1/3] bit mutexes Mikulas Patocka
@ 2008-10-05 22:14               ` Mikulas Patocka
  2008-10-05 22:33                 ` Arjan van de Ven
  2008-10-05 22:16               ` [PATCH 3/3] Fix fsync-vs-write misbehavior Mikulas Patocka
  2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
  4 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 22:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, agk, mbroz, chris

Fix starvation in memory management.

The bug happens when one process is doing sequential buffered writes to
a block device (or file) and another process is attempting to execute
sync(), fsync() or direct-IO on that device (or file). This syncing
process will wait indefinitelly, until the first writing process
finishes.

For example, run these two commands:
dd if=/dev/zero of=/dev/sda1 bs=65536 &
dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

The bug is caused by sequential walking of address space in
write_cache_pages and wait_on_page_writeback_range: if some other
process is constantly making dirty and writeback pages while these
functions run, the functions will wait on every new page, resulting in
indefinite wait.

To fix the starvation, I declared a bit mutex starvation_barrier in
struct address_space. The bit AS_STARVATION_BARRIER in
address_space->flags is used for actual locking. When the mutex is
taken, anyone making dirty pages on that address space should stop. The
functions that walk address space sequentially first estimate a number
of pages to process. If they process more than this amount (plus some
small delta), it means that someone is making dirty pages under them ---
they take the mutex and hold it until they finish. When the mutex is
taken, the function balance_dirty_pages will wait and not allow more
dirty pages being made.

The mutex is not really used as a mutex, it does not prevent access to
any critical section. It is used as a barrier that blocks new dirty
pages from being created. I use mutex and not wait queue to make sure
that the starvation can't happend the other way (if there were wait
queue, excessive calls to write_cache_pages and
wait_on_page_writeback_range would block balance_dirty_pages forever;
with mutex it is guaranteed that every process eventually makes
progress).

The essential property of this patch is that if the starvation doesn't
happen, no additional locks are taken and no atomic operations are
performed. So the patch shouldn't damage performance.

The indefinite starvation was observed in write_cache_pages and
wait_on_page_writeback_range. Another possibility where it could happen
is invalidate_inode_pages2_range.

There are two more functions that walk all the pages in address space,
but I think they don't need this starvation protection:
truncate_inode_pages_range: it is called with i_mutex locked, so no new
pages are created under it.
__invalidate_mapping_pages: it skips locked, dirty and writeback pages,
so there's no point in protecting the function against starving on them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/inode.c              |    1 +
 include/linux/fs.h      |    3 +++
 include/linux/pagemap.h |   14 ++++++++++++++
 mm/filemap.c            |    5 +++++
 mm/page-writeback.c     |   18 +++++++++++++++++-
 mm/swap_state.c         |    1 +
 mm/truncate.c           |   10 +++++++++-
 7 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6.27-rc8-devel/fs/inode.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/fs/inode.c	2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/fs/inode.c	2008-10-04 16:48:04.000000000 +0200
@@ -216,6 +216,7 @@ void inode_init_once(struct inode *inode
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
+	bit_mutex_init(&inode->i_data.flags, AS_STARVATION_BARRIER, &inode->i_data.starvation_barrier);
 	i_size_ordered_init(inode);
 #ifdef CONFIG_INOTIFY
 	INIT_LIST_HEAD(&inode->inotify_watches);
Index: linux-2.6.27-rc8-devel/include/linux/fs.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/fs.h	2008-10-04 16:47:54.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/fs.h	2008-10-04 16:49:54.000000000 +0200
@@ -289,6 +289,7 @@ extern int dir_notify_enable;
 #include <linux/init.h>
 #include <linux/pid.h>
 #include <linux/mutex.h>
+#include <linux/bit-mutex.h>
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 
@@ -539,6 +540,8 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+
+	struct bit_mutex	starvation_barrier;/* taken when fsync starves */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: linux-2.6.27-rc8-devel/include/linux/pagemap.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/pagemap.h	2008-10-04 16:47:54.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/pagemap.h	2008-10-04 16:48:04.000000000 +0200
@@ -21,6 +21,7 @@
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
 #define AS_MM_ALL_LOCKS	(__GFP_BITS_SHIFT + 2)	/* under mm_take_all_locks() */
+#define AS_STARVATION_BARRIER (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
@@ -424,4 +425,17 @@ static inline int add_to_page_cache(stru
 	return error;
 }
 
+#define starvation_protection_head(n)				\
+	long pages_to_process = (n);				\
+	pages_to_process += (pages_to_process >> 3) + 16;
+
+#define starvation_protection_step(mapping)			\
+	if (pages_to_process >= 0)				\
+		if (!pages_to_process--)			\
+			bit_mutex_lock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
+
+#define starvation_protection_end(mapping)			\
+	if (pages_to_process < 0)				\
+		bit_mutex_unlock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
+
 #endif /* _LINUX_PAGEMAP_H */
Index: linux-2.6.27-rc8-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/filemap.c	2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/filemap.c	2008-10-04 16:48:04.000000000 +0200
@@ -269,6 +269,7 @@ int wait_on_page_writeback_range(struct 
 	int nr_pages;
 	int ret = 0;
 	pgoff_t index;
+	starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK));
 
 	if (end < start)
 		return 0;
@@ -288,6 +289,8 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			starvation_protection_step(mapping);
+
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -296,6 +299,8 @@ int wait_on_page_writeback_range(struct 
 		cond_resched();
 	}
 
+	starvation_protection_end(mapping);
+
 	/* Check for outstanding write errors */
 	if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
 		ret = -ENOSPC;
Index: linux-2.6.27-rc8-devel/mm/page-writeback.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/page-writeback.c	2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/page-writeback.c	2008-10-04 16:48:04.000000000 +0200
@@ -435,6 +435,14 @@ static void balance_dirty_pages(struct a
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (unlikely(bit_mutex_is_locked(&mapping->flags, AS_STARVATION_BARRIER,
+	    &mapping->starvation_barrier))) {
+		bit_mutex_lock(&mapping->flags, AS_STARVATION_BARRIER,
+			&mapping->starvation_barrier);
+		bit_mutex_unlock(&mapping->flags, AS_STARVATION_BARRIER,
+			&mapping->starvation_barrier);
+	}
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.bdi		= bdi,
@@ -876,6 +884,7 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	int scanned = 0;
 	int range_whole = 0;
+	starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE));
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -902,7 +911,11 @@ retry:
 
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
+
+			starvation_protection_step(mapping);
+
+			page = pvec.pages[i];
 
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
@@ -963,6 +976,9 @@ retry:
 
 	if (wbc->range_cont)
 		wbc->range_start = index << PAGE_CACHE_SHIFT;
+
+	starvation_protection_end(mapping);
+
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);
Index: linux-2.6.27-rc8-devel/mm/swap_state.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/swap_state.c	2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/swap_state.c	2008-10-04 17:14:03.000000000 +0200
@@ -43,6 +43,7 @@ struct address_space swapper_space = {
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
+	.starvation_barrier = __BIT_MUTEX_INITIALIZER(&swapper_space.flags, AS_STARVATION_BARRIER, swapper_space.starvation_barrier),
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
Index: linux-2.6.27-rc8-devel/mm/truncate.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/truncate.c	2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/truncate.c	2008-10-04 16:48:04.000000000 +0200
@@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct
 	int ret2 = 0;
 	int did_range_unmap = 0;
 	int wrapped = 0;
+	starvation_protection_head(mapping->nrpages);
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -399,9 +400,13 @@ int invalidate_inode_pages2_range(struct
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
-			struct page *page = pvec.pages[i];
+			struct page *page;
 			pgoff_t page_index;
 
+			starvation_protection_step(mapping);
+
+			page = pvec.pages[i];
+
 			lock_page(page);
 			if (page->mapping != mapping) {
 				unlock_page(page);
@@ -449,6 +454,9 @@ int invalidate_inode_pages2_range(struct
 		pagevec_release(&pvec);
 		cond_resched();
 	}
+
+	starvation_protection_end(mapping);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 3/3] Fix fsync-vs-write misbehavior
  2008-10-02  5:54             ` Andrew Morton
                                 ` (2 preceding siblings ...)
  2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
@ 2008-10-05 22:16               ` Mikulas Patocka
  2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
  4 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 22:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, agk, mbroz, chris

Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a
misbehavior.  Example: there are two pages in address space with indices
4, 5. Both are dirty.  Someone calls __filemap_fdatawrite_range, it sets
.nr_to_write = 4.  Meanwhile, some other process creates dirty pages 0,
1, 2, 3.  __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out
that it reached the limit and exits.

Result: pages that were dirty before __filemap_fdatawrite_range was
invoked were not written.

With starvation protection from the previous patch, this
mapping->nrpages * 2 logic is no longer needed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/filemap.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c	2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c	2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
  * opposed to a regular memory cleansing writeback.  The difference between
  * these two operations is that if a dirty page/buffer is encountered, it must
  * be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = sync_mode == WB_SYNC_NONE ? mapping->nrpages * 2 : LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
@ 2008-10-05 22:33                 ` Arjan van de Ven
  2008-10-05 23:02                   ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-05 22:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Fix starvation in memory management.
> 
> The bug happens when one process is doing sequential buffered writes
> to a block device (or file) and another process is attempting to
> execute sync(), fsync() or direct-IO on that device (or file). This
> syncing process will wait indefinitelly, until the first writing
> process finishes.
> 
> For example, run these two commands:
> dd if=/dev/zero of=/dev/sda1 bs=65536 &
> dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> 
> The bug is caused by sequential walking of address space in
> write_cache_pages and wait_on_page_writeback_range: if some other
> process is constantly making dirty and writeback pages while these
> functions run, the functions will wait on every new page, resulting in
> indefinite wait.

are you sure?
isn't the right fix to just walk the file pages only once?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 22:33                 ` Arjan van de Ven
@ 2008-10-05 23:02                   ` Mikulas Patocka
  2008-10-05 23:07                     ` Arjan van de Ven
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 23:02 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > Fix starvation in memory management.
> > 
> > The bug happens when one process is doing sequential buffered writes
> > to a block device (or file) and another process is attempting to
> > execute sync(), fsync() or direct-IO on that device (or file). This
> > syncing process will wait indefinitelly, until the first writing
> > process finishes.
> > 
> > For example, run these two commands:
> > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > 
> > The bug is caused by sequential walking of address space in
> > write_cache_pages and wait_on_page_writeback_range: if some other
> > process is constantly making dirty and writeback pages while these
> > functions run, the functions will wait on every new page, resulting in
> > indefinite wait.
> 
> are you sure?
> isn't the right fix to just walk the file pages only once?

It walks the pages only once --- and waits on each on them. But because 
new pages are contantly appearing under it, that "walk only once" becomes 
infinite loop (well, finite, until the whole disk is written).

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 23:02                   ` Mikulas Patocka
@ 2008-10-05 23:07                     ` Arjan van de Ven
  2008-10-05 23:18                       ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-05 23:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > are you sure?
> > isn't the right fix to just walk the file pages only once?
> 
> It walks the pages only once --- and waits on each on them. But
> because new pages are contantly appearing under it, that "walk only
> once" becomes infinite loop (well, finite, until the whole disk is
> written).

well. fsync() promises that everything that's dirty at the time of the
call will hit the disk. That is not something you can compromise.
The only way out would be is to not allow new dirty during an fsync()...
which is imo even worse.

Submit them all in one go, then wait, should not be TOO bad. Unless a
lot was dirty already, but then you go back to "but it has to go to
disk".


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 23:07                     ` Arjan van de Ven
@ 2008-10-05 23:18                       ` Mikulas Patocka
  2008-10-05 23:28                         ` Arjan van de Ven
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-05 23:18 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > > are you sure?
> > > isn't the right fix to just walk the file pages only once?
> > 
> > It walks the pages only once --- and waits on each on them. But
> > because new pages are contantly appearing under it, that "walk only
> > once" becomes infinite loop (well, finite, until the whole disk is
> > written).
> 
> well. fsync() promises that everything that's dirty at the time of the
> call will hit the disk. That is not something you can compromise.
> The only way out would be is to not allow new dirty during an fsync()...
> which is imo even worse.
> 
> Submit them all in one go, then wait, should not be TOO bad. Unless a
> lot was dirty already, but then you go back to "but it has to go to
> disk".

The problem here is that you have two processes --- one is writing, the 
other is simultaneously syncing. The syncing process can't distinguish the 
pages that were created before fsync() was invoked (it has to wait on 
them) and the pages that were created while fsync() was running (it 
doesn't have to wait on them) --- so it waits on them all. The result is 
livelock, it waits indefinitely, because more and more pages are being 
created.

The patch changes it so that if it waits long enough, it stops the other 
writers creating dirty pages.

Or, how otherwise would you implement "Submit them all in one go, then 
wait"? The current code is:
you grab page 0, see it is under writeback, wait on it
you grab page 1, see it is under writeback, wait on it
you grab page 2, see it is under writeback, wait on it
you grab page 3, see it is under writeback, wait on it
...
--- and the other process is just making more and more writeback pages 
while your waiting routine run. So the waiting is indefinite.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 23:18                       ` Mikulas Patocka
@ 2008-10-05 23:28                         ` Arjan van de Ven
  2008-10-06  0:01                           ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-05 23:28 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> 
> 
> The problem here is that you have two processes --- one is writing,
> the other is simultaneously syncing. The syncing process can't
> distinguish the pages that were created before fsync() was invoked
> (it has to wait on them) and the pages that were created while
> fsync() was running (it doesn't have to wait on them) --- so it waits
> on them all.

for pages it already passed that's not true.
but yes while you walk, you may find new ones. tough luck.
> 
> Or, how otherwise would you implement "Submit them all in one go,
> then wait"? The current code is:
> you grab page 0, see it is under writeback, wait on it
> you grab page 1, see it is under writeback, wait on it
> you grab page 2, see it is under writeback, wait on it
> you grab page 3, see it is under writeback, wait on it

it shouldn't be doing this. It should be "submit the lot"
"wait on the lot that got submitted".

keeping away new writers is just shifting the latency to an innocent
party (since the fsync() is just bloody expensive, the person doing it
should be paying the price, not the rest of the system), and a grave
mistake.

If the fsync() implementation isn't smart enough, sure, lets improve
it. But not by shifting latency around... lets make it more efficient
at submitting IO.
If we need to invent something like "chained IO" where if you wait on
the last of the chain, you wait on the entirely chain, so be it.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-05 23:28                         ` Arjan van de Ven
@ 2008-10-06  0:01                           ` Mikulas Patocka
  2008-10-06  0:30                             ` Arjan van de Ven
  2008-10-06  2:51                             ` Dave Chinner
  0 siblings, 2 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06  0:01 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > 
> > 
> > The problem here is that you have two processes --- one is writing,
> > the other is simultaneously syncing. The syncing process can't
> > distinguish the pages that were created before fsync() was invoked
> > (it has to wait on them) and the pages that were created while
> > fsync() was running (it doesn't have to wait on them) --- so it waits
> > on them all.
> 
> for pages it already passed that's not true.
> but yes while you walk, you may find new ones. tough luck.
> > 
> > Or, how otherwise would you implement "Submit them all in one go,
> > then wait"? The current code is:
> > you grab page 0, see it is under writeback, wait on it
> > you grab page 1, see it is under writeback, wait on it
> > you grab page 2, see it is under writeback, wait on it
> > you grab page 3, see it is under writeback, wait on it
> 
> it shouldn't be doing this. It should be "submit the lot"
> "wait on the lot that got submitted".

And how do you want to implement that "wait on the lot that got 
submitted". Note that you can't allocate an array or linked list that 
holds pointers to all submitted pages. And if you add anything to the page 
structure or radix tree, you have to serialize concurrent sync,fsync 
calls, otherwise they'd fight over these bits.

> keeping away new writers is just shifting the latency to an innocent
> party (since the fsync() is just bloody expensive, the person doing it
> should be paying the price, not the rest of the system), and a grave
> mistake.

I assume that if very few people complained about the livelock till now, 
very few people will see degraded write performance. My patch blocks the 
writes only if the livelock happens, so if the livelock doesn't happen in 
unpatched kernel for most people, the patch won't make it worse.

> If the fsync() implementation isn't smart enough, sure, lets improve
> it. But not by shifting latency around... lets make it more efficient
> at submitting IO.
> If we need to invent something like "chained IO" where if you wait on
> the last of the chain, you wait on the entirely chain, so be it.

This looks madly complicated. And ineffective, because if some page was 
submitted before fsync() was invoked, and is under writeback while fsync() 
is called, fsync() still has to wait on it.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
@ 2008-10-06  0:04             ` Mikulas Patocka
  2008-10-06  0:19               ` david
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06  0:04 UTC (permalink / raw)
  To: david; +Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris



On Fri, 3 Oct 2008, david@lang.hm wrote:

> On Fri, 3 Oct 2008, Nick Piggin wrote:
> 
> > > *What* is, forever? Data integrity syncs should have pages operated on
> > > in-order, until we get to the end of the range. Circular writeback could
> > > go through again, possibly, but no more than once.
> > 
> > OK, I have been able to reproduce it somewhat. It is not a livelock,
> > but what is happening is that direct IO read basically does an fsync
> > on the file before performing the IO. The fsync gets stuck behind the
> > dd that is dirtying the pages, and ends up following behind it and
> > doing all its IO for it.
> > 
> > The following patch avoids the issue for direct IO, by using the range
> > syncs rather than trying to sync the whole file.
> > 
> > The underlying problem I guess is unchanged. Is it really a problem,
> > though? The way I'd love to solve it is actually by adding another bit
> > or two to the pagecache radix tree,  that can be used to transiently tag
> > the tree for future operations. That way we could record the dirty and
> > writeback pages up front, and then only bother with operating on them.
> > 
> > That's *if* it really is a problem. I don't have much pity for someone
> > doing buffered IO and direct IO to the same pages of the same file :)
> 
> I've seen lots of discussions here about different options in syncing. in this
> case a fix is to do a fsync of a range.

It fixes the bug in concurrent direct read+buffed write, but won't fix the 
bug with concurrent sync+buffered write.

> I've also seen discussions of how the
> kernel filesystem code can do ordered writes without having to wait for them
> with the use of barriers, is this capability exported to userspace? if so,
> could you point me at documentation for it?

It isn't. And it is good that it isn't --- the more complicated API, the 
more maintenance work.

Mikulas

> David Lang
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-06  0:04             ` Mikulas Patocka
@ 2008-10-06  0:19               ` david
  2008-10-06  3:42                 ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: david @ 2008-10-06  0:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008, Mikulas Patocka wrote:

> On Fri, 3 Oct 2008, david@lang.hm wrote:
>
>> I've also seen discussions of how the
>> kernel filesystem code can do ordered writes without having to wait for them
>> with the use of barriers, is this capability exported to userspace? if so,
>> could you point me at documentation for it?
>
> It isn't. And it is good that it isn't --- the more complicated API, the
> more maintenance work.

I can understand that most software would not want to deal with 
complications like this, but for things thta have requirements similar to 
journaling filesystems (databases for example) it would seem that there 
would be advantages to exposing this capabilities.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  0:01                           ` Mikulas Patocka
@ 2008-10-06  0:30                             ` Arjan van de Ven
  2008-10-06  3:30                               ` Mikulas Patocka
  2008-10-08 10:56                               ` Pavel Machek
  2008-10-06  2:51                             ` Dave Chinner
  1 sibling, 2 replies; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-06  0:30 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> I assume that if very few people complained about the livelock till
> now, very few people will see degraded write performance. My patch
> blocks the writes only if the livelock happens, so if the livelock
> doesn't happen in unpatched kernel for most people, the patch won't
> make it worse.

I object to calling this a livelock. It's not. 
And yes, fsync is slow and lots of people are seeing that.
It's not helped by how ext3 is implemented (where fsync is effectively
equivalent of a sync for many cases).
But again, moving the latency to "innocent" parties is not acceptable.

> 
> > If the fsync() implementation isn't smart enough, sure, lets improve
> > it. But not by shifting latency around... lets make it more
> > efficient at submitting IO.
> > If we need to invent something like "chained IO" where if you wait
> > on the last of the chain, you wait on the entirely chain, so be it.
> 
> This looks madly complicated. And ineffective, because if some page
> was submitted before fsync() was invoked, and is under writeback
> while fsync() is called, fsync() still has to wait on it.

so?
just make a chain per inode always...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  0:01                           ` Mikulas Patocka
  2008-10-06  0:30                             ` Arjan van de Ven
@ 2008-10-06  2:51                             ` Dave Chinner
  1 sibling, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2008-10-06  2:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Arjan van de Ven, Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, Oct 05, 2008 at 08:01:46PM -0400, Mikulas Patocka wrote:
> This looks madly complicated. And ineffective, because if some page was 
> submitted before fsync() was invoked, and is under writeback while fsync() 
> is called, fsync() still has to wait on it.

fsync() waiting on pre-issued writeback pages is the correct
behaviour.

IOW, if the page is under writeback at the time an fsync() is
issued (e.g. issued by pdflush), the page was *not clean* at the
time the fsync() was called and hence must be clean when fsync()
returns. fsync() needs to wait for all pages under I/O at the time
it is called, not just the dirty pages it issues I/O on.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  0:30                             ` Arjan van de Ven
@ 2008-10-06  3:30                               ` Mikulas Patocka
  2008-10-06  4:20                                 ` Arjan van de Ven
  2008-10-08 10:56                               ` Pavel Machek
  1 sibling, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06  3:30 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > I assume that if very few people complained about the livelock till
> > now, very few people will see degraded write performance. My patch
> > blocks the writes only if the livelock happens, so if the livelock
> > doesn't happen in unpatched kernel for most people, the patch won't
> > make it worse.
> 
> I object to calling this a livelock. It's not. 

It unlocks itself when the whole disk is written, and it can be several 
hours (or days, if you have many-terabyte array). So formally it is not 
livelock, from the user experience it is --- he sees unkillable process in 
'D' state for many hours.

> And yes, fsync is slow and lots of people are seeing that.
> It's not helped by how ext3 is implemented (where fsync is effectively
> equivalent of a sync for many cases).
> But again, moving the latency to "innocent" parties is not acceptable.
> 
> > 
> > > If the fsync() implementation isn't smart enough, sure, lets improve
> > > it. But not by shifting latency around... lets make it more
> > > efficient at submitting IO.
> > > If we need to invent something like "chained IO" where if you wait
> > > on the last of the chain, you wait on the entirely chain, so be it.
> > 
> > This looks madly complicated. And ineffective, because if some page
> > was submitted before fsync() was invoked, and is under writeback
> > while fsync() is called, fsync() still has to wait on it.
> 
> so?
> just make a chain per inode always...

The point is that many fsync()s may run in parallel and you have just one 
inode and just one chain. And if you add two-word list_head to a page, to 
link it on this list, many developers will hate it for increasing its 
size.

See the work dobe by Nick Piggin somewhere in this thread. He uses just 
one bit in radix tree to mark pages to process. But he needs to serialize 
all syncs on a given file, they no longer run in parallel.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-06  0:19               ` david
@ 2008-10-06  3:42                 ` Mikulas Patocka
  2008-10-07  3:37                   ` david
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06  3:42 UTC (permalink / raw)
  To: david; +Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008, david@lang.hm wrote:

> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
> 
> > On Fri, 3 Oct 2008, david@lang.hm wrote:
> > 
> > > I've also seen discussions of how the
> > > kernel filesystem code can do ordered writes without having to wait for
> > > them
> > > with the use of barriers, is this capability exported to userspace? if so,
> > > could you point me at documentation for it?
> > 
> > It isn't. And it is good that it isn't --- the more complicated API, the
> > more maintenance work.
> 
> I can understand that most software would not want to deal with complications
> like this, but for things thta have requirements similar to journaling
> filesystems (databases for example) it would seem that there would be
> advantages to exposing this capabilities.
> 
> David Lang

If you invent new interface that allows submitting several ordered IOs 
from userspace, it will require excessive maintenance overhead over long 
period of time. So it should be only justified, if the performance 
improvement is excessive as well.

It should not be like "here you improve 10% performance on some synthetic 
benchmark in one application that was rewritten to support the new 
interface" and then create a few more security vulnerabilities (because of 
the complexity of the interface) and damage overall Linux progress, 
because everyone is catching bugs in the new interface and checking it for 
correctness.

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  3:30                               ` Mikulas Patocka
@ 2008-10-06  4:20                                 ` Arjan van de Ven
  2008-10-06 13:00                                   ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-06  4:20 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> The point is that many fsync()s may run in parallel and you have just
> one inode and just one chain. And if you add two-word list_head to a
> page, to link it on this list, many developers will hate it for
> increasing its size.

why to a page?
a list head in the inode and chain up the bios....
or not make an actual list but just a "is the previous one done" thing
it's not all that hard to get something that works on a per inode basis,
that gives "wait for all io upto this one".



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  4:20                                 ` Arjan van de Ven
@ 2008-10-06 13:00                                   ` Mikulas Patocka
  2008-10-06 13:50                                     ` Arjan van de Ven
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06 13:00 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > The point is that many fsync()s may run in parallel and you have just
> > one inode and just one chain. And if you add two-word list_head to a
> > page, to link it on this list, many developers will hate it for
> > increasing its size.
> 
> why to a page?
> a list head in the inode and chain up the bios....

And if you want to wait for a bio submitted by a different process? 
There's no way you can find the bio from the page.

> or not make an actual list but just a "is the previous one done" thing
> it's not all that hard to get something that works on a per inode basis,
> that gives "wait for all io upto this one".

So code it :)

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06 13:00                                   ` Mikulas Patocka
@ 2008-10-06 13:50                                     ` Arjan van de Ven
  2008-10-06 20:44                                       ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Arjan van de Ven @ 2008-10-06 13:50 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Mon, 6 Oct 2008 09:00:14 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> 
> > On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > > The point is that many fsync()s may run in parallel and you have
> > > just one inode and just one chain. And if you add two-word
> > > list_head to a page, to link it on this list, many developers
> > > will hate it for increasing its size.
> > 
> > why to a page?
> > a list head in the inode and chain up the bios....
> 
> And if you want to wait for a bio submitted by a different process? 
> There's no way you can find the bio from the page.

the point is that the kernel would always chain it to the inode,
independent of who or when it is submitted


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06 13:50                                     ` Arjan van de Ven
@ 2008-10-06 20:44                                       ` Mikulas Patocka
  0 siblings, 0 replies; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-06 20:44 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Mon, 6 Oct 2008, Arjan van de Ven wrote:

> On Mon, 6 Oct 2008 09:00:14 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> > 
> > > On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > > > The point is that many fsync()s may run in parallel and you have
> > > > just one inode and just one chain. And if you add two-word
> > > > list_head to a page, to link it on this list, many developers
> > > > will hate it for increasing its size.
> > > 
> > > why to a page?
> > > a list head in the inode and chain up the bios....
> > 
> > And if you want to wait for a bio submitted by a different process? 
> > There's no way you can find the bio from the page.
> 
> the point is that the kernel would always chain it to the inode,
> independent of who or when it is submitted

If you add a list to an inode, you need to protect it with a spinlock. So 
you take one more spinlock for any write bio submitted --- a lot of 
developers would hate it.

Another problem: how do you want to walk all dirty pages and submit bio 
for them?

The act of allocating and submission of bio can block (if you run out of 
some mempool) and in this case it wait until some other bio is finished. 
During this time, more dirty pages can be created.

Also, if you find a page that is both dirty and under writeback, you need 
to wait until a writeback finishes and then initiate another writeback 
(because the old writeback may be writing stale data). You again, block, 
and more dirty pages can appear.

And if you block and more dirty pages appear, you are prone to the 
livelock.

[ In Nick Piggin's patch, it is needed to lock the whole address space, 
mark dirty pages in one non-blocking pass and write marked pages again in 
a blocking pass --- so that if more dirty pages appear while bios are 
submitted, the new pages will be skipped ]

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-06  3:42                 ` Mikulas Patocka
@ 2008-10-07  3:37                   ` david
  2008-10-07 15:44                     ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: david @ 2008-10-07  3:37 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun, 5 Oct 2008, Mikulas Patocka wrote:

> On Sun, 5 Oct 2008, david@lang.hm wrote:
>
>> On Sun, 5 Oct 2008, Mikulas Patocka wrote:
>>
>>> On Fri, 3 Oct 2008, david@lang.hm wrote:
>>>
>>>> I've also seen discussions of how the
>>>> kernel filesystem code can do ordered writes without having to wait for
>>>> them
>>>> with the use of barriers, is this capability exported to userspace? if so,
>>>> could you point me at documentation for it?
>>>
>>> It isn't. And it is good that it isn't --- the more complicated API, the
>>> more maintenance work.
>>
>> I can understand that most software would not want to deal with complications
>> like this, but for things thta have requirements similar to journaling
>> filesystems (databases for example) it would seem that there would be
>> advantages to exposing this capabilities.
>>
>> David Lang
>
> If you invent new interface that allows submitting several ordered IOs
> from userspace, it will require excessive maintenance overhead over long
> period of time. So it should be only justified, if the performance
> improvement is excessive as well.
>
> It should not be like "here you improve 10% performance on some synthetic
> benchmark in one application that was rewritten to support the new
> interface" and then create a few more security vulnerabilities (because of
> the complexity of the interface) and damage overall Linux progress,
> because everyone is catching bugs in the new interface and checking it for
> correctness.

the same benchmarks that show that it's far better for the in-kernel 
filesystem code to use write barriers should apply for FUSE filesystems.

this isn't a matter of a few % in performance, if an application is 
sync-limited in a way that can be converted to write-ordered the potential 
is for the application to speed up my many times.

programs that maintain indexes or caches of data that lives in other files 
will be able to write data && barrier && write index && fsync and double 
their performance vs write data && fsync && write index && fsync

databases can potentially do even better, today they need to fsync data to 
disk before they can update their journal to indicate that the data has 
been written, with a barrier they could order the writes so that the write 
to the journal doesn't happen until the writes of the data. they would 
neve need to call an fsync at all (when emptying the journal)

for systems without solid-state drives or battery-backed caches, the 
ability to eliminate fsyncs by being able to rely on the order of the 
writes is a huge benifit.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-07  3:37                   ` david
@ 2008-10-07 15:44                     ` Mikulas Patocka
  2008-10-07 17:16                       ` david
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-07 15:44 UTC (permalink / raw)
  To: david; +Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

> > If you invent new interface that allows submitting several ordered IOs
> > from userspace, it will require excessive maintenance overhead over long
> > period of time. So it should be only justified, if the performance
> > improvement is excessive as well.
> > 
> > It should not be like "here you improve 10% performance on some synthetic
> > benchmark in one application that was rewritten to support the new
> > interface" and then create a few more security vulnerabilities (because of
> > the complexity of the interface) and damage overall Linux progress,
> > because everyone is catching bugs in the new interface and checking it for
> > correctness.
> 
> the same benchmarks that show that it's far better for the in-kernel
> filesystem code to use write barriers should apply for FUSE filesystems.

FUSE is slow by design, and it is used in cases where performance isn't 
crucial.

> this isn't a matter of a few % in performance, if an application is
> sync-limited in a way that can be converted to write-ordered the potential is
> for the application to speed up my many times.
> 
> programs that maintain indexes or caches of data that lives in other files
> will be able to write data && barrier && write index && fsync and double their
> performance vs write data && fsync && write index && fsync

They can do: write data with O_SYNC; write another piece of data with 
O_SYNC.

And the only difference from barriers is the waiting time after the first 
O_SYNC before the second I/O is submitted (such delay wouldn't happen with 
barriers).

And now I/O delay is in milliseconds and process wakeup time is tens of 
microseconds, it doesn't look like eliminating process wakeup time would 
do more than few percents.

> databases can potentially do even better, today they need to fsync data to
> disk before they can update their journal to indicate that the data has been
> written, with a barrier they could order the writes so that the write to the
> journal doesn't happen until the writes of the data. they would neve need to
> call an fsync at all (when emptying the journal)

Good databases can pack several user transactions into one fsync() write. 
If the database server is properly engineered, it accumulates all user 
transactions committed so far into one chunk, writes that chunk with one 
fsync() call and then reports successful commit to the clients.

So if you increase fsync() latency, it should have no effect on the 
transactional throughput --- only on latency of transactions. Similarly, 
if you decrease fsync() latency, it won't increase number of processed 
transactions.

Certainly, there are primitive embedded database libraries that fsync() 
after each transaction, but they don't have good performance anyway.

> for systems without solid-state drives or battery-backed caches, the ability
> to eliminate fsyncs by being able to rely on the order of the writes is a huge
> benifit.

I may ask --- where are the applications that require extra slow fsync() 
latency? Databases are not that, they batch transactions.

If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the 
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush 
drive cache in O_DSYNC call

--- these are definitely less invasive than new I/O submitting interface.

Mikulas

> David Lang
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: application syncing options (was Re: [PATCH] Memory management livelock)
  2008-10-07 15:44                     ` Mikulas Patocka
@ 2008-10-07 17:16                       ` david
  0 siblings, 0 replies; 67+ messages in thread
From: david @ 2008-10-07 17:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Nick Piggin, Andrew Morton, linux-kernel, agk, mbroz, chris

On Tue, 7 Oct 2008, Mikulas Patocka wrote:

>>> If you invent new interface that allows submitting several ordered IOs
>>> from userspace, it will require excessive maintenance overhead over long
>>> period of time. So it should be only justified, if the performance
>>> improvement is excessive as well.
>>>
>>> It should not be like "here you improve 10% performance on some synthetic
>>> benchmark in one application that was rewritten to support the new
>>> interface" and then create a few more security vulnerabilities (because of
>>> the complexity of the interface) and damage overall Linux progress,
>>> because everyone is catching bugs in the new interface and checking it for
>>> correctness.
>>
>> the same benchmarks that show that it's far better for the in-kernel
>> filesystem code to use write barriers should apply for FUSE filesystems.
>
> FUSE is slow by design, and it is used in cases where performance isn't
> crucial.

FUSE is slow, but I don't believe that it's a design goal for it to be 
slow, it's a limitation of the implementation. so things that could speed 
it up would be a good thing.

>> this isn't a matter of a few % in performance, if an application is
>> sync-limited in a way that can be converted to write-ordered the potential is
>> for the application to speed up my many times.
>>
>> programs that maintain indexes or caches of data that lives in other files
>> will be able to write data && barrier && write index && fsync and double their
>> performance vs write data && fsync && write index && fsync
>
> They can do: write data with O_SYNC; write another piece of data with
> O_SYNC.
>
> And the only difference from barriers is the waiting time after the first
> O_SYNC before the second I/O is submitted (such delay wouldn't happen with
> barriers).
>
> And now I/O delay is in milliseconds and process wakeup time is tens of
> microseconds, it doesn't look like eliminating process wakeup time would
> do more than few percents.

each sync write needs to wait for a disk rotation (and a seek if you are 
writing to different files). if you only do two writes you save one disk 
rotation, if you do five writes you save four disk rotations

>> databases can potentially do even better, today they need to fsync data to
>> disk before they can update their journal to indicate that the data has been
>> written, with a barrier they could order the writes so that the write to the
>> journal doesn't happen until the writes of the data. they would neve need to
>> call an fsync at all (when emptying the journal)
>
> Good databases can pack several user transactions into one fsync() write.
> If the database server is properly engineered, it accumulates all user
> transactions committed so far into one chunk, writes that chunk with one
> fsync() call and then reports successful commit to the clients.

if there are multiple users doing transactions at the same time they will 
benifit from overlapping the fsyncs. but each user session cannot complete 
their transaction until the fsync completes

> So if you increase fsync() latency, it should have no effect on the
> transactional throughput --- only on latency of transactions. Similarly,
> if you decrease fsync() latency, it won't increase number of processed
> transactions.

only if you have all your transactions happening in parallel. in the real 
world programs sometimes need to wait for one transaction to complete so 
that they can do the next one.

> Certainly, there are primitive embedded database libraries that fsync()
> after each transaction, but they don't have good performance anyway.
>
>> for systems without solid-state drives or battery-backed caches, the ability
>> to eliminate fsyncs by being able to rely on the order of the writes is a huge
>> benifit.
>
> I may ask --- where are the applications that require extra slow fsync()
> latency? Databases are not that, they batch transactions.
>
> If you want to improve things, you can try:
> * implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
> * implement range_fsync and range_fdatasync (sync on file range --- the
> kernel has already support for that, you can just add a syscall)
> * turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
> drive cache in O_DSYNC call
>
> --- these are definitely less invasive than new I/O submitting interface.

but all of these require that the application stop and wait for each 
seperate write to take place before proceeding to the next step.

if this doesn't matter, then why the big push to have the in-kernel 
filesystems start using barriers? I understood that this resulted in large 
performance increases in the places that they are used from just being 
able to avoid having to drain the entire request queue, and you are saying 
that the applications would not only need to wait for the queue to flush, 
but for the disk to acknowledge the write.

syncs are slow, in some cases _very_ slow.

David Lang

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/3] Fix fsync livelock
  2008-10-06  0:30                             ` Arjan van de Ven
  2008-10-06  3:30                               ` Mikulas Patocka
@ 2008-10-08 10:56                               ` Pavel Machek
  1 sibling, 0 replies; 67+ messages in thread
From: Pavel Machek @ 2008-10-08 10:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mikulas Patocka, Andrew Morton, linux-kernel, agk, mbroz, chris

On Sun 2008-10-05 17:30:19, Arjan van de Ven wrote:
> On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > I assume that if very few people complained about the livelock till
> > now, very few people will see degraded write performance. My patch
> > blocks the writes only if the livelock happens, so if the livelock
> > doesn't happen in unpatched kernel for most people, the patch won't
> > make it worse.
> 
> I object to calling this a livelock. It's not. 

8 hours of process in D state is a livelock. And we can do minimal fix
here, this almost never happens in real life anyway.

Latency imposed of writer should not be a problem...
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH] documentation: explain memory barriers
  2008-10-02  5:54             ` Andrew Morton
                                 ` (3 preceding siblings ...)
  2008-10-05 22:16               ` [PATCH 3/3] Fix fsync-vs-write misbehavior Mikulas Patocka
@ 2008-10-09  1:12               ` Randy Dunlap
  2008-10-09  1:17                 ` Chris Snook
  2008-10-09  1:50                 ` Valdis.Kletnieks
  4 siblings, 2 replies; 67+ messages in thread
From: Randy Dunlap @ 2008-10-09  1:12 UTC (permalink / raw)
  To: Andrew Morton, Ben Hutchings; +Cc: Mikulas Patocka, linux-kernel, linux-mm

On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:

> This sequence is repeated three or four times and should be pulled out
> into a well-commented function.  That comment should explain the logic
> behind the use of these barriers, please.

and on 2008-OCT-08 Ben Hutchings wrote:

> All memory barriers need a comment to explain why and what they're doing.


Should this be added to CodingStyle or some other file?




From: Randy Dunlap <randy.dunlap@oracle.com>

We want all uses of memory barriers to be explained in the source
code.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
---
 Documentation/SubmitChecklist |    3 +++
 1 file changed, 3 insertions(+)

--- lin2627-rc9-docs.orig/Documentation/SubmitChecklist
+++ lin2627-rc9-docs/Documentation/SubmitChecklist
@@ -85,3 +85,6 @@ kernel patches.
 23: Tested after it has been merged into the -mm patchset to make sure
     that it still works with all of the other queued patches and various
     changes in the VM, VFS, and other subsystems.
+
+24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
+    source code that explains the logic of what they are doing and why.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
@ 2008-10-09  1:17                 ` Chris Snook
  2008-10-09  1:31                   ` Andrew Morton
  2008-10-09  1:50                 ` Valdis.Kletnieks
  1 sibling, 1 reply; 67+ messages in thread
From: Chris Snook @ 2008-10-09  1:17 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Ben Hutchings, Mikulas Patocka, linux-kernel, linux-mm

Randy Dunlap wrote:
> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> 
>> This sequence is repeated three or four times and should be pulled out
>> into a well-commented function.  That comment should explain the logic
>> behind the use of these barriers, please.
> 
> and on 2008-OCT-08 Ben Hutchings wrote:
> 
>> All memory barriers need a comment to explain why and what they're doing.

Seriously?  When a barrier is used, it's generally self-evident what 
it's doing.  The real problem is when a barrier is *not* used.  It would 
probably be more helpful to comment every non-barrier line of code to 
explain why it's okay if memory operations are getting reordered.

-- Chris

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  1:17                 ` Chris Snook
@ 2008-10-09  1:31                   ` Andrew Morton
  2008-10-09  5:51                     ` Chris Snook
  0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2008-10-09  1:31 UTC (permalink / raw)
  To: Chris Snook
  Cc: Randy Dunlap, Ben Hutchings, Mikulas Patocka, linux-kernel, linux-mm

On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <csnook@redhat.com> wrote:

> Randy Dunlap wrote:
> > On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> > 
> >> This sequence is repeated three or four times and should be pulled out
> >> into a well-commented function.  That comment should explain the logic
> >> behind the use of these barriers, please.
> > 
> > and on 2008-OCT-08 Ben Hutchings wrote:
> > 
> >> All memory barriers need a comment to explain why and what they're doing.

I approve this message.

> Seriously?  When a barrier is used, it's generally self-evident what 
> it's doing.

fs/buffer.c:sync_buffer().  Have fun.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
  2008-10-09  1:17                 ` Chris Snook
@ 2008-10-09  1:50                 ` Valdis.Kletnieks
  2008-10-09 17:35                   ` Nick Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Valdis.Kletnieks @ 2008-10-09  1:50 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Ben Hutchings, Mikulas Patocka, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:

> +
> +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
> +    source code that explains the logic of what they are doing and why.

"what they are doing" will almost always be "flush value to RAM" or similar.
How about this instead:

+ 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment in the
+    source code that explains the race condition being prevented, and states
+    the location of the other code or hardware feature that races with this.
+
+    An example comment:
+
+ 	/*
+ 	 * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
+ 	 * card will get confused and wedge the hardware...
+ 	 */
+	wmb();

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  1:31                   ` Andrew Morton
@ 2008-10-09  5:51                     ` Chris Snook
  2008-10-09  9:58                       ` Ben Hutchings
  2008-10-09 17:29                       ` Nick Piggin
  0 siblings, 2 replies; 67+ messages in thread
From: Chris Snook @ 2008-10-09  5:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Randy Dunlap, Ben Hutchings, Mikulas Patocka, linux-kernel, linux-mm

Andrew Morton wrote:
> On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <csnook@redhat.com> wrote:
> 
>> Randy Dunlap wrote:
>>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
>>>
>>>> This sequence is repeated three or four times and should be pulled out
>>>> into a well-commented function.  That comment should explain the logic
>>>> behind the use of these barriers, please.
>>> and on 2008-OCT-08 Ben Hutchings wrote:
>>>
>>>> All memory barriers need a comment to explain why and what they're doing.
> 
> I approve this message.
> 
>> Seriously?  When a barrier is used, it's generally self-evident what 
>> it's doing.
> 
> fs/buffer.c:sync_buffer().  Have fun.

The real disaster there is the clear_buffer_##name macro and friends, as
evidenced by fs/ext2/inode.c:599

                clear_buffer_new(bh_result); /* What's this do? */

I'm completely in favor of documenting everything that can potentially interact
with that train wreck, but I maintain that the vast majority of memory barriers
are self-evident.

-- Chris

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09 17:35                   ` Nick Piggin
@ 2008-10-09  6:52                     ` Valdis.Kletnieks
  0 siblings, 0 replies; 67+ messages in thread
From: Valdis.Kletnieks @ 2008-10-09  6:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Randy Dunlap, Andrew Morton, Ben Hutchings, Mikulas Patocka,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Fri, 10 Oct 2008 04:35:47 +1100, Nick Piggin said:
> On Thursday 09 October 2008 12:50, Valdis.Kletnieks@vt.edu wrote:
> > On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:
> > > +
> > > +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment
> > > in the +    source code that explains the logic of what they are doing
> > > and why.
> >
> > "what they are doing" will almost always be "flush value to RAM" or
> > similar.
> 
> Memory barriers don't flush anything anywhere.

That's what I get for commenting on stuff when I'm into a 40-hour week by
Wednesday. :)

I was speaking of the generic programmer who does stuff like:

	x = 10; /* set x to 10 */ 

for "what they are doing".  You know the type. ;)

"flush value to RAM", "force memory barrier operation", and I think I've seen
a few kzalloc()'s that have "allocate zero'ed memory" on them. "what they are
doing" is usually not worth writing down, but being verbose for the *why*
is almost always good, especially for things like memory barriers that
almost nobody can get their brains wrapped around (how many flame-fests per
year do we have about "volatile"? ;)

>  /*
>   * If we don't do a wmb() here, the store to the RBFROBNIZ,
>   * above, might reach the device before the store X, below.
>   * 
>   * If that happens, then the XU293  card will get confused
>   * and wedge the hardware...
>   */
>   wmb();
> 
> If you don't comment like that, then how does the reader know that the wmb
> is not *also* supposed to order the store with any other of the limitless
> subsequent stores until the next memory ordering operation? Or any of the
> previous stores since the last one?

Even better (as I missed the "also supposed to know" case).  My general point
was that a concrete example would improve Randy's original patch by showing
what sort of things should be in the comment, and your correction pointed
out *why* a concrete example was needed. ;)


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  5:51                     ` Chris Snook
@ 2008-10-09  9:58                       ` Ben Hutchings
  2008-10-09 21:27                         ` Nick Piggin
  2008-10-09 17:29                       ` Nick Piggin
  1 sibling, 1 reply; 67+ messages in thread
From: Ben Hutchings @ 2008-10-09  9:58 UTC (permalink / raw)
  To: Chris Snook
  Cc: Andrew Morton, Randy Dunlap, Mikulas Patocka, linux-kernel, linux-mm

On Thu, 2008-10-09 at 01:51 -0400, Chris Snook wrote:
> Andrew Morton wrote:
> > On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <csnook@redhat.com> wrote:
> > 
> >> Randy Dunlap wrote:
> >>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> >>>
> >>>> This sequence is repeated three or four times and should be pulled out
> >>>> into a well-commented function.  That comment should explain the logic
> >>>> behind the use of these barriers, please.
> >>> and on 2008-OCT-08 Ben Hutchings wrote:
> >>>
> >>>> All memory barriers need a comment to explain why and what they're doing.
> > 
> > I approve this message.
> > 
> >> Seriously?  When a barrier is used, it's generally self-evident what 
> >> it's doing.
> > 
> > fs/buffer.c:sync_buffer().  Have fun.
> 
> The real disaster there is the clear_buffer_##name macro and friends, as
> evidenced by fs/ext2/inode.c:599
> 
>                 clear_buffer_new(bh_result); /* What's this do? */
> 
> I'm completely in favor of documenting everything that can potentially interact
> with that train wreck, but I maintain that the vast majority of memory barriers
> are self-evident.

Acquire and release barriers attached to operations are usually self-
evident; standalone wmb() and rmb() much less so.  It is helpful to be
explicit about exactly which memory operations need to be ordered, which
are often not the memory operations immediately preceding and following
it.  "all" may have been a bit strong though.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  5:51                     ` Chris Snook
  2008-10-09  9:58                       ` Ben Hutchings
@ 2008-10-09 17:29                       ` Nick Piggin
  1 sibling, 0 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-09 17:29 UTC (permalink / raw)
  To: Chris Snook
  Cc: Andrew Morton, Randy Dunlap, Ben Hutchings, Mikulas Patocka,
	linux-kernel, linux-mm

On Thursday 09 October 2008 16:51, Chris Snook wrote:
> Andrew Morton wrote:
> > On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <csnook@redhat.com> wrote:
> >> Randy Dunlap wrote:
> >>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> >>>> This sequence is repeated three or four times and should be pulled out
> >>>> into a well-commented function.  That comment should explain the logic
> >>>> behind the use of these barriers, please.
> >>>
> >>> and on 2008-OCT-08 Ben Hutchings wrote:
> >>>> All memory barriers need a comment to explain why and what they're
> >>>> doing.
> >
> > I approve this message.
> >
> >> Seriously?  When a barrier is used, it's generally self-evident what
> >> it's doing.
> >
> > fs/buffer.c:sync_buffer().  Have fun.
>
> The real disaster there is the clear_buffer_##name macro and friends, as
> evidenced by fs/ext2/inode.c:599
>
>                 clear_buffer_new(bh_result); /* What's this do? */

That's not a disaster. It is relatively easy even if you have no
idea about any of that code to a) work out what BH_New flag does,
see where it gets set and where it gets cleared, and then work out
what that does. Actually, buffer.c used to leak BH_New in some
cases, but now it should be a bug if BH_New is found to be set
there (buffer.c should take any BH_New buffers, initialize them
appropriately, and clear BH_New; a dangling BH_New would indicate
uninitialized data going to or coming from the block).

No, they're easy, because you can find every single place where any
one cares about them with a single simple grep.

Again, fs/buffer.c:sync_buffer(). Which stores and/or loads is that
barrier sequencing against which subsequent stores and/or loads? Why?

For another fun example, mm/filemap.c:sync_page. This actually has a
big comment, but (to me) it isn't even evident then which loads and
stores are being sequenced against which subsequent ones because it
is not explicitly documented. And I do have some experience in adding
barriers to existing mm code where they have been missed completely.

mempool_free, set_dumpable, freeze_bdev.



> I'm completely in favor of documenting everything that can potentially
> interact with that train wreck,

What's the train-wreck, again?


> but I maintain that the vast majority of 
> memory barriers are self-evident.

They are self-evident if you have spent a lot of time getting the
state machine and locking/concurrency model in your mind. If you
have not, then how do you know there is not some memory operation
way back or forward in the instruction stream that is supposed to
be ordered by this barrier?

All memory barriers have to be documented, except acquire/release
for critical sections, IMO.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  1:50                 ` Valdis.Kletnieks
@ 2008-10-09 17:35                   ` Nick Piggin
  2008-10-09  6:52                     ` Valdis.Kletnieks
  0 siblings, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-09 17:35 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Randy Dunlap, Andrew Morton, Ben Hutchings, Mikulas Patocka,
	linux-kernel, linux-mm

On Thursday 09 October 2008 12:50, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:
> > +
> > +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment
> > in the +    source code that explains the logic of what they are doing
> > and why.
>
> "what they are doing" will almost always be "flush value to RAM" or
> similar.

Memory barriers don't flush anything anywhere. It's simple: you must
explain which operations you are ordering against which.


> How about this instead: 
>
> + 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment
> in the +    source code that explains the race condition being prevented,
> and states +    the location of the other code or hardware feature that
> races with this. +
> +    An example comment:
> +
> + 	/*
> + 	 * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
> + 	 * card will get confused and wedge the hardware...
> + 	 */
> +	wmb();

This comment is no good, because it doesn't tell you what the memory barrier
does. It tells you what might happen if you omit it.

 /*
  * If we don't do a wmb() here, the store to the RBFROBNIZ,
  * above, might reach the device before the store X, below.
  * 
  * If that happens, then the XU293  card will get confused
  * and wedge the hardware...
  */
  wmb();

If you don't comment like that, then how does the reader know that the wmb
is not *also* supposed to order the store with any other of the limitless
subsequent stores until the next memory ordering operation? Or any of the
previous stores since the last one?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH] documentation: explain memory barriers
  2008-10-09  9:58                       ` Ben Hutchings
@ 2008-10-09 21:27                         ` Nick Piggin
  0 siblings, 0 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-09 21:27 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Chris Snook, Andrew Morton, Randy Dunlap, Mikulas Patocka,
	linux-kernel, linux-mm

On Thursday 09 October 2008 20:58, Ben Hutchings wrote:
> On Thu, 2008-10-09 at 01:51 -0400, Chris Snook wrote:

> > I'm completely in favor of documenting everything that can potentially
> > interact with that train wreck, but I maintain that the vast majority of
> > memory barriers are self-evident.
>
> Acquire and release barriers attached to operations are usually self-
> evident; standalone wmb() and rmb() much less so.  It is helpful to be
> explicit about exactly which memory operations need to be ordered, which
> are often not the memory operations immediately preceding and following
> it.  "all" may have been a bit strong though.

No, I don't think so. We should absolutely force "all". That allows nobody
to be lazy, no confusion, and reminds people that memory barriers are not
easy to follow for a new reader of the code, or necessarily even the author,
6 months later. If somebody is too lazy to write a comment, they can use
locks

One last quick quiz, easier than the earlier ones...
mm/vmscan.c:__remove_mapping has a score of lines documenting exactly
what memory operations are being ordered, and even an example of what
happens if the ordering is not folllowed. This is a pretty good comment,
if I say so myself. However, it has one deficiency in that it doesn't
explicitly state where the write barrier(s) is (IMO the comments for one
part of an ordering protocol should reference the other parts of the
protcol).

Where are the store barriers, or why are they not required?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)
  2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
@ 2008-10-11 12:06                 ` Nick Piggin
  2008-10-20 20:14                   ` Mikulas Patocka
  0 siblings, 1 reply; 67+ messages in thread
From: Nick Piggin @ 2008-10-11 12:06 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Monday 06 October 2008 09:11, Mikulas Patocka wrote:
> Hi
>
> I removed the repeated code and create a new bit mutexes. They are
> space-efficient mutexes that consume only one bit. See the next 3 patches.

Pretty reasonable to have.


> If you are concerned about the size of an inode, I can convert other
> mutexes to bit mutexes: i_mutex and inotify_mutex.

I wouldn't worry for now. mutexes can be unlocked much faster than bit
mutexes, especially in the fastpath. And due to slab, it would be
unlikely to actually save any space.


> I could also create 
> bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
> for address_space->tree_lock, address_space->i_mmap_lock,
> address_space->private_lock, inode->i_lock.

We have that already. It is much much faster to unlock spinlocks than
bit spinlocks in general (if you own the word exclusively, then it's
not, but then you would be less likely to save space), and we can also
do proper FIFO ticket locks with a larger word.


> Look at it and say what you think about the idea of condensing mutexes
> into single bits.

Looks pretty good to me.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)
  2008-10-11 12:06                 ` Nick Piggin
@ 2008-10-20 20:14                   ` Mikulas Patocka
  2008-10-21  1:51                     ` Nick Piggin
  0 siblings, 1 reply; 67+ messages in thread
From: Mikulas Patocka @ 2008-10-20 20:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

> > If you are concerned about the size of an inode, I can convert other
> > mutexes to bit mutexes: i_mutex and inotify_mutex.
> 
> I wouldn't worry for now. mutexes can be unlocked much faster than bit
> mutexes, especially in the fastpath. And due to slab, it would be
> unlikely to actually save any space.

Maybe inotify_mutex. You are right that i_mutex is so heavily contended 
that slowing it down to save few words wouldn't be good. Do you know about 
any inotify-intensive workload?

> > I could also create 
> > bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
> > for address_space->tree_lock, address_space->i_mmap_lock,
> > address_space->private_lock, inode->i_lock.
> 
> We have that already. It is much much faster to unlock spinlocks than
> bit spinlocks in general (if you own the word exclusively, then it's
> not, but then you would be less likely to save space), and we can also
> do proper FIFO ticket locks with a larger word.

BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits? 
Are atomic 32-bit instuctions faster?

Can x86(86) system have 256 CPUs?

Mikulas

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)
  2008-10-20 20:14                   ` Mikulas Patocka
@ 2008-10-21  1:51                     ` Nick Piggin
  0 siblings, 0 replies; 67+ messages in thread
From: Nick Piggin @ 2008-10-21  1:51 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Andrew Morton, linux-kernel, agk, mbroz, chris

On Tuesday 21 October 2008 07:14, Mikulas Patocka wrote:
> > > If you are concerned about the size of an inode, I can convert other
> > > mutexes to bit mutexes: i_mutex and inotify_mutex.
> >
> > I wouldn't worry for now. mutexes can be unlocked much faster than bit
> > mutexes, especially in the fastpath. And due to slab, it would be
> > unlikely to actually save any space.
>
> Maybe inotify_mutex. You are right that i_mutex is so heavily contended
> that slowing it down to save few words wouldn't be good. Do you know about
> any inotify-intensive workload?

Don't really know, no. I think most desktop environments use it to
some extent, but no idea how much.


> > > I could also create
> > > bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save
> > > space for address_space->tree_lock, address_space->i_mmap_lock,
> > > address_space->private_lock, inode->i_lock.
> >
> > We have that already. It is much much faster to unlock spinlocks than
> > bit spinlocks in general (if you own the word exclusively, then it's
> > not, but then you would be less likely to save space), and we can also
> > do proper FIFO ticket locks with a larger word.
>
> BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits?
> Are atomic 32-bit instuctions faster?

In the case of <= 256 CPUs, they could be an unsigned short I think.
Probably it has never been found to be a huge win because they are
often beside other ints or longs. I think I actually booted up the
kernel with 16-bit spinlocks when doing the FIFO locks, but never
sent a patch for it... Don't let me stop you from trying though.


> Can x86(86) system have 256 CPUs?

Well, none that I know of which actually exist. SGI is hoping to have
4096 CPU x86 systems as far as I can tell.

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2008-10-21  1:51 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20080911101616.GA24064@agk.fab.redhat.com>
2008-09-22 21:10 ` [PATCH] Memory management livelock Mikulas Patocka
2008-09-23  0:48   ` Andrew Morton
2008-09-23 22:34   ` Mikulas Patocka
2008-09-23 22:49     ` Andrew Morton
2008-09-23 23:11       ` Mikulas Patocka
2008-09-23 23:46         ` Andrew Morton
2008-09-24 18:50           ` Mikulas Patocka
2008-09-24 18:51           ` [PATCH 1/3] " Mikulas Patocka
2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
2008-10-02  5:54             ` Andrew Morton
2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
2008-10-11 12:06                 ` Nick Piggin
2008-10-20 20:14                   ` Mikulas Patocka
2008-10-21  1:51                     ` Nick Piggin
2008-10-05 22:14               ` [PATCH 1/3] bit mutexes Mikulas Patocka
2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
2008-10-05 22:33                 ` Arjan van de Ven
2008-10-05 23:02                   ` Mikulas Patocka
2008-10-05 23:07                     ` Arjan van de Ven
2008-10-05 23:18                       ` Mikulas Patocka
2008-10-05 23:28                         ` Arjan van de Ven
2008-10-06  0:01                           ` Mikulas Patocka
2008-10-06  0:30                             ` Arjan van de Ven
2008-10-06  3:30                               ` Mikulas Patocka
2008-10-06  4:20                                 ` Arjan van de Ven
2008-10-06 13:00                                   ` Mikulas Patocka
2008-10-06 13:50                                     ` Arjan van de Ven
2008-10-06 20:44                                       ` Mikulas Patocka
2008-10-08 10:56                               ` Pavel Machek
2008-10-06  2:51                             ` Dave Chinner
2008-10-05 22:16               ` [PATCH 3/3] Fix fsync-vs-write misbehavior Mikulas Patocka
2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
2008-10-09  1:17                 ` Chris Snook
2008-10-09  1:31                   ` Andrew Morton
2008-10-09  5:51                     ` Chris Snook
2008-10-09  9:58                       ` Ben Hutchings
2008-10-09 21:27                         ` Nick Piggin
2008-10-09 17:29                       ` Nick Piggin
2008-10-09  1:50                 ` Valdis.Kletnieks
2008-10-09 17:35                   ` Nick Piggin
2008-10-09  6:52                     ` Valdis.Kletnieks
2008-09-24 18:53           ` [PATCH 3/3] Memory management livelock Mikulas Patocka
2008-10-03  2:32       ` [PATCH] " Nick Piggin
2008-10-03  2:40         ` Andrew Morton
2008-10-03  2:59           ` Nick Piggin
2008-10-03  3:14             ` Andrew Morton
2008-10-03  3:47               ` Nick Piggin
2008-10-03  3:56                 ` Andrew Morton
2008-10-03  4:07                   ` Nick Piggin
2008-10-03  4:17                     ` Andrew Morton
2008-10-03  4:29                       ` Nick Piggin
2008-10-03 11:43                   ` Mikulas Patocka
2008-10-03 12:27                     ` Nick Piggin
2008-10-03 13:53                       ` Mikulas Patocka
2008-10-03  2:54         ` Nick Piggin
2008-10-03 11:26           ` Mikulas Patocka
2008-10-03 12:31             ` Nick Piggin
2008-10-03 13:50               ` Mikulas Patocka
2008-10-03 14:50                 ` Alasdair G Kergon
2008-10-03 14:36               ` Alasdair G Kergon
2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
2008-10-06  0:04             ` Mikulas Patocka
2008-10-06  0:19               ` david
2008-10-06  3:42                 ` Mikulas Patocka
2008-10-07  3:37                   ` david
2008-10-07 15:44                     ` Mikulas Patocka
2008-10-07 17:16                       ` david

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).