linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@vger.kernel.org,
	agk@redhat.com, mbroz@redhat.com, chris@arachsys.com
Subject: Re: [PATCH] Memory management livelock
Date: Fri, 3 Oct 2008 12:54:28 +1000	[thread overview]
Message-ID: <200810031254.29121.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <200810031232.23836.nickpiggin@yahoo.com.au>

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On Friday 03 October 2008 12:32, Nick Piggin wrote:
> On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
> > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
> >
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
> > > >
> > > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > > > > The bug happens when one process is doing sequential buffered
> > > > > writes to a block device (or file) and another process is
> > > > > attempting to execute sync(), fsync() or direct-IO on that device
> > > > > (or file). This syncing process will wait indefinitelly, until the
> > > > > first writing process finishes.
> > > > >
> > > > > For example, run these two commands:
> > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> > > > >
> > > > > The bug is caused by sequential walking of address space in
> > > > > write_cache_pages and wait_on_page_writeback_range: if some other
> > > > > process is constantly making dirty and writeback pages while these
> > > > > functions run, the functions will wait on every new page, resulting
> > > > > in indefinite wait.
>
> I think the problem has been misidentified, or else I have misread the
> code. See below. I hope I'm right, because I think the patches are pretty
> heavy on complexity in these already complex paths...
>
> It would help if you explicitly identify the exact livelock. Ie. give a
> sequence of behaviour that leads to our progress rate falling to zero.
>
> > > > Shouldn't happen. All the data-syncing functions should have an upper
> > > > bound on the number of pages which they attempt to write. In the
> > > > example above, we end up in here:
> > > >
> > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
> > > > start,
> > > > loff_t end, int sync_mode)
> > > > {
> > > > int ret;
> > > > struct writeback_control wbc = {
> > > > .sync_mode = sync_mode,
> > > > .nr_to_write = mapping->nrpages * 2, <<--
> > > > .range_start = start,
> > > > .range_end = end,
> > > > };
> > > >
> > > > so generic_file_direct_write()'s filemap_write_and_wait() will
> > > > attempt to write at most 2* the number of pages which are in cache
> > > > for that inode.
> > >
> > > See write_cache_pages:
> > >
> > > if (wbc->sync_mode != WB_SYNC_NONE)
> > >         wait_on_page_writeback(page);	(1)
> > > if (PageWriteback(page) ||
> > >     !clear_page_dirty_for_io(page)) {
> > >         unlock_page(page);		(2)
> > >         continue;
> > > }
> > > ret = (*writepage)(page, wbc, data);
> > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > >         unlock_page(page);
> > >         ret = 0;
> > > }
> > > if (ret || (--(wbc->nr_to_write) <= 0))
> > >         done = 1;
> > >
> > > --- so if it goes by points (1) and (2), the counter is not
> > > decremented, yet the function waits for the page. If there is constant
> > > stream of writeback pages being generated, it waits on each on them ---
> > > that is, forever.
>
> *What* is, forever? Data integrity syncs should have pages operated on
> in-order, until we get to the end of the range. Circular writeback could
> go through again, possibly, but no more than once.

OK, I have been able to reproduce it somewhat. It is not a livelock,
but what is happening is that direct IO read basically does an fsync
on the file before performing the IO. The fsync gets stuck behind the
dd that is dirtying the pages, and ends up following behind it and
doing all its IO for it.

The following patch avoids the issue for direct IO, by using the range
syncs rather than trying to sync the whole file.

The underlying problem I guess is unchanged. Is it really a problem,
though? The way I'd love to solve it is actually by adding another bit
or two to the pagecache radix tree,  that can be used to transiently tag
the tree for future operations. That way we could record the dirty and
writeback pages up front, and then only bother with operating on them.

That's *if* it really is a problem. I don't have much pity for someone
doing buffered IO and direct IO to the same pages of the same file :)

[-- Attachment #2: mm-starve-fix.patch --]
[-- Type: text/x-diff, Size: 1671 bytes --]

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2008-10-03 11:21:31.000000000 +1000
+++ linux-2.6/mm/filemap.c	2008-10-03 12:00:17.000000000 +1000
@@ -1304,11 +1304,8 @@ generic_file_aio_read(struct kiocb *iocb
 			goto out; /* skip atime */
 		size = i_size_read(inode);
 		if (pos < size) {
-			retval = filemap_write_and_wait(mapping);
-			if (!retval) {
-				retval = mapping->a_ops->direct_IO(READ, iocb,
+			retval = mapping->a_ops->direct_IO(READ, iocb,
 							iov, pos, nr_segs);
-			}
 			if (retval > 0)
 				*ppos = pos + retval;
 			if (retval) {
@@ -2110,18 +2107,10 @@ generic_file_direct_write(struct kiocb *
 	if (count != ocount)
 		*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
 
-	/*
-	 * Unmap all mmappings of the file up-front.
-	 *
-	 * This will cause any pte dirty bits to be propagated into the
-	 * pageframes for the subsequent filemap_write_and_wait().
-	 */
 	write_len = iov_length(iov, *nr_segs);
 	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
-	if (mapping_mapped(mapping))
-		unmap_mapping_range(mapping, pos, write_len, 0);
 
-	written = filemap_write_and_wait(mapping);
+	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
 	if (written)
 		goto out;
 
@@ -2507,7 +2496,8 @@ generic_file_buffered_write(struct kiocb
 	 * the file data here, to try to honour O_DIRECT expectations.
 	 */
 	if (unlikely(file->f_flags & O_DIRECT) && written)
-		status = filemap_write_and_wait(mapping);
+		status = filemap_write_and_wait_range(mapping,
+					pos, pos + written - 1);
 
 	return written ? written : status;
 }

  parent reply	other threads:[~2008-10-03  2:54 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20080911101616.GA24064@agk.fab.redhat.com>
2008-09-22 21:10 ` [PATCH] Memory management livelock Mikulas Patocka
2008-09-23  0:48   ` Andrew Morton
2008-09-23 22:34   ` Mikulas Patocka
2008-09-23 22:49     ` Andrew Morton
2008-09-23 23:11       ` Mikulas Patocka
2008-09-23 23:46         ` Andrew Morton
2008-09-24 18:50           ` Mikulas Patocka
2008-09-24 18:51           ` [PATCH 1/3] " Mikulas Patocka
2008-09-24 18:52           ` [PATCH 2/3] " Mikulas Patocka
2008-10-02  5:54             ` Andrew Morton
2008-10-05 22:11               ` RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock) Mikulas Patocka
2008-10-11 12:06                 ` Nick Piggin
2008-10-20 20:14                   ` Mikulas Patocka
2008-10-21  1:51                     ` Nick Piggin
2008-10-05 22:14               ` [PATCH 1/3] bit mutexes Mikulas Patocka
2008-10-05 22:14               ` [PATCH 2/3] Fix fsync livelock Mikulas Patocka
2008-10-05 22:33                 ` Arjan van de Ven
2008-10-05 23:02                   ` Mikulas Patocka
2008-10-05 23:07                     ` Arjan van de Ven
2008-10-05 23:18                       ` Mikulas Patocka
2008-10-05 23:28                         ` Arjan van de Ven
2008-10-06  0:01                           ` Mikulas Patocka
2008-10-06  0:30                             ` Arjan van de Ven
2008-10-06  3:30                               ` Mikulas Patocka
2008-10-06  4:20                                 ` Arjan van de Ven
2008-10-06 13:00                                   ` Mikulas Patocka
2008-10-06 13:50                                     ` Arjan van de Ven
2008-10-06 20:44                                       ` Mikulas Patocka
2008-10-08 10:56                               ` Pavel Machek
2008-10-06  2:51                             ` Dave Chinner
2008-10-05 22:16               ` [PATCH 3/3] Fix fsync-vs-write misbehavior Mikulas Patocka
2008-10-09  1:12               ` [PATCH] documentation: explain memory barriers Randy Dunlap
2008-10-09  1:17                 ` Chris Snook
2008-10-09  1:31                   ` Andrew Morton
2008-10-09  5:51                     ` Chris Snook
2008-10-09  9:58                       ` Ben Hutchings
2008-10-09 21:27                         ` Nick Piggin
2008-10-09 17:29                       ` Nick Piggin
2008-10-09  1:50                 ` Valdis.Kletnieks
2008-10-09 17:35                   ` Nick Piggin
2008-10-09  6:52                     ` Valdis.Kletnieks
2008-09-24 18:53           ` [PATCH 3/3] Memory management livelock Mikulas Patocka
2008-10-03  2:32       ` [PATCH] " Nick Piggin
2008-10-03  2:40         ` Andrew Morton
2008-10-03  2:59           ` Nick Piggin
2008-10-03  3:14             ` Andrew Morton
2008-10-03  3:47               ` Nick Piggin
2008-10-03  3:56                 ` Andrew Morton
2008-10-03  4:07                   ` Nick Piggin
2008-10-03  4:17                     ` Andrew Morton
2008-10-03  4:29                       ` Nick Piggin
2008-10-03 11:43                   ` Mikulas Patocka
2008-10-03 12:27                     ` Nick Piggin
2008-10-03 13:53                       ` Mikulas Patocka
2008-10-03  2:54         ` Nick Piggin [this message]
2008-10-03 11:26           ` Mikulas Patocka
2008-10-03 12:31             ` Nick Piggin
2008-10-03 13:50               ` Mikulas Patocka
2008-10-03 14:50                 ` Alasdair G Kergon
2008-10-03 14:36               ` Alasdair G Kergon
2008-10-03 15:52           ` application syncing options (was Re: [PATCH] Memory management livelock) david
2008-10-06  0:04             ` Mikulas Patocka
2008-10-06  0:19               ` david
2008-10-06  3:42                 ` Mikulas Patocka
2008-10-07  3:37                   ` david
2008-10-07 15:44                     ` Mikulas Patocka
2008-10-07 17:16                       ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200810031254.29121.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=agk@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris@arachsys.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=mbroz@redhat.com \
    --cc=mpatocka@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).