All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
To: Andrea Arcangeli <andrea@novell.com>
Cc: shaggy@austin.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Date: Thu, 21 Oct 2004 19:03:20 -0700	[thread overview]
Message-ID: <20041021190320.02dccda7.akpm@osdl.org> (raw)
In-Reply-To: <20041022012211.GD14325@dualathlon.random>

Andrea Arcangeli <andrea@novell.com> wrote:
>
> On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote:
> > If you want to shootdown ptes before clearing the bitflag, that's fine
> 
> small correction s/before/after/. doing the pte shootdown before
> clearing the uptodate bitflag, would still not guarantee to read
> uptodate data after the invalidate (a minor page fault could still
> happen between the shootdown and the clear_bit; while after clearing the
> uptodate bit a major fault hitting the disk and refreshing the pagecache
> contents will be guaranteed - modulo bhs, well at least nfs is sure ok
> in that respect ;).

Yeah.  I think the only 100% reliable way to do this is:

	lock_page()
	unmap_mapping_range(page)
	ClearPageUptodate(page);
	invalidate(page);	/* Try to free the thing */
	unlock_page(page);

(Can do it for a whole range of pages if we always agree to lock pages in
file-index-ascending order).

but hrm, we don't even have the locking there to prevent do_no_page() from
reinstantiating the pte immediately after the unmap_mapping_range().

So what to do?

- The patch you originally sent has a race window which can be made
  nonfatal by removing the BUGcheck in mpage_writepage().

- NFS should probably be taught to use unmap_mapping_range() regardless
  of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
  up.

- invalidate_inode_pages2() isn't successfully invalidating the page if
  it has buffers.

  This is the biggest problem, because the pages can trivially have
  buffers - just write()ing to them will attach buffers, if they're ext2-
  or ext3-backed.

  It means that a direct-IO write to a section of a file which is mmapped
  causes pagecache and disk to get out of sync.  Question is: do we care,
  or do we say "don't do that"?  Nobody seems to have noticed thus far and
  it's a pretty dopey thing to be doing.

  If we _do_ want to fix it, it seems the simplest approach would be to
  nuke the pte's in invalidate_inode_pages2().  And we'd need some "oh we
  raced - try again" loop in there to handle the "do_no_page()
  reinstantiated a pte" race.

Fun.

Something like this?  Slow as a dog, but possibly correct ;)


void invalidate_inode_pages2(struct address_space *mapping)
{
	struct pagevec pvec;
	pgoff_t next = 0;
	int i;

	pagevec_init(&pvec, 0);
	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
		for (i = 0; i < pagevec_count(&pvec); i++) {
			struct page *page = pvec.pages[i];

			lock_page(page);
			if (page->mapping == mapping) {	/* truncate race? */
				wait_on_page_writeback(page);
				next = page->index + 1;
				while (page_mapped(page)) {
					unmap_mapping_range(mapping,
					  page->index << PAGE_CACHE_SHIFT,
					  page->index << PAGE_CACHE_SHIFT+1,
					  0);
					clear_page_dirty(page);
				}
				invalidate_complete_page(mapping, page);
			}
			unlock_page(page);
		}
		pagevec_release(&pvec);
		cond_resched();
	}
}

The unmapping from pagetables means that invalidate_complete_page() will
generally successfully remove the page from pagecache.  That mostly fixes
the direct-write-of-mmapped-data coherency problem: the pages simply aren't
in pagecache any more so we'll surely redo physical I/O.

But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
it really sucks that we're offering behaviour which only works most of the
time :(



WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@osdl.org>
To: Andrea Arcangeli <andrea@novell.com>
Cc: shaggy@austin.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Date: Thu, 21 Oct 2004 19:03:20 -0700	[thread overview]
Message-ID: <20041021190320.02dccda7.akpm@osdl.org> (raw)
In-Reply-To: <20041022012211.GD14325@dualathlon.random>

Andrea Arcangeli <andrea@novell.com> wrote:
>
> On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote:
> > If you want to shootdown ptes before clearing the bitflag, that's fine
> 
> small correction s/before/after/. doing the pte shootdown before
> clearing the uptodate bitflag, would still not guarantee to read
> uptodate data after the invalidate (a minor page fault could still
> happen between the shootdown and the clear_bit; while after clearing the
> uptodate bit a major fault hitting the disk and refreshing the pagecache
> contents will be guaranteed - modulo bhs, well at least nfs is sure ok
> in that respect ;).

Yeah.  I think the only 100% reliable way to do this is:

	lock_page()
	unmap_mapping_range(page)
	ClearPageUptodate(page);
	invalidate(page);	/* Try to free the thing */
	unlock_page(page);

(Can do it for a whole range of pages if we always agree to lock pages in
file-index-ascending order).

but hrm, we don't even have the locking there to prevent do_no_page() from
reinstantiating the pte immediately after the unmap_mapping_range().

So what to do?

- The patch you originally sent has a race window which can be made
  nonfatal by removing the BUGcheck in mpage_writepage().

- NFS should probably be taught to use unmap_mapping_range() regardless
  of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
  up.

- invalidate_inode_pages2() isn't successfully invalidating the page if
  it has buffers.

  This is the biggest problem, because the pages can trivially have
  buffers - just write()ing to them will attach buffers, if they're ext2-
  or ext3-backed.

  It means that a direct-IO write to a section of a file which is mmapped
  causes pagecache and disk to get out of sync.  Question is: do we care,
  or do we say "don't do that"?  Nobody seems to have noticed thus far and
  it's a pretty dopey thing to be doing.

  If we _do_ want to fix it, it seems the simplest approach would be to
  nuke the pte's in invalidate_inode_pages2().  And we'd need some "oh we
  raced - try again" loop in there to handle the "do_no_page()
  reinstantiated a pte" race.

Fun.

Something like this?  Slow as a dog, but possibly correct ;)


void invalidate_inode_pages2(struct address_space *mapping)
{
	struct pagevec pvec;
	pgoff_t next = 0;
	int i;

	pagevec_init(&pvec, 0);
	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
		for (i = 0; i < pagevec_count(&pvec); i++) {
			struct page *page = pvec.pages[i];

			lock_page(page);
			if (page->mapping == mapping) {	/* truncate race? */
				wait_on_page_writeback(page);
				next = page->index + 1;
				while (page_mapped(page)) {
					unmap_mapping_range(mapping,
					  page->index << PAGE_CACHE_SHIFT,
					  page->index << PAGE_CACHE_SHIFT+1,
					  0);
					clear_page_dirty(page);
				}
				invalidate_complete_page(mapping, page);
			}
			unlock_page(page);
		}
		pagevec_release(&pvec);
		cond_resched();
	}
}

The unmapping from pagetables means that invalidate_complete_page() will
generally successfully remove the page from pagecache.  That mostly fixes
the direct-write-of-mmapped-data coherency problem: the pages simply aren't
in pagecache any more so we'll surely redo physical I/O.

But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
it really sucks that we're offering behaviour which only works most of the
time :(


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2004-10-22  2:12 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-21 21:15 [PATCH] zap_pte_range should not mark non-uptodate pages dirty Dave Kleikamp
2004-10-21 21:15 ` Dave Kleikamp
2004-10-21 21:45 ` Andrew Morton
2004-10-21 21:45   ` Andrew Morton
2004-10-21 22:36   ` Andrea Arcangeli
2004-10-21 22:36     ` Andrea Arcangeli
2004-10-21 23:02     ` Andrew Morton
2004-10-21 23:02       ` Andrew Morton
2004-10-21 23:20       ` Andrea Arcangeli
2004-10-21 23:20         ` Andrea Arcangeli
2004-10-21 23:42         ` Andrew Morton
2004-10-21 23:42           ` Andrew Morton
2004-10-22  0:15           ` Andrew Morton
2004-10-22  0:15             ` Andrew Morton
2004-10-22  0:41             ` Andrea Arcangeli
2004-10-22  0:41               ` Andrea Arcangeli
2004-10-22  2:51               ` Rik van Riel
2004-10-22  2:51                 ` Rik van Riel
2004-10-22 16:19                 ` Andrea Arcangeli
2004-10-22 16:19                   ` Andrea Arcangeli
2004-10-22  0:30           ` Andrea Arcangeli
2004-10-22  0:30             ` Andrea Arcangeli
2004-10-22  1:22             ` Andrea Arcangeli
2004-10-22  1:22               ` Andrea Arcangeli
2004-10-22  2:03               ` Andrew Morton [this message]
2004-10-22  2:03                 ` Andrew Morton
2004-10-22 16:17                 ` Andrea Arcangeli
2004-10-22 16:17                   ` Andrea Arcangeli
2004-10-22 17:04                   ` Andrea Arcangeli
2004-10-22 17:04                     ` Andrea Arcangeli
2004-10-22 23:24                   ` Andrew Morton
2004-10-22 23:24                     ` Andrew Morton
2004-10-25 13:58                     ` Dave Kleikamp
2004-10-25 13:58                       ` Dave Kleikamp
2004-10-26  0:35                     ` Andrea Arcangeli
2004-10-26  0:35                       ` Andrea Arcangeli
2004-11-09 14:15                     ` Dave Kleikamp
2004-11-09 14:15                       ` Dave Kleikamp
2004-11-09 14:46                       ` Andrea Arcangeli
2004-11-09 14:46                         ` Andrea Arcangeli
2004-11-09 19:51                         ` Andrew Morton
2004-11-09 19:51                           ` Andrew Morton
2004-11-09 19:46                       ` Andrew Morton
2004-11-09 19:46                         ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041021190320.02dccda7.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=andrea@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shaggy@austin.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.