All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Kleikamp <shaggy@austin.ibm.com>
To: Andrew Morton <akpm@osdl.org>
Cc: Andrea Arcangeli <andrea@novell.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Date: Mon, 25 Oct 2004 08:58:23 -0500	[thread overview]
Message-ID: <1098712703.7260.2.camel@localhost> (raw)
In-Reply-To: <20041022162433.509341e4.akpm@osdl.org>

Andrew,
This patch does fix the oops I was seeing.

Shaggy

On Fri, 2004-10-22 at 18:24, Andrew Morton wrote:
> Something like this...
> 
> 
> 
> 
> - When invalidating pages, take care to shoot down any ptes which map them
>   as well.
> 
>   This ensures that the next mmap access to the page will generate a major
>   fault, so NFS's server-side modifications are picked up.
> 
>   This also allows us to call invalidate_complete_page() on all pages, so
>   filesytems such as ext3 get a chance to invalidate the buffer_heads.
> 
> - Don't mark in-pagetable pages as non-uptodate any more.  That broke a
>   previous guarantee that mapped-into-user-process pages are always uptodate.
> 
> - Check the return value of invalidate_complete_page().  It can fail if
>   someone redirties a page after generic_file_direct_IO() write it back.
> 
> But we still have a problem.  If invalidate_inode_pages2() calls
> unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
> pages.  That will redirty the page's buffers and will cause
> invalidate_complete_page() to fail.
> 
> So, in generic_file_direct_IO() we do a complete pte shootdown on the file
> up-front, prior to writing back dirty pagecache.  This is only done for
> O_DIRECT writes.  It _could_ be done for O_DIRECT reads too, providing full
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
> 
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications.  But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem.  (I think)
> 
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  25-akpm/include/linux/fs.h |    2 -
>  25-akpm/mm/filemap.c       |   19 ++++++++++---
>  25-akpm/mm/truncate.c      |   63 +++++++++++++++++++++++++++------------------
>  3 files changed, 55 insertions(+), 29 deletions(-)
> 
> diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:40:26 2004
> +++ 25-akpm/mm/truncate.c	Fri Oct 22 16:05:26 2004
> @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
>   * be marked dirty at any time too.  So we re-check the dirtiness inside
>   * ->tree_lock.  That provides exclusion against the __set_page_dirty
>   * functions.
> + *
> + * Returns non-zero if the page was successfully invalidated.
>   */
>  static int
>  invalidate_complete_page(struct address_space *mapping, struct page *page)
> @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
>  EXPORT_SYMBOL(invalidate_inode_pages);
>  
>  /**
> - * invalidate_inode_pages2 - remove all unmapped pages from an address_space
> + * invalidate_inode_pages2 - remove all pages from an address_space
>   * @mapping - the address_space
>   *
> - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
> - * where the page is seen to be mapped into process pagetables.  In that case,
> - * the page is marked clean but is left attached to its address_space.
> - *
> - * The page is also marked not uptodate so that a subsequent pagefault will
> - * perform I/O to bringthe page's contents back into sync with its backing
> - * store.
> + * Any pages which are found to be mapped into pagetables are unmapped prior to
> + * invalidation.
>   *
> - * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
> + * Returns -EIO if any pages could not be invalidated.
>   */
> -void invalidate_inode_pages2(struct address_space *mapping)
> +int invalidate_inode_pages2(struct address_space *mapping)
>  {
>  	struct pagevec pvec;
>  	pgoff_t next = 0;
>  	int i;
> +	int ret = 0;
> +	int did_full_unmap = 0;
>  
>  	pagevec_init(&pvec, 0);
> -	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> -		for (i = 0; i < pagevec_count(&pvec); i++) {
> +	while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			lock_page(page);
> -			if (page->mapping == mapping) {	/* truncate race? */
> -				wait_on_page_writeback(page);
> -				next = page->index + 1;
> -				if (page_mapped(page)) {
> -					clear_page_dirty(page);
> -					ClearPageUptodate(page);
> +			if (page->mapping != mapping) {	/* truncate race? */
> +				unlock_page(page);
> +				continue;
> +			}
> +			wait_on_page_writeback(page);
> +			next = page->index + 1;
> +			while (page_mapped(page)) {
> +				if (!did_full_unmap) {
> +					/*
> +					 * Zap the rest of the file in one hit.
> +					 * FIXME: invalidate_inode_pages2()
> +					 * should take start/end offsets.
> +					 */
> +					unmap_mapping_range(mapping,
> +						page->index << PAGE_CACHE_SHIFT,
> +					  	-1, 0);
> +					did_full_unmap = 1;
>  				} else {
> -					if (!invalidate_complete_page(mapping,
> -								      page)) {
> -						clear_page_dirty(page);
> -						ClearPageUptodate(page);
> -					}
> +					/*
> +					 * Just zap this page
> +					 */
> +					unmap_mapping_range(mapping,
> +					  page->index << PAGE_CACHE_SHIFT,
> +					  (page->index << PAGE_CACHE_SHIFT)+1,
> +					  0);
>  				}
>  			}
> +			clear_page_dirty(page);
> +			if (!invalidate_complete_page(mapping, page))
> +				ret = -EIO;
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> +	return ret;
>  }
> -
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
> --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:54:50 2004
> +++ 25-akpm/include/linux/fs.h	Fri Oct 22 15:54:58 2004
> @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
>  	    S_ISLNK(inode->i_mode))
>  		invalidate_inode_pages(inode->i_mapping);
>  }
> -extern void invalidate_inode_pages2(struct address_space *mapping);
> +extern int invalidate_inode_pages2(struct address_space *mapping);
>  extern void write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
> diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
> --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:55:06 2004
> +++ 25-akpm/mm/filemap.c	Fri Oct 22 16:17:19 2004
> @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file 
>  EXPORT_SYMBOL(generic_file_writev);
>  
>  /*
> - * Called under i_sem for writes to S_ISREG files
> + * Called under i_sem for writes to S_ISREG files.   Returns -EIO if something
> + * went wrong during pagecache shootdown.
>   */
>  ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
>  	struct address_space *mapping = file->f_mapping;
>  	ssize_t retval;
>  
> +	/*
> +	 * If it's a write, unmap all mmappings of the file up-front.  This
> +	 * will cause any pte dirty bits to be propagated into the pageframes
> +	 * for the subsequent filemap_write_and_wait().
> +	 */
> +	if (rw == WRITE && mapping_mapped(mapping))
> +		unmap_mapping_range(mapping, 0, -1, 0);
> +
>  	retval = filemap_write_and_wait(mapping);
>  	if (retval == 0) {
>  		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
>  						offset, nr_segs);
> -		if (rw == WRITE && mapping->nrpages)
> -			invalidate_inode_pages2(mapping);
> +		if (rw == WRITE && mapping->nrpages) {
> +			int err = invalidate_inode_pages2(mapping);
> +			if (err)
> +				retval = err;
> +		}
>  	}
>  	return retval;
>  }
> -
>  EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> _


WARNING: multiple messages have this Message-ID (diff)
From: Dave Kleikamp <shaggy@austin.ibm.com>
To: Andrew Morton <akpm@osdl.org>
Cc: Andrea Arcangeli <andrea@novell.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org
Subject: Re: [PATCH] zap_pte_range should not mark non-uptodate pages dirty
Date: Mon, 25 Oct 2004 08:58:23 -0500	[thread overview]
Message-ID: <1098712703.7260.2.camel@localhost> (raw)
In-Reply-To: <20041022162433.509341e4.akpm@osdl.org>

Andrew,
This patch does fix the oops I was seeing.

Shaggy

On Fri, 2004-10-22 at 18:24, Andrew Morton wrote:
> Something like this...
> 
> 
> 
> 
> - When invalidating pages, take care to shoot down any ptes which map them
>   as well.
> 
>   This ensures that the next mmap access to the page will generate a major
>   fault, so NFS's server-side modifications are picked up.
> 
>   This also allows us to call invalidate_complete_page() on all pages, so
>   filesytems such as ext3 get a chance to invalidate the buffer_heads.
> 
> - Don't mark in-pagetable pages as non-uptodate any more.  That broke a
>   previous guarantee that mapped-into-user-process pages are always uptodate.
> 
> - Check the return value of invalidate_complete_page().  It can fail if
>   someone redirties a page after generic_file_direct_IO() write it back.
> 
> But we still have a problem.  If invalidate_inode_pages2() calls
> unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
> pages.  That will redirty the page's buffers and will cause
> invalidate_complete_page() to fail.
> 
> So, in generic_file_direct_IO() we do a complete pte shootdown on the file
> up-front, prior to writing back dirty pagecache.  This is only done for
> O_DIRECT writes.  It _could_ be done for O_DIRECT reads too, providing full
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
> 
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications.  But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem.  (I think)
> 
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
> 
>  25-akpm/include/linux/fs.h |    2 -
>  25-akpm/mm/filemap.c       |   19 ++++++++++---
>  25-akpm/mm/truncate.c      |   63 +++++++++++++++++++++++++++------------------
>  3 files changed, 55 insertions(+), 29 deletions(-)
> 
> diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:40:26 2004
> +++ 25-akpm/mm/truncate.c	Fri Oct 22 16:05:26 2004
> @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
>   * be marked dirty at any time too.  So we re-check the dirtiness inside
>   * ->tree_lock.  That provides exclusion against the __set_page_dirty
>   * functions.
> + *
> + * Returns non-zero if the page was successfully invalidated.
>   */
>  static int
>  invalidate_complete_page(struct address_space *mapping, struct page *page)
> @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
>  EXPORT_SYMBOL(invalidate_inode_pages);
>  
>  /**
> - * invalidate_inode_pages2 - remove all unmapped pages from an address_space
> + * invalidate_inode_pages2 - remove all pages from an address_space
>   * @mapping - the address_space
>   *
> - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
> - * where the page is seen to be mapped into process pagetables.  In that case,
> - * the page is marked clean but is left attached to its address_space.
> - *
> - * The page is also marked not uptodate so that a subsequent pagefault will
> - * perform I/O to bringthe page's contents back into sync with its backing
> - * store.
> + * Any pages which are found to be mapped into pagetables are unmapped prior to
> + * invalidation.
>   *
> - * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
> + * Returns -EIO if any pages could not be invalidated.
>   */
> -void invalidate_inode_pages2(struct address_space *mapping)
> +int invalidate_inode_pages2(struct address_space *mapping)
>  {
>  	struct pagevec pvec;
>  	pgoff_t next = 0;
>  	int i;
> +	int ret = 0;
> +	int did_full_unmap = 0;
>  
>  	pagevec_init(&pvec, 0);
> -	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> -		for (i = 0; i < pagevec_count(&pvec); i++) {
> +	while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> +		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			lock_page(page);
> -			if (page->mapping == mapping) {	/* truncate race? */
> -				wait_on_page_writeback(page);
> -				next = page->index + 1;
> -				if (page_mapped(page)) {
> -					clear_page_dirty(page);
> -					ClearPageUptodate(page);
> +			if (page->mapping != mapping) {	/* truncate race? */
> +				unlock_page(page);
> +				continue;
> +			}
> +			wait_on_page_writeback(page);
> +			next = page->index + 1;
> +			while (page_mapped(page)) {
> +				if (!did_full_unmap) {
> +					/*
> +					 * Zap the rest of the file in one hit.
> +					 * FIXME: invalidate_inode_pages2()
> +					 * should take start/end offsets.
> +					 */
> +					unmap_mapping_range(mapping,
> +						page->index << PAGE_CACHE_SHIFT,
> +					  	-1, 0);
> +					did_full_unmap = 1;
>  				} else {
> -					if (!invalidate_complete_page(mapping,
> -								      page)) {
> -						clear_page_dirty(page);
> -						ClearPageUptodate(page);
> -					}
> +					/*
> +					 * Just zap this page
> +					 */
> +					unmap_mapping_range(mapping,
> +					  page->index << PAGE_CACHE_SHIFT,
> +					  (page->index << PAGE_CACHE_SHIFT)+1,
> +					  0);
>  				}
>  			}
> +			clear_page_dirty(page);
> +			if (!invalidate_complete_page(mapping, page))
> +				ret = -EIO;
>  			unlock_page(page);
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> +	return ret;
>  }
> -
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
> --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:54:50 2004
> +++ 25-akpm/include/linux/fs.h	Fri Oct 22 15:54:58 2004
> @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
>  	    S_ISLNK(inode->i_mode))
>  		invalidate_inode_pages(inode->i_mapping);
>  }
> -extern void invalidate_inode_pages2(struct address_space *mapping);
> +extern int invalidate_inode_pages2(struct address_space *mapping);
>  extern void write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
> diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
> --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix	Fri Oct 22 15:55:06 2004
> +++ 25-akpm/mm/filemap.c	Fri Oct 22 16:17:19 2004
> @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file 
>  EXPORT_SYMBOL(generic_file_writev);
>  
>  /*
> - * Called under i_sem for writes to S_ISREG files
> + * Called under i_sem for writes to S_ISREG files.   Returns -EIO if something
> + * went wrong during pagecache shootdown.
>   */
>  ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
>  	struct address_space *mapping = file->f_mapping;
>  	ssize_t retval;
>  
> +	/*
> +	 * If it's a write, unmap all mmappings of the file up-front.  This
> +	 * will cause any pte dirty bits to be propagated into the pageframes
> +	 * for the subsequent filemap_write_and_wait().
> +	 */
> +	if (rw == WRITE && mapping_mapped(mapping))
> +		unmap_mapping_range(mapping, 0, -1, 0);
> +
>  	retval = filemap_write_and_wait(mapping);
>  	if (retval == 0) {
>  		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
>  						offset, nr_segs);
> -		if (rw == WRITE && mapping->nrpages)
> -			invalidate_inode_pages2(mapping);
> +		if (rw == WRITE && mapping->nrpages) {
> +			int err = invalidate_inode_pages2(mapping);
> +			if (err)
> +				retval = err;
> +		}
>  	}
>  	return retval;
>  }
> -
>  EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> _

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  reply	other threads:[~2004-10-25 14:03 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-21 21:15 [PATCH] zap_pte_range should not mark non-uptodate pages dirty Dave Kleikamp
2004-10-21 21:15 ` Dave Kleikamp
2004-10-21 21:45 ` Andrew Morton
2004-10-21 21:45   ` Andrew Morton
2004-10-21 22:36   ` Andrea Arcangeli
2004-10-21 22:36     ` Andrea Arcangeli
2004-10-21 23:02     ` Andrew Morton
2004-10-21 23:02       ` Andrew Morton
2004-10-21 23:20       ` Andrea Arcangeli
2004-10-21 23:20         ` Andrea Arcangeli
2004-10-21 23:42         ` Andrew Morton
2004-10-21 23:42           ` Andrew Morton
2004-10-22  0:15           ` Andrew Morton
2004-10-22  0:15             ` Andrew Morton
2004-10-22  0:41             ` Andrea Arcangeli
2004-10-22  0:41               ` Andrea Arcangeli
2004-10-22  2:51               ` Rik van Riel
2004-10-22  2:51                 ` Rik van Riel
2004-10-22 16:19                 ` Andrea Arcangeli
2004-10-22 16:19                   ` Andrea Arcangeli
2004-10-22  0:30           ` Andrea Arcangeli
2004-10-22  0:30             ` Andrea Arcangeli
2004-10-22  1:22             ` Andrea Arcangeli
2004-10-22  1:22               ` Andrea Arcangeli
2004-10-22  2:03               ` Andrew Morton
2004-10-22  2:03                 ` Andrew Morton
2004-10-22 16:17                 ` Andrea Arcangeli
2004-10-22 16:17                   ` Andrea Arcangeli
2004-10-22 17:04                   ` Andrea Arcangeli
2004-10-22 17:04                     ` Andrea Arcangeli
2004-10-22 23:24                   ` Andrew Morton
2004-10-22 23:24                     ` Andrew Morton
2004-10-25 13:58                     ` Dave Kleikamp [this message]
2004-10-25 13:58                       ` Dave Kleikamp
2004-10-26  0:35                     ` Andrea Arcangeli
2004-10-26  0:35                       ` Andrea Arcangeli
2004-11-09 14:15                     ` Dave Kleikamp
2004-11-09 14:15                       ` Dave Kleikamp
2004-11-09 14:46                       ` Andrea Arcangeli
2004-11-09 14:46                         ` Andrea Arcangeli
2004-11-09 19:51                         ` Andrew Morton
2004-11-09 19:51                           ` Andrew Morton
2004-11-09 19:46                       ` Andrew Morton
2004-11-09 19:46                         ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1098712703.7260.2.camel@localhost \
    --to=shaggy@austin.ibm.com \
    --cc=akpm@osdl.org \
    --cc=andrea@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.