linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
@ 2014-12-05 14:52 Johannes Weiner
  2014-12-05 14:52 ` [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas Johannes Weiner
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-05 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara,
	Kirill A. Shutemov, linux-mm, linux-fsdevel, linux-kernel

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers()       __delete_from_page_cache()
  if (TestSetPageDirty(page))
                                     page->mapping = NULL
				     if (PageDirty())
				       dec_zone_page_state(page, NR_FILE_DIRTY);
				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
      account_page_dirtied(page)
        __inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held.  The notable exception to this rule, though,
is do_wp_page(), for which this race exists.  However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes.  Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed.  Remove it.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/writeback.h |  1 -
 mm/memory.c               | 27 +++++++++++++++++----------
 mm/page-writeback.c       | 43 ++++++++++++-------------------------------
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
 void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..72d998eb0438 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,24 @@ reuse:
 		if (!dirty_page)
 			return ret;
 
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * do_shared_fault is protected similarly.
-		 */
 		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
-			set_page_dirty_balance(dirty_page);
+			struct address_space *mapping;
+			int dirtied;
+
+			lock_page(dirty_page);
+			dirtied = set_page_dirty(dirty_page);
+			VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
+			mapping = dirty_page->mapping;
+			unlock_page(dirty_page);
+
+			if (dirtied && mapping) {
+				/*
+				 * Some device drivers do not set page.mapping
+				 * but still dirty their pages
+				 */
+				balance_dirty_pages_ratelimited(mapping);
+			}
+
 			/* file_update_time outside page_lock */
 			if (vma->vm_file)
 				file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..437174a2aaa3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
 		bdi_start_background_writeback(bdi);
 }
 
-void set_page_dirty_balance(struct page *page)
-{
-	if (set_page_dirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-
-		if (mapping)
-			balance_dirty_pages_ratelimited(mapping);
-	}
-}
-
 static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
  * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
  *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation.  Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			account_page_dirtied(page, mapping);
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
+		BUG_ON(page_mapping(page) != mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		radix_tree_tag_set(&mapping->page_tree, page_index(page),
+				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
-		 * at this point. We do this by having them hold the
-		 * page lock at some point after installing their
-		 * pte, but before marking the page dirty.
-		 * Pages are always locked coming in here, so we get
-		 * the desired exclusion. See mm/memory.c:do_wp_page()
-		 * for more comments.
+		 * at this point.  We do this by having them hold the
+		 * page lock while dirtying the page, and pages are
+		 * always locked coming in here, so we get the desired
+		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas
  2014-12-05 14:52 [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
@ 2014-12-05 14:52 ` Johannes Weiner
  2014-12-05 14:52 ` [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page() Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-05 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara,
	Kirill A. Shutemov, linux-mm, linux-fsdevel, linux-kernel

Shared anonymous mmaps are implemented with shmem files, so all VMAs
with shared writable semantics also have an underlying backing file.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 72d998eb0438..5640a718ac58 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2168,9 +2168,7 @@ reuse:
 				balance_dirty_pages_ratelimited(mapping);
 			}
 
-			/* file_update_time outside page_lock */
-			if (vma->vm_file)
-				file_update_time(vma->vm_file);
+			file_update_time(vma->vm_file);
 		}
 		put_page(dirty_page);
 		if (page_mkwrite) {
@@ -3026,8 +3024,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		balance_dirty_pages_ratelimited(mapping);
 	}
 
-	/* file_update_time outside page_lock */
-	if (vma->vm_file && !vma->vm_ops->page_mkwrite)
+	if (!vma->vm_ops->page_mkwrite)
 		file_update_time(vma->vm_file);
 
 	return ret;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()
  2014-12-05 14:52 [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
  2014-12-05 14:52 ` [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas Johannes Weiner
@ 2014-12-05 14:52 ` Johannes Weiner
  2014-12-09 18:22   ` Jan Kara
  2014-12-09 18:18 ` [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Jan Kara
  2017-04-10  2:22 ` alexander.levin
  3 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2014-12-05 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara,
	Kirill A. Shutemov, linux-mm, linux-fsdevel, linux-kernel

Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
pretty much the same.  Make sure the page references are the same in
both cases, then merge the two branches.

It's tempting to go even further and page-lock the !page_mkwrite case,
to get it in line with everybody else setting the page table and thus
further simplify the model.  But that's not quite compelling enough to
justify dropping the pte lock, then relocking and verifying the entry
for filesystems without ->page_mkwrite, which notably includes tmpfs.
Leave it for now and lock the page late in the !page_mkwrite case.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 48 +++++++++++++++++-------------------------------
 1 file changed, 17 insertions(+), 31 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5640a718ac58..df47fd0a4b7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2046,7 +2046,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t entry;
 	int ret = 0;
 	int page_mkwrite = 0;
-	struct page *dirty_page = NULL;
+	bool dirty_shared = false;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 	struct mem_cgroup *memcg;
@@ -2097,6 +2097,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
+		page_cache_get(old_page);
 		/*
 		 * Only catch write-faults on shared writable pages,
 		 * read-only shared pages can get COWed by
@@ -2104,7 +2105,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 			int tmp;
-			page_cache_get(old_page);
+
 			pte_unmap_unlock(page_table, ptl);
 			tmp = do_page_mkwrite(vma, old_page, address);
 			if (unlikely(!tmp || (tmp &
@@ -2124,11 +2125,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unlock_page(old_page);
 				goto unlock;
 			}
-
 			page_mkwrite = 1;
 		}
-		dirty_page = old_page;
-		get_page(dirty_page);
+
+		dirty_shared = true;
 
 reuse:
 		/*
@@ -2147,43 +2147,29 @@ reuse:
 		pte_unmap_unlock(page_table, ptl);
 		ret |= VM_FAULT_WRITE;
 
-		if (!dirty_page)
-			return ret;
-
-		if (!page_mkwrite) {
+		if (dirty_shared) {
 			struct address_space *mapping;
 			int dirtied;
 
-			lock_page(dirty_page);
-			dirtied = set_page_dirty(dirty_page);
-			VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
-			mapping = dirty_page->mapping;
-			unlock_page(dirty_page);
+			if (!page_mkwrite)
+				lock_page(old_page);
 
-			if (dirtied && mapping) {
-				/*
-				 * Some device drivers do not set page.mapping
-				 * but still dirty their pages
-				 */
-				balance_dirty_pages_ratelimited(mapping);
-			}
+			dirtied = set_page_dirty(old_page);
+			VM_BUG_ON_PAGE(PageAnon(old_page), old_page);
+			mapping = old_page->mapping;
+			unlock_page(old_page);
+			page_cache_release(old_page);
 
-			file_update_time(vma->vm_file);
-		}
-		put_page(dirty_page);
-		if (page_mkwrite) {
-			struct address_space *mapping = dirty_page->mapping;
-
-			set_page_dirty(dirty_page);
-			unlock_page(dirty_page);
-			page_cache_release(dirty_page);
-			if (mapping)	{
+			if ((dirtied || page_mkwrite) && mapping) {
 				/*
 				 * Some device drivers do not set page.mapping
 				 * but still dirty their pages
 				 */
 				balance_dirty_pages_ratelimited(mapping);
 			}
+
+			if (!page_mkwrite)
+				file_update_time(vma->vm_file);
 		}
 
 		return ret;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-05 14:52 [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
  2014-12-05 14:52 ` [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas Johannes Weiner
  2014-12-05 14:52 ` [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page() Johannes Weiner
@ 2014-12-09 18:18 ` Jan Kara
  2017-04-10  2:22 ` alexander.levin
  3 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2014-12-09 18:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, Kirill A. Shutemov, linux-mm, linux-fsdevel,
	linux-kernel

On Fri 05-12-14 09:52:44, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@vger.kernel.org>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/writeback.h |  1 -
>  mm/memory.c               | 27 +++++++++++++++++----------
>  mm/page-writeback.c       | 43 ++++++++++++-------------------------------
>  3 files changed, 29 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
>  		      void *data);
>  int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
>  void writeback_set_ratelimit(void);
>  void tag_pages_for_writeback(struct address_space *mapping,
>  			     pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..72d998eb0438 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,24 @@ reuse:
>  		if (!dirty_page)
>  			return ret;
>  
> -		/*
> -		 * Yes, Virginia, this is actually required to prevent a race
> -		 * with clear_page_dirty_for_io() from clearing the page dirty
> -		 * bit after it clear all dirty ptes, but before a racing
> -		 * do_wp_page installs a dirty pte.
> -		 *
> -		 * do_shared_fault is protected similarly.
> -		 */
>  		if (!page_mkwrite) {
> -			wait_on_page_locked(dirty_page);
> -			set_page_dirty_balance(dirty_page);
> +			struct address_space *mapping;
> +			int dirtied;
> +
> +			lock_page(dirty_page);
> +			dirtied = set_page_dirty(dirty_page);
> +			VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
> +			mapping = dirty_page->mapping;
> +			unlock_page(dirty_page);
> +
> +			if (dirtied && mapping) {
> +				/*
> +				 * Some device drivers do not set page.mapping
> +				 * but still dirty their pages
> +				 */
> +				balance_dirty_pages_ratelimited(mapping);
> +			}
> +
>  			/* file_update_time outside page_lock */
>  			if (vma->vm_file)
>  				file_update_time(vma->vm_file);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 19ceae87522d..437174a2aaa3 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1541,16 +1541,6 @@ pause:
>  		bdi_start_background_writeback(bdi);
>  }
>  
> -void set_page_dirty_balance(struct page *page)
> -{
> -	if (set_page_dirty(page)) {
> -		struct address_space *mapping = page_mapping(page);
> -
> -		if (mapping)
> -			balance_dirty_pages_ratelimited(mapping);
> -	}
> -}
> -
>  static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
>  /*
> @@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
>   * page dirty in that case, but not all the buffers.  This is a "bottom-up"
>   * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
>   *
> - * Most callers have locked the page, which pins the address_space in memory.
> - * But zap_pte_range() does not lock the page, however in that case the
> - * mapping is pinned by the vma's ->vm_file reference.
> - *
> - * We take care to handle the case where the page was truncated from the
> - * mapping by re-checking page_mapping() inside tree_lock.
> + * The caller must ensure this doesn't race with truncation.  Most will simply
> + * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
> + * the pte lock held, which also locks out truncation.
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
> -		struct address_space *mapping2;
>  		unsigned long flags;
>  
>  		if (!mapping)
>  			return 1;
>  
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
> -		mapping2 = page_mapping(page);
> -		if (mapping2) { /* Race with truncate? */
> -			BUG_ON(mapping2 != mapping);
> -			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> -			account_page_dirtied(page, mapping);
> -			radix_tree_tag_set(&mapping->page_tree,
> -				page_index(page), PAGECACHE_TAG_DIRTY);
> -		}
> +		BUG_ON(page_mapping(page) != mapping);
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		account_page_dirtied(page, mapping);
> +		radix_tree_tag_set(&mapping->page_tree, page_index(page),
> +				   PAGECACHE_TAG_DIRTY);
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  		if (mapping->host) {
>  			/* !PageAnon && !swapper_space */
> @@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
>  		/*
>  		 * We carefully synchronise fault handlers against
>  		 * installing a dirty pte and marking the page dirty
> -		 * at this point. We do this by having them hold the
> -		 * page lock at some point after installing their
> -		 * pte, but before marking the page dirty.
> -		 * Pages are always locked coming in here, so we get
> -		 * the desired exclusion. See mm/memory.c:do_wp_page()
> -		 * for more comments.
> +		 * at this point.  We do this by having them hold the
> +		 * page lock while dirtying the page, and pages are
> +		 * always locked coming in here, so we get the desired
> +		 * exclusion.
>  		 */
>  		if (TestClearPageDirty(page)) {
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
> -- 
> 2.1.3
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()
  2014-12-05 14:52 ` [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page() Johannes Weiner
@ 2014-12-09 18:22   ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2014-12-09 18:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, Kirill A. Shutemov, linux-mm, linux-fsdevel,
	linux-kernel

On Fri 05-12-14 09:52:46, Johannes Weiner wrote:
> Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
> pretty much the same.  Make sure the page references are the same in
> both cases, then merge the two branches.
> 
> It's tempting to go even further and page-lock the !page_mkwrite case,
> to get it in line with everybody else setting the page table and thus
> further simplify the model.  But that's not quite compelling enough to
> justify dropping the pte lock, then relocking and verifying the entry
> for filesystems without ->page_mkwrite, which notably includes tmpfs.
> Leave it for now and lock the page late in the !page_mkwrite case.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/memory.c | 48 +++++++++++++++++-------------------------------
>  1 file changed, 17 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5640a718ac58..df47fd0a4b7f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2046,7 +2046,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pte_t entry;
>  	int ret = 0;
>  	int page_mkwrite = 0;
> -	struct page *dirty_page = NULL;
> +	bool dirty_shared = false;
>  	unsigned long mmun_start = 0;	/* For mmu_notifiers */
>  	unsigned long mmun_end = 0;	/* For mmu_notifiers */
>  	struct mem_cgroup *memcg;
> @@ -2097,6 +2097,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unlock_page(old_page);
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> +		page_cache_get(old_page);
>  		/*
>  		 * Only catch write-faults on shared writable pages,
>  		 * read-only shared pages can get COWed by
> @@ -2104,7 +2105,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 */
>  		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
>  			int tmp;
> -			page_cache_get(old_page);
> +
>  			pte_unmap_unlock(page_table, ptl);
>  			tmp = do_page_mkwrite(vma, old_page, address);
>  			if (unlikely(!tmp || (tmp &
> @@ -2124,11 +2125,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  				unlock_page(old_page);
>  				goto unlock;
>  			}
> -
>  			page_mkwrite = 1;
>  		}
> -		dirty_page = old_page;
> -		get_page(dirty_page);
> +
> +		dirty_shared = true;
>  
>  reuse:
>  		/*
> @@ -2147,43 +2147,29 @@ reuse:
>  		pte_unmap_unlock(page_table, ptl);
>  		ret |= VM_FAULT_WRITE;
>  
> -		if (!dirty_page)
> -			return ret;
> -
> -		if (!page_mkwrite) {
> +		if (dirty_shared) {
>  			struct address_space *mapping;
>  			int dirtied;
>  
> -			lock_page(dirty_page);
> -			dirtied = set_page_dirty(dirty_page);
> -			VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
> -			mapping = dirty_page->mapping;
> -			unlock_page(dirty_page);
> +			if (!page_mkwrite)
> +				lock_page(old_page);
>  
> -			if (dirtied && mapping) {
> -				/*
> -				 * Some device drivers do not set page.mapping
> -				 * but still dirty their pages
> -				 */
> -				balance_dirty_pages_ratelimited(mapping);
> -			}
> +			dirtied = set_page_dirty(old_page);
> +			VM_BUG_ON_PAGE(PageAnon(old_page), old_page);
> +			mapping = old_page->mapping;
> +			unlock_page(old_page);
> +			page_cache_release(old_page);
>  
> -			file_update_time(vma->vm_file);
> -		}
> -		put_page(dirty_page);
> -		if (page_mkwrite) {
> -			struct address_space *mapping = dirty_page->mapping;
> -
> -			set_page_dirty(dirty_page);
> -			unlock_page(dirty_page);
> -			page_cache_release(dirty_page);
> -			if (mapping)	{
> +			if ((dirtied || page_mkwrite) && mapping) {
>  				/*
>  				 * Some device drivers do not set page.mapping
>  				 * but still dirty their pages
>  				 */
>  				balance_dirty_pages_ratelimited(mapping);
>  			}
> +
> +			if (!page_mkwrite)
> +				file_update_time(vma->vm_file);
>  		}
>  
>  		return ret;
> -- 
> 2.1.3
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-05 14:52 [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-12-09 18:18 ` [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Jan Kara
@ 2017-04-10  2:22 ` alexander.levin
  2017-04-10 12:06   ` Jan Kara
  3 siblings, 1 reply; 15+ messages in thread
From: alexander.levin @ 2017-04-10  2:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, Kirill A. Shutemov, linux-mm, linux-fsdevel,
	linux-kernel

On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@vger.kernel.org>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Hi Johannes,

I'm seeing the following while fuzzing with trinity on linux-next (I've changed
the WARN to a VM_BUG_ON_PAGE for some extra page info).

[   18.991007] page:ffffea000307c8c0 count:3 mapcount:0 mapping:ffff88010444cbf8 index:0x1^M
[   18.993051] flags: 0x1fffc0000000011(locked|dirty)^M
[   18.993621] raw: 01fffc0000000011 ffff88010444cbf8 0000000000000001 00000003ffffffff^M
[   18.994522] raw: dead000000000100 dead000000000200 0000000000000000 ffff880109c38008^M                                                                     [   18.995418] page dumped because: VM_BUG_ON_PAGE(!PagePrivate(page) && !PageUptodate(page))^M
[   18.996381] page->mem_cgroup:ffff880109c38008^M                                                                                                            [   18.996935] ------------[ cut here ]------------^M                                                                                                         [   18.997483] kernel BUG at mm/page-writeback.c:2486!^M                                                                                                      [   18.998063] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN^M
[   18.998756] Modules linked in:^M                                                                                                                           [   18.999129] CPU: 5 PID: 1388 Comm: trinity-c34 Not tainted 4.11.0-rc5-next-20170407-dirty #12^M                                                            [   19.000117] task: ffff880106ee5d40 task.stack: ffff8800c0f40000^M                                                                                          [   19.000828] RIP: 0010:__set_page_dirty_nobuffers (??:?)
[   19.001491] RSP: 0018:ffff8800c0f47318 EFLAGS: 00010006^M
[   19.002103] RAX: 0000000000000000 RBX: 1ffff100181e8e67 RCX: 0000000000000000^M
[   19.002929] RDX: 0000000000000021 RSI: 1ffff100181e8da7 RDI: ffffed00181e8e58^M
[   19.004806] RBP: ffff8800c0f47440 R08: 3830303833633930 R09: 3130383866666666^M
[   19.005626] R10: dffffc0000000000 R11: 0000000000001491 R12: ffff8800c0f47418^M
[   19.006452] R13: ffffea000307c8c0 R14: ffff88010444cc10 R15: ffff88010444cbf8^M
[   19.007277] FS:  00007ff6a26fb700(0000) GS:ffff88010a340000(0000) knlGS:0000000000000000^M
[   19.008424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[   19.009092] CR2: 00007ff6a155267c CR3: 00000000cb301000 CR4: 00000000000406a0^M
[   19.009919] Call Trace:^M
[   19.012266] set_page_dirty (mm/page-writeback.c:2579)
[   19.020028] v9fs_write_end (fs/9p/vfs_addr.c:325)
[   19.022473] generic_perform_write (mm/filemap.c:2842)
[   19.024857] __generic_file_write_iter (mm/filemap.c:2957)
[   19.025830] generic_file_write_iter (./include/linux/fs.h:702 mm/filemap.c:2985)
[   19.028549] __do_readv_writev (./include/linux/fs.h:1734 fs/read_write.c:696 fs/read_write.c:862)
[   19.029924] do_readv_writev (fs/read_write.c:895)
[   19.034044] vfs_writev (fs/read_write.c:921)
[   19.035223] do_writev (fs/read_write.c:955)
[   19.036925] SyS_writev (fs/read_write.c:1024)
[   19.037297] do_syscall_64 (arch/x86/entry/common.c:284)
[   19.042085] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)                                                                                      [   19.042608] RIP: 0033:0x7ff6a200a8e9^M                                                                                                                     [   19.043015] RSP: 002b:00007fff78079608 EFLAGS: 00000246 ORIG_RAX: 0000000000000014^M
[   19.044253] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff6a200a8e9^M                                                                            [   19.045045] RDX: 0000000000000001 RSI: 0000000002337d60 RDI: 000000000000018b^M
[   19.045835] RBP: 00007ff6a2601000 R08: 000000482a1a83cf R09: fffdffffffffffff^M                                                                            [   19.046627] R10: 0012536735f82cf7 R11: 0000000000000246 R12: 0000000000000002^M                                                                            [   19.047413] R13: 00007ff6a2601048 R14: 00007ff6a26fb698 R15: 00007ff6a2601000^M                                                                            [ 19.048212] Code: 89 85 f0 fe ff ff e8 39 1b 20 00 8b 85 f0 fe ff ff eb 1a e8 2c bd 12 00 31 c0 eb 11 48 c7 c6 e0 c4 47 83 4c 89 ef e8 39 44 07 00 <0f> 0b 48 ba 00 00 00 00 00 fc ff df 48 c7 04 13 00 00 00 00 48 ^M                                                                                                     All code                                                                                                                                                      ========                                                                                                                                                         0:   89 85 f0 fe ff ff       mov    %eax,-0x110(%rbp)
   6:   e8 39 1b 20 00          callq  0x201b44 
   b:   8b 85 f0 fe ff ff       mov    -0x110(%rbp),%eax 
  11:   eb 1a                   jmp    0x2d 
  13:   e8 2c bd 12 00          callq  0x12bd44 
  18:   31 c0                   xor    %eax,%eax 
  1a:   eb 11                   jmp    0x2d 
  1c:   48 c7 c6 e0 c4 47 83    mov    $0xffffffff8347c4e0,%rsi
  23:   4c 89 ef                mov    %r13,%rdi
  26:   e8 39 44 07 00          callq  0x74464
  2b:*  0f 0b                   ud2             <-- trapping instruction
  2d:   48 ba 00 00 00 00 00    movabs $0xdffffc0000000000,%rdx
  34:   fc ff df
  37:   48 c7 04 13 00 00 00    movq   $0x0,(%rbx,%rdx,1)
  3e:   00
  3f:   48                      rex.W
        ...

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2
   2:   48 ba 00 00 00 00 00    movabs $0xdffffc0000000000,%rdx
   9:   fc ff df
   c:   48 c7 04 13 00 00 00    movq   $0x0,(%rbx,%rdx,1)
  13:   00
  14:   48                      rex.W
        ...
[   19.050311] RIP: __set_page_dirty_nobuffers+0x407/0x450 RSP: ffff8800c0f47318^M (??:?)

-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2017-04-10  2:22 ` alexander.levin
@ 2017-04-10 12:06   ` Jan Kara
  2017-04-10 15:07     ` alexander.levin
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2017-04-10 12:06 UTC (permalink / raw)
  To: alexander.levin
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Hugh Dickins,
	Michel Lespinasse, Jan Kara, Kirill A. Shutemov, linux-mm,
	linux-fsdevel, linux-kernel

On Mon 10-04-17 02:22:33, alexander.levin@verizon.com wrote:
> On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> > 
> > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> >   if (TestSetPageDirty(page))
> >                                      page->mapping = NULL
> > 				     if (PageDirty())
> > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >     if (page->mapping)
> >       account_page_dirtied(page)
> >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > 
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > 
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held.  The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> > 
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed.  Remove it.
> > 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: <stable@vger.kernel.org>
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Hi Johannes,
> 
> I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> the WARN to a VM_BUG_ON_PAGE for some extra page info).

But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
a !Uptodate page?

								Honza

> 
> [   18.991007] page:ffffea000307c8c0 count:3 mapcount:0 mapping:ffff88010444cbf8 index:0x1^M
> [   18.993051] flags: 0x1fffc0000000011(locked|dirty)^M
> [   18.993621] raw: 01fffc0000000011 ffff88010444cbf8 0000000000000001 00000003ffffffff^M
> [   18.994522] raw: dead000000000100 dead000000000200 0000000000000000 ffff880109c38008^M                                                                     [   18.995418] page dumped because: VM_BUG_ON_PAGE(!PagePrivate(page) && !PageUptodate(page))^M
> [   18.996381] page->mem_cgroup:ffff880109c38008^M                                                                                                            [   18.996935] ------------[ cut here ]------------^M                                                                                                         [   18.997483] kernel BUG at mm/page-writeback.c:2486!^M                                                                                                      [   18.998063] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN^M
> [   18.998756] Modules linked in:^M                                                                                                                           [   18.999129] CPU: 5 PID: 1388 Comm: trinity-c34 Not tainted 4.11.0-rc5-next-20170407-dirty #12^M                                                            [   19.000117] task: ffff880106ee5d40 task.stack: ffff8800c0f40000^M                                                                                          [   19.000828] RIP: 0010:__set_page_dirty_nobuffers (??:?)
> [   19.001491] RSP: 0018:ffff8800c0f47318 EFLAGS: 00010006^M
> [   19.002103] RAX: 0000000000000000 RBX: 1ffff100181e8e67 RCX: 0000000000000000^M
> [   19.002929] RDX: 0000000000000021 RSI: 1ffff100181e8da7 RDI: ffffed00181e8e58^M
> [   19.004806] RBP: ffff8800c0f47440 R08: 3830303833633930 R09: 3130383866666666^M
> [   19.005626] R10: dffffc0000000000 R11: 0000000000001491 R12: ffff8800c0f47418^M
> [   19.006452] R13: ffffea000307c8c0 R14: ffff88010444cc10 R15: ffff88010444cbf8^M
> [   19.007277] FS:  00007ff6a26fb700(0000) GS:ffff88010a340000(0000) knlGS:0000000000000000^M
> [   19.008424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [   19.009092] CR2: 00007ff6a155267c CR3: 00000000cb301000 CR4: 00000000000406a0^M
> [   19.009919] Call Trace:^M
> [   19.012266] set_page_dirty (mm/page-writeback.c:2579)
> [   19.020028] v9fs_write_end (fs/9p/vfs_addr.c:325)
> [   19.022473] generic_perform_write (mm/filemap.c:2842)
> [   19.024857] __generic_file_write_iter (mm/filemap.c:2957)
> [   19.025830] generic_file_write_iter (./include/linux/fs.h:702 mm/filemap.c:2985)
> [   19.028549] __do_readv_writev (./include/linux/fs.h:1734 fs/read_write.c:696 fs/read_write.c:862)
> [   19.029924] do_readv_writev (fs/read_write.c:895)
> [   19.034044] vfs_writev (fs/read_write.c:921)
> [   19.035223] do_writev (fs/read_write.c:955)
> [   19.036925] SyS_writev (fs/read_write.c:1024)
> [   19.037297] do_syscall_64 (arch/x86/entry/common.c:284)
> [   19.042085] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)                                                                                      [   19.042608] RIP: 0033:0x7ff6a200a8e9^M                                                                                                                     [   19.043015] RSP: 002b:00007fff78079608 EFLAGS: 00000246 ORIG_RAX: 0000000000000014^M
> [   19.044253] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff6a200a8e9^M                                                                            [   19.045045] RDX: 0000000000000001 RSI: 0000000002337d60 RDI: 000000000000018b^M
> [   19.045835] RBP: 00007ff6a2601000 R08: 000000482a1a83cf R09: fffdffffffffffff^M                                                                            [   19.046627] R10: 0012536735f82cf7 R11: 0000000000000246 R12: 0000000000000002^M                                                                            [   19.047413] R13: 00007ff6a2601048 R14: 00007ff6a26fb698 R15: 00007ff6a2601000^M                                                                            [ 19.048212] Code: 89 85 f0 fe ff ff e8 39 1b 20 00 8b 85 f0 fe ff ff eb 1a e8 2c bd 12 00 31 c0 eb 11 48 c7 c6 e0 c4 47 83 4c 89 ef e8 39 44 07 00 <0f> 0b 48 ba 00 00 00 00 00 fc ff df 48 c7 04 13 00 00 00 00 48 ^M                                                                                                     All code                                                                                                                                                      ========                                                                                                                                                         0:   89 85 f0 fe ff ff       mov    %eax,-0x110(%rbp)
>    6:   e8 39 1b 20 00          callq  0x201b44 
>    b:   8b 85 f0 fe ff ff       mov    -0x110(%rbp),%eax 
>   11:   eb 1a                   jmp    0x2d 
>   13:   e8 2c bd 12 00          callq  0x12bd44 
>   18:   31 c0                   xor    %eax,%eax 
>   1a:   eb 11                   jmp    0x2d 
>   1c:   48 c7 c6 e0 c4 47 83    mov    $0xffffffff8347c4e0,%rsi
>   23:   4c 89 ef                mov    %r13,%rdi
>   26:   e8 39 44 07 00          callq  0x74464
>   2b:*  0f 0b                   ud2             <-- trapping instruction
>   2d:   48 ba 00 00 00 00 00    movabs $0xdffffc0000000000,%rdx
>   34:   fc ff df
>   37:   48 c7 04 13 00 00 00    movq   $0x0,(%rbx,%rdx,1)
>   3e:   00
>   3f:   48                      rex.W
>         ...
> 
> Code starting with the faulting instruction
> ===========================================
>    0:   0f 0b                   ud2
>    2:   48 ba 00 00 00 00 00    movabs $0xdffffc0000000000,%rdx
>    9:   fc ff df
>    c:   48 c7 04 13 00 00 00    movq   $0x0,(%rbx,%rdx,1)
>   13:   00
>   14:   48                      rex.W
>         ...
> [   19.050311] RIP: __set_page_dirty_nobuffers+0x407/0x450 RSP: ffff8800c0f47318^M (??:?)
> 
> -- 
> 
> Thanks,
> Sasha
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2017-04-10 12:06   ` Jan Kara
@ 2017-04-10 15:07     ` alexander.levin
  2017-04-10 15:51       ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: alexander.levin @ 2017-04-10 15:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Hugh Dickins,
	Michel Lespinasse, Kirill A. Shutemov, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Apr 10, 2017 at 02:06:38PM +0200, Jan Kara wrote:
> On Mon 10-04-17 02:22:33, alexander.levin@verizon.com wrote:
> > On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > > Tejun, while reviewing the code, spotted the following race condition
> > > between the dirtying and truncation of a page:
> > > 
> > > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> > >   if (TestSetPageDirty(page))
> > >                                      page->mapping = NULL
> > > 				     if (PageDirty())
> > > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > >     if (page->mapping)
> > >       account_page_dirtied(page)
> > >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > 
> > > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > > 
> > > Dirtiers usually lock out truncation, either by holding the page lock
> > > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > > the page table lock held.  The notable exception to this rule, though,
> > > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > > already waits for a locked page to unlock before setting the dirty
> > > bit, in order to prevent a race where clear_page_dirty() misses the
> > > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > > locked set_page_dirty() to also cover the situation explained above.
> > > 
> > > Afterwards, the code in set_page_dirty() dealing with a truncation
> > > race is no longer needed.  Remove it.
> > > 
> > > Reported-by: Tejun Heo <tj@kernel.org>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: <stable@vger.kernel.org>
> > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > 
> > Hi Johannes,
> > 
> > I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> > the WARN to a VM_BUG_ON_PAGE for some extra page info).
> 
> But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
> a !Uptodate page?

I thought that 77469c3f5 ("9p: saner ->write_end() on failing copy into
non-uptodate page") prevented from that happening, but that's actually the
change that's causing it (I ended up misreading it last night).

Will fix it as follows:

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c 
index adaf6f6..be84c0c 100644 
--- a/fs/9p/vfs_addr.c 
+++ b/fs/9p/vfs_addr.c 
@@ -310,9 +310,13 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping, 
  
        p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping); 
  
-       if (unlikely(copied < len && !PageUptodate(page))) { 
-               copied = 0; 
-               goto out; 
+       if (!PageUptodate(page)) { 
+               if (unlikely(copied < len)) { 
+                       copied = 0;
+                       goto out; 
+               } else { 
+                       SetPageUptodate(page); 
+               } 
        } 
        /* 
         * No need to use i_size_read() here, the i_size
 
-- 

Thanks,
Sasha

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2017-04-10 15:07     ` alexander.levin
@ 2017-04-10 15:51       ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2017-04-10 15:51 UTC (permalink / raw)
  To: alexander.levin
  Cc: Jan Kara, Johannes Weiner, Andrew Morton, Tejun Heo,
	Hugh Dickins, Michel Lespinasse, Kirill A. Shutemov, linux-mm,
	linux-fsdevel, linux-kernel

On Mon 10-04-17 15:07:58, alexander.levin@verizon.com wrote:
> On Mon, Apr 10, 2017 at 02:06:38PM +0200, Jan Kara wrote:
> > On Mon 10-04-17 02:22:33, alexander.levin@verizon.com wrote:
> > > On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > > > Tejun, while reviewing the code, spotted the following race condition
> > > > between the dirtying and truncation of a page:
> > > > 
> > > > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> > > >   if (TestSetPageDirty(page))
> > > >                                      page->mapping = NULL
> > > > 				     if (PageDirty())
> > > > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > > > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > >     if (page->mapping)
> > > >       account_page_dirtied(page)
> > > >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > > > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > > 
> > > > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > > > 
> > > > Dirtiers usually lock out truncation, either by holding the page lock
> > > > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > > > the page table lock held.  The notable exception to this rule, though,
> > > > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > > > already waits for a locked page to unlock before setting the dirty
> > > > bit, in order to prevent a race where clear_page_dirty() misses the
> > > > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > > > locked set_page_dirty() to also cover the situation explained above.
> > > > 
> > > > Afterwards, the code in set_page_dirty() dealing with a truncation
> > > > race is no longer needed.  Remove it.
> > > > 
> > > > Reported-by: Tejun Heo <tj@kernel.org>
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Cc: <stable@vger.kernel.org>
> > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > 
> > > Hi Johannes,
> > > 
> > > I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> > > the WARN to a VM_BUG_ON_PAGE for some extra page info).
> > 
> > But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
> > a !Uptodate page?
> 
> I thought that 77469c3f5 ("9p: saner ->write_end() on failing copy into
> non-uptodate page") prevented from that happening, but that's actually the
> change that's causing it (I ended up misreading it last night).
> 
> Will fix it as follows:

Yep, this looks good to me, although I'd find it more future-proof if we
had that SetPageUptodate() additionally guarded a by len == PAGE_SIZE
check.

								Honza

> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c 
> index adaf6f6..be84c0c 100644 
> --- a/fs/9p/vfs_addr.c 
> +++ b/fs/9p/vfs_addr.c 
> @@ -310,9 +310,13 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping, 
>   
>         p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping); 
>   
> -       if (unlikely(copied < len && !PageUptodate(page))) { 
> -               copied = 0; 
> -               goto out; 
> +       if (!PageUptodate(page)) { 
> +               if (unlikely(copied < len)) { 
> +                       copied = 0;
> +                       goto out; 
> +               } else { 
> +                       SetPageUptodate(page); 
> +               } 
>         } 
>         /* 
>          * No need to use i_size_read() here, the i_size
>  
> -- 
> 
> Thanks,
> Sasha
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-16 16:18 [patch 0/3 resend] mm: close race between dirtying and truncation Johannes Weiner
@ 2014-12-16 16:18 ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-16 16:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara,
	Kirill A. Shutemov, linux-mm, linux-fsdevel, linux-kernel

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers()       __delete_from_page_cache()
  if (TestSetPageDirty(page))
                                     page->mapping = NULL
				     if (PageDirty())
				       dec_zone_page_state(page, NR_FILE_DIRTY);
				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
      account_page_dirtied(page)
        __inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held.  The notable exception to this rule, though,
is do_wp_page(), for which this race exists.  However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes.  Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed.  Remove it.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/writeback.h |  1 -
 mm/memory.c               | 27 +++++++++++++++++----------
 mm/page-writeback.c       | 43 ++++++++++++-------------------------------
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
 void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index c3b9097251c5..4c06cdcd36cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2137,17 +2137,24 @@ reuse:
 		if (!dirty_page)
 			return ret;
 
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * do_shared_fault is protected similarly.
-		 */
 		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
-			set_page_dirty_balance(dirty_page);
+			struct address_space *mapping;
+			int dirtied;
+
+			lock_page(dirty_page);
+			dirtied = set_page_dirty(dirty_page);
+			VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
+			mapping = dirty_page->mapping;
+			unlock_page(dirty_page);
+
+			if (dirtied && mapping) {
+				/*
+				 * Some device drivers do not set page.mapping
+				 * but still dirty their pages
+				 */
+				balance_dirty_pages_ratelimited(mapping);
+			}
+
 			/* file_update_time outside page_lock */
 			if (vma->vm_file)
 				file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d5d81f5384d1..6f4335238e33 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
 		bdi_start_background_writeback(bdi);
 }
 
-void set_page_dirty_balance(struct page *page)
-{
-	if (set_page_dirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-
-		if (mapping)
-			balance_dirty_pages_ratelimited(mapping);
-	}
-}
-
 static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
  * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
  *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation.  Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			account_page_dirtied(page, mapping);
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
+		BUG_ON(page_mapping(page) != mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		radix_tree_tag_set(&mapping->page_tree, page_index(page),
+				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
-		 * at this point. We do this by having them hold the
-		 * page lock at some point after installing their
-		 * pte, but before marking the page dirty.
-		 * Pages are always locked coming in here, so we get
-		 * the desired exclusion. See mm/memory.c:do_wp_page()
-		 * for more comments.
+		 * at this point.  We do this by having them hold the
+		 * page lock while dirtying the page, and pages are
+		 * always locked coming in here, so we get the desired
+		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-02 11:56 ` Kirill A. Shutemov
@ 2014-12-02 15:11   ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-02 15:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, linux-mm, linux-fsdevel, linux-kernel

On Tue, Dec 02, 2014 at 01:56:52PM +0200, Kirill A. Shutemov wrote:
> On Mon, Dec 01, 2014 at 05:58:00PM -0500, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> > 
> > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> >   if (TestSetPageDirty(page))
> >                                      page->mapping = NULL
> > 				     if (PageDirty())
> > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >     if (page->mapping)
> >       account_page_dirtied(page)
> >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > 
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > 
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held.  The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> > 
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed.  Remove it.
> > 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  include/linux/writeback.h |  1 -
> >  mm/memory.c               | 26 ++++++++++++++++----------
> >  mm/page-writeback.c       | 43 ++++++++++++-------------------------------
> >  3 files changed, 28 insertions(+), 42 deletions(-)
> > 
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index a219be961c0a..00048339c23e 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> >  		      struct writeback_control *wbc, writepage_t writepage,
> >  		      void *data);
> >  int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> > -void set_page_dirty_balance(struct page *page);
> >  void writeback_set_ratelimit(void);
> >  void tag_pages_for_writeback(struct address_space *mapping,
> >  			     pgoff_t start, pgoff_t end);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3e503831e042..73220eb6e9e3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2150,17 +2150,23 @@ reuse:
> >  		if (!dirty_page)
> >  			return ret;
> >  
> > -		/*
> > -		 * Yes, Virginia, this is actually required to prevent a race
> > -		 * with clear_page_dirty_for_io() from clearing the page dirty
> > -		 * bit after it clear all dirty ptes, but before a racing
> > -		 * do_wp_page installs a dirty pte.
> > -		 *
> > -		 * do_shared_fault is protected similarly.
> > -		 */
> >  		if (!page_mkwrite) {
> > -			wait_on_page_locked(dirty_page);
> > -			set_page_dirty_balance(dirty_page);
> > +			struct address_space *mapping;
> > +			int dirtied;
> > +
> > +			lock_page(dirty_page);
> > +			dirtied = set_page_dirty(dirty_page);
> > +			mapping = dirty_page->mapping;
> 
> At first, I wanted to ask why you don't use page_mapping() here, but after
> a bit of digging I see we cannot get here with anon page.
> 
> Explicid VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page); would be great.

Fair enough, I added that.

> Otherwise looks good to me.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-02  9:12 ` Jan Kara
@ 2014-12-02 15:06   ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-02 15:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Dec 02, 2014 at 10:12:12AM +0100, Jan Kara wrote:
> On Mon 01-12-14 17:58:00, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> > 
> > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> >   if (TestSetPageDirty(page))
> >                                      page->mapping = NULL
> > 				     if (PageDirty())
> > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >     if (page->mapping)
> >       account_page_dirtied(page)
> >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > 
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > 
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held.  The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> > 
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed.  Remove it.
> > 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  include/linux/writeback.h |  1 -
> >  mm/memory.c               | 26 ++++++++++++++++----------
> >  mm/page-writeback.c       | 43 ++++++++++++-------------------------------
> >  3 files changed, 28 insertions(+), 42 deletions(-)
> > 
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index a219be961c0a..00048339c23e 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> >  		      struct writeback_control *wbc, writepage_t writepage,
> >  		      void *data);
> >  int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> > -void set_page_dirty_balance(struct page *page);
> >  void writeback_set_ratelimit(void);
> >  void tag_pages_for_writeback(struct address_space *mapping,
> >  			     pgoff_t start, pgoff_t end);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3e503831e042..73220eb6e9e3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2150,17 +2150,23 @@ reuse:
> >  		if (!dirty_page)
> >  			return ret;
> >  
> > -		/*
> > -		 * Yes, Virginia, this is actually required to prevent a race
> > -		 * with clear_page_dirty_for_io() from clearing the page dirty
> > -		 * bit after it clear all dirty ptes, but before a racing
> > -		 * do_wp_page installs a dirty pte.
> > -		 *
> > -		 * do_shared_fault is protected similarly.
> > -		 */
> >  		if (!page_mkwrite) {
> > -			wait_on_page_locked(dirty_page);
> > -			set_page_dirty_balance(dirty_page);
> > +			struct address_space *mapping;
> > +			int dirtied;
> > +
> > +			lock_page(dirty_page);
> > +			dirtied = set_page_dirty(dirty_page);
> > +			mapping = dirty_page->mapping;
> > +			unlock_page(dirty_page);
> > +
> > +			if (dirtied && mapping) {
> > +				/*
> > +				 * Some device drivers do not set page.mapping
> > +				 * but still dirty their pages
> > +				 */
>   The comment doesn't make sense to me here. Is it meant to explain why we
> check 'mapping' in the above condition? I always thought truncate is the
> main reason.

Yes, I just copied it from the page_mkwrite case a few lines down, and
there is another copy of it in do_shared_fault().  Truncate is also a
possibility during a race, of course, but even without it we expect
that the mapping can be NULL for certain device drivers.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-01 22:58 Johannes Weiner
  2014-12-02  9:12 ` Jan Kara
@ 2014-12-02 11:56 ` Kirill A. Shutemov
  2014-12-02 15:11   ` Johannes Weiner
  1 sibling, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2014-12-02 11:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, linux-mm, linux-fsdevel, linux-kernel

On Mon, Dec 01, 2014 at 05:58:00PM -0500, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/writeback.h |  1 -
>  mm/memory.c               | 26 ++++++++++++++++----------
>  mm/page-writeback.c       | 43 ++++++++++++-------------------------------
>  3 files changed, 28 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
>  		      void *data);
>  int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
>  void writeback_set_ratelimit(void);
>  void tag_pages_for_writeback(struct address_space *mapping,
>  			     pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..73220eb6e9e3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,23 @@ reuse:
>  		if (!dirty_page)
>  			return ret;
>  
> -		/*
> -		 * Yes, Virginia, this is actually required to prevent a race
> -		 * with clear_page_dirty_for_io() from clearing the page dirty
> -		 * bit after it clear all dirty ptes, but before a racing
> -		 * do_wp_page installs a dirty pte.
> -		 *
> -		 * do_shared_fault is protected similarly.
> -		 */
>  		if (!page_mkwrite) {
> -			wait_on_page_locked(dirty_page);
> -			set_page_dirty_balance(dirty_page);
> +			struct address_space *mapping;
> +			int dirtied;
> +
> +			lock_page(dirty_page);
> +			dirtied = set_page_dirty(dirty_page);
> +			mapping = dirty_page->mapping;

At first, I wanted to ask why you don't use page_mapping() here, but after
a bit of digging I see we cannot get here with anon page.

Explicid VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page); would be great.

Otherwise looks good to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
  2014-12-01 22:58 Johannes Weiner
@ 2014-12-02  9:12 ` Jan Kara
  2014-12-02 15:06   ` Johannes Weiner
  2014-12-02 11:56 ` Kirill A. Shutemov
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2014-12-02  9:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, linux-mm, linux-fsdevel, linux-kernel

On Mon 01-12-14 17:58:00, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/writeback.h |  1 -
>  mm/memory.c               | 26 ++++++++++++++++----------
>  mm/page-writeback.c       | 43 ++++++++++++-------------------------------
>  3 files changed, 28 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
>  		      void *data);
>  int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
>  void writeback_set_ratelimit(void);
>  void tag_pages_for_writeback(struct address_space *mapping,
>  			     pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..73220eb6e9e3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,23 @@ reuse:
>  		if (!dirty_page)
>  			return ret;
>  
> -		/*
> -		 * Yes, Virginia, this is actually required to prevent a race
> -		 * with clear_page_dirty_for_io() from clearing the page dirty
> -		 * bit after it clear all dirty ptes, but before a racing
> -		 * do_wp_page installs a dirty pte.
> -		 *
> -		 * do_shared_fault is protected similarly.
> -		 */
>  		if (!page_mkwrite) {
> -			wait_on_page_locked(dirty_page);
> -			set_page_dirty_balance(dirty_page);
> +			struct address_space *mapping;
> +			int dirtied;
> +
> +			lock_page(dirty_page);
> +			dirtied = set_page_dirty(dirty_page);
> +			mapping = dirty_page->mapping;
> +			unlock_page(dirty_page);
> +
> +			if (dirtied && mapping) {
> +				/*
> +				 * Some device drivers do not set page.mapping
> +				 * but still dirty their pages
> +				 */
  The comment doesn't make sense to me here. Is it meant to explain why we
check 'mapping' in the above condition? I always thought truncate is the
main reason.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch 1/3] mm: protect set_page_dirty() from ongoing truncation
@ 2014-12-01 22:58 Johannes Weiner
  2014-12-02  9:12 ` Jan Kara
  2014-12-02 11:56 ` Kirill A. Shutemov
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Weiner @ 2014-12-01 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers()       __delete_from_page_cache()
  if (TestSetPageDirty(page))
                                     page->mapping = NULL
				     if (PageDirty())
				       dec_zone_page_state(page, NR_FILE_DIRTY);
				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
      account_page_dirtied(page)
        __inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held.  The notable exception to this rule, though,
is do_wp_page(), for which this race exists.  However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes.  Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed.  Remove it.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
---
 include/linux/writeback.h |  1 -
 mm/memory.c               | 26 ++++++++++++++++----------
 mm/page-writeback.c       | 43 ++++++++++++-------------------------------
 3 files changed, 28 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
 void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..73220eb6e9e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,23 @@ reuse:
 		if (!dirty_page)
 			return ret;
 
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * do_shared_fault is protected similarly.
-		 */
 		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
-			set_page_dirty_balance(dirty_page);
+			struct address_space *mapping;
+			int dirtied;
+
+			lock_page(dirty_page);
+			dirtied = set_page_dirty(dirty_page);
+			mapping = dirty_page->mapping;
+			unlock_page(dirty_page);
+
+			if (dirtied && mapping) {
+				/*
+				 * Some device drivers do not set page.mapping
+				 * but still dirty their pages
+				 */
+				balance_dirty_pages_ratelimited(mapping);
+			}
+
 			/* file_update_time outside page_lock */
 			if (vma->vm_file)
 				file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..437174a2aaa3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
 		bdi_start_background_writeback(bdi);
 }
 
-void set_page_dirty_balance(struct page *page)
-{
-	if (set_page_dirty(page)) {
-		struct address_space *mapping = page_mapping(page);
-
-		if (mapping)
-			balance_dirty_pages_ratelimited(mapping);
-	}
-}
-
 static DEFINE_PER_CPU(int, bdp_ratelimits);
 
 /*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
  * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
  *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation.  Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			account_page_dirtied(page, mapping);
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
+		BUG_ON(page_mapping(page) != mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		radix_tree_tag_set(&mapping->page_tree, page_index(page),
+				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
-		 * at this point. We do this by having them hold the
-		 * page lock at some point after installing their
-		 * pte, but before marking the page dirty.
-		 * Pages are always locked coming in here, so we get
-		 * the desired exclusion. See mm/memory.c:do_wp_page()
-		 * for more comments.
+		 * at this point.  We do this by having them hold the
+		 * page lock while dirtying the page, and pages are
+		 * always locked coming in here, so we get the desired
+		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-04-10 15:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-05 14:52 [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
2014-12-05 14:52 ` [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas Johannes Weiner
2014-12-05 14:52 ` [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page() Johannes Weiner
2014-12-09 18:22   ` Jan Kara
2014-12-09 18:18 ` [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Jan Kara
2017-04-10  2:22 ` alexander.levin
2017-04-10 12:06   ` Jan Kara
2017-04-10 15:07     ` alexander.levin
2017-04-10 15:51       ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2014-12-16 16:18 [patch 0/3 resend] mm: close race between dirtying and truncation Johannes Weiner
2014-12-16 16:18 ` [patch 1/3] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
2014-12-01 22:58 Johannes Weiner
2014-12-02  9:12 ` Jan Kara
2014-12-02 15:06   ` Johannes Weiner
2014-12-02 11:56 ` Kirill A. Shutemov
2014-12-02 15:11   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).