All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-11-25 19:48 ` Johannes Weiner
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2014-11-25 19:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers()       __delete_from_page_cache()
  if (TestSetPageDirty(page))
                                     page->mapping = NULL
				     if (PageDirty())
				       dec_zone_page_state(page, NR_FILE_DIRTY);
				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
      account_page_dirtied(page)
        __inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held.  The notable exception to this rule, though,
is do_wp_page(), for which this race exists.  However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes.  Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed.  Remove it.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memory.c         | 11 ++---------
 mm/page-writeback.c | 33 ++++++++++++---------------------
 2 files changed, 14 insertions(+), 30 deletions(-)

It is unfortunate to hold the page lock while balancing dirty pages,
but I don't see what else would protect mapping at that point.  The
same btw applies for the page_mkwrite case: how is mapping safe to
pass to balance_dirty_pages() after unlocking page table and page?

diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..27aaee6b6f4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,10 @@ reuse:
 		if (!dirty_page)
 			return ret;
 
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * do_shared_fault is protected similarly.
-		 */
 		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
+			lock_page(dirty_page);
 			set_page_dirty_balance(dirty_page);
+			unlock_page(dirty_page);
 			/* file_update_time outside page_lock */
 			if (vma->vm_file)
 				file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..86773236f42a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2123,32 +2123,25 @@ EXPORT_SYMBOL(account_page_dirtied);
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
  * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
  *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation.  Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			account_page_dirtied(page, mapping);
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
+		BUG_ON(page_mapping(page) != mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		radix_tree_tag_set(&mapping->page_tree, page_index(page),
+				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2305,12 +2298,10 @@ int clear_page_dirty_for_io(struct page *page)
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
-		 * at this point. We do this by having them hold the
-		 * page lock at some point after installing their
-		 * pte, but before marking the page dirty.
-		 * Pages are always locked coming in here, so we get
-		 * the desired exclusion. See mm/memory.c:do_wp_page()
-		 * for more comments.
+		 * at this point.  We do this by having them hold the
+		 * page lock while dirtying the page, and pages are
+		 * always locked coming in here, so we get the desired
+		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-- 
2.1.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-11-25 19:48 ` Johannes Weiner
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2014-11-25 19:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers()       __delete_from_page_cache()
  if (TestSetPageDirty(page))
                                     page->mapping = NULL
				     if (PageDirty())
				       dec_zone_page_state(page, NR_FILE_DIRTY);
				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
      account_page_dirtied(page)
        __inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held.  The notable exception to this rule, though,
is do_wp_page(), for which this race exists.  However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes.  Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed.  Remove it.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memory.c         | 11 ++---------
 mm/page-writeback.c | 33 ++++++++++++---------------------
 2 files changed, 14 insertions(+), 30 deletions(-)

It is unfortunate to hold the page lock while balancing dirty pages,
but I don't see what else would protect mapping at that point.  The
same btw applies for the page_mkwrite case: how is mapping safe to
pass to balance_dirty_pages() after unlocking page table and page?

diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..27aaee6b6f4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,10 @@ reuse:
 		if (!dirty_page)
 			return ret;
 
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * do_shared_fault is protected similarly.
-		 */
 		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
+			lock_page(dirty_page);
 			set_page_dirty_balance(dirty_page);
+			unlock_page(dirty_page);
 			/* file_update_time outside page_lock */
 			if (vma->vm_file)
 				file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..86773236f42a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2123,32 +2123,25 @@ EXPORT_SYMBOL(account_page_dirtied);
  * page dirty in that case, but not all the buffers.  This is a "bottom-up"
  * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
  *
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation.  Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
-		struct address_space *mapping2;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		mapping2 = page_mapping(page);
-		if (mapping2) { /* Race with truncate? */
-			BUG_ON(mapping2 != mapping);
-			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			account_page_dirtied(page, mapping);
-			radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-		}
+		BUG_ON(page_mapping(page) != mapping);
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		account_page_dirtied(page, mapping);
+		radix_tree_tag_set(&mapping->page_tree, page_index(page),
+				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -2305,12 +2298,10 @@ int clear_page_dirty_for_io(struct page *page)
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
-		 * at this point. We do this by having them hold the
-		 * page lock at some point after installing their
-		 * pte, but before marking the page dirty.
-		 * Pages are always locked coming in here, so we get
-		 * the desired exclusion. See mm/memory.c:do_wp_page()
-		 * for more comments.
+		 * at this point.  We do this by having them hold the
+		 * page lock while dirtying the page, and pages are
+		 * always locked coming in here, so we get the desired
+		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-- 
2.1.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
  2014-11-25 19:48 ` Johannes Weiner
@ 2014-11-26 22:00   ` Andrew Morton
  -1 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2014-11-26 22:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memory.c         | 11 ++---------
>  mm/page-writeback.c | 33 ++++++++++++---------------------
>  2 files changed, 14 insertions(+), 30 deletions(-)
> 
> It is unfortunate to hold the page lock while balancing dirty pages,
> but I don't see what else would protect mapping at that point.

Yes.

I'm a bit surprised that calling balance_dirty_pages() under
lock_page() doesn't just go and deadlock.  Memory fails me.

And yes, often the only thing which protects the address_space is
lock_page().

set_page_dirty_balance() and balance_dirty_pages() don't actually need
the address_space - they just use it to get at the backing_dev_info. 
So perhaps what we could do here is the change those functions to take
a bdi directly, then change do_wp_page() to do something like

	lock_page(dirty_page);
	bdi = page->mapping->backing_dev_info;
	need_balance = set_page_dirty2(bdi);
	unlock_page(page);
	if (need_balance)
		balance_dirty_pages_ratelimited2(bdi);


so we no longer require that the address_space be stabilized after
lock_page().  Of course something needs to protect the bdi and I'm not
sure what that is, but we're talking about umount and that quiesces and
evicts lots of things before proceeding, so surely there's something in
there which will save us ;)

>  The
> same btw applies for the page_mkwrite case: how is mapping safe to
> pass to balance_dirty_pages() after unlocking page table and page?

I'm not sure which code you're referring to here, but it's likely that
the switch-balancing-to-bdi approach will address that as well?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-11-26 22:00   ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2014-11-26 22:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
> 
> __set_page_dirty_nobuffers()       __delete_from_page_cache()
>   if (TestSetPageDirty(page))
>                                      page->mapping = NULL
> 				     if (PageDirty())
> 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>     if (page->mapping)
>       account_page_dirtied(page)
>         __inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> 
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held.  The notable exception to this rule, though,
> is do_wp_page(), for which this race exists.  However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
> 
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed.  Remove it.
> 
> Reported-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memory.c         | 11 ++---------
>  mm/page-writeback.c | 33 ++++++++++++---------------------
>  2 files changed, 14 insertions(+), 30 deletions(-)
> 
> It is unfortunate to hold the page lock while balancing dirty pages,
> but I don't see what else would protect mapping at that point.

Yes.

I'm a bit surprised that calling balance_dirty_pages() under
lock_page() doesn't just go and deadlock.  Memory fails me.

And yes, often the only thing which protects the address_space is
lock_page().

set_page_dirty_balance() and balance_dirty_pages() don't actually need
the address_space - they just use it to get at the backing_dev_info. 
So perhaps what we could do here is the change those functions to take
a bdi directly, then change do_wp_page() to do something like

	lock_page(dirty_page);
	bdi = page->mapping->backing_dev_info;
	need_balance = set_page_dirty2(bdi);
	unlock_page(page);
	if (need_balance)
		balance_dirty_pages_ratelimited2(bdi);


so we no longer require that the address_space be stabilized after
lock_page().  Of course something needs to protect the bdi and I'm not
sure what that is, but we're talking about umount and that quiesces and
evicts lots of things before proceeding, so surely there's something in
there which will save us ;)

>  The
> same btw applies for the page_mkwrite case: how is mapping safe to
> pass to balance_dirty_pages() after unlocking page table and page?

I'm not sure which code you're referring to here, but it's likely that
the switch-balancing-to-bdi approach will address that as well?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
  2014-11-26 22:00   ` Andrew Morton
@ 2014-11-27  9:40     ` Jan Kara
  -1 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2014-11-27  9:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, linux-mm, linux-fsdevel, linux-kernel

On Wed 26-11-14 14:00:06, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> > 
> > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> >   if (TestSetPageDirty(page))
> >                                      page->mapping = NULL
> > 				     if (PageDirty())
> > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >     if (page->mapping)
> >       account_page_dirtied(page)
> >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > 
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > 
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held.  The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> > 
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed.  Remove it.
> > 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memory.c         | 11 ++---------
> >  mm/page-writeback.c | 33 ++++++++++++---------------------
> >  2 files changed, 14 insertions(+), 30 deletions(-)
> > 
> > It is unfortunate to hold the page lock while balancing dirty pages,
> > but I don't see what else would protect mapping at that point.
> 
> Yes.
> 
> I'm a bit surprised that calling balance_dirty_pages() under
> lock_page() doesn't just go and deadlock.  Memory fails me.
> 
> And yes, often the only thing which protects the address_space is
> lock_page().
> 
> set_page_dirty_balance() and balance_dirty_pages() don't actually need
> the address_space - they just use it to get at the backing_dev_info. 
> So perhaps what we could do here is the change those functions to take
> a bdi directly, then change do_wp_page() to do something like
> 
> 	lock_page(dirty_page);
> 	bdi = page->mapping->backing_dev_info;
> 	need_balance = set_page_dirty2(bdi);
> 	unlock_page(page);
> 	if (need_balance)
> 		balance_dirty_pages_ratelimited2(bdi);
  Yes, please! Holding lock_page() over balance dirty pages definitely has
a potential for deadlock (e.g. flusher might block on lock_page() in
WB_SYNC_ALL pass and then there'd be no one to clean pages and thus release
process from balance_dirty_pages()).

> so we no longer require that the address_space be stabilized after
> lock_page().  Of course something needs to protect the bdi and I'm not
> sure what that is, but we're talking about umount and that quiesces and
> evicts lots of things before proceeding, so surely there's something in
> there which will save us ;)
  In do_wp_page() the process doing the fault and ending in
balance_dirty_pages() has to have the page mapped, thus it has to have the
file open => no umount.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-11-27  9:40     ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2014-11-27  9:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	Jan Kara, linux-mm, linux-fsdevel, linux-kernel

On Wed 26-11-14 14:00:06, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> > 
> > __set_page_dirty_nobuffers()       __delete_from_page_cache()
> >   if (TestSetPageDirty(page))
> >                                      page->mapping = NULL
> > 				     if (PageDirty())
> > 				       dec_zone_page_state(page, NR_FILE_DIRTY);
> > 				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >     if (page->mapping)
> >       account_page_dirtied(page)
> >         __inc_zone_page_state(page, NR_FILE_DIRTY);
> > 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > 
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > 
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held.  The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists.  However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes.  Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> > 
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed.  Remove it.
> > 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memory.c         | 11 ++---------
> >  mm/page-writeback.c | 33 ++++++++++++---------------------
> >  2 files changed, 14 insertions(+), 30 deletions(-)
> > 
> > It is unfortunate to hold the page lock while balancing dirty pages,
> > but I don't see what else would protect mapping at that point.
> 
> Yes.
> 
> I'm a bit surprised that calling balance_dirty_pages() under
> lock_page() doesn't just go and deadlock.  Memory fails me.
> 
> And yes, often the only thing which protects the address_space is
> lock_page().
> 
> set_page_dirty_balance() and balance_dirty_pages() don't actually need
> the address_space - they just use it to get at the backing_dev_info. 
> So perhaps what we could do here is the change those functions to take
> a bdi directly, then change do_wp_page() to do something like
> 
> 	lock_page(dirty_page);
> 	bdi = page->mapping->backing_dev_info;
> 	need_balance = set_page_dirty2(bdi);
> 	unlock_page(page);
> 	if (need_balance)
> 		balance_dirty_pages_ratelimited2(bdi);
  Yes, please! Holding lock_page() over balance dirty pages definitely has
a potential for deadlock (e.g. flusher might block on lock_page() in
WB_SYNC_ALL pass and then there'd be no one to clean pages and thus release
process from balance_dirty_pages()).

> so we no longer require that the address_space be stabilized after
> lock_page().  Of course something needs to protect the bdi and I'm not
> sure what that is, but we're talking about umount and that quiesces and
> evicts lots of things before proceeding, so surely there's something in
> there which will save us ;)
  In do_wp_page() the process doing the fault and ending in
balance_dirty_pages() has to have the page mapped, thus it has to have the
file open => no umount.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
  2014-11-27  9:40     ` Jan Kara
@ 2014-11-27  9:50       ` Andrew Morton
  -1 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2014-11-27  9:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Nov 2014 10:40:06 +0100 Jan Kara <jack@suse.cz> wrote:

> > so we no longer require that the address_space be stabilized after
> > lock_page().  Of course something needs to protect the bdi and I'm not
> > sure what that is, but we're talking about umount and that quiesces and
> > evicts lots of things before proceeding, so surely there's something in
> > there which will save us ;)
>   In do_wp_page() the process doing the fault and ending in
> balance_dirty_pages() has to have the page mapped, thus it has to have the
> file open => no umount.

Actually, umount isn't enough to kill the backing_dev_info.  It's an
attribute of the device itself (for blockdevs it's a field in
request_queue) so I assume it will be stable until device hot-unplug,
losetup -d, rmmod, etc.  If the backing_dev can go away in the middle
of a pagefault against that device then we have bigger problems ;)




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-11-27  9:50       ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2014-11-27  9:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Tejun Heo, Hugh Dickins, Michel Lespinasse,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Nov 2014 10:40:06 +0100 Jan Kara <jack@suse.cz> wrote:

> > so we no longer require that the address_space be stabilized after
> > lock_page().  Of course something needs to protect the bdi and I'm not
> > sure what that is, but we're talking about umount and that quiesces and
> > evicts lots of things before proceeding, so surely there's something in
> > there which will save us ;)
>   In do_wp_page() the process doing the fault and ending in
> balance_dirty_pages() has to have the page mapped, thus it has to have the
> file open => no umount.

Actually, umount isn't enough to kill the backing_dev_info.  It's an
attribute of the device itself (for blockdevs it's a field in
request_queue) so I assume it will be stable until device hot-unplug,
losetup -d, rmmod, etc.  If the backing_dev can go away in the middle
of a pagefault against that device then we have bigger problems ;)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
  2014-11-26 22:00   ` Andrew Morton
@ 2014-12-01 22:52     ` Johannes Weiner
  -1 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2014-12-01 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Nov 26, 2014 at 02:00:06PM -0800, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> >  The
> > same btw applies for the page_mkwrite case: how is mapping safe to
> > pass to balance_dirty_pages() after unlocking page table and page?
> 
> I'm not sure which code you're referring to here, but it's likely that
> the switch-balancing-to-bdi approach will address that as well?

This code in do_wp_page():

		pte_unmap_unlock(page_table, ptl);
[...]
		put_page(dirty_page);
		if (page_mkwrite) {
			struct address_space *mapping = dirty_page->mapping;

			set_page_dirty(dirty_page);
			unlock_page(dirty_page);
			page_cache_release(dirty_page);
			if (mapping)	{
				/*
				 * Some device drivers do not set page.mapping
				 * but still dirty their pages
				 */
				balance_dirty_pages_ratelimited(mapping);
			}
		}

And there is also this code in do_shared_fault():

	pte_unmap_unlock(pte, ptl);

	if (set_page_dirty(fault_page))
		dirtied = 1;
	mapping = fault_page->mapping;
	unlock_page(fault_page);
	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
		/*
		 * Some device drivers do not set page.mapping but still
		 * dirty their pages
		 */
		balance_dirty_pages_ratelimited(mapping);
	}

I don't see anything that ensures mapping stays alive by the time it's
passed to balance_dirty_pages() in either case.

Argh, but of course there is.  The mmap_sem.  That pins the vma, which
pins the file, which pins the inode.  In all cases.  So I think we can
just stick with passing mapping to balance_dirty_pages() for now.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [rfc patch] mm: protect set_page_dirty() from ongoing truncation
@ 2014-12-01 22:52     ` Johannes Weiner
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2014-12-01 22:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Hugh Dickins, Michel Lespinasse, Jan Kara, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Nov 26, 2014 at 02:00:06PM -0800, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> >  The
> > same btw applies for the page_mkwrite case: how is mapping safe to
> > pass to balance_dirty_pages() after unlocking page table and page?
> 
> I'm not sure which code you're referring to here, but it's likely that
> the switch-balancing-to-bdi approach will address that as well?

This code in do_wp_page():

		pte_unmap_unlock(page_table, ptl);
[...]
		put_page(dirty_page);
		if (page_mkwrite) {
			struct address_space *mapping = dirty_page->mapping;

			set_page_dirty(dirty_page);
			unlock_page(dirty_page);
			page_cache_release(dirty_page);
			if (mapping)	{
				/*
				 * Some device drivers do not set page.mapping
				 * but still dirty their pages
				 */
				balance_dirty_pages_ratelimited(mapping);
			}
		}

And there is also this code in do_shared_fault():

	pte_unmap_unlock(pte, ptl);

	if (set_page_dirty(fault_page))
		dirtied = 1;
	mapping = fault_page->mapping;
	unlock_page(fault_page);
	if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
		/*
		 * Some device drivers do not set page.mapping but still
		 * dirty their pages
		 */
		balance_dirty_pages_ratelimited(mapping);
	}

I don't see anything that ensures mapping stays alive by the time it's
passed to balance_dirty_pages() in either case.

Argh, but of course there is.  The mmap_sem.  That pins the vma, which
pins the file, which pins the inode.  In all cases.  So I think we can
just stick with passing mapping to balance_dirty_pages() for now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-12-01 22:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-25 19:48 [rfc patch] mm: protect set_page_dirty() from ongoing truncation Johannes Weiner
2014-11-25 19:48 ` Johannes Weiner
2014-11-26 22:00 ` Andrew Morton
2014-11-26 22:00   ` Andrew Morton
2014-11-27  9:40   ` Jan Kara
2014-11-27  9:40     ` Jan Kara
2014-11-27  9:50     ` Andrew Morton
2014-11-27  9:50       ` Andrew Morton
2014-12-01 22:52   ` Johannes Weiner
2014-12-01 22:52     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.