All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 2/4] vfs: add set_page_dirty_notag
@ 2009-02-13 11:56 Edward Shishkin
  2009-02-13 13:08 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Edward Shishkin @ 2009-02-13 11:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Ryan Hope, Randy Dunlap, linux-kernel,
	Edward Shishkin, ReiserFS Mailing List

Andrew, please, review.
I'll change things, if needed..

This is a fixup for the following "todo":
akpm wrote:
> reiser4_set_page_dirty_internal() pokes around in VFS internals.
> Use __set_page_dirty_no_buffers() or create a new library function
> in mm/page-writeback.c.

Problem:

In accordance with reiser4 transactional model every dirty page
should be "captured" by some atom. However, outside reiser4 context
dirty page can not be captured in some cases, as it is accompanied
with specific work (jnode creation, etc). Reiser4 recognizes such
"anonymous" pages (i.e. pages that were dirtied outside of reiser4)
by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context
are not tagged at all: we don't need this. Indeed, once page is
dirtied and captured, it is attached to a jnode (a special header
to keep a track of transactions).

reiser4_set_page_dirty_internal() was the internal reiser4 function
that set dirty bit without tagging the page. Having such internal
function led to real problems (incorrect task io accounting, etc.
because of not updating this internal "friend").

Solution:

The following patch adds a core library function that sets a dirty
bit without tagging the page. It should be modified simultaneously
with its "friends": __set_page_dirty_nobuffers, __set_page_dirty.

Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_notag(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
+ * The same as __set_page_dirty_nobuffers, but this function
+ * 1) doesn't tag the page in its radix tree;
+ * 2) makes an assumption that there is no races with truncate.
+ * 3) is not for anonymous or swap pages.
+ */
+int set_page_dirty_notag(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	if (!TestSetPageDirty(page)) {
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		if (mapping_cap_account_dirty(mapping)) {
+			preempt_disable();
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+				       BDI_RECLAIMABLE);
+			task_dirty_inc(current);
+			task_io_account_write(PAGE_CACHE_SIZE);
+			preempt_enable();
+		}
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty_notag);
+
+/*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
  * redirty_page_for_writepage() and it should then unlock the page and return 0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-13 11:56 [patch 2/4] vfs: add set_page_dirty_notag Edward Shishkin
@ 2009-02-13 13:08 ` Peter Zijlstra
  2009-02-13 13:57   ` Edward Shishkin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-13 13:08 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Fri, 2009-02-13 at 14:56 +0300, Edward Shishkin wrote:

> ---
>  include/linux/mm.h  |    1 +
>  mm/page-writeback.c |   28 ++++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+)
> 
> --- mmotm.orig/include/linux/mm.h
> +++ mmotm/include/linux/mm.h
> @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
>  				struct page *page);
>  int set_page_dirty(struct page *page);
>  int set_page_dirty_lock(struct page *page);
> +int set_page_dirty_notag(struct page *page);
>  int clear_page_dirty_for_io(struct page *page);
>  
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
> --- mmotm.orig/mm/page-writeback.c
> +++ mmotm/mm/page-writeback.c
> @@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
>  /*
> + * The same as __set_page_dirty_nobuffers, but this function
> + * 1) doesn't tag the page in its radix tree;
> + * 2) makes an assumption that there is no races with truncate.
> + * 3) is not for anonymous or swap pages.
> + */
> +int set_page_dirty_notag(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (!TestSetPageDirty(page)) {
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		if (mapping_cap_account_dirty(mapping)) {
> +			preempt_disable();
> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> +			__inc_bdi_stat(mapping->backing_dev_info,
> +				       BDI_RECLAIMABLE);
> +			task_dirty_inc(current);
> +			task_io_account_write(PAGE_CACHE_SIZE);
> +			preempt_enable();
> +		}
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +		return 1;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(set_page_dirty_notag);
> +

Eew, so reiser4 will totally side-step the regular vm inode writeback
paths -- or is this fixed by a more elaborate than usual
a_ops->writepages() ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-13 13:08 ` Peter Zijlstra
@ 2009-02-13 13:57   ` Edward Shishkin
  2009-02-13 14:09     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Edward Shishkin @ 2009-02-13 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

Peter Zijlstra wrote:
> On Fri, 2009-02-13 at 14:56 +0300, Edward Shishkin wrote:
>
>   
>> ---
>>  include/linux/mm.h  |    1 +
>>  mm/page-writeback.c |   28 ++++++++++++++++++++++++++++
>>  2 files changed, 29 insertions(+)
>>
>> --- mmotm.orig/include/linux/mm.h
>> +++ mmotm/include/linux/mm.h
>> @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
>>  				struct page *page);
>>  int set_page_dirty(struct page *page);
>>  int set_page_dirty_lock(struct page *page);
>> +int set_page_dirty_notag(struct page *page);
>>  int clear_page_dirty_for_io(struct page *page);
>>  
>>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
>> --- mmotm.orig/mm/page-writeback.c
>> +++ mmotm/mm/page-writeback.c
>> @@ -1248,6 +1248,34 @@ int __set_page_dirty_nobuffers(struct pa
>>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>>  
>>  /*
>> + * The same as __set_page_dirty_nobuffers, but this function
>> + * 1) doesn't tag the page in its radix tree;
>> + * 2) makes an assumption that there is no races with truncate.
>> + * 3) is not for anonymous or swap pages.
>> + */
>> +int set_page_dirty_notag(struct page *page)
>> +{
>> +	struct address_space *mapping = page->mapping;
>> +
>> +	if (!TestSetPageDirty(page)) {
>> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
>> +		if (mapping_cap_account_dirty(mapping)) {
>> +			preempt_disable();
>> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
>> +			__inc_bdi_stat(mapping->backing_dev_info,
>> +				       BDI_RECLAIMABLE);
>> +			task_dirty_inc(current);
>> +			task_io_account_write(PAGE_CACHE_SIZE);
>> +			preempt_enable();
>> +		}
>> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL(set_page_dirty_notag);
>> +
>>     
>
> Eew, so reiser4 will totally side-step the regular vm inode writeback
> paths -- or is this fixed by a more elaborate than usual
> a_ops->writepages() ?
>   
The second.

reiser4_writepages() catches the anonymous  (tagged)
pages, captures them mandatory, then commits all atoms
of the file.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-13 13:57   ` Edward Shishkin
@ 2009-02-13 14:09     ` Peter Zijlstra
  2009-02-14 13:11       ` Edward Shishkin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-13 14:09 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote:
> >
> > Eew, so reiser4 will totally side-step the regular vm inode
> writeback
> > paths -- or is this fixed by a more elaborate than usual
> > a_ops->writepages() ?
> >   
> The second.
> 
> reiser4_writepages() catches the anonymous  (tagged)
> pages, captures them mandatory, then commits all atoms
> of the file.

OK, can you then make it painfully clear in the function comment, esp.
since you export this function?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-13 14:09     ` Peter Zijlstra
@ 2009-02-14 13:11       ` Edward Shishkin
  2009-02-14 21:11         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Edward Shishkin @ 2009-02-14 13:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Nick Piggin, Ryan Hope,
	Randy Dunlap, linux-kernel, ReiserFS Mailing List

Peter Zijlstra writes:
 > On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote:
 > > >
 > > > Eew, so reiser4 will totally side-step the regular vm inode
 > > writeback
 > > > paths -- or is this fixed by a more elaborate than usual
 > > > a_ops->writepages() ?
 > > >   
 > > The second.
 > > 
 > > reiser4_writepages() catches the anonymous  (tagged)
 > > pages, captures them mandatory, then commits all atoms
 > > of the file.
 > 
 > OK, can you then make it painfully clear in the function comment, esp.
 > since you export this function?

Hello.
Does it look better?

Thanks,
Edward.

This is a fixup for the following "todo":
akpm wrote:
> reiser4_set_page_dirty_internal() pokes around in VFS internals.
> Use __set_page_dirty_no_buffers() or create a new library function
> in mm/page-writeback.c.

Problem:

In accordance with reiser4 transactional model every dirty page
should be "captured" by some atom. However, outside reiser4 context
dirty page can not be captured in some cases, as it is accompanied
with specific work (jnode creation, etc). Reiser4 recognizes such
"anonymous" pages (i.e. pages that were dirtied outside of reiser4)
by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context
are not tagged at all: we don't need this. Indeed, once page is
dirtied and captured, it is attached to a jnode (a special header
to keep a track of transactions).

reiser4_set_page_dirty_internal() was the internal reiser4 function
that set dirty bit without tagging the page. Having such internal
function led to real problems (incorrect task io accounting, etc.
because of not updating this internal "friend").

Solution:

The following patch adds a core library function that sets a dirty
bit without tagging the page. It should be modified simultaneously
with its "friends": __set_page_dirty_nobuffers, __set_page_dirty.

Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_notag(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1248,6 +1248,46 @@ int __set_page_dirty_nobuffers(struct pa
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
+ * Some filesystems, which don't use buffers, provide their own
+ * writeback means. And it can happen, that even dirty tag, which
+ * is used by generic methods is not needed. In this case it would
+ * be reasonably to use the following lightweight version of
+ * __set_page_dirty_nobuffers:
+ *
+ * Don't tag page as dirty in its radix tree, just set dirty bit
+ * and update the accounting.
+ * NOTE: This function also doesn't take care of races, i.e. the
+ * caller should guarantee that page can not be truncated.
+ */
+int set_page_dirty_notag(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	if (!TestSetPageDirty(page)) {
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		if (mapping_cap_account_dirty(mapping)) {
+		        /*
+			 * Since we don't tag the page as dirty,
+			 * acquiring the tree_lock is replaced
+			 * with disabling preemption to protect
+			 * per-cpu data used for accounting.
+			 */
+			preempt_disable();
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+				       BDI_RECLAIMABLE);
+			task_dirty_inc(current);
+			task_io_account_write(PAGE_CACHE_SIZE);
+			preempt_enable();
+		}
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty_notag);
+
+/*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
  * redirty_page_for_writepage() and it should then unlock the page and return 0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-14 13:11       ` Edward Shishkin
@ 2009-02-14 21:11         ` Peter Zijlstra
  2009-02-16 22:43           ` Edward Shishkin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-14 21:11 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Sat, 2009-02-14 at 16:11 +0300, Edward Shishkin wrote:
> Peter Zijlstra writes:
>  > On Fri, 2009-02-13 at 16:57 +0300, Edward Shishkin wrote:
>  > > >
>  > > > Eew, so reiser4 will totally side-step the regular vm inode
>  > > writeback
>  > > > paths -- or is this fixed by a more elaborate than usual
>  > > > a_ops->writepages() ?
>  > > >   
>  > > The second.
>  > > 
>  > > reiser4_writepages() catches the anonymous  (tagged)
>  > > pages, captures them mandatory, then commits all atoms
>  > > of the file.
>  > 
>  > OK, can you then make it painfully clear in the function comment, esp.
>  > since you export this function?
> 
> Hello.
> Does it look better?
> 
> Thanks,
> Edward.
> 
> This is a fixup for the following "todo":
> akpm wrote:
> > reiser4_set_page_dirty_internal() pokes around in VFS internals.
> > Use __set_page_dirty_no_buffers() or create a new library function
> > in mm/page-writeback.c.
> 
> Problem:
> 
> In accordance with reiser4 transactional model every dirty page
> should be "captured" by some atom. However, outside reiser4 context
> dirty page can not be captured in some cases, as it is accompanied
> with specific work (jnode creation, etc). Reiser4 recognizes such
> "anonymous" pages (i.e. pages that were dirtied outside of reiser4)
> by the tag PAGECACHE_TAG_DIRTY. Pages dirtied inside reiser4 context
> are not tagged at all: we don't need this. Indeed, once page is
> dirtied and captured, it is attached to a jnode (a special header
> to keep a track of transactions).
> 
> reiser4_set_page_dirty_internal() was the internal reiser4 function
> that set dirty bit without tagging the page. Having such internal
> function led to real problems (incorrect task io accounting, etc.
> because of not updating this internal "friend").
> 
> Solution:
> 
> The following patch adds a core library function that sets a dirty
> bit without tagging the page. It should be modified simultaneously
> with its "friends": __set_page_dirty_nobuffers, __set_page_dirty.
> 
> Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
> ---
>  include/linux/mm.h  |    1 +
>  mm/page-writeback.c |   40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> --- mmotm.orig/include/linux/mm.h
> +++ mmotm/include/linux/mm.h
> @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
>  				struct page *page);
>  int set_page_dirty(struct page *page);
>  int set_page_dirty_lock(struct page *page);
> +int set_page_dirty_notag(struct page *page);
>  int clear_page_dirty_for_io(struct page *page);
>  
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
> --- mmotm.orig/mm/page-writeback.c
> +++ mmotm/mm/page-writeback.c
> @@ -1248,6 +1248,46 @@ int __set_page_dirty_nobuffers(struct pa
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
>  /*
> + * Some filesystems, which don't use buffers, provide their own
> + * writeback means. And it can happen, that even dirty tag, which
> + * is used by generic methods is not needed. In this case it would
> + * be reasonably to use the following lightweight version of
> + * __set_page_dirty_nobuffers:
> + *
> + * Don't tag page as dirty in its radix tree, just set dirty bit
> + * and update the accounting.
> + * NOTE: This function also doesn't take care of races, i.e. the
> + * caller should guarantee that page can not be truncated.
> + */

Maybe something like

/*
 * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers()
 * except it doesn't tag the page dirty in the page-cache radix tree.
 * This means that the address space using this cannot use the regular
 * filemap ->writepages() helpers and must provide its own means of
 * tracking and finding dirty pages.
 *
 * NOTE: furthermore, this version also doesn't handle truncate races.
 */

> +int set_page_dirty_notag(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (!TestSetPageDirty(page)) {
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		if (mapping_cap_account_dirty(mapping)) {
> +		        /*
> +			 * Since we don't tag the page as dirty,
> +			 * acquiring the tree_lock is replaced
> +			 * with disabling preemption to protect
> +			 * per-cpu data used for accounting.
> +			 */

This should be local_irq_save(flags)

> +			preempt_disable();
> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> +			__inc_bdi_stat(mapping->backing_dev_info,
> +				       BDI_RECLAIMABLE);
> +			task_dirty_inc(current);
> +			task_io_account_write(PAGE_CACHE_SIZE);
> +			preempt_enable();

local_irq_restore()

These accounting functions rely on being atomic wrt interrupts.

> +		}
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +		return 1;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(set_page_dirty_notag);

How much performance gain do you see by avoiding that radix tree op? 

I take it the only reason you don't use the regular
__set_page_dirty_nobuffers() and just clear the tag when you do the
write-out by whatever alternative means you have to find the page, is
that it gains you some performance.

It would be good to have some numbers to judge this on.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-14 21:11         ` Peter Zijlstra
@ 2009-02-16 22:43           ` Edward Shishkin
  2009-02-17  9:09             ` Peter Zijlstra
  2009-02-17 22:35               ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: Edward Shishkin @ 2009-02-16 22:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Nick Piggin, Ryan Hope,
	Randy Dunlap, linux-kernel, ReiserFS Mailing List

Peter Zijlstra writes:
[...]
 > Maybe something like
 > 
 > /*
 >  * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers()
 >  * except it doesn't tag the page dirty in the page-cache radix tree.
 >  * This means that the address space using this cannot use the regular
 >  * filemap ->writepages() helpers and must provide its own means of
 >  * tracking and finding dirty pages.
 >  *
 >  * NOTE: furthermore, this version also doesn't handle truncate races.
 >  */
 > 

Yup, your version is better. I just have replaced
^finding dirty pages^finding non-tagged dirty pages,
see below..


 > > +int set_page_dirty_notag(struct page *page)
 > > +{
 > > +	struct address_space *mapping = page->mapping;
 > > +
 > > +	if (!TestSetPageDirty(page)) {
 > > +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 > > +		if (mapping_cap_account_dirty(mapping)) {
 > > +		        /*
 > > +			 * Since we don't tag the page as dirty,
 > > +			 * acquiring the tree_lock is replaced
 > > +			 * with disabling preemption to protect
 > > +			 * per-cpu data used for accounting.
 > > +			 */
 > 
 > This should be local_irq_save(flags)
 > 
 > > +			preempt_disable();
 > > +			__inc_zone_page_state(page, NR_FILE_DIRTY);
 > > +			__inc_bdi_stat(mapping->backing_dev_info,
 > > +				       BDI_RECLAIMABLE);
 > > +			task_dirty_inc(current);
 > > +			task_io_account_write(PAGE_CACHE_SIZE);
 > > +			preempt_enable();
 > 
 > local_irq_restore()
 > 
 > These accounting functions rely on being atomic wrt interrupts.
 > 

Thanks!

 > > +		}
 > > +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 > > +		return 1;
 > > +	}
 > > +	return 0;
 > > +}
 > > +EXPORT_SYMBOL(set_page_dirty_notag);
 > 
 > How much performance gain do you see by avoiding that radix tree op? 
 > 

Nop. We want to use it with extended semantics.
All dirty pages are divided into 2 categories:

A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY).
B) captured by atoms (usual linked lists).

reiser4_writepages() looks for pages of "A" in the radix tree
and moves them to "B". set_page_dirty_notag(), introduced by
my patch, is needed for pages of "B".

If "B" is empty, then we get the traditional semantics with
regular ->writepages().

That's all!

 > I take it the only reason you don't use the regular
 > __set_page_dirty_nobuffers() and just clear the tag when you do the
 > write-out by whatever alternative means you have to find the page, is
 > that it gains you some performance.
 > 
 > It would be good to have some numbers to judge this on.
 > 

So, performance is not a concern. We want to extend
functionality, which is currently restricted by the unnecessary(*)
property "dirty bit is set <=> dirty tag is installed".

(*) Reiser4 successfully uses such design since 2002.
I don't remember any BUG_ONs, and related problems.

Please, consider the appended patch.

Thanks,
Edward.

Add set_page_dirty_notag() to the core library to enable
extended functionality of radix tree attached to inode->i_mapping.

Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_notag(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1248,6 +1248,42 @@ int __set_page_dirty_nobuffers(struct pa
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
+ * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers()
+ * except it doesn't tag the page dirty in the page-cache radix tree.
+ * This means that the address space using this cannot use the regular
+ * filemap ->writepages() helpers and must provide its own means of
+ * tracking and finding non-tagged dirty pages.
+ *
+ * NOTE: furthermore, this version also doesn't handle truncate races.
+ */
+int set_page_dirty_notag(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	if (!TestSetPageDirty(page)) {
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		if (mapping_cap_account_dirty(mapping)) {
+		        /*
+			 * The accounting functions rely on
+			 * being atomic wrt interrupts.
+			 */
+		        unsigned long flags;
+			local_irq_save(flags);
+			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+				       BDI_RECLAIMABLE);
+			task_dirty_inc(current);
+			task_io_account_write(PAGE_CACHE_SIZE);
+			local_irq_restore(flags);
+		}
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty_notag);
+
+/*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
  * redirty_page_for_writepage() and it should then unlock the page and return 0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-16 22:43           ` Edward Shishkin
@ 2009-02-17  9:09             ` Peter Zijlstra
  2009-02-17  9:38               ` Nick Piggin
  2009-02-17 22:35               ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-17  9:09 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: Andrew Morton, Nick Piggin, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, 2009-02-17 at 01:43 +0300, Edward Shishkin wrote:

>  > How much performance gain do you see by avoiding that radix tree op? 
>  > 
> 
> Nop. We want to use it with extended semantics.
> All dirty pages are divided into 2 categories:
> 
> A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY).
> B) captured by atoms (usual linked lists).
> 
> reiser4_writepages() looks for pages of "A" in the radix tree
> and moves them to "B". set_page_dirty_notag(), introduced by
> my patch, is needed for pages of "B".
> 
> If "B" is empty, then we get the traditional semantics with
> regular ->writepages().
> 
> That's all!

Ah, indeed. I had not considered such a scheme.

> Add set_page_dirty_notag() to the core library to enable
> extended functionality of radix tree attached to inode->i_mapping.
> 
> Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>

Looks good to me

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
>  include/linux/mm.h  |    1 +
>  mm/page-writeback.c |   36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> --- mmotm.orig/include/linux/mm.h
> +++ mmotm/include/linux/mm.h
> @@ -841,6 +841,7 @@ int redirty_page_for_writepage(struct wr
>  				struct page *page);
>  int set_page_dirty(struct page *page);
>  int set_page_dirty_lock(struct page *page);
> +int set_page_dirty_notag(struct page *page);
>  int clear_page_dirty_for_io(struct page *page);
>  
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
> --- mmotm.orig/mm/page-writeback.c
> +++ mmotm/mm/page-writeback.c
> @@ -1248,6 +1248,42 @@ int __set_page_dirty_nobuffers(struct pa
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
>  /*
> + * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers()
> + * except it doesn't tag the page dirty in the page-cache radix tree.
> + * This means that the address space using this cannot use the regular
> + * filemap ->writepages() helpers and must provide its own means of
> + * tracking and finding non-tagged dirty pages.
> + *
> + * NOTE: furthermore, this version also doesn't handle truncate races.
> + */
> +int set_page_dirty_notag(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (!TestSetPageDirty(page)) {
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		if (mapping_cap_account_dirty(mapping)) {
> +		        /*
> +			 * The accounting functions rely on
> +			 * being atomic wrt interrupts.
> +			 */
> +		        unsigned long flags;
> +			local_irq_save(flags);
> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> +			__inc_bdi_stat(mapping->backing_dev_info,
> +				       BDI_RECLAIMABLE);
> +			task_dirty_inc(current);
> +			task_io_account_write(PAGE_CACHE_SIZE);
> +			local_irq_restore(flags);
> +		}
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +		return 1;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(set_page_dirty_notag);
> +
> +/*
>   * When a writepage implementation decides that it doesn't want to write this
>   * page for some reason, it should redirty the locked page via
>   * redirty_page_for_writepage() and it should then unlock the page and return 0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-17  9:09             ` Peter Zijlstra
@ 2009-02-17  9:38               ` Nick Piggin
  2009-02-17 10:05                 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2009-02-17  9:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, Feb 17, 2009 at 10:09:41AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 01:43 +0300, Edward Shishkin wrote:
> 
> >  > How much performance gain do you see by avoiding that radix tree op? 
> >  > 
> > 
> > Nop. We want to use it with extended semantics.
> > All dirty pages are divided into 2 categories:
> > 
> > A) tagged in the radix tree (with PAGECACHE_TAG_DIRTY).
> > B) captured by atoms (usual linked lists).
> > 
> > reiser4_writepages() looks for pages of "A" in the radix tree
> > and moves them to "B". set_page_dirty_notag(), introduced by
> > my patch, is needed for pages of "B".
> > 
> > If "B" is empty, then we get the traditional semantics with
> > regular ->writepages().
> > 
> > That's all!
> 
> Ah, indeed. I had not considered such a scheme.

It is a great shame that filesystems are not properly notified
that a page may become dirty before the actual set_page_dirty
event (which is not allowed to fail and is called after the
page is already dirty).

This is a big problem I have with fsblock simply in trying to
make the memory allocation robust. page_mkwrite unfortunately
is racy and I've fixed problems there... the big problem though
is get_user_pages. Fixing that properly seems to require fixing
callers so it is not really realistic in the short term.

As such...
 
> > Add set_page_dirty_notag() to the core library to enable
> > extended functionality of radix tree attached to inode->i_mapping.
> > 
> > Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
> 
> Looks good to me
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

I'm fine with it too.

Acked-by: Nick Piggin <npiggin@suse.de>

You know... it wouldn't be terribly painful to introduce a new
pagecache radix tree tag for filesystem private use (doesn't bloat
the radix tree node size) if there is a strong need for it. But if
you have this workaround then I think it is also reasonable.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-17  9:38               ` Nick Piggin
@ 2009-02-17 10:05                 ` Peter Zijlstra
  2009-02-17 10:24                   ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-17 10:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote:

> It is a great shame that filesystems are not properly notified
> that a page may become dirty before the actual set_page_dirty
> event (which is not allowed to fail and is called after the
> page is already dirty).

Not quite true, for example the set_page_dirty() done by the write fault
code is done _before_ the page becomes dirty.

This before/after thing was the reason for that horrid file corruption
bug that dragged on for a few weeks back in .19 (IIRC).

> This is a big problem I have with fsblock simply in trying to
> make the memory allocation robust. page_mkwrite unfortunately
> is racy and I've fixed problems there... the big problem though
> is get_user_pages. Fixing that properly seems to require fixing
> callers so it is not really realistic in the short term.

Right, I'm just not sure what we can do, even with a
prepage_page_dirty() function, what are you going to do, fail the fault?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-17 10:05                 ` Peter Zijlstra
@ 2009-02-17 10:24                   ` Nick Piggin
  2009-02-17 10:40                     ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2009-02-17 10:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote:
> 
> > It is a great shame that filesystems are not properly notified
> > that a page may become dirty before the actual set_page_dirty
> > event (which is not allowed to fail and is called after the
> > page is already dirty).
> 
> Not quite true, for example the set_page_dirty() done by the write fault
> code is done _before_ the page becomes dirty.
> 
> This before/after thing was the reason for that horrid file corruption
> bug that dragged on for a few weeks back in .19 (IIRC).

Yeah, there are actually races though. The page can become cleaned
before set_page_dirty is reached, and there are also nasty races with
truncate.

 
> > This is a big problem I have with fsblock simply in trying to
> > make the memory allocation robust. page_mkwrite unfortunately
> > is racy and I've fixed problems there... the big problem though
> > is get_user_pages. Fixing that properly seems to require fixing
> > callers so it is not really realistic in the short term.
> 
> Right, I'm just not sure what we can do, even with a
> prepage_page_dirty() function, what are you going to do, fail the fault?

Oh, for regular page fault functions using page_mkwrite, they
definitely want to fail the fault with a SIGBUS, and actually XFS
already does that (for fsblock robust memory allocations you
would also want to fail OOM on metadata allocation failure). What
is the other option? Silently fail the write?

For XFS purpose (ie. -ENOSPC handling), the current code is reasonable
although there could be some truncate races with block allocation. But
mostly probably works. For something like fsblock it can be much more
common to have the metadata refcount reach 0 and freed before spd is
called. In that case the code actually goes into a bug situation so it
is a bit more critical.

But no that's the "easy" part. The hard part is get_user_pages
because the caller can hold onto the page indefinitely simply with a
refcount, and go along happily dirtying it at any stage (actually
writing to the page memory) before actually calling set_page_dirty.

The "cleanest" way to fix this from VM point of view is probably to
force gup callers to hold the page locked for the duration to
prevent truncation or writeout after the filesystem notification.
Don't know if that would be very popular, however.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 10:24                   ` Nick Piggin
@ 2009-02-17 10:40                     ` Peter Zijlstra
  2009-02-17 11:25                       ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-17 10:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote:
> On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote:
> > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote:
> > 
> > > It is a great shame that filesystems are not properly notified
> > > that a page may become dirty before the actual set_page_dirty
> > > event (which is not allowed to fail and is called after the
> > > page is already dirty).
> > 
> > Not quite true, for example the set_page_dirty() done by the write fault
> > code is done _before_ the page becomes dirty.
> > 
> > This before/after thing was the reason for that horrid file corruption
> > bug that dragged on for a few weeks back in .19 (IIRC).
> 
> Yeah, there are actually races though. The page can become cleaned
> before set_page_dirty is reached, and there are also nasty races with
> truncate.

Hmm, so you're saying that never got properly fixed?
 
> > > This is a big problem I have with fsblock simply in trying to
> > > make the memory allocation robust. page_mkwrite unfortunately
> > > is racy and I've fixed problems there... the big problem though
> > > is get_user_pages. Fixing that properly seems to require fixing
> > > callers so it is not really realistic in the short term.
> > 
> > Right, I'm just not sure what we can do, even with a
> > prepage_page_dirty() function, what are you going to do, fail the fault?
> 
> Oh, for regular page fault functions using page_mkwrite, they
> definitely want to fail the fault with a SIGBUS, and actually XFS
> already does that (for fsblock robust memory allocations you
> would also want to fail OOM on metadata allocation failure). What
> is the other option? Silently fail the write?

OK, agreed.

> For XFS purpose (ie. -ENOSPC handling), the current code is reasonable
> although there could be some truncate races with block allocation. But
> mostly probably works. For something like fsblock it can be much more
> common to have the metadata refcount reach 0 and freed before spd is
> called. In that case the code actually goes into a bug situation so it
> is a bit more critical.
> 
> But no that's the "easy" part. The hard part is get_user_pages
> because the caller can hold onto the page indefinitely simply with a
> refcount, and go along happily dirtying it at any stage (actually
> writing to the page memory) before actually calling set_page_dirty.

Should a gup user not specify .write=1 if it wants to dirty the page, at
which point the follow_page() will do the dirty-fault thingy.

Ah, but then we can clean it because we're not holding the page-lock. I
see.

> The "cleanest" way to fix this from VM point of view is probably to
> force gup callers to hold the page locked for the duration to
> prevent truncation or writeout after the filesystem notification.
> Don't know if that would be very popular, however.

Right, so you'd want to keep the page locked over gup(.write=1)
sections.

So should we extend the gup() with put_user_page()?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 10:40                     ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra
@ 2009-02-17 11:25                       ` Nick Piggin
  2009-02-17 11:39                         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2009-02-17 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, Feb 17, 2009 at 11:40:00AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote:
> > On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote:
> > > 
> > > > It is a great shame that filesystems are not properly notified
> > > > that a page may become dirty before the actual set_page_dirty
> > > > event (which is not allowed to fail and is called after the
> > > > page is already dirty).
> > > 
> > > Not quite true, for example the set_page_dirty() done by the write fault
> > > code is done _before_ the page becomes dirty.
> > > 
> > > This before/after thing was the reason for that horrid file corruption
> > > bug that dragged on for a few weeks back in .19 (IIRC).
> > 
> > Yeah, there are actually races though. The page can become cleaned
> > before set_page_dirty is reached, and there are also nasty races with
> > truncate.
> 
> Hmm, so you're saying that never got properly fixed?

For the purposes of dirty accounting and syncing they are OK I think.

The problem is with page_mkwrite which is just another thing in
the equation. Unfortunately it was introduced without a very
clear statement of what kind of locking and concurrency it was
expected, and a vauge promise that the filesystem would be able
to handle synchronisation with the pagecache and virtual memory
(yeah right).

I should hopefully get around to splitting up fsblock and
submitting some of the generic code changes (there aren't too
many but mainly giving filesystems better and and less racy
dirty page handling ability).


> > For XFS purpose (ie. -ENOSPC handling), the current code is reasonable
> > although there could be some truncate races with block allocation. But
> > mostly probably works. For something like fsblock it can be much more
> > common to have the metadata refcount reach 0 and freed before spd is
> > called. In that case the code actually goes into a bug situation so it
> > is a bit more critical.
> > 
> > But no that's the "easy" part. The hard part is get_user_pages
> > because the caller can hold onto the page indefinitely simply with a
> > refcount, and go along happily dirtying it at any stage (actually
> > writing to the page memory) before actually calling set_page_dirty.
> 
> Should a gup user not specify .write=1 if it wants to dirty the page, at
> which point the follow_page() will do the dirty-fault thingy.
> 
> Ah, but then we can clean it because we're not holding the page-lock. I
> see.

Yep.

 
> > The "cleanest" way to fix this from VM point of view is probably to
> > force gup callers to hold the page locked for the duration to
> > prevent truncation or writeout after the filesystem notification.
> > Don't know if that would be very popular, however.
> 
> Right, so you'd want to keep the page locked over gup(.write=1)
> sections.

Something like that. lock_page might cause complaints ;) But
that appears to be the easiest and cleanest way, so I don't
want to hurt my brain thinking of something better yet! 


> So should we extend the gup() with put_user_page()?

put_user_pages simply defined to do the put_page thing for now would
yes allow callers to slowly migrate over and perhaps one day allow a
solution. At least it would make things more flexible in case any
other issues arise in future.

I had a patch... can't find it now. I don't think I'd done many
driver conversions. Here is another one to kick it off. Thoughts?
--

Introduce put_user_pages function.

In order to have more flexibility to deal with issues surrounding
get_user_pages difficulties[*], introduce put_user_pages function
intended to release pages acquired by get_user_pages. For now, just
do the regular put_page thing. If all callers are converted, it could
be used to help with such races. In the meantime, it will actually
serve as a small extra piece of documentation for the code.

[*] eg. get_user_pages caller can bypass page_mkwrite calls into the
    filesystem to notify of page dirty activity if the page gets cleaned
    before the caller calls its final set_page_dirty).

---
 include/linux/mm.h |    1 +
 mm/memory.c        |   13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -826,6 +826,7 @@ extern int access_process_vm(struct task
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
 
+void put_user_pages(struct page **pages, int nr);
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t
 				start, len, flags,
 				pages, vmas);
 }
-
 EXPORT_SYMBOL(get_user_pages);
 
+/*
+ * put_user_pages should be used to release pages acquired with get_user_pages.
+ */
+void put_user_pages(struct page **pages, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		put_page(pages[i]);
+}
+EXPORT_SYMBOL(put_user_pages);
+
 pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 			spinlock_t **ptl)
 {


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 11:25                       ` Nick Piggin
@ 2009-02-17 11:39                         ` Peter Zijlstra
  2009-02-17 11:55                           ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-17 11:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote:

> Introduce put_user_pages function.
> 
> In order to have more flexibility to deal with issues surrounding
> get_user_pages difficulties[*], introduce put_user_pages function
> intended to release pages acquired by get_user_pages. For now, just
> do the regular put_page thing. If all callers are converted, it could
> be used to help with such races. In the meantime, it will actually
> serve as a small extra piece of documentation for the code.
> 
> [*] eg. get_user_pages caller can bypass page_mkwrite calls into the
>     filesystem to notify of page dirty activity if the page gets cleaned
>     before the caller calls its final set_page_dirty).

Hmm, if we want to distinguish between .write=1 and .write=0, we would
have to pass .write to pup too, right?

> ---
>  include/linux/mm.h |    1 +
>  mm/memory.c        |   13 ++++++++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -826,6 +826,7 @@ extern int access_process_vm(struct task
>  int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
>  		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
>  
> +void put_user_pages(struct page **pages, int nr);
>  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>  extern void do_invalidatepage(struct page *page, unsigned long offset);
>  
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t
>  				start, len, flags,
>  				pages, vmas);
>  }
> -
>  EXPORT_SYMBOL(get_user_pages);
>  
> +/*
> + * put_user_pages should be used to release pages acquired with get_user_pages.
> + */
> +void put_user_pages(struct page **pages, int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++)
> +		put_page(pages[i]);
> +}
> +EXPORT_SYMBOL(put_user_pages);
> +
>  pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
>  			spinlock_t **ptl)
>  {
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 11:39                         ` Peter Zijlstra
@ 2009-02-17 11:55                           ` Nick Piggin
  2009-02-17 12:05                             ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2009-02-17 11:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, Feb 17, 2009 at 12:39:32PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote:
> 
> > Introduce put_user_pages function.
> > 
> > In order to have more flexibility to deal with issues surrounding
> > get_user_pages difficulties[*], introduce put_user_pages function
> > intended to release pages acquired by get_user_pages. For now, just
> > do the regular put_page thing. If all callers are converted, it could
> > be used to help with such races. In the meantime, it will actually
> > serve as a small extra piece of documentation for the code.
> > 
> > [*] eg. get_user_pages caller can bypass page_mkwrite calls into the
> >     filesystem to notify of page dirty activity if the page gets cleaned
> >     before the caller calls its final set_page_dirty).
> 
> Hmm, if we want to distinguish between .write=1 and .write=0, we would
> have to pass .write to pup too, right?

Doh, yeah. I hand edited the patch to put that parameter in, but quilt
refresh must have outsmarted me!

If nobody thinks it is insane, I'll resend to Andrew in a new thread.

> > ---
> >  include/linux/mm.h |    1 +
> >  mm/memory.c        |   13 ++++++++++++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h
> > +++ linux-2.6/include/linux/mm.h
> > @@ -826,6 +826,7 @@ extern int access_process_vm(struct task
> >  int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
> >  		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
> >  
> > +void put_user_pages(struct page **pages, int nr);
> >  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
> >  extern void do_invalidatepage(struct page *page, unsigned long offset);
> >  
> > Index: linux-2.6/mm/memory.c
> > ===================================================================
> > --- linux-2.6.orig/mm/memory.c
> > +++ linux-2.6/mm/memory.c
> > @@ -1370,9 +1370,20 @@ int get_user_pages(struct task_struct *t
> >  				start, len, flags,
> >  				pages, vmas);
> >  }
> > -
> >  EXPORT_SYMBOL(get_user_pages);
> >  
> > +/*
> > + * put_user_pages should be used to release pages acquired with get_user_pages.
> > + */
> > +void put_user_pages(struct page **pages, int nr)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++)
> > +		put_page(pages[i]);
> > +}
> > +EXPORT_SYMBOL(put_user_pages);
> > +
> >  pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
> >  			spinlock_t **ptl)
> >  {
> > 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 11:55                           ` Nick Piggin
@ 2009-02-17 12:05                             ` Peter Zijlstra
  2009-02-17 12:30                               ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2009-02-17 12:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, 2009-02-17 at 12:55 +0100, Nick Piggin wrote:
> On Tue, Feb 17, 2009 at 12:39:32PM +0100, Peter Zijlstra wrote:
> > On Tue, 2009-02-17 at 12:25 +0100, Nick Piggin wrote:
> > 
> > > Introduce put_user_pages function.
> > > 
> > > In order to have more flexibility to deal with issues surrounding
> > > get_user_pages difficulties[*], introduce put_user_pages function
> > > intended to release pages acquired by get_user_pages. For now, just
> > > do the regular put_page thing. If all callers are converted, it could
> > > be used to help with such races. In the meantime, it will actually
> > > serve as a small extra piece of documentation for the code.
> > > 
> > > [*] eg. get_user_pages caller can bypass page_mkwrite calls into the
> > >     filesystem to notify of page dirty activity if the page gets cleaned
> > >     before the caller calls its final set_page_dirty).
> > 
> > Hmm, if we want to distinguish between .write=1 and .write=0, we would
> > have to pass .write to pup too, right?
> 
> Doh, yeah. I hand edited the patch to put that parameter in, but quilt
> refresh must have outsmarted me!
> 
> If nobody thinks it is insane, I'll resend to Andrew in a new thread.

Right, gup_fast() seems to also respect .write properly, so it would
also be used to balance that.

I guess gup_fast() would need to use trylock_page(), and fall back to
the slow path when we start taking PG_locked on .write.

I suppose we should start converting a few gup users over to pup before
handing the thing to Andrew, to have at least a few examples in-kernel.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag)
  2009-02-17 12:05                             ` Peter Zijlstra
@ 2009-02-17 12:30                               ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2009-02-17 12:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Edward Shishkin, Andrew Morton, Ryan Hope, Randy Dunlap,
	linux-kernel, ReiserFS Mailing List

On Tue, Feb 17, 2009 at 01:05:34PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 12:55 +0100, Nick Piggin wrote:
> > If nobody thinks it is insane, I'll resend to Andrew in a new thread.
> 
> Right, gup_fast() seems to also respect .write properly,

Phew! :)

> so it would
> also be used to balance that.
> 
> I guess gup_fast() would need to use trylock_page(), and fall back to
> the slow path when we start taking PG_locked on .write.

Yeah, you're right there. It might also be possible to have a flag
somewhere to avoid the lock if the underlying filesystem doesn't
have a page_mkwrite or doesn't account dirty... which could avoid
the overhead for the common case of anonymous or tmpfs memory.

For gup_fast that pretty much implies an extra page flag I think. But
let's not get too worried with details... I don't think put_user_pages
hurts, even if it only remains as put_page loop. Just to help reader
through the page refcounting. 
 
> I suppose we should start converting a few gup users over to pup before
> handing the thing to Andrew, to have at least a few examples in-kernel.

Could do. There are quite a few easy ones.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-16 22:43           ` Edward Shishkin
@ 2009-02-17 22:35               ` Andrew Morton
  2009-02-17 22:35               ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2009-02-17 22:35 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: peterz, edward.shishkin, npiggin, rmh3093, randy.dunlap,
	linux-kernel, reiserfs-devel

On Tue, 17 Feb 2009 01:43:28 +0300
Edward Shishkin <edward.shishkin@gmail.com> wrote:

> + */
> +int set_page_dirty_notag(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (!TestSetPageDirty(page)) {
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		if (mapping_cap_account_dirty(mapping)) {
> +		        /*
> +			 * The accounting functions rely on
> +			 * being atomic wrt interrupts.
> +			 */
> +		        unsigned long flags;
> +			local_irq_save(flags);
> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> +			__inc_bdi_stat(mapping->backing_dev_info,
> +				       BDI_RECLAIMABLE);
> +			task_dirty_inc(current);
> +			task_io_account_write(PAGE_CACHE_SIZE);
> +			local_irq_restore(flags);
> +		}
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +		return 1;
> +	}
> +	return 0;
> +}

I'll maintain this in -mm, alongside the resier4 patches which need it.

Of course, this rather obviates the purpose - if someone changes, say,
__set_page_dirty_nobuffers() then they won't similarly update
set_page_dirty_notag().  Oh well.

This problem would fix itself if those two functions were to
substantially share code.  And I think we can do that - something like

static void
update_page_accounting(struct page *page, struct address_space *mapping)
{
	__inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
	task_dirty_inc(current);
	task_io_account_write(PAGE_CACHE_SIZE);
}

maybe?

We could do that as a separate patch and merge it into mainline - it
should have zero impact on code generation if gcc does the right thing
(please check this).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
@ 2009-02-17 22:35               ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2009-02-17 22:35 UTC (permalink / raw)
  Cc: peterz, edward.shishkin, npiggin, rmh3093, randy.dunlap,
	linux-kernel, reiserfs-devel

On Tue, 17 Feb 2009 01:43:28 +0300
Edward Shishkin <edward.shishkin@gmail.com> wrote:

> + */
> +int set_page_dirty_notag(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +
> +	if (!TestSetPageDirty(page)) {
> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> +		if (mapping_cap_account_dirty(mapping)) {
> +		        /*
> +			 * The accounting functions rely on
> +			 * being atomic wrt interrupts.
> +			 */
> +		        unsigned long flags;
> +			local_irq_save(flags);
> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> +			__inc_bdi_stat(mapping->backing_dev_info,
> +				       BDI_RECLAIMABLE);
> +			task_dirty_inc(current);
> +			task_io_account_write(PAGE_CACHE_SIZE);
> +			local_irq_restore(flags);
> +		}
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +		return 1;
> +	}
> +	return 0;
> +}

I'll maintain this in -mm, alongside the resier4 patches which need it.

Of course, this rather obviates the purpose - if someone changes, say,
__set_page_dirty_nobuffers() then they won't similarly update
set_page_dirty_notag().  Oh well.

This problem would fix itself if those two functions were to
substantially share code.  And I think we can do that - something like

static void
update_page_accounting(struct page *page, struct address_space *mapping)
{
	__inc_zone_page_state(page, NR_FILE_DIRTY);
	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
	task_dirty_inc(current);
	task_io_account_write(PAGE_CACHE_SIZE);
}

maybe?

We could do that as a separate patch and merge it into mainline - it
should have zero impact on code generation if gcc does the right thing
(please check this).


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-17 22:35               ` Andrew Morton
  (?)
@ 2009-02-18  0:26               ` Edward Shishkin
  2009-02-18  0:38                 ` Andrew Morton
  -1 siblings, 1 reply; 25+ messages in thread
From: Edward Shishkin @ 2009-02-18  0:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel

Andrew Morton wrote:
> On Tue, 17 Feb 2009 01:43:28 +0300
> Edward Shishkin <edward.shishkin@gmail.com> wrote:
>
>   
>> + */
>> +int set_page_dirty_notag(struct page *page)
>> +{
>> +	struct address_space *mapping = page->mapping;
>> +
>> +	if (!TestSetPageDirty(page)) {
>> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
>> +		if (mapping_cap_account_dirty(mapping)) {
>> +		        /*
>> +			 * The accounting functions rely on
>> +			 * being atomic wrt interrupts.
>> +			 */
>> +		        unsigned long flags;
>> +			local_irq_save(flags);
>> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
>> +			__inc_bdi_stat(mapping->backing_dev_info,
>> +				       BDI_RECLAIMABLE);
>> +			task_dirty_inc(current);
>> +			task_io_account_write(PAGE_CACHE_SIZE);
>> +			local_irq_restore(flags);
>> +		}
>> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>>     
>
> I'll maintain this in -mm, alongside the resier4 patches which need it.
>
> Of course, this rather obviates the purpose - if someone changes, say,
> __set_page_dirty_nobuffers() then they won't similarly update
> set_page_dirty_notag().  Oh well.
>
> This problem would fix itself if those two functions were to
> substantially share code.

It will fix the problem only partially:
there is one more friend __set_page_dirty() in buffer.c

Maybe it makes sense to add comments with warnings
in all such places, or create a header file with a static inline
function update_page_accounting() ?

>   And I think we can do that - something like
>
> static void
> update_page_accounting(struct page *page, struct address_space *mapping)
> {
> 	__inc_zone_page_state(page, NR_FILE_DIRTY);
> 	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> 	task_dirty_inc(current);
> 	task_io_account_write(PAGE_CACHE_SIZE);
> }
>
> maybe?
>
> We could do that as a separate patch and merge it into mainline - it
> should have zero impact on code generation if gcc does the right thing
> (please check this).
>
>
>   


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/4] vfs: add set_page_dirty_notag
  2009-02-18  0:26               ` Edward Shishkin
@ 2009-02-18  0:38                 ` Andrew Morton
  2009-02-18 13:27                   ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin
  2009-02-18 13:27                   ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2009-02-18  0:38 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: peterz, npiggin, rmh3093, randy.dunlap, linux-kernel, reiserfs-devel

On Wed, 18 Feb 2009 03:26:16 +0300
Edward Shishkin <edward.shishkin@gmail.com> wrote:

> Andrew Morton wrote:
> > On Tue, 17 Feb 2009 01:43:28 +0300
> > Edward Shishkin <edward.shishkin@gmail.com> wrote:
> >
> >   
> >> + */
> >> +int set_page_dirty_notag(struct page *page)
> >> +{
> >> +	struct address_space *mapping = page->mapping;
> >> +
> >> +	if (!TestSetPageDirty(page)) {
> >> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> >> +		if (mapping_cap_account_dirty(mapping)) {
> >> +		        /*
> >> +			 * The accounting functions rely on
> >> +			 * being atomic wrt interrupts.
> >> +			 */
> >> +		        unsigned long flags;
> >> +			local_irq_save(flags);
> >> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
> >> +			__inc_bdi_stat(mapping->backing_dev_info,
> >> +				       BDI_RECLAIMABLE);
> >> +			task_dirty_inc(current);
> >> +			task_io_account_write(PAGE_CACHE_SIZE);
> >> +			local_irq_restore(flags);
> >> +		}
> >> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >> +		return 1;
> >> +	}
> >> +	return 0;
> >> +}
> >>     
> >
> > I'll maintain this in -mm, alongside the resier4 patches which need it.
> >
> > Of course, this rather obviates the purpose - if someone changes, say,
> > __set_page_dirty_nobuffers() then they won't similarly update
> > set_page_dirty_notag().  Oh well.
> >
> > This problem would fix itself if those two functions were to
> > substantially share code.
> 
> It will fix the problem only partially:
> there is one more friend __set_page_dirty() in buffer.c

But all three functions do the same thing?

			if (mapping_cap_account_dirty(mapping)) {
				__inc_zone_page_state(page, NR_FILE_DIRTY);
				__inc_bdi_stat(mapping->backing_dev_info,
						BDI_RECLAIMABLE);
				task_dirty_inc(current);
				task_io_account_write(PAGE_CACHE_SIZE);
			}

?

> Maybe it makes sense to add comments with warnings
> in all such places, or create a header file with a static inline
> function update_page_accounting() ?

Could just uninline the helper function I guess - if you look, those
four statements already involve doing a heck of a lot of stuff.

Try it, see how it looks?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/2] vfs: add/use update_page_accounting
  2009-02-18  0:38                 ` Andrew Morton
@ 2009-02-18 13:27                   ` Edward Shishkin
  2009-02-18 14:06                     ` Nick Piggin
  2009-02-18 13:27                   ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin
  1 sibling, 1 reply; 25+ messages in thread
From: Edward Shishkin @ 2009-02-18 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Edward Shishkin, peterz, npiggin, rmh3093, randy.dunlap,
	linux-kernel, reiserfs-devel

Andrew Morton writes:
 > On Wed, 18 Feb 2009 03:26:16 +0300
 > Edward Shishkin <edward.shishkin@gmail.com> wrote:
 > 
 > > Andrew Morton wrote:
 > > > On Tue, 17 Feb 2009 01:43:28 +0300
 > > > Edward Shishkin <edward.shishkin@gmail.com> wrote:
 > > >
 > > >   
 > > >> + */
 > > >> +int set_page_dirty_notag(struct page *page)
 > > >> +{
 > > >> +	struct address_space *mapping = page->mapping;
 > > >> +
 > > >> +	if (!TestSetPageDirty(page)) {
 > > >> +		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 > > >> +		if (mapping_cap_account_dirty(mapping)) {
 > > >> +		        /*
 > > >> +			 * The accounting functions rely on
 > > >> +			 * being atomic wrt interrupts.
 > > >> +			 */
 > > >> +		        unsigned long flags;
 > > >> +			local_irq_save(flags);
 > > >> +			__inc_zone_page_state(page, NR_FILE_DIRTY);
 > > >> +			__inc_bdi_stat(mapping->backing_dev_info,
 > > >> +				       BDI_RECLAIMABLE);
 > > >> +			task_dirty_inc(current);
 > > >> +			task_io_account_write(PAGE_CACHE_SIZE);
 > > >> +			local_irq_restore(flags);
 > > >> +		}
 > > >> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 > > >> +		return 1;
 > > >> +	}
 > > >> +	return 0;
 > > >> +}
 > > >>     
 > > >
 > > > I'll maintain this in -mm, alongside the resier4 patches which need it.
 > > >
 > > > Of course, this rather obviates the purpose - if someone changes, say,
 > > > __set_page_dirty_nobuffers() then they won't similarly update
 > > > set_page_dirty_notag().  Oh well.
 > > >
 > > > This problem would fix itself if those two functions were to
 > > > substantially share code.
 > > 
 > > It will fix the problem only partially:
 > > there is one more friend __set_page_dirty() in buffer.c
 > 
 > But all three functions do the same thing?
 > 
 > 			if (mapping_cap_account_dirty(mapping)) {
 > 				__inc_zone_page_state(page, NR_FILE_DIRTY);
 > 				__inc_bdi_stat(mapping->backing_dev_info,
 > 						BDI_RECLAIMABLE);
 > 				task_dirty_inc(current);
 > 				task_io_account_write(PAGE_CACHE_SIZE);
 > 			}
 > 
 > ?

Yup, exactly the same.

 > 
 > > Maybe it makes sense to add comments with warnings
 > > in all such places, or create a header file with a static inline
 > > function update_page_accounting() ?
 > 
 > Could just uninline the helper function I guess - if you look, those
 > four statements already involve doing a heck of a lot of stuff.
 > 
 > Try it, see how it looks?
 > 

Done.
Please, review.

Add/use a helper function update_page_accounting().

Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
 fs/buffer.c         |    9 +--------
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   22 +++++++++++++++-------
 3 files changed, 17 insertions(+), 15 deletions(-)

--- mmotm.orig/fs/buffer.c
+++ mmotm/fs/buffer.c
@@ -803,14 +803,7 @@ static int __set_page_dirty(struct page 
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
-
-		if (mapping_cap_account_dirty(mapping)) {
-			__inc_zone_page_state(page, NR_FILE_DIRTY);
-			__inc_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
-			task_dirty_inc(current);
-			task_io_account_write(PAGE_CACHE_SIZE);
-		}
+		update_page_accounting(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -839,6 +839,7 @@ int __set_page_dirty_nobuffers(struct pa
 int __set_page_dirty_no_writeback(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
 				struct page *page);
+void update_page_accounting(struct page *page, struct address_space *mapping);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1198,6 +1198,20 @@ int __set_page_dirty_no_writeback(struct
 }
 
 /*
+ * Helper function for set_page_dirty family.
+ * NOTE: This relies on being atomic wrt interrupts.
+ */
+void update_page_accounting(struct page *page, struct address_space *mapping)
+{
+	if (mapping_cap_account_dirty(mapping)) {
+		__inc_zone_page_state(page, NR_FILE_DIRTY);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		task_dirty_inc(current);
+		task_io_account_write(PAGE_CACHE_SIZE);
+	}
+}
+
+/*
  * For address_spaces which do not use buffers.  Just tag the page as dirty in
  * its radix tree.
  *
@@ -1226,13 +1240,7 @@ int __set_page_dirty_nobuffers(struct pa
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-			if (mapping_cap_account_dirty(mapping)) {
-				__inc_zone_page_state(page, NR_FILE_DIRTY);
-				__inc_bdi_stat(mapping->backing_dev_info,
-						BDI_RECLAIMABLE);
-				task_dirty_inc(current);
-				task_io_account_write(PAGE_CACHE_SIZE);
-			}
+			update_page_accounting(page, mapping);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/2] vfs: (take 2)add set_page_dirty_notag
  2009-02-18  0:38                 ` Andrew Morton
  2009-02-18 13:27                   ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin
@ 2009-02-18 13:27                   ` Edward Shishkin
  1 sibling, 0 replies; 25+ messages in thread
From: Edward Shishkin @ 2009-02-18 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Edward Shishkin, peterz, npiggin, rmh3093, randy.dunlap,
	linux-kernel, reiserfs-devel

In accordance with reiser4 transactional model every dirty page should be
"captured" by some atom.  However, outside reiser4 context dirty page can
not be captured in some cases, as it is accompanied with specific work
(jnode creation, etc).  Reiser4 recognizes such "anonymous" pages (i.e. 
pages that were dirtied outside of reiser4) by the tag
PAGECACHE_TAG_DIRTY.  Pages dirtied inside reiser4 context are not tagged
at all: we don't need this.  Indeed, once page is dirtied and captured, it
is attached to a jnode (a special header to keep a track of transactions).

reiser4_set_page_dirty_internal() was the internal reiser4 function that
set dirty bit without tagging the page.  Having such internal function led
to real problems (incorrect task io accounting, etc.  because of not
updating this internal "friend").

Solution:

The following patch adds a core library function that sets a dirty bit
without tagging the page.

Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
---
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

--- mmotm.orig/include/linux/mm.h
+++ mmotm/include/linux/mm.h
@@ -842,6 +842,7 @@ int redirty_page_for_writepage(struct wr
 void update_page_accounting(struct page *page, struct address_space *mapping);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_notag(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
--- mmotm.orig/mm/page-writeback.c
+++ mmotm/mm/page-writeback.c
@@ -1256,6 +1256,32 @@ int __set_page_dirty_nobuffers(struct pa
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
+ * set_page_dirty_notag() -- similar to __set_page_dirty_nobuffers()
+ * except it doesn't tag the page dirty in the page-cache radix tree.
+ * This means that the address space using this cannot use the regular
+ * filemap ->writepages() helpers and must provide its own means of
+ * tracking and finding non-tagged dirty pages.
+ *
+ * NOTE: furthermore, this version also doesn't handle truncate races.
+ */
+int set_page_dirty_notag(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+
+	if (!TestSetPageDirty(page)) {
+		unsigned long flags;
+		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+		local_irq_save(flags);
+		update_page_accounting(page, mapping);
+		local_irq_restore(flags);
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty_notag);
+
+/*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
  * redirty_page_for_writepage() and it should then unlock the page and return 0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/2] vfs: add/use update_page_accounting
  2009-02-18 13:27                   ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin
@ 2009-02-18 14:06                     ` Nick Piggin
  2009-02-18 18:23                       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2009-02-18 14:06 UTC (permalink / raw)
  To: Edward Shishkin
  Cc: Andrew Morton, peterz, rmh3093, randy.dunlap, linux-kernel,
	reiserfs-devel

On Wed, Feb 18, 2009 at 04:27:02PM +0300, Edward Shishkin wrote:
>  > > Maybe it makes sense to add comments with warnings
>  > > in all such places, or create a header file with a static inline
>  > > function update_page_accounting() ?
>  > 
>  > Could just uninline the helper function I guess - if you look, those
>  > four statements already involve doing a heck of a lot of stuff.
>  > 
>  > Try it, see how it looks?
>  > 
> 
> Done.
> Please, review.
> 
> Add/use a helper function update_page_accounting().

Fine patch, except the name I don't like. account_set_page_dirty, or
account_page_dirtied, or something to hint it is for accounting
dirty.

 
> Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
> ---
>  fs/buffer.c         |    9 +--------
>  include/linux/mm.h  |    1 +
>  mm/page-writeback.c |   22 +++++++++++++++-------
>  3 files changed, 17 insertions(+), 15 deletions(-)
> 
> --- mmotm.orig/fs/buffer.c
> +++ mmotm/fs/buffer.c
> @@ -803,14 +803,7 @@ static int __set_page_dirty(struct page 
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> -
> -		if (mapping_cap_account_dirty(mapping)) {
> -			__inc_zone_page_state(page, NR_FILE_DIRTY);
> -			__inc_bdi_stat(mapping->backing_dev_info,
> -					BDI_RECLAIMABLE);
> -			task_dirty_inc(current);
> -			task_io_account_write(PAGE_CACHE_SIZE);
> -		}
> +		update_page_accounting(page, mapping);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> --- mmotm.orig/include/linux/mm.h
> +++ mmotm/include/linux/mm.h
> @@ -839,6 +839,7 @@ int __set_page_dirty_nobuffers(struct pa
>  int __set_page_dirty_no_writeback(struct page *page);
>  int redirty_page_for_writepage(struct writeback_control *wbc,
>  				struct page *page);
> +void update_page_accounting(struct page *page, struct address_space *mapping);
>  int set_page_dirty(struct page *page);
>  int set_page_dirty_lock(struct page *page);
>  int clear_page_dirty_for_io(struct page *page);
> --- mmotm.orig/mm/page-writeback.c
> +++ mmotm/mm/page-writeback.c
> @@ -1198,6 +1198,20 @@ int __set_page_dirty_no_writeback(struct
>  }
>  
>  /*
> + * Helper function for set_page_dirty family.
> + * NOTE: This relies on being atomic wrt interrupts.
> + */
> +void update_page_accounting(struct page *page, struct address_space *mapping)
> +{
> +	if (mapping_cap_account_dirty(mapping)) {
> +		__inc_zone_page_state(page, NR_FILE_DIRTY);
> +		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +		task_dirty_inc(current);
> +		task_io_account_write(PAGE_CACHE_SIZE);
> +	}
> +}
> +
> +/*
>   * For address_spaces which do not use buffers.  Just tag the page as dirty in
>   * its radix tree.
>   *
> @@ -1226,13 +1240,7 @@ int __set_page_dirty_nobuffers(struct pa
>  		if (mapping2) { /* Race with truncate? */
>  			BUG_ON(mapping2 != mapping);
>  			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> -			if (mapping_cap_account_dirty(mapping)) {
> -				__inc_zone_page_state(page, NR_FILE_DIRTY);
> -				__inc_bdi_stat(mapping->backing_dev_info,
> -						BDI_RECLAIMABLE);
> -				task_dirty_inc(current);
> -				task_io_account_write(PAGE_CACHE_SIZE);
> -			}
> +			update_page_accounting(page, mapping);
>  			radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  		}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/2] vfs: add/use update_page_accounting
  2009-02-18 14:06                     ` Nick Piggin
@ 2009-02-18 18:23                       ` Andrew Morton
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2009-02-18 18:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Edward Shishkin, peterz, rmh3093, randy.dunlap, linux-kernel,
	reiserfs-devel

On Wed, 18 Feb 2009 15:06:26 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Feb 18, 2009 at 04:27:02PM +0300, Edward Shishkin wrote:
> >  > > Maybe it makes sense to add comments with warnings
> >  > > in all such places, or create a header file with a static inline
> >  > > function update_page_accounting() ?
> >  > 
> >  > Could just uninline the helper function I guess - if you look, those
> >  > four statements already involve doing a heck of a lot of stuff.
> >  > 
> >  > Try it, see how it looks?
> >  > 
> > 
> > Done.
> > Please, review.
> > 
> > Add/use a helper function update_page_accounting().
> 
> Fine patch, except the name I don't like. account_set_page_dirty, or
> account_page_dirtied, or something to hint it is for accounting
> dirty.

yep, sorry, I didn't think too hard about that for-example.

I'll edit the diffs..

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-02-18 18:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-13 11:56 [patch 2/4] vfs: add set_page_dirty_notag Edward Shishkin
2009-02-13 13:08 ` Peter Zijlstra
2009-02-13 13:57   ` Edward Shishkin
2009-02-13 14:09     ` Peter Zijlstra
2009-02-14 13:11       ` Edward Shishkin
2009-02-14 21:11         ` Peter Zijlstra
2009-02-16 22:43           ` Edward Shishkin
2009-02-17  9:09             ` Peter Zijlstra
2009-02-17  9:38               ` Nick Piggin
2009-02-17 10:05                 ` Peter Zijlstra
2009-02-17 10:24                   ` Nick Piggin
2009-02-17 10:40                     ` set_page_dirty races (was: Re: [patch 2/4] vfs: add set_page_dirty_notag) Peter Zijlstra
2009-02-17 11:25                       ` Nick Piggin
2009-02-17 11:39                         ` Peter Zijlstra
2009-02-17 11:55                           ` Nick Piggin
2009-02-17 12:05                             ` Peter Zijlstra
2009-02-17 12:30                               ` Nick Piggin
2009-02-17 22:35             ` [patch 2/4] vfs: add set_page_dirty_notag Andrew Morton
2009-02-17 22:35               ` Andrew Morton
2009-02-18  0:26               ` Edward Shishkin
2009-02-18  0:38                 ` Andrew Morton
2009-02-18 13:27                   ` [patch 1/2] vfs: add/use update_page_accounting Edward Shishkin
2009-02-18 14:06                     ` Nick Piggin
2009-02-18 18:23                       ` Andrew Morton
2009-02-18 13:27                   ` [patch 2/2] vfs: (take 2)add set_page_dirty_notag Edward Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.