All of lore.kernel.org
 help / color / mirror / Atom feed
* No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-25  2:40 ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2010-04-25  2:40 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: linux-ext4, linux-btrfs


I happened to be going through the source code for write_cache_pages(),
and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
what the heck that was, so I did search for it, and found this in
Documentation/filesystems/vfs.txt:

      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
      try too hard if there are problems, and may choose to write out
      other pages from the mapping if that is easier (e.g. due to
      internal dependencies).  If it chooses not to start writeout, it
      should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
      calling ->writepage on that page.

      See the file "Locking" for more details.

No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
chooses not to writeout page and call redirty_page_for_writeback()
instead.

Is this a change we should make, for example when btrfs refuses a
writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
if the page involved hasn't been allocated an on-disk block yet (i.e.,
delayed allocation)?  The change seems to be that we should call
redirty_page_for_writeback() as before, but then _not_ unlock the page,
and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
us to do?

Right now, the only writepage() function which is returning
AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
using redirty_page_for_writeback().  Should it, out of consistency's
sake if not to keep various zone accounting straight?

There are some longer-term issues, including the fact that ext4 and
btrfs are violating some of the rules laid out in
Documentation/vfs/Locking regarding what writepage() is supposed to do
under direct reclaim -- something which isn't going to be practical for
us to change on the file-system side, at least not without doing some
pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
gracefully with the fact that ext4 and btrfs will be refusing
writepage() calls under certain conditions, maybe we should make this
change?

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-25  2:40 ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2010-04-25  2:40 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: linux-ext4, linux-btrfs


I happened to be going through the source code for write_cache_pages(),
and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
what the heck that was, so I did search for it, and found this in
Documentation/filesystems/vfs.txt:

      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
      try too hard if there are problems, and may choose to write out
      other pages from the mapping if that is easier (e.g. due to
      internal dependencies).  If it chooses not to start writeout, it
      should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
      calling ->writepage on that page.

      See the file "Locking" for more details.

No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
chooses not to writeout page and call redirty_page_for_writeback()
instead.

Is this a change we should make, for example when btrfs refuses a
writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
if the page involved hasn't been allocated an on-disk block yet (i.e.,
delayed allocation)?  The change seems to be that we should call
redirty_page_for_writeback() as before, but then _not_ unlock the page,
and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
us to do?

Right now, the only writepage() function which is returning
AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
using redirty_page_for_writeback().  Should it, out of consistency's
sake if not to keep various zone accounting straight?

There are some longer-term issues, including the fact that ext4 and
btrfs are violating some of the rules laid out in
Documentation/vfs/Locking regarding what writepage() is supposed to do
under direct reclaim -- something which isn't going to be practical for
us to change on the file-system side, at least not without doing some
pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
gracefully with the fact that ext4 and btrfs will be refusing
writepage() calls under certain conditions, maybe we should make this
change?

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
  2010-04-25  2:40 ` Theodore Ts'o
@ 2010-04-26 10:18   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-04-26 10:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

Hi Ted


> I happened to be going through the source code for write_cache_pages(),
> and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
> what the heck that was, so I did search for it, and found this in
> Documentation/filesystems/vfs.txt:
> 
>       If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
>       try too hard if there are problems, and may choose to write out
>       other pages from the mapping if that is easier (e.g. due to
>       internal dependencies).  If it chooses not to start writeout, it
>       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
>       calling ->writepage on that page.
> 
>       See the file "Locking" for more details.
> 
> No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
> chooses not to writeout page and call redirty_page_for_writeback()
> instead.
> 
> Is this a change we should make, for example when btrfs refuses a
> writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
> if the page involved hasn't been allocated an on-disk block yet (i.e.,
> delayed allocation)?  The change seems to be that we should call
> redirty_page_for_writeback() as before, but then _not_ unlock the page,
> and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
> us to do?

Sorry, no.

AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
(and later rd choosed to use another way).
Then, It assume writepage refusing aren't happen on majority pages.
IOW, the VM assume other many pages can writeout although the page can't.
Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
but now ext4 and btrfs refuse all writepage(). (right?)

IOW, I don't think such documentation suppose delayed allocation issue ;)

The point is, Our dirty page accounting only account per-system-memory
dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
nor per-zone dirty ratio. and then, to refuse write page and fake numa
abusing can make confusing our vm easily. if _all_ pages in our VM LRU
list (it's per-zone), page activation doesn't help. It also lead to OOM.

And I'm sorry. I have to say now all vm developers fake numa is not
production level quority yet. afaik, nobody have seriously tested our
vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
useful for debugging".)

	--------------------------------------------------------------
	config NUMA_EMU
	        bool "NUMA emulation"
	        depends on X86_64 && NUMA
	        ---help---
	          Enable NUMA emulation. A flat machine will be split
	          into virtual nodes when booted with "numa=fake=N", where N is the
	          number of nodes. This is only useful for debugging.


> 
> Right now, the only writepage() function which is returning
> AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
> using redirty_page_for_writeback().  Should it, out of consistency's
> sake if not to keep various zone accounting straight?

Umm. I don't know the reason. instead I've cc to hugh :)


> There are some longer-term issues, including the fact that ext4 and
> btrfs are violating some of the rules laid out in
> Documentation/vfs/Locking regarding what writepage() is supposed to do
> under direct reclaim -- something which isn't going to be practical for
> us to change on the file-system side, at least not without doing some
> pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
> if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
> gracefully with the fact that ext4 and btrfs will be refusing
> writepage() calls under certain conditions, maybe we should make this
> change?

I'm sorry again. I'm pretty sure our vm also need to change if we need
to solve your company's fake numa use case. I think our vm is still delayed 
allocation unfriendly. we haven't noticed ext4 delayed allocation issue ;-)

So, I have two questions
 - I really hope to understand ext4 delayed allocation issue, can you please
   tell me which url explain ext4 high level design and behavior about delayed
   allocation.
 - If my understood is correctly, making very much fake numa node and
   simple dd can reproduce your issue. right?

Now I'm guessing enough small vm patch can solve this issue. (that's only
guess, maybe yes maybe no). but correct understanding and correct testing
way are really necessary. please help.





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-26 10:18   ` KOSAKI Motohiro
  0 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-04-26 10:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

Hi Ted


> I happened to be going through the source code for write_cache_pages(),
> and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
> what the heck that was, so I did search for it, and found this in
> Documentation/filesystems/vfs.txt:
> 
>       If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
>       try too hard if there are problems, and may choose to write out
>       other pages from the mapping if that is easier (e.g. due to
>       internal dependencies).  If it chooses not to start writeout, it
>       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
>       calling ->writepage on that page.
> 
>       See the file "Locking" for more details.
> 
> No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
> chooses not to writeout page and call redirty_page_for_writeback()
> instead.
> 
> Is this a change we should make, for example when btrfs refuses a
> writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
> if the page involved hasn't been allocated an on-disk block yet (i.e.,
> delayed allocation)?  The change seems to be that we should call
> redirty_page_for_writeback() as before, but then _not_ unlock the page,
> and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
> us to do?

Sorry, no.

AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
(and later rd choosed to use another way).
Then, It assume writepage refusing aren't happen on majority pages.
IOW, the VM assume other many pages can writeout although the page can't.
Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
but now ext4 and btrfs refuse all writepage(). (right?)

IOW, I don't think such documentation suppose delayed allocation issue ;)

The point is, Our dirty page accounting only account per-system-memory
dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
nor per-zone dirty ratio. and then, to refuse write page and fake numa
abusing can make confusing our vm easily. if _all_ pages in our VM LRU
list (it's per-zone), page activation doesn't help. It also lead to OOM.

And I'm sorry. I have to say now all vm developers fake numa is not
production level quority yet. afaik, nobody have seriously tested our
vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
useful for debugging".)

	--------------------------------------------------------------
	config NUMA_EMU
	        bool "NUMA emulation"
	        depends on X86_64 && NUMA
	        ---help---
	          Enable NUMA emulation. A flat machine will be split
	          into virtual nodes when booted with "numa=fake=N", where N is the
	          number of nodes. This is only useful for debugging.


> 
> Right now, the only writepage() function which is returning
> AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not
> using redirty_page_for_writeback().  Should it, out of consistency's
> sake if not to keep various zone accounting straight?

Umm. I don't know the reason. instead I've cc to hugh :)


> There are some longer-term issues, including the fact that ext4 and
> btrfs are violating some of the rules laid out in
> Documentation/vfs/Locking regarding what writepage() is supposed to do
> under direct reclaim -- something which isn't going to be practical for
> us to change on the file-system side, at least not without doing some
> pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
> if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
> gracefully with the fact that ext4 and btrfs will be refusing
> writepage() calls under certain conditions, maybe we should make this
> change?

I'm sorry again. I'm pretty sure our vm also need to change if we need
to solve your company's fake numa use case. I think our vm is still delayed 
allocation unfriendly. we haven't noticed ext4 delayed allocation issue ;-)

So, I have two questions
 - I really hope to understand ext4 delayed allocation issue, can you please
   tell me which url explain ext4 high level design and behavior about delayed
   allocation.
 - If my understood is correctly, making very much fake numa node and
   simple dd can reproduce your issue. right?

Now I'm guessing enough small vm patch can solve this issue. (that's only
guess, maybe yes maybe no). but correct understanding and correct testing
way are really necessary. please help.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
  2010-04-26 10:18   ` KOSAKI Motohiro
@ 2010-04-26 14:50     ` Theodore Tso
  -1 siblings, 0 replies; 10+ messages in thread
From: Theodore Tso @ 2010-04-26 14:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-fsdevel, linux-ext4, linux-btrfs, Hugh Dickins


On Apr 26, 2010, at 6:18 AM, KOSAK
> AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> (and later rd choosed to use another way).
> Then, It assume writepage refusing aren't happen on majority pages.
> IOW, the VM assume other many pages can writeout although the page =
can't.
> Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is =
returned.
> but now ext4 and btrfs refuse all writepage(). (right?)

No, not exactly.   Btrfs refuses the writepage() in the direct reclaim =
cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case =
of zone scanning.  I don't want to speak for Chris, but I assume it's =
due to stack depth concerns --- if it was just due to worrying about fs =
recursion issues, i assume all of the btrfs allocations could be done =
GFP_NOFS.

Ext4 is slightly different; it refuses writepages() if the inode blocks =
for the page haven't yet been allocated.  (Regardless of whether it's =
happening for direct reclaim or zone scanning.)  However, if the on-disk =
block has been assigned (i.e., this isn't a delalloc case), ext4 will =
honor the writepage().   (i.e., if this is an mmap of an already =
existing file, or if the space has been pre-allocated using =
fallocate()).    The reason for ext4's concern is lock ordering, =
although I'm investigating whether I can fix this.   If we call =
set_page_writeback() to set PG_writeback (plus set the various bits of =
magic fs accounting), and then drop the page_lock, does that protect us =
from random changes happening to the page (i.e., from vmtruncate, etc.)?

>=20
> IOW, I don't think such documentation suppose delayed allocation issue =
;)
>=20
> The point is, Our dirty page accounting only account per-system-memory
> dirty ratio and per-task dirty pages. but It doesn't account =
per-numa-node
> nor per-zone dirty ratio. and then, to refuse write page and fake numa
> abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> list (it's per-zone), page activation doesn't help. It also lead to =
OOM.
>=20
> And I'm sorry. I have to say now all vm developers fake numa is not
> production level quority yet. afaik, nobody have seriously tested our
> vm code on such environment. (linux/arch/x86/Kconfig says "This is =
only=20
> useful for debugging".)

So I'm sorry I mentioned the fake numa bit, since I think this is a bit =
of a red herring.   That code is in production here, and we've made all =
sorts of changes so ti can be used for more than just debugging.  So =
please ignore it, it's our local hack, and if it breaks that's our =
problem.    More importantly, just two weeks ago I talked to soeone in =
the financial sector, who was testing out ext4 on an upstream kernel, =
and not using our hacks that force 128MB zones, and he ran into the =
ext4/OOM problem while using an upstream kernel.  It involved Oracle =
pinning down 3G worth of pages, and him trying to do a huge streaming =
backup (which of course wasn't using fallocate or direct I/O) under =
ext4, and he had the same issue --- an OOM, that I'm pretty sure was =
caused by the fact that ext4_writepage() was refusing the writepage() =
and most of the pages weren't nailed down by Oracle were delalloc.    =
The same test scenario using ext3 worked just fine, of course.

Under normal cases it's not a problem since statistically there should =
be enough other pages in the system compared to the number of pages that =
are subject to delalloc, such that pages can usually get pushed out =
until the writeback code can get around to writing out the pages.   But =
in cases where the zones have been made artificially small, or you have =
a big program like Oracle pinning down a large number of pages, then of =
course we have problems.=20

I'm trying to fix things from the file system side, which means trying =
to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is =
described in Documentation/filesystems/Locking as something which MUST =
be used if writepage() is going refuse a page.  And then I discovered no =
one is actually using it.   So that's why I was asking with respect =
whether the Locking documentation file was out of date, or whether all =
of the file systems are doing it wrong.

On a related example of how file system code isnt' necessarily following =
what is required/recommended by the Locking documentation, ext2 and ext3 =
are both NOT using set_page_writeback()/end_page_writeback(), but are =
rather keeping the page locked until after they call =
block_write_full_page(), because of concerns of truncate coming in and =
screwing things up.   But now looking at Locking, it appears that =
set_page_writeback() is as good as page_lock() for preventing the =
truncate code from coming in and screwing everything up?   It's not =
clear to me exactly what locking guarantees are provided against =
truncate by set_page_writeback().   And suppose we are writing out a =
whole cluster of pages, say 4MB worth of pages; do we need to call =
set_page_writeback() on every single page in the cluster before we do =
the I/O to make sure things don't change out from under us?  (I'm pretty =
sure at least some of the other filesystems that are submitting huge =
numbers of pages using bio instead of 4k at a time like ext2/3/4 aren't =
calling set_page_writeback() on all of the pages first.)

Part of the problem is that the writeback Locking semantics aren't well =
documented, and where they are documented, it's not clear they are up to =
date --- and all of the file systems that are doing delayed allocation =
writeback are doing things slightly differently, or in some cases very =
differently.    (And even without delalloc, as I've pointed out ext2/3 =
don't use set_page_writeback() --- if this is a MUST USE as implied by =
the Locking file, why did whoever added this requirement didn't go in =
and modify common filesystems like ext2 and ext3 to use the =
set_page_writeback/end_page_writeback calls?)

I'm happy to change things in ext4; in fact I'm pretty sure ext4 =
probably isn't completely right here.   But it's not clear what "right" =
actually is, and when I look to see what protects writepage() racing =
with vmtruncate(), it's enough to give me a headache.  :-(   =20

Hence my question about wouldn't it be simpler if we simply added more =
high-level locking to prevent truncate from racing against =
writepage/writeback. =20

-- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-26 14:50     ` Theodore Tso
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Tso @ 2010-04-26 14:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-fsdevel, linux-ext4, linux-btrfs, Hugh Dickins


On Apr 26, 2010, at 6:18 AM, KOSAK
> AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> (and later rd choosed to use another way).
> Then, It assume writepage refusing aren't happen on majority pages.
> IOW, the VM assume other many pages can writeout although the page can't.
> Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
> but now ext4 and btrfs refuse all writepage(). (right?)

No, not exactly.   Btrfs refuses the writepage() in the direct reclaim cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone scanning.  I don't want to speak for Chris, but I assume it's due to stack depth concerns --- if it was just due to worrying about fs recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS.

Ext4 is slightly different; it refuses writepages() if the inode blocks for the page haven't yet been allocated.  (Regardless of whether it's happening for direct reclaim or zone scanning.)  However, if the on-disk block has been assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage().   (i.e., if this is an mmap of an already existing file, or if the space has been pre-allocated using fallocate()).    The reason for ext4's concern is lock ordering, although I'm investigating whether I can fix this.   If we call set_page_writeback() to set PG_writeback (plus set the various bits of magic fs accounting), and then drop the page_lock, does that protect us from random changes happening to the page (i.e., from vmtruncate, etc.)?

> 
> IOW, I don't think such documentation suppose delayed allocation issue ;)
> 
> The point is, Our dirty page accounting only account per-system-memory
> dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
> nor per-zone dirty ratio. and then, to refuse write page and fake numa
> abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> list (it's per-zone), page activation doesn't help. It also lead to OOM.
> 
> And I'm sorry. I have to say now all vm developers fake numa is not
> production level quority yet. afaik, nobody have seriously tested our
> vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
> useful for debugging".)

So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a red herring.   That code is in production here, and we've made all sorts of changes so ti can be used for more than just debugging.  So please ignore it, it's our local hack, and if it breaks that's our problem.    More importantly, just two weeks ago I talked to soeone in the financial sector, who was testing out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, and he ran into the ext4/OOM problem while using an upstream kernel.  It involved Oracle pinning down 3G worth of pages, and him trying to do a huge streaming backup (which of course wasn't using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by the fact that ext4_writepage() was refusing the writepage() and most of the pages weren't nailed down by Oracle were delalloc.    The same test scenario using ext3 worked just fine, of course.

Under normal cases it's not a problem since statistically there should be enough other pages in the system compared to the number of pages that are subject to delalloc, such that pages can usually get pushed out until the writeback code can get around to writing out the pages.   But in cases where the zones have been made artificially small, or you have a big program like Oracle pinning down a large number of pages, then of course we have problems. 

I'm trying to fix things from the file system side, which means trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in Documentation/filesystems/Locking as something which MUST be used if writepage() is going refuse a page.  And then I discovered no one is actually using it.   So that's why I was asking with respect whether the Locking documentation file was out of date, or whether all of the file systems are doing it wrong.

On a related example of how file system code isnt' necessarily following what is required/recommended by the Locking documentation, ext2 and ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are rather keeping the page locked until after they call block_write_full_page(), because of concerns of truncate coming in and screwing things up.   But now looking at Locking, it appears that set_page_writeback() is as good as page_lock() for preventing the truncate code from coming in and screwing everything up?   It's not clear to me exactly what locking guarantees are provided against truncate by set_page_writeback().   And suppose we are writing out a whole cluster of pages, say 4MB worth of pages; do we need to call set_page_writeback() on every single page in the cluster before we do the I/O to make sure things don't change out from under us?  (I'm pretty sure at least some of the other filesystems that are submitting huge numbers of pages using bio instead of 4k at a time like ext2/3/4 aren't calling set_page_writeback() on all of the pages first.)

Part of the problem is that the writeback Locking semantics aren't well documented, and where they are documented, it's not clear they are up to date --- and all of the file systems that are doing delayed allocation writeback are doing things slightly differently, or in some cases very differently.    (And even without delalloc, as I've pointed out ext2/3 don't use set_page_writeback() --- if this is a MUST USE as implied by the Locking file, why did whoever added this requirement didn't go in and modify common filesystems like ext2 and ext3 to use the set_page_writeback/end_page_writeback calls?)

I'm happy to change things in ext4; in fact I'm pretty sure ext4 probably isn't completely right here.   But it's not clear what "right" actually is, and when I look to see what protects writepage() racing with vmtruncate(), it's enough to give me a headache.  :-(    

Hence my question about wouldn't it be simpler if we simply added more high-level locking to prevent truncate from racing against writepage/writeback.  

-- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
  2010-04-26 14:50     ` Theodore Tso
@ 2010-04-26 17:24       ` Chris Mason
  -1 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2010-04-26 17:24 UTC (permalink / raw)
  To: Theodore Tso
  Cc: KOSAKI Motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso wrote:
> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren't happen on majority pages.
> > IOW, the VM assume other many pages can writeout although the page can't.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
> cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the
> case of zone scanning.  I don't want to speak for Chris, but I assume
> it's due to stack depth concerns --- if it was just due to worrying
> about fs recursion issues, i assume all of the btrfs allocations could
> be done GFP_NOFS.
> 

Btrfs refuses all PF_MEMALLOC writepage.  It will go ahead and process a
regular writepage but in practice that never happens...everyone else
except a few internal btrfs callers use writepages.

I wish I had thought of stack depth back then, but really this was to
keep kswapd out of the heavy work done by delalloc.  From a locking
point of view we're properly GPF_NOFS, so its safe, but it just isn't a
great way to use precious PF_MEMALLOC cycles.

> Ext4 is slightly different; it refuses writepages() if the inode
> blocks for the page haven't yet been allocated.  (Regardless of
> whether it's happening for direct reclaim or zone scanning.)  However,
> if the on-disk block has been assigned (i.e., this isn't a delalloc
> case), ext4 will honor the writepage().   (i.e., if this is an mmap of
> an already existing file, or if the space has been pre-allocated using
> fallocate()).    The reason for ext4's concern is lock ordering,
> although I'm investigating whether I can fix this.   If we call
> set_page_writeback() to set PG_writeback (plus set the various bits of
> magic fs accounting), and then drop the page_lock, does that protect
> us from random changes happening to the page (i.e., from vmtruncate,
> etc.)?

PG_writeback will protect you from vmtruncate, but may also want to
have page_mkwrite wait for pages in flight.

> 
> > 
> > IOW, I don't think such documentation suppose delayed allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it's per-zone), page activation doesn't help. It also lead to OOM.
> > 
> > And I'm sorry. I have to say now all vm developers fake numa is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
> > useful for debugging".)
> 

> So I'm sorry I mentioned the fake numa bit, since I think this is a
> bit of a red herring.   That code is in production here, and we've
> made all sorts of changes so ti can be used for more than just
> debugging.  So please ignore it, it's our local hack, and if it breaks
> that's our problem.    More importantly, just two weeks ago I talked
> to soeone in the financial sector, who was testing out ext4 on an
> upstream kernel, and not using our hacks that force 128MB zones, and
> he ran into the ext4/OOM problem while using an upstream kernel.  It
> involved Oracle pinning down 3G worth of pages, and him trying to do a
> huge streaming backup (which of course wasn't using fallocate or
> direct I/O) under ext4, and he had the same issue --- an OOM, that I'm
> pretty sure was caused by the fact that ext4_writepage() was refusing
> the writepage() and most of the pages weren't nailed down by Oracle
> were delalloc.    The same test scenario using ext3 worked just fine,
> of course.
> 
> Under normal cases it's not a problem since statistically there should
> be enough other pages in the system compared to the number of pages
> that are subject to delalloc, such that pages can usually get pushed
> out until the writeback code can get around to writing out the pages.
> But in cases where the zones have been made artificially small, or you
> have a big program like Oracle pinning down a large number of pages,
> then of course we have problems. 
> 
> I'm trying to fix things from the file system side, which means trying
> to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is
> described in Documentation/filesystems/Locking as something which MUST
> be used if writepage() is going refuse a page.  And then I discovered
> no one is actually using it.   So that's why I was asking with respect
> whether the Locking documentation file was out of date, or whether all
> of the file systems are doing it wrong.
> 

> On a related example of how file system code isnt' necessarily
> following what is required/recommended by the Locking documentation,
> ext2 and ext3 are both NOT using
> set_page_writeback()/end_page_writeback(), but are rather keeping the
> page locked until after they call block_write_full_page(), because of
> concerns of truncate coming in and screwing things up.

block_write_full_page takes a locked page and if all goes well produces
a writeback page without the page locked.  Basically it needs the page
locked until after it has the writeback bit set to protect against
truncate and make sure the page buffers don't go away while it is
looping over them.

So, I don't think ext23 are breaking the rules here.

>But now looking at Locking, it appears that set_page_writeback() is as
>good as page_lock() for preventing the truncate code from coming in and
>screwing everything up?   It's not clear to me exactly what locking
>guarantees are provided against truncate by set_page_writeback().   And
>suppose we are writing out a whole cluster of pages, say 4MB worth of
>pages; do we need to call set_page_writeback() on every single page in
>the cluster before we do the I/O to make sure things don't change out
>from under us?  (I'm pretty sure at least some of the other filesystems
>that are submitting huge numbers of pages using bio instead of 4k at a
>time like ext2/3/4 aren' t calling set_page_writeback() on all of the
>pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren't
> well documented, and where they are documented, it's not clear they
> are up to date --- and all of the file systems that are doing delayed
> allocation writeback are doing things slightly differently, or in some
> cases very differently.    (And even without delalloc, as I've pointed
> out ext2/3 don't use set_page_writeback() --- if this is a MUST USE as
> implied by the Locking file, why did whoever added this requirement
> didn't go in and modify common filesystems like ext2 and ext3 to use
> the set_page_writeback/end_page_writeback calls?)
> 
> I'm happy to change things in ext4; in fact I'm pretty sure ext4
> probably isn't completely right here.   But it's not clear what
> "right" actually is, and when I look to see what protects writepage()
> racing with vmtruncate(), it's enough to give me a headache.  :-(    
> 
> Hence my question about wouldn't it be simpler if we simply added more
> high-level locking to prevent truncate from racing against
> writepage/writeback.  

My understanding of the current scheme is that truncate will wait on
both locked and writeback pages.  The page lock is used while setting
up the page for writeback, which is true both for writepages and
writepage.

I don't think we need a new lock on top of the page lock and the
writeback bit, but maybe I don't see exactly which problem you're
solving.  A given range of pages is either:

1) allocated but not under IO.  ext4 must write these pages to disk
before truncate can finish for data=ordered reasons, unless it manages
to log the orphan item.  Figuring out dependency between the orphan
item, which i_size is on disk right now, and holes is pretty tricky, so
I'd go with the less complex: just wait for all the allocated delalloc
pages to hit the disk.

2) Allocated and under IO.  These pages go to disk.

3) Delalloc and not under IO.  Truncate (or notify_change if you lean
toward the xfs crowd) should be able to clean these up
without waiting for the IO.

Of the three, #3 is probably the most common, which #1 a close second.
Is this a case that we really need to optimize for?

-chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-26 17:24       ` Chris Mason
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Mason @ 2010-04-26 17:24 UTC (permalink / raw)
  To: Theodore Tso
  Cc: KOSAKI Motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso wrote:
> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren't happen on majority pages.
> > IOW, the VM assume other many pages can writeout although the page can't.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
> cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the
> case of zone scanning.  I don't want to speak for Chris, but I assume
> it's due to stack depth concerns --- if it was just due to worrying
> about fs recursion issues, i assume all of the btrfs allocations could
> be done GFP_NOFS.
> 

Btrfs refuses all PF_MEMALLOC writepage.  It will go ahead and process a
regular writepage but in practice that never happens...everyone else
except a few internal btrfs callers use writepages.

I wish I had thought of stack depth back then, but really this was to
keep kswapd out of the heavy work done by delalloc.  From a locking
point of view we're properly GPF_NOFS, so its safe, but it just isn't a
great way to use precious PF_MEMALLOC cycles.

> Ext4 is slightly different; it refuses writepages() if the inode
> blocks for the page haven't yet been allocated.  (Regardless of
> whether it's happening for direct reclaim or zone scanning.)  However,
> if the on-disk block has been assigned (i.e., this isn't a delalloc
> case), ext4 will honor the writepage().   (i.e., if this is an mmap of
> an already existing file, or if the space has been pre-allocated using
> fallocate()).    The reason for ext4's concern is lock ordering,
> although I'm investigating whether I can fix this.   If we call
> set_page_writeback() to set PG_writeback (plus set the various bits of
> magic fs accounting), and then drop the page_lock, does that protect
> us from random changes happening to the page (i.e., from vmtruncate,
> etc.)?

PG_writeback will protect you from vmtruncate, but may also want to
have page_mkwrite wait for pages in flight.

> 
> > 
> > IOW, I don't think such documentation suppose delayed allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it's per-zone), page activation doesn't help. It also lead to OOM.
> > 
> > And I'm sorry. I have to say now all vm developers fake numa is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
> > useful for debugging".)
> 

> So I'm sorry I mentioned the fake numa bit, since I think this is a
> bit of a red herring.   That code is in production here, and we've
> made all sorts of changes so ti can be used for more than just
> debugging.  So please ignore it, it's our local hack, and if it breaks
> that's our problem.    More importantly, just two weeks ago I talked
> to soeone in the financial sector, who was testing out ext4 on an
> upstream kernel, and not using our hacks that force 128MB zones, and
> he ran into the ext4/OOM problem while using an upstream kernel.  It
> involved Oracle pinning down 3G worth of pages, and him trying to do a
> huge streaming backup (which of course wasn't using fallocate or
> direct I/O) under ext4, and he had the same issue --- an OOM, that I'm
> pretty sure was caused by the fact that ext4_writepage() was refusing
> the writepage() and most of the pages weren't nailed down by Oracle
> were delalloc.    The same test scenario using ext3 worked just fine,
> of course.
> 
> Under normal cases it's not a problem since statistically there should
> be enough other pages in the system compared to the number of pages
> that are subject to delalloc, such that pages can usually get pushed
> out until the writeback code can get around to writing out the pages.
> But in cases where the zones have been made artificially small, or you
> have a big program like Oracle pinning down a large number of pages,
> then of course we have problems. 
> 
> I'm trying to fix things from the file system side, which means trying
> to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is
> described in Documentation/filesystems/Locking as something which MUST
> be used if writepage() is going refuse a page.  And then I discovered
> no one is actually using it.   So that's why I was asking with respect
> whether the Locking documentation file was out of date, or whether all
> of the file systems are doing it wrong.
> 

> On a related example of how file system code isnt' necessarily
> following what is required/recommended by the Locking documentation,
> ext2 and ext3 are both NOT using
> set_page_writeback()/end_page_writeback(), but are rather keeping the
> page locked until after they call block_write_full_page(), because of
> concerns of truncate coming in and screwing things up.

block_write_full_page takes a locked page and if all goes well produces
a writeback page without the page locked.  Basically it needs the page
locked until after it has the writeback bit set to protect against
truncate and make sure the page buffers don't go away while it is
looping over them.

So, I don't think ext23 are breaking the rules here.

>But now looking at Locking, it appears that set_page_writeback() is as
>good as page_lock() for preventing the truncate code from coming in and
>screwing everything up?   It's not clear to me exactly what locking
>guarantees are provided against truncate by set_page_writeback().   And
>suppose we are writing out a whole cluster of pages, say 4MB worth of
>pages; do we need to call set_page_writeback() on every single page in
>the cluster before we do the I/O to make sure things don't change out
>from under us?  (I'm pretty sure at least some of the other filesystems
>that are submitting huge numbers of pages using bio instead of 4k at a
>time like ext2/3/4 aren' t calling set_page_writeback() on all of the
>pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren't
> well documented, and where they are documented, it's not clear they
> are up to date --- and all of the file systems that are doing delayed
> allocation writeback are doing things slightly differently, or in some
> cases very differently.    (And even without delalloc, as I've pointed
> out ext2/3 don't use set_page_writeback() --- if this is a MUST USE as
> implied by the Locking file, why did whoever added this requirement
> didn't go in and modify common filesystems like ext2 and ext3 to use
> the set_page_writeback/end_page_writeback calls?)
> 
> I'm happy to change things in ext4; in fact I'm pretty sure ext4
> probably isn't completely right here.   But it's not clear what
> "right" actually is, and when I look to see what protects writepage()
> racing with vmtruncate(), it's enough to give me a headache.  :-(    
> 
> Hence my question about wouldn't it be simpler if we simply added more
> high-level locking to prevent truncate from racing against
> writepage/writeback.  

My understanding of the current scheme is that truncate will wait on
both locked and writeback pages.  The page lock is used while setting
up the page for writeback, which is true both for writepages and
writepage.

I don't think we need a new lock on top of the page lock and the
writeback bit, but maybe I don't see exactly which problem you're
solving.  A given range of pages is either:

1) allocated but not under IO.  ext4 must write these pages to disk
before truncate can finish for data=ordered reasons, unless it manages
to log the orphan item.  Figuring out dependency between the orphan
item, which i_size is on disk right now, and holes is pretty tricky, so
I'd go with the less complex: just wait for all the allocated delalloc
pages to hit the disk.

2) Allocated and under IO.  These pages go to disk.

3) Delalloc and not under IO.  Truncate (or notify_change if you lean
toward the xfs crowd) should be able to clean these up
without waiting for the IO.

Of the three, #3 is probably the most common, which #1 a close second.
Is this a case that we really need to optimize for?

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
  2010-04-26 14:50     ` Theodore Tso
@ 2010-04-27 13:03       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-04-27 13:03 UTC (permalink / raw)
  To: Theodore Tso
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

>=20
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren't happen on majority pages.
> > IOW, the VM assume other many pages can writeout although the page can'=
t.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is ret=
urned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
>=20
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim ca=
ses (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of z=
one scanning.  I don't want to speak for Chris, but I assume it's due to st=
ack depth concerns --- if it was just due to worrying about fs recursion is=
sues, i assume all of the btrfs allocations could be done GFP_NOFS.
>=20
> Ext4 is slightly different; it refuses writepages() if the inode blocks f=
or the page haven't yet been allocated.  (Regardless of whether it's happen=
ing for direct reclaim or zone scanning.)  However, if the on-disk block ha=
s been assigned (i.e., this isn't a delalloc case), ext4 will honor the wri=
tepage().   (i.e., if this is an mmap of an already existing file, or if th=
e space has been pre-allocated using fallocate()).    The reason for ext4's=
 concern is lock ordering, although I'm investigating whether I can fix thi=
s.   If we call set_page_writeback() to set PG_writeback (plus set the vari=
ous bits of magic fs accounting), and then drop the page_lock, does that pr=
otect us from random changes happening to the page (i.e., from vmtruncate, =
etc.)?
>=20
> >=20
> > IOW, I don't think such documentation suppose delayed allocation issue =
;)
> >=20
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn't account per-numa-n=
ode
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it's per-zone), page activation doesn't help. It also lead to OOM.
> >=20
> > And I'm sorry. I have to say now all vm developers fake numa is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This is only=
=20
> > useful for debugging".)
>=20
> So I'm sorry I mentioned the fake numa bit, since I think this is a bit o=
f a red herring.   That code is in production here, and we've made all sort=
s of changes so ti can be used for more than just debugging.  So please ign=
ore it, it's our local hack, and if it breaks that's our problem.    More i=
mportantly, just two weeks ago I talked to soeone in the financial sector, =
who was testing out ext4 on an upstream kernel, and not using our hacks tha=
t force 128MB zones, and he ran into the ext4/OOM problem while using an up=
stream kernel.  It involved Oracle pinning down 3G worth of pages, and him =
trying to do a huge streaming backup (which of course wasn't using fallocat=
e or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm=
 pretty sure was caused by the fact that ext4_writepage() was refusing the =
writepage() and most of the pages weren't nailed down by Oracle were delall=
oc.    The same test scenario using ext3 worked just fine, of course.
>=20
> Under normal cases it's not a problem since statistically there should be=
 enough other pages in the system compared to the number of pages that are =
subject to delalloc, such that pages can usually get pushed out until the w=
riteback code can get around to writing out the pages.   But in cases where=
 the zones have been made artificially small, or you have a big program lik=
e Oracle pinning down a large number of pages, then of course we have probl=
ems.=20
>=20
> I'm trying to fix things from the file system side, which means trying to=
 understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in =
Documentation/filesystems/Locking as something which MUST be used if writep=
age() is going refuse a page.  And then I discovered no one is actually usi=
ng it.   So that's why I was asking with respect whether the Locking docume=
ntation file was out of date, or whether all of the file systems are doing =
it wrong.
>=20
> On a related example of how file system code isnt' necessarily following =
what is required/recommended by the Locking documentation, ext2 and ext3 ar=
e both NOT using set_page_writeback()/end_page_writeback(), but are rather =
keeping the page locked until after they call block_write_full_page(), beca=
use of concerns of truncate coming in and screwing things up.   But now loo=
king at Locking, it appears that set_page_writeback() is as good as page_lo=
ck() for preventing the truncate code from coming in and screwing everythin=
g up?   It's not clear to me exactly what locking guarantees are provided a=
gainst truncate by set_page_writeback().   And suppose we are writing out a=
 whole cluster of pages, say 4MB worth of pages; do we need to call set_pag=
e_writeback() on every single page in the cluster before we do the I/O to m=
ake sure things don't change out from under us?  (I'm pretty sure at least =
some of the other filesystems that are submitting huge numbers of pages usi=
ng bio instead of 4k at a time like ext2/3/4 aren't calling set_page_writeb=
ack() on all of the pages first.)
>=20
> Part of the problem is that the writeback Locking semantics aren't well d=
ocumented, and where they are documented, it's not clear they are up to dat=
e --- and all of the file systems that are doing delayed allocation writeba=
ck are doing things slightly differently, or in some cases very differently.=
    (And even without delalloc, as I've pointed out ext2/3 don't use set_pa=
ge_writeback() --- if this is a MUST USE as implied by the Locking file, wh=
y did whoever added this requirement didn't go in and modify common filesys=
tems like ext2 and ext3 to use the set_page_writeback/end_page_writeback ca=
lls?)
>=20
> I'm happy to change things in ext4; in fact I'm pretty sure ext4 probably=
 isn't completely right here.   But it's not clear what "right" actually is=
, and when I look to see what protects writepage() racing with vmtruncate()=
, it's enough to give me a headache.  :-(   =20

Umm.. sorry, I'm not good person to answer your question.=20
probably Nick has best knowledge in this area.

afaics, vmtruncate call graph is here.

vmtruncate
 -> truncate_pagecache
    -> truncate_inode_pages
       -> truncate_inode_pages_range
            lock_page(page);
            wait_on_page_writeback(page);
            truncate_inode_page(mapping, page);
             -> truncate_complete_page
                -> remove_from_page_cache
             ....
            unlock_page(page);

Then, PG_lock and/or PG_writeback protect against remove_from_page_cache().

But..
Now I'm afraid it can't solve ext4 delalloc issue. I'm pretty sure =20
you have done above easy grepping. I guess you are suffering from more
difficult issue. I hope to ask you, why ext4 couing logic of number of
delalloc pages can't take page lock?

and, today's my grep result is,=20

ext2_writepage
  block_write_full_page
    block_write_full_page_endio
      __block_write_full_page
        set_page_writeback

end_buffer_async_write
  end_page_writeback


ext3 seems to have similar logic of ext2. Am I missing something?




>=20
> Hence my question about wouldn't it be simpler if we simply added more hi=
gh-level locking to prevent truncate from racing against writepage/writebac=
k. =20
>=20
> -- Ted
>=20



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
@ 2010-04-27 13:03       ` KOSAKI Motohiro
  0 siblings, 0 replies; 10+ messages in thread
From: KOSAKI Motohiro @ 2010-04-27 13:03 UTC (permalink / raw)
  To: Theodore Tso
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, linux-ext4,
	linux-btrfs, Hugh Dickins

> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren't happen on majority pages.
> > IOW, the VM assume other many pages can writeout although the page can't.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone scanning.  I don't want to speak for Chris, but I assume it's due to stack depth concerns --- if it was just due to worrying about fs recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS.
> 
> Ext4 is slightly different; it refuses writepages() if the inode blocks for the page haven't yet been allocated.  (Regardless of whether it's happening for direct reclaim or zone scanning.)  However, if the on-disk block has been assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage().   (i.e., if this is an mmap of an already existing file, or if the space has been pre-allocated using fallocate()).    The reason for ext4's concern is lock ordering, although I'm investigating whether I can fix this.   If we call set_page_writeback() to set PG_writeback (plus set the various bits of magic fs accounting), and then drop the page_lock, does that protect us from random changes happening to the page (i.e., from vmtruncate, etc.)?
> 
> > 
> > IOW, I don't think such documentation suppose delayed allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it's per-zone), page activation doesn't help. It also lead to OOM.
> > 
> > And I'm sorry. I have to say now all vm developers fake numa is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
> > useful for debugging".)
> 
> So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a red herring.   That code is in production here, and we've made all sorts of changes so ti can be used for more than just debugging.  So please ignore it, it's our local hack, and if it breaks that's our problem.    More importantly, just two weeks ago I talked to soeone in the financial sector, who was testing out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, and he ran into the ext4/OOM problem while using an upstream kernel.  It involved Oracle pinning down 3G worth of pages, and him trying to do a huge streaming backup (which of course wasn't using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by the fact that ext4_writepage() was refusing the writepage() and most of the pages weren't nailed down by Oracle were delalloc.    The same test scenario using ext3 worked just fine, of course.
> 
> Under normal cases it's not a problem since statistically there should be enough other pages in the system compared to the number of pages that are subject to delalloc, such that pages can usually get pushed out until the writeback code can get around to writing out the pages.   But in cases where the zones have been made artificially small, or you have a big program like Oracle pinning down a large number of pages, then of course we have problems. 
> 
> I'm trying to fix things from the file system side, which means trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in Documentation/filesystems/Locking as something which MUST be used if writepage() is going refuse a page.  And then I discovered no one is actually using it.   So that's why I was asking with respect whether the Locking documentation file was out of date, or whether all of the file systems are doing it wrong.
> 
> On a related example of how file system code isnt' necessarily following what is required/recommended by the Locking documentation, ext2 and ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are rather keeping the page locked until after they call block_write_full_page(), because of concerns of truncate coming in and screwing things up.   But now looking at Locking, it appears that set_page_writeback() is as good as page_lock() for preventing the truncate code from coming in and screwing everything up?   It's not clear to me exactly what locking guarantees are provided against truncate by set_page_writeback().   And suppose we are writing out a whole cluster of pages, say 4MB worth of pages; do we need to call set_page_writeback() on every single page in the cluster before we do the I/O to make sure things don't change out from under us?  (I'm pretty sure at least some of the other filesystems that are submitting huge numbers of pages using bio instead of 4k at a time like ext2/3/4 aren't calling set_page_writeback() on all of the pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren't well documented, and where they are documented, it's not clear they are up to date --- and all of the file systems that are doing delayed allocation writeback are doing things slightly differently, or in some cases very differently.    (And even without delalloc, as I've pointed out ext2/3 don't use set_page_writeback() --- if this is a MUST USE as implied by the Locking file, why did whoever added this requirement didn't go in and modify common filesystems like ext2 and ext3 to use the set_page_writeback/end_page_writeback calls?)
> 
> I'm happy to change things in ext4; in fact I'm pretty sure ext4 probably isn't completely right here.   But it's not clear what "right" actually is, and when I look to see what protects writepage() racing with vmtruncate(), it's enough to give me a headache.  :-(    

Umm.. sorry, I'm not good person to answer your question. 
probably Nick has best knowledge in this area.

afaics, vmtruncate call graph is here.

vmtruncate
 -> truncate_pagecache
    -> truncate_inode_pages
       -> truncate_inode_pages_range
            lock_page(page);
            wait_on_page_writeback(page);
            truncate_inode_page(mapping, page);
             -> truncate_complete_page
                -> remove_from_page_cache
             ....
            unlock_page(page);

Then, PG_lock and/or PG_writeback protect against remove_from_page_cache().

But..
Now I'm afraid it can't solve ext4 delalloc issue. I'm pretty sure  
you have done above easy grepping. I guess you are suffering from more
difficult issue. I hope to ask you, why ext4 couing logic of number of
delalloc pages can't take page lock?

and, today's my grep result is, 

ext2_writepage
  block_write_full_page
    block_write_full_page_endio
      __block_write_full_page
        set_page_writeback

end_buffer_async_write
  end_page_writeback


ext3 seems to have similar logic of ext2. Am I missing something?




> 
> Hence my question about wouldn't it be simpler if we simply added more high-level locking to prevent truncate from racing against writepage/writeback.  
> 
> -- Ted
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-04-27 13:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-25  2:40 No one seems to be using AOP_WRITEPAGE_ACTIVATE? Theodore Ts'o
2010-04-25  2:40 ` Theodore Ts'o
2010-04-26 10:18 ` KOSAKI Motohiro
2010-04-26 10:18   ` KOSAKI Motohiro
2010-04-26 14:50   ` Theodore Tso
2010-04-26 14:50     ` Theodore Tso
2010-04-26 17:24     ` Chris Mason
2010-04-26 17:24       ` Chris Mason
2010-04-27 13:03     ` KOSAKI Motohiro
2010-04-27 13:03       ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.