All of lore.kernel.org
 help / color / mirror / Atom feed
* Removing GFP_NOFS
@ 2018-03-08 23:46 Matthew Wilcox
  2018-03-09  1:35 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2018-03-08 23:46 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel


Do we have a strategy for eliminating GFP_NOFS?

As I understand it, our intent is to mark the areas in individual
filesystems that can't be reentered with memalloc_nofs_save()/restore()
pairs.  Once they're all done, then we can replace all the GFP_NOFS
users with GFP_KERNEL.

How will we know when we're done and can kill GFP_NOFS?  I was thinking
that we could put a warning in slab/page_alloc that fires when __GFP_IO
is set, __GFP_FS is clear and PF_MEMALLOC_NOFS is clear.  That would
catch every place that uses GFP_NOFS without using memalloc_nofs_save().

Unfortunately (and this is sort of the point), there's a lot of places
which use GFP_NOFS as a precaution; that is, they can be called from
places which both are and aren't in a nofs path.  So we'd have to pass
in GFP flags.  Which would be a lot of stupid churn.

I don't have a good solution here.  Maybe this is a good discussion
topic for LSFMM, or maybe there's already a good solution I'm overlooking.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-08 23:46 Removing GFP_NOFS Matthew Wilcox
@ 2018-03-09  1:35 ` Dave Chinner
  2018-03-09  4:06   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2018-03-09  1:35 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel

On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
> 
> Do we have a strategy for eliminating GFP_NOFS?
> 
> As I understand it, our intent is to mark the areas in individual
> filesystems that can't be reentered with memalloc_nofs_save()/restore()
> pairs.  Once they're all done, then we can replace all the GFP_NOFS
> users with GFP_KERNEL.

Won't be that easy, I think.  We recently came across user-reported
allocation deadlocks in XFS where we were doing allocation with
pages held in the writeback state that lockdep has never triggered
on.

https://www.spinics.net/lists/linux-xfs/msg16154.html

IOWs, GFP_NOFS isn't a solid guide to where
memalloc_nofs_save/restore need to cover in the filesystems because
there's a surprising amount of code that isn't covered by existing
lockdep annotations to warning us about un-intended recursion
problems.

I think we need to start with some documentation of all the generic
rules for where these will need to be set, then the per-filesystem
rules can be added on top of that...

> How will we know when we're done and can kill GFP_NOFS?  I was thinking
> that we could put a warning in slab/page_alloc that fires when __GFP_IO
> is set, __GFP_FS is clear and PF_MEMALLOC_NOFS is clear.  That would
> catch every place that uses GFP_NOFS without using memalloc_nofs_save().
> 
> Unfortunately (and this is sort of the point), there's a lot of places
> which use GFP_NOFS as a precaution; that is, they can be called from
> places which both are and aren't in a nofs path.  So we'd have to pass
> in GFP flags.  Which would be a lot of stupid churn.

Yup, GFP_NOFS has been used as a "go away, lockdep, your drunk" flag
for handling false positives for quite a long time because some
calls are already under memalloc_nofs_save/restore protection paths.
THese would need to be converted to GFP_NOLOCKDEP instead of
memalloc_nofs_save/restore() which they are already covered by in
the cases taht matter...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-09  1:35 ` Dave Chinner
@ 2018-03-09  4:06   ` Dave Chinner
  2018-03-09 11:14       ` Tetsuo Handa
  2018-03-09 14:48     ` Goldwyn Rodrigues
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2018-03-09  4:06 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel

On Fri, Mar 09, 2018 at 12:35:35PM +1100, Dave Chinner wrote:
> On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
> > 
> > Do we have a strategy for eliminating GFP_NOFS?
> > 
> > As I understand it, our intent is to mark the areas in individual
> > filesystems that can't be reentered with memalloc_nofs_save()/restore()
> > pairs.  Once they're all done, then we can replace all the GFP_NOFS
> > users with GFP_KERNEL.
> 
> Won't be that easy, I think.  We recently came across user-reported
> allocation deadlocks in XFS where we were doing allocation with
> pages held in the writeback state that lockdep has never triggered
> on.
> 
> https://www.spinics.net/lists/linux-xfs/msg16154.html
> 
> IOWs, GFP_NOFS isn't a solid guide to where
> memalloc_nofs_save/restore need to cover in the filesystems because
> there's a surprising amount of code that isn't covered by existing
> lockdep annotations to warning us about un-intended recursion
> problems.
> 
> I think we need to start with some documentation of all the generic
> rules for where these will need to be set, then the per-filesystem
> rules can be added on top of that...

So thinking a bit further here:

* page writeback state gets set and held:
	->writepage should be under memalloc_nofs_save
	->writepages should be under memalloc_nofs_save
* page cache write path is often under AOP_FLAG_NOFS
	- should probably be under memalloc_nofs_save
* metadata writeback that uses page cache and page writeback flags
  should probably be under memalloc_nofs_save

What other generic code paths are susceptible to allocation
deadlocks?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-09  4:06   ` Dave Chinner
@ 2018-03-09 11:14       ` Tetsuo Handa
  2018-03-09 14:48     ` Goldwyn Rodrigues
  1 sibling, 0 replies; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-09 11:14 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: linux-mm, linux-fsdevel, rgoldwyn, neilb, James.Bottomley, Michal Hocko

On 2018/03/09 13:06, Dave Chinner wrote:
> On Fri, Mar 09, 2018 at 12:35:35PM +1100, Dave Chinner wrote:
>> On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
>>>
>>> Do we have a strategy for eliminating GFP_NOFS?
>>>
>>> As I understand it, our intent is to mark the areas in individual
>>> filesystems that can't be reentered with memalloc_nofs_save()/restore()
>>> pairs.  Once they're all done, then we can replace all the GFP_NOFS
>>> users with GFP_KERNEL.
>>
>> Won't be that easy, I think.  We recently came across user-reported
>> allocation deadlocks in XFS where we were doing allocation with
>> pages held in the writeback state that lockdep has never triggered
>> on.
>>
>> https://www.spinics.net/lists/linux-xfs/msg16154.html
>>
>> IOWs, GFP_NOFS isn't a solid guide to where
>> memalloc_nofs_save/restore need to cover in the filesystems because
>> there's a surprising amount of code that isn't covered by existing
>> lockdep annotations to warning us about un-intended recursion
>> problems.
>>
>> I think we need to start with some documentation of all the generic
>> rules for where these will need to be set, then the per-filesystem
>> rules can be added on top of that...
>
> So thinking a bit further here:
>
> * page writeback state gets set and held:
>     ->writepage should be under memalloc_nofs_save
>     ->writepages should be under memalloc_nofs_save
> * page cache write path is often under AOP_FLAG_NOFS
>     - should probably be under memalloc_nofs_save
> * metadata writeback that uses page cache and page writeback flags
>   should probably be under memalloc_nofs_save
>
> What other generic code paths are susceptible to allocation
> deadlocks?
>
> Cheers,
>
> Dave.

Goldwyn Rodrigues is thinking "[LSF/MM ATTEND] memory allocation scope".
But I think that getting rid of direct reclaim, assigning dedicated kernel
thread for each reclaim routine, and using multiple watermark levels based
on purpose of memory allocation request is the better. OOM situation is
wasting CPU resources by unthrottled direct reclaim attempts without making
progress, and we are loosing performance when memory allocation for memory
reclaim activities are failing, and happily diving into the lockdep
labyrinth which is too hard to test, and we are unable to give up upon
SIGKILL (i.e. unable to implement __GFP_KILLABLE) due to unkillable locks
within direct reclaim attempts.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
@ 2018-03-09 11:14       ` Tetsuo Handa
  0 siblings, 0 replies; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-09 11:14 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: linux-mm, linux-fsdevel, rgoldwyn, neilb, James.Bottomley, Michal Hocko

On 2018/03/09 13:06, Dave Chinner wrote:
> On Fri, Mar 09, 2018 at 12:35:35PM +1100, Dave Chinner wrote:
>> On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
>>>
>>> Do we have a strategy for eliminating GFP_NOFS?
>>>
>>> As I understand it, our intent is to mark the areas in individual
>>> filesystems that can't be reentered with memalloc_nofs_save()/restore()
>>> pairs.A  Once they're all done, then we can replace all the GFP_NOFS
>>> users with GFP_KERNEL.
>>
>> Won't be that easy, I think.A  We recently came across user-reported
>> allocation deadlocks in XFS where we were doing allocation with
>> pages held in the writeback state that lockdep has never triggered
>> on.
>>
>> https://www.spinics.net/lists/linux-xfs/msg16154.html
>>
>> IOWs, GFP_NOFS isn't a solid guide to where
>> memalloc_nofs_save/restore need to cover in the filesystems because
>> there's a surprising amount of code that isn't covered by existing
>> lockdep annotations to warning us about un-intended recursion
>> problems.
>>
>> I think we need to start with some documentation of all the generic
>> rules for where these will need to be set, then the per-filesystem
>> rules can be added on top of that...
>
> So thinking a bit further here:
>
> * page writeback state gets set and held:
> A A A  ->writepage should be under memalloc_nofs_save
> A A A  ->writepages should be under memalloc_nofs_save
> * page cache write path is often under AOP_FLAG_NOFS
> A A A  - should probably be under memalloc_nofs_save
> * metadata writeback that uses page cache and page writeback flags
>A A  should probably be under memalloc_nofs_save
>
> What other generic code paths are susceptible to allocation
> deadlocks?
>
> Cheers,
>
> Dave.

Goldwyn Rodrigues is thinking "[LSF/MM ATTEND] memory allocation scope".
But I think that getting rid of direct reclaim, assigning dedicated kernel
thread for each reclaim routine, and using multiple watermark levels based
on purpose of memory allocation request is the better. OOM situation is
wasting CPU resources by unthrottled direct reclaim attempts without making
progress, and we are loosing performance when memory allocation for memory
reclaim activities are failing, and happily diving into the lockdep
labyrinth which is too hard to test, and we are unable to give up upon
SIGKILL (i.e. unable to implement __GFP_KILLABLE) due to unkillable locks
within direct reclaim attempts.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-09  4:06   ` Dave Chinner
  2018-03-09 11:14       ` Tetsuo Handa
@ 2018-03-09 14:48     ` Goldwyn Rodrigues
  2018-03-09 22:38       ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-09 14:48 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox; +Cc: linux-mm, linux-fsdevel, penguin-kernel



On 03/08/2018 10:06 PM, Dave Chinner wrote:
> On Fri, Mar 09, 2018 at 12:35:35PM +1100, Dave Chinner wrote:
>> On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
>>>
>>> Do we have a strategy for eliminating GFP_NOFS?
>>>
>>> As I understand it, our intent is to mark the areas in individual
>>> filesystems that can't be reentered with memalloc_nofs_save()/restore()
>>> pairs.  Once they're all done, then we can replace all the GFP_NOFS
>>> users with GFP_KERNEL.
>>
>> Won't be that easy, I think.  We recently came across user-reported
>> allocation deadlocks in XFS where we were doing allocation with
>> pages held in the writeback state that lockdep has never triggered
>> on.
>>
>> https://www.spinics.net/lists/linux-xfs/msg16154.html
>>
>> IOWs, GFP_NOFS isn't a solid guide to where
>> memalloc_nofs_save/restore need to cover in the filesystems because
>> there's a surprising amount of code that isn't covered by existing
>> lockdep annotations to warning us about un-intended recursion
>> problems.
>>
>> I think we need to start with some documentation of all the generic
>> rules for where these will need to be set, then the per-filesystem
>> rules can be added on top of that...
> 
> So thinking a bit further here:
> 
> * page writeback state gets set and held:
> 	->writepage should be under memalloc_nofs_save
> 	->writepages should be under memalloc_nofs_save
> * page cache write path is often under AOP_FLAG_NOFS
> 	- should probably be under memalloc_nofs_save
> * metadata writeback that uses page cache and page writeback flags
>   should probably be under memalloc_nofs_save
> 
> What other generic code paths are susceptible to allocation
> deadlocks?
> 

AFAIU, these are callbacks into the filesystem from the mm code which
are executed in case of low memory. So, the calls of memory allocation
from filesystem code are the ones that should be the one under
memalloc_nofs_save() in order to save from recursion.

OTOH (contradicting myself here), writepages, in essence writebacks, are
performed by per-BDI flusher threads which are kicked by the mm code in
low memory situations, as opposed to the thread performing the allocation.

As Tetsuo pointed out, direct reclaims are the real problematic scenarios.

Also the shrinkers registered by filesystem code. However, there are no
shrinkers that I know of, which allocate memory or perform locking.
Thanks to smartly swapping into a temporary local list variable.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-09 14:48     ` Goldwyn Rodrigues
@ 2018-03-09 22:38       ` Dave Chinner
  2018-03-10  2:44         ` Tetsuo Handa
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2018-03-09 22:38 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Matthew Wilcox, linux-mm, linux-fsdevel, penguin-kernel

On Fri, Mar 09, 2018 at 08:48:32AM -0600, Goldwyn Rodrigues wrote:
> 
> 
> On 03/08/2018 10:06 PM, Dave Chinner wrote:
> > On Fri, Mar 09, 2018 at 12:35:35PM +1100, Dave Chinner wrote:
> >> On Thu, Mar 08, 2018 at 03:46:18PM -0800, Matthew Wilcox wrote:
> >>>
> >>> Do we have a strategy for eliminating GFP_NOFS?
> >>>
> >>> As I understand it, our intent is to mark the areas in individual
> >>> filesystems that can't be reentered with memalloc_nofs_save()/restore()
> >>> pairs.  Once they're all done, then we can replace all the GFP_NOFS
> >>> users with GFP_KERNEL.
> >>
> >> Won't be that easy, I think.  We recently came across user-reported
> >> allocation deadlocks in XFS where we were doing allocation with
> >> pages held in the writeback state that lockdep has never triggered
> >> on.
> >>
> >> https://www.spinics.net/lists/linux-xfs/msg16154.html
> >>
> >> IOWs, GFP_NOFS isn't a solid guide to where
> >> memalloc_nofs_save/restore need to cover in the filesystems because
> >> there's a surprising amount of code that isn't covered by existing
> >> lockdep annotations to warning us about un-intended recursion
> >> problems.
> >>
> >> I think we need to start with some documentation of all the generic
> >> rules for where these will need to be set, then the per-filesystem
> >> rules can be added on top of that...
> > 
> > So thinking a bit further here:
> > 
> > * page writeback state gets set and held:
> > 	->writepage should be under memalloc_nofs_save
> > 	->writepages should be under memalloc_nofs_save
> > * page cache write path is often under AOP_FLAG_NOFS
> > 	- should probably be under memalloc_nofs_save
> > * metadata writeback that uses page cache and page writeback flags
> >   should probably be under memalloc_nofs_save
> > 
> > What other generic code paths are susceptible to allocation
> > deadlocks?
> > 
> 
> AFAIU, these are callbacks into the filesystem from the mm code which
> are executed in case of low memory.

Except that many filesystems reject such attempts at writeback from
direct reclaim because they are a problem:

        /*
         * Refuse to write the page out if we are called from reclaim context.
         *
         * This avoids stack overflows when called from deeply used stacks in
         * random callers for direct reclaim or memcg reclaim.  We explicitly
         * allow reclaim from kswapd as the stack usage there is relatively low.
         *
         * This should never happen except in the case of a VM regression so
         * warn about it.
         */
        if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
                        PF_MEMALLOC))
                goto redirty;

But writback is also called on demand - the filemap_fdatawrite() and
friends interfaces. This means they can be called from anywhere in
the kernel....

> So, the calls of memory allocation
> from filesystem code are the ones that should be the one under
> memalloc_nofs_save() in order to save from recursion.

I don't think you understand the problem here - the problem is not
recursing into the writeback path - it's being in the writeback path
and doing an allocation that then triggers memory reclaim which then
recurses into the same filesystem while we hold pages in writeback
state.

i.e. the writeback path is a source of allocation deadlocks no matter
where it is called from.

> OTOH (contradicting myself here), writepages, in essence writebacks, are
> performed by per-BDI flusher threads which are kicked by the mm code in
> low memory situations, as opposed to the thread performing the allocation.
> 
> As Tetsuo pointed out, direct reclaims are the real problematic scenarios.

Sure, but I've been saying for more than 10 years we need to get rid
of direct reclaim because it's horribly inefficient when there's
lots of concurrent allocation pressure, not to mention it's full of
deadlock scenarios like this.

Really, though I'm tired of having the same arguments over and over
again about architectural problems that people just don't seem to
understand or want to fix.

> Also the shrinkers registered by filesystem code. However, there are no
> shrinkers that I know of, which allocate memory or perform locking.
> Thanks to smartly swapping into a temporary local list variable.

Go look at the XFS shrinkers that will lock inodes, dquots, buffers,
etc, run transactions, issue IO, block waiting for IO to complete,
etc.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Removing GFP_NOFS
  2018-03-09 22:38       ` Dave Chinner
@ 2018-03-10  2:44         ` Tetsuo Handa
  0 siblings, 0 replies; 8+ messages in thread
From: Tetsuo Handa @ 2018-03-10  2:44 UTC (permalink / raw)
  To: david, rgoldwyn, mhocko; +Cc: willy, linux-mm, linux-fsdevel

Dave Chinner wrote:
> > OTOH (contradicting myself here), writepages, in essence writebacks, are
> > performed by per-BDI flusher threads which are kicked by the mm code in
> > low memory situations, as opposed to the thread performing the allocation.
> > 
> > As Tetsuo pointed out, direct reclaims are the real problematic scenarios.
> 
> Sure, but I've been saying for more than 10 years we need to get rid
> of direct reclaim because it's horribly inefficient when there's
> lots of concurrent allocation pressure, not to mention it's full of
> deadlock scenarios like this.
> 
> Really, though I'm tired of having the same arguments over and over
> again about architectural problems that people just don't seem to
> understand or want to fix.
> 
Yeah, it is sad that developers are not interested in lowmem situation.

  Suspect the MM subsystem when your Linux system hung up!?
  https://elinux.org/images/4/49/CELFJP-Jamboree63-handa-en.pdf

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-10  2:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-08 23:46 Removing GFP_NOFS Matthew Wilcox
2018-03-09  1:35 ` Dave Chinner
2018-03-09  4:06   ` Dave Chinner
2018-03-09 11:14     ` Tetsuo Handa
2018-03-09 11:14       ` Tetsuo Handa
2018-03-09 14:48     ` Goldwyn Rodrigues
2018-03-09 22:38       ` Dave Chinner
2018-03-10  2:44         ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.