linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Removing GFP_NOFS
@ 2024-01-04 21:17 Matthew Wilcox
  2024-01-05 10:13 ` Viacheslav Dubeyko
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Matthew Wilcox @ 2024-01-04 21:17 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi, linux-nvme

This is primarily a _FILESYSTEM_ track topic.  All the work has already
been done on the MM side; the FS people need to do their part.  It could
be a joint session, but I'm not sure there's much for the MM people
to say.

There are situations where we need to allocate memory, but cannot call
into the filesystem to free memory.  Generally this is because we're
holding a lock or we've started a transaction, and attempting to write
out dirty folios to reclaim memory would result in a deadlock.

The old way to solve this problem is to specify GFP_NOFS when allocating
memory.  This conveys little information about what is being protected
against, and so it is hard to know when it might be safe to remove.
It's also a reflex -- many filesystem authors use GFP_NOFS by default
even when they could use GFP_KERNEL because there's no risk of deadlock.

The new way is to use the scoped APIs -- memalloc_nofs_save() and
memalloc_nofs_restore().  These should be called when we start a
transaction or take a lock that would cause a GFP_KERNEL allocation to
deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
can see the nofs situation is in effect and will not call back into
the filesystem.

This results in better code within your filesystem as you don't need to
pass around gfp flags as much, and can lead to better performance from
the memory allocators as GFP_NOFS will not be used unnecessarily.

The memalloc_nofs APIs were introduced in May 2017, but we still have
over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
really sad).  This session is for filesystem developers to talk about
what they need to do to fix up their own filesystem, or share stories
about how they made their filesystem better by adopting the new APIs.

My interest in this is that I'd like to get rid of the FGP_NOFS flag.
It'd also be good to get rid of the __GFP_FS flag since there's always
demand for more GFP flags.  I have a git branch with some work in this
area, so there's a certain amount of conference-driven development going
on here too.

We could mutatis mutandi for GFP_NOIO, memalloc_noio_save/restore,
__GFP_IO, etc, so maybe the block people are also interested.  I haven't
looked into that in any detail though.  I guess we'll see what interest
this topic gains.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
@ 2024-01-05 10:13 ` Viacheslav Dubeyko
  2024-01-05 10:26   ` [Lsf-pc] " Jan Kara
  2024-01-05 14:35   ` Vlastimil Babka (SUSE)
  2024-01-05 10:57 ` [Lsf-pc] " Jan Kara
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-05 10:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, Linux FS Devel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme



> On Jan 5, 2024, at 12:17 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> been done on the MM side; the FS people need to do their part.  It could
> be a joint session, but I'm not sure there's much for the MM people
> to say.
> 
> There are situations where we need to allocate memory, but cannot call
> into the filesystem to free memory.  Generally this is because we're
> holding a lock or we've started a transaction, and attempting to write
> out dirty folios to reclaim memory would result in a deadlock.
> 
> The old way to solve this problem is to specify GFP_NOFS when allocating
> memory.  This conveys little information about what is being protected
> against, and so it is hard to know when it might be safe to remove.
> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> even when they could use GFP_KERNEL because there's no risk of deadlock.
> 
> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> memalloc_nofs_restore().  These should be called when we start a
> transaction or take a lock that would cause a GFP_KERNEL allocation to
> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> can see the nofs situation is in effect and will not call back into
> the filesystem.
> 
> This results in better code within your filesystem as you don't need to
> pass around gfp flags as much, and can lead to better performance from
> the memory allocators as GFP_NOFS will not be used unnecessarily.
> 
> The memalloc_nofs APIs were introduced in May 2017, but we still have
> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
> really sad).  This session is for filesystem developers to talk about
> what they need to do to fix up their own filesystem, or share stories
> about how they made their filesystem better by adopting the new APIs.
> 

Many file systems are still heavily using GFP_NOFS for kmalloc and
kmem_cache_alloc family methods even if  memalloc_nofs_save() and
memalloc_nofs_restore() pair is used too. But I can see that GFP_NOFS
is used in radix_tree_preload(), bio_alloc(), posix_acl_clone(),
sb_issue_zeroout, sb_issue_discard(), alloc_inode_sb(), blkdev_issue_zeroout(),
blkdev_issue_secure_erase(), blkdev_zone_mgmt(), etc.

Would it be safe to switch on memalloc_nofs_save()/memalloc_nofs_restore() for
all possible cases? Any potential issues or downsides?

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-05 10:13 ` Viacheslav Dubeyko
@ 2024-01-05 10:26   ` Jan Kara
  2024-01-05 14:17     ` Viacheslav Dubeyko
  2024-01-05 14:35   ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2024-01-05 10:26 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: Matthew Wilcox, linux-scsi, linux-ide, linux-nvme, linux-block,
	linux-mm, Linux FS Devel, lsf-pc

On Fri 05-01-24 13:13:11, Viacheslav Dubeyko wrote:
> 
> 
> > On Jan 5, 2024, at 12:17 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > This is primarily a _FILESYSTEM_ track topic.  All the work has already
> > been done on the MM side; the FS people need to do their part.  It could
> > be a joint session, but I'm not sure there's much for the MM people
> > to say.
> > 
> > There are situations where we need to allocate memory, but cannot call
> > into the filesystem to free memory.  Generally this is because we're
> > holding a lock or we've started a transaction, and attempting to write
> > out dirty folios to reclaim memory would result in a deadlock.
> > 
> > The old way to solve this problem is to specify GFP_NOFS when allocating
> > memory.  This conveys little information about what is being protected
> > against, and so it is hard to know when it might be safe to remove.
> > It's also a reflex -- many filesystem authors use GFP_NOFS by default
> > even when they could use GFP_KERNEL because there's no risk of deadlock.
> > 
> > The new way is to use the scoped APIs -- memalloc_nofs_save() and
> > memalloc_nofs_restore().  These should be called when we start a
> > transaction or take a lock that would cause a GFP_KERNEL allocation to
> > deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> > can see the nofs situation is in effect and will not call back into
> > the filesystem.
> > 
> > This results in better code within your filesystem as you don't need to
> > pass around gfp flags as much, and can lead to better performance from
> > the memory allocators as GFP_NOFS will not be used unnecessarily.
> > 
> > The memalloc_nofs APIs were introduced in May 2017, but we still have
> > over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
> > really sad).  This session is for filesystem developers to talk about
> > what they need to do to fix up their own filesystem, or share stories
> > about how they made their filesystem better by adopting the new APIs.
> > 
> 
> Many file systems are still heavily using GFP_NOFS for kmalloc and
> kmem_cache_alloc family methods even if  memalloc_nofs_save() and
> memalloc_nofs_restore() pair is used too. But I can see that GFP_NOFS
> is used in radix_tree_preload(), bio_alloc(), posix_acl_clone(),
> sb_issue_zeroout, sb_issue_discard(), alloc_inode_sb(), blkdev_issue_zeroout(),
> blkdev_issue_secure_erase(), blkdev_zone_mgmt(), etc.

Given the nature of the scoped API, the transition has to start in the
leaves (i.e. filesystems itself) and only once all users of say
radix_tree_preload() are converted to the scoped API, we can remove the
GFP_NOFS use from radix_tree_preload() itself. So Matthew is right that we
need to start in the filesystems.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
  2024-01-05 10:13 ` Viacheslav Dubeyko
@ 2024-01-05 10:57 ` Jan Kara
  2024-01-08 11:47   ` Johannes Thumshirn
  2024-01-08  6:39 ` Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2024-01-05 10:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-scsi, linux-ide, linux-nvme, linux-block, linux-mm,
	linux-fsdevel

Hello,

On Thu 04-01-24 21:17:16, Matthew Wilcox wrote:
> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> been done on the MM side; the FS people need to do their part.  It could
> be a joint session, but I'm not sure there's much for the MM people
> to say.
> 
> There are situations where we need to allocate memory, but cannot call
> into the filesystem to free memory.  Generally this is because we're
> holding a lock or we've started a transaction, and attempting to write
> out dirty folios to reclaim memory would result in a deadlock.
> 
> The old way to solve this problem is to specify GFP_NOFS when allocating
> memory.  This conveys little information about what is being protected
> against, and so it is hard to know when it might be safe to remove.
> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> even when they could use GFP_KERNEL because there's no risk of deadlock.
> 
> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> memalloc_nofs_restore().  These should be called when we start a
> transaction or take a lock that would cause a GFP_KERNEL allocation to
> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> can see the nofs situation is in effect and will not call back into
> the filesystem.
> 
> This results in better code within your filesystem as you don't need to
> pass around gfp flags as much, and can lead to better performance from
> the memory allocators as GFP_NOFS will not be used unnecessarily.
> 
> The memalloc_nofs APIs were introduced in May 2017, but we still have
> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
> really sad).  This session is for filesystem developers to talk about
> what they need to do to fix up their own filesystem, or share stories
> about how they made their filesystem better by adopting the new APIs.

I agree this is a worthy goal and the scoped API helped us a lot in the
ext4/jbd2 land. Still we have some legacy to deal with:

~> git grep "NOFS" fs/jbd2/ | wc -l
15
~> git grep "NOFS" fs/ext4/ | wc -l
71

When you are asking about what would help filesystems with the conversion I
actually have one wish. The most common case is that you need to annotate
some lock that can be grabbed in the reclaim path and thus you must avoid
GFP_FS allocations from under it. For example to deal with reclaim
deadlocks in the writeback paths we had to introduce wrappers like:

static inline int ext4_writepages_down_read(struct super_block *sb)
{
        percpu_down_read(&EXT4_SB(sb)->s_writepages_rwsem);
        return memalloc_nofs_save();
}

static inline void ext4_writepages_up_read(struct super_block *sb, int ctx)
{
        memalloc_nofs_restore(ctx);
        percpu_up_read(&EXT4_SB(sb)->s_writepages_rwsem);
}

When you have to do it for 5 locks in your filesystem it gets a bit ugly
and it would be nice to have some generic way to deal with this. We already
have the spin_lock_irqsave() precedent we might follow (and I don't
necessarily mean the calling convention which is a bit weird for today's
standards)?

Even more lovely would be if we could actually avoid passing around the
returned reclaim state because sometimes the locks get acquired / released
in different functions and passing the state around requires quite some
changes and gets ugly. That would mean we'd have to have
fs-reclaim-forbidden counter instead of just a flag in task_struct. OTOH
then we could just mark the lock (mutex / rwsem / whatever) as
fs-reclaim-unsafe during init and the rest would just magically happen.
That would be super-easy to use.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-05 10:26   ` [Lsf-pc] " Jan Kara
@ 2024-01-05 14:17     ` Viacheslav Dubeyko
  0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2024-01-05 14:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, linux-scsi, linux-ide, linux-nvme, linux-block,
	linux-mm, Linux FS Devel, lsf-pc



> On Jan 5, 2024, at 1:26 PM, Jan Kara <jack@suse.cz> wrote:
> 
> On Fri 05-01-24 13:13:11, Viacheslav Dubeyko wrote:
>> 
>> 
>>> On Jan 5, 2024, at 12:17 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> 
>>> This is primarily a _FILESYSTEM_ track topic.  All the work has already
>>> been done on the MM side; the FS people need to do their part.  It could
>>> be a joint session, but I'm not sure there's much for the MM people
>>> to say.
>>> 
>>> There are situations where we need to allocate memory, but cannot call
>>> into the filesystem to free memory.  Generally this is because we're
>>> holding a lock or we've started a transaction, and attempting to write
>>> out dirty folios to reclaim memory would result in a deadlock.
>>> 
>>> The old way to solve this problem is to specify GFP_NOFS when allocating
>>> memory.  This conveys little information about what is being protected
>>> against, and so it is hard to know when it might be safe to remove.
>>> It's also a reflex -- many filesystem authors use GFP_NOFS by default
>>> even when they could use GFP_KERNEL because there's no risk of deadlock.
>>> 
>>> The new way is to use the scoped APIs -- memalloc_nofs_save() and
>>> memalloc_nofs_restore().  These should be called when we start a
>>> transaction or take a lock that would cause a GFP_KERNEL allocation to
>>> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
>>> can see the nofs situation is in effect and will not call back into
>>> the filesystem.
>>> 
>>> This results in better code within your filesystem as you don't need to
>>> pass around gfp flags as much, and can lead to better performance from
>>> the memory allocators as GFP_NOFS will not be used unnecessarily.
>>> 
>>> The memalloc_nofs APIs were introduced in May 2017, but we still have
>>> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
>>> really sad).  This session is for filesystem developers to talk about
>>> what they need to do to fix up their own filesystem, or share stories
>>> about how they made their filesystem better by adopting the new APIs.
>>> 
>> 
>> Many file systems are still heavily using GFP_NOFS for kmalloc and
>> kmem_cache_alloc family methods even if  memalloc_nofs_save() and
>> memalloc_nofs_restore() pair is used too. But I can see that GFP_NOFS
>> is used in radix_tree_preload(), bio_alloc(), posix_acl_clone(),
>> sb_issue_zeroout, sb_issue_discard(), alloc_inode_sb(), blkdev_issue_zeroout(),
>> blkdev_issue_secure_erase(), blkdev_zone_mgmt(), etc.
> 
> Given the nature of the scoped API, the transition has to start in the
> leaves (i.e. filesystems itself) and only once all users of say
> radix_tree_preload() are converted to the scoped API, we can remove the
> GFP_NOFS use from radix_tree_preload() itself. So Matthew is right that we
> need to start in the filesystems.

Makes sense to me. So, we need to summarize which file system uses
the GFP_NOFS for which methods. Then, I assume, it will be possible
to split the whole modification for particular phases of getting rid of
GFP_NOFS in particular case (particular method). It looks like that
we need to declare the whole modification plan and something like
a schedule for such change. Would it work in such way? :)

Thanks,
Slava. 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-05 10:13 ` Viacheslav Dubeyko
  2024-01-05 10:26   ` [Lsf-pc] " Jan Kara
@ 2024-01-05 14:35   ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-01-05 14:35 UTC (permalink / raw)
  To: Viacheslav Dubeyko, Matthew Wilcox
  Cc: lsf-pc, Linux FS Devel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme

On 1/5/24 11:13, Viacheslav Dubeyko wrote:
> 
>> On Jan 5, 2024, at 12:17 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> The memalloc_nofs APIs were introduced in May 2017, but we still have
>> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
>> really sad).  This session is for filesystem developers to talk about
>> what they need to do to fix up their own filesystem, or share stories
>> about how they made their filesystem better by adopting the new APIs.
>> 
> 
> Many file systems are still heavily using GFP_NOFS for kmalloc and
> kmem_cache_alloc family methods even if  memalloc_nofs_save() and
> memalloc_nofs_restore() pair is used too. But I can see that GFP_NOFS

Yes it should be enough to rely on memalloc_nofs_save() for
kmalloc/kmem_cache_alloc. The kmalloc layer doesnt't care about it, and once
it's run out of available slab folios and calls into the page allocator for
a new one, it evaluates the effect of memalloc_nofs_save() as expected.

> is used in radix_tree_preload(), bio_alloc(), posix_acl_clone(),
> sb_issue_zeroout, sb_issue_discard(), alloc_inode_sb(), blkdev_issue_zeroout(),
> blkdev_issue_secure_erase(), blkdev_zone_mgmt(), etc.
> 
> Would it be safe to switch on memalloc_nofs_save()/memalloc_nofs_restore() for
> all possible cases? Any potential issues or downsides?
> 
> Thanks,
> Slava.
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
  2024-01-05 10:13 ` Viacheslav Dubeyko
  2024-01-05 10:57 ` [Lsf-pc] " Jan Kara
@ 2024-01-08  6:39 ` Dave Chinner
  2024-01-09  4:47 ` Dave Chinner
  2024-01-09 22:44 ` Dave Chinner
  4 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-01-08  6:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> been done on the MM side; the FS people need to do their part.  It could
> be a joint session, but I'm not sure there's much for the MM people
> to say.
> 
> There are situations where we need to allocate memory, but cannot call
> into the filesystem to free memory.  Generally this is because we're
> holding a lock or we've started a transaction, and attempting to write
> out dirty folios to reclaim memory would result in a deadlock.
> 
> The old way to solve this problem is to specify GFP_NOFS when allocating
> memory.  This conveys little information about what is being protected
> against, and so it is hard to know when it might be safe to remove.
> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> even when they could use GFP_KERNEL because there's no risk of deadlock.

There are many uses in XFS where GFP_NOFS has been used because
__GFP_NOLOCKDEP did not exist. A large number of the remaining
GFP_NOFS and KM_NOFS uses in XFS fall under this category.

As a first step, I have a patchset that gets rid of KM_NOFS and
replaces it with either GFP_NOFS or __GFP_NOLOCKDEP:

$ git grep "GFP_NOFS\|KM_NOFS" fs/xfs |wc -l
64
$ git checkout guilt/xfs-kmem-cleanup
Switched to branch 'guilt/xfs-kmem-cleanup'
$ git grep "GFP_NOFS\|KM_NOFS" fs/xfs |wc -l
21

Some of these are in newly merged code that I haven't updated the
patch set to handle yet, others are in kthread/kworker contexts that
don't inherit any allocation context information. There isn't any
big issues remaining to be fixed in XFS, though.

> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> memalloc_nofs_restore().  These should be called when we start a
> transaction or take a lock that would cause a GFP_KERNEL allocation to
> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> can see the nofs situation is in effect and will not call back into
> the filesystem.

Note that this is the only way to use vmalloc() safely with GFP_NOFS
context...

> This results in better code within your filesystem as you don't need to
> pass around gfp flags as much, and can lead to better performance from
> the memory allocators as GFP_NOFS will not be used unnecessarily.
> 
> The memalloc_nofs APIs were introduced in May 2017, but we still have

For everyone else who doesn't know the history of this, the scoped
GFP_NOFS allocation code has been around for a lot longer than this
current API. PF_FSTRANS was added in early 2002 so we didn't have to
hack magic flags into current->journal_info to defermine if we were
in a transaction, and then this was added:

commit 957568938d4030414d71c583bc261fe3558d2c17
Author: Steve Lord <lord@sgi.com>
Date:   Thu Jan 31 11:17:26 2002 +0000

    Use PF_FSTRANS to detect being in a transaction

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 08a17984..282b724f 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -396,16 +396,11 @@ linvfs_release_buftarg(

 static kmem_cache_t * linvfs_inode_cachep;

-#define XFS_TRANS_MAGIC 0x5452414E
-
 static __inline__ unsigned int gfp_mask(void)
 {
         /* If we're not in a transaction, FS activity is ok */
-        if (!current->journal_info) return GFP_KERNEL;
-        /* could be set from some other filesystem */
-        if ((int)current->journal_info != XFS_TRANS_MAGIC)
-                return GFP_KERNEL;
-        return GFP_NOFS;
+        if (current->flags & PF_FSTRANS) return GFP_NOFS;
+       return GFP_KERNEL;
 }

> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
> really sad).  This session is for filesystem developers to talk about
> what they need to do to fix up their own filesystem, or share stories
> about how they made their filesystem better by adopting the new APIs.
> 
> My interest in this is that I'd like to get rid of the FGP_NOFS flag.

Isn't that flag redundant? i.e. we already have mapping_gfp_mask()
to indicate what gfp mask should be used with the mapping
operations, and at least the iomap code uses that.

Many filesystems call mapping_set_gfp_mask(GFP_NOFS) already, XFS is
the special one that does:

	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));

so it doesn't actually use GFP_NOFS there.

Given that we already have a generic way of telling mapping
operations the scoped allocation context they should run under,
perhaps we could turn this into scoped context calls somewhere in
the generic IO/mapping operation paths? e.g.
call_read_iter()/call_write_iter()

> It'd also be good to get rid of the __GFP_FS flag since there's always
> demand for more GFP flags.  I have a git branch with some work in this
> area, so there's a certain amount of conference-driven development going
> on here too.

Worry about that when everything is using scoped contexted. Then
nobody will be using GFP_NOFS or __GFP_FS externally, and the
allocator can then reclaim the flag.

> We could mutatis mutandi for GFP_NOIO, memalloc_noio_save/restore,
> __GFP_IO, etc, so maybe the block people are also interested.  I haven't
> looked into that in any detail though.  I guess we'll see what interest
> this topic gains.

That seems a whole lot simpler - just set the GFP_NOIO scope at
entry to the block layer and that should cover a large percentage of
the GFP_NOIO allocations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-05 10:57 ` [Lsf-pc] " Jan Kara
@ 2024-01-08 11:47   ` Johannes Thumshirn
  2024-01-08 17:39     ` David Sterba
  2024-01-09 15:47     ` Luis Henriques
  0 siblings, 2 replies; 24+ messages in thread
From: Johannes Thumshirn @ 2024-01-08 11:47 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: lsf-pc, linux-scsi, linux-ide, linux-nvme, linux-block, linux-mm,
	linux-fsdevel, linux-btrfs

On 05.01.24 11:57, Jan Kara wrote:
> Hello,
> 
> On Thu 04-01-24 21:17:16, Matthew Wilcox wrote:
>> This is primarily a _FILESYSTEM_ track topic.  All the work has already
>> been done on the MM side; the FS people need to do their part.  It could
>> be a joint session, but I'm not sure there's much for the MM people
>> to say.
>>
>> There are situations where we need to allocate memory, but cannot call
>> into the filesystem to free memory.  Generally this is because we're
>> holding a lock or we've started a transaction, and attempting to write
>> out dirty folios to reclaim memory would result in a deadlock.
>>
>> The old way to solve this problem is to specify GFP_NOFS when allocating
>> memory.  This conveys little information about what is being protected
>> against, and so it is hard to know when it might be safe to remove.
>> It's also a reflex -- many filesystem authors use GFP_NOFS by default
>> even when they could use GFP_KERNEL because there's no risk of deadlock.
>>
>> The new way is to use the scoped APIs -- memalloc_nofs_save() and
>> memalloc_nofs_restore().  These should be called when we start a
>> transaction or take a lock that would cause a GFP_KERNEL allocation to
>> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
>> can see the nofs situation is in effect and will not call back into
>> the filesystem.
>>
>> This results in better code within your filesystem as you don't need to
>> pass around gfp flags as much, and can lead to better performance from
>> the memory allocators as GFP_NOFS will not be used unnecessarily.
>>
>> The memalloc_nofs APIs were introduced in May 2017, but we still have
>> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
>> really sad).  This session is for filesystem developers to talk about
>> what they need to do to fix up their own filesystem, or share stories
>> about how they made their filesystem better by adopting the new APIs.
> 
> I agree this is a worthy goal and the scoped API helped us a lot in the
> ext4/jbd2 land. Still we have some legacy to deal with:
> 
> ~> git grep "NOFS" fs/jbd2/ | wc -l
> 15
> ~> git grep "NOFS" fs/ext4/ | wc -l
> 71
>

For everyone following out there being curious:
1 - affs
1 - cachefiles
1 - ecryptfs
1 - fscache
1 - notify
1 - squashfs
1 - vboxsf
1 - zonefs
2 - hfsplus
2 - tracefs
3 - 9p
3 - ext2
3 - iomap
5 - befs
5 - exfat
5 - fat
5 - udf
5 - ufs
7 - erofs
10 - fuse
11 - smb
14 - hpfs
15 - jbd2
17 - crypto
17 - jfs
17 - quota
17 - reiserfs
18 - nfs
18 - nilfs2
21 - ntfs
30 - xfs
37 - bcachefs
46 - gfs2
47 - afs
55 - dlm
61 - f2fs
63 - ceph
66 - ext4
71 - ocfs2
74 - ntfs3
84 - ubifs
199 - btrfs

As I've already feared we (as in btrfs) are the worst here.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-08 11:47   ` Johannes Thumshirn
@ 2024-01-08 17:39     ` David Sterba
  2024-01-09  7:43       ` Johannes Thumshirn
  2024-01-09 15:47     ` Luis Henriques
  1 sibling, 1 reply; 24+ messages in thread
From: David Sterba @ 2024-01-08 17:39 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jan Kara, Matthew Wilcox, lsf-pc, linux-scsi, linux-ide,
	linux-nvme, linux-block, linux-mm, linux-fsdevel, linux-btrfs

On Mon, Jan 08, 2024 at 11:47:11AM +0000, Johannes Thumshirn wrote:
> On 05.01.24 11:57, Jan Kara wrote:
> > Hello,
> > 
> > On Thu 04-01-24 21:17:16, Matthew Wilcox wrote:
> >> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> >> been done on the MM side; the FS people need to do their part.  It could
> >> be a joint session, but I'm not sure there's much for the MM people
> >> to say.
> >>
> >> There are situations where we need to allocate memory, but cannot call
> >> into the filesystem to free memory.  Generally this is because we're
> >> holding a lock or we've started a transaction, and attempting to write
> >> out dirty folios to reclaim memory would result in a deadlock.
> >>
> >> The old way to solve this problem is to specify GFP_NOFS when allocating
> >> memory.  This conveys little information about what is being protected
> >> against, and so it is hard to know when it might be safe to remove.
> >> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> >> even when they could use GFP_KERNEL because there's no risk of deadlock.
> >>
> >> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> >> memalloc_nofs_restore().  These should be called when we start a
> >> transaction or take a lock that would cause a GFP_KERNEL allocation to
> >> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> >> can see the nofs situation is in effect and will not call back into
> >> the filesystem.
> >>
> >> This results in better code within your filesystem as you don't need to
> >> pass around gfp flags as much, and can lead to better performance from
> >> the memory allocators as GFP_NOFS will not be used unnecessarily.
> >>
> >> The memalloc_nofs APIs were introduced in May 2017, but we still have
> >> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
> >> really sad).  This session is for filesystem developers to talk about
> >> what they need to do to fix up their own filesystem, or share stories
> >> about how they made their filesystem better by adopting the new APIs.

> 199 - btrfs

All the easy conversions to scoped nofs allocaionts have been done, the
rest requires to add saving the nofs state at the transactions tart, as
said in above. I have a wip series for that, updated every few releases
but it's intrusive and not finished for a testing run. The number of
patches is over 100, doing each conversion separately, the other generic
changes are straightforward.

It's possible to do it incrementally, there's one moster patch (300
edited lines) to add a stub parameter to transaction start,
https://lore.kernel.org/linux-btrfs/20211018173803.18353-1-dsterba@suse.com/ .
There are some counter points in the discussion if it has to be done
like that but IIRC it's not possible, I have examples why not.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
                   ` (2 preceding siblings ...)
  2024-01-08  6:39 ` Dave Chinner
@ 2024-01-09  4:47 ` Dave Chinner
  2024-02-08 16:02   ` Vlastimil Babka (SUSE)
  2024-01-09 22:44 ` Dave Chinner
  4 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-01-09  4:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> been done on the MM side; the FS people need to do their part.  It could
> be a joint session, but I'm not sure there's much for the MM people
> to say.
> 
> There are situations where we need to allocate memory, but cannot call
> into the filesystem to free memory.  Generally this is because we're
> holding a lock or we've started a transaction, and attempting to write
> out dirty folios to reclaim memory would result in a deadlock.
> 
> The old way to solve this problem is to specify GFP_NOFS when allocating
> memory.  This conveys little information about what is being protected
> against, and so it is hard to know when it might be safe to remove.
> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> even when they could use GFP_KERNEL because there's no risk of deadlock.
> 
> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> memalloc_nofs_restore().  These should be called when we start a
> transaction or take a lock that would cause a GFP_KERNEL allocation to
> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> can see the nofs situation is in effect and will not call back into
> the filesystem.

So in rebasing the XFS kmem.[ch] removal patchset I've been working
on, there is a clear memory allocator function that we need to be
scoped: __GFP_NOFAIL.

All of the allocations done through the existing XFS kmem.[ch]
interfaces (i.e just about everything) have __GFP_NOFAIL semantics
added except in the explicit cases where we add KM_MAYFAIL to
indicate that the allocation can fail.

The result of this conversion to remove GFP_NOFS is that I'm also
adding *dozens* of __GFP_NOFAIL annotations because we effectively
scope that behaviour.

Hence I think this discussion needs to consider that __GFP_NOFAIL is
also widely used within critical filesystem code that cannot
gracefully recover from memory allocation failures, and that this
would also be useful to scope....

Yeah, I know, mm developers hate __GFP_NOFAIL. We've been using
these semantics NOFAIL in XFS for over 2 decades and the sky hasn't
fallen. So can we get memalloc_nofail_{save,restore}() so that we
can change the default allocation behaviour in certain contexts
(e.g. the same contexts we need NOFS allocations) to be NOFAIL
unless __GFP_RETRY_MAYFAIL or __GFP_NORETRY are set?

We already have memalloc_noreclaim_{save/restore}() for turning off
direct memory reclaim for a given context (i.e. equivalent of
clearing __GFP_DIRECT_RECLAIM), so if we are going to embrace scoped
allocation contexts, then we should be going all in and providing
all the contexts that filesystems actually need....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-08 17:39     ` David Sterba
@ 2024-01-09  7:43       ` Johannes Thumshirn
  2024-01-09 22:23         ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Thumshirn @ 2024-01-09  7:43 UTC (permalink / raw)
  To: dsterba
  Cc: Jan Kara, Matthew Wilcox, lsf-pc, linux-scsi, linux-ide,
	linux-nvme, linux-block, linux-mm, linux-fsdevel, linux-btrfs

On 08.01.24 18:40, David Sterba wrote:
>> 199 - btrfs
> 
> All the easy conversions to scoped nofs allocaionts have been done, the
> rest requires to add saving the nofs state at the transactions tart, as
> said in above. I have a wip series for that, updated every few releases
> but it's intrusive and not finished for a testing run. The number of
> patches is over 100, doing each conversion separately, the other generic
> changes are straightforward.
> 
> It's possible to do it incrementally, there's one moster patch (300
> edited lines) to add a stub parameter to transaction start,
> https://lore.kernel.org/linux-btrfs/20211018173803.18353-1-dsterba@suse.com/ .
> There are some counter points in the discussion if it has to be done
> like that but IIRC it's not possible, I have examples why not.
> 

At a first glance, storing the nofs scope in the transaction handle like 
Filipe proposed sounds like a good idea to me.

Anyways, the deeper discussion on how we hope to solve it is nothing 
that needs to be done in this thread.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-08 11:47   ` Johannes Thumshirn
  2024-01-08 17:39     ` David Sterba
@ 2024-01-09 15:47     ` Luis Henriques
  2024-01-09 18:04       ` Johannes Thumshirn
  1 sibling, 1 reply; 24+ messages in thread
From: Luis Henriques @ 2024-01-09 15:47 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jan Kara, Matthew Wilcox, lsf-pc, linux-scsi, linux-ide,
	linux-nvme, linux-block, linux-mm, linux-fsdevel, linux-btrfs

Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:

> On 05.01.24 11:57, Jan Kara wrote:
>> Hello,
>> 
>> On Thu 04-01-24 21:17:16, Matthew Wilcox wrote:
>>> This is primarily a _FILESYSTEM_ track topic.  All the work has already
>>> been done on the MM side; the FS people need to do their part.  It could
>>> be a joint session, but I'm not sure there's much for the MM people
>>> to say.
>>>
>>> There are situations where we need to allocate memory, but cannot call
>>> into the filesystem to free memory.  Generally this is because we're
>>> holding a lock or we've started a transaction, and attempting to write
>>> out dirty folios to reclaim memory would result in a deadlock.
>>>
>>> The old way to solve this problem is to specify GFP_NOFS when allocating
>>> memory.  This conveys little information about what is being protected
>>> against, and so it is hard to know when it might be safe to remove.
>>> It's also a reflex -- many filesystem authors use GFP_NOFS by default
>>> even when they could use GFP_KERNEL because there's no risk of deadlock.
>>>
>>> The new way is to use the scoped APIs -- memalloc_nofs_save() and
>>> memalloc_nofs_restore().  These should be called when we start a
>>> transaction or take a lock that would cause a GFP_KERNEL allocation to
>>> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
>>> can see the nofs situation is in effect and will not call back into
>>> the filesystem.
>>>
>>> This results in better code within your filesystem as you don't need to
>>> pass around gfp flags as much, and can lead to better performance from
>>> the memory allocators as GFP_NOFS will not be used unnecessarily.
>>>
>>> The memalloc_nofs APIs were introduced in May 2017, but we still have
>>> over 1000 uses of GFP_NOFS in fs/ today (and 200 outside fs/, which is
>>> really sad).  This session is for filesystem developers to talk about
>>> what they need to do to fix up their own filesystem, or share stories
>>> about how they made their filesystem better by adopting the new APIs.
>> 
>> I agree this is a worthy goal and the scoped API helped us a lot in the
>> ext4/jbd2 land. Still we have some legacy to deal with:
>> 
>> ~> git grep "NOFS" fs/jbd2/ | wc -l
>> 15
>> ~> git grep "NOFS" fs/ext4/ | wc -l
>> 71
>>
>
> For everyone following out there being curious:
> 1 - affs
> 1 - cachefiles
> 1 - ecryptfs
> 1 - fscache
> 1 - notify
> 1 - squashfs
> 1 - vboxsf
> 1 - zonefs
> 2 - hfsplus
> 2 - tracefs
> 3 - 9p
> 3 - ext2
> 3 - iomap
> 5 - befs
> 5 - exfat
> 5 - fat
> 5 - udf
> 5 - ufs
> 7 - erofs
> 10 - fuse
> 11 - smb
> 14 - hpfs
> 15 - jbd2
> 17 - crypto
> 17 - jfs
> 17 - quota
> 17 - reiserfs
> 18 - nfs
> 18 - nilfs2
> 21 - ntfs
> 30 - xfs
> 37 - bcachefs
> 46 - gfs2
> 47 - afs
> 55 - dlm
> 61 - f2fs
> 63 - ceph
> 66 - ext4
> 71 - ocfs2
> 74 - ntfs3
> 84 - ubifs
> 199 - btrfs
>
> As I've already feared we (as in btrfs) are the worst here.

It probably won't make you feel any better, but the value for ceph isn't
correct as you're just taking into account the code in 'fs/ceph/'.  If you
also take 'net/ceph/', it brings it much closer to btrfs: 63 + 48 = 111

Cheers,
-- 
Luís

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-09 15:47     ` Luis Henriques
@ 2024-01-09 18:04       ` Johannes Thumshirn
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Thumshirn @ 2024-01-09 18:04 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Jan Kara, Matthew Wilcox, lsf-pc, linux-scsi, linux-ide,
	linux-nvme, linux-block, linux-mm, linux-fsdevel, linux-btrfs

On 09.01.24 16:47, Luis Henriques wrote:
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> writes:
>>
>> As I've already feared we (as in btrfs) are the worst here.
> 
> It probably won't make you feel any better, but the value for ceph isn't
> correct as you're just taking into account the code in 'fs/ceph/'.  If you
> also take 'net/ceph/', it brings it much closer to btrfs: 63 + 48 = 111
> 
> Cheers,

Yeah I've just quickly skimmed over fs/. There's net/ (69) and drivers/ 
(36) as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-09  7:43       ` Johannes Thumshirn
@ 2024-01-09 22:23         ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-01-09 22:23 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: dsterba, Jan Kara, Matthew Wilcox, lsf-pc, linux-scsi, linux-ide,
	linux-nvme, linux-block, linux-mm, linux-fsdevel, linux-btrfs

On Tue, Jan 09, 2024 at 07:43:54AM +0000, Johannes Thumshirn wrote:
> On 08.01.24 18:40, David Sterba wrote:
> >> 199 - btrfs
> > 
> > All the easy conversions to scoped nofs allocaionts have been done, the
> > rest requires to add saving the nofs state at the transactions tart, as
> > said in above. I have a wip series for that, updated every few releases
> > but it's intrusive and not finished for a testing run. The number of
> > patches is over 100, doing each conversion separately, the other generic
> > changes are straightforward.
> > 
> > It's possible to do it incrementally, there's one moster patch (300
> > edited lines) to add a stub parameter to transaction start,
> > https://lore.kernel.org/linux-btrfs/20211018173803.18353-1-dsterba@suse.com/ .
> > There are some counter points in the discussion if it has to be done
> > like that but IIRC it's not possible, I have examples why not.
> > 
> 
> At a first glance, storing the nofs scope in the transaction handle like 
> Filipe proposed sounds like a good idea to me.

That's exactly what XFS has done for the last couple of decades. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
                   ` (3 preceding siblings ...)
  2024-01-09  4:47 ` Dave Chinner
@ 2024-01-09 22:44 ` Dave Chinner
  4 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-01-09 22:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> been done on the MM side; the FS people need to do their part.  It could
> be a joint session, but I'm not sure there's much for the MM people
> to say.
> 
> There are situations where we need to allocate memory, but cannot call
> into the filesystem to free memory.  Generally this is because we're
> holding a lock or we've started a transaction, and attempting to write
> out dirty folios to reclaim memory would result in a deadlock.
> 
> The old way to solve this problem is to specify GFP_NOFS when allocating
> memory.  This conveys little information about what is being protected
> against, and so it is hard to know when it might be safe to remove.
> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> even when they could use GFP_KERNEL because there's no risk of deadlock.

Another thing that needs to be considered: GFP_NOFS has been used to
avoid lockdep false positives due to GFP_KERNEL allocations also
being used as scoped GFP_NOFS allocations via different call
chains.

That is, if a code path does a GFP_KERNEL allocation by default,
lockdep will track this as a "reclaim allowed" allocation context.
If that same code is then executed from a scoped NOFS context
(e.g. inside a transaction context), then lockdep will see it as
a "no reclaim allowed" allocation context.

The problem then arises when the next GFP_KERNEL allocation occurs,
the code enters direct reclaim, grabs a filesystem lock and lockdep
then throws out a warning that filesystem locks are being taken
in an allocation context that doesn't allow reclaim to take
filesystem locks.

These are typically false positives.

Prior to __GFP_NOLOCKDEP existing, we used GFP_NOFS unconditionally
in these shared context paths to avoid lockdep from seeing a
GFP_KERNEL allocation context from this allocation path. Now that we
are getting rid of GFP_NOFS and replacing these instances with
GFP_KERNEL and scoped constraints, we're removing the anti-lockdep
false positive mechanism.

IOWs, we have to replace GFP_NOFS with GFP_KERNEL | __GFP_NOLOCKDEP
in these cases to prevent false positive reclaim vs lock inversion
detections. There's at least a dozen of these cases in XFS and we
generally know where they are, but this will likely be an ongoing
issue for filesystems as we switch over to using scoped memory
allocation contexts.

I'm not sure there's a good solution to this. However, we need to
make sure people are aware that GFP_NOFS will need to be converted
to GFP_KERNEL | __GFP_NOLOCKDEP for allocations that can occur in
mixed contexts.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-01-09  4:47 ` Dave Chinner
@ 2024-02-08 16:02   ` Vlastimil Babka (SUSE)
  2024-02-08 17:33     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-02-08 16:02 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, Kent Overstreet, Michal Hocko

On 1/9/24 05:47, Dave Chinner wrote:
> On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
>> This is primarily a _FILESYSTEM_ track topic.  All the work has already
>> been done on the MM side; the FS people need to do their part.  It could
>> be a joint session, but I'm not sure there's much for the MM people
>> to say.
>> 
>> There are situations where we need to allocate memory, but cannot call
>> into the filesystem to free memory.  Generally this is because we're
>> holding a lock or we've started a transaction, and attempting to write
>> out dirty folios to reclaim memory would result in a deadlock.
>> 
>> The old way to solve this problem is to specify GFP_NOFS when allocating
>> memory.  This conveys little information about what is being protected
>> against, and so it is hard to know when it might be safe to remove.
>> It's also a reflex -- many filesystem authors use GFP_NOFS by default
>> even when they could use GFP_KERNEL because there's no risk of deadlock.
>> 
>> The new way is to use the scoped APIs -- memalloc_nofs_save() and
>> memalloc_nofs_restore().  These should be called when we start a
>> transaction or take a lock that would cause a GFP_KERNEL allocation to
>> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
>> can see the nofs situation is in effect and will not call back into
>> the filesystem.
> 
> So in rebasing the XFS kmem.[ch] removal patchset I've been working
> on, there is a clear memory allocator function that we need to be
> scoped: __GFP_NOFAIL.
> 
> All of the allocations done through the existing XFS kmem.[ch]
> interfaces (i.e just about everything) have __GFP_NOFAIL semantics
> added except in the explicit cases where we add KM_MAYFAIL to
> indicate that the allocation can fail.
> 
> The result of this conversion to remove GFP_NOFS is that I'm also
> adding *dozens* of __GFP_NOFAIL annotations because we effectively
> scope that behaviour.
> 
> Hence I think this discussion needs to consider that __GFP_NOFAIL is
> also widely used within critical filesystem code that cannot
> gracefully recover from memory allocation failures, and that this
> would also be useful to scope....
> 
> Yeah, I know, mm developers hate __GFP_NOFAIL. We've been using
> these semantics NOFAIL in XFS for over 2 decades and the sky hasn't
> fallen. So can we get memalloc_nofail_{save,restore}() so that we
> can change the default allocation behaviour in certain contexts
> (e.g. the same contexts we need NOFS allocations) to be NOFAIL
> unless __GFP_RETRY_MAYFAIL or __GFP_NORETRY are set?

Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
is no longer FS-only topic as this isn't just about converting to the scoped
apis, but also how they should be improved.

[1] http://lkml.kernel.org/r/Zbu_yyChbCO6b2Lj@tiehlicka

> We already have memalloc_noreclaim_{save/restore}() for turning off
> direct memory reclaim for a given context (i.e. equivalent of
> clearing __GFP_DIRECT_RECLAIM), so if we are going to embrace scoped
> allocation contexts, then we should be going all in and providing
> all the contexts that filesystems actually need....
> 
> -Dave.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-08 16:02   ` Vlastimil Babka (SUSE)
@ 2024-02-08 17:33     ` Michal Hocko
  2024-02-08 19:55       ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2024-02-08 17:33 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Dave Chinner, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
> On 1/9/24 05:47, Dave Chinner wrote:
> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> >> This is primarily a _FILESYSTEM_ track topic.  All the work has already
> >> been done on the MM side; the FS people need to do their part.  It could
> >> be a joint session, but I'm not sure there's much for the MM people
> >> to say.
> >> 
> >> There are situations where we need to allocate memory, but cannot call
> >> into the filesystem to free memory.  Generally this is because we're
> >> holding a lock or we've started a transaction, and attempting to write
> >> out dirty folios to reclaim memory would result in a deadlock.
> >> 
> >> The old way to solve this problem is to specify GFP_NOFS when allocating
> >> memory.  This conveys little information about what is being protected
> >> against, and so it is hard to know when it might be safe to remove.
> >> It's also a reflex -- many filesystem authors use GFP_NOFS by default
> >> even when they could use GFP_KERNEL because there's no risk of deadlock.
> >> 
> >> The new way is to use the scoped APIs -- memalloc_nofs_save() and
> >> memalloc_nofs_restore().  These should be called when we start a
> >> transaction or take a lock that would cause a GFP_KERNEL allocation to
> >> deadlock.  Then just use GFP_KERNEL as normal.  The memory allocators
> >> can see the nofs situation is in effect and will not call back into
> >> the filesystem.
> > 
> > So in rebasing the XFS kmem.[ch] removal patchset I've been working
> > on, there is a clear memory allocator function that we need to be
> > scoped: __GFP_NOFAIL.
> > 
> > All of the allocations done through the existing XFS kmem.[ch]
> > interfaces (i.e just about everything) have __GFP_NOFAIL semantics
> > added except in the explicit cases where we add KM_MAYFAIL to
> > indicate that the allocation can fail.
> > 
> > The result of this conversion to remove GFP_NOFS is that I'm also
> > adding *dozens* of __GFP_NOFAIL annotations because we effectively
> > scope that behaviour.
> > 
> > Hence I think this discussion needs to consider that __GFP_NOFAIL is
> > also widely used within critical filesystem code that cannot
> > gracefully recover from memory allocation failures, and that this
> > would also be useful to scope....
> > 
> > Yeah, I know, mm developers hate __GFP_NOFAIL. We've been using
> > these semantics NOFAIL in XFS for over 2 decades and the sky hasn't
> > fallen. So can we get memalloc_nofail_{save,restore}() so that we
> > can change the default allocation behaviour in certain contexts
> > (e.g. the same contexts we need NOFS allocations) to be NOFAIL
> > unless __GFP_RETRY_MAYFAIL or __GFP_NORETRY are set?
> 
> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
> is no longer FS-only topic as this isn't just about converting to the scoped
> apis, but also how they should be improved.

Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
mode. It is still tricky to deal with GFP_NOWAIT requests inside the
NOFAIL scope because that makes it a non failing busy wait for an
allocation if we need to insist on scope NOFAIL semantic. 

On the other hand we can define the behavior similar to what you
propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
better handle allocation failures regardless of the external allocation
scope.

Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
not see an immediate problem with that.

Having more NOFAIL allocations is not great but if you need to
emulate those by implementing the nofail semantic outside of the
allocator then it is better to have those retries inside the allocator
IMO.

> [1] http://lkml.kernel.org/r/Zbu_yyChbCO6b2Lj@tiehlicka
> 
> > We already have memalloc_noreclaim_{save/restore}() for turning off
> > direct memory reclaim for a given context (i.e. equivalent of
> > clearing __GFP_DIRECT_RECLAIM), so if we are going to embrace scoped
> > allocation contexts, then we should be going all in and providing
> > all the contexts that filesystems actually need....
> > 
> > -Dave.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-08 17:33     ` Michal Hocko
@ 2024-02-08 19:55       ` Vlastimil Babka (SUSE)
  2024-02-08 22:45         ` Kent Overstreet
  2024-02-12  1:20         ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-02-08 19:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On 2/8/24 18:33, Michal Hocko wrote:
> On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
>> On 1/9/24 05:47, Dave Chinner wrote:
>> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
>> 
>> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
>> is no longer FS-only topic as this isn't just about converting to the scoped
>> apis, but also how they should be improved.
> 
> Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
> scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
> mode. It is still tricky to deal with GFP_NOWAIT requests inside the
> NOFAIL scope because that makes it a non failing busy wait for an
> allocation if we need to insist on scope NOFAIL semantic. 
> 
> On the other hand we can define the behavior similar to what you
> propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
> better handle allocation failures regardless of the external allocation
> scope.
> 
> Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
> resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
> not see an immediate problem with that.
> 
> Having more NOFAIL allocations is not great but if you need to
> emulate those by implementing the nofail semantic outside of the
> allocator then it is better to have those retries inside the allocator
> IMO.

I see potential issues in scoping both the NOWAIT and NOFAIL

- NOFAIL - I'm assuming Dave is adding __GFP_NOFAIL to xfs allocations or
adjacent layers where he knows they must not fail for his transaction. But
could the scope affect also something else underneath that could fail
without the failure propagating in a way that it affects xfs? Maybe it's a
high-order allocation with a low-order fallback that really should not be
__GFP_NOFAIL? We would need to hope it has something like RETRY_MAYFAIL or
NORETRY already. But maybe it just relies on >costly order being more likely
to fail implicitly, and those costly orders should be kept excluded from the
scoped NOFAIL? Maybe __GFP_NOWARN should also override the scoped nofail?

- NOWAIT - as said already, we need to make sure we're not turning an
allocation that relied on too-small-to-fail into a null pointer exception or
BUG_ON(!page). It's probably not feasible to audit everything that can be
called underneath when adding a new scoped NOWAIT. Static analysis probably
won't be powerful enough as well. Kent suggested fault injection [1]. We
have the framework for a system-wide one but I don't know if anyone is
running it and how successful it is. But maybe we could have a special fault
injection mode (CONFIG_ option or something) for the NOWAIT scoped
allocations only. If everything works as expected, there are no crashes and
the pattern Kent described in [1] has a fallback that's slower but still
functional. If not, we get a report and known which place to fix, and the
testing only focuses on the relevant parts. When a new scoped NOWAIT is
added and bots/CIs running this fault injection config report no issues, we
can be reasonably sure it's fine?

[1]
https://lore.kernel.org/all/zup5yilebkgkrypis4g6zkbft7pywqi57k5aztoio2ufi5ujsd@mfnqu4rarort/

>> [1] http://lkml.kernel.org/r/Zbu_yyChbCO6b2Lj@tiehlicka
>> 
>> > We already have memalloc_noreclaim_{save/restore}() for turning off
>> > direct memory reclaim for a given context (i.e. equivalent of
>> > clearing __GFP_DIRECT_RECLAIM), so if we are going to embrace scoped
>> > allocation contexts, then we should be going all in and providing
>> > all the contexts that filesystems actually need....
>> > 
>> > -Dave.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-08 19:55       ` Vlastimil Babka (SUSE)
@ 2024-02-08 22:45         ` Kent Overstreet
  2024-02-12  1:20         ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2024-02-08 22:45 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michal Hocko, Dave Chinner, Matthew Wilcox, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, Kent Overstreet

On Thu, Feb 08, 2024 at 08:55:05PM +0100, Vlastimil Babka (SUSE) wrote:
> - NOWAIT - as said already, we need to make sure we're not turning an
> allocation that relied on too-small-to-fail into a null pointer exception or
> BUG_ON(!page). It's probably not feasible to audit everything that can be
> called underneath when adding a new scoped NOWAIT. Static analysis probably
> won't be powerful enough as well. Kent suggested fault injection [1]. We
> have the framework for a system-wide one but I don't know if anyone is
> running it and how successful it is.

I've also got a better fault injection library in the pipeline - I'll be
posting it after memory allocation profiling is merged, since that has
the library code needed for the new fault injection.

The new stuff gives us (via the same hooks for memory allocation
profiling), per callsite individually controllable injection points -
which means it's way easier to inject memory allocation failures into
existing tests and write tests that cover a specific codepath.

e.g. what I used to do with this code was flip on random memory
allocation failures for all code in fs/bcachefs/ after mounting, and I
had every test doing that (at one point in time, bcachefs could handle
_any_ allocation failure after startup without reporting an error to
userspace, but sadly not quite anymore).

that, plus code coverage analysis should make this pretty tractable.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-08 19:55       ` Vlastimil Babka (SUSE)
  2024-02-08 22:45         ` Kent Overstreet
@ 2024-02-12  1:20         ` Dave Chinner
  2024-02-12  2:06           ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-02-12  1:20 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Thu, Feb 08, 2024 at 08:55:05PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/8/24 18:33, Michal Hocko wrote:
> > On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
> >> On 1/9/24 05:47, Dave Chinner wrote:
> >> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> >> 
> >> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
> >> is no longer FS-only topic as this isn't just about converting to the scoped
> >> apis, but also how they should be improved.
> > 
> > Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
> > scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
> > mode. It is still tricky to deal with GFP_NOWAIT requests inside the
> > NOFAIL scope because that makes it a non failing busy wait for an
> > allocation if we need to insist on scope NOFAIL semantic. 
> > 
> > On the other hand we can define the behavior similar to what you
> > propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
> > better handle allocation failures regardless of the external allocation
> > scope.
> > 
> > Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
> > resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
> > not see an immediate problem with that.
> > 
> > Having more NOFAIL allocations is not great but if you need to
> > emulate those by implementing the nofail semantic outside of the
> > allocator then it is better to have those retries inside the allocator
> > IMO.
> 
> I see potential issues in scoping both the NOWAIT and NOFAIL
> 
> - NOFAIL - I'm assuming Dave is adding __GFP_NOFAIL to xfs allocations or
> adjacent layers where he knows they must not fail for his transaction. But
> could the scope affect also something else underneath that could fail
> without the failure propagating in a way that it affects xfs?

Memory allocaiton failures below the filesystem (i.e. in the IO
path) will fail the IO, and if that happens for a read IO within
a transaction then it will have the same effect as XFS failing a
memory allocation. i.e. it will shut down the filesystem.

The key point here is the moment we go below the filesystem we enter
into a new scoped allocation context with a guaranteed method of
returning errors: NOIO and bio errors.

Once we cross an allocation scope boundary, NOFAIL is no
longer relevant to the code that is being run because there are
other errors that can occur that the filesysetm must handle
that. Hence memory allocation errors just don't matter at this
point, and the NOFAIL constraint is no longer relevant.

Hence we really need to conside NOFAIL differently to NOFS/NOIO.
NOFS/NOIO are about avoiding reclaim recursion deadlocks, so are
relevant all the way down the stack. NOFAIL is only relevant to a
specific subsystem to prevent subsystem allocations from failing,
but as soon as we cross into another subsystem that can (and does)
return errors for memory allocation failures, the NOFAIL context is
no longer relevant.

i.e NOFAIL scopes are not relevant outside the subsystem that sets
it.  Hence we likely need helpers to clear and restore NOFAIL when
we cross an allocation context boundaries. e.g. as we cross from
filesystem to block layer in the IO stack via submit_bio(). Maybe
they should be doing something like:

	nofail_flags = memalloc_nofail_clear();
	noio_flags = memalloc_noio_save();

	....

	memalloc_noio_restore(noio_flags);
	memalloc_nofail_reinstate(nofail_flags);

> Maybe it's a
> high-order allocation with a low-order fallback that really should not be
> __GFP_NOFAIL? We would need to hope it has something like RETRY_MAYFAIL or
> NORETRY already. But maybe it just relies on >costly order being more likely
> to fail implicitly, and those costly orders should be kept excluded from the
> scoped NOFAIL? Maybe __GFP_NOWARN should also override the scoped nofail?

We definitely need NORETRY/RETRY_MAYFAIL to override scoped NOFAIL
at the filesystem layer (e.g. for readahead buffer allocations,
xlog_kvmalloc(), etc to correctly fail fast within XFS
transactions), but I don't think we should force every subsystem to
have to do this just in case a higher level subsystem had a scoped
NOFAIL set for it to work correctly.

> - NOWAIT - as said already, we need to make sure we're not turning an
> allocation that relied on too-small-to-fail into a null pointer exception or
> BUG_ON(!page).

Agreed. NOWAIT is removing allocation failure constraints and I
don't think that can be made to work reliably. Error injection
cannot prove the absence of errors  and so we can never be certain
the code will always operate correctly and not crash when an
unexepected allocation failure occurs.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-12  1:20         ` Dave Chinner
@ 2024-02-12  2:06           ` Kent Overstreet
  2024-02-12  4:35             ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2024-02-12  2:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka (SUSE),
	Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Mon, Feb 12, 2024 at 12:20:32PM +1100, Dave Chinner wrote:
> On Thu, Feb 08, 2024 at 08:55:05PM +0100, Vlastimil Babka (SUSE) wrote:
> > On 2/8/24 18:33, Michal Hocko wrote:
> > > On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
> > >> On 1/9/24 05:47, Dave Chinner wrote:
> > >> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> > >> 
> > >> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
> > >> is no longer FS-only topic as this isn't just about converting to the scoped
> > >> apis, but also how they should be improved.
> > > 
> > > Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
> > > scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
> > > mode. It is still tricky to deal with GFP_NOWAIT requests inside the
> > > NOFAIL scope because that makes it a non failing busy wait for an
> > > allocation if we need to insist on scope NOFAIL semantic. 
> > > 
> > > On the other hand we can define the behavior similar to what you
> > > propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
> > > better handle allocation failures regardless of the external allocation
> > > scope.
> > > 
> > > Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
> > > resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
> > > not see an immediate problem with that.
> > > 
> > > Having more NOFAIL allocations is not great but if you need to
> > > emulate those by implementing the nofail semantic outside of the
> > > allocator then it is better to have those retries inside the allocator
> > > IMO.
> > 
> > I see potential issues in scoping both the NOWAIT and NOFAIL
> > 
> > - NOFAIL - I'm assuming Dave is adding __GFP_NOFAIL to xfs allocations or
> > adjacent layers where he knows they must not fail for his transaction. But
> > could the scope affect also something else underneath that could fail
> > without the failure propagating in a way that it affects xfs?
> 
> Memory allocaiton failures below the filesystem (i.e. in the IO
> path) will fail the IO, and if that happens for a read IO within
> a transaction then it will have the same effect as XFS failing a
> memory allocation. i.e. it will shut down the filesystem.
> 
> The key point here is the moment we go below the filesystem we enter
> into a new scoped allocation context with a guaranteed method of
> returning errors: NOIO and bio errors.

Hang on, you're conflating NOIO to mean something completely different -
NOIO means "don't recurse in reclaim", it does _not_ mean anything about
what happens when the allocation fails, and in particular it definitely
does _not_ mean that failing the allocation is going to result in an IO
error.

That's because in general most code in the IO path knows how to make
effective use of biosets and mempools (which may take some work! you
have to ensure that you're always able to make forward progress when
memory is limited, and in particular that you don't double allocate from
the same mempool if you're blocking the first allocation from
completing/freeing).

> i.e NOFAIL scopes are not relevant outside the subsystem that sets
> it.  Hence we likely need helpers to clear and restore NOFAIL when
> we cross an allocation context boundaries. e.g. as we cross from
> filesystem to block layer in the IO stack via submit_bio(). Maybe
> they should be doing something like:
> 
> 	nofail_flags = memalloc_nofail_clear();

NOFAIL is not a scoped thing at all, period; it is very much a
_callsite_ specific thing, and it depends on whether that callsite has a
fallback.

The most obvious example being, as mentioned previously, mempools.

> > - NOWAIT - as said already, we need to make sure we're not turning an
> > allocation that relied on too-small-to-fail into a null pointer exception or
> > BUG_ON(!page).
> 
> Agreed. NOWAIT is removing allocation failure constraints and I
> don't think that can be made to work reliably. Error injection
> cannot prove the absence of errors  and so we can never be certain
> the code will always operate correctly and not crash when an
> unexepected allocation failure occurs.

You saying we don't know how to test code? Come on, that's just throwing
your hands up and giving up. We can write error injection tests that
cycle through each injection point and test them individuall and then
verify that they've been tested with coverage analysis.

Anyways, NOWAIT is no different here from NORETRY/RETRY_MAYFAIL. We need
to be able to handle allocation failures, period...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-12  2:06           ` Kent Overstreet
@ 2024-02-12  4:35             ` Dave Chinner
  2024-02-12 19:30               ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2024-02-12  4:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka (SUSE),
	Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 12:20:32PM +1100, Dave Chinner wrote:
> > On Thu, Feb 08, 2024 at 08:55:05PM +0100, Vlastimil Babka (SUSE) wrote:
> > > On 2/8/24 18:33, Michal Hocko wrote:
> > > > On Thu 08-02-24 17:02:07, Vlastimil Babka (SUSE) wrote:
> > > >> On 1/9/24 05:47, Dave Chinner wrote:
> > > >> > On Thu, Jan 04, 2024 at 09:17:16PM +0000, Matthew Wilcox wrote:
> > > >> 
> > > >> Your points and Kent's proposal of scoped GFP_NOWAIT [1] suggests to me this
> > > >> is no longer FS-only topic as this isn't just about converting to the scoped
> > > >> apis, but also how they should be improved.
> > > > 
> > > > Scoped GFP_NOFAIL context is slightly easier from the semantic POV than
> > > > scoped GFP_NOWAIT as it doesn't add a potentially unexpected failure
> > > > mode. It is still tricky to deal with GFP_NOWAIT requests inside the
> > > > NOFAIL scope because that makes it a non failing busy wait for an
> > > > allocation if we need to insist on scope NOFAIL semantic. 
> > > > 
> > > > On the other hand we can define the behavior similar to what you
> > > > propose with RETRY_MAYFAIL resp. NORETRY. Existing NOWAIT users should
> > > > better handle allocation failures regardless of the external allocation
> > > > scope.
> > > > 
> > > > Overriding that scoped NOFAIL semantic with RETRY_MAYFAIL or NORETRY
> > > > resembles the existing PF_MEMALLOC and GFP_NOMEMALLOC semantic and I do
> > > > not see an immediate problem with that.
> > > > 
> > > > Having more NOFAIL allocations is not great but if you need to
> > > > emulate those by implementing the nofail semantic outside of the
> > > > allocator then it is better to have those retries inside the allocator
> > > > IMO.
> > > 
> > > I see potential issues in scoping both the NOWAIT and NOFAIL
> > > 
> > > - NOFAIL - I'm assuming Dave is adding __GFP_NOFAIL to xfs allocations or
> > > adjacent layers where he knows they must not fail for his transaction. But
> > > could the scope affect also something else underneath that could fail
> > > without the failure propagating in a way that it affects xfs?
> > 
> > Memory allocaiton failures below the filesystem (i.e. in the IO
> > path) will fail the IO, and if that happens for a read IO within
> > a transaction then it will have the same effect as XFS failing a
> > memory allocation. i.e. it will shut down the filesystem.
> > 
> > The key point here is the moment we go below the filesystem we enter
> > into a new scoped allocation context with a guaranteed method of
> > returning errors: NOIO and bio errors.
> 
> Hang on, you're conflating NOIO to mean something completely different -
> NOIO means "don't recurse in reclaim", it does _not_ mean anything about
> what happens when the allocation fails,

Yes, I know that's what NOIO means. I'm not conflating it with
anything else.

> and in particular it definitely
> does _not_ mean that failing the allocation is going to result in an IO
> error.

Exactly. FS level NOFAIL contexts simply do not apply to NOIO
context functionality. NOIO contexts require different mechanisms to
guarantee forwards progress under memory pressure. They work
pretty well, and we don't want or need to perturb them by having
them inherit filesystem level NOFAIL semantics.

i.e. architecturally speaking, NOIO is a completely separate
allocation domain to NOFS.

> That's because in general most code in the IO path knows how to make
> effective use of biosets and mempools (which may take some work! you
> have to ensure that you're always able to make forward progress when
> memory is limited, and in particular that you don't double allocate from
> the same mempool if you're blocking the first allocation from
> completing/freeing).

Yes, I understand this, and that's my point: NOIO context tends to
be able to use mempools and other mechanisms to prevent memory
allocation failure, not NOFAIL.

The IO layers are request based and that enables one-in, one out
allocation pools that can guarantee single IO progress. That's all
the IO layers need to guarantee to the filesystems so that forwards
progress can always be made until memory pressure.

However, filesystems cannot guarantee "one in, one out" allocation
behaviour. A transaction can require a largely unbound number of
memory allocations to succeed to make progress through to
completion, and so things like mempools -cannot be used- to prevent
memory allocation failures whilst providing a forwards progress
guarantee.

Hence a NOFAIL scope if useful at the filesystem layer for
filesystem objects to ensure forwards progress under memory
pressure, but it is compeltely unnecessary once we transition to the
IO layer where forwards progress guarantees ensure memory allocation
failures don't impede progress.

IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
layers. The entry points to the block layer should transition the
task to NOIO context and restore the previous context on exit. Then
it becomes relatively trivial to apply context based filtering of
allocation behaviour....

> > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > we cross an allocation context boundaries. e.g. as we cross from
> > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > they should be doing something like:
> > 
> > 	nofail_flags = memalloc_nofail_clear();
> 
> NOFAIL is not a scoped thing at all, period; it is very much a
> _callsite_ specific thing, and it depends on whether that callsite has a
> fallback.

*cough*

As I've already stated, NOFAIL allocation has been scoped in XFS for
the past 20 years.

Every memory allocation inside a transaction *must* be NOFAIL unless
otherwise specified because memory allocation inside a dirty
transaction is a fatal error. However, that scoping has never been
passed to the NOIO contexts below the filesytsem - it's scoped
purely within the filesystem itself and doesn't pass on to other
subsystems the filesystem calls into.

> The most obvious example being, as mentioned previously, mempools.

Yes, they require one-in, one-out guarantees to avoid starvation and
ENOMEM situations. Which, as we've known since mempools were
invented, these guarantees cannot be provided by most filesystems.

> > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > BUG_ON(!page).
> > 
> > Agreed. NOWAIT is removing allocation failure constraints and I
> > don't think that can be made to work reliably. Error injection
> > cannot prove the absence of errors  and so we can never be certain
> > the code will always operate correctly and not crash when an
> > unexepected allocation failure occurs.
> 
> You saying we don't know how to test code?

Yes, that's exactly what I'm saying.

I'm also saying that designing algorithms that aren't fail safe is
poor design. If you get it wrong and nothing bad can happen as a
result, then the design is fine.

But if the result of missing something accidentally is that the
system is guaranteed to crash when that is hit, then failure is
guaranteed and no amount of testing will prevent that failure from
occurring.

And we suck at testing, so we absolutely need to design fail
safe algorithms and APIs...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-12  4:35             ` Dave Chinner
@ 2024-02-12 19:30               ` Kent Overstreet
  2024-02-12 22:07                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2024-02-12 19:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka (SUSE),
	Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Mon, Feb 12, 2024 at 03:35:33PM +1100, Dave Chinner wrote:
> On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> > That's because in general most code in the IO path knows how to make
> > effective use of biosets and mempools (which may take some work! you
> > have to ensure that you're always able to make forward progress when
> > memory is limited, and in particular that you don't double allocate from
> > the same mempool if you're blocking the first allocation from
> > completing/freeing).
> 
> Yes, I understand this, and that's my point: NOIO context tends to
> be able to use mempools and other mechanisms to prevent memory
> allocation failure, not NOFAIL.
> 
> The IO layers are request based and that enables one-in, one out
> allocation pools that can guarantee single IO progress. That's all
> the IO layers need to guarantee to the filesystems so that forwards
> progress can always be made until memory pressure.
> 
> However, filesystems cannot guarantee "one in, one out" allocation
> behaviour. A transaction can require a largely unbound number of
> memory allocations to succeed to make progress through to
> completion, and so things like mempools -cannot be used- to prevent
> memory allocation failures whilst providing a forwards progress
> guarantee.

I don't see that that's actually true. There's no requirement that
arbitrarily large IOs must be done atomically, within a single
transaction: there's been at most talk of eventually doing atomic writes
through the pagecache, but the people on that can't even finish atomic
writes through the block layer, so who knows when that'll happen.

I generally haven't been running into filesyste operations that require
an unbounded number of memory allocations (reflink is a bit of an
exception in the current bcachefs code, and even that is just a
limitation I could solve if I really wanted to...)

> Hence a NOFAIL scope if useful at the filesystem layer for
> filesystem objects to ensure forwards progress under memory
> pressure, but it is compeltely unnecessary once we transition to the
> IO layer where forwards progress guarantees ensure memory allocation
> failures don't impede progress.
> 
> IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
> layers. The entry points to the block layer should transition the
> task to NOIO context and restore the previous context on exit. Then
> it becomes relatively trivial to apply context based filtering of
> allocation behaviour....
> 
> > > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > > we cross an allocation context boundaries. e.g. as we cross from
> > > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > > they should be doing something like:
> > > 
> > > 	nofail_flags = memalloc_nofail_clear();
> > 
> > NOFAIL is not a scoped thing at all, period; it is very much a
> > _callsite_ specific thing, and it depends on whether that callsite has a
> > fallback.
> 
> *cough*
> 
> As I've already stated, NOFAIL allocation has been scoped in XFS for
> the past 20 years.
> 
> Every memory allocation inside a transaction *must* be NOFAIL unless
> otherwise specified because memory allocation inside a dirty
> transaction is a fatal error.

Say you start to incrementally mempoolify your allocations inside a
transaction - those mempools aren't going to do anything if there's a
scoped NOFAIL, and sorting that out is going to get messy fast.

> However, that scoping has never been
> passed to the NOIO contexts below the filesytsem - it's scoped
> purely within the filesystem itself and doesn't pass on to other
> subsystems the filesystem calls into.

How is that managed?
> 
> > The most obvious example being, as mentioned previously, mempools.
> 
> Yes, they require one-in, one-out guarantees to avoid starvation and
> ENOMEM situations. Which, as we've known since mempools were
> invented, these guarantees cannot be provided by most filesystems.
> 
> > > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > > BUG_ON(!page).
> > > 
> > > Agreed. NOWAIT is removing allocation failure constraints and I
> > > don't think that can be made to work reliably. Error injection
> > > cannot prove the absence of errors  and so we can never be certain
> > > the code will always operate correctly and not crash when an
> > > unexepected allocation failure occurs.
> > 
> > You saying we don't know how to test code?
> 
> Yes, that's exactly what I'm saying.
> 
> I'm also saying that designing algorithms that aren't fail safe is
> poor design. If you get it wrong and nothing bad can happen as a
> result, then the design is fine.
> 
> But if the result of missing something accidentally is that the
> system is guaranteed to crash when that is hit, then failure is
> guaranteed and no amount of testing will prevent that failure from
> occurring.
> 
> And we suck at testing, so we absolutely need to design fail
> safe algorithms and APIs...

GFP_NOFAIL dosen't magically make your algorithm fail safe, though.

Suren and I are trying to get memory allocation profiling into 6.9, and
I'll be posting the improved fault injection immediately afterwards -
this is what I used to use to make sure every allocation failure path in
the bcachefs predecessor was tested. Hopefully that'll make things
easier...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
  2024-02-12 19:30               ` Kent Overstreet
@ 2024-02-12 22:07                 ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-02-12 22:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka (SUSE),
	Michal Hocko, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, Kent Overstreet

On Mon, Feb 12, 2024 at 02:30:02PM -0500, Kent Overstreet wrote:
> On Mon, Feb 12, 2024 at 03:35:33PM +1100, Dave Chinner wrote:
> > On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> > > That's because in general most code in the IO path knows how to make
> > > effective use of biosets and mempools (which may take some work! you
> > > have to ensure that you're always able to make forward progress when
> > > memory is limited, and in particular that you don't double allocate from
> > > the same mempool if you're blocking the first allocation from
> > > completing/freeing).
> > 
> > Yes, I understand this, and that's my point: NOIO context tends to
> > be able to use mempools and other mechanisms to prevent memory
> > allocation failure, not NOFAIL.
> > 
> > The IO layers are request based and that enables one-in, one out
> > allocation pools that can guarantee single IO progress. That's all
> > the IO layers need to guarantee to the filesystems so that forwards
> > progress can always be made until memory pressure.
> > 
> > However, filesystems cannot guarantee "one in, one out" allocation
> > behaviour. A transaction can require a largely unbound number of
> > memory allocations to succeed to make progress through to
> > completion, and so things like mempools -cannot be used- to prevent
> > memory allocation failures whilst providing a forwards progress
> > guarantee.
> 
> I don't see that that's actually true. There's no requirement that
> arbitrarily large IOs must be done atomically, within a single
> transaction:

*cough*

metadata Io needs to be set up, issued and completed before the
transaction can make progress, and then the transaction will hold
onto that metadata until it is committed and unlocked.

That means we hold every btree buffer we walk along a path in the
transaction, and if the cache is cold it means we might need to
allocate and read dozens of metadata buffers in a single
transaction.

> there's been at most talk of eventually doing atomic writes
> through the pagecache, but the people on that can't even finish atomic
> writes through the block layer, so who knows when that'll happen.

What's atomic data writes got to do with metadata transaction
contexts?

> I generally haven't been running into filesyste operations that require
> an unbounded number of memory allocations (reflink is a bit of an
> exception in the current bcachefs code, and even that is just a
> limitation I could solve if I really wanted to...)

Step outside of bcachefs for a minute, Kent. Not everything works
the same way or has the same constraints and/or freedoms as the
bcachefs implementation....

> > Hence a NOFAIL scope if useful at the filesystem layer for
> > filesystem objects to ensure forwards progress under memory
> > pressure, but it is compeltely unnecessary once we transition to the
> > IO layer where forwards progress guarantees ensure memory allocation
> > failures don't impede progress.
> > 
> > IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
> > layers. The entry points to the block layer should transition the
> > task to NOIO context and restore the previous context on exit. Then
> > it becomes relatively trivial to apply context based filtering of
> > allocation behaviour....
> > 
> > > > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > > > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > > > we cross an allocation context boundaries. e.g. as we cross from
> > > > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > > > they should be doing something like:
> > > > 
> > > > 	nofail_flags = memalloc_nofail_clear();
> > > 
> > > NOFAIL is not a scoped thing at all, period; it is very much a
> > > _callsite_ specific thing, and it depends on whether that callsite has a
> > > fallback.
> > 
> > *cough*
> > 
> > As I've already stated, NOFAIL allocation has been scoped in XFS for
> > the past 20 years.
> > 
> > Every memory allocation inside a transaction *must* be NOFAIL unless
> > otherwise specified because memory allocation inside a dirty
> > transaction is a fatal error.
> 
> Say you start to incrementally mempoolify your allocations inside a
> transaction - those mempools aren't going to do anything if there's a
> scoped NOFAIL, and sorting that out is going to get messy fast.

How do you mempoolify something that can have thousands of
concurrent contexts with in-flight objects across multiple
filesystems that might get stashed in a LRU rather than freed when
finished with? Not to mention that each context has an unknown
demand on the mempool before it can complete and return objects to
the mempool?

We talked about this a decade ago at LSFMM (2014, IIRC) with the MM
developers and nothing about mempools has changed since.

> > However, that scoping has never been
> > passed to the NOIO contexts below the filesytsem - it's scoped
> > purely within the filesystem itself and doesn't pass on to other
> > subsystems the filesystem calls into.
> 
> How is that managed?

Our own internal memory allocation wrappers. go look at what remains
in fs/xfs/kmem.c. See the loop there in kmem_alloc()? It's
guaranteeing NOFAIL behaviour unless KM_MAYFAIL is passed to the
allocation. Look at xlog_kvmalloc() - same thing, except it is
always run within transaction context (the "xlog" prefix is a
giveaway) and so will block until allocation succeeds.

IOWs, we scoped everything by having our own internal allocation
wrappers than never fail. In removing these wrappers (which is where
my "scoped NOFAIL" comments in this thread originated from) the
proliferation of __GFP_NOFAIL annotations across meant we went from
pretty much zero usage of __GFP_NOFAIL to having almost a hundred
allocation sites annotated with __GFP_NOFAIL. And it adds a
maintenance landmine for us - we now have to ensure that all future
allocations within a transaction scope are marked __GFP_NOFAIL.

Hence I'm looking for ways to move this NOFAIL scoping into the
generic memory allocation code to replace the scoping we current
have via subsystem-specific allocation wrappers.

> > > > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > > > BUG_ON(!page).
> > > > 
> > > > Agreed. NOWAIT is removing allocation failure constraints and I
> > > > don't think that can be made to work reliably. Error injection
> > > > cannot prove the absence of errors  and so we can never be certain
> > > > the code will always operate correctly and not crash when an
> > > > unexepected allocation failure occurs.
> > > 
> > > You saying we don't know how to test code?
> > 
> > Yes, that's exactly what I'm saying.
> > 
> > I'm also saying that designing algorithms that aren't fail safe is
> > poor design. If you get it wrong and nothing bad can happen as a
> > result, then the design is fine.
> > 
> > But if the result of missing something accidentally is that the
> > system is guaranteed to crash when that is hit, then failure is
> > guaranteed and no amount of testing will prevent that failure from
> > occurring.
> > 
> > And we suck at testing, so we absolutely need to design fail
> > safe algorithms and APIs...
> 
> GFP_NOFAIL dosen't magically make your algorithm fail safe, though.

I never said it did - this part of the conversation was about the
failure prone design of proposed -NOWAIT- scoping, not about
trying to codify a generic mechanism for scoped behaviour we've been
using successfully for the past 20 years...

> Suren and I are trying to get memory allocation profiling into 6.9, and
> I'll be posting the improved fault injection immediately afterwards -
> this is what I used to use to make sure every allocation failure path in
> the bcachefs predecessor was tested. Hopefully that'll make things
> easier...

Tha all sounds good, but after a recent spate of "CI and post
integration testing didn't uncover fs bugs that fstests reproduced
until after tested kernels were released to test systems", I have
little confidence in the ability of larger QA organisations, let
alone individuals, to test filesystem code adequately when they are
constrained either by time or resources.

The fact of the matter is that we are all constrained by time
and resources. Hence adding more new testing methods that add time
and resources to validate new code and backports of fixes to the
test matrix overhead does nothing to improve that situation.

We need to start designing our code in a way that doesn't require
extensive testing to validate it as correct. If the only way to
validate new code is correct is via stochastic coverage via error
injection, then that is a clear sign we've made poor design choices
along the way.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-02-12 22:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-04 21:17 [LSF/MM/BPF TOPIC] Removing GFP_NOFS Matthew Wilcox
2024-01-05 10:13 ` Viacheslav Dubeyko
2024-01-05 10:26   ` [Lsf-pc] " Jan Kara
2024-01-05 14:17     ` Viacheslav Dubeyko
2024-01-05 14:35   ` Vlastimil Babka (SUSE)
2024-01-05 10:57 ` [Lsf-pc] " Jan Kara
2024-01-08 11:47   ` Johannes Thumshirn
2024-01-08 17:39     ` David Sterba
2024-01-09  7:43       ` Johannes Thumshirn
2024-01-09 22:23         ` Dave Chinner
2024-01-09 15:47     ` Luis Henriques
2024-01-09 18:04       ` Johannes Thumshirn
2024-01-08  6:39 ` Dave Chinner
2024-01-09  4:47 ` Dave Chinner
2024-02-08 16:02   ` Vlastimil Babka (SUSE)
2024-02-08 17:33     ` Michal Hocko
2024-02-08 19:55       ` Vlastimil Babka (SUSE)
2024-02-08 22:45         ` Kent Overstreet
2024-02-12  1:20         ` Dave Chinner
2024-02-12  2:06           ` Kent Overstreet
2024-02-12  4:35             ` Dave Chinner
2024-02-12 19:30               ` Kent Overstreet
2024-02-12 22:07                 ` Dave Chinner
2024-01-09 22:44 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).