[1/2] mm: clarify __GFP_MEMALLOC usage
diff mbox series

Message ID 20200403083543.11552-2-mhocko@kernel.org
State In Next
Commit eabd05e1a154ddd40daf6c25458f5d180b94e819
Headers show
Series
  • mm: few refinements to gfp flags documentation
Related show

Commit Message

Michal Hocko April 3, 2020, 8:35 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

It seems that the existing documentation is not explicit about the
expected usage and potential risks enough. While it is calls out
that users have to free memory when using this flag it is not really
apparent that users have to careful to not deplete memory reserves
and that they should implement some sort of throttling wrt. freeing
process.

This is partly based on Neil's explanation [1].

[1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h | 3 +++
 1 file changed, 3 insertions(+)

Comments

David Rientjes April 3, 2020, 7:41 p.m. UTC | #1
On Fri, 3 Apr 2020, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> It seems that the existing documentation is not explicit about the
> expected usage and potential risks enough. While it is calls out
> that users have to free memory when using this flag it is not really
> apparent that users have to careful to not deplete memory reserves
> and that they should implement some sort of throttling wrt. freeing
> process.
> 
> This is partly based on Neil's explanation [1].
> 
> [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/gfp.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..e3ab1c0d9140 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -110,6 +110,9 @@ struct vm_area_struct;
>   * the caller guarantees the allocation will allow more memory to be freed
>   * very shortly e.g. process exiting or swapping. Users either should
>   * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> + * Users of this flag have to be extremely careful to not deplete the reserve
> + * completely and implement a throttling mechanism which controls the consumption
> + * of the reserve based on the amount of freed memory.
>   *
>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.

Hmm, any guidance that we can offer to users of this flag that aren't 
aware of __GFP_MEMALLOC internals?  If I were to read this and not be 
aware of the implementation, I would ask "how do I know when I'm at risk 
of depleting this reserve" especially since the amount of reserve is 
controlled by sysctl.  How do I know when I'm risking a depletion of this 
shared reserve?
NeilBrown April 3, 2020, 9:23 p.m. UTC | #2
On Fri, Apr 03 2020, David Rientjes wrote:

> On Fri, 3 Apr 2020, Michal Hocko wrote:
>
>> From: Michal Hocko <mhocko@suse.com>
>> 
>> It seems that the existing documentation is not explicit about the
>> expected usage and potential risks enough. While it is calls out
>> that users have to free memory when using this flag it is not really
>> apparent that users have to careful to not deplete memory reserves
>> and that they should implement some sort of throttling wrt. freeing
>> process.
>> 
>> This is partly based on Neil's explanation [1].
>> 
>> [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  include/linux/gfp.h | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index e5b817cb86e7..e3ab1c0d9140 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -110,6 +110,9 @@ struct vm_area_struct;
>>   * the caller guarantees the allocation will allow more memory to be freed
>>   * very shortly e.g. process exiting or swapping. Users either should
>>   * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>> + * Users of this flag have to be extremely careful to not deplete the reserve
>> + * completely and implement a throttling mechanism which controls the consumption
>> + * of the reserve based on the amount of freed memory.
>>   *
>>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>
> Hmm, any guidance that we can offer to users of this flag that aren't 
> aware of __GFP_MEMALLOC internals?  If I were to read this and not be 
> aware of the implementation, I would ask "how do I know when I'm at risk 
> of depleting this reserve" especially since the amount of reserve is 
> controlled by sysctl.  How do I know when I'm risking a depletion of this 
> shared reserve?

"how do I know when I'm at risk of depleting this reserve" is definitely
the wrong question to be asking.  The questions to ask are:
- how little memory to I need to ensure forward progress?
- how quick will that forward progress be?

In the ideal case a small allocation will be all that is needed in order
for that allocation plus another page to be freed "quickly", in time
governed only by throughput to some device.  In that case you probably
don't need to worry about rate limiting.

The reason I brought up ratelimiting is that RCU is slow.  You can get
quite a lot of memory caught up in the kfree-rcu lists.  That's not much
of a problem for normal memory, but it might be for the more limited
reserves.

The other difficulty with the the kfree_rcu case is that we have no idea
how many users there will be, so we cannot realistically model how long
the queue might get.  Compare with NFS swap-out there the only user it
the VM swapping memory which (I think?) already tries to pace writeout
with the speed of the device (or is that just writeback...).  I'm
clearly not sure of the details but it is a more constrained environment
so it is more predicatable.

In many cases, preallocating a private reserve is better than using
GFP_MEMALLOC.  That is what mempools provide and they are very effective
(though often way over-allocated*).
GFP_MEMALLOC was added because swap-over-NFS requires lots of different
allocations (transmit headers, receive buffers, possible routing changes
etc), many of them in the network layer which is very sensitive
to latency (and mempools require a spinlock to get the reserves).

Maybe the documentation should say.
 Don't use this - use a mempool.  Here be dragons.

I'm not sure you can really say anything more useful without writing a
long essay.

NeilBrown

(*) mempool sizes should not exceed 2 without measurements demonstrating
that more provides better throughput. Many are 2, (BIO_POOL_SIZE is 2,
which is perfect) but some aren't.
 #define DRBD_MIN_POOL_PAGES       128
way too big!
 #define MIN_IOS 256
even bigger!
 mempool_create_page_pool(2 * (F2FS_IO_SIZE(sbi) - 1), 0);
This is really wrong.  If the IO size is relevant, then each object in
the pool needs to be that size.  Having that many objects in the pool
doesn't mean anything useful.
Michal Hocko April 6, 2020, 7:01 a.m. UTC | #3
On Sat 04-04-20 08:23:45, Neil Brown wrote:
> On Fri, Apr 03 2020, David Rientjes wrote:
[...]
> > Hmm, any guidance that we can offer to users of this flag that aren't 
> > aware of __GFP_MEMALLOC internals?  If I were to read this and not be 
> > aware of the implementation, I would ask "how do I know when I'm at risk 
> > of depleting this reserve" especially since the amount of reserve is 
> > controlled by sysctl.  How do I know when I'm risking a depletion of this 
> > shared reserve?
> 
> "how do I know when I'm at risk of depleting this reserve" is definitely
> the wrong question to be asking.  The questions to ask are:
> - how little memory to I need to ensure forward progress?
> - how quick will that forward progress be?

Absolutely agreed. The total amount of reserves will always depend on
all other users. Unless they are perfectly coordinated, which is not the
case.

> In the ideal case a small allocation will be all that is needed in order
> for that allocation plus another page to be freed "quickly", in time
> governed only by throughput to some device.  In that case you probably
> don't need to worry about rate limiting.

Right but I wouldn't expect this to be a general usage pattern of this
flag. "Allocate to free memory" suggests this would be a part of the
memory reclaim process and that really needs some form of rate
limiting. Be it the reclaim itself directly or some other mechanism if
this happens from a different context.

> The reason I brought up ratelimiting is that RCU is slow.  You can get
> quite a lot of memory caught up in the kfree-rcu lists.  That's not much
> of a problem for normal memory, but it might be for the more limited
> reserves.

Right.

> The other difficulty with the the kfree_rcu case is that we have no idea
> how many users there will be, so we cannot realistically model how long
> the queue might get.  Compare with NFS swap-out there the only user it
> the VM swapping memory which (I think?) already tries to pace writeout
> with the speed of the device (or is that just writeback...).  I'm
> clearly not sure of the details but it is a more constrained environment
> so it is more predicatable.

Mel explained this http://lkml.kernel.org/r/20200401131426.GN3772@suse.de

> In many cases, preallocating a private reserve is better than using
> GFP_MEMALLOC.  That is what mempools provide and they are very effective
> (though often way over-allocated*).
> GFP_MEMALLOC was added because swap-over-NFS requires lots of different
> allocations (transmit headers, receive buffers, possible routing changes
> etc), many of them in the network layer which is very sensitive
> to latency (and mempools require a spinlock to get the reserves).

Yes.

> Maybe the documentation should say.
>  Don't use this - use a mempool.  Here be dragons.

OK, this looks like a good idea.
 
> I'm not sure you can really say anything more useful without writing a
> long essay.

Yes and I am not sure it would be really more helpful than confusing.
What do you think about this updated patch?

From 6c90b0a19a07c87d24ad576e69b33c6e19c2f9a2 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 1 Apr 2020 14:00:56 +0200
Subject: [PATCH] mm: clarify __GFP_MEMALLOC usage

It seems that the existing documentation is not explicit about the
expected usage and potential risks enough. While it is calls out
that users have to free memory when using this flag it is not really
apparent that users have to careful to not deplete memory reserves
and that they should implement some sort of throttling wrt. freeing
process.

This is partly based on Neil's explanation [1].

Let's also call out that a pre allocated pool allocator should be
considered.

[1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..9cacef1a3ee0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -110,6 +110,11 @@ struct vm_area_struct;
  * the caller guarantees the allocation will allow more memory to be freed
  * very shortly e.g. process exiting or swapping. Users either should
  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
+ * Users of this flag have to be extremely careful to not deplete the reserve
+ * completely and implement a throttling mechanism which controls the consumption
+ * of the reserve based on the amount of freed memory.
+ * Usage of a pre-allocated pool (e.g. mempool) should be always considered before
+ * using this flag.
  *
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
John Hubbard April 6, 2020, 7:02 p.m. UTC | #4
On 4/6/20 12:01 AM, Michal Hocko wrote:
...
>  From 6c90b0a19a07c87d24ad576e69b33c6e19c2f9a2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 1 Apr 2020 14:00:56 +0200
> Subject: [PATCH] mm: clarify __GFP_MEMALLOC usage
> 
> It seems that the existing documentation is not explicit about the
> expected usage and potential risks enough. While it is calls out
> that users have to free memory when using this flag it is not really
> apparent that users have to careful to not deplete memory reserves
> and that they should implement some sort of throttling wrt. freeing
> process.
> 
> This is partly based on Neil's explanation [1].
> 
> Let's also call out that a pre allocated pool allocator should be
> considered.
> 
> [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   include/linux/gfp.h | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..9cacef1a3ee0 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -110,6 +110,11 @@ struct vm_area_struct;
>    * the caller guarantees the allocation will allow more memory to be freed
>    * very shortly e.g. process exiting or swapping. Users either should
>    * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> + * Users of this flag have to be extremely careful to not deplete the reserve
> + * completely and implement a throttling mechanism which controls the consumption
> + * of the reserve based on the amount of freed memory.
> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered before
> + * using this flag.
>    *
>    * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>    * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
> 

Hi Michal and all,

How about using approximately this wording instead? I found Neil's wording to be
especially helpful so I mixed it in. (Also fixed a couple of slight 80-col overruns.)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index be2754841369..c247a911d8c7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -111,6 +111,15 @@ struct vm_area_struct;
   * very shortly e.g. process exiting or swapping. Users either should
   * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
   *
+ * To be extra clear: users of __GFP_MEMALLOC must be working to free other
+ * memory, and that other memory needs to be freed "soon"; specifically, before
+ * the reserve is exhausted. This generally implies a throttling mechanism that
+ * balances the amount of __GFP_MEMALLOC memory used against the amount that the
+ * caller is about to free.
+ *
+ * Usage of a pre-allocated pool (e.g. mempool) should be always considered
+ * before using this flag.
+ *
   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
   */


thanks,
David Rientjes April 6, 2020, 11:32 p.m. UTC | #5
On Mon, 6 Apr 2020, John Hubbard wrote:

> Hi Michal and all,
> 
> How about using approximately this wording instead? I found Neil's wording to
> be
> especially helpful so I mixed it in. (Also fixed a couple of slight 80-col
> overruns.)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index be2754841369..c247a911d8c7 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -111,6 +111,15 @@ struct vm_area_struct;
>   * very shortly e.g. process exiting or swapping. Users either should
>   * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>   *
> + * To be extra clear: users of __GFP_MEMALLOC must be working to free other
> + * memory, and that other memory needs to be freed "soon"; specifically,
> before
> + * the reserve is exhausted. This generally implies a throttling mechanism
> that
> + * balances the amount of __GFP_MEMALLOC memory used against the amount that
> the
> + * caller is about to free.
> + *
> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> + * before using this flag.
> + *
>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency
> reserves.
>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>   */

I agree this looks better, but if a developer is reading this and is 
unfamiliar with the implementation of memory reserves or __GFP_MEMALLOC, 
how do they take any action that memory allocated with this bit is freed 
before the reserve is exhausted?

It seems like it's simply saying "don't allocate a lot of this before you 
free it."  That may be very well how it goes, but any discussion of 
depletion of the reserve seems to imply we'd want to quantify it and I 
agree that's not what we want the user to do.

So maybe simply state that reserves can be extremely limited and thus it's 
best to assume there is very little reserve left?
John Hubbard April 6, 2020, 11:40 p.m. UTC | #6
On 4/6/20 4:32 PM, David Rientjes wrote:
> On Mon, 6 Apr 2020, John Hubbard wrote:
> 
>> Hi Michal and all,
>>
>> How about using approximately this wording instead? I found Neil's wording to
>> be
>> especially helpful so I mixed it in. (Also fixed a couple of slight 80-col
>> overruns.)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index be2754841369..c247a911d8c7 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -111,6 +111,15 @@ struct vm_area_struct;
>>    * very shortly e.g. process exiting or swapping. Users either should
>>    * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>>    *
>> + * To be extra clear: users of __GFP_MEMALLOC must be working to free other
>> + * memory, and that other memory needs to be freed "soon"; specifically,
>> before
>> + * the reserve is exhausted. This generally implies a throttling mechanism
>> that
>> + * balances the amount of __GFP_MEMALLOC memory used against the amount that
>> the
>> + * caller is about to free.
>> + *
>> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered
>> + * before using this flag.
>> + *
>>    * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency
>> reserves.
>>    * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>>    */
> 
> I agree this looks better, but if a developer is reading this and is
> unfamiliar with the implementation of memory reserves or __GFP_MEMALLOC,
> how do they take any action that memory allocated with this bit is freed
> before the reserve is exhausted?
> 

In order to make it even possible to write documentation, I'd like to constrain
what "a developer" means a bit more. Someone who comes decides to use this
flag will at least get a clear indication of what's involved, and I would
expect that if it's still not clear, they would take a slightly deeper look.

So "a developer unfamiliar with the implementation of memory reserves" is
probably going to get into trouble if they remain unfamiliar. This documentation
should inspire them to learn what they need to learn.


> It seems like it's simply saying "don't allocate a lot of this before you
> free it."  That may be very well how it goes, but any discussion of
> depletion of the reserve seems to imply we'd want to quantify it and I
> agree that's not what we want the user to do.
> 
> So maybe simply state that reserves can be extremely limited and thus it's
> best to assume there is very little reserve left?
> 

Well...but now we're sort of back to the original documentation anyway. I
like the idea of putting in a bit about "you're supposed to be doing something
that frees up memory" in the comments, because it is a lot more concrete.

Because it's pretty hard to figure out what "be careful, there's not much
left" really means, in terms of code that one writes. :)

thanks,
NeilBrown April 7, 2020, 1 a.m. UTC | #7
On Mon, Apr 06 2020, John Hubbard wrote:

> On 4/6/20 12:01 AM, Michal Hocko wrote:
> ...
>>  From 6c90b0a19a07c87d24ad576e69b33c6e19c2f9a2 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Wed, 1 Apr 2020 14:00:56 +0200
>> Subject: [PATCH] mm: clarify __GFP_MEMALLOC usage
>> 
>> It seems that the existing documentation is not explicit about the
>> expected usage and potential risks enough. While it is calls out
>> that users have to free memory when using this flag it is not really
>> apparent that users have to careful to not deplete memory reserves
>> and that they should implement some sort of throttling wrt. freeing
>> process.
>> 
>> This is partly based on Neil's explanation [1].
>> 
>> Let's also call out that a pre allocated pool allocator should be
>> considered.
>> 
>> [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>   include/linux/gfp.h | 5 +++++
>>   1 file changed, 5 insertions(+)
>> 
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index e5b817cb86e7..9cacef1a3ee0 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -110,6 +110,11 @@ struct vm_area_struct;
>>    * the caller guarantees the allocation will allow more memory to be freed
>>    * very shortly e.g. process exiting or swapping. Users either should
>>    * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>> + * Users of this flag have to be extremely careful to not deplete the reserve
>> + * completely and implement a throttling mechanism which controls the consumption
>> + * of the reserve based on the amount of freed memory.
>> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered before
>> + * using this flag.

I think this version is pretty good.

>>    *
>>    * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>>    * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>> 
>
> Hi Michal and all,
>
> How about using approximately this wording instead? I found Neil's wording to be
> especially helpful so I mixed it in. (Also fixed a couple of slight 80-col overruns.)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index be2754841369..c247a911d8c7 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -111,6 +111,15 @@ struct vm_area_struct;
>    * very shortly e.g. process exiting or swapping. Users either should
>    * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>    *
> + * To be extra clear: users of __GFP_MEMALLOC must be working to free other
> + * memory, and that other memory needs to be freed "soon"; specifically, before
> + * the reserve is exhausted. This generally implies a throttling mechanism that
> + * balances the amount of __GFP_MEMALLOC memory used against the amount that the
> + * caller is about to free.

I don't like this change. "balances the amount ... is about to free"
does say anything about time, so it doesn't seem to be about throttling.

I think it is hard to write rules because the rules are a bit spongey.

With mempools, we have a nice clear rule.  When you allocate from a
mempool you must have a clear path to freeing that allocation which will
not block on memory allocation except from a subordinate mempool.  This
implies a partial ordering between mempools.  When you have layered
block devices the path through the layers from filesystem down to
hardware defines the order.  It isn't enforced, but it is quite easy to
reason about.

GFP_MEMALLOC effectively provides multiple mempools.  So it could
theoretically deadlock if multiple long dependency chains
happened. i.e. if 1000 threads each make a GFP_MEMALLOC allocation and
then need to make another one before the first can be freed - then you
hit problems.  There is no formal way to guarantee that this doesn't
happen.  We just say "be gentle" and minimize the users of this flag,
and keep more memory in reserve than we really need.
Note that 'threads' here might not be Linux tasks.  If you have an IO
request that proceed asynchronously, moving from queue to queue and
being handled by different task, then each one is a "thread" for the
purpose of understanding mem-alloc dependency.

So maybe what I really should focus on is not how quickly things happen,
but how many happen concurrently.  The idea of throttling is to allow
previous requests to complete before we start too many more.

With Swap-over-NFS, some of the things that might need to be allocated
are routing table entries.  These scale with the number of NFS servers
rather than the number of IO requests, so they are not going to cause
concurrency problems.
We also need memory to store replies, but these never exceed the number
of pending requests, so there is limited concurrency there.
NFS can send a lot of requests in parallel, but the main limit is the
RPC "slot table" and while that grows dynamically, it does so with
GFP_NOFS, so it can block or fail (I wonder if that should explicitly
disable the use of the reserves).

So there a limit on concurrency imposed by non-GFP_MEMALLOC allocations

So ... maybe the documentation should say that boundless concurrency of
allocations (i.e. one module allocating a boundless number of times
before previous allocations are freed) must be avoided.

NeilBrown



> + *
> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> + * before using this flag.
> + *
>    * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>    * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>    */
>
>
> thanks,
> -- 
> John Hubbard
> NVIDIA
John Hubbard April 7, 2020, 1:21 a.m. UTC | #8
On 4/6/20 6:00 PM, NeilBrown wrote:
...
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index e5b817cb86e7..9cacef1a3ee0 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -110,6 +110,11 @@ struct vm_area_struct;
>>>     * the caller guarantees the allocation will allow more memory to be freed
>>>     * very shortly e.g. process exiting or swapping. Users either should
>>>     * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>>> + * Users of this flag have to be extremely careful to not deplete the reserve
>>> + * completely and implement a throttling mechanism which controls the consumption
>>> + * of the reserve based on the amount of freed memory.
>>> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered before
>>> + * using this flag.
> 
> I think this version is pretty good.
> 
>>>     *
>>>     * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>>>     * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>>>
>>
>> Hi Michal and all,
>>
>> How about using approximately this wording instead? I found Neil's wording to be
>> especially helpful so I mixed it in. (Also fixed a couple of slight 80-col overruns.)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index be2754841369..c247a911d8c7 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -111,6 +111,15 @@ struct vm_area_struct;
>>     * very shortly e.g. process exiting or swapping. Users either should
>>     * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>>     *
>> + * To be extra clear: users of __GFP_MEMALLOC must be working to free other
>> + * memory, and that other memory needs to be freed "soon"; specifically, before
>> + * the reserve is exhausted. This generally implies a throttling mechanism that
>> + * balances the amount of __GFP_MEMALLOC memory used against the amount that the
>> + * caller is about to free.
> 
> I don't like this change. "balances the amount ... is about to free"
> does say anything about time, so it doesn't seem to be about throttling.
> 
> I think it is hard to write rules because the rules are a bit spongey.
> 
> With mempools, we have a nice clear rule.  When you allocate from a
> mempool you must have a clear path to freeing that allocation which will
> not block on memory allocation except from a subordinate mempool.  This
> implies a partial ordering between mempools.  When you have layered
> block devices the path through the layers from filesystem down to
> hardware defines the order.  It isn't enforced, but it is quite easy to
> reason about.
> 
> GFP_MEMALLOC effectively provides multiple mempools.  So it could
> theoretically deadlock if multiple long dependency chains
> happened. i.e. if 1000 threads each make a GFP_MEMALLOC allocation and
> then need to make another one before the first can be freed - then you
> hit problems.  There is no formal way to guarantee that this doesn't
> happen.  We just say "be gentle" and minimize the users of this flag,
> and keep more memory in reserve than we really need.
> Note that 'threads' here might not be Linux tasks.  If you have an IO
> request that proceed asynchronously, moving from queue to queue and
> being handled by different task, then each one is a "thread" for the
> purpose of understanding mem-alloc dependency.
> 
> So maybe what I really should focus on is not how quickly things happen,
> but how many happen concurrently.  The idea of throttling is to allow
> previous requests to complete before we start too many more.
> 
> With Swap-over-NFS, some of the things that might need to be allocated
> are routing table entries.  These scale with the number of NFS servers
> rather than the number of IO requests, so they are not going to cause
> concurrency problems.
> We also need memory to store replies, but these never exceed the number
> of pending requests, so there is limited concurrency there.
> NFS can send a lot of requests in parallel, but the main limit is the
> RPC "slot table" and while that grows dynamically, it does so with
> GFP_NOFS, so it can block or fail (I wonder if that should explicitly
> disable the use of the reserves).
> 
> So there a limit on concurrency imposed by non-GFP_MEMALLOC allocations
> 
> So ... maybe the documentation should say that boundless concurrency of
> allocations (i.e. one module allocating a boundless number of times
> before previous allocations are freed) must be avoided.
> 

Well, that's a good discussion that you just wrote, above, and I think it
demonstrates that it's hard to describe the situation in just a couple of
sentences. With that in mind, perhaps it's best to take the above notes
as a starting point, adjust them slightly and drop them into
Documentation/core-api/memory-allocation.rst ?

Then the comments here could refer to it.


thanks,
Michal Hocko April 7, 2020, 7:24 a.m. UTC | #9
On Tue 07-04-20 11:00:29, Neil Brown wrote:
> On Mon, Apr 06 2020, John Hubbard wrote:
> 
> > On 4/6/20 12:01 AM, Michal Hocko wrote:
[...]
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index e5b817cb86e7..9cacef1a3ee0 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -110,6 +110,11 @@ struct vm_area_struct;
> >>    * the caller guarantees the allocation will allow more memory to be freed
> >>    * very shortly e.g. process exiting or swapping. Users either should
> >>    * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> >> + * Users of this flag have to be extremely careful to not deplete the reserve
> >> + * completely and implement a throttling mechanism which controls the consumption
> >> + * of the reserve based on the amount of freed memory.
> >> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered before
> >> + * using this flag.
> 
> I think this version is pretty good.

Thanks! I will stick with it then.

[...]

> I think it is hard to write rules because the rules are a bit spongey.

Exactly! And the more specific we are the more likely people are going
to follow literally. And we do not want that. We want people to be aware
of the limitation but want them to think hard before using the flag.

> With mempools, we have a nice clear rule.  When you allocate from a
> mempool you must have a clear path to freeing that allocation which will
> not block on memory allocation except from a subordinate mempool.  This
> implies a partial ordering between mempools.  When you have layered
> block devices the path through the layers from filesystem down to
> hardware defines the order.  It isn't enforced, but it is quite easy to
> reason about.
> 
> GFP_MEMALLOC effectively provides multiple mempools.  So it could
> theoretically deadlock if multiple long dependency chains
> happened. i.e. if 1000 threads each make a GFP_MEMALLOC allocation and
> then need to make another one before the first can be freed - then you
> hit problems.  There is no formal way to guarantee that this doesn't
> happen.  We just say "be gentle" and minimize the users of this flag,
> and keep more memory in reserve than we really need.
> Note that 'threads' here might not be Linux tasks.  If you have an IO
> request that proceed asynchronously, moving from queue to queue and
> being handled by different task, then each one is a "thread" for the
> purpose of understanding mem-alloc dependency.
> 
> So maybe what I really should focus on is not how quickly things happen,
> but how many happen concurrently.  The idea of throttling is to allow
> previous requests to complete before we start too many more.
> 
> With Swap-over-NFS, some of the things that might need to be allocated
> are routing table entries.  These scale with the number of NFS servers
> rather than the number of IO requests, so they are not going to cause
> concurrency problems.
> We also need memory to store replies, but these never exceed the number
> of pending requests, so there is limited concurrency there.
> NFS can send a lot of requests in parallel, but the main limit is the
> RPC "slot table" and while that grows dynamically, it does so with
> GFP_NOFS, so it can block or fail (I wonder if that should explicitly
> disable the use of the reserves).
> 
> So there a limit on concurrency imposed by non-GFP_MEMALLOC allocations

This really makes sense to mention in the allocation manual
(Documentation/core-api/memory-allocation.rst) as suggested by John.
Care to make it into a patch?
Andrew Morton April 14, 2020, 2:15 a.m. UTC | #10
I've rather lost the plot with this little patch.  Is the below
suitable, or do we think that changes are needed?


From: Michal Hocko <mhocko@suse.com>
Subject: mm: clarify __GFP_MEMALLOC usage

It seems that the existing documentation is not explicit about the
expected usage and potential risks enough.  While it is calls out that
users have to free memory when using this flag it is not really apparent
that users have to careful to not deplete memory reserves and that they
should implement some sort of throttling wrt.  freeing process.

This is partly based on Neil's explanation [1].

Let's also call out that a pre allocated pool allocator should be
considered.

[1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20200403083543.11552-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
[mhocko@kernel.org: update]
  Link: http://lkml.kernel.org/r/20200406070137.GC19426@dhcp22.suse.cz
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/include/linux/gfp.h~mm-clarify-__gfp_memalloc-usage
+++ a/include/linux/gfp.h
@@ -110,6 +110,11 @@ struct vm_area_struct;
  * the caller guarantees the allocation will allow more memory to be freed
  * very shortly e.g. process exiting or swapping. Users either should
  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
+ * Users of this flag have to be extremely careful to not deplete the reserve
+ * completely and implement a throttling mechanism which controls the
+ * consumption of the reserve based on the amount of freed memory.
+ * Usage of a pre-allocated pool (e.g. mempool) should be always considered
+ * before using this flag.
  *
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
NeilBrown April 14, 2020, 3:56 a.m. UTC | #11
On Mon, Apr 13 2020, Andrew Morton wrote:

> I've rather lost the plot with this little patch.  Is the below
> suitable, or do we think that changes are needed?
>
>
> From: Michal Hocko <mhocko@suse.com>
> Subject: mm: clarify __GFP_MEMALLOC usage
>
> It seems that the existing documentation is not explicit about the
> expected usage and potential risks enough.  While it is calls out that
> users have to free memory when using this flag it is not really apparent
> that users have to careful to not deplete memory reserves and that they
> should implement some sort of throttling wrt.  freeing process.
>
> This is partly based on Neil's explanation [1].
>
> Let's also call out that a pre allocated pool allocator should be
> considered.
>
> [1] http://lkml.kernel.org/r/877dz0yxoa.fsf@notabene.neil.brown.name
>
> [akpm@linux-foundation.org: coding style fixes]
> Link: http://lkml.kernel.org/r/20200403083543.11552-2-mhocko@kernel.org
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> [mhocko@kernel.org: update]
>   Link: http://lkml.kernel.org/r/20200406070137.GC19426@dhcp22.suse.cz
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  include/linux/gfp.h |    5 +++++
>  1 file changed, 5 insertions(+)
>
> --- a/include/linux/gfp.h~mm-clarify-__gfp_memalloc-usage
> +++ a/include/linux/gfp.h
> @@ -110,6 +110,11 @@ struct vm_area_struct;
>   * the caller guarantees the allocation will allow more memory to be freed
>   * very shortly e.g. process exiting or swapping. Users either should
>   * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> + * Users of this flag have to be extremely careful to not deplete the reserve
> + * completely and implement a throttling mechanism which controls the
> + * consumption of the reserve based on the amount of freed memory.
> + * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> + * before using this flag.

I particularly don't like the connection between the consumption and the
amount freed.  I don't think that say anything useful and it misses the
main point which, I think, is having a bound on total usage.

Nichal's previous proposal is, I think, the best concrete proposal so
far.

NeilBrown

>   *
>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
> _
John Hubbard April 14, 2020, 7:05 p.m. UTC | #12
On 2020-04-13 20:56, NeilBrown wrote:
> On Mon, Apr 13 2020, Andrew Morton wrote:
> 
>> I've rather lost the plot with this little patch.  Is the below
>> suitable, or do we think that changes are needed?
>>

I recall we were trying to talk Neil into adding some of his writings
into Documentation/core-api/memory-allocation.rst, and then refer to
that from here. But that would be a separate patch I think.


thanks,

Patch
diff mbox series

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..e3ab1c0d9140 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -110,6 +110,9 @@  struct vm_area_struct;
  * the caller guarantees the allocation will allow more memory to be freed
  * very shortly e.g. process exiting or swapping. Users either should
  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
+ * Users of this flag have to be extremely careful to not deplete the reserve
+ * completely and implement a throttling mechanism which controls the consumption
+ * of the reserve based on the amount of freed memory.
  *
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.