linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 00/18] multiple preferred nodes
       [not found] <20200619162425.1052382-1-ben.widawsky@intel.com>
@ 2020-06-22  7:09 ` Michal Hocko
  2020-06-23 11:20   ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-22  7:09 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

User visible APIs changes/additions should be posted to the linux-api
mailing list. Now added.

On Fri 19-06-20 09:24:07, Ben Widawsky wrote:
> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> preference for nodes which will fulfil memory allocation requests. Like the
> MPOL_BIND interface, it works over a set of nodes.
> 
> Summary:
> 1-2: Random fixes I found along the way
> 3-4: Logic to handle many preferred nodes in page allocation
> 5-9: Plumbing to allow multiple preferred nodes in mempolicy
> 10-13: Teach page allocation APIs about nodemasks
> 14: Provide a helper to generate preferred nodemasks
> 15: Have page allocation callers generate preferred nodemasks
> 16-17: Flip the switch to have __alloc_pages_nodemask take preferred mask.
> 18: Expose the new uapi
> 
> Along with these patches are patches for libnuma, numactl, numademo, and memhog.
> They still need some polish, but can be found here:
> https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
> It allows new usage: `numactl -P 0,3,4`
> 
> The goal of the new mode is to enable some use-cases when using tiered memory
> usage models which I've lovingly named.
> 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> requirements allowing preference to be given to all nodes with "fast" memory.
> 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> perhaps slow memory), but doesn't care which node it runs on. The application
> can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> etc). This reverses the nodes are chosen today where the kernel attempts to use
> local memory to the CPU whenever possible. This will attempt to use the local
> accelerator to the memory.
> 2. The Tortoise - The administrator (or the application itself) is aware it only
> needs slow memory, and so can prefer that.
> 
> Much of this is almost achievable with the bind interface, but the bind
> interface suffers from an inability to fallback to another set of nodes if
> binding fails to all nodes in the nodemask.
> 
> Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> preference.
> 
> > /* Set first two nodes as preferred in an 8 node system. */
> > const unsigned long nodes = 0x3
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> 
> > /* Mimic interleave policy, but have fallback *.
> > const unsigned long nodes = 0xaa
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> 
> Some internal discussion took place around the interface. There are two
> alternatives which we have discussed, plus one I stuck in:
> 1. Ordered list of nodes. Currently it's believed that the added complexity is
>    nod needed for expected usecases.
> 2. A flag for bind to allow falling back to other nodes. This confuses the
>    notion of binding and is less flexible than the current solution.
> 3. Create flags or new modes that helps with some ordering. This offers both a
>    friendlier API as well as a solution for more customized usage. It's unknown
>    if it's worth the complexity to support this. Here is sample code for how
>    this might work:
> 
> > // Default
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > // which is the same as
> > set_mempolicy(MPOL_DEFAULT, NULL, 0);
> >
> > // The Hare
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> >
> > // The Tortoise
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> >
> > // Prefer the fast memory of the first two sockets
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> >
> > // Prefer specific nodes for some something wacky
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);
> 
> ---
> 
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Cc: Li Xinhai <lixinhai.lxh@gmail.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mina Almasry <almasrymina@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> 
> Ben Widawsky (14):
>   mm/mempolicy: Add comment for missing LOCAL
>   mm/mempolicy: Use node_mem_id() instead of node_id()
>   mm/page_alloc: start plumbing multi preferred node
>   mm/page_alloc: add preferred pass to page allocation
>   mm: Finish handling MPOL_PREFERRED_MANY
>   mm: clean up alloc_pages_vma (thp)
>   mm: Extract THP hugepage allocation
>   mm/mempolicy: Use __alloc_page_node for interleaved
>   mm: kill __alloc_pages
>   mm/mempolicy: Introduce policy_preferred_nodes()
>   mm: convert callers of __alloc_pages_nodemask to pmask
>   alloc_pages_nodemask: turn preferred nid into a nodemask
>   mm: Use less stack for page allocations
>   mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
> 
> Dave Hansen (4):
>   mm/mempolicy: convert single preferred_node to full nodemask
>   mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
>   mm/mempolicy: allow preferred code to take a nodemask
>   mm/mempolicy: refactor rebind code for PREFERRED_MANY
> 
>  .../admin-guide/mm/numa_memory_policy.rst     |  22 +-
>  include/linux/gfp.h                           |  19 +-
>  include/linux/mempolicy.h                     |   4 +-
>  include/linux/migrate.h                       |   4 +-
>  include/linux/mmzone.h                        |   3 +
>  include/uapi/linux/mempolicy.h                |   6 +-
>  mm/hugetlb.c                                  |  10 +-
>  mm/internal.h                                 |   1 +
>  mm/mempolicy.c                                | 271 +++++++++++++-----
>  mm/page_alloc.c                               | 179 +++++++++++-
>  10 files changed, 403 insertions(+), 116 deletions(-)
> 
> 
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
@ 2020-06-23 11:20   ` Michal Hocko
  2020-06-23 16:12     ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-23 11:20 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Mon 22-06-20 09:10:00, Michal Hocko wrote:
[...]
> > The goal of the new mode is to enable some use-cases when using tiered memory
> > usage models which I've lovingly named.
> > 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> > requirements allowing preference to be given to all nodes with "fast" memory.
> > 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> > perhaps slow memory), but doesn't care which node it runs on. The application
> > can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> > etc). This reverses the nodes are chosen today where the kernel attempts to use
> > local memory to the CPU whenever possible. This will attempt to use the local
> > accelerator to the memory.
> > 2. The Tortoise - The administrator (or the application itself) is aware it only
> > needs slow memory, and so can prefer that.
> >
> > Much of this is almost achievable with the bind interface, but the bind
> > interface suffers from an inability to fallback to another set of nodes if
> > binding fails to all nodes in the nodemask.

Yes, and probably worth mentioning explicitly that this might lead to
the OOM killer invocation so a failure would be disruptive to any
workload which is allowed to allocate from the specific node mask (so
even tasks without any mempolicy).

> > Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> > preference.
> > 
> > > /* Set first two nodes as preferred in an 8 node system. */
> > > const unsigned long nodes = 0x3
> > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > 
> > > /* Mimic interleave policy, but have fallback *.
> > > const unsigned long nodes = 0xaa
> > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > 
> > Some internal discussion took place around the interface. There are two
> > alternatives which we have discussed, plus one I stuck in:
> > 1. Ordered list of nodes. Currently it's believed that the added complexity is
> >    nod needed for expected usecases.

There is no ordering in MPOL_BIND either and even though numa apis tend
to be screwed up from multiple aspects this is not a problem I have ever
stumbled over.

> > 2. A flag for bind to allow falling back to other nodes. This confuses the
> >    notion of binding and is less flexible than the current solution.

Agreed.

> > 3. Create flags or new modes that helps with some ordering. This offers both a
> >    friendlier API as well as a solution for more customized usage. It's unknown
> >    if it's worth the complexity to support this. Here is sample code for how
> >    this might work:
> > 
> > > // Default
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > > // which is the same as
> > > set_mempolicy(MPOL_DEFAULT, NULL, 0);

OK

> > > // The Hare
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> > >
> > > // The Tortoise
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> > >
> > > // Prefer the fast memory of the first two sockets
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> > >
> > > // Prefer specific nodes for some something wacky
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);

I am not so sure about these though. It would be much more easier to
start without additional modifiers and provide MPOL_PREFER_MANY without
any additional restrictions first (btw. I would like MPOL_PREFER_MASK
more but I do understand that naming is not the top priority now).

It would be also great to provide a high level semantic description
here. I have very quickly glanced through patches and they are not
really trivial to follow with many incremental steps so the higher level
intention is lost easily.

Do I get it right that the default semantic is essentially
	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
	  semantic)
	- fallback to numa unrestricted allocation with the default
	  numa policy on the failure

Or are there any usecases to modify how hard to keep the preference over
the fallback?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-23 11:20   ` Michal Hocko
@ 2020-06-23 16:12     ` Ben Widawsky
  2020-06-24  7:52       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-23 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-23 13:20:48, Michal Hocko wrote:
> On Mon 22-06-20 09:10:00, Michal Hocko wrote:
> [...]
> > > The goal of the new mode is to enable some use-cases when using tiered memory
> > > usage models which I've lovingly named.
> > > 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> > > requirements allowing preference to be given to all nodes with "fast" memory.
> > > 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> > > perhaps slow memory), but doesn't care which node it runs on. The application
> > > can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> > > etc). This reverses the nodes are chosen today where the kernel attempts to use
> > > local memory to the CPU whenever possible. This will attempt to use the local
> > > accelerator to the memory.
> > > 2. The Tortoise - The administrator (or the application itself) is aware it only
> > > needs slow memory, and so can prefer that.
> > >
> > > Much of this is almost achievable with the bind interface, but the bind
> > > interface suffers from an inability to fallback to another set of nodes if
> > > binding fails to all nodes in the nodemask.
> 
> Yes, and probably worth mentioning explicitly that this might lead to
> the OOM killer invocation so a failure would be disruptive to any
> workload which is allowed to allocate from the specific node mask (so
> even tasks without any mempolicy).

Thanks. I don't believe I mention this fact in any of the commit messages or
comments (and perhaps this is an indication I should have). I'll find a place to
mention this outside of the cover letter.

> 
> > > Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> > > preference.
> > > 
> > > > /* Set first two nodes as preferred in an 8 node system. */
> > > > const unsigned long nodes = 0x3
> > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > > 
> > > > /* Mimic interleave policy, but have fallback *.
> > > > const unsigned long nodes = 0xaa
> > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > > 
> > > Some internal discussion took place around the interface. There are two
> > > alternatives which we have discussed, plus one I stuck in:
> > > 1. Ordered list of nodes. Currently it's believed that the added complexity is
> > >    nod needed for expected usecases.
> 
> There is no ordering in MPOL_BIND either and even though numa apis tend
> to be screwed up from multiple aspects this is not a problem I have ever
> stumbled over.
> 
> > > 2. A flag for bind to allow falling back to other nodes. This confuses the
> > >    notion of binding and is less flexible than the current solution.
> 
> Agreed.
> 
> > > 3. Create flags or new modes that helps with some ordering. This offers both a
> > >    friendlier API as well as a solution for more customized usage. It's unknown
> > >    if it's worth the complexity to support this. Here is sample code for how
> > >    this might work:
> > > 
> > > > // Default
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > > > // which is the same as
> > > > set_mempolicy(MPOL_DEFAULT, NULL, 0);
> 
> OK
> 
> > > > // The Hare
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> > > >
> > > > // The Tortoise
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> > > >
> > > > // Prefer the fast memory of the first two sockets
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> > > >
> > > > // Prefer specific nodes for some something wacky
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);
> 
> I am not so sure about these though. It would be much more easier to
> start without additional modifiers and provide MPOL_PREFER_MANY without
> any additional restrictions first (btw. I would like MPOL_PREFER_MASK
> more but I do understand that naming is not the top priority now).

True. In fact, this is the same as making MPOL_F_PREFER_ORDER_TYPE_CUSTOM the
implicit default, and adding the others later. Luckily for me, this is
effectively what I have already :-).

It's a new domain for me, so I'm very flexible on the name. MASK seems like an
altogether better name to me as well, but I've been using "MANY" long enough now
that it seems natural.

> 
> It would be also great to provide a high level semantic description
> here. I have very quickly glanced through patches and they are not
> really trivial to follow with many incremental steps so the higher level
> intention is lost easily.
> 
> Do I get it right that the default semantic is essentially
> 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> 	  semantic)
> 	- fallback to numa unrestricted allocation with the default
> 	  numa policy on the failure
> 
> Or are there any usecases to modify how hard to keep the preference over
> the fallback?

tl;dr is: yes, and no usecases.

Longer answer:
Internal APIs (specifically, __alloc_pages_nodemask()) keep all the same
semantics for trying to allocate with the exception that it will first try the
preferred nodes, and next try the bound nodes. It should be noted here that an
empty preferred mask is the same as saying, traverse nodes in distance order
starting from local. Therefore, both for preferred mask, and bound mask the
universe set is equivalent to the empty set (∅ == U). [1]

| prefmask | bindmask | how                                    |
|----------|----------|----------------------------------------|
| ∅        | ∅        | Page allocation without policy         |
| ∅        | N ≠ ∅    | MPOL_BIND                              |
| N ≠ ∅    | ∅        | MPOL_PREFERRED* or internal preference |
| N ≠ ∅    | N ≠ ∅    | MPOL_BIND + internal preference        |
|----------|----------|----------------------------------------|

At the end of this patch series, there is never a case (that I can contrive
anyway) where prefmask is multiple nodes, and bindmask is multiple nodes. In the
future, if internal callers wanted to try to get clever, this could be the case.
The UAPI won't allow having both a bind and preferred node. "This system call
defines the default policy for the thread.  The thread policy governs allocation
of pages in the process's address space outside of memory ranges controlled  by
a more specific policy set by mbind(2)."

To your second question. There isn't any usecase. Sans bugs and oversights,
preferred nodes are always tried before fallback. I consider that almost the
hardest level of preference. The one thing I can think of that would be "harder"
would be some sort of mechanism to try all preferred nodes before any tricks are
used, like reclaim. I fear doing this will make the already scary
get_page_from_freelist() even more scary.

On this topic, I haven't changed anything for fragmentation. In the code right
now, fragmentation is enabled as soon as the zone chosen for allocation doesn't
match the preferred_zoneref->zone.

```
if (no_fallback && nr_online_nodes > 1 &&
		zone != ac->preferred_zoneref->zone) {
```

What might be more optimal is to move on to the next node and not allow
fragmentation yet, unless zone ∉  prefmask. Like the above, I think this will
add a decent amount of complexity.

The last thing, which I mention in a commit message but not here, OOM will scan
all nodes, and not just preferred nodes first. This seemed like a premature
optimization to me.


[1] There is an underlying assumption that the geodesic distance between any two
nodes is the same for all zonelists. IOW, if you have nodes M, N, P each with zones
A, and B the zonelists will be as follows:
M zonelist: MA -> MB -> NA -> NB -> PA -> PB
N zonelist: NA -> NB -> PA -> PB -> MA -> MB
P zonelist: PA -> PB -> MA -> MB -> NA -> NC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-23 16:12     ` Ben Widawsky
@ 2020-06-24  7:52       ` Michal Hocko
  2020-06-24 16:16         ` Ben Widawsky
  2020-06-26 21:39         ` Ben Widawsky
  0 siblings, 2 replies; 16+ messages in thread
From: Michal Hocko @ 2020-06-24  7:52 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> On 20-06-23 13:20:48, Michal Hocko wrote:
[...]
> > It would be also great to provide a high level semantic description
> > here. I have very quickly glanced through patches and they are not
> > really trivial to follow with many incremental steps so the higher level
> > intention is lost easily.
> > 
> > Do I get it right that the default semantic is essentially
> > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > 	  semantic)
> > 	- fallback to numa unrestricted allocation with the default
> > 	  numa policy on the failure
> > 
> > Or are there any usecases to modify how hard to keep the preference over
> > the fallback?
> 
> tl;dr is: yes, and no usecases.

OK, then I am wondering why the change has to be so involved. Except for
syscall plumbing the only real change to the allocator path would be
something like

static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
	/* Lower zones don't get a nodemask applied for MPOL_BIND */
	if (unlikely(policy->mode == MPOL_BIND || 
	   	     policy->mode == MPOL_PREFERED_MANY) &&
			apply_policy_zone(policy, gfp_zone(gfp)) &&
			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
		return &policy->v.nodes;

	return NULL;
}

alloc_pages_current

	if (pol->mode == MPOL_INTERLEAVE)
		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
	else {
		gfp_t gfp_attempt = gfp;

		/*
		 * Make sure the first allocation attempt will try hard
		 * but eventually fail without OOM killer or other
		 * disruption before falling back to the full nodemask
		 */
		if (pol->mode == MPOL_PREFERED_MANY)
			gfp_attempt |= __GFP_RETRY_MAYFAIL;	

		page = __alloc_pages_nodemask(gfp_attempt, order,
				policy_node(gfp, pol, numa_node_id()),
				policy_nodemask(gfp, pol));
		if (!page && pol->mode == MPOL_PREFERED_MANY)
			page = __alloc_pages_nodemask(gfp, order,
				numa_node_id(), NULL);
	}

	return page;

similar (well slightly more hairy) in alloc_pages_vma

Or do I miss something that really requires more involved approach like
building custom zonelists and other larger changes to the allocator?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24  7:52       ` Michal Hocko
@ 2020-06-24 16:16         ` Ben Widawsky
  2020-06-24 18:39           ` Michal Hocko
  2020-06-26 21:39         ` Ben Widawsky
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-24 16:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 09:52:16, Michal Hocko wrote:
> On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > On 20-06-23 13:20:48, Michal Hocko wrote:
> [...]
> > > It would be also great to provide a high level semantic description
> > > here. I have very quickly glanced through patches and they are not
> > > really trivial to follow with many incremental steps so the higher level
> > > intention is lost easily.
> > > 
> > > Do I get it right that the default semantic is essentially
> > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > 	  semantic)
> > > 	- fallback to numa unrestricted allocation with the default
> > > 	  numa policy on the failure
> > > 
> > > Or are there any usecases to modify how hard to keep the preference over
> > > the fallback?
> > 
> > tl;dr is: yes, and no usecases.
> 
> OK, then I am wondering why the change has to be so involved. Except for
> syscall plumbing the only real change to the allocator path would be
> something like
> 
> static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> {
> 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> 	if (unlikely(policy->mode == MPOL_BIND || 
> 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> 		return &policy->v.nodes;
> 
> 	return NULL;
> }
> 
> alloc_pages_current
> 
> 	if (pol->mode == MPOL_INTERLEAVE)
> 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> 	else {
> 		gfp_t gfp_attempt = gfp;
> 
> 		/*
> 		 * Make sure the first allocation attempt will try hard
> 		 * but eventually fail without OOM killer or other
> 		 * disruption before falling back to the full nodemask
> 		 */
> 		if (pol->mode == MPOL_PREFERED_MANY)
> 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> 
> 		page = __alloc_pages_nodemask(gfp_attempt, order,
> 				policy_node(gfp, pol, numa_node_id()),
> 				policy_nodemask(gfp, pol));
> 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> 			page = __alloc_pages_nodemask(gfp, order,
> 				numa_node_id(), NULL);
> 	}
> 
> 	return page;
> 
> similar (well slightly more hairy) in alloc_pages_vma
> 
> Or do I miss something that really requires more involved approach like
> building custom zonelists and other larger changes to the allocator?

I think I'm missing how this allows selecting from multiple preferred nodes. In
this case when you try to get the page from the freelist, you'll get the
zonelist of the preferred node, and when you actually scan through on page
allocation, you have no way to filter out the non-preferred nodes. I think the
plumbing of multiple nodes has to go all the way through
__alloc_pages_nodemask(). But it's possible I've missed the point.

I do have a branch where I build a custom zonelist, but that's not the reason
here :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 16:16         ` Ben Widawsky
@ 2020-06-24 18:39           ` Michal Hocko
  2020-06-24 19:37             ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-24 18:39 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> On 20-06-24 09:52:16, Michal Hocko wrote:
> > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > [...]
> > > > It would be also great to provide a high level semantic description
> > > > here. I have very quickly glanced through patches and they are not
> > > > really trivial to follow with many incremental steps so the higher level
> > > > intention is lost easily.
> > > > 
> > > > Do I get it right that the default semantic is essentially
> > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > 	  semantic)
> > > > 	- fallback to numa unrestricted allocation with the default
> > > > 	  numa policy on the failure
> > > > 
> > > > Or are there any usecases to modify how hard to keep the preference over
> > > > the fallback?
> > > 
> > > tl;dr is: yes, and no usecases.
> > 
> > OK, then I am wondering why the change has to be so involved. Except for
> > syscall plumbing the only real change to the allocator path would be
> > something like
> > 
> > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > {
> > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > 	if (unlikely(policy->mode == MPOL_BIND || 
> > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > 		return &policy->v.nodes;
> > 
> > 	return NULL;
> > }
> > 
> > alloc_pages_current
> > 
> > 	if (pol->mode == MPOL_INTERLEAVE)
> > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > 	else {
> > 		gfp_t gfp_attempt = gfp;
> > 
> > 		/*
> > 		 * Make sure the first allocation attempt will try hard
> > 		 * but eventually fail without OOM killer or other
> > 		 * disruption before falling back to the full nodemask
> > 		 */
> > 		if (pol->mode == MPOL_PREFERED_MANY)
> > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > 
> > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > 				policy_node(gfp, pol, numa_node_id()),
> > 				policy_nodemask(gfp, pol));
> > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > 			page = __alloc_pages_nodemask(gfp, order,
> > 				numa_node_id(), NULL);
> > 	}
> > 
> > 	return page;
> > 
> > similar (well slightly more hairy) in alloc_pages_vma
> > 
> > Or do I miss something that really requires more involved approach like
> > building custom zonelists and other larger changes to the allocator?
> 
> I think I'm missing how this allows selecting from multiple preferred nodes. In
> this case when you try to get the page from the freelist, you'll get the
> zonelist of the preferred node, and when you actually scan through on page
> allocation, you have no way to filter out the non-preferred nodes. I think the
> plumbing of multiple nodes has to go all the way through
> __alloc_pages_nodemask(). But it's possible I've missed the point.

policy_nodemask() will provide the nodemask which will be used as a
filter on the policy_node.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 18:39           ` Michal Hocko
@ 2020-06-24 19:37             ` Ben Widawsky
  2020-06-24 19:51               ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-24 19:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 20:39:17, Michal Hocko wrote:
> On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > [...]
> > > > > It would be also great to provide a high level semantic description
> > > > > here. I have very quickly glanced through patches and they are not
> > > > > really trivial to follow with many incremental steps so the higher level
> > > > > intention is lost easily.
> > > > > 
> > > > > Do I get it right that the default semantic is essentially
> > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > 	  semantic)
> > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > 	  numa policy on the failure
> > > > > 
> > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > the fallback?
> > > > 
> > > > tl;dr is: yes, and no usecases.
> > > 
> > > OK, then I am wondering why the change has to be so involved. Except for
> > > syscall plumbing the only real change to the allocator path would be
> > > something like
> > > 
> > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > {
> > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > 		return &policy->v.nodes;
> > > 
> > > 	return NULL;
> > > }
> > > 
> > > alloc_pages_current
> > > 
> > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > 	else {
> > > 		gfp_t gfp_attempt = gfp;
> > > 
> > > 		/*
> > > 		 * Make sure the first allocation attempt will try hard
> > > 		 * but eventually fail without OOM killer or other
> > > 		 * disruption before falling back to the full nodemask
> > > 		 */
> > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > 
> > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > 				policy_node(gfp, pol, numa_node_id()),
> > > 				policy_nodemask(gfp, pol));
> > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > 			page = __alloc_pages_nodemask(gfp, order,
> > > 				numa_node_id(), NULL);
> > > 	}
> > > 
> > > 	return page;
> > > 
> > > similar (well slightly more hairy) in alloc_pages_vma
> > > 
> > > Or do I miss something that really requires more involved approach like
> > > building custom zonelists and other larger changes to the allocator?
> > 
> > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > this case when you try to get the page from the freelist, you'll get the
> > zonelist of the preferred node, and when you actually scan through on page
> > allocation, you have no way to filter out the non-preferred nodes. I think the
> > plumbing of multiple nodes has to go all the way through
> > __alloc_pages_nodemask(). But it's possible I've missed the point.
> 
> policy_nodemask() will provide the nodemask which will be used as a
> filter on the policy_node.

Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
to that point. UAPI cannot get independent masks, and callers of these functions
don't yet use them.

So let me ask before I actually type it up and find it's much much simpler, is
there not some perceived benefit to having both masks being independent?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 19:37             ` Ben Widawsky
@ 2020-06-24 19:51               ` Michal Hocko
  2020-06-24 20:01                 ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-24 19:51 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> On 20-06-24 20:39:17, Michal Hocko wrote:
> > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > > [...]
> > > > > > It would be also great to provide a high level semantic description
> > > > > > here. I have very quickly glanced through patches and they are not
> > > > > > really trivial to follow with many incremental steps so the higher level
> > > > > > intention is lost easily.
> > > > > > 
> > > > > > Do I get it right that the default semantic is essentially
> > > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > > 	  semantic)
> > > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > > 	  numa policy on the failure
> > > > > > 
> > > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > > the fallback?
> > > > > 
> > > > > tl;dr is: yes, and no usecases.
> > > > 
> > > > OK, then I am wondering why the change has to be so involved. Except for
> > > > syscall plumbing the only real change to the allocator path would be
> > > > something like
> > > > 
> > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > {
> > > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > > 		return &policy->v.nodes;
> > > > 
> > > > 	return NULL;
> > > > }
> > > > 
> > > > alloc_pages_current
> > > > 
> > > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > > 	else {
> > > > 		gfp_t gfp_attempt = gfp;
> > > > 
> > > > 		/*
> > > > 		 * Make sure the first allocation attempt will try hard
> > > > 		 * but eventually fail without OOM killer or other
> > > > 		 * disruption before falling back to the full nodemask
> > > > 		 */
> > > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > > 
> > > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > > 				policy_node(gfp, pol, numa_node_id()),
> > > > 				policy_nodemask(gfp, pol));
> > > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > > 			page = __alloc_pages_nodemask(gfp, order,
> > > > 				numa_node_id(), NULL);
> > > > 	}
> > > > 
> > > > 	return page;
> > > > 
> > > > similar (well slightly more hairy) in alloc_pages_vma
> > > > 
> > > > Or do I miss something that really requires more involved approach like
> > > > building custom zonelists and other larger changes to the allocator?
> > > 
> > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > this case when you try to get the page from the freelist, you'll get the
> > > zonelist of the preferred node, and when you actually scan through on page
> > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > plumbing of multiple nodes has to go all the way through
> > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > 
> > policy_nodemask() will provide the nodemask which will be used as a
> > filter on the policy_node.
> 
> Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> to that point. UAPI cannot get independent masks, and callers of these functions
> don't yet use them.
> 
> So let me ask before I actually type it up and find it's much much simpler, is
> there not some perceived benefit to having both masks being independent?

I am not sure I follow. Which two masks do you have in mind? zonelist
and user provided nodemask?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 19:51               ` Michal Hocko
@ 2020-06-24 20:01                 ` Ben Widawsky
  2020-06-24 20:07                   ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 21:51:58, Michal Hocko wrote:
> On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > > > [...]
> > > > > > > It would be also great to provide a high level semantic description
> > > > > > > here. I have very quickly glanced through patches and they are not
> > > > > > > really trivial to follow with many incremental steps so the higher level
> > > > > > > intention is lost easily.
> > > > > > > 
> > > > > > > Do I get it right that the default semantic is essentially
> > > > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > > > 	  semantic)
> > > > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > > > 	  numa policy on the failure
> > > > > > > 
> > > > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > > > the fallback?
> > > > > > 
> > > > > > tl;dr is: yes, and no usecases.
> > > > > 
> > > > > OK, then I am wondering why the change has to be so involved. Except for
> > > > > syscall plumbing the only real change to the allocator path would be
> > > > > something like
> > > > > 
> > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > > {
> > > > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > > > 		return &policy->v.nodes;
> > > > > 
> > > > > 	return NULL;
> > > > > }
> > > > > 
> > > > > alloc_pages_current
> > > > > 
> > > > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > > > 	else {
> > > > > 		gfp_t gfp_attempt = gfp;
> > > > > 
> > > > > 		/*
> > > > > 		 * Make sure the first allocation attempt will try hard
> > > > > 		 * but eventually fail without OOM killer or other
> > > > > 		 * disruption before falling back to the full nodemask
> > > > > 		 */
> > > > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > > > 
> > > > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > > > 				policy_node(gfp, pol, numa_node_id()),
> > > > > 				policy_nodemask(gfp, pol));
> > > > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > > > 			page = __alloc_pages_nodemask(gfp, order,
> > > > > 				numa_node_id(), NULL);
> > > > > 	}
> > > > > 
> > > > > 	return page;
> > > > > 
> > > > > similar (well slightly more hairy) in alloc_pages_vma
> > > > > 
> > > > > Or do I miss something that really requires more involved approach like
> > > > > building custom zonelists and other larger changes to the allocator?
> > > > 
> > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > this case when you try to get the page from the freelist, you'll get the
> > > > zonelist of the preferred node, and when you actually scan through on page
> > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > plumbing of multiple nodes has to go all the way through
> > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > 
> > > policy_nodemask() will provide the nodemask which will be used as a
> > > filter on the policy_node.
> > 
> > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > to that point. UAPI cannot get independent masks, and callers of these functions
> > don't yet use them.
> > 
> > So let me ask before I actually type it up and find it's much much simpler, is
> > there not some perceived benefit to having both masks being independent?
> 
> I am not sure I follow. Which two masks do you have in mind? zonelist
> and user provided nodemask?

Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:01                 ` Ben Widawsky
@ 2020-06-24 20:07                   ` Michal Hocko
  2020-06-24 20:23                     ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-24 20:07 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> On 20-06-24 21:51:58, Michal Hocko wrote:
> > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
[...]
> > > > > > Or do I miss something that really requires more involved approach like
> > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > 
> > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > plumbing of multiple nodes has to go all the way through
> > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > 
> > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > filter on the policy_node.
> > > 
> > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > don't yet use them.
> > > 
> > > So let me ask before I actually type it up and find it's much much simpler, is
> > > there not some perceived benefit to having both masks being independent?
> > 
> > I am not sure I follow. Which two masks do you have in mind? zonelist
> > and user provided nodemask?
> 
> Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.

Each mask is a local to its policy object.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:07                   ` Michal Hocko
@ 2020-06-24 20:23                     ` Ben Widawsky
  2020-06-24 20:42                       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 22:07:50, Michal Hocko wrote:
> On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> [...]
> > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > 
> > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > 
> > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > filter on the policy_node.
> > > > 
> > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > don't yet use them.
> > > > 
> > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > there not some perceived benefit to having both masks being independent?
> > > 
> > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > and user provided nodemask?
> > 
> > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> 
> Each mask is a local to its policy object.

I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
policy. Policy decisions are all made beforehand. The question from a few mails
ago was whether there is any use in keeping that change to
__alloc_pages_nodemask accepting two nodemasks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:23                     ` Ben Widawsky
@ 2020-06-24 20:42                       ` Michal Hocko
  2020-06-24 20:55                         ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-06-24 20:42 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> On 20-06-24 22:07:50, Michal Hocko wrote:
> > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > [...]
> > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > 
> > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > 
> > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > filter on the policy_node.
> > > > > 
> > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > don't yet use them.
> > > > > 
> > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > there not some perceived benefit to having both masks being independent?
> > > > 
> > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > and user provided nodemask?
> > > 
> > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > 
> > Each mask is a local to its policy object.
> 
> I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> policy. Policy decisions are all made beforehand. The question from a few mails
> ago was whether there is any use in keeping that change to
> __alloc_pages_nodemask accepting two nodemasks.

It is probably too late for me because I am still not following you
mean. Maybe it would be better to provide a pseudo code what you have in
mind. Anyway all that I am saying is that for the functionality that you
propose and _if_ the fallback strategy is fixed then all you should need
is to use the preferred nodemask for the __alloc_pages_nodemask and a
fallback allocation to the full (NULL nodemask). So you first try what
the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
do not OOM if the memory is depleted semantic and the fallback
allocation goes all the way to OOM on the complete memory depletion.
So I do not see much point in a custom zonelist for the policy. Maybe as
a micro-optimization to save some branches here and there.

If you envision usecases which might want to control the fallback
allocation strategy then this would get more complex because you
would need a sorted list of zones to try but this would really require
some solid usecase and it should build on top of a trivial
implementation which really is BIND with the fallback.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:42                       ` Michal Hocko
@ 2020-06-24 20:55                         ` Ben Widawsky
  2020-06-25  6:28                           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 22:42:32, Michal Hocko wrote:
> On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> > On 20-06-24 22:07:50, Michal Hocko wrote:
> > > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > [...]
> > > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > > 
> > > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > > 
> > > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > > filter on the policy_node.
> > > > > > 
> > > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > > don't yet use them.
> > > > > > 
> > > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > > there not some perceived benefit to having both masks being independent?
> > > > > 
> > > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > > and user provided nodemask?
> > > > 
> > > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > > 
> > > Each mask is a local to its policy object.
> > 
> > I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> > policy. Policy decisions are all made beforehand. The question from a few mails
> > ago was whether there is any use in keeping that change to
> > __alloc_pages_nodemask accepting two nodemasks.
> 
> It is probably too late for me because I am still not following you
> mean. Maybe it would be better to provide a pseudo code what you have in
> mind. Anyway all that I am saying is that for the functionality that you
> propose and _if_ the fallback strategy is fixed then all you should need
> is to use the preferred nodemask for the __alloc_pages_nodemask and a
> fallback allocation to the full (NULL nodemask). So you first try what
> the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
> do not OOM if the memory is depleted semantic and the fallback
> allocation goes all the way to OOM on the complete memory depletion.
> So I do not see much point in a custom zonelist for the policy. Maybe as
> a micro-optimization to save some branches here and there.
> 
> If you envision usecases which might want to control the fallback
> allocation strategy then this would get more complex because you
> would need a sorted list of zones to try but this would really require
> some solid usecase and it should build on top of a trivial
> implementation which really is BIND with the fallback.
> 

I will implement what you suggest. I think it's a good suggestion. Here is what
I mean though:
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-                                                       nodemask_t *nodemask);
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask,
+		       nodemask_t *nodemask);

Is there any value in keeping two nodemasks as part of the interface?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:55                         ` Ben Widawsky
@ 2020-06-25  6:28                           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2020-06-25  6:28 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:55:18, Ben Widawsky wrote:
> On 20-06-24 22:42:32, Michal Hocko wrote:
> > On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> > > On 20-06-24 22:07:50, Michal Hocko wrote:
> > > > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > > [...]
> > > > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > > > 
> > > > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > > > 
> > > > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > > > filter on the policy_node.
> > > > > > > 
> > > > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > > > don't yet use them.
> > > > > > > 
> > > > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > > > there not some perceived benefit to having both masks being independent?
> > > > > > 
> > > > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > > > and user provided nodemask?
> > > > > 
> > > > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > > > 
> > > > Each mask is a local to its policy object.
> > > 
> > > I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> > > policy. Policy decisions are all made beforehand. The question from a few mails
> > > ago was whether there is any use in keeping that change to
> > > __alloc_pages_nodemask accepting two nodemasks.
> > 
> > It is probably too late for me because I am still not following you
> > mean. Maybe it would be better to provide a pseudo code what you have in
> > mind. Anyway all that I am saying is that for the functionality that you
> > propose and _if_ the fallback strategy is fixed then all you should need
> > is to use the preferred nodemask for the __alloc_pages_nodemask and a
> > fallback allocation to the full (NULL nodemask). So you first try what
> > the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
> > do not OOM if the memory is depleted semantic and the fallback
> > allocation goes all the way to OOM on the complete memory depletion.
> > So I do not see much point in a custom zonelist for the policy. Maybe as
> > a micro-optimization to save some branches here and there.
> > 
> > If you envision usecases which might want to control the fallback
> > allocation strategy then this would get more complex because you
> > would need a sorted list of zones to try but this would really require
> > some solid usecase and it should build on top of a trivial
> > implementation which really is BIND with the fallback.
> > 
> 
> I will implement what you suggest. I think it's a good suggestion. Here is what
> I mean though:
> -struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> -                                                       nodemask_t *nodemask);
> +struct page *
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask,
> +		       nodemask_t *nodemask);
> 
> Is there any value in keeping two nodemasks as part of the interface?

I do not see any advantage. The first thing you would have to do is
either intersect the two or special case the code to use one over
another and then you would need a clear criterion on how to do that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24  7:52       ` Michal Hocko
  2020-06-24 16:16         ` Ben Widawsky
@ 2020-06-26 21:39         ` Ben Widawsky
  2020-06-29 10:16           ` Michal Hocko
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2020-06-26 21:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 09:52:16, Michal Hocko wrote:
> On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > On 20-06-23 13:20:48, Michal Hocko wrote:
> [...]
> > > It would be also great to provide a high level semantic description
> > > here. I have very quickly glanced through patches and they are not
> > > really trivial to follow with many incremental steps so the higher level
> > > intention is lost easily.
> > > 
> > > Do I get it right that the default semantic is essentially
> > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > 	  semantic)
> > > 	- fallback to numa unrestricted allocation with the default
> > > 	  numa policy on the failure
> > > 
> > > Or are there any usecases to modify how hard to keep the preference over
> > > the fallback?
> > 
> > tl;dr is: yes, and no usecases.
> 
> OK, then I am wondering why the change has to be so involved. Except for
> syscall plumbing the only real change to the allocator path would be
> something like
> 
> static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> {
> 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> 	if (unlikely(policy->mode == MPOL_BIND || 
> 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> 		return &policy->v.nodes;
> 
> 	return NULL;
> }
> 
> alloc_pages_current
> 
> 	if (pol->mode == MPOL_INTERLEAVE)
> 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> 	else {
> 		gfp_t gfp_attempt = gfp;
> 
> 		/*
> 		 * Make sure the first allocation attempt will try hard
> 		 * but eventually fail without OOM killer or other
> 		 * disruption before falling back to the full nodemask
> 		 */
> 		if (pol->mode == MPOL_PREFERED_MANY)
> 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> 
> 		page = __alloc_pages_nodemask(gfp_attempt, order,
> 				policy_node(gfp, pol, numa_node_id()),
> 				policy_nodemask(gfp, pol));
> 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> 			page = __alloc_pages_nodemask(gfp, order,
> 				numa_node_id(), NULL);
> 	}
> 
> 	return page;
> 
> similar (well slightly more hairy) in alloc_pages_vma
> 
> Or do I miss something that really requires more involved approach like
> building custom zonelists and other larger changes to the allocator?

Hi Michal,

I'm mostly done implementing this change. It looks good, and so far I think it's
functionally equivalent. One thing though, above you use NULL for the fallback.
That actually should not be NULL because of the logic in policy_node to restrict
zones, and obey cpusets. I've implemented it as such, but I was hoping someone
with a deeper understanding, and more experience can confirm that was the
correct thing to do.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-26 21:39         ` Ben Widawsky
@ 2020-06-29 10:16           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2020-06-29 10:16 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Fri 26-06-20 14:39:05, Ben Widawsky wrote:
> On 20-06-24 09:52:16, Michal Hocko wrote:
> > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > [...]
> > > > It would be also great to provide a high level semantic description
> > > > here. I have very quickly glanced through patches and they are not
> > > > really trivial to follow with many incremental steps so the higher level
> > > > intention is lost easily.
> > > > 
> > > > Do I get it right that the default semantic is essentially
> > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > 	  semantic)
> > > > 	- fallback to numa unrestricted allocation with the default
> > > > 	  numa policy on the failure
> > > > 
> > > > Or are there any usecases to modify how hard to keep the preference over
> > > > the fallback?
> > > 
> > > tl;dr is: yes, and no usecases.
> > 
> > OK, then I am wondering why the change has to be so involved. Except for
> > syscall plumbing the only real change to the allocator path would be
> > something like
> > 
> > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > {
> > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > 	if (unlikely(policy->mode == MPOL_BIND || 
> > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > 		return &policy->v.nodes;
> > 
> > 	return NULL;
> > }
> > 
> > alloc_pages_current
> > 
> > 	if (pol->mode == MPOL_INTERLEAVE)
> > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > 	else {
> > 		gfp_t gfp_attempt = gfp;
> > 
> > 		/*
> > 		 * Make sure the first allocation attempt will try hard
> > 		 * but eventually fail without OOM killer or other
> > 		 * disruption before falling back to the full nodemask
> > 		 */
> > 		if (pol->mode == MPOL_PREFERED_MANY)
> > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > 
> > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > 				policy_node(gfp, pol, numa_node_id()),
> > 				policy_nodemask(gfp, pol));
> > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > 			page = __alloc_pages_nodemask(gfp, order,
> > 				numa_node_id(), NULL);
> > 	}
> > 
> > 	return page;
> > 
> > similar (well slightly more hairy) in alloc_pages_vma
> > 
> > Or do I miss something that really requires more involved approach like
> > building custom zonelists and other larger changes to the allocator?
> 
> Hi Michal,
> 
> I'm mostly done implementing this change. It looks good, and so far I think it's
> functionally equivalent. One thing though, above you use NULL for the fallback.
> That actually should not be NULL because of the logic in policy_node to restrict
> zones, and obey cpusets. I've implemented it as such, but I was hoping someone
> with a deeper understanding, and more experience can confirm that was the
> correct thing to do.

Cpusets are just plumbed into the allocator directly. Have a look at
__cpuset_zone_allowed call inside get_page_from_freelist. Anyway
functionally what you are looking for here is that the fallback
allocation should be exactly as if there was no mempolicy in place. And
that is expressed by NULL nodemask. The rest is done automagically...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-06-29 20:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200619162425.1052382-1-ben.widawsky@intel.com>
2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
2020-06-23 11:20   ` Michal Hocko
2020-06-23 16:12     ` Ben Widawsky
2020-06-24  7:52       ` Michal Hocko
2020-06-24 16:16         ` Ben Widawsky
2020-06-24 18:39           ` Michal Hocko
2020-06-24 19:37             ` Ben Widawsky
2020-06-24 19:51               ` Michal Hocko
2020-06-24 20:01                 ` Ben Widawsky
2020-06-24 20:07                   ` Michal Hocko
2020-06-24 20:23                     ` Ben Widawsky
2020-06-24 20:42                       ` Michal Hocko
2020-06-24 20:55                         ` Ben Widawsky
2020-06-25  6:28                           ` Michal Hocko
2020-06-26 21:39         ` Ben Widawsky
2020-06-29 10:16           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).