[RFC 0/3] Group pages on the direct map for permissioned vmallocs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] Group pages on the direct map for permissioned vmallocs
@ 2021-04-05 20:37 Rick Edgecombe
  2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

Hi,

This is a followup to the previous attempt to overhaul how vmalloc permissions
are done:
https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/

In working on a next version it dawned on me that we can significantly reduce
direct map breakage on x86 with a much less invasive change, so I thought maybe
I would start there in the meantime.

In a test of booting fedora and running the BPF unit tests, this reduced 4k
direct map pages by 98%.

It simply uses pages for x86 module_alloc() mappings from a cache created out of
2MB pages. This results in all the later breakage clustering in 2MB blocks on
the direct map. The trade-off is colder pages are used for these allocations.
All module_alloc() users (modules, ebpf jit, ftrace, kprobes) get this behavior.

Potentially this behavior should be enabled for eBPF byte code allocations as
well in the case of !CONFIG_BPF_JIT_ALWAYS_ON.

The new APIs and invasive changes in the callers can happen after vmalloc huge
pages bring more benefits. Although, I can post shootdown reduction changes with
previous comments integrated if anyone disagrees.

Based on v5.11.

Thanks,

Rick

Rick Edgecombe (3):
  list: Support getting most recent element in list_lru
  vmalloc: Support grouped page allocations
  x86/module: Use VM_GROUP_PAGES flag

 arch/x86/Kconfig         |   1 +
 arch/x86/kernel/module.c |   2 +-
 include/linux/list_lru.h |  13 +++
 include/linux/vmalloc.h  |   1 +
 mm/Kconfig               |   9 ++
 mm/list_lru.c            |  28 +++++
 mm/vmalloc.c             | 215 +++++++++++++++++++++++++++++++++++++--
 7 files changed, 257 insertions(+), 12 deletions(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC 1/3] list: Support getting most recent element in list_lru
  2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
@ 2021-04-05 20:37 ` Rick Edgecombe
  2021-04-05 20:37 ` [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag Rick Edgecombe
       [not found] ` <20210405203711.1095940-3-rick.p.edgecombe@intel.com>
  2 siblings, 0 replies; 7+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

In future patches, some functionality will use list_lru that also needs
to keep track of the most recently used element on a node. Since this
information is already contained within list_lru, add a function to get
it so that an additional list is not needed in the caller.

Do not support memcg aware list_lru's since it is not needed by the
intended caller.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/list_lru.h | 13 +++++++++++++
 mm/list_lru.c            | 28 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 9dcaa3e582c9..4bde44a5024b 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -103,6 +103,19 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_get_mru: gets and removes the tail from one of the node lists
+ * @list_lru: the lru pointer
+ * @nid: the node id
+ *
+ * This function removes the most recently added item from one of the node
+ * id specified. This function should not be used if the list_lru is memcg
+ * aware.
+ *
+ * Return value: The element removed
+ */
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid);
+
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/mm/list_lru.c b/mm/list_lru.c
index fe230081690b..a18d9ed8abd5 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -156,6 +156,34 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+struct list_head *list_lru_get_mru(struct list_lru *lru, int nid)
+{
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l = &nlru->lru;
+	struct list_head *ret;
+
+	/* This function does not attempt to search through the memcg lists */
+	if (list_lru_memcg_aware(lru)) {
+		WARN_ONCE(1, "list_lru: %s not supported on memcg aware list_lrus", __func__);
+		return NULL;
+	}
+
+	spin_lock(&nlru->lock);
+	if (list_empty(&l->list)) {
+		ret = NULL;
+	} else {
+		/* Get tail */
+		ret = l->list.prev;
+		list_del_init(ret);
+
+		l->nr_items--;
+		nlru->nr_items--;
+	}
+	spin_unlock(&nlru->lock);
+
+	return ret;
+}
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag
  2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
  2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
@ 2021-04-05 20:37 ` Rick Edgecombe
       [not found] ` <20210405203711.1095940-3-rick.p.edgecombe@intel.com>
  2 siblings, 0 replies; 7+ messages in thread
From: Rick Edgecombe @ 2021-04-05 20:37 UTC (permalink / raw)
  To: akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86, Rick Edgecombe

Callers of module_alloc() will set permissions on the allocation. Use
the VM_GROUP_PAGES to reduce direct map breakage.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/module.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 34b153cbd4ac..9161ce0e987f 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -75,7 +75,7 @@ void *module_alloc(unsigned long size)
 	p = __vmalloc_node_range(size, MODULE_ALIGN,
 				    MODULES_VADDR + get_module_load_offset(),
 				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL, 0, NUMA_NO_NODE,
+				    PAGE_KERNEL, VM_GROUP_PAGES, NUMA_NO_NODE,
 				    __builtin_return_address(0));
 	if (p && (kasan_module_alloc(p, size) < 0)) {
 		vfree(p);
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
       [not found] ` <20210405203711.1095940-3-rick.p.edgecombe@intel.com>
@ 2021-04-05 21:01   ` Dave Hansen
  2021-04-05 21:32     ` Matthew Wilcox
  2021-04-05 21:38     ` Edgecombe, Rick P
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2021-04-05 21:01 UTC (permalink / raw)
  To: Rick Edgecombe, akpm, linux-mm, bpf, dave.hansen, peterz, luto, jeyu
  Cc: linux-kernel, ast, daniel, andrii, hch, x86

On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> +static void __dispose_pages(struct list_head *head)
> +{
> +	struct list_head *cur, *next;
> +
> +	list_for_each_safe(cur, next, head) {
> +		list_del(cur);
> +
> +		/* The list head is stored at the start of the page */
> +		free_page((unsigned long)cur);
> +	}
> +}

This is interesting.

While the page is in the allocator, you're using the page contents
themselves to store the list_head.  It took me a minute to figure out
what you were doing here because: "start of the page" is a bit
ambiguous.  It could mean:

 * the first 16 bytes in 'struct page'
or
 * the first 16 bytes in the page itself, aka *page_address(page)

The fact that this doesn't work on higmem systems makes this an OK thing
to do, but it is a bit weird.  It's also doubly susceptible to bugs
where there's a page_to_virt() or virt_to_page() screwup.

I was *hoping* there was still sufficient space in 'struct page' for
this second list_head in addition to page->lru.  I think there *should*
be.  That would at least make this allocator a bit more "normal" in not
caring about page contents while the page is free in the allocator.  If
you were able to do that you could do things like kmemcheck or page
alloc debugging while the page is in the allocator.

Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
But, if that doesn't work out, please comment the snot out of this thing
because it _is_ weird.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:01   ` [RFC 2/3] vmalloc: Support grouped page allocations Dave Hansen
@ 2021-04-05 21:32     ` Matthew Wilcox
  2021-04-05 21:49       ` Edgecombe, Rick P
  2021-04-05 21:38     ` Edgecombe, Rick P
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2021-04-05 21:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rick Edgecombe, akpm, linux-mm, bpf, dave.hansen, peterz, luto,
	jeyu, linux-kernel, ast, daniel, andrii, hch, x86

On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +	struct list_head *cur, *next;
> > +
> > +	list_for_each_safe(cur, next, head) {
> > +		list_del(cur);
> > +
> > +		/* The list head is stored at the start of the page */
> > +		free_page((unsigned long)cur);
> > +	}
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there *should*
> be.  That would at least make this allocator a bit more "normal" in not
> caring about page contents while the page is free in the allocator.  If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this thing
> because it _is_ weird.

Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!

I haven't read over these patches yet.  If these pages are in use by
vmalloc, they can't use mapping+index because get_user_pages() will call
page_mapping() and the list_head will confuse it.  I think it could use
index+private for a list_head.

If the pages are in the buddy, I _think_ mapping+index are free.  private
is in use for buddy order.  But I haven't read through the buddy code
in a while.

Does it need to be a doubly linked list?  Can it be an hlist?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:01   ` [RFC 2/3] vmalloc: Support grouped page allocations Dave Hansen
  2021-04-05 21:32     ` Matthew Wilcox
@ 2021-04-05 21:38     ` Edgecombe, Rick P
  1 sibling, 0 replies; 7+ messages in thread
From: Edgecombe, Rick P @ 2021-04-05 21:38 UTC (permalink / raw)
  To: luto, Hansen, Dave, dave.hansen, linux-mm, jeyu, peterz, akpm, bpf
  Cc: daniel, ast, linux-kernel, andrii, x86, hch

On Mon, 2021-04-05 at 14:01 -0700, Dave Hansen wrote:
> On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > +static void __dispose_pages(struct list_head *head)
> > +{
> > +       struct list_head *cur, *next;
> > +
> > +       list_for_each_safe(cur, next, head) {
> > +               list_del(cur);
> > +
> > +               /* The list head is stored at the start of the page
> > */
> > +               free_page((unsigned long)cur);
> > +       }
> > +}
> 
> This is interesting.
> 
> While the page is in the allocator, you're using the page contents
> themselves to store the list_head.  It took me a minute to figure out
> what you were doing here because: "start of the page" is a bit
> ambiguous.  It could mean:
> 
>  * the first 16 bytes in 'struct page'
> or
>  * the first 16 bytes in the page itself, aka *page_address(page)
> 
> The fact that this doesn't work on higmem systems makes this an OK
> thing
> to do, but it is a bit weird.  It's also doubly susceptible to bugs
> where there's a page_to_virt() or virt_to_page() screwup.
> 
> I was *hoping* there was still sufficient space in 'struct page' for
> this second list_head in addition to page->lru.  I think there
> *should*
> be.  That would at least make this allocator a bit more "normal" in
> not
> caring about page contents while the page is free in the allocator. 
> If
> you were able to do that you could do things like kmemcheck or page
> alloc debugging while the page is in the allocator.
> 
> Anyway, I think I'd prefer that you *try* to use 'struct page' alone.
> But, if that doesn't work out, please comment the snot out of this
> thing
> because it _is_ weird.

Yes sorry, that deserved more explanation. I tried putting it in struct
page actually. The problem was list_lru automatically determines the
node id from the list_head provided to it via
page_to_nid(virt_to_page(head)). I guess it assumes the list_head is on
the actual item. I started adding another list_lru function that let
the node id be passed in separately, but I remembered this trick from
the deferred free list in vmalloc.

If this ever expands to handle direct map unmapped pages (which would
probably be the next step), the list_head will have to be moved out of
the actual page anyway. But in the meantime it resulted in the smallest
change.

I can try the other way if it's still too weird.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 2/3] vmalloc: Support grouped page allocations
  2021-04-05 21:32     ` Matthew Wilcox
@ 2021-04-05 21:49       ` Edgecombe, Rick P
  0 siblings, 0 replies; 7+ messages in thread
From: Edgecombe, Rick P @ 2021-04-05 21:49 UTC (permalink / raw)
  To: willy, Hansen, Dave
  Cc: luto, x86, dave.hansen, linux-mm, jeyu, peterz, linux-kernel,
	hch, akpm, ast, bpf, daniel, andrii

On Mon, 2021-04-05 at 22:32 +0100, Matthew Wilcox wrote:
> On Mon, Apr 05, 2021 at 02:01:58PM -0700, Dave Hansen wrote:
> > On 4/5/21 1:37 PM, Rick Edgecombe wrote:
> > > +static void __dispose_pages(struct list_head *head)
> > > +{
> > > +       struct list_head *cur, *next;
> > > +
> > > +       list_for_each_safe(cur, next, head) {
> > > +               list_del(cur);
> > > +
> > > +               /* The list head is stored at the start of the
> > > page */
> > > +               free_page((unsigned long)cur);
> > > +       }
> > > +}
> > 
> > This is interesting.
> > 
> > While the page is in the allocator, you're using the page contents
> > themselves to store the list_head.  It took me a minute to figure
> > out
> > what you were doing here because: "start of the page" is a bit
> > ambiguous.  It could mean:
> > 
> >  * the first 16 bytes in 'struct page'
> > or
> >  * the first 16 bytes in the page itself, aka *page_address(page)
> > 
> > The fact that this doesn't work on higmem systems makes this an OK
> > thing
> > to do, but it is a bit weird.  It's also doubly susceptible to bugs
> > where there's a page_to_virt() or virt_to_page() screwup.
> > 
> > I was *hoping* there was still sufficient space in 'struct page'
> > for
> > this second list_head in addition to page->lru.  I think there
> > *should*
> > be.  That would at least make this allocator a bit more "normal" in
> > not
> > caring about page contents while the page is free in the
> > allocator.  If
> > you were able to do that you could do things like kmemcheck or page
> > alloc debugging while the page is in the allocator.
> > 
> > Anyway, I think I'd prefer that you *try* to use 'struct page'
> > alone.
> > But, if that doesn't work out, please comment the snot out of this
> > thing
> > because it _is_ weird.
> 
> Hi!  Current closest-thing-we-have-to-an-expert-on-struct-page here!
> 
> I haven't read over these patches yet.  If these pages are in use by
> vmalloc, they can't use mapping+index because get_user_pages() will
> call
> page_mapping() and the list_head will confuse it.  I think it could
> use
> index+private for a list_head.
> 
> If the pages are in the buddy, I _think_ mapping+index are free. 
> private
> is in use for buddy order.  But I haven't read through the buddy code
> in a while.
> 
> Does it need to be a doubly linked list?  Can it be an hlist?

It does need to be a doubly linked list. I think they should never be
mapped to userspace. As far as the page allocator is concerned these
pages are not free. And they are not compound.

Originally I was just using the lru member. Would it be ok in that
case?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-05 21:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05 20:37 [RFC 0/3] Group pages on the direct map for permissioned vmallocs Rick Edgecombe
2021-04-05 20:37 ` [RFC 1/3] list: Support getting most recent element in list_lru Rick Edgecombe
2021-04-05 20:37 ` [RFC 3/3] x86/module: Use VM_GROUP_PAGES flag Rick Edgecombe
     [not found] ` <20210405203711.1095940-3-rick.p.edgecombe@intel.com>
2021-04-05 21:01   ` [RFC 2/3] vmalloc: Support grouped page allocations Dave Hansen
2021-04-05 21:32     ` Matthew Wilcox
2021-04-05 21:49       ` Edgecombe, Rick P
2021-04-05 21:38     ` Edgecombe, Rick P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).