Dynamically allocated memory descriptors

From: Matthew Wilcox <willy@infradead.org>
To: linux-mm@kvack.org
Cc: Kent Overstreet <kent.overstreet@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Dynamically allocated memory descriptors
Date: Mon, 25 Oct 2021 20:55:21 +0100	[thread overview]
Message-ID: <YXcLqcFhDq3uUwIj@casper.infradead.org> (raw)

Kent asked:
> I ran into a major roadblock when I tried converting buddy allocator
> freelists to radix trees: freeing a page may require allocating a new
> page for the radix tree freelist, which is fine normally - we're freeing
> a page after all - but not if it's highmem. So right now I'm not sure
> if getting struct page down to two words is even possible. Oh well.

I don't think I can answer this without explaining the whole design
I have in mind, so here goes ... this is far more complicated than
I would like it to be, but I think it *works*.

Assuming that we have got the kernel to the point where no code uses
struct page as a memory descriptor; ie struct page is now:

struct page {
	unsigned long flags;
	unsigned long compound_head;
	unsigned long _pad[N];	/* for sufficiently large value of N */
};

and all code uses helpers like slab_page() and page_slab() instead of
casting to and from struct page.  Let's further assume that we have:

struct buddy {
	unsigned long flags;
	struct list_head list;
	unsigned int order;
};

The next step is to transition to:

struct page {
	unsigned long mem_desc;
};

struct buddy {
	unsigned long flags;
	struct list_head list;
	unsigned int order;
	struct page *page;
};

(and similar changes for other memory descriptors, but I don't think we
need to discuss those at this time).  mem_desc is really a pointer, but
the bottom few bits are the type id of the memory descriptor (eg buddy,
slab, folio, anonmem, filemem, pgtable, vmalloc ... whatever types we
decide to create)

We must never allocate memory when freeing memory.  So we must allocate
the necessary struct buddy at the time we allocate the memory.  If we
cannot do so, we simply fail the allocation.

The simplest case to handle is something like __get_free_page().  If we
already have an order-0 page available, we take the page off the free
list, mark the page as allocated (perhaps we have a different type id
for allocated-buddy and free-buddy) and return its virtual address.

If we don't have an order-0 page available when we call __get_free_page(),
we have to split a higher-order page.  Worst case, that's an order-11
page.  So we'll need to allocate 11 buddy structs to describe two order-0
pages, one order-1 page, one order-2 page, ... one order-9 page.  The new
order-10 page can be described by the former order-11 buddy descriptor.
This means changing the mem_desc pointers for 1024 struct pages, which
isn't ideal, but also not the end of the world.  We don't often have
order-11 pages around!

For something like slab which requires that struct page point to a
different memory descriptor, it's the job of slab to keep track of the
buddy struct and return it at free time.  So struct slab gets a 'struct
buddy' pointer.  I think that slub's alloc_slab_page() looks like:

static inline struct slub *alloc_slab_page(struct kmem_cache *s, gfp_t flags,
		int node, struct kmem_cache_order_objects oo)
{
	struct slab *slab;
	unsigned long desc;
	struct buddy *buddy;

	/* The recursion stops here */
	if (s == kmem_cache_slab)
		return __alloc_slab(flags);
	slab = kmem_cache_alloc(kmem_cache_slab, flags);
	if (!slab)
		return NULL;
	desc = MEM_DESC(slab, MEM_DESC_SLAB);

	if (node == NUMA_NO_NODE)
		buddy = alloc_pages_desc(flags, order, desc);
	else
		buddy = __alloc_pages_node_desc(node, flags, order, desc);
	if (!buddy) {
		kmem_cache_free(kmem_cache_slab, slab);
		return NULL;
	}

	slab->buddy = buddy;
	slab->flags = buddy->flags;
	slab->order = order;
	slab->virt = page_address(buddy->page);

	return slab;
}

Allocating a slab from the slab allocator is slightly tricky.  Usually,
this will work fine, but at some point we need to allocate a struct slab
in order to allocate a struct slab, and that's the job of __alloc_slab().
It needs a new interface to the page allocator to return the smallest
buddy currently available (or that can be made available given the GFP
flags passed).  If it is order-0, we initialise the first few bytes of
it to be a struct slab; and make it self-describing.  If it is order-1
or higher, we add the second page to kmem_cache_buddy (and allocate
a second struct slab on the first page to describe it).  We can then
allocate sufficient struct buddy on this second page to describe all
the buddy pages.

Here's how I see the worst case call chain going (using slub):

alloc_pages(order = 0)
  kmem_cache_alloc_bulk(kmem_cache_buddy)
    ___slab_alloc(kmem_cache_buddy)
      new_slab(kmem_cache_buddy)
        allocate_slab(kmem_cache_buddy)
          alloc_slab_page(kmem_cache_buddy)
            kmem_cache_alloc(kmem_cache_slab)
              ___slab_alloc(kmem_cache_slab)
                new_slab(kmem_cache_slab)
                  allocate_slab(kmem_cache_slab)
                    alloc_slab_page(kmem_cache_slab)
                      __alloc_slab()
            alloc_pages()
              kmem_cache_alloc_bulk(kmem_cache_buddy)
                ___slab_alloc(kmem_cache_buddy)
                  new_slab(kmem_cache_buddy)
                    allocate_slab(kmem_cache_buddy)
                      alloc_slab_page(kmem_cache_buddy)
                        kmem_cache_alloc(kmem_cache_slab)
                           ___slab_alloc(kmem_cache_slab)
                             new_slab(kmem_cache_slab)
                               allocate_slab(kmem_cache_slab)
                                 alloc_slab_page(kmem_cache_slab)
                                   __alloc_slab()

To actually hit this depth of nesting, other processes/interrupts/threads
must be simultaneously consuming both slab structs and buddy structs, as
well as conspiring to keep low-order buddy lists empty while not depleting
all the high-order buddy lists.