All of lore.kernel.org
 help / color / mirror / Atom feed
* Dynamically allocated memory descriptors
@ 2021-10-25 19:55 Matthew Wilcox
  2021-10-26 17:22 ` Kent Overstreet
  0 siblings, 1 reply; 3+ messages in thread
From: Matthew Wilcox @ 2021-10-25 19:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Kent Overstreet, Johannes Weiner

Kent asked:
> I ran into a major roadblock when I tried converting buddy allocator
> freelists to radix trees: freeing a page may require allocating a new
> page for the radix tree freelist, which is fine normally - we're freeing
> a page after all - but not if it's highmem. So right now I'm not sure
> if getting struct page down to two words is even possible. Oh well.

I don't think I can answer this without explaining the whole design
I have in mind, so here goes ... this is far more complicated than
I would like it to be, but I think it *works*.

Assuming that we have got the kernel to the point where no code uses
struct page as a memory descriptor; ie struct page is now:

struct page {
	unsigned long flags;
	unsigned long compound_head;
	unsigned long _pad[N];	/* for sufficiently large value of N */
};

and all code uses helpers like slab_page() and page_slab() instead of
casting to and from struct page.  Let's further assume that we have:

struct buddy {
	unsigned long flags;
	struct list_head list;
	unsigned int order;
};

The next step is to transition to:

struct page {
	unsigned long mem_desc;
};

struct buddy {
	unsigned long flags;
	struct list_head list;
	unsigned int order;
	struct page *page;
};

(and similar changes for other memory descriptors, but I don't think we
need to discuss those at this time).  mem_desc is really a pointer, but
the bottom few bits are the type id of the memory descriptor (eg buddy,
slab, folio, anonmem, filemem, pgtable, vmalloc ... whatever types we
decide to create)

We must never allocate memory when freeing memory.  So we must allocate
the necessary struct buddy at the time we allocate the memory.  If we
cannot do so, we simply fail the allocation.

The simplest case to handle is something like __get_free_page().  If we
already have an order-0 page available, we take the page off the free
list, mark the page as allocated (perhaps we have a different type id
for allocated-buddy and free-buddy) and return its virtual address.

If we don't have an order-0 page available when we call __get_free_page(),
we have to split a higher-order page.  Worst case, that's an order-11
page.  So we'll need to allocate 11 buddy structs to describe two order-0
pages, one order-1 page, one order-2 page, ... one order-9 page.  The new
order-10 page can be described by the former order-11 buddy descriptor.
This means changing the mem_desc pointers for 1024 struct pages, which
isn't ideal, but also not the end of the world.  We don't often have
order-11 pages around!

For something like slab which requires that struct page point to a
different memory descriptor, it's the job of slab to keep track of the
buddy struct and return it at free time.  So struct slab gets a 'struct
buddy' pointer.  I think that slub's alloc_slab_page() looks like:

static inline struct slub *alloc_slab_page(struct kmem_cache *s, gfp_t flags,
		int node, struct kmem_cache_order_objects oo)
{
	struct slab *slab;
	unsigned long desc;
	struct buddy *buddy;

	/* The recursion stops here */
	if (s == kmem_cache_slab)
		return __alloc_slab(flags);
	slab = kmem_cache_alloc(kmem_cache_slab, flags);
	if (!slab)
		return NULL;
	desc = MEM_DESC(slab, MEM_DESC_SLAB);

	if (node == NUMA_NO_NODE)
		buddy = alloc_pages_desc(flags, order, desc);
	else
		buddy = __alloc_pages_node_desc(node, flags, order, desc);
	if (!buddy) {
		kmem_cache_free(kmem_cache_slab, slab);
		return NULL;
	}

	slab->buddy = buddy;
	slab->flags = buddy->flags;
	slab->order = order;
	slab->virt = page_address(buddy->page);

	return slab;
}

Allocating a slab from the slab allocator is slightly tricky.  Usually,
this will work fine, but at some point we need to allocate a struct slab
in order to allocate a struct slab, and that's the job of __alloc_slab().
It needs a new interface to the page allocator to return the smallest
buddy currently available (or that can be made available given the GFP
flags passed).  If it is order-0, we initialise the first few bytes of
it to be a struct slab; and make it self-describing.  If it is order-1
or higher, we add the second page to kmem_cache_buddy (and allocate
a second struct slab on the first page to describe it).  We can then
allocate sufficient struct buddy on this second page to describe all
the buddy pages.

Here's how I see the worst case call chain going (using slub):

alloc_pages(order = 0)
  kmem_cache_alloc_bulk(kmem_cache_buddy)
    ___slab_alloc(kmem_cache_buddy)
      new_slab(kmem_cache_buddy)
        allocate_slab(kmem_cache_buddy)
          alloc_slab_page(kmem_cache_buddy)
            kmem_cache_alloc(kmem_cache_slab)
              ___slab_alloc(kmem_cache_slab)
                new_slab(kmem_cache_slab)
                  allocate_slab(kmem_cache_slab)
                    alloc_slab_page(kmem_cache_slab)
                      __alloc_slab()
            alloc_pages()
              kmem_cache_alloc_bulk(kmem_cache_buddy)
                ___slab_alloc(kmem_cache_buddy)
                  new_slab(kmem_cache_buddy)
                    allocate_slab(kmem_cache_buddy)
                      alloc_slab_page(kmem_cache_buddy)
                        kmem_cache_alloc(kmem_cache_slab)
                           ___slab_alloc(kmem_cache_slab)
                             new_slab(kmem_cache_slab)
                               allocate_slab(kmem_cache_slab)
                                 alloc_slab_page(kmem_cache_slab)
                                   __alloc_slab()

To actually hit this depth of nesting, other processes/interrupts/threads
must be simultaneously consuming both slab structs and buddy structs, as
well as conspiring to keep low-order buddy lists empty while not depleting
all the high-order buddy lists.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Dynamically allocated memory descriptors
  2021-10-25 19:55 Dynamically allocated memory descriptors Matthew Wilcox
@ 2021-10-26 17:22 ` Kent Overstreet
  2021-10-27 12:14   ` David Hildenbrand
  0 siblings, 1 reply; 3+ messages in thread
From: Kent Overstreet @ 2021-10-26 17:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Johannes Weiner

On Mon, Oct 25, 2021 at 08:55:21PM +0100, Matthew Wilcox wrote:
> Kent asked:
> > I ran into a major roadblock when I tried converting buddy allocator
> > freelists to radix trees: freeing a page may require allocating a new
> > page for the radix tree freelist, which is fine normally - we're freeing
> > a page after all - but not if it's highmem. So right now I'm not sure
> > if getting struct page down to two words is even possible. Oh well.
> 
> I don't think I can answer this without explaining the whole design
> I have in mind, so here goes ... this is far more complicated than
> I would like it to be, but I think it *works*.

So you've got two separately allocated structs per compound page - struct buddy,
for allocator/freelist state, and struct folio or slab or whatever, for
allocatee state. This lets you get struct page - our 4k page tax - down to a
single pointer.

But the shenanigans required for separately allocating struct buddy make me want
to go back to my proposal :)

The difference between your proposal and mine is that in mine, we don't
separately allocate struct buddy, instead we only shrink struct page down to two
words/pointers, not one. We can get the state for a free page down to two words
if we replace the doubly linked freelists with a dequeue implemented as a radix
tree: the second word in struct page will be a pointer to allocatee state for
allocated pages, but for free pages it will be an index onto the freelist.

As you also noted, splitting page->flags up between allocator state and
allocatee state (i.e. moving some of it to the folio) means we'll be able to fit
compound/buddy order in page->flags; that becomes the allocator state word in my
model.

The issue I ran into was where we have to allocate new pages for the freelist
radix tree: normally there's no issue here because we can just consume the page
we're trying to free. But if the page is highmem - oof.

So I've been kicking around the idea of implementing a version of my
lib/generic-radix-tree.c code where we use the low bit of pointers to nodes to
indicate when they're highmem pages that need to be kmap_local()'d. I think done
this way the performance overhead will be negligable, in practice.

So, I'm gonna cook this up and see how it comes out...


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Dynamically allocated memory descriptors
  2021-10-26 17:22 ` Kent Overstreet
@ 2021-10-27 12:14   ` David Hildenbrand
  0 siblings, 0 replies; 3+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:14 UTC (permalink / raw)
  To: Kent Overstreet, Matthew Wilcox; +Cc: linux-mm, Johannes Weiner

On 26.10.21 19:22, Kent Overstreet wrote:
> On Mon, Oct 25, 2021 at 08:55:21PM +0100, Matthew Wilcox wrote:
>> Kent asked:
>>> I ran into a major roadblock when I tried converting buddy allocator
>>> freelists to radix trees: freeing a page may require allocating a new
>>> page for the radix tree freelist, which is fine normally - we're freeing
>>> a page after all - but not if it's highmem. So right now I'm not sure
>>> if getting struct page down to two words is even possible. Oh well.
>>
>> I don't think I can answer this without explaining the whole design
>> I have in mind, so here goes ... this is far more complicated than
>> I would like it to be, but I think it *works*.
> 
> So you've got two separately allocated structs per compound page - struct buddy,
> for allocator/freelist state, and struct folio or slab or whatever, for
> allocatee state. This lets you get struct page - our 4k page tax - down to a
> single pointer.
> 
> But the shenanigans required for separately allocating struct buddy make me want
> to go back to my proposal :)
> 
> The difference between your proposal and mine is that in mine, we don't
> separately allocate struct buddy, instead we only shrink struct page down to two
> words/pointers, not one. We can get the state for a free page down to two words
> if we replace the doubly linked freelists with a dequeue implemented as a radix
> tree: the second word in struct page will be a pointer to allocatee state for
> allocated pages, but for free pages it will be an index onto the freelist.
> 
> As you also noted, splitting page->flags up between allocator state and
> allocatee state (i.e. moving some of it to the folio) means we'll be able to fit
> compound/buddy order in page->flags; that becomes the allocator state word in my
> model.
> 
> The issue I ran into was where we have to allocate new pages for the freelist
> radix tree: normally there's no issue here because we can just consume the page
> we're trying to free. But if the page is highmem - oof.

ZONE_MOVABLE and MIGRATE_CMA is similarly problematic, no?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-10-27 12:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 19:55 Dynamically allocated memory descriptors Matthew Wilcox
2021-10-26 17:22 ` Kent Overstreet
2021-10-27 12:14   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.