Re: [PATCH] slab: introduce the flag SLAB_MINIMIZE_WASTE

From: Mikulas Patocka <mpatocka@redhat.com>
To: Christopher Lameter <cl@linux.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, dm-devel@redhat.com,
	Mike Snitzer <msnitzer@redhat.com>
Subject: Re: [PATCH] slab: introduce the flag SLAB_MINIMIZE_WASTE
Date: Wed, 21 Mar 2018 14:36:58 -0400 (EDT)	[thread overview]
Message-ID: <alpine.LRH.2.02.1803211425330.26409@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1803211226350.3174@nuc-kabylake>

On Wed, 21 Mar 2018, Christopher Lameter wrote:

> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> 
> > > You should not be using the slab allocators for these. Allocate higher
> > > order pages or numbers of consecutive smaller pagess from the page
> > > allocator. The slab allocators are written for objects smaller than page
> > > size.
> >
> > So, do you argue that I need to write my own slab cache functionality
> > instead of using the existing slab code?
> 
> Just use the existing page allocator calls to allocate and free the
> memory you need.
> 
> > I can do it - but duplicating code is bad thing.
> 
> There is no need to duplicate anything. There is lots of infrastructure
> already in the kernel. You just need to use the right allocation / freeing
> calls.

So, what would you recommend for allocating 640KB objects while minimizing 
wasted space?
* alloc_pages - rounds up to the next power of two
* kmalloc - rounds up to the next power of two
* alloc_pages_exact - O(n*log n) complexity; and causes memory 
  fragmentation if used excesivelly
* vmalloc - horrible performance (modifies page tables and that causes 
  synchronization across all CPUs)

anything else?

The slab cache with large order seems as a best choice for this.

> > > What kind of problem could be caused here?
> >
> > Unlocked accesses are generally considered bad. For example, see this
> > piece of code in calculate_sizes:
> >         s->allocflags = 0;
> >         if (order)
> >                 s->allocflags |= __GFP_COMP;
> >
> >         if (s->flags & SLAB_CACHE_DMA)
> >                 s->allocflags |= GFP_DMA;
> >
> >         if (s->flags & SLAB_RECLAIM_ACCOUNT)
> >                 s->allocflags |= __GFP_RECLAIMABLE;
> >
> > If you are running this while the cache is in use (i.e. when the user
> > writes /sys/kernel/slab/<cache>/order), then other processes will see
> > invalid s->allocflags for a short time.
> 
> Calculating sizes is done when the slab has only a single accessor. Thus
> no locking is neeed.

The calculation is done whenever someone writes to 
"/sys/kernel/slab/*/order"

And you can obviously write to that file why the slab cache is in use. Try 
it.

So, the function calculate_sizes can actually race with allocation from 
the slab cache.

> Changing the size of objects in a slab cache when there is already a set
> of object allocated and under management by the slab cache would
> cause the allocator to fail and lead to garbled data.

I am not talking about changing the size of objects in a slab cache. I am 
talking about changing the allocation order of a slab cache while the 
cache is in use. This can be done with the sysfs interface.

Mikulas