linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How reliable is SLAB_HWCACHE_ALIGN?
@ 2003-09-10  8:16 Ravikiran G Thirumalai
  2003-09-10 15:41 ` Robert Love
  0 siblings, 1 reply; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-10  8:16 UTC (permalink / raw)
  To: linux-kernel

I was assuming that if you create a slab cache with SLAB_HWCACHE_ALIGN,
objects are guaranteed to be aligned to L1 cacheline.  But this piece
of code in kmem_cache_create has raised doubts.

----------------------------------------------------------------------------
        if (flags & SLAB_HWCACHE_ALIGN) {
                /* Need to adjust size so that objs are cache aligned. */
                /* Small obj size, can get at least two per cache line. */
                while (size < align/2)
                        align /= 2;
                size = (size+align-1)&(~(align-1));
        }
----------------------------------------------------------------------------

Am I missing something or can there really be two objects on the same 
cacheline even when SLAB_HWCACHE_ALIGN is specified?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How reliable is SLAB_HWCACHE_ALIGN?
  2003-09-10  8:16 How reliable is SLAB_HWCACHE_ALIGN? Ravikiran G Thirumalai
@ 2003-09-10 15:41 ` Robert Love
  2003-09-11  5:54   ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 14+ messages in thread
From: Robert Love @ 2003-09-10 15:41 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel

On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:

> Am I missing something or can there really be two objects on the same 
> cacheline even when SLAB_HWCACHE_ALIGN is specified?

No, you are right.

If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.

But note that this will result in cache aligning objects even if it ends
of wasting a _lot_ of space, such as when debugging is turned on.

I think we only use it for the task_struct_cachep.

	Robert Love



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: How reliable is SLAB_HWCACHE_ALIGN?
  2003-09-10 15:41 ` Robert Love
@ 2003-09-11  5:54   ` Ravikiran G Thirumalai
  2003-09-11 11:08     ` [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN Ravikiran G Thirumalai
  0 siblings, 1 reply; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-11  5:54 UTC (permalink / raw)
  To: Robert Love; +Cc: linux-kernel

On Wed, Sep 10, 2003 at 11:41:04AM -0400, Robert Love wrote:
> On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:
> 
> > Am I missing something or can there really be two objects on the same 
> > cacheline even when SLAB_HWCACHE_ALIGN is specified?
> 
> No, you are right.
> 
> If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.
> 

WOW!!!
Looking at the code though, SLAB_MUST_HWCACHE_ALIGN is never considered
by kmem_cache_create or kmem_cache_alloc. So, right now, there is no way of 
getting objects which are _guaranteed_ to be cacheline aligned!!!(?)

Kiran

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-11  5:54   ` Ravikiran G Thirumalai
@ 2003-09-11 11:08     ` Ravikiran G Thirumalai
  2003-09-11 16:19       ` Manfred Spraul
  0 siblings, 1 reply; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-11 11:08 UTC (permalink / raw)
  To: akpm, manfred; +Cc: linux-kernel, Robert Love, dipankar

On Thu, Sep 11, 2003 at 11:24:30AM +0530, Ravikiran G Thirumalai wrote:
> On Wed, Sep 10, 2003 at 11:41:04AM -0400, Robert Love wrote:
> > On Wed, 2003-09-10 at 04:16, Ravikiran G Thirumalai wrote:
> > 
> > > Am I missing something or can there really be two objects on the same 
> > > cacheline even when SLAB_HWCACHE_ALIGN is specified?
> > 
> > No, you are right.
> > 
> > If your object _must_ be cache aligned, use SLAB_MUST_HWCACHE_ALIGN.
> > 
> 
> WOW!!!
> Looking at the code though, SLAB_MUST_HWCACHE_ALIGN is never considered
> by kmem_cache_create or kmem_cache_alloc. So, right now, there is no way of 
> getting objects which are _guaranteed_ to be cacheline aligned!!!(?)
>

Hi Andrew, Manfred
Looks like SLAB_HWCACHE_ALIGN does not guarantee cacheline alignment
and SLAB_MUST_HWCACHE_ALIGN is not at all recognised by the slab code.
(Right now, SLAB_MUST_HWCACHE_ALIGN caches are aligned to sizeof (void *)!!)

The following patch fixes the slab for this.  Note that I have replaced
alignment arithmetic to use ALIGN macro too.  I have done some basic
testing on a 4 way pIII with 32 byte cacheline -- but the patch probably
needs testing on other archs.  Please give it a whirl on mm* trees.  
This should help task_struct cache and pte_chain caches (other users of 
SLAB_MUST_HWCACHE_ALIGN also use SLAB_HWCACHE_ALIGN -- smart huh?)

I also propose to rename SLAB_HWCACHE_ALIGN to SLAB_MAY_HWCACHE_ALIGN since
SLAB_HWCACHE_ALIGN seems very misleading -- if everyone agrees.

Andrew, this also means alloc_percpu in its current form is not guaranteed
to work on machines with cacheline sizes greater than 32 bytes.  I am
working on a fix for alloc_percpu.

Thanks,
Kiran


diff -ruN -X dontdiff.1 linux-2.6.0-test5-mm1/mm/slab.c slabfix-2.6.0-test5-mm1/mm/slab.c
--- linux-2.6.0-test5-mm1/mm/slab.c	Thu Sep 11 15:08:37 2003
+++ slabfix-2.6.0-test5-mm1/mm/slab.c	Thu Sep 11 15:23:28 2003
@@ -1101,7 +1101,7 @@
 #endif
 #endif
 	align = BYTES_PER_WORD;
-	if (flags & SLAB_HWCACHE_ALIGN)
+	if ((flags & SLAB_HWCACHE_ALIGN) | (flags & SLAB_MUST_HWCACHE_ALIGN))
 		align = L1_CACHE_BYTES;
 
 	/* Determine if the slab management is 'on' or 'off' slab. */
@@ -1112,12 +1112,14 @@
 		 */
 		flags |= CFLGS_OFF_SLAB;
 
-	if (flags & SLAB_HWCACHE_ALIGN) {
+	if (flags & SLAB_MUST_HWCACHE_ALIGN) {
 		/* Need to adjust size so that objs are cache aligned. */
+		size = ALIGN(size, align);
+	} else if (flags & SLAB_HWCACHE_ALIGN) {
 		/* Small obj size, can get at least two per cache line. */
 		while (size <= align/2)
 			align /= 2;
-		size = (size+align-1)&(~(align-1));
+		size = ALIGN(size, align);
 	}
 
 	/* Cal size (in pages) of slabs, and the num of objs per slab.
@@ -1175,8 +1177,7 @@
 	}
 
 	/* Offset must be a multiple of the alignment. */
-	offset += (align-1);
-	offset &= ~(align-1);
+	offset = ALIGN(offset, align);
 	if (!offset)
 		offset = L1_CACHE_BYTES;
 	cachep->colour_off = offset;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-11 11:08     ` [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN Ravikiran G Thirumalai
@ 2003-09-11 16:19       ` Manfred Spraul
  2003-09-11 21:49         ` Manfred Spraul
  2003-09-12  8:59         ` Ravikiran G Thirumalai
  0 siblings, 2 replies; 14+ messages in thread
From: Manfred Spraul @ 2003-09-11 16:19 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: akpm, linux-kernel, Robert Love, dipankar

Ravikiran G Thirumalai wrote:

>Hi Andrew, Manfred
>Looks like SLAB_HWCACHE_ALIGN does not guarantee cacheline alignment
>and SLAB_MUST_HWCACHE_ALIGN is not at all recognised by the slab code.
>(Right now, SLAB_MUST_HWCACHE_ALIGN caches are aligned to sizeof (void *)!!)
>  
>
Correct.
SLAB_HWCACHE_ALIGN is a hint, which is always honoured except with 
debugging turned on. Which debugging of, it's equivalent to 
MUST_HWCACHE_ALIGN. The reason why debugging turns it of is to break 
drivers that use kmalloc+virt_to_phys instead of pci_pool. IMHO that's a 
good cause, thus I would like to leave that unchanged.

SLAB_MUST_HWCACHE_ALIGN guarantees alignment to the smaller of the L1 
cache line size and the object size. I was added for PAE support: the 
3rd level page tables must be 32-byte aligned. It's only intended for 
the PAE buffers. Noone else is supposed to use that, i.e. the right 
change for the pte cache and the task cache is s/_MUST_//.

But there are problems with the current implementation:
- some drivers/archs are too difficult to fix. James Bottomley asked me 
to add a switch for archs that cannot transfer to pci_dma completely. 
Basically guarantee that all cachelines that are touched by an object 
are exclusive to the object. Left over bytes must not be used, they 
could be overwritten randomly by the hardware.
- Russell King onced asked me for the ability for 1024 byte aligned 
objects. IIRC ARM needs that for it's page tables.
- If I understand you correctly you want to avoid false sharing of the 
per-cpu buffers that back alloc_percpu, correct?
I have two concerns:
- what about users that compile a kernel for a pIII and then run it on a 
p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.
- what about NUMA systems? IMHO the per-cpu buffers should definitively 
be from the nearest node to the target cpu. Unfortunately slab has no 
concept of nodes right now. There was a patch, but it's quite intrusive 
and never fine-tuned. Thus we must either ignore NUMA, or alloc-percpu 
would have to use it's own allocator. And: Which cacheline size is 
relevant for NUMA? Is L1==L2==Ln_CACHE_BYTES on all archs?

IMHO the right solution is
- remove SLAB_MUST_HWCACHE_ALIGN and SLAB_HWCACHE_ALIGN. The API is 
broken. Align everything always to L1_CACHE_BYTES.
- rename the "offset" parameter to "align", and use that to support 
explicit alignment on a byte boundary. I have a patch somewhere, it's 
not very large.
- alloc_percpu should set align to the real cache line size from cpu_data.
- add a minimal NUMA support: a very inefficient 
"kmem_cache_alloc_forcpu(cachep, flags, target_cpu)". Not that difficult 
if I sacrifice performance [no lockless per-cpu buffers, etc.]

What do you think? Possible now, or too intrusive before 2.6.0? The 
align patch is not much larger than the patch Ravikiran attached. The 
minimal numa patch wouldn't contain any core changes either.

--
    Manfred


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-11 16:19       ` Manfred Spraul
@ 2003-09-11 21:49         ` Manfred Spraul
  2003-09-12  8:59         ` Ravikiran G Thirumalai
  1 sibling, 0 replies; 14+ messages in thread
From: Manfred Spraul @ 2003-09-11 21:49 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: akpm, linux-kernel, Robert Love, dipankar

[-- Attachment #1: Type: text/plain, Size: 541 bytes --]

Attached is a forward port of my arbitrary-align patch from 2.4.something.
Only partially tested. And a small correction: rmap needs alignment to 
it's own object size, it crashes immediately if this is not provided by 
slab. Probably it's a good idea to use the new API to decouple the 
pte_chain size from the L1_CACHE_SIZE: I'd bet that 32-byte for a pIII 
is too small.

I'm still thinking how to implement kmem_cache_alloc_forcpu() without 
having to change too much in slab - it's not as simple as I assumed 
initially

--
    Manfred

[-- Attachment #2: patch-slab-alignobj --]
[-- Type: text/plain, Size: 9835 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 6
//  SUBLEVEL = 0
//  EXTRAVERSION = -test5-mm1
--- 2.6/mm/slab.c	2003-09-11 22:30:47.000000000 +0200
+++ build-2.6/mm/slab.c	2003-09-11 23:42:16.000000000 +0200
@@ -268,6 +268,7 @@
 	unsigned int		colour_off;	/* colour offset */
 	unsigned int		colour_next;	/* cache colouring */
 	kmem_cache_t		*slabp_cache;
+	unsigned int		slab_size;
 	unsigned int		dflags;		/* dynamic flags */
 
 	/* constructor func */
@@ -488,7 +489,7 @@
 	.objsize	= sizeof(kmem_cache_t),
 	.flags		= SLAB_NO_REAP,
 	.spinlock	= SPIN_LOCK_UNLOCKED,
-	.colour_off	= L1_CACHE_BYTES,
+	.colour_off	= SMP_CACHE_BYTES,
 	.name		= "kmem_cache",
 };
 
@@ -523,7 +524,7 @@
 static void enable_cpucache (kmem_cache_t *cachep);
 
 /* Cal the num objs, wastage, and bytes left over for a given slab size. */
-static void cache_estimate (unsigned long gfporder, size_t size,
+static void cache_estimate (unsigned long gfporder, size_t size, size_t align,
 		 int flags, size_t *left_over, unsigned int *num)
 {
 	int i;
@@ -536,7 +537,7 @@
 		extra = sizeof(kmem_bufctl_t);
 	}
 	i = 0;
-	while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
+	while (i*size + ALIGN(base+i*extra, align) <= wastage)
 		i++;
 	if (i > 0)
 		i--;
@@ -546,7 +547,7 @@
 
 	*num = i;
 	wastage -= i*size;
-	wastage -= L1_CACHE_ALIGN(base+i*extra);
+	wastage -= ALIGN(base+i*extra, align);
 	*left_over = wastage;
 }
 
@@ -690,14 +691,15 @@
 	list_add(&cache_cache.next, &cache_chain);
 	cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
 
-	cache_estimate(0, cache_cache.objsize, 0,
-			&left_over, &cache_cache.num);
+	cache_estimate(0, cache_cache.objsize, SMP_CACHE_BYTES, 0,
+  			&left_over, &cache_cache.num);
 	if (!cache_cache.num)
 		BUG();
 
 	cache_cache.colour = left_over/cache_cache.colour_off;
 	cache_cache.colour_next = 0;
-
+	cache_cache.slab_size = ALIGN(cache_cache.num*sizeof(kmem_bufctl_t) + sizeof(struct slab),
+					SMP_CACHE_BYTES);
 
 	/* 2+3) create the kmalloc caches */
 	sizes = malloc_sizes;
@@ -993,7 +995,7 @@
  * kmem_cache_create - Create a cache.
  * @name: A string which is used in /proc/slabinfo to identify this cache.
  * @size: The size of objects to be created in this cache.
- * @offset: The offset to use within the page.
+ * @align: The required alignment for the objects.
  * @flags: SLAB flags
  * @ctor: A constructor for the objects.
  * @dtor: A destructor for the objects.
@@ -1018,17 +1020,15 @@
  * %SLAB_NO_REAP - Don't automatically reap this cache when we're under
  * memory pressure.
  *
- * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
- * cacheline.  This can be beneficial if you're counting cycles as closely
- * as davem.
+ * %SLAB_HWCACHE_ALIGN - This flag has no effect and will be removed soon.
  */
 kmem_cache_t *
-kmem_cache_create (const char *name, size_t size, size_t offset,
+kmem_cache_create (const char *name, size_t size, size_t align,
 	unsigned long flags, void (*ctor)(void*, kmem_cache_t *, unsigned long),
 	void (*dtor)(void*, kmem_cache_t *, unsigned long))
 {
 	const char *func_nm = KERN_ERR "kmem_create: ";
-	size_t left_over, align, slab_size;
+	size_t left_over, slab_size;
 	kmem_cache_t *cachep = NULL;
 
 	/*
@@ -1039,7 +1039,7 @@
 		(size < BYTES_PER_WORD) ||
 		(size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
 		(dtor && !ctor) ||
-		(offset < 0 || offset > size))
+		(align < 0))
 			BUG();
 
 #if DEBUG
@@ -1051,22 +1051,16 @@
 
 #if FORCED_DEBUG
 	/*
-	 * Enable redzoning and last user accounting, except
-	 * - for caches with forced alignment: redzoning would violate the
-	 *   alignment
-	 * - for caches with large objects, if the increased size would
-	 *   increase the object size above the next power of two: caches
-	 *   with object sizes just above a power of two have a significant
-	 *   amount of internal fragmentation
+	 * Enable redzoning and last user accounting, except for caches with
+	 * large objects, if the increased size would increase the object size
+	 * above the next power of two: caches with object sizes just above a
+	 * power of two have a significant amount of internal fragmentation.
 	 */
-	if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD))
-			&& !(flags & SLAB_MUST_HWCACHE_ALIGN)) {
+	if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD)))
 		flags |= SLAB_RED_ZONE|SLAB_STORE_USER;
-	}
 	flags |= SLAB_POISON;
 #endif
 #endif
-
 	/*
 	 * Always checks flags, a caller might be expecting debug
 	 * support which isn't available.
@@ -1074,15 +1068,23 @@
 	if (flags & ~CREATE_MASK)
 		BUG();
 
+	if (align) {
+		/* minimum supported alignment: */
+		if (align < BYTES_PER_WORD)
+			align = BYTES_PER_WORD;
+
+		/* combinations of forced alignment and advanced debugging is
+		 * not yet implemented.
+		 */
+		flags &= ~(SLAB_RED_ZONE|SLAB_STORE_USER);
+	}
+
 	/* Get cache's description obj. */
 	cachep = (kmem_cache_t *) kmem_cache_alloc(&cache_cache, SLAB_KERNEL);
 	if (!cachep)
 		goto opps;
 	memset(cachep, 0, sizeof(kmem_cache_t));
 
-#if DEBUG
-	cachep->reallen = size;
-#endif
 	/* Check that size is in terms of words.  This is needed to avoid
 	 * unaligned accesses for some archs when redzoning is used, and makes
 	 * sure any on-slab bufctl's are also correctly aligned.
@@ -1094,20 +1096,25 @@
 	}
 	
 #if DEBUG
+	cachep->reallen = size;
+
 	if (flags & SLAB_RED_ZONE) {
-		/*
-		 * There is no point trying to honour cache alignment
-		 * when redzoning.
-		 */
-		flags &= ~SLAB_HWCACHE_ALIGN;
+		/* redzoning only works with word aligned caches */
+		align = BYTES_PER_WORD;
+
 		/* add space for red zone words */
 		cachep->dbghead += BYTES_PER_WORD;
 		size += 2*BYTES_PER_WORD;
 	}
 	if (flags & SLAB_STORE_USER) {
-		flags &= ~SLAB_HWCACHE_ALIGN;
-		size += BYTES_PER_WORD; /* add space */
+		/* user store requires word alignment and
+		 * one word storage behind the end of the real
+		 * object.
+		 */
+		align = BYTES_PER_WORD;
+		size += BYTES_PER_WORD;
 	}
+
 #if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
 	if (size > 128 && cachep->reallen > L1_CACHE_BYTES && size < PAGE_SIZE) {
 		cachep->dbghead += PAGE_SIZE - size;
@@ -1115,9 +1122,6 @@
 	}
 #endif
 #endif
-	align = BYTES_PER_WORD;
-	if (flags & SLAB_HWCACHE_ALIGN)
-		align = L1_CACHE_BYTES;
 
 	/* Determine if the slab management is 'on' or 'off' slab. */
 	if (size >= (PAGE_SIZE>>3))
@@ -1127,13 +1131,16 @@
 		 */
 		flags |= CFLGS_OFF_SLAB;
 
-	if (flags & SLAB_HWCACHE_ALIGN) {
-		/* Need to adjust size so that objs are cache aligned. */
-		/* Small obj size, can get at least two per cache line. */
+	if (!align) {
+		/* Default alignment: compile time specified l1 cache size.
+		 * But if an object is really small, then squeeze multiple
+		 * into one cacheline.
+		 */
+		align = L1_CACHE_BYTES;
 		while (size <= align/2)
 			align /= 2;
-		size = (size+align-1)&(~(align-1));
 	}
+	size = ALIGN(size, align);
 
 	/* Cal size (in pages) of slabs, and the num of objs per slab.
 	 * This could be made much more intelligent.  For now, try to avoid
@@ -1143,7 +1150,7 @@
 	do {
 		unsigned int break_flag = 0;
 cal_wastage:
-		cache_estimate(cachep->gfporder, size, flags,
+		cache_estimate(cachep->gfporder, size, align, flags,
 						&left_over, &cachep->num);
 		if (break_flag)
 			break;
@@ -1177,8 +1184,8 @@
 		cachep = NULL;
 		goto opps;
 	}
-	slab_size = L1_CACHE_ALIGN(cachep->num*sizeof(kmem_bufctl_t) +
-			sizeof(struct slab));
+	slab_size = ALIGN(cachep->num*sizeof(kmem_bufctl_t) + sizeof(struct slab),
+			align);
 
 	/*
 	 * If the slab has been placed off-slab, and we have enough space then
@@ -1189,14 +1196,17 @@
 		left_over -= slab_size;
 	}
 
+	if (flags & CFLGS_OFF_SLAB) {
+		/* really off slab. No need for manual alignment */
+		slab_size = cachep->num*sizeof(kmem_bufctl_t)+sizeof(struct slab);
+	}
+ 
+	cachep->colour_off = L1_CACHE_BYTES;
 	/* Offset must be a multiple of the alignment. */
-	offset += (align-1);
-	offset &= ~(align-1);
-	if (!offset)
-		offset = L1_CACHE_BYTES;
-	cachep->colour_off = offset;
-	cachep->colour = left_over/offset;
-
+	if (cachep->colour_off < align)
+		cachep->colour_off = align;
+	cachep->colour = left_over/cachep->colour_off;
+	cachep->slab_size = slab_size;
 	cachep->flags = flags;
 	cachep->gfpflags = 0;
 	if (flags & SLAB_CACHE_DMA)
@@ -1470,8 +1480,7 @@
 			return NULL;
 	} else {
 		slabp = objp+colour_off;
-		colour_off += L1_CACHE_ALIGN(cachep->num *
-				sizeof(kmem_bufctl_t) + sizeof(struct slab));
+		colour_off += cachep->slab_size;
 	}
 	slabp->inuse = 0;
 	slabp->colouroff = colour_off;
--- 2.6/arch/i386/mm/init.c	2003-09-11 22:30:47.000000000 +0200
+++ build-2.6/arch/i386/mm/init.c	2003-09-11 22:38:57.000000000 +0200
@@ -509,8 +509,8 @@
 	if (PTRS_PER_PMD > 1) {
 		pmd_cache = kmem_cache_create("pmd",
 					PTRS_PER_PMD*sizeof(pmd_t),
+					PTRS_PER_PMD*sizeof(pmd_t),
 					0,
-					SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
 					pmd_ctor,
 					NULL);
 		if (!pmd_cache)
@@ -519,8 +519,8 @@
 		if (TASK_SIZE > PAGE_OFFSET) {
 			kpmd_cache = kmem_cache_create("kpmd",
 					PTRS_PER_PMD*sizeof(pmd_t),
+					PTRS_PER_PMD*sizeof(pmd_t),
 					0,
-					SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
 					kpmd_ctor,
 					NULL);
 			if (!kpmd_cache)
@@ -541,8 +541,8 @@
 
 	pgd_cache = kmem_cache_create("pgd",
 				PTRS_PER_PGD*sizeof(pgd_t),
+				PTRS_PER_PGD*sizeof(pgd_t),
 				0,
-				SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN,
 				ctor,
 				dtor);
 	if (!pgd_cache)
--- 2.6/mm/rmap.c	2003-09-11 19:33:57.000000000 +0200
+++ build-2.6/mm/rmap.c	2003-09-11 23:41:55.000000000 +0200
@@ -521,8 +521,8 @@
 {
 	pte_chain_cache = kmem_cache_create(	"pte_chain",
 						sizeof(struct pte_chain),
+						sizeof(struct pte_chain),
 						0,
-						SLAB_MUST_HWCACHE_ALIGN,
 						pte_chain_ctor,
 						NULL);
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-11 16:19       ` Manfred Spraul
  2003-09-11 21:49         ` Manfred Spraul
@ 2003-09-12  8:59         ` Ravikiran G Thirumalai
  2003-09-12  9:10           ` Arjan van de Ven
  2003-09-13 20:06           ` Manfred Spraul
  1 sibling, 2 replies; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-12  8:59 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: akpm, linux-kernel, Robert Love, dipankar

On Thu, Sep 11, 2003 at 06:19:22PM +0200, Manfred Spraul wrote:
> ... 
> But there are problems with the current implementation:
> - some drivers/archs are too difficult to fix. James Bottomley asked me 
> to add a switch for archs that cannot transfer to pci_dma completely. 
> Basically guarantee that all cachelines that are touched by an object 
> are exclusive to the object. Left over bytes must not be used, they 
> could be overwritten randomly by the hardware.
> - Russell King onced asked me for the ability for 1024 byte aligned 
> objects. IIRC ARM needs that for it's page tables.
> - If I understand you correctly you want to avoid false sharing of the 
> per-cpu buffers that back alloc_percpu, correct?

Correct.  We were assuming kmalloc always returns cacheline aligned
objects and kinda assumed that cachelines touched by the object
are exclusive to that object.  With SLAB_HWCACHE_ALIGN not being 
always honoured, this assumption turned out to be false.

> I have two concerns:
> - what about users that compile a kernel for a pIII and then run it on a 
> p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.

If the target cpu is chosen properly during compile time, L1_CACHE_BYTES
is set accordingly (CONFIG_X86_L1_CACHE_SHIFT on x86) so this should
not be an issue right?
 
> - what about NUMA systems? IMHO the per-cpu buffers should definitively 
> be from the nearest node to the target cpu. Unfortunately slab has no 
> concept of nodes right now. There was a patch, but it's quite intrusive 
> and never fine-tuned. Thus we must either ignore NUMA, or alloc-percpu 
> would have to use it's own allocator. And: Which cacheline size is 
> relevant for NUMA? Is L1==L2==Ln_CACHE_BYTES on all archs?

I am working on a simplistic allocator for alloc_percpu which
1. Minimises cache footprint (simple pointer arithmetic to get to each cpus 
   version
2. Does numa aware allocation
3. Does not fragment
4. Is simple and extends simple pointer arithmetic to get to cpus offsets

I wouldn't be using the slab at all because using slabs would mean using
NR_CPUs pointers and one extra dereference which is bad as we had found out
earlier.  But I guess slab will have to do node local allocations for
other applications.

> 
> IMHO the right solution is
> - remove SLAB_MUST_HWCACHE_ALIGN and SLAB_HWCACHE_ALIGN. The API is 
> broken. Align everything always to L1_CACHE_BYTES.
> - rename the "offset" parameter to "align", and use that to support 
> explicit alignment on a byte boundary. I have a patch somewhere, it's 
> not very large.

You mean color slabs to L1_CACHE_BYTES and align objects to "align" 
parameter? sounds great.  I wonder why that wasn't done in the first
place even Bonwick's paper talks of passing align parameter and calculating
colour offset based on wastage etc.  Maybe calculating colour offset
in terms of slab wastage can reduce fragmentation instead of hard coding
it to L1_CACHE_BYTES (when you think of large cachelines and small objects)?	

> - alloc_percpu should set align to the real cache line size from cpu_data.
> - add a minimal NUMA support: a very inefficient 
> "kmem_cache_alloc_forcpu(cachep, flags, target_cpu)". Not that difficult 
> if I sacrifice performance [no lockless per-cpu buffers, etc.]

As I said, place holder for NR_CPUS will cause performance problems, so
IMHO staying away from slab for alloc_percpu is best.

Thanks,
Kiran


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-12  8:59         ` Ravikiran G Thirumalai
@ 2003-09-12  9:10           ` Arjan van de Ven
  2003-09-13 20:06           ` Manfred Spraul
  1 sibling, 0 replies; 14+ messages in thread
From: Arjan van de Ven @ 2003-09-12  9:10 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Manfred Spraul, akpm, linux-kernel, Robert Love, dipankar

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

On Fri, 2003-09-12 at 10:59, Ravikiran G Thirumalai wrote:

> > I have two concerns:
> > - what about users that compile a kernel for a pIII and then run it on a 
> > p4? L1_CACHE_BYTES is 32 bytes, but the real cache line size is 128 bytes.
> 
> If the target cpu is chosen properly during compile time, L1_CACHE_BYTES
> is set accordingly (CONFIG_X86_L1_CACHE_SHIFT on x86) so this should
> not be an issue right?

In my opinion there is no excuse to make such an assumption at compile
time if its basically free to do it at runtime instead...
for structure padding, using the config setting is obviously the only
way to go; but for kmalloc you can use the runtime, actual, cacheline
size/

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-12  8:59         ` Ravikiran G Thirumalai
  2003-09-12  9:10           ` Arjan van de Ven
@ 2003-09-13 20:06           ` Manfred Spraul
  2003-09-13 20:58             ` Dipankar Sarma
  2003-09-14  8:09             ` Ravikiran G Thirumalai
  1 sibling, 2 replies; 14+ messages in thread
From: Manfred Spraul @ 2003-09-13 20:06 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: akpm, linux-kernel, Robert Love, dipankar

Ravikiran G Thirumalai wrote:

>I am working on a simplistic allocator for alloc_percpu which
>1. Minimises cache footprint (simple pointer arithmetic to get to each cpus 
>   version
>2. Does numa aware allocation
>3. Does not fragment
>4. Is simple and extends simple pointer arithmetic to get to cpus offsets
>
>I wouldn't be using the slab at all because using slabs would mean using
>NR_CPUs pointers and one extra dereference which is bad as we had found out
>earlier.  But I guess slab will have to do node local allocations for
>other applications.
>  
>
Interesting. Slab internally uses lots of large per-cpu arrays. 
Alltogether something like around 40 kB/cpu. Right now implemented with 
NR_CPUs pointers. In the long run I'll try to switch to your allocator.

But back to the patch that started this thread: Do you still need the 
ability to set an explicit alignment for slab allocations? If yes, then 
I'd polish my patch, double check all kmem_cache_create callers and then 
send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.

--
    Manfred


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-13 20:06           ` Manfred Spraul
@ 2003-09-13 20:58             ` Dipankar Sarma
  2003-09-14  8:09             ` Ravikiran G Thirumalai
  1 sibling, 0 replies; 14+ messages in thread
From: Dipankar Sarma @ 2003-09-13 20:58 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Ravikiran G Thirumalai, akpm, linux-kernel, Robert Love

On Sat, Sep 13, 2003 at 10:06:08PM +0200, Manfred Spraul wrote:
> Ravikiran G Thirumalai wrote:
> >I wouldn't be using the slab at all because using slabs would mean using
> >NR_CPUs pointers and one extra dereference which is bad as we had found out
> >earlier.  But I guess slab will have to do node local allocations for
> >other applications.
> > 
> >
> Interesting. Slab internally uses lots of large per-cpu arrays. 
> Alltogether something like around 40 kB/cpu. Right now implemented with 
> NR_CPUs pointers. In the long run I'll try to switch to your allocator.
> 
> But back to the patch that started this thread: Do you still need the 
> ability to set an explicit alignment for slab allocations? If yes, then 
> I'd polish my patch, double check all kmem_cache_create callers and then 
> send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.

This is the problem - the current dynamic per-cpu allocator
(alloc_percpu()) is broken. I uses kmalloc() to allocate each
CPU's data, but kmalloc() doesn't gurantee cache line alignment
(only SLAB_HWCACHE_ALIGN). This may result in some per-CPU statistics
bouncing between CPUs specially on the ones with large L1 cache lines.

We have a number of options -

1. Force kmalloc() to strictly align on cache line boundary, but will
   result in wastage of space elsewhere (with your strict align patch)
   but alloc_percpu() will never result in cache line sharing.
2. Make alloc_percpu() use its own caches for various sizes with your
   strictly align patch. The rest of the kernel is not affected.
3. Let alloc_percpu() use its own allocator which supports NUMA
   and does not use an offset table.

#2 and #3 has less impact in the kernel and we should consider those,
IMO.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-13 20:06           ` Manfred Spraul
  2003-09-13 20:58             ` Dipankar Sarma
@ 2003-09-14  8:09             ` Ravikiran G Thirumalai
  2003-09-14 13:00               ` Dipankar Sarma
  1 sibling, 1 reply; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-14  8:09 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: akpm, linux-kernel, Robert Love, dipankar

On Sat, Sep 13, 2003 at 10:06:08PM +0200, Manfred Spraul wrote:
> Ravikiran G Thirumalai wrote:
> > ... 
> >
> Interesting. Slab internally uses lots of large per-cpu arrays. 
> Alltogether something like around 40 kB/cpu. Right now implemented with 
> NR_CPUs pointers. In the long run I'll try to switch to your allocator.
> 
> But back to the patch that started this thread: Do you still need the 
> ability to set an explicit alignment for slab allocations? If yes, then 
> I'd polish my patch, double check all kmem_cache_create callers and then 
> send the patch to akpm. 

No, even the arbitrary aligned patch might not fix the current brokenness
of alloc_percpu, because then we would have to change kmalloc caches
(malloc_sizes[]) to explicity align objects on cacheline boundaries
even for small objects.  And that would be a major change at this stage.
Having different alloc_percpu caches for differnt objects is probably not
such a good thing right now -- this is if we tolerate the extra dereference.
So. IMO, going ahead with a new allocator is a good choice for alloc_percpu.

> Otherwise I'd wait - the patch is not a bugfix.

That said, the reason I see arbitray align patch for 2.6 is:

1. SLAB_MUST_HWCACHELINE_ALIGN is broken.  If I understand the code right,
   it will result in BYTES_PER_WORD alignment.  This means there can be false
   sharing between two task_structs depending on the slab calculations
   and there is an atomic_t at the beginning of the task_struct.
   This has to be fixed.
2. The fix should not change the way kmalloc caches currently work, i.e
   by sharing cachelines for small objects.  Later benchmarks might 
   prove that it might not be good to share cachelines for kmalloc, but 
   for now we don't know.
3. IMHO its not right that SLAB_HWCACHE_ALIGN is taken as a hint and not
   as a directive.  It is misleading.  I don't like the fact that the
   allocator can reduce the alignment. IMHO, alignment decisions should be left
   to the user.  
4. There are already requirements from users for arbitrary alignment as you'd
   mentioned in earlier mails in this thread. I guess such requirements will
   go up as 2.6 matures, so it is a good idea to provide alignment upto 
   PAGE alignment.

The patch I posted earlier takes care of 1 & 2, but arbitraty alignment takes
care of all 4, and as you mentioned the patch is small enough.  If interface
change is a concern for 2.6, then IMHO the interface is broken anyways...who
uses offset parameter to specify coloring offsets????  IMHO, Slab 
offset colour should be slab calculation based and if needed arch based.
added to that SLAB_HWCACHE_ALIGN name itself is broken anyways.   Clearly
we should not wait for 2.7 for these fixes, if so now is the good time
right?  This is just my opinion.  It is left to akpm and you to decide....


Thanks,
Kiran

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-14  8:09             ` Ravikiran G Thirumalai
@ 2003-09-14 13:00               ` Dipankar Sarma
  2003-09-15  5:13                 ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 14+ messages in thread
From: Dipankar Sarma @ 2003-09-14 13:00 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Manfred Spraul, akpm, linux-kernel, Robert Love

On Sun, Sep 14, 2003 at 01:39:42PM +0530, Ravikiran G Thirumalai wrote:
> The patch I posted earlier takes care of 1 & 2, but arbitraty alignment takes
> care of all 4, and as you mentioned the patch is small enough.  If interface

While we are at it, we should also probably look up the cache line
size for a cpu family from a table, store it in some per-cpu data
(cpuinfo_x86 ?) and provide an l1_cache_bytes() api to
return it at run time.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
  2003-09-14 13:00               ` Dipankar Sarma
@ 2003-09-15  5:13                 ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 14+ messages in thread
From: Ravikiran G Thirumalai @ 2003-09-15  5:13 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Manfred Spraul, akpm, linux-kernel, Robert Love, arjanv

On Sun, Sep 14, 2003 at 06:30:37PM +0530, Dipankar Sarma wrote:
> 
> While we are at it, we should also probably look up the cache line
> size for a cpu family from a table, store it in some per-cpu data
> (cpuinfo_x86 ?) and provide an l1_cache_bytes() api to
> return it at run time.

If we are going to solve the cache line size mismatch due to compile 
time arch versus run time arch (compiling on a PIII and running on a P4), 
we might end up with kernel code which sometimes refer to the static line size 
and sometimes to the run time size, which might cause correctness issues.  
So maybe it is good idea to have just one macro for l1 size?  We can't
do away with the static one, so maybe we should keep it and expect users
to compile with the target cpu properly set (otherwise they lose out on
performance -- correctness won't be a problem).  I could be wrong...just
thinking out loud...

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN
       [not found]           ` <voSF.8l7.17@gated-at.bofh.it>
@ 2003-09-13 22:18             ` Arnd Bergmann
  0 siblings, 0 replies; 14+ messages in thread
From: Arnd Bergmann @ 2003-09-13 22:18 UTC (permalink / raw)
  To: Manfred Spraul, linux-kernel, Martin Schwidefsky, Ravikiran G Thirumalai

Manfred Spraul wrote:

> But back to the patch that started this thread: Do you still need the 
> ability to set an explicit alignment for slab allocations? If yes, then 
> I'd polish my patch, double check all kmem_cache_create callers and then 
> send the patch to akpm. Otherwise I'd wait - the patch is not a bugfix.

The explicit alignment would be my preferred way to fix slab debugging
on s390. We still have the problem that practically all s390 device drivers
need 64-bit alignment on slab allocations.

Our current hack is to redefine BYTES_PER_WORD in slab.c to 8, but what
I'd like to see is a per-architecture alignment in kmem_cache_init
that would default to 8 on s390 and to sizeof(long) otherwise.

Using the current SLAB_MUST_HWCACHE_ALIGN is not an option since
it wastes a lot of memory with our 2048-bit L1 cache lines.
I'd also like to avoid creating special slab caches for anything
that needs 8-byte alignment.

        Arnd <><

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-09-15  5:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-10  8:16 How reliable is SLAB_HWCACHE_ALIGN? Ravikiran G Thirumalai
2003-09-10 15:41 ` Robert Love
2003-09-11  5:54   ` Ravikiran G Thirumalai
2003-09-11 11:08     ` [patch] Make slab allocator work with SLAB_MUST_HWCACHE_ALIGN Ravikiran G Thirumalai
2003-09-11 16:19       ` Manfred Spraul
2003-09-11 21:49         ` Manfred Spraul
2003-09-12  8:59         ` Ravikiran G Thirumalai
2003-09-12  9:10           ` Arjan van de Ven
2003-09-13 20:06           ` Manfred Spraul
2003-09-13 20:58             ` Dipankar Sarma
2003-09-14  8:09             ` Ravikiran G Thirumalai
2003-09-14 13:00               ` Dipankar Sarma
2003-09-15  5:13                 ` Ravikiran G Thirumalai
     [not found] <u8mV.so.19@gated-at.bofh.it>
     [not found] ` <ufor.30e.21@gated-at.bofh.it>
     [not found]   ` <usvj.6s9.17@gated-at.bofh.it>
     [not found]     ` <uxv1.5D5.23@gated-at.bofh.it>
     [not found]       ` <uCuI.5hY.13@gated-at.bofh.it>
     [not found]         ` <uRWI.xK.5@gated-at.bofh.it>
     [not found]           ` <voSF.8l7.17@gated-at.bofh.it>
2003-09-13 22:18             ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).