linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] NUMA slab locking fixes
@ 2006-02-03 20:53 Ravikiran G Thirumalai
  2006-02-03 20:55 ` [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix Ravikiran G Thirumalai
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-03 20:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Manfred Spraul, Shai Fultheim (Shai@scalex86.org),
	Christoph Lameter, Alok Kataria, sonny

There were some locking oddities in the NUMA slab allocator which broke cpu
hotplug.  Here is a patchset to correct locking and also fix the
breakage with cpu hotplug Sonny Rao noticed on POWER5.  We hit the bug 
on POWER, but not on other arches because other arches failed to update
the node cpumasks when a cpu went down.

Following patches are included in the series:

1. slab-colour-next fix: Moves the slab colouring index from kmem_cache_t
to kmem_list3. Per node cache colouring makes sense, and also, we can now
avoid taking the cachep->spinlock in the alloc patch (cache_grow).  Now, alloc
and free need not take cachep->spinlock and just need to take the respective
kmem_list3 locks.

2. slab-locking-irq-optimization: With patch 1, we don't need to disable 
on chip interrupts while taking the cachep->spinlock.  (We needed to disable
interrupts while taking cachep->spinlock earlier because alloc can happen
from interrupt context, and we had to take the cachep->lock to protect
color_next -- which is not the case now).

3. slab-hotplug-fix: Fix the cpu_down and cpu_up slab locking. When the last
cpu of a node goes down, cpu_down path used to free the shared array_cache,
alien array_cache and l3, with the kmem_list3 lock held, which resulted in
badness.  This patch fixes the locking not to do that.  Now we don't free
l3 on cpu_down (because there could be a request from other cpus to get
memory from the node going down, while the last cpu is going down).  This
fixes the problem reported by Sonny Rao.

2.6.16-rc1mmm4 with the above patches ran successfully overnight on a 4 cpu 2
node amd tyan box, with 4 dbench processes running all the time and cpus
offlined and onlined in an infinite loop.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix
  2006-02-03 20:53 [patch 0/3] NUMA slab locking fixes Ravikiran G Thirumalai
@ 2006-02-03 20:55 ` Ravikiran G Thirumalai
  2006-02-03 20:56 ` [patch 2/3] NUMA slab locking fixes -- slab locking irq optimizations Ravikiran G Thirumalai
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-03 20:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Manfred Spraul, Shai Fultheim (Shai@scalex86.org),
	Christoph Lameter, Alok Kataria, sonny

colour_next is used as an index to add a colouring offset to a new slab 
in the cache (colour_off * colour_next).  Now with the NUMA aware slab 
allocator, it makes sense to colour slabs added on the same node 
sequentially with colour_next.

This patch moves the colouring index "colour_next" per-node by placing it
on kmem_list3 rather than kmem_cache.

This also avoids taking the cachep->spinlock on the alloc path at
cache_grow.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c	2006-01-29 20:20:20.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c	2006-01-29 20:23:28.000000000 -0800
@@ -294,6 +294,7 @@ struct kmem_list3 {
 	unsigned long next_reap;
 	int free_touched;
 	unsigned int free_limit;
+	unsigned int colour_next;	/* Per-node cache coloring */
 	spinlock_t list_lock;
 	struct array_cache *shared;	/* shared per node */
 	struct array_cache **alien;	/* on other nodes */
@@ -344,6 +345,7 @@ static void kmem_list3_init(struct kmem_
 	INIT_LIST_HEAD(&parent->slabs_free);
 	parent->shared = NULL;
 	parent->alien = NULL;
+	parent->colour_next = 0;
 	spin_lock_init(&parent->list_lock);
 	parent->free_objects = 0;
 	parent->free_touched = 0;
@@ -390,7 +392,6 @@ struct kmem_cache {
 
 	size_t colour;		/* cache colouring range */
 	unsigned int colour_off;	/* colour offset */
-	unsigned int colour_next;	/* cache colouring */
 	struct kmem_cache *slabp_cache;
 	unsigned int slab_size;
 	unsigned int dflags;	/* dynamic flags */
@@ -1127,7 +1128,6 @@ void __init kmem_cache_init(void)
 		BUG();
 
 	cache_cache.colour = left_over / cache_cache.colour_off;
-	cache_cache.colour_next = 0;
 	cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
 				      sizeof(struct slab), cache_line_size());
 
@@ -2334,18 +2334,19 @@ static int cache_grow(struct kmem_cache 
 		 */
 		ctor_flags |= SLAB_CTOR_ATOMIC;
 
-	/* About to mess with non-constant members - lock. */
+	/* Take the l3 list lock to change the colour_next on this node */
 	check_irq_off();
-	spin_lock(&cachep->spinlock);
+	l3 = cachep->nodelists[nodeid];
+	spin_lock(&l3->list_lock);
 
 	/* Get colour for the slab, and cal the next value. */
-	offset = cachep->colour_next;
-	cachep->colour_next++;
-	if (cachep->colour_next >= cachep->colour)
-		cachep->colour_next = 0;
-	offset *= cachep->colour_off;
+	offset = l3->colour_next;
+	l3->colour_next++;
+	if (l3->colour_next >= cachep->colour)
+		l3->colour_next = 0;
+	spin_unlock(&l3->list_lock);
 
-	spin_unlock(&cachep->spinlock);
+	offset *= cachep->colour_off;
 
 	check_irq_off();
 	if (local_flags & __GFP_WAIT)
@@ -2377,7 +2378,6 @@ static int cache_grow(struct kmem_cache 
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
-	l3 = cachep->nodelists[nodeid];
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 2/3] NUMA slab locking fixes -- slab locking irq optimizations
  2006-02-03 20:53 [patch 0/3] NUMA slab locking fixes Ravikiran G Thirumalai
  2006-02-03 20:55 ` [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix Ravikiran G Thirumalai
@ 2006-02-03 20:56 ` Ravikiran G Thirumalai
  2006-02-03 20:57 ` [patch 3/3] NUMA slab locking fixes -- slab cpu hotplug fix Ravikiran G Thirumalai
  2006-02-03 22:07 ` [patch 0/3] NUMA slab locking fixes Andrew Morton
  3 siblings, 0 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-03 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Manfred Spraul, Shai Fultheim (Shai@scalex86.org),
	Christoph Lameter, Alok Kataria, sonny

Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
because, at cache_grow, on every addition of a slab to a slab cache, we 
incremented colour_next which was protected by the cachep->spinlock, and
cache_grow could occur at interrupt context.  Since, now we protect the 
per-node colour_next with the node's list_lock, we do not need to disable 
on chip interrupts while taking the per-cache spinlock, but we
just need to disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c	2006-01-29 20:23:28.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c	2006-01-29 23:13:14.000000000 -0800
@@ -995,7 +995,7 @@ static int __devinit cpuup_callback(stru
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
-			spin_lock_irq(&cachep->spinlock);
+			spin_lock(&cachep->spinlock);
 			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
@@ -1004,7 +1004,7 @@ static int __devinit cpuup_callback(stru
 			if (!l3)
 				goto unlock_cache;
 
-			spin_lock(&l3->list_lock);
+			spin_lock_irq(&l3->list_lock);
 
 			/* Free limit for this kmem_list3 */
 			l3->free_limit -= cachep->batchcount;
@@ -1012,7 +1012,7 @@ static int __devinit cpuup_callback(stru
 				free_block(cachep, nc->entry, nc->avail, node);
 
 			if (!cpus_empty(mask)) {
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 				goto unlock_cache;
 			}
 
@@ -1031,13 +1031,13 @@ static int __devinit cpuup_callback(stru
 			/* free slabs belonging to this node */
 			if (__node_shrink(cachep, node)) {
 				cachep->nodelists[node] = NULL;
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 				kfree(l3);
 			} else {
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 			}
 		      unlock_cache:
-			spin_unlock_irq(&cachep->spinlock);
+			spin_unlock(&cachep->spinlock);
 			kfree(nc);
 		}
 		mutex_unlock(&cache_chain_mutex);
@@ -2021,18 +2021,18 @@ static void drain_cpu_caches(struct kmem
 
 	smp_call_function_all_cpus(do_drain, cachep);
 	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		l3 = cachep->nodelists[node];
 		if (l3) {
-			spin_lock(&l3->list_lock);
+			spin_lock_irq(&l3->list_lock);
 			drain_array_locked(cachep, l3->shared, 1, node);
-			spin_unlock(&l3->list_lock);
+			spin_unlock_irq(&l3->list_lock);
 			if (l3->alien)
 				drain_alien_cache(cachep, l3);
 		}
 	}
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 }
 
 static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -2348,7 +2348,6 @@ static int cache_grow(struct kmem_cache 
 
 	offset *= cachep->colour_off;
 
-	check_irq_off();
 	if (local_flags & __GFP_WAIT)
 		local_irq_enable();
 
@@ -2744,6 +2743,7 @@ static void *__cache_alloc_node(struct k
 	BUG_ON(!l3);
 
       retry:
+	check_irq_off();
 	spin_lock(&l3->list_lock);
 	entry = l3->slabs_partial.next;
 	if (entry == &l3->slabs_partial) {
@@ -3323,11 +3323,11 @@ static int do_tune_cpucache(struct kmem_
 	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
 
 	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	cachep->batchcount = batchcount;
 	cachep->limit = limit;
 	cachep->shared = shared;
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 
 	for_each_online_cpu(i) {
 		struct array_cache *ccold = new.new[i];
@@ -3584,8 +3584,7 @@ static int s_show(struct seq_file *m, vo
 	int node;
 	struct kmem_list3 *l3;
 
-	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	active_objs = 0;
 	num_slabs = 0;
 	for_each_online_node(node) {
@@ -3593,7 +3592,8 @@ static int s_show(struct seq_file *m, vo
 		if (!l3)
 			continue;
 
-		spin_lock(&l3->list_lock);
+		check_irq_on();
+		spin_lock_irq(&l3->list_lock);
 
 		list_for_each(q, &l3->slabs_full) {
 			slabp = list_entry(q, struct slab, list);
@@ -3620,7 +3620,7 @@ static int s_show(struct seq_file *m, vo
 		free_objects += l3->free_objects;
 		shared_avail += l3->shared->avail;
 
-		spin_unlock(&l3->list_lock);
+		spin_unlock_irq(&l3->list_lock);
 	}
 	num_slabs += active_slabs;
 	num_objs = num_slabs * cachep->num;
@@ -3670,7 +3670,7 @@ static int s_show(struct seq_file *m, vo
 			shrinker_stat_read(cachep->shrinker, nr_freed));
 	}
 	seq_putc(m, '\n');
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 	return 0;
 }
 
@@ -3702,10 +3702,10 @@ static void do_dump_slabp(kmem_cache_t *
 	int node;
 
 	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		struct kmem_list3 *rl3 = cachep->nodelists[node];
-		spin_lock(&rl3->list_lock);
+		spin_lock_irq(&rl3->list_lock);
 
 		list_for_each(q, &rl3->slabs_full) {
 			int i;
@@ -3719,9 +3719,9 @@ static void do_dump_slabp(kmem_cache_t *
 				printk("\n");
 			}
 		}
-		spin_unlock(&rl3->list_lock);
+		spin_unlock_irq(&rl3->list_lock);
 	}
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 3/3] NUMA slab locking fixes -- slab cpu hotplug fix
  2006-02-03 20:53 [patch 0/3] NUMA slab locking fixes Ravikiran G Thirumalai
  2006-02-03 20:55 ` [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix Ravikiran G Thirumalai
  2006-02-03 20:56 ` [patch 2/3] NUMA slab locking fixes -- slab locking irq optimizations Ravikiran G Thirumalai
@ 2006-02-03 20:57 ` Ravikiran G Thirumalai
  2006-02-03 22:07 ` [patch 0/3] NUMA slab locking fixes Andrew Morton
  3 siblings, 0 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-03 20:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Manfred Spraul, Shai Fultheim (Shai@scalex86.org),
	Christoph Lameter, Alok Kataria, sonny

This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator.  Sonny Rao <sonny@burdell.org> reported problems sometime back 
on POWER5 boxes, when the last cpu on the nodes were being offlined.  We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down.  Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go 
down, the array_caches (shared, alien)  and the kmem_list3 of the node were 
being freed (kfree) with the kmem_list3 lock held.  If the l3 or the 
array_caches were to come from the same cache being cleared, we hit on badness.

This patch cleans up the locking in cpu_up and cpu_down path.  
We cannot really free l3 on cpu down because, there is no node offlining yet
and even though a cpu is not yet up, node local memory can be allocated
for it. So l3s are usually allocated at keme_cache_create and destroyed at 
kmem_cache_destroy.  Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 
4 dbench process running all the time.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc1mm4/mm/slab.c
===================================================================
--- linux-2.6.16-rc1mm4.orig/mm/slab.c	2006-02-01 18:04:21.000000000 -0800
+++ linux-2.6.16-rc1mm4/mm/slab.c	2006-02-01 23:44:25.000000000 -0800
@@ -892,14 +892,14 @@ static void __drain_alien_cache(struct k
 	}
 }
 
-static void drain_alien_cache(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien)
 {
 	int i = 0;
 	struct array_cache *ac;
 	unsigned long flags;
 
 	for_each_online_node(i) {
-		ac = l3->alien[i];
+		ac = alien[i];
 		if (ac) {
 			spin_lock_irqsave(&ac->lock, flags);
 			__drain_alien_cache(cachep, ac, i);
@@ -910,7 +910,7 @@ static void drain_alien_cache(struct kme
 #else
 #define alloc_alien_cache(node, limit) do { } while (0)
 #define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
 #endif
 
 static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -944,6 +944,9 @@ static int __devinit cpuup_callback(stru
 				l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
 				    ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
 
+				/* The l3s don't come and go as cpus come and
+				   go.  cache_chain_mutex is sufficient 
+				   protection here */
 				cachep->nodelists[node] = l3;
 			}
 
@@ -957,27 +960,39 @@ static int __devinit cpuup_callback(stru
 		/* Now we can go ahead with allocating the shared array's
 		   & array cache's */
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc;
+			struct array_cache *nc, *shared, **alien;
 
-			nc = alloc_arraycache(node, cachep->limit,
-					      cachep->batchcount);
-			if (!nc)
+			if (!(nc = alloc_arraycache(node, 
+			      cachep->limit, cachep->batchcount)))
 				goto bad;
+			if (!(shared = alloc_arraycache(node,
+			      cachep->shared*cachep->batchcount, 0xbaadf00d)))
+				goto bad;
+#ifdef CONFIG_NUMA
+			if (!(alien = alloc_alien_cache(node,
+			      cachep->limit)))
+				goto bad;
+#endif
 			cachep->array[cpu] = nc;
 
 			l3 = cachep->nodelists[node];
 			BUG_ON(!l3);
-			if (!l3->shared) {
-				if (!(nc = alloc_arraycache(node,
-							    cachep->shared *
-							    cachep->batchcount,
-							    0xbaadf00d)))
-					goto bad;
 
+			spin_lock_irq(&l3->list_lock);
+			if (!l3->shared) {
 				/* we are serialised from CPU_DEAD or
 				   CPU_UP_CANCELLED by the cpucontrol lock */
-				l3->shared = nc;
+				l3->shared = shared;
+				shared = NULL;
+			}
+			if (!l3->alien) {
+				l3->alien = alien;
+				alien = NULL;
 			}
+			spin_unlock_irq(&l3->list_lock);
+
+			kfree(shared);
+			free_alien_cache(alien);
 		}
 		mutex_unlock(&cache_chain_mutex);
 		break;
@@ -986,23 +1001,29 @@ static int __devinit cpuup_callback(stru
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DEAD:
+		/* Even if all the cpus of a node are down, we don't
+		 * free the kmem_list3 of any cache. This to avoid a race
+		 * between cpu_down, and a kmalloc allocation from another
+		 * cpu for memory from the node of the cpu going down.  
+		 * The list3 structure is usually allocated from 
+		 * kmem_cache_create and gets destroyed at kmem_cache_destroy
+		 */
 		/* fall thru */
 	case CPU_UP_CANCELED:
 		mutex_lock(&cache_chain_mutex);
 
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc;
+			struct array_cache *nc, *shared, **alien;
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
-			spin_lock(&cachep->spinlock);
 			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
 			l3 = cachep->nodelists[node];
 
 			if (!l3)
-				goto unlock_cache;
+				goto free_array_cache;
 
 			spin_lock_irq(&l3->list_lock);
 
@@ -1013,33 +1034,43 @@ static int __devinit cpuup_callback(stru
 
 			if (!cpus_empty(mask)) {
 				spin_unlock_irq(&l3->list_lock);
-				goto unlock_cache;
+				goto free_array_cache;
 			}
 
-			if (l3->shared) {
+			if ((shared = l3->shared)) {
 				free_block(cachep, l3->shared->entry,
 					   l3->shared->avail, node);
-				kfree(l3->shared);
 				l3->shared = NULL;
 			}
-			if (l3->alien) {
-				drain_alien_cache(cachep, l3);
-				free_alien_cache(l3->alien);
-				l3->alien = NULL;
-			}
 
-			/* free slabs belonging to this node */
-			if (__node_shrink(cachep, node)) {
-				cachep->nodelists[node] = NULL;
-				spin_unlock_irq(&l3->list_lock);
-				kfree(l3);
-			} else {
-				spin_unlock_irq(&l3->list_lock);
+			alien = l3->alien;
+			l3->alien = NULL;
+
+			spin_unlock_irq(&l3->list_lock);
+
+			kfree(shared);
+			if (alien) {
+				drain_alien_cache(cachep, alien);
+				free_alien_cache(alien);
 			}
-		      unlock_cache:
-			spin_unlock(&cachep->spinlock);
+      free_array_cache:
 			kfree(nc);
 		}
+		/*
+		 * In the previous loop, all the objects were freed to
+		 * the respective cache's slabs,  now we can go ahead and
+		 * shrink each nodelist to its limit.
+		 */
+		list_for_each_entry(cachep, &cache_chain, next) {
+
+			l3 = cachep->nodelists[node];
+			if (!l3)
+				continue;
+			spin_lock_irq(&l3->list_lock);
+			/* free slabs belonging to this node */
+			__node_shrink(cachep, node);
+			spin_unlock_irq(&l3->list_lock);
+		}
 		mutex_unlock(&cache_chain_mutex);
 		break;
 #endif
@@ -2021,7 +2052,6 @@ static void drain_cpu_caches(struct kmem
 
 	smp_call_function_all_cpus(do_drain, cachep);
 	check_irq_on();
-	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		l3 = cachep->nodelists[node];
 		if (l3) {
@@ -2029,10 +2059,9 @@ static void drain_cpu_caches(struct kmem
 			drain_array_locked(cachep, l3->shared, 1, node);
 			spin_unlock_irq(&l3->list_lock);
 			if (l3->alien)
-				drain_alien_cache(cachep, l3);
+				drain_alien_cache(cachep, l3->alien);
 		}
 	}
-	spin_unlock(&cachep->spinlock);
 }
 
 static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -3459,7 +3488,7 @@ static void cache_reap(void *unused)
 
 		l3 = searchp->nodelists[numa_node_id()];
 		if (l3->alien)
-			drain_alien_cache(searchp, l3);
+			drain_alien_cache(searchp, l3->alien);
 		spin_lock_irq(&l3->list_lock);
 
 		drain_array_locked(searchp, cpu_cache_get(searchp), 0,
@@ -3618,7 +3647,8 @@ static int s_show(struct seq_file *m, vo
 			num_slabs++;
 		}
 		free_objects += l3->free_objects;
-		shared_avail += l3->shared->avail;
+		if (l3->shared)
+			shared_avail += l3->shared->avail;
 
 		spin_unlock_irq(&l3->list_lock);
 	}
@@ -3702,7 +3732,6 @@ static void do_dump_slabp(kmem_cache_t *
 	int node;
 
 	check_irq_on();
-	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		struct kmem_list3 *rl3 = cachep->nodelists[node];
 		spin_lock_irq(&rl3->list_lock);
@@ -3721,7 +3750,6 @@ static void do_dump_slabp(kmem_cache_t *
 		}
 		spin_unlock_irq(&rl3->list_lock);
 	}
-	spin_unlock(&cachep->spinlock);
 #endif
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 0/3] NUMA slab locking fixes
  2006-02-03 20:53 [patch 0/3] NUMA slab locking fixes Ravikiran G Thirumalai
                   ` (2 preceding siblings ...)
  2006-02-03 20:57 ` [patch 3/3] NUMA slab locking fixes -- slab cpu hotplug fix Ravikiran G Thirumalai
@ 2006-02-03 22:07 ` Andrew Morton
  2006-02-03 23:06   ` Christoph Lameter
  3 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2006-02-03 22:07 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, manfred, shai, clameter, alok.kataria, sonny

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> There were some locking oddities in the NUMA slab allocator which broke cpu
> hotplug.  Here is a patchset to correct locking and also fix the
> breakage with cpu hotplug Sonny Rao noticed on POWER5.

These patches are on rc1-mm4, which unfortunately contains quite a lot of
-mm-only debugging stuff in slab.c.

Could you please redo/retest these patches so that they apply on
2.6.16-rc2.  Also, please arrange for any bugfixes to be staged ahead of
any optimisations - the optimisations we can defer until 2.6.17.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 0/3] NUMA slab locking fixes
  2006-02-03 22:07 ` [patch 0/3] NUMA slab locking fixes Andrew Morton
@ 2006-02-03 23:06   ` Christoph Lameter
  2006-02-04  1:08     ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-02-03 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, shai, alok.kataria, sonny

On Fri, 3 Feb 2006, Andrew Morton wrote:

> Could you please redo/retest these patches so that they apply on
> 2.6.16-rc2.  Also, please arrange for any bugfixes to be staged ahead of
> any optimisations - the optimisations we can defer until 2.6.17.

It seems that these two things are tightly connected. By changing the 
locking he was able to address the hotplug breakage.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 0/3] NUMA slab locking fixes
  2006-02-03 23:06   ` Christoph Lameter
@ 2006-02-04  1:08     ` Ravikiran G Thirumalai
  2006-02-04  1:15       ` [patch 1/3] NUMA slab locking fixes -- move color_next to l3 Ravikiran G Thirumalai
                         ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-04  1:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-kernel, manfred, shai, alok.kataria, sonny

On Fri, Feb 03, 2006 at 03:06:16PM -0800, Christoph Lameter wrote:
> On Fri, 3 Feb 2006, Andrew Morton wrote:
> 
> > Could you please redo/retest these patches so that they apply on
> > 2.6.16-rc2.  Also, please arrange for any bugfixes to be staged ahead of
> > any optimisations - the optimisations we can defer until 2.6.17.
> 
> It seems that these two things are tightly connected. By changing the 
> locking he was able to address the hotplug breakage.

Yes,  they are.  The cachep->spinlock goes away at many places due to the 
colour_next patch and keeping the l3 around. (we should not have called that 
optimization I guess :)), and the irq disabling had to be moved to the 
l3 lock anyways.  Maybe I could have done away with patch 2 (merged it with 
patch3), but I thought keeping them seperate made it readable ....

Patchset against rc2 follows.

Thanks,
Kiran

PS: Many hotplug fixes are yet to be applied upstream I think.  I know
these slab patches work well with mm4 (with the x86_64 subtree and hotplug 
fixes in there) but I cannot really test cpu_up after cpu_down on rc2 
because it is broken as of now (pending merges, I guess).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 1/3] NUMA slab locking fixes -- move color_next to l3
  2006-02-04  1:08     ` Ravikiran G Thirumalai
@ 2006-02-04  1:15       ` Ravikiran G Thirumalai
  2006-02-04  1:22       ` [patch 0/3] NUMA slab locking fixes Andrew Morton
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-04  1:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, manfred, shai, alok.kataria, sonny

colour_next is used as an index to add a colouring offset to a new slab 
in the cache (colour_off * colour_next).  Now with the NUMA aware slab 
allocator, it makes sense to colour slabs added on the same node 
sequentially with colour_next.

This patch moves the colouring index "colour_next" per-node by placing it
on kmem_list3 rather than kmem_cache.

This also helps sinplify locking for cup up and down paths.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c	2006-02-03 14:31:01.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c	2006-02-03 14:35:21.000000000 -0800
@@ -294,6 +294,7 @@ struct kmem_list3 {
 	unsigned long next_reap;
 	int free_touched;
 	unsigned int free_limit;
+	unsigned int colour_next;	/* Per-node cache coloring */
 	spinlock_t list_lock;
 	struct array_cache *shared;	/* shared per node */
 	struct array_cache **alien;	/* on other nodes */
@@ -344,6 +345,7 @@ static void kmem_list3_init(struct kmem_
 	INIT_LIST_HEAD(&parent->slabs_free);
 	parent->shared = NULL;
 	parent->alien = NULL;
+	parent->colour_next = 0;
 	spin_lock_init(&parent->list_lock);
 	parent->free_objects = 0;
 	parent->free_touched = 0;
@@ -390,7 +392,6 @@ struct kmem_cache {
 
 	size_t colour;		/* cache colouring range */
 	unsigned int colour_off;	/* colour offset */
-	unsigned int colour_next;	/* cache colouring */
 	struct kmem_cache *slabp_cache;
 	unsigned int slab_size;
 	unsigned int dflags;	/* dynamic flags */
@@ -1119,7 +1120,6 @@ void __init kmem_cache_init(void)
 		BUG();
 
 	cache_cache.colour = left_over / cache_cache.colour_off;
-	cache_cache.colour_next = 0;
 	cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
 				      sizeof(struct slab), cache_line_size());
 
@@ -2324,18 +2324,19 @@ static int cache_grow(struct kmem_cache 
 		 */
 		ctor_flags |= SLAB_CTOR_ATOMIC;
 
-	/* About to mess with non-constant members - lock. */
+	/* Take the l3 list lock to change the colour_next on this node */
 	check_irq_off();
-	spin_lock(&cachep->spinlock);
+	l3 = cachep->nodelists[nodeid];
+	spin_lock(&l3->list_lock);
 
 	/* Get colour for the slab, and cal the next value. */
-	offset = cachep->colour_next;
-	cachep->colour_next++;
-	if (cachep->colour_next >= cachep->colour)
-		cachep->colour_next = 0;
-	offset *= cachep->colour_off;
+	offset = l3->colour_next;
+	l3->colour_next++;
+	if (l3->colour_next >= cachep->colour)
+		l3->colour_next = 0;
+	spin_unlock(&l3->list_lock);
 
-	spin_unlock(&cachep->spinlock);
+	offset *= cachep->colour_off;
 
 	check_irq_off();
 	if (local_flags & __GFP_WAIT)
@@ -2367,7 +2368,6 @@ static int cache_grow(struct kmem_cache 
 	if (local_flags & __GFP_WAIT)
 		local_irq_disable();
 	check_irq_off();
-	l3 = cachep->nodelists[nodeid];
 	spin_lock(&l3->list_lock);
 
 	/* Make slab active. */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 0/3] NUMA slab locking fixes
  2006-02-04  1:08     ` Ravikiran G Thirumalai
  2006-02-04  1:15       ` [patch 1/3] NUMA slab locking fixes -- move color_next to l3 Ravikiran G Thirumalai
@ 2006-02-04  1:22       ` Andrew Morton
  2006-02-04  1:28       ` [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock Ravikiran G Thirumalai
  2006-02-04  1:29       ` [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking Ravikiran G Thirumalai
  3 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2006-02-04  1:22 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: clameter, linux-kernel, manfred, shai, alok.kataria, sonny

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> PS: Many hotplug fixes are yet to be applied upstream I think.  I know
>  these slab patches work well with mm4 (with the x86_64 subtree and hotplug 
>  fixes in there) but I cannot really test cpu_up after cpu_down on rc2 
>  because it is broken as of now (pending merges, I guess).

Yes, -rc2 was a shade too early for me to complete merging stuff.  Current
-linus should be better.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-04  1:08     ` Ravikiran G Thirumalai
  2006-02-04  1:15       ` [patch 1/3] NUMA slab locking fixes -- move color_next to l3 Ravikiran G Thirumalai
  2006-02-04  1:22       ` [patch 0/3] NUMA slab locking fixes Andrew Morton
@ 2006-02-04  1:28       ` Ravikiran G Thirumalai
  2006-02-04  9:48         ` Andrew Morton
  2006-02-04  1:29       ` [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking Ravikiran G Thirumalai
  3 siblings, 1 reply; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-04  1:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, manfred, shai, alok.kataria, sonny

Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
because, at cache_grow, on every addition of a slab to a slab cache, we 
incremented colour_next which was protected by the cachep->spinlock, and
cache_grow could occur at interrupt context.  Since, now we protect the 
per-node colour_next with the node's list_lock, we do not need to disable 
on chip interrupts while taking the per-cache spinlock, but we
just need to disable interrupts when taking the per-node kmem_list3 list_lock.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c	2006-02-03 15:07:56.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c	2006-02-03 15:10:04.000000000 -0800
@@ -987,7 +987,7 @@ static int __devinit cpuup_callback(stru
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
-			spin_lock_irq(&cachep->spinlock);
+			spin_lock(&cachep->spinlock);
 			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
@@ -996,7 +996,7 @@ static int __devinit cpuup_callback(stru
 			if (!l3)
 				goto unlock_cache;
 
-			spin_lock(&l3->list_lock);
+			spin_lock_irq(&l3->list_lock);
 
 			/* Free limit for this kmem_list3 */
 			l3->free_limit -= cachep->batchcount;
@@ -1004,7 +1004,7 @@ static int __devinit cpuup_callback(stru
 				free_block(cachep, nc->entry, nc->avail, node);
 
 			if (!cpus_empty(mask)) {
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 				goto unlock_cache;
 			}
 
@@ -1023,13 +1023,13 @@ static int __devinit cpuup_callback(stru
 			/* free slabs belonging to this node */
 			if (__node_shrink(cachep, node)) {
 				cachep->nodelists[node] = NULL;
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 				kfree(l3);
 			} else {
-				spin_unlock(&l3->list_lock);
+				spin_unlock_irq(&l3->list_lock);
 			}
 		      unlock_cache:
-			spin_unlock_irq(&cachep->spinlock);
+			spin_unlock(&cachep->spinlock);
 			kfree(nc);
 		}
 		mutex_unlock(&cache_chain_mutex);
@@ -2011,18 +2011,18 @@ static void drain_cpu_caches(struct kmem
 
 	smp_call_function_all_cpus(do_drain, cachep);
 	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		l3 = cachep->nodelists[node];
 		if (l3) {
-			spin_lock(&l3->list_lock);
+			spin_lock_irq(&l3->list_lock);
 			drain_array_locked(cachep, l3->shared, 1, node);
-			spin_unlock(&l3->list_lock);
+			spin_unlock_irq(&l3->list_lock);
 			if (l3->alien)
 				drain_alien_cache(cachep, l3);
 		}
 	}
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 }
 
 static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -2338,7 +2338,6 @@ static int cache_grow(struct kmem_cache 
 
 	offset *= cachep->colour_off;
 
-	check_irq_off();
 	if (local_flags & __GFP_WAIT)
 		local_irq_enable();
 
@@ -2725,6 +2724,7 @@ static void *__cache_alloc_node(struct k
 	BUG_ON(!l3);
 
       retry:
+	check_irq_off();
 	spin_lock(&l3->list_lock);
 	entry = l3->slabs_partial.next;
 	if (entry == &l3->slabs_partial) {
@@ -3304,11 +3304,11 @@ static int do_tune_cpucache(struct kmem_
 	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
 
 	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	cachep->batchcount = batchcount;
 	cachep->limit = limit;
 	cachep->shared = shared;
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 
 	for_each_online_cpu(i) {
 		struct array_cache *ccold = new.new[i];
@@ -3564,8 +3564,7 @@ static int s_show(struct seq_file *m, vo
 	int node;
 	struct kmem_list3 *l3;
 
-	check_irq_on();
-	spin_lock_irq(&cachep->spinlock);
+	spin_lock(&cachep->spinlock);
 	active_objs = 0;
 	num_slabs = 0;
 	for_each_online_node(node) {
@@ -3573,7 +3572,8 @@ static int s_show(struct seq_file *m, vo
 		if (!l3)
 			continue;
 
-		spin_lock(&l3->list_lock);
+		check_irq_on();
+		spin_lock_irq(&l3->list_lock);
 
 		list_for_each(q, &l3->slabs_full) {
 			slabp = list_entry(q, struct slab, list);
@@ -3600,7 +3600,7 @@ static int s_show(struct seq_file *m, vo
 		free_objects += l3->free_objects;
 		shared_avail += l3->shared->avail;
 
-		spin_unlock(&l3->list_lock);
+		spin_unlock_irq(&l3->list_lock);
 	}
 	num_slabs += active_slabs;
 	num_objs = num_slabs * cachep->num;
@@ -3644,7 +3644,7 @@ static int s_show(struct seq_file *m, vo
 	}
 #endif
 	seq_putc(m, '\n');
-	spin_unlock_irq(&cachep->spinlock);
+	spin_unlock(&cachep->spinlock);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking
  2006-02-04  1:08     ` Ravikiran G Thirumalai
                         ` (2 preceding siblings ...)
  2006-02-04  1:28       ` [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock Ravikiran G Thirumalai
@ 2006-02-04  1:29       ` Ravikiran G Thirumalai
  2006-02-04 10:03         ` Andrew Morton
  3 siblings, 1 reply; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-04  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-kernel, manfred, shai, alok.kataria, sonny

This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
allocator.  Sonny Rao <sonny@burdell.org> reported problems sometime back 
on POWER5 boxes, when the last cpu on the nodes were being offlined.  We could
not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
being updated on cpu down.  Since that issue is now fixed, we can reproduce
Sonny's problems on x86_64 NUMA, and here is the fix.

The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go 
down, the array_caches (shared, alien)  and the kmem_list3 of the node were 
being freed (kfree) with the kmem_list3 lock held.  If the l3 or the 
array_caches were to come from the same cache being cleared, we hit on badness.

This patch cleans up the locking in cpu_up and cpu_down path.  
We cannot really free l3 on cpu down because, there is no node offlining yet
and even though a cpu is not yet up, node local memory can be allocated
for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy.  Hence, we don't need cachep->spinlock protection to get
to the cachep->nodelist[nodeid] either.

Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 
4 dbench process running all the time.

Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6.16-rc2/mm/slab.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/slab.c	2006-02-03 15:10:04.000000000 -0800
+++ linux-2.6.16-rc2/mm/slab.c	2006-02-03 15:18:51.000000000 -0800
@@ -884,14 +884,14 @@ static void __drain_alien_cache(struct k
 	}
 }
 
-static void drain_alien_cache(struct kmem_cache *cachep, struct kmem_list3 *l3)
+static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien)
 {
 	int i = 0;
 	struct array_cache *ac;
 	unsigned long flags;
 
 	for_each_online_node(i) {
-		ac = l3->alien[i];
+		ac = alien[i];
 		if (ac) {
 			spin_lock_irqsave(&ac->lock, flags);
 			__drain_alien_cache(cachep, ac, i);
@@ -902,7 +902,7 @@ static void drain_alien_cache(struct kme
 #else
 #define alloc_alien_cache(node, limit) do { } while (0)
 #define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
+#define drain_alien_cache(cachep, alien) do { } while (0)
 #endif
 
 static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -936,6 +936,9 @@ static int __devinit cpuup_callback(stru
 				l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
 				    ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
 
+				/* The l3s don't come and go as cpus come and
+				   go.  cache_chain_mutex is sufficient 
+				   protection here */
 				cachep->nodelists[node] = l3;
 			}
 
@@ -949,27 +952,39 @@ static int __devinit cpuup_callback(stru
 		/* Now we can go ahead with allocating the shared array's
 		   & array cache's */
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc;
+			struct array_cache *nc, *shared, **alien;
 
-			nc = alloc_arraycache(node, cachep->limit,
-					      cachep->batchcount);
-			if (!nc)
+			if (!(nc = alloc_arraycache(node, 
+			      cachep->limit, cachep->batchcount)))
 				goto bad;
+			if (!(shared = alloc_arraycache(node,
+			      cachep->shared*cachep->batchcount, 0xbaadf00d)))
+				goto bad;
+#ifdef CONFIG_NUMA
+			if (!(alien = alloc_alien_cache(node,
+			      cachep->limit)))
+				goto bad;
+#endif
 			cachep->array[cpu] = nc;
 
 			l3 = cachep->nodelists[node];
 			BUG_ON(!l3);
-			if (!l3->shared) {
-				if (!(nc = alloc_arraycache(node,
-							    cachep->shared *
-							    cachep->batchcount,
-							    0xbaadf00d)))
-					goto bad;
 
+			spin_lock_irq(&l3->list_lock);
+			if (!l3->shared) {
 				/* we are serialised from CPU_DEAD or
 				   CPU_UP_CANCELLED by the cpucontrol lock */
-				l3->shared = nc;
+				l3->shared = shared;
+				shared = NULL;
+			}
+			if (!l3->alien) {
+				l3->alien = alien;
+				alien = NULL;
 			}
+			spin_unlock_irq(&l3->list_lock);
+
+			kfree(shared);
+			free_alien_cache(alien);
 		}
 		mutex_unlock(&cache_chain_mutex);
 		break;
@@ -978,23 +993,29 @@ static int __devinit cpuup_callback(stru
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DEAD:
+		/* Even if all the cpus of a node are down, we don't
+		 * free the kmem_list3 of any cache. This to avoid a race
+		 * between cpu_down, and a kmalloc allocation from another
+		 * cpu for memory from the node of the cpu going down.  
+		 * The list3 structure is usually allocated from 
+		 * kmem_cache_create and gets destroyed at kmem_cache_destroy
+		 */
 		/* fall thru */
 	case CPU_UP_CANCELED:
 		mutex_lock(&cache_chain_mutex);
 
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc;
+			struct array_cache *nc, *shared, **alien;
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
-			spin_lock(&cachep->spinlock);
 			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
 			l3 = cachep->nodelists[node];
 
 			if (!l3)
-				goto unlock_cache;
+				goto free_array_cache;
 
 			spin_lock_irq(&l3->list_lock);
 
@@ -1005,33 +1026,43 @@ static int __devinit cpuup_callback(stru
 
 			if (!cpus_empty(mask)) {
 				spin_unlock_irq(&l3->list_lock);
-				goto unlock_cache;
+				goto free_array_cache;
 			}
 
-			if (l3->shared) {
+			if ((shared = l3->shared)) {
 				free_block(cachep, l3->shared->entry,
 					   l3->shared->avail, node);
-				kfree(l3->shared);
 				l3->shared = NULL;
 			}
-			if (l3->alien) {
-				drain_alien_cache(cachep, l3);
-				free_alien_cache(l3->alien);
-				l3->alien = NULL;
-			}
 
-			/* free slabs belonging to this node */
-			if (__node_shrink(cachep, node)) {
-				cachep->nodelists[node] = NULL;
-				spin_unlock_irq(&l3->list_lock);
-				kfree(l3);
-			} else {
-				spin_unlock_irq(&l3->list_lock);
+			alien = l3->alien;
+			l3->alien = NULL;
+
+			spin_unlock_irq(&l3->list_lock);
+
+			kfree(shared);
+			if (alien) {
+				drain_alien_cache(cachep, alien);
+				free_alien_cache(alien);
 			}
-		      unlock_cache:
-			spin_unlock(&cachep->spinlock);
+      free_array_cache:
 			kfree(nc);
 		}
+		/*
+		 * In the previous loop, all the objects were freed to
+		 * the respective cache's slabs,  now we can go ahead and
+		 * shrink each nodelist to its limit.
+		 */
+		list_for_each_entry(cachep, &cache_chain, next) {
+
+			l3 = cachep->nodelists[node];
+			if (!l3)
+				continue;
+			spin_lock_irq(&l3->list_lock);
+			/* free slabs belonging to this node */
+			__node_shrink(cachep, node);
+			spin_unlock_irq(&l3->list_lock);
+		}
 		mutex_unlock(&cache_chain_mutex);
 		break;
 #endif
@@ -2011,7 +2042,6 @@ static void drain_cpu_caches(struct kmem
 
 	smp_call_function_all_cpus(do_drain, cachep);
 	check_irq_on();
-	spin_lock(&cachep->spinlock);
 	for_each_online_node(node) {
 		l3 = cachep->nodelists[node];
 		if (l3) {
@@ -2019,10 +2049,9 @@ static void drain_cpu_caches(struct kmem
 			drain_array_locked(cachep, l3->shared, 1, node);
 			spin_unlock_irq(&l3->list_lock);
 			if (l3->alien)
-				drain_alien_cache(cachep, l3);
+				drain_alien_cache(cachep, l3->alien);
 		}
 	}
-	spin_unlock(&cachep->spinlock);
 }
 
 static int __node_shrink(struct kmem_cache *cachep, int node)
@@ -3440,7 +3469,7 @@ static void cache_reap(void *unused)
 
 		l3 = searchp->nodelists[numa_node_id()];
 		if (l3->alien)
-			drain_alien_cache(searchp, l3);
+			drain_alien_cache(searchp, l3->alien);
 		spin_lock_irq(&l3->list_lock);
 
 		drain_array_locked(searchp, cpu_cache_get(searchp), 0,
@@ -3598,7 +3627,8 @@ static int s_show(struct seq_file *m, vo
 			num_slabs++;
 		}
 		free_objects += l3->free_objects;
-		shared_avail += l3->shared->avail;
+		if (l3->shared)
+			shared_avail += l3->shared->avail;
 
 		spin_unlock_irq(&l3->list_lock);
 	}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-04  1:28       ` [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock Ravikiran G Thirumalai
@ 2006-02-04  9:48         ` Andrew Morton
  2006-02-06 22:51           ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2006-02-04  9:48 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: clameter, linux-kernel, manfred, shai, alok.kataria, sonny

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
>  because, at cache_grow, on every addition of a slab to a slab cache, we 
>  incremented colour_next which was protected by the cachep->spinlock, and
>  cache_grow could occur at interrupt context.  Since, now we protect the 
>  per-node colour_next with the node's list_lock, we do not need to disable 
>  on chip interrupts while taking the per-cache spinlock, but we
>  just need to disable interrupts when taking the per-node kmem_list3 list_lock.

It'd be nice to have some comments describing what cachep->spinlock
actually protects.

Does __cache_shrink() need some locking to prevent nodes from going offline?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking
  2006-02-04  1:29       ` [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking Ravikiran G Thirumalai
@ 2006-02-04 10:03         ` Andrew Morton
  2006-02-04 10:05           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2006-02-04 10:03 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: clameter, linux-kernel, manfred, shai, alok.kataria, sonny

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> This fixes locking and bugs in cpu_down and cpu_up paths of the NUMA slab
> allocator.  Sonny Rao <sonny@burdell.org> reported problems sometime back 
> on POWER5 boxes, when the last cpu on the nodes were being offlined.  We could
> not reproduce the same on x86_64 because the cpumask (node_to_cpumask) was not
> being updated on cpu down.  Since that issue is now fixed, we can reproduce
> Sonny's problems on x86_64 NUMA, and here is the fix.
> 
> The problem earlier was on CPU_DOWN, if it was the last cpu on the node to go 
> down, the array_caches (shared, alien)  and the kmem_list3 of the node were 
> being freed (kfree) with the kmem_list3 lock held.  If the l3 or the 
> array_caches were to come from the same cache being cleared, we hit on badness.
> 
> This patch cleans up the locking in cpu_up and cpu_down path.  
> We cannot really free l3 on cpu down because, there is no node offlining yet
> and even though a cpu is not yet up, node local memory can be allocated
> for it. So l3s are usually allocated at keme_cache_create and destroyed at kmem_cache_destroy.  Hence, we don't need cachep->spinlock protection to get
> to the cachep->nodelist[nodeid] either.
> 
> Patch survived onlining and offlining on a 4 core 2 node Tyan box with a 
> 4 dbench process running all the time.
> 
> ...
>
> +			if (!(nc = alloc_arraycache(node, 
> +			      cachep->limit, cachep->batchcount)))
>  				goto bad;
> +			if (!(shared = alloc_arraycache(node,
> +			      cachep->shared*cachep->batchcount, 0xbaadf00d)))
> +				goto bad;

Please don't do things like that - it's quite hard to read and we avoid it.
Cleanup patch below.

--- devel/mm/slab.c~numa-slab-locking-fixes-fix-cpu-down-and-up-locking-tidy	2006-02-04 02:01:33.000000000 -0800
+++ devel-akpm/mm/slab.c	2006-02-04 02:01:33.000000000 -0800
@@ -936,9 +936,11 @@ static int __devinit cpuup_callback(stru
 				l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
 				    ((unsigned long)cachep) % REAPTIMEOUT_LIST3;
 
-				/* The l3s don't come and go as cpus come and
-				   go.  cache_chain_mutex is sufficient
-				   protection here */
+				/*
+				 * The l3s don't come and go as CPUs come and
+				 * go.  cache_chain_mutex is sufficient
+				 * protection here.
+				 */
 				cachep->nodelists[node] = l3;
 			}
 
@@ -952,17 +954,22 @@ static int __devinit cpuup_callback(stru
 		/* Now we can go ahead with allocating the shared array's
 		   & array cache's */
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc, *shared, **alien;
-
-			if (!(nc = alloc_arraycache(node,
-			      cachep->limit, cachep->batchcount)))
+			struct array_cache *nc;
+			struct array_cache *shared;
+			struct array_cache **alien;
+
+			nc = alloc_arraycache(node, cachep->limit,
+						cachep->batchcount);
+			if (!nc)
 				goto bad;
-			if (!(shared = alloc_arraycache(node,
-			      cachep->shared*cachep->batchcount, 0xbaadf00d)))
+			shared = alloc_arraycache(node,
+					cachep->shared * cachep->batchcount,
+					0xbaadf00d);
+			if (!shared)
 				goto bad;
 #ifdef CONFIG_NUMA
-			if (!(alien = alloc_alien_cache(node,
-			      cachep->limit)))
+			alien = alloc_alien_cache(node, cachep->limit);
+			if (!alien)
 				goto bad;
 #endif
 			cachep->array[cpu] = nc;
@@ -972,8 +979,10 @@ static int __devinit cpuup_callback(stru
 
 			spin_lock_irq(&l3->list_lock);
 			if (!l3->shared) {
-				/* we are serialised from CPU_DEAD or
-				   CPU_UP_CANCELLED by the cpucontrol lock */
+				/*
+				 * We are serialised from CPU_DEAD or
+				 * CPU_UP_CANCELLED by the cpucontrol lock
+				 */
 				l3->shared = shared;
 				shared = NULL;
 			}
@@ -993,19 +1002,22 @@ static int __devinit cpuup_callback(stru
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_DEAD:
-		/* Even if all the cpus of a node are down, we don't
-		 * free the kmem_list3 of any cache. This to avoid a race
-		 * between cpu_down, and a kmalloc allocation from another
-		 * cpu for memory from the node of the cpu going down.
-		 * The list3 structure is usually allocated from
-		 * kmem_cache_create and gets destroyed at kmem_cache_destroy
+		/*
+		 * Even if all the cpus of a node are down, we don't free the
+		 * kmem_list3 of any cache. This to avoid a race between
+		 * cpu_down, and a kmalloc allocation from another cpu for
+		 * memory from the node of the cpu going down.  The list3
+		 * structure is usually allocated from kmem_cache_create() and
+		 * gets destroyed at kmem_cache_destroy().
 		 */
 		/* fall thru */
 	case CPU_UP_CANCELED:
 		mutex_lock(&cache_chain_mutex);
 
 		list_for_each_entry(cachep, &cache_chain, next) {
-			struct array_cache *nc, *shared, **alien;
+			struct array_cache *nc;
+			struct array_cache *shared;
+			struct array_cache **alien;
 			cpumask_t mask;
 
 			mask = node_to_cpumask(node);
@@ -1029,7 +1041,8 @@ static int __devinit cpuup_callback(stru
 				goto free_array_cache;
 			}
 
-			if ((shared = l3->shared)) {
+			shared = l3->shared;
+			if (shared) {
 				free_block(cachep, l3->shared->entry,
 					   l3->shared->avail, node);
 				l3->shared = NULL;
@@ -1045,7 +1058,7 @@ static int __devinit cpuup_callback(stru
 				drain_alien_cache(cachep, alien);
 				free_alien_cache(alien);
 			}
-      free_array_cache:
+free_array_cache:
 			kfree(nc);
 		}
 		/*
@@ -1054,7 +1067,6 @@ static int __devinit cpuup_callback(stru
 		 * shrink each nodelist to its limit.
 		 */
 		list_for_each_entry(cachep, &cache_chain, next) {
-
 			l3 = cachep->nodelists[node];
 			if (!l3)
 				continue;
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking
  2006-02-04 10:03         ` Andrew Morton
@ 2006-02-04 10:05           ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2006-02-04 10:05 UTC (permalink / raw)
  To: kiran, clameter, linux-kernel, manfred, shai, alok.kataria, sonny

Andrew Morton <akpm@osdl.org> wrote:
>
> Cleanup patch below.

After which we need to address this:

mm/slab.c: In function `cpuup_callback':
mm/slab.c:959: warning: `alien' might be used uninitialized in this function

I guess it won't cause crashes, but it makes me wonder if this was tested
on non-NUMA?

Is this right?

--- devel/mm/slab.c~numa-slab-locking-fixes-fix-cpu-down-and-up-locking-fix	2006-02-04 02:01:44.000000000 -0800
+++ devel-akpm/mm/slab.c	2006-02-04 02:01:44.000000000 -0800
@@ -901,8 +901,11 @@ static void drain_alien_cache(struct kme
 }
 #else
 #define alloc_alien_cache(node, limit) do { } while (0)
-#define free_alien_cache(ac_ptr) do { } while (0)
 #define drain_alien_cache(cachep, alien) do { } while (0)
+
+static inline void free_alien_cache(struct array_cache **ac_ptr)
+{
+}
 #endif
 
 static int __devinit cpuup_callback(struct notifier_block *nfb,
@@ -986,10 +989,12 @@ static int __devinit cpuup_callback(stru
 				l3->shared = shared;
 				shared = NULL;
 			}
+#ifdef CONFIG_NUMA
 			if (!l3->alien) {
 				l3->alien = alien;
 				alien = NULL;
 			}
+#endif
 			spin_unlock_irq(&l3->list_lock);
 
 			kfree(shared);
_


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-04  9:48         ` Andrew Morton
@ 2006-02-06 22:51           ` Ravikiran G Thirumalai
  2006-02-06 23:30             ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-06 22:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-kernel, manfred, shai, alok.kataria, sonny

On Sat, Feb 04, 2006 at 01:48:28AM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> >
> > Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
> >  because, at cache_grow, on every addition of a slab to a slab cache, we 
> >  incremented colour_next which was protected by the cachep->spinlock, and
> >  cache_grow could occur at interrupt context.  Since, now we protect the 
> >  per-node colour_next with the node's list_lock, we do not need to disable 
> >  on chip interrupts while taking the per-cache spinlock, but we
> >  just need to disable interrupts when taking the per-node kmem_list3 list_lock.
> 
> It'd be nice to have some comments describing what cachep->spinlock
> actually protects.

Actually, it does not protect much anymore.  Here's a cleanup+comments
patch (against current mainline).

The non atomic stats on the cachep structure now needs serialization though,
would a spinlock be better there instead of changing them over to atomic_t s
? I wonder...

> 
> Does __cache_shrink() need some locking to prevent nodes from going offline?

Looks like l3 list locking (which is already done) is sufficient, although
one of the two paths calling __cache_shrink takes the cpu hotplug lock
(kmem_cache_destroy()).  The other does not seem to be used much
(kmem_cache_shrink())

Thanks,
Kiran


Remove cachep->spinlock.  Locking has moved to the kmem_list3 and
most of the structures protected earlier by cachep->spinlock is
now protected by the l3->list_lock.  slab cache tunables like
batchcount are accessed always with the cache_chain_mutex held.

Patch tested on SMP and NUMA kernels with dbench processes
running, constant onlining/offlining, and constant cache tuning,
all at the same time.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>

Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2006-02-05 14:38:37.000000000 -0800
+++ linux-2.6/mm/slab.c	2006-02-06 12:05:26.000000000 -0800
@@ -373,17 +373,19 @@ static void kmem_list3_init(struct kmem_
 struct kmem_cache {
 /* 1) per-cpu data, touched during every alloc/free */
 	struct array_cache *array[NR_CPUS];
+/* 2) Cache tunables. Protected by cache_chain_mutex */
 	unsigned int batchcount;
 	unsigned int limit;
 	unsigned int shared;
+
 	unsigned int buffer_size;
-/* 2) touched by every alloc & free from the backend */
+/* 3) touched by every alloc & free from the backend */
 	struct kmem_list3 *nodelists[MAX_NUMNODES];
+
 	unsigned int flags;	/* constant flags */
 	unsigned int num;	/* # of objs per slab */
-	spinlock_t spinlock;
 
-/* 3) cache_grow/shrink */
+/* 4) cache_grow/shrink */
 	/* order of pgs per slab (2^n) */
 	unsigned int gfporder;
 
@@ -402,11 +404,11 @@ struct kmem_cache {
 	/* de-constructor func */
 	void (*dtor) (void *, struct kmem_cache *, unsigned long);
 
-/* 4) cache creation/removal */
+/* 5) cache creation/removal */
 	const char *name;
 	struct list_head next;
 
-/* 5) statistics */
+/* 6) statistics */
 #if STATS
 	unsigned long num_active;
 	unsigned long num_allocations;
@@ -643,7 +645,6 @@ static struct kmem_cache cache_cache = {
 	.shared = 1,
 	.buffer_size = sizeof(struct kmem_cache),
 	.flags = SLAB_NO_REAP,
-	.spinlock = SPIN_LOCK_UNLOCKED,
 	.name = "kmem_cache",
 #if DEBUG
 	.obj_size = sizeof(struct kmem_cache),
@@ -1909,7 +1910,6 @@ kmem_cache_create (const char *name, siz
 	cachep->gfpflags = 0;
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
-	spin_lock_init(&cachep->spinlock);
 	cachep->buffer_size = size;
 
 	if (flags & CFLGS_OFF_SLAB)
@@ -3334,6 +3334,7 @@ static void do_ccupdate_local(void *info
 	new->new[smp_processor_id()] = old;
 }
 
+/* Always called with the cache_chain_mutex held */
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount,
 			    int shared)
 {
@@ -3355,11 +3356,9 @@ static int do_tune_cpucache(struct kmem_
 	smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
 
 	check_irq_on();
-	spin_lock(&cachep->spinlock);
 	cachep->batchcount = batchcount;
 	cachep->limit = limit;
 	cachep->shared = shared;
-	spin_unlock(&cachep->spinlock);
 
 	for_each_online_cpu(i) {
 		struct array_cache *ccold = new.new[i];
@@ -3380,6 +3379,7 @@ static int do_tune_cpucache(struct kmem_
 	return 0;
 }
 
+/* Called with cache_chain_mutex held always */
 static void enable_cpucache(struct kmem_cache *cachep)
 {
 	int err;
@@ -3615,7 +3615,6 @@ static int s_show(struct seq_file *m, vo
 	int node;
 	struct kmem_list3 *l3;
 
-	spin_lock(&cachep->spinlock);
 	active_objs = 0;
 	num_slabs = 0;
 	for_each_online_node(node) {
@@ -3696,7 +3695,6 @@ static int s_show(struct seq_file *m, vo
 	}
 #endif
 	seq_putc(m, '\n');
-	spin_unlock(&cachep->spinlock);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-06 22:51           ` Ravikiran G Thirumalai
@ 2006-02-06 23:30             ` Andrew Morton
  2006-02-07  0:21               ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2006-02-06 23:30 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: clameter, linux-kernel, manfred, shai, alok.kataria, sonny, Pekka Enberg

Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
>
> On Sat, Feb 04, 2006 at 01:48:28AM -0800, Andrew Morton wrote:
> > Ravikiran G Thirumalai <kiran@scalex86.org> wrote:
> > >
> > > Earlier, we had to disable on chip interrupts while taking the cachep->spinlock
> > >  because, at cache_grow, on every addition of a slab to a slab cache, we 
> > >  incremented colour_next which was protected by the cachep->spinlock, and
> > >  cache_grow could occur at interrupt context.  Since, now we protect the 
> > >  per-node colour_next with the node's list_lock, we do not need to disable 
> > >  on chip interrupts while taking the per-cache spinlock, but we
> > >  just need to disable interrupts when taking the per-node kmem_list3 list_lock.
> > 
> > It'd be nice to have some comments describing what cachep->spinlock
> > actually protects.
> 
> Actually, it does not protect much anymore.  Here's a cleanup+comments
> patch (against current mainline).

This is getting scary.  Manfred, Christoph, Pekka: have you guys taken a
close look at what's going on in here?

> The non atomic stats on the cachep structure now needs serialization though,
> would a spinlock be better there instead of changing them over to atomic_t s
> ? I wonder...

It's common for the kernel to use non-atomic ops for stats such as this
(networking especially).  We accept the small potential for small
inaccuracy as an acceptable tradeoff for improved performance.

But given that a) these stats are only enabled with CONFIG_DEBUG_SLAB and
b) CONFIG_DEBUG_SLAB causes performance to suck the big one anyway and c)
the slab.c stats are wrapped in handy macros, yes, I guess it wouldn't hurt
to make these guys atomic_t.  Sometime.  My slab.c is looking rather
overpatched again at present.

A problem right now is the -mm-only slab debug patches
(slab-leak-detector.patch, slab-cache-shrinker-statistics.patch and the
currently-dropped
periodically-scan-redzone-entries-and-slab-control-structures.patch).  It
would be good if we could finish these things off and get them into mainline
so things get a bit sane again.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-06 23:30             ` Andrew Morton
@ 2006-02-07  0:21               ` Christoph Lameter
  2006-02-07  7:36                 ` Pekka J Enberg
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2006-02-07  0:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ravikiran G Thirumalai, linux-kernel, manfred, shai,
	alok.kataria, sonny, Pekka Enberg

On Mon, 6 Feb 2006, Andrew Morton wrote:

> > Actually, it does not protect much anymore.  Here's a cleanup+comments
> > patch (against current mainline).
> 
> This is getting scary.  Manfred, Christoph, Pekka: have you guys taken a
> close look at what's going on in here?

I looked at his patch and he seems to be right. Most of the kmem_cache 
structure is established at slab creation. Updates are to the debug 
counters and to nodelists[] during node online/offline and to array[] 
during cpu online/offline. The chain mutex is used to protect the 
setting of the tuning parameters. I still need to have a look at the 
details though.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-07  0:21               ` Christoph Lameter
@ 2006-02-07  7:36                 ` Pekka J Enberg
  2006-02-07  7:50                   ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 20+ messages in thread
From: Pekka J Enberg @ 2006-02-07  7:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Ravikiran G Thirumalai, linux-kernel, manfred,
	shai, alok.kataria, sonny

On Mon, 6 Feb 2006, Andrew Morton wrote:
> > This is getting scary.  Manfred, Christoph, Pekka: have you guys taken a
> > close look at what's going on in here?

On Mon, 6 Feb 2006, Christoph Lameter wrote:
> I looked at his patch and he seems to be right. Most of the kmem_cache 
> structure is established at slab creation. Updates are to the debug 
> counters and to nodelists[] during node online/offline and to array[] 
> during cpu online/offline. The chain mutex is used to protect the 
> setting of the tuning parameters. I still need to have a look at the 
> details though.

The patch looks correct but I am wondering if we should keep the spinlock 
around for clarity? The chain mutex doesn't really have anything to do 
with the tunables, it's there to protect the cache chain. I am worried 
that this patch makes code restructuring harder. Hmm?

				Pekka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-07  7:36                 ` Pekka J Enberg
@ 2006-02-07  7:50                   ` Ravikiran G Thirumalai
  2006-02-07  7:55                     ` Pekka J Enberg
  0 siblings, 1 reply; 20+ messages in thread
From: Ravikiran G Thirumalai @ 2006-02-07  7:50 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Christoph Lameter, Andrew Morton, linux-kernel, manfred, shai,
	alok.kataria, sonny

On Tue, Feb 07, 2006 at 09:36:40AM +0200, Pekka J Enberg wrote:
> On Mon, 6 Feb 2006, Andrew Morton wrote:
> > > This is getting scary.  Manfred, Christoph, Pekka: have you guys taken a
> > > close look at what's going on in here?
> 
> On Mon, 6 Feb 2006, Christoph Lameter wrote:
> > I looked at his patch and he seems to be right. Most of the kmem_cache 
> > structure is established at slab creation. Updates are to the debug 
> > counters and to nodelists[] during node online/offline and to array[] 
> > during cpu online/offline. The chain mutex is used to protect the 
> > setting of the tuning parameters. I still need to have a look at the 
> > details though.
> 
> The patch looks correct but I am wondering if we should keep the spinlock 
> around for clarity? The chain mutex doesn't really have anything to do 
> with the tunables, it's there to protect the cache chain. I am worried 
> that this patch makes code restructuring harder. Hmm?

IMHO, if you keep something around which is not needed, it might later get
abused/misused.  And what would you add in as comments for the
cachep->spinlock?  

Instead,  bold comments on cachep structure stating what all members are 
protected by which lock/mutex should be sufficient no?

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock
  2006-02-07  7:50                   ` Ravikiran G Thirumalai
@ 2006-02-07  7:55                     ` Pekka J Enberg
  0 siblings, 0 replies; 20+ messages in thread
From: Pekka J Enberg @ 2006-02-07  7:55 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Christoph Lameter, Andrew Morton, linux-kernel, manfred, shai,
	alok.kataria, sonny

On Mon, 6 Feb 2006, Ravikiran G Thirumalai wrote:
> IMHO, if you keep something around which is not needed, it might later get
> abused/misused.  And what would you add in as comments for the
> cachep->spinlock?  
> 
> Instead,  bold comments on cachep structure stating what all members are 
> protected by which lock/mutex should be sufficient no?

Yeah, I guess we can put the spinlock back if we ever need it.

			Pekka

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2006-02-07  7:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-03 20:53 [patch 0/3] NUMA slab locking fixes Ravikiran G Thirumalai
2006-02-03 20:55 ` [patch 1/3] NUMA slab locking fixes -- slab-colour-next fix Ravikiran G Thirumalai
2006-02-03 20:56 ` [patch 2/3] NUMA slab locking fixes -- slab locking irq optimizations Ravikiran G Thirumalai
2006-02-03 20:57 ` [patch 3/3] NUMA slab locking fixes -- slab cpu hotplug fix Ravikiran G Thirumalai
2006-02-03 22:07 ` [patch 0/3] NUMA slab locking fixes Andrew Morton
2006-02-03 23:06   ` Christoph Lameter
2006-02-04  1:08     ` Ravikiran G Thirumalai
2006-02-04  1:15       ` [patch 1/3] NUMA slab locking fixes -- move color_next to l3 Ravikiran G Thirumalai
2006-02-04  1:22       ` [patch 0/3] NUMA slab locking fixes Andrew Morton
2006-02-04  1:28       ` [patch 2/3] NUMA slab locking fixes - move irq disabling from cahep->spinlock to l3 lock Ravikiran G Thirumalai
2006-02-04  9:48         ` Andrew Morton
2006-02-06 22:51           ` Ravikiran G Thirumalai
2006-02-06 23:30             ` Andrew Morton
2006-02-07  0:21               ` Christoph Lameter
2006-02-07  7:36                 ` Pekka J Enberg
2006-02-07  7:50                   ` Ravikiran G Thirumalai
2006-02-07  7:55                     ` Pekka J Enberg
2006-02-04  1:29       ` [patch 3/3] NUMA slab locking fixes -- fix cpu down and up locking Ravikiran G Thirumalai
2006-02-04 10:03         ` Andrew Morton
2006-02-04 10:05           ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).