All of lore.kernel.org
 help / color / mirror / Atom feed
* [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic
@ 2009-10-07 21:10 cl
  2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
                   ` (7 more replies)
  0 siblings, 8 replies; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

V5->V6:
- Drop patches merged by Tejun.
- Drop irqless slub fastpath for now.
- Patches against Tejun percpu for-next branch.

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
  patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today

The patchset introduces various operations to allow efficient access
to per cpu variables for the current processor. Currently there is
no way in the core to calculate the address of the instance
of a per cpu variable without a table lookup. So we see a lot of

	per_cpu_ptr(x, smp_processor_id())

The patchset introduces a way to calculate the address using the offset
that is available in arch specific ways (register or special memory
locations) using

	this_cpu_ptr(x)

In addition macros are provided that can operate on per cpu
variables in a per cpu atomic way. With that scalars in structures
allocated with the new percpu allocator can be modified without disabling
preempt or interrupts. This works by generating a single instruction that
does both the relocation of the address to the proper percpu area and
the RMW action.

F.e.

	this_cpu_add(x->var, 20)

can be used to generate an instruction that uses a segment register for the
relocation of the per cpu address into the per cpu area of the current processor
and then increments the variable by 20. The instruction cannot be interrupted
and therefore the modification is atomic vs the cpu (it either happens or not).
Rescheduling or interrupt can only happen before or after the instruction.

Per cpu atomicness does not provide protection from concurrent modifications from
other processors. In general per cpu data is modified only from the processor
that the per cpu area is associated with. So per cpu atomicness provides a fast
and effective means of dealing with concurrency. It may allow development of
better fastpaths for allocators and other important subsystems.

The per cpu atomic RMW operations can be used to avoid having to dimension pointer
arrays in the allocators (patches for page allocator and slub are provided) and
avoid pointer lookups in the hot paths of the allocators thereby decreasing
latency of critical OS paths. The macros could be used to revise the critical
paths in the allocators to no longer need to disable interrupts (not included).

Per cpu atomic RMW operations are useful to decrease the overhead of counter
maintenance in the kernel. A this_cpu_inc() f.e. can generate a single
instruction that has no needs for registers on x86. preempt on / off can
be avoided in many places.

Patchset will reduce the code size and increase speed of operations for
dynamically allocated per cpu based statistics. A set of patches modifies
the fastpaths of the SLUB allocator reducing code size and cache footprint
through the per cpu atomic operations.

---


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
@ 2009-10-07 21:10 ` cl
  2009-10-08 10:38   ` Tejun Heo
  2009-10-08 10:53   ` Mel Gorman
  2009-10-07 21:10 ` [this_cpu_xx V6 2/7] this_cpu ops: Remove pageset_notifier cl
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: this_cpu_page_allocator --]
[-- Type: text/plain, Size: 14169 bytes --]

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 -
 include/linux/mmzone.h |   12 ---
 mm/page_alloc.c        |  187 ++++++++++++++++++-------------------------------
 mm/vmstat.c            |   14 ++-
 4 files changed, 81 insertions(+), 136 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/include/linux/mm.h	2009-10-07 14:48:09.000000000 -0500
@@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/include/linux/mmzone.h	2009-10-07 14:48:09.000000000 -0500
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-07 14:48:09.000000000 -0500
@@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		local_irq_save(flags);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
-		local_irq_save(flags);
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
@@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
 
 out:
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
-		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		list = &pcp->lists[migratetype];
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
@@ -1234,7 +1231,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1243,7 +1239,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -2172,7 +2167,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2735,10 +2730,29 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+
 /* return values int ....just for stop_machine() */
 static int __build_all_zonelists(void *dummy)
 {
 	int nid;
+	int cpu;
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -2749,6 +2763,14 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 	}
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors.
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	return 0;
 }
 
@@ -3087,120 +3109,60 @@ static void setup_pagelist_highmark(stru
 }
 
 
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
-
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
 	}
-	return ret;
+	return NOTIFY_OK;
 }
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
 
+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
+
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
 
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }
 
-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3254,7 +3216,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
 		local_irq_save(flags);
@@ -3272,21 +3234,13 @@ void zone_pcp_update(struct zone *zone)
 
 static __meminit void zone_pcp_init(struct zone *zone)
 {
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
+	/* Use boot pagesets until we have the per cpu allocator up */
+	zone->pageset = &per_cpu_var(boot_pageset);
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
 	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+			zone->name, zone->present_pages,
+					 zone_batchsize(zone));
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -4800,10 +4754,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-07 14:34:25.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-07 14:48:09.000000000 -0500
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
 
-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 2/7] this_cpu ops: Remove pageset_notifier
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
@ 2009-10-07 21:10 ` cl
  2009-10-07 21:10 ` [this_cpu_xx V6 3/7] Use this_cpu operations in slub cl
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: this_cpu_remove_pageset_notifier --]
[-- Type: text/plain, Size: 2015 bytes --]

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   28 ----------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 28 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/vmstat.c	2009-10-06 18:19:20.000000000 -0500
@@ -906,6 +906,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-10-06 18:19:17.000000000 -0500
+++ linux-2.6/mm/page_alloc.c	2009-10-06 18:19:21.000000000 -0500
@@ -3108,27 +3108,6 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3154,13 +3133,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 
 static noinline __init_refok

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
  2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
  2009-10-07 21:10 ` [this_cpu_xx V6 2/7] this_cpu ops: Remove pageset_notifier cl
@ 2009-10-07 21:10 ` cl
  2009-10-12 10:19   ` Tejun Heo
  2009-10-07 21:10 ` [this_cpu_xx V6 4/7] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: this_cpu_slub_conversion --]
[-- Type: text/plain, Size: 12664 bytes --]

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  207 ++++++++++-------------------------------------
 2 files changed, 49 insertions(+), 164 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-17 17:51:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 09:02:05.000000000 -0500
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-28 10:08:10.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 09:02:05.000000000 -0500
@@ -242,15 +242,6 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1124,7 +1115,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1422,7 +1413,7 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1454,7 +1445,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1673,7 +1661,7 @@ new_slab:
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __this_cpu_ptr(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1711,7 +1699,6 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1722,24 +1709,23 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	c = __this_cpu_ptr(s->cpu_slab);
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
 	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, objsize, 1, s->flags, gfpflags);
+	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1800,7 +1786,7 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1872,7 +1858,7 @@ static __always_inline void slab_free(st
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, c->objsize);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
@@ -2095,130 +2081,28 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu [NR_KMEM_CACHE_CPU],
-		      kmem_cache_cpu);
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c >= per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[SLUB_PAGE_SHIFT]);
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
-}
-
-static void __init init_alloc_cpu(void)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
+	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+		/*
+		 * Boot time creation of the kmalloc array. Use static per cpu data
+		 * since the per cpu allocator is not available yet.
+		 */
+		s->cpu_slab = per_cpu_var(kmalloc_percpu) + (s - kmalloc_caches);
+	else
+		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2609,9 +2493,8 @@ static inline int kmem_cache_close(struc
 	int node;
 
 	flush_all(s);
-
+	free_percpu(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2760,7 +2643,19 @@ static noinline struct kmem_cache *dma_k
 	realsize = kmalloc_caches[index].objsize;
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
-	s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+
+	if (flags & __GFP_WAIT)
+		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	else {
+		int i;
+
+		s = NULL;
+		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
+			if (kmalloc_caches[i].size) {
+				s = kmalloc_caches + i;
+				break;
+			}
+	}
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -3176,8 +3071,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3261,8 +3154,10 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3365,7 +3260,7 @@ struct kmem_cache *kmem_cache_create(con
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3422,11 +3317,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3436,13 +3329,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3928,7 +3817,7 @@ static ssize_t show_slab_objects(struct 
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4353,7 +4242,7 @@ static int show_stat(struct kmem_cache *
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 4/7] SLUB: Get rid of dynamic DMA kmalloc cache allocation
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (2 preceding siblings ...)
  2009-10-07 21:10 ` [this_cpu_xx V6 3/7] Use this_cpu operations in slub cl
@ 2009-10-07 21:10 ` cl
  2009-10-13 18:48   ` [FIX] patch "SLUB: Get rid of dynamic DMA kmalloc cache allocation" Christoph Lameter
  2009-10-07 21:10 ` [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields cl
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: this_cpu_slub_static_dma_kmalloc --]
[-- Type: text/plain, Size: 3687 bytes --]

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-09-29 11:43:18.000000000 -0500
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:42:06.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:43:18.000000000 -0500
@@ -2090,7 +2090,7 @@ static inline int alloc_kmem_cache_cpus(
 {
 	int cpu;
 
-	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2537,7 +2537,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2627,6 +2627,7 @@ static noinline struct kmem_cache *dma_k
 	char *text;
 	size_t realsize;
 	unsigned long slabflags;
+	int i;
 
 	s = kmalloc_caches_dma[index];
 	if (s)
@@ -2647,18 +2648,13 @@ static noinline struct kmem_cache *dma_k
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	if (flags & __GFP_WAIT)
-		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
-	else {
-		int i;
+	s = NULL;
+	for (i = 0; i < KMALLOC_CACHES; i++)
+		if (kmalloc_caches[i].size)
+			break;
 
-		s = NULL;
-		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
-			if (kmalloc_caches[i].size) {
-				s = kmalloc_caches + i;
-				break;
-			}
-	}
+	BUG_ON(i >= KMALLOC_CACHES);
+	s = kmalloc_caches + i;
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -2672,7 +2668,7 @@ static noinline struct kmem_cache *dma_k
 
 	if (!s || !text || !kmem_cache_open(s, flags, text,
 			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		kfree(s);
+		s->size = 0;
 		kfree(text);
 		goto unlock_out;
 	}

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (3 preceding siblings ...)
  2009-10-07 21:10 ` [this_cpu_xx V6 4/7] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
@ 2009-10-07 21:10 ` cl
  2009-10-07 23:10   ` Christoph Lameter
  2009-10-07 21:10 ` [this_cpu_xx V6 6/7] Make slub statistics use this_cpu_inc cl
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: this_cpu_slub_remove_fields --]
[-- Type: text/plain, Size: 7691 bytes --]

Remove the fields in struct kmem_cache_cpu that were used to cache data from
struct kmem_cache when they were in different cachelines. The cacheline that
holds the per cpu array pointer now also holds these values. We can cut down
the struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the size field does not require accessing an additional cacheline anymore.
This results in consistent use of functions for setting the freepointer of
objects throughout SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   76 +++++++++++------------------------------------
 2 files changed, 19 insertions(+), 59 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-10-07 14:52:05.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2009-10-07 14:53:40.000000000 -0500
@@ -38,8 +38,6 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-07 14:52:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-07 15:02:45.000000000 -0500
@@ -260,13 +260,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1473,10 +1466,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1635,7 +1628,7 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1681,7 +1674,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1719,7 +1712,7 @@ static __always_inline void *slab_alloc(
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1727,8 +1720,8 @@ static __always_inline void *slab_alloc(
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, c->objsize, 1, s->flags, gfpflags);
+	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1783,7 +1776,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr, unsigned int offset)
+			void *x, unsigned long addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1797,7 +1790,8 @@ static void __slab_free(struct kmem_cach
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1862,16 +1856,16 @@ static __always_inline void slab_free(st
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, c->objsize);
-	debug_check_no_locks_freed(object, c->objsize);
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, c->objsize);
+		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -2058,19 +2052,6 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -2088,8 +2069,6 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	int cpu;
-
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2102,8 +2081,6 @@ static inline int alloc_kmem_cache_cpus(
 	if (!s->cpu_slab)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
 
@@ -2387,8 +2364,11 @@ static int kmem_cache_open(struct kmem_c
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	if (!alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+
+	if (s->cpu_slab)
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -3245,22 +3225,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3314,14 +3284,6 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 6/7] Make slub statistics use this_cpu_inc
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (4 preceding siblings ...)
  2009-10-07 21:10 ` [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields cl
@ 2009-10-07 21:10 ` cl
  2009-10-07 21:10 ` [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
  2009-10-13 15:40 ` [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic Mel Gorman
  7 siblings, 0 replies; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: this_cpu_slub_cleanup_stat --]
[-- Type: text/plain, Size: 5136 bytes --]

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, DEACTIVATE_EMPTY);
 		if (n->nr_partial < s->min_partial) {
 			/*
 			 * Adding an empty slab to the partial slabs in order
@@ -1438,7 +1437,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
+			stat(s, FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1453,7 +1452,7 @@ static void deactivate_slab(struct kmem_
 	int tail = 1;
 
 	if (page->freelist)
-		stat(c, DEACTIVATE_REMOTE_FREES);
+		stat(s, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
@@ -1479,7 +1478,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(c, CPUSLAB_FLUSH);
+	stat(s, CPUSLAB_FLUSH);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1619,7 +1618,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 
-	stat(c, ALLOC_REFILL);
+	stat(s, ALLOC_REFILL);
 
 load_freelist:
 	object = c->page->freelist;
@@ -1634,7 +1633,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-	stat(c, ALLOC_SLOWPATH);
+	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
@@ -1644,7 +1643,7 @@ new_slab:
 	new = get_partial(s, gfpflags, node);
 	if (new) {
 		c->page = new;
-		stat(c, ALLOC_FROM_PARTIAL);
+		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
 
@@ -1658,7 +1657,7 @@ new_slab:
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
-		stat(c, ALLOC_SLAB);
+		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
 		slab_lock(new);
@@ -1713,7 +1712,7 @@ static __always_inline void *slab_alloc(
 
 	else {
 		c->freelist = get_freepointer(s, object);
-		stat(c, ALLOC_FASTPATH);
+		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
@@ -1780,10 +1779,8 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 
-	c = __this_cpu_ptr(s->cpu_slab);
-	stat(c, FREE_SLOWPATH);
+	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
@@ -1796,7 +1793,7 @@ checks_ok:
 	page->inuse--;
 
 	if (unlikely(PageSlubFrozen(page))) {
-		stat(c, FREE_FROZEN);
+		stat(s, FREE_FROZEN);
 		goto out_unlock;
 	}
 
@@ -1809,7 +1806,7 @@ checks_ok:
 	 */
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(c, FREE_ADD_PARTIAL);
+		stat(s, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
@@ -1822,10 +1819,10 @@ slab_empty:
 		 * Slab still on the partial list.
 		 */
 		remove_partial(s, page);
-		stat(c, FREE_REMOVE_PARTIAL);
+		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-	stat(c, FREE_SLAB);
+	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
 
@@ -1863,7 +1860,7 @@ static __always_inline void slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
-		stat(c, FREE_FASTPATH);
+		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (5 preceding siblings ...)
  2009-10-07 21:10 ` [this_cpu_xx V6 6/7] Make slub statistics use this_cpu_inc cl
@ 2009-10-07 21:10 ` cl
  2009-10-12 10:40   ` Tejun Heo
  2009-10-13 15:40 ` [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic Mel Gorman
  7 siblings, 1 reply; 56+ messages in thread
From: cl @ 2009-10-07 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Mel Gorman

[-- Attachment #1: this_cpu_slub_aggressive_cpu_ops --]
[-- Type: text/plain, Size: 5880 bytes --]

Use this_cpu_* operations in the hotpath to avoid calculations of
kmem_cache_cpu pointer addresses.

On x86 there is a trade off: Multiple uses segment prefixes against an
address calculation and more register pressure. Code size is reduced
also therefore it is an advantage icache wise.

The use of prefixes is necessary if we want to use a scheme
for fastpaths that do not require disabling interrupts.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   80 ++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 39 insertions(+), 41 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-07 14:52:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-07 14:58:50.000000000 -0500
@@ -1512,10 +1512,10 @@ static void flush_all(struct kmem_cache 
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int node_match(struct kmem_cache *s, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
+	if (node != -1 && __this_cpu_read(s->cpu_slab->node) != node)
 		return 0;
 #endif
 	return 1;
@@ -1603,46 +1603,46 @@ slab_out_of_memory(struct kmem_cache *s,
  * a call to the page allocator and the setup of a new slab.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+			  unsigned long addr)
 {
 	void **object;
-	struct page *new;
+	struct page *page = __this_cpu_read(s->cpu_slab->page);
 
 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
 
-	if (!c->page)
+	if (!page)
 		goto new_slab;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	slab_lock(page);
+	if (unlikely(!node_match(s, node)))
 		goto another_slab;
 
 	stat(s, ALLOC_REFILL);
 
 load_freelist:
-	object = c->page->freelist;
+	object = page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
 		goto debug;
 
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
+	__this_cpu_write(s->cpu_slab->node, page_to_nid(page));
+	__this_cpu_write(s->cpu_slab->freelist, get_freepointer(s, object));
+	page->inuse = page->objects;
+	page->freelist = NULL;
 unlock_out:
-	slab_unlock(c->page);
+	slab_unlock(page);
 	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
-	deactivate_slab(s, c);
+	deactivate_slab(s, __this_cpu_ptr(s->cpu_slab));
 
 new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
+	page = get_partial(s, gfpflags, node);
+	if (page) {
+		__this_cpu_write(s->cpu_slab->page, page);
 		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
@@ -1650,31 +1650,30 @@ new_slab:
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
-	new = new_slab(s, gfpflags, node);
+	page = new_slab(s, gfpflags, node);
 
 	if (gfpflags & __GFP_WAIT)
 		local_irq_disable();
 
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
+	if (page) {
 		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
+		if (__this_cpu_read(s->cpu_slab->page))
+			flush_slab(s, __this_cpu_ptr(s->cpu_slab));
+		slab_lock(page);
+		__SetPageSlubFrozen(page);
+		__this_cpu_write(s->cpu_slab->page, page);
 		goto load_freelist;
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
 	return NULL;
 debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
+	if (!alloc_debug_processing(s, page, object, addr))
 		goto another_slab;
 
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
+	page->inuse++;
+	page->freelist = get_freepointer(s, object);
+	__this_cpu_write(s->cpu_slab->node, -1);
 	goto unlock_out;
 }
 
@@ -1692,7 +1691,6 @@ static __always_inline void *slab_alloc(
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	gfpflags &= gfp_allowed_mask;
@@ -1704,14 +1702,14 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	object = __this_cpu_read(s->cpu_slab->freelist);
+	if (unlikely(!object || !node_match(s, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		object = __slab_alloc(s, gfpflags, node, addr);
 
 	else {
-		c->freelist = get_freepointer(s, object);
+		__this_cpu_write(s->cpu_slab->freelist,
+			get_freepointer(s, object));
 		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1847,19 +1845,19 @@ static __always_inline void slab_free(st
 			struct page *page, void *x, unsigned long addr)
 {
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
+
+	if (likely(page == __this_cpu_read(s->cpu_slab->page) &&
+			__this_cpu_read(s->cpu_slab->node) >= 0)) {
+		set_freepointer(s, object, __this_cpu_read(s->cpu_slab->freelist));
+		__this_cpu_write(s->cpu_slab->freelist, object);
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

-- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields
  2009-10-07 21:10 ` [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields cl
@ 2009-10-07 23:10   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-07 23:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Wed, 7 Oct 2009, cl@linux-foundation.org wrote:

> @@ -2387,8 +2364,11 @@ static int kmem_cache_open(struct kmem_c
>  	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
>  		goto error;
>
> -	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
> +	if (!alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
> +
> +	if (s->cpu_slab)
>  		return 1;
> +
>  	free_kmem_cache_nodes(s);

Argh. I goofed while fixing a diff problem shortly before release.

The following patch fixes the patch:

---
 mm/slub.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-07 18:00:06.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-07 18:03:05.000000000 -0500
@@ -2364,9 +2364,7 @@ static int kmem_cache_open(struct kmem_c
 	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
 		goto error;

-	if (!alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
-
-	if (s->cpu_slab)
+	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;

 	free_kmem_cache_nodes(s);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion
  2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
@ 2009-10-08 10:38   ` Tejun Heo
  2009-10-08 10:40     ` Tejun Heo
  2009-10-08 16:15     ` Christoph Lameter
  2009-10-08 10:53   ` Mel Gorman
  1 sibling, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2009-10-08 10:38 UTC (permalink / raw)
  To: cl; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Hello, Christoph.

cl@linux-foundation.org wrote:
> +/*
> + * Boot pageset table. One per cpu which is going to be used for all
> + * zones and all nodes. The parameters will be set in such a way
> + * that an item put on a list will immediately be handed over to
> + * the buddy list. This is safe since pageset manipulation is done
> + * with interrupts disabled.
> + *
> + * The boot_pagesets must be kept even after bootup is complete for
> + * unused processors and/or zones. They do play a role for bootstrapping
> + * hotplugged processors.
> + *
> + * zoneinfo_show() and maybe other functions do
> + * not check if the processor is online before following the pageset pointer.
> + * Other parts of the kernel may not check if the zone is available.
> + */
> +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);

This looks much better but I'm not sure whether it's safe.  percpu
offsets have not been set up before setup_per_cpu_areas() is complete
on most archs but if all that's necessary is getting the page
allocator up and running as soon as static per cpu areas and offsets
are set up (which basically means as soon as cpu init is complete on
ia64 and setup_per_cpu_areas() is complete on all other archs).  This
should be correct.  Is this what you're expecting?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion
  2009-10-08 10:38   ` Tejun Heo
@ 2009-10-08 10:40     ` Tejun Heo
  2009-10-08 16:15     ` Christoph Lameter
  1 sibling, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2009-10-08 10:40 UTC (permalink / raw)
  To: cl; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Tejun Heo wrote:
> This looks much better but I'm not sure whether it's safe.  percpu
> offsets have not been set up before setup_per_cpu_areas() is complete
> on most archs but if all that's necessary is getting the page
> allocator up and running as soon as static per cpu areas and offsets
> are set up (which basically means as soon as cpu init is complete on
> ia64 and setup_per_cpu_areas() is complete on all other archs).  This
> should be correct.  Is this what you're expecting?

Also, as I'm not very familiar with the code, I'd really appreciate
Mel Gorman's acked or reviewed-by.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion
  2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
  2009-10-08 10:38   ` Tejun Heo
@ 2009-10-08 10:53   ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2009-10-08 10:53 UTC (permalink / raw)
  To: cl; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Wed, Oct 07, 2009 at 05:10:25PM -0400, cl@linux-foundation.org wrote:
> Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> 
> This drastically reduces the size of struct zone for systems with large
> amounts of processors and allows placement of critical variables of struct
> zone in one cacheline even on very large systems.
> 
> Another effect is that the pagesets of one processor are placed near one
> another. If multiple pagesets from different zones fit into one cacheline
> then additional cacheline fetches can be avoided on the hot paths when
> allocating memory from multiple zones.
> 
> Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
> are reduced and we can drop the zone_pcp macro.
> 
> Hotplug handling is also simplified since cpu alloc can bring up and
> shut down cpu areas for a specific cpu as a whole. So there is no need to
> allocate or free individual pagesets.
> 
> V4-V5:
> - Fix up cases where per_cpu_ptr is called before irq disable
> - Integrate the bootstrap logic that was separate before.
> 
> Cc: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 

I haven't tested the patch series but it now looks good to my eyes at
least. Thanks

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
>  include/linux/mm.h     |    4 -
>  include/linux/mmzone.h |   12 ---
>  mm/page_alloc.c        |  187 ++++++++++++++++++-------------------------------
>  mm/vmstat.c            |   14 ++-
>  4 files changed, 81 insertions(+), 136 deletions(-)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/include/linux/mm.h	2009-10-07 14:48:09.000000000 -0500
> @@ -1061,11 +1061,7 @@ extern void si_meminfo(struct sysinfo * 
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  extern int after_bootmem;
>  
> -#ifdef CONFIG_NUMA
>  extern void setup_per_cpu_pageset(void);
> -#else
> -static inline void setup_per_cpu_pageset(void) {}
> -#endif
>  
>  extern void zone_pcp_update(struct zone *zone);
>  
> Index: linux-2.6/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mmzone.h	2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/include/linux/mmzone.h	2009-10-07 14:48:09.000000000 -0500
> @@ -184,13 +184,7 @@ struct per_cpu_pageset {
>  	s8 stat_threshold;
>  	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
>  #endif
> -} ____cacheline_aligned_in_smp;
> -
> -#ifdef CONFIG_NUMA
> -#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
> -#else
> -#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
> -#endif
> +};
>  
>  #endif /* !__GENERATING_BOUNDS.H */
>  
> @@ -306,10 +300,8 @@ struct zone {
>  	 */
>  	unsigned long		min_unmapped_pages;
>  	unsigned long		min_slab_pages;
> -	struct per_cpu_pageset	*pageset[NR_CPUS];
> -#else
> -	struct per_cpu_pageset	pageset[NR_CPUS];
>  #endif
> +	struct per_cpu_pageset	*pageset;
>  	/*
>  	 * free areas of different sizes
>  	 */
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c	2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c	2009-10-07 14:48:09.000000000 -0500
> @@ -1011,10 +1011,10 @@ static void drain_pages(unsigned int cpu
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
>  
> -		pset = zone_pcp(zone, cpu);
> +		local_irq_save(flags);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  
>  		pcp = &pset->pcp;
> -		local_irq_save(flags);
>  		free_pcppages_bulk(zone, pcp->count, pcp);
>  		pcp->count = 0;
>  		local_irq_restore(flags);
> @@ -1098,7 +1098,6 @@ static void free_hot_cold_page(struct pa
>  	arch_free_page(page, 0);
>  	kernel_map_pages(page, 1, 0);
>  
> -	pcp = &zone_pcp(zone, get_cpu())->pcp;
>  	migratetype = get_pageblock_migratetype(page);
>  	set_page_private(page, migratetype);
>  	local_irq_save(flags);
> @@ -1121,6 +1120,7 @@ static void free_hot_cold_page(struct pa
>  		migratetype = MIGRATE_MOVABLE;
>  	}
>  
> +	pcp = &this_cpu_ptr(zone->pageset)->pcp;
>  	if (cold)
>  		list_add_tail(&page->lru, &pcp->lists[migratetype]);
>  	else
> @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
>  
>  out:
>  	local_irq_restore(flags);
> -	put_cpu();
>  }
>  
>  void free_hot_page(struct page *page)
> @@ -1183,17 +1182,15 @@ struct page *buffered_rmqueue(struct zon
>  	unsigned long flags;
>  	struct page *page;
>  	int cold = !!(gfp_flags & __GFP_COLD);
> -	int cpu;
>  
>  again:
> -	cpu  = get_cpu();
>  	if (likely(order == 0)) {
>  		struct per_cpu_pages *pcp;
>  		struct list_head *list;
>  
> -		pcp = &zone_pcp(zone, cpu)->pcp;
> -		list = &pcp->lists[migratetype];
>  		local_irq_save(flags);
> +		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> +		list = &pcp->lists[migratetype];
>  		if (list_empty(list)) {
>  			pcp->count += rmqueue_bulk(zone, 0,
>  					pcp->batch, list,
> @@ -1234,7 +1231,6 @@ again:
>  	__count_zone_vm_events(PGALLOC, zone, 1 << order);
>  	zone_statistics(preferred_zone, zone);
>  	local_irq_restore(flags);
> -	put_cpu();
>  
>  	VM_BUG_ON(bad_range(zone, page));
>  	if (prep_new_page(page, order, gfp_flags))
> @@ -1243,7 +1239,6 @@ again:
>  
>  failed:
>  	local_irq_restore(flags);
> -	put_cpu();
>  	return NULL;
>  }
>  
> @@ -2172,7 +2167,7 @@ void show_free_areas(void)
>  		for_each_online_cpu(cpu) {
>  			struct per_cpu_pageset *pageset;
>  
> -			pageset = zone_pcp(zone, cpu);
> +			pageset = per_cpu_ptr(zone->pageset, cpu);
>  
>  			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
>  			       cpu, pageset->pcp.high,
> @@ -2735,10 +2730,29 @@ static void build_zonelist_cache(pg_data
>  
>  #endif	/* CONFIG_NUMA */
>  
> +/*
> + * Boot pageset table. One per cpu which is going to be used for all
> + * zones and all nodes. The parameters will be set in such a way
> + * that an item put on a list will immediately be handed over to
> + * the buddy list. This is safe since pageset manipulation is done
> + * with interrupts disabled.
> + *
> + * The boot_pagesets must be kept even after bootup is complete for
> + * unused processors and/or zones. They do play a role for bootstrapping
> + * hotplugged processors.
> + *
> + * zoneinfo_show() and maybe other functions do
> + * not check if the processor is online before following the pageset pointer.
> + * Other parts of the kernel may not check if the zone is available.
> + */
> +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
> +
>  /* return values int ....just for stop_machine() */
>  static int __build_all_zonelists(void *dummy)
>  {
>  	int nid;
> +	int cpu;
>  
>  #ifdef CONFIG_NUMA
>  	memset(node_load, 0, sizeof(node_load));
> @@ -2749,6 +2763,14 @@ static int __build_all_zonelists(void *d
>  		build_zonelists(pgdat);
>  		build_zonelist_cache(pgdat);
>  	}
> +
> +	/*
> +	 * Initialize the boot_pagesets that are going to be used
> +	 * for bootstrapping processors.
> +	 */
> +	for_each_possible_cpu(cpu)
> +		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> +
>  	return 0;
>  }
>  
> @@ -3087,120 +3109,60 @@ static void setup_pagelist_highmark(stru
>  }
>  
>  
> -#ifdef CONFIG_NUMA
> -/*
> - * Boot pageset table. One per cpu which is going to be used for all
> - * zones and all nodes. The parameters will be set in such a way
> - * that an item put on a list will immediately be handed over to
> - * the buddy list. This is safe since pageset manipulation is done
> - * with interrupts disabled.
> - *
> - * Some NUMA counter updates may also be caught by the boot pagesets.
> - *
> - * The boot_pagesets must be kept even after bootup is complete for
> - * unused processors and/or zones. They do play a role for bootstrapping
> - * hotplugged processors.
> - *
> - * zoneinfo_show() and maybe other functions do
> - * not check if the processor is online before following the pageset pointer.
> - * Other parts of the kernel may not check if the zone is available.
> - */
> -static struct per_cpu_pageset boot_pageset[NR_CPUS];
> -
> -/*
> - * Dynamically allocate memory for the
> - * per cpu pageset array in struct zone.
> - */
> -static int __cpuinit process_zones(int cpu)
> -{
> -	struct zone *zone, *dzone;
> -	int node = cpu_to_node(cpu);
> -
> -	node_set_state(node, N_CPU);	/* this node has a cpu */
> -
> -	for_each_populated_zone(zone) {
> -		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
> -					 GFP_KERNEL, node);
> -		if (!zone_pcp(zone, cpu))
> -			goto bad;
> -
> -		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
> -
> -		if (percpu_pagelist_fraction)
> -			setup_pagelist_highmark(zone_pcp(zone, cpu),
> -			 	(zone->present_pages / percpu_pagelist_fraction));
> -	}
> -
> -	return 0;
> -bad:
> -	for_each_zone(dzone) {
> -		if (!populated_zone(dzone))
> -			continue;
> -		if (dzone == zone)
> -			break;
> -		kfree(zone_pcp(dzone, cpu));
> -		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
> -	}
> -	return -ENOMEM;
> -}
> -
> -static inline void free_zone_pagesets(int cpu)
> -{
> -	struct zone *zone;
> -
> -	for_each_zone(zone) {
> -		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
> -
> -		/* Free per_cpu_pageset if it is slab allocated */
> -		if (pset != &boot_pageset[cpu])
> -			kfree(pset);
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -	}
> -}
> -
>  static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
>  		unsigned long action,
>  		void *hcpu)
>  {
>  	int cpu = (long)hcpu;
> -	int ret = NOTIFY_OK;
>  
>  	switch (action) {
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
> -		if (process_zones(cpu))
> -			ret = NOTIFY_BAD;
> -		break;
> -	case CPU_UP_CANCELED:
> -	case CPU_UP_CANCELED_FROZEN:
> -	case CPU_DEAD:
> -	case CPU_DEAD_FROZEN:
> -		free_zone_pagesets(cpu);
> +		node_set_state(cpu_to_node(cpu), N_CPU);
>  		break;
>  	default:
>  		break;
>  	}
> -	return ret;
> +	return NOTIFY_OK;
>  }
>  
>  static struct notifier_block __cpuinitdata pageset_notifier =
>  	{ &pageset_cpuup_callback, NULL, 0 };
>  
> +/*
> + * Allocate per cpu pagesets and initialize them.
> + * Before this call only boot pagesets were available.
> + * Boot pagesets will no longer be used by this processorr
> + * after setup_per_cpu_pageset().
> + */
>  void __init setup_per_cpu_pageset(void)
>  {
> -	int err;
> +	struct zone *zone;
> +	int cpu;
> +
> +	for_each_populated_zone(zone) {
> +		zone->pageset = alloc_percpu(struct per_cpu_pageset);
> +
> +		for_each_possible_cpu(cpu) {
> +			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> +
> +			setup_pageset(pcp, zone_batchsize(zone));
> +
> +			if (percpu_pagelist_fraction)
> +				setup_pagelist_highmark(pcp,
> +					(zone->present_pages /
> +						percpu_pagelist_fraction));
> +		}
> +	}
>  
> -	/* Initialize per_cpu_pageset for cpu 0.
> -	 * A cpuup callback will do this for every cpu
> -	 * as it comes online
> +	/*
> +	 * The boot cpu is always the first active.
> +	 * The boot node has a processor
>  	 */
> -	err = process_zones(smp_processor_id());
> -	BUG_ON(err);
> +	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
>  	register_cpu_notifier(&pageset_notifier);
>  }
>  
> -#endif
> -
>  static noinline __init_refok
>  int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
>  {
> @@ -3254,7 +3216,7 @@ static int __zone_pcp_update(void *data)
>  		struct per_cpu_pageset *pset;
>  		struct per_cpu_pages *pcp;
>  
> -		pset = zone_pcp(zone, cpu);
> +		pset = per_cpu_ptr(zone->pageset, cpu);
>  		pcp = &pset->pcp;
>  
>  		local_irq_save(flags);
> @@ -3272,21 +3234,13 @@ void zone_pcp_update(struct zone *zone)
>  
>  static __meminit void zone_pcp_init(struct zone *zone)
>  {
> -	int cpu;
> -	unsigned long batch = zone_batchsize(zone);
> +	/* Use boot pagesets until we have the per cpu allocator up */
> +	zone->pageset = &per_cpu_var(boot_pageset);
>  
> -	for (cpu = 0; cpu < NR_CPUS; cpu++) {
> -#ifdef CONFIG_NUMA
> -		/* Early boot. Slab allocator not functional yet */
> -		zone_pcp(zone, cpu) = &boot_pageset[cpu];
> -		setup_pageset(&boot_pageset[cpu],0);
> -#else
> -		setup_pageset(zone_pcp(zone,cpu), batch);
> -#endif
> -	}
>  	if (zone->present_pages)
> -		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
> -			zone->name, zone->present_pages, batch);
> +		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
> +			zone->name, zone->present_pages,
> +					 zone_batchsize(zone));
>  }
>  
>  __meminit int init_currently_empty_zone(struct zone *zone,
> @@ -4800,10 +4754,11 @@ int percpu_pagelist_fraction_sysctl_hand
>  	if (!write || (ret == -EINVAL))
>  		return ret;
>  	for_each_populated_zone(zone) {
> -		for_each_online_cpu(cpu) {
> +		for_each_possible_cpu(cpu) {
>  			unsigned long  high;
>  			high = zone->present_pages / percpu_pagelist_fraction;
> -			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
> +			setup_pagelist_highmark(
> +				per_cpu_ptr(zone->pageset, cpu), high);
>  		}
>  	}
>  	return 0;
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2009-10-07 14:34:25.000000000 -0500
> +++ linux-2.6/mm/vmstat.c	2009-10-07 14:48:09.000000000 -0500
> @@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
>  		threshold = calculate_threshold(zone);
>  
>  		for_each_online_cpu(cpu)
> -			zone_pcp(zone, cpu)->stat_threshold = threshold;
> +			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> +							= threshold;
>  	}
>  }
>  
> @@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
>  void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
>  				int delta)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
> +
>  	s8 *p = pcp->vm_stat_diff + item;
>  	long x;
>  
> @@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
>   */
>  void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>  
>  	(*p)++;
> @@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
>  
>  void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
>  {
> -	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> +	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
>  	s8 *p = pcp->vm_stat_diff + item;
>  
>  	(*p)--;
> @@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset *p;
>  
> -		p = zone_pcp(zone, cpu);
> +		p = per_cpu_ptr(zone->pageset, cpu);
>  
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>  			if (p->vm_stat_diff[i]) {
> @@ -738,7 +740,7 @@ static void zoneinfo_show_print(struct s
>  	for_each_online_cpu(i) {
>  		struct per_cpu_pageset *pageset;
>  
> -		pageset = zone_pcp(zone, i);
> +		pageset = per_cpu_ptr(zone->pageset, i);
>  		seq_printf(m,
>  			   "\n    cpu: %i"
>  			   "\n              count: %i"
> 
> -- 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion
  2009-10-08 10:38   ` Tejun Heo
  2009-10-08 10:40     ` Tejun Heo
@ 2009-10-08 16:15     ` Christoph Lameter
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-08 16:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On Thu, 8 Oct 2009, Tejun Heo wrote:

> > +static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> > +static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>
> This looks much better but I'm not sure whether it's safe.  percpu
> offsets have not been set up before setup_per_cpu_areas() is complete
> on most archs but if all that's necessary is getting the page
> allocator up and running as soon as static per cpu areas and offsets
> are set up (which basically means as soon as cpu init is complete on
> ia64 and setup_per_cpu_areas() is complete on all other archs).  This
> should be correct.  Is this what you're expecting?

paging_init() is called after the per cpu areas have been initialized. So
I thought this would be safe. Tested it on x86.

zone_pcp_init() only sets up the per cpu pointers to the pagesets. That
works regardless of the boot stage. Then then build_all_zonelists()
initializes the actual contents of the per cpu variables.

Finally the per cpu pagesets are allocated from the percpu allocator when
all allocators are up and the pagesets are sized.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-07 21:10 ` [this_cpu_xx V6 3/7] Use this_cpu operations in slub cl
@ 2009-10-12 10:19   ` Tejun Heo
  2009-10-12 10:21     ` Tejun Heo
  2009-10-12 14:54     ` Christoph Lameter
  0 siblings, 2 replies; 56+ messages in thread
From: Tejun Heo @ 2009-10-12 10:19 UTC (permalink / raw)
  To: cl; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Hello,

cl@linux-foundation.org wrote:
> @@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
>   */
>  static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>  {
> -	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
> +	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
>  
>  	if (likely(c && c->page))
>  		flush_slab(s, c);
> @@ -1673,7 +1661,7 @@ new_slab:
>  		local_irq_disable();
>  
>  	if (new) {
> -		c = get_cpu_slab(s, smp_processor_id());
> +		c = __this_cpu_ptr(s->cpu_slab);

Shouldn't this be this_cpu_ptr() without the double underscore?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-12 10:19   ` Tejun Heo
@ 2009-10-12 10:21     ` Tejun Heo
  2009-10-12 14:54     ` Christoph Lameter
  1 sibling, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2009-10-12 10:21 UTC (permalink / raw)
  To: cl; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Tejun Heo wrote:
> Hello,
> 
> cl@linux-foundation.org wrote:
>> @@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
>>   */
>>  static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>>  {
>> -	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
>> +	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
>>  
>>  	if (likely(c && c->page))
>>  		flush_slab(s, c);
>> @@ -1673,7 +1661,7 @@ new_slab:
>>  		local_irq_disable();
>>  
>>  	if (new) {
>> -		c = get_cpu_slab(s, smp_processor_id());
>> +		c = __this_cpu_ptr(s->cpu_slab);
> 
> Shouldn't this be this_cpu_ptr() without the double underscore?

Oh... another similar conversions in slab_alloc() and slab_free() too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-07 21:10 ` [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
@ 2009-10-12 10:40   ` Tejun Heo
  2009-10-12 13:14     ` Pekka Enberg
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2009-10-12 10:40 UTC (permalink / raw)
  To: cl; +Cc: linux-kernel, Mathieu Desnoyers, Pekka Enberg, Mel Gorman

cl@linux-foundation.org wrote:
> Use this_cpu_* operations in the hotpath to avoid calculations of
> kmem_cache_cpu pointer addresses.
> 
> On x86 there is a trade off: Multiple uses segment prefixes against an
> address calculation and more register pressure. Code size is reduced
> also therefore it is an advantage icache wise.
> 
> The use of prefixes is necessary if we want to use a scheme
> for fastpaths that do not require disabling interrupts.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

The rest of the patches look good to me but I'm no expert in this area
of code.  But you're the maintainer of the allocator and the changes
definitely are percpu related, so if you're comfortable with it, I can
happily carry the patches through percpu tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu  operations in the hotpaths
  2009-10-12 10:40   ` Tejun Heo
@ 2009-10-12 13:14     ` Pekka Enberg
  2009-10-12 14:55       ` Christoph Lameter
  2009-10-13  9:45       ` David Rientjes
  0 siblings, 2 replies; 56+ messages in thread
From: Pekka Enberg @ 2009-10-12 13:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, linux-kernel, Mathieu Desnoyers, Mel Gorman, David Rientjes,
	Zhang Yanmin

On Mon, Oct 12, 2009 at 1:40 PM, Tejun Heo <tj@kernel.org> wrote:
> cl@linux-foundation.org wrote:
>> Use this_cpu_* operations in the hotpath to avoid calculations of
>> kmem_cache_cpu pointer addresses.
>>
>> On x86 there is a trade off: Multiple uses segment prefixes against an
>> address calculation and more register pressure. Code size is reduced
>> also therefore it is an advantage icache wise.
>>
>> The use of prefixes is necessary if we want to use a scheme
>> for fastpaths that do not require disabling interrupts.
>>
>> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> The rest of the patches look good to me but I'm no expert in this area
> of code.  But you're the maintainer of the allocator and the changes
> definitely are percpu related, so if you're comfortable with it, I can
> happily carry the patches through percpu tree.

The patch looks sane to me but the changelog contains no relevant
numbers on performance. I am fine with the patch going in -percpu but
the patch probably needs some more beating performance-wise before it
can go into .33. I'm CC'ing some more people who are known to do SLAB
performance testing just in case they're interested in looking at the
patch. In any case,

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>

                        Pekka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-12 10:19   ` Tejun Heo
  2009-10-12 10:21     ` Tejun Heo
@ 2009-10-12 14:54     ` Christoph Lameter
  2009-10-13  2:13       ` Tejun Heo
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-12 14:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Mon, 12 Oct 2009, Tejun Heo wrote:

> > -		c = get_cpu_slab(s, smp_processor_id());
> > +		c = __this_cpu_ptr(s->cpu_slab);
>
> Shouldn't this be this_cpu_ptr() without the double underscore?

Interrupts are disabled so no concurrent fast path can occur.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-12 13:14     ` Pekka Enberg
@ 2009-10-12 14:55       ` Christoph Lameter
  2009-10-13  9:45       ` David Rientjes
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-12 14:55 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Tejun Heo, linux-kernel, Mathieu Desnoyers, Mel Gorman,
	David Rientjes, Zhang Yanmin

On Mon, 12 Oct 2009, Pekka Enberg wrote:

> The patch looks sane to me but the changelog contains no relevant
> numbers on performance. I am fine with the patch going in -percpu but
> the patch probably needs some more beating performance-wise before it
> can go into .33. I'm CC'ing some more people who are known to do SLAB
> performance testing just in case they're interested in looking at the
> patch. In any case,

I am warming up my synthetic in kernel tests right now. Hope I have
something by tomorrow.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-12 14:54     ` Christoph Lameter
@ 2009-10-13  2:13       ` Tejun Heo
  2009-10-13 14:41         ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2009-10-13  2:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Christoph Lameter wrote:
> On Mon, 12 Oct 2009, Tejun Heo wrote:
> 
>>> -		c = get_cpu_slab(s, smp_processor_id());
>>> +		c = __this_cpu_ptr(s->cpu_slab);
>> Shouldn't this be this_cpu_ptr() without the double underscore?
> 
> Interrupts are disabled so no concurrent fast path can occur.
> 

The only difference between this_cpu_ptr() and __this_cpu_ptr() is the
usage of my_cpu_offset and __my_cpu_offset which in turn are only
different in whether they check preemption status to make sure the cpu
is pinned down when called.

The only places where the underbar prefixed versions should be used
are places where cpu locality is nice but not critical and preemption
debug check wouldn't work properly for whatever reason.  The above is
none of the two and the conversion is buried in a patch which is
supposed to do something else.  Am I missing something?

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-12 13:14     ` Pekka Enberg
  2009-10-12 14:55       ` Christoph Lameter
@ 2009-10-13  9:45       ` David Rientjes
  2009-10-13 14:43         ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: David Rientjes @ 2009-10-13  9:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Tejun Heo, Christoph Lameter, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Mon, 12 Oct 2009, Pekka Enberg wrote:

> The patch looks sane to me but the changelog contains no relevant
> numbers on performance. I am fine with the patch going in -percpu but
> the patch probably needs some more beating performance-wise before it
> can go into .33. I'm CC'ing some more people who are known to do SLAB
> performance testing just in case they're interested in looking at the
> patch. In any case,
> 

I ran 60-second netperf TCP_RR benchmarks with various thread counts over 
two machines, both four quad-core Opterons.  I ran the trials ten times 
each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this 
patchset.  The transfer rates were virtually identical showing no 
improvement or regression with this patchset in this benchmark.

 [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472, 
   this benchmark continues to be the most significant regression slub has 
   compared to slab. ]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-13  2:13       ` Tejun Heo
@ 2009-10-13 14:41         ` Christoph Lameter
  2009-10-13 14:56           ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 14:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Tue, 13 Oct 2009, Tejun Heo wrote:

> The only difference between this_cpu_ptr() and __this_cpu_ptr() is the
> usage of my_cpu_offset and __my_cpu_offset which in turn are only
> different in whether they check preemption status to make sure the cpu
> is pinned down when called.

Correct.

> The only places where the underbar prefixed versions should be used
> are places where cpu locality is nice but not critical and preemption
> debug check wouldn't work properly for whatever reason.  The above is
> none of the two and the conversion is buried in a patch which is
> supposed to do something else.  Am I missing something?

I used __this_cpu_* whenever the context is already providing enough
safety that preempt disable or irq disable would not matter. The use of
__this_cpu_ptr was entirely for consistent usage here. this_cpu_ptr would
be safer because it has additional checks that preemption really is
disabled. So if someone gets confused about logic flow later it can be
dtected.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13  9:45       ` David Rientjes
@ 2009-10-13 14:43         ` Christoph Lameter
  2009-10-13 19:14           ` Christoph Lameter
  2009-10-14  1:33           ` David Rientjes
  0 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 14:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, David Rientjes wrote:

> I ran 60-second netperf TCP_RR benchmarks with various thread counts over
> two machines, both four quad-core Opterons.  I ran the trials ten times
> each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this
> patchset.  The transfer rates were virtually identical showing no
> improvement or regression with this patchset in this benchmark.
>
>  [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472,
>    this benchmark continues to be the most significant regression slub has
>    compared to slab. ]

Hmmm... Last time I ran the in kernel benchmarks this showed a reduction
in cycle counts. Did not get to get my tests yet.

Can you also try the irqless hotpath?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-13 14:41         ` Christoph Lameter
@ 2009-10-13 14:56           ` Tejun Heo
  2009-10-13 15:20             ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2009-10-13 14:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Christoph Lameter wrote:
> On Tue, 13 Oct 2009, Tejun Heo wrote:
> 
>> The only difference between this_cpu_ptr() and __this_cpu_ptr() is the
>> usage of my_cpu_offset and __my_cpu_offset which in turn are only
>> different in whether they check preemption status to make sure the cpu
>> is pinned down when called.
> 
> Correct.
> 
>> The only places where the underbar prefixed versions should be used
>> are places where cpu locality is nice but not critical and preemption
>> debug check wouldn't work properly for whatever reason.  The above is
>> none of the two and the conversion is buried in a patch which is
>> supposed to do something else.  Am I missing something?
> 
> I used __this_cpu_* whenever the context is already providing enough
> safety that preempt disable or irq disable would not matter. The use of
> __this_cpu_ptr was entirely for consistent usage here. this_cpu_ptr would
> be safer because it has additional checks that preemption really is
> disabled. So if someone gets confused about logic flow later it can be
> dtected.

Yeah, widespread use of underscored versions isn't very desirable.
The underscored versions should notify certain specific exceptional
conditions instead of being used as general optimization (which
doesn't make much sense after all as the optimization is only
meaningful with debug option turned on).  Are you interested in doing
a sweeping patch to drop underscores from __this_cpu_*() conversions?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-13 14:56           ` Tejun Heo
@ 2009-10-13 15:20             ` Christoph Lameter
  2009-10-14  1:57               ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 15:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Tue, 13 Oct 2009, Tejun Heo wrote:

> Yeah, widespread use of underscored versions isn't very desirable.
> The underscored versions should notify certain specific exceptional
> conditions instead of being used as general optimization (which
> doesn't make much sense after all as the optimization is only
> meaningful with debug option turned on).  Are you interested in doing
> a sweeping patch to drop underscores from __this_cpu_*() conversions?

Nope. __this_cpu_add/dec cannot be converted.

__this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are
useful there too to show that we are in a preempt section.

The calls to raw_smp_processor_id and smp_processor_id() are only useful
in the fallback case. There is no need for those if the arch has a way to
provide the current percpu offset. So we in effect have two meanings of __
right now.

1. We do not care about the preempt state (thus we call
raw_smp_processor_id so that the preempt state does not trigger)

2. We do not need to disable preempt before the operation.

__this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
                   ` (6 preceding siblings ...)
  2009-10-07 21:10 ` [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
@ 2009-10-13 15:40 ` Mel Gorman
  2009-10-13 15:45   ` Christoph Lameter
  7 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2009-10-13 15:40 UTC (permalink / raw)
  To: cl; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Wed, Oct 07, 2009 at 05:10:24PM -0400, cl@linux-foundation.org wrote:
> V5->V6:
> - Drop patches merged by Tejun.
> - Drop irqless slub fastpath for now.
> - Patches against Tejun percpu for-next branch.
> 

FWIW, this fails to boot on latest mmotm on x86-64 even though the patches
apply. It fails to create basic slab cackes like kmalloc-64.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-13 15:40 ` [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic Mel Gorman
@ 2009-10-13 15:45   ` Christoph Lameter
  2009-10-13 16:09     ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 15:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Tue, 13 Oct 2009, Mel Gorman wrote:

> FWIW, this fails to boot on latest mmotm on x86-64 even though the patches
> apply. It fails to create basic slab cackes like kmalloc-64.

There was a fixup patch for one of the slub patches. Was that merged?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-13 15:45   ` Christoph Lameter
@ 2009-10-13 16:09     ` Mel Gorman
  2009-10-13 17:17       ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2009-10-13 16:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

On Tue, Oct 13, 2009 at 11:45:44AM -0400, Christoph Lameter wrote:
> On Tue, 13 Oct 2009, Mel Gorman wrote:
> 
> > FWIW, this fails to boot on latest mmotm on x86-64 even though the patches
> > apply. It fails to create basic slab cackes like kmalloc-64.
> 
> There was a fixup patch for one of the slub patches. Was that merged?
> 

No. I missed it without the change in subject line and had just exported
the thread series itself. Sorry.

I might have something useful on this in the morning assuming no other
PEBKAC-related messes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic
  2009-10-13 16:09     ` Mel Gorman
@ 2009-10-13 17:17       ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 17:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tejun Heo, linux-kernel, Pekka Enberg, Mathieu Desnoyers

I am stuck too. Sysfs is screwed up somehow and triggers the
hangcheck timer.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [FIX] patch "SLUB: Get rid of dynamic DMA kmalloc cache allocation"
  2009-10-07 21:10 ` [this_cpu_xx V6 4/7] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
@ 2009-10-13 18:48   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 18:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Slight bug when creating kmalloc dma caches on the fly. When searching for
an unused statically allocated kmem_cache structure we need to check for
size == 0 not the other way around.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-10-13 13:31:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-10-13 13:31:36.000000000 -0500
@@ -2650,7 +2650,7 @@ static noinline struct kmem_cache *dma_k

 	s = NULL;
 	for (i = 0; i < KMALLOC_CACHES; i++)
-		if (kmalloc_caches[i].size)
+		if (!kmalloc_caches[i].size)
 			break;

 	BUG_ON(i >= KMALLOC_CACHES);


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 14:43         ` Christoph Lameter
@ 2009-10-13 19:14           ` Christoph Lameter
  2009-10-13 19:44             ` Pekka Enberg
  2009-10-14  1:33           ` David Rientjes
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 19:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

Here are some cycle numbers w/o the slub patches and with. I will post the
full test results and the patches to do these in kernel tests in a new
thread. The regression may be due to caching behavior of SLUB that will
not change with these patches.

Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
being used. First test does 10000 kmallocs and then frees them all.
Second test alloc one and free one and does that 10000 times.

no this_cpu ops

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 292 cycles
10000 times kmalloc(16)/kfree -> 308 cycles
10000 times kmalloc(32)/kfree -> 326 cycles
10000 times kmalloc(64)/kfree -> 303 cycles
10000 times kmalloc(128)/kfree -> 257 cycles
10000 times kmalloc(256)/kfree -> 262 cycles
10000 times kmalloc(512)/kfree -> 293 cycles
10000 times kmalloc(1024)/kfree -> 262 cycles
10000 times kmalloc(2048)/kfree -> 289 cycles
10000 times kmalloc(4096)/kfree -> 274 cycles
10000 times kmalloc(8192)/kfree -> 265 cycles
10000 times kmalloc(16384)/kfree -> 1041 cycles


with this_cpu_xx

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles
10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 232 cycles
10000 times kmalloc(16)/kfree -> 150 cycles
10000 times kmalloc(32)/kfree -> 278 cycles
10000 times kmalloc(64)/kfree -> 263 cycles
10000 times kmalloc(128)/kfree -> 280 cycles
10000 times kmalloc(256)/kfree -> 279 cycles
10000 times kmalloc(512)/kfree -> 299 cycles
10000 times kmalloc(1024)/kfree -> 289 cycles
10000 times kmalloc(2048)/kfree -> 288 cycles
10000 times kmalloc(4096)/kfree -> 321 cycles
10000 times kmalloc(8192)/kfree -> 285 cycles
10000 times kmalloc(16384)/kfree -> 1002 cycles


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 19:14           ` Christoph Lameter
@ 2009-10-13 19:44             ` Pekka Enberg
  2009-10-13 19:48               ` Christoph Lameter
  2009-10-13 20:25               ` Christoph Lameter
  0 siblings, 2 replies; 56+ messages in thread
From: Pekka Enberg @ 2009-10-13 19:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

Hi Christoph,

Christoph Lameter wrote:
> Here are some cycle numbers w/o the slub patches and with. I will post the
> full test results and the patches to do these in kernel tests in a new
> thread. The regression may be due to caching behavior of SLUB that will
> not change with these patches.
> 
> Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
> being used. First test does 10000 kmallocs and then frees them all.
> Second test alloc one and free one and does that 10000 times.

I wonder how reliable these numbers are. We did similar testing a while 
back because we thought kmalloc-96 caches had weird cache behavior but 
finally figured out the anomaly was explained by the order of the tests 
run, not cache size.

AFAICT, we have similar artifact in these tests as well:

> no this_cpu ops
> 
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
> 10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
> 10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
> 10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles

Notice the jump from 32 to 64 and then back to 64. One would expect we 
see linear increase as object size grows as we hit the page allocator 
more often, no?

> 10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
> 10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
> 10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
> 10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
> 10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
> 10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
> 10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
> 10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 292 cycles
> 10000 times kmalloc(16)/kfree -> 308 cycles
> 10000 times kmalloc(32)/kfree -> 326 cycles
> 10000 times kmalloc(64)/kfree -> 303 cycles
> 10000 times kmalloc(128)/kfree -> 257 cycles
> 10000 times kmalloc(256)/kfree -> 262 cycles
> 10000 times kmalloc(512)/kfree -> 293 cycles
> 10000 times kmalloc(1024)/kfree -> 262 cycles
> 10000 times kmalloc(2048)/kfree -> 289 cycles
> 10000 times kmalloc(4096)/kfree -> 274 cycles
> 10000 times kmalloc(8192)/kfree -> 265 cycles
> 10000 times kmalloc(16384)/kfree -> 1041 cycles
> 
> 
> with this_cpu_xx
> 
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
> 10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles

Same artifact here.

> 10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
> 10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
> 10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
> 10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
> 10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
> 10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
> 10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
> 10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
> 10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
> 10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles

And looking at these numbers:

> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 232 cycles
> 10000 times kmalloc(16)/kfree -> 150 cycles
> 10000 times kmalloc(32)/kfree -> 278 cycles
> 10000 times kmalloc(64)/kfree -> 263 cycles
> 10000 times kmalloc(128)/kfree -> 280 cycles
> 10000 times kmalloc(256)/kfree -> 279 cycles
> 10000 times kmalloc(512)/kfree -> 299 cycles
> 10000 times kmalloc(1024)/kfree -> 289 cycles
> 10000 times kmalloc(2048)/kfree -> 288 cycles
> 10000 times kmalloc(4096)/kfree -> 321 cycles
> 10000 times kmalloc(8192)/kfree -> 285 cycles
> 10000 times kmalloc(16384)/kfree -> 1002 cycles

If there's 50% improvement in the kmalloc() path, why does the 
this_cpu() version seem to be roughly as fast as the mainline version?

			Pekka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 19:44             ` Pekka Enberg
@ 2009-10-13 19:48               ` Christoph Lameter
  2009-10-13 20:15                 ` David Rientjes
  2009-10-13 20:25               ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 19:48 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, Pekka Enberg wrote:

> I wonder how reliable these numbers are. We did similar testing a while back
> because we thought kmalloc-96 caches had weird cache behavior but finally
> figured out the anomaly was explained by the order of the tests run, not cache
> size.

Well you need to look behind these numbers to see when the allocator uses
the fastpath or slow path. Only the fast path is optimized here.

> AFAICT, we have similar artifact in these tests as well:
>
> > no this_cpu ops
> >
> > 1. Kmalloc: Repeatedly allocate then free test
> > 10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
> > 10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
> > 10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
> > 10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
>
> Notice the jump from 32 to 64 and then back to 64. One would expect we see
> linear increase as object size grows as we hit the page allocator more often,
> no?

64 is the cacheline size for the machine. At that point you have the
advantage of no overlapping data between different allocations and the
prefetcher may do a particularly good job.

> > 10000 times kmalloc(16384)/kfree -> 1002 cycles
>
> If there's 50% improvement in the kmalloc() path, why does the this_cpu()
> version seem to be roughly as fast as the mainline version?

Its not that the kmalloc() is faster. The instructions used for the
fastpath generate less cycles. Other components figure into the total
latency as well.

16k allocations for example are not handled by slub anymore. Fastpath has
no effect. The wins there is just the improved percpu handling in the page
allocator.

I have some numbers here for irqless which drops another half of the
fastpath latency (and it adds some code to the slow path, sigh):

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 251 cycles
10000 times kmalloc(16) -> 201 cycles kfree -> 261 cycles
10000 times kmalloc(32) -> 220 cycles kfree -> 261 cycles
10000 times kmalloc(64) -> 186 cycles kfree -> 224 cycles
10000 times kmalloc(128) -> 205 cycles kfree -> 125 cycles
10000 times kmalloc(256) -> 351 cycles kfree -> 267 cycles
10000 times kmalloc(512) -> 330 cycles kfree -> 310 cycles
10000 times kmalloc(1024) -> 416 cycles kfree -> 419 cycles
10000 times kmalloc(2048) -> 537 cycles kfree -> 439 cycles
10000 times kmalloc(4096) -> 458 cycles kfree -> 594 cycles
10000 times kmalloc(8192) -> 810 cycles kfree -> 678 cycles
10000 times kmalloc(16384) -> 879 cycles kfree -> 746 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 66 cycles
10000 times kmalloc(16)/kfree -> 187 cycles
10000 times kmalloc(32)/kfree -> 116 cycles
10000 times kmalloc(64)/kfree -> 107 cycles
10000 times kmalloc(128)/kfree -> 115 cycles
10000 times kmalloc(256)/kfree -> 65 cycles
10000 times kmalloc(512)/kfree -> 66 cycles
10000 times kmalloc(1024)/kfree -> 206 cycles
10000 times kmalloc(2048)/kfree -> 65 cycles
10000 times kmalloc(4096)/kfree -> 193 cycles
10000 times kmalloc(8192)/kfree -> 65 cycles
10000 times kmalloc(16384)/kfree -> 976 cycles





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 19:48               ` Christoph Lameter
@ 2009-10-13 20:15                 ` David Rientjes
  2009-10-13 20:28                   ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: David Rientjes @ 2009-10-13 20:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > I wonder how reliable these numbers are. We did similar testing a while back
> > because we thought kmalloc-96 caches had weird cache behavior but finally
> > figured out the anomaly was explained by the order of the tests run, not cache
> > size.
> 
> Well you need to look behind these numbers to see when the allocator uses
> the fastpath or slow path. Only the fast path is optimized here.
> 

With the netperf -t TCP_RR -l 60 benchmark I ran, CONFIG_SLUB_STATS shows 
the allocation fastpath is utilized quite a bit for a couple of key 
caches:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

For an optimized fastpath, I'd expect such a workload would result in at 
least a slightly higher transfer rate.

I'll try the irqless patch, but this particular benchmark may not 
appropriately demonstrate any performance gain because of the added code 
in the also significantly-used slowpath.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 19:44             ` Pekka Enberg
  2009-10-13 19:48               ` Christoph Lameter
@ 2009-10-13 20:25               ` Christoph Lameter
  1 sibling, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 20:25 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, Pekka Enberg wrote:

> I wonder how reliable these numbers are. We did similar testing a while back
> because we thought kmalloc-96 caches had weird cache behavior but finally
> figured out the anomaly was explained by the order of the tests run, not cache
> size.

The tests were all run directly after booting the respective kernel.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 20:15                 ` David Rientjes
@ 2009-10-13 20:28                   ` Christoph Lameter
  2009-10-13 22:53                     ` David Rientjes
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-13 20:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, David Rientjes wrote:

> For an optimized fastpath, I'd expect such a workload would result in at
> least a slightly higher transfer rate.

There will be no improvements if the load is dominated by the
instructions in the network layer or caching issues. None of that is
changed by the path. It only reduces the cycle count in the fastpath.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 20:28                   ` Christoph Lameter
@ 2009-10-13 22:53                     ` David Rientjes
  2009-10-14 13:34                       ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: David Rientjes @ 2009-10-13 22:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > For an optimized fastpath, I'd expect such a workload would result in at
> > least a slightly higher transfer rate.
> 
> There will be no improvements if the load is dominated by the
> instructions in the network layer or caching issues. None of that is
> changed by the path. It only reduces the cycle count in the fastpath.
> 

Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the 
same workload so it shows that the slab allocator does have an impact in 
transfer rate.  I understand that the performance gain with this patchset, 
however, may not be representative with the benchmark since it also 
frequently uses the slowpath for kmalloc-256 about 25% of the time and the 
added code of the irqless patch may mask the fastpath gain.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 14:43         ` Christoph Lameter
  2009-10-13 19:14           ` Christoph Lameter
@ 2009-10-14  1:33           ` David Rientjes
  1 sibling, 0 replies; 56+ messages in thread
From: David Rientjes @ 2009-10-14  1:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Tejun Heo, linux-kernel, Mathieu Desnoyers,
	Mel Gorman, Zhang Yanmin

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > I ran 60-second netperf TCP_RR benchmarks with various thread counts over
> > two machines, both four quad-core Opterons.  I ran the trials ten times
> > each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this
> > patchset.  The transfer rates were virtually identical showing no
> > improvement or regression with this patchset in this benchmark.
> >
> >  [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472,
> >    this benchmark continues to be the most significant regression slub has
> >    compared to slab. ]
> 
> Hmmm... Last time I ran the in kernel benchmarks this showed a reduction
> in cycle counts. Did not get to get my tests yet.
> 
> Can you also try the irqless hotpath?
> 

v6 of your patchset applied to percpu#for-next now at dec54bf "this_cpu: 
Use this_cpu_xx in trace_functions_graph.c" works fine, but when I apply 
the irqless patch from http://marc.info/?l=linux-kernel&m=125503037213262 
it hangs my netserver machine within the first 60 seconds when running 
this benchmark.  These kernels both include the fixes to kmem_cache_open() 
and dma_kmalloc_cache() you posted earlier.  I'll have to debug why that's 
happening before collecting results.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-13 15:20             ` Christoph Lameter
@ 2009-10-14  1:57               ` Tejun Heo
  2009-10-14 14:14                 ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2009-10-14  1:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Hello, Christoph.

Christoph Lameter wrote:
>> Yeah, widespread use of underscored versions isn't very desirable.
>> The underscored versions should notify certain specific exceptional
>> conditions instead of being used as general optimization (which
>> doesn't make much sense after all as the optimization is only
>> meaningful with debug option turned on).  Are you interested in doing
>> a sweeping patch to drop underscores from __this_cpu_*() conversions?
> 
> Nope. __this_cpu_add/dec cannot be converted.

Right.

> __this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are
> useful there too to show that we are in a preempt section.

That doesn't make much sense.  __ for this_cpu_ptr() means "bypass
sanity check, we're knowingly violating the required conditions" not
"we know sanity checks will pass here".

> The calls to raw_smp_processor_id and smp_processor_id() are only useful
> in the fallback case. There is no need for those if the arch has a way to
> provide the current percpu offset. So we in effect have two meanings of __
> right now.
> 
> 1. We do not care about the preempt state (thus we call
> raw_smp_processor_id so that the preempt state does not trigger)
> 
> 2. We do not need to disable preempt before the operation.
> 
> __this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2.

Yeah, we need to clean it up.  The naming is too confusing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-13 22:53                     ` David Rientjes
@ 2009-10-14 13:34                       ` Mel Gorman
  2009-10-14 14:08                         ` Christoph Lameter
  2009-10-15  9:03                         ` David Rientjes
  0 siblings, 2 replies; 56+ messages in thread
From: Mel Gorman @ 2009-10-14 13:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Tue, Oct 13, 2009 at 03:53:00PM -0700, David Rientjes wrote:
> On Tue, 13 Oct 2009, Christoph Lameter wrote:
> 
> > > For an optimized fastpath, I'd expect such a workload would result in at
> > > least a slightly higher transfer rate.
> > 
> > There will be no improvements if the load is dominated by the
> > instructions in the network layer or caching issues. None of that is
> > changed by the path. It only reduces the cycle count in the fastpath.
> > 
> 
> Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the 
> same workload so it shows that the slab allocator does have an impact in 
> transfer rate.  I understand that the performance gain with this patchset, 
> however, may not be representative with the benchmark since it also 
> frequently uses the slowpath for kmalloc-256 about 25% of the time and the 
> added code of the irqless patch may mask the fastpath gain.
> 

I have a bit more detailed results based on the following machine

CPU type:	AMD Phenom 9950
CPU counts:	1 CPU (4 cores)
CPU Speed:	1.3GHz
Motherboard:	Gigabyte GA-MA78GM-S2H
Memory:		8GB

The reference kernel used is mmotm-2009-10-09-01-07. The patches applied
are the patches in this thread. The headings are a bit munged but it's

SLUB-vanilla	where vanilla is mmotm-2009-10-09-01-07
SLUB-this-cpu	mmotm-2009-10-09-01-07 + patches in this thread
SLAB-*		same as above but SLAB configured instead of SLUB.
		I know it wasn't necessary to run SLAB-this-cpu but
		it gives an idea to what degree results can vary
		between reboots even if results are stable once the
		machine is running.

The benchmarks run were kernbench, netperf UDP_STREAM and TCP_STREAM and
sysbench with postgres.

Kernbench is 5 kernel compiles and an average taken. One kernel compile
is done at the start to warm the benchmark up and this result is
discarded.

Netperf is the _STREAM tests as opposed to the _RR tests reported
elsewhere. No special effort is done to bind processes to any particular
CPU. The results reported tried to be 99% confidence that the estimated
mean was within 1% of the true mean. Results where netperf failed to
achieve the necessary confidence are marked with a * and the line after
such a result states what percentage the estimated mean is to the true
mean. The test is run with different packet sizes.

Sysbench is a read-only test (to avoid IO) and is the "complex"
workload. The test is run with varying numbers of threads.

In all the results, SLUB-vanilla is the reference baseline. This allows
a comparison between SLUB-vanilla and SLAB-vanilla as well with the
patches applied.

            kernbench-SLUB-vanilla-kernbench    kernbench-SLUBkernbench-SLAB-vanilla-kernbench    kernbench-SLAB
                  SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
Elapsed min       92.95 ( 0.00%)    92.62 ( 0.36%)    92.93 ( 0.02%)    92.62 ( 0.36%)
Elapsed mean      93.11 ( 0.00%)    92.74 ( 0.40%)    93.00 ( 0.13%)    92.82 ( 0.32%)
Elapsed stddev     0.10 ( 0.00%)     0.14 (-40.55%)     0.04 (55.47%)     0.18 (-84.33%)
Elapsed max       93.20 ( 0.00%)    92.95 ( 0.27%)    93.05 ( 0.16%)    93.09 ( 0.12%)
User    min      323.21 ( 0.00%)   322.60 ( 0.19%)   322.50 ( 0.22%)   323.26 (-0.02%)
User    mean     323.81 ( 0.00%)   323.20 ( 0.19%)   323.16 ( 0.20%)   323.54 ( 0.08%)
User    stddev     0.40 ( 0.00%)     0.46 (-15.30%)     0.48 (-20.92%)     0.29 (26.07%)
User    max      324.32 ( 0.00%)   323.72 ( 0.19%)   323.86 ( 0.14%)   323.98 ( 0.10%)
System  min       35.95 ( 0.00%)    35.50 ( 1.25%)    35.35 ( 1.67%)    36.01 (-0.17%)
System  mean      36.30 ( 0.00%)    35.96 ( 0.96%)    36.17 ( 0.36%)    36.23 ( 0.21%)
System  stddev     0.25 ( 0.00%)     0.45 (-75.60%)     0.56 (-121.14%)     0.14 (46.14%)
System  max       36.65 ( 0.00%)    36.67 (-0.05%)    36.94 (-0.79%)    36.39 ( 0.71%)
CPU     min      386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)
CPU     mean     386.25 ( 0.00%)   386.75 (-0.13%)   386.00 ( 0.06%)   387.25 (-0.26%)
CPU     stddev     0.43 ( 0.00%)     0.83 (-91.49%)     0.00 (100.00%)     0.83 (-91.49%)
CPU     max      387.00 ( 0.00%)   388.00 (-0.26%)   386.00 ( 0.26%)   388.00 (-0.26%)

Small gains in the User, System and Elapsed times with this-cpu patches
applied. It is interest to note for the mean times that the patches more
than close the gap between SLUB and SLAB for the most part - the
exception being User which has marginally better performance. This might
indicate that SLAB is still slightly better at giving back cache-hot
memory but this is speculation.

NETPERF UDP_STREAM
  Packet           netperf-udp          udp-SLUB       netperf-udp          udp-SLAB
    Size          SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
      64       148.48 ( 0.00%)    152.03 ( 2.34%)    147.45 (-0.70%)    150.07 ( 1.06%) 
     128       294.65 ( 0.00%)    299.92 ( 1.76%)    289.20 (-1.88%)    290.15 (-1.55%) 
     256       583.63 ( 0.00%)    609.14 ( 4.19%)    590.78 ( 1.21%)    586.42 ( 0.48%) 
    1024      2217.90 ( 0.00%)   2261.99 ( 1.95%)   2219.64 ( 0.08%)   2207.93 (-0.45%) 
    2048      4164.27 ( 0.00%)   4161.47 (-0.07%)   4216.46 ( 1.24%)   4155.11 (-0.22%) 
    3312      6284.17 ( 0.00%)   6383.24 ( 1.55%)   6231.88 (-0.84%)   6243.82 (-0.65%) 
    4096      7399.42 ( 0.00%)   7686.38 ( 3.73%)   7394.89 (-0.06%)   7487.91 ( 1.18%) 
    6144     10014.35 ( 0.00%)  10199.48 ( 1.82%)   9927.92 (-0.87%)* 10067.40 ( 0.53%) 
                 1.00%             1.00%             1.08%             1.00%        
    8192     11232.50 ( 0.00%)* 11368.13 ( 1.19%)* 12280.88 ( 8.54%)* 12244.23 ( 8.26%) 
                 1.65%             1.64%             1.32%             1.00%        
   10240     12961.87 ( 0.00%)  13099.82 ( 1.05%)* 13816.33 ( 6.18%)* 13927.18 ( 6.93%) 
                 1.00%             1.03%             1.21%             1.00%        
   12288     14403.74 ( 0.00%)* 14276.89 (-0.89%)* 15173.09 ( 5.07%)* 15464.05 ( 6.86%)*
                 1.31%             1.63%             1.93%             1.55%        
   14336     15229.98 ( 0.00%)* 15218.52 (-0.08%)* 16412.94 ( 7.21%)  16252.98 ( 6.29%) 
                 1.37%             2.76%             1.00%             1.00%        
   16384     15367.60 ( 0.00%)* 16038.71 ( 4.18%)  16635.91 ( 7.62%)  17128.87 (10.28%)*
             1.29%             1.00%             1.00%             6.36%        

The patches mostly improve the performance of netperf UDP_STREAM by a good
whack so the patches are a plus here. However, it should also be noted that
SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
my memory, how do SLUB and SLAB differ in regards to off-loading large
allocations to the page allocator these days?

NETPERF TCP_STREAM
  Packet           netperf-tcp          tcp-SLUB       netperf-tcp          tcp-SLAB
    Size          SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
      64      1773.00 ( 0.00%)   1731.63 (-2.39%)*  1794.48 ( 1.20%)   2029.46 (12.64%) 
                 1.00%             2.43%             1.00%             1.00%        
     128      3181.12 ( 0.00%)   3471.22 ( 8.36%)   3296.37 ( 3.50%)   3251.33 ( 2.16%) 
     256      4794.35 ( 0.00%)   4797.38 ( 0.06%)   4912.99 ( 2.41%)   4846.86 ( 1.08%) 
    1024      9438.10 ( 0.00%)   8681.05 (-8.72%)*  8270.58 (-14.12%)   8268.85 (-14.14%) 
                 1.00%             7.31%             1.00%             1.00%        
    2048      9196.06 ( 0.00%)   9375.72 ( 1.92%)  11474.59 (19.86%)   9420.01 ( 2.38%) 
    3312     10338.49 ( 0.00%)* 10021.82 (-3.16%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
                 9.49%             6.36%             1.21%             2.12%        
    4096      9931.20 ( 0.00%)* 10285.38 ( 3.44%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
                 1.31%             1.38%             9.97%             8.33%        
    6144     12775.08 ( 0.00%)* 10559.63 (-20.98%)  13139.34 ( 2.77%)  13210.79 ( 3.30%)*
                 1.45%             1.00%             1.00%             2.99%        
    8192     10933.93 ( 0.00%)* 10534.41 (-3.79%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
                14.29%             2.10%            12.50%             9.55%        
   10240     12868.58 ( 0.00%)  12991.65 ( 0.95%)  10892.20 (-18.14%)  13106.01 ( 1.81%) 
   12288     11854.97 ( 0.00%)  12122.34 ( 2.21%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
                 1.00%             6.61%             5.78%             8.95%        
   14336     12552.48 ( 0.00%)* 12501.71 (-0.41%)* 12274.54 (-2.26%)  12322.63 (-1.87%)*
                 6.05%             2.58%             1.00%             2.23%        
   16384     11733.09 ( 0.00%)* 12735.05 ( 7.87%)* 13195.68 (11.08%)* 14401.62 (18.53%) 
                 1.14%             9.79%            10.30%             1.00%        

The results for the patches are a bit all over the place for TCP_STREAM
with big gains and losses depending on the packet size, particularly 6144
for some reason. SLUB vs SLAB shows SLAB often has really massive advantages
and this is not always for the larger packet sizes where the page allocator
might be a suspect.

SYSBENCH
            sysbench-SLUB-vanilla-sysbench     sysbench-SLUBsysbench-SLAB-vanilla-sysbench     sysbench-SLAB
                  SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
           1 26950.79 ( 0.00%) 26822.05 (-0.48%) 26919.89 (-0.11%) 26746.18 (-0.77%)
           2 51555.51 ( 0.00%) 51928.02 ( 0.72%) 51370.02 (-0.36%) 51129.82 (-0.83%)
           3 76204.23 ( 0.00%) 76333.58 ( 0.17%) 76483.99 ( 0.37%) 75954.52 (-0.33%)
           4 100599.12 ( 0.00%) 101757.98 ( 1.14%) 100499.65 (-0.10%) 101605.61 ( 0.99%)
           5 100211.45 ( 0.00%) 100435.33 ( 0.22%) 100150.98 (-0.06%) 99398.11 (-0.82%)
           6 99390.81 ( 0.00%) 99840.85 ( 0.45%) 99234.38 (-0.16%) 99244.42 (-0.15%)
           7 98740.56 ( 0.00%) 98727.61 (-0.01%) 98305.88 (-0.44%) 98123.56 (-0.63%)
           8 98075.89 ( 0.00%) 98048.62 (-0.03%) 98183.99 ( 0.11%) 97587.82 (-0.50%)
           9 96502.22 ( 0.00%) 97276.80 ( 0.80%) 96819.88 ( 0.33%) 97320.51 ( 0.84%)
          10 96598.70 ( 0.00%) 96545.37 (-0.06%) 96222.51 (-0.39%) 96221.69 (-0.39%)
          11 95500.66 ( 0.00%) 95671.11 ( 0.18%) 95003.21 (-0.52%) 95246.81 (-0.27%)
          12 94572.87 ( 0.00%) 95266.70 ( 0.73%) 93807.60 (-0.82%) 94859.82 ( 0.30%)
          13 93811.85 ( 0.00%) 94309.18 ( 0.53%) 93219.81 (-0.64%) 93051.63 (-0.82%)
          14 92972.16 ( 0.00%) 93849.87 ( 0.94%) 92641.50 (-0.36%) 92916.70 (-0.06%)
          15 92276.06 ( 0.00%) 92454.94 ( 0.19%) 91094.04 (-1.30%) 91972.79 (-0.33%)
          16 90265.35 ( 0.00%) 90416.26 ( 0.17%) 89309.26 (-1.07%) 90103.89 (-0.18%)

The patches mostly gain for sysbench although the gains are very marginal
and SLUB has a minor advantage over SLAB. I haven't actually checked how
slab-intensive this workload is. The differences are no marginal, I would
guess the answer is "not very".

Overall based on these results, I would say that the patches are a "Good Thing"
for this machine at least. With the patches applied, SLUB has a marginal
advantage over SLAB for kernbench. However, netperf TCP_STREAM and UDP_STREAM
both show significant disadvantages for SLUB and this cannot be always
explained by differing behaviour with respect to page-allocator offloading.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 13:34                       ` Mel Gorman
@ 2009-10-14 14:08                         ` Christoph Lameter
  2009-10-14 15:49                           ` Mel Gorman
  2009-10-15  9:03                         ` David Rientjes
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-14 14:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

The test did not include the irqless patch I hope?

On Wed, 14 Oct 2009, Mel Gorman wrote:

> Small gains in the User, System and Elapsed times with this-cpu patches
> applied. It is interest to note for the mean times that the patches more
> than close the gap between SLUB and SLAB for the most part - the
> exception being User which has marginally better performance. This might
> indicate that SLAB is still slightly better at giving back cache-hot
> memory but this is speculation.

The queuing in SLAB allows a better cache hot behavior. Without a queue
SLUB has a difficult time improvising cache hot behavior based on objects
restricted to a slab page. Therefore the size of the slab page will
affect how much "queueing" SLUB can do.

> The patches mostly improve the performance of netperf UDP_STREAM by a good
> whack so the patches are a plus here. However, it should also be noted that
> SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
> my memory, how do SLUB and SLAB differ in regards to off-loading large
> allocations to the page allocator these days?

SLUB offloads allocations > 8k to the page allocator.
SLAB does create large slabs.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-14  1:57               ` Tejun Heo
@ 2009-10-14 14:14                 ` Christoph Lameter
  2009-10-15  7:47                   ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-14 14:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Wed, 14 Oct 2009, Tejun Heo wrote:

> > __this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are
> > useful there too to show that we are in a preempt section.
>
> That doesn't make much sense.  __ for this_cpu_ptr() means "bypass
> sanity check, we're knowingly violating the required conditions" not
> "we know sanity checks will pass here".

Are you defining what __ means for this_cpu_ptr?

> > The calls to raw_smp_processor_id and smp_processor_id() are only useful
> > in the fallback case. There is no need for those if the arch has a way to
> > provide the current percpu offset. So we in effect have two meanings of __
> > right now.
> >
> > 1. We do not care about the preempt state (thus we call
> > raw_smp_processor_id so that the preempt state does not trigger)
> >
> > 2. We do not need to disable preempt before the operation.
> >
> > __this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2.
>
> Yeah, we need to clean it up.  The naming is too confusing.

Its consistent if __ means both 1 and 2. If we want to distinguish it then
we may want to create raw_this_cpu_xx which means that we do not call
smp_processor_id() on fallback but raw_smp_processor_id(). Does not
matter if the arch provides a per cpu offset.

This would mean duplicating all the macros. The use of raw_this_cpu_xx
should be rare so maybe the best approach is to say that __ means only
that the macro does not need to disable preempt but it still checks for
preemption being off. Then audit the __this_cpu_xx uses and see if there
are any that require a raw_ variant.

The vm event counters require both no check and no preempt since they can
be implemented in a racy way.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 14:08                         ` Christoph Lameter
@ 2009-10-14 15:49                           ` Mel Gorman
  2009-10-14 15:53                             ` Pekka Enberg
  0 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2009-10-14 15:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Wed, Oct 14, 2009 at 10:08:12AM -0400, Christoph Lameter wrote:
> The test did not include the irqless patch I hope?
> 

Correct. Only the patches in this thread were tested.

> On Wed, 14 Oct 2009, Mel Gorman wrote:
> 
> > Small gains in the User, System and Elapsed times with this-cpu patches
> > applied. It is interest to note for the mean times that the patches more
> > than close the gap between SLUB and SLAB for the most part - the
> > exception being User which has marginally better performance. This might
> > indicate that SLAB is still slightly better at giving back cache-hot
> > memory but this is speculation.
> 
> The queuing in SLAB allows a better cache hot behavior. Without a queue
> SLUB has a difficult time improvising cache hot behavior based on objects
> restricted to a slab page. Therefore the size of the slab page will
> affect how much "queueing" SLUB can do.
> 

Ok, so the speculation is a plausible explanation.

> > The patches mostly improve the performance of netperf UDP_STREAM by a good
> > whack so the patches are a plus here. However, it should also be noted that
> > SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
> > my memory, how do SLUB and SLAB differ in regards to off-loading large
> > allocations to the page allocator these days?
> 
> SLUB offloads allocations > 8k to the page allocator.
> SLAB does create large slabs.
> 

Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM
performance suffers. That can be marked as future possible work to sort
out within the allocator.

However, does it explain why TCP_STREAM suffers so badly even for packet
sizes like 2K? It's also important to note in some cases, SLAB was far
slower even when the packet sizes were greater than 8k so I don't think
the page allocator is an adequate explanation for TCP_STREAM.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 15:49                           ` Mel Gorman
@ 2009-10-14 15:53                             ` Pekka Enberg
  2009-10-14 15:56                               ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Pekka Enberg @ 2009-10-14 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, David Rientjes, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

Hi Mel,

Mel Gorman wrote:
>>> The patches mostly improve the performance of netperf UDP_STREAM by a good
>>> whack so the patches are a plus here. However, it should also be noted that
>>> SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
>>> my memory, how do SLUB and SLAB differ in regards to off-loading large
>>> allocations to the page allocator these days?
>> SLUB offloads allocations > 8k to the page allocator.
>> SLAB does create large slabs.
>>
> 
> Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM
> performance suffers. That can be marked as future possible work to sort
> out within the allocator.
> 
> However, does it explain why TCP_STREAM suffers so badly even for packet
> sizes like 2K? It's also important to note in some cases, SLAB was far
> slower even when the packet sizes were greater than 8k so I don't think
> the page allocator is an adequate explanation for TCP_STREAM.

SLAB is able to queue lots of large objects but SLUB can't do that 
because it has no queues. In SLUB, each CPU gets a page assigned to it 
that serves as a "queue" but the size of the queue gets smaller as 
object size approaches page size.

We try to offset that with higher order allocations but IIRC we don't 
increase the order linearly with object size and cap it to some 
reasonable maximum.

			Pekka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 15:53                             ` Pekka Enberg
@ 2009-10-14 15:56                               ` Christoph Lameter
  2009-10-14 16:14                                 ` Pekka Enberg
  2009-10-16 10:50                                 ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-14 15:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, David Rientjes, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Wed, 14 Oct 2009, Pekka Enberg wrote:

> SLAB is able to queue lots of large objects but SLUB can't do that because it
> has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
> "queue" but the size of the queue gets smaller as object size approaches page
> size.
>
> We try to offset that with higher order allocations but IIRC we don't increase
> the order linearly with object size and cap it to some reasonable maximum.

You can test to see if larger pages have an influence by passing

slub_max_order=6

or so on the kernel command line.

You can force a large page use in slub by setting

slub_min_order=3

f.e.

Or you can force a mininum number of objecxcts in slub through f.e.

slub_min_objects=50



slub_max_order=6 slub_min_objects=50

should result in pretty large slabs with lots of in page objects that
allow slub to queue better.





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu  operations in the hotpaths
  2009-10-14 15:56                               ` Christoph Lameter
@ 2009-10-14 16:14                                 ` Pekka Enberg
  2009-10-14 18:19                                   ` Christoph Lameter
  2009-10-16 10:50                                 ` Mel Gorman
  1 sibling, 1 reply; 56+ messages in thread
From: Pekka Enberg @ 2009-10-14 16:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, David Rientjes, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

Hi Christoph,

On Wed, 14 Oct 2009, Pekka Enberg wrote:
>> SLAB is able to queue lots of large objects but SLUB can't do that because it
>> has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
>> "queue" but the size of the queue gets smaller as object size approaches page
>> size.
>>
>> We try to offset that with higher order allocations but IIRC we don't increase
>> the order linearly with object size and cap it to some reasonable maximum.

On Wed, Oct 14, 2009 at 6:56 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> You can test to see if larger pages have an influence by passing
>
> slub_max_order=6
>
> or so on the kernel command line.
>
> You can force a large page use in slub by setting
>
> slub_min_order=3
>
> f.e.
>
> Or you can force a mininum number of objecxcts in slub through f.e.
>
> slub_min_objects=50
>
> slub_max_order=6 slub_min_objects=50
>
> should result in pretty large slabs with lots of in page objects that
> allow slub to queue better.

Yeah, that should help but it's probably not something we can do for
mainline. I'm not sure how we can fix SLUB to support large objects
out-of-the-box as efficiently as SLAB does.

                        Pekka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 16:14                                 ` Pekka Enberg
@ 2009-10-14 18:19                                   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-14 18:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Mel Gorman, David Rientjes, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Wed, 14 Oct 2009, Pekka Enberg wrote:

> Yeah, that should help but it's probably not something we can do for
> mainline. I'm not sure how we can fix SLUB to support large objects
> out-of-the-box as efficiently as SLAB does.

We could add a per cpu "queue" through a pointer array in kmem_cache_cpu.
Which is more SLQB than SLUB.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-14 14:14                 ` Christoph Lameter
@ 2009-10-15  7:47                   ` Tejun Heo
  2009-10-16 16:44                     ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Tejun Heo @ 2009-10-15  7:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Christoph Lameter wrote:
> On Wed, 14 Oct 2009, Tejun Heo wrote:
> 
>>> __this_cpu_ptr could be converted to this_cpu_ptr but I think the __ are
>>> useful there too to show that we are in a preempt section.
>> That doesn't make much sense.  __ for this_cpu_ptr() means "bypass
>> sanity check, we're knowingly violating the required conditions" not
>> "we know sanity checks will pass here".
> 
> Are you defining what __ means for this_cpu_ptr?

I was basically stating the different between raw_smp_processor_id()
and smp_processor_id() which I thought applied the same to
__this_cpu_ptr() and this_cpu_ptr().

>>> The calls to raw_smp_processor_id and smp_processor_id() are only useful
>>> in the fallback case. There is no need for those if the arch has a way to
>>> provide the current percpu offset. So we in effect have two meanings of __
>>> right now.
>>>
>>> 1. We do not care about the preempt state (thus we call
>>> raw_smp_processor_id so that the preempt state does not trigger)
>>>
>>> 2. We do not need to disable preempt before the operation.
>>>
>>> __this_cpu_ptr only implies 1. __this_cpu_add uses 1 and 2.
>>
>> Yeah, we need to clean it up.  The naming is too confusing.
> 
> Its consistent if __ means both 1 and 2. If we want to distinguish it then
> we may want to create raw_this_cpu_xx which means that we do not call
> smp_processor_id() on fallback but raw_smp_processor_id(). Does not
> matter if the arch provides a per cpu offset.
> 
> This would mean duplicating all the macros. The use of raw_this_cpu_xx
> should be rare so maybe the best approach is to say that __ means only
> that the macro does not need to disable preempt but it still checks for
> preemption being off. Then audit the __this_cpu_xx uses and see if there
> are any that require a raw_ variant.
> 
> The vm event counters require both no check and no preempt since they can
> be implemented in a racy way.

The biggest grief I have is that the meaning of __ is different among
different accessors.  If that can be cleared up, we would be in much
better shape without adding any extra macros.  Can we just remove all
__'s and use meaningful pre or suffixes like raw or irq or whatever?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 13:34                       ` Mel Gorman
  2009-10-14 14:08                         ` Christoph Lameter
@ 2009-10-15  9:03                         ` David Rientjes
  2009-10-16 16:45                           ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: David Rientjes @ 2009-10-15  9:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Wed, 14 Oct 2009, Mel Gorman wrote:

> NETPERF TCP_STREAM
>   Packet           netperf-tcp          tcp-SLUB       netperf-tcp          tcp-SLAB
>     Size          SLUB-vanilla          this-cpu      SLAB-vanilla          this-cpu
>       64      1773.00 ( 0.00%)   1731.63 (-2.39%)*  1794.48 ( 1.20%)   2029.46 (12.64%) 
>                  1.00%             2.43%             1.00%             1.00%        
>      128      3181.12 ( 0.00%)   3471.22 ( 8.36%)   3296.37 ( 3.50%)   3251.33 ( 2.16%) 
>      256      4794.35 ( 0.00%)   4797.38 ( 0.06%)   4912.99 ( 2.41%)   4846.86 ( 1.08%) 
>     1024      9438.10 ( 0.00%)   8681.05 (-8.72%)*  8270.58 (-14.12%)   8268.85 (-14.14%) 
>                  1.00%             7.31%             1.00%             1.00%        
>     2048      9196.06 ( 0.00%)   9375.72 ( 1.92%)  11474.59 (19.86%)   9420.01 ( 2.38%) 
>     3312     10338.49 ( 0.00%)* 10021.82 (-3.16%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
>                  9.49%             6.36%             1.21%             2.12%        
>     4096      9931.20 ( 0.00%)* 10285.38 ( 3.44%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
>                  1.31%             1.38%             9.97%             8.33%        
>     6144     12775.08 ( 0.00%)* 10559.63 (-20.98%)  13139.34 ( 2.77%)  13210.79 ( 3.30%)*
>                  1.45%             1.00%             1.00%             2.99%        
>     8192     10933.93 ( 0.00%)* 10534.41 (-3.79%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
>                 14.29%             2.10%            12.50%             9.55%        
>    10240     12868.58 ( 0.00%)  12991.65 ( 0.95%)  10892.20 (-18.14%)  13106.01 ( 1.81%) 
>    12288     11854.97 ( 0.00%)  12122.34 ( 2.21%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
>                  1.00%             6.61%             5.78%             8.95%        
>    14336     12552.48 ( 0.00%)* 12501.71 (-0.41%)* 12274.54 (-2.26%)  12322.63 (-1.87%)*
>                  6.05%             2.58%             1.00%             2.23%        
>    16384     11733.09 ( 0.00%)* 12735.05 ( 7.87%)* 13195.68 (11.08%)* 14401.62 (18.53%) 
>                  1.14%             9.79%            10.30%             1.00%        
> 
> The results for the patches are a bit all over the place for TCP_STREAM
> with big gains and losses depending on the packet size, particularly 6144
> for some reason. SLUB vs SLAB shows SLAB often has really massive advantages
> and this is not always for the larger packet sizes where the page allocator
> might be a suspect.
> 

TCP_STREAM stresses a few specific caches:

		ALLOC_FASTPATH	ALLOC_SLOWPATH	FREE_FASTPATH	FREE_SLOWPATH
kmalloc-256	3868530		3450592		95628		7223491
kmalloc-1024	2440434		429		2430825		10034
kmalloc-4096	3860625		1036723		85571		4811779

This demonstrates that freeing to full (or partial) slabs causes a lot of 
pain since the fastpath normally can't be utilized and that's probably 
beyond the scope of this patchset.

It's also different from the cpu slab thrashing issue I identified with 
the TCP_RR benchmark and had a patchset to somewhat improve.  The 
criticism was the addition of an increment to a fastpath counter in struct 
kmem_cache_cpu which could probably now be much cheaper with these 
optimizations.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-14 15:56                               ` Christoph Lameter
  2009-10-14 16:14                                 ` Pekka Enberg
@ 2009-10-16 10:50                                 ` Mel Gorman
  2009-10-16 18:40                                   ` David Rientjes
  1 sibling, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2009-10-16 10:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, David Rientjes, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Wed, Oct 14, 2009 at 11:56:29AM -0400, Christoph Lameter wrote:
> On Wed, 14 Oct 2009, Pekka Enberg wrote:
> 
> > SLAB is able to queue lots of large objects but SLUB can't do that because it
> > has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
> > "queue" but the size of the queue gets smaller as object size approaches page
> > size.
> >
> > We try to offset that with higher order allocations but IIRC we don't increase
> > the order linearly with object size and cap it to some reasonable maximum.
> 
> You can test to see if larger pages have an influence by passing
> 
> slub_max_order=6
> 
> or so on the kernel command line.
> 
> You can force a large page use in slub by setting
> 
> slub_min_order=3
> 
> f.e.
> 
> Or you can force a mininum number of objecxcts in slub through f.e.
> 
> slub_min_objects=50
> 
> 
> 
> slub_max_order=6 slub_min_objects=50
> 
> should result in pretty large slabs with lots of in page objects that
> allow slub to queue better.
> 

Here are the results of that suggestion. They are side-by-side with the
other results so the columns are

SLUB-vanilla		No other patches applied, SLUB configured
vanilla-highorder	No other patches + slub_max_order=6 slub_min_objects=50
SLUB-this-cpu		The patches in this set applied
this-cpu-higher		These patches + slub_max_order=6 slub_min_objects=50
SLAB-vanilla		No other patches, SLAB configured
SLAB-this-cpu		Thes patches, SLAB configured

                  SLUB-vanilla   vanilla-highorder     SLUB-this-cpu  this-cpu-highorder    SLAB-vanilla     SLAB-this-cpu
Elapsed min       92.95 ( 0.00%)    92.64 ( 0.33%)    92.62 ( 0.36%)    92.77 ( 0.19%)    92.93 ( 0.02%)    92.62 ( 0.36%)
Elapsed mean      93.11 ( 0.00%)    92.89 ( 0.24%)    92.74 ( 0.40%)    92.82 ( 0.31%)    93.00 ( 0.13%)    92.82 ( 0.32%)
Elapsed stddev     0.10 ( 0.00%)     0.15 (-58.74%)     0.14 (-40.55%)     0.09 ( 7.73%)     0.04 (55.47%)     0.18 (-84.33%)
Elapsed max       93.20 ( 0.00%)    93.04 ( 0.17%)    92.95 ( 0.27%)    92.98 ( 0.24%)    93.05 ( 0.16%)    93.09 ( 0.12%)
User    min      323.21 ( 0.00%)   323.38 (-0.05%)   322.60 ( 0.19%)   323.26 (-0.02%)   322.50 ( 0.22%)   323.26 (-0.02%)
User    mean     323.81 ( 0.00%)   323.64 ( 0.05%)   323.20 ( 0.19%)   323.56 ( 0.08%)   323.16 ( 0.20%)   323.54 ( 0.08%)
User    stddev     0.40 ( 0.00%)     0.38 ( 4.24%)     0.46 (-15.30%)     0.27 (33.20%)     0.48 (-20.92%)     0.29 (26.07%)
User    max      324.32 ( 0.00%)   324.30 ( 0.01%)   323.72 ( 0.19%)   323.96 ( 0.11%)   323.86 ( 0.14%)   323.98 ( 0.10%)
System  min       35.95 ( 0.00%)    35.33 ( 1.72%)    35.50 ( 1.25%)    35.95 ( 0.00%)    35.35 ( 1.67%)    36.01 (-0.17%)
System  mean      36.30 ( 0.00%)    35.99 ( 0.87%)    35.96 ( 0.96%)    36.20 ( 0.28%)    36.17 ( 0.36%)    36.23 ( 0.21%)
System  stddev     0.25 ( 0.00%)     0.41 (-59.25%)     0.45 (-75.60%)     0.15 (41.61%)     0.56 (-121.14%)     0.14 (46.14%)
System  max       36.65 ( 0.00%)    36.44 ( 0.57%)    36.67 (-0.05%)    36.32 ( 0.90%)    36.94 (-0.79%)    36.39 ( 0.71%)
CPU     min      386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)   386.00 ( 0.00%)
CPU     mean     386.25 ( 0.00%)   386.75 (-0.13%)   386.75 (-0.13%)   386.75 (-0.13%)   386.00 ( 0.06%)   387.25 (-0.26%)
CPU     stddev     0.43 ( 0.00%)     0.83 (-91.49%)     0.83 (-91.49%)     0.43 ( 0.00%)     0.00 (100.00%)     0.83 (-91.49%)
CPU     max      387.00 ( 0.00%)   388.00 (-0.26%)   388.00 (-0.26%)   387.00 ( 0.00%)   386.00 ( 0.26%)   388.00 (-0.26%)

The high-order allocations help here, but not by a massive amount. In some
cases it made things slightly worse. However, the standard deviations are
generally high enough to file most of the results under "noise"

NETPERF UDP
              SLUB-vanilla   vanilla-highorder     SLUB-this-cpu  this-cpu-highorder    SLAB-vanilla     SLAB-this-cpu
      64   148.48 ( 0.00%)    146.28 (-1.50%)    152.03 ( 2.34%)    152.20 ( 2.44%)    147.45 (-0.70%)    150.07 ( 1.06%) 
     128   294.65 ( 0.00%)    286.80 (-2.74%)    299.92 ( 1.76%)    302.55 ( 2.61%)    289.20 (-1.88%)    290.15 (-1.55%) 
     256   583.63 ( 0.00%)    564.84 (-3.33%)    609.14 ( 4.19%)    587.53 ( 0.66%)    590.78 ( 1.21%)    586.42 ( 0.48%) 
    1024  2217.90 ( 0.00%)   2176.12 (-1.92%)   2261.99 ( 1.95%)   2312.12 ( 4.08%)   2219.64 ( 0.08%)   2207.93 (-0.45%) 
    2048  4164.27 ( 0.00%)   4154.96 (-0.22%)   4161.47 (-0.07%)   4244.60 ( 1.89%)   4216.46 ( 1.24%)   4155.11 (-0.22%) 
    3312  6284.17 ( 0.00%)   6121.32 (-2.66%)   6383.24 ( 1.55%)   6356.61 ( 1.14%)   6231.88 (-0.84%)   6243.82 (-0.65%) 
    4096  7399.42 ( 0.00%)   7327.40 (-0.98%)*  7686.38 ( 3.73%)   7633.64 ( 3.07%)   7394.89 (-0.06%)   7487.91 ( 1.18%) 
             1.00%             1.07%             1.00%             1.00%             1.00%             1.00%        
    6144 10014.35 ( 0.00%)  10061.59 ( 0.47%)  10199.48 ( 1.82%)  10223.16 ( 2.04%)   9927.92 (-0.87%)* 10067.40 ( 0.53%) 
             1.00%             1.00%             1.00%             1.00%             1.08%             1.00%        
    8192 11232.50 ( 0.00%)* 11222.92 (-0.09%)* 11368.13 ( 1.19%)* 11403.82 ( 1.50%)* 12280.88 ( 8.54%)* 12244.23 ( 8.26%) 
             1.65%             1.37%             1.64%             1.16%             1.32%             1.00%        
   10240 12961.87 ( 0.00%)  12746.40 (-1.69%)* 13099.82 ( 1.05%)* 12767.02 (-1.53%)* 13816.33 ( 6.18%)* 13927.18 ( 6.93%) 
             1.00%             2.34%             1.03%             1.26%             1.21%             1.00%        
   12288 14403.74 ( 0.00%)* 14136.36 (-1.89%)* 14276.89 (-0.89%)* 14246.18 (-1.11%)* 15173.09 ( 5.07%)* 15464.05 ( 6.86%)*
             1.31%             1.60%             1.63%             1.60%             1.93%             1.55%        
   14336 15229.98 ( 0.00%)* 14962.61 (-1.79%)* 15218.52 (-0.08%)* 15243.51 ( 0.09%)  16412.94 ( 7.21%)  16252.98 ( 6.29%) 
             1.37%             1.66%             2.76%             1.00%             1.00%             1.00%        
   16384 15367.60 ( 0.00%)* 15543.13 ( 1.13%)* 16038.71 ( 4.18%)  15870.54 ( 3.17%)* 16635.91 ( 7.62%)  17128.87 (10.28%)*
             1.29%             1.34%             1.00%             2.18%             1.00%             6.36%        

Configuring use of high-order pages actually hurt SLUB mostly on the unpatched
kernel. The results are mixed with the patches applied. Hard to draw anything
very conclusive to be honest. Based on these results, I wouldn't push the
high-order allocations aggressively.

NETPERF TCP
              SLUB-vanilla   vanilla-highorder     SLUB-this-cpu  this-cpu-highorder    SLAB-vanilla     SLAB-this-cpu
      64  1773.00 ( 0.00%)   1812.07 ( 2.16%)*  1731.63 (-2.39%)*  1717.99 (-3.20%)*  1794.48 ( 1.20%)   2029.46 (12.64%) 
             1.00%             5.88%             2.43%             2.83%             1.00%             1.00%        
     128  3181.12 ( 0.00%)   3193.06 ( 0.37%)*  3471.22 ( 8.36%)   3154.79 (-0.83%)   3296.37 ( 3.50%)   3251.33 ( 2.16%) 
             1.00%             1.70%             1.00%             1.00%             1.00%             1.00%        
     256  4794.35 ( 0.00%)   4813.37 ( 0.40%)   4797.38 ( 0.06%)   4819.16 ( 0.51%)   4912.99 ( 2.41%)   4846.86 ( 1.08%) 
    1024  9438.10 ( 0.00%)   8144.02 (-15.89%)   8681.05 (-8.72%)*  8204.11 (-15.04%)   8270.58 (-14.12%)   8268.85 (-14.14%) 
             1.00%             1.00%             7.31%             1.00%             1.00%             1.00%        
    2048  9196.06 ( 0.00%)  11233.72 (18.14%)   9375.72 ( 1.92%)  10487.89 (12.32%)* 11474.59 (19.86%)   9420.01 ( 2.38%) 
             1.00%             1.00%             1.00%             9.43%             1.00%             1.00%        
    3312 10338.49 ( 0.00%)*  9730.79 (-6.25%)* 10021.82 (-3.16%)* 10089.90 (-2.46%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
             9.49%             2.51%             6.36%             5.96%             1.21%             2.12%        
    4096  9931.20 ( 0.00%)* 12447.88 (20.22%)  10285.38 ( 3.44%)* 10548.56 ( 5.85%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
             1.31%             1.00%             1.38%             8.22%             9.97%             8.33%        
    6144 12775.08 ( 0.00%)* 10489.24 (-21.79%)* 10559.63 (-20.98%)  11033.15 (-15.79%)* 13139.34 ( 2.77%)  13210.79 ( 3.30%)*
             1.45%             8.46%             1.00%            12.65%             1.00%             2.99%        
    8192 10933.93 ( 0.00%)* 10340.42 (-5.74%)* 10534.41 (-3.79%)* 10845.36 (-0.82%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
            14.29%             2.38%             2.10%             1.83%            12.50%             9.55%        
   10240 12868.58 ( 0.00%)  11211.60 (-14.78%)* 12991.65 ( 0.95%)  11330.97 (-13.57%)* 10892.20 (-18.14%)  13106.01 ( 1.81%) 
             1.00%            11.36%             1.00%             6.64%             1.00%             1.00%        
   12288 11854.97 ( 0.00%)  11854.51 (-0.00%)  12122.34 ( 2.21%)* 12258.61 ( 3.29%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
             1.00%             1.00%             6.61%             5.69%             5.78%             8.95%        
   14336 12552.48 ( 0.00%)* 12309.15 (-1.98%)  12501.71 (-0.41%)* 13683.57 ( 8.27%)* 12274.54 (-2.26%)  12322.63 (-1.87%)*
             6.05%             1.00%             2.58%             2.46%             1.00%             2.23%        
   16384 11733.09 ( 0.00%)* 11856.66 ( 1.04%)* 12735.05 ( 7.87%)* 13482.61 (12.98%)* 13195.68 (11.08%)* 14401.62 (18.53%) 
             1.14%             1.05%             9.79%            11.52%            10.30%             1.00%        

Configuring high-rder helper in a few cases here and in one or two
cases close the gap with SLAB, particularly for large packet sizes.
However, it still suffered for the small packet sizes.

SYSBENCH
                  SLUB-vanilla  vanilla-highorder   SLUB-this-cpu  this-cpu-highorder   SLAB-vanilla     SLAB-this-cpu
           1 26950.79 ( 0.00%) 26723.98 (-0.85%) 26822.05 (-0.48%) 26877.71 (-0.27%) 26919.89 (-0.11%) 26746.18 (-0.77%)
           2 51555.51 ( 0.00%) 51231.41 (-0.63%) 51928.02 ( 0.72%) 51794.47 ( 0.46%) 51370.02 (-0.36%) 51129.82 (-0.83%)
           3 76204.23 ( 0.00%) 76060.77 (-0.19%) 76333.58 ( 0.17%) 76270.53 ( 0.09%) 76483.99 ( 0.37%) 75954.52 (-0.33%)
           4 100599.12 ( 0.00%) 100825.16 ( 0.22%) 101757.98 ( 1.14%) 100273.02 (-0.33%) 100499.65 (-0.10%) 101605.61 ( 0.99%)
           5 100211.45 ( 0.00%) 100096.77 (-0.11%) 100435.33 ( 0.22%) 101132.16 ( 0.91%) 100150.98 (-0.06%) 99398.11 (-0.82%)
           6 99390.81 ( 0.00%) 99305.36 (-0.09%) 99840.85 ( 0.45%) 99200.53 (-0.19%) 99234.38 (-0.16%) 99244.42 (-0.15%)
           7 98740.56 ( 0.00%) 98625.23 (-0.12%) 98727.61 (-0.01%) 98470.75 (-0.27%) 98305.88 (-0.44%) 98123.56 (-0.63%)
           8 98075.89 ( 0.00%) 97609.30 (-0.48%) 98048.62 (-0.03%) 97092.44 (-1.01%) 98183.99 ( 0.11%) 97587.82 (-0.50%)
           9 96502.22 ( 0.00%) 96685.39 ( 0.19%) 97276.80 ( 0.80%) 96800.23 ( 0.31%) 96819.88 ( 0.33%) 97320.51 ( 0.84%)
          10 96598.70 ( 0.00%) 96272.05 (-0.34%) 96545.37 (-0.06%) 95936.97 (-0.69%) 96222.51 (-0.39%) 96221.69 (-0.39%)
          11 95500.66 ( 0.00%) 95141.00 (-0.38%) 95671.11 ( 0.18%) 96057.84 ( 0.58%) 95003.21 (-0.52%) 95246.81 (-0.27%)
          12 94572.87 ( 0.00%) 94811.46 ( 0.25%) 95266.70 ( 0.73%) 93767.06 (-0.86%) 93807.60 (-0.82%) 94859.82 ( 0.30%)
          13 93811.85 ( 0.00%) 93597.39 (-0.23%) 94309.18 ( 0.53%) 93323.96 (-0.52%) 93219.81 (-0.64%) 93051.63 (-0.82%)
          14 92972.16 ( 0.00%) 92936.53 (-0.04%) 93849.87 ( 0.94%) 92545.83 (-0.46%) 92641.50 (-0.36%) 92916.70 (-0.06%)
          15 92276.06 ( 0.00%) 91559.63 (-0.78%) 92454.94 ( 0.19%) 91748.29 (-0.58%) 91094.04 (-1.30%) 91972.79 (-0.33%)
          16 90265.35 ( 0.00%) 89707.32 (-0.62%) 90416.26 ( 0.17%) 89253.93 (-1.13%) 89309.26 (-1.07%) 90103.89 (-0.18%)

High-order didn't really help here either.

Overall, it would appear that high-order allocations occasionally help
but the margins are pretty small.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-15  7:47                   ` Tejun Heo
@ 2009-10-16 16:44                     ` Christoph Lameter
  2009-10-18  3:11                       ` Tejun Heo
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-16 16:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

On Thu, 15 Oct 2009, Tejun Heo wrote:

> > Are you defining what __ means for this_cpu_ptr?
>
> I was basically stating the different between raw_smp_processor_id()
> and smp_processor_id() which I thought applied the same to
> __this_cpu_ptr() and this_cpu_ptr().

Ii does apply. __this_cpu_ptr does not use smp_processor_id() but
raw_smp_processor_id(). this_cpu_ptr does not need to disable preempt so
we dont do anything on that level.

> > The vm event counters require both no check and no preempt since they can
> > be implemented in a racy way.
>
> The biggest grief I have is that the meaning of __ is different among
> different accessors.  If that can be cleared up, we would be in much
> better shape without adding any extra macros.  Can we just remove all
> __'s and use meaningful pre or suffixes like raw or irq or whatever?

It currently means that we do not deal with preempt and do not check for
preemption. That is consistent.

Sure we could change the API to have even more macros than the large
amount it already has so that we can check for proper preempt disablement.

I guess that would mean adding

raw_nopreempt_this_cpu_xx  and nopreempt_this_cpu_xx variants? The thing
gets huge. I think we could just leave it. __ suggests that serialization
and checking is not performed like in the full versions and that is true.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-15  9:03                         ` David Rientjes
@ 2009-10-16 16:45                           ` Christoph Lameter
  2009-10-16 18:43                             ` David Rientjes
  0 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2009-10-16 16:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Thu, 15 Oct 2009, David Rientjes wrote:

> TCP_STREAM stresses a few specific caches:
>
> 		ALLOC_FASTPATH	ALLOC_SLOWPATH	FREE_FASTPATH	FREE_SLOWPATH
> kmalloc-256	3868530		3450592		95628		7223491
> kmalloc-1024	2440434		429		2430825		10034
> kmalloc-4096	3860625		1036723		85571		4811779
>
> This demonstrates that freeing to full (or partial) slabs causes a lot of
> pain since the fastpath normally can't be utilized and that's probably
> beyond the scope of this patchset.
>
> It's also different from the cpu slab thrashing issue I identified with
> the TCP_RR benchmark and had a patchset to somewhat improve.  The
> criticism was the addition of an increment to a fastpath counter in struct
> kmem_cache_cpu which could probably now be much cheaper with these
> optimizations.

Can you redo the patch?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-16 10:50                                 ` Mel Gorman
@ 2009-10-16 18:40                                   ` David Rientjes
  0 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2009-10-16 18:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Fri, 16 Oct 2009, Mel Gorman wrote:

> NETPERF TCP
>               SLUB-vanilla   vanilla-highorder     SLUB-this-cpu  this-cpu-highorder    SLAB-vanilla     SLAB-this-cpu
>       64  1773.00 ( 0.00%)   1812.07 ( 2.16%)*  1731.63 (-2.39%)*  1717.99 (-3.20%)*  1794.48 ( 1.20%)   2029.46 (12.64%) 
>              1.00%             5.88%             2.43%             2.83%             1.00%             1.00%        
>      128  3181.12 ( 0.00%)   3193.06 ( 0.37%)*  3471.22 ( 8.36%)   3154.79 (-0.83%)   3296.37 ( 3.50%)   3251.33 ( 2.16%) 
>              1.00%             1.70%             1.00%             1.00%             1.00%             1.00%        
>      256  4794.35 ( 0.00%)   4813.37 ( 0.40%)   4797.38 ( 0.06%)   4819.16 ( 0.51%)   4912.99 ( 2.41%)   4846.86 ( 1.08%) 
>     1024  9438.10 ( 0.00%)   8144.02 (-15.89%)   8681.05 (-8.72%)*  8204.11 (-15.04%)   8270.58 (-14.12%)   8268.85 (-14.14%) 
>              1.00%             1.00%             7.31%             1.00%             1.00%             1.00%        
>     2048  9196.06 ( 0.00%)  11233.72 (18.14%)   9375.72 ( 1.92%)  10487.89 (12.32%)* 11474.59 (19.86%)   9420.01 ( 2.38%) 
>              1.00%             1.00%             1.00%             9.43%             1.00%             1.00%        
>     3312 10338.49 ( 0.00%)*  9730.79 (-6.25%)* 10021.82 (-3.16%)* 10089.90 (-2.46%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
>              9.49%             2.51%             6.36%             5.96%             1.21%             2.12%        
>     4096  9931.20 ( 0.00%)* 12447.88 (20.22%)  10285.38 ( 3.44%)* 10548.56 ( 5.85%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
>              1.31%             1.00%             1.38%             8.22%             9.97%             8.33%        
>     6144 12775.08 ( 0.00%)* 10489.24 (-21.79%)* 10559.63 (-20.98%)  11033.15 (-15.79%)* 13139.34 ( 2.77%)  13210.79 ( 3.30%)*
>              1.45%             8.46%             1.00%            12.65%             1.00%             2.99%        
>     8192 10933.93 ( 0.00%)* 10340.42 (-5.74%)* 10534.41 (-3.79%)* 10845.36 (-0.82%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
>             14.29%             2.38%             2.10%             1.83%            12.50%             9.55%        
>    10240 12868.58 ( 0.00%)  11211.60 (-14.78%)* 12991.65 ( 0.95%)  11330.97 (-13.57%)* 10892.20 (-18.14%)  13106.01 ( 1.81%) 
>              1.00%            11.36%             1.00%             6.64%             1.00%             1.00%        
>    12288 11854.97 ( 0.00%)  11854.51 (-0.00%)  12122.34 ( 2.21%)* 12258.61 ( 3.29%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
>              1.00%             1.00%             6.61%             5.69%             5.78%             8.95%        
>    14336 12552.48 ( 0.00%)* 12309.15 (-1.98%)  12501.71 (-0.41%)* 13683.57 ( 8.27%)* 12274.54 (-2.26%)  12322.63 (-1.87%)*
>              6.05%             1.00%             2.58%             2.46%             1.00%             2.23%        
>    16384 11733.09 ( 0.00%)* 11856.66 ( 1.04%)* 12735.05 ( 7.87%)* 13482.61 (12.98%)* 13195.68 (11.08%)* 14401.62 (18.53%) 
>              1.14%             1.05%             9.79%            11.52%            10.30%             1.00%        
> 
> Configuring high-rder helper in a few cases here and in one or two
> cases close the gap with SLAB, particularly for large packet sizes.
> However, it still suffered for the small packet sizes.
> 

This is understandable considering the statistics that I posted for this 
workload on my machine, higher order cpu slabs will naturally get freed to 
more often from the fastpath, which also causes it to utilize the 
allocation fastpath more often (and we can see the optimization of this 
patchset), in addition to avoiding partial list handling.

The pain with the smaller packet sizes is probably the overhead from the 
page allocator more than slub, a characteristic that also caused the 
TCP_RR benchmark to suffer.  It can be mitigated somewhat with slab 
preallocation or a higher min_partial setting, but that's probably not an 
optimal solution.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-16 16:45                           ` Christoph Lameter
@ 2009-10-16 18:43                             ` David Rientjes
  2009-10-16 18:50                               ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: David Rientjes @ 2009-10-16 18:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Fri, 16 Oct 2009, Christoph Lameter wrote:

> > TCP_STREAM stresses a few specific caches:
> >
> > 		ALLOC_FASTPATH	ALLOC_SLOWPATH	FREE_FASTPATH	FREE_SLOWPATH
> > kmalloc-256	3868530		3450592		95628		7223491
> > kmalloc-1024	2440434		429		2430825		10034
> > kmalloc-4096	3860625		1036723		85571		4811779
> >
> > This demonstrates that freeing to full (or partial) slabs causes a lot of
> > pain since the fastpath normally can't be utilized and that's probably
> > beyond the scope of this patchset.
> >
> > It's also different from the cpu slab thrashing issue I identified with
> > the TCP_RR benchmark and had a patchset to somewhat improve.  The
> > criticism was the addition of an increment to a fastpath counter in struct
> > kmem_cache_cpu which could probably now be much cheaper with these
> > optimizations.
> 
> Can you redo the patch?
> 

Sure, but it would be even more inexpensive if we can figure out why the 
irqless patch is hanging my netserver machine within the first 60 seconds 
on the TCP_RR benchmark.  I guess nobody else has reproduced that yet.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths
  2009-10-16 18:43                             ` David Rientjes
@ 2009-10-16 18:50                               ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2009-10-16 18:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Pekka Enberg, Tejun Heo, linux-kernel,
	Mathieu Desnoyers, Zhang Yanmin

On Fri, 16 Oct 2009, David Rientjes wrote:

> Sure, but it would be even more inexpensive if we can figure out why the
> irqless patch is hanging my netserver machine within the first 60 seconds
> on the TCP_RR benchmark.  I guess nobody else has reproduced that yet.

Nope. Sorry. I have tried running some tests but so far nothing.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [this_cpu_xx V6 3/7] Use this_cpu operations in slub
  2009-10-16 16:44                     ` Christoph Lameter
@ 2009-10-18  3:11                       ` Tejun Heo
  0 siblings, 0 replies; 56+ messages in thread
From: Tejun Heo @ 2009-10-18  3:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

Hello, Christoph.

Christoph Lameter wrote:
>> The biggest grief I have is that the meaning of __ is different among
>> different accessors.  If that can be cleared up, we would be in much
>> better shape without adding any extra macros.  Can we just remove all
>> __'s and use meaningful pre or suffixes like raw or irq or whatever?
> 
> It currently means that we do not deal with preempt and do not check for
> preemption. That is consistent.

If you define it inclusively, it can be consistent.

> Sure we could change the API to have even more macros than the large
> amount it already has so that we can check for proper preempt disablement.
> 
> I guess that would mean adding
> 
> raw_nopreempt_this_cpu_xx  and nopreempt_this_cpu_xx variants? The thing
> gets huge. I think we could just leave it. __ suggests that serialization
> and checking is not performed like in the full versions and that is true.

I don't think we'll need to add new variants.  Just renaming existing
ones so that they have more specific pre/suffix should make things
clearer.  I'll give a shot at that once the sparse annotation patchset
is merged.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2009-10-18  3:10 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-07 21:10 [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic cl
2009-10-07 21:10 ` [this_cpu_xx V6 1/7] this_cpu_ops: page allocator conversion cl
2009-10-08 10:38   ` Tejun Heo
2009-10-08 10:40     ` Tejun Heo
2009-10-08 16:15     ` Christoph Lameter
2009-10-08 10:53   ` Mel Gorman
2009-10-07 21:10 ` [this_cpu_xx V6 2/7] this_cpu ops: Remove pageset_notifier cl
2009-10-07 21:10 ` [this_cpu_xx V6 3/7] Use this_cpu operations in slub cl
2009-10-12 10:19   ` Tejun Heo
2009-10-12 10:21     ` Tejun Heo
2009-10-12 14:54     ` Christoph Lameter
2009-10-13  2:13       ` Tejun Heo
2009-10-13 14:41         ` Christoph Lameter
2009-10-13 14:56           ` Tejun Heo
2009-10-13 15:20             ` Christoph Lameter
2009-10-14  1:57               ` Tejun Heo
2009-10-14 14:14                 ` Christoph Lameter
2009-10-15  7:47                   ` Tejun Heo
2009-10-16 16:44                     ` Christoph Lameter
2009-10-18  3:11                       ` Tejun Heo
2009-10-07 21:10 ` [this_cpu_xx V6 4/7] SLUB: Get rid of dynamic DMA kmalloc cache allocation cl
2009-10-13 18:48   ` [FIX] patch "SLUB: Get rid of dynamic DMA kmalloc cache allocation" Christoph Lameter
2009-10-07 21:10 ` [this_cpu_xx V6 5/7] this_cpu: Remove slub kmem_cache fields cl
2009-10-07 23:10   ` Christoph Lameter
2009-10-07 21:10 ` [this_cpu_xx V6 6/7] Make slub statistics use this_cpu_inc cl
2009-10-07 21:10 ` [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths cl
2009-10-12 10:40   ` Tejun Heo
2009-10-12 13:14     ` Pekka Enberg
2009-10-12 14:55       ` Christoph Lameter
2009-10-13  9:45       ` David Rientjes
2009-10-13 14:43         ` Christoph Lameter
2009-10-13 19:14           ` Christoph Lameter
2009-10-13 19:44             ` Pekka Enberg
2009-10-13 19:48               ` Christoph Lameter
2009-10-13 20:15                 ` David Rientjes
2009-10-13 20:28                   ` Christoph Lameter
2009-10-13 22:53                     ` David Rientjes
2009-10-14 13:34                       ` Mel Gorman
2009-10-14 14:08                         ` Christoph Lameter
2009-10-14 15:49                           ` Mel Gorman
2009-10-14 15:53                             ` Pekka Enberg
2009-10-14 15:56                               ` Christoph Lameter
2009-10-14 16:14                                 ` Pekka Enberg
2009-10-14 18:19                                   ` Christoph Lameter
2009-10-16 10:50                                 ` Mel Gorman
2009-10-16 18:40                                   ` David Rientjes
2009-10-15  9:03                         ` David Rientjes
2009-10-16 16:45                           ` Christoph Lameter
2009-10-16 18:43                             ` David Rientjes
2009-10-16 18:50                               ` Christoph Lameter
2009-10-13 20:25               ` Christoph Lameter
2009-10-14  1:33           ` David Rientjes
2009-10-13 15:40 ` [this_cpu_xx V6 0/7] Introduce per cpu atomic operations and avoid per cpu address arithmetic Mel Gorman
2009-10-13 15:45   ` Christoph Lameter
2009-10-13 16:09     ` Mel Gorman
2009-10-13 17:17       ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.