[this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu

All of lore.kernel.org
 help / color / mirror / Atom feed

* [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops
@ 2009-12-18 22:26 Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 01/16] this_cpu_ops: page allocator conversion Christoph Lameter
                   ` (15 more replies)
  0 siblings, 16 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Leftovers from the earlier patchset rediffed to 2.6.33-rc1.
Mostly applications of per cpu counters to core components.

After this patchset there will be only one user of local_t left: Mathieu's
trace ringbuffer.

I added some patches that define additional this_cpu ops in order to help
Mathieu with making the trace ringbuffer use this_cpu ops. These have barely
been tested (boots fine on 32 and 64 bit but there is no user of these operations).
RFC state.

V7->V8
- Fix issue in slub patch
- Fix issue in modules patch
- Rediff page allocator patch
- Provide new this_cpu ops needed for ringbuffer [RFC state]

V6->V7
- Drop patches merged in 2.6.33 merge cycle
- Drop risky slub patches

V5->V6:
- Drop patches merged by Tejun.
- Drop irqless slub fastpath for now.
- Patches against Tejun percpu for-next branch.

V4->V5:
- Avoid setup_per_cpu_area() modifications and fold the remainder of the
  patch into the page allocator patch.
- Irq disable / per cpu ptr fixes for page allocator patch.

V3->V4:
- Fix various macro definitions.
- Provide experimental percpu based fastpath that does not disable
  interrupts for SLUB.

V2->V3:
- Available via git tree against latest upstream from
	 git://git.kernel.org/pub/scm/linux/kernel/git/christoph/percpu.git linus
- Rework SLUB per cpu operations. Get rid of dynamic DMA slab creation
  for CONFIG_ZONE_DMA
- Create fallback framework so that 64 bit ops on 32 bit platforms
  can fallback to the use of preempt or interrupt disable. 64 bit
  platforms can use 64 bit atomic per cpu ops.

V1->V2:
- Various minor fixes
- Add SLUB conversion
- Add Page allocator conversion
- Patch against the git tree of today



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 01/16] this_cpu_ops: page allocator conversion
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 02/16] this_cpu ops: Remove pageset_notifier Christoph Lameter
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_page_allocator --]
[-- Type: text/plain, Size: 14851 bytes --]

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V7-V8:
- Explain chicken egg dilemmna with percpu allocator.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm.h     |    4 
 include/linux/mmzone.h |   12 --
 mm/page_alloc.c        |  197 ++++++++++++++++++++-----------------------------
 mm/vmstat.c            |   14 +--
 4 files changed, 92 insertions(+), 135 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/include/linux/mm.h	2009-12-18 13:15:32.000000000 -0600
@@ -1076,11 +1076,7 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 extern void zone_pcp_update(struct zone *zone);
 
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/include/linux/mmzone.h	2009-12-18 13:15:32.000000000 -0600
@@ -184,13 +184,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -306,10 +300,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/mm/page_alloc.c	2009-12-18 13:34:02.000000000 -0600
@@ -1007,10 +1007,10 @@ static void drain_pages(unsigned int cpu
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		local_irq_save(flags);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
-		local_irq_save(flags);
 		free_pcppages_bulk(zone, pcp->count, pcp);
 		pcp->count = 0;
 		local_irq_restore(flags);
@@ -1094,7 +1094,6 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	migratetype = get_pageblock_migratetype(page);
 	set_page_private(page, migratetype);
 	local_irq_save(flags);
@@ -1117,6 +1116,7 @@ static void free_hot_cold_page(struct pa
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	else
@@ -1129,7 +1129,6 @@ static void free_hot_cold_page(struct pa
 
 out:
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1179,17 +1178,15 @@ struct page *buffered_rmqueue(struct zon
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 		struct list_head *list;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
-		list = &pcp->lists[migratetype];
 		local_irq_save(flags);
+		pcp = &this_cpu_ptr(zone->pageset)->pcp;
+		list = &pcp->lists[migratetype];
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
@@ -1230,7 +1227,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1239,7 +1235,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -2178,7 +2173,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = per_cpu_ptr(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2740,10 +2735,29 @@ static void build_zonelist_cache(pg_data
 
 #endif	/* CONFIG_NUMA */
 
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
+static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
+
 /* return values int ....just for stop_machine() */
 static int __build_all_zonelists(void *dummy)
 {
 	int nid;
+	int cpu;
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -2754,6 +2768,23 @@ static int __build_all_zonelists(void *d
 		build_zonelists(pgdat);
 		build_zonelist_cache(pgdat);
 	}
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors. The real pagesets for
+	 * each zone will be allocated later when the per cpu
+	 * allocator is available.
+	 *
+	 * boot_pagesets are used also for bootstrapping offline
+	 * cpus if the system is already booted because the pagesets
+	 * are needed to initialize allocators on a specific cpu too.
+	 * F.e. the percpu allocator needs the page allocator which
+	 * needs the percpu allocator in order to allocate its pagesets
+	 * (a chicken-egg dilemma).
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	return 0;
 }
 
@@ -3092,95 +3123,16 @@ static void setup_pagelist_highmark(stru
 }
 
 
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-	int node = cpu_to_node(cpu);
-
-	node_set_state(node, N_CPU);	/* this node has a cpu */
-
-	for_each_populated_zone(zone) {
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			    (zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = &boot_pageset[cpu];
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-	}
-}
-
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
 {
 	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	default:
 		break;
@@ -3191,21 +3143,39 @@ static int __cpuinit pageset_cpuup_callb
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
 
+/*
+ * Allocate per cpu pagesets and initialize them.
+ * Before this call only boot pagesets were available.
+ * Boot pagesets will no longer be used by this processorr
+ * after setup_per_cpu_pageset().
+ */
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
+	struct zone *zone;
+	int cpu;
+
+	for_each_populated_zone(zone) {
+		zone->pageset = alloc_percpu(struct per_cpu_pageset);
 
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+		for_each_possible_cpu(cpu) {
+			struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
+
+			setup_pageset(pcp, zone_batchsize(zone));
+
+			if (percpu_pagelist_fraction)
+				setup_pagelist_highmark(pcp,
+					(zone->present_pages /
+						percpu_pagelist_fraction));
+		}
+	}
+	/*
+	 * The boot cpu is always the first active.
+	 * The boot node has a processor.
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
 	register_cpu_notifier(&pageset_notifier);
 }
 
-#endif
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -3259,7 +3229,7 @@ static int __zone_pcp_update(void *data)
 		struct per_cpu_pageset *pset;
 		struct per_cpu_pages *pcp;
 
-		pset = zone_pcp(zone, cpu);
+		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
 		local_irq_save(flags);
@@ -3277,21 +3247,17 @@ void zone_pcp_update(struct zone *zone)
 
 static __meminit void zone_pcp_init(struct zone *zone)
 {
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
+	/*
+	 * per cpu subsystem is not up at this point. The following code
+	 * relies on the ability of the linker to provide the
+	 * offset of a (static) per cpu variable into the per cpu area.
+	 */
+	zone->pageset = &per_cpu_var(boot_pageset);
 
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
 	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+			zone->name, zone->present_pages,
+					 zone_batchsize(zone));
 }
 
 __meminit int init_currently_empty_zone(struct zone *zone,
@@ -4805,10 +4771,11 @@ int percpu_pagelist_fraction_sysctl_hand
 	if (!write || (ret == -EINVAL))
 		return ret;
 	for_each_populated_zone(zone) {
-		for_each_online_cpu(cpu) {
+		for_each_possible_cpu(cpu) {
 			unsigned long  high;
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(
+				per_cpu_ptr(zone->pageset, cpu), high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2009-12-18 13:15:32.000000000 -0600
@@ -139,7 +139,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -149,7 +150,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -202,7 +204,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -223,7 +225,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = this_cpu_ptr(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -300,7 +302,7 @@ void refresh_cpu_vm_stats(int cpu)
 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset *p;
 
-		p = zone_pcp(zone, cpu);
+		p = per_cpu_ptr(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -741,7 +743,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 02/16] this_cpu ops: Remove pageset_notifier
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 01/16] this_cpu_ops: page allocator conversion Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 03/16] Use this_cpu operations in slub Christoph Lameter
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_page_alloc_remove_pageset_notifier --]
[-- Type: text/plain, Size: 2007 bytes --]

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/page_alloc.c |   27 ---------------------------
 mm/vmstat.c     |    1 +
 2 files changed, 1 insertion(+), 27 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2009-12-18 13:15:32.000000000 -0600
+++ linux-2.6/mm/vmstat.c	2009-12-18 13:37:41.000000000 -0600
@@ -908,6 +908,7 @@ static int __cpuinit vmstat_cpuup_callba
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		start_cpu_timer(cpu);
+		node_set_state(cpu_to_node(cpu), N_CPU);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-12-18 13:34:02.000000000 -0600
+++ linux-2.6/mm/page_alloc.c	2009-12-18 13:39:19.000000000 -0600
@@ -3122,27 +3122,6 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		node_set_state(cpu_to_node(cpu), N_CPU);
-		break;
-	default:
-		break;
-	}
-	return ret;
-}
-
-static struct notifier_block __cpuinitdata pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.
@@ -3168,12 +3147,6 @@ void __init setup_per_cpu_pageset(void)
 						percpu_pagelist_fraction));
 		}
 	}
-	/*
-	 * The boot cpu is always the first active.
-	 * The boot node has a processor.
-	 */
-	node_set_state(cpu_to_node(smp_processor_id()), N_CPU);
-	register_cpu_notifier(&pageset_notifier);
 }
 
 static noinline __init_refok

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 03/16] Use this_cpu operations in slub
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 01/16] this_cpu_ops: page allocator conversion Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 02/16] this_cpu ops: Remove pageset_notifier Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-20  9:11   ` Pekka Enberg
  2009-12-18 22:26 ` [this_cpu_xx V8 04/16] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_conversion --]
[-- Type: text/plain, Size: 12617 bytes --]

Using per cpu allocations removes the needs for the per cpu arrays in the
kmem_cache struct. These could get quite big if we have to support systems
with thousands of cpus. The use of this_cpu_xx operations results in:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes / 4k cpus.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases since per cpu pointer lookups and
   address calculations are avoided.

V7-V8
- Convert missed get_cpu_slab() under CONFIG_SLUB_STATS

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  202 +++++++++++------------------------------------
 2 files changed, 49 insertions(+), 159 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-18 13:39:26.000000000 -0600
@@ -69,6 +69,7 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -104,11 +105,6 @@ struct kmem_cache {
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-18 13:39:41.000000000 -0600
@@ -242,15 +242,6 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1124,7 +1115,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1422,7 +1413,7 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
@@ -1454,7 +1445,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1507,7 +1498,7 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1673,7 +1664,7 @@ new_slab:
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __this_cpu_ptr(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1711,7 +1702,7 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned int objsize;
+	unsigned long objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1722,14 +1713,14 @@ static __always_inline void *slab_alloc(
 		return NULL;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
+	object = c->freelist;
 	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
@@ -1800,7 +1791,7 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1872,7 +1863,7 @@ static __always_inline void slab_free(st
 
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __this_cpu_ptr(s->cpu_slab);
 	kmemcheck_slab_free(s, object, c->objsize);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
@@ -2095,130 +2086,28 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu [NR_KMEM_CACHE_CPU],
-		      kmem_cache_cpu);
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static DECLARE_BITMAP(kmem_cach_cpu_free_init_once, CONFIG_NR_CPUS);
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c >= per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
-static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[SLUB_PAGE_SHIFT]);
 
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpumask_test_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once)))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
-	cpumask_set_cpu(cpu, to_cpumask(kmem_cach_cpu_free_init_once));
-}
-
-static void __init init_alloc_cpu(void)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
+	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+		/*
+		 * Boot time creation of the kmalloc array. Use static per cpu data
+		 * since the per cpu allocator is not available yet.
+		 */
+		s->cpu_slab = per_cpu_var(kmalloc_percpu) + (s - kmalloc_caches);
+	else
+		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
 
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
+	if (!s->cpu_slab)
+		return 0;
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+	for_each_possible_cpu(cpu)
+		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2609,9 +2498,8 @@ static inline int kmem_cache_close(struc
 	int node;
 
 	flush_all(s);
-
+	free_percpu(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2760,7 +2648,19 @@ static noinline struct kmem_cache *dma_k
 	realsize = kmalloc_caches[index].objsize;
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
-	s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+
+	if (flags & __GFP_WAIT)
+		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	else {
+		int i;
+
+		s = NULL;
+		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
+			if (kmalloc_caches[i].size) {
+				s = kmalloc_caches + i;
+				break;
+			}
+	}
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -3176,8 +3076,6 @@ void __init kmem_cache_init(void)
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3261,8 +3159,10 @@ void __init kmem_cache_init(void)
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3365,7 +3265,7 @@ struct kmem_cache *kmem_cache_create(con
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3422,11 +3322,9 @@ static int __cpuinit slab_cpuup_callback
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3436,13 +3334,9 @@ static int __cpuinit slab_cpuup_callback
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3928,7 +3822,7 @@ static ssize_t show_slab_objects(struct 
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4353,7 +4247,7 @@ static int show_stat(struct kmem_cache *
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;
@@ -4376,7 +4270,7 @@ static void clear_stat(struct kmem_cache
 	int cpu;
 
 	for_each_online_cpu(cpu)
-		get_cpu_slab(s, cpu)->stat[si] = 0;
+		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
 }
 
 #define STAT_ATTR(si, text) 					\

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 04/16] SLUB: Get rid of dynamic DMA kmalloc cache allocation
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (2 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 03/16] Use this_cpu operations in slub Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 05/16] this_cpu: Remove slub kmem_cache fields Christoph Lameter
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_static_dma_kmalloc --]
[-- Type: text/plain, Size: 3688 bytes --]

Dynamic DMA kmalloc cache allocation is troublesome since the
new percpu allocator does not support allocations in atomic contexts.
Reserve some statically allocated kmalloc_cpu structures instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   19 +++++++++++--------
 mm/slub.c                |   24 ++++++++++--------------
 2 files changed, 21 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-14 15:22:57.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-14 15:23:27.000000000 -0600
@@ -131,11 +131,21 @@ struct kmem_cache {
 
 #define SLUB_PAGE_SHIFT (PAGE_SHIFT + 2)
 
+#ifdef CONFIG_ZONE_DMA
+#define SLUB_DMA __GFP_DMA
+/* Reserve extra caches for potential DMA use */
+#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT - 6)
+#else
+/* Disable DMA functionality */
+#define SLUB_DMA (__force gfp_t)0
+#define KMALLOC_CACHES SLUB_PAGE_SHIFT
+#endif
+
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -203,13 +213,6 @@ static __always_inline struct kmem_cache
 	return &kmalloc_caches[index];
 }
 
-#ifdef CONFIG_ZONE_DMA
-#define SLUB_DMA __GFP_DMA
-#else
-/* Disable DMA functionality */
-#define SLUB_DMA (__force gfp_t)0
-#endif
-
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
 void *__kmalloc(size_t size, gfp_t flags);
 
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-14 15:23:19.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-14 15:23:27.000000000 -0600
@@ -2092,7 +2092,7 @@ static inline int alloc_kmem_cache_cpus(
 {
 	int cpu;
 
-	if (s < kmalloc_caches + SLUB_PAGE_SHIFT && s >= kmalloc_caches)
+	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
 		 * since the per cpu allocator is not available yet.
@@ -2539,7 +2539,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[SLUB_PAGE_SHIFT] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
 static int __init setup_slub_min_order(char *str)
@@ -2629,6 +2629,7 @@ static noinline struct kmem_cache *dma_k
 	char *text;
 	size_t realsize;
 	unsigned long slabflags;
+	int i;
 
 	s = kmalloc_caches_dma[index];
 	if (s)
@@ -2649,18 +2650,13 @@ static noinline struct kmem_cache *dma_k
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			 (unsigned int)realsize);
 
-	if (flags & __GFP_WAIT)
-		s = kmalloc(kmem_size, flags & ~SLUB_DMA);
-	else {
-		int i;
+	s = NULL;
+	for (i = 0; i < KMALLOC_CACHES; i++)
+		if (!kmalloc_caches[i].size)
+			break;
 
-		s = NULL;
-		for (i = 0; i < SLUB_PAGE_SHIFT; i++)
-			if (kmalloc_caches[i].size) {
-				s = kmalloc_caches + i;
-				break;
-			}
-	}
+	BUG_ON(i >= KMALLOC_CACHES);
+	s = kmalloc_caches + i;
 
 	/*
 	 * Must defer sysfs creation to a workqueue because we don't know
@@ -2674,7 +2670,7 @@ static noinline struct kmem_cache *dma_k
 
 	if (!s || !text || !kmem_cache_open(s, flags, text,
 			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		kfree(s);
+		s->size = 0;
 		kfree(text);
 		goto unlock_out;
 	}

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 05/16] this_cpu: Remove slub kmem_cache fields
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (3 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 04/16] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 06/16] Make slub statistics use this_cpu_inc Christoph Lameter
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_remove_fields --]
[-- Type: text/plain, Size: 7862 bytes --]

Remove the fields in struct kmem_cache_cpu that were used to cache data from
struct kmem_cache when they were in different cachelines. The cacheline that
holds the per cpu array pointer now also holds these values. We can cut down
the struct kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the size field does not require accessing an additional cacheline anymore.
This results in consistent use of functions for setting the freepointer of
objects throughout SLUB.

Also we initialize all possible kmem_cache_cpu structures when a slab is
created. No need to initialize them when a processor or node comes online.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   76 ++++++++++-------------------------------------
 2 files changed, 17 insertions(+), 61 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2009-12-14 15:23:27.000000000 -0600
+++ linux-2.6/include/linux/slub_def.h	2009-12-14 15:23:59.000000000 -0600
@@ -38,8 +38,6 @@ struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-12-14 15:23:27.000000000 -0600
+++ linux-2.6/mm/slub.c	2009-12-14 15:24:55.000000000 -0600
@@ -260,13 +260,6 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1473,10 +1466,10 @@ static void deactivate_slab(struct kmem_
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1635,7 +1628,7 @@ load_freelist:
 	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1681,7 +1674,7 @@ debug:
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1702,7 +1695,6 @@ static __always_inline void *slab_alloc(
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
-	unsigned long objsize;
 
 	gfpflags &= gfp_allowed_mask;
 
@@ -1715,22 +1707,21 @@ static __always_inline void *slab_alloc(
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
 	object = c->freelist;
-	objsize = c->objsize;
 	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
-		memset(object, 0, objsize);
+		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, c->objsize);
-	kmemleak_alloc_recursive(object, objsize, 1, s->flags, gfpflags);
+	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
 }
@@ -1785,7 +1776,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_notr
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr, unsigned int offset)
+			void *x, unsigned long addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1799,7 +1790,8 @@ static void __slab_free(struct kmem_cach
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1864,16 +1856,16 @@ static __always_inline void slab_free(st
 	kmemleak_free_recursive(x, s->flags);
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, c->objsize);
-	debug_check_no_locks_freed(object, c->objsize);
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, c->objsize);
+		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -2060,19 +2052,6 @@ static unsigned long calculate_alignment
 	return ALIGN(align, sizeof(void *));
 }
 
-static void init_kmem_cache_cpu(struct kmem_cache *s,
-			struct kmem_cache_cpu *c)
-{
-	c->page = NULL;
-	c->freelist = NULL;
-	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
-#ifdef CONFIG_SLUB_STATS
-	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
-#endif
-}
-
 static void
 init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
 {
@@ -2090,8 +2069,6 @@ static DEFINE_PER_CPU(struct kmem_cache_
 
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
-	int cpu;
-
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
 		 * Boot time creation of the kmalloc array. Use static per cpu data
@@ -2104,8 +2081,6 @@ static inline int alloc_kmem_cache_cpus(
 	if (!s->cpu_slab)
 		return 0;
 
-	for_each_possible_cpu(cpu)
-		init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
 	return 1;
 }
 
@@ -2391,6 +2366,7 @@ static int kmem_cache_open(struct kmem_c
 
 	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
+
 	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
@@ -3247,22 +3223,12 @@ struct kmem_cache *kmem_cache_create(con
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
 		 * the complete object on kzalloc.
 		 */
 		s->objsize = max(s->objsize, (int)size);
-
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			per_cpu_ptr(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 
@@ -3316,14 +3282,6 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
-			init_kmem_cache_cpu(s, per_cpu_ptr(s->cpu_slab, cpu));
-		up_read(&slub_lock);
-		break;
-
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 06/16] Make slub statistics use this_cpu_inc
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (4 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 05/16] this_cpu: Remove slub kmem_cache fields Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Pekka Enberg, Mel Gorman, Mathieu Desnoyers

[-- Attachment #1: percpu_slub_cleanup_stat --]
[-- Type: text/plain, Size: 5136 bytes --]

this_cpu_inc() translates into a single instruction on x86 and does not
need any register. So use it in stat(). We also want to avoid the
calculation of the per cpu kmem_cache_cpu structure pointer. So pass
a kmem_cache pointer instead of a kmem_cache_cpu pointer.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org?

---
 mm/slub.c |   43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2009-09-29 11:44:35.000000000 -0500
+++ linux-2.6/mm/slub.c	2009-09-29 11:44:49.000000000 -0500
@@ -217,10 +217,10 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
-static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
+static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	c->stat[si]++;
+	__this_cpu_inc(s->cpu_slab->stat[si]);
 #endif
 }
 
@@ -1108,7 +1108,7 @@ static struct page *allocate_slab(struct
 		if (!page)
 			return NULL;
 
-		stat(this_cpu_ptr(s->cpu_slab), ORDER_FALLBACK);
+		stat(s, ORDER_FALLBACK);
 	}
 
 	if (kmemcheck_enabled
@@ -1406,23 +1406,22 @@ static struct page *get_partial(struct k
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
 
 	__ClearPageSlubFrozen(page);
 	if (page->inuse) {
 
 		if (page->freelist) {
 			add_partial(n, page, tail);
-			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
+			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
-			stat(c, DEACTIVATE_FULL);
+			stat(s, DEACTIVATE_FULL);
 			if (SLABDEBUG && PageSlubDebug(page) &&
 						(s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
 	} else {
-		stat(c, DEACTIVATE_EMPTY);
+		stat(s, DEACTIVATE_EMPTY);
 		if (n->nr_partial < s->min_partial) {
 			/*
 			 * Adding an empty slab to the partial slabs in order
@@ -1438,7 +1437,7 @@ static void unfreeze_slab(struct kmem_ca
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(__this_cpu_ptr(s->cpu_slab), FREE_SLAB);
+			stat(s, FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1453,7 +1452,7 @@ static void deactivate_slab(struct kmem_
 	int tail = 1;
 
 	if (page->freelist)
-		stat(c, DEACTIVATE_REMOTE_FREES);
+		stat(s, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
@@ -1479,7 +1478,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(c, CPUSLAB_FLUSH);
+	stat(s, CPUSLAB_FLUSH);
 	slab_lock(c->page);
 	deactivate_slab(s, c);
 }
@@ -1619,7 +1618,7 @@ static void *__slab_alloc(struct kmem_ca
 	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 
-	stat(c, ALLOC_REFILL);
+	stat(s, ALLOC_REFILL);
 
 load_freelist:
 	object = c->page->freelist;
@@ -1634,7 +1633,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-	stat(c, ALLOC_SLOWPATH);
+	stat(s, ALLOC_SLOWPATH);
 	return object;
 
 another_slab:
@@ -1644,7 +1643,7 @@ new_slab:
 	new = get_partial(s, gfpflags, node);
 	if (new) {
 		c->page = new;
-		stat(c, ALLOC_FROM_PARTIAL);
+		stat(s, ALLOC_FROM_PARTIAL);
 		goto load_freelist;
 	}
 
@@ -1658,7 +1657,7 @@ new_slab:
 
 	if (new) {
 		c = __this_cpu_ptr(s->cpu_slab);
-		stat(c, ALLOC_SLAB);
+		stat(s, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
 		slab_lock(new);
@@ -1713,7 +1712,7 @@ static __always_inline void *slab_alloc(
 
 	else {
 		c->freelist = get_freepointer(s, object);
-		stat(c, ALLOC_FASTPATH);
+		stat(s, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
@@ -1780,10 +1779,8 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
 
-	c = __this_cpu_ptr(s->cpu_slab);
-	stat(c, FREE_SLOWPATH);
+	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
 	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
@@ -1796,7 +1793,7 @@ checks_ok:
 	page->inuse--;
 
 	if (unlikely(PageSlubFrozen(page))) {
-		stat(c, FREE_FROZEN);
+		stat(s, FREE_FROZEN);
 		goto out_unlock;
 	}
 
@@ -1809,7 +1806,7 @@ checks_ok:
 	 */
 	if (unlikely(!prior)) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(c, FREE_ADD_PARTIAL);
+		stat(s, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
@@ -1822,10 +1819,10 @@ slab_empty:
 		 * Slab still on the partial list.
 		 */
 		remove_partial(s, page);
-		stat(c, FREE_REMOVE_PARTIAL);
+		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-	stat(c, FREE_SLAB);
+	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
 
@@ -1863,7 +1860,7 @@ static __always_inline void slab_free(st
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
-		stat(c, FREE_FASTPATH);
+		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (5 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 06/16] Make slub statistics use this_cpu_inc Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-21  7:47   ` Tejun Heo
  2009-12-18 22:26 ` [this_cpu_xx V8 08/16] Remove cpu_local_xx macros Christoph Lameter
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_local_t_remove_modules --]
[-- Type: text/plain, Size: 6378 bytes --]

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

The last hold out of users of local_t is the tracing ringbuffer after this patch
has been applied.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/module.h     |   38 ++++++++++++++------------------------
 kernel/module.c            |   30 ++++++++++++++++--------------
 kernel/trace/ring_buffer.c |    1 +
 3 files changed, 31 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/include/linux/module.h	2009-12-18 14:16:17.000000000 -0600
@@ -16,8 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
 #include <asm/module.h>
 
 #include <trace/events/module.h>
@@ -363,11 +362,9 @@ struct module
 	/* Destruction function. */
 	void (*exit)(void);
 
-#ifdef CONFIG_SMP
-	char *refptr;
-#else
-	local_t ref;
-#endif
+	struct module_ref {
+		int count;
+	} *refptr;
 #endif
 
 #ifdef CONFIG_CONSTRUCTORS
@@ -454,25 +451,16 @@ void __symbol_put(const char *symbol);
 #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
 void symbol_put_addr(void *addr);
 
-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
-	return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
-	return &mod->ref;
-#endif
-}
-
 /* Sometimes we know we already have a refcount, and it's easier not
    to handle the error case (which only happens with rmmod --wait). */
 static inline void __module_get(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_inc(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_inc(module->refptr->count);
 		trace_module_get(module, _THIS_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
-		put_cpu();
+				 __this_cpu_read(module->refptr->count));
+		preempt_enable();
 	}
 }
 
@@ -481,15 +469,17 @@ static inline int try_module_get(struct 
 	int ret = 1;
 
 	if (module) {
-		unsigned int cpu = get_cpu();
+		preempt_disable();
+
 		if (likely(module_is_live(module))) {
-			local_inc(__module_ref_addr(module, cpu));
+			__this_cpu_inc(module->refptr->count);
 			trace_module_get(module, _THIS_IP_,
-				local_read(__module_ref_addr(module, cpu)));
+				__this_cpu_read(module->refptr->count));
 		}
 		else
 			ret = 0;
-		put_cpu();
+
+		preempt_enable();
 	}
 	return ret;
 }
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/kernel/module.c	2009-12-18 14:15:57.000000000 -0600
@@ -474,9 +474,10 @@ static void module_unload_init(struct mo
 
 	INIT_LIST_HEAD(&mod->modules_which_use_me);
 	for_each_possible_cpu(cpu)
-		local_set(__module_ref_addr(mod, cpu), 0);
+		per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
 	/* Hold reference count during initialization. */
-	local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+	__this_cpu_write(mod->refptr->count, 1);
 	/* Backwards compatibility macros put refcount during init. */
 	mod->waiter = current;
 }
@@ -555,6 +556,7 @@ static void module_unload_free(struct mo
 				kfree(use);
 				sysfs_remove_link(i->holders_dir, mod->name);
 				/* There can be at most one match. */
+				free_percpu(i->refptr);
 				break;
 			}
 		}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct modu
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		total += local_read(__module_ref_addr(mod, cpu));
+		total += per_cpu_ptr(mod->refptr, cpu)->count;
 	return total;
 }
 EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt = 
 void module_put(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_dec(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_dec(module->refptr->count);
+
 		trace_module_put(module, _RET_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
+				 __this_cpu_read(module->refptr->count));
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
 			wake_up_process(module->waiter);
-		put_cpu();
+		preempt_enable();
 	}
 }
 EXPORT_SYMBOL(module_put);
@@ -1394,9 +1397,9 @@ static void free_module(struct module *m
 	kfree(mod->args);
 	if (mod->percpu)
 		percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
 	if (mod->refptr)
-		percpu_modfree(mod->refptr);
+		free_percpu(mod->refptr);
 #endif
 	/* Free lock-classes: */
 	lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2159,9 +2162,8 @@ static noinline struct module *load_modu
 	mod = (void *)sechdrs[modindex].sh_addr;
 	kmemleak_load_module(mod, hdr, sechdrs, secstrings);
 
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
-				      mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+	mod->refptr = alloc_percpu(struct module_ref);
 	if (!mod->refptr) {
 		err = -ENOMEM;
 		goto free_init;
@@ -2393,8 +2395,8 @@ static noinline struct module *load_modu
 	kobject_put(&mod->mkobj.kobj);
  free_unload:
 	module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+	free_percpu(mod->refptr);
  free_init:
 #endif
 	module_free(mod, mod->module_init);
Index: linux-2.6/kernel/trace/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
+++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
@@ -12,6 +12,7 @@
 #include <linux/hardirq.h>
 #include <linux/kmemcheck.h>
 #include <linux/module.h>
+#include <asm/local.h>
 #include <linux/percpu.h>
 #include <linux/mutex.h>
 #include <linux/init.h>

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 08/16] Remove cpu_local_xx macros
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (6 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately Christoph Lameter
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_local_t_remove_cpu_alloc --]
[-- Type: text/plain, Size: 9721 bytes --]

These macros have not been used for awhile now and are not used
by the last local_t user in the tree (trace ringbuffer logic).

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/alpha/include/asm/local.h   |   17 -----------------
 arch/m32r/include/asm/local.h    |   25 -------------------------
 arch/mips/include/asm/local.h    |   25 -------------------------
 arch/powerpc/include/asm/local.h |   25 -------------------------
 arch/x86/include/asm/local.h     |   37 -------------------------------------
 include/asm-generic/local.h      |   19 -------------------
 6 files changed, 148 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/arch/alpha/include/asm/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -98,21 +98,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/arch/m32r/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/m32r/include/asm/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/arch/m32r/include/asm/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -338,29 +338,4 @@ static inline void local_set_mask(unsign
  * a variable, not an address.
  */
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non local way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* __M32R_LOCAL_H */
Index: linux-2.6/arch/mips/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/mips/include/asm/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/arch/mips/include/asm/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -193,29 +193,4 @@ static __inline__ long local_sub_return(
 #define __local_add(i, l)	((l)->a.counter+=(i))
 #define __local_sub(i, l)	((l)->a.counter-=(i))
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/arch/powerpc/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/arch/powerpc/include/asm/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -172,29 +172,4 @@ static __inline__ long local_dec_if_posi
 #define __local_add(i,l)	((l)->a.counter+=(i))
 #define __local_sub(i,l)	((l)->a.counter-=(i))
 
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/arch/x86/include/asm/local.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -195,41 +195,4 @@ static inline long local_sub_return(long
 #define __local_add(i, l)	local_add((i), (l))
 #define __local_sub(i, l)	local_sub((i), (l))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- *
- * X86_64: This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)		\
-({					\
-	local_t res__;			\
-	preempt_disable(); 		\
-	res__ = (l);			\
-	preempt_enable();		\
-	res__;				\
-})
-#define cpu_local_wrap(l)		\
-({					\
-	preempt_disable();		\
-	(l);				\
-	preempt_enable();		\
-})					\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
-
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
 #endif /* _ASM_X86_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h	2009-12-16 13:50:11.000000000 -0600
+++ linux-2.6/include/asm-generic/local.h	2009-12-18 14:19:17.000000000 -0600
@@ -52,23 +52,4 @@ typedef struct
 #define __local_add(i,l)	local_set((l), local_read(l) + (i))
 #define __local_sub(i,l)	local_set((l), local_read(l) - (i))
 
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc.  Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
 #endif /* _ASM_GENERIC_LOCAL_H */

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (7 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 08/16] Remove cpu_local_xx macros Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-21  7:25   ` Tejun Heo
  2009-12-18 22:26 ` [this_cpu_xx V8 10/16] Support generating inc/dec for this_cpu_inc/dec Christoph Lameter
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_inc_dec --]
[-- Type: text/plain, Size: 4752 bytes --]

Current this_cpu ops only allow an arch to specify add RMW operations or inc
and dec for all sizes. Some arches can do more efficient inc and dec
operations. Allow size specific override of fallback functions like with
the other operations.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/percpu.h |   96 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 90 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-12-18 13:34:13.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2009-12-18 14:19:25.000000000 -0600
@@ -257,6 +257,18 @@ do {									\
 	}								\
 } while (0)
 
+#define __pcpu_size_call_noparam(stem, variable)			\
+do {									\
+	switch(sizeof(variable)) {					\
+		case 1: stem##1(variable);break;			\
+		case 2: stem##2(variable);break;			\
+		case 4: stem##4(variable);break;			\
+		case 8: stem##8(variable);break;			\
+		default: 						\
+			__bad_size_call_parameter();break;		\
+	}								\
+} while (0)
+
 /*
  * Optimized manipulation for memory allocated through the per cpu
  * allocator or for addresses of per cpu variables (can be determined
@@ -352,11 +364,35 @@ do {									\
 #endif
 
 #ifndef this_cpu_inc
-# define this_cpu_inc(pcp)		this_cpu_add((pcp), 1)
+# ifndef this_cpu_inc_1
+#  define this_cpu_inc_1(pcp)	__this_cpu_add((pcp), 1)
+# endif
+# ifndef this_cpu_inc_2
+#  define this_cpu_inc_2(pcp)	__this_cpu_add((pcp), 1)
+# endif
+# ifndef this_cpu_inc_4
+#  define this_cpu_inc_4(pcp)	__this_cpu_add((pcp), 1)
+# endif
+# ifndef this_cpu_inc_8
+#  define this_cpu_inc_8(pcp)	__this_cpu_add((pcp), 1)
+# endif
+# define this_cpu_inc(pcp)		__pcpu_size_call_noparam(this_cpu_inc_, (pcp))
 #endif
 
 #ifndef this_cpu_dec
-# define this_cpu_dec(pcp)		this_cpu_sub((pcp), 1)
+# ifndef this_cpu_dec_1
+#  define this_cpu_dec_1(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef this_cpu_dec_2
+#  define this_cpu_dec_2(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef this_cpu_dec_4
+#  define this_cpu_dec_4(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef this_cpu_dec_8
+#  define this_cpu_dec_8(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# define this_cpu_dec(pcp)		__pcpu_size_call_noparam(this_cpu_dec_, (pcp))
 #endif
 
 #ifndef this_cpu_and
@@ -479,11 +515,35 @@ do {									\
 #endif
 
 #ifndef __this_cpu_inc
-# define __this_cpu_inc(pcp)		__this_cpu_add((pcp), 1)
+# ifndef __this_cpu_inc_1
+#  define __this_cpu_inc_1(pcp)	__this_cpu_add_1((pcp), 1)
+# endif
+# ifndef __this_cpu_inc_2
+#  define __this_cpu_inc_2(pcp)	__this_cpu_add_2((pcp), 1)
+# endif
+# ifndef __this_cpu_inc_4
+#  define __this_cpu_inc_4(pcp)	__this_cpu_add_4((pcp), 1)
+# endif
+# ifndef __this_cpu_inc_8
+#  define __this_cpu_inc_8(pcp)	__this_cpu_add_8((pcp), 1)
+# endif
+# define __this_cpu_inc(pcp)	__pcpu_size_call_noparam(__this_cpu_inc_, (pcp))
 #endif
 
 #ifndef __this_cpu_dec
-# define __this_cpu_dec(pcp)		__this_cpu_sub((pcp), 1)
+# ifndef __this_cpu_dec_1
+#  define __this_cpu_dec_1(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef __this_cpu_dec_2
+#  define __this_cpu_dec_2(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef __this_cpu_dec_4
+#  define __this_cpu_dec_4(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# ifndef __this_cpu_dec_8
+#  define __this_cpu_dec_8(pcp)	__this_cpu_sub((pcp), 1)
+# endif
+# define __this_cpu_dec(pcp)	__pcpu_size_call_noparam(__this_cpu_dec_, (pcp))
 #endif
 
 #ifndef __this_cpu_and
@@ -570,11 +630,35 @@ do {									\
 #endif
 
 #ifndef irqsafe_cpu_inc
-# define irqsafe_cpu_inc(pcp)	irqsafe_cpu_add((pcp), 1)
+# ifndef irqsafe_cpu_inc_1
+#  define irqsafe_cpu_inc_1(pcp)	irqsafe_cpu_add_1((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_inc_2
+#  define irqsafe_cpu_inc_2(pcp)	irqsafe_cpu_add_2((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_inc_4
+#  define irqsafe_cpu_inc_4(pcp)	irqsafe_cpu_add_4((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_inc_8
+#  define irqsafe_cpu_inc_8(pcp)	irqsafe_cpu_add_8((pcp), 1)
+# endif
+# define irqsafe_cpu_inc(pcp)	__pcpu_size_call_noparam(irqsafe_cpu_inc_, (pcp))
 #endif
 
 #ifndef irqsafe_cpu_dec
-# define irqsafe_cpu_dec(pcp)	irqsafe_cpu_sub((pcp), 1)
+# ifndef irqsafe_cpu_dec_1
+#  define irqsafe_cpu_dec_1(pcp)	irqsafe_cpu_sub_1((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_dec_2
+#  define irqsafe_cpu_dec_2(pcp)	irqsafe_cpu_sub_2((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_dec_4
+#  define irqsafe_cpu_dec_4(pcp)	irqsafe_cpu_sub_4((pcp), 1)
+# endif
+# ifndef irqsafe_cpu_dec_8
+#  define irqsafe_cpu_dec_8(pcp)	irqsafe_cpu_sub_8((pcp), 1)
+# endif
+# define irqsafe_cpu_dec(pcp)	__pcpu_size_call_noparam(irqsafe_cpu_dec_, (pcp))
 #endif
 
 #ifndef irqsafe_cpu_and

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 10/16] Support generating inc/dec for this_cpu_inc/dec
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (8 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg Christoph Lameter
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_inc_dec_x86 --]
[-- Type: text/plain, Size: 5208 bytes --]

Support generating inc/dec instruction. Currently we create an
add 1 instruction. Saves one byte per use of this_cpu_xx.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   47 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-12-18 12:45:31.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-12-18 14:22:40.000000000 -0600
@@ -133,6 +133,29 @@ do {							\
 	pfo_ret__;					\
 })
 
+#define percpu_var_op(op, var)				\
+({							\
+	switch (sizeof(var)) {				\
+	case 1:						\
+		asm(op "b "__percpu_arg(0)		\
+		    : "+m" (var) :);			\
+		break;					\
+	case 2:						\
+		asm(op "w "__percpu_arg(0)		\
+		    : "+m" (var) :);			\
+		break;					\
+	case 4:						\
+		asm(op "l "__percpu_arg(0)		\
+		    : "+m" (var) :);			\
+		break;					\
+	case 8:						\
+		asm(op "q "__percpu_arg(0)		\
+		    : "+m" (var) :);			\
+		break;					\
+	default: __bad_percpu_size();			\
+	}						\
+})
+
 /*
  * percpu_read() makes gcc load the percpu variable every time it is
  * accessed while percpu_read_stable() allows the value to be cached.
@@ -163,6 +186,12 @@ do {							\
 #define __this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
 #define __this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
 #define __this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_inc_1(pcp)		percpu_var_op("inc", (pcp))
+#define __this_cpu_inc_2(pcp)		percpu_var_op("inc", (pcp))
+#define __this_cpu_inc_4(pcp)		percpu_var_op("inc", (pcp))
+#define __this_cpu_dec_1(pcp)		percpu_var_op("dec", (pcp))
+#define __this_cpu_dec_2(pcp)		percpu_var_op("dec", (pcp))
+#define __this_cpu_dec_4(pcp)		percpu_var_op("dec", (pcp))
 #define __this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
 #define __this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
 #define __this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
@@ -182,6 +211,12 @@ do {							\
 #define this_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
 #define this_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
 #define this_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_inc_1(pcp)		percpu_var_op("inc", (pcp))
+#define this_cpu_inc_2(pcp)		percpu_var_op("inc", (pcp))
+#define this_cpu_inc_4(pcp)		percpu_var_op("inc", (pcp))
+#define this_cpu_dec_1(pcp)		percpu_var_op("dec", (pcp))
+#define this_cpu_dec_2(pcp)		percpu_var_op("dec", (pcp))
+#define this_cpu_dec_4(pcp)		percpu_var_op("dec", (pcp))
 #define this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
 #define this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
 #define this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
@@ -195,6 +230,12 @@ do {							\
 #define irqsafe_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_add_4(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_inc_1(pcp)		percpu_var_op("inc", (pcp))
+#define irqsafe_cpu_inc_2(pcp)		percpu_var_op("inc", (pcp))
+#define irqsafe_cpu_inc_4(pcp)		percpu_var_op("inc", (pcp))
+#define irqsafe_cpu_dec_1(pcp)		percpu_var_op("dec", (pcp))
+#define irqsafe_cpu_dec_2(pcp)		percpu_var_op("dec", (pcp))
+#define irqsafe_cpu_dec_4(pcp)		percpu_var_op("dec", (pcp))
 #define irqsafe_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
 #define irqsafe_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
 #define irqsafe_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
@@ -213,6 +254,8 @@ do {							\
 #define __this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define __this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
 #define __this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define __this_cpu_inc_8(pcp)		percpu_var_op("inc", (pcp))
+#define __this_cpu_dec_8(pcp)		percpu_var_op("dec", (pcp))
 #define __this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define __this_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define __this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
@@ -220,11 +263,15 @@ do {							\
 #define this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
 #define this_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define this_cpu_inc_8(pcp)		percpu_var_op("inc", (pcp))
+#define this_cpu_dec_8(pcp)		percpu_var_op("dec", (pcp))
 #define this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define this_cpu_or_8(pcp, val)		percpu_to_op("or", (pcp), val)
 #define this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 
 #define irqsafe_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
+#define irqsafe_cpu_inc_8(pcp)		percpu_var_op("inc", (pcp))
+#define irqsafe_cpu_dec_8(pcp)		percpu_var_op("dec", (pcp))
 #define irqsafe_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (9 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 10/16] Support generating inc/dec for this_cpu_inc/dec Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-19 14:45   ` Mathieu Desnoyers
  2009-12-18 22:26 ` [this_cpu_xx V8 12/16] Add percpu cmpxchg operations Christoph Lameter
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_cmpxchg --]
[-- Type: text/plain, Size: 4454 bytes --]

Provide support for this_cpu_cmpxchg.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/percpu.h |   99 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-12-18 15:17:05.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2009-12-18 15:33:12.000000000 -0600
@@ -443,6 +443,62 @@ do {									\
 # define this_cpu_xor(pcp, val)		__pcpu_size_call(this_cpu_or_, (pcp), (val))
 #endif
 
+static inline unsigned long __this_cpu_cmpxchg_generic(volatile void *pptr,
+		unsigned long old, unsigned long new, int size)
+{
+	unsigned long prev;
+	volatile void * ptr = __this_cpu_ptr(pptr);
+
+	switch (size) {
+	case 1: prev = *(u8 *)ptr;
+		if (prev == old)
+			*(u8 *)ptr = (u8)new;
+		break;
+	case 2: prev = *(u16 *)ptr;
+		if (prev == old)
+			*(u16 *)ptr = (u16)new;
+		break;
+	case 4: prev = *(u32 *)ptr;
+		if (prev == old)
+			*(u32 *)ptr = (u32)new;
+		break;
+	case 8: prev = *(u64 *)ptr;
+		if (prev == old)
+			*(u64 *)ptr = (u64)new;
+		break;
+	default:
+		__bad_size_call_parameter();
+	}
+	return prev;
+}
+
+static inline unsigned long this_cpu_cmpxchg_generic(volatile void *ptr,
+		unsigned long old, unsigned long new, int size)
+{
+	unsigned long prev;
+
+	preempt_disable();
+	prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
+	preempt_enable();
+	return prev;
+}
+
+#ifndef this_cpu_cmpxchg
+# ifndef this_cpu_cmpxchg_1
+#  define this_cpu_cmpxchg_1(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
+# endif
+# ifndef this_cpu_cmpxchg_2
+#  define this_cpu_cmpxchg_2(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
+# endif
+# ifndef this_cpu_cmpxchg_4
+#  define this_cpu_cmpxchg_4(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
+# endif
+# ifndef this_cpu_cmpxchg_8
+#  define this_cpu_cmpxchg_8(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
+# endif
+# define this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
+#endif
+
 /*
  * Generic percpu operations that do not require preemption handling.
  * Either we do not care about races or the caller has the
@@ -594,6 +650,22 @@ do {									\
 # define __this_cpu_xor(pcp, val)	__pcpu_size_call(__this_cpu_xor_, (pcp), (val))
 #endif
 
+#ifndef __this_cpu_cmpxchg
+# ifndef __this_cpu_cmpxchg_1
+#  define __this_cpu_cmpxchg_1(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
+# endif
+# ifndef __this_cpu_cmpxchg_2
+#  define __this_cpu_cmpxchg_2(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
+# endif
+# ifndef __this_cpu_cmpxchg_4
+#  define __this_cpu_cmpxchg_4(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
+# endif
+# ifndef __this_cpu_cmpxchg_8
+#  define __this_cpu_cmpxchg_8(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
+# endif
+# define __this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
+#endif
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
@@ -709,4 +781,31 @@ do {									\
 # define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
 #endif
 
+static inline unsigned long irqsafe_cpu_cmpxchg_generic(volatile void *ptr,
+		unsigned long old, unsigned long new, int size)
+{
+	unsigned long flags, prev;
+
+	local_irq_save(flags);
+	prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
+	local_irq_restore(flags);
+	return prev;
+}
+
+#ifndef irqsafe_cpu_cmpxchg
+# ifndef irqsafe_cpu_cmpxchg_1
+#  define irqsafe_cpu_cmpxchg_1(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 1)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_2
+#  define irqsafe_cpu_cmpxchg_2(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 2)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_4
+#  define irqsafe_cpu_cmpxchg_4(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 4)
+# endif
+# ifndef irqsafe_cpu_cmpxchg_8
+#  define irqsafe_cpu_cmpxchg_8(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 8)
+# endif
+# define irqsafe_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(irqsafe_cpu_cmpxchg_, (old), (new))
+#endif
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 12/16] Add percpu cmpxchg operations
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (10 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 13/16] Generic support for this_cpu_xchg Christoph Lameter
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_cmpxchg_x86 --]
[-- Type: text/plain, Size: 3580 bytes --]

These are needed for the ringbuffer logic.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-12-18 15:00:53.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-12-18 15:06:09.000000000 -0600
@@ -201,6 +201,9 @@ do {							\
 #define __this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define __this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define __this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
+#define __this_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
+#define __this_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
 
 #define this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
@@ -226,6 +229,9 @@ do {							\
 #define this_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define this_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define this_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define this_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define this_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 
 #define irqsafe_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
@@ -245,6 +251,9 @@ do {							\
 #define irqsafe_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define irqsafe_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define irqsafe_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
@@ -259,6 +268,7 @@ do {							\
 #define __this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define __this_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define __this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define __this_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
 
 #define this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
@@ -268,6 +278,7 @@ do {							\
 #define this_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define this_cpu_or_8(pcp, val)		percpu_to_op("or", (pcp), val)
 #define this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define this_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 
 #define irqsafe_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_inc_8(pcp)		percpu_var_op("inc", (pcp))
@@ -275,6 +286,7 @@ do {							\
 #define irqsafe_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
 #define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
+#define irqsafe_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 13/16] Generic support for this_cpu_xchg
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (11 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 12/16] Add percpu cmpxchg operations Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 14/16] x86 percpu xchg operation Christoph Lameter
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_xchg --]
[-- Type: text/plain, Size: 4106 bytes --]

Provide generic this_cpu_xchg() and all fallbacks.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/percpu.h |   95 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-12-18 15:33:12.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2009-12-18 15:34:15.000000000 -0600
@@ -499,6 +499,58 @@ static inline unsigned long this_cpu_cmp
 # define this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
 #endif
 
+static inline unsigned long __this_cpu_xchg_generic(volatile void *pptr,
+		unsigned long new, int size)
+{
+	unsigned long prev;
+	volatile void *ptr = __this_cpu_ptr(pptr);
+
+	switch (size) {
+	case 1: prev = *(u8 *)ptr;
+		*(u8 *)ptr = (u8)new;
+		break;
+	case 2: prev = *(u16 *)ptr;
+		*(u16 *)ptr = (u16)new;
+		break;
+	case 4: prev = *(u32 *)ptr;
+		*(u32 *)ptr = (u32)new;
+		break;
+	case 8: prev = *(u64 *)ptr;
+		*(u64 *)ptr = (u64)new;
+		break;
+	default:
+		__bad_size_call_parameter();
+	}
+	return prev;
+}
+
+static inline unsigned long this_cpu_xchg_generic(volatile void *ptr,
+		unsigned long new, int size)
+{
+	unsigned long prev;
+
+	preempt_disable();
+	prev = __this_cpu_xchg_generic(ptr, new, size);
+	preempt_enable();
+	return prev;
+}
+
+#ifndef this_cpu_xchg
+# ifndef this_cpu_xchg_1
+#  define this_cpu_xchg_1(pcp, new) this_cpu_xchg_generic((pcp), (new), 1)
+# endif
+# ifndef this_cpu_xchg_2
+#  define this_cpu_xchg_2(pcp, new) this_cpu_xchg_generic((pcp), (new), 2)
+# endif
+# ifndef this_cpu_xchg_4
+#  define this_cpu_xchg_4(pcp, new) this_cpu_xchg_generic((pcp), (new), 4)
+# endif
+# ifndef this_cpu_xchg_8
+#  define this_cpu_xchg_8(pcp, new) this_cpu_xchg_generic((pcp), (new), 8)
+# endif
+# define this_cpu_xchg(pcp, new) __pcpu_size_call_return(__this_cpu_xchg_, (new))
+#endif
+
 /*
  * Generic percpu operations that do not require preemption handling.
  * Either we do not care about races or the caller has the
@@ -666,6 +718,22 @@ do {									\
 # define __this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
 #endif
 
+#ifndef __this_cpu_xchg
+# ifndef __this_cpu_xchg_1
+#  define __this_cpu_xchg_1(pcp, new) __this_cpu_xchg_generic((pcp), (new), 1)
+# endif
+# ifndef __this_cpu_xchg_2
+#  define __this_cpu_xchg_2(pcp, new) __this_cpu_xchg_generic((pcp), (new), 2)
+# endif
+# ifndef __this_cpu_xchg_4
+#  define __this_cpu_xchg_4(pcp, new) __this_cpu_xchg_generic((pcp), (new), 4)
+# endif
+# ifndef __this_cpu_xchg_8
+#  define __this_cpu_xchg_8(pcp, new) __this_cpu_xchg_generic((pcp), (new), 8)
+# endif
+# define __this_cpu_xchg(pcp, new) __pcpu_size_call_return(__this_cpu_xchg_, (new))
+#endif
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
@@ -808,4 +876,31 @@ static inline unsigned long irqsafe_cpu_
 # define irqsafe_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(irqsafe_cpu_cmpxchg_, (old), (new))
 #endif
 
+static inline unsigned long irqsafe_cpu_xchg_generic(volatile void *ptr,
+		unsigned long new, int size)
+{
+	unsigned long flags, prev;
+
+	local_irq_save(flags);
+	prev = __this_cpu_xchg_generic(ptr, new, size);
+	local_irq_restore(flags);
+	return prev;
+}
+
+#ifndef irqsafe_cpu_xchg
+# ifndef irqsafe_cpu_xchg_1
+#  define irqsafe_cpu_xchg_1(pcp, new) irqsafe_cpu_xchg_generic(((pcp), (new), 1)
+# endif
+# ifndef irqsafe_cpu_xchg_2
+#  define irqsafe_cpu_xchg_2(pcp, new) irqsafe_cpu_xchg_generic(((pcp), (new), 2)
+# endif
+# ifndef irqsafe_cpu_xchg_4
+#  define irqsafe_cpu_xchg_4(pcp, new) irqsafe_cpu_xchg_generic(((pcp), (new), 4)
+# endif
+# ifndef irqsafe_cpu_xchg_8
+#  define irqsafe_cpu_xchg_8(pcp, new) irqsafe_cpu_xchg_generic(((pcp), (new), 8)
+# endif
+# define irqsafe_cpu_xchg(pcp, new) __pcpu_size_call_return(irqsafe_cpu_xchg_, (new))
+#endif
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 14/16] x86 percpu xchg operation
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (12 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 13/16] Generic support for this_cpu_xchg Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 15/16] Generic support for this_cpu_add_return() Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 16/16] x86 support for this_cpu_add_return Christoph Lameter
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_xchg_x86 --]
[-- Type: text/plain, Size: 4101 bytes --]

Xchg on x86 implies LOCK semantics and therefore is slow. Approximate it
using this_cpu_cmpxchg().

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-12-18 15:49:09.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-12-18 15:50:12.000000000 -0600
@@ -156,6 +156,15 @@ do {							\
 	}						\
 })
 
+#define this_cpu_xchg_x86(var, new)						\
+({	__typeof(*var) __tmp_var,__tmp_old;				\
+	do {								\
+		__tmp_old = __this_cpu_read(var);			\
+		__tmp_var = __this_cpu_cmpxchg(var, __tmp_old, new);	\
+	} while (__tmp_var != __tmp_old);				\
+	__tmp_old;							\
+})
+
 /*
  * percpu_read() makes gcc load the percpu variable every time it is
  * accessed while percpu_read_stable() allows the value to be cached.
@@ -204,6 +213,9 @@ do {							\
 #define __this_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
 #define __this_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
 #define __this_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
+#define __this_cpu_xchg_1(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define __this_cpu_xchg_2(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define __this_cpu_xchg_4(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 #define this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
@@ -232,6 +244,9 @@ do {							\
 #define this_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 #define this_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 #define this_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define this_cpu_xchg_1(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define this_cpu_xchg_2(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define this_cpu_xchg_4(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 #define irqsafe_cpu_add_1(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_add_2(pcp, val)	percpu_to_op("add", (pcp), val)
@@ -254,6 +269,9 @@ do {							\
 #define irqsafe_cpu_cmpxchg_1(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 #define irqsafe_cpu_cmpxchg_2(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
 #define irqsafe_cpu_cmpxchg_4(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define irqsafe_cpu_xchg_1(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define irqsafe_cpu_xchg_2(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define irqsafe_cpu_xchg_4(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
@@ -269,6 +287,7 @@ do {							\
 #define __this_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define __this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define __this_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(__this_cpu_ptr(pcp), old, new)
+#define __this_cpu_xchg_8(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 #define this_cpu_read_8(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_write_8(pcp, val)	percpu_to_op("mov", (pcp), val)
@@ -279,6 +298,7 @@ do {							\
 #define this_cpu_or_8(pcp, val)		percpu_to_op("or", (pcp), val)
 #define this_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define this_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define this_cpu_xchg_8(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 #define irqsafe_cpu_add_8(pcp, val)	percpu_to_op("add", (pcp), val)
 #define irqsafe_cpu_inc_8(pcp)		percpu_var_op("inc", (pcp))
@@ -287,6 +307,7 @@ do {							\
 #define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
 #define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
 #define irqsafe_cpu_cmpxchg_8(pcp, old,new)	cmpxchg_local(this_cpu_ptr(pcp), old, new)
+#define irqsafe_cpu_xchg_8(pcp, new)	this_cpu_xchg_x86((pcp), new)
 
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 15/16] Generic support for this_cpu_add_return()
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (13 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 14/16] x86 percpu xchg operation Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  2009-12-18 22:26 ` [this_cpu_xx V8 16/16] x86 support for this_cpu_add_return Christoph Lameter
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_add_return --]
[-- Type: text/plain, Size: 4357 bytes --]

This allows core code to use this_cpu_add_return()
Provides all fallback scenarios.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/percpu.h |   90 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2009-12-18 15:34:15.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2009-12-18 15:34:49.000000000 -0600
@@ -551,6 +551,53 @@ static inline unsigned long this_cpu_xch
 # define this_cpu_xchg(pcp, new) __pcpu_size_call_return(__this_cpu_xchg_, (new))
 #endif
 
+static inline unsigned long __this_cpu_add_return_generic(volatile void *ptr,
+		unsigned long val, int size)
+{
+	unsigned long result;
+
+	switch (size) {
+	case 1: result = (*__this_cpu_ptr((u8 *)ptr) += val);
+		break;
+	case 2: result = (*__this_cpu_ptr((u16 *)ptr) += val);
+		break;
+	case 4: result = (*__this_cpu_ptr((u32 *)ptr) += val);
+		break;
+	case 8: result = (*__this_cpu_ptr((u64 *)ptr) += val);
+		break;
+	default:
+		__bad_size_call_parameter();
+	}
+	return result;
+}
+
+static inline unsigned long this_cpu_add_return_generic(volatile void *ptr,
+		unsigned long val, int size)
+{
+	unsigned long result;
+
+	preempt_disable();
+	result = __this_cpu_add_return_generic(ptr, val, size);
+	preempt_enable();
+	return result;
+}
+
+#ifndef this_cpu_add_return
+# ifndef this_cpu_add_return_1
+#  define this_cpu_add_return_1(pcp, val) this_cpu_add_return_generic((pcp), (val), 1)
+# endif
+# ifndef this_cpu_add_return_2
+#  define this_cpu_add_return_2(pcp, val) this_cpu_add_return_generic((pcp), (val), 2)
+# endif
+# ifndef this_cpu_add_return_4
+#  define this_cpu_add_return_4(pcp, val) this_cpu_add_return_generic((pcp), (val), 4)
+# endif
+# ifndef this_cpu_add_return_8
+#  define this_cpu_add_return_8(pcp, val) this_cpu_add_return_generic((pcp), (val), 8)
+# endif
+# define this_cpu_add_return(pcp, val) __pcpu_size_call_return(__this_cpu_add_return_, (val))
+#endif
+
 /*
  * Generic percpu operations that do not require preemption handling.
  * Either we do not care about races or the caller has the
@@ -734,6 +781,22 @@ do {									\
 # define __this_cpu_xchg(pcp, new) __pcpu_size_call_return(__this_cpu_xchg_, (new))
 #endif
 
+#ifndef __this_cpu_add_return
+# ifndef __this_cpu_add_return_1
+#  define __this_cpu_add_return_1(pcp, val) __this_cpu_add_return_generic((pcp), (val), 1)
+# endif
+# ifndef __this_cpu_add_return_2
+#  define __this_cpu_add_return_2(pcp, val) __this_cpu_add_return_generic((pcp), (val), 2)
+# endif
+# ifndef __this_cpu_add_return_4
+#  define __this_cpu_add_return_4(pcp, val) __this_cpu_add_return_generic((pcp), (val), 4)
+# endif
+# ifndef __this_cpu_add_return_8
+#  define __this_cpu_add_return_8(pcp, val) __this_cpu_add_return_generic((pcp), (val), 8)
+# endif
+# define __this_cpu_add_return(pcp, val) __pcpu_size_call_return(__this_cpu_add_return_, (val))
+#endif
+
 /*
  * IRQ safe versions of the per cpu RMW operations. Note that these operations
  * are *not* safe against modification of the same variable from another
@@ -903,4 +966,31 @@ static inline unsigned long irqsafe_cpu_
 # define irqsafe_cpu_xchg(pcp, new) __pcpu_size_call_return(irqsafe_cpu_xchg_, (new))
 #endif
 
+static inline unsigned long irqsafe_cpu_add_return_generic(volatile void *ptr,
+		unsigned long val, int size)
+{
+	unsigned long flags, result;
+
+	local_irq_save(flags);
+	result = __this_cpu_add_return_generic(ptr, val, size);
+	local_irq_restore(flags);
+	return result;
+}
+
+#ifndef irqsafe_cpu_add_return
+# ifndef irqsafe_cpu_add_return_1
+#  define irqsafe_cpu_add_return_1(pcp, val) irqsafe_cpu_add_return_generic(((pcp), (val), 1)
+# endif
+# ifndef irqsafe_cpu_add_return_2
+#  define irqsafe_cpu_add_return_2(pcp, val) irqsafe_cpu_add_return_generic(((pcp), (val), 2)
+# endif
+# ifndef irqsafe_cpu_add_return_4
+#  define irqsafe_cpu_add_return_4(pcp, val) irqsafe_cpu_add_return_generic(((pcp), (val), 4)
+# endif
+# ifndef irqsafe_cpu_add_return_8
+#  define irqsafe_cpu_add_return_8(pcp, val) irqsafe_cpu_add_return_generic(((pcp), (val), 8)
+# endif
+# define irqsafe_cpu_add_return(pcp, val) __pcpu_size_call_return(irqsafe_cpu_add_return_, (val))
+#endif
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [this_cpu_xx V8 16/16] x86 support for this_cpu_add_return
  2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
                   ` (14 preceding siblings ...)
  2009-12-18 22:26 ` [this_cpu_xx V8 15/16] Generic support for this_cpu_add_return() Christoph Lameter
@ 2009-12-18 22:26 ` Christoph Lameter
  15 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-18 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

[-- Attachment #1: percpu_add_add_return_x86 --]
[-- Type: text/plain, Size: 2676 bytes --]

x86 had the xadd instruction that can be use to have a percpu atomic
increment and return instruction.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 arch/x86/include/asm/percpu.h |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2009-12-18 15:55:07.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2009-12-18 16:06:33.000000000 -0600
@@ -165,6 +165,35 @@ do {							\
 	__tmp_old;							\
 })
 
+#define this_cpu_add_return_x86(var, val)		\
+({							\
+	typeof(var) pfo_ret__;				\
+	switch (sizeof(var)) {				\
+	case 1:						\
+		asm("xaddb %0,"__percpu_arg(1)		\
+		    : "+r" (val), "+m" (var)		\
+		    : "qi" ((pto_T__)(val)));		\
+		break;					\
+	case 2:						\
+		asm("xaddw %0,"__percpu_arg(1)		\
+		    : "+r" (val), "+m" (var)		\
+		    : "=r" (pfo_ret__)			\
+		break;					\
+	case 4:						\
+		asm("xaddl %0,"__percpu_arg(1)		\
+		    : "+r" (val), "+m" (var)		\
+		    : "=r" (pfo_ret__)			\
+		break;					\
+	case 8:						\
+		asm("xaddq %0,"__percpu_arg(1)		\
+		    : "+r" (val), "+m" (var)		\
+		    : "re" ((pto_T__)(val)));		\
+		break;					\
+	default: __bad_percpu_size();			\
+	}						\
+	pfo_ret__;					\
+})
+
 /*
  * percpu_read() makes gcc load the percpu variable every time it is
  * accessed while percpu_read_stable() allows the value to be cached.
@@ -216,6 +245,9 @@ do {							\
 #define __this_cpu_xchg_1(pcp, new)	this_cpu_xchg_x86((pcp), new)
 #define __this_cpu_xchg_2(pcp, new)	this_cpu_xchg_x86((pcp), new)
 #define __this_cpu_xchg_4(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define __this_cpu_add_return_1(pcp, val)	this_cpu_add_return_x86((pcp), val)
+#define __this_cpu_add_return_2(pcp, val)	this_cpu_add_return_x86((pcp), val)
+#define __this_cpu_add_return_4(pcp, val)	this_cpu_add_return_x86((pcp), val)
 
 #define this_cpu_read_1(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
 #define this_cpu_read_2(pcp)		percpu_from_op("mov", (pcp), "m"(pcp))
@@ -272,6 +304,9 @@ do {							\
 #define irqsafe_cpu_xchg_1(pcp, new)	this_cpu_xchg_x86((pcp), new)
 #define irqsafe_cpu_xchg_2(pcp, new)	this_cpu_xchg_x86((pcp), new)
 #define irqsafe_cpu_xchg_4(pcp, new)	this_cpu_xchg_x86((pcp), new)
+#define irqsafe_cpu_add_return_1(pcp, val)	this_cpu_add_return_x86((pcp), val)
+#define irqsafe_cpu_add_return_2(pcp, val)	this_cpu_add_return_x86((pcp), val)
+#define irqsafe_cpu_add_return_4(pcp, val)	this_cpu_add_return_x86((pcp), val)
 
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2009-12-18 22:26 ` [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg Christoph Lameter
@ 2009-12-19 14:45   ` Mathieu Desnoyers
  2009-12-22 15:54     ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2009-12-19 14:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

* Christoph Lameter (cl@linux-foundation.org) wrote:
> Provide support for this_cpu_cmpxchg.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  include/linux/percpu.h |   99 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2009-12-18 15:17:05.000000000 -0600
> +++ linux-2.6/include/linux/percpu.h	2009-12-18 15:33:12.000000000 -0600
> @@ -443,6 +443,62 @@ do {									\
>  # define this_cpu_xor(pcp, val)		__pcpu_size_call(this_cpu_or_, (pcp), (val))
>  #endif
>  
> +static inline unsigned long __this_cpu_cmpxchg_generic(volatile void *pptr,
> +		unsigned long old, unsigned long new, int size)
> +{
> +	unsigned long prev;
> +	volatile void * ptr = __this_cpu_ptr(pptr);
> +
> +	switch (size) {
> +	case 1: prev = *(u8 *)ptr;
> +		if (prev == old)
> +			*(u8 *)ptr = (u8)new;
> +		break;
> +	case 2: prev = *(u16 *)ptr;
> +		if (prev == old)
> +			*(u16 *)ptr = (u16)new;
> +		break;
> +	case 4: prev = *(u32 *)ptr;
> +		if (prev == old)
> +			*(u32 *)ptr = (u32)new;
> +		break;
> +	case 8: prev = *(u64 *)ptr;
> +		if (prev == old)
> +			*(u64 *)ptr = (u64)new;
> +		break;
> +	default:
> +		__bad_size_call_parameter();
> +	}
> +	return prev;
> +}

Hi Christoph,

I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
Given that what LTTng needs is basically an atomic, nmi-safe version of
the primitive (on all architectures that have something close to a NMI),
this means that it could not switch over to your primitives until we add
the equivalent support we currently have with local_t to all
architectures. The transition would be faster if we create an
atomic_cpu_*() variant which would map to local_t operations in the
initial version.

Or maybe have I missed something in your patchset that address this ?

Thanks,

Mathieu

> +
> +static inline unsigned long this_cpu_cmpxchg_generic(volatile void *ptr,
> +		unsigned long old, unsigned long new, int size)
> +{
> +	unsigned long prev;
> +
> +	preempt_disable();
> +	prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
> +	preempt_enable();
> +	return prev;
> +}
> +
> +#ifndef this_cpu_cmpxchg
> +# ifndef this_cpu_cmpxchg_1
> +#  define this_cpu_cmpxchg_1(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
> +# endif
> +# ifndef this_cpu_cmpxchg_2
> +#  define this_cpu_cmpxchg_2(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
> +# endif
> +# ifndef this_cpu_cmpxchg_4
> +#  define this_cpu_cmpxchg_4(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
> +# endif
> +# ifndef this_cpu_cmpxchg_8
> +#  define this_cpu_cmpxchg_8(pcp, old, new) this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
> +# endif
> +# define this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
> +#endif
> +
>  /*
>   * Generic percpu operations that do not require preemption handling.
>   * Either we do not care about races or the caller has the
> @@ -594,6 +650,22 @@ do {									\
>  # define __this_cpu_xor(pcp, val)	__pcpu_size_call(__this_cpu_xor_, (pcp), (val))
>  #endif
>  
> +#ifndef __this_cpu_cmpxchg
> +# ifndef __this_cpu_cmpxchg_1
> +#  define __this_cpu_cmpxchg_1(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 1)
> +# endif
> +# ifndef __this_cpu_cmpxchg_2
> +#  define __this_cpu_cmpxchg_2(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 2)
> +# endif
> +# ifndef __this_cpu_cmpxchg_4
> +#  define __this_cpu_cmpxchg_4(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 4)
> +# endif
> +# ifndef __this_cpu_cmpxchg_8
> +#  define __this_cpu_cmpxchg_8(pcp, old, new) __this_cpu_cmpxchg_generic((pcp), (old), (new), 8)
> +# endif
> +# define __this_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(__this_cpu_cmpxchg_, (old), (new))
> +#endif
> +
>  /*
>   * IRQ safe versions of the per cpu RMW operations. Note that these operations
>   * are *not* safe against modification of the same variable from another
> @@ -709,4 +781,31 @@ do {									\
>  # define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
>  #endif
>  
> +static inline unsigned long irqsafe_cpu_cmpxchg_generic(volatile void *ptr,
> +		unsigned long old, unsigned long new, int size)
> +{
> +	unsigned long flags, prev;
> +
> +	local_irq_save(flags);
> +	prev = __this_cpu_cmpxchg_generic(ptr, old, new, size);
> +	local_irq_restore(flags);
> +	return prev;
> +}
> +
> +#ifndef irqsafe_cpu_cmpxchg
> +# ifndef irqsafe_cpu_cmpxchg_1
> +#  define irqsafe_cpu_cmpxchg_1(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 1)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_2
> +#  define irqsafe_cpu_cmpxchg_2(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 2)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_4
> +#  define irqsafe_cpu_cmpxchg_4(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 4)
> +# endif
> +# ifndef irqsafe_cpu_cmpxchg_8
> +#  define irqsafe_cpu_cmpxchg_8(pcp, old, new) irqsafe_cpu_cmpxchg_generic(((pcp), (old), (new), 8)
> +# endif
> +# define irqsafe_cpu_cmpxchg(pcp, old, new) __pcpu_size_call_return(irqsafe_cpu_cmpxchg_, (old), (new))
> +#endif
> +
>  #endif /* __LINUX_PERCPU_H */
> 
> -- 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 03/16] Use this_cpu operations in slub
  2009-12-18 22:26 ` [this_cpu_xx V8 03/16] Use this_cpu operations in slub Christoph Lameter
@ 2009-12-20  9:11   ` Pekka Enberg
  0 siblings, 0 replies; 38+ messages in thread
From: Pekka Enberg @ 2009-12-20  9:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Mathieu Desnoyers

Christoph Lameter wrote:
> Using per cpu allocations removes the needs for the per cpu arrays in the
> kmem_cache struct. These could get quite big if we have to support systems
> with thousands of cpus. The use of this_cpu_xx operations results in:
> 
> 1. The size of kmem_cache for SMP configuration shrinks since we will only
>    need 1 pointer instead of NR_CPUS. The same pointer can be used by all
>    processors. Reduces cache footprint of the allocator.
> 
> 2. We can dynamically size kmem_cache according to the actual nodes in the
>    system meaning less memory overhead for configurations that may potentially
>    support up to 1k NUMA nodes / 4k cpus.
> 
> 3. We can remove the diddle widdle with allocating and releasing of
>    kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
>    alloc logic will do it all for us. Removes some portions of the cpu hotplug
>    functionality.
> 
> 4. Fastpath performance increases since per cpu pointer lookups and
>    address calculations are avoided.
> 
> V7-V8
> - Convert missed get_cpu_slab() under CONFIG_SLUB_STATS
> 
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Patches 3-6 applied.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately
  2009-12-18 22:26 ` [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately Christoph Lameter
@ 2009-12-21  7:25   ` Tejun Heo
  2009-12-22 15:56     ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2009-12-21  7:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On 12/19/2009 07:26 AM, Christoph Lameter wrote:
> Current this_cpu ops only allow an arch to specify add RMW operations or inc
> and dec for all sizes. Some arches can do more efficient inc and dec
> operations. Allow size specific override of fallback functions like with
> the other operations.

Wouldn't it be better to use __builtin_constant_p() and switching in
arch add/dec macros?  It just seems a bit extreme to define all those
different variants from generic header.  I'm quite unsure whether
providing overrides for all the different size variants from generic
header is necessary at all.  If an arch is gonna override the
operation, it's probably best to just let it override the whole thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-18 22:26 ` [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
@ 2009-12-21  7:47   ` Tejun Heo
  2009-12-21  7:59     ` Tejun Heo
  2009-12-22 15:57     ` Christoph Lameter
  0 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2009-12-21  7:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

Hello,

On 12/19/2009 07:26 AM, Christoph Lameter wrote:
> Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
> requirements, cache footprint and decreases cycle counts.
> 
> The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
> not drop the use of __module_ref_addr() which would make per cpu data handling
> complicated. this_cpu_xx operations have their own fallback for !SMP.
> 
> The last hold out of users of local_t is the tracing ringbuffer after this patch
> has been applied.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Please keep Rusty Russell cc'd on module changes.

> Index: linux-2.6/kernel/trace/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
> +++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
> @@ -12,6 +12,7 @@
>  #include <linux/hardirq.h>
>  #include <linux/kmemcheck.h>
>  #include <linux/module.h>
> +#include <asm/local.h>

This doesn't belong to this patch, right?  I stripped this part out,
added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch.  I'll
post the updated patch here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21  7:47   ` Tejun Heo
@ 2009-12-21  7:59     ` Tejun Heo
  2009-12-21  8:19       ` Tejun Heo
                         ` (2 more replies)
  2009-12-22 15:57     ` Christoph Lameter
  1 sibling, 3 replies; 38+ messages in thread
From: Tejun Heo @ 2009-12-21  7:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

Here's the version I committed to percpu branch.

Thanks.

>From 8af47ddd01364ae3c663c0bc92415c06fe887ba1 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 18 Dec 2009 16:26:24 -0600
Subject: [PATCH] module: Use this_cpu_xx to dynamically allocate counters

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
not drop the use of __module_ref_addr() which would make per cpu data handling
complicated. this_cpu_xx operations have their own fallback for !SMP.

The last hold out of users of local_t is the tracing ringbuffer after this patch
has been applied.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/module.h |   38 ++++++++++++++------------------------
 kernel/module.c        |   30 ++++++++++++++++--------------
 2 files changed, 30 insertions(+), 38 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 482efc8..9350577 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -16,8 +16,7 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/tracepoint.h>
-
-#include <asm/local.h>
+#include <linux/percpu.h>
 #include <asm/module.h>
 
 #include <trace/events/module.h>
@@ -361,11 +360,9 @@ struct module
 	/* Destruction function. */
 	void (*exit)(void);
 
-#ifdef CONFIG_SMP
-	char *refptr;
-#else
-	local_t ref;
-#endif
+	struct module_ref {
+		int count;
+	} *refptr;
 #endif
 
 #ifdef CONFIG_CONSTRUCTORS
@@ -452,25 +449,16 @@ void __symbol_put(const char *symbol);
 #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
 void symbol_put_addr(void *addr);
 
-static inline local_t *__module_ref_addr(struct module *mod, int cpu)
-{
-#ifdef CONFIG_SMP
-	return (local_t *) (mod->refptr + per_cpu_offset(cpu));
-#else
-	return &mod->ref;
-#endif
-}
-
 /* Sometimes we know we already have a refcount, and it's easier not
    to handle the error case (which only happens with rmmod --wait). */
 static inline void __module_get(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_inc(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_inc(module->refptr->count);
 		trace_module_get(module, _THIS_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
-		put_cpu();
+				 __this_cpu_read(module->refptr->count));
+		preempt_enable();
 	}
 }
 
@@ -479,15 +467,17 @@ static inline int try_module_get(struct module *module)
 	int ret = 1;
 
 	if (module) {
-		unsigned int cpu = get_cpu();
+		preempt_disable();
+
 		if (likely(module_is_live(module))) {
-			local_inc(__module_ref_addr(module, cpu));
+			__this_cpu_inc(module->refptr->count);
 			trace_module_get(module, _THIS_IP_,
-				local_read(__module_ref_addr(module, cpu)));
+				__this_cpu_read(module->refptr->count));
 		}
 		else
 			ret = 0;
-		put_cpu();
+
+		preempt_enable();
 	}
 	return ret;
 }
diff --git a/kernel/module.c b/kernel/module.c
index 64787cd..9bc04de 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -474,9 +474,10 @@ static void module_unload_init(struct module *mod)
 
 	INIT_LIST_HEAD(&mod->modules_which_use_me);
 	for_each_possible_cpu(cpu)
-		local_set(__module_ref_addr(mod, cpu), 0);
+		per_cpu_ptr(mod->refptr, cpu)->count = 0;
+
 	/* Hold reference count during initialization. */
-	local_set(__module_ref_addr(mod, raw_smp_processor_id()), 1);
+	__this_cpu_write(mod->refptr->count, 1);
 	/* Backwards compatibility macros put refcount during init. */
 	mod->waiter = current;
 }
@@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
 				kfree(use);
 				sysfs_remove_link(i->holders_dir, mod->name);
 				/* There can be at most one match. */
+				free_percpu(i->refptr);
 				break;
 			}
 		}
@@ -619,7 +621,7 @@ unsigned int module_refcount(struct module *mod)
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		total += local_read(__module_ref_addr(mod, cpu));
+		total += per_cpu_ptr(mod->refptr, cpu)->count;
 	return total;
 }
 EXPORT_SYMBOL(module_refcount);
@@ -796,14 +798,15 @@ static struct module_attribute refcnt = {
 void module_put(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_dec(__module_ref_addr(module, cpu));
+		preempt_disable();
+		__this_cpu_dec(module->refptr->count);
+
 		trace_module_put(module, _RET_IP_,
-				 local_read(__module_ref_addr(module, cpu)));
+				 __this_cpu_read(module->refptr->count));
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
 			wake_up_process(module->waiter);
-		put_cpu();
+		preempt_enable();
 	}
 }
 EXPORT_SYMBOL(module_put);
@@ -1377,9 +1380,9 @@ static void free_module(struct module *mod)
 	kfree(mod->args);
 	if (mod->percpu)
 		percpu_modfree(mod->percpu);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
+#if defined(CONFIG_MODULE_UNLOAD)
 	if (mod->refptr)
-		percpu_modfree(mod->refptr);
+		free_percpu(mod->refptr);
 #endif
 	/* Free lock-classes: */
 	lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -2145,9 +2148,8 @@ static noinline struct module *load_module(void __user *umod,
 	mod = (void *)sechdrs[modindex].sh_addr;
 	kmemleak_load_module(mod, hdr, sechdrs, secstrings);
 
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	mod->refptr = percpu_modalloc(sizeof(local_t), __alignof__(local_t),
-				      mod->name);
+#if defined(CONFIG_MODULE_UNLOAD)
+	mod->refptr = alloc_percpu(struct module_ref);
 	if (!mod->refptr) {
 		err = -ENOMEM;
 		goto free_init;
@@ -2373,8 +2375,8 @@ static noinline struct module *load_module(void __user *umod,
 	kobject_put(&mod->mkobj.kobj);
  free_unload:
 	module_unload_free(mod);
-#if defined(CONFIG_MODULE_UNLOAD) && defined(CONFIG_SMP)
-	percpu_modfree(mod->refptr);
+#if defined(CONFIG_MODULE_UNLOAD)
+	free_percpu(mod->refptr);
  free_init:
 #endif
 	module_free(mod, mod->module_init);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21  7:59     ` Tejun Heo
@ 2009-12-21  8:19       ` Tejun Heo
  2009-12-21 23:28       ` Rusty Russell
  2009-12-22 15:58       ` Christoph Lameter
  2 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2009-12-21  8:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

On 12/21/2009 04:59 PM, Tejun Heo wrote:
> Here's the version I committed to percpu branch.
> 
> Thanks.
> 
> From 8af47ddd01364ae3c663c0bc92415c06fe887ba1 Mon Sep 17 00:00:00 2001
> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Fri, 18 Dec 2009 16:26:24 -0600
> Subject: [PATCH] module: Use this_cpu_xx to dynamically allocate counters
> 
> Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
> requirements, cache footprint and decreases cycle counts.
> 
> The this_cpu_xx operations are also used for !SMP mode. Otherwise we could
> not drop the use of __module_ref_addr() which would make per cpu data handling
> complicated. this_cpu_xx operations have their own fallback for !SMP.
> 
> The last hold out of users of local_t is the tracing ringbuffer after this patch
> has been applied.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Cc: Rusty Russell <rusty@rustcorp.com.au>

This was changed to Acked-by as Rusty acked on the previous thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21  7:59     ` Tejun Heo
  2009-12-21  8:19       ` Tejun Heo
@ 2009-12-21 23:28       ` Rusty Russell
  2009-12-22  0:02         ` Tejun Heo
  2009-12-22 15:58       ` Christoph Lameter
  2 siblings, 1 reply; 38+ messages in thread
From: Rusty Russell @ 2009-12-21 23:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
>  				kfree(use);
>  				sysfs_remove_link(i->holders_dir, mod->name);
>  				/* There can be at most one match. */
> +				free_percpu(i->refptr);
>  				break;
>  			}
>  		}

This looks very wrong.

Rusty

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21 23:28       ` Rusty Russell
@ 2009-12-22  0:02         ` Tejun Heo
  2009-12-22 16:17           ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2009-12-22  0:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, linux-kernel, Mel Gorman, Pekka Enberg,
	Mathieu Desnoyers

On 12/22/2009 08:28 AM, Rusty Russell wrote:
> On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
>> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
>>  				kfree(use);
>>  				sysfs_remove_link(i->holders_dir, mod->name);
>>  				/* There can be at most one match. */
>> +				free_percpu(i->refptr);
>>  				break;
>>  			}
>>  		}
> 
> This looks very wrong.

Indeed, thanks for spotting it.  Christoph, I'm rolling back all
patches from this series.  Please re-post with updates.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2009-12-19 14:45   ` Mathieu Desnoyers
@ 2009-12-22 15:54     ` Christoph Lameter
  2009-12-22 17:24       ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-22 15:54 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

On Sat, 19 Dec 2009, Mathieu Desnoyers wrote:

> I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> Given that what LTTng needs is basically an atomic, nmi-safe version of
> the primitive (on all architectures that have something close to a NMI),
> this means that it could not switch over to your primitives until we add
> the equivalent support we currently have with local_t to all
> architectures. The transition would be faster if we create an
> atomic_cpu_*() variant which would map to local_t operations in the
> initial version.
>
> Or maybe have I missed something in your patchset that address this ?

NMI safeness is not covered by this_cpu operations.

We could add nmi_safe_.... ops?

The atomic_cpu reference make me think that you want full (LOCK)
semantics? Then use the regular atomic ops?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately
  2009-12-21  7:25   ` Tejun Heo
@ 2009-12-22 15:56     ` Christoph Lameter
  2009-12-23  2:08       ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-22 15:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On Mon, 21 Dec 2009, Tejun Heo wrote:

> On 12/19/2009 07:26 AM, Christoph Lameter wrote:
> > Current this_cpu ops only allow an arch to specify add RMW operations or inc
> > and dec for all sizes. Some arches can do more efficient inc and dec
> > operations. Allow size specific override of fallback functions like with
> > the other operations.
>
> Wouldn't it be better to use __builtin_constant_p() and switching in
> arch add/dec macros?  It just seems a bit extreme to define all those

Yes that could be done but I am on vacation till next year ;-)...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21  7:47   ` Tejun Heo
  2009-12-21  7:59     ` Tejun Heo
@ 2009-12-22 15:57     ` Christoph Lameter
  2009-12-23  2:07       ` Tejun Heo
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2009-12-22 15:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

On Mon, 21 Dec 2009, Tejun Heo wrote:

>
> > Index: linux-2.6/kernel/trace/ring_buffer.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
> > +++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
> > @@ -12,6 +12,7 @@
> >  #include <linux/hardirq.h>
> >  #include <linux/kmemcheck.h>
> >  #include <linux/module.h>
> > +#include <asm/local.h>
>
> This doesn't belong to this patch, right?  I stripped this part out,
> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch.  I'll
> post the updated patch here. Thanks.

If you strip this part out then ringbuffer.c will no longer compile
since it relies on the "#include <local.h>" that is removed by this patch
from the module.h file.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-21  7:59     ` Tejun Heo
  2009-12-21  8:19       ` Tejun Heo
  2009-12-21 23:28       ` Rusty Russell
@ 2009-12-22 15:58       ` Christoph Lameter
  2 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-22 15:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

On Mon, 21 Dec 2009, Tejun Heo wrote:

> index 482efc8..9350577 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -16,8 +16,7 @@
>  #include <linux/kobject.h>
>  #include <linux/moduleparam.h>
>  #include <linux/tracepoint.h>
> -
> -#include <asm/local.h>
> +#include <linux/percpu.h>

This is going to break ringbuffer.c.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-22  0:02         ` Tejun Heo
@ 2009-12-22 16:17           ` Christoph Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2009-12-22 16:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

On Tue, 22 Dec 2009, Tejun Heo wrote:

> On 12/22/2009 08:28 AM, Rusty Russell wrote:
> > On Mon, 21 Dec 2009 06:29:36 pm Tejun Heo wrote:
> >> @@ -555,6 +556,7 @@ static void module_unload_free(struct module *mod)
> >>  				kfree(use);
> >>  				sysfs_remove_link(i->holders_dir, mod->name);
> >>  				/* There can be at most one match. */
> >> +				free_percpu(i->refptr);
> >>  				break;
> >>  			}
> >>  		}
> >
> > This looks very wrong.
>
> Indeed, thanks for spotting it.  Christoph, I'm rolling back all
> patches from this series.  Please re-post with updates.

Simply drop the chunk?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2009-12-22 15:54     ` Christoph Lameter
@ 2009-12-22 17:24       ` Mathieu Desnoyers
  2010-01-04 17:21         ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2009-12-22 17:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Sat, 19 Dec 2009, Mathieu Desnoyers wrote:
> 
> > I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> > Given that what LTTng needs is basically an atomic, nmi-safe version of
> > the primitive (on all architectures that have something close to a NMI),
> > this means that it could not switch over to your primitives until we add
> > the equivalent support we currently have with local_t to all
> > architectures. The transition would be faster if we create an
> > atomic_cpu_*() variant which would map to local_t operations in the
> > initial version.
> >
> > Or maybe have I missed something in your patchset that address this ?
> 
> NMI safeness is not covered by this_cpu operations.
> 
> We could add nmi_safe_.... ops?
> 
> The atomic_cpu reference make me think that you want full (LOCK)
> semantics? Then use the regular atomic ops?

nmi_safe would probably make sense here.

But given that we have to disable preemption to add precision in terms
of trace clock timestamp, I wonder if we would really gain something
considerable performance-wise.

I also thought about the design change this requires for the per-cpu
buffer commit count pointer which would have to become a per-cpu pointer
independent of the buffer structure, and I foresee a problem with
Steven's irq off tracing which need to perform buffer exchanges while
tracing is active. Basically, having only one top-level pointer for the
buffer makes it possible to exchange it atomically, but if we have to
have two separate pointers (one for per-cpu buffer, one for per-cpu
commit count array), then we are stucked.

So given that per-cpu ops limits us in terms of data structure layout, I
am less and less sure it's the best fit for ring buffers, especially if
we don't gain much performance-wise.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-22 15:57     ` Christoph Lameter
@ 2009-12-23  2:07       ` Tejun Heo
  2010-01-04 17:22         ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2009-12-23  2:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> On Mon, 21 Dec 2009, Tejun Heo wrote:
> 
>>
>>> Index: linux-2.6/kernel/trace/ring_buffer.c
>>> ===================================================================
>>> --- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
>>> +++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
>>> @@ -12,6 +12,7 @@
>>>  #include <linux/hardirq.h>
>>>  #include <linux/kmemcheck.h>
>>>  #include <linux/module.h>
>>> +#include <asm/local.h>
>>
>> This doesn't belong to this patch, right?  I stripped this part out,
>> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch.  I'll
>> post the updated patch here. Thanks.
> 
> If you strip this part out then ringbuffer.c will no longer compile
> since it relies on the "#include <local.h>" that is removed by this patch
> from the module.h file.

Oh... alright.  I'll add that comment and drop the offending chunk and
recommit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately
  2009-12-22 15:56     ` Christoph Lameter
@ 2009-12-23  2:08       ` Tejun Heo
  0 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2009-12-23  2:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers

Hello, Christoph.

On 12/23/2009 12:56 AM, Christoph Lameter wrote:
> On Mon, 21 Dec 2009, Tejun Heo wrote:
> 
>> On 12/19/2009 07:26 AM, Christoph Lameter wrote:
>>> Current this_cpu ops only allow an arch to specify add RMW operations or inc
>>> and dec for all sizes. Some arches can do more efficient inc and dec
>>> operations. Allow size specific override of fallback functions like with
>>> the other operations.
>>
>> Wouldn't it be better to use __builtin_constant_p() and switching in
>> arch add/dec macros?  It just seems a bit extreme to define all those
> 
> Yes that could be done but I am on vacation till next year ;-)...

I'm not sure how much I would be working from now till the end of the
year but if I end up doing some work I'll give it a shot.

Happy holidays and enjoy your vacation!

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2009-12-22 17:24       ` Mathieu Desnoyers
@ 2010-01-04 17:21         ` Christoph Lameter
  2010-01-05 22:29           ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2010-01-04 17:21 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

On Tue, 22 Dec 2009, Mathieu Desnoyers wrote:

> > > I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> > > Given that what LTTng needs is basically an atomic, nmi-safe version of
> > > the primitive (on all architectures that have something close to a NMI),
> > > this means that it could not switch over to your primitives until we add
> > > the equivalent support we currently have with local_t to all
> > > architectures. The transition would be faster if we create an
> > > atomic_cpu_*() variant which would map to local_t operations in the
> > > initial version.
> > >
> > > Or maybe have I missed something in your patchset that address this ?
> >
> > NMI safeness is not covered by this_cpu operations.
> >
> > We could add nmi_safe_.... ops?
> >
> > The atomic_cpu reference make me think that you want full (LOCK)
> > semantics? Then use the regular atomic ops?
>
> nmi_safe would probably make sense here.

I am not sure how to implement fallback for nmi_safe operations though
since there is no way of disabling NMIs.

> But given that we have to disable preemption to add precision in terms
> of trace clock timestamp, I wonder if we would really gain something
> considerable performance-wise.

Not sure what exactly you attempt to do there.

> I also thought about the design change this requires for the per-cpu
> buffer commit count pointer which would have to become a per-cpu pointer
> independent of the buffer structure, and I foresee a problem with
> Steven's irq off tracing which need to perform buffer exchanges while
> tracing is active. Basically, having only one top-level pointer for the
> buffer makes it possible to exchange it atomically, but if we have to
> have two separate pointers (one for per-cpu buffer, one for per-cpu
> commit count array), then we are stucked.

You just need to keep percpu pointers that are offsets into the percpu
area. They can be relocated as needed to the processor specific addresses
using the cpu ops.

> So given that per-cpu ops limits us in terms of data structure layout, I
> am less and less sure it's the best fit for ring buffers, especially if
> we don't gain much performance-wise.

I dont understand how exactly the ring buffer logic works and what you are
trying to do here.

The ringbuffers are per cpu structures right and you do not change cpus
while performing operations on them? If not then the per cpu ops are not
useful to you.

If you dont: How can you safely use the local_t operations for the
ringbuffer logic?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2009-12-23  2:07       ` Tejun Heo
@ 2010-01-04 17:22         ` Christoph Lameter
  2010-01-04 17:51           ` Mathieu Desnoyers
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Lameter @ 2010-01-04 17:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Mel Gorman, Pekka Enberg, Mathieu Desnoyers, Rusty Russell

On Wed, 23 Dec 2009, Tejun Heo wrote:

> On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> > On Mon, 21 Dec 2009, Tejun Heo wrote:
> >
> >>
> >>> Index: linux-2.6/kernel/trace/ring_buffer.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
> >>> +++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
> >>> @@ -12,6 +12,7 @@
> >>>  #include <linux/hardirq.h>
> >>>  #include <linux/kmemcheck.h>
> >>>  #include <linux/module.h>
> >>> +#include <asm/local.h>
> >>
> >> This doesn't belong to this patch, right?  I stripped this part out,
> >> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch.  I'll
> >> post the updated patch here. Thanks.
> >
> > If you strip this part out then ringbuffer.c will no longer compile
> > since it relies on the "#include <local.h>" that is removed by this patch
> > from the module.h file.
>
> Oh... alright.  I'll add that comment and drop the offending chunk and
> recommit.

So we need a separate patch to deal with remval of the #include
<asm/local.h> from modules.h?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters
  2010-01-04 17:22         ` Christoph Lameter
@ 2010-01-04 17:51           ` Mathieu Desnoyers
  0 siblings, 0 replies; 38+ messages in thread
From: Mathieu Desnoyers @ 2010-01-04 17:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg, Rusty Russell,
	Steven Rostedt

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Wed, 23 Dec 2009, Tejun Heo wrote:
> 
> > On 12/23/2009 12:57 AM, Christoph Lameter wrote:
> > > On Mon, 21 Dec 2009, Tejun Heo wrote:
> > >
> > >>
> > >>> Index: linux-2.6/kernel/trace/ring_buffer.c
> > >>> ===================================================================
> > >>> --- linux-2.6.orig/kernel/trace/ring_buffer.c	2009-12-18 13:13:24.000000000 -0600
> > >>> +++ linux-2.6/kernel/trace/ring_buffer.c	2009-12-18 14:15:57.000000000 -0600
> > >>> @@ -12,6 +12,7 @@
> > >>>  #include <linux/hardirq.h>
> > >>>  #include <linux/kmemcheck.h>
> > >>>  #include <linux/module.h>
> > >>> +#include <asm/local.h>
> > >>
> > >> This doesn't belong to this patch, right?  I stripped this part out,
> > >> added Cc: to Rusty and applied 1, 2, 7 and 8 to percpu branch.  I'll
> > >> post the updated patch here. Thanks.
> > >
> > > If you strip this part out then ringbuffer.c will no longer compile
> > > since it relies on the "#include <local.h>" that is removed by this patch
> > > from the module.h file.
> >
> > Oh... alright.  I'll add that comment and drop the offending chunk and
> > recommit.
> 
> So we need a separate patch to deal with remval of the #include
> <asm/local.h> from modules.h?
> 

I think this would make sense, yes. This would be a patch specific to
the ring-buffer code that would go through (or be acked by) Steven.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2010-01-04 17:21         ` Christoph Lameter
@ 2010-01-05 22:29           ` Mathieu Desnoyers
  2010-01-05 22:35             ` Christoph Lameter
  0 siblings, 1 reply; 38+ messages in thread
From: Mathieu Desnoyers @ 2010-01-05 22:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

* Christoph Lameter (cl@linux-foundation.org) wrote:
> On Tue, 22 Dec 2009, Mathieu Desnoyers wrote:
> 
> > > > I am a bit concerned about the "generic" version of this_cpu_cmpxchg.
> > > > Given that what LTTng needs is basically an atomic, nmi-safe version of
> > > > the primitive (on all architectures that have something close to a NMI),
> > > > this means that it could not switch over to your primitives until we add
> > > > the equivalent support we currently have with local_t to all
> > > > architectures. The transition would be faster if we create an
> > > > atomic_cpu_*() variant which would map to local_t operations in the
> > > > initial version.
> > > >
> > > > Or maybe have I missed something in your patchset that address this ?
> > >
> > > NMI safeness is not covered by this_cpu operations.
> > >
> > > We could add nmi_safe_.... ops?
> > >
> > > The atomic_cpu reference make me think that you want full (LOCK)
> > > semantics? Then use the regular atomic ops?
> >
> > nmi_safe would probably make sense here.
> 
> I am not sure how to implement fallback for nmi_safe operations though
> since there is no way of disabling NMIs.
> 
> > But given that we have to disable preemption to add precision in terms
> > of trace clock timestamp, I wonder if we would really gain something
> > considerable performance-wise.
> 
> Not sure what exactly you attempt to do there.
> 
> > I also thought about the design change this requires for the per-cpu
> > buffer commit count pointer which would have to become a per-cpu pointer
> > independent of the buffer structure, and I foresee a problem with
> > Steven's irq off tracing which need to perform buffer exchanges while
> > tracing is active. Basically, having only one top-level pointer for the
> > buffer makes it possible to exchange it atomically, but if we have to
> > have two separate pointers (one for per-cpu buffer, one for per-cpu
> > commit count array), then we are stucked.
> 
> You just need to keep percpu pointers that are offsets into the percpu
> area. They can be relocated as needed to the processor specific addresses
> using the cpu ops.
> 
> > So given that per-cpu ops limits us in terms of data structure layout, I
> > am less and less sure it's the best fit for ring buffers, especially if
> > we don't gain much performance-wise.
> 
> I dont understand how exactly the ring buffer logic works and what you are
> trying to do here.
> 
> The ringbuffers are per cpu structures right and you do not change cpus
> while performing operations on them? If not then the per cpu ops are not
> useful to you.

Trying to make a long story short:

This scheme where Steven moves buffers from one CPU to another, he only
performs this operation when some other exclusion mechanism ensures that
the buffer is not used for writing by the CPU when this move operation
is done.

It is therefore correct, and needs local_t type to deal with the data
structure hierarchy vs atomic exchange as I pointed in my email.

Mathieu

> 
> If you dont: How can you safely use the local_t operations for the
> ringbuffer logic?
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg
  2010-01-05 22:29           ` Mathieu Desnoyers
@ 2010-01-05 22:35             ` Christoph Lameter
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Lameter @ 2010-01-05 22:35 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Tejun Heo, linux-kernel, Mel Gorman, Pekka Enberg

On Tue, 5 Jan 2010, Mathieu Desnoyers wrote:

> This scheme where Steven moves buffers from one CPU to another, he only
> performs this operation when some other exclusion mechanism ensures that
> the buffer is not used for writing by the CPU when this move operation
> is done.
>
> It is therefore correct, and needs local_t type to deal with the data
> structure hierarchy vs atomic exchange as I pointed in my email.

Yes this_cpu_xx does not seem to work for your scheme. Please review the
patchset that I sent you instead.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2010-01-05 22:37 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-18 22:26 [this_cpu_xx V8 00/16] Per cpu atomics in core allocators, cleanup and more this_cpu_ops Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 01/16] this_cpu_ops: page allocator conversion Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 02/16] this_cpu ops: Remove pageset_notifier Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 03/16] Use this_cpu operations in slub Christoph Lameter
2009-12-20  9:11   ` Pekka Enberg
2009-12-18 22:26 ` [this_cpu_xx V8 04/16] SLUB: Get rid of dynamic DMA kmalloc cache allocation Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 05/16] this_cpu: Remove slub kmem_cache fields Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 06/16] Make slub statistics use this_cpu_inc Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 07/16] Module handling: Use this_cpu_xx to dynamically allocate counters Christoph Lameter
2009-12-21  7:47   ` Tejun Heo
2009-12-21  7:59     ` Tejun Heo
2009-12-21  8:19       ` Tejun Heo
2009-12-21 23:28       ` Rusty Russell
2009-12-22  0:02         ` Tejun Heo
2009-12-22 16:17           ` Christoph Lameter
2009-12-22 15:58       ` Christoph Lameter
2009-12-22 15:57     ` Christoph Lameter
2009-12-23  2:07       ` Tejun Heo
2010-01-04 17:22         ` Christoph Lameter
2010-01-04 17:51           ` Mathieu Desnoyers
2009-12-18 22:26 ` [this_cpu_xx V8 08/16] Remove cpu_local_xx macros Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 09/16] Allow arch to provide inc/dec functionality for each size separately Christoph Lameter
2009-12-21  7:25   ` Tejun Heo
2009-12-22 15:56     ` Christoph Lameter
2009-12-23  2:08       ` Tejun Heo
2009-12-18 22:26 ` [this_cpu_xx V8 10/16] Support generating inc/dec for this_cpu_inc/dec Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 11/16] Generic support for this_cpu_cmpxchg Christoph Lameter
2009-12-19 14:45   ` Mathieu Desnoyers
2009-12-22 15:54     ` Christoph Lameter
2009-12-22 17:24       ` Mathieu Desnoyers
2010-01-04 17:21         ` Christoph Lameter
2010-01-05 22:29           ` Mathieu Desnoyers
2010-01-05 22:35             ` Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 12/16] Add percpu cmpxchg operations Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 13/16] Generic support for this_cpu_xchg Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 14/16] x86 percpu xchg operation Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 15/16] Generic support for this_cpu_add_return() Christoph Lameter
2009-12-18 22:26 ` [this_cpu_xx V8 16/16] x86 support for this_cpu_add_return Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.