All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] perf/x86  queue
@ 2010-01-22 15:50 Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 01/10] perf_events: improve x86 event scheduling (v5) Peter Zijlstra
                   ` (9 more replies)
  0 siblings, 10 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

This is my current perf/x86 queue, after I test this lot a bit, I'll
look at stacking that AMD stuff on top.

After that I'll have a look at Roberts bits, most of that looked quite nice.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 01/10] perf_events: improve x86 event scheduling (v5)
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 02/10] perf_events: Add fast-path to the rescheduling code Peter Zijlstra
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: stephane-perf_events-improve_x86_event_scheduling.patch --]
[-- Type: text/plain, Size: 30456 bytes --]

This patch improves event scheduling by maximizing the use
of PMU registers regardless of the order in which events are
created in a group.

The algorithm takes into account the list of counter constraints
for each event. It assigns events to counters from the most
constrained, i.e., works on only one counter, to the least
constrained, i.e., works on any counter.

Intel Fixed counter events and the BTS special event are also
handled via this algorithm which is designed to be fairly generic.

The patch also updates the validation of an event to use the
scheduling algorithm. This will cause early failure in
perf_event_open().

The 2nd version of this patch follows the model used
by PPC, by running the scheduling algorithm and the actual
assignment separately. Actual assignment takes place in
hw_perf_enable() whereas scheduling is implemented in
hw_perf_group_sched_in() and x86_pmu_enable().

Signed-off-by: Stephane Eranian <eranian@google.com>
[a.p.zijlstra@chello.nl: fixup whitespace and style nits as well as
                         adding is_x86_event() ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b5430c6.0f975e0a.1bf9.ffff85fe@mx.google.com>
---
 arch/x86/include/asm/perf_event.h |   16 
 arch/x86/kernel/cpu/perf_event.c  |  771 +++++++++++++++++++++++++++-----------
 2 files changed, 572 insertions(+), 215 deletions(-)

Index: linux-2.6/arch/x86/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -27,7 +27,14 @@
 /*
  * Includes eventsel and unit mask as well:
  */
-#define ARCH_PERFMON_EVENT_MASK				    0xffff
+
+
+#define INTEL_ARCH_EVTSEL_MASK		0x000000FFULL
+#define INTEL_ARCH_UNIT_MASK		0x0000FF00ULL
+#define INTEL_ARCH_EDGE_MASK		0x00040000ULL
+#define INTEL_ARCH_INV_MASK		0x00800000ULL
+#define INTEL_ARCH_CNT_MASK		0xFF000000ULL
+#define INTEL_ARCH_EVENT_MASK	(INTEL_ARCH_UNIT_MASK|INTEL_ARCH_EVTSEL_MASK)
 
 /*
  * filter mask to validate fixed counter events.
@@ -38,7 +45,12 @@
  *  The other filters are supported by fixed counters.
  *  The any-thread option is supported starting with v3.
  */
-#define ARCH_PERFMON_EVENT_FILTER_MASK			0xff840000
+#define INTEL_ARCH_FIXED_MASK \
+	(INTEL_ARCH_CNT_MASK| \
+	 INTEL_ARCH_INV_MASK| \
+	 INTEL_ARCH_EDGE_MASK|\
+	 INTEL_ARCH_UNIT_MASK|\
+	 INTEL_ARCH_EVENT_MASK)
 
 #define ARCH_PERFMON_UNHALTED_CORE_CYCLES_SEL		      0x3c
 #define ARCH_PERFMON_UNHALTED_CORE_CYCLES_UMASK		(0x00 << 8)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -7,6 +7,7 @@
  *  Copyright (C) 2009 Advanced Micro Devices, Inc., Robert Richter
  *  Copyright (C) 2008-2009 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *  Copyright (C) 2009 Intel Corporation, <markus.t.metzger@intel.com>
+ *  Copyright (C) 2009 Google, Inc., Stephane Eranian
  *
  *  For licencing details see kernel-base/COPYING
  */
@@ -68,26 +69,37 @@ struct debug_store {
 	u64	pebs_event_reset[MAX_PEBS_EVENTS];
 };
 
+#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
+
+struct event_constraint {
+	u64	idxmsk[BITS_TO_U64(X86_PMC_IDX_MAX)];
+	int	code;
+	int	cmask;
+};
+
 struct cpu_hw_events {
-	struct perf_event	*events[X86_PMC_IDX_MAX];
-	unsigned long		used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	struct perf_event	*events[X86_PMC_IDX_MAX]; /* in counter order */
 	unsigned long		active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	unsigned long		interrupts;
 	int			enabled;
 	struct debug_store	*ds;
-};
 
-struct event_constraint {
-	unsigned long	idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
-	int		code;
+	int			n_events;
+	int			n_added;
+	int			assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
+	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 };
 
-#define EVENT_CONSTRAINT(c, m) { .code = (c), .idxmsk[0] = (m) }
-#define EVENT_CONSTRAINT_END  { .code = 0, .idxmsk[0] = 0 }
+#define EVENT_CONSTRAINT(c, n, m) { \
+	.code = (c),	\
+	.cmask = (m),	\
+	.idxmsk[0] = (n) }
 
-#define for_each_event_constraint(e, c) \
-	for ((e) = (c); (e)->idxmsk[0]; (e)++)
+#define EVENT_CONSTRAINT_END \
+	{ .code = 0, .cmask = 0, .idxmsk[0] = 0 }
 
+#define for_each_event_constraint(e, c) \
+	for ((e) = (c); (e)->cmask; (e)++)
 
 /*
  * struct x86_pmu - generic x86 pmu
@@ -114,8 +126,9 @@ struct x86_pmu {
 	u64		intel_ctrl;
 	void		(*enable_bts)(u64 config);
 	void		(*disable_bts)(void);
-	int		(*get_event_idx)(struct cpu_hw_events *cpuc,
-					 struct hw_perf_event *hwc);
+	void		(*get_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event, u64 *idxmsk);
+	void		(*put_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event);
+	const struct event_constraint *event_constraints;
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
@@ -124,7 +137,8 @@ static DEFINE_PER_CPU(struct cpu_hw_even
 	.enabled = 1,
 };
 
-static const struct event_constraint *event_constraints;
+static int x86_perf_event_set_period(struct perf_event *event,
+			     struct hw_perf_event *hwc, int idx);
 
 /*
  * Not sure about some of these
@@ -171,14 +185,14 @@ static u64 p6_pmu_raw_event(u64 hw_event
 	return hw_event & P6_EVNTSEL_MASK;
 }
 
-static const struct event_constraint intel_p6_event_constraints[] =
+static struct event_constraint intel_p6_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc1, 0x1),	/* FLOPS */
-	EVENT_CONSTRAINT(0x10, 0x1),	/* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x1),	/* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2),	/* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2),	/* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1),	/* CYCLES_DIV_BUSY */
+	EVENT_CONSTRAINT(0xc1, 0x1, INTEL_ARCH_EVENT_MASK),	/* FLOPS */
+	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_COMP_OPS_EXE */
+	EVENT_CONSTRAINT(0x11, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_ASSIST */
+	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK),	/* MUL */
+	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK),	/* DIV */
+	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK),	/* CYCLES_DIV_BUSY */
 	EVENT_CONSTRAINT_END
 };
 
@@ -196,32 +210,43 @@ static const u64 intel_perfmon_event_map
   [PERF_COUNT_HW_BUS_CYCLES]		= 0x013c,
 };
 
-static const struct event_constraint intel_core_event_constraints[] =
+static struct event_constraint intel_core_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0x10, 0x1),	/* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x2),	/* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2),	/* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2),	/* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1),	/* CYCLES_DIV_BUSY */
-	EVENT_CONSTRAINT(0x18, 0x1),	/* IDLE_DURING_DIV */
-	EVENT_CONSTRAINT(0x19, 0x2),	/* DELAYED_BYPASS */
-	EVENT_CONSTRAINT(0xa1, 0x1),	/* RS_UOPS_DISPATCH_CYCLES */
-	EVENT_CONSTRAINT(0xcb, 0x1),	/* MEM_LOAD_RETIRED */
+	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
+	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
+	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK), /* FP_COMP_OPS_EXE */
+	EVENT_CONSTRAINT(0x11, 0x2, INTEL_ARCH_EVENT_MASK), /* FP_ASSIST */
+	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK), /* MUL */
+	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK), /* DIV */
+	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK), /* CYCLES_DIV_BUSY */
+	EVENT_CONSTRAINT(0x18, 0x1, INTEL_ARCH_EVENT_MASK), /* IDLE_DURING_DIV */
+	EVENT_CONSTRAINT(0x19, 0x2, INTEL_ARCH_EVENT_MASK), /* DELAYED_BYPASS */
+	EVENT_CONSTRAINT(0xa1, 0x1, INTEL_ARCH_EVENT_MASK), /* RS_UOPS_DISPATCH_CYCLES */
+	EVENT_CONSTRAINT(0xcb, 0x1, INTEL_ARCH_EVENT_MASK), /* MEM_LOAD_RETIRED */
 	EVENT_CONSTRAINT_END
 };
 
-static const struct event_constraint intel_nehalem_event_constraints[] =
+static struct event_constraint intel_nehalem_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0x40, 0x3),	/* L1D_CACHE_LD */
-	EVENT_CONSTRAINT(0x41, 0x3),	/* L1D_CACHE_ST */
-	EVENT_CONSTRAINT(0x42, 0x3),	/* L1D_CACHE_LOCK */
-	EVENT_CONSTRAINT(0x43, 0x3),	/* L1D_ALL_REF */
-	EVENT_CONSTRAINT(0x4e, 0x3),	/* L1D_PREFETCH */
-	EVENT_CONSTRAINT(0x4c, 0x3),	/* LOAD_HIT_PRE */
-	EVENT_CONSTRAINT(0x51, 0x3),	/* L1D */
-	EVENT_CONSTRAINT(0x52, 0x3),	/* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0x53, 0x3),	/* L1D_CACHE_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0xc5, 0x3),	/* CACHE_LOCK_CYCLES */
+	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
+	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
+	EVENT_CONSTRAINT(0x40, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LD */
+	EVENT_CONSTRAINT(0x41, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_ST */
+	EVENT_CONSTRAINT(0x42, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK */
+	EVENT_CONSTRAINT(0x43, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_ALL_REF */
+	EVENT_CONSTRAINT(0x4e, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_PREFETCH */
+	EVENT_CONSTRAINT(0x4c, 0x3, INTEL_ARCH_EVENT_MASK), /* LOAD_HIT_PRE */
+	EVENT_CONSTRAINT(0x51, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D */
+	EVENT_CONSTRAINT(0x52, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
+	EVENT_CONSTRAINT(0x53, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK_FB_HIT */
+	EVENT_CONSTRAINT(0xc5, 0x3, INTEL_ARCH_EVENT_MASK), /* CACHE_LOCK_CYCLES */
+	EVENT_CONSTRAINT_END
+};
+
+static struct event_constraint intel_gen_event_constraints[] =
+{
+	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
+	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
 	EVENT_CONSTRAINT_END
 };
 
@@ -527,11 +552,11 @@ static u64 intel_pmu_raw_event(u64 hw_ev
 #define CORE_EVNTSEL_REG_MASK		0xFF000000ULL
 
 #define CORE_EVNTSEL_MASK		\
-	(CORE_EVNTSEL_EVENT_MASK |	\
-	 CORE_EVNTSEL_UNIT_MASK  |	\
-	 CORE_EVNTSEL_EDGE_MASK  |	\
-	 CORE_EVNTSEL_INV_MASK  |	\
-	 CORE_EVNTSEL_REG_MASK)
+	(INTEL_ARCH_EVTSEL_MASK |	\
+	 INTEL_ARCH_UNIT_MASK   |	\
+	 INTEL_ARCH_EDGE_MASK   |	\
+	 INTEL_ARCH_INV_MASK    |	\
+	 INTEL_ARCH_CNT_MASK)
 
 	return hw_event & CORE_EVNTSEL_MASK;
 }
@@ -1120,9 +1145,15 @@ static void amd_pmu_disable_all(void)
 
 void hw_perf_disable(void)
 {
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
 	if (!x86_pmu_initialized())
 		return;
-	return x86_pmu.disable_all();
+
+	if (cpuc->enabled)
+		cpuc->n_added = 0;
+
+	x86_pmu.disable_all();
 }
 
 static void p6_pmu_enable_all(void)
@@ -1189,10 +1220,237 @@ static void amd_pmu_enable_all(void)
 	}
 }
 
+static const struct pmu pmu;
+
+static inline int is_x86_event(struct perf_event *event)
+{
+	return event->pmu == &pmu;
+}
+
+static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+	int i, j , w, num;
+	int weight, wmax;
+	unsigned long *c;
+	u64 constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	struct hw_perf_event *hwc;
+
+	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+
+	for (i = 0; i < n; i++) {
+		x86_pmu.get_event_constraints(cpuc,
+					      cpuc->event_list[i],
+					      constraints[i]);
+	}
+
+	/*
+	 * weight = number of possible counters
+	 *
+	 * 1    = most constrained, only works on one counter
+	 * wmax = least constrained, works on any counter
+	 *
+	 * assign events to counters starting with most
+	 * constrained events.
+	 */
+	wmax = x86_pmu.num_events;
+
+	/*
+	 * when fixed event counters are present,
+	 * wmax is incremented by 1 to account
+	 * for one more choice
+	 */
+	if (x86_pmu.num_events_fixed)
+		wmax++;
+
+	num = n;
+	for (w = 1; num && w <= wmax; w++) {
+		/* for each event */
+		for (i = 0; i < n; i++) {
+			c = (unsigned long *)constraints[i];
+			hwc = &cpuc->event_list[i]->hw;
+
+			weight = bitmap_weight(c, X86_PMC_IDX_MAX);
+			if (weight != w)
+				continue;
+
+			/*
+			 * try to reuse previous assignment
+			 *
+			 * This is possible despite the fact that
+			 * events or events order may have changed.
+			 *
+			 * What matters is the level of constraints
+			 * of an event and this is constant for now.
+			 *
+			 * This is possible also because we always
+			 * scan from most to least constrained. Thus,
+			 * if a counter can be reused, it means no,
+			 * more constrained events, needed it. And
+			 * next events will either compete for it
+			 * (which cannot be solved anyway) or they
+			 * have fewer constraints, and they can use
+			 * another counter.
+			 */
+			j = hwc->idx;
+			if (j != -1 && !test_bit(j, used_mask))
+				goto skip;
+
+			for_each_bit(j, c, X86_PMC_IDX_MAX) {
+				if (!test_bit(j, used_mask))
+					break;
+			}
+
+			if (j == X86_PMC_IDX_MAX)
+				break;
+skip:
+			set_bit(j, used_mask);
+
+#if 0
+			pr_debug("CPU%d config=0x%llx idx=%d assign=%c\n",
+				smp_processor_id(),
+				hwc->config,
+				j,
+				assign ? 'y' : 'n');
+#endif
+
+			if (assign)
+				assign[i] = j;
+			num--;
+		}
+	}
+	/*
+	 * scheduling failed or is just a simulation,
+	 * free resources if necessary
+	 */
+	if (!assign || num) {
+		for (i = 0; i < n; i++) {
+			if (x86_pmu.put_event_constraints)
+				x86_pmu.put_event_constraints(cpuc, cpuc->event_list[i]);
+		}
+	}
+	return num ? -ENOSPC : 0;
+}
+
+/*
+ * dogrp: true if must collect siblings events (group)
+ * returns total number of events and error code
+ */
+static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader, bool dogrp)
+{
+	struct perf_event *event;
+	int n, max_count;
+
+	max_count = x86_pmu.num_events + x86_pmu.num_events_fixed;
+
+	/* current number of events already accepted */
+	n = cpuc->n_events;
+
+	if (is_x86_event(leader)) {
+		if (n >= max_count)
+			return -ENOSPC;
+		cpuc->event_list[n] = leader;
+		n++;
+	}
+	if (!dogrp)
+		return n;
+
+	list_for_each_entry(event, &leader->sibling_list, group_entry) {
+		if (!is_x86_event(event) ||
+		    event->state == PERF_EVENT_STATE_OFF)
+			continue;
+
+		if (n >= max_count)
+			return -ENOSPC;
+
+		cpuc->event_list[n] = event;
+		n++;
+	}
+	return n;
+}
+
+
+static inline void x86_assign_hw_event(struct perf_event *event,
+				struct hw_perf_event *hwc, int idx)
+{
+	hwc->idx = idx;
+
+	if (hwc->idx == X86_PMC_IDX_FIXED_BTS) {
+		hwc->config_base = 0;
+		hwc->event_base	= 0;
+	} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
+		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
+		/*
+		 * We set it so that event_base + idx in wrmsr/rdmsr maps to
+		 * MSR_ARCH_PERFMON_FIXED_CTR0 ... CTR2:
+		 */
+		hwc->event_base =
+			MSR_ARCH_PERFMON_FIXED_CTR0 - X86_PMC_IDX_FIXED;
+	} else {
+		hwc->config_base = x86_pmu.eventsel;
+		hwc->event_base  = x86_pmu.perfctr;
+	}
+}
+
 void hw_perf_enable(void)
 {
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct perf_event *event;
+	struct hw_perf_event *hwc;
+	int i;
+
 	if (!x86_pmu_initialized())
 		return;
+	if (cpuc->n_added) {
+		/*
+		 * apply assignment obtained either from
+		 * hw_perf_group_sched_in() or x86_pmu_enable()
+		 *
+		 * step1: save events moving to new counters
+		 * step2: reprogram moved events into new counters
+		 */
+		for (i = 0; i < cpuc->n_events; i++) {
+
+			event = cpuc->event_list[i];
+			hwc = &event->hw;
+
+			if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
+				continue;
+
+			x86_pmu.disable(hwc, hwc->idx);
+
+			clear_bit(hwc->idx, cpuc->active_mask);
+			barrier();
+			cpuc->events[hwc->idx] = NULL;
+
+			x86_perf_event_update(event, hwc, hwc->idx);
+
+			hwc->idx = -1;
+		}
+
+		for (i = 0; i < cpuc->n_events; i++) {
+
+			event = cpuc->event_list[i];
+			hwc = &event->hw;
+
+			if (hwc->idx == -1) {
+				x86_assign_hw_event(event, hwc, cpuc->assign[i]);
+				x86_perf_event_set_period(event, hwc, hwc->idx);
+			}
+			/*
+			 * need to mark as active because x86_pmu_disable()
+			 * clear active_mask and eventsp[] yet it preserves
+			 * idx
+			 */
+			set_bit(hwc->idx, cpuc->active_mask);
+			cpuc->events[hwc->idx] = event;
+
+			x86_pmu.enable(hwc, hwc->idx);
+			perf_event_update_userpage(event);
+		}
+		cpuc->n_added = 0;
+		perf_events_lapic_init();
+	}
 	x86_pmu.enable_all();
 }
 
@@ -1398,148 +1656,43 @@ static void amd_pmu_enable_event(struct 
 		x86_pmu_enable_event(hwc, idx);
 }
 
-static int fixed_mode_idx(struct hw_perf_event *hwc)
-{
-	unsigned int hw_event;
-
-	hw_event = hwc->config & ARCH_PERFMON_EVENT_MASK;
-
-	if (unlikely((hw_event ==
-		      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&
-		     (hwc->sample_period == 1)))
-		return X86_PMC_IDX_FIXED_BTS;
-
-	if (!x86_pmu.num_events_fixed)
-		return -1;
-
-	/*
-	 * fixed counters do not take all possible filters
-	 */
-	if (hwc->config & ARCH_PERFMON_EVENT_FILTER_MASK)
-		return -1;
-
-	if (unlikely(hw_event == x86_pmu.event_map(PERF_COUNT_HW_INSTRUCTIONS)))
-		return X86_PMC_IDX_FIXED_INSTRUCTIONS;
-	if (unlikely(hw_event == x86_pmu.event_map(PERF_COUNT_HW_CPU_CYCLES)))
-		return X86_PMC_IDX_FIXED_CPU_CYCLES;
-	if (unlikely(hw_event == x86_pmu.event_map(PERF_COUNT_HW_BUS_CYCLES)))
-		return X86_PMC_IDX_FIXED_BUS_CYCLES;
-
-	return -1;
-}
-
-/*
- * generic counter allocator: get next free counter
- */
-static int
-gen_get_event_idx(struct cpu_hw_events *cpuc, struct hw_perf_event *hwc)
-{
-	int idx;
-
-	idx = find_first_zero_bit(cpuc->used_mask, x86_pmu.num_events);
-	return idx == x86_pmu.num_events ? -1 : idx;
-}
-
-/*
- * intel-specific counter allocator: check event constraints
- */
-static int
-intel_get_event_idx(struct cpu_hw_events *cpuc, struct hw_perf_event *hwc)
-{
-	const struct event_constraint *event_constraint;
-	int i, code;
-
-	if (!event_constraints)
-		goto skip;
-
-	code = hwc->config & CORE_EVNTSEL_EVENT_MASK;
-
-	for_each_event_constraint(event_constraint, event_constraints) {
-		if (code == event_constraint->code) {
-			for_each_bit(i, event_constraint->idxmsk, X86_PMC_IDX_MAX) {
-				if (!test_and_set_bit(i, cpuc->used_mask))
-					return i;
-			}
-			return -1;
-		}
-	}
-skip:
-	return gen_get_event_idx(cpuc, hwc);
-}
-
-static int
-x86_schedule_event(struct cpu_hw_events *cpuc, struct hw_perf_event *hwc)
-{
-	int idx;
-
-	idx = fixed_mode_idx(hwc);
-	if (idx == X86_PMC_IDX_FIXED_BTS) {
-		/* BTS is already occupied. */
-		if (test_and_set_bit(idx, cpuc->used_mask))
-			return -EAGAIN;
-
-		hwc->config_base	= 0;
-		hwc->event_base		= 0;
-		hwc->idx		= idx;
-	} else if (idx >= 0) {
-		/*
-		 * Try to get the fixed event, if that is already taken
-		 * then try to get a generic event:
-		 */
-		if (test_and_set_bit(idx, cpuc->used_mask))
-			goto try_generic;
-
-		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
-		/*
-		 * We set it so that event_base + idx in wrmsr/rdmsr maps to
-		 * MSR_ARCH_PERFMON_FIXED_CTR0 ... CTR2:
-		 */
-		hwc->event_base =
-			MSR_ARCH_PERFMON_FIXED_CTR0 - X86_PMC_IDX_FIXED;
-		hwc->idx = idx;
-	} else {
-		idx = hwc->idx;
-		/* Try to get the previous generic event again */
-		if (idx == -1 || test_and_set_bit(idx, cpuc->used_mask)) {
-try_generic:
-			idx = x86_pmu.get_event_idx(cpuc, hwc);
-			if (idx == -1)
-				return -EAGAIN;
-
-			set_bit(idx, cpuc->used_mask);
-			hwc->idx = idx;
-		}
-		hwc->config_base = x86_pmu.eventsel;
-		hwc->event_base  = x86_pmu.perfctr;
-	}
-
-	return idx;
-}
-
 /*
- * Find a PMC slot for the freshly enabled / scheduled in event:
+ * activate a single event
+ *
+ * The event is added to the group of enabled events
+ * but only if it can be scehduled with existing events.
+ *
+ * Called with PMU disabled. If successful and return value 1,
+ * then guaranteed to call perf_enable() and hw_perf_enable()
  */
 static int x86_pmu_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct hw_perf_event *hwc = &event->hw;
-	int idx;
-
-	idx = x86_schedule_event(cpuc, hwc);
-	if (idx < 0)
-		return idx;
+	struct hw_perf_event *hwc;
+	int assign[X86_PMC_IDX_MAX];
+	int n, n0, ret;
 
-	perf_events_lapic_init();
+	hwc = &event->hw;
 
-	x86_pmu.disable(hwc, idx);
+	n0 = cpuc->n_events;
+	n = collect_events(cpuc, event, false);
+	if (n < 0)
+		return n;
 
-	cpuc->events[idx] = event;
-	set_bit(idx, cpuc->active_mask);
+	ret = x86_schedule_events(cpuc, n, assign);
+	if (ret)
+		return ret;
+	/*
+	 * copy new assignment, now we know it is possible
+	 * will be used by hw_perf_enable()
+	 */
+	memcpy(cpuc->assign, assign, n*sizeof(int));
 
-	x86_perf_event_set_period(event, hwc, idx);
-	x86_pmu.enable(hwc, idx);
+	cpuc->n_events = n;
+	cpuc->n_added  = n - n0;
 
-	perf_event_update_userpage(event);
+	if (hwc->idx != -1)
+		x86_perf_event_set_period(event, hwc, hwc->idx);
 
 	return 0;
 }
@@ -1583,7 +1736,7 @@ void perf_event_print_debug(void)
 		pr_info("CPU#%d: overflow:   %016llx\n", cpu, overflow);
 		pr_info("CPU#%d: fixed:      %016llx\n", cpu, fixed);
 	}
-	pr_info("CPU#%d: used:       %016llx\n", cpu, *(u64 *)cpuc->used_mask);
+	pr_info("CPU#%d: active:       %016llx\n", cpu, *(u64 *)cpuc->active_mask);
 
 	for (idx = 0; idx < x86_pmu.num_events; idx++) {
 		rdmsrl(x86_pmu.eventsel + idx, pmc_ctrl);
@@ -1671,7 +1824,7 @@ static void x86_pmu_disable(struct perf_
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
-	int idx = hwc->idx;
+	int i, idx = hwc->idx;
 
 	/*
 	 * Must be done before we disable, otherwise the nmi handler
@@ -1697,8 +1850,19 @@ static void x86_pmu_disable(struct perf_
 		intel_pmu_drain_bts_buffer(cpuc);
 
 	cpuc->events[idx] = NULL;
-	clear_bit(idx, cpuc->used_mask);
 
+	for (i = 0; i < cpuc->n_events; i++) {
+		if (event == cpuc->event_list[i]) {
+
+			if (x86_pmu.put_event_constraints)
+				x86_pmu.put_event_constraints(cpuc, event);
+
+			while (++i < cpuc->n_events)
+				cpuc->event_list[i-1] = cpuc->event_list[i];
+
+			--cpuc->n_events;
+		}
+	}
 	perf_event_update_userpage(event);
 }
 
@@ -1969,6 +2133,176 @@ perf_event_nmi_handler(struct notifier_b
 	return NOTIFY_STOP;
 }
 
+static struct event_constraint bts_constraint = {
+	.code = 0,
+	.cmask = 0,
+	.idxmsk[0] = 1ULL << X86_PMC_IDX_FIXED_BTS
+};
+
+static int intel_special_constraints(struct perf_event *event,
+				     u64 *idxmsk)
+{
+	unsigned int hw_event;
+
+	hw_event = event->hw.config & INTEL_ARCH_EVENT_MASK;
+
+	if (unlikely((hw_event ==
+		      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&
+		     (event->hw.sample_period == 1))) {
+
+		bitmap_copy((unsigned long *)idxmsk,
+			    (unsigned long *)bts_constraint.idxmsk,
+			    X86_PMC_IDX_MAX);
+		return 1;
+	}
+	return 0;
+}
+
+static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
+					struct perf_event *event,
+					u64 *idxmsk)
+{
+	const struct event_constraint *c;
+
+	/*
+	 * cleanup bitmask
+	 */
+	bitmap_zero((unsigned long *)idxmsk, X86_PMC_IDX_MAX);
+
+	if (intel_special_constraints(event, idxmsk))
+		return;
+
+	if (x86_pmu.event_constraints) {
+		for_each_event_constraint(c, x86_pmu.event_constraints) {
+			if ((event->hw.config & c->cmask) == c->code) {
+
+				bitmap_copy((unsigned long *)idxmsk,
+					    (unsigned long *)c->idxmsk,
+					    X86_PMC_IDX_MAX);
+				return;
+			}
+		}
+	}
+	/* no constraints, means supports all generic counters */
+	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
+}
+
+static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
+				      struct perf_event *event,
+				      u64 *idxmsk)
+{
+}
+
+static int x86_event_sched_in(struct perf_event *event,
+			  struct perf_cpu_context *cpuctx, int cpu)
+{
+	int ret = 0;
+
+	event->state = PERF_EVENT_STATE_ACTIVE;
+	event->oncpu = cpu;
+	event->tstamp_running += event->ctx->time - event->tstamp_stopped;
+
+	if (!is_x86_event(event))
+		ret = event->pmu->enable(event);
+
+	if (!ret && !is_software_event(event))
+		cpuctx->active_oncpu++;
+
+	if (!ret && event->attr.exclusive)
+		cpuctx->exclusive = 1;
+
+	return ret;
+}
+
+static void x86_event_sched_out(struct perf_event *event,
+			    struct perf_cpu_context *cpuctx, int cpu)
+{
+	event->state = PERF_EVENT_STATE_INACTIVE;
+	event->oncpu = -1;
+
+	if (!is_x86_event(event))
+		event->pmu->disable(event);
+
+	event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
+
+	if (!is_software_event(event))
+		cpuctx->active_oncpu--;
+
+	if (event->attr.exclusive || !cpuctx->active_oncpu)
+		cpuctx->exclusive = 0;
+}
+
+/*
+ * Called to enable a whole group of events.
+ * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
+ * Assumes the caller has disabled interrupts and has
+ * frozen the PMU with hw_perf_save_disable.
+ *
+ * called with PMU disabled. If successful and return value 1,
+ * then guaranteed to call perf_enable() and hw_perf_enable()
+ */
+int hw_perf_group_sched_in(struct perf_event *leader,
+	       struct perf_cpu_context *cpuctx,
+	       struct perf_event_context *ctx, int cpu)
+{
+	struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
+	struct perf_event *sub;
+	int assign[X86_PMC_IDX_MAX];
+	int n0, n1, ret;
+
+	/* n0 = total number of events */
+	n0 = collect_events(cpuc, leader, true);
+	if (n0 < 0)
+		return n0;
+
+	ret = x86_schedule_events(cpuc, n0, assign);
+	if (ret)
+		return ret;
+
+	ret = x86_event_sched_in(leader, cpuctx, cpu);
+	if (ret)
+		return ret;
+
+	n1 = 1;
+	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
+		if (sub->state != PERF_EVENT_STATE_OFF) {
+			ret = x86_event_sched_in(sub, cpuctx, cpu);
+			if (ret)
+				goto undo;
+			++n1;
+		}
+	}
+	/*
+	 * copy new assignment, now we know it is possible
+	 * will be used by hw_perf_enable()
+	 */
+	memcpy(cpuc->assign, assign, n0*sizeof(int));
+
+	cpuc->n_events  = n0;
+	cpuc->n_added   = n1;
+	ctx->nr_active += n1;
+
+	/*
+	 * 1 means successful and events are active
+	 * This is not quite true because we defer
+	 * actual activation until hw_perf_enable() but
+	 * this way we* ensure caller won't try to enable
+	 * individual events
+	 */
+	return 1;
+undo:
+	x86_event_sched_out(leader, cpuctx, cpu);
+	n0  = 1;
+	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
+		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+			x86_event_sched_out(sub, cpuctx, cpu);
+			if (++n0 == n1)
+				break;
+		}
+	}
+	return ret;
+}
+
 static __read_mostly struct notifier_block perf_event_nmi_notifier = {
 	.notifier_call		= perf_event_nmi_handler,
 	.next			= NULL,
@@ -2000,7 +2334,8 @@ static __initconst struct x86_pmu p6_pmu
 	 */
 	.event_bits		= 32,
 	.event_mask		= (1ULL << 32) - 1,
-	.get_event_idx		= intel_get_event_idx,
+	.get_event_constraints	= intel_get_event_constraints,
+	.event_constraints	= intel_p6_event_constraints
 };
 
 static __initconst struct x86_pmu intel_pmu = {
@@ -2024,7 +2359,7 @@ static __initconst struct x86_pmu intel_
 	.max_period		= (1ULL << 31) - 1,
 	.enable_bts		= intel_pmu_enable_bts,
 	.disable_bts		= intel_pmu_disable_bts,
-	.get_event_idx		= intel_get_event_idx,
+	.get_event_constraints	= intel_get_event_constraints
 };
 
 static __initconst struct x86_pmu amd_pmu = {
@@ -2045,7 +2380,7 @@ static __initconst struct x86_pmu amd_pm
 	.apic			= 1,
 	/* use highest bit to detect overflow */
 	.max_period		= (1ULL << 47) - 1,
-	.get_event_idx		= gen_get_event_idx,
+	.get_event_constraints	= amd_get_event_constraints
 };
 
 static __init int p6_pmu_init(void)
@@ -2058,12 +2393,9 @@ static __init int p6_pmu_init(void)
 	case 7:
 	case 8:
 	case 11: /* Pentium III */
-		event_constraints = intel_p6_event_constraints;
-		break;
 	case 9:
 	case 13:
 		/* Pentium M */
-		event_constraints = intel_p6_event_constraints;
 		break;
 	default:
 		pr_cont("unsupported p6 CPU model %d ",
@@ -2128,23 +2460,29 @@ static __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		x86_pmu.event_constraints = intel_core_event_constraints;
 		pr_cont("Core2 events, ");
-		event_constraints = intel_core_event_constraints;
 		break;
-	default:
 	case 26:
 		memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
-		event_constraints = intel_nehalem_event_constraints;
+		x86_pmu.event_constraints = intel_nehalem_event_constraints;
 		pr_cont("Nehalem/Corei7 events, ");
 		break;
 	case 28:
 		memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		x86_pmu.event_constraints = intel_gen_event_constraints;
 		pr_cont("Atom events, ");
 		break;
+	default:
+		/*
+		 * default constraints for v2 and up
+		 */
+		x86_pmu.event_constraints = intel_gen_event_constraints;
+		pr_cont("generic architected perfmon, ");
 	}
 	return 0;
 }
@@ -2241,36 +2579,43 @@ static const struct pmu pmu = {
 	.unthrottle	= x86_pmu_unthrottle,
 };
 
-static int
-validate_event(struct cpu_hw_events *cpuc, struct perf_event *event)
-{
-	struct hw_perf_event fake_event = event->hw;
-
-	if (event->pmu && event->pmu != &pmu)
-		return 0;
-
-	return x86_schedule_event(cpuc, &fake_event) >= 0;
-}
-
+/*
+ * validate a single event group
+ *
+ * validation include:
+ * 	- check events are compatible which each other
+ * 	- events do not compete for the same counter
+ * 	- number of events <= number of counters
+ *
+ * validation ensures the group can be loaded onto the
+ * PMU if it was the only group available.
+ */
 static int validate_group(struct perf_event *event)
 {
-	struct perf_event *sibling, *leader = event->group_leader;
-	struct cpu_hw_events fake_pmu;
+	struct perf_event *leader = event->group_leader;
+	struct cpu_hw_events fake_cpuc;
+	int n;
 
-	memset(&fake_pmu, 0, sizeof(fake_pmu));
+	memset(&fake_cpuc, 0, sizeof(fake_cpuc));
 
-	if (!validate_event(&fake_pmu, leader))
+	/*
+	 * the event is not yet connected with its
+	 * siblings therefore we must first collect
+	 * existing siblings, then add the new event
+	 * before we can simulate the scheduling
+	 */
+	n = collect_events(&fake_cpuc, leader, true);
+	if (n < 0)
 		return -ENOSPC;
 
-	list_for_each_entry(sibling, &leader->sibling_list, group_entry) {
-		if (!validate_event(&fake_pmu, sibling))
-			return -ENOSPC;
-	}
-
-	if (!validate_event(&fake_pmu, event))
+	fake_cpuc.n_events = n;
+	n = collect_events(&fake_cpuc, event, false);
+	if (n < 0)
 		return -ENOSPC;
 
-	return 0;
+	fake_cpuc.n_events = n;
+
+	return x86_schedule_events(&fake_cpuc, n, NULL);
 }
 
 const struct pmu *hw_perf_event_init(struct perf_event *event)

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 02/10] perf_events: Add fast-path to the rescheduling code
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 01/10] perf_events: improve x86 event scheduling (v5) Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 03/10] perf_event: x86: Allocate the fake_cpuc Peter Zijlstra
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: stephane-perf_events-improve_x86_event_scheduling__v6_incremental.patch --]
[-- Type: text/plain, Size: 8390 bytes --]

The 6th version does:
- implement correct fastpath scheduling, i.e., reuse previous assignment
- skip reprogramming counters in hw_perf_enable() with a generation number

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4b588464.1818d00a.4456.383b@mx.google.com>
---
 arch/x86/kernel/cpu/perf_event.c |  148 +++++++++++++++++++++++----------------
 include/linux/perf_event.h       |    2 
 2 files changed, 93 insertions(+), 57 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -87,6 +87,7 @@ struct cpu_hw_events {
 	int			n_events;
 	int			n_added;
 	int			assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
+	u64			tags[X86_PMC_IDX_MAX];
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 };
 
@@ -1013,6 +1014,8 @@ static int __hw_perf_event_init(struct p
 	hwc->config = ARCH_PERFMON_EVENTSEL_INT;
 
 	hwc->idx = -1;
+	hwc->last_cpu = -1;
+	hwc->last_tag = ~0ULL;
 
 	/*
 	 * Count user and OS events unless requested not to.
@@ -1245,6 +1248,46 @@ static int x86_schedule_events(struct cp
 	}
 
 	/*
+	 * fastpath, try to reuse previous register
+	 */
+	for (i = 0, num = n; i < n; i++, num--) {
+		hwc = &cpuc->event_list[i]->hw;
+		c = (unsigned long *)constraints[i];
+
+		/* never assigned */
+		if (hwc->idx == -1)
+			break;
+
+		/* constraint still honored */
+		if (!test_bit(hwc->idx, c))
+			break;
+
+		/* not already used */
+		if (test_bit(hwc->idx, used_mask))
+			break;
+
+#if 0
+		pr_debug("CPU%d fast config=0x%llx idx=%d assign=%c\n",
+			 smp_processor_id(),
+			 hwc->config,
+			 hwc->idx,
+			 assign ? 'y' : 'n');
+#endif
+
+		set_bit(hwc->idx, used_mask);
+		if (assign)
+			assign[i] = hwc->idx;
+	}
+	if (!num)
+		goto done;
+
+	/*
+	 * begin slow path
+	 */
+
+	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+
+	/*
 	 * weight = number of possible counters
 	 *
 	 * 1    = most constrained, only works on one counter
@@ -1263,10 +1306,9 @@ static int x86_schedule_events(struct cp
 	if (x86_pmu.num_events_fixed)
 		wmax++;
 
-	num = n;
-	for (w = 1; num && w <= wmax; w++) {
+	for (w = 1, num = n; num && w <= wmax; w++) {
 		/* for each event */
-		for (i = 0; i < n; i++) {
+		for (i = 0; num && i < n; i++) {
 			c = (unsigned long *)constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
@@ -1274,28 +1316,6 @@ static int x86_schedule_events(struct cp
 			if (weight != w)
 				continue;
 
-			/*
-			 * try to reuse previous assignment
-			 *
-			 * This is possible despite the fact that
-			 * events or events order may have changed.
-			 *
-			 * What matters is the level of constraints
-			 * of an event and this is constant for now.
-			 *
-			 * This is possible also because we always
-			 * scan from most to least constrained. Thus,
-			 * if a counter can be reused, it means no,
-			 * more constrained events, needed it. And
-			 * next events will either compete for it
-			 * (which cannot be solved anyway) or they
-			 * have fewer constraints, and they can use
-			 * another counter.
-			 */
-			j = hwc->idx;
-			if (j != -1 && !test_bit(j, used_mask))
-				goto skip;
-
 			for_each_bit(j, c, X86_PMC_IDX_MAX) {
 				if (!test_bit(j, used_mask))
 					break;
@@ -1303,22 +1323,23 @@ static int x86_schedule_events(struct cp
 
 			if (j == X86_PMC_IDX_MAX)
 				break;
-skip:
-			set_bit(j, used_mask);
 
 #if 0
-			pr_debug("CPU%d config=0x%llx idx=%d assign=%c\n",
+			pr_debug("CPU%d slow config=0x%llx idx=%d assign=%c\n",
 				smp_processor_id(),
 				hwc->config,
 				j,
 				assign ? 'y' : 'n');
 #endif
 
+			set_bit(j, used_mask);
+
 			if (assign)
 				assign[i] = j;
 			num--;
 		}
 	}
+done:
 	/*
 	 * scheduling failed or is just a simulation,
 	 * free resources if necessary
@@ -1357,7 +1378,7 @@ static int collect_events(struct cpu_hw_
 
 	list_for_each_entry(event, &leader->sibling_list, group_entry) {
 		if (!is_x86_event(event) ||
-		    event->state == PERF_EVENT_STATE_OFF)
+		    event->state <= PERF_EVENT_STATE_OFF)
 			continue;
 
 		if (n >= max_count)
@@ -1369,11 +1390,15 @@ static int collect_events(struct cpu_hw_
 	return n;
 }
 
-
 static inline void x86_assign_hw_event(struct perf_event *event,
-				struct hw_perf_event *hwc, int idx)
+				struct cpu_hw_events *cpuc,
+				int idx)
 {
+	struct hw_perf_event *hwc = &event->hw;
+
 	hwc->idx = idx;
+	hwc->last_cpu = smp_processor_id();
+	hwc->last_tag = ++cpuc->tags[idx];
 
 	if (hwc->idx == X86_PMC_IDX_FIXED_BTS) {
 		hwc->config_base = 0;
@@ -1392,6 +1417,14 @@ static inline void x86_assign_hw_event(s
 	}
 }
 
+static bool match_prev_assignment(struct hw_perf_event *hwc,
+				  struct cpu_hw_events *cpuc, int idx)
+{
+	return hwc->idx == cpuc->assign[idx] &&
+	       hwc->last_cpu == smp_processor_id() &&
+	       hwc->last_tag == cpuc->tags[idx];
+}
+
 void hw_perf_enable(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1401,45 +1434,33 @@ void hw_perf_enable(void)
 
 	if (!x86_pmu_initialized())
 		return;
+
 	if (cpuc->n_added) {
 		/*
 		 * apply assignment obtained either from
 		 * hw_perf_group_sched_in() or x86_pmu_enable()
 		 *
-		 * step1: save events moving to new counters
-		 * step2: reprogram moved events into new counters
+		 * We either re-enable or re-program and re-enable.
+		 * All events are disabled by the time we come here.
+		 * That means their state has been saved already.
 		 */
 		for (i = 0; i < cpuc->n_events; i++) {
-
 			event = cpuc->event_list[i];
 			hwc = &event->hw;
 
-			if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
-				continue;
-
-			x86_pmu.disable(hwc, hwc->idx);
-
-			clear_bit(hwc->idx, cpuc->active_mask);
-			barrier();
-			cpuc->events[hwc->idx] = NULL;
-
-			x86_perf_event_update(event, hwc, hwc->idx);
-
-			hwc->idx = -1;
-		}
-
-		for (i = 0; i < cpuc->n_events; i++) {
-
-			event = cpuc->event_list[i];
-			hwc = &event->hw;
-
-			if (hwc->idx == -1) {
-				x86_assign_hw_event(event, hwc, cpuc->assign[i]);
+			/*
+			 * we can avoid reprogramming counter if:
+			 * - assigned same counter as last time
+			 * - running on same CPU as last time
+			 * - no other event has used the counter since
+			 */
+			if (!match_prev_assignment(hwc, cpuc, i)) {
+				x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
 				x86_perf_event_set_period(event, hwc, hwc->idx);
 			}
 			/*
 			 * need to mark as active because x86_pmu_disable()
-			 * clear active_mask and eventsp[] yet it preserves
+			 * clear active_mask and events[] yet it preserves
 			 * idx
 			 */
 			set_bit(hwc->idx, cpuc->active_mask);
@@ -2191,6 +2212,8 @@ static void amd_get_event_constraints(st
 				      struct perf_event *event,
 				      u64 *idxmsk)
 {
+	/* no constraints, means supports all generic counters */
+	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
 }
 
 static int x86_event_sched_in(struct perf_event *event,
@@ -2265,7 +2288,7 @@ int hw_perf_group_sched_in(struct perf_e
 
 	n1 = 1;
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		if (sub->state != PERF_EVENT_STATE_OFF) {
+		if (sub->state > PERF_EVENT_STATE_OFF) {
 			ret = x86_event_sched_in(sub, cpuctx, cpu);
 			if (ret)
 				goto undo;
@@ -2620,12 +2643,23 @@ static int validate_group(struct perf_ev
 
 const struct pmu *hw_perf_event_init(struct perf_event *event)
 {
+	const struct pmu *tmp;
 	int err;
 
 	err = __hw_perf_event_init(event);
 	if (!err) {
+		/*
+		 * we temporarily connect event to its pmu
+		 * such that validate_group() can classify
+		 * it as an x86 event using is_x86_event()
+		 */
+		tmp = event->pmu;
+		event->pmu = &pmu;
+
 		if (event->group_leader != event)
 			err = validate_group(event);
+
+		event->pmu = tmp;
 	}
 	if (err) {
 		if (event->destroy)
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -478,9 +478,11 @@ struct hw_perf_event {
 	union {
 		struct { /* hardware */
 			u64		config;
+			u64		last_tag;
 			unsigned long	config_base;
 			unsigned long	event_base;
 			int		idx;
+			int		last_cpu;
 		};
 		struct { /* software */
 			s64		remaining;

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 03/10] perf_event: x86: Allocate the fake_cpuc
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 01/10] perf_events: improve x86 event scheduling (v5) Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 02/10] perf_events: Add fast-path to the rescheduling code Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 04/10] perf_event: x86: Fixup weight tying issue Peter Zijlstra
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-alloc-fake-pmu.patch --]
[-- Type: text/plain, Size: 1788 bytes --]

GCC was complaining the stack usage was too large, so allocate
the structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |   29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -2614,10 +2614,13 @@ static const struct pmu pmu = {
 static int validate_group(struct perf_event *event)
 {
 	struct perf_event *leader = event->group_leader;
-	struct cpu_hw_events fake_cpuc;
-	int n;
+	struct cpu_hw_events *fake_cpuc;
+	int ret, n;
 
-	memset(&fake_cpuc, 0, sizeof(fake_cpuc));
+	ret = -ENOMEM;
+	fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
+	if (!fake_cpuc)
+		goto out;
 
 	/*
 	 * the event is not yet connected with its
@@ -2625,18 +2628,24 @@ static int validate_group(struct perf_ev
 	 * existing siblings, then add the new event
 	 * before we can simulate the scheduling
 	 */
-	n = collect_events(&fake_cpuc, leader, true);
+	ret = -ENOSPC;
+	n = collect_events(fake_cpuc, leader, true);
 	if (n < 0)
-		return -ENOSPC;
+		goto out_free;
 
-	fake_cpuc.n_events = n;
-	n = collect_events(&fake_cpuc, event, false);
+	fake_cpuc->n_events = n;
+	n = collect_events(fake_cpuc, event, false);
 	if (n < 0)
-		return -ENOSPC;
+		goto out_free;
 
-	fake_cpuc.n_events = n;
+	fake_cpuc->n_events = n;
 
-	return x86_schedule_events(&fake_cpuc, n, NULL);
+	ret = x86_schedule_events(fake_cpuc, n, NULL);
+
+out_free:
+	kfree(fake_cpuc);
+out:
+	return ret;
 }
 
 const struct pmu *hw_perf_event_init(struct perf_event *event)

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 04/10] perf_event: x86: Fixup weight tying issue
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (2 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 03/10] perf_event: x86: Allocate the fake_cpuc Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:27   ` [tip:perf/core] perf_event: x86: Fixup constraints typing issue tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 05/10] perf_event: x86: Clean up some of the u64/long bitmask casting Peter Zijlstra
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-fix-types.patch --]
[-- Type: text/plain, Size: 1434 bytes --]

constraints gets defined an u64 but in long quantities and then cast to long.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -1235,7 +1235,7 @@ static int x86_schedule_events(struct cp
 	int i, j , w, num;
 	int weight, wmax;
 	unsigned long *c;
-	u64 constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	unsigned long constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
 
@@ -1252,7 +1252,7 @@ static int x86_schedule_events(struct cp
 	 */
 	for (i = 0, num = n; i < n; i++, num--) {
 		hwc = &cpuc->event_list[i]->hw;
-		c = (unsigned long *)constraints[i];
+		c = constraints[i];
 
 		/* never assigned */
 		if (hwc->idx == -1)
@@ -1309,7 +1309,7 @@ static int x86_schedule_events(struct cp
 	for (w = 1, num = n; num && w <= wmax; w++) {
 		/* for each event */
 		for (i = 0; num && i < n; i++) {
-			c = (unsigned long *)constraints[i];
+			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
 			weight = bitmap_weight(c, X86_PMC_IDX_MAX);

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 05/10] perf_event: x86: Clean up some of the u64/long bitmask casting
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (3 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 04/10] perf_event: x86: Fixup weight tying issue Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 06/10] perf_event: x86: Reduce some overly long lines with some MACROs Peter Zijlstra
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-change-idxmask-type.patch --]
[-- Type: text/plain, Size: 3843 bytes --]

We need this to be u64 for direct assigment, but the bitmask functions
all work on unsigned long, leading to cast heaven, solve this by using a
union.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |   47 +++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -69,10 +69,11 @@ struct debug_store {
 	u64	pebs_event_reset[MAX_PEBS_EVENTS];
 };
 
-#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
-
 struct event_constraint {
-	u64	idxmsk[BITS_TO_U64(X86_PMC_IDX_MAX)];
+	union {
+		unsigned long	idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+		u64		idxmsk64[1];
+	};
 	int	code;
 	int	cmask;
 };
@@ -91,13 +92,14 @@ struct cpu_hw_events {
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 };
 
-#define EVENT_CONSTRAINT(c, n, m) { \
-	.code = (c),	\
-	.cmask = (m),	\
-	.idxmsk[0] = (n) }
+#define EVENT_CONSTRAINT(c, n, m) { 	\
+	{ .idxmsk64[0] = (n) },		\
+	.code = (c),			\
+	.cmask = (m),			\
+}
 
 #define EVENT_CONSTRAINT_END \
-	{ .code = 0, .cmask = 0, .idxmsk[0] = 0 }
+	EVENT_CONSTRAINT(0, 0, 0)
 
 #define for_each_event_constraint(e, c) \
 	for ((e) = (c); (e)->cmask; (e)++)
@@ -127,8 +129,11 @@ struct x86_pmu {
 	u64		intel_ctrl;
 	void		(*enable_bts)(u64 config);
 	void		(*disable_bts)(void);
-	void		(*get_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event, u64 *idxmsk);
-	void		(*put_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event);
+	void		(*get_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event,
+						 unsigned long *idxmsk);
+	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event);
 	const struct event_constraint *event_constraints;
 };
 
@@ -2154,14 +2159,11 @@ perf_event_nmi_handler(struct notifier_b
 	return NOTIFY_STOP;
 }
 
-static struct event_constraint bts_constraint = {
-	.code = 0,
-	.cmask = 0,
-	.idxmsk[0] = 1ULL << X86_PMC_IDX_FIXED_BTS
-};
+static struct event_constraint bts_constraint =
+	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
 
 static int intel_special_constraints(struct perf_event *event,
-				     u64 *idxmsk)
+				     unsigned long *idxmsk)
 {
 	unsigned int hw_event;
 
@@ -2181,14 +2183,14 @@ static int intel_special_constraints(str
 
 static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
 					struct perf_event *event,
-					u64 *idxmsk)
+					unsigned long *idxmsk)
 {
 	const struct event_constraint *c;
 
 	/*
 	 * cleanup bitmask
 	 */
-	bitmap_zero((unsigned long *)idxmsk, X86_PMC_IDX_MAX);
+	bitmap_zero(idxmsk, X86_PMC_IDX_MAX);
 
 	if (intel_special_constraints(event, idxmsk))
 		return;
@@ -2196,10 +2198,7 @@ static void intel_get_event_constraints(
 	if (x86_pmu.event_constraints) {
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
 			if ((event->hw.config & c->cmask) == c->code) {
-
-				bitmap_copy((unsigned long *)idxmsk,
-					    (unsigned long *)c->idxmsk,
-					    X86_PMC_IDX_MAX);
+				bitmap_copy(idxmsk, c->idxmsk, X86_PMC_IDX_MAX);
 				return;
 			}
 		}
@@ -2210,10 +2209,10 @@ static void intel_get_event_constraints(
 
 static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
 				      struct perf_event *event,
-				      u64 *idxmsk)
+				      unsigned long *idxmsk)
 {
 	/* no constraints, means supports all generic counters */
-	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
+	bitmap_fill(idxmsk, x86_pmu.num_events);
 }
 
 static int x86_event_sched_in(struct perf_event *event,

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 06/10] perf_event: x86: Reduce some overly long lines with some MACROs
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (4 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 05/10] perf_event: x86: Clean up some of the u64/long bitmask casting Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} Peter Zijlstra
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-event_constraint.patch --]
[-- Type: text/plain, Size: 5645 bytes --]

Introduce INTEL_EVENT_CONSTRAINT and FIXED_EVENT_CONSTRAINT to reduce
some line length and typing work.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |   68 +++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 31 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -98,6 +98,12 @@ struct cpu_hw_events {
 	.cmask = (m),			\
 }
 
+#define INTEL_EVENT_CONSTRAINT(c, n)	\
+	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)
+
+#define FIXED_EVENT_CONSTRAINT(c, n)	\
+	EVENT_CONSTRAINT(c, n, INTEL_ARCH_FIXED_MASK)
+
 #define EVENT_CONSTRAINT_END \
 	EVENT_CONSTRAINT(0, 0, 0)
 
@@ -193,12 +199,12 @@ static u64 p6_pmu_raw_event(u64 hw_event
 
 static struct event_constraint intel_p6_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc1, 0x1, INTEL_ARCH_EVENT_MASK),	/* FLOPS */
-	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK),	/* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK),	/* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK),	/* CYCLES_DIV_BUSY */
+	INTEL_EVENT_CONSTRAINT(0xc1, 0x1),	/* FLOPS */
+	INTEL_EVENT_CONSTRAINT(0x10, 0x1),	/* FP_COMP_OPS_EXE */
+	INTEL_EVENT_CONSTRAINT(0x11, 0x1),	/* FP_ASSIST */
+	INTEL_EVENT_CONSTRAINT(0x12, 0x2),	/* MUL */
+	INTEL_EVENT_CONSTRAINT(0x13, 0x2),	/* DIV */
+	INTEL_EVENT_CONSTRAINT(0x14, 0x1),	/* CYCLES_DIV_BUSY */
 	EVENT_CONSTRAINT_END
 };
 
@@ -218,41 +224,41 @@ static const u64 intel_perfmon_event_map
 
 static struct event_constraint intel_core_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
-	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK), /* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x2, INTEL_ARCH_EVENT_MASK), /* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK), /* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK), /* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK), /* CYCLES_DIV_BUSY */
-	EVENT_CONSTRAINT(0x18, 0x1, INTEL_ARCH_EVENT_MASK), /* IDLE_DURING_DIV */
-	EVENT_CONSTRAINT(0x19, 0x2, INTEL_ARCH_EVENT_MASK), /* DELAYED_BYPASS */
-	EVENT_CONSTRAINT(0xa1, 0x1, INTEL_ARCH_EVENT_MASK), /* RS_UOPS_DISPATCH_CYCLES */
-	EVENT_CONSTRAINT(0xcb, 0x1, INTEL_ARCH_EVENT_MASK), /* MEM_LOAD_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0x10, 0x1), /* FP_COMP_OPS_EXE */
+	INTEL_EVENT_CONSTRAINT(0x11, 0x2), /* FP_ASSIST */
+	INTEL_EVENT_CONSTRAINT(0x12, 0x2), /* MUL */
+	INTEL_EVENT_CONSTRAINT(0x13, 0x2), /* DIV */
+	INTEL_EVENT_CONSTRAINT(0x14, 0x1), /* CYCLES_DIV_BUSY */
+	INTEL_EVENT_CONSTRAINT(0x18, 0x1), /* IDLE_DURING_DIV */
+	INTEL_EVENT_CONSTRAINT(0x19, 0x2), /* DELAYED_BYPASS */
+	INTEL_EVENT_CONSTRAINT(0xa1, 0x1), /* RS_UOPS_DISPATCH_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0xcb, 0x1), /* MEM_LOAD_RETIRED */
 	EVENT_CONSTRAINT_END
 };
 
 static struct event_constraint intel_nehalem_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
-	EVENT_CONSTRAINT(0x40, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LD */
-	EVENT_CONSTRAINT(0x41, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_ST */
-	EVENT_CONSTRAINT(0x42, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK */
-	EVENT_CONSTRAINT(0x43, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_ALL_REF */
-	EVENT_CONSTRAINT(0x4e, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_PREFETCH */
-	EVENT_CONSTRAINT(0x4c, 0x3, INTEL_ARCH_EVENT_MASK), /* LOAD_HIT_PRE */
-	EVENT_CONSTRAINT(0x51, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D */
-	EVENT_CONSTRAINT(0x52, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0x53, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0xc5, 0x3, INTEL_ARCH_EVENT_MASK), /* CACHE_LOCK_CYCLES */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0x40, 0x3), /* L1D_CACHE_LD */
+	INTEL_EVENT_CONSTRAINT(0x41, 0x3), /* L1D_CACHE_ST */
+	INTEL_EVENT_CONSTRAINT(0x42, 0x3), /* L1D_CACHE_LOCK */
+	INTEL_EVENT_CONSTRAINT(0x43, 0x3), /* L1D_ALL_REF */
+	INTEL_EVENT_CONSTRAINT(0x4e, 0x3), /* L1D_PREFETCH */
+	INTEL_EVENT_CONSTRAINT(0x4c, 0x3), /* LOAD_HIT_PRE */
+	INTEL_EVENT_CONSTRAINT(0x51, 0x3), /* L1D */
+	INTEL_EVENT_CONSTRAINT(0x52, 0x3), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
+	INTEL_EVENT_CONSTRAINT(0x53, 0x3), /* L1D_CACHE_LOCK_FB_HIT */
+	INTEL_EVENT_CONSTRAINT(0xc5, 0x3), /* CACHE_LOCK_CYCLES */
 	EVENT_CONSTRAINT_END
 };
 
 static struct event_constraint intel_gen_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
 	EVENT_CONSTRAINT_END
 };
 

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (5 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 06/10] perf_event: x86: Reduce some overly long lines with some MACROs Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
  2010-01-29 10:32   ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} John Kacur
  2010-01-22 15:50 ` [PATCH 08/10] perf_event: Optimize the constraint searching bits Peter Zijlstra
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: generic-HWEIGHT.patch --]
[-- Type: text/plain, Size: 1047 bytes --]

Provide compile time versions of hweight.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/bitops.h |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,6 +45,20 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
+#define HWEIGHT8(w) 			\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
+#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
+#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
+
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 08/10] perf_event: Optimize the constraint searching bits
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (6 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-22 16:08   ` Stephane Eranian
  2010-01-29  9:28   ` [tip:perf/core] perf_event: x86: " tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 09/10] perf_event: x86: Optimize constraint weight computation Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 10/10] perf_event: Optimize the fast path a little more Peter Zijlstra
  9 siblings, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-opt-constraint.patch --]
[-- Type: text/plain, Size: 5202 bytes --]

Instead of copying bitmasks around, pass pointers to the constraint structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |   81 ++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 43 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -135,12 +138,14 @@ struct x86_pmu {
 	u64		intel_ctrl;
 	void		(*enable_bts)(u64 config);
 	void		(*disable_bts)(void);
-	void		(*get_event_constraints)(struct cpu_hw_events *cpuc,
-						 struct perf_event *event,
-						 unsigned long *idxmsk);
+
+	struct event_constraint *
+			(*get_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event);
+
 	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
-	const struct event_constraint *event_constraints;
+	struct event_constraint *event_constraints;
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
@@ -1244,18 +1249,16 @@ static inline int is_x86_event(struct pe
 static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
 	int i, j , w, num;
-	int weight, wmax;
-	unsigned long *c;
-	unsigned long constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	int weight, wmax;
+	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
 
 	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
 
 	for (i = 0; i < n; i++) {
-		x86_pmu.get_event_constraints(cpuc,
-					      cpuc->event_list[i],
-					      constraints[i]);
+		constraints[i] =
+		  x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
 	}
 
 	/*
@@ -1270,7 +1273,7 @@ static int x86_schedule_events(struct cp
 			break;
 
 		/* constraint still honored */
-		if (!test_bit(hwc->idx, c))
+		if (!test_bit(hwc->idx, c->idxmsk))
 			break;
 
 		/* not already used */
@@ -1323,11 +1326,11 @@ static int x86_schedule_events(struct cp
 			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
-			weight = bitmap_weight(c, X86_PMC_IDX_MAX);
+			weight = bitmap_weight(c->idxmsk, X86_PMC_IDX_MAX);
 			if (weight != w)
 				continue;
 
-			for_each_bit(j, c, X86_PMC_IDX_MAX) {
+			for_each_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
 				if (!test_bit(j, used_mask))
 					break;
 			}
@@ -2165,11 +2167,13 @@ perf_event_nmi_handler(struct notifier_b
 	return NOTIFY_STOP;
 }
 
+static struct event_constraint unconstrained;
+
 static struct event_constraint bts_constraint =
 	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
 
-static int intel_special_constraints(struct perf_event *event,
-				     unsigned long *idxmsk)
+static struct event_constraint *
+intel_special_constraints(struct perf_event *event)
 {
 	unsigned int hw_event;
 
@@ -2179,46 +2183,34 @@ static int intel_special_constraints(str
 		      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&
 		     (event->hw.sample_period == 1))) {
 
-		bitmap_copy((unsigned long *)idxmsk,
-			    (unsigned long *)bts_constraint.idxmsk,
-			    X86_PMC_IDX_MAX);
-		return 1;
+		return &bts_constraint;
 	}
-	return 0;
+	return NULL;
 }
 
-static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
-					struct perf_event *event,
-					unsigned long *idxmsk)
+static struct event_constraint *
+intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 {
-	const struct event_constraint *c;
+	struct event_constraint *c;
 
-	/*
-	 * cleanup bitmask
-	 */
-	bitmap_zero(idxmsk, X86_PMC_IDX_MAX);
-
-	if (intel_special_constraints(event, idxmsk))
-		return;
+	c = intel_special_constraints(event);
+	if (c)
+		return c;
 
 	if (x86_pmu.event_constraints) {
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
-			if ((event->hw.config & c->cmask) == c->code) {
-				bitmap_copy(idxmsk, c->idxmsk, X86_PMC_IDX_MAX);
-				return;
-			}
+			if ((event->hw.config & c->cmask) == c->code)
+				return c;
 		}
 	}
-	/* no constraints, means supports all generic counters */
-	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
+
+	return &unconstrained;
 }
 
-static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
-				      struct perf_event *event,
-				      unsigned long *idxmsk)
+static struct event_constraint *
+amd_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 {
-	/* no constraints, means supports all generic counters */
-	bitmap_fill(idxmsk, x86_pmu.num_events);
+	return &unconstrained;
 }
 
 static int x86_event_sched_in(struct perf_event *event,
@@ -2586,6 +2578,9 @@ void __init init_hw_perf_events(void)
 	perf_events_lapic_init();
 	register_die_notifier(&perf_event_nmi_notifier);
 
+	unconstrained = (struct event_constraint)
+		EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_events) - 1, 0);
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.event_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_events);

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 09/10] perf_event: x86: Optimize constraint weight computation
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (7 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 08/10] perf_event: Optimize the constraint searching bits Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
  2010-01-22 15:50 ` [PATCH 10/10] perf_event: Optimize the fast path a little more Peter Zijlstra
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-opt-weight.patch --]
[-- Type: text/plain, Size: 1653 bytes --]

Add a weight member to the constraint structure and avoid recomputing
the weight at runtime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -23,6 +23,7 @@
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
 #include <linux/cpu.h>
+#include <linux/bitops.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -76,6 +77,7 @@ struct event_constraint {
 	};
 	int	code;
 	int	cmask;
+	int	weight;
 };
 
 struct cpu_hw_events {
@@ -96,6 +98,7 @@ struct cpu_hw_events {
 	{ .idxmsk64[0] = (n) },		\
 	.code = (c),			\
 	.cmask = (m),			\
+	.weight = HWEIGHT64((u64)(n)),	\
 }
 
 #define INTEL_EVENT_CONSTRAINT(c, n)	\
@@ -1245,8 +1248,7 @@ static inline int is_x86_event(struct pe
 
 static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-	int i, j , w, num;
-	int weight, wmax;
+	int i, j, w, num, wmax;
 	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
@@ -1323,8 +1325,7 @@ static int x86_schedule_events(struct cp
 			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
-			weight = bitmap_weight(c->idxmsk, X86_PMC_IDX_MAX);
-			if (weight != w)
+			if (c->weight != w)
 				continue;
 
 			for_each_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 10/10] perf_event: Optimize the fast path a little more
  2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
                   ` (8 preceding siblings ...)
  2010-01-22 15:50 ` [PATCH 09/10] perf_event: x86: Optimize constraint weight computation Peter Zijlstra
@ 2010-01-22 15:50 ` Peter Zijlstra
  2010-01-29  9:28   ` [tip:perf/core] perf_event: x86: " tip-bot for Peter Zijlstra
  9 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 15:50 UTC (permalink / raw)
  To: eranian, fweisbec, paulus, mingo, davem, robert.richter
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-opt-num.patch --]
[-- Type: text/plain, Size: 1269 bytes --]

Remove num from the fast path and save a few ops.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/kernel/cpu/perf_event.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -1248,9 +1248,9 @@ static inline int is_x86_event(struct pe
 
 static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-	int i, j, w, num, wmax;
 	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	int i, j, w, wmax, num = 0;
 	struct hw_perf_event *hwc;
 
 	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -1263,7 +1263,7 @@ static int x86_schedule_events(struct cp
 	/*
 	 * fastpath, try to reuse previous register
 	 */
-	for (i = 0, num = n; i < n; i++, num--) {
+	for (i = 0; i < n; i++) {
 		hwc = &cpuc->event_list[i]->hw;
 		c = constraints[i];
 
@@ -1291,7 +1291,7 @@ static int x86_schedule_events(struct cp
 		if (assign)
 			assign[i] = hwc->idx;
 	}
-	if (!num)
+	if (i == n)
 		goto done;
 
 	/*

-- 


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/10] perf_event: Optimize the constraint searching bits
  2010-01-22 15:50 ` [PATCH 08/10] perf_event: Optimize the constraint searching bits Peter Zijlstra
@ 2010-01-22 16:08   ` Stephane Eranian
  2010-01-22 16:22     ` Peter Zijlstra
  2010-01-29  9:28   ` [tip:perf/core] perf_event: x86: " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Stephane Eranian @ 2010-01-22 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweisbec, paulus, mingo, davem, robert.richter, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 8385 bytes --]

On Fri, Jan 22, 2010 at 4:50 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:> Instead of copying bitmasks around, pass pointers to the constraint structure.>The reason I changed that is because of the AMD situation where constraintsare dynamically generated. So we need to get some storage from somwhere.See my AMD NB patch,

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>> LKML-Reference: <new-submission>> --->  arch/x86/kernel/cpu/perf_event.c |   81 ++++++++++++++++++--------------------->  1 file changed, 38 insertions(+), 43 deletions(-)>> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c> ===================================================================> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c> @@ -135,12 +138,14 @@ struct x86_pmu {>        u64             intel_ctrl;>        void            (*enable_bts)(u64 config);>        void            (*disable_bts)(void);> -       void            (*get_event_constraints)(struct cpu_hw_events *cpuc,> -                                                struct perf_event *event,> -                                                unsigned long *idxmsk);> +> +       struct event_constraint *> +                       (*get_event_constraints)(struct cpu_hw_events *cpuc,> +                                                struct perf_event *event);> +>        void            (*put_event_constraints)(struct cpu_hw_events *cpuc,>                                                 struct perf_event *event);> -       const struct event_constraint *event_constraints;> +       struct event_constraint *event_constraints;>  };>>  static struct x86_pmu x86_pmu __read_mostly;> @@ -1244,18 +1249,16 @@ static inline int is_x86_event(struct pe>  static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)>  {>        int i, j , w, num;> -       int weight, wmax;> -       unsigned long *c;> -       unsigned long constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];> +       int weight, wmax;> +       struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];>        unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];>        struct hw_perf_event *hwc;>>        bitmap_zero(used_mask, X86_PMC_IDX_MAX);>>        for (i = 0; i < n; i++) {> -               x86_pmu.get_event_constraints(cpuc,> -                                             cpuc->event_list[i],> -                                             constraints[i]);> +               constraints[i] => +                 x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);>        }>>        /*> @@ -1270,7 +1273,7 @@ static int x86_schedule_events(struct cp>                        break;>>                /* constraint still honored */> -               if (!test_bit(hwc->idx, c))> +               if (!test_bit(hwc->idx, c->idxmsk))>                        break;>>                /* not already used */> @@ -1323,11 +1326,11 @@ static int x86_schedule_events(struct cp>                        c = constraints[i];>                        hwc = &cpuc->event_list[i]->hw;>> -                       weight = bitmap_weight(c, X86_PMC_IDX_MAX);> +                       weight = bitmap_weight(c->idxmsk, X86_PMC_IDX_MAX);>                        if (weight != w)>                                continue;>> -                       for_each_bit(j, c, X86_PMC_IDX_MAX) {> +                       for_each_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {>                                if (!test_bit(j, used_mask))>                                        break;>                        }> @@ -2165,11 +2167,13 @@ perf_event_nmi_handler(struct notifier_b>        return NOTIFY_STOP;>  }>> +static struct event_constraint unconstrained;> +>  static struct event_constraint bts_constraint =>        EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);>> -static int intel_special_constraints(struct perf_event *event,> -                                    unsigned long *idxmsk)> +static struct event_constraint *> +intel_special_constraints(struct perf_event *event)>  {>        unsigned int hw_event;>> @@ -2179,46 +2183,34 @@ static int intel_special_constraints(str>                      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&>                     (event->hw.sample_period == 1))) {>> -               bitmap_copy((unsigned long *)idxmsk,> -                           (unsigned long *)bts_constraint.idxmsk,> -                           X86_PMC_IDX_MAX);> -               return 1;> +               return &bts_constraint;>        }> -       return 0;> +       return NULL;>  }>> -static void intel_get_event_constraints(struct cpu_hw_events *cpuc,> -                                       struct perf_event *event,> -                                       unsigned long *idxmsk)> +static struct event_constraint *> +intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)>  {> -       const struct event_constraint *c;> +       struct event_constraint *c;>> -       /*> -        * cleanup bitmask> -        */> -       bitmap_zero(idxmsk, X86_PMC_IDX_MAX);> -> -       if (intel_special_constraints(event, idxmsk))> -               return;> +       c = intel_special_constraints(event);> +       if (c)> +               return c;>>        if (x86_pmu.event_constraints) {>                for_each_event_constraint(c, x86_pmu.event_constraints) {> -                       if ((event->hw.config & c->cmask) == c->code) {> -                               bitmap_copy(idxmsk, c->idxmsk, X86_PMC_IDX_MAX);> -                               return;> -                       }> +                       if ((event->hw.config & c->cmask) == c->code)> +                               return c;>                }>        }> -       /* no constraints, means supports all generic counters */> -       bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);> +> +       return &unconstrained;>  }>> -static void amd_get_event_constraints(struct cpu_hw_events *cpuc,> -                                     struct perf_event *event,> -                                     unsigned long *idxmsk)> +static struct event_constraint *> +amd_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)>  {> -       /* no constraints, means supports all generic counters */> -       bitmap_fill(idxmsk, x86_pmu.num_events);> +       return &unconstrained;>  }>>  static int x86_event_sched_in(struct perf_event *event,> @@ -2586,6 +2578,9 @@ void __init init_hw_perf_events(void)>        perf_events_lapic_init();>        register_die_notifier(&perf_event_nmi_notifier);>> +       unconstrained = (struct event_constraint)> +               EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_events) - 1, 0);> +>        pr_info("... version:                %d\n",     x86_pmu.version);>        pr_info("... bit width:              %d\n",     x86_pmu.event_bits);>        pr_info("... generic registers:      %d\n",     x86_pmu.num_events);>> -->>


-- Stephane Eranian  | EMEA Software EngineeringGoogle France | 38 avenue de l'Opéra | 75002 ParisTel : +33 (0) 1 42 68 53 00This email may be confidential or privileged. If you received thiscommunication by mistake, pleasedon't forward it to anyone else, please erase all copies andattachments, and please let me know thatit went to the wrong person. Thanksÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/10] perf_event: Optimize the constraint searching bits
  2010-01-22 16:08   ` Stephane Eranian
@ 2010-01-22 16:22     ` Peter Zijlstra
  2010-01-22 16:28       ` Stephane Eranian
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-22 16:22 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: fweisbec, paulus, mingo, davem, robert.richter, linux-kernel

On Fri, 2010-01-22 at 17:08 +0100, Stephane Eranian wrote:
> On Fri, Jan 22, 2010 at 4:50 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Instead of copying bitmasks around, pass pointers to the constraint structure.
> >
> The reason I changed that is because of the AMD situation where constraints
> are dynamically generated. So we need to get some storage from somwhere.
> See my AMD NB patch,

Can't we still do that by embedding 4 event_constraint structures in
struct amd_nb and rewriting those as we need?




^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 08/10] perf_event: Optimize the constraint searching bits
  2010-01-22 16:22     ` Peter Zijlstra
@ 2010-01-22 16:28       ` Stephane Eranian
  0 siblings, 0 replies; 126+ messages in thread
From: Stephane Eranian @ 2010-01-22 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweisbec, paulus, mingo, davem, robert.richter, linux-kernel

On Fri, Jan 22, 2010 at 5:22 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-01-22 at 17:08 +0100, Stephane Eranian wrote:
>> On Fri, Jan 22, 2010 at 4:50 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > Instead of copying bitmasks around, pass pointers to the constraint structure.
>> >
>> The reason I changed that is because of the AMD situation where constraints
>> are dynamically generated. So we need to get some storage from somwhere.
>> See my AMD NB patch,
>
> Can't we still do that by embedding 4 event_constraint structures in
> struct amd_nb and rewriting those as we need?
>
We sure could do that too.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Allocate the fake_cpuc
  2010-01-22 15:50 ` [PATCH 03/10] perf_event: x86: Allocate the fake_cpuc Peter Zijlstra
@ 2010-01-29  9:27   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  502568d563bcc37ac505a83341c0c95b88c015a8
Gitweb:     http://git.kernel.org/tip/502568d563bcc37ac505a83341c0c95b88c015a8
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 14:35:46 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:35 +0100

perf_event: x86: Allocate the fake_cpuc

GCC was complaining the stack usage was too large, so allocate the
structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.411197266@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |   29 +++++++++++++++++++----------
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 0bd23d0..7bd359a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2606,10 +2606,13 @@ static const struct pmu pmu = {
 static int validate_group(struct perf_event *event)
 {
 	struct perf_event *leader = event->group_leader;
-	struct cpu_hw_events fake_cpuc;
-	int n;
+	struct cpu_hw_events *fake_cpuc;
+	int ret, n;
 
-	memset(&fake_cpuc, 0, sizeof(fake_cpuc));
+	ret = -ENOMEM;
+	fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
+	if (!fake_cpuc)
+		goto out;
 
 	/*
 	 * the event is not yet connected with its
@@ -2617,18 +2620,24 @@ static int validate_group(struct perf_event *event)
 	 * existing siblings, then add the new event
 	 * before we can simulate the scheduling
 	 */
-	n = collect_events(&fake_cpuc, leader, true);
+	ret = -ENOSPC;
+	n = collect_events(fake_cpuc, leader, true);
 	if (n < 0)
-		return -ENOSPC;
+		goto out_free;
 
-	fake_cpuc.n_events = n;
-	n = collect_events(&fake_cpuc, event, false);
+	fake_cpuc->n_events = n;
+	n = collect_events(fake_cpuc, event, false);
 	if (n < 0)
-		return -ENOSPC;
+		goto out_free;
 
-	fake_cpuc.n_events = n;
+	fake_cpuc->n_events = n;
 
-	return x86_schedule_events(&fake_cpuc, n, NULL);
+	ret = x86_schedule_events(fake_cpuc, n, NULL);
+
+out_free:
+	kfree(fake_cpuc);
+out:
+	return ret;
 }
 
 const struct pmu *hw_perf_event_init(struct perf_event *event)

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Fixup constraints typing issue
  2010-01-22 15:50 ` [PATCH 04/10] perf_event: x86: Fixup weight tying issue Peter Zijlstra
@ 2010-01-29  9:27   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  81269a085669b5130058a0275aa7ba9f94abd1fa
Gitweb:     http://git.kernel.org/tip/81269a085669b5130058a0275aa7ba9f94abd1fa
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 14:55:22 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:36 +0100

perf_event: x86: Fixup constraints typing issue

Constraints gets defined an u64 but in long quantities and then cast to
long.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.504916780@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 7bd359a..7e181a5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1232,7 +1232,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	int i, j , w, num;
 	int weight, wmax;
 	unsigned long *c;
-	u64 constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	unsigned long constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
 
@@ -1249,7 +1249,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	 */
 	for (i = 0, num = n; i < n; i++, num--) {
 		hwc = &cpuc->event_list[i]->hw;
-		c = (unsigned long *)constraints[i];
+		c = constraints[i];
 
 		/* never assigned */
 		if (hwc->idx == -1)
@@ -1306,7 +1306,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	for (w = 1, num = n; num && w <= wmax; w++) {
 		/* for each event */
 		for (i = 0; num && i < n; i++) {
-			c = (unsigned long *)constraints[i];
+			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
 			weight = bitmap_weight(c, X86_PMC_IDX_MAX);

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Clean up some of the u64/long bitmask casting
  2010-01-22 15:50 ` [PATCH 05/10] perf_event: x86: Clean up some of the u64/long bitmask casting Peter Zijlstra
@ 2010-01-29  9:27   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  c91e0f5da81c6f3a611a1bd6d0cca6717c90fdab
Gitweb:     http://git.kernel.org/tip/c91e0f5da81c6f3a611a1bd6d0cca6717c90fdab
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 15:25:59 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:37 +0100

perf_event: x86: Clean up some of the u64/long bitmask casting

We need this to be u64 for direct assigment, but the bitmask functions
all work on unsigned long, leading to cast heaven, solve this by using a
union.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.595961269@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |   47 ++++++++++++++++++-------------------
 1 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 7e181a5..921bbf7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -69,10 +69,11 @@ struct debug_store {
 	u64	pebs_event_reset[MAX_PEBS_EVENTS];
 };
 
-#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
-
 struct event_constraint {
-	u64	idxmsk[BITS_TO_U64(X86_PMC_IDX_MAX)];
+	union {
+		unsigned long	idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+		u64		idxmsk64[1];
+	};
 	int	code;
 	int	cmask;
 };
@@ -90,13 +91,14 @@ struct cpu_hw_events {
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 };
 
-#define EVENT_CONSTRAINT(c, n, m) { \
-	.code = (c),	\
-	.cmask = (m),	\
-	.idxmsk[0] = (n) }
+#define EVENT_CONSTRAINT(c, n, m) { 	\
+	{ .idxmsk64[0] = (n) },		\
+	.code = (c),			\
+	.cmask = (m),			\
+}
 
 #define EVENT_CONSTRAINT_END \
-	{ .code = 0, .cmask = 0, .idxmsk[0] = 0 }
+	EVENT_CONSTRAINT(0, 0, 0)
 
 #define for_each_event_constraint(e, c) \
 	for ((e) = (c); (e)->cmask; (e)++)
@@ -126,8 +128,11 @@ struct x86_pmu {
 	u64		intel_ctrl;
 	void		(*enable_bts)(u64 config);
 	void		(*disable_bts)(void);
-	void		(*get_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event, u64 *idxmsk);
-	void		(*put_event_constraints)(struct cpu_hw_events *cpuc, struct perf_event *event);
+	void		(*get_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event,
+						 unsigned long *idxmsk);
+	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event);
 	const struct event_constraint *event_constraints;
 };
 
@@ -2144,14 +2149,11 @@ perf_event_nmi_handler(struct notifier_block *self,
 	return NOTIFY_STOP;
 }
 
-static struct event_constraint bts_constraint = {
-	.code = 0,
-	.cmask = 0,
-	.idxmsk[0] = 1ULL << X86_PMC_IDX_FIXED_BTS
-};
+static struct event_constraint bts_constraint =
+	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
 
 static int intel_special_constraints(struct perf_event *event,
-				     u64 *idxmsk)
+				     unsigned long *idxmsk)
 {
 	unsigned int hw_event;
 
@@ -2171,14 +2173,14 @@ static int intel_special_constraints(struct perf_event *event,
 
 static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
 					struct perf_event *event,
-					u64 *idxmsk)
+					unsigned long *idxmsk)
 {
 	const struct event_constraint *c;
 
 	/*
 	 * cleanup bitmask
 	 */
-	bitmap_zero((unsigned long *)idxmsk, X86_PMC_IDX_MAX);
+	bitmap_zero(idxmsk, X86_PMC_IDX_MAX);
 
 	if (intel_special_constraints(event, idxmsk))
 		return;
@@ -2186,10 +2188,7 @@ static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
 	if (x86_pmu.event_constraints) {
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
 			if ((event->hw.config & c->cmask) == c->code) {
-
-				bitmap_copy((unsigned long *)idxmsk,
-					    (unsigned long *)c->idxmsk,
-					    X86_PMC_IDX_MAX);
+				bitmap_copy(idxmsk, c->idxmsk, X86_PMC_IDX_MAX);
 				return;
 			}
 		}
@@ -2200,10 +2199,10 @@ static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
 
 static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
 				      struct perf_event *event,
-				      u64 *idxmsk)
+				      unsigned long *idxmsk)
 {
 	/* no constraints, means supports all generic counters */
-	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
+	bitmap_fill(idxmsk, x86_pmu.num_events);
 }
 
 static int x86_event_sched_in(struct perf_event *event,

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Reduce some overly long lines with some MACROs
  2010-01-22 15:50 ` [PATCH 06/10] perf_event: x86: Reduce some overly long lines with some MACROs Peter Zijlstra
@ 2010-01-29  9:27   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  8433be1184e4f22c37d4b8ed36cde529a47882f4
Gitweb:     http://git.kernel.org/tip/8433be1184e4f22c37d4b8ed36cde529a47882f4
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 15:38:26 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:38 +0100

perf_event: x86: Reduce some overly long lines with some MACROs

Introduce INTEL_EVENT_CONSTRAINT and FIXED_EVENT_CONSTRAINT to reduce
some line length and typing work.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.688730371@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |   68 ++++++++++++++++++++-----------------
 1 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 921bbf7..4d1ed10 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -97,6 +97,12 @@ struct cpu_hw_events {
 	.cmask = (m),			\
 }
 
+#define INTEL_EVENT_CONSTRAINT(c, n)	\
+	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)
+
+#define FIXED_EVENT_CONSTRAINT(c, n)	\
+	EVENT_CONSTRAINT(c, n, INTEL_ARCH_FIXED_MASK)
+
 #define EVENT_CONSTRAINT_END \
 	EVENT_CONSTRAINT(0, 0, 0)
 
@@ -192,12 +198,12 @@ static u64 p6_pmu_raw_event(u64 hw_event)
 
 static struct event_constraint intel_p6_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc1, 0x1, INTEL_ARCH_EVENT_MASK),	/* FLOPS */
-	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x1, INTEL_ARCH_EVENT_MASK),	/* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK),	/* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK),	/* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK),	/* CYCLES_DIV_BUSY */
+	INTEL_EVENT_CONSTRAINT(0xc1, 0x1),	/* FLOPS */
+	INTEL_EVENT_CONSTRAINT(0x10, 0x1),	/* FP_COMP_OPS_EXE */
+	INTEL_EVENT_CONSTRAINT(0x11, 0x1),	/* FP_ASSIST */
+	INTEL_EVENT_CONSTRAINT(0x12, 0x2),	/* MUL */
+	INTEL_EVENT_CONSTRAINT(0x13, 0x2),	/* DIV */
+	INTEL_EVENT_CONSTRAINT(0x14, 0x1),	/* CYCLES_DIV_BUSY */
 	EVENT_CONSTRAINT_END
 };
 
@@ -217,41 +223,41 @@ static const u64 intel_perfmon_event_map[] =
 
 static struct event_constraint intel_core_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
-	EVENT_CONSTRAINT(0x10, 0x1, INTEL_ARCH_EVENT_MASK), /* FP_COMP_OPS_EXE */
-	EVENT_CONSTRAINT(0x11, 0x2, INTEL_ARCH_EVENT_MASK), /* FP_ASSIST */
-	EVENT_CONSTRAINT(0x12, 0x2, INTEL_ARCH_EVENT_MASK), /* MUL */
-	EVENT_CONSTRAINT(0x13, 0x2, INTEL_ARCH_EVENT_MASK), /* DIV */
-	EVENT_CONSTRAINT(0x14, 0x1, INTEL_ARCH_EVENT_MASK), /* CYCLES_DIV_BUSY */
-	EVENT_CONSTRAINT(0x18, 0x1, INTEL_ARCH_EVENT_MASK), /* IDLE_DURING_DIV */
-	EVENT_CONSTRAINT(0x19, 0x2, INTEL_ARCH_EVENT_MASK), /* DELAYED_BYPASS */
-	EVENT_CONSTRAINT(0xa1, 0x1, INTEL_ARCH_EVENT_MASK), /* RS_UOPS_DISPATCH_CYCLES */
-	EVENT_CONSTRAINT(0xcb, 0x1, INTEL_ARCH_EVENT_MASK), /* MEM_LOAD_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0x10, 0x1), /* FP_COMP_OPS_EXE */
+	INTEL_EVENT_CONSTRAINT(0x11, 0x2), /* FP_ASSIST */
+	INTEL_EVENT_CONSTRAINT(0x12, 0x2), /* MUL */
+	INTEL_EVENT_CONSTRAINT(0x13, 0x2), /* DIV */
+	INTEL_EVENT_CONSTRAINT(0x14, 0x1), /* CYCLES_DIV_BUSY */
+	INTEL_EVENT_CONSTRAINT(0x18, 0x1), /* IDLE_DURING_DIV */
+	INTEL_EVENT_CONSTRAINT(0x19, 0x2), /* DELAYED_BYPASS */
+	INTEL_EVENT_CONSTRAINT(0xa1, 0x1), /* RS_UOPS_DISPATCH_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0xcb, 0x1), /* MEM_LOAD_RETIRED */
 	EVENT_CONSTRAINT_END
 };
 
 static struct event_constraint intel_nehalem_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
-	EVENT_CONSTRAINT(0x40, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LD */
-	EVENT_CONSTRAINT(0x41, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_ST */
-	EVENT_CONSTRAINT(0x42, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK */
-	EVENT_CONSTRAINT(0x43, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_ALL_REF */
-	EVENT_CONSTRAINT(0x4e, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_PREFETCH */
-	EVENT_CONSTRAINT(0x4c, 0x3, INTEL_ARCH_EVENT_MASK), /* LOAD_HIT_PRE */
-	EVENT_CONSTRAINT(0x51, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D */
-	EVENT_CONSTRAINT(0x52, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0x53, 0x3, INTEL_ARCH_EVENT_MASK), /* L1D_CACHE_LOCK_FB_HIT */
-	EVENT_CONSTRAINT(0xc5, 0x3, INTEL_ARCH_EVENT_MASK), /* CACHE_LOCK_CYCLES */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
+	INTEL_EVENT_CONSTRAINT(0x40, 0x3), /* L1D_CACHE_LD */
+	INTEL_EVENT_CONSTRAINT(0x41, 0x3), /* L1D_CACHE_ST */
+	INTEL_EVENT_CONSTRAINT(0x42, 0x3), /* L1D_CACHE_LOCK */
+	INTEL_EVENT_CONSTRAINT(0x43, 0x3), /* L1D_ALL_REF */
+	INTEL_EVENT_CONSTRAINT(0x4e, 0x3), /* L1D_PREFETCH */
+	INTEL_EVENT_CONSTRAINT(0x4c, 0x3), /* LOAD_HIT_PRE */
+	INTEL_EVENT_CONSTRAINT(0x51, 0x3), /* L1D */
+	INTEL_EVENT_CONSTRAINT(0x52, 0x3), /* L1D_CACHE_PREFETCH_LOCK_FB_HIT */
+	INTEL_EVENT_CONSTRAINT(0x53, 0x3), /* L1D_CACHE_LOCK_FB_HIT */
+	INTEL_EVENT_CONSTRAINT(0xc5, 0x3), /* CACHE_LOCK_CYCLES */
 	EVENT_CONSTRAINT_END
 };
 
 static struct event_constraint intel_gen_event_constraints[] =
 {
-	EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32)), INTEL_ARCH_FIXED_MASK), /* INSTRUCTIONS_RETIRED */
-	EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33)), INTEL_ARCH_FIXED_MASK), /* UNHALTED_CORE_CYCLES */
+	FIXED_EVENT_CONSTRAINT(0xc0, (0x3|(1ULL<<32))), /* INSTRUCTIONS_RETIRED */
+	FIXED_EVENT_CONSTRAINT(0x3c, (0x3|(1ULL<<33))), /* UNHALTED_CORE_CYCLES */
 	EVENT_CONSTRAINT_END
 };
 

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-22 15:50 ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} Peter Zijlstra
@ 2010-01-29  9:28   ` tip-bot for Peter Zijlstra
  2010-01-29 10:01     ` Andrew Morton
  2010-01-29 10:32   ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} John Kacur
  1 sibling, 1 reply; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, torvalds, a.p.zijlstra, akpm,
	tglx, mingo

Commit-ID:  9f41699ed067fa695faff8e2e9981b2550abec62
Gitweb:     http://git.kernel.org/tip/9f41699ed067fa695faff8e2e9981b2550abec62
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 15:59:29 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:39 +0100

bitops: Provide compile time HWEIGHT{8,16,32,64}

Provide compile time versions of hweight.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <20100122155535.797688466@chello.nl>
[ Remove some whitespace damage while we are at it ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/bitops.h |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index c05a29c..ba0fd1e 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -25,7 +25,7 @@
 static __inline__ int get_bitmask_order(unsigned int count)
 {
 	int order;
-	
+
 	order = fls(count);
 	return order;	/* We could be slightly more clever with -1 here... */
 }
@@ -33,7 +33,7 @@ static __inline__ int get_bitmask_order(unsigned int count)
 static __inline__ int get_count_order(unsigned int count)
 {
 	int order;
-	
+
 	order = fls(count) - 1;
 	if (count & (count - 1))
 		order++;
@@ -45,6 +45,20 @@ static inline unsigned long hweight_long(unsigned long w)
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
+#define HWEIGHT8(w)			\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
+#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
+#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
+
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Optimize the constraint searching bits
  2010-01-22 15:50 ` [PATCH 08/10] perf_event: Optimize the constraint searching bits Peter Zijlstra
  2010-01-22 16:08   ` Stephane Eranian
@ 2010-01-29  9:28   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  63b146490befc027a7e0923e333269e68b20d380
Gitweb:     http://git.kernel.org/tip/63b146490befc027a7e0923e333269e68b20d380
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 16:32:17 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:40 +0100

perf_event: x86: Optimize the constraint searching bits

Instead of copying bitmasks around, pass pointers to the constraint
structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.887853503@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |   75 +++++++++++++++++---------------------
 1 files changed, 34 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4d1ed10..092ad56 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -134,12 +134,14 @@ struct x86_pmu {
 	u64		intel_ctrl;
 	void		(*enable_bts)(u64 config);
 	void		(*disable_bts)(void);
-	void		(*get_event_constraints)(struct cpu_hw_events *cpuc,
-						 struct perf_event *event,
-						 unsigned long *idxmsk);
+
+	struct event_constraint *
+			(*get_event_constraints)(struct cpu_hw_events *cpuc,
+						 struct perf_event *event);
+
 	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
-	const struct event_constraint *event_constraints;
+	struct event_constraint *event_constraints;
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
@@ -1242,17 +1244,15 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
 	int i, j , w, num;
 	int weight, wmax;
-	unsigned long *c;
-	unsigned long constraints[X86_PMC_IDX_MAX][BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
 
 	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
 
 	for (i = 0; i < n; i++) {
-		x86_pmu.get_event_constraints(cpuc,
-					      cpuc->event_list[i],
-					      constraints[i]);
+		constraints[i] =
+		  x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
 	}
 
 	/*
@@ -1267,7 +1267,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 			break;
 
 		/* constraint still honored */
-		if (!test_bit(hwc->idx, c))
+		if (!test_bit(hwc->idx, c->idxmsk))
 			break;
 
 		/* not already used */
@@ -1320,11 +1320,11 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
-			weight = bitmap_weight(c, X86_PMC_IDX_MAX);
+			weight = bitmap_weight(c->idxmsk, X86_PMC_IDX_MAX);
 			if (weight != w)
 				continue;
 
-			for_each_bit(j, c, X86_PMC_IDX_MAX) {
+			for_each_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
 				if (!test_bit(j, used_mask))
 					break;
 			}
@@ -2155,11 +2155,13 @@ perf_event_nmi_handler(struct notifier_block *self,
 	return NOTIFY_STOP;
 }
 
+static struct event_constraint unconstrained;
+
 static struct event_constraint bts_constraint =
 	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
 
-static int intel_special_constraints(struct perf_event *event,
-				     unsigned long *idxmsk)
+static struct event_constraint *
+intel_special_constraints(struct perf_event *event)
 {
 	unsigned int hw_event;
 
@@ -2169,46 +2171,34 @@ static int intel_special_constraints(struct perf_event *event,
 		      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&
 		     (event->hw.sample_period == 1))) {
 
-		bitmap_copy((unsigned long *)idxmsk,
-			    (unsigned long *)bts_constraint.idxmsk,
-			    X86_PMC_IDX_MAX);
-		return 1;
+		return &bts_constraint;
 	}
-	return 0;
+	return NULL;
 }
 
-static void intel_get_event_constraints(struct cpu_hw_events *cpuc,
-					struct perf_event *event,
-					unsigned long *idxmsk)
+static struct event_constraint *
+intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 {
-	const struct event_constraint *c;
+	struct event_constraint *c;
 
-	/*
-	 * cleanup bitmask
-	 */
-	bitmap_zero(idxmsk, X86_PMC_IDX_MAX);
-
-	if (intel_special_constraints(event, idxmsk))
-		return;
+	c = intel_special_constraints(event);
+	if (c)
+		return c;
 
 	if (x86_pmu.event_constraints) {
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
-			if ((event->hw.config & c->cmask) == c->code) {
-				bitmap_copy(idxmsk, c->idxmsk, X86_PMC_IDX_MAX);
-				return;
-			}
+			if ((event->hw.config & c->cmask) == c->code)
+				return c;
 		}
 	}
-	/* no constraints, means supports all generic counters */
-	bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
+
+	return &unconstrained;
 }
 
-static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
-				      struct perf_event *event,
-				      unsigned long *idxmsk)
+static struct event_constraint *
+amd_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 {
-	/* no constraints, means supports all generic counters */
-	bitmap_fill(idxmsk, x86_pmu.num_events);
+	return &unconstrained;
 }
 
 static int x86_event_sched_in(struct perf_event *event,
@@ -2576,6 +2566,9 @@ void __init init_hw_perf_events(void)
 	perf_events_lapic_init();
 	register_die_notifier(&perf_event_nmi_notifier);
 
+	unconstrained = (struct event_constraint)
+		EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_events) - 1, 0);
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.event_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_events);

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Optimize constraint weight computation
  2010-01-22 15:50 ` [PATCH 09/10] perf_event: x86: Optimize constraint weight computation Peter Zijlstra
@ 2010-01-29  9:28   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  272d30be622c9c6cbd514b1211ff359292001baa
Gitweb:     http://git.kernel.org/tip/272d30be622c9c6cbd514b1211ff359292001baa
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 16:32:17 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:41 +0100

perf_event: x86: Optimize constraint weight computation

Add a weight member to the constraint structure and avoid recomputing the
weight at runtime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155535.963944926@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 092ad56..2c22ce4 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -23,6 +23,7 @@
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
 #include <linux/cpu.h>
+#include <linux/bitops.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -76,6 +77,7 @@ struct event_constraint {
 	};
 	int	code;
 	int	cmask;
+	int	weight;
 };
 
 struct cpu_hw_events {
@@ -95,6 +97,7 @@ struct cpu_hw_events {
 	{ .idxmsk64[0] = (n) },		\
 	.code = (c),			\
 	.cmask = (m),			\
+	.weight = HWEIGHT64((u64)(n)),	\
 }
 
 #define INTEL_EVENT_CONSTRAINT(c, n)	\
@@ -1242,8 +1245,7 @@ static inline int is_x86_event(struct perf_event *event)
 
 static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-	int i, j , w, num;
-	int weight, wmax;
+	int i, j, w, num, wmax;
 	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct hw_perf_event *hwc;
@@ -1320,8 +1322,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 			c = constraints[i];
 			hwc = &cpuc->event_list[i]->hw;
 
-			weight = bitmap_weight(c->idxmsk, X86_PMC_IDX_MAX);
-			if (weight != w)
+			if (c->weight != w)
 				continue;
 
 			for_each_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:perf/core] perf_event: x86: Optimize the fast path a little more
  2010-01-22 15:50 ` [PATCH 10/10] perf_event: Optimize the fast path a little more Peter Zijlstra
@ 2010-01-29  9:28   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-01-29  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  c933c1a603d5bf700ddce79216c1be0ec3bc0e6c
Gitweb:     http://git.kernel.org/tip/c933c1a603d5bf700ddce79216c1be0ec3bc0e6c
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Fri, 22 Jan 2010 16:40:12 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 29 Jan 2010 09:01:42 +0100

perf_event: x86: Optimize the fast path a little more

Remove num from the fast path and save a few ops.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <20100122155536.056430539@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2c22ce4..33c889f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1245,9 +1245,9 @@ static inline int is_x86_event(struct perf_event *event)
 
 static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
-	int i, j, w, num, wmax;
 	struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	int i, j, w, wmax, num = 0;
 	struct hw_perf_event *hwc;
 
 	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -1260,7 +1260,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	/*
 	 * fastpath, try to reuse previous register
 	 */
-	for (i = 0, num = n; i < n; i++, num--) {
+	for (i = 0; i < n; i++) {
 		hwc = &cpuc->event_list[i]->hw;
 		c = constraints[i];
 
@@ -1288,7 +1288,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 		if (assign)
 			assign[i] = hwc->idx;
 	}
-	if (!num)
+	if (i == n)
 		goto done;
 
 	/*

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
@ 2010-01-29 10:01     ` Andrew Morton
  2010-01-29 10:04       ` Ingo Molnar
  2010-01-29 11:03       ` Peter Zijlstra
  0 siblings, 2 replies; 126+ messages in thread
From: Andrew Morton @ 2010-01-29 10:01 UTC (permalink / raw)
  To: mingo, hpa, eranian, linux-kernel, akpm, a.p.zijlstra, torvalds,
	tglx, mingo
  Cc: tip-bot for Peter Zijlstra, linux-tip-commits, linux-kernel,
	eranian, hpa, mingo, torvalds, tglx, mingo

On Fri, 29 Jan 2010 09:28:04 GMT tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> +#define HWEIGHT8(w)			\
> +      (	(!!((w) & (1ULL << 0))) +	\
> +	(!!((w) & (1ULL << 1))) +	\
> +	(!!((w) & (1ULL << 2))) +	\
> +	(!!((w) & (1ULL << 3))) +	\
> +	(!!((w) & (1ULL << 4))) +	\
> +	(!!((w) & (1ULL << 5))) +	\
> +	(!!((w) & (1ULL << 6))) +	\
> +	(!!((w) & (1ULL << 7)))	)
> +
> +#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
> +#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
> +#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))

Would be nice if it had a comment explaining why it exists.  If people
accidentally use this with non-constant arguments, the generated code
will be pretty ghastly.

Or add some barf-if-not-__constant_p() thing, perhaps.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 10:01     ` Andrew Morton
@ 2010-01-29 10:04       ` Ingo Molnar
  2010-01-29 10:13         ` Andrew Morton
  2010-01-29 11:03       ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Ingo Molnar @ 2010-01-29 10:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, hpa, eranian, linux-kernel, a.p.zijlstra, torvalds, tglx,
	linux-tip-commits


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 29 Jan 2010 09:28:04 GMT tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > +#define HWEIGHT8(w)			\
> > +      (	(!!((w) & (1ULL << 0))) +	\
> > +	(!!((w) & (1ULL << 1))) +	\
> > +	(!!((w) & (1ULL << 2))) +	\
> > +	(!!((w) & (1ULL << 3))) +	\
> > +	(!!((w) & (1ULL << 4))) +	\
> > +	(!!((w) & (1ULL << 5))) +	\
> > +	(!!((w) & (1ULL << 6))) +	\
> > +	(!!((w) & (1ULL << 7)))	)
> > +
> > +#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
> > +#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
> > +#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
> 
> Would be nice if it had a comment explaining why it exists.  If people
> accidentally use this with non-constant arguments, the generated code
> will be pretty ghastly.
> 
> Or add some barf-if-not-__constant_p() thing, perhaps.

Yeah, agreed.

	Ingo

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 10:04       ` Ingo Molnar
@ 2010-01-29 10:13         ` Andrew Morton
  0 siblings, 0 replies; 126+ messages in thread
From: Andrew Morton @ 2010-01-29 10:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, eranian, linux-kernel, a.p.zijlstra, torvalds, tglx,
	linux-tip-commits

On Fri, 29 Jan 2010 11:04:31 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri, 29 Jan 2010 09:28:04 GMT tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > +#define HWEIGHT8(w)			\
> > > +      (	(!!((w) & (1ULL << 0))) +	\
> > > +	(!!((w) & (1ULL << 1))) +	\
> > > +	(!!((w) & (1ULL << 2))) +	\
> > > +	(!!((w) & (1ULL << 3))) +	\
> > > +	(!!((w) & (1ULL << 4))) +	\
> > > +	(!!((w) & (1ULL << 5))) +	\
> > > +	(!!((w) & (1ULL << 6))) +	\
> > > +	(!!((w) & (1ULL << 7)))	)
> > > +
> > > +#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
> > > +#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
> > > +#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
> > 
> > Would be nice if it had a comment explaining why it exists.  If people
> > accidentally use this with non-constant arguments, the generated code
> > will be pretty ghastly.
> > 
> > Or add some barf-if-not-__constant_p() thing, perhaps.
> 
> Yeah, agreed.
> 

Also...

Should we just do

#define HWEIGHT(x) HWEIGHT64(x)

and make HWEIGHT() the sole officially-exported interface?  I mean, all
it does is emit an obfuscated constant - perhaps we can save users from
having to pick which one of the above to use by giving them a "this one
always works" interface.

That might require some casting to suppress "shift out of range"
warnings though.

<wonders if we'd otherwise end up needing an HWEIGHT_LONG()>

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-22 15:50 ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} Peter Zijlstra
  2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
@ 2010-01-29 10:32   ` John Kacur
  2010-01-29 11:05     ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: John Kacur @ 2010-01-29 10:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, fweisbec, paulus, mingo, davem, robert.richter,
	linux-kernel, Andrew Morton

On Fri, Jan 22, 2010 at 4:50 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Provide compile time versions of hweight.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <new-submission>
> ---
>  include/linux/bitops.h |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> Index: linux-2.6/include/linux/bitops.h
> ===================================================================
> --- linux-2.6.orig/include/linux/bitops.h
> +++ linux-2.6/include/linux/bitops.h
> @@ -45,6 +45,20 @@ static inline unsigned long hweight_long
>        return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
>  }

I like it. Maybe provide a comment that this provides the Hamming weight, it
took me a second to realize what this was.

>
> +#define HWEIGHT8(w)                    \
> +      (        (!!((w) & (1ULL << 0))) +       \
> +       (!!((w) & (1ULL << 1))) +       \
> +       (!!((w) & (1ULL << 2))) +       \
> +       (!!((w) & (1ULL << 3))) +       \
> +       (!!((w) & (1ULL << 4))) +       \
> +       (!!((w) & (1ULL << 5))) +       \
> +       (!!((w) & (1ULL << 6))) +       \
> +       (!!((w) & (1ULL << 7))) )
> +
> +#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
> +#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
> +#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
> +
>  /**
>  * rol32 - rotate a 32-bit value left
>  * @word: value to rotate
>
> --
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 10:01     ` Andrew Morton
  2010-01-29 10:04       ` Ingo Molnar
@ 2010-01-29 11:03       ` Peter Zijlstra
  2010-01-29 16:24         ` Linus Torvalds
  2010-01-29 22:50         ` H. Peter Anvin
  1 sibling, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-29 11:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, hpa, eranian, linux-kernel, torvalds, tglx, mingo,
	linux-tip-commits

On Fri, 2010-01-29 at 02:01 -0800, Andrew Morton wrote:
> On Fri, 29 Jan 2010 09:28:04 GMT tip-bot for Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > +#define HWEIGHT8(w)			\
> > +      (	(!!((w) & (1ULL << 0))) +	\
> > +	(!!((w) & (1ULL << 1))) +	\
> > +	(!!((w) & (1ULL << 2))) +	\
> > +	(!!((w) & (1ULL << 3))) +	\
> > +	(!!((w) & (1ULL << 4))) +	\
> > +	(!!((w) & (1ULL << 5))) +	\
> > +	(!!((w) & (1ULL << 6))) +	\
> > +	(!!((w) & (1ULL << 7)))	)
> > +
> > +#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
> > +#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
> > +#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
> 
> Would be nice if it had a comment explaining why it exists.  If people
> accidentally use this with non-constant arguments, the generated code
> will be pretty ghastly.

*sigh* and here I though it being placed right next to hweight_long()
which uses the arch hweightN() would be clue enough.

If people are so clueless, who says they'll read a comment.. but sure I
guess I can add one.

> Or add some barf-if-not-__constant_p() thing, perhaps.

I've actually sneaked one non-constant usage in, but since its in an
init path I didn't care to fix that, but I guess here goes:

---
Subject: bitops: Dummyify the compile-time hweight versions

Because it seems allowed to not think and write kernel code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -93,13 +93,16 @@ struct cpu_hw_events {
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 };
 
-#define EVENT_CONSTRAINT(c, n, m) {	\
+#define __EVENT_CONSTRAINT(c, n, m, w) {\
 	{ .idxmsk64[0] = (n) },		\
 	.code = (c),			\
 	.cmask = (m),			\
-	.weight = HWEIGHT64((u64)(n)),	\
+	.weight = (w),			\
 }
 
+#define EVENT_CONSTRAINT(c, n, m)	\
+	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n))
+
 #define INTEL_EVENT_CONSTRAINT(c, n)	\
 	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVTSEL_MASK)
 
@@ -2646,7 +2649,8 @@ void __init init_hw_perf_events(void)
 	register_die_notifier(&perf_event_nmi_notifier);
 
 	unconstrained = (struct event_constraint)
-		EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_events) - 1, 0);
+		__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_events) - 1, 
+				   0, x86_pmu.num_events);
 
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.event_bits);
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,19 +45,29 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-#define HWEIGHT8(w)			\
-      (	(!!((w) & (1ULL << 0))) +	\
-	(!!((w) & (1ULL << 1))) +	\
-	(!!((w) & (1ULL << 2))) +	\
-	(!!((w) & (1ULL << 3))) +	\
-	(!!((w) & (1ULL << 4))) +	\
-	(!!((w) & (1ULL << 5))) +	\
-	(!!((w) & (1ULL << 6))) +	\
+/*
+ * Clearly slow versions of the hweightN() functions, their benefit is
+ * of course compile time evaluation of constant arguments.
+ */
+#define HWEIGHT8(w)					\
+      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
+	(!!((w) & (1ULL << 0))) +			\
+	(!!((w) & (1ULL << 1))) +			\
+	(!!((w) & (1ULL << 2))) +			\
+	(!!((w) & (1ULL << 3))) +			\
+	(!!((w) & (1ULL << 4))) +			\
+	(!!((w) & (1ULL << 5))) +			\
+	(!!((w) & (1ULL << 6))) +			\
 	(!!((w) & (1ULL << 7)))	)
 
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8(w >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16(w >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32(w >> 32))
+#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
+#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
+#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
+
+/*
+ * For us lazy bastards
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)(w))
 
 /**
  * rol32 - rotate a 32-bit value left



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 10:32   ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} John Kacur
@ 2010-01-29 11:05     ` Peter Zijlstra
  2010-01-29 11:13       ` John Kacur
                         ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-29 11:05 UTC (permalink / raw)
  To: John Kacur
  Cc: eranian, fweisbec, paulus, mingo, davem, robert.richter,
	linux-kernel, Andrew Morton

On Fri, 2010-01-29 at 11:32 +0100, John Kacur wrote:
> 
> I like it. Maybe provide a comment that this provides the Hamming weight, it
> took me a second to realize what this was.

its called _H_weight for crying out loud.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 11:05     ` Peter Zijlstra
@ 2010-01-29 11:13       ` John Kacur
  2010-01-30  0:09       ` H. Peter Anvin
  2010-01-30  7:34       ` Ingo Molnar
  2 siblings, 0 replies; 126+ messages in thread
From: John Kacur @ 2010-01-29 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, fweisbec, paulus, mingo, davem, robert.richter,
	linux-kernel, Andrew Morton

On Fri, Jan 29, 2010 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-01-29 at 11:32 +0100, John Kacur wrote:
>>
>> I like it. Maybe provide a comment that this provides the Hamming weight, it
>> took me a second to realize what this was.
>
> its called _H_weight for crying out loud.
>

Ha. :) You're right, but it's always obvious when your hands are dirty with
algorithms that can use it. Easy to forget when you're working on other stuff.
Put the term in there so we can google it more easily.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 11:03       ` Peter Zijlstra
@ 2010-01-29 16:24         ` Linus Torvalds
  2010-01-29 22:50         ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: Linus Torvalds @ 2010-01-29 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, mingo, hpa, eranian, linux-kernel, tglx, mingo,
	linux-tip-commits



On Fri, 29 Jan 2010, Peter Zijlstra wrote:
> 
> *sigh* and here I though it being placed right next to hweight_long()
> which uses the arch hweightN() would be clue enough.

No. People who add new uses may be copying old uses, without looking at 
the definition.

Also, people who _change_ uses may be changing a value that used to be a 
constant into a variable when something is made more dynamic. At which 
point it really makes sense to have a function that requires a constant to 
_check_ that it gets a constant.

> If people are so clueless, who says they'll read a comment.. but sure I
> guess I can add one.

The comment nobody cares about. But surprisingly crap code generation? 
That's bad.

> Subject: bitops: Dummyify the compile-time hweight versions
> 
> Because it seems allowed to not think and write kernel code.

I would suggest you look in the mirror at some point.

		Linus

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 11:03       ` Peter Zijlstra
  2010-01-29 16:24         ` Linus Torvalds
@ 2010-01-29 22:50         ` H. Peter Anvin
  2010-01-30 16:28           ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-01-29 22:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, mingo, eranian, linux-kernel, torvalds, tglx,
	mingo, linux-tip-commits

On 01/29/2010 03:03 AM, Peter Zijlstra wrote:
> 
> *sigh* and here I though it being placed right next to hweight_long()
> which uses the arch hweightN() would be clue enough.
> 
> If people are so clueless, who says they'll read a comment.. but sure I
> guess I can add one.
> 

I would personally say that the Right Way[TM] to do this is to call
these __constant_hweightX() -- so the name reflects the function -- and
then have

#define hweight(x) (__builtin_constant_p(x) ? __constant_hweight(x) :
__arch_hweight(x))

[example does not reflect actual naming]

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 11:05     ` Peter Zijlstra
  2010-01-29 11:13       ` John Kacur
@ 2010-01-30  0:09       ` H. Peter Anvin
  2010-01-30  7:34       ` Ingo Molnar
  2 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-01-30  0:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Kacur, eranian, fweisbec, paulus, mingo, davem,
	robert.richter, linux-kernel, Andrew Morton

On 01/29/2010 03:05 AM, Peter Zijlstra wrote:
> On Fri, 2010-01-29 at 11:32 +0100, John Kacur wrote:
>>
>> I like it. Maybe provide a comment that this provides the Hamming weight, it
>> took me a second to realize what this was.
> 
> its called _H_weight for crying out loud.
> 

True... although in computer architecture, when talking about bits, I
think you find that the term population count (popcount) is more readily
recognized.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 11:05     ` Peter Zijlstra
  2010-01-29 11:13       ` John Kacur
  2010-01-30  0:09       ` H. Peter Anvin
@ 2010-01-30  7:34       ` Ingo Molnar
  2 siblings, 0 replies; 126+ messages in thread
From: Ingo Molnar @ 2010-01-30  7:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Kacur, eranian, fweisbec, paulus, davem, robert.richter,
	linux-kernel, Andrew Morton


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2010-01-29 at 11:32 +0100, John Kacur wrote:
> > 
> > I like it. Maybe provide a comment that this provides the Hamming weight, 
> > it took me a second to realize what this was.
> 
> its called _H_weight for crying out loud.

I always forget such abbreviations :-/

I think when a single kernel hacker requests a clarifying comment that's 
reason enough to add it. After all he broke through a barrier of laziness to 
write that email ;-)

	Ingo

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits
@ 2010-01-30  9:45 Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
                   ` (4 more replies)
  0 siblings, 5 replies; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, Al Viro, linux-fsdevel

Andrew,

Resend the patches related to FMODE_* and O_*, rebased to the -mm tree.

O_* and FMODE_NONOTIFY collision fix/check
	[PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
	[PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
	[PATCH 3/5] vfs: O_* bit numbers uniqueness check

allow negative f_pos for /dev/kmem
	[PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos
	[PATCH 5/5] devmem: dont allow seek to last page

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-02-01 20:44   ` Andrew Morton
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, Eric Paris, Al Viro, linux-fsdevel

[-- Attachment #1: fanotify-bit-fix --]
[-- Type: text/plain, Size: 1183 bytes --]

FMODE_NONOTIFY=0x800000 collides with __O_SYNC in sparc,
so change it to 0x1000000.

CC: Eric Paris <eparis@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/asm-generic/fcntl.h |    2 +-
 include/linux/fs.h          |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-13 10:16:27.000000000 +0800
+++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
@@ -5,7 +5,7 @@
 
 /*
  * FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x800000
+ * FMODE_NONOTIFY is 0x1000000
  * These cannot be used by userspace O_* until internal and external open
  * flags are split.
  * -Eric Paris
--- linux-mm.orig/include/linux/fs.h	2010-01-30 17:14:14.000000000 +0800
+++ linux-mm/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
@@ -91,7 +91,7 @@ struct inodes_stat_t {
 #define FMODE_RANDOM		((__force fmode_t)0x1000)
 
 /* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY		((__force fmode_t)0x800000)
+#define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
 /*
  * The below are the various read and write types that we support. Some of



^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-02-01 20:48   ` Andrew Morton
  2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel

[-- Attachment #1: constant-hweight32.patch --]
[-- Type: text/plain, Size: 874 bytes --]

This allows use of hweight_long() in BUILD_BUG_ON().
Suggested by Jamie.

CC: Jamie Lokier <jamie@shareable.org>
CC: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/bitops.h |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- linux-mm.orig/include/linux/bitops.h	2009-07-20 20:10:19.000000000 +0800
+++ linux-mm/include/linux/bitops.h	2010-01-30 17:41:15.000000000 +0800
@@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
 	return order;
 }
 
-static inline unsigned long hweight_long(unsigned long w)
-{
-	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
-}
+#define hweight_long(x)				\
+(						\
+	__builtin_constant_p(x) ?		\
+		__builtin_popcountl(x) :	\
+		(sizeof(x) <= 4 ?		\
+			 hweight32(x) :		\
+			 hweight64(x))		\
+)
 
 /**
  * rol32 - rotate a 32-bit value left



^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 3/5] vfs: O_* bit numbers uniqueness check
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
  2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang
  4 siblings, 0 replies; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, David Miller, Stephen Rothwell, Al Viro,
	Christoph Hellwig, Eric Paris, Roland Dreier, Jamie Lokier,
	Andreas Schwab, Al Viro, linux-fsdevel

[-- Attachment #1: fcntl-bit-check.patch --]
[-- Type: text/plain, Size: 2107 bytes --]

The O_* bit numbers are defined in 20+ arch/*, and can silently overlap.
Add a compile time check to ensure the uniqueness as suggested by David
Miller.

v4: use the nice hweight_long() (suggested by Jamie)
    split O_SYNC to (__O_SYNC | O_DSYNC) (suggested by Andreas)
    take away the FMODE_* and O_RANDOM bits
v3: change to BUILD_BUG_ON() (suggested by Roland)

CC: David Miller <davem@davemloft.net>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Christoph Hellwig <hch@infradead.org>
CC: Eric Paris <eparis@redhat.com>
CC: Roland Dreier <rdreier@cisco.com>
CC: Jamie Lokier <jamie@shareable.org>
CC: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fcntl.c                  |   14 ++++++++++++--
 include/asm-generic/fcntl.h |    2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

--- linux-mm.orig/fs/fcntl.c	2010-01-30 17:14:13.000000000 +0800
+++ linux-mm/fs/fcntl.c	2010-01-30 17:41:36.000000000 +0800
@@ -741,11 +741,21 @@ void kill_fasync(struct fasync_struct **
 }
 EXPORT_SYMBOL(kill_fasync);
 
-static int __init fasync_init(void)
+static int __init fcntl_init(void)
 {
+	/* please add new bits here to ensure allocation uniqueness */
+	BUILD_BUG_ON(17 != hweight_long(
+		O_RDONLY	| O_WRONLY	| O_RDWR	|
+		O_CREAT		| O_EXCL	| O_NOCTTY	|
+		O_TRUNC		| O_APPEND	| O_NONBLOCK	|
+		__O_SYNC	| O_DSYNC	| FASYNC	|
+		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
+		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC
+		));
+
 	fasync_cache = kmem_cache_create("fasync_cache",
 		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
 	return 0;
 }
 
-module_init(fasync_init)
+module_init(fcntl_init)
--- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
+++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:41:36.000000000 +0800
@@ -4,6 +4,8 @@
 #include <linux/types.h>
 
 /*
+ * When introducing new O_* bits, please check its uniqueness in fcntl_init().
+ *
  * FMODE_EXEC is 0x20
  * FMODE_NONOTIFY is 0x1000000
  * These cannot be used by userspace O_* until internal and external open



^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang
  4 siblings, 0 replies; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Al Viro, Heiko Carstens, KAMEZAWA Hiroyuki,
	linux-fsdevel

[-- Attachment #1: f_pos-fix --]
[-- Type: text/plain, Size: 3728 bytes --]

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, rw_verify_area() checsk f_pos is negative or not. And if
negative, returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..
has negative offsets. And we can't do any access via read/write
to the file(device).

So introduce FMODE_NEG_OFFSET to allow negative file offsets.

Changelog: v5->v6
 - use FMODE_NEG_OFFSET (suggested by Al)
 - rebased onto 2.6.33-rc1

Changelog: v4->v5
 - clean up patches dor /dev/mem.
 - rebased onto 2.6.32-rc1

Changelog: v3->v4
 - make changes in mem.c aligned.
 - change __negative_fpos_check() to return int. 
 - fixed bug in "pos" check.
 - added comments.

Changelog: v2->v3
 - fixed bug in rw_verify_area (it cannot be compiled)

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 drivers/char/mem.c |    4 ++++
 fs/proc/base.c     |    2 ++
 fs/read_write.c    |   21 +++++++++++++++++++--
 include/linux/fs.h |    3 +++
 4 files changed, 28 insertions(+), 2 deletions(-)

--- linux-mm.orig/fs/read_write.c	2010-01-22 11:20:29.000000000 +0800
+++ linux-mm/fs/read_write.c	2010-01-30 17:41:51.000000000 +0800
@@ -205,6 +205,20 @@ bad:
 }
 #endif
 
+static int
+__negative_fpos_check(struct file *file, loff_t pos, size_t count)
+{
+	/*
+	 * pos or pos+count is negative here, check overflow.
+	 * too big "count" will be caught in rw_verify_area().
+	 */
+	if ((pos < 0) && (pos + count < pos))
+		return -EOVERFLOW;
+	if (file->f_mode & FMODE_NEG_OFFSET)
+		return 0;
+	return -EINVAL;
+}
+
 /*
  * rw_verify_area doesn't like huge counts. We limit
  * them to something that fits in "int" so that others
@@ -222,8 +236,11 @@ int rw_verify_area(int read_write, struc
 	if (unlikely((ssize_t) count < 0))
 		return retval;
 	pos = *ppos;
-	if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
-		return retval;
+	if (unlikely((pos < 0) || (loff_t) (pos + count) < 0)) {
+		retval = __negative_fpos_check(file, pos, count);
+		if (retval)
+			return retval;
+	}
 
 	if (unlikely(inode->i_flock && mandatory_lock(inode))) {
 		retval = locks_mandatory_area(
--- linux-mm.orig/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
+++ linux-mm/include/linux/fs.h	2010-01-30 17:41:51.000000000 +0800
@@ -93,6 +93,9 @@ struct inodes_stat_t {
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
+/* File is huge (eg. /dev/kmem): treat loff_t as unsigned */
+#define FMODE_NEG_OFFSET	((__force fmode_t)0x2000)
+
 /*
  * The below are the various read and write types that we support. Some of
  * them include behavioral modifiers that send information down to the
--- linux-mm.orig/drivers/char/mem.c	2010-01-30 17:23:31.000000000 +0800
+++ linux-mm/drivers/char/mem.c	2010-01-30 17:41:51.000000000 +0800
@@ -882,6 +882,10 @@ static int memory_open(struct inode *ino
 	if (dev->dev_info)
 		filp->f_mapping->backing_dev_info = dev->dev_info;
 
+	/* Is /dev/mem or /dev/kmem ? */
+	if (dev->dev_info == &directly_mappable_cdev_bdi)
+		filp->f_mode |= FMODE_NEG_OFFSET;
+
 	if (dev->fops->open)
 		return dev->fops->open(inode, filp);
 
--- linux-mm.orig/fs/proc/base.c	2010-01-30 17:14:14.000000000 +0800
+++ linux-mm/fs/proc/base.c	2010-01-30 17:41:51.000000000 +0800
@@ -880,6 +880,8 @@ static const struct file_operations proc
 static int mem_open(struct inode* inode, struct file* file)
 {
 	file->private_data = (void*)((long)current->self_exec_id);
+	/* OK to pass negative loff_t, we can catch out-of-range */
+	file->f_mode |= FMODE_NEG_OFFSET;
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 5/5] devmem: dont allow seek to last page
  2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
@ 2010-01-30  9:45 ` Wu Fengguang
  4 siblings, 0 replies; 126+ messages in thread
From: Wu Fengguang @ 2010-01-30  9:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, LKML, OGAWA Hirofumi, Al Viro, linux-fsdevel

[-- Attachment #1: mem-seek-fix --]
[-- Type: text/plain, Size: 1380 bytes --]

So as to return a uniform error -EOVERFLOW instead of a random one:

# kmem-seek 0xfffffffffffffff0
seek /dev/kmem: Device or resource busy
# kmem-seek 0xfffffffffffffff1
seek /dev/kmem: Block device required

Suggested by OGAWA Hirofumi.

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 drivers/char/mem.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- linux-mm.orig/drivers/char/mem.c	2010-01-30 17:41:51.000000000 +0800
+++ linux-mm/drivers/char/mem.c	2010-01-30 17:42:29.000000000 +0800
@@ -721,16 +721,23 @@ static loff_t memory_lseek(struct file *
 
 	mutex_lock(&file->f_path.dentry->d_inode->i_mutex);
 	switch (orig) {
-		case 0:
+		case SEEK_CUR:
+			offset += file->f_pos;
+			if ((unsigned long long)offset <
+			    (unsigned long long)file->f_pos) {
+				ret = -EOVERFLOW;
+				break;
+			}
+		case SEEK_SET:
+			/* to avoid userland mistaking f_pos=-9 as -EBADF=-9 */
+			if ((unsigned long long)offset >= ~0xFFFULL) {
+				ret = -EOVERFLOW;
+				break;
+			}
 			file->f_pos = offset;
 			ret = file->f_pos;
 			force_successful_syscall_return();
 			break;
-		case 1:
-			file->f_pos += offset;
-			ret = file->f_pos;
-			force_successful_syscall_return();
-			break;
 		default:
 			ret = -EINVAL;
 	}



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-29 22:50         ` H. Peter Anvin
@ 2010-01-30 16:28           ` Peter Zijlstra
  2010-02-01 12:43             ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-01-30 16:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, mingo, eranian, linux-kernel, torvalds, tglx,
	mingo, linux-tip-commits

On Fri, 2010-01-29 at 14:50 -0800, H. Peter Anvin wrote:

> I would personally say that the Right Way[TM] to do this is to call
> these __constant_hweightX() -- so the name reflects the function -- and
> then have
> 
> #define hweight(x) (__builtin_constant_p(x) ? __constant_hweight(x) :
> __arch_hweight(x))

I actually considered that, but since I didn't have a full cross compile
set around I wasn't sure.

The trouble is that asm/bitops.h used to be sufficient for hweightN(),
but with such a scheme we'd need linux/bitops.h.

Anyway, something like the below, I'll try and run it through the cross
compilers I have on monday or something.


---
 arch/alpha/include/asm/bitops.h      |   14 +++++++-------
 arch/sparc/include/asm/bitops_64.h   |    8 ++++----
 include/asm-generic/bitops/hweight.h |    8 ++++----
 include/linux/bitops.h               |   33 +++++++++++++++++++--------------
 lib/hweight.c                        |   19 ++++++++++---------
 5 files changed, 44 insertions(+), 38 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/bitops.h
+++ linux-2.6/arch/alpha/include/asm/bitops.h
@@ -405,24 +405,24 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
 #include <asm-generic/bitops/hweight.h>
Index: linux-2.6/arch/sparc/include/asm/bitops_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/bitops_64.h
+++ linux-2.6/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr,
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(uns
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(uns
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(uns
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
Index: linux-2.6/include/asm-generic/bitops/hweight.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/hweight.h
+++ linux-2.6/include/asm-generic/bitops/hweight.h
@@ -3,9 +3,9 @@
 
 #include <asm/types.h>
 
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -40,16 +40,7 @@ static __inline__ int get_count_order(un
 	return order;
 }
 
-static inline unsigned long hweight_long(unsigned long w)
-{
-	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
-}
-
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
+#define __const_hweight8(w)				\
       (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
 	(!!((w) & (1ULL << 0))) +			\
 	(!!((w) & (1ULL << 1))) +			\
@@ -60,15 +51,29 @@ static inline unsigned long hweight_long
 	(!!((w) & (1ULL << 6))) +			\
 	(!!((w) & (1ULL << 7)))	)
 
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w) >> 8))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+#define hweight8(w)	\
+	(__builtin_constant_p(w) ? __const_hweight8(w) : __arch_hweight8(w))
+#define hweight16(w)	\
+	(__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w)	\
+	(__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w)	\
+	(__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
 
 /*
  * Type invariant version that simply casts things to the
  * largest type.
  */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
+#define HWEIGHT(w)   __const_hweight64((u64)(w))
+
+static inline unsigned long hweight_long(unsigned long w)
+{
+	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
+}
 
 /**
  * rol32 - rotate a 32-bit value left
Index: linux-2.6/lib/hweight.c
===================================================================
--- linux-2.6.orig/lib/hweight.c
+++ linux-2.6/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arhc_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-01-30 16:28           ` Peter Zijlstra
@ 2010-02-01 12:43             ` Peter Zijlstra
  2010-02-01 19:06               ` H. Peter Anvin
  2010-04-06 23:03               ` [tip:core/hweight] bitops: Optimize hweight() by making use of compile-time evaluation tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-01 12:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, mingo, eranian, linux-kernel, torvalds, tglx,
	mingo, linux-tip-commits

On Sat, 2010-01-30 at 17:28 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-29 at 14:50 -0800, H. Peter Anvin wrote:
> 
> > I would personally say that the Right Way[TM] to do this is to call
> > these __constant_hweightX() -- so the name reflects the function -- and
> > then have
> > 
> > #define hweight(x) (__builtin_constant_p(x) ? __constant_hweight(x) :
> > __arch_hweight(x))
> 
> I actually considered that, but since I didn't have a full cross compile
> set around I wasn't sure.
> 
> The trouble is that asm/bitops.h used to be sufficient for hweightN(),
> but with such a scheme we'd need linux/bitops.h.
> 
> Anyway, something like the below, I'll try and run it through the cross
> compilers I have on monday or something.

Ok, that didn't compile, utter include dependency hell.

The below does work, but is still a tad ugly in that if you want to use
any of the HWEIGHT functions that use BUILD_BUG_ON_ZERO() you have to
have included linux/kernel.h yourself.

But at least it builds on x86_64, alpha and sparc64 (didn't have a ia64
compiler around).

FWIW I was tempted to change the return type of hweight64() from
unsigned long to unsigned int, its not as if it'll ever return a value
larger than 64.

---
 arch/alpha/include/asm/bitops.h            |   18 ++++++------
 arch/ia64/include/asm/bitops.h             |   11 ++++---
 arch/sparc/include/asm/bitops_64.h         |   11 ++++---
 include/asm-generic/bitops/arch_hweight.h  |   11 +++++++
 include/asm-generic/bitops/const_hweight.h |   42 +++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h       |    8 +----
 include/linux/bitops.h                     |   25 -----------------
 lib/hweight.c                              |   19 ++++++-------
 8 files changed, 87 insertions(+), 58 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/bitops.h
+++ linux-2.6/arch/alpha/include/asm/bitops.h
@@ -405,29 +405,31 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_weight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 #endif
 
+#include <asm-generic/bitops/const_hweight.h>
+
 #endif /* __KERNEL__ */
 
 #include <asm-generic/bitops/find.h>
Index: linux-2.6/arch/ia64/include/asm/bitops.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/bitops.h
+++ linux-2.6/arch/ia64/include/asm/bitops.h
@@ -437,17 +437,18 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long
-hweight64 (unsigned long x)
+static __inline__ unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
 	return result;
 }
 
-#define hweight32(x)	(unsigned int) hweight64((x) & 0xfffffffful)
-#define hweight16(x)	(unsigned int) hweight64((x) & 0xfffful)
-#define hweight8(x)	(unsigned int) hweight64((x) & 0xfful)
+#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
+#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
+#define __arch_hweight8(x)  ((unsigned int) __arch_hweight64((x) & 0xfful))
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
Index: linux-2.6/arch/sparc/include/asm/bitops_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/bitops_64.h
+++ linux-2.6/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr,
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(uns
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(uns
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(uns
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
@@ -78,9 +78,10 @@ static inline unsigned int hweight8(unsi
 
 #else
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 
 #endif
+#include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
Index: linux-2.6/include/asm-generic/bitops/arch_hweight.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/bitops/arch_hweight.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+
+#include <asm/types.h>
+
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
Index: linux-2.6/include/asm-generic/bitops/const_hweight.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/bitops/const_hweight.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+
+/*
+ * Compile time versions of __arch_hweightN()
+ */
+#define __const_hweight8(w)		\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+/*
+ * Generic interface.
+ */
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+
+/*
+ * Interface for known constant arguments
+ */
+#define HWEIGHT8(w)  (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight8(w))
+#define HWEIGHT16(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight16(w))
+#define HWEIGHT32(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight32(w))
+#define HWEIGHT64(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight64(w))
+
+/*
+ * Type invariant interface to the compile time constant hweight functions.
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)w)
+
+#endif /* _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_ */
Index: linux-2.6/include/asm-generic/bitops/hweight.h
===================================================================
--- linux-2.6.orig/include/asm-generic/bitops/hweight.h
+++ linux-2.6/include/asm-generic/bitops/hweight.h
@@ -1,11 +1,7 @@
 #ifndef _ASM_GENERIC_BITOPS_HWEIGHT_H_
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
-#include <asm/types.h>
-
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+#include <asm-generic/bitops/arch_hweight.h>
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,31 +45,6 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate
Index: linux-2.6/lib/hweight.c
===================================================================
--- linux-2.6.orig/lib/hweight.c
+++ linux-2.6/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arch_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [tip:perf/core] bitops: Provide compile time HWEIGHT{8,16,32,64}
  2010-02-01 12:43             ` Peter Zijlstra
@ 2010-02-01 19:06               ` H. Peter Anvin
  2010-04-06 23:03               ` [tip:core/hweight] bitops: Optimize hweight() by making use of compile-time evaluation tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-01 19:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, mingo, eranian, linux-kernel, torvalds, tglx,
	mingo, linux-tip-commits

On 02/01/2010 04:43 AM, Peter Zijlstra wrote:
> 
> The below does work, but is still a tad ugly in that if you want to use
> any of the HWEIGHT functions that use BUILD_BUG_ON_ZERO() you have to
> have included linux/kernel.h yourself.
> 
> But at least it builds on x86_64, alpha and sparc64 (didn't have a ia64
> compiler around).
> 
> FWIW I was tempted to change the return type of hweight64() from
> unsigned long to unsigned int, its not as if it'll ever return a value
> larger than 64.
> 

That might generate worse code in some cases, though (something that
needs it as a 64-bit value would have to extend it unnecessarily), but
probably the easiest is to just compile and see what happens.

x86 also has a POPCNT instruction now, which we should be able to use
via alternatives.  That's something to do after your cleanup is in.

Anyway, I like the cleanup.

Acked-by: H. Peter Anvin <hpa@zytor.com>

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number
  2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
@ 2010-02-01 20:44   ` Andrew Morton
  0 siblings, 0 replies; 126+ messages in thread
From: Andrew Morton @ 2010-02-01 20:44 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: LKML, Eric Paris, Al Viro, linux-fsdevel

On Sat, 30 Jan 2010 17:45:16 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> FMODE_NONOTIFY=0x800000 collides with __O_SYNC in sparc,
> so change it to 0x1000000.
> 
> CC: Eric Paris <eparis@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/asm-generic/fcntl.h |    2 +-
>  include/linux/fs.h          |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-mm.orig/include/asm-generic/fcntl.h	2010-01-13 10:16:27.000000000 +0800
> +++ linux-mm/include/asm-generic/fcntl.h	2010-01-30 17:40:48.000000000 +0800
> @@ -5,7 +5,7 @@
>  
>  /*
>   * FMODE_EXEC is 0x20
> - * FMODE_NONOTIFY is 0x800000
> + * FMODE_NONOTIFY is 0x1000000
>   * These cannot be used by userspace O_* until internal and external open
>   * flags are split.
>   * -Eric Paris
> --- linux-mm.orig/include/linux/fs.h	2010-01-30 17:14:14.000000000 +0800
> +++ linux-mm/include/linux/fs.h	2010-01-30 17:41:04.000000000 +0800
> @@ -91,7 +91,7 @@ struct inodes_stat_t {
>  #define FMODE_RANDOM		((__force fmode_t)0x1000)
>  
>  /* File was opened by fanotify and shouldn't generate fanotify events */
> -#define FMODE_NONOTIFY		((__force fmode_t)0x800000)
> +#define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
>  
>  /*
>   * The below are the various read and write types that we support. Some of
> 

Eric, please fix up "vfs: introduce FMODE_NONOTIFY" in your tree?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
@ 2010-02-01 20:48   ` Andrew Morton
  2010-02-03 13:39     ` Wu Fengguang
  0 siblings, 1 reply; 126+ messages in thread
From: Andrew Morton @ 2010-02-01 20:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Peter Zijlstra

On Sat, 30 Jan 2010 17:45:17 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> This allows use of hweight_long() in BUILD_BUG_ON().
> Suggested by Jamie.
> 
> CC: Jamie Lokier <jamie@shareable.org>
> CC: Roland Dreier <rdreier@cisco.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/bitops.h |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> --- linux-mm.orig/include/linux/bitops.h	2009-07-20 20:10:19.000000000 +0800
> +++ linux-mm/include/linux/bitops.h	2010-01-30 17:41:15.000000000 +0800
> @@ -40,10 +40,14 @@ static __inline__ int get_count_order(un
>  	return order;
>  }
>  
> -static inline unsigned long hweight_long(unsigned long w)
> -{
> -	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> -}
> +#define hweight_long(x)				\
> +(						\
> +	__builtin_constant_p(x) ?		\
> +		__builtin_popcountl(x) :	\
> +		(sizeof(x) <= 4 ?		\
> +			 hweight32(x) :		\
> +			 hweight64(x))		\
> +)
>  
>  /**
>   * rol32 - rotate a 32-bit value left

Peter's been mucking with a compile-time HWEIGHT().  An outdated
version of that is presently in linux-next.

(I wonder if it could have used __builtin_popcount)

(I wonder which gcc versions support __builtin_popcount)

Anyway, I suspect you should be using that rather than tweaking
hweight_long().

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-01 20:48   ` Andrew Morton
@ 2010-02-03 13:39     ` Wu Fengguang
  2010-02-03 15:08       ` Andrew Morton
  2010-02-03 17:10       ` H. Peter Anvin
  0 siblings, 2 replies; 126+ messages in thread
From: Wu Fengguang @ 2010-02-03 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Peter Zijlstra, H. Peter Anvin

On Mon, Feb 01, 2010 at 01:48:25PM -0700, Andrew Morton wrote:
> > -static inline unsigned long hweight_long(unsigned long w)
> > -{
> > -	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
> > -}
> > +#define hweight_long(x)				\
> > +(						\
> > +	__builtin_constant_p(x) ?		\
> > +		__builtin_popcountl(x) :	\
> > +		(sizeof(x) <= 4 ?		\
> > +			 hweight32(x) :		\
> > +			 hweight64(x))		\
> > +)
> >  
> >  /**
> >   * rol32 - rotate a 32-bit value left
> 
> Peter's been mucking with a compile-time HWEIGHT().  An outdated
> version of that is presently in linux-next.

Ah I find it. Thanks.

> (I wonder if it could have used __builtin_popcount)
> 
> (I wonder which gcc versions support __builtin_popcount)

This is how Jamie Lokier first recommended it to me:

        Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
        constant if the argument is a compile-time constant, so can be
        used in BUILD_BUG_ON() and even for array sizes etc.

Is there an official lowest GCC version that Linux supports?

> Anyway, I suspect you should be using that rather than tweaking
> hweight_long().

OK.

But.. if ever the GCC builtins can be used:

        __builtin_popcount(unsigned int)
        __builtin_popcountl(unsigned long)
        __builtin_popcountll(unsigned long long)

it would be more natural to use hweight_long() or even introduce
hweight_longlong() for (transparent) compile time computation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 13:39     ` Wu Fengguang
@ 2010-02-03 15:08       ` Andrew Morton
  2010-02-03 15:15         ` Peter Zijlstra
  2010-02-03 17:10       ` H. Peter Anvin
  1 sibling, 1 reply; 126+ messages in thread
From: Andrew Morton @ 2010-02-03 15:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Peter Zijlstra, H. Peter Anvin

On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> Is there an official lowest GCC version that Linux supports?

Documentation/Changes says gcc-3.2.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:08       ` Andrew Morton
@ 2010-02-03 15:15         ` Peter Zijlstra
  2010-02-03 15:42           ` Andrew Morton
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-03 15:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, H. Peter Anvin

On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Is there an official lowest GCC version that Linux supports?
> 
> Documentation/Changes says gcc-3.2.

Anyway, its not like that macro I used isn't straight forward. Not much
a builtin can do better.

The only benefit the builtin has is possibly use machine popcnt
instructions but we already deal with that in the kernel.




^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:15         ` Peter Zijlstra
@ 2010-02-03 15:42           ` Andrew Morton
  2010-02-03 15:47             ` Peter Zijlstra
  2010-02-03 18:14             ` Borislav Petkov
  0 siblings, 2 replies; 126+ messages in thread
From: Andrew Morton @ 2010-02-03 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, H. Peter Anvin

On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Is there an official lowest GCC version that Linux supports?
> > 
> > Documentation/Changes says gcc-3.2.
> 
> Anyway, its not like that macro I used isn't straight forward. Not much
> a builtin can do better.
> 
> The only benefit the builtin has is possibly use machine popcnt
> instructions but we already deal with that in the kernel.

We didn't deal with it on every architecture, which is something which
the compiler extension takes care of.

In fact I can't find anywhere where we dealt with it on x86.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:42           ` Andrew Morton
@ 2010-02-03 15:47             ` Peter Zijlstra
  2010-02-03 17:11               ` H. Peter Anvin
  2010-02-03 18:14             ` Borislav Petkov
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-03 15:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, H. Peter Anvin

On Wed, 2010-02-03 at 07:42 -0800, Andrew Morton wrote:
> On Wed, 03 Feb 2010 16:15:57 +0100 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2010-02-03 at 07:08 -0800, Andrew Morton wrote:
> > > On Wed, 3 Feb 2010 21:39:51 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Is there an official lowest GCC version that Linux supports?
> > > 
> > > Documentation/Changes says gcc-3.2.
> > 
> > Anyway, its not like that macro I used isn't straight forward. Not much
> > a builtin can do better.
> > 
> > The only benefit the builtin has is possibly use machine popcnt
> > instructions but we already deal with that in the kernel.
> 
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
> 
> In fact I can't find anywhere where we dealt with it on x86.

>From what I know the popcnt ins on x86 is relatively new, hpa was going
to look at supporting it using alternatives.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 13:39     ` Wu Fengguang
  2010-02-03 15:08       ` Andrew Morton
@ 2010-02-03 17:10       ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-03 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Peter Zijlstra

On 02/03/2010 05:39 AM, Wu Fengguang wrote:
> 
>> (I wonder if it could have used __builtin_popcount)
>>
>> (I wonder which gcc versions support __builtin_popcount)
> 

I think 4.0 or later, but it often generates calls to libgcc, which
would mean adding libgcc functions to the kernel (which some arches
already do.)

> This is how Jamie Lokier first recommended it to me:
> 
>         Checked GCC 3.4.3 / 4.1 / 4.4: It expands as a compile-time
>         constant if the argument is a compile-time constant, so can be
>         used in BUILD_BUG_ON() and even for array sizes etc.
> 
> Is there an official lowest GCC version that Linux supports?

3.4, I believe.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:47             ` Peter Zijlstra
@ 2010-02-03 17:11               ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-03 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar

On 02/03/2010 07:47 AM, Peter Zijlstra wrote:
> 
> From what I know the popcnt ins on x86 is relatively new, hpa was going
> to look at supporting it using alternatives.
> 

And kernel-specific stuff like alternatives is pretty much the one thing
that gcc *can't* do for us.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 15:42           ` Andrew Morton
  2010-02-03 15:47             ` Peter Zijlstra
@ 2010-02-03 18:14             ` Borislav Petkov
  2010-02-03 18:47               ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-03 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, H. Peter Anvin

On Wed, Feb 03, 2010 at 07:42:51AM -0800, Andrew Morton wrote:
> We didn't deal with it on every architecture, which is something which
> the compiler extension takes care of.
> 
> In fact I can't find anywhere where we dealt with it on x86.

Yeah, we talked briefly about using hardware popcnt, see thread
beginning at

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-06/msg00245.html

for example. I did an ftrace of the cpumask_weight() calls in sched.c to
see whether there would be a measurable performance gain but it didn't
seem so at the time. My numbers said something like ca. 170 hweight
calls per second and since the <lib/hweight.c> implementations roughly
translate to something like ~20 isns (hweight64 to about ~30), the whole
thing wasn't worth the trouble considering checking binutils versions
and slapping opcodes or using gcc intrinsics which involves gcc version
checking.

An alternatives solution which is based on CPUID flag could add the
popcnt opcode without checking any toolchain versions but how is the
replaced instruction going to look like? Something like

alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

by making sure the arg is in some register first?

Hmm..

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 18:14             ` Borislav Petkov
@ 2010-02-03 18:47               ` Peter Zijlstra
  2010-02-03 19:49                 ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-03 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, H. Peter Anvin

On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:

> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)

Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
and hweight64() needs a bit of magic for 32bit, but yes, something like
that ought to work nicely.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 18:47               ` Peter Zijlstra
@ 2010-02-03 19:49                 ` H. Peter Anvin
  2010-02-04 15:10                   ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-03 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> 
>> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> 
> Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> and hweight64() needs a bit of magic for 32bit, but yes, something like
> that ought to work nicely.
> 

Arguably the "best" option is to have the alternative being a jump to an
out-of-line stub which does the necessary parameter marshalling before
calling a stub.  This technique is already used in a few other places.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-03 19:49                 ` H. Peter Anvin
@ 2010-02-04 15:10                   ` Borislav Petkov
  2010-02-04 15:13                     ` Peter Zijlstra
                                       ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-04 15:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> > 
> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> > 
> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > that ought to work nicely.
> > 
> 
> Arguably the "best" option is to have the alternative being a jump to an
> out-of-line stub which does the necessary parameter marshalling before
> calling a stub.  This technique is already used in a few other places.

Ok, here's a first alpha prototype and completely untested. The asm
output looks ok though. I've added separate 32-bit and 64-bit helpers in
order to dispense with the if-else tests. The hw-popcnt versions are the
opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
so %rAX has to be preloaded with the bitmask and the computed value
has to be retrieved from there afterwards. And yes, it looks not that
elegant so I'm open for suggestions.

The good thing is, this should work on any toolchain since we don't rely
on the compiler to know about popcnt and we're protected by CPUID flag
so that the hw-popcnt version is used only on processors which support
it.

Please take a good look and let me know what do you guys think.

Thanks.

--
 arch/x86/include/asm/bitops.h |    4 ++
 arch/x86/lib/Makefile         |    2 +-
 arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/lib/popcnt.c

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..deb5013 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -434,6 +434,10 @@ static inline int fls(int x)
 #endif
 	return r + 1;
 }
+
+
+extern int arch_hweight_long(unsigned long);
+
 #endif /* __KERNEL__ */
 
 #undef ADDR
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..c03fe2d 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
new file mode 100644
index 0000000..179a6e8
--- /dev/null
+++ b/arch/x86/lib/popcnt.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+
+int _hweight32(void)
+{
+	unsigned long w;
+
+	asm volatile("" : "=a" (w));
+
+	return hweight32(w);
+}
+
+int _hweight64(void)
+{
+	unsigned long w;
+
+	asm volatile("" : "=a" (w));
+
+	return hweight64(w);
+}
+
+int _popcnt32(void)
+{
+
+	unsigned long w;
+
+	asm volatile(".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0\n\t"
+			: "=a" (w));
+
+	return w;
+}
+
+int _popcnt64(void)
+{
+
+	unsigned long w;
+
+	asm volatile(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t."
+		     "byte 0xb8\n\t.byte 0xc0\n\t"
+		     : "=a" (w));
+
+	return w;
+}
+
+int arch_hweight_long(unsigned long w)
+{
+	if (sizeof(w) == 4) {
+		asm volatile("movl %[w], %%eax" :: [w] "r" (w));
+		alternative("call _hweight32",
+			    "call _popcnt32",
+			    X86_FEATURE_POPCNT);
+		asm volatile("" : "=a" (w));
+
+	} else {
+		asm volatile("movq %[w], %%rax" :: [w] "r" (w));
+		alternative("call _hweight64",
+			    "call _popcnt64",
+			    X86_FEATURE_POPCNT);
+		asm volatile("" : "=a" (w));
+	}
+	return w;
+}
-- 
1.6.6

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
@ 2010-02-04 15:13                     ` Peter Zijlstra
  2010-02-04 15:54                       ` Borislav Petkov
  2010-02-04 15:16                     ` H. Peter Anvin
  2010-02-04 15:39                       ` Brian Gerst
  2 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-04 15:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On Thu, 2010-02-04 at 16:10 +0100, Borislav Petkov wrote:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
> > On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
> > > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
> > > 
> > >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
> > > 
> > > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
> > > and hweight64() needs a bit of magic for 32bit, but yes, something like
> > > that ought to work nicely.
> > > 
> > 
> > Arguably the "best" option is to have the alternative being a jump to an
> > out-of-line stub which does the necessary parameter marshalling before
> > calling a stub.  This technique is already used in a few other places.
> 
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
> 
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
> 
> Please take a good look and let me know what do you guys think.

> +int arch_hweight_long(unsigned long w)
> +{
> +	if (sizeof(w) == 4) {
> +		asm volatile("movl %[w], %%eax" :: [w] "r" (w));
> +		alternative("call _hweight32",
> +			    "call _popcnt32",
> +			    X86_FEATURE_POPCNT);
> +		asm volatile("" : "=a" (w));
> +
> +	} else {
> +		asm volatile("movq %[w], %%rax" :: [w] "r" (w));
> +		alternative("call _hweight64",
> +			    "call _popcnt64",
> +			    X86_FEATURE_POPCNT);
> +		asm volatile("" : "=a" (w));
> +	}
> +	return w;
> +}

hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
are.



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
  2010-02-04 15:13                     ` Peter Zijlstra
@ 2010-02-04 15:16                     ` H. Peter Anvin
  2010-02-04 15:39                       ` Brian Gerst
  2 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-04 15:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On 02/04/2010 07:10 AM, Borislav Petkov wrote:
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
> 
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
> 
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
> 
> Please take a good look and let me know what do you guys think.
> 

I think you misunderstood me.  The idea was to use the alternatives
mechanism to put the popcnt instruction inline instead of the call.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for  hweight_long(CONSTANT)
  2010-02-04 15:10                   ` Borislav Petkov
@ 2010-02-04 15:39                       ` Brian Gerst
  2010-02-04 15:16                     ` H. Peter Anvin
  2010-02-04 15:39                       ` Brian Gerst
  2 siblings, 0 replies; 126+ messages in thread
From: Brian Gerst @ 2010-02-04 15:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar

2010/2/4 Borislav Petkov <bp@amd64.org>:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
>> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
>> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
>> >
>> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
>> >
>> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
>> > and hweight64() needs a bit of magic for 32bit, but yes, something like
>> > that ought to work nicely.
>> >
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.
>
> Thanks.
>
> --
>  arch/x86/include/asm/bitops.h |    4 ++
>  arch/x86/lib/Makefile         |    2 +-
>  arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 67 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/lib/popcnt.c
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 02b47a6..deb5013 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -434,6 +434,10 @@ static inline int fls(int x)
>  #endif
>        return r + 1;
>  }
> +
> +
> +extern int arch_hweight_long(unsigned long);
> +
>  #endif /* __KERNEL__ */
>
>  #undef ADDR
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index cffd754..c03fe2d 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
>  lib-y += memcpy_$(BITS).o
>  lib-$(CONFIG_KPROBES) += insn.o inat.o
>
> -obj-y += msr.o msr-reg.o msr-reg-export.o
> +obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
>
>  ifeq ($(CONFIG_X86_32),y)
>         obj-y += atomic64_32.o
> diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
> new file mode 100644
> index 0000000..179a6e8
> --- /dev/null
> +++ b/arch/x86/lib/popcnt.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/bitops.h>
> +
> +int _hweight32(void)
> +{
> +       unsigned long w;
> +
> +       asm volatile("" : "=a" (w));
> +
> +       return hweight32(w);
> +

You should just use a normal parameter here (use %rdi for 64-bit).
The way you have it written isn't guaranteed to work.  GCC could
clobber %eax/%rax before you read it, although it is highly unlikely.
For a good example of how this should be written, look at
__mutex_fastpath_lock().

--
Brian Gerst

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
@ 2010-02-04 15:39                       ` Brian Gerst
  0 siblings, 0 replies; 126+ messages in thread
From: Brian Gerst @ 2010-02-04 15:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar

2010/2/4 Borislav Petkov <bp@amd64.org>:
> On Wed, Feb 03, 2010 at 11:49:54AM -0800, H. Peter Anvin wrote:
>> On 02/03/2010 10:47 AM, Peter Zijlstra wrote:
>> > On Wed, 2010-02-03 at 19:14 +0100, Borislav Petkov wrote:
>> >
>> >> alternative("call hweightXX", "popcnt", X86_FEATURE_POPCNT)
>> >
>> > Make sure to apply a 0xff bitmask to the popcnt r16 call for hweight8(),
>> > and hweight64() needs a bit of magic for 32bit, but yes, something like
>> > that ought to work nicely.
>> >
>>
>> Arguably the "best" option is to have the alternative being a jump to an
>> out-of-line stub which does the necessary parameter marshalling before
>> calling a stub.  This technique is already used in a few other places.
>
> Ok, here's a first alpha prototype and completely untested. The asm
> output looks ok though. I've added separate 32-bit and 64-bit helpers in
> order to dispense with the if-else tests. The hw-popcnt versions are the
> opcodes for "popcnt %eax, %eax" and "popcnt %rax, %rax", respectively,
> so %rAX has to be preloaded with the bitmask and the computed value
> has to be retrieved from there afterwards. And yes, it looks not that
> elegant so I'm open for suggestions.
>
> The good thing is, this should work on any toolchain since we don't rely
> on the compiler to know about popcnt and we're protected by CPUID flag
> so that the hw-popcnt version is used only on processors which support
> it.
>
> Please take a good look and let me know what do you guys think.
>
> Thanks.
>
> --
>  arch/x86/include/asm/bitops.h |    4 ++
>  arch/x86/lib/Makefile         |    2 +-
>  arch/x86/lib/popcnt.c         |   62 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 67 insertions(+), 1 deletions(-)
>  create mode 100644 arch/x86/lib/popcnt.c
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 02b47a6..deb5013 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -434,6 +434,10 @@ static inline int fls(int x)
>  #endif
>        return r + 1;
>  }
> +
> +
> +extern int arch_hweight_long(unsigned long);
> +
>  #endif /* __KERNEL__ */
>
>  #undef ADDR
> diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
> index cffd754..c03fe2d 100644
> --- a/arch/x86/lib/Makefile
> +++ b/arch/x86/lib/Makefile
> @@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
>  lib-y += memcpy_$(BITS).o
>  lib-$(CONFIG_KPROBES) += insn.o inat.o
>
> -obj-y += msr.o msr-reg.o msr-reg-export.o
> +obj-y += msr.o msr-reg.o msr-reg-export.o popcnt.o
>
>  ifeq ($(CONFIG_X86_32),y)
>         obj-y += atomic64_32.o
> diff --git a/arch/x86/lib/popcnt.c b/arch/x86/lib/popcnt.c
> new file mode 100644
> index 0000000..179a6e8
> --- /dev/null
> +++ b/arch/x86/lib/popcnt.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/bitops.h>
> +
> +int _hweight32(void)
> +{
> +       unsigned long w;
> +
> +       asm volatile("" : "=a" (w));
> +
> +       return hweight32(w);
> +

You should just use a normal parameter here (use %rdi for 64-bit).
The way you have it written isn't guaranteed to work.  GCC could
clobber %eax/%rax before you read it, although it is highly unlikely.
For a good example of how this should be written, look at
__mutex_fastpath_lock().

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:13                     ` Peter Zijlstra
@ 2010-02-04 15:54                       ` Borislav Petkov
  2010-02-04 16:04                         ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-04 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> are.

Yeah, I'm still looking for the proper location. hweight_long() is the
generic version so do we want to do the

#ifndef __HAVE_ARCH_POPCNT
static inline unsigned long hweight_long(unsigned long w)
...

#endif

thing and define a x86-specific version?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 15:54                       ` Borislav Petkov
@ 2010-02-04 16:04                         ` Peter Zijlstra
  2010-02-05 12:11                           ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-04 16:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar

On Thu, 2010-02-04 at 16:54 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 04:13:52PM +0100, Peter Zijlstra wrote:
> > hweight_long() isn't an arch primitive, only __arch_hweight{8,16,32,64}
> > are.
> 
> Yeah, I'm still looking for the proper location. hweight_long() is the
> generic version so do we want to do the
> 
> #ifndef __HAVE_ARCH_POPCNT
> static inline unsigned long hweight_long(unsigned long w)
> ....
> 
> #endif
> 
> thing and define a x86-specific version? 

No, just don't touch hweight_long(), simply provide
__arch_hweight{8,16,32,64} and all will be well.

hweight_long() is provided by <include/linux.h> and is constructed from
height32() and hweight64() depending on the actual word size.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-04 16:04                         ` Peter Zijlstra
@ 2010-02-05 12:11                           ` Borislav Petkov
  2010-02-05 12:14                             ` Peter Zijlstra
  2010-02-05 21:54                             ` H. Peter Anvin
  0 siblings, 2 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-05 12:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> No, just don't touch hweight_long(), simply provide
> __arch_hweight{8,16,32,64} and all will be well.

Ok, another day, another version :)

It is, of course, completely untested but it builds and the asm looks
ok. I think I've addressed all concerns so far.

--
 arch/x86/include/asm/hweight.h       |   14 ++++++++
 arch/x86/lib/Makefile                |    2 +-
 arch/x86/lib/hweight.c               |   57 ++++++++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h |   45 ++++++++++++++++++++++++--
 lib/hweight.c                        |   16 +++++-----
 5 files changed, 121 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/hweight.h
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/include/asm/hweight.h b/arch/x86/include/asm/hweight.h
new file mode 100644
index 0000000..762125f
--- /dev/null
+++ b/arch/x86/include/asm/hweight.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight64 __arch_hweight64
+
+extern unsigned int __arch_hweight8(unsigned int);
+extern unsigned int __arch_hweight16(unsigned int);
+extern unsigned int __arch_hweight32(unsigned int);
+extern unsigned long __arch_hweight64(__u64);
+
+#endif /*_ASM_X86_HWEIGHT_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..3cf51c8
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,57 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define __arch_hweight_alt(size)	\
+	ALTERNATIVE("call _hweight"#size, POPCNT##size, X86_FEATURE_POPCNT)
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile("xor %%dh, %%dh\n\t"
+		     __arch_hweight_alt(32)
+			: "=di" (res)
+			: "di" (w)
+			: "ecx", "memory");
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight16(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile(__arch_hweight_alt(32)
+			: "=di" (res)
+			: "di" (w)
+			: "ecx", "memory");
+
+	return res;
+
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned int res = 0;
+
+	asm volatile(__arch_hweight_alt(64)
+			: "=di" (res)
+			: "di" (w)
+			: "rsi", "rcx", "r8", "r9", "r10", "r11",
+			  "memory");
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..340ad4e 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -2,10 +2,47 @@
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
 #include <asm/types.h>
+#include <asm/hweight.h>
 
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+extern unsigned int _hweight32(unsigned int w);
+extern unsigned int _hweight16(unsigned int w);
+extern unsigned int _hweight8(unsigned int w);
+extern unsigned long _hweight64(__u64 w);
+
+static inline unsigned int hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+	return __arch_hweight8(w);
+#else
+	return _hweight8(w);
+#endif
+}
+
+static inline unsigned int hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+	return __arch_hweight16(w);
+#else
+	return _hweight16(w);
+#endif
+}
+
+static inline unsigned int hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+	return __arch_hweight32(w);
+#else
+	return _hweight32(w);
+#endif
+}
+
+static inline unsigned long hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+	return __arch_hweight64(w);
+#else
+	return _hweight64(w);
+#endif
+}
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/hweight.c b/lib/hweight.c
index 389424e..f7b81a1 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int _hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,26 +17,26 @@ unsigned int hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int _hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int _hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long _hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
 	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
@@ -56,4 +56,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(_hweight64);
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 12:11                           ` Borislav Petkov
@ 2010-02-05 12:14                             ` Peter Zijlstra
  2010-02-05 21:54                             ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-05 12:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Fri, 2010-02-05 at 13:11 +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2010 at 05:04:17PM +0100, Peter Zijlstra wrote:
> > No, just don't touch hweight_long(), simply provide
> > __arch_hweight{8,16,32,64} and all will be well.
> 
> Ok, another day, another version :)
> 
> It is, of course, completely untested but it builds and the asm looks
> ok. I think I've addressed all concerns so far.

please work against a tree that has:

  http://lkml.org/lkml/2010/2/4/119

in.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 12:11                           ` Borislav Petkov
  2010-02-05 12:14                             ` Peter Zijlstra
@ 2010-02-05 21:54                             ` H. Peter Anvin
  2010-02-06  9:36                               ` Borislav Petkov
  1 sibling, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-05 21:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm volatile("xor %%dh, %%dh\n\t"
> +		     __arch_hweight_alt(32)
> +			: "=di" (res)
> +			: "di" (w)
> +			: "ecx", "memory");
> +

This is wrong in more ways than I can shake a stick at.

a) "di" doesn't mean the DI register - it means the DX register (d) or
an immediate (i).  Since you don't have any reference to either %0 or %1
in your code, you have no way of knowing which one it is.  The
constraint for the di register is "D".

b) On 32 bits, the first argument register is in %eax (with %edx used
for the upper half of a 32-bit argument), but on 64 bits, the first
argument is in %rdi, with the return still in %rax.

c) You call a C function, but you don't clobber the set of registers
that a C function would clobber.  You either need to put the function in
an assembly wrapper (which is better in the long run), or clobber the
full set of registers that is clobbered by a C function (which is better
in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

d) On the other hand, you do *not* need a "memory" clobber.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-05 21:54                             ` H. Peter Anvin
@ 2010-02-06  9:36                               ` Borislav Petkov
  2010-02-07  1:55                                 ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-06  9:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Fri, Feb 05, 2010 at 01:54:42PM -0800, H. Peter Anvin wrote:
> On 02/05/2010 04:11 AM, Borislav Petkov wrote:
> > +
> > +unsigned int __arch_hweight16(unsigned int w)
> > +{
> > +	unsigned int res = 0;
> > +
> > +	asm volatile("xor %%dh, %%dh\n\t"
> > +		     __arch_hweight_alt(32)
> > +			: "=di" (res)
> > +			: "di" (w)
> > +			: "ecx", "memory");
> > +
> 
> This is wrong in more ways than I can shake a stick at.

Thanks for reviewing it though - how else would I learn :).

> a) "di" doesn't mean the DI register - it means the DX register (d) or
> an immediate (i).  Since you don't have any reference to either %0 or %1
> in your code, you have no way of knowing which one it is.  The
> constraint for the di register is "D".

right.

> b) On 32 bits, the first argument register is in %eax (with %edx used
> for the upper half of a 32-bit argument), but on 64 bits, the first
> argument is in %rdi, with the return still in %rax.

Sure, it is right there in arch/x86/include/asm/calling.h. Shame on me.

> c) You call a C function, but you don't clobber the set of registers
> that a C function would clobber.  You either need to put the function in
> an assembly wrapper (which is better in the long run), or clobber the
> full set of registers that is clobbered by a C function (which is better
> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.

I think you mean rsi instead of esi here.

Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
the full set of clobbered registers. Please elaborate on the assembly
wrapper for the function, wouldn't I need to list all the clobbered
registers there too or am I missing something?

> d) On the other hand, you do *not* need a "memory" clobber.

Right, in this case we have all non-barrier like inlines so no memory
clobber, according to the comment above alternative() macro.

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-06  9:36                               ` Borislav Petkov
@ 2010-02-07  1:55                                 ` H. Peter Anvin
  2010-02-08  9:28                                   ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-07  1:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Andrew Morton, Wu Fengguang, LKML, Jamie Lokier,
	Roland Dreier, Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/06/2010 01:36 AM, Borislav Petkov wrote:
> 
>> c) You call a C function, but you don't clobber the set of registers
>> that a C function would clobber.  You either need to put the function in
>> an assembly wrapper (which is better in the long run), or clobber the
>> full set of registers that is clobbered by a C function (which is better
>> in the short term) -- which is eax, edx, ecx on 32 bits, but rax, rdi,
>> esi, rdx, rcx, r8, r9, r10, r11 on 64 bits.
> 
> I think you mean rsi instead of esi here.
> 
> Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> the full set of clobbered registers. Please elaborate on the assembly
> wrapper for the function, wouldn't I need to list all the clobbered
> registers there too or am I missing something?
> 

The notion there would be that you do push/pop in the assembly wrapper.

>> d) On the other hand, you do *not* need a "memory" clobber.
> 
> Right, in this case we have all non-barrier like inlines so no memory
> clobber, according to the comment above alternative() macro.

OK, I'm missing something here.

A few more notions:

a. This is exactly the kind of code where you don't want to put
"volatile" on your asm statement, because it's a pure compute.

b. It is really rather pointless to go through the whole alternatives
work if you are then going to put it inside a function which isn't an
inline ...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-07  1:55                                 ` H. Peter Anvin
@ 2010-02-08  9:28                                   ` Borislav Petkov
  2010-02-08  9:35                                       ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-08  9:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Sat, Feb 06, 2010 at 05:55:47PM -0800, H. Peter Anvin wrote:
> > Well, the example Brian pointed me to - __mutex_fastpath_lock - lists
> > the full set of clobbered registers. Please elaborate on the assembly
> > wrapper for the function, wouldn't I need to list all the clobbered
> > registers there too or am I missing something?
> > 
> 
> The notion there would be that you do push/pop in the assembly wrapper.

Oh yes, something similar to SAVE/RESTORE_ALL in
<arch/x86/kernel/entry_32.S> could work. Good idea!

> >> d) On the other hand, you do *not* need a "memory" clobber.
> > 
> > Right, in this case we have all non-barrier like inlines so no memory
> > clobber, according to the comment above alternative() macro.
> 
> OK, I'm missing something here.
> 
> A few more notions:
> 
> a. This is exactly the kind of code where you don't want to put
> "volatile" on your asm statement, because it's a pure compute.
> 
> b. It is really rather pointless to go through the whole alternatives
> work if you are then going to put it inside a function which isn't an
> inline ...

Well, in the second version I did replace a 'call _hweightXX' with
the actual popcnt opcode so the alternatives is only needed to do the
replacement during boot. We might just as well do

if (X86_FEATURE_POPCNT)
	__hw_popcnt()
else
	__software_hweight()

The only advantage of the alternatives is that it would save us the
if-else test above each time we do cpumask_weight. However, the if-else
approach is much more readable and obviates the need for all that macro
magic and taking special care of calling c function from within asm. And
since we do not call cpumask_weight all that often I'll honestly opt for
alternative-less solution...

Hmm...

Thanks,
Boris.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:28                                   ` Borislav Petkov
@ 2010-02-08  9:35                                       ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-08  9:35 UTC (permalink / raw)
  To: Borislav Petkov, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/08/2010 01:28 AM, Borislav Petkov wrote:
>
> Well, in the second version I did replace a 'call _hweightXX' with
> the actual popcnt opcode so the alternatives is only needed to do the
> replacement during boot. We might just as well do
>
> if (X86_FEATURE_POPCNT)
> 	__hw_popcnt()
> else
> 	__software_hweight()
>
> The only advantage of the alternatives is that it would save us the
> if-else test above each time we do cpumask_weight. However, the if-else
> approach is much more readable and obviates the need for all that macro
> magic and taking special care of calling c function from within asm. And
> since we do not call cpumask_weight all that often I'll honestly opt for
> alternative-less solution...
>

The highest performance will be gotten by alternatives, but it only make 
sense if they are inlined at the point of use... otherwise it's 
basically pointless.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
@ 2010-02-08  9:35                                       ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-08  9:35 UTC (permalink / raw)
  To: Borislav Petkov, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang

On 02/08/2010 01:28 AM, Borislav Petkov wrote:
>
> Well, in the second version I did replace a 'call _hweightXX' with
> the actual popcnt opcode so the alternatives is only needed to do the
> replacement during boot. We might just as well do
>
> if (X86_FEATURE_POPCNT)
> 	__hw_popcnt()
> else
> 	__software_hweight()
>
> The only advantage of the alternatives is that it would save us the
> if-else test above each time we do cpumask_weight. However, the if-else
> approach is much more readable and obviates the need for all that macro
> magic and taking special care of calling c function from within asm. And
> since we do not call cpumask_weight all that often I'll honestly opt for
> alternative-less solution...
>

The highest performance will be gotten by alternatives, but it only make 
sense if they are inlined at the point of use... otherwise it's 
basically pointless.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:35                                       ` H. Peter Anvin
  (?)
@ 2010-02-08  9:59                                       ` Borislav Petkov
  2010-02-11 17:24                                         ` Borislav Petkov
  -1 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-08  9:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Mon, Feb 08, 2010 at 01:35:41AM -0800, H. Peter Anvin wrote:
> On 02/08/2010 01:28 AM, Borislav Petkov wrote:
> >
> >Well, in the second version I did replace a 'call _hweightXX' with
> >the actual popcnt opcode so the alternatives is only needed to do the
> >replacement during boot. We might just as well do
> >
> >if (X86_FEATURE_POPCNT)
> >	__hw_popcnt()
> >else
> >	__software_hweight()
> >
> >The only advantage of the alternatives is that it would save us the
> >if-else test above each time we do cpumask_weight. However, the if-else
> >approach is much more readable and obviates the need for all that macro
> >magic and taking special care of calling c function from within asm. And
> >since we do not call cpumask_weight all that often I'll honestly opt for
> >alternative-less solution...
> >
> 
> The highest performance will be gotten by alternatives, but it only
> make sense if they are inlined at the point of use... otherwise it's
> basically pointless.

The popcnt-replacement part of the alternative would be as fast as
possible since we're adding the opcode there but the slow version would
add the additional overhead of saving/restoring the registers before
calling the software hweight implementation. I'll do some tracing to see
what a change like that would cost on machines which don't have popcnt.

Let me prep another version when I get back on Wed. (currently
travelling) with all the stuff we discussed to see how it would turn.

Thanks,
Boris.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-08  9:59                                       ` Borislav Petkov
@ 2010-02-11 17:24                                         ` Borislav Petkov
  2010-02-11 17:33                                           ` H. Peter Anvin
  2010-02-14 10:12                                           ` Peter Zijlstra
  0 siblings, 2 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-11 17:24 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> Let me prep another version when I get back on Wed. (currently
> travelling) with all the stuff we discussed to see how it would turn.

Ok, here's another version ontop of PeterZ's patch at
http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
while on 64-bit I do "popcnt %rdi, %rdi".

I also did some rudimentary tracing with the function graph tracer of
all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
compile and the preliminary results show that hweight in software takes
about 9.768 usecs the longest while the hardware popcnt about 8.515
usecs. The machine is a Fam10 revB2 quadcore.

What remains to be done is see whether the saving/restoring of
callee-clobbered regs with this patch has any noticeable negative
effects on the software hweight case on machines which don't support
popcnt. Also, I'm open for better tracing ideas :).

Thanks.

---
 arch/alpha/include/asm/bitops.h            |    4 +
 arch/ia64/include/asm/bitops.h             |    1 +
 arch/sparc/include/asm/bitops_64.h         |    4 +
 arch/x86/include/asm/bitops.h              |   10 +++
 arch/x86/lib/Makefile                      |    2 +-
 arch/x86/lib/hweight.c                     |   96 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h  |    8 +-
 include/asm-generic/bitops/const_hweight.h |   43 +++++++++++-
 lib/hweight.c                              |   20 +++---
 9 files changed, 169 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_weight32(unsigned int w)
 {
 	return __arch_hweight64(w);
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	return __arch_hweight64(w & 0xffff);
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	return __arch_hweight64(w & 0xff);
 }
+#define __arch_hweight8 __arch_hweight8
 #else
 #include <asm-generic/bitops/arch_hweight.h>
 #endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
 	result = ia64_popcnt(x);
 	return result;
 }
+#define __arch_hweight64 __arch_hweight64
 
 #define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
 #define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
 	return res;
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_hweight32(unsigned int w)
 {
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
 	return res;
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
 	return res;
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
 	return res;
 }
+#define __arch_hweight8 __arch_hweight8
 
 #else
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
 #include <asm-generic/bitops/hweight.h>
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..4a1c5c9
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,96 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+/* popcnt %eax, %eax */
+#define POPCNT32 ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+
+/* popcnt %rdi, %rdi */
+#define POPCNT64 ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xff"
+
+#define PUSH_CLOBBERED_32	\
+	"pushl %%edx\n\t"	\
+	"pushl %%ecx\n\t"
+
+#define POP_CLOBBERED_32	\
+	"\n\t"			\
+	"popl %%ecx\n\t"	\
+	"popl %%edx\n\t"
+
+#define PUSH_CLOBBERED_64	\
+	"pushq %%rax\n\t"	\
+	"pushq %%rsi\n\t"	\
+	"pushq %%rdx\n\t"	\
+	"pushq %%rcx\n\t"	\
+	"pushq %%r8\n\t"	\
+	"pushq %%r9\n\t"	\
+	"pushq %%r10\n\t"	\
+	"pushq %%r11\n\t"
+
+#define POP_CLOBBERED_64	\
+	"\n\t"			\
+	"popq %%r11\n\t"	\
+	"popq %%r10\n\t"	\
+	"popq %%r9\n\t"		\
+	"popq %%r8\n\t"		\
+	"popq %%rcx\n\t"	\
+	"popq %%rdx\n\t"	\
+	"popq %%rsi\n\t"	\
+	"popq %%rax\n\t"
+
+#ifdef CONFIG_64BIT
+#define PUSH_CLOBBERED PUSH_CLOBBERED_64
+#define POP_CLOBBERED  POP_CLOBBERED_64
+#define POPCNT POPCNT64
+#define ARG0 "D"
+#else
+#define PUSH_CLOBBERED PUSH_CLOBBERED_32
+#define POP_CLOBBERED  POP_CLOBBERED_32
+#define POPCNT POPCNT32
+#define ARG0 "a"
+#endif
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm volatile(PUSH_CLOBBERED
+		     ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     POP_CLOBBERED
+		     : "="ARG0 (res)
+		     : ARG0 (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm volatile(PUSH_CLOBBERED
+		     ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		      POP_CLOBBERED
+		     : "="ARG0 (res)
+		     : ARG0 (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
 #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
 #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
 
+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+       return __arch_hweight8(w);
+#else
+       return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+       return __arch_hweight16(w);
+#else
+       return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+       return __arch_hweight32(w);
+#else
+       return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+       return __arch_hweight64(w);
+#else
+       return __sw_hweight64(w);
+#endif
+}
 /*
  * Generic interface.
  */
-#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))
 
 /*
  * Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
-- 
1.6.6.1


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:24                                         ` Borislav Petkov
@ 2010-02-11 17:33                                           ` H. Peter Anvin
  2010-02-12 17:06                                             ` Borislav Petkov
  2010-02-14 10:12                                           ` Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-11 17:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On 02/11/2010 09:24 AM, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
>> Let me prep another version when I get back on Wed. (currently
>> travelling) with all the stuff we discussed to see how it would turn.
> 
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

On 64 bits it should be "popcnt %rdi, %rax".

> I also did some rudimentary tracing with the function graph tracer of
> all the cpumask_weight-calls in <kernel/sched.c> while doing a kernel
> compile and the preliminary results show that hweight in software takes
> about 9.768 usecs the longest while the hardware popcnt about 8.515
> usecs. The machine is a Fam10 revB2 quadcore.
> 
> What remains to be done is see whether the saving/restoring of
> callee-clobbered regs with this patch has any noticeable negative
> effects on the software hweight case on machines which don't support
> popcnt. Also, I'm open for better tracing ideas :).
> 
> +	asm volatile(PUSH_CLOBBERED
> +		     ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		      POP_CLOBBERED
> +		     : "="ARG0 (res)
> +		     : ARG0 (w));


Sorry, no.

You don't do the push/pop inline -- if you're going to take the hit of
pushing this into the caller, it's better to list them as explicit
clobbers and let the compiler figure out how to do it.  The point of
doing an explicit push/pop is that it can be pushed into the out-of-line
subroutine.

Furthermore, you're still putting "volatile" on there... this is a pure
computation -- no side effects -- so it is exactly when you *shouldn't*
declare your asm statement volatile.

Note: given how simple and regular a popcnt actually is, it might be
preferrable to have the out-of-line implementation either in assembly,
or using gcc's -fcall-saved-* options to reduce the number of registers
that is clobbered by the routine.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:33                                           ` H. Peter Anvin
@ 2010-02-12 17:06                                             ` Borislav Petkov
  2010-02-12 17:28                                               ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-12 17:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> You don't do the push/pop inline -- if you're going to take the hit of
> pushing this into the caller, it's better to list them as explicit
> clobbers and let the compiler figure out how to do it.  The point of
> doing an explicit push/pop is that it can be pushed into the out-of-line
> subroutine.

Doh! Of course, this was wrong. See below.

> Furthermore, you're still putting "volatile" on there... this is a pure
> computation -- no side effects -- so it is exactly when you *shouldn't*
> declare your asm statement volatile.

Also gone.

> Note: given how simple and regular a popcnt actually is, it might be
> preferrable to have the out-of-line implementation either in assembly,
> or using gcc's -fcall-saved-* options to reduce the number of registers
> that is clobbered by the routine.

Yep, this is actually a very good idea. And it seems to work, I've
tested it on an F10, K8 and even on an Atom and no blow ups so far...

---
 arch/alpha/include/asm/bitops.h            |    4 ++
 arch/ia64/include/asm/bitops.h             |    1 +
 arch/sparc/include/asm/bitops_64.h         |    4 ++
 arch/x86/include/asm/bitops.h              |   10 ++++
 arch/x86/lib/Makefile                      |    8 +++-
 arch/x86/lib/hweight.c                     |   72 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h  |    8 ++--
 include/asm-generic/bitops/const_hweight.h |   43 +++++++++++++++--
 lib/hweight.c                              |   20 ++++----
 9 files changed, 151 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 296da1d..7fe6943 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -409,21 +409,25 @@ static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_weight32(unsigned int w)
 {
 	return __arch_hweight64(w);
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	return __arch_hweight64(w & 0xffff);
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	return __arch_hweight64(w & 0xff);
 }
+#define __arch_hweight8 __arch_hweight8
 #else
 #include <asm-generic/bitops/arch_hweight.h>
 #endif
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 9da3df6..9b64f59 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -443,6 +443,7 @@ static __inline__ unsigned long __arch_hweight64(unsigned long x)
 	result = ia64_popcnt(x);
 	return result;
 }
+#define __arch_hweight64 __arch_hweight64
 
 #define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
 #define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index 766121a..4982bc4 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -51,6 +51,7 @@ static inline unsigned int __arch_hweight64(unsigned long w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w));
 	return res;
 }
+#define __arch_hweight64 __arch_hweight64
 
 static inline unsigned int __arch_hweight32(unsigned int w)
 {
@@ -59,6 +60,7 @@ static inline unsigned int __arch_hweight32(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffffffff));
 	return res;
 }
+#define __arch_hweight32 __arch_hweight32
 
 static inline unsigned int __arch_hweight16(unsigned int w)
 {
@@ -67,6 +69,7 @@ static inline unsigned int __arch_hweight16(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xffff));
 	return res;
 }
+#define __arch_hweight16 __arch_hweight16
 
 static inline unsigned int __arch_hweight8(unsigned int w)
 {
@@ -75,6 +78,7 @@ static inline unsigned int __arch_hweight8(unsigned int w)
 	__asm__ ("popc %1,%0" : "=r" (res) : "r" (w & 0xff));
 	return res;
 }
+#define __arch_hweight8 __arch_hweight8
 
 #else
 
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..187defe 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,6 +444,16 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#define __arch_hweight32 __arch_hweight32
+#define __arch_hweight16 __arch_hweight16
+#define __arch_hweight8 __arch_hweight8
+#define __arch_hweight64 __arch_hweight64
+
 #include <asm-generic/bitops/hweight.h>
 
 #endif /* __KERNEL__ */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..ddc32eb 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,9 +22,11 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
+	CFLAGS_hweight.o = -fcall-saved-ecx -fcall-saved-edx
+
         obj-y += atomic64_32.o
         lib-y += checksum_32.o
         lib-y += strstr_32.o
@@ -34,6 +36,10 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
 endif
         lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
 else
+	CFLAGS_hweight.o = -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx \
+			   -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 \
+			   -fcall-saved-r11
+
         obj-y += io_64.o iomap_copy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
         lib-y += thunk_64.o clear_page_64.o copy_page_64.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..9b4bfa5
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,72 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * Those are called out-of-line in the alternative below and are added here only
+ * so that gcc is able to figure out which registers have been clobbered by
+ * __sw_hweightXX so that it could restore their values before returning from
+ * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
+ */
+unsigned int _sw_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
+
+unsigned long _sw_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
+
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call _sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..c72de8b 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,9 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+extern unsigned int __sw_hweight32(unsigned int w);
+extern unsigned int __sw_hweight16(unsigned int w);
+extern unsigned int __sw_hweight8(unsigned int w);
+extern unsigned long __sw_hweight64(__u64 w);
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
index fa2a50b..8cf8169 100644
--- a/include/asm-generic/bitops/const_hweight.h
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -18,13 +18,48 @@
 #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
 #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
 
+static inline unsigned int _hweight8(unsigned int w)
+{
+#ifdef __arch_hweight8
+       return __arch_hweight8(w);
+#else
+       return __sw_hweight8(w);
+#endif
+}
+
+static inline unsigned int _hweight16(unsigned int w)
+{
+#ifdef __arch_hweight16
+       return __arch_hweight16(w);
+#else
+       return __sw_hweight16(w);
+#endif
+}
+
+static inline unsigned int _hweight32(unsigned int w)
+{
+#ifdef __arch_hweight32
+       return __arch_hweight32(w);
+#else
+       return __sw_hweight32(w);
+#endif
+}
+
+static inline unsigned long _hweight64(__u64 w)
+{
+#ifdef __arch_hweight64
+       return __arch_hweight64(w);
+#else
+       return __sw_hweight64(w);
+#endif
+}
 /*
  * Generic interface.
  */
-#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
-#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
-#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
-#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : _hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : _hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : _hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : _hweight64(w))
 
 /*
  * Interface for known constant arguments
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
-- 
1.6.6.1


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:06                                             ` Borislav Petkov
@ 2010-02-12 17:28                                               ` H. Peter Anvin
  2010-02-12 17:47                                                 ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-12 17:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
>> You don't do the push/pop inline -- if you're going to take the hit of
>> pushing this into the caller, it's better to list them as explicit
>> clobbers and let the compiler figure out how to do it.  The point of
>> doing an explicit push/pop is that it can be pushed into the out-of-line
>> subroutine.
> 
> Doh! Of course, this was wrong. See below.
> 
> diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> new file mode 100644
> index 0000000..9b4bfa5
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,72 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * Those are called out-of-line in the alternative below and are added here only
> + * so that gcc is able to figure out which registers have been clobbered by
> + * __sw_hweightXX so that it could restore their values before returning from
> + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> + */
> +unsigned int _sw_hweight32(unsigned int w)
> +{
> +	return __sw_hweight32(w);
> +}
> +
> +unsigned long _sw_hweight64(__u64 w)
> +{
> +	return __sw_hweight64(w);
> +}
> +
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +

Now you have no clobbers and no assembly wrapper.  That really doesn't work.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:28                                               ` H. Peter Anvin
@ 2010-02-12 17:47                                                 ` Borislav Petkov
  2010-02-12 19:05                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-12 17:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Fri, Feb 12, 2010 at 09:28:32AM -0800, H. Peter Anvin wrote:
> On 02/12/2010 09:06 AM, Borislav Petkov wrote:
> > On Thu, Feb 11, 2010 at 09:33:49AM -0800, H. Peter Anvin wrote:
> >> You don't do the push/pop inline -- if you're going to take the hit of
> >> pushing this into the caller, it's better to list them as explicit
> >> clobbers and let the compiler figure out how to do it.  The point of
> >> doing an explicit push/pop is that it can be pushed into the out-of-line
> >> subroutine.
> > 
> > Doh! Of course, this was wrong. See below.
> > 
> > diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
> > new file mode 100644
> > index 0000000..9b4bfa5
> > --- /dev/null
> > +++ b/arch/x86/lib/hweight.c
> > @@ -0,0 +1,72 @@
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/bitops.h>
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> > +/* popcnt %eax, %eax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> > +#define REG_IN "a"
> > +#define REG_OUT "a"
> > +#endif
> > +
> > +/*
> > + * Those are called out-of-line in the alternative below and are added here only
> > + * so that gcc is able to figure out which registers have been clobbered by
> > + * __sw_hweightXX so that it could restore their values before returning from
> > + * the __arch_hweightXX versions. See also <arch/x86/lib/Makefile>.
> > + */
> > +unsigned int _sw_hweight32(unsigned int w)
> > +{
> > +	return __sw_hweight32(w);
> > +}
> > +
> > +unsigned long _sw_hweight64(__u64 w)
> > +{
> > +	return __sw_hweight64(w);
> > +}
> > +
> > +unsigned int __arch_hweight32(unsigned int w)
> > +{
> > +	unsigned int res = 0;
> > +
> > +	asm (ALTERNATIVE("call _sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> > +		     : "="REG_OUT (res)
> > +		     : REG_IN (w));
> > +
> 
> Now you have no clobbers and no assembly wrapper.  That really doesn't work.

I see, this means we have to build lib/hweight.c with -fcall-saved*. I
just did a test and it uses %rcx and %rdx temporarily by saving their
values on the stack and restoring them before it returns.

However, this is generic code and for the above to work we have to
enforce x86-specific CFLAGS for it. What is the preferred way to do
that?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 17:47                                                 ` Borislav Petkov
@ 2010-02-12 19:05                                                   ` H. Peter Anvin
  2010-02-17 13:57                                                     ` Michal Marek
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-12 19:05 UTC (permalink / raw)
  To: Borislav Petkov, Michal Marek, linux-kbuild
  Cc: Borislav Petkov, Peter Zijlstra, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> 
> However, this is generic code and for the above to work we have to
> enforce x86-specific CFLAGS for it. What is the preferred way to do
> that?
> 

That's a question for Michal and the kbuild list.  Michal?

	-hpa


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-11 17:24                                         ` Borislav Petkov
  2010-02-11 17:33                                           ` H. Peter Anvin
@ 2010-02-14 10:12                                           ` Peter Zijlstra
  2010-02-14 11:24                                             ` Borislav Petkov
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-14 10:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > Let me prep another version when I get back on Wed. (currently
> > travelling) with all the stuff we discussed to see how it would turn.
> 
> Ok, here's another version ontop of PeterZ's patch at
> http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> while on 64-bit I do "popcnt %rdi, %rdi".

Right, so I don't like how you need to touch !x86 for this, and I think
that is easily avoidable by not making x86 include
asm-generic/bitops/arch_hweight.h.

If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
and simply provide __arch_hweightN() from x86/include/asm/bitops.h



^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 10:12                                           ` Peter Zijlstra
@ 2010-02-14 11:24                                             ` Borislav Petkov
  2010-02-14 12:23                                               ` Peter Zijlstra
  2010-02-14 18:36                                                 ` H. Peter Anvin
  0 siblings, 2 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-14 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 11:12:23AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-02-11 at 18:24 +0100, Borislav Petkov wrote:
> > On Mon, Feb 08, 2010 at 10:59:45AM +0100, Borislav Petkov wrote:
> > > Let me prep another version when I get back on Wed. (currently
> > > travelling) with all the stuff we discussed to see how it would turn.
> > 
> > Ok, here's another version ontop of PeterZ's patch at
> > http://lkml.org/lkml/2010/2/4/119. I need to handle 32- and 64-bit
> > differently wrt to popcnt opcode so on 32-bit I do "popcnt %eax, %eax"
> > while on 64-bit I do "popcnt %rdi, %rdi".
> 
> Right, so I don't like how you need to touch !x86 for this, and I think
> that is easily avoidable by not making x86 include
> asm-generic/bitops/arch_hweight.h.
> 
> If you then add __sw_hweightN() -> __arch_hweightN() wrappers in
> arch_hweight.h, then you can leave const_hweight.h use __arch_hweightN()
> and simply provide __arch_hweightN() from x86/include/asm/bitops.h

Hmm, all these different names start to get a little confusing. Can we first
agree on the naming please, here's my proposal:

__const_hweightN - for at compile time known constants as arguments
__arch_hweightN - arch possibly has an optimized hweight version
__sw_hweightN - fall back when nothing else is there, aka the functions in
lib/hweight.c

Now, in the x86 case, when the compiler can't know that the argument is
a constant, we call the __arch_hweightN versions. The alternative does
call the __sw_hweightN version in case the CPU doesn't support popcnt.
In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
to be able to take care of the regs clobbered ny __sw_hweightN.

So, if I understand you correctly, your suggestion might work, we
simply need to rename the lib/hweight.c versions to __sw_hweightN
and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
__sw_hweightN wrappers in the default case, all arches which have an
optimized version will provide it in their respective bitops header...

Hows that?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 11:24                                             ` Borislav Petkov
@ 2010-02-14 12:23                                               ` Peter Zijlstra
  2010-02-14 14:19                                                 ` Borislav Petkov
  2010-02-14 18:36                                                 ` H. Peter Anvin
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-14 12:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
> 
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
> 
I'm not quite sure what the last 'it' refers to, does that refer to:
 1) an __arch_hweightN() implementation, or
 2) __arch_hweightN() -> __sw_hweightN() wrappers ?

If you meant 1, then yes.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 12:23                                               ` Peter Zijlstra
@ 2010-02-14 14:19                                                 ` Borislav Petkov
  0 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-14 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, H. Peter Anvin, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 01:23:35PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-02-14 at 12:24 +0100, Borislav Petkov wrote:
> > 
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> > 
> I'm not quite sure what the last 'it' refers to, does that refer to:
>  1) an __arch_hweightN() implementation, or
>  2) __arch_hweightN() -> __sw_hweightN() wrappers ?
> 
> If you meant 1, then yes.

Yes, I mean all architectures which have an optimized hweight will use
that optimized version in their __arch_hweightN while as a default
fallback for the remaining architectures we'll have __arch_hweightN() ->
__sw_hweightN() wrappers in <asm-generic/bitops/arch_hweight.h>.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 11:24                                             ` Borislav Petkov
@ 2010-02-14 18:36                                                 ` H. Peter Anvin
  2010-02-14 18:36                                                 ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-14 18:36 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> 
> __const_hweightN - for at compile time known constants as arguments
> __arch_hweightN - arch possibly has an optimized hweight version
> __sw_hweightN - fall back when nothing else is there, aka the functions in
> lib/hweight.c
> 
> Now, in the x86 case, when the compiler can't know that the argument is
> a constant, we call the __arch_hweightN versions. The alternative does
> call the __sw_hweightN version in case the CPU doesn't support popcnt.
> In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> to be able to take care of the regs clobbered ny __sw_hweightN.
> 
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
> 

I'm not entirely sure what you're asking; if what you're asking what to
name an x86-specific fallback function, it presumably should be
__arch_sw_hweightN (i.e. __arch prefix with a modifier.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
@ 2010-02-14 18:36                                                 ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-14 18:36 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang

On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> 
> __const_hweightN - for at compile time known constants as arguments
> __arch_hweightN - arch possibly has an optimized hweight version
> __sw_hweightN - fall back when nothing else is there, aka the functions in
> lib/hweight.c
> 
> Now, in the x86 case, when the compiler can't know that the argument is
> a constant, we call the __arch_hweightN versions. The alternative does
> call the __sw_hweightN version in case the CPU doesn't support popcnt.
> In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> to be able to take care of the regs clobbered ny __sw_hweightN.
> 
> So, if I understand you correctly, your suggestion might work, we
> simply need to rename the lib/hweight.c versions to __sw_hweightN
> and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> __sw_hweightN wrappers in the default case, all arches which have an
> optimized version will provide it in their respective bitops header...
> 

I'm not entirely sure what you're asking; if what you're asking what to
name an x86-specific fallback function, it presumably should be
__arch_sw_hweightN (i.e. __arch prefix with a modifier.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 18:36                                                 ` H. Peter Anvin
  (?)
@ 2010-02-14 20:28                                                 ` Borislav Petkov
  2010-02-14 22:13                                                   ` H. Peter Anvin
  2010-02-14 22:13                                                   ` H. Peter Anvin
  -1 siblings, 2 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-14 20:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Borislav Petkov, Andrew Morton, Wu Fengguang,
	LKML, Jamie Lokier, Roland Dreier, Al Viro, linux-fsdevel,
	Ingo Molnar, Brian Gerst

On Sun, Feb 14, 2010 at 10:36:48AM -0800, H. Peter Anvin wrote:
> On 02/14/2010 03:24 AM, Borislav Petkov wrote:
> > 
> > __const_hweightN - for at compile time known constants as arguments
> > __arch_hweightN - arch possibly has an optimized hweight version
> > __sw_hweightN - fall back when nothing else is there, aka the functions in
> > lib/hweight.c
> > 
> > Now, in the x86 case, when the compiler can't know that the argument is
> > a constant, we call the __arch_hweightN versions. The alternative does
> > call the __sw_hweightN version in case the CPU doesn't support popcnt.
> > In this case, we need to build __sw_hweightN with -fcall-saved* for gcc
> > to be able to take care of the regs clobbered ny __sw_hweightN.
> > 
> > So, if I understand you correctly, your suggestion might work, we
> > simply need to rename the lib/hweight.c versions to __sw_hweightN
> > and have <asm-generic/bitops/arch_hweight.h> have __arch_hweightN ->
> > __sw_hweightN wrappers in the default case, all arches which have an
> > optimized version will provide it in their respective bitops header...
> > 
> 
> I'm not entirely sure what you're asking; if what you're asking what to
> name an x86-specific fallback function, it presumably should be
> __arch_sw_hweightN (i.e. __arch prefix with a modifier.)

Hmm, basically, what PeterZ suggested is that I drop one indirection
under __arch_hweightN, which would make x86-specific fallback functions
superfluous.

IOW, what we have so far is:

#define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))

and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
-> __sw_hweightN wrappers per default, where the __sw_hweightN are the
lib/hweight.c generic versions.

On architectures/CPUs which provide popcnt in
hardware, we create __arch_hweightN implementations in
<arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
<asm-generic/bitops/arch_hweight.h> versions by simply not including
that last header.

Is that agreeable?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 20:28                                                 ` Borislav Petkov
@ 2010-02-14 22:13                                                   ` H. Peter Anvin
  2010-02-14 22:13                                                   ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-14 22:13 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/14/2010 12:28 PM, Borislav Petkov wrote:
> 
> Hmm, basically, what PeterZ suggested is that I drop one indirection
> under __arch_hweightN, which would make x86-specific fallback functions
> superfluous.
> 
> IOW, what we have so far is:
> 
> #define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))
> 
> and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
> -> __sw_hweightN wrappers per default, where the __sw_hweightN are the
> lib/hweight.c generic versions.
> 
> On architectures/CPUs which provide popcnt in
> hardware, we create __arch_hweightN implementations in
> <arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
> <asm-generic/bitops/arch_hweight.h> versions by simply not including
> that last header.
> 
> Is that agreeable?
> 

That makes sense... after all, that's a pretty typical use of asm-generic.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-14 20:28                                                 ` Borislav Petkov
  2010-02-14 22:13                                                   ` H. Peter Anvin
@ 2010-02-14 22:13                                                   ` H. Peter Anvin
  1 sibling, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-14 22:13 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Borislav Petkov, Andrew Morton,
	Wu Fengguang

On 02/14/2010 12:28 PM, Borislav Petkov wrote:
> 
> Hmm, basically, what PeterZ suggested is that I drop one indirection
> under __arch_hweightN, which would make x86-specific fallback functions
> superfluous.
> 
> IOW, what we have so far is:
> 
> #define hweightN(w) (__builtin_constant_p(w) ? __const_hweightN(w) : __arch_hweightN(w))
> 
> and have <asm-generic/bitops/arch_hweight.h> provide __arch_hweightN()
> -> __sw_hweightN wrappers per default, where the __sw_hweightN are the
> lib/hweight.c generic versions.
> 
> On architectures/CPUs which provide popcnt in
> hardware, we create __arch_hweightN implementations in
> <arch/[:ARCH_NAME:]/include/asm/bitops.h> overriding the
> <asm-generic/bitops/arch_hweight.h> versions by simply not including
> that last header.
> 
> Is that agreeable?
> 

That makes sense... after all, that's a pretty typical use of asm-generic.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-12 19:05                                                   ` H. Peter Anvin
@ 2010-02-17 13:57                                                     ` Michal Marek
  2010-02-17 17:20                                                       ` Borislav Petkov
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  0 siblings, 2 replies; 126+ messages in thread
From: Michal Marek @ 2010-02-17 13:57 UTC (permalink / raw)
  To: H. Peter Anvin, Borislav Petkov
  Cc: linux-kbuild, Borislav Petkov, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 12.2.2010 20:05, H. Peter Anvin wrote:
> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>
>> However, this is generic code and for the above to work we have to
>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>> that?
>>
> 
> That's a question for Michal and the kbuild list.  Michal?

(I was offline last week).

The _preferred_ way probably is not to do it :), but otherwise you can
set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
in arch/x86/lib/Makefile already.

Michal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                                                     ` Michal Marek
@ 2010-02-17 17:20                                                       ` Borislav Petkov
  2010-02-17 17:31                                                         ` Michal Marek
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
  1 sibling, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:20 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
which gets set in the arch Makefile instead?

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:20                                                       ` Borislav Petkov
@ 2010-02-17 17:31                                                         ` Michal Marek
  2010-02-17 17:34                                                           ` Borislav Petkov
  2010-02-17 17:39                                                           ` Michal Marek
  0 siblings, 2 replies; 126+ messages in thread
From: Michal Marek @ 2010-02-17 17:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On 17.2.2010 18:20, Borislav Petkov wrote:
> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>
>>>> However, this is generic code and for the above to work we have to
>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>> that?
>>>>
>>>
>>> That's a question for Michal and the kbuild list.  Michal?
>>
>> (I was offline last week).
>>
>> The _preferred_ way probably is not to do it :), but otherwise you can
>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>> in arch/x86/lib/Makefile already.
> 
> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> which gets set in the arch Makefile instead?

We could, but is it worth it if there is only one potential user so far?
IMO just put the condition to lib/Makefile now and if there turn out to
be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Michal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                                                         ` Michal Marek
@ 2010-02-17 17:34                                                           ` Borislav Petkov
  2010-02-17 17:39                                                           ` Michal Marek
  1 sibling, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-17 17:34 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:31:04PM +0100, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
> > On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
> >> On 12.2.2010 20:05, H. Peter Anvin wrote:
> >>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>>>
> >>>> However, this is generic code and for the above to work we have to
> >>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >>>> that?
> >>>>
> >>>
> >>> That's a question for Michal and the kbuild list.  Michal?
> >>
> >> (I was offline last week).
> >>
> >> The _preferred_ way probably is not to do it :), but otherwise you can
> >> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> >> in arch/x86/lib/Makefile already.
> > 
> > Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
> > which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

Ok, I'm fine with that too, thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:31                                                         ` Michal Marek
  2010-02-17 17:34                                                           ` Borislav Petkov
@ 2010-02-17 17:39                                                           ` Michal Marek
  2010-02-18  6:19                                                             ` Borislav Petkov
  1 sibling, 1 reply; 126+ messages in thread
From: Michal Marek @ 2010-02-17 17:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, linux-kbuild, Borislav Petkov, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On 17.2.2010 18:31, Michal Marek wrote:
> On 17.2.2010 18:20, Borislav Petkov wrote:
>> On Wed, Feb 17, 2010 at 02:57:42PM +0100, Michal Marek wrote:
>>> On 12.2.2010 20:05, H. Peter Anvin wrote:
>>>> On 02/12/2010 09:47 AM, Borislav Petkov wrote:
>>>>>
>>>>> However, this is generic code and for the above to work we have to
>>>>> enforce x86-specific CFLAGS for it. What is the preferred way to do
>>>>> that?
>>>>>
>>>>
>>>> That's a question for Michal and the kbuild list.  Michal?
>>>
>>> (I was offline last week).
>>>
>>> The _preferred_ way probably is not to do it :), but otherwise you can
>>> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
>>> in arch/x86/lib/Makefile already.
>>
>> Wouldn't it be better if we had something like ARCH_CFLAGS_hweight.o
>> which gets set in the arch Makefile instead?
> 
> We could, but is it worth it if there is only one potential user so far?
> IMO just put the condition to lib/Makefile now and if there turn out to
> be more cases like this, we can add support for ARCH_CFLAGS_foo.o then.

It wouldn't work actually, because such variable would then apply to all
hweight.o targets in the tree. But another way would be:

arch/x86/Kconfig
config ARCH_HWEIGHT_CFLAGS
        string
        default "..." if X86_32
        default "..." if X86_64

lib/Makefile
CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Michal

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 17:39                                                           ` Michal Marek
@ 2010-02-18  6:19                                                             ` Borislav Petkov
  2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-18  6:19 UTC (permalink / raw)
  To: Michal Marek
  Cc: Borislav Petkov, H. Peter Anvin, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Wed, Feb 17, 2010 at 06:39:13PM +0100, Michal Marek wrote:
> It wouldn't work actually, because such variable would then apply to all
> hweight.o targets in the tree. But another way would be:
> 
> arch/x86/Kconfig
> config ARCH_HWEIGHT_CFLAGS
>         string
>         default "..." if X86_32
>         default "..." if X86_64
> 
> lib/Makefile
> CFLAGS_hweight.o = $(CONFIG_ARCH_HWEIGHT_CFLAGS)

Yep, this works, albeit with a small adjustment since
CONFIG_ARCH_HWEIGHT_CFLAGS is quoted in the Kconfig and the quotes
appear in the $(CC) call like this:

gcc -Wp,-MD,lib/.hweight.o.d  ... "-fcall-saved-ecx..."

which I fixed like this (idea reused from the make manual):

---
 source "init/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="



I'm open for better suggestions though.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-17 13:57                                                     ` Michal Marek
  2010-02-17 17:20                                                       ` Borislav Petkov
@ 2010-02-18 10:51                                                       ` Peter Zijlstra
  2010-02-18 11:51                                                         ` Borislav Petkov
  1 sibling, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-02-18 10:51 UTC (permalink / raw)
  To: Michal Marek
  Cc: H. Peter Anvin, Borislav Petkov, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Wed, 2010-02-17 at 14:57 +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> > 
> > That's a question for Michal and the kbuild list.  Michal?
> 
> (I was offline last week).
> 
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

I guess one way to achieve that is to create a arch/x86/lib/hweight.c
that includes lib/hweight.c and give the x86 one special compile flags
and not build the lib on.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT)
  2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
@ 2010-02-18 11:51                                                         ` Borislav Petkov
  0 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-18 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Marek, H. Peter Anvin, linux-kbuild, Borislav Petkov,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

On Thu, Feb 18, 2010 at 11:51:50AM +0100, Peter Zijlstra wrote:
> I guess one way to achieve that is to create a arch/x86/lib/hweight.c
> that includes lib/hweight.c and give the x86 one special compile flags
> and not build the lib on.

That's what I thought initially too but that won't fly because the
lib/hweight.c helpers have to be inlined into arch/x86/lib/hweight.c so
that gcc can take care of the clobbered registers. Otherwise, it just a
"call __sw_hweightXX" that gets issued into asm.


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH] x86: Add optimized popcnt variants
  2010-02-18  6:19                                                             ` Borislav Petkov
@ 2010-02-19 14:22                                                               ` Borislav Petkov
  2010-02-19 16:06                                                                 ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-19 14:22 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, Borislav Petkov, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/bitops.h             |    7 +++-
 arch/x86/lib/Makefile                     |    2 +-
 arch/x86/lib/hweight.c                    |   62 +++++++++++++++++++++++++++++
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 109 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/lib/hweight.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..5ec3bd8 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,12 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index cffd754..e811bbd 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -22,7 +22,7 @@ lib-y += usercopy_$(BITS).o getuser.o putuser.o
 lib-y += memcpy_$(BITS).o
 lib-$(CONFIG_KPROBES) += insn.o inat.o
 
-obj-y += msr.o msr-reg.o msr-reg-export.o
+obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
 ifeq ($(CONFIG_X86_32),y)
         obj-y += atomic64_32.o
diff --git a/arch/x86/lib/hweight.c b/arch/x86/lib/hweight.c
new file mode 100644
index 0000000..54d3cb0
--- /dev/null
+++ b/arch/x86/lib/hweight.c
@@ -0,0 +1,62 @@
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bitops.h>
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight32);
+
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+EXPORT_SYMBOL(__arch_hweight16);
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+EXPORT_SYMBOL(__arch_hweight8);
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+EXPORT_SYMBOL(__arch_hweight64);
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..1c82306 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.5.4

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
@ 2010-02-19 16:06                                                                 ` H. Peter Anvin
  2010-02-19 16:45                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/19/2010 06:22 AM, Borislav Petkov wrote:
> --- /dev/null
> +++ b/arch/x86/lib/hweight.c
> @@ -0,0 +1,62 @@
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else
> +/* popcnt %eax, %eax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
> +#define REG_IN "a"
> +#define REG_OUT "a"
> +#endif
> +
> +/*
> + * __sw_hweightXX are called from within the alternatives below
> + * and callee-clobbered registers need to be taken care of. See
> + * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
> + * compiler switches.
> + */
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	unsigned int res = 0;
> +
> +	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +
> +	return res;
> +}
> +EXPORT_SYMBOL(__arch_hweight32);
> +
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xffff);
> +}
> +EXPORT_SYMBOL(__arch_hweight16);
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __arch_hweight32(w & 0xff);
> +}
> +EXPORT_SYMBOL(__arch_hweight8);
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	unsigned long res = 0;
> +
> +#ifdef CONFIG_X86_32
> +	return  __arch_hweight32((u32)w) +
> +		__arch_hweight32((u32)(w >> 32));
> +#else
> +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> +		     : "="REG_OUT (res)
> +		     : REG_IN (w));
> +#endif /* CONFIG_X86_32 */
> +
> +	return res;
> +}

You're still not inlining these.  They should be: there is absolutely no
reason for code size to not inline them anymore.

> diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
> index 3a7be84..1c82306 100644
> --- a/include/asm-generic/bitops/arch_hweight.h
> +++ b/include/asm-generic/bitops/arch_hweight.h
> @@ -3,9 +3,23 @@
>  
>  #include <asm/types.h>
>  
> -extern unsigned int __arch_hweight32(unsigned int w);
> -extern unsigned int __arch_hweight16(unsigned int w);
> -extern unsigned int __arch_hweight8(unsigned int w);
> -extern unsigned long __arch_hweight64(__u64 w);
> +unsigned int __arch_hweight32(unsigned int w)
> +{
> +	return __sw_hweight32(w);
> +}
>  
> +unsigned int __arch_hweight16(unsigned int w)
> +{
> +	return __sw_hweight16(w);
> +}
> +
> +unsigned int __arch_hweight8(unsigned int w)
> +{
> +	return __sw_hweight8(w);
> +}
> +
> +unsigned long __arch_hweight64(__u64 w)
> +{
> +	return __sw_hweight64(w);
> +}
>  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */

and these are in a header file and *definitely* should be inlines.

	-hpa
-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:06                                                                 ` H. Peter Anvin
@ 2010-02-19 16:45                                                                   ` Borislav Petkov
  2010-02-19 16:53                                                                     ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-19 16:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:06:07AM -0800

<snip>

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	unsigned long res = 0;
> > +
> > +#ifdef CONFIG_X86_32
> > +	return  __arch_hweight32((u32)w) +
> > +		__arch_hweight32((u32)(w >> 32));
> > +#else
> > +	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
> > +		     : "="REG_OUT (res)
> > +		     : REG_IN (w));
> > +#endif /* CONFIG_X86_32 */
> > +
> > +	return res;
> > +}
> 
> You're still not inlining these.  They should be: there is absolutely no
> reason for code size to not inline them anymore.

Isn't better to have only those 4 locations for apply_alternatives to
patch wrt to popcnt instead of sprinkling alternatives sections around
the kernel in every callsite of hweight and its users? Or is the aim to
optimize even that "call __arch_hweightXX" away?

> > +unsigned long __arch_hweight64(__u64 w)
> > +{
> > +	return __sw_hweight64(w);
> > +}
> >  #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
> 
> and these are in a header file and *definitely* should be inlines.

Yep, done.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:45                                                                   ` Borislav Petkov
@ 2010-02-19 16:53                                                                     ` H. Peter Anvin
  2010-02-22 14:17                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-19 16:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/19/2010 08:45 AM, Borislav Petkov wrote:
>>
>> You're still not inlining these.  They should be: there is absolutely no
>> reason for code size to not inline them anymore.
> 
> Isn't better to have only those 4 locations for apply_alternatives to
> patch wrt to popcnt instead of sprinkling alternatives sections around
> the kernel in every callsite of hweight and its users? Or is the aim to
> optimize even that "call __arch_hweightXX" away?
> 

That's the idea, yes.  We use inline alternatives in quite a few other
places.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-19 16:53                                                                     ` H. Peter Anvin
@ 2010-02-22 14:17                                                                       ` Borislav Petkov
  2010-02-22 17:21                                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-22 14:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 19, 2010 at 08:53:32AM -0800

> That's the idea, yes.  We use inline alternatives in quite a few other
> places.

Ok, inlining results in circa 100+ replacements here both on 32- and
64-bit. Here we go:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..f79b733
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 14:17                                                                       ` Borislav Petkov
@ 2010-02-22 17:21                                                                         ` H. Peter Anvin
  2010-02-22 18:49                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-22 17:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/22/2010 06:17 AM, Borislav Petkov wrote:
>  
> +config ARCH_HWEIGHT_CFLAGS
> +	string
> +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> +

[...]

> +
> +#ifdef CONFIG_64BIT
> +/* popcnt %rdi, %rax */
> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> +#define REG_IN "D"
> +#define REG_OUT "a"
> +#else

Just a note: this still means rdi is clobbered on x86-64, which is
probably fine, but needs to be recorded as such.  Since gcc doesn't
support clobbers for registers used as operands (sigh), you have to
create a dummy output and assign it a "=D" constraint.

I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
reliably.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 17:21                                                                         ` H. Peter Anvin
@ 2010-02-22 18:49                                                                           ` Borislav Petkov
  2010-02-22 19:55                                                                             ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-22 18:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 09:21:05AM -0800

> On 02/22/2010 06:17 AM, Borislav Petkov wrote:
> >  
> > +config ARCH_HWEIGHT_CFLAGS
> > +	string
> > +	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
> > +
> 
> [...]
> 
> > +
> > +#ifdef CONFIG_64BIT
> > +/* popcnt %rdi, %rax */
> > +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
> > +#define REG_IN "D"
> > +#define REG_OUT "a"
> > +#else
> 
> Just a note: this still means rdi is clobbered on x86-64, which is
> probably fine, but needs to be recorded as such.  Since gcc doesn't
> support clobbers for registers used as operands (sigh), you have to
> create a dummy output and assign it a "=D" constraint.
> 
> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> reliably.

Ok, from looking at kernel/sched.s output it looks like it saves rdi
content over the alternative where needed. I'll do some more testing
just to make sure.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..cc3a188
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3\n\t.byte 0x0f\n\t.byte 0xb8\n\t.byte 0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0, dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 18:49                                                                           ` Borislav Petkov
@ 2010-02-22 19:55                                                                             ` H. Peter Anvin
  2010-02-23  6:37                                                                               ` Borislav Petkov
  2010-02-23 15:58                                                                               ` Borislav Petkov
  0 siblings, 2 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-22 19:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/22/2010 10:49 AM, Borislav Petkov wrote:
>>
>> Just a note: this still means rdi is clobbered on x86-64, which is
>> probably fine, but needs to be recorded as such.  Since gcc doesn't
>> support clobbers for registers used as operands (sigh), you have to
>> create a dummy output and assign it a "=D" constraint.
>>
>> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
>> reliably.
> 
> Ok, from looking at kernel/sched.s output it looks like it saves rdi
> content over the alternative where needed. I'll do some more testing
> just to make sure.
> 

No, you can't rely on behavioral observation.  A different version of
gcc could behave differently.  We need to make sure we tell gcc what the
requirements actually are, as opposed to thinking we can just fix them.

+#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
0xb8\n\t.byte 0xc7"

BTW, this can be written:

#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                                                             ` H. Peter Anvin
@ 2010-02-23  6:37                                                                               ` Borislav Petkov
  2010-02-23 15:58                                                                               ` Borislav Petkov
  1 sibling, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-23  6:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.
> > 
> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Ok, I've added the dummy "=D" constraint since it sounded like the more
stable thing to do WRT gcc versions. BTW, I left the machine overnight
and it is still alive cheerfully building randconfigs.

I'll try the -fcall-saved-rdi thing also later today.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

done, updated version coming up.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-22 19:55                                                                             ` H. Peter Anvin
  2010-02-23  6:37                                                                               ` Borislav Petkov
@ 2010-02-23 15:58                                                                               ` Borislav Petkov
  2010-02-23 17:34                                                                                 ` H. Peter Anvin
  1 sibling, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-23 15:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Mon, Feb 22, 2010 at 11:55:48AM -0800

> On 02/22/2010 10:49 AM, Borislav Petkov wrote:
> >>
> >> Just a note: this still means rdi is clobbered on x86-64, which is
> >> probably fine, but needs to be recorded as such.  Since gcc doesn't
> >> support clobbers for registers used as operands (sigh), you have to
> >> create a dummy output and assign it a "=D" constraint.
> >>
> >> I don't know if gcc would handle -fcall-saved-rdi here... and if so, how
> >> reliably.

Hmm, we cannot do that with the current design since __arch_hweight64
is being inlined into every callsite and AFAICT we would have to build
every callsite with "-fcall-saved-rdi" which is clearly too much. The
explicit "=D" dummy constraint is straightforward, instead.

> > Ok, from looking at kernel/sched.s output it looks like it saves rdi
> > content over the alternative where needed. I'll do some more testing
> > just to make sure.
> > 
> 
> No, you can't rely on behavioral observation.  A different version of
> gcc could behave differently.  We need to make sure we tell gcc what the
> requirements actually are, as opposed to thinking we can just fix them.

Just to make clear: those observations were referring to the version
_with_ the dummy "=D" constraint - I was simply verifying the asm
output and wasn't relying on behavioral observation.

> +#define POPCNT ".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t.byte
> 0xb8\n\t.byte 0xc7"
> 
> BTW, this can be written:
> 
> #define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"

Done, here's the latest version, it boots fine and testing is ongoing:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   61 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 ++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 110 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..176950e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..eaefadc
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,61 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+	/* tell gcc that %rdi is clobbered as an input operand */
+	unsigned long dummy;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res), "="REG_IN (dummy)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 15:58                                                                               ` Borislav Petkov
@ 2010-02-23 17:34                                                                                 ` H. Peter Anvin
  2010-02-23 17:54                                                                                     ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-23 17:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/23/2010 07:58 AM, Borislav Petkov wrote:
>
> Hmm, we cannot do that with the current design since __arch_hweight64
> is being inlined into every callsite and AFAICT we would have to build
> every callsite with "-fcall-saved-rdi" which is clearly too much. The
> explicit "=D" dummy constraint is straightforward, instead.
>

Uh... the -fcall-saved-rdi would go with all the other ones.  Assuming 
it can actually work and that gcc doesn't choke on an inbound argument 
being saved.

	-hpa

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:34                                                                                 ` H. Peter Anvin
@ 2010-02-23 17:54                                                                                     ` Borislav Petkov
  0 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-23 17:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 09:34:04AM -0800

> On 02/23/2010 07:58 AM, Borislav Petkov wrote:
> >
> >Hmm, we cannot do that with the current design since __arch_hweight64
> >is being inlined into every callsite and AFAICT we would have to build
> >every callsite with "-fcall-saved-rdi" which is clearly too much. The
> >explicit "=D" dummy constraint is straightforward, instead.
> >
> 
> Uh... the -fcall-saved-rdi would go with all the other ones.
> Assuming it can actually work and that gcc doesn't choke on an
> inbound argument being saved.

Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
might choke on it, I could leave the "=D" dummy constraint in?

BTW, the current version screams

/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h: In function ‘__arch_hweight64’:
/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h:47: warning: unused variable ‘dummy’

on x86-32. I'll send a fixed version in a second.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
@ 2010-02-23 17:54                                                                                     ` Borislav Petkov
  0 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-02-23 17:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 09:34:04AM -0800

> On 02/23/2010 07:58 AM, Borislav Petkov wrote:
> >
> >Hmm, we cannot do that with the current design since __arch_hweight64
> >is being inlined into every callsite and AFAICT we would have to build
> >every callsite with "-fcall-saved-rdi" which is clearly too much. The
> >explicit "=D" dummy constraint is straightforward, instead.
> >
> 
> Uh... the -fcall-saved-rdi would go with all the other ones.
> Assuming it can actually work and that gcc doesn't choke on an
> inbound argument being saved.

Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
might choke on it, I could leave the "=D" dummy constraint in?

BTW, the current version screams

/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h: In function ‘__arch_hweight64’:
/usr/src/linux-2.6/arch/x86/include/asm/arch_hweight.h:47: warning: unused variable ‘dummy’

on x86-32. I'll send a fixed version in a second.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 17:54                                                                                     ` Borislav Petkov
  (?)
@ 2010-02-23 18:17                                                                                     ` H. Peter Anvin
  2010-02-23 19:06                                                                                       ` Borislav Petkov
  -1 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-23 18:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> 
> Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> might choke on it, I could leave the "=D" dummy constraint in?
> 

I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 18:17                                                                                     ` H. Peter Anvin
@ 2010-02-23 19:06                                                                                       ` Borislav Petkov
  2010-02-26  5:27                                                                                         ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-23 19:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Tue, Feb 23, 2010 at 10:17:39AM -0800

> On 02/23/2010 09:54 AM, Borislav Petkov wrote:
> > 
> > Right, doh. Ok, just added it and it builds fine with a gcc (Gentoo
> > 4.4.1 p1.0) 4.4.1. If you have suspicion that some older gcc versions
> > might choke on it, I could leave the "=D" dummy constraint in?
> > 
> 
> I can try it with gcc 3.4 here.  -fcall-saved-rdi is cleaner, if it works.

Ok, here you go.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Feb 2010 00:48:31 +0100
Subject: [PATCH] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..4673dc5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 69b74a7..0720c96 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -91,6 +88,12 @@ static inline void alternatives_smp_switch(int smp) {}
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 3b0b4a6..e2ad17c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
diff --git a/lib/hweight.c b/lib/hweight.c
index 9ff86df..f9ce440 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55555555);
 	res = (res & 0x33333333) + ((res >> 2) & 0x33333333);
@@ -17,30 +17,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	res = res + (res >> 8);
 	return (res + (res >> 16)) & 0x000000FF;
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -57,4 +57,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.6.4.2


-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-23 19:06                                                                                       ` Borislav Petkov
@ 2010-02-26  5:27                                                                                         ` H. Peter Anvin
  2010-02-26  7:47                                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-26  5:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/23/2010 11:06 AM, Borislav Petkov wrote:
> 
> Ok, here you go.
> 
> --
> From: Borislav Petkov <borislav.petkov@amd.com>
> Date: Thu, 11 Feb 2010 00:48:31 +0100
> Subject: [PATCH] x86: Add optimized popcnt variants
> 
> Add support for the hardware version of the Hamming weight function,
> popcnt, present in CPUs which advertize it under CPUID, Function
> 0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
> default lib/hweight.c sw versions.
> 
> A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
> a 3x speedup on a F10h machine.
> 
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>

OK, this patch looks pretty good now, but I'm completely lost as to what
the baseline of this patch is supposed to be.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  5:27                                                                                         ` H. Peter Anvin
@ 2010-02-26  7:47                                                                                           ` Borislav Petkov
  2010-02-26 17:48                                                                                               ` H. Peter Anvin
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-26  7:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Thu, Feb 25, 2010 at 09:27:25PM -0800

> OK, this patch looks pretty good now, but I'm completely lost as to
> what the baseline of this patch is supposed to be.

Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119

But I'm not sure which tree has it...

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26  7:47                                                                                           ` Borislav Petkov
@ 2010-02-26 17:48                                                                                               ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-26 17:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst, Andrew Morton

On 02/25/2010 11:47 PM, Borislav Petkov wrote:
> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Thu, Feb 25, 2010 at 09:27:25PM -0800
> 
>> OK, this patch looks pretty good now, but I'm completely lost as to
>> what the baseline of this patch is supposed to be.
> 
> Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119
> 
> But I'm not sure which tree has it...
> 

Looks like -mm, which really means that either Andrew has to take your
patch, too, or we have to wait until that is upstream until we can merge
your patch.

I'm a little nervous about just acking the patch and telling Andrew to
test it, because I don't know what the fallout would look like.  I'm
particularly concerned about gcc version dependencies.

I guess, on the other hand, if it ends up not getting merged until .35
it's not a huge deal either.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
@ 2010-02-26 17:48                                                                                               ` H. Peter Anvin
  0 siblings, 0 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-26 17:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/25/2010 11:47 PM, Borislav Petkov wrote:
> From: "H. Peter Anvin" <hpa@zytor.com>
> Date: Thu, Feb 25, 2010 at 09:27:25PM -0800
> 
>> OK, this patch looks pretty good now, but I'm completely lost as to
>> what the baseline of this patch is supposed to be.
> 
> Yeah, this is based on PeterZ's http://lkml.org/lkml/2010/2/4/119
> 
> But I'm not sure which tree has it...
> 

Looks like -mm, which really means that either Andrew has to take your
patch, too, or we have to wait until that is upstream until we can merge
your patch.

I'm a little nervous about just acking the patch and telling Andrew to
test it, because I don't know what the fallout would look like.  I'm
particularly concerned about gcc version dependencies.

I guess, on the other hand, if it ends up not getting merged until .35
it's not a huge deal either.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-26 17:48                                                                                               ` H. Peter Anvin
  (?)
@ 2010-02-27  8:28                                                                                               ` Borislav Petkov
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  -1 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-02-27  8:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Michal Marek, linux-kbuild, Peter Zijlstra,
	Andrew Morton, Wu Fengguang, LKML, Jamie Lokier, Roland Dreier,
	Al Viro, linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Fri, Feb 26, 2010 at 09:48:46AM -0800

> Looks like -mm, which really means that either Andrew has to take your
> patch, too, or we have to wait until that is upstream until we can merge
> your patch.
> 
> I'm a little nervous about just acking the patch and telling Andrew to
> test it, because I don't know what the fallout would look like.  I'm
> particularly concerned about gcc version dependencies.

I have the same concern. Actually, I'll be much more at ease if it saw a
bit of wider testing without hitting mainline just yet. I'll try to give
it some more testing with the machines and toolchains I can get my hands
on next week.

> I guess, on the other hand, if it ends up not getting merged until .35
> it's not a huge deal either.

Yeah, let's give it another round of testing and queue it for .35 -
AFAIR Ingo runs also a wide testing effort so it spending another cycle
in -tip and being hammered on by us could give us a bit more certainty.

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27  8:28                                                                                               ` Borislav Petkov
@ 2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
                                                                                                                     ` (3 more replies)
  0 siblings, 4 replies; 126+ messages in thread
From: H. Peter Anvin @ 2010-02-27 20:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> 
>> I guess, on the other hand, if it ends up not getting merged until .35
>> it's not a huge deal either.
> 
> Yeah, let's give it another round of testing and queue it for .35 -
> AFAIR Ingo runs also a wide testing effort so it spending another cycle
> in -tip and being hammered on by us could give us a bit more certainty.
> 

Yes, if we can get into -tip then we'll get more test coverage, so I'll
queue it up for .35 as soon as the merge window closes.  Please remind
me if I forget.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
@ 2010-03-09 15:36                                                                                                   ` Borislav Petkov
  2010-03-09 15:50                                                                                                     ` Peter Zijlstra
  2010-03-18 11:17                                                                                                   ` Borislav Petkov
                                                                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-03-09 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Hi Peter,

I see that you've added the HWEIGHT-capitalized interfaces for
compile-time constants with fce877e3. Which means, the bits in
<include/asm-generic/bitops/const_hweight.h> from your original patch at
http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
but I've missed them).

IOW, where can I get the current version of that last patch so that I can
base my __arch_hweightXX stuff ontop of it for testing?

Thanks.

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
@ 2010-03-09 15:50                                                                                                     ` Peter Zijlstra
  2010-03-09 16:23                                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-03-09 15:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 16:36 +0100, Borislav Petkov wrote:
> > On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > > Yeah, let's give it another round of testing and queue it for .35 -
> > > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > > in -tip and being hammered on by us could give us a bit more certainty.
> > > 
> > 
> > Yes, if we can get into -tip then we'll get more test coverage, so I'll
> > queue it up for .35 as soon as the merge window closes.  Please remind
> > me if I forget.
> 
> Hi Peter,
> 
> I see that you've added the HWEIGHT-capitalized interfaces for
> compile-time constants with fce877e3. Which means, the bits in
> <include/asm-generic/bitops/const_hweight.h> from your original patch at
> http://lkml.org/lkml/2010/2/4/119 need changing (or have changed already
> but I've missed them).
> 
> IOW, where can I get the current version of that last patch so that I can
> base my __arch_hweightXX stuff ontop of it for testing?

Should all be fine as it is, that patch
( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
in, I've just checked and it still applies to tip/master as of this
writing (although it grew a single 2 line offset for 1 hunk).




^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 15:50                                                                                                     ` Peter Zijlstra
@ 2010-03-09 16:23                                                                                                       ` Borislav Petkov
  2010-03-09 16:32                                                                                                         ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-03-09 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 04:50:40PM +0100

> Should all be fine as it is, that patch
> ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> in, I've just checked and it still applies to tip/master as of this
> writing (although it grew a single 2 line offset for 1 hunk).

Well, this way, I'm getting

...
In file included from include/linux/kernel.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
                 from /home/linux-2.6/arch/x86/include/asm/current.h:5,
                 from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
                 from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
                 from include/linux/crypto.h:20,
                 from arch/x86/kernel/asm-offsets_64.c:8,
                 from arch/x86/kernel/asm-offsets.c:4:
include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
...

due to the fact that we have multiple definitions of HWEIGHT*:

The one batch is in <include/linux/bitops.h> introduced by fce877e3.

The other is in <include/asm-generic/bitops/const_hweight.h> which
is pulled in into <include/linux/bitops.h> through "#include
<asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
<include/asm-generic/bitops/const_hweight.h>.

The obvious resolution is to remove the HWEIGHT* batch from
<include/asm-generic/bitops/const_hweight.h> since they're functionally
identical with the ones in <include/linux/bitops.h>, no?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:23                                                                                                       ` Borislav Petkov
@ 2010-03-09 16:32                                                                                                         ` Peter Zijlstra
  2010-03-09 17:32                                                                                                           ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: Peter Zijlstra @ 2010-03-09 16:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 17:23 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 04:50:40PM +0100
> 
> > Should all be fine as it is, that patch
> > ( http://lkml.org/lkml/2010/2/4/119 ) is against a kernel with fce877e3
> > in, I've just checked and it still applies to tip/master as of this
> > writing (although it grew a single 2 line offset for 1 hunk).
> 
> Well, this way, I'm getting
> 
> ...
> In file included from include/linux/kernel.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/percpu.h:45,
>                  from /home/linux-2.6/arch/x86/include/asm/current.h:5,
>                  from /home/linux-2.6/arch/x86/include/asm/processor.h:15,
>                  from /home/linux-2.6/arch/x86/include/asm/atomic.h:6,
>                  from include/linux/crypto.h:20,
>                  from arch/x86/kernel/asm-offsets_64.c:8,
>                  from arch/x86/kernel/asm-offsets.c:4:
> include/linux/bitops.h:52:1: warning: "HWEIGHT8" redefined
> ...
> 
> due to the fact that we have multiple definitions of HWEIGHT*:
> 
> The one batch is in <include/linux/bitops.h> introduced by fce877e3.
> 
> The other is in <include/asm-generic/bitops/const_hweight.h> which
> is pulled in into <include/linux/bitops.h> through "#include
> <asm/bitops.h>", which, in turn, <includes asm/arch_hweight.h> and
> <include/asm-generic/bitops/const_hweight.h>.
> 
> The obvious resolution is to remove the HWEIGHT* batch from
> <include/asm-generic/bitops/const_hweight.h> since they're functionally
> identical with the ones in <include/linux/bitops.h>, no?

I thought the patch did that, see this hunk (straight from
http://lkml.org/lkml/2010/2/4/119 ):


---
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -45,31 +45,6 @@ static inline unsigned long hweight_long
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate


^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 16:32                                                                                                         ` Peter Zijlstra
@ 2010-03-09 17:32                                                                                                           ` Borislav Petkov
  2010-03-09 17:37                                                                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-03-09 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, Mar 09, 2010 at 05:32:49PM +0100

> I thought the patch did that, see this hunk (straight from
> http://lkml.org/lkml/2010/2/4/119 ):

Bollocks, I seem to have lost that hunk while applying the patch by
foot, sorry for the noise.

By the way, I can't seem to find your patch in Andrew's tree, is it
still going through his tree?

-- 
Regards/Gruss,
Boris.

-
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-03-09 17:32                                                                                                           ` Borislav Petkov
@ 2010-03-09 17:37                                                                                                             ` Peter Zijlstra
  0 siblings, 0 replies; 126+ messages in thread
From: Peter Zijlstra @ 2010-03-09 17:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Michal Marek, linux-kbuild, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

On Tue, 2010-03-09 at 18:32 +0100, Borislav Petkov wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, Mar 09, 2010 at 05:32:49PM +0100
> 
> > I thought the patch did that, see this hunk (straight from
> > http://lkml.org/lkml/2010/2/4/119 ):
> 
> Bollocks, I seem to have lost that hunk while applying the patch by
> foot, sorry for the noise.
> 
> By the way, I can't seem to find your patch in Andrew's tree, is it
> still going through his tree?

I hope so, Andrew, need me to resend or do you still have a copy?




^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
@ 2010-03-18 11:17                                                                                                   ` Borislav Petkov
  2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
  2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: "H. Peter Anvin" <hpa@zytor.com>
Date: Sat, Feb 27, 2010 at 12:00:26PM -0800

> On 02/27/2010 12:28 AM, Borislav Petkov wrote:
> > 
> >> I guess, on the other hand, if it ends up not getting merged until .35
> >> it's not a huge deal either.
> > 
> > Yeah, let's give it another round of testing and queue it for .35 -
> > AFAIR Ingo runs also a wide testing effort so it spending another cycle
> > in -tip and being hammered on by us could give us a bit more certainty.
> > 
> 
> Yes, if we can get into -tip then we'll get more test coverage, so I'll
> queue it up for .35 as soon as the merge window closes.  Please remind
> me if I forget.

Ok, I've been pretty busy lately and this got pushed back on the todo
list. I finally got around to do some build-testing with a bunch of
compilers and -fcall-saved* seem to get accepted. I haven't stared at
their asm output though, yet:


command:
make CC=gcc-<version> HOSTCC=gcc-<version> -j4

compile stats (64bit only):

not ok:
- gcc-3.3 (GCC) 3.3.5 (Debian 1:3.3.5-13):	OOM KILLER goes off, gcc-3.3 leak maybe

ok:
- gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-13sarge1)
- gcc-4.1 (GCC) 4.1.3 20080704 (prerelease) (Debian 4.1.2-27)
- gcc-4.3 (Debian 4.3.4-6) 4.3.4
- gcc (Debian 4.4.2-6) 4.4.2
- gcc (Debian 4.4.3-3) 4.4.3

- gcc-3.4.6 (GCC) 3.4.6 (Gentoo 3.4.6-r2 p1.6, ssp-3.4.6-1.0, pie-8.7.10)
- gcc-4.1.2 (GCC) 4.1.2 (Gentoo 4.1.2 p1.3)

I'm attaching the versions of the patches I'm using. The first one by
PeterZ touches a bunch of arches and Andrew hasn't picked it up yet so
the question of getting the second (popcnt) patch to see wider testing
in some tree is still unresolved. Suggestions, ideas?

Thanks.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
  2010-03-09 15:36                                                                                                   ` Borislav Petkov
  2010-03-18 11:17                                                                                                   ` Borislav Petkov
@ 2010-03-18 11:19                                                                                                   ` Borislav Petkov
  2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
  3 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon, 1 Feb 2010 15:03:07 +0100
Subject: [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation

Rename the extisting runtime hweight() implementations to
__arch_hweight(), rename the compile-time versions to __const_hweight()
and then have hweight() pick between them.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1265028224.24455.154.camel@laptop>
---
 arch/alpha/include/asm/bitops.h            |   18 ++++++-----
 arch/ia64/include/asm/bitops.h             |   11 ++++---
 arch/sparc/include/asm/bitops_64.h         |   11 ++++---
 include/asm-generic/bitops/arch_hweight.h  |   11 +++++++
 include/asm-generic/bitops/const_hweight.h |   42 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h       |    8 +----
 include/linux/bitops.h                     |   25 ----------------
 lib/hweight.c                              |   19 ++++++------
 8 files changed, 87 insertions(+), 58 deletions(-)
 create mode 100644 include/asm-generic/bitops/arch_hweight.h
 create mode 100644 include/asm-generic/bitops/const_hweight.h

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 15f3ae2..296da1d 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -405,29 +405,31 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_weight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 #endif
 
+#include <asm-generic/bitops/const_hweight.h>
+
 #endif /* __KERNEL__ */
 
 #include <asm-generic/bitops/find.h>
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 6ebc229..9da3df6 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -437,17 +437,18 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long
-hweight64 (unsigned long x)
+static __inline__ unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
 	return result;
 }
 
-#define hweight32(x)	(unsigned int) hweight64((x) & 0xfffffffful)
-#define hweight16(x)	(unsigned int) hweight64((x) & 0xfffful)
-#define hweight8(x)	(unsigned int) hweight64((x) & 0xfful)
+#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
+#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
+#define __arch_hweight8(x)  ((unsigned int) __arch_hweight64((x) & 0xfful))
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index e72ac9c..766121a 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr, volatile unsigned long *addr);
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(unsigned long w)
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
@@ -78,9 +78,10 @@ static inline unsigned int hweight8(unsigned int w)
 
 #else
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 
 #endif
+#include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
new file mode 100644
index 0000000..3a7be84
--- /dev/null
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+
+#include <asm/types.h>
+
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
new file mode 100644
index 0000000..fa2a50b
--- /dev/null
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+
+/*
+ * Compile time versions of __arch_hweightN()
+ */
+#define __const_hweight8(w)		\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+/*
+ * Generic interface.
+ */
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+
+/*
+ * Interface for known constant arguments
+ */
+#define HWEIGHT8(w)  (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight8(w))
+#define HWEIGHT16(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight16(w))
+#define HWEIGHT32(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight32(w))
+#define HWEIGHT64(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight64(w))
+
+/*
+ * Type invariant interface to the compile time constant hweight functions.
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)w)
+
+#endif /* _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..a94d651 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -1,11 +1,7 @@
 #ifndef _ASM_GENERIC_BITOPS_HWEIGHT_H_
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
-#include <asm/types.h>
-
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+#include <asm-generic/bitops/arch_hweight.h>
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index b793898..c55d5bc 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -47,31 +47,6 @@ static inline unsigned long hweight_long(unsigned long w)
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate
diff --git a/lib/hweight.c b/lib/hweight.c
index 63ee4eb..a6927e7 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arch_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 2/2] x86: Add optimized popcnt variants
  2010-02-27 20:00                                                                                                 ` H. Peter Anvin
                                                                                                                     ` (2 preceding siblings ...)
  2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
@ 2010-03-18 11:20                                                                                                   ` Borislav Petkov
  2010-04-06 23:04                                                                                                     ` [tip:core/hweight] " tip-bot for Borislav Petkov
  3 siblings, 1 reply; 126+ messages in thread
From: Borislav Petkov @ 2010-03-18 11:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Michal Marek, linux-kbuild, Peter Zijlstra, Andrew Morton,
	Wu Fengguang, LKML, Jamie Lokier, Roland Dreier, Al Viro,
	linux-fsdevel, Ingo Molnar, Brian Gerst

From: Borislav Petkov <borislav.petkov@amd.com>
Date: Fri, 5 Mar 2010 17:34:46 +0100
Subject: [PATCH 2/2] x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)
 create mode 100644 arch/x86/include/asm/arch_hweight.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..89d8c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index b09ec55..67dae51 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -96,6 +93,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 2e152ae..abe63a8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
diff --git a/lib/hweight.c b/lib/hweight.c
index a6927e7..3c79d50 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,30 +24,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -64,4 +64,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="
-- 
1.7.0.2

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:core/hweight] bitops: Optimize hweight() by making use of compile-time evaluation
  2010-02-01 12:43             ` Peter Zijlstra
  2010-02-01 19:06               ` H. Peter Anvin
@ 2010-04-06 23:03               ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 126+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-04-06 23:03 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx

Commit-ID:  1527bc8b928dd1399c3d3467dd47d9ede210978a
Gitweb:     http://git.kernel.org/tip/1527bc8b928dd1399c3d3467dd47d9ede210978a
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 1 Feb 2010 15:03:07 +0100
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 6 Apr 2010 15:52:11 -0700

bitops: Optimize hweight() by making use of compile-time evaluation

Rename the extisting runtime hweight() implementations to
__arch_hweight(), rename the compile-time versions to __const_hweight()
and then have hweight() pick between them.

Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100318111929.GB11152@aftab>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1265028224.24455.154.camel@laptop>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/alpha/include/asm/bitops.h            |   18 ++++++-----
 arch/ia64/include/asm/bitops.h             |   11 ++++---
 arch/sparc/include/asm/bitops_64.h         |   11 ++++---
 include/asm-generic/bitops/arch_hweight.h  |   11 +++++++
 include/asm-generic/bitops/const_hweight.h |   42 ++++++++++++++++++++++++++++
 include/asm-generic/bitops/hweight.h       |    8 +----
 include/linux/bitops.h                     |   25 ----------------
 lib/hweight.c                              |   19 ++++++------
 8 files changed, 87 insertions(+), 58 deletions(-)

diff --git a/arch/alpha/include/asm/bitops.h b/arch/alpha/include/asm/bitops.h
index 15f3ae2..296da1d 100644
--- a/arch/alpha/include/asm/bitops.h
+++ b/arch/alpha/include/asm/bitops.h
@@ -405,29 +405,31 @@ static inline int fls(int x)
 
 #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
 /* Whee.  EV67 can calculate it directly.  */
-static inline unsigned long hweight64(unsigned long w)
+static inline unsigned long __arch_hweight64(unsigned long w)
 {
 	return __kernel_ctpop(w);
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_weight32(unsigned int w)
 {
-	return hweight64(w);
+	return __arch_hweight64(w);
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
-	return hweight64(w & 0xffff);
+	return __arch_hweight64(w & 0xffff);
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
-	return hweight64(w & 0xff);
+	return __arch_hweight64(w & 0xff);
 }
 #else
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 #endif
 
+#include <asm-generic/bitops/const_hweight.h>
+
 #endif /* __KERNEL__ */
 
 #include <asm-generic/bitops/find.h>
diff --git a/arch/ia64/include/asm/bitops.h b/arch/ia64/include/asm/bitops.h
index 6ebc229..9da3df6 100644
--- a/arch/ia64/include/asm/bitops.h
+++ b/arch/ia64/include/asm/bitops.h
@@ -437,17 +437,18 @@ __fls (unsigned long x)
  * hweightN: returns the hamming weight (i.e. the number
  * of bits set) of a N-bit word
  */
-static __inline__ unsigned long
-hweight64 (unsigned long x)
+static __inline__ unsigned long __arch_hweight64(unsigned long x)
 {
 	unsigned long result;
 	result = ia64_popcnt(x);
 	return result;
 }
 
-#define hweight32(x)	(unsigned int) hweight64((x) & 0xfffffffful)
-#define hweight16(x)	(unsigned int) hweight64((x) & 0xfffful)
-#define hweight8(x)	(unsigned int) hweight64((x) & 0xfful)
+#define __arch_hweight32(x) ((unsigned int) __arch_hweight64((x) & 0xfffffffful))
+#define __arch_hweight16(x) ((unsigned int) __arch_hweight64((x) & 0xfffful))
+#define __arch_hweight8(x)  ((unsigned int) __arch_hweight64((x) & 0xfful))
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/sparc/include/asm/bitops_64.h b/arch/sparc/include/asm/bitops_64.h
index e72ac9c..766121a 100644
--- a/arch/sparc/include/asm/bitops_64.h
+++ b/arch/sparc/include/asm/bitops_64.h
@@ -44,7 +44,7 @@ extern void change_bit(unsigned long nr, volatile unsigned long *addr);
 
 #ifdef ULTRA_HAS_POPULATION_COUNT
 
-static inline unsigned int hweight64(unsigned long w)
+static inline unsigned int __arch_hweight64(unsigned long w)
 {
 	unsigned int res;
 
@@ -52,7 +52,7 @@ static inline unsigned int hweight64(unsigned long w)
 	return res;
 }
 
-static inline unsigned int hweight32(unsigned int w)
+static inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
@@ -60,7 +60,7 @@ static inline unsigned int hweight32(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight16(unsigned int w)
+static inline unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res;
 
@@ -68,7 +68,7 @@ static inline unsigned int hweight16(unsigned int w)
 	return res;
 }
 
-static inline unsigned int hweight8(unsigned int w)
+static inline unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res;
 
@@ -78,9 +78,10 @@ static inline unsigned int hweight8(unsigned int w)
 
 #else
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm-generic/bitops/arch_hweight.h>
 
 #endif
+#include <asm-generic/bitops/const_hweight.h>
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
new file mode 100644
index 0000000..3a7be84
--- /dev/null
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_ARCH_HWEIGHT_H_
+
+#include <asm/types.h>
+
+extern unsigned int __arch_hweight32(unsigned int w);
+extern unsigned int __arch_hweight16(unsigned int w);
+extern unsigned int __arch_hweight8(unsigned int w);
+extern unsigned long __arch_hweight64(__u64 w);
+
+#endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/const_hweight.h b/include/asm-generic/bitops/const_hweight.h
new file mode 100644
index 0000000..fa2a50b
--- /dev/null
+++ b/include/asm-generic/bitops/const_hweight.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+#define _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_
+
+/*
+ * Compile time versions of __arch_hweightN()
+ */
+#define __const_hweight8(w)		\
+      (	(!!((w) & (1ULL << 0))) +	\
+	(!!((w) & (1ULL << 1))) +	\
+	(!!((w) & (1ULL << 2))) +	\
+	(!!((w) & (1ULL << 3))) +	\
+	(!!((w) & (1ULL << 4))) +	\
+	(!!((w) & (1ULL << 5))) +	\
+	(!!((w) & (1ULL << 6))) +	\
+	(!!((w) & (1ULL << 7)))	)
+
+#define __const_hweight16(w) (__const_hweight8(w)  + __const_hweight8((w)  >> 8 ))
+#define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
+#define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
+
+/*
+ * Generic interface.
+ */
+#define hweight8(w)  (__builtin_constant_p(w) ? __const_hweight8(w)  : __arch_hweight8(w))
+#define hweight16(w) (__builtin_constant_p(w) ? __const_hweight16(w) : __arch_hweight16(w))
+#define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w))
+#define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
+
+/*
+ * Interface for known constant arguments
+ */
+#define HWEIGHT8(w)  (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight8(w))
+#define HWEIGHT16(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight16(w))
+#define HWEIGHT32(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight32(w))
+#define HWEIGHT64(w) (BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) + __const_hweight64(w))
+
+/*
+ * Type invariant interface to the compile time constant hweight functions.
+ */
+#define HWEIGHT(w)   HWEIGHT64((u64)w)
+
+#endif /* _ASM_GENERIC_BITOPS_CONST_HWEIGHT_H_ */
diff --git a/include/asm-generic/bitops/hweight.h b/include/asm-generic/bitops/hweight.h
index fbbc383..a94d651 100644
--- a/include/asm-generic/bitops/hweight.h
+++ b/include/asm-generic/bitops/hweight.h
@@ -1,11 +1,7 @@
 #ifndef _ASM_GENERIC_BITOPS_HWEIGHT_H_
 #define _ASM_GENERIC_BITOPS_HWEIGHT_H_
 
-#include <asm/types.h>
-
-extern unsigned int hweight32(unsigned int w);
-extern unsigned int hweight16(unsigned int w);
-extern unsigned int hweight8(unsigned int w);
-extern unsigned long hweight64(__u64 w);
+#include <asm-generic/bitops/arch_hweight.h>
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index b793898..c55d5bc 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -47,31 +47,6 @@ static inline unsigned long hweight_long(unsigned long w)
 	return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
 }
 
-/*
- * Clearly slow versions of the hweightN() functions, their benefit is
- * of course compile time evaluation of constant arguments.
- */
-#define HWEIGHT8(w)					\
-      (	BUILD_BUG_ON_ZERO(!__builtin_constant_p(w)) +	\
-	(!!((w) & (1ULL << 0))) +			\
-	(!!((w) & (1ULL << 1))) +			\
-	(!!((w) & (1ULL << 2))) +			\
-	(!!((w) & (1ULL << 3))) +			\
-	(!!((w) & (1ULL << 4))) +			\
-	(!!((w) & (1ULL << 5))) +			\
-	(!!((w) & (1ULL << 6))) +			\
-	(!!((w) & (1ULL << 7)))	)
-
-#define HWEIGHT16(w) (HWEIGHT8(w)  + HWEIGHT8((w) >> 8))
-#define HWEIGHT32(w) (HWEIGHT16(w) + HWEIGHT16((w) >> 16))
-#define HWEIGHT64(w) (HWEIGHT32(w) + HWEIGHT32((w) >> 32))
-
-/*
- * Type invariant version that simply casts things to the
- * largest type.
- */
-#define HWEIGHT(w)   HWEIGHT64((u64)(w))
-
 /**
  * rol32 - rotate a 32-bit value left
  * @word: value to rotate
diff --git a/lib/hweight.c b/lib/hweight.c
index 63ee4eb..a6927e7 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int hweight32(unsigned int w)
+unsigned int __arch_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,29 +24,30 @@ unsigned int hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(hweight32);
+EXPORT_SYMBOL(__arch_hweight32);
 
-unsigned int hweight16(unsigned int w)
+unsigned int __arch_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(hweight16);
+EXPORT_SYMBOL(__arch_hweight16);
 
-unsigned int hweight8(unsigned int w)
+unsigned int __arch_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(hweight8);
+EXPORT_SYMBOL(__arch_hweight8);
 
-unsigned long hweight64(__u64 w)
+unsigned long __arch_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return hweight32((unsigned int)(w >> 32)) + hweight32((unsigned int)w);
+	return __arch_hweight32((unsigned int)(w >> 32)) +
+	       __arch_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -63,4 +64,4 @@ unsigned long hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(hweight64);
+EXPORT_SYMBOL(__arch_hweight64);

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [tip:core/hweight] x86: Add optimized popcnt variants
  2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
@ 2010-04-06 23:04                                                                                                     ` tip-bot for Borislav Petkov
  2010-04-07  7:02                                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 126+ messages in thread
From: tip-bot for Borislav Petkov @ 2010-04-06 23:04 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tglx, borislav.petkov

Commit-ID:  d61931d89be506372d01a90d1755f6d0a9fafe2d
Gitweb:     http://git.kernel.org/tip/d61931d89be506372d01a90d1755f6d0a9fafe2d
Author:     Borislav Petkov <borislav.petkov@amd.com>
AuthorDate: Fri, 5 Mar 2010 17:34:46 +0100
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Tue, 6 Apr 2010 15:52:11 -0700

x86: Add optimized popcnt variants

Add support for the hardware version of the Hamming weight function,
popcnt, present in CPUs which advertize it under CPUID, Function
0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
default lib/hweight.c sw versions.

A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
a 3x speedup on a F10h machine.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100318112015.GC11152@aftab>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/Kconfig                          |    5 ++
 arch/x86/include/asm/alternative.h        |    9 +++-
 arch/x86/include/asm/arch_hweight.h       |   59 +++++++++++++++++++++++++++++
 arch/x86/include/asm/bitops.h             |    4 +-
 include/asm-generic/bitops/arch_hweight.h |   22 +++++++++--
 lib/Makefile                              |    3 +
 lib/hweight.c                             |   20 +++++-----
 scripts/Makefile.lib                      |    4 ++
 8 files changed, 108 insertions(+), 18 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0eacb1f..89d8c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,11 @@ config X86_32_LAZY_GS
 	def_bool y
 	depends on X86_32 && !CC_STACKPROTECTOR
 
+config ARCH_HWEIGHT_CFLAGS
+	string
+	default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
+	default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
+
 config KTIME_SCALAR
 	def_bool X86_32
 source "init/Kconfig"
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index b09ec55..67dae51 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -39,9 +39,6 @@
 #define LOCK_PREFIX ""
 #endif
 
-/* This must be included *after* the definition of LOCK_PREFIX */
-#include <asm/cpufeature.h>
-
 struct alt_instr {
 	u8 *instr;		/* original instruction */
 	u8 *replacement;
@@ -96,6 +93,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
       ".previous"
 
 /*
+ * This must be included *after* the definition of ALTERNATIVE due to
+ * <asm/arch_hweight.h>
+ */
+#include <asm/cpufeature.h>
+
+/*
  * Alternative instructions for different CPU types or capabilities.
  *
  * This allows to use optimized instructions even on generic binary
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
new file mode 100644
index 0000000..d1fc3c2
--- /dev/null
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_HWEIGHT_H
+#define _ASM_X86_HWEIGHT_H
+
+#ifdef CONFIG_64BIT
+/* popcnt %rdi, %rax */
+#define POPCNT ".byte 0xf3,0x48,0x0f,0xb8,0xc7"
+#define REG_IN "D"
+#define REG_OUT "a"
+#else
+/* popcnt %eax, %eax */
+#define POPCNT ".byte 0xf3,0x0f,0xb8,0xc0"
+#define REG_IN "a"
+#define REG_OUT "a"
+#endif
+
+/*
+ * __sw_hweightXX are called from within the alternatives below
+ * and callee-clobbered registers need to be taken care of. See
+ * ARCH_HWEIGHT_CFLAGS in <arch/x86/Kconfig> for the respective
+ * compiler switches.
+ */
+static inline unsigned int __arch_hweight32(unsigned int w)
+{
+	unsigned int res = 0;
+
+	asm (ALTERNATIVE("call __sw_hweight32", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+
+	return res;
+}
+
+static inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __arch_hweight32(w & 0xffff);
+}
+
+static inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __arch_hweight32(w & 0xff);
+}
+
+static inline unsigned long __arch_hweight64(__u64 w)
+{
+	unsigned long res = 0;
+
+#ifdef CONFIG_X86_32
+	return  __arch_hweight32((u32)w) +
+		__arch_hweight32((u32)(w >> 32));
+#else
+	asm (ALTERNATIVE("call __sw_hweight64", POPCNT, X86_FEATURE_POPCNT)
+		     : "="REG_OUT (res)
+		     : REG_IN (w));
+#endif /* CONFIG_X86_32 */
+
+	return res;
+}
+
+#endif
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 02b47a6..545776e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -444,7 +444,9 @@ static inline int fls(int x)
 
 #define ARCH_HAS_FAST_MULTIPLIER 1
 
-#include <asm-generic/bitops/hweight.h>
+#include <asm/arch_hweight.h>
+
+#include <asm-generic/bitops/const_hweight.h>
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-generic/bitops/arch_hweight.h b/include/asm-generic/bitops/arch_hweight.h
index 3a7be84..9a81c1e 100644
--- a/include/asm-generic/bitops/arch_hweight.h
+++ b/include/asm-generic/bitops/arch_hweight.h
@@ -3,9 +3,23 @@
 
 #include <asm/types.h>
 
-extern unsigned int __arch_hweight32(unsigned int w);
-extern unsigned int __arch_hweight16(unsigned int w);
-extern unsigned int __arch_hweight8(unsigned int w);
-extern unsigned long __arch_hweight64(__u64 w);
+inline unsigned int __arch_hweight32(unsigned int w)
+{
+	return __sw_hweight32(w);
+}
 
+inline unsigned int __arch_hweight16(unsigned int w)
+{
+	return __sw_hweight16(w);
+}
+
+inline unsigned int __arch_hweight8(unsigned int w)
+{
+	return __sw_hweight8(w);
+}
+
+inline unsigned long __arch_hweight64(__u64 w)
+{
+	return __sw_hweight64(w);
+}
 #endif /* _ASM_GENERIC_BITOPS_HWEIGHT_H_ */
diff --git a/lib/Makefile b/lib/Makefile
index 2e152ae..abe63a8 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -39,7 +39,10 @@ lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
 lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
 lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
 obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
+
+CFLAGS_hweight.o = $(subst $(quote),,$(CONFIG_ARCH_HWEIGHT_CFLAGS))
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
+
 obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
 obj-$(CONFIG_BTREE) += btree.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
diff --git a/lib/hweight.c b/lib/hweight.c
index a6927e7..3c79d50 100644
--- a/lib/hweight.c
+++ b/lib/hweight.c
@@ -9,7 +9,7 @@
  * The Hamming Weight of a number is the total number of bits set in it.
  */
 
-unsigned int __arch_hweight32(unsigned int w)
+unsigned int __sw_hweight32(unsigned int w)
 {
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x55555555;
@@ -24,30 +24,30 @@ unsigned int __arch_hweight32(unsigned int w)
 	return (res + (res >> 16)) & 0x000000FF;
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight32);
+EXPORT_SYMBOL(__sw_hweight32);
 
-unsigned int __arch_hweight16(unsigned int w)
+unsigned int __sw_hweight16(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x5555);
 	res = (res & 0x3333) + ((res >> 2) & 0x3333);
 	res = (res + (res >> 4)) & 0x0F0F;
 	return (res + (res >> 8)) & 0x00FF;
 }
-EXPORT_SYMBOL(__arch_hweight16);
+EXPORT_SYMBOL(__sw_hweight16);
 
-unsigned int __arch_hweight8(unsigned int w)
+unsigned int __sw_hweight8(unsigned int w)
 {
 	unsigned int res = w - ((w >> 1) & 0x55);
 	res = (res & 0x33) + ((res >> 2) & 0x33);
 	return (res + (res >> 4)) & 0x0F;
 }
-EXPORT_SYMBOL(__arch_hweight8);
+EXPORT_SYMBOL(__sw_hweight8);
 
-unsigned long __arch_hweight64(__u64 w)
+unsigned long __sw_hweight64(__u64 w)
 {
 #if BITS_PER_LONG == 32
-	return __arch_hweight32((unsigned int)(w >> 32)) +
-	       __arch_hweight32((unsigned int)w);
+	return __sw_hweight32((unsigned int)(w >> 32)) +
+	       __sw_hweight32((unsigned int)w);
 #elif BITS_PER_LONG == 64
 #ifdef ARCH_HAS_FAST_MULTIPLIER
 	w -= (w >> 1) & 0x5555555555555555ul;
@@ -64,4 +64,4 @@ unsigned long __arch_hweight64(__u64 w)
 #endif
 #endif
 }
-EXPORT_SYMBOL(__arch_hweight64);
+EXPORT_SYMBOL(__sw_hweight64);
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index f9bdf26..cbcd654 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -245,3 +245,7 @@ quiet_cmd_lzo = LZO    $@
 cmd_lzo = (cat $(filter-out FORCE,$^) | \
 	lzop -9 && $(call size_append, $(filter-out FORCE,$^))) > $@ || \
 	(rm -f $@ ; false)
+
+# misc stuff
+# ---------------------------------------------------------------------------
+quote:="

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [tip:core/hweight] x86: Add optimized popcnt variants
  2010-04-06 23:04                                                                                                     ` [tip:core/hweight] " tip-bot for Borislav Petkov
@ 2010-04-07  7:02                                                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 126+ messages in thread
From: Borislav Petkov @ 2010-04-07  7:02 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, mingo, tglx

From: tip-bot for Borislav Petkov <borislav.petkov@amd.com>
Date: Tue, Apr 06, 2010 at 11:04:07PM +0000

> Commit-ID:  d61931d89be506372d01a90d1755f6d0a9fafe2d
> Gitweb:     http://git.kernel.org/tip/d61931d89be506372d01a90d1755f6d0a9fafe2d
> Author:     Borislav Petkov <borislav.petkov@amd.com>
> AuthorDate: Fri, 5 Mar 2010 17:34:46 +0100
> Committer:  H. Peter Anvin <hpa@zytor.com>
> CommitDate: Tue, 6 Apr 2010 15:52:11 -0700

Just so you know, Andrew picked those up already, along with Peter's
multi-arch-touching patch.

-- 
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center

^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2010-04-07  7:03 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-30  9:45 [PATCH 0/5] [RESEND] FMODE_NONOTIFY and FMODE_NEG_OFFSET bits Wu Fengguang
2010-01-30  9:45 ` [PATCH 1/5] fanotify: fix FMODE_NONOTIFY bit number Wu Fengguang
2010-02-01 20:44   ` Andrew Morton
2010-01-30  9:45 ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Wu Fengguang
2010-02-01 20:48   ` Andrew Morton
2010-02-03 13:39     ` Wu Fengguang
2010-02-03 15:08       ` Andrew Morton
2010-02-03 15:15         ` Peter Zijlstra
2010-02-03 15:42           ` Andrew Morton
2010-02-03 15:47             ` Peter Zijlstra
2010-02-03 17:11               ` H. Peter Anvin
2010-02-03 18:14             ` Borislav Petkov
2010-02-03 18:47               ` Peter Zijlstra
2010-02-03 19:49                 ` H. Peter Anvin
2010-02-04 15:10                   ` Borislav Petkov
2010-02-04 15:13                     ` Peter Zijlstra
2010-02-04 15:54                       ` Borislav Petkov
2010-02-04 16:04                         ` Peter Zijlstra
2010-02-05 12:11                           ` Borislav Petkov
2010-02-05 12:14                             ` Peter Zijlstra
2010-02-05 21:54                             ` H. Peter Anvin
2010-02-06  9:36                               ` Borislav Petkov
2010-02-07  1:55                                 ` H. Peter Anvin
2010-02-08  9:28                                   ` Borislav Petkov
2010-02-08  9:35                                     ` H. Peter Anvin
2010-02-08  9:35                                       ` H. Peter Anvin
2010-02-08  9:59                                       ` Borislav Petkov
2010-02-11 17:24                                         ` Borislav Petkov
2010-02-11 17:33                                           ` H. Peter Anvin
2010-02-12 17:06                                             ` Borislav Petkov
2010-02-12 17:28                                               ` H. Peter Anvin
2010-02-12 17:47                                                 ` Borislav Petkov
2010-02-12 19:05                                                   ` H. Peter Anvin
2010-02-17 13:57                                                     ` Michal Marek
2010-02-17 17:20                                                       ` Borislav Petkov
2010-02-17 17:31                                                         ` Michal Marek
2010-02-17 17:34                                                           ` Borislav Petkov
2010-02-17 17:39                                                           ` Michal Marek
2010-02-18  6:19                                                             ` Borislav Petkov
2010-02-19 14:22                                                               ` [PATCH] x86: Add optimized popcnt variants Borislav Petkov
2010-02-19 16:06                                                                 ` H. Peter Anvin
2010-02-19 16:45                                                                   ` Borislav Petkov
2010-02-19 16:53                                                                     ` H. Peter Anvin
2010-02-22 14:17                                                                       ` Borislav Petkov
2010-02-22 17:21                                                                         ` H. Peter Anvin
2010-02-22 18:49                                                                           ` Borislav Petkov
2010-02-22 19:55                                                                             ` H. Peter Anvin
2010-02-23  6:37                                                                               ` Borislav Petkov
2010-02-23 15:58                                                                               ` Borislav Petkov
2010-02-23 17:34                                                                                 ` H. Peter Anvin
2010-02-23 17:54                                                                                   ` Borislav Petkov
2010-02-23 17:54                                                                                     ` Borislav Petkov
2010-02-23 18:17                                                                                     ` H. Peter Anvin
2010-02-23 19:06                                                                                       ` Borislav Petkov
2010-02-26  5:27                                                                                         ` H. Peter Anvin
2010-02-26  7:47                                                                                           ` Borislav Petkov
2010-02-26 17:48                                                                                             ` H. Peter Anvin
2010-02-26 17:48                                                                                               ` H. Peter Anvin
2010-02-27  8:28                                                                                               ` Borislav Petkov
2010-02-27 20:00                                                                                                 ` H. Peter Anvin
2010-03-09 15:36                                                                                                   ` Borislav Petkov
2010-03-09 15:50                                                                                                     ` Peter Zijlstra
2010-03-09 16:23                                                                                                       ` Borislav Petkov
2010-03-09 16:32                                                                                                         ` Peter Zijlstra
2010-03-09 17:32                                                                                                           ` Borislav Petkov
2010-03-09 17:37                                                                                                             ` Peter Zijlstra
2010-03-18 11:17                                                                                                   ` Borislav Petkov
2010-03-18 11:19                                                                                                   ` [PATCH 1/2] bitops: Optimize hweight() by making use of compile-time evaluation Borislav Petkov
2010-03-18 11:20                                                                                                   ` [PATCH 2/2] x86: Add optimized popcnt variants Borislav Petkov
2010-04-06 23:04                                                                                                     ` [tip:core/hweight] " tip-bot for Borislav Petkov
2010-04-07  7:02                                                                                                       ` Borislav Petkov
2010-02-18 10:51                                                       ` [PATCH 2/5] bitops: compile time optimization for hweight_long(CONSTANT) Peter Zijlstra
2010-02-18 11:51                                                         ` Borislav Petkov
2010-02-14 10:12                                           ` Peter Zijlstra
2010-02-14 11:24                                             ` Borislav Petkov
2010-02-14 12:23                                               ` Peter Zijlstra
2010-02-14 14:19                                                 ` Borislav Petkov
2010-02-14 18:36                                               ` H. Peter Anvin
2010-02-14 18:36                                                 ` H. Peter Anvin
2010-02-14 20:28                                                 ` Borislav Petkov
2010-02-14 22:13                                                   ` H. Peter Anvin
2010-02-14 22:13                                                   ` H. Peter Anvin
2010-02-04 15:16                     ` H. Peter Anvin
2010-02-04 15:39                     ` Brian Gerst
2010-02-04 15:39                       ` Brian Gerst
2010-02-03 17:10       ` H. Peter Anvin
2010-01-30  9:45 ` [PATCH 3/5] vfs: O_* bit numbers uniqueness check Wu Fengguang
2010-01-30  9:45 ` [PATCH 4/5] vfs: introduce FMODE_NEG_OFFSET for allowing negative f_pos Wu Fengguang
2010-01-30  9:45 ` [PATCH 5/5] devmem: dont allow seek to last page Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2010-01-22 15:50 [PATCH 00/10] perf/x86 queue Peter Zijlstra
2010-01-22 15:50 ` [PATCH 01/10] perf_events: improve x86 event scheduling (v5) Peter Zijlstra
2010-01-22 15:50 ` [PATCH 02/10] perf_events: Add fast-path to the rescheduling code Peter Zijlstra
2010-01-22 15:50 ` [PATCH 03/10] perf_event: x86: Allocate the fake_cpuc Peter Zijlstra
2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 04/10] perf_event: x86: Fixup weight tying issue Peter Zijlstra
2010-01-29  9:27   ` [tip:perf/core] perf_event: x86: Fixup constraints typing issue tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 05/10] perf_event: x86: Clean up some of the u64/long bitmask casting Peter Zijlstra
2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 06/10] perf_event: x86: Reduce some overly long lines with some MACROs Peter Zijlstra
2010-01-29  9:27   ` [tip:perf/core] " tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} Peter Zijlstra
2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
2010-01-29 10:01     ` Andrew Morton
2010-01-29 10:04       ` Ingo Molnar
2010-01-29 10:13         ` Andrew Morton
2010-01-29 11:03       ` Peter Zijlstra
2010-01-29 16:24         ` Linus Torvalds
2010-01-29 22:50         ` H. Peter Anvin
2010-01-30 16:28           ` Peter Zijlstra
2010-02-01 12:43             ` Peter Zijlstra
2010-02-01 19:06               ` H. Peter Anvin
2010-04-06 23:03               ` [tip:core/hweight] bitops: Optimize hweight() by making use of compile-time evaluation tip-bot for Peter Zijlstra
2010-01-29 10:32   ` [PATCH 07/10] bitops: Provide compile time HWEIGHT{8,16,32,64} John Kacur
2010-01-29 11:05     ` Peter Zijlstra
2010-01-29 11:13       ` John Kacur
2010-01-30  0:09       ` H. Peter Anvin
2010-01-30  7:34       ` Ingo Molnar
2010-01-22 15:50 ` [PATCH 08/10] perf_event: Optimize the constraint searching bits Peter Zijlstra
2010-01-22 16:08   ` Stephane Eranian
2010-01-22 16:22     ` Peter Zijlstra
2010-01-22 16:28       ` Stephane Eranian
2010-01-29  9:28   ` [tip:perf/core] perf_event: x86: " tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 09/10] perf_event: x86: Optimize constraint weight computation Peter Zijlstra
2010-01-29  9:28   ` [tip:perf/core] " tip-bot for Peter Zijlstra
2010-01-22 15:50 ` [PATCH 10/10] perf_event: Optimize the fast path a little more Peter Zijlstra
2010-01-29  9:28   ` [tip:perf/core] perf_event: x86: " tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.