[RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update
@ 2007-01-29 19:45 Maynard Johnson
  2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-29 19:45 UTC (permalink / raw)
  To: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

On December 14, 2006, I posted a patch that added support to the 
OProfile kernel driver for profiling Cell SPUs.  There have been some 
changes/fixes to this patch since the original posting (including 
forward porting from 2.6.18-based kernel to 2.6.20-rc1), so I am 
reposting the patch for review now.  This patch relies upon the 
following patches that have not been accepted yet:
   1. oprofile cleanup patch (submitted on Nov 27)
   2. Fix for PPU profiling (not submitted yet, since it depends on #1)
   3. SPU task notification patch (last submitted on Jan 26)

For those who may want to apply and build the oprofile SPU support 
patch, it would be necessary to first apply the above patches.  For 
convenience, I will post all three of the above patches, along with the 
oprofile SPU support patch.

Comments appreciated.

Thank you.

Maynard Johnson
IBM LTC Toolchain

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
@ 2007-01-29 19:46 ` Maynard Johnson
  2007-01-30  4:07   ` [Cbe-oss-dev] " Arnd Bergmann
  2007-01-30 10:39   ` Christoph Hellwig
  2007-01-29 19:47 ` [RFC, PATCH 2/4] " Maynard Johnson
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-29 19:46 UTC (permalink / raw)
  To: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

[-- Attachment #1: Type: text/plain, Size: 2 bytes --]




[-- Attachment #2: oprof-ppu-cleanup1.diff --]
[-- Type: text/x-patch, Size: 12489 bytes --]

This is a clean up patch that includes the following changes:

        -It removes some macro definitions that are only used once
         with the actual code.
        -Some comments were added to clarify the code based on feedback
         from the community.
        -The write_pm_cntrl() and set_count_mode() were passed a structure
         element from a global variable.  The argument was removed so the
         functions now just operate on the global directly.
        -The set_pm_event() function call in the cell_virtual_cntr() routine
         was moved to a for-loop before the for_each_cpu loop

Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>

Index: linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/op_model_cell.c	2007-01-18 16:43:14.428510224 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c	2007-01-18 16:56:47.300605984 -0600
@@ -41,8 +41,12 @@
 #define PPU_CYCLES_EVENT_NUM 1	/*  event number for CYCLES */
 #define CBE_COUNT_ALL_CYCLES 0x42800000	/* PPU cycle event specifier */
 
-#define NUM_THREADS 2
-#define VIRT_CNTR_SW_TIME_NS 100000000	// 0.5 seconds
+#define NUM_THREADS 2         /* number of physical threads in 
+			       * physical processor 
+			       */
+#define NUM_TRACE_BUS_WORDS 4
+#define NUM_INPUT_BUS_WORDS 2
+
 
 struct pmc_cntrl_data {
 	unsigned long vcntr;
@@ -94,14 +98,6 @@
 } pm_regs;
 
 
-#define GET_SUB_UNIT(x) ((x & 0x0000f000) >> 12)
-#define GET_BUS_WORD(x) ((x & 0x000000f0) >> 4)
-#define GET_BUS_TYPE(x) ((x & 0x00000300) >> 8)
-#define GET_POLARITY(x) ((x & 0x00000002) >> 1)
-#define GET_COUNT_CYCLES(x) (x & 0x00000001)
-#define GET_INPUT_CONTROL(x) ((x & 0x00000004) >> 2)
-
-
 static DEFINE_PER_CPU(unsigned long[NR_PHYS_CTRS], pmc_values);
 
 static struct pmc_cntrl_data pmc_cntrl[NUM_THREADS][NR_PHYS_CTRS];
@@ -129,8 +125,8 @@
 
 static u32 ctr_enabled;
 
-static unsigned char trace_bus[4];
-static unsigned char input_bus[2];
+static unsigned char trace_bus[NUM_TRACE_BUS_WORDS];
+static unsigned char input_bus[NUM_INPUT_BUS_WORDS];
 
 /*
  * Firmware interface functions
@@ -183,7 +179,8 @@
 	for (j = 0; j < count; j++) {
 		/* fw expects physical cpu # */
 		pm_signal_local[j].cpu = node;
-		pm_signal_local[j].signal_group = pm_signal[j].signal_group;
+		pm_signal_local[j].signal_group 
+			= pm_signal[j].signal_group;
 		pm_signal_local[j].bus_word = pm_signal[j].bus_word;
 		pm_signal_local[j].sub_unit = pm_signal[j].sub_unit;
 		pm_signal_local[j].bit = pm_signal[j].bit;
@@ -221,24 +218,32 @@
 		pm_regs.pm07_cntrl[ctr] = 0;
 	}
 
-	bus_word = GET_BUS_WORD(unit_mask);
-	bus_type = GET_BUS_TYPE(unit_mask);
-	count_cycles = GET_COUNT_CYCLES(unit_mask);
-	polarity = GET_POLARITY(unit_mask);
-	input_control = GET_INPUT_CONTROL(unit_mask);
+	bus_word = (unit_mask & 0x000000f0) >> 4;
+	bus_type = (unit_mask & 0x00000300) >> 8;
+	count_cycles = unit_mask & 0x00000001;
+	polarity = (unit_mask & 0x00000002) >> 1;
+	input_control = (unit_mask & 0x00000004) >> 2;
 	signal_bit = (event % 100);
 
 	p = &(pm_signal[ctr]);
 
 	p->signal_group = event / 100;
 	p->bus_word = bus_word;
-	p->sub_unit = unit_mask & 0x0000f000;
+	p->sub_unit = (unit_mask & 0x0000f000) >> 12;
 
 	pm_regs.pm07_cntrl[ctr] = 0;
-	pm_regs.pm07_cntrl[ctr] |= PM07_CTR_COUNT_CYCLES(count_cycles);
-	pm_regs.pm07_cntrl[ctr] |= PM07_CTR_POLARITY(polarity);
-	pm_regs.pm07_cntrl[ctr] |= PM07_CTR_INPUT_CONTROL(input_control);
-
+	pm_regs.pm07_cntrl[ctr] |= (count_cycles & 1) << 23;
+	pm_regs.pm07_cntrl[ctr] |= (polarity & 1) << 24;
+	pm_regs.pm07_cntrl[ctr] |= (input_control & 1) << 25;
+
+	/* Some of the islands signal selection is based on 64 bit words.
+	 * The debug bus words are 32 bits, the input words to the performance
+	 * counters are defined as 32 bits.  Need to convert the 64 bit island
+	 * specification to the appropriate 32 input bit and bus word for the
+	 * performance counter event selection.  See the CELL Performance
+	 * monitoring signals manual and the Perf cntr hardware descriptions
+	 * for the details.
+	 */
 	if (input_control == 0) {
 		if (signal_bit > 31) {
 			signal_bit -= 32;
@@ -253,18 +258,18 @@
 		if ((bus_type == 1) && p->signal_group >= 50)
 			bus_type = 0;
 
-		pm_regs.pm07_cntrl[ctr] |= PM07_CTR_INPUT_MUX(signal_bit);
+		pm_regs.pm07_cntrl[ctr] |= (signal_bit & 0x3F) << 26;
 	} else {
 		pm_regs.pm07_cntrl[ctr] = 0;
 		p->bit = signal_bit;
 	}
 
-	for (i = 0; i < 4; i++) {
+	for (i = 0; i < NUM_TRACE_BUS_WORDS; i++) {
 		if (bus_word & (1 << i)) {
 			pm_regs.debug_bus_control |=
 			    (bus_type << (31 - (2 * i) + 1));
 
-			for (j = 0; j < 2; j++) {
+			for (j = 0; j < NUM_INPUT_BUS_WORDS; j++) {
 				if (input_bus[j] == 0xff) {
 					input_bus[j] = i;
 					pm_regs.group_control |=
@@ -278,52 +283,58 @@
 	;
 }
 
-static void write_pm_cntrl(int cpu, struct pm_cntrl *pm_cntrl)
+static void write_pm_cntrl(int cpu)
 {
-	/* Oprofile will use 32 bit counters, set bits 7:10 to 0 */
+	/* Oprofile will use 32 bit counters, set bits 7:10 to 0 
+	 * pmregs.pm_cntrl is a global
+	 */
+        
 	u32 val = 0;
-	if (pm_cntrl->enable == 1)
+	if (pm_regs.pm_cntrl.enable == 1)
 		val |= CBE_PM_ENABLE_PERF_MON;
 
-	if (pm_cntrl->stop_at_max == 1)
+	if (pm_regs.pm_cntrl.stop_at_max == 1)
 		val |= CBE_PM_STOP_AT_MAX;
 
-	if (pm_cntrl->trace_mode == 1)
-		val |= CBE_PM_TRACE_MODE_SET(pm_cntrl->trace_mode);
+	if (pm_regs.pm_cntrl.trace_mode == 1)
+		val |= CBE_PM_TRACE_MODE_SET(pm_regs.pm_cntrl.trace_mode);
 
-	if (pm_cntrl->freeze == 1)
+	if (pm_regs.pm_cntrl.freeze == 1)
 		val |= CBE_PM_FREEZE_ALL_CTRS;
 
 	/* Routine set_count_mode must be called previously to set
 	 * the count mode based on the user selection of user and kernel.
 	 */
-	val |= CBE_PM_COUNT_MODE_SET(pm_cntrl->count_mode);
+	val |= CBE_PM_COUNT_MODE_SET(pm_regs.pm_cntrl.count_mode);
 	cbe_write_pm(cpu, pm_control, val);
 }
 
 static inline void
-set_count_mode(u32 kernel, u32 user, struct pm_cntrl *pm_cntrl)
+set_count_mode(u32 kernel, u32 user)
 {
 	/* The user must specify user and kernel if they want them. If
-	 *  neither is specified, OProfile will count in hypervisor mode
+	 *  neither is specified, OProfile will count in hypervisor mode.
+	 *  pm_regs.pm_cntrl is a global
 	 */
 	if (kernel) {
 		if (user)
-			pm_cntrl->count_mode = CBE_COUNT_ALL_MODES;
+			pm_regs.pm_cntrl.count_mode = CBE_COUNT_ALL_MODES;
 		else
-			pm_cntrl->count_mode = CBE_COUNT_SUPERVISOR_MODE;
+			pm_regs.pm_cntrl.count_mode = 
+				CBE_COUNT_SUPERVISOR_MODE;
 	} else {
 		if (user)
-			pm_cntrl->count_mode = CBE_COUNT_PROBLEM_MODE;
+			pm_regs.pm_cntrl.count_mode = CBE_COUNT_PROBLEM_MODE;
 		else
-			pm_cntrl->count_mode = CBE_COUNT_HYPERVISOR_MODE;
+			pm_regs.pm_cntrl.count_mode = 
+				CBE_COUNT_HYPERVISOR_MODE;
 	}
 }
 
 static inline void enable_ctr(u32 cpu, u32 ctr, u32 * pm07_cntrl)
 {
 
-	pm07_cntrl[ctr] |= PM07_CTR_ENABLE(1);
+	pm07_cntrl[ctr] |= CBE_PM_CTR_ENABLE;
 	cbe_write_pm07_control(cpu, ctr, pm07_cntrl[ctr]);
 }
 
@@ -365,6 +376,14 @@
 	hdw_thread = 1 ^ hdw_thread;
 	next_hdw_thread = hdw_thread;
 
+	for (i = 0; i < num_counters; i++)
+	/* There are some per thread events.  Must do the
+	 * set event, for the thread that is being started
+	 */
+		set_pm_event(i,
+			pmc_cntrl[next_hdw_thread][i].evnts,
+			pmc_cntrl[next_hdw_thread][i].masks);
+
 	/* The following is done only once per each node, but
 	 * we need cpu #, not node #, to pass to the cbe_xxx functions.
 	 */
@@ -385,12 +404,13 @@
 			    == 0xFFFFFFFF)
 				/* If the cntr value is 0xffffffff, we must
 				 * reset that to 0xfffffff0 when the current
-				 * thread is restarted.  This will generate a new
-				 * interrupt and make sure that we never restore
-				 * the counters to the max value.  If the counters
-				 * were restored to the max value, they do not
-				 * increment and no interrupts are generated.  Hence
-				 * no more samples will be collected on that cpu.
+				 * thread is restarted.  This will generate a 
+				 * new interrupt and make sure that we never 
+				 * restore the counters to the max value.  If 
+				 * the counters were restored to the max value,
+				 * they do not increment and no interrupts are
+				 * generated.  Hence no more samples will be 
+				 * collected on that cpu.
 				 */
 				cbe_write_ctr(cpu, i, 0xFFFFFFF0);
 			else
@@ -410,9 +430,6 @@
 				 * Must do the set event, enable_cntr
 				 * for each cpu.
 				 */
-				set_pm_event(i,
-				     pmc_cntrl[next_hdw_thread][i].evnts,
-				     pmc_cntrl[next_hdw_thread][i].masks);
 				enable_ctr(cpu, i,
 					   pm_regs.pm07_cntrl);
 			} else {
@@ -465,8 +482,7 @@
 	pm_regs.pm_cntrl.trace_mode = 0;
 	pm_regs.pm_cntrl.freeze = 1;
 
-	set_count_mode(sys->enable_kernel, sys->enable_user,
-		       &pm_regs.pm_cntrl);
+	set_count_mode(sys->enable_kernel, sys->enable_user);
 
 	/* Setup the thread 0 events */
 	for (i = 0; i < num_ctrs; ++i) {
@@ -498,10 +514,10 @@
 		pmc_cntrl[1][i].vcntr = i;
 	}
 
-	for (i = 0; i < 4; i++)
+	for (i = 0; i < NUM_INPUT_BUS_WORDS; i++)
 		trace_bus[i] = 0xff;
 
-	for (i = 0; i < 2; i++)
+	for (i = 0; i < NUM_INPUT_BUS_WORDS; i++)
 		input_bus[i] = 0xff;
 
 	/* Our counters count up, and "count" refers to
@@ -560,7 +576,7 @@
 	cbe_write_pm(cpu, pm_start_stop, 0);
 	cbe_write_pm(cpu, group_control, pm_regs.group_control);
 	cbe_write_pm(cpu, debug_bus_control, pm_regs.debug_bus_control);
-	write_pm_cntrl(cpu, &pm_regs.pm_cntrl);
+	write_pm_cntrl(cpu);
 
 	for (i = 0; i < num_counters; ++i) {
 		if (ctr_enabled & (1 << i)) {
@@ -602,7 +618,7 @@
 			}
 		}
 
-		cbe_clear_pm_interrupts(cpu);
+		cbe_get_and_clear_pm_interrupts(cpu);
 		cbe_enable_pm_interrupts(cpu, hdw_thread, interrupt_mask);
 		cbe_enable_pm(cpu);
 	}
@@ -672,7 +688,7 @@
 
 	cbe_disable_pm(cpu);
 
-	interrupt_mask = cbe_clear_pm_interrupts(cpu);
+	interrupt_mask = cbe_get_and_clear_pm_interrupts(cpu);
 
 	/* If the interrupt mask has been cleared, then the virt cntr
 	 * has cleared the interrupt.  When the thread that generated
Index: linux-2.6.20-rc1/arch/powerpc/platforms/cell/pmu.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/platforms/cell/pmu.c	2007-01-18 16:43:14.319526792 -0600
+++ linux-2.6.20-rc1/arch/powerpc/platforms/cell/pmu.c	2007-01-18 16:49:58.319611848 -0600
@@ -345,18 +345,12 @@
  * Enabling/disabling interrupts for the entire performance monitoring unit.
  */
 
-u32 cbe_query_pm_interrupts(u32 cpu)
-{
-	return cbe_read_pm(cpu, pm_status);
-}
-EXPORT_SYMBOL_GPL(cbe_query_pm_interrupts);
-
-u32 cbe_clear_pm_interrupts(u32 cpu)
+u32 cbe_get_and_clear_pm_interrupts(u32 cpu)
 {
 	/* Reading pm_status clears the interrupt bits. */
-	return cbe_query_pm_interrupts(cpu);
+	return cbe_read_pm(cpu, pm_status);
 }
-EXPORT_SYMBOL_GPL(cbe_clear_pm_interrupts);
+EXPORT_SYMBOL_GPL(cbe_get_and_clear_pm_interrupts);
 
 void cbe_enable_pm_interrupts(u32 cpu, u32 thread, u32 mask)
 {
@@ -371,7 +365,7 @@
 
 void cbe_disable_pm_interrupts(u32 cpu)
 {
-	cbe_clear_pm_interrupts(cpu);
+	cbe_get_and_clear_pm_interrupts(cpu);
 	cbe_write_pm(cpu, pm_status, 0);
 }
 EXPORT_SYMBOL_GPL(cbe_disable_pm_interrupts);
Index: linux-2.6.20-rc1/include/asm-powerpc/cell-pmu.h
===================================================================
--- linux-2.6.20-rc1.orig/include/asm-powerpc/cell-pmu.h	2007-01-18 16:43:19.932605072 -0600
+++ linux-2.6.20-rc1/include/asm-powerpc/cell-pmu.h	2007-01-18 16:59:38.230579688 -0600
@@ -89,8 +89,7 @@
 
 extern void cbe_enable_pm_interrupts(u32 cpu, u32 thread, u32 mask);
 extern void cbe_disable_pm_interrupts(u32 cpu);
-extern u32  cbe_query_pm_interrupts(u32 cpu);
-extern u32  cbe_clear_pm_interrupts(u32 cpu);
+extern u32  cbe_get_and_clear_pm_interrupts(u32 cpu);
 extern void cbe_sync_irq(int node);
 
 /* Utility functions, macros */
@@ -103,11 +102,4 @@
 #define CBE_COUNT_PROBLEM_MODE          2
 #define CBE_COUNT_ALL_MODES             3
 
-/* Macros for the pm07_control registers. */
-#define PM07_CTR_INPUT_MUX(x)                    (((x) & 0x3F) << 26)
-#define PM07_CTR_INPUT_CONTROL(x)                (((x) & 1) << 25)
-#define PM07_CTR_POLARITY(x)                     (((x) & 1) << 24)
-#define PM07_CTR_COUNT_CYCLES(x)                 (((x) & 1) << 23)
-#define PM07_CTR_ENABLE(x)                       (((x) & 1) << 22)
-
 #endif /* __ASM_CELL_PMU_H__ */

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC, PATCH 2/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
  2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
@ 2007-01-29 19:47 ` Maynard Johnson
  2007-01-30  4:08   ` [Cbe-oss-dev] " Arnd Bergmann
  2007-01-29 19:48 ` [RFC, PATCH 3/4] " Maynard Johnson
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-01-29 19:47 UTC (permalink / raw)
  To: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: oprof-ppu-fix2.diff --]
[-- Type: text/x-patch, Size: 3634 bytes --]

The code was setting up the debug bus for group 21 when profiling on the 
event PPU CYCLES.  The debug bus is not actually used by the hardware 
performance counters when counting PPU CYCLES.  Setting up the debug bus
for PPU CYCLES causes signal routing conflicts on the debug bus when 
profiling PPU cycles and another PPU event.  This patch fixes the code to 
only setup the debug bus to route the performance signals for the non
PPU CYCLE events.

Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
Signed-off-by: Carl Love <carll@us.ibm.com>

Index: linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/op_model_cell.c	2007-01-18 16:56:47.300605984 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c	2007-01-24 12:16:16.225609136 -0600
@@ -39,6 +39,9 @@
 #include "../platforms/cell/interrupt.h"
 
 #define PPU_CYCLES_EVENT_NUM 1	/*  event number for CYCLES */
+#define PPU_CYCLES_GRP_NUM   1  /* special group number for identifying
+                                 * PPU_CYCLES event
+                                 */
 #define CBE_COUNT_ALL_CYCLES 0x42800000	/* PPU cycle event specifier */
 
 #define NUM_THREADS 2         /* number of physical threads in 
@@ -62,7 +65,7 @@
 struct pm_signal {
 	u16 cpu;		/* Processor to modify */
 	u16 sub_unit;		/* hw subunit this applies to (if applicable) */
-	u16 signal_group;	/* Signal Group to Enable/Disable */
+	short int signal_group;	/* Signal Group to Enable/Disable */
 	u8 bus_word;		/* Enable/Disable on this Trace/Trigger/Event
 				 * Bus Word(s) (bitmask)
 				 */
@@ -173,26 +176,40 @@
 static void pm_rtas_activate_signals(u32 node, u32 count)
 {
 	int ret;
-	int j;
+	int i, j;
 	struct pm_signal pm_signal_local[NR_PHYS_CTRS];
 
+	/* There is no debug setup required for the cycles event.
+	 * Note that only events in the same group can be used.
+	 * Otherwise, there will be conflicts in correctly routing
+	 * the signals on the debug bus.  It is the responsiblity
+	 * of the OProfile user tool to check the events are in
+	 * the same group.
+	 */
+	i = 0;
 	for (j = 0; j < count; j++) {
-		/* fw expects physical cpu # */
-		pm_signal_local[j].cpu = node;
-		pm_signal_local[j].signal_group 
-			= pm_signal[j].signal_group;
-		pm_signal_local[j].bus_word = pm_signal[j].bus_word;
-		pm_signal_local[j].sub_unit = pm_signal[j].sub_unit;
-		pm_signal_local[j].bit = pm_signal[j].bit;
-	}
+		if (pm_signal[j].signal_group != PPU_CYCLES_GRP_NUM) {
 
-	ret = rtas_ibm_cbe_perftools(SUBFUNC_ACTIVATE, PASSTHRU_ENABLE,
-				     pm_signal_local,
-				     count * sizeof(struct pm_signal));
+			/* fw expects physical cpu # */
+			pm_signal_local[i].cpu = node;
+			pm_signal_local[i].signal_group 
+				= pm_signal[j].signal_group;
+			pm_signal_local[i].bus_word = pm_signal[j].bus_word;
+			pm_signal_local[i].sub_unit = pm_signal[j].sub_unit;
+			pm_signal_local[i].bit = pm_signal[j].bit;
+			i++;
+		}
+	}
 
-	if (ret)
-		printk(KERN_WARNING "%s: rtas returned: %d\n",
-		       __FUNCTION__, ret);
+	if (i != 0) {
+		ret = rtas_ibm_cbe_perftools(SUBFUNC_ACTIVATE, PASSTHRU_ENABLE,
+					     pm_signal_local,
+					     i * sizeof(struct pm_signal));
+
+		if (ret)
+			printk(KERN_WARNING "%s: rtas returned: %d\n",
+			       __FUNCTION__, ret);
+	}
 }
 
 /*
@@ -209,7 +226,7 @@
 		/* Special Event: Count all cpu cycles */
 		pm_regs.pm07_cntrl[ctr] = CBE_COUNT_ALL_CYCLES;
 		p = &(pm_signal[ctr]);
-		p->signal_group = 21;
+		p->signal_group = PPU_CYCLES_GRP_NUM;
 		p->bus_word = 1;
 		p->sub_unit = 0;
 		p->bit = 0;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC, PATCH 3/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
  2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
  2007-01-29 19:47 ` [RFC, PATCH 2/4] " Maynard Johnson
@ 2007-01-29 19:48 ` Maynard Johnson
  2007-01-30  4:24   ` [Cbe-oss-dev] " Arnd Bergmann
  2007-01-29 19:48 ` [RFC, PATCH 4/4] " Maynard Johnson
  2007-01-30  8:37 ` [RFC, PATCH 0/4] " Arnd Bergmann
  4 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-01-29 19:48 UTC (permalink / raw)
  To: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: spu-notifier.patch --]
[-- Type: text/x-diff, Size: 4589 bytes --]

Subject: Enable SPU switch notification to detect currently active SPU tasks.

From: Maynard Johnson <maynardj@us.ibm.com>

This patch adds to the capability of spu_switch_event_register so that the
caller is also notified of currently active SPU tasks.  It also exports
spu_switch_event_register and spu_switch_event_unregister.

Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>


Index: linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/sched.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/platforms/cell/spufs/sched.c	2007-01-18 16:43:14.324526032 -0600
+++ linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/sched.c	2007-01-26 16:16:35.219668640 -0600
@@ -78,21 +78,46 @@
 
 static BLOCKING_NOTIFIER_HEAD(spu_switch_notifier);
 
-static void spu_switch_notify(struct spu *spu, struct spu_context *ctx)
+void spu_switch_notify(struct spu *spu, struct spu_context *ctx)
 {
 	blocking_notifier_call_chain(&spu_switch_notifier,
 			    ctx ? ctx->object_id : 0, spu);
 }
 
+static void notify_spus_active(void)
+{
+       int node;
+	/* Wake up the active spu_contexts. When the awakened processes 
+	 * sees their notify_active flag is set, they will call
+	 * spu_switch_notify();
+	 */
+	for (node = 0; node < MAX_NUMNODES; node++) {
+		struct spu *spu;
+		mutex_lock(&spu_prio->active_mutex[node]);
+                list_for_each_entry(spu, &spu_prio->active_list[node], list) {
+			struct spu_context *ctx = spu->ctx;
+			spu->notify_active = 1;
+			wake_up_all(&ctx->stop_wq);
+		}
+                mutex_unlock(&spu_prio->active_mutex[node]);
+	}
+}
+
 int spu_switch_event_register(struct notifier_block * n)
 {
-	return blocking_notifier_chain_register(&spu_switch_notifier, n);
+	int ret;
+	ret = blocking_notifier_chain_register(&spu_switch_notifier, n);
+	if (!ret)
+		notify_spus_active();
+	return ret;
 }
+EXPORT_SYMBOL_GPL(spu_switch_event_register);
 
 int spu_switch_event_unregister(struct notifier_block * n)
 {
 	return blocking_notifier_chain_unregister(&spu_switch_notifier, n);
 }
+EXPORT_SYMBOL_GPL(spu_switch_event_unregister);
 
 
 static inline void bind_context(struct spu *spu, struct spu_context *ctx)
Index: linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/spufs.h
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/platforms/cell/spufs/spufs.h	2007-01-18 16:43:14.340523600 -0600
+++ linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/spufs.h	2007-01-26 16:26:49.733703448 -0600
@@ -180,6 +180,7 @@
 int spu_activate(struct spu_context *ctx, u64 flags);
 void spu_deactivate(struct spu_context *ctx);
 void spu_yield(struct spu_context *ctx);
+void spu_switch_notify(struct spu *spu, struct spu_context *ctx);
 int __init spu_sched_init(void);
 void __exit spu_sched_exit(void);
 
Index: linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/run.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/platforms/cell/spufs/run.c	2007-01-18 16:43:14.340523600 -0600
+++ linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/run.c	2007-01-26 16:24:38.979744856 -0600
@@ -45,9 +45,10 @@
 	u64 pte_fault;
 
 	*stat = ctx->ops->status_read(ctx);
-	if (ctx->state != SPU_STATE_RUNNABLE)
-		return 1;
+
 	spu = ctx->spu;
+	if (ctx->state != SPU_STATE_RUNNABLE || spu->notify_active)
+		return 1;
 	pte_fault = spu->dsisr &
 	    (MFC_DSISR_PTE_NOT_FOUND | MFC_DSISR_ACCESS_DENIED);
 	return (!(*stat & 0x1) || pte_fault || spu->class_0_pending) ? 1 : 0;
@@ -305,6 +306,7 @@
 		   u32 *npc, u32 *event)
 {
 	int ret;
+	struct spu * spu;
 	u32 status;
 
 	if (down_interruptible(&ctx->run_sema))
@@ -318,8 +320,16 @@
 
 	do {
 		ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, &status));
+		spu = ctx->spu;
 		if (unlikely(ret))
 			break;
+		if (unlikely(spu->notify_active)) {
+			spu->notify_active = 0;
+			if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
+				spu_switch_notify(spu, ctx);
+				continue;
+			}
+		}
 		if ((status & SPU_STATUS_STOPPED_BY_STOP) &&
 		    (status >> SPU_STOP_STATUS_SHIFT == 0x2104)) {
 			ret = spu_process_callback(ctx);
Index: linux-2.6.20-rc1/include/asm-powerpc/spu.h
===================================================================
--- linux-2.6.20-rc1.orig/include/asm-powerpc/spu.h	2007-01-18 16:43:19.932605072 -0600
+++ linux-2.6.20-rc1/include/asm-powerpc/spu.h	2007-01-24 12:17:30.209676992 -0600
@@ -144,6 +144,7 @@
 
 	void* pdata; /* platform private data */
 	struct sys_device sysdev;
+	int notify_active;
 };
 
 struct spu *spu_alloc(void);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
                   ` (2 preceding siblings ...)
  2007-01-29 19:48 ` [RFC, PATCH 3/4] " Maynard Johnson
@ 2007-01-29 19:48 ` Maynard Johnson
       [not found]   ` <200701300839.05144.arnd@arndb.de>
  2007-01-30  8:37 ` [RFC, PATCH 0/4] " Arnd Bergmann
  4 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-01-29 19:48 UTC (permalink / raw)
  To: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: oprof-spu.diff --]
[-- Type: text/x-patch, Size: 48333 bytes --]

Subject: Add support to OProfile for profiling Cell BE SPUs

From: Maynard Johnson <maynardj@us.ibm.com>

This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
code.

Signed-off-by: Carl Love <carll@us.ibm.com>
Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>

Index: linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/configs/cell_defconfig	2007-01-18 16:43:14.230540320 -0600
+++ linux-2.6.20-rc1/arch/powerpc/configs/cell_defconfig	2007-01-29 10:32:03.386789608 -0600
@@ -1403,7 +1403,7 @@
 # Instrumentation Support
 #
 CONFIG_PROFILING=y
-CONFIG_OPROFILE=y
+CONFIG_OPROFILE=m
 # CONFIG_KPROBES is not set
 
 #
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/pr_util.h	2007-01-29 10:32:03.388789304 -0600
@@ -0,0 +1,75 @@
+ /*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Author: Maynard Johnson <maynardj@us.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef PR_UTIL_H
+#define PR_UTIL_H
+
+#include <linux/cpumask.h>
+#include <linux/oprofile.h>
+#include <asm/cell-pmu.h>
+#include <asm/spu.h>
+
+#define number_of_online_nodes(nodes) {        \
+        u32 cpu; u32 tmp;                      \
+        nodes = 0;                             \
+        for_each_online_cpu(cpu) {             \
+                tmp = cbe_cpu_to_node(cpu) + 1;\
+                if (tmp > nodes)               \
+                        nodes++;               \
+	}                                      \
+}
+
+
+/* Defines used for sync_start */
+#define SKIP_GENERIC_SYNC 0
+#define SYNC_START_ERROR -1
+#define DO_GENERIC_SYNC 1
+
+typedef struct vma_map
+{
+	struct vma_map *next;
+	unsigned int vma;
+	unsigned int size;
+	unsigned int offset;
+	unsigned int guard_ptr;
+	unsigned int guard_val;
+} vma_map_t;
+
+/* The three functions below are for maintaining and accessing
+ * the vma-to-file offset map.
+ */
+vma_map_t * create_vma_map(const struct spu * spu, u64 objectid);
+unsigned int vma_map_lookup(vma_map_t *map, unsigned int vma,
+			    const struct spu * aSpu);
+void vma_map_free(struct vma_map *map);
+
+/*
+ * Entry point for SPU profiling.
+ * cycles_reset is the SPU_CYCLES count value specified by the user.
+ */
+void start_spu_profiling(unsigned int cycles_reset);
+
+void stop_spu_profiling(void);
+
+ 
+/* add the necessary profiling hooks */
+int spu_sync_start(void);
+
+/* remove the hooks */
+int spu_sync_stop(void);
+ 
+/* Record SPU program counter samples to the oprofile event buffer. */
+void spu_sync_buffer(int spu_num, unsigned int * samples, 
+		     int num_samples);
+
+#endif    // PR_UTIL_H 
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_profiler.c	2007-01-29 10:32:03.392788696 -0600
@@ -0,0 +1,204 @@
+/*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Authors: Maynard Johnson <maynardj@us.ibm.com>
+ *          Carl Love <carll@us.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/hrtimer.h>
+#include <linux/smp.h>
+#include <linux/slab.h>
+#include <asm/cell-pmu.h>
+#include "pr_util.h"
+
+#define TRACE_ARRAY_SIZE 1024
+static u32 * samples;
+static u32 * samples_per_node;
+
+static int spu_prof_running = 0;
+static unsigned int profiling_interval = 0;
+
+extern int num_nodes;
+extern unsigned int khzfreq;
+
+/*
+ * Oprofile setup functions
+ */
+
+#define NUM_SPU_BITS_TRBUF 16
+#define SPUS_PER_TB_ENTRY   4
+#define SPUS_PER_NODE       8
+
+/* 
+ * Collect the SPU program counter samples from the trace buffer.
+ * The global variable usage is as follows:
+ *    samples[<total-spus>][TRACE_ARRAY_SIZE] - array to store SPU PC samples
+ *           Assumption, the array will be all zeros on entry.
+ *    u32 samples_per_node[num_nodes] - array of how many valid samples per node
+ */
+static void cell_spu_pc_collection(void)
+{
+	int cpu;
+	int node;
+	int spu;
+	u32 trace_addr;
+        /* the trace buffer is 128 bits */
+	u64 trace_buffer[2];
+	u64 spu_pc_lower;  
+	u64 spu_pc_upper;
+	u64 spu_mask;
+	int entry, node_factor;
+	// process the collected SPU PC for each node
+	for_each_online_cpu(cpu) {
+		if (cbe_get_hw_thread_id(cpu))
+			continue;
+
+		node = cbe_cpu_to_node(cpu);
+		node_factor = node * SPUS_PER_NODE;
+                /* number of valid entries for this node */
+		entry = 0;
+
+		trace_addr = cbe_read_pm(cpu, trace_address);
+		while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
+		{
+			/* there is data in the trace buffer to process */
+			cbe_read_trace_buffer(cpu, trace_buffer);  
+			spu_mask = 0xFFFF000000000000;
+
+			/* Each SPU PC is 16 bits; hence, four spus in each of 
+			 * the two 64-bit buffer entries that make up the
+			 * 128-bit trace_buffer entry.  Process the upper and
+			 * lower 64-bit values simultaneously.
+			 */
+			for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
+				spu_pc_lower = spu_mask & trace_buffer[0];
+				spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
+								* (SPUS_PER_TB_ENTRY-spu-1));
+
+				spu_pc_upper = spu_mask & trace_buffer[1];
+				spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
+								* (SPUS_PER_TB_ENTRY-spu-1));
+
+				spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
+
+				/* spu PC trace entry is upper 16 bits of the
+				 * 18 bit SPU program counter 
+				 */
+				spu_pc_lower = spu_pc_lower << 2;
+				spu_pc_upper = spu_pc_upper << 2;
+
+				samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
+					= (u32) spu_pc_lower;
+				samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
+					= (u32) spu_pc_upper;
+			}
+
+			entry++;
+
+			if (entry >= TRACE_ARRAY_SIZE) 
+				/* spu_samples is full */
+				break;
+
+			trace_addr = cbe_read_pm(cpu, trace_address);
+		}
+		samples_per_node[node] = entry;
+	}
+}
+
+
+static int profile_spus(struct hrtimer * timer)
+{
+	ktime_t kt;
+        int cpu, node, k, num_samples, spu_num;
+	
+	if (!spu_prof_running)
+		goto STOP;
+
+	cell_spu_pc_collection();
+	for_each_online_cpu(cpu) {
+		if (cbe_get_hw_thread_id(cpu))
+			continue;
+
+		node = cbe_cpu_to_node(cpu);
+
+		num_samples = samples_per_node[node];
+		if (num_samples == 0)
+			continue;
+		for (k = 0; k < SPUS_PER_NODE; k++) {
+			spu_num = k + (node * SPUS_PER_NODE);
+			spu_sync_buffer(spu_num, samples + (spu_num * TRACE_ARRAY_SIZE), num_samples);
+		}
+	}
+	smp_wmb();
+
+	kt = ktime_set(0, profiling_interval);
+	if (!spu_prof_running)
+		goto STOP;
+	hrtimer_forward(timer, timer->base->get_time(), kt);
+	return HRTIMER_RESTART;
+
+ STOP:
+	printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
+	return HRTIMER_NORESTART;
+}
+
+static struct hrtimer timer;
+#define SCALE_SHIFT 14 
+/*
+ * Entry point for SPU profiling.
+ * NOTE:  SPU profiling is done system-wide, not per-CPU.
+ *
+ * cycles_reset is the count value specified by the user when
+ * setting up OProfile to count SPU_CYCLES.
+ */
+void start_spu_profiling(unsigned int cycles_reset) {
+
+	ktime_t kt;
+
+        /* To calculate a timeout in nanoseconds, the basic
+	 * formula is ns = cycles_reset * (NSEC_PER_SEC / cpu frequency).
+	 * To avoid floating point math, we use the scale math
+	 * technique as described in linux/jiffies.h.  We use
+	 * a scale factor of SCALE_SHIFT,which provides 4 decimal places
+	 * of precision, which is close enough for the purpose at hand.
+	 */
+
+	/* Since cpufreq_quick_get returns frequency in kHz, we use
+	 * USEC_PER_SEC here vs NSEC_PER_SEC.
+	 */
+	unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
+	profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
+	
+	pr_debug("timer resolution: %lu\n", 
+		 TICK_NSEC);
+	kt = ktime_set(0, profiling_interval);
+	hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_REL);
+	timer.expires = kt;
+	timer.function = profile_spus;
+
+        /* Allocate arrays for collecting SPU PC samples */
+	samples = (u32 *) kzalloc(num_nodes * SPUS_PER_NODE * TRACE_ARRAY_SIZE * sizeof(u32), GFP_ATOMIC);
+	samples_per_node = (u32 *) kzalloc(num_nodes * sizeof(u32), GFP_ATOMIC);
+
+	spu_prof_running = 1;
+	hrtimer_start(&timer, kt, HRTIMER_REL);
+}
+
+void stop_spu_profiling(void) 
+{
+
+	hrtimer_cancel(&timer);
+	kfree(samples);
+	kfree(samples_per_node);
+	pr_debug("SPU_PROF: stop_spu_profiling issued\n");
+	spu_prof_running = 0;
+}
+
+
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_task_sync.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/spu_task_sync.c	2007-01-29 10:32:03.398787784 -0600
@@ -0,0 +1,493 @@
+/*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Author: Maynard Johnson <maynardj@us.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/* The purpose of this file is to handle SPU event task switching
+ * and to record SPU context information into the OProfile
+ * event buffer. 
+ *
+ * Additionally, the spu_sync_buffer function is provided as a helper
+ * for recoding actual SPU program counter samples to the event buffer.
+ */
+
+#include <linux/notifier.h>
+#include <linux/list.h>
+#include <linux/numa.h>
+#include <linux/mm.h>
+#include <linux/dcookies.h>
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+#include <linux/oprofile.h>
+#include "pr_util.h"
+
+#define DISCARD_ALL 9999
+
+static spinlock_t buffer_lock = SPIN_LOCK_UNLOCKED;
+static int num_spu_nodes;
+int num_nodes;
+
+/* Conainer for caching information about an active SPU task.
+ *   :map -- pointer to a list of vma_maps
+ *   :spu -- the spu for this active SPU task
+ *   :list -- potentially could be used to contain the cached_infos
+ *            for inactive SPU tasks.
+ * 
+ * Ideally, we would like to be able to create the cached_info for
+ * an SPU task just one time -- when libspe first loads the SPU 
+ * binary file.  We would store the cached_info in a list.  Then, as
+ * SPU tasks are switched out and new ones switched in, the cached_info
+ * for inactive tasks would be kept, and the active one would be placed
+ * at the head of the list.  But this technique may not with
+ * current spufs functionality since the spu used in bind_context may
+ * be a different spu than was used in a previous bind_context for a
+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
+ * assigned to run on a different physical SPE.  We will investigate
+ * further if this can be done.
+ *
+ */
+struct cached_info {
+	vma_map_t * map;
+	struct spu * the_spu;
+	struct kref cache_ref;
+	struct list_head list;
+};
+
+/* A data structure for cached information about active SPU tasks.
+ * Storage is dynamically allocated, sized as
+ * "number of active nodes multplied by 8". 
+ * The info_list[n] member holds 0 or more 
+ * 'struct cached_info' objects for SPU#=n. 
+ *
+ * As currently implemented, there will only ever be one cached_info 
+ * in the list for a given SPU.  If we can devise a way to maintain
+ * multiple cached_infos in our list, then it would make sense
+ * to also cache the dcookie representing the PPU task application.
+ * See above description of struct cached_info for more details.
+ */
+struct spu_info_stacks {
+	struct list_head * info_list;
+};
+
+static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
+
+
+static struct spu_info_stacks * spu_info;
+
+static void destroy_cached_info(struct kref * kref)
+{
+	struct cached_info * info;
+	info = container_of(kref, struct cached_info, cache_ref);
+	vma_map_free(info->map);
+	kfree(info);
+}
+
+static int put_cached_info(struct cached_info * info)
+{
+	return kref_put(&info->cache_ref, &destroy_cached_info);
+}
+
+/* Return the cached_info for the passed SPU number.
+ * Current implementation is such that a list will hold, at most,
+ * one cached_info.
+ * 
+ * NOTE: Clients of this function MUST call put_cached_info()
+ *       when finished using the returned cached_info (if the
+ *       returned value is non-null).
+ */
+static struct cached_info * get_cached_info(int spu_num)
+{
+	struct cached_info * ret_info, * info;
+	unsigned long flags = 0;
+	ret_info = NULL;
+	spin_lock_irqsave(&cache_lock, flags);
+	if (spu_info == NULL) {
+		pr_debug("SPU_PROF:%s: spu_info does not exist\n",
+			 __FUNCTION__);
+		goto out;
+	}
+	if (spu_num >= num_spu_nodes) {
+		printk(KERN_ERR "SPU_PROF: " 
+		       "%s, line %d: Invalid index %d into spu info cache\n",
+		       __FUNCTION__, __LINE__, spu_num);
+		goto out;
+	}
+	list_for_each_entry(info, &spu_info->info_list[spu_num], list) {
+		/* Only one item in the list, so return it. */
+		ret_info = info;
+		kref_get(&info->cache_ref);
+		break;
+	}
+
+out:
+	spin_unlock_irqrestore(&cache_lock, flags);
+	return ret_info;
+}
+
+
+/* Looks for cached info for the passed spu.  If not found, the
+ * cached info is created for the passed spu.
+ * Returns 0 for success; otherwise, -1 for error.  
+ */ 
+static int
+prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
+{
+	vma_map_t * new_map;
+	unsigned long flags = 0;
+	int retval = 0;
+	/* spu->number is a system-wide value, not a per-node value. */
+	struct cached_info * info = get_cached_info(spu->number);
+	if (info == NULL) {
+		/* create cached_info and add it to the list for SPU #<n>.*/
+		info = kzalloc(sizeof(struct cached_info), GFP_ATOMIC);
+		if (!info) {
+			printk(KERN_ERR "SPU_PROF: "
+			       "%s, line %d: create vma_map failed\n",
+			       __FUNCTION__, __LINE__);
+			goto ERR_ALLOC;
+		}
+		new_map = create_vma_map(spu, objectId);
+		if (!new_map) {
+			printk(KERN_ERR "SPU_PROF: "
+			       "%s, line %d: create vma_map failed\n",
+			       __FUNCTION__, __LINE__);
+			goto ERR_ALLOC;
+		}
+
+		pr_debug("Created vma_map\n");
+ 		info->map = new_map;
+		info->the_spu = spu;
+		kref_init(&info->cache_ref);
+		spin_lock_irqsave(&cache_lock, flags);
+		list_add(&info->list, &spu_info->info_list[spu->number]);
+		spin_unlock_irqrestore(&cache_lock, flags);
+		goto OUT;
+	} else {
+        /* Immedidately put back reference to cached_info since we don't
+	 * really need it -- just checking whether we have it.
+	 */
+		put_cached_info(info);
+		pr_debug("Found cached SPU info.\n");
+	}
+	
+ERR_ALLOC:
+	retval = -1;
+OUT:
+	return retval;
+}
+
+
+/* Discard all cached info and free the memory.
+ * NOTE:  The caller is responsible for locking the
+ *        spu_info struct containing the cached_info
+ *        prior to calling this function.
+ */
+static int discard_cached_info(int spu_index)
+{
+	struct cached_info * info, * tmp;
+	int index, end;
+	if (spu_index == DISCARD_ALL) {
+		end = num_spu_nodes;
+		index = 0;
+	} else {
+	        if (spu_index >= num_spu_nodes) {
+        	        printk(KERN_ERR "SPU_PROF: "
+			       "%s, line %d: Invalid index %d into spu info cache\n",
+               	               __FUNCTION__, __LINE__, spu_index);
+	                goto out;
+	        }
+		end = spu_index +1;
+		index = spu_index;
+	}
+	for (; index < end; index++) {
+		list_for_each_entry_safe(info, tmp, 
+					 &spu_info->info_list[index],
+					 list) {
+			list_del(&info->list);
+			put_cached_info(info);
+		}
+	}
+out:
+	return 0;
+}
+
+/* The source code for fast_get_dcookie was "borrowed"
+ * from drivers/oprofile/buffer_sync.c.
+ */
+
+/* Optimisation. We can manage without taking the dcookie sem
+ * because we cannot reach this code without at least one
+ * dcookie user still being registered (namely, the reader
+ * of the event buffer).
+ */
+static inline unsigned long fast_get_dcookie(struct dentry * dentry,
+					     struct vfsmount * vfsmnt)
+{
+        unsigned long cookie;
+
+        if (dentry->d_cookie)
+                return (unsigned long)dentry;
+        get_dcookie(dentry, vfsmnt, &cookie);
+        return cookie;
+}
+
+/* Look up the dcookie for the task's first VM_EXECUTABLE mapping,
+ * which corresponds loosely to "application name". Also, determine
+ * the offset for the SPU ELF object.  If computed offset is 
+ * non-zero, it implies an embedded SPU object; otherwise, it's a
+ * separate SPU binary, in which case we retrieve it's dcookie.
+ */
+static unsigned long 
+get_exec_dcookie_and_offset(
+	struct spu * spu, unsigned int * offsetp,
+	unsigned long * spu_bin_dcookie,
+	unsigned int spu_ref)
+{
+        unsigned long cookie = 0;
+	unsigned int my_offset = 0;
+        struct vm_area_struct * vma;
+	struct mm_struct * mm = spu->mm;
+
+        if (!mm)
+                goto OUT;
+
+        for (vma = mm->mmap; vma; vma = vma->vm_next) {
+                if (!vma->vm_file)
+                        continue;
+                if (!(vma->vm_flags & VM_EXECUTABLE))
+                        continue;
+                cookie = fast_get_dcookie(vma->vm_file->f_dentry,
+                        vma->vm_file->f_vfsmnt);
+		pr_debug("got dcookie for %s\n",
+			  vma->vm_file->f_dentry->d_name.name);
+                break;
+        }
+
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_start > spu_ref || vma->vm_end < spu_ref)
+			continue;
+		my_offset = spu_ref - vma->vm_start;
+		pr_debug("Found spu ELF at "
+			  " %X for file %s\n", my_offset,
+			  vma->vm_file->f_dentry->d_name.name);
+		*offsetp = my_offset;
+		if (my_offset == 0) {
+			if (!vma->vm_file) {
+				goto FAIL_NO_SPU_COOKIE;
+			}
+			*spu_bin_dcookie = fast_get_dcookie(
+				vma->vm_file->f_dentry,
+				vma->vm_file->f_vfsmnt);
+			pr_debug("got dcookie for %s\n",
+				  vma->vm_file->f_dentry->d_name.name);
+		}
+		break;			
+	}
+	
+OUT:
+        return cookie;
+
+FAIL_NO_SPU_COOKIE:
+	printk(KERN_ERR "SPU_PROF: "
+	       "%s, line %d: Cannot find dcookie for SPU binary\n",
+	       __FUNCTION__, __LINE__);
+	goto OUT;
+}
+
+
+
+/* This function finds or creates cached context information for the
+ * passed SPU and records SPU context information into the OProfile
+ * event buffer.
+ */
+static int process_context_switch(struct spu * spu, unsigned int objectId)
+{
+	unsigned long flags = 0;
+	int retval = 0;
+	unsigned int offset = 0;
+	unsigned long spu_cookie = 0, app_dcookie = 0;
+	retval = prepare_cached_spu_info(spu, objectId);
+	if (retval == -1) {
+		goto OUT;
+	}
+        /* Get dcookie first because a mutex_lock is taken in that
+	 * code path, so interrupts must not be disabled.
+	 */
+	app_dcookie = get_exec_dcookie_and_offset(spu, &offset, 
+						  &spu_cookie, objectId);
+
+        /* Record context info in event buffer */
+	spin_lock_irqsave(&buffer_lock, flags);
+	add_event_entry(ESCAPE_CODE);
+	add_event_entry(SPU_CTX_SWITCH_CODE);
+	add_event_entry(spu->number);
+	add_event_entry(spu->pid);
+	add_event_entry(spu->tgid);
+	add_event_entry(app_dcookie);
+
+	add_event_entry(ESCAPE_CODE);
+	if (offset) {
+	  /* When offset is non-zero,  this means the SPU ELF was embedded;
+	   * otherwise, it was loaded from a separate binary file.  For the
+	   * embedded case, we record the offset of the SPU ELF into the PPU
+	   * executable; for the non-embedded case, we record a dcookie that
+	   * points to the location of the SPU binary that was loaded.
+	   */
+		add_event_entry(SPU_OFFSET_CODE);
+		add_event_entry(offset);
+	} else {
+		add_event_entry(SPU_COOKIE_CODE);
+		add_event_entry(spu_cookie);
+	}
+	spin_unlock_irqrestore(&buffer_lock, flags);
+	smp_wmb();
+OUT:
+	return retval;
+}
+
+/* 
+ * This function is invoked on either a bind_context or unbind_context.  
+ * If called for an unbind_context, the val arg is 0; otherwise, 
+ * it is the object-id value for the spu context.
+ * The data arg is of type 'struct spu *'.
+ */
+static int spu_active_notify(struct notifier_block * self, unsigned long val,
+			     void * data)
+{
+	int retval ;
+	unsigned long flags = 0;
+	struct spu * the_spu = (struct spu *) data;
+	pr_debug("SPU event notification arrived\n");
+	if (val == 0){
+		spin_lock_irqsave(&cache_lock, flags);
+		retval = discard_cached_info(the_spu->number);
+		spin_unlock_irqrestore(&cache_lock, flags);
+	} else {
+		retval = process_context_switch(the_spu, val);
+	}
+	return retval;
+}
+
+static struct notifier_block spu_active = {
+	.notifier_call = spu_active_notify,
+};
+
+/* The main purpose of this function is to synchronize
+ * OProfile with SPUFS by registering to be notified of
+ * SPU task switches.
+ *
+ * NOTE: When profiling SPUs, we must ensure that only
+ * spu_sync_start is invoked and not the generic sync_start
+ * in drivers/oprofile/oprof.c.  A return value of
+ * SKIP_GENERIC_SYNC or SYNC_START_ERROR will
+ * accomplish this.
+ */
+int spu_sync_start(void) {
+	int ret = SKIP_GENERIC_SYNC;
+	int register_ret;
+	int i;
+	unsigned long flags = 0;
+	number_of_online_nodes(num_nodes);
+	num_spu_nodes = num_nodes * 8;
+	spin_lock_irqsave(&cache_lock, flags);
+	spu_info = kzalloc(sizeof(struct spu_info_stacks), GFP_ATOMIC);
+	spu_info->info_list = kzalloc(sizeof(struct list_head) * num_spu_nodes,
+				      GFP_ATOMIC);
+
+	for (i = 0; i <  num_spu_nodes; i++) {
+		INIT_LIST_HEAD(&spu_info->info_list[i]);
+	}
+	spin_unlock_irqrestore(&cache_lock, flags);
+
+	spin_lock_irqsave(&buffer_lock, flags);
+	add_event_entry(ESCAPE_CODE);
+	add_event_entry(SPU_PROFILING_CODE);
+	add_event_entry(num_spu_nodes);
+	spin_unlock_irqrestore(&buffer_lock, flags);
+
+        /* Register for SPU events  */
+	register_ret = spu_switch_event_register(&spu_active);
+	if (register_ret) {
+		ret = SYNC_START_ERROR;
+		goto OUT;
+	}
+
+	pr_debug("spu_sync_start -- running.\n");
+OUT:
+	return ret;	
+}
+
+/* Record SPU program counter samples to the oprofile event buffer. */
+void spu_sync_buffer(int spu_num, unsigned int * samples, 
+		     int num_samples)
+{
+	unsigned long flags = 0;
+	int i;
+	vma_map_t * map;
+	struct spu * the_spu;
+	unsigned long long spu_num_ll = spu_num;
+	unsigned long long spu_num_shifted = spu_num_ll << 32;
+	struct cached_info * c_info = get_cached_info(spu_num);
+	if (c_info == NULL) {
+/* This legitimately happens when the SPU task ends before all 
+ * samples are recorded.  No big deal -- so we just drop a few samples.
+ */
+		pr_debug("SPU_PROF: No cached SPU contex "
+			  "for SPU #%d. Dropping samples.\n", spu_num);
+		return ;
+	}
+
+	map = c_info->map;
+	the_spu = c_info->the_spu;
+	spin_lock_irqsave(&buffer_lock, flags);
+	for (i = 0; i < num_samples; i++) {
+		unsigned long long file_offset;
+		unsigned int sample = *(samples+i);
+		if (sample == 0)
+			continue;
+		file_offset = vma_map_lookup(
+			map, sample, the_spu);
+		/* For now, we'll drop samples that can't be mapped.
+		 * This can happen for generated stubs executed from
+		 * the SPU stack.  Do we need to record these somehow?
+		 */
+		if (unlikely(file_offset == -1))
+			continue;
+		add_event_entry(file_offset | spu_num_shifted);
+	}
+	spin_unlock_irqrestore(&buffer_lock, flags);
+	put_cached_info(c_info);
+}
+
+
+int spu_sync_stop(void)
+{
+	unsigned long flags = 0;
+	int ret = spu_switch_event_unregister(&spu_active);
+	if (ret) {
+		printk(KERN_ERR "SPU_PROF: "
+		       "%s, line %d: spu_switch_event_unregister returned %d\n",
+		       __FUNCTION__, __LINE__, ret);
+		goto OUT;
+	} 
+
+	spin_lock_irqsave(&cache_lock, flags);
+	ret = discard_cached_info(DISCARD_ALL);
+	kfree(spu_info->info_list);
+	kfree(spu_info);
+	spin_unlock_irqrestore(&cache_lock, flags);
+
+OUT:
+	pr_debug("spu_sync_stop -- done.\n");
+	return ret;
+}
+
+
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/cell/vma_map.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/cell/vma_map.c	2007-01-29 10:32:03.401787328 -0600
@@ -0,0 +1,228 @@
+ /*
+ * Cell Broadband Engine OProfile Support
+ *
+ * (C) Copyright IBM Corporation 2006
+ *
+ * Author: Maynard Johnson <maynardj@us.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/* The code in this source file is responsible for generating
+ * vma-to-fileOffset maps for both overlay and non-overlay SPU
+ * applications.
+ */
+
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/elf.h>
+#include "pr_util.h"
+
+
+void vma_map_free(struct vma_map *map)
+{
+	while (map) {
+		vma_map_t *next = map->next;
+		kfree(map);
+		map = next;
+	}
+}
+
+unsigned int vma_map_lookup(vma_map_t *map, unsigned int vma,
+			    const struct spu * aSpu)
+{
+	u32 offset = -1;
+	u32 ovly_grd;
+	for (; map; map = map->next) {
+		if (vma < map->vma || vma >= map->vma + map->size)
+			continue;
+
+		if (map->guard_ptr) {
+			ovly_grd = *(u32 *)(aSpu->local_store + map->guard_ptr);
+			if (ovly_grd != map->guard_val)
+				continue;
+		}
+		break;
+	}
+
+	if (likely(map != NULL)) {
+		offset = vma - map->vma + map->offset;
+	}
+	return offset;
+}
+
+static vma_map_t *
+vma_map_add(vma_map_t *map, unsigned int vma, unsigned int size,
+	     unsigned int offset, unsigned int guard_ptr, 
+	     unsigned int guard_val)
+{
+	vma_map_t *new = kzalloc(sizeof(vma_map_t), GFP_ATOMIC);
+	if (!new) {
+		printk(KERN_ERR "SPU_PROF: %s, line %d: malloc failed\n",
+		       __FUNCTION__, __LINE__);
+		vma_map_free(map);
+		return NULL;
+	}
+
+	new->next = map;
+	new->vma = vma;
+	new->size = size;
+	new->offset = offset;
+	new->guard_ptr = guard_ptr;
+	new->guard_val = guard_val;
+
+	return new;
+}
+
+
+/* Parse SPE ELF header and generate a list of vma_maps.
+ * A pointer to the first vma_map in the generated list
+ * of vma_maps is returned.  */
+vma_map_t * create_vma_map(const struct spu * aSpu, 
+			   unsigned long spu_elf_start)
+{
+	static const unsigned char expected[EI_PAD] = {
+		[EI_MAG0] = ELFMAG0,
+		[EI_MAG1] = ELFMAG1,
+		[EI_MAG2] = ELFMAG2,
+		[EI_MAG3] = ELFMAG3,
+		[EI_CLASS] = ELFCLASS32,
+		[EI_DATA] = ELFDATA2MSB,
+		[EI_VERSION] = EV_CURRENT,
+		[EI_OSABI] = ELFOSABI_NONE
+	};
+
+	struct vma_map *map = NULL;
+	unsigned int overlay_tbl_offset = -1;
+	unsigned long phdr_start, shdr_start;
+	Elf32_Ehdr ehdr;
+	Elf32_Phdr phdr;
+	Elf32_Shdr shdr, shdr_str;
+	Elf32_Sym sym;
+	int i, j;
+	char name[32];
+
+	unsigned int ovly_table_sym = 0;
+	unsigned int ovly_buf_table_sym = 0;
+	unsigned int ovly_table_end_sym = 0;
+	unsigned int ovly_buf_table_end_sym = 0;
+	unsigned long ovly_table;
+	unsigned int n_ovlys;
+
+	struct {
+		unsigned int vma;
+		unsigned int size;
+		unsigned int offset;
+		unsigned int buf;
+	} ovly;
+
+	/* Get and validate ELF header.  */
+
+	copy_from_user(&ehdr, (void *) spu_elf_start, sizeof (ehdr));
+	if (memcmp(ehdr.e_ident, expected, EI_PAD) != 0) {
+		printk(KERN_ERR "SPU_PROF: "
+		       "%s, line %d: Unexpected value parsing SPU ELF\n",
+		       __FUNCTION__, __LINE__);
+		return NULL;
+	}
+	if (ehdr.e_machine != 23) {
+		printk(KERN_ERR "SPU_PROF: "
+		       "%s, line %d: Unexpected value parsing SPU ELF\n",
+		       __FUNCTION__,  __LINE__);
+
+		return NULL;
+	}
+	if (ehdr.e_type != ET_EXEC) {
+		printk(KERN_ERR "SPU_PROF: "
+		       "%s, line %d: Unexpected value parsing SPU ELF\n",
+		       __FUNCTION__, __LINE__);
+		return NULL;
+	}
+	phdr_start = spu_elf_start + ehdr.e_phoff;
+	shdr_start = spu_elf_start + ehdr.e_shoff;
+
+	/* Traverse program headers.  */
+	for (i = 0; i < ehdr.e_phnum; i++) {
+		copy_from_user(&phdr, (void *) (phdr_start + i * sizeof(phdr)), 
+			       sizeof(phdr));
+		if (phdr.p_type != PT_LOAD)
+			continue;
+		if (phdr.p_flags & (1 << 27))
+			continue;
+
+		map = vma_map_add(map, phdr.p_vaddr, phdr.p_memsz, 
+				  phdr.p_offset, 0, 0);
+		if (!map)
+			return NULL;
+	}
+
+	pr_debug("SPU_PROF: Created non-overlay maps\n");	
+	/* Traverse section table and search for overlay-related symbols.  */
+	for (i = 0; i < ehdr.e_shnum; i++) {
+		copy_from_user(&shdr, (void *) (shdr_start + i * sizeof(shdr)), 
+			       sizeof(shdr));
+		if (shdr.sh_type != SHT_SYMTAB)
+			continue;
+		if (shdr.sh_entsize != sizeof (sym))
+			continue;
+
+		copy_from_user(&shdr_str, 
+			       (void *) (shdr_start + shdr.sh_link * sizeof(shdr)),
+			       sizeof(shdr));
+		if (shdr_str.sh_type != SHT_STRTAB)
+			return NULL;
+
+		for (j = 0; j < shdr.sh_size / sizeof (sym); j++) {
+			copy_from_user(&sym, (void *) (spu_elf_start +
+						       shdr.sh_offset + j * sizeof (sym)),
+				       sizeof (sym));
+			copy_from_user(name, (void *) (spu_elf_start + shdr_str.sh_offset + 
+						       sym.st_name),
+				       20);
+			if (memcmp(name, "_ovly_table", 12) == 0)
+				ovly_table_sym = sym.st_value;
+			if (memcmp(name, "_ovly_buf_table", 16) == 0)
+				ovly_buf_table_sym = sym.st_value;
+			if (memcmp(name, "_ovly_table_end", 16) == 0)
+				ovly_table_end_sym = sym.st_value;
+			if (memcmp(name, "_ovly_buf_table_end", 20) == 0)
+				ovly_buf_table_end_sym = sym.st_value;
+		}
+	}
+
+	/* If we don't have overlays, we're done.  */
+	if (ovly_table_sym == 0 || ovly_buf_table_sym == 0
+	    || ovly_table_end_sym == 0 || ovly_buf_table_end_sym == 0) {
+		pr_debug("SPU_PROF: No overlay table found\n");
+		return map;
+	}
+	else {
+		pr_debug("SPU_PROF: Overlay table found\n");
+	}
+
+	overlay_tbl_offset = vma_map_lookup(map, ovly_table_sym, aSpu);
+	if (overlay_tbl_offset < 0) {
+		printk(KERN_ERR "SPU_PROF: "
+		       "%s, line %d: Error finding SPU overlay table\n",
+		       __FUNCTION__, __LINE__);
+		return NULL;
+	}
+	ovly_table = spu_elf_start + overlay_tbl_offset;
+	n_ovlys = (ovly_table_end_sym - ovly_table_sym) / sizeof (ovly);
+
+	/* Traverse overlay table.  */
+	for (i = 0; i < n_ovlys; i++) {
+		copy_from_user(&ovly, (void *) (ovly_table + i * sizeof (ovly)),
+			       sizeof (ovly));
+		map = vma_map_add(map, ovly.vma, ovly.size, ovly.offset,
+				   ovly_buf_table_sym + (ovly.buf - 1) * 4, i + 1);
+		if (!map)
+			return NULL;
+	}
+	
+	return map;
+}
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/common.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/common.c	2007-01-18 16:43:14.429510072 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/common.c	2007-01-29 10:32:03.403787024 -0600
@@ -150,6 +150,8 @@
 #ifdef CONFIG_PPC_CELL_NATIVE
 		case PPC_OPROFILE_CELL:
 			model = &op_model_cell;
+			ops->sync_start = model->sync_start;
+			ops->sync_stop = model->sync_stop;
 			break;
 #endif
 		case PPC_OPROFILE_RS64:
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Kconfig	2007-01-18 16:43:14.426510528 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/Kconfig	2007-01-29 10:32:03.404786872 -0600
@@ -7,7 +7,8 @@
 
 config OPROFILE
 	tristate "OProfile system profiling (EXPERIMENTAL)"
-	depends on PROFILING
+	default m
+	depends on SPU_FS && PROFILING && CBE_CPUFREQ
 	help
 	  OProfile is a profiling system capable of profiling the
 	  whole system, include the kernel, kernel modules, libraries,
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/Makefile
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/Makefile	2007-01-18 16:43:14.429510072 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/Makefile	2007-01-29 10:32:03.405786720 -0600
@@ -11,7 +11,8 @@
 		timer_int.o )
 
 oprofile-y := $(DRIVER_OBJS) common.o backtrace.o
-oprofile-$(CONFIG_PPC_CELL_NATIVE) += op_model_cell.o
+oprofile-$(CONFIG_PPC_CELL_NATIVE) += op_model_cell.o \
+					cell/spu_profiler.o cell/vma_map.o cell/spu_task_sync.o
 oprofile-$(CONFIG_PPC64) += op_model_rs64.o op_model_power4.o
 oprofile-$(CONFIG_FSL_BOOKE) += op_model_fsl_booke.o
 oprofile-$(CONFIG_6xx) += op_model_7450.o
Index: linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/oprofile/op_model_cell.c	2007-01-24 12:16:16.225609136 -0600
+++ linux-2.6.20-rc1/arch/powerpc/oprofile/op_model_cell.c	2007-01-29 10:32:03.410785960 -0600
@@ -37,6 +37,17 @@
 #include <asm/system.h>
 
 #include "../platforms/cell/interrupt.h"
+#include "cell/pr_util.h"
+
+/* spu_cycle_reset is the number of cycles between samples.
+ * This variable is used for SPU profiling and should ONLY be set
+ * at the beginning of cell_reg_setup; otherwise, it's read-only.
+ */
+static unsigned int spu_cycle_reset = 0;
+unsigned int khzfreq;
+
+#define NUM_SPUS_PER_NODE    8
+#define SPU_CYCLES_EVENT_NUM 2        /*  event number for SPU_CYCLES */
 
 #define PPU_CYCLES_EVENT_NUM 1	/*  event number for CYCLES */
 #define PPU_CYCLES_GRP_NUM   1  /* special group number for identifying
@@ -50,7 +61,6 @@
 #define NUM_TRACE_BUS_WORDS 4
 #define NUM_INPUT_BUS_WORDS 2
 
-
 struct pmc_cntrl_data {
 	unsigned long vcntr;
 	unsigned long evnts;
@@ -134,6 +144,7 @@
 /*
  * Firmware interface functions
  */
+
 static int
 rtas_ibm_cbe_perftools(int subfunc, int passthru,
 		       void *address, unsigned long length)
@@ -480,7 +491,22 @@
 	       struct op_system_config *sys, int num_ctrs)
 {
 	int i, j, cpu;
+	spu_cycle_reset = 0;
 
+	/* The cpufreq_quick_get function requires that cbe_cpufreq module
+	 * be loaded.  This function is not actually provided and exported
+	 * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
+	 * data structures.  Since there's no way for depmod to realize
+	 * that our OProfile module depends on cbe_cpufreq, we currently
+	 * are letting the userspace tool, opcontrol, ensure that the
+	 * cbe_cpufreq module is loaded.
+	 */
+	khzfreq = cpufreq_quick_get(smp_processor_id());
+
+	if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
+		spu_cycle_reset = ctr[0].count;
+		return;
+	}
 	pm_rtas_token = rtas_token("ibm,cbe-perftools");
 	if (pm_rtas_token == RTAS_UNKNOWN_SERVICE) {
 		printk(KERN_WARNING "%s: RTAS_UNKNOWN_SERVICE\n",
@@ -566,6 +592,8 @@
 	;
 }
 
+
+
 /* This function is called once for each cpu */
 static void cell_cpu_setup(struct op_counter_config *cntr)
 {
@@ -573,6 +601,9 @@
 	u32 num_enabled = 0;
 	int i;
 
+	if (spu_cycle_reset)
+		return;
+
 	/* There is one performance monitor per processor chip (i.e. node),
 	 * so we only need to perform this function once per node.
 	 */
@@ -607,7 +638,121 @@
 	;
 }
 
-static void cell_global_start(struct op_counter_config *ctr)
+static int calculate_lfsr(int n)
+{
+#define size 24
+	int i;
+	unsigned int newlfsr0;
+	unsigned int lfsr = 0xFFFFFF;
+	unsigned int howmany = lfsr - n;
+
+	for (i = 2; i < howmany + 2; i++) {
+		newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
+			    ((lfsr >> (size - 1 - 1)) & 1) ^
+			    (((lfsr >> (size - 1 - 6)) & 1) ^
+			     ((lfsr >> (size - 1 - 23)) & 1)));
+
+		lfsr >>= 1;
+		lfsr = lfsr | (newlfsr0 << (size - 1));
+	}
+	return lfsr;
+
+}
+
+static void pm_rtas_activate_spu_profiling(u32 node)
+{
+	int ret, i;
+	struct pm_signal pm_signal_local[NR_PHYS_CTRS];
+
+	/* Set up the rtas call to configure the debug bus to 
+	 * route the SPU PCs.  Setup the pm_signal for each SPU */
+	for (i = 0; i < NUM_SPUS_PER_NODE; i++) {
+		pm_signal_local[i].cpu = node;
+		pm_signal_local[i].signal_group = 41;
+		pm_signal_local[i].bus_word = 1 << i / 2; /* spu i on 
+							   * word (i/2) 
+							   */
+		pm_signal_local[i].sub_unit = i;	/* spu i */
+		pm_signal_local[i].bit = 63;
+	}
+
+	pm_rtas_token = rtas_token("ibm,cbe-perftools");
+	if (pm_rtas_token == RTAS_UNKNOWN_SERVICE) {
+		printk(KERN_WARNING "%s: RTAS_UNKNOWN_SERVICE \n",
+		       __FUNCTION__);
+	}
+
+	ret = rtas_ibm_cbe_perftools(SUBFUNC_ACTIVATE, PASSTHRU_ENABLE,
+				     pm_signal_local,
+				     8 * sizeof(struct pm_signal)); //FIXME 8 to #define
+
+	if (ret)
+		printk(KERN_WARNING "%s: rtas returned: %d\n",
+		       __FUNCTION__, ret);
+
+}
+
+static void cell_global_start_spu(struct op_counter_config *ctr)
+{
+	int subfunc, rtn_value;
+	unsigned int lfsr_value;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (cbe_get_hw_thread_id(cpu))
+			continue;
+		/* Setup SPU cycle-based profiling.
+		 * Set perf_mon_control bit 0 to a zero before
+		 * enabling spu collection hardware.
+		 */
+		cbe_write_pm(cpu, pm_control, 0);
+
+		pm_rtas_activate_spu_profiling(cbe_cpu_to_node(cpu));
+
+		if (spu_cycle_reset > 0xFFFFFE) 
+				lfsr_value = calculate_lfsr(1);  /* use largest possible 
+								  *  value 
+								  */
+		else 
+		    lfsr_value = calculate_lfsr(spu_cycle_reset);
+
+		if (lfsr_value == 0) {  /* must use a non zero value.  Zero
+					 * disables data collection.
+					 */
+				lfsr_value = calculate_lfsr(1);  /* use largest possible 
+								  * value 
+								 */
+		}
+
+		lfsr_value = lfsr_value << 8; /* shift lfsr to correct 
+					       * register location
+					       */
+		
+		pm_rtas_token = rtas_token("ibm,cbe-spu-perftools");  
+
+		if (pm_rtas_token == RTAS_UNKNOWN_SERVICE) {
+			printk(KERN_ERR
+			       "%s: rtas token ibm,cbe-spu-perftools unknown\n",
+			       __FUNCTION__);
+		}
+
+		subfunc = 2;	// 2 - activate SPU tracing, 3 - deactivate
+
+		rtn_value = rtas_call(pm_rtas_token, 3, 1, NULL, subfunc,
+			  cbe_cpu_to_node(cpu), lfsr_value);
+
+		if (rtn_value != 0)
+			printk(KERN_ERR
+			       "%s: rtas call ibm,cbe-spu-perftools failed, return = %d\n",
+			       __FUNCTION__, rtn_value);
+	}
+
+	start_spu_profiling(spu_cycle_reset);
+
+	oprofile_running = 1;
+}
+
+static void cell_global_start_ppu(struct op_counter_config *ctr)
 {
 	u32 cpu;
 	u32 interrupt_mask = 0;
@@ -652,7 +797,44 @@
 	start_virt_cntrs();
 }
 
-static void cell_global_stop(void)
+static void cell_global_start(struct op_counter_config *ctr)
+{
+	if (spu_cycle_reset) {
+		cell_global_start_spu(ctr);
+	} else {
+		cell_global_start_ppu(ctr);
+	}
+}
+
+static void cell_global_stop_spu(void)
+{
+	int subfunc, rtn_value;
+	unsigned int lfsr_value;
+	int cpu;
+
+	oprofile_running = 0;
+
+	for_each_online_cpu(cpu) {
+		if (cbe_get_hw_thread_id(cpu))
+			continue;
+
+		subfunc = 3;	// 2 - activate SPU tracing, 3 - deactivate
+		lfsr_value = 0x8f100000;
+
+		rtn_value =
+		    rtas_call(pm_rtas_token, 3, 1, NULL, subfunc,
+			      cbe_cpu_to_node(cpu), lfsr_value);
+
+		if (rtn_value != 0)
+			printk
+			    ("ERROR, rtas call ibm,cbe-spu-perftools failed, return = %d\n",
+			     rtn_value);
+	}
+
+	stop_spu_profiling();
+}
+
+static void cell_global_stop_ppu(void)
 {
 	int cpu;
 
@@ -680,6 +862,16 @@
 	}
 }
 
+static void cell_global_stop(void)
+{
+	if (spu_cycle_reset) {
+		cell_global_stop_spu();
+	} else {
+		cell_global_stop_ppu();
+	}
+
+}
+
 static void
 cell_handle_interrupt(struct pt_regs *regs, struct op_counter_config *ctr)
 {
@@ -748,10 +940,33 @@
 	spin_unlock_irqrestore(&virt_cntr_lock, flags);
 }
 
+/* This function is called from the generic OProfile
+ * driver.  When profiling PPUs, we need to do the
+ * generic sync start; otherwise, do spu_sync_start.
+ */
+static int cell_sync_start(void)
+{
+	if (spu_cycle_reset)
+		return spu_sync_start();
+	else
+		return DO_GENERIC_SYNC;
+}
+
+static int cell_sync_stop(void)
+{
+	if (spu_cycle_reset)
+		return spu_sync_stop();
+	else
+		return 1;
+}
+
+
 struct op_powerpc_model op_model_cell = {
 	.reg_setup = cell_reg_setup,
 	.cpu_setup = cell_cpu_setup,
 	.global_start = cell_global_start,
 	.global_stop = cell_global_stop,
+	.sync_start = cell_sync_start,
+	.sync_stop = cell_sync_stop,
 	.handle_interrupt = cell_handle_interrupt,
 };
Index: linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/sched.c
===================================================================
--- linux-2.6.20-rc1.orig/arch/powerpc/platforms/cell/spufs/sched.c	2007-01-26 16:16:35.219668640 -0600
+++ linux-2.6.20-rc1/arch/powerpc/platforms/cell/spufs/sched.c	2007-01-29 10:32:03.413785504 -0600
@@ -94,7 +94,7 @@
 	for (node = 0; node < MAX_NUMNODES; node++) {
 		struct spu *spu;
 		mutex_lock(&spu_prio->active_mutex[node]);
-                list_for_each_entry(spu, &spu_prio->active_list[node], list) {
+		list_for_each_entry(spu, &spu_prio->active_list[node], list) {
 			struct spu_context *ctx = spu->ctx;
 			spu->notify_active = 1;
 			wake_up_all(&ctx->stop_wq);
@@ -129,6 +129,7 @@
 	ctx->spu = spu;
 	ctx->ops = &spu_hw_ops;
 	spu->pid = current->pid;
+	spu->tgid = current->tgid;
 	spu->prio = current->prio;
 	spu->mm = ctx->owner;
 	mm_needs_global_tlbie(spu->mm);
@@ -161,6 +162,7 @@
 	spu->dma_callback = NULL;
 	spu->mm = NULL;
 	spu->pid = 0;
+	spu->tgid = 0;
 	spu->prio = MAX_PRIO;
 	ctx->ops = &spu_backing_ops;
 	ctx->spu = NULL;
Index: linux-2.6.20-rc1/drivers/oprofile/buffer_sync.c
===================================================================
--- linux-2.6.20-rc1.orig/drivers/oprofile/buffer_sync.c	2007-01-18 16:43:11.675529376 -0600
+++ linux-2.6.20-rc1/drivers/oprofile/buffer_sync.c	2007-01-29 10:32:03.415785200 -0600
@@ -26,6 +26,7 @@
 #include <linux/profile.h>
 #include <linux/module.h>
 #include <linux/fs.h>
+#include <linux/oprofile.h>
  
 #include "oprofile_stats.h"
 #include "event_buffer.h"
Index: linux-2.6.20-rc1/drivers/oprofile/event_buffer.h
===================================================================
--- linux-2.6.20-rc1.orig/drivers/oprofile/event_buffer.h	2007-01-18 16:43:11.673529680 -0600
+++ linux-2.6.20-rc1/drivers/oprofile/event_buffer.h	2007-01-29 10:32:03.417784896 -0600
@@ -19,28 +19,10 @@
  
 /* wake up the process sleeping on the event file */
 void wake_up_buffer_waiter(void);
- 
-/* Each escaped entry is prefixed by ESCAPE_CODE
- * then one of the following codes, then the
- * relevant data.
- */
-#define ESCAPE_CODE			~0UL
-#define CTX_SWITCH_CODE 		1
-#define CPU_SWITCH_CODE 		2
-#define COOKIE_SWITCH_CODE 		3
-#define KERNEL_ENTER_SWITCH_CODE	4
-#define KERNEL_EXIT_SWITCH_CODE		5
-#define MODULE_LOADED_CODE		6
-#define CTX_TGID_CODE			7
-#define TRACE_BEGIN_CODE		8
-#define TRACE_END_CODE			9
- 
+  
 #define INVALID_COOKIE ~0UL
 #define NO_COOKIE 0UL
 
-/* add data to the event buffer */
-void add_event_entry(unsigned long data);
- 
 extern struct file_operations event_buffer_fops;
  
 /* mutex between sync_cpu_buffers() and the
Index: linux-2.6.20-rc1/drivers/oprofile/oprof.c
===================================================================
--- linux-2.6.20-rc1.orig/drivers/oprofile/oprof.c	2007-01-18 16:43:11.675529376 -0600
+++ linux-2.6.20-rc1/drivers/oprofile/oprof.c	2007-01-29 10:32:03.419784592 -0600
@@ -53,9 +53,23 @@
 	 * us missing task deaths and eventually oopsing
 	 * when trying to process the event buffer.
 	 */
+	if (oprofile_ops.sync_start) {
+		int sync_ret = oprofile_ops.sync_start();
+		switch (sync_ret) {
+			case 0: goto post_sync;
+				break;
+			case 1: goto do_generic;
+				break;
+			case -1: goto out3;
+				break;
+			default: goto out3;
+		}
+	}
+do_generic:
 	if ((err = sync_start()))
 		goto out3;
 
+post_sync:
 	is_setup = 1;
 	mutex_unlock(&start_mutex);
 	return 0;
@@ -118,7 +132,19 @@
 void oprofile_shutdown(void)
 {
 	mutex_lock(&start_mutex);
+        if (oprofile_ops.sync_stop) {
+                int sync_ret = oprofile_ops.sync_stop();
+                switch (sync_ret) {
+                        case 0: goto post_sync;
+                                break;
+                        case 1: goto do_generic;
+                                break;
+			default: goto post_sync;
+                }
+        }
+do_generic:
 	sync_stop();
+post_sync:
 	if (oprofile_ops.shutdown)
 		oprofile_ops.shutdown();
 	is_setup = 0;
Index: linux-2.6.20-rc1/include/asm-powerpc/oprofile_impl.h
===================================================================
--- linux-2.6.20-rc1.orig/include/asm-powerpc/oprofile_impl.h	2007-01-18 16:43:19.315566704 -0600
+++ linux-2.6.20-rc1/include/asm-powerpc/oprofile_impl.h	2007-01-29 10:32:03.421784288 -0600
@@ -47,6 +47,8 @@
         void (*global_start) (struct op_counter_config *);
 	void (*stop) (void);
 	void (*global_stop) (void);
+	int (*sync_start)(void);
+	int (*sync_stop)(void);
 	void (*handle_interrupt) (struct pt_regs *,
 				  struct op_counter_config *);
 	int num_counters;
Index: linux-2.6.20-rc1/include/asm-powerpc/spu.h
===================================================================
--- linux-2.6.20-rc1.orig/include/asm-powerpc/spu.h	2007-01-24 12:17:30.209676992 -0600
+++ linux-2.6.20-rc1/include/asm-powerpc/spu.h	2007-01-29 10:32:03.423783984 -0600
@@ -128,6 +128,7 @@
 	struct spu_runqueue *rq;
 	unsigned long long timestamp;
 	pid_t pid;
+	pid_t tgid;
 	int prio;
 	int class_0_pending;
 	spinlock_t register_lock;
Index: linux-2.6.20-rc1/include/linux/oprofile.h
===================================================================
--- linux-2.6.20-rc1.orig/include/linux/oprofile.h	2007-01-18 16:43:18.379575976 -0600
+++ linux-2.6.20-rc1/include/linux/oprofile.h	2007-01-29 10:32:03.425783680 -0600
@@ -17,6 +17,28 @@
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
  
+/* Each escaped entry is prefixed by ESCAPE_CODE
+ * then one of the following codes, then the
+ * relevant data.
+ * These #defines live in this file so that arch-specific
+ * buffer sync'ing code can access them.  
+ */
+#define ESCAPE_CODE                     ~0UL
+#define CTX_SWITCH_CODE                 1
+#define CPU_SWITCH_CODE                 2
+#define COOKIE_SWITCH_CODE              3
+#define KERNEL_ENTER_SWITCH_CODE        4
+#define KERNEL_EXIT_SWITCH_CODE         5
+#define MODULE_LOADED_CODE              6
+#define CTX_TGID_CODE                   7
+#define TRACE_BEGIN_CODE                8
+#define TRACE_END_CODE                  9
+#define XEN_ENTER_SWITCH_CODE          10
+#define SPU_PROFILING_CODE             11
+#define SPU_CTX_SWITCH_CODE            12
+#define SPU_OFFSET_CODE                13
+#define SPU_COOKIE_CODE                14
+
 struct super_block;
 struct dentry;
 struct file_operations;
@@ -35,6 +57,14 @@
 	int (*start)(void);
 	/* Stop delivering interrupts. */
 	void (*stop)(void);
+	/* Arch-specific buffer sync functions.
+	 * Return value = 0:  Success
+	 * Return value = -1: Failure
+	 * Return value = 1:  Run generic sync function
+	 */
+	int (*sync_start)(void);
+	int (*sync_stop)(void);
+
 	/* Initiate a stack backtrace. Optional. */
 	void (*backtrace)(struct pt_regs * const regs, unsigned int depth);
 	/* CPU identification string. */
@@ -56,6 +86,13 @@
 void oprofile_arch_exit(void);
 
 /**
+ * Add data to the event buffer.
+ * The data passed is free-form, but typically consists of
+ * file offsets, dcookies, context information, and ESCAPE codes.
+ */
+void add_event_entry(unsigned long data);
+ 
+/**
  * Add a sample. This may be called from any context. Pass
  * smp_processor_id() as cpu.
  */
Index: linux-2.6.20-rc1/kernel/hrtimer.c
===================================================================
--- linux-2.6.20-rc1.orig/kernel/hrtimer.c	2007-01-18 16:43:05.808489704 -0600
+++ linux-2.6.20-rc1/kernel/hrtimer.c	2007-01-29 10:32:48.321748656 -0600
@@ -335,6 +335,7 @@
 
 	return orun;
 }
+EXPORT_SYMBOL_GPL(hrtimer_forward);
 
 /*
  * enqueue_hrtimer - internal function to (re)start a timer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
@ 2007-01-30  4:07   ` Arnd Bergmann
  2007-01-30 10:39   ` Christoph Hellwig
  1 sibling, 0 replies; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-30  4:07 UTC (permalink / raw)
  To: cbe-oss-dev, maynardj; +Cc: linux-kernel, linuxppc-dev, oprofile-list

On Monday 29 January 2007 20:46, Maynard Johnson wrote:
>   This is a clean up patch that includes the following changes:
> 
>         -It removes some macro definitions that are only used once
>          with the actual code.
>         -Some comments were added to clarify the code based on feedback
>          from the community.
>         -The write_pm_cntrl() and set_count_mode() were passed a structure
>          element from a global variable.  The argument was removed so the
>          functions now just operate on the global directly.
>         -The set_pm_event() function call in the cell_virtual_cntr() routine
>          was moved to a for-loop before the for_each_cpu loop
> 
> Signed-off-by: Carl Love <carll@us.ibm.com>
> Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
> 

Acked-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>

Just a small side note: Please give each of your patches a one-line
summary in the subject of the email. I'm filing this one under:
"cell: oprofile cleanup".

It would also be good if you could use a mailer that sends out
patches as inline, so you don't need to resort to using attachments.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 2/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:47 ` [RFC, PATCH 2/4] " Maynard Johnson
@ 2007-01-30  4:08   ` Arnd Bergmann
  2007-01-30 23:51     ` Carl Love
  0 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-30  4:08 UTC (permalink / raw)
  To: cbe-oss-dev, maynardj; +Cc: linux-kernel, linuxppc-dev, oprofile-list

On Monday 29 January 2007 20:47, Maynard Johnson wrote:
>   The code was setting up the debug bus for group 21 when profiling on the 
> event PPU CYCLES.  The debug bus is not actually used by the hardware 
> performance counters when counting PPU CYCLES.  Setting up the debug bus
> for PPU CYCLES causes signal routing conflicts on the debug bus when 
> profiling PPU cycles and another PPU event.  This patch fixes the code to 
> only setup the debug bus to route the performance signals for the non
> PPU CYCLE events.
> 
> Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
> Signed-off-by: Carl Love <carll@us.ibm.com>

Acked-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>

Any suggestion for a one-line patch title?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 3/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:48 ` [RFC, PATCH 3/4] " Maynard Johnson
@ 2007-01-30  4:24   ` Arnd Bergmann
  2007-01-30 15:31     ` Maynard Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-30  4:24 UTC (permalink / raw)
  To: cbe-oss-dev, maynardj; +Cc: linux-kernel, linuxppc-dev, oprofile-list

On Monday 29 January 2007 20:48, Maynard Johnson wrote:
> Subject: Enable SPU switch notification to detect currently active SPU tasks.
> 
> From: Maynard Johnson <maynardj@us.ibm.com>
> 
> This patch adds to the capability of spu_switch_event_register so that the
> caller is also notified of currently active SPU tasks.  It also exports
> spu_switch_event_register and spu_switch_event_unregister.
> 
> Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>

I looked through it again, and think I found a serious bug, but that
should be easy enough to solve:

> +static void notify_spus_active(void)
> +{
> +       int node;
> +       /* Wake up the active spu_contexts. When the awakened processes 
> +        * sees their notify_active flag is set, they will call
> +        * spu_switch_notify();
> +        */
> +       for (node = 0; node < MAX_NUMNODES; node++) {
> +               struct spu *spu;
> +               mutex_lock(&spu_prio->active_mutex[node]);
> +                list_for_each_entry(spu, &spu_prio->active_list[node], list) {
> +                       struct spu_context *ctx = spu->ctx;

[side note]
There is a small whitespace breakage in here, please make sure you always
use tabs for indenting, not space characters.
[/side note]

> @@ -45,9 +45,10 @@
>         u64 pte_fault;
>  
>         *stat = ctx->ops->status_read(ctx);
> -       if (ctx->state != SPU_STATE_RUNNABLE)
> -               return 1;
> +
>         spu = ctx->spu;
> +       if (ctx->state != SPU_STATE_RUNNABLE || spu->notify_active)
> +               return 1;
>         pte_fault = spu->dsisr &
>             (MFC_DSISR_PTE_NOT_FOUND | MFC_DSISR_ACCESS_DENIED);
>         return (!(*stat & 0x1) || pte_fault || spu->class_0_pending) ? 1 : 0;
> @@ -305,6 +306,7 @@
>                    u32 *npc, u32 *event)
>  {
>         int ret;
> +       struct spu * spu;
>         u32 status;
>  
>         if (down_interruptible(&ctx->run_sema))
> @@ -318,8 +320,16 @@
>  
>         do {
>                 ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, &status));
> +               spu = ctx->spu;
>                 if (unlikely(ret))
>                         break;
> +               if (unlikely(spu->notify_active)) {
> +                       spu->notify_active = 0;
> +                       if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
> +                               spu_switch_notify(spu, ctx);
> +                               continue;
> +                       }
> +               }

This is before spu_reacquire_runnable, so in case the spu got
preempted at the same time when oprofile was enabled, ctx->spu
is NULL, and you can't load the notify_active flag from it.

On solution would be to move the notify_active flag from ctx->spu
into ctx itself, but maybe there are other ways to solve this.

Thanks,

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
       [not found]   ` <200701300839.05144.arnd@arndb.de>
@ 2007-01-30  7:53     ` Benjamin Herrenschmidt
  2007-01-30 10:41       ` Christoph Hellwig
  2007-01-30 21:41     ` Maynard Johnson
  2007-02-03 23:49     ` Maynard Johnson
  2 siblings, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-30  7:53 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: cbe-oss-dev, maynardj, linuxppc-dev, linux-kernel, oprofile-list


> > +/* Defines used for sync_start */
> > +#define SKIP_GENERIC_SYNC 0
> > +#define SYNC_START_ERROR -1
> > +#define DO_GENERIC_SYNC 1
> > +
> > +typedef struct vma_map
> > +{
> > +       struct vma_map *next;
> > +       unsigned int vma;
> > +       unsigned int size;
> > +       unsigned int offset;
> > +       unsigned int guard_ptr;
> > +       unsigned int guard_val;
> > +} vma_map_t;

I haven't had time to look in details yet but in that context, what does
"vma" stands for ? There's already an important vm data structure in
linux routinely called "vma" and thus I suspect this is a poor naming
choice as it will cause confusion.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
                   ` (3 preceding siblings ...)
  2007-01-29 19:48 ` [RFC, PATCH 4/4] " Maynard Johnson
@ 2007-01-30  8:37 ` Arnd Bergmann
  4 siblings, 0 replies; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-30  8:37 UTC (permalink / raw)
  To: linuxppc-dev, maynardj; +Cc: cbe-oss-dev, linux-kernel, oprofile-list

On Monday 29 January 2007 20:45, Maynard Johnson wrote:
> On December 14, 2006, I posted a patch that added support to the 
> OProfile kernel driver for profiling Cell SPUs.  There have been some 
> changes/fixes to this patch since the original posting (including 
> forward porting from 2.6.18-based kernel to 2.6.20-rc1), so I am 
> reposting the patch for review now.  This patch relies upon the 
> following patches that have not been accepted yet:
>    1. oprofile cleanup patch (submitted on Nov 27)
>    2. Fix for PPU profiling (not submitted yet, since it depends on #1)
>    3. SPU task notification patch (last submitted on Jan 26)
> 

Sorry for taking so much time before reviewing this. I knew it would
be a long patch that requires time to go through in detail (though
certainly not nearly as much time as it took you to write it), so I
kept procrastinating.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
  2007-01-30  4:07   ` [Cbe-oss-dev] " Arnd Bergmann
@ 2007-01-30 10:39   ` Christoph Hellwig
  2007-01-30 22:49     ` Carl Love
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2007-01-30 10:39 UTC (permalink / raw)
  To: Maynard Johnson; +Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

On Mon, Jan 29, 2007 at 01:46:50PM -0600, Maynard Johnson wrote:
> 
> 

I don't think the macro removal is helpful, getting rid of the names
makes the code less readable to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30  7:53     ` [Cbe-oss-dev] " Benjamin Herrenschmidt
@ 2007-01-30 10:41       ` Christoph Hellwig
  2007-01-30 23:09         ` Maynard Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2007-01-30 10:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, linuxppc-dev, cbe-oss-dev, oprofile-list, linux-kernel

On Tue, Jan 30, 2007 at 06:53:50PM +1100, Benjamin Herrenschmidt wrote:
> 
> > > +/* Defines used for sync_start */
> > > +#define SKIP_GENERIC_SYNC 0
> > > +#define SYNC_START_ERROR -1
> > > +#define DO_GENERIC_SYNC 1
> > > +
> > > +typedef struct vma_map
> > > +{
> > > +       struct vma_map *next;
> > > +       unsigned int vma;
> > > +       unsigned int size;
> > > +       unsigned int offset;
> > > +       unsigned int guard_ptr;
> > > +       unsigned int guard_val;
> > > +} vma_map_t;
> 
> I haven't had time to look in details yet but in that context, what does
> "vma" stands for ? There's already an important vm data structure in
> linux routinely called "vma" and thus I suspect this is a poor naming
> choice as it will cause confusion.

It looks like it actually is dealing with vma to me.  But then again:

 - please don't use typedefs for structures
 - there might be a more descriptive name for this than just vma_map

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 3/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30  4:24   ` [Cbe-oss-dev] " Arnd Bergmann
@ 2007-01-30 15:31     ` Maynard Johnson
  2007-01-31  0:35       ` Arnd Bergmann
  0 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-01-30 15:31 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

Arnd Bergmann wrote:

> On Monday 29 January 2007 20:48, Maynard Johnson wrote:
> 
>>Subject: Enable SPU switch notification to detect currently active SPU tasks.
>>
>>From: Maynard Johnson <maynardj@us.ibm.com>
>>
>>This patch adds to the capability of spu_switch_event_register so that the
>>caller is also notified of currently active SPU tasks.  It also exports
>>spu_switch_event_register and spu_switch_event_unregister.
>>
>>Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
> 
> 
> I looked through it again, and think I found a serious bug, but that
> should be easy enough to solve:
> 
> 
>>+static void notify_spus_active(void)
>>+{
>>+       int node;
>>+       /* Wake up the active spu_contexts. When the awakened processes 
>>+        * sees their notify_active flag is set, they will call
>>+        * spu_switch_notify();
>>+        */
>>+       for (node = 0; node < MAX_NUMNODES; node++) {
>>+               struct spu *spu;
>>+               mutex_lock(&spu_prio->active_mutex[node]);
>>+                list_for_each_entry(spu, &spu_prio->active_list[node], list) {
>>+                       struct spu_context *ctx = spu->ctx;
> 
> 
> [side note]
> There is a small whitespace breakage in here, please make sure you always
> use tabs for indenting, not space characters.
> [/side note]
> 
> 
>>@@ -45,9 +45,10 @@
>>        u64 pte_fault;
>> 
>>        *stat = ctx->ops->status_read(ctx);
>>-       if (ctx->state != SPU_STATE_RUNNABLE)
>>-               return 1;
>>+
>>        spu = ctx->spu;
>>+       if (ctx->state != SPU_STATE_RUNNABLE || spu->notify_active)
>>+               return 1;
>>        pte_fault = spu->dsisr &
>>            (MFC_DSISR_PTE_NOT_FOUND | MFC_DSISR_ACCESS_DENIED);
>>        return (!(*stat & 0x1) || pte_fault || spu->class_0_pending) ? 1 : 0;
>>@@ -305,6 +306,7 @@
>>                   u32 *npc, u32 *event)
>> {
>>        int ret;
>>+       struct spu * spu;
>>        u32 status;
>> 
>>        if (down_interruptible(&ctx->run_sema))
>>@@ -318,8 +320,16 @@
>> 
>>        do {
>>                ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, &status));
>>+               spu = ctx->spu;
>>                if (unlikely(ret))
>>                        break;
>>+               if (unlikely(spu->notify_active)) {
>>+                       spu->notify_active = 0;
>>+                       if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
>>+                               spu_switch_notify(spu, ctx);
>>+                               continue;
>>+                       }
>>+               }
> 
> 
> This is before spu_reacquire_runnable, so in case the spu got
> preempted at the same time when oprofile was enabled, ctx->spu
> is NULL, and you can't load the notify_active flag from it.
> 
> On solution would be to move the notify_active flag from ctx->spu
> into ctx itself, but maybe there are other ways to solve this.
In an earlier review of this patch, Christopher Hellwig suggested I move 
the notify_active flag to be a bit in the sched_flags field that's added 
in his scheduler patch series.  If this patch series will be a available 
in an "Arnd" tree that we'll be using for our current OProfile 
development, perhaps I should wait until that time to change this, since 
the window of vulnerability is quite small.  What do you think?

-Maynard
> 
> Thanks,
> 
> 	Arnd <><



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
       [not found]   ` <200701300839.05144.arnd@arndb.de>
  2007-01-30  7:53     ` [Cbe-oss-dev] " Benjamin Herrenschmidt
@ 2007-01-30 21:41     ` Maynard Johnson
  2007-01-30 22:54       ` Maynard Johnson
                         ` (2 more replies)
  2007-02-03 23:49     ` Maynard Johnson
  2 siblings, 3 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-30 21:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list, Anton Blanchard

Arnd Bergmann wrote:

> On Monday 29 January 2007 20:48, Maynard Johnson wrote:
> 
>>Subject: Add support to OProfile for profiling Cell BE SPUs
>>
>>From: Maynard Johnson <maynardj@us.ibm.com>
>>
>>This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
>>to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
>>was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
>>code.
>>
>>Signed-off-by: Carl Love <carll@us.ibm.com>
>>Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
> 
> 
> I can't really say much about the common oprofile files that you are
> touching, maybe someone from oprofile-list (Philippe?) to look over them
> and ack/nack them.
Anton (added to cc list) may be another good reviewer of 
drivers/oprofile changes.
> 
> 
>>+#define number_of_online_nodes(nodes) {        \
>>+        u32 cpu; u32 tmp;                      \
>>+        nodes = 0;                             \
>>+        for_each_online_cpu(cpu) {             \
>>+                tmp = cbe_cpu_to_node(cpu) + 1;\
>>+                if (tmp > nodes)               \
>>+                        nodes++;               \
>>+       }                                      \
>>+}
> 
> 
> I've been discussing with benh about a better way to do this. We should
> change all the references to nodes and cpu numbers to something more
> correct in the future, so we get rid of the assumption that each
> numa node is a cell chip. It's probably best to leave your code as
> is for now, but we need to remember this one when cleaning it up.
> 
> You should however convert this into an inline function
> of prototype 
> 
> static inline int number_of_online_nodes(void)
> 
> instead of a macro.
OK.
> 
> 
>>+/* Defines used for sync_start */
>>+#define SKIP_GENERIC_SYNC 0
>>+#define SYNC_START_ERROR -1
>>+#define DO_GENERIC_SYNC 1
>>+
>>+typedef struct vma_map
>>+{
>>+       struct vma_map *next;
>>+       unsigned int vma;
>>+       unsigned int size;
>>+       unsigned int offset;
>>+       unsigned int guard_ptr;
>>+       unsigned int guard_val;
>>+} vma_map_t;
> 
> 
> please don't typedef structures.
Sure.
> 
> 
[snip]
>>+
>>+static int spu_prof_running = 0;
>>+static unsigned int profiling_interval = 0;
>>+
>>+extern int num_nodes;
>>+extern unsigned int khzfreq;
> 
> 
> You really can't have global variable with such generic names. For
> static variables, it's less of a problem, since they are not in the
> same name space, but for easier debugging, it's good to always have
> the name of your module (e.g. spu_prof_) as a prefix to the identifier.
OK, we'll add 'spu_prof' prefix to them.
> 
> Of course, the best way would be to avoid global and static variables
> entirely, but that's not always possible.
> 
> 
[snip]
>>+       // process the collected SPU PC for each node
>>+       for_each_online_cpu(cpu) {
>>+               if (cbe_get_hw_thread_id(cpu))
>>+                       continue;
>>+
>>+               node = cbe_cpu_to_node(cpu);
>>+               node_factor = node * SPUS_PER_NODE;
>>+                /* number of valid entries for this node */
>>+               entry = 0;
>>+
>>+               trace_addr = cbe_read_pm(cpu, trace_address);
>>+               while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
>>+               {
>>+                       /* there is data in the trace buffer to process */
>>+                       cbe_read_trace_buffer(cpu, trace_buffer);  
>>+                       spu_mask = 0xFFFF000000000000;
>>+
>>+                       /* Each SPU PC is 16 bits; hence, four spus in each of 
>>+                        * the two 64-bit buffer entries that make up the
>>+                        * 128-bit trace_buffer entry.  Process the upper and
>>+                        * lower 64-bit values simultaneously.
>>+                        */
>>+                       for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
>>+                               spu_pc_lower = spu_mask & trace_buffer[0];
>>+                               spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
>>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
>>+
>>+                               spu_pc_upper = spu_mask & trace_buffer[1];
>>+                               spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
>>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
>>+
>>+                               spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
>>+
>>+                               /* spu PC trace entry is upper 16 bits of the
>>+                                * 18 bit SPU program counter 
>>+                                */
>>+                               spu_pc_lower = spu_pc_lower << 2;
>>+                               spu_pc_upper = spu_pc_upper << 2;
>>+
>>+                               samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
>>+                                       = (u32) spu_pc_lower;
>>+                               samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
>>+                                       = (u32) spu_pc_upper;
>>+                       }
>>+
>>+                       entry++;
>>+
>>+                       if (entry >= TRACE_ARRAY_SIZE) 
>>+                               /* spu_samples is full */
>>+                               break;
>>+
>>+                       trace_addr = cbe_read_pm(cpu, trace_address);
>>+               }
>>+               samples_per_node[node] = entry;
>>+       }
>>+}
> 
> 
> While I can't see anything technically wrong with this function, it would be
> good to split it into smaller functions. Since you are nesting three
> loops, it should be possible to make a separate function from one of the
> inner loops without changing the actual logic behind it.
Will do.
> 
> 
>>+
>>+static int profile_spus(struct hrtimer * timer)
>>+{
>>+       ktime_t kt;
>>+        int cpu, node, k, num_samples, spu_num;
> 
> 
> whitespace damage
fixed
> 
> 
>>+       
>>+       if (!spu_prof_running)
>>+               goto STOP;
>>+
>>+       cell_spu_pc_collection();
>>+       for_each_online_cpu(cpu) {
>>+               if (cbe_get_hw_thread_id(cpu))
>>+                       continue;
> 
> 
> Here, you enter the same top-level loop again, why not make it
> 	for_each_online_cpu(cpu) {
> 		if (cbe_get_hw_thread_id(cpu))
>                          continue;
> 		num_samples = cell_spu_pc_collection(cpu);
> 		...
Yes, good suggestion.
> 
> 
>>+       kt = ktime_set(0, profiling_interval);
>>+       if (!spu_prof_running)
>>+               goto STOP;
>>+       hrtimer_forward(timer, timer->base->get_time(), kt);
>>+       return HRTIMER_RESTART;
> 
> 
> is hrtimer_forward really the right interface here? You are ignoring
> the number of overruns anyway, so hrtimer_start(,,) sounds more
> correct to me.
According to Tom Gleixner, "hrtimer_forward is a convenience function to 
move the expiry time of a timer forward in multiples of the interval, so 
it is in the future.  After setting the expiry time you restart the 
timer either with [sic] a return HRTIMER_RESTART (if you are in the 
timer callback function)."
> 
> 
>>+
>>+ STOP:
> 
> 
> labels should be in small letters.
OK
> 
> 
>>+       printk(KERN_INFO "SPU_PROF: spu-prof timer ending\n");
>>+       return HRTIMER_NORESTART;
>>+}
> 
> 
>>+void start_spu_profiling(unsigned int cycles_reset) {
>>+
>>+       ktime_t kt;
>>+
>>+        /* To calculate a timeout in nanoseconds, the basic
>>+        * formula is ns = cycles_reset * (NSEC_PER_SEC / cpu frequency).
>>+        * To avoid floating point math, we use the scale math
>>+        * technique as described in linux/jiffies.h.  We use
>>+        * a scale factor of SCALE_SHIFT,which provides 4 decimal places
>>+        * of precision, which is close enough for the purpose at hand.
>>+        */
>>+
>>+       /* Since cpufreq_quick_get returns frequency in kHz, we use
>>+        * USEC_PER_SEC here vs NSEC_PER_SEC.
>>+        */
>>+       unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
>>+       profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
>>+       
>>+       pr_debug("timer resolution: %lu\n", 
>>+                TICK_NSEC);
> 
> 
> Don't you need to adapt the profiling_interval at run time, when cpufreq
> changes the core frequency? You should probably use
> cpufreq_register_notifier() to update this.
Since OProfile is a statistical profiler, the exact frequency is not 
critical.  The user is going to be looking for hot spots in their code, 
so it's all relative.  With that said,  I don't imagine using the 
cpufreq notiication would be a big deal.  We'll look at it.
> 
> 
>>+       kt = ktime_set(0, profiling_interval);
>>+       hrtimer_init(&timer, CLOCK_MONOTONIC, HRTIMER_REL);
>>+       timer.expires = kt;
>>+       timer.function = profile_spus;
>>+
>>+        /* Allocate arrays for collecting SPU PC samples */
>>+       samples = (u32 *) kzalloc(num_nodes * SPUS_PER_NODE * TRACE_ARRAY_SIZE * sizeof(u32), GFP_ATOMIC);
> 
> 
> Try to avoid atomic allocations. I don't think you are in an atomic
> context here at all, so you can just use GFP_KERNEL.
OK, I'll check it out.
> 
> 
>>+       samples_per_node = (u32 *) kzalloc(num_nodes * sizeof(u32), GFP_ATOMIC);
> 
> 
> Since MAX_NUMNODES is small, it's probably more efficient to just allocate this
> statically.
OK.
> 
> 
>>+
>>+       spu_prof_running = 1;
>>+       hrtimer_start(&timer, kt, HRTIMER_REL);
>>+}
>>
>>+
>>+void stop_spu_profiling(void) 
>>+{
>>+
>>+       hrtimer_cancel(&timer);
>>+       kfree(samples);
>>+       kfree(samples_per_node);
>>+       pr_debug("SPU_PROF: stop_spu_profiling issued\n");
>>+       spu_prof_running = 0;
>>+}
> 
> 
> shouldn't you set spu_prof_running = 0 before doing any of the other things?
> It looks to me like you could otherwise get into a use-after-free
> situation. If I'm wrong with that, you probably don't need spu_prof_running
> at all ;-)
No, you're right.  :-)
> 
> 
>>+/* Conainer for caching information about an active SPU task.
>>+ *   :map -- pointer to a list of vma_maps
>>+ *   :spu -- the spu for this active SPU task
>>+ *   :list -- potentially could be used to contain the cached_infos
>>+ *            for inactive SPU tasks.
> 
> 
> Documenting structures is good, but please use the common kerneldoc format
> for it. There are a number of examples for this in include/linux/
OK
> 
> 
>>+ * 
>>+ * Ideally, we would like to be able to create the cached_info for
>>+ * an SPU task just one time -- when libspe first loads the SPU 
>>+ * binary file.  We would store the cached_info in a list.  Then, as
>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>+ * for inactive tasks would be kept, and the active one would be placed
>>+ * at the head of the list.  But this technique may not with
>>+ * current spufs functionality since the spu used in bind_context may
>>+ * be a different spu than was used in a previous bind_context for a
>>+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
>>+ * assigned to run on a different physical SPE.  We will investigate
>>+ * further if this can be done.
>>+ *
>>+ */
> 
> 
> You should stuff a pointer to cached_info into struct spu_context,
> e.g. 'void *profile_private'.
> 
> 
>>+struct cached_info {
>>+       vma_map_t * map;
>>+       struct spu * the_spu;
>>+       struct kref cache_ref;
>>+       struct list_head list;
>>+};
> 
> 
> And replace the 'the_spu' member with a back pointer to the
> spu_context if you need it.
> 
> 
>>+
>>+/* A data structure for cached information about active SPU tasks.
>>+ * Storage is dynamically allocated, sized as
>>+ * "number of active nodes multplied by 8". 
>>+ * The info_list[n] member holds 0 or more 
>>+ * 'struct cached_info' objects for SPU#=n. 
>>+ *
>>+ * As currently implemented, there will only ever be one cached_info 
>>+ * in the list for a given SPU.  If we can devise a way to maintain
>>+ * multiple cached_infos in our list, then it would make sense
>>+ * to also cache the dcookie representing the PPU task application.
>>+ * See above description of struct cached_info for more details.
>>+ */
>>+struct spu_info_stacks {
>>+       struct list_head * info_list;
>>+};
> 
> 
> Why do you store pointers to list_head structures? If you want to store
> lists, you should have a lists_head itself in here.
info_list is an array of n lists, where n is the number of SPUs.
> 
> Why do you store them per spu in the first place? The physical spu
> doesn't have any relevance to this at all, the only data that is
> per spu is the sample data collected on a profiling interrupt,
> which you can then copy in the per-context data on a context switch.
The sample data is written out to the event buffer on every profiling 
interrupt.  But we don't write out the SPU program counter samples 
directly to the event buffer.  First, we have to find the cached_info 
for the appropriate SPU context to retrieve the cached vma-to-fileoffset 
map.  Then we do the vma_map_lookup to find the fileoffset corresponding 
to the SPU PC sample, which we then write out to the event buffer.  This 
is one of the most time-critical pieces of the SPU profiling code, so I 
used an array to hold the cached_info for fast random access.  But as I 
stated in a code comment above, the negative implication of this current 
implementation is that the array can only hold the cached_info for 
currently running SPU tasks.  I need to give this some more thought.
> 
> 
> 
>>+/* Looks for cached info for the passed spu.  If not found, the
>>+ * cached info is created for the passed spu.
>>+ * Returns 0 for success; otherwise, -1 for error.  
>>+ */ 
>>+static int
>>+prepare_cached_spu_info(struct spu * spu, unsigned int objectId)
>>+{
> 
> 
> see above, this should get the spu_context pointer as its argument,
> not the spu.
> 
> 
>>+       vma_map_t * new_map;
>>+       unsigned long flags = 0;
>>+       int retval = 0;
>>+       /* spu->number is a system-wide value, not a per-node value. */
>>+       struct cached_info * info = get_cached_info(spu->number);
>>+       if (info == NULL) {
> 
> 
> if you revert the logic to
> 
> 	if (info)
> 		goto out;
> 
> then the bulk of your function doesn't need to get indented,
> which helps readability.
OK
> 
> 
>>+               /* create cached_info and add it to the list for SPU #<n>.*/
>>+               info = kzalloc(sizeof(struct cached_info), GFP_ATOMIC);
> 
> 
> GFP_KERNEL
OK
> 
> 
>>+               if (!info) {
>>+                       printk(KERN_ERR "SPU_PROF: "
>>+                              "%s, line %d: create vma_map failed\n",
>>+                              __FUNCTION__, __LINE__);
>>+                       goto ERR_ALLOC;
>>+               }
>>+               new_map = create_vma_map(spu, objectId);
>>+               if (!new_map) {
>>+                       printk(KERN_ERR "SPU_PROF: "
>>+                              "%s, line %d: create vma_map failed\n",
>>+                              __FUNCTION__, __LINE__);
>>+                       goto ERR_ALLOC;
>>+               }
>>+
>>+               pr_debug("Created vma_map\n");
>>+               info->map = new_map;
>>+               info->the_spu = spu;
>>+               kref_init(&info->cache_ref);
>>+               spin_lock_irqsave(&cache_lock, flags);
>>+               list_add(&info->list, &spu_info->info_list[spu->number]);
>>+               spin_unlock_irqrestore(&cache_lock, flags);
>>+               goto OUT;
>>+       } else {
>>+        /* Immedidately put back reference to cached_info since we don't
>>+        * really need it -- just checking whether we have it.
>>+        */
>>+               put_cached_info(info);
>>+               pr_debug("Found cached SPU info.\n");
>>+       }
>>+       
>>+ERR_ALLOC:
>>+       retval = -1;
>>+OUT:
>>+       return retval;
>>+}
> 
> 
>>+/* Look up the dcookie for the task's first VM_EXECUTABLE mapping,
>>+ * which corresponds loosely to "application name". Also, determine
>>+ * the offset for the SPU ELF object.  If computed offset is 
>>+ * non-zero, it implies an embedded SPU object; otherwise, it's a
>>+ * separate SPU binary, in which case we retrieve it's dcookie.
>>+ */
>>+static unsigned long 
>>+get_exec_dcookie_and_offset(
>>+       struct spu * spu, unsigned int * offsetp,
>>+       unsigned long * spu_bin_dcookie,
>>+       unsigned int spu_ref)
>>+{
>>+        unsigned long cookie = 0;
>>+       unsigned int my_offset = 0;
>>+        struct vm_area_struct * vma;
>>+       struct mm_struct * mm = spu->mm;
> 
> 
> indenting
uh-huh
> 
> 
>>+        if (!mm)
>>+                goto OUT;
>>+
>>+        for (vma = mm->mmap; vma; vma = vma->vm_next) {
>>+                if (!vma->vm_file)
>>+                        continue;
>>+                if (!(vma->vm_flags & VM_EXECUTABLE))
>>+                        continue;
>>+                cookie = fast_get_dcookie(vma->vm_file->f_dentry,
>>+                        vma->vm_file->f_vfsmnt);
>>+               pr_debug("got dcookie for %s\n",
>>+                         vma->vm_file->f_dentry->d_name.name);
>>+                break;
>>+        }
>>+
>>+       for (vma = mm->mmap; vma; vma = vma->vm_next) {
>>+               if (vma->vm_start > spu_ref || vma->vm_end < spu_ref)
>>+                       continue;
>>+               my_offset = spu_ref - vma->vm_start;
>>+               pr_debug("Found spu ELF at "
>>+                         " %X for file %s\n", my_offset,
>>+                         vma->vm_file->f_dentry->d_name.name);
>>+               *offsetp = my_offset;
>>+               if (my_offset == 0) {
>>+                       if (!vma->vm_file) {
>>+                               goto FAIL_NO_SPU_COOKIE;
>>+                       }
>>+                       *spu_bin_dcookie = fast_get_dcookie(
>>+                               vma->vm_file->f_dentry,
>>+                               vma->vm_file->f_vfsmnt);
>>+                       pr_debug("got dcookie for %s\n",
>>+                                 vma->vm_file->f_dentry->d_name.name);
>>+               }
>>+               break;
>>+       }
>>+       
>>+OUT:
>>+        return cookie;
>>+
>>+FAIL_NO_SPU_COOKIE:
>>+       printk(KERN_ERR "SPU_PROF: "
>>+              "%s, line %d: Cannot find dcookie for SPU binary\n",
>>+              __FUNCTION__, __LINE__);
>>+       goto OUT;
>>+}
>>+
>>+
>>+
>>+/* This function finds or creates cached context information for the
>>+ * passed SPU and records SPU context information into the OProfile
>>+ * event buffer.
>>+ */
>>+static int process_context_switch(struct spu * spu, unsigned int objectId)
>>+{
>>+       unsigned long flags = 0;
>>+       int retval = 0;
>>+       unsigned int offset = 0;
>>+       unsigned long spu_cookie = 0, app_dcookie = 0;
>>+       retval = prepare_cached_spu_info(spu, objectId);
>>+       if (retval == -1) {
>>+               goto OUT;
>>+       }
>>+        /* Get dcookie first because a mutex_lock is taken in that
>>+        * code path, so interrupts must not be disabled.
>>+        */
>>+       app_dcookie = get_exec_dcookie_and_offset(spu, &offset, 
>>+                                                 &spu_cookie, objectId);
>>+
>>+        /* Record context info in event buffer */
>>+       spin_lock_irqsave(&buffer_lock, flags);
>>+       add_event_entry(ESCAPE_CODE);
>>+       add_event_entry(SPU_CTX_SWITCH_CODE);
>>+       add_event_entry(spu->number);
>>+       add_event_entry(spu->pid);
>>+       add_event_entry(spu->tgid);
>>+       add_event_entry(app_dcookie);
>>+
>>+       add_event_entry(ESCAPE_CODE);
>>+       if (offset) {
>>+         /* When offset is non-zero,  this means the SPU ELF was embedded;
>>+          * otherwise, it was loaded from a separate binary file.  For the
>>+          * embedded case, we record the offset of the SPU ELF into the PPU
>>+          * executable; for the non-embedded case, we record a dcookie that
>>+          * points to the location of the SPU binary that was loaded.
>>+          */
>>+               add_event_entry(SPU_OFFSET_CODE);
>>+               add_event_entry(offset);
>>+       } else {
>>+               add_event_entry(SPU_COOKIE_CODE);
>>+               add_event_entry(spu_cookie);
>>+       }
> 
> 
> I don't get it. What is the app_dcookie used for? If the spu binary is
> embedded into a library, you are still missing the dcookie to that .so file,
> because you return only an offset.
For embedded SPU app, the post-processing tool opens the PPE binary app 
file and obtains the SPU ELF embedded thereine, and from that, we obtain 
the name of the SPU binary.  Also, the app name is included in the 
report, along with the SPU binary filename, if the report contains 
samples from more than one application.
> 
> <nitpicking>
> 
>>+       unsigned long flags = 0;
> 
> 
> no need to initialize
> 
> 
>>+       struct spu * the_spu = (struct spu *) data;
> 
> 
> no need for the cast
> 
> 
>>+       pr_debug("SPU event notification arrived\n");
>>+       if (val == 0){
> 
> 
> if (!val)
> 
> 
>>+       pr_debug("spu_sync_start -- running.\n");
>>+OUT:
> 
> 
> out:
> 
> 
>>+       return ret;     
>>+}
> 
> 
> </nitpicking>
OK, we'll fix up.
> 
> 
> 
>>@@ -480,7 +491,22 @@
>>               struct op_system_config *sys, int num_ctrs)
>> {
>>        int i, j, cpu;
>>+       spu_cycle_reset = 0;
>> 
>>+       /* The cpufreq_quick_get function requires that cbe_cpufreq module
>>+        * be loaded.  This function is not actually provided and exported
>>+        * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
>>+        * data structures.  Since there's no way for depmod to realize
>>+        * that our OProfile module depends on cbe_cpufreq, we currently
>>+        * are letting the userspace tool, opcontrol, ensure that the
>>+        * cbe_cpufreq module is loaded.
>>+        */
>>+       khzfreq = cpufreq_quick_get(smp_processor_id());
> 
> 
> You should probably have a fallback in here in case the cpufreq module
> is not loaded. There is a global variable ppc_proc_freq (in Hz) that
> you can access.
Our userspace tool ensures the cpufreq module is loaded.
> 
>>        ;
>> }
>> 
>>-static void cell_global_start(struct op_counter_config *ctr)
>>+static int calculate_lfsr(int n)
>>+{
>>+#define size 24
>>+       int i;
>>+       unsigned int newlfsr0;
>>+       unsigned int lfsr = 0xFFFFFF;
>>+       unsigned int howmany = lfsr - n;
>>+
>>+       for (i = 2; i < howmany + 2; i++) {
>>+               newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
>>+                           ((lfsr >> (size - 1 - 1)) & 1) ^
>>+                           (((lfsr >> (size - 1 - 6)) & 1) ^
>>+                            ((lfsr >> (size - 1 - 23)) & 1)));
>>+
>>+               lfsr >>= 1;
>>+               lfsr = lfsr | (newlfsr0 << (size - 1));
>>+       }
>>+       return lfsr;
>>+
>>+}
> 
> 
> I don't have the slightest idea what this code is about, but
Me neither.  Carl, can you comment?
> it certainly looks inefficient to loop 16 million times to
> compute a constant. Could you use a faster algorithm instead,
> or at least add a comment about why you do it this way?
> 
> 
>>+static void cell_global_stop(void)
>>+{
>>+       if (spu_cycle_reset) {
>>+               cell_global_stop_spu();
>>+       } else {
>>+               cell_global_stop_ppu();
>>+       }
>>+
>>+}
> 
> 
> This looks weird as well. I suppose it's a limitation of the hardware
> that you can only do either ppu or spu profiling. However, making that
> dependent of whether the 'spu_cycle_reset' variable is set sounds
> rather bogus.
It is the only file-scoped variable relating to SPU profiling, and will 
always be non-zero when the user selects to perform SPU profiling. 
Seemed like a logical-enough choice to me.
> 
> I don't know what the best interface for choosing the target from
> user space would be, but you probably also want to be able to
> switch between them at run time.
The hardware setup is so completely different, I don't think there's a 
viable way of switching between PPU profiling and SPU profiling without 
a "stop" in between.
> 
> 	Arnd <><



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 10:39   ` Christoph Hellwig
@ 2007-01-30 22:49     ` Carl Love
  2007-01-30 22:57       ` Benjamin Herrenschmidt
  2007-01-30 22:59       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 38+ messages in thread
From: Carl Love @ 2007-01-30 22:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Maynard Johnson, linuxppc-dev, cbe-oss-dev, oprofile-list, linux-kernel

Christoph:

In our earlier work on the PPU profiling patch, Benjamin Herrenschmidt
said that we should remove macros that are only used once and just put
the actual code in.  That is why the macros were removed. 

            Carl Love


On Tue, 2007-01-30 at 11:39 +0100, Christoph Hellwig wrote:
> On Mon, Jan 29, 2007 at 01:46:50PM -0600, Maynard Johnson wrote:
> > 
> > 
> 
> I don't think the macro removal is helpful, getting rid of the names
> makes the code less readable to me.
> _______________________________________________
> cbe-oss-dev mailing list
> cbe-oss-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/cbe-oss-dev


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 21:41     ` Maynard Johnson
@ 2007-01-30 22:54       ` Maynard Johnson
  2007-01-30 23:34         ` Benjamin Herrenschmidt
  2007-01-31  6:52         ` Arnd Bergmann
  2007-01-30 23:31       ` Carl Love
  2007-01-31  5:57       ` Arnd Bergmann
  2 siblings, 2 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-30 22:54 UTC (permalink / raw)
  Cc: Arnd Bergmann, linuxppc-dev, cbe-oss-dev, oprofile-list, linux-kernel

Maynard Johnson wrote:

> Arnd Bergmann wrote:
> 
> 
>>On Monday 29 January 2007 20:48, Maynard Johnson wrote:
>>
>>
>>>Subject: Add support to OProfile for profiling Cell BE SPUs
>>>
>>>From: Maynard Johnson <maynardj@us.ibm.com>
>>>
>>>This patch updates the existing arch/powerpc/oprofile/op_model_cell.c
>>>to add in the SPU profiling capabilities.  In addition, a 'cell' subdirectory
>>>was added to arch/powerpc/oprofile to hold Cell-specific SPU profiling
>>>code.
>>>
>>>Signed-off-by: Carl Love <carll@us.ibm.com>
>>>Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
>>
[snip]
>>
>>>+ * 
>>>+ * Ideally, we would like to be able to create the cached_info for
>>>+ * an SPU task just one time -- when libspe first loads the SPU 
>>>+ * binary file.  We would store the cached_info in a list.  Then, as
>>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>>+ * for inactive tasks would be kept, and the active one would be placed
>>>+ * at the head of the list.  But this technique may not with
>>>+ * current spufs functionality since the spu used in bind_context may
>>>+ * be a different spu than was used in a previous bind_context for a
>>>+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
>>>+ * assigned to run on a different physical SPE.  We will investigate
>>>+ * further if this can be done.
>>>+ *
>>>+ */
>>
>>
>>You should stuff a pointer to cached_info into struct spu_context,
>>e.g. 'void *profile_private'.
>>
>>
>>
>>>+struct cached_info {
>>>+       vma_map_t * map;
>>>+       struct spu * the_spu;
>>>+       struct kref cache_ref;
>>>+       struct list_head list;
>>>+};
>>
>>
>>And replace the 'the_spu' member with a back pointer to the
>>spu_context if you need it.
>>
>>
>>
>>>+
>>>+/* A data structure for cached information about active SPU tasks.
>>>+ * Storage is dynamically allocated, sized as
>>>+ * "number of active nodes multplied by 8". 
>>>+ * The info_list[n] member holds 0 or more 
>>>+ * 'struct cached_info' objects for SPU#=n. 
>>>+ *
>>>+ * As currently implemented, there will only ever be one cached_info 
>>>+ * in the list for a given SPU.  If we can devise a way to maintain
>>>+ * multiple cached_infos in our list, then it would make sense
>>>+ * to also cache the dcookie representing the PPU task application.
>>>+ * See above description of struct cached_info for more details.
>>>+ */
>>>+struct spu_info_stacks {
>>>+       struct list_head * info_list;
>>>+};
>>
>>
>>Why do you store pointers to list_head structures? If you want to store
>>lists, you should have a lists_head itself in here.
> 
> info_list is an array of n lists, where n is the number of SPUs.
> 
>>Why do you store them per spu in the first place? The physical spu
>>doesn't have any relevance to this at all, the only data that is
>>per spu is the sample data collected on a profiling interrupt,
>>which you can then copy in the per-context data on a context switch.
> 
> The sample data is written out to the event buffer on every profiling 
> interrupt.  But we don't write out the SPU program counter samples 
> directly to the event buffer.  First, we have to find the cached_info 
> for the appropriate SPU context to retrieve the cached vma-to-fileoffset 
> map.  Then we do the vma_map_lookup to find the fileoffset corresponding 
> to the SPU PC sample, which we then write out to the event buffer.  This 
> is one of the most time-critical pieces of the SPU profiling code, so I 
> used an array to hold the cached_info for fast random access.  But as I 
> stated in a code comment above, the negative implication of this current 
> implementation is that the array can only hold the cached_info for 
> currently running SPU tasks.  I need to give this some more thought.

I've given this some more thought, and I'm coming to the conclusion that 
a pure array-based implementation for holding cached_info (getting rid 
of the lists) would work well for the vast majority of cases in which 
OProfile will be used.  Yes, it is true that the mapping of an SPU 
context to a phsyical spu-numbered array location cannot be guaranteed 
to stay valid, and that's why I discard the cached_info at that array 
location when the SPU task is switched out.  Yes, it would be terribly 
inefficient if the same SPU task gets switched back in later and we 
would have to recreate the cached_info.  However, I contend that 
OProfile users are interested in profiling one application at a time. 
They are not going to want to muddy the waters with multiple SPU apps 
running at the same time.  I can't think of any reason why someone would 
conscisouly choose to do that.

Any thoughts from the general community, especially OProfile users?

Thanks.
-Maynard
> 
>>
[snip]


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 22:49     ` Carl Love
@ 2007-01-30 22:57       ` Benjamin Herrenschmidt
  2007-01-31  8:47         ` Christoph Hellwig
  2007-01-30 22:59       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-30 22:57 UTC (permalink / raw)
  To: Carl Love
  Cc: Christoph Hellwig, linuxppc-dev, cbe-oss-dev, oprofile-list,
	linux-kernel

On Tue, 2007-01-30 at 14:49 -0800, Carl Love wrote:
> Christoph:
> 
> In our earlier work on the PPU profiling patch, Benjamin Herrenschmidt
> said that we should remove macros that are only used once and just put
> the actual code in.  That is why the macros were removed. 

Heh... there is a balance to be found... In some cases, inline functions
might be better too.

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 22:49     ` Carl Love
  2007-01-30 22:57       ` Benjamin Herrenschmidt
@ 2007-01-30 22:59       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-30 22:59 UTC (permalink / raw)
  To: Carl Love
  Cc: Christoph Hellwig, linuxppc-dev, cbe-oss-dev, oprofile-list,
	linux-kernel

On Tue, 2007-01-30 at 14:49 -0800, Carl Love wrote:
> Christoph:
> 
> In our earlier work on the PPU profiling patch, Benjamin Herrenschmidt
> said that we should remove macros that are only used once and just put
> the actual code in.  That is why the macros were removed. 

I've looked at the macros you remove and indeed, they look like stuff
you actually want to keep in macros. I don't have off the top of my head
the circumstances where I asked you to remove macros in the PPE code,
but I'm sure it was different.

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 10:41       ` Christoph Hellwig
@ 2007-01-30 23:09         ` Maynard Johnson
  0 siblings, 0 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-30 23:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Benjamin Herrenschmidt, linuxppc-dev, oprofile-list, cbe-oss-dev,
	Arnd Bergmann, linux-kernel

Christoph Hellwig wrote:

> On Tue, Jan 30, 2007 at 06:53:50PM +1100, Benjamin Herrenschmidt wrote:
> 
>>>>+/* Defines used for sync_start */
>>>>+#define SKIP_GENERIC_SYNC 0
>>>>+#define SYNC_START_ERROR -1
>>>>+#define DO_GENERIC_SYNC 1
>>>>+
>>>>+typedef struct vma_map
>>>>+{
>>>>+       struct vma_map *next;
>>>>+       unsigned int vma;
>>>>+       unsigned int size;
>>>>+       unsigned int offset;
>>>>+       unsigned int guard_ptr;
>>>>+       unsigned int guard_val;
>>>>+} vma_map_t;
>>
>>I haven't had time to look in details yet but in that context, what does
>>"vma" stands for ? There's already an important vm data structure in
>>linux routinely called "vma" and thus I suspect this is a poor naming
>>choice as it will cause confusion.
> 
> 
> It looks like it actually is dealing with vma to me.  But then again:
> 
>  - please don't use typedefs for structures
>  - there might be a more descriptive name for this than just vma_map
Yes, I'll come up with some (hopefully) better name.
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> oprofile-list mailing list
> oprofile-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oprofile-list



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 21:41     ` Maynard Johnson
  2007-01-30 22:54       ` Maynard Johnson
@ 2007-01-30 23:31       ` Carl Love
  2007-01-31  1:25         ` Christian Krafft
  2007-01-31  6:06         ` Arnd Bergmann
  2007-01-31  5:57       ` Arnd Bergmann
  2 siblings, 2 replies; 38+ messages in thread
From: Carl Love @ 2007-01-30 23:31 UTC (permalink / raw)
  To: maynardj
  Cc: Arnd Bergmann, linuxppc-dev, cbe-oss-dev, oprofile-list, linux-kernel

On Tue, 2007-01-30 at 15:41 -0600, Maynard Johnson wrote:
[snip
> [snip]
> >>+       // process the collected SPU PC for each node
> >>+       for_each_online_cpu(cpu) {
> >>+               if (cbe_get_hw_thread_id(cpu))
> >>+                       continue;
> >>+
> >>+               node = cbe_cpu_to_node(cpu);
> >>+               node_factor = node * SPUS_PER_NODE;
> >>+                /* number of valid entries for this node */
> >>+               entry = 0;
> >>+
> >>+               trace_addr = cbe_read_pm(cpu, trace_address);
> >>+               while ((trace_addr & CBE_PM_TRACE_BUF_EMPTY) != 0x400)
> >>+               {
> >>+                       /* there is data in the trace buffer to process */
> >>+                       cbe_read_trace_buffer(cpu, trace_buffer);  
> >>+                       spu_mask = 0xFFFF000000000000;
> >>+
> >>+                       /* Each SPU PC is 16 bits; hence, four spus in each of 
> >>+                        * the two 64-bit buffer entries that make up the
> >>+                        * 128-bit trace_buffer entry.  Process the upper and
> >>+                        * lower 64-bit values simultaneously.
> >>+                        */
> >>+                       for (spu = 0; spu < SPUS_PER_TB_ENTRY; spu++) {
> >>+                               spu_pc_lower = spu_mask & trace_buffer[0];
> >>+                               spu_pc_lower = spu_pc_lower >> (NUM_SPU_BITS_TRBUF
> >>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+                               spu_pc_upper = spu_mask & trace_buffer[1];
> >>+                               spu_pc_upper = spu_pc_upper >> (NUM_SPU_BITS_TRBUF
> >>+                                                               * (SPUS_PER_TB_ENTRY-spu-1));
> >>+
> >>+                               spu_mask = spu_mask >> NUM_SPU_BITS_TRBUF;
> >>+
> >>+                               /* spu PC trace entry is upper 16 bits of the
> >>+                                * 18 bit SPU program counter 
> >>+                                */
> >>+                               spu_pc_lower = spu_pc_lower << 2;
> >>+                               spu_pc_upper = spu_pc_upper << 2;
> >>+
> >>+                               samples[((node_factor + spu) * TRACE_ARRAY_SIZE) + entry]
> >>+                                       = (u32) spu_pc_lower;
> >>+                               samples[((node_factor + spu + SPUS_PER_TB_ENTRY) * TRACE_ARRAY_SIZE) + entry]
> >>+                                       = (u32) spu_pc_upper;
> >>+                       }
> >>+
> >>+                       entry++;
> >>+
> >>+                       if (entry >= TRACE_ARRAY_SIZE) 
> >>+                               /* spu_samples is full */
> >>+                               break;
> >>+
> >>+                       trace_addr = cbe_read_pm(cpu, trace_address);
> >>+               }
> >>+               samples_per_node[node] = entry;
> >>+       }
> >>+}
> > 
> > 
> > While I can't see anything technically wrong with this function, it would be
> > good to split it into smaller functions. Since you are nesting three
> > loops, it should be possible to make a separate function from one of the
> > inner loops without changing the actual logic behind it.
> Will do.
> > 
> > 
> >>+
> >>+static int profile_spus(struct hrtimer * timer)
> >>+{
> >>+       ktime_t kt;
> >>+        int cpu, node, k, num_samples, spu_num;
> > 
> > 
> > whitespace damage
> fixed
> > 
> > 
> >>+       
> >>+       if (!spu_prof_running)
> >>+               goto STOP;
> >>+
> >>+       cell_spu_pc_collection();
> >>+       for_each_online_cpu(cpu) {
> >>+               if (cbe_get_hw_thread_id(cpu))
> >>+                       continue;
> > 
> > 
> > Here, you enter the same top-level loop again, why not make it
> > 	for_each_online_cpu(cpu) {
> > 		if (cbe_get_hw_thread_id(cpu))
> >                          continue;
> > 		num_samples = cell_spu_pc_collection(cpu);
> > 		...
> Yes, good suggestion.
I believe what you are asking here is why can't the
cell_spu_pc_collection() function put the data in to the samples array
for a given node, in the loop that then processes the samples array for
that node.  Yes, I believe that this can be done.  The only restriction
is that cell_spu_pc_collection() will have to extract the SPU program
counter data for all SPUs on that node.  This is due to the fact the
data for the 8 SPUs are all stored in a single entry of the hardware
trace buffer.  Once the hardware trace buffer is read, the hardware
advances the read pointer so there is no way to go back and re-read the
entry.

[snip]


[snip]
> >>+static int calculate_lfsr(int n)
> >>+{
> >>+#define size 24
> >>+       int i;
> >>+       unsigned int newlfsr0;
> >>+       unsigned int lfsr = 0xFFFFFF;
> >>+       unsigned int howmany = lfsr - n;
> >>+
> >>+       for (i = 2; i < howmany + 2; i++) {
> >>+               newlfsr0 = (((lfsr >> (size - 1 - 0)) & 1) ^
> >>+                           ((lfsr >> (size - 1 - 1)) & 1) ^
> >>+                           (((lfsr >> (size - 1 - 6)) & 1) ^
> >>+                            ((lfsr >> (size - 1 - 23)) & 1)));
> >>+
> >>+               lfsr >>= 1;
> >>+               lfsr = lfsr | (newlfsr0 << (size - 1));
> >>+       }
> >>+       return lfsr;
> >>+
> >>+}
> > 
> > 
> > I don't have the slightest idea what this code is about, but
> Me neither.  Carl, can you comment?
> > it certainly looks inefficient to loop 16 million times to
> > compute a constant. Could you use a faster algorithm instead,
> > or at least add a comment about why you do it this way?
> > 
> > 

An LFSR sequence is similar to a pseudo random number sequence. For a 24
bit LFSR sequence each number between 0 and 2^24 will occur once in the
sequence but not in a normal counting order.  The hardware uses the LFSR
sequence to count to since it is much simpler to implement in hardware
then a normal counter.  Unfortunately, the only way we know how to
figure out what the LFSR value that corresponds to the number in the
sequence that is N before the last value (0xFFFFFF) is to calculate the
previous value N times.  It is like trying to ask what is the pseudo
random number that is N before this pseudo random number?

I will add a short comment to the code that will summarize the above
paragraph.
[snip]


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 22:54       ` Maynard Johnson
@ 2007-01-30 23:34         ` Benjamin Herrenschmidt
  2007-01-31  0:29           ` Maynard Johnson
  2007-01-31  6:52         ` Arnd Bergmann
  1 sibling, 1 reply; 38+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-30 23:34 UTC (permalink / raw)
  To: maynardj
  Cc: linuxppc-dev, oprofile-list, cbe-oss-dev, Arnd Bergmann, linux-kernel


> I've given this some more thought, and I'm coming to the conclusion that 
> a pure array-based implementation for holding cached_info (getting rid 
> of the lists) would work well for the vast majority of cases in which 
> OProfile will be used.  Yes, it is true that the mapping of an SPU 
> context to a phsyical spu-numbered array location cannot be guaranteed 
> to stay valid, and that's why I discard the cached_info at that array 
> location when the SPU task is switched out.  Yes, it would be terribly 
> inefficient if the same SPU task gets switched back in later and we 
> would have to recreate the cached_info.  However, I contend that 
> OProfile users are interested in profiling one application at a time. 
> They are not going to want to muddy the waters with multiple SPU apps 
> running at the same time.  I can't think of any reason why someone would 
> conscisouly choose to do that.
> 
> Any thoughts from the general community, especially OProfile users?

Well, it's my understanding that quite a few typical usage scenario
involve different tasks running on different SPUs passing each other
data around.

Ben.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 2/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30  4:08   ` [Cbe-oss-dev] " Arnd Bergmann
@ 2007-01-30 23:51     ` Carl Love
  0 siblings, 0 replies; 38+ messages in thread
From: Carl Love @ 2007-01-30 23:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: cbe-oss-dev, maynardj, linuxppc-dev, linux-kernel, oprofile-list

Arnd:

Well most of the patch is simply cleaning up the
pm_rtas_activate_signals() routine.  I would be think
"pm_rtas_activat_signals routine cleanup" would be good.

                   Carl Love


On Tue, 2007-01-30 at 05:08 +0100, Arnd Bergmann wrote:
> On Monday 29 January 2007 20:47, Maynard Johnson wrote:
> >   The code was setting up the debug bus for group 21 when profiling on the 
> > event PPU CYCLES.  The debug bus is not actually used by the hardware 
> > performance counters when counting PPU CYCLES.  Setting up the debug bus
> > for PPU CYCLES causes signal routing conflicts on the debug bus when 
> > profiling PPU cycles and another PPU event.  This patch fixes the code to 
> > only setup the debug bus to route the performance signals for the non
> > PPU CYCLE events.
> > 
> > Signed-off-by: Maynard Johnson <mpjohn@us.ibm.com>
> > Signed-off-by: Carl Love <carll@us.ibm.com>
> 
> Acked-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
> 
> Any suggestion for a one-line patch title?
> _______________________________________________
> cbe-oss-dev mailing list
> cbe-oss-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/cbe-oss-dev


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 23:34         ` Benjamin Herrenschmidt
@ 2007-01-31  0:29           ` Maynard Johnson
  0 siblings, 0 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-01-31  0:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev, oprofile-list, cbe-oss-dev, Arnd Bergmann, linux-kernel

Benjamin Herrenschmidt wrote:

>>I've given this some more thought, and I'm coming to the conclusion that 
>>a pure array-based implementation for holding cached_info (getting rid 
>>of the lists) would work well for the vast majority of cases in which 
>>OProfile will be used.  Yes, it is true that the mapping of an SPU 
>>context to a phsyical spu-numbered array location cannot be guaranteed 
>>to stay valid, and that's why I discard the cached_info at that array 
>>location when the SPU task is switched out.  Yes, it would be terribly 
>>inefficient if the same SPU task gets switched back in later and we 
>>would have to recreate the cached_info.  However, I contend that 
>>OProfile users are interested in profiling one application at a time. 
>>They are not going to want to muddy the waters with multiple SPU apps 
>>running at the same time.  I can't think of any reason why someone would 
>>conscisouly choose to do that.
>>
>>Any thoughts from the general community, especially OProfile users?
> 
> 
> Well, it's my understanding that quite a few typical usage scenario
> involve different tasks running on different SPUs passing each other
> data around.
That shouldn't be a problem.  I would consider this to be "one large 
application" consisting of multiple SPU binaries running simultaneously. 
  Such a scenario can be handled with no negative performance impact 
using a simple 16 element array of cached_info objects -- as long as 
there isn't (much) SPU task switching being done.

-Maynard
> 
> Ben.
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 3/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 15:31     ` Maynard Johnson
@ 2007-01-31  0:35       ` Arnd Bergmann
  0 siblings, 0 replies; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-31  0:35 UTC (permalink / raw)
  To: maynardj; +Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

On Tuesday 30 January 2007 16:31, Maynard Johnson wrote:
> 
> > On solution would be to move the notify_active flag from ctx->spu
> > into ctx itself, but maybe there are other ways to solve this.
> In an earlier review of this patch, Christopher Hellwig suggested I move 
> the notify_active flag to be a bit in the sched_flags field that's added 
> in his scheduler patch series.  If this patch series will be a available 
> in an "Arnd" tree that we'll be using for our current OProfile 
> development, perhaps I should wait until that time to change this, since 
> the window of vulnerability is quite small.  What do you think?

Sounds good to me.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 23:31       ` Carl Love
@ 2007-01-31  1:25         ` Christian Krafft
  2007-01-31  6:06         ` Arnd Bergmann
  1 sibling, 0 replies; 38+ messages in thread
From: Christian Krafft @ 2007-01-31  1:25 UTC (permalink / raw)
  To: Carl Love
  Cc: maynardj, linuxppc-dev, oprofile-list, cbe-oss-dev,
	Arnd Bergmann, linux-kernel

On Tue, 30 Jan 2007 15:31:09 -0800
Carl Love <cel@us.ibm.com> wrote:

> An LFSR sequence is similar to a pseudo random number sequence. For a 24
> bit LFSR sequence each number between 0 and 2^24 will occur once in the
> sequence but not in a normal counting order.  The hardware uses the LFSR
> sequence to count to since it is much simpler to implement in hardware
> then a normal counter.  Unfortunately, the only way we know how to
> figure out what the LFSR value that corresponds to the number in the
> sequence that is N before the last value (0xFFFFFF) is to calculate the
> previous value N times.  It is like trying to ask what is the pseudo
> random number that is N before this pseudo random number?

That should be no problem. 
You can just revers your algorithm and let it run x times instead of 0xFFFFFF-x.

> 
> I will add a short comment to the code that will summarize the above
> paragraph.
> [snip]
> 
> _______________________________________________
> cbe-oss-dev mailing list
> cbe-oss-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/cbe-oss-dev


-- 
Mit freundlichen Grüssen,
kind regards,

Christian Krafft
IBM Systems & Technology Group, 
Linux Kernel Development
IT Specialist

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 21:41     ` Maynard Johnson
  2007-01-30 22:54       ` Maynard Johnson
  2007-01-30 23:31       ` Carl Love
@ 2007-01-31  5:57       ` Arnd Bergmann
  2007-02-02 19:27         ` Maynard Johnson
  2 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-31  5:57 UTC (permalink / raw)
  To: maynardj
  Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list, Anton Blanchard

On Tuesday 30 January 2007 22:41, Maynard Johnson wrote:
> Arnd Bergmann wrote:

> >>+       kt = ktime_set(0, profiling_interval);
> >>+       if (!spu_prof_running)
> >>+               goto STOP;
> >>+       hrtimer_forward(timer, timer->base->get_time(), kt);
> >>+       return HRTIMER_RESTART;
> > 
> > 
> > is hrtimer_forward really the right interface here? You are ignoring
> > the number of overruns anyway, so hrtimer_start(,,) sounds more
> > correct to me.
> According to Tom Gleixner, "hrtimer_forward is a convenience function to 
> move the expiry time of a timer forward in multiples of the interval, so 
> it is in the future.  After setting the expiry time you restart the 
> timer either with [sic] a return HRTIMER_RESTART (if you are in the 
> timer callback function)."
> > 

Ok, I see. Have you seen the timer actually coming in late, resulting
in hrtimer_forward returning non-zero? I guess it's not a big problem
for statistic data collection if that happens, but you might still want
to be able to see it.

> >>+       /* Since cpufreq_quick_get returns frequency in kHz, we use
> >>+        * USEC_PER_SEC here vs NSEC_PER_SEC.
> >>+        */
> >>+       unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
> >>+       profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
> >>+       
> >>+       pr_debug("timer resolution: %lu\n", 
> >>+                TICK_NSEC);
> > 
> > 
> > Don't you need to adapt the profiling_interval at run time, when cpufreq
> > changes the core frequency? You should probably use
> > cpufreq_register_notifier() to update this.
> Since OProfile is a statistical profiler, the exact frequency is not 
> critical.  The user is going to be looking for hot spots in their code, 
> so it's all relative.  With that said,  I don't imagine using the 
> cpufreq notiication would be a big deal.  We'll look at it.
>
> >>@@ -480,7 +491,22 @@
> >>               struct op_system_config *sys, int num_ctrs)
> >> {
> >>        int i, j, cpu;
> >>+       spu_cycle_reset = 0;
> >> 
> >>+       /* The cpufreq_quick_get function requires that cbe_cpufreq module
> >>+        * be loaded.  This function is not actually provided and exported
> >>+        * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
> >>+        * data structures.  Since there's no way for depmod to realize
> >>+        * that our OProfile module depends on cbe_cpufreq, we currently
> >>+        * are letting the userspace tool, opcontrol, ensure that the
> >>+        * cbe_cpufreq module is loaded.
> >>+        */
> >>+       khzfreq = cpufreq_quick_get(smp_processor_id());
> > 
> > 
> > You should probably have a fallback in here in case the cpufreq module
> > is not loaded. There is a global variable ppc_proc_freq (in Hz) that
> > you can access.
>
> Our userspace tool ensures the cpufreq module is loaded.

You should not rely on user space tools to do the right thing in the kernel.

Moreover, if the exact frequency is not that important, as you mentioned
above, you can probably just hardcode a compile-time constant here.

> >>+ * 
> >>+ * Ideally, we would like to be able to create the cached_info for
> >>+ * an SPU task just one time -- when libspe first loads the SPU 
> >>+ * binary file.  We would store the cached_info in a list.  Then, as
> >>+ * SPU tasks are switched out and new ones switched in, the cached_info
> >>+ * for inactive tasks would be kept, and the active one would be placed
> >>+ * at the head of the list.  But this technique may not with
> >>+ * current spufs functionality since the spu used in bind_context may
> >>+ * be a different spu than was used in a previous bind_context for a
> >>+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
> >>+ * assigned to run on a different physical SPE.  We will investigate
> >>+ * further if this can be done.
> >>+ *
> >>+ */
> > 
> > 
> > You should stuff a pointer to cached_info into struct spu_context,
> > e.g. 'void *profile_private'.
> > 
> > 
> >>+struct cached_info {
> >>+       vma_map_t * map;
> >>+       struct spu * the_spu;
> >>+       struct kref cache_ref;
> >>+       struct list_head list;
> >>+};
> > 
> > 
> > And replace the 'the_spu' member with a back pointer to the
> > spu_context if you need it.
> > 
> > 
> >>+
> >>+/* A data structure for cached information about active SPU tasks.
> >>+ * Storage is dynamically allocated, sized as
> >>+ * "number of active nodes multplied by 8". 
> >>+ * The info_list[n] member holds 0 or more 
> >>+ * 'struct cached_info' objects for SPU#=n. 
> >>+ *
> >>+ * As currently implemented, there will only ever be one cached_info 
> >>+ * in the list for a given SPU.  If we can devise a way to maintain
> >>+ * multiple cached_infos in our list, then it would make sense
> >>+ * to also cache the dcookie representing the PPU task application.
> >>+ * See above description of struct cached_info for more details.
> >>+ */
> >>+struct spu_info_stacks {
> >>+       struct list_head * info_list;
> >>+};
> > 
> > 
> > Why do you store pointers to list_head structures? If you want to store
> > lists, you should have a lists_head itself in here.
> info_list is an array of n lists, where n is the number of SPUs.

My point was that it's not an array of lists, but an array of pointers
to lists. The way that include/list.h works is by having a struct list_head
as the anchor and then add nodes to it. By simply pointing to a list_head,
you won't be able to use the list_for_each_entry() macro the way it is meant.

> > 
> > I don't get it. What is the app_dcookie used for? If the spu binary is
> > embedded into a library, you are still missing the dcookie to that .so file,
> > because you return only an offset.
> For embedded SPU app, the post-processing tool opens the PPE binary app 
> file and obtains the SPU ELF embedded thereine, and from that, we obtain 
> the name of the SPU binary.  Also, the app name is included in the 
> report, along with the SPU binary filename, if the report contains 
> samples from more than one application.

That's not what I meant. If you have an embedded spu ELF file in a shared
library object, you end up recording the file of the main PPE binary, and
the offset of the SPU ELF inside of the .so file, but not the name of the
.so file itself.

You also say that the you record the name of the application on purpose for
separate SPU ELF files. My assumption was that this is not necessary, but
I don't know much about that aspect of oprofile. My feeling is that the
samples for an external SPU ELF file should be handled the same way that
oprofile handles events in shared libraries on the PPU (e.g. in libc.so).

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 23:31       ` Carl Love
  2007-01-31  1:25         ` Christian Krafft
@ 2007-01-31  6:06         ` Arnd Bergmann
  1 sibling, 0 replies; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-31  6:06 UTC (permalink / raw)
  To: cbe-oss-dev
  Cc: Carl Love, maynardj, linuxppc-dev, oprofile-list, linux-kernel

On Wednesday 31 January 2007 00:31, Carl Love wrote:
> Unfortunately, the only way we know how to
> figure out what the LFSR value that corresponds to the number in the
> sequence that is N before the last value (0xFFFFFF) is to calculate the
> previous value N times.  It is like trying to ask what is the pseudo
> random number that is N before this pseudo random number?

Well, you can at least implement the lfsr both ways, and choose the one
that is faster to get at, like

u32 get_lfsr(u32 v)
{
	int i;
	u32 r = 0xffffff;
	if (v < 0x7fffff) {
		for (i = 0; i < v; i++)
			r = lfsr_forwards(r);
	} else {
		for (i = 0; i < (0x1000000 - v); i++)
			r = lfsr_backwards(r);
	}
	return r;
}

Also, if the value doesn't have to be really exact, you could have
a small lookup table with precomputed values, like:

u32 get_lfsr(u32 v)
{
	static const lookup[256] = {
		0xab3492, 0x3e3f34, 0xc47610c, ... /* insert actual values */
	};

	return lookup[v >> 16];
}

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 22:54       ` Maynard Johnson
  2007-01-30 23:34         ` Benjamin Herrenschmidt
@ 2007-01-31  6:52         ` Arnd Bergmann
  2007-02-02 16:47           ` Maynard Johnson
  1 sibling, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-01-31  6:52 UTC (permalink / raw)
  To: cbe-oss-dev, maynardj; +Cc: linuxppc-dev, oprofile-list, linux-kernel

On Tuesday 30 January 2007 23:54, Maynard Johnson wrote:

> >>Why do you store them per spu in the first place? The physical spu
> >>doesn't have any relevance to this at all, the only data that is
> >>per spu is the sample data collected on a profiling interrupt,
> >>which you can then copy in the per-context data on a context switch.
> > 
> > The sample data is written out to the event buffer on every profiling 
> > interrupt.  But we don't write out the SPU program counter samples 
> > directly to the event buffer.  First, we have to find the cached_info 
> > for the appropriate SPU context to retrieve the cached vma-to-fileoffset 
> > map.  Then we do the vma_map_lookup to find the fileoffset corresponding 
> > to the SPU PC sample, which we then write out to the event buffer.  This 
> > is one of the most time-critical pieces of the SPU profiling code, so I 
> > used an array to hold the cached_info for fast random access.  But as I 
> > stated in a code comment above, the negative implication of this current 
> > implementation is that the array can only hold the cached_info for 
> > currently running SPU tasks.  I need to give this some more thought.
> 
> I've given this some more thought, and I'm coming to the conclusion that 
> a pure array-based implementation for holding cached_info (getting rid 
> of the lists) would work well for the vast majority of cases in which 
> OProfile will be used.  Yes, it is true that the mapping of an SPU 
> context to a phsyical spu-numbered array location cannot be guaranteed 
> to stay valid, and that's why I discard the cached_info at that array 
> location when the SPU task is switched out.  Yes, it would be terribly 
> inefficient if the same SPU task gets switched back in later and we 
> would have to recreate the cached_info.  However, I contend that 
> OProfile users are interested in profiling one application at a time. 
> They are not going to want to muddy the waters with multiple SPU apps 
> running at the same time.  I can't think of any reason why someone would 
> conscisouly choose to do that.
> 
> Any thoughts from the general community, especially OProfile users?
> 
Please assume that in the near future we will be scheduling SPU contexts
in and out multiple times a second. Even in a single application, you
can easily have more contexts than you have physical SPUs.

The event buffer by definition needs to be per context. If you for some
reason want to collect the samples per physical SPU during an event
interrupt, you should at least make sure that they are copied into the
per-context event buffer on a context switch.

At the context switch point, you probably also want to drain the
hw event counters, so that you account all events correctly.

We also want to be able to profile the context switch code itself, which
means that we also need one event buffer associated with the kernel to
collect events that for a zero context_id.

Of course, the recording of raw samples in the per-context buffer does
not need to have the dcookies along with it, you can still resolve
the pointers when the SPU context gets destroyed (or an object gets
unmapped).

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 1/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-30 22:57       ` Benjamin Herrenschmidt
@ 2007-01-31  8:47         ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2007-01-31  8:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Carl Love, Christoph Hellwig, linuxppc-dev, cbe-oss-dev,
	oprofile-list, linux-kernel

On Wed, Jan 31, 2007 at 09:57:26AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2007-01-30 at 14:49 -0800, Carl Love wrote:
> > Christoph:
> > 
> > In our earlier work on the PPU profiling patch, Benjamin Herrenschmidt
> > said that we should remove macros that are only used once and just put
> > the actual code in.  That is why the macros were removed. 
> 
> Heh... there is a balance to be found... In some cases, inline functions
> might be better too.

Well, unless there's a very good reasons against it (token pasting,
header pollution) inlines are always preferable over macros, but I
didn't want to bring that issue up aswell..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-31  6:52         ` Arnd Bergmann
@ 2007-02-02 16:47           ` Maynard Johnson
  2007-02-03  7:40             ` Arnd Bergmann
  0 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-02-02 16:47 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linuxppc-dev, oprofile-list, linux-kernel

Arnd Bergmann wrote:

> On Tuesday 30 January 2007 23:54, Maynard Johnson wrote:
> 
> 
>>>>Why do you store them per spu in the first place? The physical spu
>>>>doesn't have any relevance to this at all, the only data that is
>>>>per spu is the sample data collected on a profiling interrupt,
>>>>which you can then copy in the per-context data on a context switch.
>>>
>>>The sample data is written out to the event buffer on every profiling 
>>>interrupt.  But we don't write out the SPU program counter samples 
>>>directly to the event buffer.  First, we have to find the cached_info 
>>>for the appropriate SPU context to retrieve the cached vma-to-fileoffset 
>>>map.  Then we do the vma_map_lookup to find the fileoffset corresponding 
>>>to the SPU PC sample, which we then write out to the event buffer.  This 
>>>is one of the most time-critical pieces of the SPU profiling code, so I 
>>>used an array to hold the cached_info for fast random access.  But as I 
>>>stated in a code comment above, the negative implication of this current 
>>>implementation is that the array can only hold the cached_info for 
>>>currently running SPU tasks.  I need to give this some more thought.
>>
>>I've given this some more thought, and I'm coming to the conclusion that 
>>a pure array-based implementation for holding cached_info (getting rid 
>>of the lists) would work well for the vast majority of cases in which 
>>OProfile will be used.  Yes, it is true that the mapping of an SPU 
>>context to a phsyical spu-numbered array location cannot be guaranteed 
>>to stay valid, and that's why I discard the cached_info at that array 
>>location when the SPU task is switched out.  Yes, it would be terribly 
>>inefficient if the same SPU task gets switched back in later and we 
>>would have to recreate the cached_info.  However, I contend that 
>>OProfile users are interested in profiling one application at a time. 
>>They are not going to want to muddy the waters with multiple SPU apps 
>>running at the same time.  I can't think of any reason why someone would 
>>conscisouly choose to do that.
>>
>>Any thoughts from the general community, especially OProfile users?
>>
> 
> Please assume that in the near future we will be scheduling SPU contexts
> in and out multiple times a second. Even in a single application, you
> can easily have more contexts than you have physical SPUs.
Arnd, thanks for pointing this out.  That's definitely a good reason why 
my simplistic approach won't work.  I'll look at other options.
> 
> The event buffer by definition needs to be per context. If you for some
Yes, and it is.  Right now, with the current simplistic approach, the 
context and the physical SPU are kept in sync.
> reason want to collect the samples per physical SPU during an event
> interrupt, you should at least make sure that they are copied into the
> per-context event buffer on a context switch.
> 
> At the context switch point, you probably also want to drain the
> hw event counters, so that you account all events correctly.
Yeah, that's a good idea.  The few extraneous invalid samples would 
probably never rise above the noise level, but we should do this anyway 
for completeness.
> 
> We also want to be able to profile the context switch code itself, which
> means that we also need one event buffer associated with the kernel to
> collect events that for a zero context_id.
The hardware design precludes tracing both SPU and PPU simultaneously.

-Maynard
> 
> Of course, the recording of raw samples in the per-context buffer does
> not need to have the dcookies along with it, you can still resolve
> the pointers when the SPU context gets destroyed (or an object gets
> unmapped).
> 
> 	Arnd <><



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-01-31  5:57       ` Arnd Bergmann
@ 2007-02-02 19:27         ` Maynard Johnson
  0 siblings, 0 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-02-02 19:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list, Anton Blanchard

Arnd Bergmann wrote:

> On Tuesday 30 January 2007 22:41, Maynard Johnson wrote:
> 
>>Arnd Bergmann wrote:
> 
> 
>>>>+       kt = ktime_set(0, profiling_interval);
>>>>+       if (!spu_prof_running)
>>>>+               goto STOP;
>>>>+       hrtimer_forward(timer, timer->base->get_time(), kt);
>>>>+       return HRTIMER_RESTART;
>>>
>>>
>>>is hrtimer_forward really the right interface here? You are ignoring
>>>the number of overruns anyway, so hrtimer_start(,,) sounds more
>>>correct to me.
>>
>>According to Tom Gleixner, "hrtimer_forward is a convenience function to 
>>move the expiry time of a timer forward in multiples of the interval, so 
>>it is in the future.  After setting the expiry time you restart the 
>>timer either with [sic] a return HRTIMER_RESTART (if you are in the 
>>timer callback function)."
>>
> 
> Ok, I see. Have you seen the timer actually coming in late, resulting
> in hrtimer_forward returning non-zero? I guess it's not a big problem
> for statistic data collection if that happens, but you might still want
> to be able to see it.
I don't think there's much point in knowing if we have overrun(s). 
We're not going to schedule the timer any differently. We want to keep 
the timer interrupts as consistent as possible according to the user's 
request.
> 
> 
>>>>+       /* Since cpufreq_quick_get returns frequency in kHz, we use
>>>>+        * USEC_PER_SEC here vs NSEC_PER_SEC.
>>>>+        */
>>>>+       unsigned long nsPerCyc = (USEC_PER_SEC << SCALE_SHIFT)/khzfreq;
>>>>+       profiling_interval = (nsPerCyc * cycles_reset) >> SCALE_SHIFT;
>>>>+       
>>>>+       pr_debug("timer resolution: %lu\n", 
>>>>+                TICK_NSEC);
>>>
>>>
>>>Don't you need to adapt the profiling_interval at run time, when cpufreq
>>>changes the core frequency? You should probably use
>>>cpufreq_register_notifier() to update this.
>>
>>Since OProfile is a statistical profiler, the exact frequency is not 
>>critical.  The user is going to be looking for hot spots in their code, 
>>so it's all relative.  With that said,  I don't imagine using the 
>>cpufreq notiication would be a big deal.  We'll look at it.
>>
>>
>>>>@@ -480,7 +491,22 @@
>>>>              struct op_system_config *sys, int num_ctrs)
>>>>{
>>>>       int i, j, cpu;
>>>>+       spu_cycle_reset = 0;
>>>>
>>>>+       /* The cpufreq_quick_get function requires that cbe_cpufreq module
>>>>+        * be loaded.  This function is not actually provided and exported
>>>>+        * by cbe_cpufreq, but it relies on cbe_cpufreq initialize kernel
>>>>+        * data structures.  Since there's no way for depmod to realize
>>>>+        * that our OProfile module depends on cbe_cpufreq, we currently
>>>>+        * are letting the userspace tool, opcontrol, ensure that the
>>>>+        * cbe_cpufreq module is loaded.
>>>>+        */
>>>>+       khzfreq = cpufreq_quick_get(smp_processor_id());
>>>
>>>
>>>You should probably have a fallback in here in case the cpufreq module
>>>is not loaded. There is a global variable ppc_proc_freq (in Hz) that
>>>you can access.
>>
>>Our userspace tool ensures the cpufreq module is loaded.
> 
> 
> You should not rely on user space tools to do the right thing in the kernel.
Ok, we'll look at the fallback option you suggest.  I don't recall if I 
even knew about ppc_proc_freq before or why I originally chose 
cpufreq_guick_get.  Maybe we can do without the cpufreq and use 
ppc_proc_freq all the time.  We'll see . . .
> 
> Moreover, if the exact frequency is not that important, as you mentioned
> above, you can probably just hardcode a compile-time constant here.
Well, exact frequency isn't critical, but it should, as close as 
possible, correspond with the user's requested value for "spu cycle reset".
> 
> 
>>>>+ * 
>>>>+ * Ideally, we would like to be able to create the cached_info for
>>>>+ * an SPU task just one time -- when libspe first loads the SPU 
>>>>+ * binary file.  We would store the cached_info in a list.  Then, as
>>>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>>>+ * for inactive tasks would be kept, and the active one would be placed
>>>>+ * at the head of the list.  But this technique may not with
>>>>+ * current spufs functionality since the spu used in bind_context may
>>>>+ * be a different spu than was used in a previous bind_context for a
>>>>+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
>>>>+ * assigned to run on a different physical SPE.  We will investigate
>>>>+ * further if this can be done.
>>>>+ *
>>>>+ */
>>>
>>>
>>>You should stuff a pointer to cached_info into struct spu_context,
>>>e.g. 'void *profile_private'.
>>>
>>>
>>>
>>>>+struct cached_info {
>>>>+       vma_map_t * map;
>>>>+       struct spu * the_spu;
>>>>+       struct kref cache_ref;
>>>>+       struct list_head list;
>>>>+};
>>>
>>>
>>>And replace the 'the_spu' member with a back pointer to the
>>>spu_context if you need it.
>>>
>>>
>>>
>>>>+
>>>>+/* A data structure for cached information about active SPU tasks.
>>>>+ * Storage is dynamically allocated, sized as
>>>>+ * "number of active nodes multplied by 8". 
>>>>+ * The info_list[n] member holds 0 or more 
>>>>+ * 'struct cached_info' objects for SPU#=n. 
>>>>+ *
>>>>+ * As currently implemented, there will only ever be one cached_info 
>>>>+ * in the list for a given SPU.  If we can devise a way to maintain
>>>>+ * multiple cached_infos in our list, then it would make sense
>>>>+ * to also cache the dcookie representing the PPU task application.
>>>>+ * See above description of struct cached_info for more details.
>>>>+ */
>>>>+struct spu_info_stacks {
>>>>+       struct list_head * info_list;
>>>>+};
>>>
>>>
>>>Why do you store pointers to list_head structures? If you want to store
>>>lists, you should have a lists_head itself in here.
>>
>>info_list is an array of n lists, where n is the number of SPUs.
> 
> 
> My point was that it's not an array of lists, but an array of pointers
> to lists. The way that include/list.h works is by having a struct list_head
> as the anchor and then add nodes to it. By simply pointing to a list_head,
> you won't be able to use the list_for_each_entry() macro the way it is meant.
I've got to rework this implementation anyway . . .
> 
> 
[snip]

-Maynard

> 
> 	Arnd <><



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-02 16:47           ` Maynard Johnson
@ 2007-02-03  7:40             ` Arnd Bergmann
  2007-02-03 20:03               ` Maynard Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-02-03  7:40 UTC (permalink / raw)
  To: maynardj; +Cc: cbe-oss-dev, linuxppc-dev, oprofile-list, linux-kernel

On Friday 02 February 2007 17:47, Maynard Johnson wrote:
> 
> > We also want to be able to profile the context switch code itself, which
> > means that we also need one event buffer associated with the kernel to
> > collect events that for a zero context_id.
> The hardware design precludes tracing both SPU and PPU simultaneously.
> 
I mean the SPU-side part of the context switch code, which you can find
in arch/powerpc/platforms/cell/spufs/spu_{save,restore}*.

This code is the one that runs when context_id == 0 is passed to the
callback.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-03  7:40             ` Arnd Bergmann
@ 2007-02-03 20:03               ` Maynard Johnson
  2007-02-04  2:42                 ` Arnd Bergmann
  0 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-02-03 20:03 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linuxppc-dev, oprofile-list, linux-kernel

Arnd Bergmann wrote:

>On Friday 02 February 2007 17:47, Maynard Johnson wrote:
>  
>
>>>We also want to be able to profile the context switch code itself, which
>>>means that we also need one event buffer associated with the kernel to
>>>collect events that for a zero context_id.
>>>      
>>>
>>The hardware design precludes tracing both SPU and PPU simultaneously.
>>
>>    
>>
>I mean the SPU-side part of the context switch code, which you can find
>in arch/powerpc/platforms/cell/spufs/spu_{save,restore}*.
>
>This code is the one that runs when context_id == 0 is passed to the
>callback.
>  
>
I presume you mean 'object_id'.  What you're asking for is a new 
requirement, and one which I don't believe is achievable in the current 
timeframe.  Since this is spufs code that's dynamicaly loaded into the 
SPU at runtime, the symbol information for this code is not accessible 
to the userspace post-processing tools.  It would require an altogether 
different mechanism to record samples along with necessary information, 
not to mention the changes required in the post-processing tools.  This 
will have to be a future enhancement.

-Maynard

>	Arnd <><
>  
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
       [not found]   ` <200701300839.05144.arnd@arndb.de>
  2007-01-30  7:53     ` [Cbe-oss-dev] " Benjamin Herrenschmidt
  2007-01-30 21:41     ` Maynard Johnson
@ 2007-02-03 23:49     ` Maynard Johnson
  2007-02-04  2:52       ` Arnd Bergmann
  2 siblings, 1 reply; 38+ messages in thread
From: Maynard Johnson @ 2007-02-03 23:49 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linux-kernel, linuxppc-dev, oprofile-list

Arnd Bergmann wrote:

>On Monday 29 January 2007 20:48, Maynard Johnson wrote:
>  
>
>>Subject: Add support to OProfile for profiling Cell BE SPUs
>>
>>    
>>
>  
>
[snip]

>>+ * 
>>+ * Ideally, we would like to be able to create the cached_info for
>>+ * an SPU task just one time -- when libspe first loads the SPU 
>>+ * binary file.  We would store the cached_info in a list.  Then, as
>>+ * SPU tasks are switched out and new ones switched in, the cached_info
>>+ * for inactive tasks would be kept, and the active one would be placed
>>+ * at the head of the list.  But this technique may not with
>>+ * current spufs functionality since the spu used in bind_context may
>>+ * be a different spu than was used in a previous bind_context for a
>>+ * reactivated SPU task.  Additionally, a reactivated SPU task may be
>>+ * assigned to run on a different physical SPE.  We will investigate
>>+ * further if this can be done.
>>+ *
>>+ */
>>    
>>
>
>You should stuff a pointer to cached_info into struct spu_context,
>e.g. 'void *profile_private'.
>  
>
I seem to recall looking at this option a while back, but didn't go that 
route since struct spu_context is opaque to me.  With such a teqnique, I 
could then use a simple 16-element array of  pointers to cached_info 
objects, creating them as needed when spu_context->profile_private is 
NULL.  I suppose the better option for now is to add a 
get_profile_private() function to SPUFs, rather than requiring 
spu_context to be visible.  Don't know why I didn't think to do that 
before.  Ah, well, live and learn.

-Maynard

>  
>
>>+struct cached_info {
>>+       vma_map_t * map;
>>+       struct spu * the_spu;
>>+       struct kref cache_ref;
>>+       struct list_head list;
>>+};
>>    
>>
>
>And replace the 'the_spu' member with a back pointer to the
>spu_context if you need it.
>
>  
>
>	Arnd <><
>  
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-03 20:03               ` Maynard Johnson
@ 2007-02-04  2:42                 ` Arnd Bergmann
  2007-02-04 17:11                   ` Maynard Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-02-04  2:42 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: Maynard Johnson, linuxppc-dev, oprofile-list, linux-kernel

On Saturday 03 February 2007 21:03, Maynard Johnson wrote:
> I presume you mean 'object_id'. 

Right, sorry for the confusion.

> What you're asking for is a new  
> requirement, and one which I don't believe is achievable in the current 
> timeframe.  Since this is spufs code that's dynamicaly loaded into the 
> SPU at runtime, the symbol information for this code is not accessible 
> to the userspace post-processing tools.

We can always fix the user space tool later, but it is important to
get at least the kernel interface right, so we can add the functionality
later without breaking new user space on old kernels or vice versa.

> It would require an altogether  
> different mechanism to record samples along with necessary information, 
> not to mention the changes required in the post-processing tools.  This 
> will have to be a future enhancement.

So what do you do with the samples for object_id == 0? I would expect them
to be simply added to the sample buffer for the kernel, which is sort
of special in oprofile anyway.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-03 23:49     ` Maynard Johnson
@ 2007-02-04  2:52       ` Arnd Bergmann
  2007-02-04 17:33         ` Maynard Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Arnd Bergmann @ 2007-02-04  2:52 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: Maynard Johnson, linuxppc-dev, oprofile-list, linux-kernel

On Sunday 04 February 2007 00:49, Maynard Johnson wrote:
> I seem to recall looking at this option a while back, but didn't go that 
> route since struct spu_context is opaque to me.  With such a teqnique, I 
> could then use a simple 16-element array of  pointers to cached_info 
> objects, creating them as needed when spu_context->profile_private is 
> NULL.  I suppose the better option for now is to add a 
> get_profile_private() function to SPUFs, rather than requiring 
> spu_context to be visible.

Yes, that sounds good. Note that the file providing the 
spufs_get_profile_private (and respective spufs_set_profile_private)
functions needs to be compiled into the kernel then in case oprofile
gets linked in but spufs is a module.

I think it would also be necessary to have another interface for cleaning
up this data when spufs destroys the context. That could possibly
a variation of the existing notifier call, or a new call, or you
establish the convention that if the private pointer is non-NULL,
spufs will kfree it.

	Arnd <><

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-04  2:42                 ` Arnd Bergmann
@ 2007-02-04 17:11                   ` Maynard Johnson
  0 siblings, 0 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-02-04 17:11 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linuxppc-dev, oprofile-list, linux-kernel

Arnd Bergmann wrote:

>On Saturday 03 February 2007 21:03, Maynard Johnson wrote:
>  
>
>>I presume you mean 'object_id'. 
>>    
>>
>
>Right, sorry for the confusion.
>
>  
>
>>What you're asking for is a new  
>>requirement, and one which I don't believe is achievable in the current 
>>timeframe.  Since this is spufs code that's dynamicaly loaded into the 
>>SPU at runtime, the symbol information for this code is not accessible 
>>to the userspace post-processing tools.
>>    
>>
>
>We can always fix the user space tool later, but it is important to
>get at least the kernel interface right, so we can add the functionality
>later without breaking new user space on old kernels or vice versa.
>  
>
There's no obvious solution to this problem, so it's going to take some 
design effort to come up with a solution.  We can work on the problem, 
but I don't think we can allow it to get in the way of getting our 
currently proposed SPU profiling functionality into the kernel.

>  
>
>>It would require an altogether  
>>different mechanism to record samples along with necessary information, 
>>not to mention the changes required in the post-processing tools.  This 
>>will have to be a future enhancement.
>>    
>>
>
>So what do you do with the samples for object_id == 0? I would expect them
>to be simply added to the sample buffer for the kernel, which is sort
>of special in oprofile anyway.
>  
>
There is no sample buffer for the kernel in SPU profiling.  When 
OProfile gets notified of an SPU task switching out (object_if == 0), we 
stop recording samples for the corresponding SPU.  If SPUFs sends the 
notification after the spu_save operation, we still would be collecting 
samples during that time; however, since the VMAs of these samples would 
not map to any fileoffset in the SPU binary that's executing, our 
current implementation throws them away.  We could change that behavior 
and record them in the samples buffer as "anonymous samples".  Not sure 
it that would help too much, as you wouldn't be able to distinguish 
between spu_save samples and samples from generated stubs that are 
executing from the stack.

-Maynard

>	Arnd <><
>  
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Cbe-oss-dev] [RFC, PATCH 4/4] Add support to OProfile for profiling Cell BE SPUs -- update
  2007-02-04  2:52       ` Arnd Bergmann
@ 2007-02-04 17:33         ` Maynard Johnson
  0 siblings, 0 replies; 38+ messages in thread
From: Maynard Johnson @ 2007-02-04 17:33 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: cbe-oss-dev, linuxppc-dev, oprofile-list, linux-kernel

Arnd Bergmann wrote:

>On Sunday 04 February 2007 00:49, Maynard Johnson wrote:
>  
>
>>I seem to recall looking at this option a while back, but didn't go that 
>>route since struct spu_context is opaque to me.  With such a teqnique, I 
>>could then use a simple 16-element array of  pointers to cached_info 
>>objects, creating them as needed when spu_context->profile_private is 
>>NULL.  I suppose the better option for now is to add a 
>>get_profile_private() function to SPUFs, rather than requiring 
>>spu_context to be visible.
>>    
>>
>
>Yes, that sounds good. Note that the file providing the 
>spufs_get_profile_private (and respective spufs_set_profile_private)
>functions needs to be compiled into the kernel then in case oprofile
>gets linked in but spufs is a module.
>  
>
Hmm . . . we already depend on the register/unregister functions in 
sched.c, so my patch changes the oprofile Kconfig to default to 'm' and 
'depends on SPU_FS'.

>I think it would also be necessary to have another interface for cleaning
>up this data when spufs destroys the context. That could possibly
>a variation of the existing notifier call, or a new call, or you
>establish the convention that if the private pointer is non-NULL,
>spufs will kfree it.
>  
>
Yes, I was thnking along the lines of your last suggestion.  I presume 
OProfile gets notified (object_id == 0) before the context is actually 
destroyed.  At that time, we would NULL-out the reference to the 
cached_info, so then SPUFS would kfree it at destroy time.

-Maynard

>	Arnd <><
>  
>



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2007-02-04 17:33 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-29 19:45 [RFC, PATCH 0/4] Add support to OProfile for profiling Cell BE SPUs -- update Maynard Johnson
2007-01-29 19:46 ` [RFC, PATCH 1/4] " Maynard Johnson
2007-01-30  4:07   ` [Cbe-oss-dev] " Arnd Bergmann
2007-01-30 10:39   ` Christoph Hellwig
2007-01-30 22:49     ` Carl Love
2007-01-30 22:57       ` Benjamin Herrenschmidt
2007-01-31  8:47         ` Christoph Hellwig
2007-01-30 22:59       ` Benjamin Herrenschmidt
2007-01-29 19:47 ` [RFC, PATCH 2/4] " Maynard Johnson
2007-01-30  4:08   ` [Cbe-oss-dev] " Arnd Bergmann
2007-01-30 23:51     ` Carl Love
2007-01-29 19:48 ` [RFC, PATCH 3/4] " Maynard Johnson
2007-01-30  4:24   ` [Cbe-oss-dev] " Arnd Bergmann
2007-01-30 15:31     ` Maynard Johnson
2007-01-31  0:35       ` Arnd Bergmann
2007-01-29 19:48 ` [RFC, PATCH 4/4] " Maynard Johnson
     [not found]   ` <200701300839.05144.arnd@arndb.de>
2007-01-30  7:53     ` [Cbe-oss-dev] " Benjamin Herrenschmidt
2007-01-30 10:41       ` Christoph Hellwig
2007-01-30 23:09         ` Maynard Johnson
2007-01-30 21:41     ` Maynard Johnson
2007-01-30 22:54       ` Maynard Johnson
2007-01-30 23:34         ` Benjamin Herrenschmidt
2007-01-31  0:29           ` Maynard Johnson
2007-01-31  6:52         ` Arnd Bergmann
2007-02-02 16:47           ` Maynard Johnson
2007-02-03  7:40             ` Arnd Bergmann
2007-02-03 20:03               ` Maynard Johnson
2007-02-04  2:42                 ` Arnd Bergmann
2007-02-04 17:11                   ` Maynard Johnson
2007-01-30 23:31       ` Carl Love
2007-01-31  1:25         ` Christian Krafft
2007-01-31  6:06         ` Arnd Bergmann
2007-01-31  5:57       ` Arnd Bergmann
2007-02-02 19:27         ` Maynard Johnson
2007-02-03 23:49     ` Maynard Johnson
2007-02-04  2:52       ` Arnd Bergmann
2007-02-04 17:33         ` Maynard Johnson
2007-01-30  8:37 ` [RFC, PATCH 0/4] " Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).