All of lore.kernel.org
 help / color / mirror / Atom feed
From: Carl Love <cel@us.ibm.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: cbe-oss-dev@ozlabs.org, Milton Miller <miltonm@bga.com>,
	linuxppc-dev@ozlabs.org, LKML <linux-kernel@vger.kernel.org>,
	oprofile-list@lists.sourceforge.net
Subject: Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Date: Thu, 08 Feb 2007 14:51:59 -0800	[thread overview]
Message-ID: <1170975119.5204.99.camel@dyn9047021078.beaverton.ibm.com> (raw)
In-Reply-To: <200702081821.57358.arnd@arndb.de>

On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:
> On Thursday 08 February 2007 15:18, Milton Miller wrote:
>  
> > 1) sample rate setup
> > 
> >     In the current patch, the user specifies a sample rate as a time 
> > interval.
> >     The kernel is (a) calling cpufreq to get the current cpu frequency, 
> > (b)
> >     converting the rate to a cycle count, (c) converting this to a 24 bit
> >     LFSR count, an iterative algorithm (in this patch, starting from
> >     one of 256 values so a max of 2^16 or 64k iterations), (d) 
> > calculating
> >     an trace unload interval.   In addition, a cpufreq notifier is 
> > registered
> >     to recalculate on frequency changes.

No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.  The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is the
performance counter entry for that event.  Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.  The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.  

Milton had a comment about the code 

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> +             spu_cycle_reset = ctr[0].count;
> +             return;
> +     }

Well, given the above description, it is clear that if you are doing SPU
event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.

> > 
> >     The obvious problem is step (c), running a loop potentially 64 
> > thousand
> >     times in kernel space will have a noticeable impact on other threads.
> > 
> >     I propose instead that user space perform the above 4 steps, and 
> > provide
> >     the kernel with two inputs: (1) the value to load in the LFSR and (2)
> >     the periodic frequency / time interval at which to empty the hardware
> >     trace buffer, perform sample analysis, and send the data to the 
> > oprofile
> >     subsystem.
> > 
> >     There should be no security issues with this approach.   If the LFSR 
> > value
> >     is calculated incorrectly, either it will be too short, causing the 
> > trace
> >     array to overfill and data to be dropped, or it will be too long, and
> >     there will be fewer samples.   Likewise, the kernel periodic poll 
> > can be
> >     too long, again causing overflow, or too frequent, causing only timer
> >     execution overhead.
> > 
> >     Various data is collected by the kernel while processing the 
> > periodic timer,
> >     this approach would also allow the profiling tools to control the
> >     frequency of this collection.   More frequent collection results in 
> > more
> >     accurate sample data, with the linear cost of poll execution 
> > overhead.
> > 
> >     Frequency changes can be handled either by the profile code setting
> >     collection at a higher than necessary rate, or by interacting with 
> > the
> >     governor to limit the speeds.
> > 
> >     Optionally, the kernel can add a record indicating that some data was
> >     likely dropped if it is able to read all 256 entries without 
> > underflowing
> >     the array.  This can be used as hint to user space that the kernel 
> > time
> >     was too long for the collection rate.
>  
> Moving the sample rate computation to user space sounds like the right
> idea, but why not have a more drastic version of it:

No, I do not agree.  The user/kernel API pass N where N is the number of
events between samples.  We are not at liberty to just change the API.
We we did do this, we fully expect that John Levon will push back saying
why make an architecture specific API change when it isn't necessary.  

Please define "drastic" in this context.  Do you mean make the table
bigger i.e. more then 256 precomputed elements?  I did 256 since Arnd
seemed to think that would be a reasonable size. Based on his example
How much kernel space are we willing to use to same some computation?
Keep in mind only one of the entries in the table will ever be used.

I think if we found the LFSR that was with in 2^10 of the desired value
that would be good enough. It would be within 1% of the minimum N the
user can specify.  That would require a table with 2^14 entries.  That
seems unreasonably large.

Anyway, the user controls how often sampling is done by setting N. 

>  
                Carl Love
> [cut]


WARNING: multiple messages have this Message-ID (diff)
From: Carl Love <cel@us.ibm.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: linuxppc-dev@ozlabs.org, oprofile-list@lists.sourceforge.net,
	cbe-oss-dev@ozlabs.org, Milton Miller <miltonm@bga.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
Date: Thu, 08 Feb 2007 14:51:59 -0800	[thread overview]
Message-ID: <1170975119.5204.99.camel@dyn9047021078.beaverton.ibm.com> (raw)
In-Reply-To: <200702081821.57358.arnd@arndb.de>

On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:
> On Thursday 08 February 2007 15:18, Milton Miller wrote:
>  
> > 1) sample rate setup
> > 
> >     In the current patch, the user specifies a sample rate as a time 
> > interval.
> >     The kernel is (a) calling cpufreq to get the current cpu frequency, 
> > (b)
> >     converting the rate to a cycle count, (c) converting this to a 24 bit
> >     LFSR count, an iterative algorithm (in this patch, starting from
> >     one of 256 values so a max of 2^16 or 64k iterations), (d) 
> > calculating
> >     an trace unload interval.   In addition, a cpufreq notifier is 
> > registered
> >     to recalculate on frequency changes.

No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.  The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is the
performance counter entry for that event.  Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.  The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.  

Milton had a comment about the code 

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
> +             spu_cycle_reset = ctr[0].count;
> +             return;
> +     }

Well, given the above description, it is clear that if you are doing SPU
event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.

> > 
> >     The obvious problem is step (c), running a loop potentially 64 
> > thousand
> >     times in kernel space will have a noticeable impact on other threads.
> > 
> >     I propose instead that user space perform the above 4 steps, and 
> > provide
> >     the kernel with two inputs: (1) the value to load in the LFSR and (2)
> >     the periodic frequency / time interval at which to empty the hardware
> >     trace buffer, perform sample analysis, and send the data to the 
> > oprofile
> >     subsystem.
> > 
> >     There should be no security issues with this approach.   If the LFSR 
> > value
> >     is calculated incorrectly, either it will be too short, causing the 
> > trace
> >     array to overfill and data to be dropped, or it will be too long, and
> >     there will be fewer samples.   Likewise, the kernel periodic poll 
> > can be
> >     too long, again causing overflow, or too frequent, causing only timer
> >     execution overhead.
> > 
> >     Various data is collected by the kernel while processing the 
> > periodic timer,
> >     this approach would also allow the profiling tools to control the
> >     frequency of this collection.   More frequent collection results in 
> > more
> >     accurate sample data, with the linear cost of poll execution 
> > overhead.
> > 
> >     Frequency changes can be handled either by the profile code setting
> >     collection at a higher than necessary rate, or by interacting with 
> > the
> >     governor to limit the speeds.
> > 
> >     Optionally, the kernel can add a record indicating that some data was
> >     likely dropped if it is able to read all 256 entries without 
> > underflowing
> >     the array.  This can be used as hint to user space that the kernel 
> > time
> >     was too long for the collection rate.
>  
> Moving the sample rate computation to user space sounds like the right
> idea, but why not have a more drastic version of it:

No, I do not agree.  The user/kernel API pass N where N is the number of
events between samples.  We are not at liberty to just change the API.
We we did do this, we fully expect that John Levon will push back saying
why make an architecture specific API change when it isn't necessary.  

Please define "drastic" in this context.  Do you mean make the table
bigger i.e. more then 256 precomputed elements?  I did 256 since Arnd
seemed to think that would be a reasonable size. Based on his example
How much kernel space are we willing to use to same some computation?
Keep in mind only one of the entries in the table will ever be used.

I think if we found the LFSR that was with in 2^10 of the desired value
that would be good enough. It would be within 1% of the minimum N the
user can specify.  That would require a table with 2^14 entries.  That
seems unreasonably large.

Anyway, the user controls how often sampling is done by setting N. 

>  
                Carl Love
> [cut]

  parent reply	other threads:[~2007-02-08 22:52 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-06  0:28 [RFC,PATCH] CELL PPU Oprofile SPU profiling updated patch Carl Love
2007-02-06  0:46 ` Paul Mackerras
2007-02-06  0:46   ` Paul Mackerras
2007-02-06  0:49 ` [Cbe-oss-dev] [RFC, PATCH] " Benjamin Herrenschmidt
2007-02-06  0:49   ` Benjamin Herrenschmidt
2007-02-06 23:02 ` [Cbe-oss-dev] [RFC, PATCH] CELL " Carl Love
2007-02-06 23:02   ` Carl Love
2007-02-07 15:41   ` Maynard Johnson
2007-02-07 15:41     ` Maynard Johnson
2007-02-07 22:48     ` Michael Ellerman
2007-02-07 22:48       ` Michael Ellerman
2007-02-08 15:03       ` Maynard Johnson
2007-02-08 15:03         ` Maynard Johnson
2007-02-08 14:18   ` Milton Miller
2007-02-08 14:18     ` Milton Miller
2007-02-08 17:21     ` Arnd Bergmann
2007-02-08 17:21       ` Arnd Bergmann
2007-02-08 18:01       ` Adrian Reber
2007-02-08 18:01         ` Adrian Reber
2007-02-08 22:51       ` Carl Love [this message]
2007-02-08 22:51         ` Carl Love
2007-02-09  2:46         ` Milton Miller
2007-02-09  2:46           ` Milton Miller
2007-02-09 16:17           ` Carl Love
2007-02-09 16:17             ` Carl Love
2007-02-11 22:46             ` Milton Miller
2007-02-11 22:46               ` Milton Miller
2007-02-12 16:38               ` Carl Love
2007-02-12 16:38                 ` Carl Love
2007-02-09 18:47       ` Milton Miller
2007-02-09 18:47         ` Milton Miller
2007-02-09 19:10         ` Arnd Bergmann
2007-02-09 19:10           ` Arnd Bergmann
2007-02-09 19:46           ` Milton Miller
2007-02-09 19:46             ` Milton Miller
2007-02-08 23:59     ` Maynard Johnson
2007-02-08 23:59       ` Maynard Johnson
2007-02-09 18:03       ` Milton Miller
2007-02-09 18:03         ` Milton Miller
2007-02-14 23:52 Carl Love
2007-02-15 14:37 ` Arnd Bergmann
2007-02-15 14:37   ` Arnd Bergmann
2007-02-15 16:15   ` Maynard Johnson
2007-02-15 16:15     ` Maynard Johnson
2007-02-15 18:13     ` Arnd Bergmann
2007-02-15 18:13       ` Arnd Bergmann
2007-02-15 20:21   ` Carl Love
2007-02-15 20:21     ` Carl Love
2007-02-15 21:03     ` Arnd Bergmann
2007-02-15 21:03       ` Arnd Bergmann
2007-02-15 21:50     ` Paul E. McKenney
2007-02-15 21:50       ` Paul E. McKenney
2007-02-16  0:33       ` Arnd Bergmann
2007-02-16  0:33         ` Arnd Bergmann
2007-02-16  0:32   ` Maynard Johnson
2007-02-16  0:32     ` Maynard Johnson
2007-02-16 17:14     ` Arnd Bergmann
2007-02-16 17:14       ` Arnd Bergmann
2007-02-16 21:43       ` Maynard Johnson
2007-02-16 21:43         ` Maynard Johnson
2007-02-18 23:18         ` Maynard Johnson
2007-02-18 23:18           ` Maynard Johnson
2007-02-22  0:02 Carl Love
2007-02-26 23:50 ` Arnd Bergmann
2007-02-26 23:50   ` Arnd Bergmann
2007-02-27  1:31   ` Michael Ellerman
2007-02-27  1:31     ` Michael Ellerman
2007-02-27 16:52   ` Maynard Johnson
2007-02-27 16:52     ` Maynard Johnson
2007-02-28  1:44     ` Arnd Bergmann
2007-02-28  1:44       ` Arnd Bergmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1170975119.5204.99.camel@dyn9047021078.beaverton.ibm.com \
    --to=cel@us.ibm.com \
    --cc=arnd@arndb.de \
    --cc=cbe-oss-dev@ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=miltonm@bga.com \
    --cc=oprofile-list@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.