linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fix up perfmon to build on -mm
@ 2007-11-07  0:34 Greg KH
  2007-11-07 10:34 ` Stephane Eranian
                   ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Greg KH @ 2007-11-07  0:34 UTC (permalink / raw)
  To: Andrew Morton, Stephane Eranian; +Cc: perfmon, linux-kernel

Here's a patch against my current tree that gets the perfmon code
building and hopefully working.

Note, it needs the kobject_create_and_register() patch which is in my
tree, but I do not think it made it to -mm yet.  The next -mm cycle
should have it.

Also, the sysfs usage in the perfmon code is quite strange and not
documented at all.  Yes, there is a little bit in the documentation
about what a few of the files do, but there are _way_ more files and
even directories being created under /sys/kernel/perfmon/ that are not
documented at all here.

If you document this stuff, I think I can clean up your sysfs code a
lot, making things simpler, easier to extend, and easier to understand.
But as it is, I don't want to break anything as it's totally unknown how
this stuff is supposed to work...

Hint, use the Documentation/ABI directory to document your sysfs
interfaces, that is what it is there for...

thanks,

greg k-h

---------------
From: Greg Kroah-Hartman <gregkh@suse.de>
Subject: perfmon: fix up some static kobject usages

This gets the perfmon code to build properly on the latest -mm tree, as
well as removing some static kobjects.

A lot of future kobject cleanups can be done on this code, but the
documentation for the perfmon sysfs interface is very limited and does
not describe all of the different files and subdirectories at all.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 perfmon/perfmon_sysfs.c |   37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

--- a/perfmon/perfmon_sysfs.c
+++ b/perfmon/perfmon_sysfs.c
@@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);
 
 DECLARE_PER_CPU(struct pfm_stats, pfm_stats);
 
-static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
+static struct kobject *pfm_kernel_kobj;
+static struct kobject *pfm_kernel_fmt_kobj;
 
 static void pfm_reset_stats(int cpu)
 {
@@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel
 
 int __init pfm_init_sysfs(void)
 {
-	int ret;
+	int ret = -ENOMEM;
 	int i, cpu = -1;
 
-	kobject_init(&pfm_kernel_kobj);
-	kobject_init(&pfm_kernel_fmt_kobj);
-
-	pfm_kernel_kobj.parent = &kernel_subsys.kobj;
-	kobject_set_name(&pfm_kernel_kobj, "perfmon");
-
-	pfm_kernel_fmt_kobj.parent = &pfm_kernel_kobj;
-	kobject_set_name(&pfm_kernel_fmt_kobj, "formats");
-
-	ret = kobject_add(&pfm_kernel_kobj);
-	if (ret) {
-		PFM_INFO("cannot add kernel object: %d", ret);
+	pfm_kernel_kobj = kobject_create_and_register("perfmon", kernel_kobj);
+	if (!pfm_kernel_kobj) {
+		PFM_INFO("cannot create perfmon kernel object");
 		goto error;
 	}
 
-	ret = kobject_add(&pfm_kernel_fmt_kobj);
-	if (ret) {
-		PFM_INFO("cannot add fmt object: %d", ret);
+	pfm_kernel_fmt_kobj = kobject_create_and_register("formats",
+							  pfm_kernel_kobj);
+	if (!pfm_kernel_fmt_kobj) {
+		PFM_INFO("cannot add fmt object");
 		goto error_fmt;
 	}
 
-	ret = sysfs_create_group(&pfm_kernel_kobj, &pfm_kernel_attr_group);
+	ret = sysfs_create_group(pfm_kernel_kobj, &pfm_kernel_attr_group);
 	if (ret) {
 		PFM_INFO("cannot create kernel group");
 		goto error_group;
@@ -449,9 +442,9 @@ int __init pfm_init_sysfs(void)
 	return 0;
 
 error_group:
-	kobject_del(&pfm_kernel_fmt_kobj);
+	kobject_unregister(pfm_kernel_fmt_kobj);
 error_fmt:
-	kobject_del(&pfm_kernel_kobj);
+	kobject_unregister(pfm_kernel_kobj);
 
 	for (i=0; i < cpu; i++)
 		pfm_sysfs_del_cpu(i);
@@ -683,7 +676,7 @@ int pfm_sysfs_add_fmt(struct pfm_smpl_fm
 
 	kobject_set_name(&fmt->kobj, fmt->fmt_name);
 	//kobj_set_kset_s(fmt, pfm_fmt_subsys);
-	fmt->kobj.parent = &pfm_kernel_fmt_kobj;
+	fmt->kobj.parent = pfm_kernel_fmt_kobj;
 
 	ret = kobject_add(&fmt->kobj);
 	if (ret)
@@ -861,7 +854,7 @@ int pfm_sysfs_add_pmu(struct pfm_pmu_con
 	kobject_init(&pmu->kobj);
 	kobject_set_name(&pmu->kobj, "pmu_desc");
 	//kobj_set_kset_s(pmu, pfm_pmu_subsys);
-	pmu->kobj.parent = &pfm_kernel_kobj;
+	pmu->kobj.parent = pfm_kernel_kobj;
 
 	ret = kobject_add(&pmu->kobj);
 	if (ret)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07  0:34 [PATCH] fix up perfmon to build on -mm Greg KH
@ 2007-11-07 10:34 ` Stephane Eranian
  2007-11-07 17:07   ` Greg KH
  2007-11-07 13:42 ` Stephane Eranian
  2007-11-09 20:06 ` Andrew Morton
  2 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-07 10:34 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel

Greg,

On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.
> 
Thanks for your quick help.

> Note, it needs the kobject_create_and_register() patch which is in my
> tree, but I do not think it made it to -mm yet.  The next -mm cycle
> should have it.
> 
> Also, the sysfs usage in the perfmon code is quite strange and not
> documented at all.  Yes, there is a little bit in the documentation
> about what a few of the files do, but there are _way_ more files and
> even directories being created under /sys/kernel/perfmon/ that are not
> documented at all here.
> 
The full documentation for /sys/kernel/perfmon is in Documentation/perfmon2.txt

> If you document this stuff, I think I can clean up your sysfs code a
> lot, making things simpler, easier to extend, and easier to understand.
> But as it is, I don't want to break anything as it's totally unknown how
> this stuff is supposed to work...
> 
I certainly welcome your help.

> Hint, use the Documentation/ABI directory to document your sysfs
> interfaces, that is what it is there for...
> 
I will move the description from perfmon2.txt to its own file in
ABI/testing.

--
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07  0:34 [PATCH] fix up perfmon to build on -mm Greg KH
  2007-11-07 10:34 ` Stephane Eranian
@ 2007-11-07 13:42 ` Stephane Eranian
  2007-11-07 17:08   ` Greg KH
  2007-11-07 17:47   ` Greg KH
  2007-11-09 20:06 ` Andrew Morton
  2 siblings, 2 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-07 13:42 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel

Greg,

Perfmon sysfs document has been updated following your adivce.
you can check out in my perfmon tree  the following commit:

	e83278f879e52ecee025effe9ad509fd51e4a516

Thanks.

On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.
> 
> Note, it needs the kobject_create_and_register() patch which is in my
> tree, but I do not think it made it to -mm yet.  The next -mm cycle
> should have it.
> 
> Also, the sysfs usage in the perfmon code is quite strange and not
> documented at all.  Yes, there is a little bit in the documentation
> about what a few of the files do, but there are _way_ more files and
> even directories being created under /sys/kernel/perfmon/ that are not
> documented at all here.
> 
> If you document this stuff, I think I can clean up your sysfs code a
> lot, making things simpler, easier to extend, and easier to understand.
> But as it is, I don't want to break anything as it's totally unknown how
> this stuff is supposed to work...
> 
> Hint, use the Documentation/ABI directory to document your sysfs
> interfaces, that is what it is there for...
> 
> thanks,
> 
> greg k-h
> 
> ---------------
> From: Greg Kroah-Hartman <gregkh@suse.de>
> Subject: perfmon: fix up some static kobject usages
> 
> This gets the perfmon code to build properly on the latest -mm tree, as
> well as removing some static kobjects.
> 
> A lot of future kobject cleanups can be done on this code, but the
> documentation for the perfmon sysfs interface is very limited and does
> not describe all of the different files and subdirectories at all.
> 
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
> 
> ---
>  perfmon/perfmon_sysfs.c |   37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
> 
> --- a/perfmon/perfmon_sysfs.c
> +++ b/perfmon/perfmon_sysfs.c
> @@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);
>  
>  DECLARE_PER_CPU(struct pfm_stats, pfm_stats);
>  
> -static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
> +static struct kobject *pfm_kernel_kobj;
> +static struct kobject *pfm_kernel_fmt_kobj;
>  
>  static void pfm_reset_stats(int cpu)
>  {
> @@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel
>  
>  int __init pfm_init_sysfs(void)
>  {
> -	int ret;
> +	int ret = -ENOMEM;
>  	int i, cpu = -1;
>  
> -	kobject_init(&pfm_kernel_kobj);
> -	kobject_init(&pfm_kernel_fmt_kobj);
> -
> -	pfm_kernel_kobj.parent = &kernel_subsys.kobj;
> -	kobject_set_name(&pfm_kernel_kobj, "perfmon");
> -
> -	pfm_kernel_fmt_kobj.parent = &pfm_kernel_kobj;
> -	kobject_set_name(&pfm_kernel_fmt_kobj, "formats");
> -
> -	ret = kobject_add(&pfm_kernel_kobj);
> -	if (ret) {
> -		PFM_INFO("cannot add kernel object: %d", ret);
> +	pfm_kernel_kobj = kobject_create_and_register("perfmon", kernel_kobj);
> +	if (!pfm_kernel_kobj) {
> +		PFM_INFO("cannot create perfmon kernel object");
>  		goto error;
>  	}
>  
> -	ret = kobject_add(&pfm_kernel_fmt_kobj);
> -	if (ret) {
> -		PFM_INFO("cannot add fmt object: %d", ret);
> +	pfm_kernel_fmt_kobj = kobject_create_and_register("formats",
> +							  pfm_kernel_kobj);
> +	if (!pfm_kernel_fmt_kobj) {
> +		PFM_INFO("cannot add fmt object");
>  		goto error_fmt;
>  	}
>  
> -	ret = sysfs_create_group(&pfm_kernel_kobj, &pfm_kernel_attr_group);
> +	ret = sysfs_create_group(pfm_kernel_kobj, &pfm_kernel_attr_group);
>  	if (ret) {
>  		PFM_INFO("cannot create kernel group");
>  		goto error_group;
> @@ -449,9 +442,9 @@ int __init pfm_init_sysfs(void)
>  	return 0;
>  
>  error_group:
> -	kobject_del(&pfm_kernel_fmt_kobj);
> +	kobject_unregister(pfm_kernel_fmt_kobj);
>  error_fmt:
> -	kobject_del(&pfm_kernel_kobj);
> +	kobject_unregister(pfm_kernel_kobj);
>  
>  	for (i=0; i < cpu; i++)
>  		pfm_sysfs_del_cpu(i);
> @@ -683,7 +676,7 @@ int pfm_sysfs_add_fmt(struct pfm_smpl_fm
>  
>  	kobject_set_name(&fmt->kobj, fmt->fmt_name);
>  	//kobj_set_kset_s(fmt, pfm_fmt_subsys);
> -	fmt->kobj.parent = &pfm_kernel_fmt_kobj;
> +	fmt->kobj.parent = pfm_kernel_fmt_kobj;
>  
>  	ret = kobject_add(&fmt->kobj);
>  	if (ret)
> @@ -861,7 +854,7 @@ int pfm_sysfs_add_pmu(struct pfm_pmu_con
>  	kobject_init(&pmu->kobj);
>  	kobject_set_name(&pmu->kobj, "pmu_desc");
>  	//kobj_set_kset_s(pmu, pfm_pmu_subsys);
> -	pmu->kobj.parent = &pfm_kernel_kobj;
> +	pmu->kobj.parent = pfm_kernel_kobj;
>  
>  	ret = kobject_add(&pmu->kobj);
>  	if (ret)

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 10:34 ` Stephane Eranian
@ 2007-11-07 17:07   ` Greg KH
  0 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-07 17:07 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Andrew Morton, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 02:34:49AM -0800, Stephane Eranian wrote:
> Greg,
> 
> On Tue, Nov 06, 2007 at 04:34:54PM -0800, Greg KH wrote:
> > Here's a patch against my current tree that gets the perfmon code
> > building and hopefully working.
> > 
> Thanks for your quick help.
> 
> > Note, it needs the kobject_create_and_register() patch which is in my
> > tree, but I do not think it made it to -mm yet.  The next -mm cycle
> > should have it.
> > 
> > Also, the sysfs usage in the perfmon code is quite strange and not
> > documented at all.  Yes, there is a little bit in the documentation
> > about what a few of the files do, but there are _way_ more files and
> > even directories being created under /sys/kernel/perfmon/ that are not
> > documented at all here.
> > 
> The full documentation for /sys/kernel/perfmon is in Documentation/perfmon2.txt

That is what I was referring to, that file does not describe all of the
sysfs files in /sys/kernel/perfmon by far.

> > If you document this stuff, I think I can clean up your sysfs code a
> > lot, making things simpler, easier to extend, and easier to understand.
> > But as it is, I don't want to break anything as it's totally unknown how
> > this stuff is supposed to work...
> > 
> I certainly welcome your help.
> 
> > Hint, use the Documentation/ABI directory to document your sysfs
> > interfaces, that is what it is there for...
> > 
> I will move the description from perfmon2.txt to its own file in
> ABI/testing.

That would be great to have, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 13:42 ` Stephane Eranian
@ 2007-11-07 17:08   ` Greg KH
  2007-11-07 17:33     ` Andrew Morton
  2007-11-07 17:50     ` Stephane Eranian
  2007-11-07 17:47   ` Greg KH
  1 sibling, 2 replies; 116+ messages in thread
From: Greg KH @ 2007-11-07 17:08 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Andrew Morton, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> Greg,
> 
> Perfmon sysfs document has been updated following your adivce.
> you can check out in my perfmon tree  the following commit:
> 
> 	e83278f879e52ecee025effe9ad509fd51e4a516

Where is this git tree located?  On git.kernel.org somewhere?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 17:08   ` Greg KH
@ 2007-11-07 17:33     ` Andrew Morton
  2007-11-07 17:41       ` Greg KH
  2007-11-07 17:50     ` Stephane Eranian
  1 sibling, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2007-11-07 17:33 UTC (permalink / raw)
  To: Greg KH; +Cc: eranian, perfmon, linux-kernel

> On Wed, 7 Nov 2007 09:08:20 -0800 Greg KH <greg@kroah.com> wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> > 
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree  the following commit:
> > 
> > 	e83278f879e52ecee025effe9ad509fd51e4a516
> 
> Where is this git tree located?  On git.kernel.org somewhere?
> 


I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 17:33     ` Andrew Morton
@ 2007-11-07 17:41       ` Greg KH
  0 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-07 17:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: eranian, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 09:33:13AM -0800, Andrew Morton wrote:
> > On Wed, 7 Nov 2007 09:08:20 -0800 Greg KH <greg@kroah.com> wrote:
> > On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > > Greg,
> > > 
> > > Perfmon sysfs document has been updated following your adivce.
> > > you can check out in my perfmon tree  the following commit:
> > > 
> > > 	e83278f879e52ecee025effe9ad509fd51e4a516
> > 
> > Where is this git tree located?  On git.kernel.org somewhere?
> > 
> 
> 
> I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git

Thanks, that worked, let me go read the new documentation...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 13:42 ` Stephane Eranian
  2007-11-07 17:08   ` Greg KH
@ 2007-11-07 17:47   ` Greg KH
  2007-11-07 17:57     ` Stephane Eranian
  1 sibling, 1 reply; 116+ messages in thread
From: Greg KH @ 2007-11-07 17:47 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Andrew Morton, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> Greg,
> 
> Perfmon sysfs document has been updated following your adivce.
> you can check out in my perfmon tree  the following commit:
> 
> 	e83278f879e52ecee025effe9ad509fd51e4a516

Thanks, that looks a lot better.

Do you want me to send you patches based on this tree to help clean up
the sysfs usage now that it's documented?

Also, a lot of your per-cpu sysfs files should probably move to debugfs
as they are for debugging only, right?  No need to clutter up sysfs with
them when only the very few perfmon developers would be needing access
to them.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 17:08   ` Greg KH
  2007-11-07 17:33     ` Andrew Morton
@ 2007-11-07 17:50     ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-07 17:50 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 09:08:20AM -0800, Greg KH wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> > 
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree  the following commit:
> > 
> > 	e83278f879e52ecee025effe9ad509fd51e4a516
> 
> Where is this git tree located?  On git.kernel.org somewhere?
> 
	http://git.kernel.org/?p=linux/kernel/git/eranian/linux-2.6.git

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 17:47   ` Greg KH
@ 2007-11-07 17:57     ` Stephane Eranian
  2007-11-07 19:53       ` Greg KH
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-07 17:57 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel

Greg,

On Wed, Nov 07, 2007 at 09:47:47AM -0800, Greg KH wrote:
> On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > Greg,
> > 
> > Perfmon sysfs document has been updated following your adivce.
> > you can check out in my perfmon tree  the following commit:
> > 
> > 	e83278f879e52ecee025effe9ad509fd51e4a516
> 
> Thanks, that looks a lot better.
> 
> Do you want me to send you patches based on this tree to help clean up
> the sysfs usage now that it's documented?
> 
Yes, send me the patches. But from what you were saying earlier it seems
I would need an extra sysfs patches to make this compile. Is that particular
patch already in Linus's tree?


> Also, a lot of your per-cpu sysfs files should probably move to debugfs
> as they are for debugging only, right?  No need to clutter up sysfs with
> them when only the very few perfmon developers would be needing access
> to them.
> 
Yes, this is mostly debugging. If debugfs is meant for this, then I'll
be happy to move this stuff over there. Is there some good example of how
I could do that based on my current sysfs code?

Thanks.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 17:57     ` Stephane Eranian
@ 2007-11-07 19:53       ` Greg KH
  2007-11-07 20:39         ` Stephane Eranian
  2007-11-08 15:27         ` Stephane Eranian
  0 siblings, 2 replies; 116+ messages in thread
From: Greg KH @ 2007-11-07 19:53 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Andrew Morton, perfmon, linux-kernel

On Wed, Nov 07, 2007 at 09:57:42AM -0800, Stephane Eranian wrote:
> Greg,
> 
> On Wed, Nov 07, 2007 at 09:47:47AM -0800, Greg KH wrote:
> > On Wed, Nov 07, 2007 at 05:42:55AM -0800, Stephane Eranian wrote:
> > > Greg,
> > > 
> > > Perfmon sysfs document has been updated following your adivce.
> > > you can check out in my perfmon tree  the following commit:
> > > 
> > > 	e83278f879e52ecee025effe9ad509fd51e4a516
> > 
> > Thanks, that looks a lot better.
> > 
> > Do you want me to send you patches based on this tree to help clean up
> > the sysfs usage now that it's documented?
> > 
> Yes, send me the patches. But from what you were saying earlier it seems
> I would need an extra sysfs patches to make this compile. Is that particular
> patch already in Linus's tree?

No, it's in my tree, and will be in the next -mm.  You will need a few
patches to get this to work, not just a single patch.

> > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > as they are for debugging only, right?  No need to clutter up sysfs with
> > them when only the very few perfmon developers would be needing access
> > to them.
> > 
> Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> be happy to move this stuff over there. Is there some good example of how
> I could do that based on my current sysfs code?

There is documentation for debugfs in the kernel api document :)

And, there are many in-kernel users of debugfs, a grep for
"debugfs_create_" should show you some examples of how to use this.  If
you have any questions, please let me know.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 19:53       ` Greg KH
@ 2007-11-07 20:39         ` Stephane Eranian
  2007-11-08 15:27         ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-07 20:39 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel

Greg,

On Wed, Nov 07, 2007 at 11:53:20AM -0800, Greg KH wrote:
> > > 
> > > Do you want me to send you patches based on this tree to help clean up
> > > the sysfs usage now that it's documented?
> > > 
> > Yes, send me the patches. But from what you were saying earlier it seems
> > I would need an extra sysfs patches to make this compile. Is that particular
> > patch already in Linus's tree?
> 
> No, it's in my tree, and will be in the next -mm.  You will need a few
> patches to get this to work, not just a single patch.
> 
Could you send them to me? if they are not too intrusive I could add them
to my tree. Yet I don't want something to distant from Linus's tree which
I pull from. My goal is to ensure that my tree still compiles and works.

> > > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > > as they are for debugging only, right?  No need to clutter up sysfs with
> > > them when only the very few perfmon developers would be needing access
> > > to them.
> > > 
> > Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> > be happy to move this stuff over there. Is there some good example of how
> > I could do that based on my current sysfs code?
> 
> There is documentation for debugfs in the kernel api document :)
> 
> And, there are many in-kernel users of debugfs, a grep for
> "debugfs_create_" should show you some examples of how to use this.  If
> you have any questions, please let me know.
> 
Ok, I'll look at that next.

Thanks,

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07 19:53       ` Greg KH
  2007-11-07 20:39         ` Stephane Eranian
@ 2007-11-08 15:27         ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-08 15:27 UTC (permalink / raw)
  To: Greg KH; +Cc: Andrew Morton, perfmon, linux-kernel, perfmon2-devel

Greg,

On Wed, Nov 07, 2007 at 11:53:20AM -0800, Greg KH wrote:
> > > Also, a lot of your per-cpu sysfs files should probably move to debugfs
> > > as they are for debugging only, right?  No need to clutter up sysfs with
> > > them when only the very few perfmon developers would be needing access
> > > to them.
> > > 
> > Yes, this is mostly debugging. If debugfs is meant for this, then I'll
> > be happy to move this stuff over there. Is there some good example of how
> > I could do that based on my current sysfs code?
> 
> There is documentation for debugfs in the kernel api document :)
> 
> And, there are many in-kernel users of debugfs, a grep for
> "debugfs_create_" should show you some examples of how to use this.  If
> you have any questions, please let me know.

I have now removed all the perfmon2 statistics from sysfs and moved them
to debugfs. I must admit, I like it better this way. Debugfs is also so
much easier to program.

Patch has been pushed into my tree. Let me know if you think I can improve
the sysfs code some more.

Thanks.

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-07  0:34 [PATCH] fix up perfmon to build on -mm Greg KH
  2007-11-07 10:34 ` Stephane Eranian
  2007-11-07 13:42 ` Stephane Eranian
@ 2007-11-09 20:06 ` Andrew Morton
  2007-11-09 21:38   ` Greg KH
  2 siblings, 1 reply; 116+ messages in thread
From: Andrew Morton @ 2007-11-09 20:06 UTC (permalink / raw)
  To: Greg KH; +Cc: eranian, perfmon, linux-kernel

On Tue, 6 Nov 2007 16:34:54 -0800
Greg KH <greg@kroah.com> wrote:

> Here's a patch against my current tree that gets the perfmon code
> building and hopefully working.

Unfortunately I still haven't merged perfmon due to recently-occurring
minor conflicts with Tony's ia64 tree and more major recently-occurring
conflicts with the x86 tree.

There's not really a lot which Stephane can practically do about this -
normally I'll just get down and fix stuff like this up.  But the impression
I get from various people is that the perfmon tree in its present form
would not be a popular merge.

The impression which people have (and I admit to sharing it) is that
there's just too much stuff in there and it might not all be justifiable. 
But I suspect that people have largely forgotten what is in there, and why
it is in there.

We really need to get this ball rolling, and that will require a sustained
effort from more people than just Stephane.  I suppose as a starting point
we could yet again review the existing patches, please.  People will mainly
concentrate upon the changelogging to understand which features are being
proposed and why, so that submission should describe these things pretty
carefully: what are the features and why do we need each of them.

tia.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-09 20:06 ` Andrew Morton
@ 2007-11-09 21:38   ` Greg KH
  2007-11-10 20:32     ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Greg KH @ 2007-11-09 21:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: eranian, perfmon, linux-kernel

On Fri, Nov 09, 2007 at 12:06:27PM -0800, Andrew Morton wrote:
> On Tue, 6 Nov 2007 16:34:54 -0800
> Greg KH <greg@kroah.com> wrote:
> 
> > Here's a patch against my current tree that gets the perfmon code
> > building and hopefully working.
> 
> Unfortunately I still haven't merged perfmon due to recently-occurring
> minor conflicts with Tony's ia64 tree and more major recently-occurring
> conflicts with the x86 tree.
> 
> There's not really a lot which Stephane can practically do about this -
> normally I'll just get down and fix stuff like this up.  But the impression
> I get from various people is that the perfmon tree in its present form
> would not be a popular merge.
> 
> The impression which people have (and I admit to sharing it) is that
> there's just too much stuff in there and it might not all be justifiable. 
> But I suspect that people have largely forgotten what is in there, and why
> it is in there.
> 
> We really need to get this ball rolling, and that will require a sustained
> effort from more people than just Stephane.  I suppose as a starting point
> we could yet again review the existing patches, please.  People will mainly
> concentrate upon the changelogging to understand which features are being
> proposed and why, so that submission should describe these things pretty
> carefully: what are the features and why do we need each of them.

Is there some way to rebase these patches/git tree to be a bit more easy
to review?  Right now there are over 75 patches in the tree and many (if
not most) can be removed by merging them with previous patches.

If someone could break this stuff down into reviewable pieces, it would
go a very long way toward making it acceptable.

Is there any way to just provide a basic framework that everyone can
agree on and then add on more stuff as time goes on?  Do we have to have
every different processor/arch with support to start with?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] fix up perfmon to build on -mm
  2007-11-09 21:38   ` Greg KH
@ 2007-11-10 20:32     ` Andi Kleen
  2007-11-13 15:17       ` perfmon2 merge news Robert Richter
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-10 20:32 UTC (permalink / raw)
  To: gregkh; +Cc: akpm, eranian, linux-kernel

Greg KH <greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> writes:

[dropped perfmon list because gmane messed it up and it's apparently
closed anyways]

> Is there any way to just provide a basic framework that everyone can
> agree on and then add on more stuff as time goes on?  Do we have to have
> every different processor/arch with support to start with?

I think the real problem are not the architectures (the processor
adaption layer is usually relatively straight forward IIRC), but the
excessive functionality implemented by the user interface.

It would be really good to extract a core perfmon and start with
that and then add stuff as it makes sense.

e.g. core perfmon could be something simple like just support
to context switch state and initialize counters in a basic way 
and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Next step could be basic event on overflow/underflow support.

Then more features as they make sense, with clear rationale
what they're good for and proper step by step patches. 

-Andi

[1] On x86 we urgently need a replacement to RDTSC for counting
cycles.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* perfmon2 merge news
  2007-11-10 20:32     ` Andi Kleen
@ 2007-11-13 15:17       ` Robert Richter
  2007-11-13 15:35         ` [perfmon2] " William Cohen
  2007-11-13 18:32         ` Stephane Eranian
  0 siblings, 2 replies; 116+ messages in thread
From: Robert Richter @ 2007-11-13 15:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gregkh, akpm, eranian, linux-kernel, perfmon2-devel

On 10.11.07 21:32:39, Andi Kleen wrote:
> It would be really good to extract a core perfmon and start with
> that and then add stuff as it makes sense.
> 
> e.g. core perfmon could be something simple like just support
> to context switch state and initialize counters in a basic way 
> and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Perhaps a core could provide also as much functionality so that
Perfmon can be used with an *unpatched* kernel using loadable modules?
One drawback with today's Perfmon is that it can not be used with a
vanilla kernel. But maybe such a core is by far too complex for a
first merge.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-11-13 15:17       ` perfmon2 merge news Robert Richter
@ 2007-11-13 15:35         ` William Cohen
  2007-11-13 17:55           ` Stephane Eranian
  2007-11-13 20:42           ` Andi Kleen
  2007-11-13 18:32         ` Stephane Eranian
  1 sibling, 2 replies; 116+ messages in thread
From: William Cohen @ 2007-11-13 15:35 UTC (permalink / raw)
  To: Robert Richter; +Cc: Andi Kleen, akpm, gregkh, linux-kernel, perfmon2-devel

Robert Richter wrote:
> On 10.11.07 21:32:39, Andi Kleen wrote:
>> It would be really good to extract a core perfmon and start with
>> that and then add stuff as it makes sense.
>>
>> e.g. core perfmon could be something simple like just support
>> to context switch state and initialize counters in a basic way 
>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> 
> Perhaps a core could provide also as much functionality so that
> Perfmon can be used with an *unpatched* kernel using loadable modules?
> One drawback with today's Perfmon is that it can not be used with a
> vanilla kernel. But maybe such a core is by far too complex for a
> first merge.
> 
> -Robert
> 

Hi Robert,

In the past I suggested that it might be useful to have a version of perfmon2 
that only set up the perfmon on a global basis. That would allow the patches for 
context switches to be added as a separate step, splitting up the patch into 
smaller set of patches.

Perfmon2 uses a set of system calls to control the performance monitoring 
hardware. This would make it difficult to use an unpatch kernel unless perfmon 
changed the mechanism used to control the performance monitoring hardware.

-Will

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-11-13 15:35         ` [perfmon2] " William Cohen
@ 2007-11-13 17:55           ` Stephane Eranian
  2007-11-13 18:33             ` [perfmon] " William Cohen
  2007-11-13 18:47             ` Philip Mucci
  2007-11-13 20:42           ` Andi Kleen
  1 sibling, 2 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 17:55 UTC (permalink / raw)
  To: William Cohen
  Cc: Robert Richter, akpm, Andi Kleen, gregkh, perfmon2-devel,
	linux-kernel, perfmon

Hello,

On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
> Robert Richter wrote:
> > On 10.11.07 21:32:39, Andi Kleen wrote:
> >> It would be really good to extract a core perfmon and start with
> >> that and then add stuff as it makes sense.
> >>
> >> e.g. core perfmon could be something simple like just support
> >> to context switch state and initialize counters in a basic way 
> >> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> > 
> > Perhaps a core could provide also as much functionality so that
> > Perfmon can be used with an *unpatched* kernel using loadable modules?
> > One drawback with today's Perfmon is that it can not be used with a
> > vanilla kernel. But maybe such a core is by far too complex for a
> > first merge.
> > 
> > -Robert
> > 
> 
> Hi Robert,
> 
> In the past I suggested that it might be useful to have a version of perfmon2 
> that only set up the perfmon on a global basis. That would allow the patches for 
> context switches to be added as a separate step, splitting up the patch into 
> smaller set of patches.
> 
> Perfmon2 uses a set of system calls to control the performance monitoring 
> hardware. This would make it difficult to use an unpatch kernel unless perfmon 
> changed the mechanism used to control the performance monitoring hardware.
>
Yes, that would be a possibility but as you pointed out there are some problems:

	- perfmon2 uses system calls. So unless you can dynamically patch the
	  syscall table we would have to go back to the ioctl() and driver model.
	  I was under the impression that people did not quite like multiplexing
	  syscalls such as ioctl(). I also do prefer the multi syscall approach.

	- perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
	  an external device interrupts. There needs to be some APIC and interrupt
	  gate setup. There maybe other constraints on other architectures as well.
	  Not sure if all functions/structures necessary for this are available to
	  modules.

	- we could not support per-thread mode with the kernel module approach due to
	  link to the context switch code. I do believe per-thread is a key value-add
	  for performance monitoring.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-13 15:17       ` perfmon2 merge news Robert Richter
  2007-11-13 15:35         ` [perfmon2] " William Cohen
@ 2007-11-13 18:32         ` Stephane Eranian
  2007-11-13 22:29           ` Christoph Hellwig
  2007-11-16 18:25           ` PMC core internal API design Mathieu Desnoyers
  1 sibling, 2 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 18:32 UTC (permalink / raw)
  To: Robert Richter
  Cc: Andi Kleen, gregkh, akpm, linux-kernel, perfmon2-devel, perfmon

Hello,

On Tue, Nov 13, 2007 at 04:17:18PM +0100, Robert Richter wrote:
> On 10.11.07 21:32:39, Andi Kleen wrote:
> > It would be really good to extract a core perfmon and start with
> > that and then add stuff as it makes sense.
> > 
> > e.g. core perfmon could be something simple like just support
> > to context switch state and initialize counters in a basic way 
> > and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> 
> Perhaps a core could provide also as much functionality so that
> Perfmon can be used with an *unpatched* kernel using loadable modules?
> One drawback with today's Perfmon is that it can not be used with a
> vanilla kernel. But maybe such a core is by far too complex for a
> first merge.
> 
Note that I am not against the gradual approach such as:
	- system-wide only counting
	- per-thread counting
	- user-level sampling support
	- in-kernel sampling buffer support
	- in-kernel customizable sampling buffer formats via modules
	- event set multiplexing
	- PMU description modules

It would obvisouly cause a lot of troubles to existing perfmon libraries and
applications (e.g. PAPI). It would also be fairly tricky to do because you'd 
have to make sure that in the beginning, you leave enough flexiblity such that
you can add the rest while maintaining total backward compatibility. But given
that we already have the full solution, it could just be a matter of dropping
features without disrupting the user level API. Of course there would be a bigger
burden on the maintainer because he would have two trees to maintain but I think
that is already commonplace in many of the kernel-related projects.

Let's take a simple example. The set of syscalls necessary to control a system-wide
monitoring session is exactly the same as for a per-thread session. The difference is
just a flag when the session is created. Thus, we could keep the same set of syscalls,
but only accept system-wide sessions. Later on, when we add per-thread, we would just
have to expose the per-thread session flag.

Having said that, does not mean that this is necessarily what we will do. I am just
try to present my understanding of the comments from Andrew, Andi and others.

I think that going with a kernel module will not address the 'complexity/bloat' perception
that some people have. There is a logic to that, I did not just wakeup one day saying
'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
developers and it was justfied by what the hardware offered then. This unfortunately still
stands today. I admit that justification is not necessarily spelled out clearly in the code. So
I understand most of those worries and I am trying to figure out how we could best address them.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 17:55           ` Stephane Eranian
@ 2007-11-13 18:33             ` William Cohen
  2007-11-13 21:13               ` Stephane Eranian
  2007-11-13 18:47             ` Philip Mucci
  1 sibling, 1 reply; 116+ messages in thread
From: William Cohen @ 2007-11-13 18:33 UTC (permalink / raw)
  To: eranian
  Cc: akpm, Robert Richter, gregkh, linux-kernel, perfmon, Andi Kleen,
	perfmon2-devel

Stephane Eranian wrote:
> Hello,
> 
> On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
>> Robert Richter wrote:
>>> On 10.11.07 21:32:39, Andi Kleen wrote:
>>>> It would be really good to extract a core perfmon and start with
>>>> that and then add stuff as it makes sense.
>>>>
>>>> e.g. core perfmon could be something simple like just support
>>>> to context switch state and initialize counters in a basic way 
>>>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>>> Perhaps a core could provide also as much functionality so that
>>> Perfmon can be used with an *unpatched* kernel using loadable modules?
>>> One drawback with today's Perfmon is that it can not be used with a
>>> vanilla kernel. But maybe such a core is by far too complex for a
>>> first merge.
>>>
>>> -Robert
>>>
>> Hi Robert,
>>
>> In the past I suggested that it might be useful to have a version of perfmon2 
>> that only set up the perfmon on a global basis. That would allow the patches for 
>> context switches to be added as a separate step, splitting up the patch into 
>> smaller set of patches.
>>
>> Perfmon2 uses a set of system calls to control the performance monitoring 
>> hardware. This would make it difficult to use an unpatch kernel unless perfmon 
>> changed the mechanism used to control the performance monitoring hardware.
>>
> Yes, that would be a possibility but as you pointed out there are some problems:
> 
> 	- perfmon2 uses system calls. So unless you can dynamically patch the
> 	  syscall table we would have to go back to the ioctl() and driver model.
> 	  I was under the impression that people did not quite like multiplexing
> 	  syscalls such as ioctl(). I also do prefer the multi syscall approach.
> 
> 	- perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
> 	  an external device interrupts. There needs to be some APIC and interrupt
> 	  gate setup. There maybe other constraints on other architectures as well.
> 	  Not sure if all functions/structures necessary for this are available to
> 	  modules.

The oprofile module can setup a handler for PMU interrupts. This is done in 
archi/x86/oprofile/nmi_int:nmi_cpu_setup().  Other modules could do the same. 
However, it bumps what ever was using the nmi/pmu off, then restores nmi/pmu 
when oprofile is shut down. Maybe the pmu/nmi resource reservation mechanism 
should be another self-contained patch.

> 	- we could not support per-thread mode with the kernel module approach due to
> 	  link to the context switch code. I do believe per-thread is a key value-add
> 	  for performance monitoring.

The per-thread monitoring is useful to a number of people and many people want 
it. The thought was how to break the large perfmon patch into set of smaller 
incremental patches. So it isn't whether to have per-thread pmu virtualization, 
but rather when/how to get it in.

-Will

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 17:55           ` Stephane Eranian
  2007-11-13 18:33             ` [perfmon] " William Cohen
@ 2007-11-13 18:47             ` Philip Mucci
  2007-11-13 18:59               ` Greg KH
  2007-11-13 22:27               ` Christoph Hellwig
  1 sibling, 2 replies; 116+ messages in thread
From: Philip Mucci @ 2007-11-13 18:47 UTC (permalink / raw)
  To: eranian
  Cc: William Cohen, akpm, Robert Richter, gregkh, linux-kernel,
	Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel, papi list

Hi folks,

Well, I can say the mood here at supercomputing'07 is pretty somber  
in regards to the latest exchange of messages regarding the perfmon  
patches. Our community has been the largest user of both the PerfCtr  
and the Perfmon patches, the former being regularly installed by  
vendors and integrators on clusters at install time, and the latter  
now being adopted into vendor kernels by IBM, Cray, AMD, SiCortex and  
others. Of course, adoption by a vendor, does not a good kernel patch  
make. However, it should be viewed as a strong data point on demand  
for such functionality. We are a community focused on performance and  
we have long had a need for these tools.

A solution that does not provide 64 bit virtualized per-thread counts  
is not a solution at all. That would need to be ripped out by all of  
us using this functionality so we could get something that actually  
does what the community needs, not what the you folks think we need.  
Device level access and/or root access to the counters is not  
unacceptable for machines in production. If that was fine, oprofile  
would have satisfied everyone and we wouldn't be sucking up your  
bandwidth. Please understand that people outside of the your  
community are desperate for adoption of any form of 'per-thread' PMU  
functionality into the kernel. For those of you who are (still) not  
convinced of this, I can arrange your inbox to be spammed by 1000's  
of HPC geeks, managers, vendors, etc. My point is, let's start  
somewhere that the community finds useful. Otherwise we run the risk  
of developing an interface that everyone isn't comfortable with and  
no-one uses. Hardly a productive exercise.

So please, do consider a set of core functionality that provides for  
(at least) the following:

- per-CPU and per-thread 64 bit virtualized counts
- third person operation (attach/ptrace)
- dispatch of signal upon interrupt on overflow if requested
- 'buffered' interrupts into a buffer that can be mmap'd into user space
- support for a variety of the major processor platforms

Regards,


On Nov 13, 2007, at 9:55 AM, Stephane Eranian wrote:

> Hello,
>
> On Tue, Nov 13, 2007 at 10:35:11AM -0500, William Cohen wrote:
>> Robert Richter wrote:
>>> On 10.11.07 21:32:39, Andi Kleen wrote:
>>>> It would be really good to extract a core perfmon and start with
>>>> that and then add stuff as it makes sense.
>>>>
>>>> e.g. core perfmon could be something simple like just support
>>>> to context switch state and initialize counters in a basic way
>>>> and perhaps get counter numbers for RDPMC in ring3 on x86[1]
>>>
>>> Perhaps a core could provide also as much functionality so that
>>> Perfmon can be used with an *unpatched* kernel using loadable  
>>> modules?
>>> One drawback with today's Perfmon is that it can not be used with a
>>> vanilla kernel. But maybe such a core is by far too complex for a
>>> first merge.
>>>
>>> -Robert
>>>
>>
>> Hi Robert,
>>
>> In the past I suggested that it might be useful to have a version  
>> of perfmon2
>> that only set up the perfmon on a global basis. That would allow  
>> the patches for
>> context switches to be added as a separate step, splitting up the  
>> patch into
>> smaller set of patches.
>>
>> Perfmon2 uses a set of system calls to control the performance  
>> monitoring
>> hardware. This would make it difficult to use an unpatch kernel  
>> unless perfmon
>> changed the mechanism used to control the performance monitoring  
>> hardware.
>>
> Yes, that would be a possibility but as you pointed out there are  
> some problems:
>
> 	- perfmon2 uses system calls. So unless you can dynamically patch the
> 	  syscall table we would have to go back to the ioctl() and driver  
> model.
> 	  I was under the impression that people did not quite like  
> multiplexing
> 	  syscalls such as ioctl(). I also do prefer the multi syscall  
> approach.
>
> 	- perfmon2 needs to install a PMU interrupt handler. On X86, this  
> is not just
> 	  an external device interrupts. There needs to be some APIC and  
> interrupt
> 	  gate setup. There maybe other constraints on other architectures  
> as well.
> 	  Not sure if all functions/structures necessary for this are  
> available to
> 	  modules.
>
> 	- we could not support per-thread mode with the kernel module  
> approach due to
> 	  link to the context switch code. I do believe per-thread is a  
> key value-add
> 	  for performance monitoring.
>
> -- 
> -Stephane
> _______________________________________________
> perfmon mailing list
> perfmon@linux.hpl.hp.com
> http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 18:47             ` Philip Mucci
@ 2007-11-13 18:59               ` Greg KH
  2007-11-13 20:07                 ` Andrew Morton
  2007-11-13 21:33                 ` [perfmon] Re: [perfmon2] " Stephane Eranian
  2007-11-13 22:27               ` Christoph Hellwig
  1 sibling, 2 replies; 116+ messages in thread
From: Greg KH @ 2007-11-13 18:59 UTC (permalink / raw)
  To: Philip Mucci
  Cc: eranian, William Cohen, akpm, Robert Richter, linux-kernel,
	Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel, papi list

On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> Hi folks,
>
> Well, I can say the mood here at supercomputing'07 is pretty somber in 
> regards to the latest exchange of messages regarding the perfmon patches. 

"somber"?

Why?

We (a number of the kernel developers) want to see the perfmon code make
it into the kernel tree, unfortunatly, in the current state it is in,
that's not going to happen.

Andi specified a way that this can happen, just refactor your patches
into smaller bits that can be reviewed and applied.

If you, or anyone else has any questions about this, please let us know.
So far, I have not seen any response to his message, so I'm guessing
that the perfmon developers either are off working on this, or don't
care.

And if they don't care, then yes, I agree with your "somber" feeling...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 18:59               ` Greg KH
@ 2007-11-13 20:07                 ` Andrew Morton
  2007-11-13 20:14                   ` Greg KH
                                     ` (2 more replies)
  2007-11-13 21:33                 ` [perfmon] Re: [perfmon2] " Stephane Eranian
  1 sibling, 3 replies; 116+ messages in thread
From: Andrew Morton @ 2007-11-13 20:07 UTC (permalink / raw)
  To: Greg KH
  Cc: Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

On Tue, 13 Nov 2007 10:59:24 -0800 Greg KH <gregkh@suse.de> wrote:

> On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> > Hi folks,
> >
> > Well, I can say the mood here at supercomputing'07 is pretty somber in 
> > regards to the latest exchange of messages regarding the perfmon patches. 
> 
> "somber"?
> 
> Why?
> 
> We (a number of the kernel developers) want to see the perfmon code make
> it into the kernel tree, unfortunatly, in the current state it is in,
> that's not going to happen.
> 
> Andi specified a way that this can happen, just refactor your patches
> into smaller bits that can be reviewed and applied.
> 
> If you, or anyone else has any questions about this, please let us know.
> So far, I have not seen any response to his message, so I'm guessing
> that the perfmon developers either are off working on this, or don't
> care.
> 
> And if they don't care, then yes, I agree with your "somber" feeling...
> 

Well...  Philip is (I assume) a numerical-computing guy and not a
kernel-developing guy (probably a wise choice).

He speaks for quite a few people - they have serious need for this feature
but they've had to scruff around with out-of-tree patches for years to get
it, and still there are problems.

I was hoping that after the round of release-and-review which Stephane,
Andi and I did about twelve months ago that we were on track to merge the
perfmon codebase as-offered.  But now it turns out that the sentiment is
that the code simply has too many bells-and-whistles to be acceptable.

My problem with that sentiment is that it is quite likely the case that
those bells-n-whistles are actually useful and needed features.  Perfmon
has been out there for quite a few years and the code which is in there
_should_ be in response to real-world in-the-field experience.  Such
requirements never go away.


So.  If what I am saying is correct then the best course of action would be
for Stephane to help us all to understand what these features are and why
we need them.  The ideal way in which to do this is

[patch] perfmon: core
[patch] perfmon: whizzy feature #1
[patch] perfmon: whizzy feature #2
[patch] perfmon: whizzy feature #3

etc.  Where the changelog in each whizzy-feature-n explains what it does,
why it does it and why our users need it.

Whatever happens, perfmon is so big and so old and has been out-of-tree for
so long that it's going to take a pile of work from lots of people to get
any of it landed.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 20:07                 ` Andrew Morton
@ 2007-11-13 20:14                   ` Greg KH
  2007-11-13 20:36                   ` Andi Kleen
  2007-11-14  7:24                   ` [perfmon] Re: [perfmon2] " Paul Mackerras
  2 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-13 20:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

On Tue, Nov 13, 2007 at 12:07:28PM -0800, Andrew Morton wrote:
> 
> So.  If what I am saying is correct then the best course of action would be
> for Stephane to help us all to understand what these features are and why
> we need them.  The ideal way in which to do this is
> 
> [patch] perfmon: core
> [patch] perfmon: whizzy feature #1
> [patch] perfmon: whizzy feature #2
> [patch] perfmon: whizzy feature #3
> 
> etc.  Where the changelog in each whizzy-feature-n explains what it does,
> why it does it and why our users need it.

I agree.  Right now their git tree has over 80 patches in it, without
descriptions like this to help those of us who want to review and help
out, it is quite difficult.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 20:07                 ` Andrew Morton
  2007-11-13 20:14                   ` Greg KH
@ 2007-11-13 20:36                   ` Andi Kleen
  2007-11-14  0:28                     ` Philip Mucci
  2007-11-14  7:24                   ` [perfmon] Re: [perfmon2] " Paul Mackerras
  2 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-13 20:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg KH, Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

> He speaks for quite a few people - they have serious need for this feature

Most likely they have serious need for a very small subset of perfmon2.
The point of my proposal was to get this very small subset in quickly.

Phil, how many of the command line options of pfmon do you
actually use? How many do the people at your conference use? Or what
functions, what performance counters etc. in PAPI or whatever 
library you use? 

Make use understand the use cases better, that would already help a lot
in merging by concentrating on what people actually really need.

-Andi


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-11-13 15:35         ` [perfmon2] " William Cohen
  2007-11-13 17:55           ` Stephane Eranian
@ 2007-11-13 20:42           ` Andi Kleen
  1 sibling, 0 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-13 20:42 UTC (permalink / raw)
  To: William Cohen
  Cc: Robert Richter, Andi Kleen, akpm, gregkh, linux-kernel, perfmon2-devel

> In the past I suggested that it might be useful to have a version of 
> perfmon2 that only set up the perfmon on a global basis. That would allow 

Context switch is imho the main differentiating feature of perfmon 
over oprofile.  Not sure it makes sense to take that one out.

I don't think the complexity of the patches comes from the context
switch anyways, it comes from the lots of other things it does.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 18:33             ` [perfmon] " William Cohen
@ 2007-11-13 21:13               ` Stephane Eranian
  2007-11-13 21:29                 ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 21:13 UTC (permalink / raw)
  To: William Cohen
  Cc: akpm, Robert Richter, gregkh, linux-kernel, perfmon, Andi Kleen,
	perfmon2-devel, perfmon

Will,

On Tue, Nov 13, 2007 at 01:33:55PM -0500, William Cohen wrote:
> 
> The oprofile module can setup a handler for PMU interrupts. This is done in 
> archi/x86/oprofile/nmi_int:nmi_cpu_setup().  Other modules could do the 
> same. However, it bumps what ever was using the nmi/pmu off, then restores 
> nmi/pmu when oprofile is shut down. Maybe the pmu/nmi resource reservation 
> mechanism should be another self-contained patch.
> 

Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
setup. It uses the register_die() mechanism, if I recall. The low level APIC
and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
of the NMI watchdog. 


> >	- we could not support per-thread mode with the kernel module 
> >	approach due to
> >	  link to the context switch code. I do believe per-thread is a key 
> >	  value-add
> >	  for performance monitoring.
> 
> The per-thread monitoring is useful to a number of people and many people 
> want it. The thought was how to break the large perfmon patch into set of 
> smaller incremental patches. So it isn't whether to have per-thread pmu 
> virtualization, but rather when/how to get it in.

I think we all agree on this.

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 21:13               ` Stephane Eranian
@ 2007-11-13 21:29                 ` Andi Kleen
  2007-11-13 21:46                   ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-13 21:29 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: William Cohen, akpm, Robert Richter, gregkh, linux-kernel,
	perfmon, Andi Kleen, perfmon2-devel

On Tue, Nov 13, 2007 at 01:13:45PM -0800, Stephane Eranian wrote:
> Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
> setup.

Oprofile works without the NMI watchdog too, but it just happens to be another
NMI user.

> It uses the register_die() mechanism, 

Not correct.

> if I recall. The low level APIC
> and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
> of the NMI watchdog. 

It could handle it in the same way as oprofile if it wanted. But given
NMIs make everything more complicated and it might not be worth it.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 18:59               ` Greg KH
  2007-11-13 20:07                 ` Andrew Morton
@ 2007-11-13 21:33                 ` Stephane Eranian
  2007-11-13 21:45                   ` Greg KH
  1 sibling, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 21:33 UTC (permalink / raw)
  To: Greg KH
  Cc: Philip Mucci, William Cohen, akpm, Robert Richter, linux-kernel,
	Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel, papi list,
	perfmon

Greg,

On Tue, Nov 13, 2007 at 10:59:24AM -0800, Greg KH wrote:
> On Tue, Nov 13, 2007 at 10:47:45AM -0800, Philip Mucci wrote:
> > Hi folks,
> >
> > Well, I can say the mood here at supercomputing'07 is pretty somber in 
> > regards to the latest exchange of messages regarding the perfmon patches. 
> 
> "somber"?
> 

I am the core developer of this and I am not as pessimistic as Phil. Yet I admit
Phil has been asking for this kind of kernel interface for a very very long time ;-<

> Why?
> 
> We (a number of the kernel developers) want to see the perfmon code make
> it into the kernel tree, unfortunatly, in the current state it is in,
> that's not going to happen.
> 
> Andi specified a way that this can happen, just refactor your patches
> into smaller bits that can be reviewed and applied.
> 

I think I understand your concerns. I will work on this. I think it is possible to
refactor. It will certainly be painful (for me), but I think it can be done within
some reasonable delay. Of course, it would be help if you could better qualify what
you mean by 'smaller'.

> If you, or anyone else has any questions about this, please let us know.
> So far, I have not seen any response to his message, so I'm guessing
> that the perfmon developers either are off working on this, or don't
> care.
> 

I will start working on this once I fix the tickless/hrtimer issues.

> And if they don't care, then yes, I agree with your "somber" feeling...
> 

I do care a lot actually. Believe me, I do spend a lot of effort and energy
on this project everyday, like many others around the world, and I intend for
it to succeed. We have reached a point in the development of processor hardware
where this kind of features is crucial and it is not just for HPC folks anymore.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 21:33                 ` [perfmon] Re: [perfmon2] " Stephane Eranian
@ 2007-11-13 21:45                   ` Greg KH
  0 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-13 21:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Philip Mucci, William Cohen, akpm, Robert Richter, linux-kernel,
	Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel, papi list

On Tue, Nov 13, 2007 at 01:33:13PM -0800, Stephane Eranian wrote:
> I think I understand your concerns. I will work on this. I think it is possible to
> refactor. It will certainly be painful (for me), but I think it can be done within
> some reasonable delay. Of course, it would be help if you could better qualify what
> you mean by 'smaller'.

I think Andrew already spelled this out.  If after reading his message,
you still have questions, please let me know and I'll be glad to work
with you to address them.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 21:29                 ` Andi Kleen
@ 2007-11-13 21:46                   ` Stephane Eranian
  2007-11-13 21:50                     ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 21:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, Robert Richter, gregkh, linux-kernel, perfmon,
	William Cohen, perfmon2-devel

Andi.

On Tue, Nov 13, 2007 at 10:29:02PM +0100, Andi Kleen wrote:
> On Tue, Nov 13, 2007 at 01:13:45PM -0800, Stephane Eranian wrote:
> > Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
> > setup.
> 
> Oprofile works without the NMI watchdog too, but it just happens to be another
> NMI user.
> 
I have no doubt it can work with a "regular" interrupt.

> > It uses the register_die() mechanism, 
> 
> Not correct.
> 
I meant the register_die_notifier() mechanism which allow you to
chain a handler on NMI interrupts. At least that's my understanding
reading the code:

static int nmi_setup(void)
{
        int err=0;
        int cpu;

        if (!allocate_msrs())
                return -ENOMEM;

        if ((err = register_die_notifier(&profile_exceptions_nb))){
                free_msrs();
                pfm_release_allcpus();
                return err;
        }
	...


> > if I recall. The low level APIC
> > and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because
> > of the NMI watchdog. 
> 
> It could handle it in the same way as oprofile if it wanted. But given
> NMIs make everything more complicated and it might not be worth it.
> 
Yes, horribly more complicated because of locking issues within perfmon.
As soon as you expose a file descriptor, you need some locking to prevent
multiple user threads (malicious or not) to compete to access the PMU state.
I think the value add of NMI can be as well achieved with advanced PMU features
such as Intel Core 2 PEBS.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 21:46                   ` Stephane Eranian
@ 2007-11-13 21:50                     ` Andi Kleen
  2007-11-13 22:22                       ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-13 21:50 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, akpm, Robert Richter, gregkh, linux-kernel, perfmon,
	William Cohen, perfmon2-devel

> Yes, horribly more complicated because of locking issues within perfmon.
> As soon as you expose a file descriptor, you need some locking to prevent
> multiple user threads (malicious or not) to compete to access the PMU state.

Why do you need the file descriptor? 

One of the main problems with perfmon is the complicated user interface.

Naively I would assume just some thread global state should be sufficient. 

> I think the value add of NMI can be as well achieved with advanced PMU features
> such as Intel Core 2 PEBS.

True probably, although only on CPUs that support PEBS. Dropping features
for old CPUs is unfortunately quite difficult in Linux, and in this case
probably not an option because there are so many of them (e.g. all of AMD
not Fam10h) 

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 21:50                     ` Andi Kleen
@ 2007-11-13 22:22                       ` Stephane Eranian
  2007-11-13 22:25                         ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 22:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, Robert Richter, gregkh, linux-kernel, perfmon,
	William Cohen, perfmon2-devel

Andi,
On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > Yes, horribly more complicated because of locking issues within perfmon.
> > As soon as you expose a file descriptor, you need some locking to prevent
> > multiple user threads (malicious or not) to compete to access the PMU state.
> 
> Why do you need the file descriptor? 
> 

To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
file descriptor allows you to use close, read, select, poll and you leverage the
existing file descriptor sharing/inheritance sematics. At the kernel level, a 
descriptor provides all the callback necessary to make sure you clean up the perfmon
session state on exit.


> One of the main problems with perfmon is the complicated user interface.
> 
> Naively I would assume just some thread global state should be sufficient. 
> 
> > I think the value add of NMI can be as well achieved with advanced PMU features
> > such as Intel Core 2 PEBS.
> 
> True probably, although only on CPUs that support PEBS. Dropping features
> for old CPUs is unfortunately quite difficult in Linux, and in this case
> probably not an option because there are so many of them (e.g. all of AMD
> not Fam10h) 
> 

Yes, I know that. Also note that unfortunately, AMD Fam10h IBS feature does not
allow you to capture more than one sample in critical sections. It is still
interrupt based sampling with one entry-deep buffer: one interrupt = one sample.
Perfmon does support NMI though it is much more expensive to use.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 22:22                       ` Stephane Eranian
@ 2007-11-13 22:25                         ` Andi Kleen
  2007-11-13 22:58                           ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-13 22:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, akpm, Robert Richter, gregkh, linux-kernel, perfmon,
	William Cohen, perfmon2-devel

On Tue, Nov 13, 2007 at 02:22:34PM -0800, Stephane Eranian wrote:
> Andi,
> On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > > Yes, horribly more complicated because of locking issues within perfmon.
> > > As soon as you expose a file descriptor, you need some locking to prevent
> > > multiple user threads (malicious or not) to compete to access the PMU state.
> > 
> > Why do you need the file descriptor? 
> > 
> 
> To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
> file descriptor allows you to use close, read, select, poll and you leverage the

Surely that could be done with a flag for each call too? Keeping file descriptors
to pass essentially a boolean seems overkill.

> existing file descriptor sharing/inheritance sematics. At the kernel level, a 
> descriptor provides all the callback necessary to make sure you clean up the perfmon
> session state on exit.

Didn't you already have a thread destructor for it?

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 18:47             ` Philip Mucci
  2007-11-13 18:59               ` Greg KH
@ 2007-11-13 22:27               ` Christoph Hellwig
  1 sibling, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2007-11-13 22:27 UTC (permalink / raw)
  To: Philip Mucci
  Cc: eranian, William Cohen, akpm, Robert Richter, gregkh,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

<stupid bullshitting snipped>

What about investing some effort to do a proper performance counter
infrastructure or turning the mess perfom is into one instead of this
useless rant?  Code is not getting any better by your complain ccing
gazillions of useless list.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-13 18:32         ` Stephane Eranian
@ 2007-11-13 22:29           ` Christoph Hellwig
  2007-11-16 18:25           ` PMC core internal API design Mathieu Desnoyers
  1 sibling, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2007-11-13 22:29 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Robert Richter, Andi Kleen, gregkh, akpm, linux-kernel,
	perfmon2-devel, perfmon

On Tue, Nov 13, 2007 at 10:32:39AM -0800, Stephane Eranian wrote:
> It would obvisouly cause a lot of troubles to existing perfmon libraries and
> applications (e.g. PAPI). It would also be fairly tricky to do because you'd 
> have to make sure that in the beginning, you leave enough flexiblity such that
> you can add the rest while maintaining total backward compatibility. But given
> that we already have the full solution, it could just be a matter of dropping
> features without disrupting the user level API.

There no way we'll keep this completely idiotic userland API.  If people start
to use out of tree APIs they can pretty much expect that they're not going
to stay around.  And in this case they most certainly won't.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 22:25                         ` Andi Kleen
@ 2007-11-13 22:58                           ` Stephane Eranian
  2007-11-14  2:07                             ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-13 22:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, Robert Richter, gregkh, linux-kernel, perfmon,
	William Cohen, perfmon2-devel

Andi,

On Tue, Nov 13, 2007 at 11:25:34PM +0100, Andi Kleen wrote:
> On Tue, Nov 13, 2007 at 02:22:34PM -0800, Stephane Eranian wrote:
> > Andi,
> > On Tue, Nov 13, 2007 at 10:50:56PM +0100, Andi Kleen wrote:
> > > > Yes, horribly more complicated because of locking issues within perfmon.
> > > > As soon as you expose a file descriptor, you need some locking to prevent
> > > > multiple user threads (malicious or not) to compete to access the PMU state.
> > > 
> > > Why do you need the file descriptor? 
> > > 
> > 
> > To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
> > file descriptor allows you to use close, read, select, poll and you leverage the
> 
> Surely that could be done with a flag for each call too? Keeping file descriptors
> to pass essentially a boolean seems overkill.
> 

I don't understand this.

Let's take the simplest possible example (self-monitoring per-thread)
counting one event in one data register.

int
main(int argc, char **argv)
{
	int ctx_fd;
	pfarg_pmd_t pd[1];
	pfarg_pmc_t pc[1];
	pfarg_ctx_t ctx;
	pfarg_load_t load_args;

	memset(&ctx, 0, sizeof(ctx));
	memset(pc, 0, sizeof(pc));
	memset(pd, 0, sizeof(pd));

	/* create session (context) and get file descriptor back (identifier) */
	ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

	/* setup one config register (PMC0) */
	pc[0].reg_num   = 0
	pc[0].reg_value = 0x1234;

	/* setup one data register (PMD0) */
	pd[0].reg_num = 0;
	pd[0].reg_value = 0;

	/* program the registers */
	pfm_write_pmcs(ctx_fd, pc, 1);
	pfm_write_pmds(ctx_fd, pd, 1);

	/* attach the context to self */
	load_args.load_pid = getpid();
	pfm_load_context(ctx_fd, &load_args);

	/* activate monitoring */
	pfm_start(ctx_fd, NULL);

	/*
	 * run code to measure
	 */

	/* stop monitoring */
	pfm_stop(ctx_fd);

	/* read data register */
	pfm_read_pmds(ctx_fd, pd, 1);

	printf("PMD0 %llu\n", pd[0].reg_value);

	/* destroy session */
	close(ctx_fd);

	return 0;
}

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:44                             ` Paul Mackerras
@ 2007-11-13 23:49                               ` Nick Piggin
  2007-11-14 11:58                                 ` David Miller
  2007-11-14 11:52                               ` David Miller
  2007-11-14 13:51                               ` Stephane Eranian
  2 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2007-11-13 23:49 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> David Miller writes:
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
>
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way.  I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier.  For that a system call is better.
>
> For instance, if you have something that kind-of looks like
>
> 	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
>
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

Could you implement it with readv()?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:58                                 ` David Miller
@ 2007-11-14  0:25                                   ` Nick Piggin
  2007-11-14 21:30                                     ` Paul Mackerras
  0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2007-11-14  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

On Wednesday 14 November 2007 22:58, David Miller wrote:
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Wed, 14 Nov 2007 10:49:48 +1100
>
> > On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > > David Miller writes:
> > > > This is my impression too, all of the things being done with
> > > > a slew of system calls would be better served by real special
> > > > files and appropriate fops.
> > >
> > > Special files and fops really only work well if you can coerce the
> > > interface into one where data flows predominantly one way.  I don't
> > > think they work so well for something that is more like an RPC across
> > > the user/kernel barrier.  For that a system call is better.
> > >
> > > For instance, if you have something that kind-of looks like
> > >
> > > 	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> > >
> > > where the caller supplies an array of PMD numbers and the function
> > > returns their values (and you want that reading to be done atomically
> > > in some sense), how would you do that using special files and fops?
> >
> > Could you implement it with readv()?
>
> Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
> X and pmd_values goes to offset Y, with some helpers like what
> we have in the networking already for recvmsg.
>
> But why would you want readv() for this?  The syscall thing
> Paul asked me to translate into a read() doesn't provide
> iovec-like behavior so I don't see why readv() is necessary
> at all.

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 20:36                   ` Andi Kleen
@ 2007-11-14  0:28                     ` Philip Mucci
  2007-11-14  1:52                       ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Philip Mucci @ 2007-11-14  0:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Greg KH, Stephane Eranian, William Cohen,
	Robert Richter, linux-kernel, Perfmon, perfmon2-devel, papi list

Hi Andi,

pfmon is a single tool and fairly low level, the HPC folks don't use  
it so much because it isn't parallel aware and is meant for power- 
users. It is not representative of the tools used in HPC at all. Our  
community uses tools built on the infrastructure provided by libpfm  
and PAPI for the most part.

I know you don't want to hear this, but we actually use all of the  
features of perfmon, because a) we wanted to use the best methods  
available and b) areas where user level solutions could be made (like  
multiplexing) introduced too much noise and overhead to be of use.  
For years we relied on PerfCtr which did 'just enough' for us. But  
when Perfmon2 became available, we adopted technology where it meant  
a significant increase in accuracy for the resulting measurements,  
specifically for us that meant, kernel multiplexing and sample buffers.
Note that PAPI is just middleware. The tools built upon it are what  
people use...some of those are commercial tools like Vampir but most  
are Open Source. These tools are cross platform, as such they run on  
nearly everything...although intel/amd/ppc systems dominate the HPC  
market.

The usage cases are always the same and can be broken down into  
simple counting and sampling:

	- providing virtualized 64-bit counters per-thread
	- providing notification (buffered or non) on interrupt/overflow of  
the above.

If you'd like to outline further what you'd like to hear from the  
community, I can arrange that. I seem to remember going through this  
once before, but I'd be happy to do it again. For reference, here's a  
quick list from memory of some of the tools in active use and built  
on this infrastructure. These are used heavily around the globe.  
You'll see that each basically follows one of the 2 usage models above.

- HPCToolkit (Rice)
- PerfSuite (NCSA)
- Vampir (Dresden)
- Kojak (Juelich)
- TAU (UOregon)
- PAPIEX (me)
- GPTL (NCAR)
- HPM-Linux (IBM)
- Paraver (Barcelona)

Time to go give a talk here at a tools session at SC'07 about this  
very subject.

Phil

On Nov 13, 2007, at 12:36 PM, Andi Kleen wrote:

>> He speaks for quite a few people - they have serious need for this  
>> feature
>
> Most likely they have serious need for a very small subset of  
> perfmon2.
> The point of my proposal was to get this very small subset in quickly.
>
> Phil, how many of the command line options of pfmon do you
> actually use? How many do the people at your conference use? Or what
> functions, what performance counters etc. in PAPI or whatever
> library you use?
>
> Make use understand the use cases better, that would already help a  
> lot
> in merging by concentrating on what people actually really need.
>
> -Andi
>


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:07                                   ` David Miller
@ 2007-11-14  0:28                                     ` Nick Piggin
  2007-11-14 21:50                                     ` Paul Mackerras
  1 sibling, 0 replies; 116+ messages in thread
From: Nick Piggin @ 2007-11-14  0:28 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

On Wednesday 14 November 2007 23:07, David Miller wrote:
> From: Paul Mackerras <paulus@samba.org>
> Date: Wed, 14 Nov 2007 23:03:24 +1100
>
> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read?  Gack!  Surely you have better
> > taste than that?
>
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
>
> I see nothing wrong or gross with these semantics.  Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

True, but is it now any so different to an ioctl?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14  0:28                     ` Philip Mucci
@ 2007-11-14  1:52                       ` Andi Kleen
  2007-11-16  9:18                         ` Philip Mucci
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-14  1:52 UTC (permalink / raw)
  To: Philip Mucci
  Cc: Andi Kleen, Andrew Morton, Greg KH, Stephane Eranian,
	William Cohen, Robert Richter, linux-kernel, Perfmon,
	perfmon2-devel, papi list

On Tue, Nov 13, 2007 at 04:28:52PM -0800, Philip Mucci wrote:
> I know you don't want to hear this, but we actually use all of the  
> features of perfmon, because a) we wanted to use the best methods  

That is hard to believe.

But let's go for it temporarily for the argument. 

Can you instead prioritize features.  What is most essential, what is 
important, what is just nice to have, what is rarely used? 

> Note that PAPI is just middleware. The tools built upon it are what  

Surely the tools on top cannot use more than the middleware provides.

> 	- providing virtualized 64-bit counters per-thread
> 	- providing notification (buffered or non) on interrupt/overflow of  
> the above.

Ok that makes sense and should be possible with a reasonable simple
interface.

> If you'd like to outline further what you'd like to hear from the  
> community, I can arrange that. I seem to remember going through this  
> once before, but I'd be happy to do it again. For reference, here's a  
> quick list from memory of some of the tools in active use and built  
> on this infrastructure. These are used heavily around the globe.  

Please list concrete features, throwing around random names is not useful.

-Andi


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 22:58                           ` Stephane Eranian
@ 2007-11-14  2:07                             ` Andi Kleen
  2007-11-14 13:09                               ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-14  2:07 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, akpm, Robert Richter, gregkh, linux-kernel, William Cohen


[dropped all these bouncing email lists. Adding closed lists to public
cc lists is just a bad idea]

> int
> main(int argc, char **argv)
> {
> 	int ctx_fd;
> 	pfarg_pmd_t pd[1];
> 	pfarg_pmc_t pc[1];
> 	pfarg_ctx_t ctx;
> 	pfarg_load_t load_args;
> 
> 	memset(&ctx, 0, sizeof(ctx));
> 	memset(pc, 0, sizeof(pc));
> 	memset(pd, 0, sizeof(pd));
> 
> 	/* create session (context) and get file descriptor back (identifier) */
> 	ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

There's nothing in your example that makes the file descriptor needed.

> 
> 	/* setup one config register (PMC0) */
> 	pc[0].reg_num   = 0
> 	pc[0].reg_value = 0x1234;

That would be nicer if it was just two arguments.

> 
> 	/* setup one data register (PMD0) */
> 	pd[0].reg_num = 0;
> 	pd[0].reg_value = 0;

Why do you need to set the data register? Wouldn't it make
more sense to let the kernel handle that and just return one.

> 
> 	/* program the registers */
> 	pfm_write_pmcs(ctx_fd, pc, 1);
> 	pfm_write_pmds(ctx_fd, pd, 1);
> 
> 	/* attach the context to self */
> 	load_args.load_pid = getpid();
> 	pfm_load_context(ctx_fd, &load_args);

My replacement would be to just add a flags argument to write_pmcs 
with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
> 
> 	/* activate monitoring */
> 	pfm_start(ctx_fd, NULL);

Why can't that be done by the call setting up the register?

Or if someone needs to do it for a specific region they can read
the register before and then afterwards.

> 
> 	/*
> 	 * run code to measure
> 	 */
> 
> 	/* stop monitoring */
> 	pfm_stop(ctx_fd);
> 
> 	/* read data register */
> 	pfm_read_pmds(ctx_fd, pd, 1);

On x86 i think it would be much simpler to just let the set/alloc
register call return a number and then use RDPMC directly. That would
be actually faster and be much simpler too.

I suppose most architectures have similar facilities, if not a call could be 
added for them but it's not really essential. The call might be also needed
for event multiplexing, but frankly I would just leave that out for now.

e.g. here is one use case I would personally see as useful. We need
a replacement for simple cycle counting since RDTSC doesn't do that anymore
on modern x86 CPUs.  It could be something like:

	/* 0 is the initial value */

	/* could be either library or syscall */
	event = get_event(COUNTER_CYCLES); 
	if (event < 0) 
		/* CPU has no cycle counter */

	reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */

	rdpmc(reg, start);
	.... some code to run ...
	rdpmc(reg, end);

	free_perfctr(reg);	/* syscall */

On other architectures rdpmc would be different of course, but 
the rest could be probably similar.

-Andi


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 20:07                 ` Andrew Morton
  2007-11-13 20:14                   ` Greg KH
  2007-11-13 20:36                   ` Andi Kleen
@ 2007-11-14  7:24                   ` Paul Mackerras
  2007-11-14  7:40                     ` Andrew Morton
  2007-11-14 10:38                     ` Christoph Hellwig
  2 siblings, 2 replies; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg KH, Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

Andrew Morton writes:

> I was hoping that after the round of release-and-review which Stephane,
> Andi and I did about twelve months ago that we were on track to merge the
> perfmon codebase as-offered.  But now it turns out that the sentiment is
> that the code simply has too many bells-and-whistles to be acceptable.

Whose sentiment?

I've had a bit of a look at it today together with David Gibson.  Our
impression is that the latest version is a lot cleaner and simpler
than it used to be.  I'm also reading Stephane's technical report
which describes the interface, and whilst I'm only part-way through
it, I haven't seen anything yet which strikes me as unnecessary or
overly complicated.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14  7:24                   ` [perfmon] Re: [perfmon2] " Paul Mackerras
@ 2007-11-14  7:40                     ` Andrew Morton
  2007-11-14 10:38                     ` Christoph Hellwig
  1 sibling, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2007-11-14  7:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Greg KH, Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, Andi Kleen, perfmon2-devel, OSPAT devel,
	papi list

On Wed, 14 Nov 2007 18:24:36 +1100 Paul Mackerras <paulus@samba.org> wrote:

> Andrew Morton writes:
> 
> > I was hoping that after the round of release-and-review which Stephane,
> > Andi and I did about twelve months ago that we were on track to merge the
> > perfmon codebase as-offered.  But now it turns out that the sentiment is
> > that the code simply has too many bells-and-whistles to be acceptable.
> 
> Whose sentiment?

Andi and hch, maybe others I've forgotten about.

> I've had a bit of a look at it today together with David Gibson.  Our
> impression is that the latest version is a lot cleaner and simpler
> than it used to be.  I'm also reading Stephane's technical report
> which describes the interface, and whilst I'm only part-way through
> it, I haven't seen anything yet which strikes me as unnecessary or
> overly complicated.

Yes, that's quite possible.  I don't know how up-to-date people's
knowledge is.  I know I haven't looked seriously at the code in around
twelve months.

Let's get it on the wires as outlined and take a look at it all.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 21:30                                     ` Paul Mackerras
@ 2007-11-14 10:17                                       ` Nick Piggin
  2007-11-14 22:56                                         ` Chuck Ebbert
  0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2007-11-14 10:17 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

On Thursday 15 November 2007 08:30, Paul Mackerras wrote:
> Nick Piggin writes:
> > What I really mean is a readv-like syscall, but one that also
> > vectorises the file offset. Maybe this is useful enough as a generic
> > syscall that also helps Paul's example...
>
> I've sometimes thought it would be useful to have a "transaction"
> system call that is like a write + read combined into one:
>
> 	int transaction(int fd, char *req, size_t req_nb,
> 			char *reply, size_t reply_nb);
>
> as a way to provide a general request/reply interface for special
> files.

Maybe not a bad idea, though I'm not the one to ask about taste ;)
In this case, it is enough for your requests to be a set of scalars
(eg. file offsets), so it _could_ be handled with vectorised offsets...

But in general, for special files, I guess the response is usually
some structured data (that is not visible at the syscall layer).
So I don't see a big problem to have a similarly arbitrarily
structured request.


> > Of course, I guess this all depends on whether the atomicity is an
> > important requirement. If not, you can obviously just do it with
> > multiple read syscalls...
>
> That would take N system calls instead of one, which could have a
> performance impact if you need to read the counters frequently (which
> I believe you do in some performance monitoring situations).

That's true too.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14  7:24                   ` [perfmon] Re: [perfmon2] " Paul Mackerras
  2007-11-14  7:40                     ` Andrew Morton
@ 2007-11-14 10:38                     ` Christoph Hellwig
  2007-11-14 10:43                       ` Paul Mackerras
  1 sibling, 1 reply; 116+ messages in thread
From: Christoph Hellwig @ 2007-11-14 10:38 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, Greg KH, Philip Mucci, eranian, William Cohen,
	Robert Richter, linux-kernel, Perfmon, Andi Kleen,
	perfmon2-devel, OSPAT devel, papi list

On Wed, Nov 14, 2007 at 06:24:36PM +1100, Paul Mackerras wrote:
> Whose sentiment?

Mine for example.  The whole userspace interface is just on crack,
and the code is full of complexities aswell.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 10:38                     ` Christoph Hellwig
@ 2007-11-14 10:43                       ` Paul Mackerras
  2007-11-14 11:00                         ` Christoph Hellwig
  0 siblings, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 10:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Greg KH, Philip Mucci, eranian, William Cohen,
	Robert Richter, linux-kernel, Perfmon, Andi Kleen,
	perfmon2-devel, OSPAT devel, papi list

Christoph Hellwig writes:

> Mine for example.  The whole userspace interface is just on crack,
> and the code is full of complexities aswell.

Could you give some _technical_ details of what you don't like?

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 10:43                       ` Paul Mackerras
@ 2007-11-14 11:00                         ` Christoph Hellwig
  2007-11-14 11:12                           ` David Miller
                                             ` (2 more replies)
  0 siblings, 3 replies; 116+ messages in thread
From: Christoph Hellwig @ 2007-11-14 11:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Hellwig, Andrew Morton, Greg KH, Philip Mucci, eranian,
	William Cohen, Robert Richter, linux-kernel, Perfmon, Andi Kleen,
	perfmon2-devel, OSPAT devel, papi list

On Wed, Nov 14, 2007 at 09:43:02PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
> 
> > Mine for example.  The whole userspace interface is just on crack,
> > and the code is full of complexities aswell.
> 
> Could you give some _technical_ details of what you don't like?

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive.  It's not like this is the
first discussion of perfmon.  But to get start look at the systems calls,
many of them are beasts like:

  int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call.   The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing.  Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 22:56                                         ` Chuck Ebbert
@ 2007-11-14 11:03                                           ` Nick Piggin
  0 siblings, 0 replies; 116+ messages in thread
From: Nick Piggin @ 2007-11-14 11:03 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Paul Mackerras, David Miller, hch, akpm, gregkh, mucci, eranian,
	wcohen, robert.richter, linux-kernel, andi

On Thursday 15 November 2007 09:56, Chuck Ebbert wrote:
> On 11/14/2007 05:17 AM, Nick Piggin wrote:
> > But in general, for special files, I guess the response is usually
> > some structured data (that is not visible at the syscall layer).
> > So I don't see a big problem to have a similarly arbitrarily
> > structured request.
>
> IOW, an ioctl.

In the same way a read of structured data from a special file
"is an" ioctl, yeah. You could implement either with an ioctl.

The main difference is they have more explicitly typed interfaces
Whether that's enough argument (and if Paul's proposal is widely
usable enough) is another question. Which I won't try to answer.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:00                         ` Christoph Hellwig
@ 2007-11-14 11:12                           ` David Miller
  2007-11-14 11:14                             ` David Miller
  2007-11-14 11:44                             ` Paul Mackerras
  2007-11-14 11:39                           ` Paul Mackerras
  2007-11-14 12:38                           ` Andi Kleen
  2 siblings, 2 replies; 116+ messages in thread
From: David Miller @ 2007-11-14 11:12 UTC (permalink / raw)
  To: hch
  Cc: paulus, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, perfmon, andi, perfmon2-devel, ospat-devel,
	ptools-perfapi

From: Christoph Hellwig <hch@infradead.org>
Date: Wed, 14 Nov 2007 11:00:09 +0000

> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive.  It's not like this is the
> first discussion of perfmon.  But to get start look at the systems calls,
> many of them are beasts like:
> 
>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> 
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call.   The right thing
> to do is obviously have a pmds and pmcs file in procfs for the thread beeing
> monitored instead of these special-case files, with another set for global
> tracing.  Similarly I'm pretty sure we can get a much better interface
> if we introduce marching files in procfs for the other calls.

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops.  Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:12                           ` David Miller
@ 2007-11-14 11:14                             ` David Miller
  2007-11-14 11:44                             ` Paul Mackerras
  1 sibling, 0 replies; 116+ messages in thread
From: David Miller @ 2007-11-14 11:14 UTC (permalink / raw)
  To: hch
  Cc: paulus, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, perfmon, andi, perfmon2-devel, ospat-devel,
	ptools-perfapi


Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:00                         ` Christoph Hellwig
  2007-11-14 11:12                           ` David Miller
@ 2007-11-14 11:39                           ` Paul Mackerras
  2007-11-14 11:52                             ` David Miller
  2007-11-14 13:47                             ` Stephane Eranian
  2007-11-14 12:38                           ` Andi Kleen
  2 siblings, 2 replies; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Greg KH, Philip Mucci, eranian, William Cohen,
	Robert Richter, linux-kernel, Andi Kleen

Christoph Hellwig writes:

>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> 
> This is basically a read(2) (or for other syscalls a write) on something
> else than the file descriptor provided to the system call.

No it's not basically a read().  It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is "tell me about this particular collection
of PMDs" and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time.  So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one.  Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:12                           ` David Miller
  2007-11-14 11:14                             ` David Miller
@ 2007-11-14 11:44                             ` Paul Mackerras
  2007-11-13 23:49                               ` Nick Piggin
                                                 ` (2 more replies)
  1 sibling, 3 replies; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 11:44 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> This is my impression too, all of the things being done with
> a slew of system calls would be better served by real special
> files and appropriate fops.

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way.  I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier.  For that a system call is better.

For instance, if you have something that kind-of looks like

	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically
in some sense), how would you do that using special files and fops?

>  Whether the thing is some kind
> of misc device or procfs is less important than simply getting
> away from these system calls.

Why?  What's inherently offensive about system calls?

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:44                             ` Paul Mackerras
  2007-11-13 23:49                               ` Nick Piggin
@ 2007-11-14 11:52                               ` David Miller
  2007-11-14 12:03                                 ` Paul Mackerras
  2007-11-14 13:51                               ` Stephane Eranian
  2 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-14 11:52 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Wed, 14 Nov 2007 22:44:56 +1100

> For instance, if you have something that kind-of looks like
> 
> 	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> 
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?

The same way we handle some of the multicast "getsockopt()"
calls.  The parameters passed in are both inputs and outputs.

For the above example:

	struct pmd_info {
		int *pmd_numbers;
		u64 *pmd_values;
		int n;
	} *p;

	buffer_size = N;
	p = malloc(buffer_size);
	p->pmd_numbers = p + foo;
	p->pmd_values = p + bar;
	p->n = whatever(N);
	err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:39                           ` Paul Mackerras
@ 2007-11-14 11:52                             ` David Miller
  2007-11-14 13:47                             ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: David Miller @ 2007-11-14 11:52 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Wed, 14 Nov 2007 22:39:24 +1100

> No it's not basically a read().  It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.

Yes it can, see my other reply.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-13 23:49                               ` Nick Piggin
@ 2007-11-14 11:58                                 ` David Miller
  2007-11-14  0:25                                   ` Nick Piggin
  0 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-14 11:58 UTC (permalink / raw)
  To: nickpiggin
  Cc: paulus, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Wed, 14 Nov 2007 10:49:48 +1100

> On Wednesday 14 November 2007 22:44, Paul Mackerras wrote:
> > David Miller writes:
> > > This is my impression too, all of the things being done with
> > > a slew of system calls would be better served by real special
> > > files and appropriate fops.
> >
> > Special files and fops really only work well if you can coerce the
> > interface into one where data flows predominantly one way.  I don't
> > think they work so well for something that is more like an RPC across
> > the user/kernel barrier.  For that a system call is better.
> >
> > For instance, if you have something that kind-of looks like
> >
> > 	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> >
> > where the caller supplies an array of PMD numbers and the function
> > returns their values (and you want that reading to be done atomically
> > in some sense), how would you do that using special files and fops?
> 
> Could you implement it with readv()?

Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this?  The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:52                               ` David Miller
@ 2007-11-14 12:03                                 ` Paul Mackerras
  2007-11-14 12:07                                   ` David Miller
  0 siblings, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 12:03 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> The same way we handle some of the multicast "getsockopt()"
> calls.  The parameters passed in are both inputs and outputs.

For a read??!!!

> For the above example:
> 
> 	struct pmd_info {
> 		int *pmd_numbers;
> 		u64 *pmd_values;
> 		int n;
> 	} *p;
> 
> 	buffer_size = N;
> 	p = malloc(buffer_size);
> 	p->pmd_numbers = p + foo;
> 	p->pmd_values = p + bar;
> 	p->n = whatever(N);
> 	err = read(fd, p, N);

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read?  Gack!  Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()?  That
seems even worse to me.

> Another alternative is to use generic netlink.

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply).  For
something that needs to be quick that is a suboptimal interface.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:03                                 ` Paul Mackerras
@ 2007-11-14 12:07                                   ` David Miller
  2007-11-14  0:28                                     ` Nick Piggin
  2007-11-14 21:50                                     ` Paul Mackerras
  0 siblings, 2 replies; 116+ messages in thread
From: David Miller @ 2007-11-14 12:07 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Wed, 14 Nov 2007 23:03:24 +1100

> You're suggesting that the behaviour of a read() should depend on what
> was in the buffer before the read?  Gack!  Surely you have better
> taste than that?

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics.  Nothing in the
"book of UNIX" specifies that for a device or special file the passed
in buffer cannot contain input control data.

> > Another alternative is to use generic netlink.
> 
> Then you end up with two system calls to get the data rather than one
> (one to send the request and another to read the reply).  For
> something that needs to be quick that is a suboptimal interface.

Not necessarily, consider the possibility of using recvmsg() control
message data.  With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:00                         ` Christoph Hellwig
  2007-11-14 11:12                           ` David Miller
  2007-11-14 11:39                           ` Paul Mackerras
@ 2007-11-14 12:38                           ` Andi Kleen
  2007-11-14 14:13                             ` Stephane Eranian
                                               ` (2 more replies)
  2 siblings, 3 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-14 12:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paul Mackerras, Andrew Morton, Greg KH, Philip Mucci, eranian,
	William Cohen, Robert Richter, linux-kernel, Perfmon, Andi Kleen,
	perfmon2-devel, OSPAT devel, papi list

Christoph Hellwig <hch@infradead.org> writes:
>
> I've done this a gazillion times before, so maybe instead of beeing a lazy
> bastard you could look up mailinglist archive.  It's not like this is the
> first discussion of perfmon.  But to get start look at the systems calls,
> many of them are beasts like:
>
>   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
>
> This is basically a read(2) (or for other syscalls a write) on something

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting) 

Later a syscall might be needed with event multiplexing, but that seems
more like a far away non essential feature.

> else than the file descriptor provided to the system call.   The right thing

I don't like read/write for this too much. I think it's better to
have individual syscalls.  After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14  2:07                             ` Andi Kleen
@ 2007-11-14 13:09                               ` Stephane Eranian
  2007-11-14 14:24                                 ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-14 13:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, Robert Richter, gregkh, linux-kernel, William Cohen

Andi,

On Wed, Nov 14, 2007 at 03:07:02AM +0100, Andi Kleen wrote:
> 
> [dropped all these bouncing email lists. Adding closed lists to public
> cc lists is just a bad idea]
> 

Just want to make sure perfmon2 users participate in this discussion.

> > int
> > main(int argc, char **argv)
> > {
> > 	int ctx_fd;
> > 	pfarg_pmd_t pd[1];
> > 	pfarg_pmc_t pc[1];
> > 	pfarg_ctx_t ctx;
> > 	pfarg_load_t load_args;
> > 
> > 	memset(&ctx, 0, sizeof(ctx));
> > 	memset(pc, 0, sizeof(pc));
> > 	memset(pd, 0, sizeof(pd));
> > 
> > 	/* create session (context) and get file descriptor back (identifier) */
> > 	ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);
> 
> There's nothing in your example that makes the file descriptor needed.
> 

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the session, i.e.,
free the associated kernel resources. For that, you use close() instead of adding yet
another syscall. It also provides a resource limitation mechanisms to control consumption
of kernel memory, i.e., you can only create as many sessions as you can have open files.

> > 
> > 	/* setup one config register (PMC0) */
> > 	pc[0].reg_num   = 0
> > 	pc[0].reg_value = 0x1234;
> 
> That would be nicer if it was just two arguments.
> 
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the critical
path.

You do not want a call that programs the entire PMU state all at once either. Many times,
you only want to modify a small subset. Having the full state does also cause some portability
problems.


> > 
> > 	/* setup one data register (PMD0) */
> > 	pd[0].reg_num = 0;
> > 	pd[0].reg_value = 0;
> 
> Why do you need to set the data register? Wouldn't it make
> more sense to let the kernel handle that and just return one.
> 
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS sampling
period.

With event-based sampling,  the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
The way you express this with perfmon2 is that you program a counter to measure
L2 cache misses, and then you initialize the corresponding data register (counter)
to overflow after 2000 occurrences. Given that the interface guarantees all counters
are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
Thus you see that you need a call to actual program the data registers.

> > 
> > 	/* program the registers */
> > 	pfm_write_pmcs(ctx_fd, pc, 1);
> > 	pfm_write_pmds(ctx_fd, pd, 1);
> > 
> > 	/* attach the context to self */
> > 	load_args.load_pid = getpid();
> > 	pfm_load_context(ctx_fd, &load_args);
> 
> My replacement would be to just add a flags argument to write_pmcs 
> with one flag bit meaning "GLOBAL CONTEXT" versus "MY CONTEXT"
> > 

You are mixing PMU programming with the type of measurement you want to do.

Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
before you attach to either a CPU or a thread. This way, you can prepare your measurement
and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
That is useful, for instance, when you are trying to measure across fork, pthread_create
which you can catch on-the-fly.

Take the per-thread example, you can setup your session before you fork/exec the program
you want to measure.

Note also that perfmon2 supports attaching to an already running thread. So there is
more than "GLOBAL CONTEXT" versus "MY CONTEXT".


> > 	/* activate monitoring */
> > 	pfm_start(ctx_fd, NULL);
> 
> Why can't that be done by the call setting up the register?
> 

Good question. If you do what say, you assume that the start/stop bit lives in the
config (or data) registers of the PMU. This is not true on all hardware. On Itanium
for instance, the start/stop bit is part of the Processor Status Register (psr).
That is not a PMU register.

On X86, you set the enable bit the PERFEVTSEL, but nothing really happens until you issue
pfm_start(), i.e., the PERFEVTSEL registers are not touched until then.

> Or if someone needs to do it for a specific region they can read
> the register before and then afterwards.
> 
> > 
> > 	/*
> > 	 * run code to measure
> > 	 */
> > 
> > 	/* stop monitoring */
> > 	pfm_stop(ctx_fd);
> > 
> > 	/* read data register */
> > 	pfm_read_pmds(ctx_fd, pd, 1);
> 
> On x86 i think it would be much simpler to just let the set/alloc
> register call return a number and then use RDPMC directly. That would
> be actually faster and be much simpler too.
> 
One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
a self-monitoring thread from reading the counters directly. You'll just get the
lower 32-bit of it. So if you read frequently enough, you should not have a problem.

But keep in mind that we do want a uniform interface across all hardware and all type
of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so
on. You want an interface that guarantees that with pfm_read_pmds() you'll be able to
read on any hardware platforms, then on some you may be able to use a more efficient
method, e.g., rdpmc on X86.

Reducing performance monitoring to self-monitoring is not what we want. In fact, there
are only a few domains where you can actually do this and HPC is one of them. But in 
many other situations, you cannot and don't want to have to instrument applications
or libraries to collect performance data. It is quite handy to be able to do:
	$ pfmon /bin/ls
or
	$ pfmon --attach-task=`pidof sshd` -timeout=10s


Also note that there is no guarantee that RDPMC allows you to access all data registers
on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
RDPMC.


> I suppose most architectures have similar facilities, if not a call could be 
> added for them but it's not really essential. The call might be also needed
> for event multiplexing, but frankly I would just leave that out for now.
> 
Itanium does allow user level read of data registers. It also allows start/stop.
Perfmon2 allows this only for self-monitoring per-thread sessions.

I think restricting per-thread mode to only self-monitoring is just too limiting
even for a start.


> e.g. here is one use case I would personally see as useful. We need
> a replacement for simple cycle counting since RDTSC doesn't do that anymore
> on modern x86 CPUs.  It could be something like:
> 
You can do exactly this with the perfmon2 interface as it exists today.
Your example is perfectly fine, your interface works in your case.

But you are driving the design of the interface from your very specific need
and you are ignoring all the other usage models. This has been a problem with so
many other interfaces and that explains the current situation. You have to
take a broader view, look at what the hardware (across the board) provides and
build from there. We do not need yet another interface to support one tool or one
type of measurement, we need a true programming interface with a uniform set
of calls. So sure, several calls may look overkill for basic measurements, but
they become necessary with others.

> 	/* 0 is the initial value */
> 
> 	/* could be either library or syscall */
> 	event = get_event(COUNTER_CYCLES); 
> 	if (event < 0) 
> 		/* CPU has no cycle counter */
> 
> 	reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */
> 
> 	rdpmc(reg, start);
> 	.... some code to run ...
> 	rdpmc(reg, end);
> 
> 	free_perfctr(reg);	/* syscall */
> 
-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:39                           ` Paul Mackerras
  2007-11-14 11:52                             ` David Miller
@ 2007-11-14 13:47                             ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-14 13:47 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Hellwig, Andrew Morton, Greg KH, Philip Mucci,
	William Cohen, Robert Richter, linux-kernel, Andi Kleen

Hello,

On Wed, Nov 14, 2007 at 10:39:24PM +1100, Paul Mackerras wrote:
> Christoph Hellwig writes:
> 
> >   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> > 
> > This is basically a read(2) (or for other syscalls a write) on something
> > else than the file descriptor provided to the system call.
> 
> No it's not basically a read().  It's more like a request/reply
> interface, which a read()/write() interface doesn't handle very well.
> The request in this case is "tell me about this particular collection
> of PMDs" and the reply is the values.
> 

Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as 
the reset values (sampling periods), randomization parameters and some
flags. They are not needed on read.

> It seems to me that an important part of this is to be able to collect
> values from several PMDs at a single point in time, or at least an
> approximation to a single point in time.  So that means that you don't
> want a file per PMD either.
> 

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file
descriptor per register looks overkill to me.

> Basically we don't have a good abstraction for a request/reply (or
> command/response) type of interface, and this is a case where we need
> one.  Having a syscall that takes a struct containing the request and
> reply is as good a way as any, particularly for something that needs
> to be quick.
> 

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 11:44                             ` Paul Mackerras
  2007-11-13 23:49                               ` Nick Piggin
  2007-11-14 11:52                               ` David Miller
@ 2007-11-14 13:51                               ` Stephane Eranian
  2 siblings, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-14 13:51 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, hch, akpm, gregkh, mucci, wcohen, robert.richter,
	linux-kernel, andi


On Wed, Nov 14, 2007 at 10:44:56PM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > This is my impression too, all of the things being done with
> > a slew of system calls would be better served by real special
> > files and appropriate fops.
> 
> Special files and fops really only work well if you can coerce the
> interface into one where data flows predominantly one way.  I don't
> think they work so well for something that is more like an RPC across
> the user/kernel barrier.  For that a system call is better.
> 
> For instance, if you have something that kind-of looks like
> 
> 	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);
> 
> where the caller supplies an array of PMD numbers and the function
> returns their values (and you want that reading to be done atomically
> in some sense), how would you do that using special files and fops?
> 
Yes, the read call could be simplified to the level proposed above by Paul.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:38                           ` Andi Kleen
@ 2007-11-14 14:13                             ` Stephane Eranian
  2007-11-14 14:26                               ` Andi Kleen
  2007-11-14 19:48                             ` David Miller
  2007-11-15  4:20                             ` dean gaudet
  2 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-14 14:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Paul Mackerras, Andrew Morton, Greg KH,
	Philip Mucci, William Cohen, Robert Richter, linux-kernel,
	Perfmon, perfmon2-devel, OSPAT devel, papi list

Andi,

On Wed, Nov 14, 2007 at 01:38:38PM +0100, Andi Kleen wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> >
> > I've done this a gazillion times before, so maybe instead of beeing a lazy
> > bastard you could look up mailinglist archive.  It's not like this is the
> > first discussion of perfmon.  But to get start look at the systems calls,
> > many of them are beasts like:
> >
> >   int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
> >
> > This is basically a read(2) (or for other syscalls a write) on something
> 
> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting) 
> 

This only works when counting (not sampling) and only for self-monitoring.

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.
> 
On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
to one generic counter on on Core 2.

> > else than the file descriptor provided to the system call.   The right thing
> 
> I don't like read/write for this too much. I think it's better to
> have individual syscalls.  After all that is CPU state and having
> syscalls for that does seem reasonable.

As I said earlier, we do use read(), not for reading counters but to extract overflow
notification messages when we are sampling. It makes more sense for this usage because
this is where you want to leverage some key mechanisms such as:

	 - asynchronous notification via SIGIO. this is how you can implement self-sampling
	   for instance.

	 - select/poll to allow monitoring tools to wait for notification coming from
	   multiple sessions in one call. This is useful when monitoring across fork or
	   pthread_create.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 13:09                               ` Stephane Eranian
@ 2007-11-14 14:24                                 ` Andi Kleen
  2007-11-14 15:44                                   ` William Cohen
  2007-11-15  0:07                                   ` Stephane Eranian
  0 siblings, 2 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-14 14:24 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, akpm, Robert Richter, gregkh, linux-kernel, William Cohen

On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
> 
> Partially true. The file descriptor becomes really useful when you sample.
> You leverage the file descriptor to receive notifications of counter overflows
> and full sampling buffer. You extract notification messages via read() and you can
> use SIGIO, select/poll.

Hmm, ok for the event notification we would need a nice interface. Still
have my doubts a file descriptor is the best way to do this though.

> Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

See my example below.
> 
> That would be quite expensive when you have lots of registers to setup: one
> syscall per register. The perfmon syscalls to read/write registers accept vector
> of arguments to amortize the cost of the syscall over multiple registers
> (similar to poll(2)).


First system calls are not that slow on Linux. Measure it.


> 
> With many tools, registers are not just setup once. During certain measurements,
> data registers may be read multiple times. When you sample or multiplex at

I think you optimize the wrong thing here.

There are basically two cases I see:

-  Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The 
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out 
of the latency critical path.


> It depends on what you are doing. Here, this was not really necessary. It was
> meant to show how you can program the data registers as well. Perfmon2 provides
> default values for all data registers. For counters, the value is guaranteed to
> be zero.
> 
> But it is important to note that not all data registers are counters. That is the
> case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
> well, and some may need to be initialized to non zero value, i.e., the IBS sampling
> period.

Setting period should be a separate call. Mixing the two together into one
 does not look like a nice interface.

> 
> With event-based sampling,  the period is expressed as the number of occurrences
> of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
> The way you express this with perfmon2 is that you program a counter to measure
> L2 cache misses, and then you initialize the corresponding data register (counter)
> to overflow after 2000 occurrences. Given that the interface guarantees all counters
> are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
> Thus you see that you need a call to actual program the data registers.

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you
a register number.


> Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
> before you attach to either a CPU or a thread. This way, you can prepare your measurement
> and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
> That is useful, for instance, when you are trying to measure across fork, pthread_create
> which you can catch on-the-fly.
> 
> Take the per-thread example, you can setup your session before you fork/exec the program
> you want to measure.

And?  You didn't say what the advantage of that is? 

All the approaches add context switch latencies. It is not clear that the separate
session setup helps it all that much.

> 
> Note also that perfmon2 supports attaching to an already running thread. So there is
> more than "GLOBAL CONTEXT" versus "MY CONTEXT".

What is the use case of this? Do users use that? 

> 
> 
> > > 	/* activate monitoring */
> > > 	pfm_start(ctx_fd, NULL);
> > 
> > Why can't that be done by the call setting up the register?
> > 
> 
> Good question. If you do what say, you assume that the start/stop bit lives in the
> config (or data) registers of the PMU. This is not true on all hardware. On Itanium
> for instance, the start/stop bit is part of the Processor Status Register (psr).
> That is not a PMU register.


Well the system call layer can manage that transparently with a little software state
(counter). No need to expose it.

> One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
> a self-monitoring thread from reading the counters directly. You'll just get the
> lower 32-bit of it. So if you read frequently enough, you should not have a problem.

Hmm? RDPMC is 64bit.
> 
> But keep in mind that we do want a uniform interface across all hardware and all type
> of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
> an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so

I disagree. Using RDPMC is essential for at least some of the things I would like
to do with perfmon2. If the interface does not provide it it is useless to me at least.
System calls are far too slow for cycle measurements. 

And when RDPMC is already supported it should be as widely used as possible.

Regarding the portable code problem: of course you would have some header in user space
that hides the details in a hopefully portable macro.

> on. You want an interface that guarantees that with pfm_read_pmds() you'll be able to
> read on any hardware platforms, then on some you may be able to use a more efficient
> method, e.g., rdpmc on X86.
> 
> Reducing performance monitoring to self-monitoring is not what we want. In fact, there
> are only a few domains where you can actually do this and HPC is one of them. But in 
> many other situations, you cannot and don't want to have to instrument applications
> or libraries to collect performance data. It is quite handy to be able to do:
> 	$ pfmon /bin/ls
> or
> 	$ pfmon --attach-task=`pidof sshd` -timeout=10s

I think only supporting global and self monitoring as first step is totally fine.
All the bells'n'whistles can be added later if users really want them.

> 
> 
> Also note that there is no guarantee that RDPMC allows you to access all data registers
> on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
> RDPMC.

Sure at some point a system call for the more complex cases (also like multiplexing) would
be needed. But I don't think we need it as first step. The goal would be to define a 
simple subset that is actually mergeable.

> But you are driving the design of the interface from your very specific need
> and you are ignoring all the other usage models. This has been a problem with so

I asked your noisy user base to specify more concrete use cases, but so far
they have not provided anything except rather vacuous complaints. Short of that I'll stick 
with what I know currently.

> many other interfaces and that explains the current situation. You have to
> take a broader view, look at what the hardware (across the board) provides and
> build from there. We do not need yet another interface to support one tool or one


Well your "broad view" resulted in a incredible mess of interface moloch to be honest.
I really think we need a fresh start examining many of the underlying assumptions.

Regarding itanium: I suppose it could provide a RDPMC replacement using your 
fast priviledged vsyscalls.

-Andi


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 14:13                             ` Stephane Eranian
@ 2007-11-14 14:26                               ` Andi Kleen
  2007-11-15  0:23                                 ` Paul Mackerras
  0 siblings, 1 reply; 116+ messages in thread
From: Andi Kleen @ 2007-11-14 14:26 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, Christoph Hellwig, Paul Mackerras, Andrew Morton,
	Greg KH, Philip Mucci, William Cohen, Robert Richter,
	linux-kernel, Perfmon, perfmon2-devel, OSPAT devel, papi list

On Wed, Nov 14, 2007 at 06:13:42AM -0800, Stephane Eranian wrote:
> > At least for x86 and I suspect some 1other architectures we don't
> > initially need a syscall at all for this. There is an instruction
> > RDPMC who can read a performance counter just fine. It is also much
> > faster and generally preferable for the case where a process measures
> > events about itself. In fact it is essential for one of the use cases
> > I would like to see perfmon used (replacement of RDTSC for cycle
> > counting) 
> > 
> 
> This only works when counting (not sampling) and only for self-monitoring.

It works for global monitoring too.

> 
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> > 
> On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
> multiplexing offers some advantages. If NMI watchdog is enabled, then you drop
> to one generic counter on on Core 2.

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 14:24                                 ` Andi Kleen
@ 2007-11-14 15:44                                   ` William Cohen
  2007-11-14 16:13                                     ` Stephane Eranian
  2007-11-14 18:53                                     ` Philippe Elie
  2007-11-15  0:07                                   ` Stephane Eranian
  1 sibling, 2 replies; 116+ messages in thread
From: William Cohen @ 2007-11-14 15:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stephane Eranian, akpm, Robert Richter, gregkh, linux-kernel

Andi Kleen wrote:

>> One approach does not prevent the other. Assuming you allow cr4.pce, then nothing prevents
>> a self-monitoring thread from reading the counters directly. You'll just get the
>> lower 32-bit of it. So if you read frequently enough, you should not have a problem.
> 
> Hmm? RDPMC is 64bit.

There are a number of processors that have 32-bit counters such as the IBM power 
processors. On many x86 processors the upper bits of the counter are sign 
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are 
available. Roll over of values is quite possible (<2 seconds of cycle count), so 
additional work needs to be done to obtain a valid value.

>> But keep in mind that we do want a uniform interface across all hardware and all type
>> of sessions (self-monitoring, CPU-wide, monitoring of another thread). You don't want
>> an interface that says on x86 you have to use rdpmc, on Itanium pfm_read_pmds() and so
> 
> I disagree. Using RDPMC is essential for at least some of the things I would like
> to do with perfmon2. If the interface does not provide it it is useless to me at least.
> System calls are far too slow for cycle measurements. 

What range of cycles are you interested in measuring? 100's of cycles? A couple 
thousand? Are you just looking at cycle counts or other events?

-Will

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 15:44                                   ` William Cohen
@ 2007-11-14 16:13                                     ` Stephane Eranian
  2007-11-14 18:53                                     ` Philippe Elie
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-14 16:13 UTC (permalink / raw)
  To: William Cohen; +Cc: Andi Kleen, akpm, Robert Richter, gregkh, linux-kernel


On Wed, Nov 14, 2007 at 10:44:20AM -0500, William Cohen wrote:
> Andi Kleen wrote:
> 
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then 
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just 
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have 
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
> 
> There are a number of processors that have 32-bit counters such as the IBM 
> power processors. On many x86 processors the upper bits of the counter are 
> sign extended from the lower 32 bits. Thus, one can only assume the lower 
> 32-bit are available. Roll over of values is quite possible (<2 seconds of 
> cycle count), so additional work needs to be done to obtain a valid value.
> 

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth.
Don't know about Power or Cell.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 15:44                                   ` William Cohen
  2007-11-14 16:13                                     ` Stephane Eranian
@ 2007-11-14 18:53                                     ` Philippe Elie
  2007-11-14 19:15                                       ` Andi Kleen
  1 sibling, 1 reply; 116+ messages in thread
From: Philippe Elie @ 2007-11-14 18:53 UTC (permalink / raw)
  To: William Cohen
  Cc: Andi Kleen, Stephane Eranian, akpm, Robert Richter, gregkh, linux-kernel

On Wed, 14 Nov 2007 at 10:44 +0000, Will Cohen wrote:

> Andi Kleen wrote:
> 
> >>One approach does not prevent the other. Assuming you allow cr4.pce, then 
> >>nothing prevents
> >>a self-monitoring thread from reading the counters directly. You'll just 
> >>get the
> >>lower 32-bit of it. So if you read frequently enough, you should not have 
> >>a problem.
> >
> >Hmm? RDPMC is 64bit.
> 
> There are a number of processors that have 32-bit counters such as the IBM 
> power processors. On many x86 processors the upper bits of the counter are 
> sign extended from the lower 32 bits. Thus, one can only assume the lower 
> 32-bit are available. Roll over of values is quite possible (<2 seconds of 
> cycle count), so additional work needs to be done to obtain a valid value.

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

-- 
Phe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 18:53                                     ` Philippe Elie
@ 2007-11-14 19:15                                       ` Andi Kleen
  0 siblings, 0 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-14 19:15 UTC (permalink / raw)
  To: Philippe Elie
  Cc: William Cohen, Andi Kleen, Stephane Eranian, akpm,
	Robert Richter, gregkh, linux-kernel

> BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
> to disable it, dunno if it has been applied.

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:38                           ` Andi Kleen
  2007-11-14 14:13                             ` Stephane Eranian
@ 2007-11-14 19:48                             ` David Miller
  2007-11-15  4:20                             ` dean gaudet
  2 siblings, 0 replies; 116+ messages in thread
From: David Miller @ 2007-11-14 19:48 UTC (permalink / raw)
  To: andi
  Cc: hch, paulus, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, perfmon, perfmon2-devel,
	ospat-devel, ptools-perfapi

From: Andi Kleen <andi@firstfloor.org>
Date: Wed, 14 Nov 2007 13:38:38 +0100

> At least for x86 and I suspect some 1other architectures we don't
> initially need a syscall at all for this. There is an instruction
> RDPMC who can read a performance counter just fine. It is also much
> faster and generally preferable for the case where a process measures
> events about itself. In fact it is essential for one of the use cases
> I would like to see perfmon used (replacement of RDTSC for cycle
> counting) 

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14  0:25                                   ` Nick Piggin
@ 2007-11-14 21:30                                     ` Paul Mackerras
  2007-11-14 10:17                                       ` Nick Piggin
  0 siblings, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 21:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Miller, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

Nick Piggin writes:

> What I really mean is a readv-like syscall, but one that also
> vectorises the file offset. Maybe this is useful enough as a generic
> syscall that also helps Paul's example...

I've sometimes thought it would be useful to have a "transaction"
system call that is like a write + read combined into one:

	int transaction(int fd, char *req, size_t req_nb,
			char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special
files.

> Of course, I guess this all depends on whether the atomicity is an
> important requirement. If not, you can obviously just do it with
> multiple read syscalls...

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:07                                   ` David Miller
  2007-11-14  0:28                                     ` Nick Piggin
@ 2007-11-14 21:50                                     ` Paul Mackerras
  2007-11-14 23:03                                       ` David Miller
  1 sibling, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 21:50 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> > You're suggesting that the behaviour of a read() should depend on what
> > was in the buffer before the read?  Gack!  Surely you have better
> > taste than that?
> 
> Absolutely that's what I mean, it's atomic and gives you exactly what
> you need.
> 
> I see nothing wrong or gross with these semantics.  Nothing in the
> "book of UNIX" specifies that for a device or special file the passed
> in buffer cannot contain input control data.

Ohhhhh.... kayyyyy.... *shudders*

It really violates the abstract model of "read" pretty badly.  "Read"
is "fill in the buffer with data from the device", not "do some
arbitrary stuff with this area of memory".

I'd prefer to have a transaction() system call like I suggested to
Nick rather than overloading read() like this.

> > Then you end up with two system calls to get the data rather than one
> > (one to send the request and another to read the reply).  For
> > something that needs to be quick that is a suboptimal interface.
> 
> Not necessarily, consider the possibility of using recvmsg() control
> message data.  With that it could be done in one go.
> 
> This also suggests that it could be implemented as it's own protocol
> family.

There's all sorts of possible ways that it could be implemented.  On
the one hand we have an actual proposed implementation, and on the
other we have various people saying "oh but it could be implemented
this other way" without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful.  I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 10:17                                       ` Nick Piggin
@ 2007-11-14 22:56                                         ` Chuck Ebbert
  2007-11-14 11:03                                           ` Nick Piggin
  0 siblings, 1 reply; 116+ messages in thread
From: Chuck Ebbert @ 2007-11-14 22:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mackerras, David Miller, hch, akpm, gregkh, mucci, eranian,
	wcohen, robert.richter, linux-kernel, andi

On 11/14/2007 05:17 AM, Nick Piggin wrote:
> 
> But in general, for special files, I guess the response is usually
> some structured data (that is not visible at the syscall layer).
> So I don't see a big problem to have a similarly arbitrarily
> structured request.
> 
> 

IOW, an ioctl.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 21:50                                     ` Paul Mackerras
@ 2007-11-14 23:03                                       ` David Miller
  2007-11-14 23:12                                         ` Paul Mackerras
  0 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-14 23:03 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Thu, 15 Nov 2007 08:50:22 +1100

> I'd prefer to have a transaction() system call like I suggested to
> Nick rather than overloading read() like this.

So much for getting rid of the extra system calls...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 23:03                                       ` David Miller
@ 2007-11-14 23:12                                         ` Paul Mackerras
  2007-11-14 23:21                                           ` David Miller
  0 siblings, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-14 23:12 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Thu, 15 Nov 2007 08:50:22 +1100
> 
> > I'd prefer to have a transaction() system call like I suggested to
> > Nick rather than overloading read() like this.
> 
> So much for getting rid of the extra system calls...

*I* never had a problem with a few extra system calls.  I don't
 understand why you (apparently) do.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 23:12                                         ` Paul Mackerras
@ 2007-11-14 23:21                                           ` David Miller
  2007-11-15  1:11                                             ` Paul Mackerras
  0 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-14 23:21 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Thu, 15 Nov 2007 10:12:22 +1100

> *I* never had a problem with a few extra system calls.  I don't
>  understand why you (apparently) do.

We're stuck with them forever, they are hard to version and extend
cleanly.

Those are my main objections.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 14:24                                 ` Andi Kleen
  2007-11-14 15:44                                   ` William Cohen
@ 2007-11-15  0:07                                   ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-15  0:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, Robert Richter, gregkh, linux-kernel, William Cohen,
	perfmon2-devel

Andi,

On Wed, Nov 14, 2007 at 03:24:11PM +0100, Andi Kleen wrote:
> On Wed, Nov 14, 2007 at 05:09:09AM -0800, Stephane Eranian wrote:
> > 
> > Partially true. The file descriptor becomes really useful when you sample.
> > You leverage the file descriptor to receive notifications of counter overflows
> > and full sampling buffer. You extract notification messages via read() and you can
> > use SIGIO, select/poll.
> 
> Hmm, ok for the event notification we would need a nice interface. Still
> have my doubts a file descriptor is the best way to do this though.
> 

Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?

>From my experience read(), select(), and SIGIO are fine. I know many tools use that.

As for the file descriptor, you would need to replace that with another identifier of
some sort. As I pointed out in another message on this thread, you don't want to use
a pid-based identifier. This is not usable when you monitor other threads and you
want to read out the results after their death.


> > Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?
> 
> See my example below.
> > 
> > That would be quite expensive when you have lots of registers to setup: one
> > syscall per register. The perfmon syscalls to read/write registers accept vector
> > of arguments to amortize the cost of the syscall over multiple registers
> > (similar to poll(2)).
> 
> 
> First system calls are not that slow on Linux. Measure it.
> 
If people do not like vector arguments, then I think I can live with N system calls
to program N registers. Now you have two choices for passing the arguments:

	- a pointer to a struct
		struct pfarg_pmc {
			uint64_t reg_value;
			uint16_t reg_num;
		} pmc0;
		pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
		pfm_write_pmcs(fd, &pmc0);

	- explicitly passing every field:
		pfm_write_pmcs(fd, 0x0, 0x1234);

Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?

Of course the same approach would work for the data registers at least for counting.

> > With many tools, registers are not just setup once. During certain measurements,
> > data registers may be read multiple times. When you sample or multiplex at
> 
> I think you optimize the wrong thing here.
> 
> There are basically two cases I see:
> 
> -  Global measurement of lots of things:

I am not sure I understand what you mean by 'lots of things'?
Are you still talking per-thread and self-monitoring?


> Things are slow anyways with large context switch overheads. The 
> overheads are large anyways. Doing one or more system calls probably
> does not matter much. Most important is a clean interface.
> 
> - Exact measurement of the current process. For that you need very
> low latencies. Any system call is too slow. That is why CPUs have
> instructions like RDPMC that allow to read those registers with
> minimal latency in user space. Interface should support those.
> 

I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
adding it back for self-monitoring sessions.


> Also for this case programming time does not matter too much. You
> just program once and then do RDPMC before code to measure and then
> afterwards and take the difference. The actual counter setup is out 
> of the latency critical path.
> 
Agreed.

> 
> > It depends on what you are doing. Here, this was not really necessary. It was
> > meant to show how you can program the data registers as well. Perfmon2 provides
> > default values for all data registers. For counters, the value is guaranteed to
> > be zero.
> > 
> > But it is important to note that not all data registers are counters. That is the
> > case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
> > well, and some may need to be initialized to non zero value, i.e., the IBS sampling
> > period.
> 
> Setting period should be a separate call. Mixing the two together into one
>  does not look like a nice interface.
> 
Periods are setup by data register. Given that there is already a call to program
the data register why add another one? You don't need to treat the sampling period
differently from the register value. This just a value that will cause the register
to overflow after an explicit number of occurrences.


> > With event-based sampling,  the period is expressed as the number of occurrences
> > of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
> > The way you express this with perfmon2 is that you program a counter to measure
> > L2 cache misses, and then you initialize the corresponding data register (counter)
> > to overflow after 2000 occurrences. Given that the interface guarantees all counters
> > are 64-bit regardless of the hardware, you simply have to program the counter to -2000.
> > Thus you see that you need a call to actual program the data registers.
> 
> I didn't object to providing the initial value -- my example had that.

Should you support a kernel level sampling buffer (like Oprofile) you'd also want
to specify the reset value on overflow. And you would not necessarily want it to
be identical to the initial value (period). So you'd to have a way to specify that
one as well.

> Just having a separate concept of data registers seems too complicated to me.

I am not against providing a flat namespace. But I think it is nice to separate config
from data. 

> You should just pass event types and values and the kernel gives you
> a register number.

Absolutely not, you don't want to the kernel to know about events. This has to
remain at the user level. The event -> register problem is best solved in a user
library (such as libpfm). You don't want to bloat the kernel with event tables.
Many PMU models have over 200 events. And there is worse, in many PMU models,
you have tons of constraints as to each counter can measure, it can become very complicated,
e.g., Itanium and Power and Pentium 4 are good examples. It is difficult to get right, vendors
are constantly correcting their spec so maintenance is a pain.

The kernel interface must just deal with PMU registers and not events.


> 
> 
> > Perfmon2 decouples the two operations. In fact, no PMU hardware is actually touched
> > before you attach to either a CPU or a thread. This way, you can prepare your measurement
> > and then attach-and-go. Thus is is possible to create batches of ready-to-go sessions.
> > That is useful, for instance, when you are trying to measure across fork, pthread_create
> > which you can catch on-the-fly.
> > 
> > Take the per-thread example, you can setup your session before you fork/exec the program
> > you want to measure.
> 
> And?  You didn't say what the advantage of that is? 
> 
You pass to the kernel all the register values (config, data), you setup the kernel sampling
buffer and the mapping of it. Then it is just ma tter of attaching + start. The value of this
is that it lets you create a pool of ready-to-go sessions and when you are monitoring across
fork/pthread_create, each time you receive a notification from ptrace, you simply have to
attach, start and go, i..e, you minimize the overhead on the application you are measuring.

> All the approaches add context switch latencies. It is not clear that the separate
> session setup helps it all that much.
> 
This is a different issue. Sure the more PMU register you use the more expensive the
context switch gets. Yet the current perfmon2 implementation tries to mitigate this by
using lazy restore scheme, similar to the one used for FP registers.,

> > 
> > Note also that perfmon2 supports attaching to an already running thread. So there is
> > more than "GLOBAL CONTEXT" versus "MY CONTEXT".
> 
> What is the use case of this? Do users use that? 
> 
I think this is even the first approach when you get code to measure. You want to try
and characterize the workload without having to instrument and recompile. Furthermore, their
are certain workloads which are very long to restart and that cannot be stopped and restarted
easily, yet you may want to attach for several seconds. You may also want to use this approach
to avoid monitoring the initialization phase of an application. Sometimes you may not even
all have the sources to be able to instrument (e.g. 3rd party libraries).


> > 
> > 
> > > > 	/* activate monitoring */
> > > > 	pfm_start(ctx_fd, NULL);
> > > 
> > > Why can't that be done by the call setting up the register?
> > > 
> > 
> > Good question. If you do what say, you assume that the start/stop bit lives in the
> > config (or data) registers of the PMU. This is not true on all hardware. On Itanium
> > for instance, the start/stop bit is part of the Processor Status Register (psr).
> > That is not a PMU register.
> 
> 
> Well the system call layer can manage that transparently with a little software state
> (counter). No need to expose it.
> 
Are you suggesting virtual PMU registers that map to other resources, e.g., Itanium's PSR?

> 
> I disagree. Using RDPMC is essential for at least some of the things I would like
> to do with perfmon2. If the interface does not provide it it is useless to me at least.
> System calls are far too slow for cycle measurements. 
> 
> And when RDPMC is already supported it should be as widely used as possible.
> 
I am perfectly fine with RDPMC for self-monitoring and simple counting. I need to check and
see if this could work for self-sampling. But I also want to provide an interface
that would work for: non self-monitoring, self-monitoring, architecture without RDPMC equivalent.
This is important for people who want to write portable tools. The syscall would
return the full 64-bit value of the counter without the sign-extension.

> > 
> > Reducing performance monitoring to self-monitoring is not what we want. In fact, there
> > are only a few domains where you can actually do this and HPC is one of them. But in 
> > many other situations, you cannot and don't want to have to instrument applications
> > or libraries to collect performance data. It is quite handy to be able to do:
> > 	$ pfmon /bin/ls
> > or
> > 	$ pfmon --attach-task=`pidof sshd` -timeout=10s
> 
> I think only supporting global and self monitoring as first step is totally fine.
I asssume by 'global' you mean system-wide, i.e., measuring all threads running on
a cpu.

> All the bells'n'whistles can be added later if users really want them.
> 
They do because it provides such a simplicity of use. On production systems, it is not
uncommon to not even have compilers installed yet you may want to diagnose performance
problems by simply running a performance tool for a while.

> > 
> > Also note that there is no guarantee that RDPMC allows you to access all data registers
> > on a PMU. For instance, on AMD Barcelona, it seems you cannot read the IBS register using
> > RDPMC.
> 
> Sure at some point a system call for the more complex cases (also like multiplexing) would
> be needed. But I don't think we need it as first step. The goal would be to define a 
> simple subset that is actually mergeable.
> 
> > But you are driving the design of the interface from your very specific need
> > and you are ignoring all the other usage models. This has been a problem with so
> 
> I asked your noisy user base to specify more concrete use cases, but so far
> they have not provided anything except rather vacuous complaints. Short of that I'll stick 
> with what I know currently.
> 
I think they will respond but Phil is busy at Supercomputing  right now. They'll be able
to provide lots of use cases based on their experience with the popular PAPI toolkit.

> > many other interfaces and that explains the current situation. You have to
> > take a broader view, look at what the hardware (across the board) provides and
> > build from there. We do not need yet another interface to support one tool or one
> 
> 
> Well your "broad view" resulted in a incredible mess of interface moloch to be honest.

That is your opinion. I am not trying to say perfmon2 is perfect and I don't want to make changes.
I have proved in the past and still today that I am willing to make changes. See my comments about
pfm_write_pmcs() above.

But what I also know now is that people have managed to port this interface on all major hardware
platforms from X86, Itanium, Cray, Power*, Cell and derivative such Sony Playstation 3. They were
able to do so while providing access to all the advanced features (PEBS, IBS, DEAR, IPEAR, opcode
matchers, range restriction) and not just counters. They have never had to make changes to the
user level API to make their hardware work.

I just trying to say that you need to consider the arguments of people who have been involved with
performance monitoring and development of monitoring tools for a long time and on different architectures.
What you want to do with it is perfectly fine but it only represents a tiny fraction of what you can do
with the hardware and of what many people already want todo today. I would not want to have one interface
to do self-monitoring very well, then another one to do sampling, and another one for multiplexing.

> I really think we need a fresh start examining many of the underlying assumptions.
> 
I am happy to go over every design choices with you and others.

> Regarding itanium: I suppose it could provide a RDPMC replacement using your 
> fast priviledged vsyscalls.
> 

We don't need that. Itanium allows reading of PMD registers directly from user space with
a single instruction once we clear the protection mechanism similar to cr4.pce. And this 
is already done for self-monitoring per-thread sessions today.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 14:26                               ` Andi Kleen
@ 2007-11-15  0:23                                 ` Paul Mackerras
  0 siblings, 0 replies; 116+ messages in thread
From: Paul Mackerras @ 2007-11-15  0:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Christoph Hellwig, Andrew Morton, Greg KH,
	Philip Mucci, William Cohen, Robert Richter, linux-kernel,
	Perfmon, perfmon2-devel, OSPAT devel, papi list

Andi Kleen writes:

> > This only works when counting (not sampling) and only for self-monitoring.
> 
> It works for global monitoring too.

How would you provide access to the counters of another process?
Through an extension to ptrace perhaps?

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 23:21                                           ` David Miller
@ 2007-11-15  1:11                                             ` Paul Mackerras
  2007-11-15  1:27                                               ` David Miller
  2007-11-15  8:29                                               ` [perfmon] " Stephane Eranian
  0 siblings, 2 replies; 116+ messages in thread
From: Paul Mackerras @ 2007-11-15  1:11 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Thu, 15 Nov 2007 10:12:22 +1100
> 
> > *I* never had a problem with a few extra system calls.  I don't
> >  understand why you (apparently) do.
> 
> We're stuck with them forever, they are hard to version and extend
> cleanly.
> 
> Those are my main objections.

The first is valid (for suitable values of "forever") but applies to
any user/kernel interface, not just system calls.

As for the second (hard to version) I don't see why it applies to
syscalls specifically more than to other interfaces.  It's just a
matter of designing it correctly in the first place.  For example, the
sys_swapcontext system call we have on powerpc takes an argument which
is the size of the ucontext_t that userland is using, which allows us
to extend it in future if necessary.  (Note that I'm not saying that
the current perfmon2 interfaces are well-designed in this respect.)

The third (hard to extend cleanly) is a good point, and is a valid
criticism of the current set of perfmon2 system calls, I think.
However, the goal of being able to extend the interface tends to be in
opposition to the goal of having strong typing of the interface.
Things like a multiplexed syscall or an ioctl are much easier to
extend but that is at the expense of losing strong typing.  Something
like my transaction() (or your weird kind of read() :) also provides
extensibility but loses type safety to some degree.

Also, as Andi says, this is core CPU state that we are dealing with,
not some I/O device, so treating the whole of perfmon2 (or any
performance monitoring infrastructure) as a driver doesn't fit very
well, and in fact system calls are appropriate.  Just like we don't
try to make access to debugging facilities fit into a driver, we
shouldn't make performance monitoring fit into a driver either.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  1:11                                             ` Paul Mackerras
@ 2007-11-15  1:27                                               ` David Miller
  2007-11-15  2:34                                                 ` Paul Mackerras
  2007-11-19 13:08                                                 ` David Miller
  2007-11-15  8:29                                               ` [perfmon] " Stephane Eranian
  1 sibling, 2 replies; 116+ messages in thread
From: David Miller @ 2007-11-15  1:27 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

From: Paul Mackerras <paulus@samba.org>
Date: Thu, 15 Nov 2007 12:11:10 +1100

> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing.

I disagree.

With netlink we can just add new attributes when a new need arises for
a particular interface.  The attribute code describes the type
precisely, so there is no loss of strong typing at all.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  1:27                                               ` David Miller
@ 2007-11-15  2:34                                                 ` Paul Mackerras
  2007-11-15  7:48                                                   ` Herbert Xu
  2007-11-19 13:08                                                 ` David Miller
  1 sibling, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-15  2:34 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> From: Paul Mackerras <paulus@samba.org>
> Date: Thu, 15 Nov 2007 12:11:10 +1100
> 
> > The third (hard to extend cleanly) is a good point, and is a valid
> > criticism of the current set of perfmon2 system calls, I think.
> > However, the goal of being able to extend the interface tends to be in
> > opposition to the goal of having strong typing of the interface.
> > Things like a multiplexed syscall or an ioctl are much easier to
> > extend but that is at the expense of losing strong typing.
> 
> I disagree.
> 
> With netlink we can just add new attributes when a new need arises for
> a particular interface.  The attribute code describes the type
> precisely, so there is no loss of strong typing at all.

Well you must mean something different by "strong typing" from the
rest of us.  Strong typing means that the compiler can check that you
have passed in the correct types of arguments, but the compiler
doesn't have any visibility into what structures are valid in netlink
messages.

In any case, I think that adding a structure size argument to the
current perfmon2 system calls where appropriate would mean that we
could extend them cleanly later on if necessary.  It would mean that
we could add fields at the end, and that the kernel could know what
version of the structures that userspace was using.

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-14 12:38                           ` Andi Kleen
  2007-11-14 14:13                             ` Stephane Eranian
  2007-11-14 19:48                             ` David Miller
@ 2007-11-15  4:20                             ` dean gaudet
  2007-11-15  4:47                               ` Paul Mackerras
                                                 ` (2 more replies)
  2 siblings, 3 replies; 116+ messages in thread
From: dean gaudet @ 2007-11-15  4:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Hellwig, Paul Mackerras, Andrew Morton, Greg KH,
	Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel, Perfmon, perfmon2-devel, OSPAT devel, papi list

On Wed, 14 Nov 2007, Andi Kleen wrote:

> Later a syscall might be needed with event multiplexing, but that seems
> more like a far away non essential feature.

actually multiplexing is the main feature i am in need of. there are an 
insufficient number of counters (even on k8 with 4 counters) to do 
complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
hit rates, average miss latency, time spent in various stalls, and the 
memory system utilization (or HT bus utilization).  this runs out to 
something like 30 events which are interesting... and re-running a 
benchmark over and over just to get around the lack of multiplexing is a 
royal pain in the ass.

it's not a "far away non-essential feature" to me.  it's something i would 
use daily if i had all the pieces together now (and i'm constrained 
because i cannot add an out-of-tree patch which adds unofficial syscalls 
to the kernel i use).

-dean

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  4:20                             ` dean gaudet
@ 2007-11-15  4:47                               ` Paul Mackerras
  2007-11-15  5:14                                 ` dean gaudet
  2007-11-15  8:53                               ` Stephane Eranian
  2007-11-15 17:01                               ` [perfmon2] [perfmon] " Dan Terpstra
  2 siblings, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-15  4:47 UTC (permalink / raw)
  To: dean gaudet
  Cc: Andi Kleen, Christoph Hellwig, Andrew Morton, Greg KH,
	Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel

dean gaudet writes:

> actually multiplexing is the main feature i am in need of. there are an 
> insufficient number of counters (even on k8 with 4 counters) to do 
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> hit rates, average miss latency, time spent in various stalls, and the 
> memory system utilization (or HT bus utilization).  this runs out to 
> something like 30 events which are interesting... and re-running a 
> benchmark over and over just to get around the lack of multiplexing is a 
> royal pain in the ass.

So by "multiplexing" do you mean the ability to have multiple event
sets associated with a context and have the kernel switch between them
automatically?

Paul.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  4:47                               ` Paul Mackerras
@ 2007-11-15  5:14                                 ` dean gaudet
  0 siblings, 0 replies; 116+ messages in thread
From: dean gaudet @ 2007-11-15  5:14 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andi Kleen, Christoph Hellwig, Andrew Morton, Greg KH,
	Philip Mucci, eranian, William Cohen, Robert Richter,
	linux-kernel

On Thu, 15 Nov 2007, Paul Mackerras wrote:

> dean gaudet writes:
> 
> > actually multiplexing is the main feature i am in need of. there are an 
> > insufficient number of counters (even on k8 with 4 counters) to do 
> > complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> > hit rates, average miss latency, time spent in various stalls, and the 
> > memory system utilization (or HT bus utilization).  this runs out to 
> > something like 30 events which are interesting... and re-running a 
> > benchmark over and over just to get around the lack of multiplexing is a 
> > royal pain in the ass.
> 
> So by "multiplexing" do you mean the ability to have multiple event
> sets associated with a context and have the kernel switch between them
> automatically?

yep.

-dean

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  2:34                                                 ` Paul Mackerras
@ 2007-11-15  7:48                                                   ` Herbert Xu
  2007-11-15  8:19                                                     ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Herbert Xu @ 2007-11-15  7:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: davem, hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

Paul Mackerras <paulus@samba.org> wrote:
>
> Well you must mean something different by "strong typing" from the
> rest of us.  Strong typing means that the compiler can check that you
> have passed in the correct types of arguments, but the compiler
> doesn't have any visibility into what structures are valid in netlink
> messages.

That's strong static typing.  Netlink is 90% strong static
typing plus 10% strong dynamic typing.  That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  7:48                                                   ` Herbert Xu
@ 2007-11-15  8:19                                                     ` Andi Kleen
  0 siblings, 0 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-15  8:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul Mackerras, davem, hch, akpm, gregkh, mucci, eranian, wcohen,
	robert.richter, linux-kernel, andi

Herbert Xu <herbert@gondor.apana.org.au> writes:

> That's strong static typing.  Netlink is 90% strong static
> typing plus 10% strong dynamic typing.  That is, it'll tell
> you at run-time if you give it the wrong netlink attribute.

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  1:11                                             ` Paul Mackerras
  2007-11-15  1:27                                               ` David Miller
@ 2007-11-15  8:29                                               ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-15  8:29 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: mucci, wcohen, robert.richter, linux-kernel, andi, Stephane Eranian

Hi,

On Thu, Nov 15, 2007 at 12:11:10PM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > From: Paul Mackerras <paulus@samba.org>
> > Date: Thu, 15 Nov 2007 10:12:22 +1100
> > 
> > > *I* never had a problem with a few extra system calls.  I don't
> > >  understand why you (apparently) do.
> > 
> > We're stuck with them forever, they are hard to version and extend
> > cleanly.
> > 
> > Those are my main objections.
> 
> The first is valid (for suitable values of "forever") but applies to
> any user/kernel interface, not just system calls.
> 
Agreed.

> As for the second (hard to version) I don't see why it applies to
> syscalls specifically more than to other interfaces.  It's just a
> matter of designing it correctly in the first place.  For example, the
> sys_swapcontext system call we have on powerpc takes an argument which
> is the size of the ucontext_t that userland is using, which allows us
> to extend it in future if necessary.  (Note that I'm not saying that
> the current perfmon2 interfaces are well-designed in this respect.)
> 
> The third (hard to extend cleanly) is a good point, and is a valid
> criticism of the current set of perfmon2 system calls, I think.
> However, the goal of being able to extend the interface tends to be in
> opposition to the goal of having strong typing of the interface.
> Things like a multiplexed syscall or an ioctl are much easier to
> extend but that is at the expense of losing strong typing.  Something
> like my transaction() (or your weird kind of read() :) also provides
> extensibility but loses type safety to some degree.
> 
In the initial design there was only one perfmon syscall perfmonctl()
and it was a multiplexing call. People objected to it and thus I split it
up into multiple system calls. I like the strong typing but I agree that
it is harder to extend without creating new syscalls. In the current
state, all perfmon syscalls take a pointer to structs which have reserved
fields for future extensions. If you specify that reserved fields must be
zeroed, then it leaves you *some* flexibility for extending the structs.

Another alternative, similar to your ucontext, would be to pass the size
of the structure. If we assume we drop the vector arguments, we could do:

	pfm_write_pmcs(fd, &pmc, sizeof(pmc));
instead of
	pfm_write_pmcs(fd, &pmc);

Should the sizeof(pmc) need to change we could demultiplex inside the
kernel. Another, probably cleaner, possibility is to version structures
that are passed:
	union pfarg_pmc {
		int version;
		struct {
			int version;
			int reg_num;
			u64 reg_value;
		}
	}

But that seems overkill. I think versioning could be passed when the session
is created instead of at every call:

	fd = pfm_create_session(version, &ctx, ....);


> Also, as Andi says, this is core CPU state that we are dealing with,
> not some I/O device, so treating the whole of perfmon2 (or any
> performance monitoring infrastructure) as a driver doesn't fit very
> well, and in fact system calls are appropriate.  Just like we don't
> try to make access to debugging facilities fit into a driver, we
> shouldn't make performance monitoring fit into a driver either.
> 

Agreed 100%. This is especially true because we support per-thread
monitoring.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  4:20                             ` dean gaudet
  2007-11-15  4:47                               ` Paul Mackerras
@ 2007-11-15  8:53                               ` Stephane Eranian
  2007-11-15 17:01                               ` [perfmon2] [perfmon] " Dan Terpstra
  2 siblings, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-15  8:53 UTC (permalink / raw)
  To: dean gaudet
  Cc: Andi Kleen, Christoph Hellwig, Paul Mackerras, Andrew Morton,
	Greg KH, Philip Mucci, William Cohen, Robert Richter,
	linux-kernel, Stephane Eranian

Hello,

On Wed, Nov 14, 2007 at 08:20:22PM -0800, dean gaudet wrote:
> On Wed, 14 Nov 2007, Andi Kleen wrote:
> 
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> 
> actually multiplexing is the main feature i am in need of. there are an 
> insufficient number of counters (even on k8 with 4 counters) to do 
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
> hit rates, average miss latency, time spent in various stalls, and the 
> memory system utilization (or HT bus utilization).  this runs out to 
> something like 30 events which are interesting... and re-running a 
> benchmark over and over just to get around the lack of multiplexing is a 
> royal pain in the ass.
> 
> it's not a "far away non-essential feature" to me.  it's something i would 
> use daily if i had all the pieces together now (and i'm constrained 
> because i cannot add an out-of-tree patch which adds unofficial syscalls 
> to the kernel i use).
> 

Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more events
than there are counters. Making multiple runs can be difficult with some workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, but
for most interesting metrics, you cannot measure all the events at once. Why? Because
there are important hardware constraints which translate into event combination 
constraints. It is not uncommon to have constraints such as:
	- event A and B cannot be measured together
	- event A can only be measured by counter X
	- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the *estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the user
level. That's true and we do support this as well. You have to:
		- stop monitoring
		- read out current counter
		- reprogram config and data registers
		- restart monitoring

But there are some important benefits for doing this in the kernel especially for
per-thread monitoring. When you are not self-monitoring, you would need to stop the
other thread first, then issue a minimum of 4 system calls and incur a couple of
context switches. By doing it in the kernel, you guaranteed that switching always occur
in the context of the monitored thread.

Furthermore it can be integrated with kernel-level sampling. Adding the notion
of event set is fairly pervasive and you need to make sure that it fits well with
the other parts of the interface.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [perfmon2] [perfmon] Re:  perfmon2 merge news
  2007-11-15  4:20                             ` dean gaudet
  2007-11-15  4:47                               ` Paul Mackerras
  2007-11-15  8:53                               ` Stephane Eranian
@ 2007-11-15 17:01                               ` Dan Terpstra
  2 siblings, 0 replies; 116+ messages in thread
From: Dan Terpstra @ 2007-11-15 17:01 UTC (permalink / raw)
  To: 'dean gaudet', 'Andi Kleen'
  Cc: 'papi list', 'OSPAT devel', 'Greg KH',
	'Perfmon', linux-kernel, 'Christoph Hellwig',
	'Paul Mackerras', 'Andrew Morton',
	perfmon2-devel, 'Philip Mucci'

We've provided multiplexing in PAPI at the user level for years. That forced
it to the user level, which wasn't pretty. Or very statistically accurate.
We've been eagerly anticipating the improvements provided by in-kernel
multiplexing in perfmon2. We and our user base don't consider this a "far
away non-essential feature", but a deficiency that's needed addressing for a
long time.
- d

> -----Original Message-----
> From: perfmon2-devel-bounces@lists.sourceforge.net [mailto:perfmon2-devel-
> bounces@lists.sourceforge.net] On Behalf Of dean gaudet
> Sent: Wednesday, November 14, 2007 11:20 PM
> To: Andi Kleen
> Cc: papi list; OSPAT devel; Greg KH; Perfmon; linux-
> kernel@vger.kernel.org; Christoph Hellwig; Paul Mackerras; Andrew Morton;
> perfmon2-devel@lists.sourceforge.net; Philip Mucci
> Subject: Re: [perfmon2] [perfmon] Re: perfmon2 merge news
> 
> On Wed, 14 Nov 2007, Andi Kleen wrote:
> 
> > Later a syscall might be needed with event multiplexing, but that seems
> > more like a far away non essential feature.
> 
> actually multiplexing is the main feature i am in need of. there are an
> insufficient number of counters (even on k8 with 4 counters) to do
> complete stall accounting or to get a general overview of L1d/L1i/L2 cache
> hit rates, average miss latency, time spent in various stalls, and the
> memory system utilization (or HT bus utilization).  this runs out to
> something like 30 events which are interesting... and re-running a
> benchmark over and over just to get around the lack of multiplexing is a
> royal pain in the ass.
> 
> it's not a "far away non-essential feature" to me.  it's something i would
> use daily if i had all the pieces together now (and i'm constrained
> because i cannot add an out-of-tree patch which adds unofficial syscalls
> to the kernel i use).
> 
> -dean
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-14  1:52                       ` Andi Kleen
@ 2007-11-16  9:18                         ` Philip Mucci
  2007-11-16 15:15                           ` Andi Kleen
  0 siblings, 1 reply; 116+ messages in thread
From: Philip Mucci @ 2007-11-16  9:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Greg KH, Stephane Eranian, William Cohen,
	Robert Richter, linux-kernel, papi list

Just getting back to this now that SC07 is finally over...

On Nov 13, 2007, at 5:52 PM, Andi Kleen wrote:

> On Tue, Nov 13, 2007 at 04:28:52PM -0800, Philip Mucci wrote:
>> I know you don't want to hear this, but we actually use all of the
>> features of perfmon, because a) we wanted to use the best methods
>
> That is hard to believe.
>

You are welcome to download the code and some of the tools and verify  
the functionality yourself. It might be a good exercise.

> But let's go for it temporarily for the argument.
>
> Can you instead prioritize features.  What is most essential, what is
> important, what is just nice to have, what is rarely used?

Yes, although this has been done before. You've got the list below in  
the previous
emails which should be considered the absolute minimum.

- A feature which was dropped earlier by Stefane (only to satiate  
LKML), we consider
very important. Allowing one tomapping of the kernels view of the  
PMD's, allowing
user-space access to full 64-bit counts, if the architecture
supports a user-level read instruction. Getting the counts in a  
couple of dozen cycles
is ALWAYS a win for us. This is because the HPC community is mainly  
interested in
self-monitoring, not third-party, because the former can be easily  
associated with
context in the app through instrumentation in various forms.

- Kernel multiplexing is very nice to have, saves you tremendous  
overhead at user
level. PAPI has an implementation in user-space for the platforms  
that don't support
this. The flexibility of the current implementation is not exploited,  
here I'm
referring to the concept of eventsets. Having multiplexing is  
important. Being able
to allocate/reallocate eventsets and the threshold of individual  
eventsets is just nice
to have.

- Custom sample formats would be considered not often used in our  
community, largely
because the tools run on all HPC/Linux architectures. PAPI uses the  
default sample
format which has been sufficient for our needs. However, the lack of  
custom sample
formats preclude the dev of the specialized tools that access the  
sampling
hardware as found on the IA64, PPC64, the Barcelona and the SiCortex  
node chip.
pfmon exports this functionality quite well, and it does get used.


>> 	- providing virtualized 64-bit counters per-thread
>> 	- providing notification (buffered or non) on interrupt/overflow of
>> the above.
>
> Ok that makes sense and should be possible with a reasonable simple
> interface.

Well that's good news. The above is what we have used via the PerfCtr  
set of
patches for a long time. It wasn't quite enough, but it got the job  
done.

>> If you'd like to outline further what you'd like to hear from the
>> community, I can arrange that. I seem to remember going through this
>> once before, but I'd be happy to do it again. For reference, here's a
>> quick list from memory of some of the tools in active use and built
>> on this infrastructure. These are used heavily around the globe.
>
> Please list concrete features, throwing around random names is not  
> useful.
>

This is kind of comment that makes the Linux/HPC folks 'somber'. What  
isn't useful, is being dismissive of an entire community that moves a  
heck of a lot of Linux DVD's. >80% of the top500 list is Linux these  
days (compared to < 10% just a few years back), and so is the bulk of  
the HPC clusters in the marketplace, large and small. (ref those  
expensive IDC reports) These are tools used daily in HPC centers and  
industry around the globe, doing real work for folks that buy a lot  
of hardware and actually pay for Linux distributions. These tools  
seem random to you, because you haven't spent any time educating  
yourself about this community since we first talked about this >3  
years back when considering PerfCtr. Really, there are dozens of HPC/ 
Linux events held every year around the world of varying sizes;  
should you ever attend one, this all might not seem so 'random'. And  
yes, you would be warmly welcomed.

-Phil
  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16  9:18                         ` Philip Mucci
@ 2007-11-16 15:15                           ` Andi Kleen
  2007-11-16 16:00                             ` Stephane Eranian
                                               ` (4 more replies)
  0 siblings, 5 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-16 15:15 UTC (permalink / raw)
  To: Philip Mucci
  Cc: Andi Kleen, Andrew Morton, Greg KH, Stephane Eranian,
	William Cohen, Robert Richter, linux-kernel, papi list

Philip Mucci <mucci@cs.utk.edu> writes:
>
> Yes, although this has been done before. You've got the list below in
> the previous
> emails which should be considered the absolute minimum.

I didn't see a clear list. 

My impression so far is that you're not quite sure what you want,
otherwise you would be more concrete.

> - A feature which was dropped earlier by Stefane (only to satiate
> LKML), we consider
> very important. Allowing one tomapping of the kernels view of the
> PMD's, allowing
> user-space access to full 64-bit counts, if the architecture
> supports a user-level read instruction.

You mean returning the register number for RDPMC or equivalent
and a way to enable it for ring 3 access? 

I'm considering that an essential feature too. I wasn't aware
it was dropped.

> Getting the counts in a
> couple of dozen cycles
> is ALWAYS a win for us.

Yes it is for everybody. I've been rather questioning if the slow
ways (complicated syscalls) to get the counter information are really 
needed.

> referring to the concept of eventsets. Having multiplexing is
> important.

Why is it important? 

> - Custom sample formats would be considered not often used in our
> community, largely
> because the tools run on all HPC/Linux architectures. PAPI uses the
> default sample
> format which has been sufficient for our needs. However, the lack of
> custom sample
> formats preclude the dev of the specialized tools that access the
> sampling
> hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
> node chip.
> pfmon exports this functionality quite well, and it does get used.

What do you mean with custom sample formats exactly?  What information
do you want in there? And why?

e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
so they only way to get a custom format would be to use a separate buffer.

I can think of one reason why the kernel should add more information
in a separate buffer (log the instruction bytes so that it can
be disassembled and a address histogram be generated using the PEBS
register values), but it is a relatively obscure one and definitely
not a essential feature. Unfortunately it is also hard to implement completely
race-free.

> This is kind of comment that makes the Linux/HPC folks 'somber'. What
> isn't useful, is being dismissive of an entire community that moves a
> heck of a lot of Linux DVD's. 

Sorry, but these kind of non technical BS arguments will just make
you be ignored in mainline Linux lands. They might work if you pay
a lot of money to specific Linux companies (do you?), but here
on linux-kernel you have to convince with purely technical arguments.

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 15:15                           ` Andi Kleen
@ 2007-11-16 16:00                             ` Stephane Eranian
  2007-11-16 16:28                               ` Andi Kleen
  2007-11-16 17:51                             ` dean gaudet
                                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-16 16:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Philip Mucci, Andrew Morton, Greg KH, William Cohen,
	Robert Richter, linux-kernel, Stephane Eranian

Andi,

On Fri, Nov 16, 2007 at 04:15:56PM +0100, Andi Kleen wrote:
> My impression so far is that you're not quite sure what you want,
> otherwise you would be more concrete.
> 
> > - A feature which was dropped earlier by Stefane (only to satiate
> > LKML), we consider
> > very important. Allowing one tomapping of the kernels view of the
> > PMD's, allowing
> > user-space access to full 64-bit counts, if the architecture
> > supports a user-level read instruction.
> 
> You mean returning the register number for RDPMC or equivalent
> and a way to enable it for ring 3 access? 
> 
No, he is talking about something similar to what was in perfctr.
The kernel emulates 64-bit counters in software and that is you
get back when you read the counters. If you read via RDPMC, you
get 40 bits. To reconstruct the full 64-bit value from user land
you need the upper bits. One approach is for the kernel to allow
you to remap a page that has the 64-bit (software) counters. With
that and a bit of mask/shifting you can reconstruct the full value.

> I'm considering that an essential feature too. I wasn't aware
> it was dropped.
> 
What I dropped is the cr4.pce enabled for self-monitoring sessions.

> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really 
> needed.
> 
> > referring to the concept of eventsets. Having multiplexing is
> > important.
> 
> Why is it important? 
> 
Read my follow-up message to Dean's message.

> > - Custom sample formats would be considered not often used in our
> > community, largely
> > because the tools run on all HPC/Linux architectures. PAPI uses the
> > default sample
> > format which has been sufficient for our needs. However, the lack of
> > custom sample
> > formats preclude the dev of the specialized tools that access the
> > sampling
> > hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
> > node chip.
> > pfmon exports this functionality quite well, and it does get used.
> 
> What do you mean with custom sample formats exactly?  What information
> do you want in there? And why?
> 
Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
not new, Oprofile has this as well. The problem here is that if the
buffer is in the kernel the format of the samples is fixed and it
should have to. Tools may want to record samples in different formats
and as you said some may need extra information gathered in the kernel.
Some may want to aggregate samples in the kernel (Oprofile used to
do that), some may want to use a double-buffer approach to minimize
blind spots, others may simply use the counter overflow mechanism to
record something that is non-PMU related, e.g, kernel call stack.
I have built such a module and it was quite interesting to collect
the call stack when you hit a last cache level miss.

The idea behind customizable sampling format is simple: extract the
format from the perfmon core and put this into a kernel module. The
core provides a simple registration mechanism and the two communicate
via a set of callbacks.

Perfmon2 comes with a basic default format which works on all
platforms. But it is possible to develop others without having to
patch the kernel nor recompile nor reboot. At its core, each format provides
a handler routine which is called on counter overflow. The handler routine
controls what is recorded, how it is recorded, how it is exported to
userland, and wheher overflow notifications need to be sent.

Using this mechanism, for instance, we were able to connect the
Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
code. The exact same approach would also work on X86 Oprofile as well.

> e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> so they only way to get a custom format would be to use a separate buffer.
> 

This is also how we support PEBS because, as you said, the format of the
samples is not under your control. if you want zero-copy PEBS support,
you have to follow the PEBS format.

I am sure other processors haev and will have hardware buffers as well.

> I can think of one reason why the kernel should add more information
> in a separate buffer (log the instruction bytes so that it can
> be disassembled and a address histogram be generated using the PEBS
> register values), but it is a relatively obscure one and definitely
> not a essential feature. Unfortunately it is also hard to implement
> completely race-free.
> 
Yes, you could do that without changing the core implementation of
perfmon2.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 16:00                             ` Stephane Eranian
@ 2007-11-16 16:28                               ` Andi Kleen
  2007-11-16 17:13                                 ` William Cohen
  2007-11-16 17:36                                 ` Stephane Eranian
  0 siblings, 2 replies; 116+ messages in thread
From: Andi Kleen @ 2007-11-16 16:28 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, Philip Mucci, Andrew Morton, Greg KH, William Cohen,
	Robert Richter, linux-kernel

On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> No, he is talking about something similar to what was in perfctr.
> The kernel emulates 64-bit counters in software and that is you
> get back when you read the counters. If you read via RDPMC, you
> get 40 bits. To reconstruct the full 64-bit value from user land
> you need the upper bits. One approach is for the kernel to allow
> you to remap a page that has the 64-bit (software) counters. With
> that and a bit of mask/shifting you can reconstruct the full value.

You mean the page contains the upper [40;63] bits? 

Sounds reasonable, although I don't remember seeing that when I looked
at the perfmon code last.

> 
> > I'm considering that an essential feature too. I wasn't aware
> > it was dropped.
> > 
> What I dropped is the cr4.pce enabled for self-monitoring sessions.

That sounds bad.

> Perfmon2 allows you to have an in-kernel sampling buffer. The idea is

... you also didn't say *why* that is needed.

Can you give a concrete use case for something that cannot be done
without custom buffer formats? 

> Using this mechanism, for instance, we were able to connect the
> Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
> code. The exact same approach would also work on X86 Oprofile as well.

The existing oprofile code works already fine on x86, no real
need for another one.

> > e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> > so they only way to get a custom format would be to use a separate buffer.
> > 
> 
> This is also how we support PEBS because, as you said, the format of the
> samples is not under your control. if you want zero-copy PEBS support,
> you have to follow the PEBS format.

Exactly that makes the support for random custom buffers questionable.

e.g. as I can see the main advantage of perfmon over existing setups
is that it support PEBS etc., but with your custom buffer formats which
are by definition incompatible with PEBS you would negate that advantage
again.

Ok IBS will probably need some special handling.

> Yes, you could do that without changing the core implementation of
> perfmon2.

Why this insistence against changing anything?

-Andi

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 16:28                               ` Andi Kleen
@ 2007-11-16 17:13                                 ` William Cohen
  2007-11-16 21:56                                   ` Stephane Eranian
  2007-11-16 17:36                                 ` Stephane Eranian
  1 sibling, 1 reply; 116+ messages in thread
From: William Cohen @ 2007-11-16 17:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Philip Mucci, Andrew Morton, Greg KH,
	Robert Richter, linux-kernel

Andi Kleen wrote:
> On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
>> No, he is talking about something similar to what was in perfctr.
>> The kernel emulates 64-bit counters in software and that is you
>> get back when you read the counters. If you read via RDPMC, you
>> get 40 bits. To reconstruct the full 64-bit value from user land
>> you need the upper bits. One approach is for the kernel to allow
>> you to remap a page that has the 64-bit (software) counters. With
>> that and a bit of mask/shifting you can reconstruct the full value.
> 
> You mean the page contains the upper [40;63] bits? 
> 
> Sounds reasonable, although I don't remember seeing that when I looked
> at the perfmon code last.

Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are 
available in the register. the 32:40 bits in several processor implementation of 
x86 processors can not be set to bit outside of sign extension of bit 32. On 
other processor implementations the event counters are only 32-bit in width.

> 
>>> I'm considering that an essential feature too. I wasn't aware
>>> it was dropped.
>>>
>> What I dropped is the cr4.pce enabled for self-monitoring sessions.
> 
> That sounds bad.
> 
>> Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
> 
> ... you also didn't say *why* that is needed.
> 
> Can you give a concrete use case for something that cannot be done
> without custom buffer formats? 
> 
>> Using this mechanism, for instance, we were able to connect the
>> Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
>> code. The exact same approach would also work on X86 Oprofile as well.
> 
> The existing oprofile code works already fine on x86, no real
> need for another one.

OProfile is very useful in many cases, but it only perform sampling. If one want 
to take a look at the number events a specific section of code causes, one can't 
really do that with oprofile. The counters are running systemwide, not per 
thread. For some experiments developers really like to have per thread counters.

The rewrite of oprofile to use the perfmon code was to consolidate code using 
the performance monitoring hardware. Use one interface for accessing the 
performance monitoring hardware rather than have one for sampling and another 
for virtualizing the counters on a per thread basis.

>>> e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
>>> so they only way to get a custom format would be to use a separate buffer.
>>>
>> This is also how we support PEBS because, as you said, the format of the
>> samples is not under your control. if you want zero-copy PEBS support,
>> you have to follow the PEBS format.
> 
> Exactly that makes the support for random custom buffers questionable.
> 
> e.g. as I can see the main advantage of perfmon over existing setups
> is that it support PEBS etc., but with your custom buffer formats which
> are by definition incompatible with PEBS you would negate that advantage
> again.
> 
> Ok IBS will probably need some special handling.
> 
>> Yes, you could do that without changing the core implementation of
>> perfmon2.
> 
> Why this insistence against changing anything?
> 
> -Andi

So the alternative approach is to write a new device driver for each of the new 
performance monitoring mechanisms, e.g. one for PEBS and another for IBS?

One of the reason for the custom sample buffers was to avoid having an expensive 
user-space signal for a process to record some simple pieces of data each time 
the data becomes available. For the oprofile port to the perfmon2 custom buffer 
  mechanism the instruction pointer and the counter that overflowed are 
recorded. The buffer can be processed in one large chunk by userspace, reducing 
overhead. In essence the current implementation of OProfile in the mainline 
kernels has a custom buffer mechanism.

-Will


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 16:28                               ` Andi Kleen
  2007-11-16 17:13                                 ` William Cohen
@ 2007-11-16 17:36                                 ` Stephane Eranian
  1 sibling, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-16 17:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Philip Mucci, Andrew Morton, Greg KH, William Cohen,
	Robert Richter, linux-kernel, Stephane Eranian

Andi,
On Fri, Nov 16, 2007 at 05:28:13PM +0100, Andi Kleen wrote:
> On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> > No, he is talking about something similar to what was in perfctr.
> > The kernel emulates 64-bit counters in software and that is you
> > get back when you read the counters. If you read via RDPMC, you
> > get 40 bits. To reconstruct the full 64-bit value from user land
> > you need the upper bits. One approach is for the kernel to allow
> > you to remap a page that has the 64-bit (software) counters. With
> > that and a bit of mask/shifting you can reconstruct the full value.
> 
> You mean the page contains the upper [40;63] bits? 
> 
> Sounds reasonable, although I don't remember seeing that when I looked
> at the perfmon code last.
> 
I dropped that quite some time ago.

> > 
> > > I'm considering that an essential feature too. I wasn't aware
> > > it was dropped.
> > > 
> > What I dropped is the cr4.pce enabled for self-monitoring sessions.
> 
> That sounds bad.

That's because you said you were going to enable it system-wide by default.

> 
> > Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
> 
> ... you also didn't say *why* that is needed.
> 
Do you question why Oprofile has one ;->

But I am happy to explain.

With sampling, you want to record information about the execution of a
thread at some interval. The interval could be expressed as time or
number of occurences of an PMU event.

Typically you get a notification. Then you need to collect certain 
information about the execution. Typically you record the instruction
pointer (e.g. Oprofile), but you may want to record the value of other
counters, PMU registers or other HW/SW resources. While you're doing
this monitoring is typically stopped so you get a consitent view. After
you're done recording you need to re-arm the sampling period. If you
use event-based sampling, you need to reprogram the counter(s). Then
you resume monitoring. You have to repeat this process for each sample
regardless of whether you are self-monitoring, monitoring another thread,
or monitoring a CPU.

Such sequence of operations is quite expensive, especially in the case
where you are monitoring another thread, because it incurs at least
a couple of context switches per sample in addition to the various
register manipulations and syscalls.

The idea with the kernel sampling buffer is that you amortize the
cost of notification to userland over LOTS of samples. On counter
overflow, the kernel records the samples on your behalf. There is
no context switch, samples are always recorded in the context on
the monitored thread.

Now, you need a bit more information for this to work correctly
because the kernel records on *your behalf*,  thus
you need to express:
	- what you want to see recorded

	- the value to reload into the overflowed counter(s)
	  so the kernel can re-arm the next period.

Because you have multiple counters, you may use them for sampling
periods, i.e., overlap sampling measurements. That is something
done very frequently.

For instance, the q-syscollect tool that D. Mosberger wrote, is
overlapping elapsed cycles and branch trace buffer (BTB) sampling
to collect, in *one* run, a flat profile and a statistical call graph.

Depending on which counter overflowed, you may one to record
different things. For instance, the flat profile requires
just the instruction pointer. But for the BTB, the buffer
is implemented by PMU registers, thus you need to record
them (16 total). You don't want to record all register possible
in each sample: reading PMU register is costly and you
want to maximize buffer space usage.

As you can see, you need to express per counter:
	- what other resources to record when it overflows
	- the value to reload into the counter after overflow

In perfmon2 this information is passed by the pfm_write_pmds()
call. You can say:
	PMD2.value     = -5000; /* initial period */
	PMD2.reset     = -2000; /* repeat period */
	PMD2.smpl_pmds = 0xf0;  /* to record PMD4-7 on overlow */

Now, it is important to note that this is not just on Itanium
that we need this kind of flexibility. Given that you mentioned
IBS, I will use it as a non-X86 example. IBS is implemented
using PMU registers, 10 to be precise. There is no need for a
custom sampling format to support that, the default format is
sufficient.

The default sampling format does record more than the instruction
pointer.  Each sample has a fixed size header including the instruction
pointer but also PID/TID/CPU. But it also has a variable size body
where the kernel stores the other registers you want to record
in each sample based on which counter overflowed. So for IBS, it
would store the 10 data registers.

> Can you give a concrete use case for something that cannot be done
> without custom buffer formats? 
> 

PEBS is one. You would have to special this. PEBS includes
the instruction pointer + values of all registers. You'd have
to devise a scheme to allocate the PEBS buffer and then on
PEBS interrupt you'd have to copy the data  into the other
buffer. Not counting on the fact that PEBS between P4 and Intel
Core 2 different and that this is an Intel X86 only feature.
I think this is better isolated into X86 specific code and
into a kernel module because it does not work on all models.


> > Using this mechanism, for instance, we were able to connect the
> > Oprofile kernel code to perfmon2 on Itanium with a 100 lines of
> > code. The exact same approach would also work on X86 Oprofile as well.
> 
> The existing oprofile code works already fine on x86, no real
> need for another one.
> 
Can you support advanced monitoring like I just described above?

> > > e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
> > > so they only way to get a custom format would be to use a separate buffer.
> > > 
> > 
> > This is also how we support PEBS because, as you said, the format of the
> > samples is not under your control. if you want zero-copy PEBS support,
> > you have to follow the PEBS format.
> 
> Exactly that makes the support for random custom buffers questionable.
> 
Quite the contrary, without the custom buffers we would have horrible
hacks to support PEBS.

> e.g. as I can see the main advantage of perfmon over existing setups
> is that it support PEBS etc., but with your custom buffer formats which
> are by definition incompatible with PEBS you would negate that advantage
> again.
> 

I think  you are confused about the terms here. The custom sampling
format is a kernel-level interface to plug-in kernel modules
which implement custom sampling formats. PEBS requires a custom
format  because you do not control what is recorded. Thus what
you do is you *create* a format whose sample format *maps* the PEBS
format exactly. And that format is *different* from the one used
by the default sampling format.


> Ok IBS will probably need some special handling.
> 
No, it does not. No sampling format, no extra tricks.

> > Yes, you could do that without changing the core implementation of
> > perfmon2.
> 
> Why this insistence against changing anything?
> 

Because hardware is very diverse and is changing rapidly.
Changing the kernel is difficult and it takes a very long time
for new features to reach end-users. You are not without knowing
that most users do not download their production kernels from
kernel.org. Monitoring is not just reserved for core developers
and it is also very useful on production systems to diagnose
performance problems.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 15:15                           ` Andi Kleen
  2007-11-16 16:00                             ` Stephane Eranian
@ 2007-11-16 17:51                             ` dean gaudet
  2007-11-17  0:29                               ` David Miller
  2007-11-16 20:16                             ` Philip Mucci
                                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: dean gaudet @ 2007-11-16 17:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Philip Mucci, Andrew Morton, Greg KH, Stephane Eranian,
	William Cohen, Robert Richter, linux-kernel, papi list

On Fri, 16 Nov 2007, Andi Kleen wrote:

> I didn't see a clear list. 

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring 
  infrastructure the vendors through at us in the future
- low overhead:  must minimize the "probe effect" of monitoring
- low noise in measurements:  cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: PMC core internal API design
  2007-11-13 18:32         ` Stephane Eranian
  2007-11-13 22:29           ` Christoph Hellwig
@ 2007-11-16 18:25           ` Mathieu Desnoyers
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2007-11-16 18:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Robert Richter, Andi Kleen, gregkh, akpm, linux-kernel,
	perfmon2-devel, perfmon, Christoph Hellwig

* Stephane Eranian (eranian@hpl.hp.com) wrote:
> Hello,
> 
> On Tue, Nov 13, 2007 at 04:17:18PM +0100, Robert Richter wrote:
> > On 10.11.07 21:32:39, Andi Kleen wrote:
> > > It would be really good to extract a core perfmon and start with
> > > that and then add stuff as it makes sense.
> > > 
> > > e.g. core perfmon could be something simple like just support
> > > to context switch state and initialize counters in a basic way 
> > > and perhaps get counter numbers for RDPMC in ring3 on x86[1]
> > 
> > Perhaps a core could provide also as much functionality so that
> > Perfmon can be used with an *unpatched* kernel using loadable modules?
> > One drawback with today's Perfmon is that it can not be used with a
> > vanilla kernel. But maybe such a core is by far too complex for a
> > first merge.
> > 
> Note that I am not against the gradual approach such as:
> 	- system-wide only counting

(jumping in late in the game)

Linux Trace Toolkit Next Generation would _happily_ use global PMC
counters, but I would prefer to interact with an internal kernel API
rather than being required to start/stop counters from user-space. There
is a big precision loss involved in having to start things from
userspace.

Ideally, this API would manage access to available PMCs and even use the
same counters for both system-wide tracing/profiling done at the same
time as user-space profiling. This would however involve having a
wrapper around both user-space and kernel-space performance counter
reads, which is fine with me. I would suggest that user-space still go
through a system call for this, since this is available a early boot,
before the filesystem is mounted.

This API could offer to in-kernel architecture _independent_ PMC control
interface to :
- list available PMCs
  - That would involve mapping the common PMCs to some generic
    identifier
- attach to these PMCs, with a certain priority

We could call a single connexion to a PMC a "virtual PMC". All PMC
accesses should then be done through this internally managed structure
(giving callbacks to be called after a certain count, reads, stop...).
We could have virtual PMCs that are : system wide, or per thread.

As a starting point, we could limit one virtual PMC attached to a
physical PMC at a given time. Later, we could add support for multiple
virtual PMCs connected to a single physical PMC. The priorities could be
used to kick out the PMC users with lower priorities (that involves that
a PMC read could fail!).

Then, to get interrupts or signals upon PMC overflow, we could manage
each physical PMC like a timer, using the lowest requested value for the
next time were are to be awakened. Some logic would have to be added to
the pmc read operation to get the "real" expected value, but this is
nothing difficult.

Those were the ideas I had last OLS after hearing the talk about
perfmon2. I hope they can be useful. If things need to be clarified, I
will gladly discuss them further.

Mathieu

P.S. : the rest of the feature list _should_ be easy to implement on top
of this internal architecture.

> 	- per-thread counting
> 	- user-level sampling support
> 	- in-kernel sampling buffer support
> 	- in-kernel customizable sampling buffer formats via modules
> 	- event set multiplexing
> 	- PMU description modules
> 
> It would obvisouly cause a lot of troubles to existing perfmon libraries and
> applications (e.g. PAPI). It would also be fairly tricky to do because you'd 
> have to make sure that in the beginning, you leave enough flexiblity such that
> you can add the rest while maintaining total backward compatibility. But given
> that we already have the full solution, it could just be a matter of dropping
> features without disrupting the user level API. Of course there would be a bigger
> burden on the maintainer because he would have two trees to maintain but I think
> that is already commonplace in many of the kernel-related projects.
> 
> Let's take a simple example. The set of syscalls necessary to control a system-wide
> monitoring session is exactly the same as for a per-thread session. The difference is
> just a flag when the session is created. Thus, we could keep the same set of syscalls,
> but only accept system-wide sessions. Later on, when we add per-thread, we would just
> have to expose the per-thread session flag.
> 
> Having said that, does not mean that this is necessarily what we will do. I am just
> try to present my understanding of the comments from Andrew, Andi and others.
> 
> I think that going with a kernel module will not address the 'complexity/bloat' perception
> that some people have. There is a logic to that, I did not just wakeup one day saying
> 'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
> developers and it was justfied by what the hardware offered then. This unfortunately still
> stands today. I admit that justification is not necessarily spelled out clearly in the code. So
> I understand most of those worries and I am trying to figure out how we could best address them.
> 
> -- 
> -Stephane
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 15:15                           ` Andi Kleen
  2007-11-16 16:00                             ` Stephane Eranian
  2007-11-16 17:51                             ` dean gaudet
@ 2007-11-16 20:16                             ` Philip Mucci
  2007-11-17  0:15                             ` David Miller
       [not found]                             ` <1d7226b10711161713j675341b7wdb4f050c59a8be0a@mail.gmail.com>
  4 siblings, 0 replies; 116+ messages in thread
From: Philip Mucci @ 2007-11-16 20:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Greg KH, Stephane Eranian, William Cohen,
	Robert Richter, linux-kernel, papi list


> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really
> needed.

I suppose by complicated here, your referring to the gather semantics  
of the
pfm_read/write_pmds/pmcs calls. Many processors may have 100's of  
registers
(IA64, BG/P, SiCortex), some of which have different access times. So a
naive syscall of 'give me all the registers you've got' isn't going  
to cut it.
However, any additional simplicity (performance) we can squeeze out  
of this
particular primitive is a huge win as it sits in the critical path of  
the user
tools (unless one is sampling).

>> referring to the concept of eventsets. Having multiplexing is
>> important.
>
> Why is it important?
>

Performance and noise. See the earlier message about our user-land  
implementation versus kernel mode implementations. Any any useful  
granularity, you begin to seriously affect the counts with noise as  
well as dilate the run-time. But let's punt on this one until after  
we get the basics in. It's a non-essential feature at this point.

>> - Custom sample formats would be considered not often used in our
>> community, largely
>> because the tools run on all HPC/Linux architectures. PAPI uses the
>> default sample
>> format which has been sufficient for our needs. However, the lack of
>> custom sample
>> formats preclude the dev of the specialized tools that access the
>> sampling
>> hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
>> node chip.
>> pfmon exports this functionality quite well, and it does get used.
>
> What do you mean with custom sample formats exactly?  What information
> do you want in there? And why?

By custom here, I mean the ability to have the kernel take samples  
containing
more than just the IP, the PID and a bitmask of which registers  
overflowed at this
point. Myself and others have worked hard to get effective address  
sampling into the
hardware (there are registers that contain EA's of misses as well as  
branch mispredict
data on the PPC, IA64, Barcelona and SiCortex) that are handled  
through the use
of a format that gathers up that information at interrupt time for  
deposit into
the sample buffer. We are not wedded to Perfmon2's implementation of  
these formats, we
are however, wedded to having this information collected at interrupt  
time as the data
may change by the time you get back to user-mode. This hardware is  
not obscure any more,
it's the norm, as we've learned at thus simple aggregate counters,  
even those with precise
interrupt abilities, are not sufficient to satisfy all of our needs.

> e.g. PEBS and so on pretty much fix the in memory sample format in  
> hardware,
> so they only way to get a custom format would be to use a separate  
> buffer.
>
> I can think of one reason why the kernel should add more information
> in a separate buffer (log the instruction bytes so that it can
> be disassembled and a address histogram be generated using the PEBS
> register values), but it is a relatively obscure one and definitely
> not a essential feature. Unfortunately it is also hard to implement  
> completely
> race-free.
>

>> This is kind of comment that makes the Linux/HPC folks 'somber'. What
>> isn't useful, is being dismissive of an entire community that moves a
>> heck of a lot of Linux DVD's.
>
> Sorry, but these kind of non technical BS arguments will just make
> you be ignored in mainline Linux lands. They might work if you pay
> a lot of money to specific Linux companies (do you?), but here
> on linux-kernel you have to convince with purely technical arguments.

I love it when kernel folks refer to their own revenue streams
(and yes, we do, ask your VP of sales) and the needs of a user  
community as
"BS non-technical arguments".

But let's get back to basics here. We can sort that out over a beer  
sometime.
At this point, let's try and agree on the minimum set of
functionality acceptable for a first round of patches.

- per-CPU (system-wide) and per-thread 64-bit virtualized counters
- dispatch of interrupt on overflow via a signal
- first (self) and third-party (attach) semantics
- extensible to new lines within an architecture without repatching
   (By this I mean that through the use of modules that contain PMU
    description tables, i.e. patches don't have to be issued for  
every new rev
    of HW that Intel releases)

To be considered later:
- Sample buffers and formats
- Multiplexing (by event threshold and time slicing)
- fast-read support if the hardware supports it (mmap + user rdpmc)

I think(?) we are all clear now why oprofile is not sufficient, i.e.
simultaneous usage by non-root users, each with different counter  
configurations,
lack of read/write access  etc. Oprofile is however, a very important  
tool
and any initial set of functionality should allow for a very simple  
port to
each version of the infrastructure along the way. I'd happily port
incremental versions of PAPI to the patches, so the performance tools  
can be
accessible to the LKML community while testing/benchmarking the
patchset on a variety of architectures. If we can agree on the  
starting point,
we can move the discussion of the API to the Perfmon2 mailing list  
and with your
input, finally 'get it right' in terms of acceptance.

If there's anything we do have in common at the moment, it's momentum.
We (speaking for the HPC community/vendors again) are not in favor of  
useless bloat, every
TLB slot, mispredict, miss, timer-tick or pipeline bubble, we care  
about, kernel
space or not. It's precisely why this type of infrastructure has  
become so
vital to us over the years.

-Phil

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 17:13                                 ` William Cohen
@ 2007-11-16 21:56                                   ` Stephane Eranian
  0 siblings, 0 replies; 116+ messages in thread
From: Stephane Eranian @ 2007-11-16 21:56 UTC (permalink / raw)
  To: William Cohen
  Cc: Andi Kleen, Philip Mucci, Andrew Morton, Greg KH, Robert Richter,
	linux-kernel

Will,

On Fri, Nov 16, 2007 at 12:13:07PM -0500, William Cohen wrote:
> Andi Kleen wrote:
> >On Fri, Nov 16, 2007 at 08:00:56AM -0800, Stephane Eranian wrote:
> >>No, he is talking about something similar to what was in perfctr.
> >>The kernel emulates 64-bit counters in software and that is you
> >>get back when you read the counters. If you read via RDPMC, you
> >>get 40 bits. To reconstruct the full 64-bit value from user land
> >>you need the upper bits. One approach is for the kernel to allow
> >>you to remap a page that has the 64-bit (software) counters. With
> >>that and a bit of mask/shifting you can reconstruct the full value.
> >
> >You mean the page contains the upper [40;63] bits? 
> >
> >Sounds reasonable, although I don't remember seeing that when I looked
> >at the perfmon code last.
> 
> Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are 
> available in the register. the 32:40 bits in several processor 
> implementation of x86 processors can not be set to bit outside of sign 
> extension of bit 32. On other processor implementations the event counters 
> are only 32-bit in width.
> 
That is quite true on Intel's. Perfmon2 only considers the bottom 31 bits as
true counter bits, the rest is forced to 1. This is true even on Intel Core 2.

--
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 15:15                           ` Andi Kleen
                                               ` (2 preceding siblings ...)
  2007-11-16 20:16                             ` Philip Mucci
@ 2007-11-17  0:15                             ` David Miller
       [not found]                             ` <1d7226b10711161713j675341b7wdb4f050c59a8be0a@mail.gmail.com>
  4 siblings, 0 replies; 116+ messages in thread
From: David Miller @ 2007-11-17  0:15 UTC (permalink / raw)
  To: andi
  Cc: mucci, akpm, gregkh, eranian, wcohen, robert.richter,
	linux-kernel, ptools-perfapi

From: Andi Kleen <andi@firstfloor.org>
Date: Fri, 16 Nov 2007 16:15:56 +0100

> Philip Mucci <mucci@cs.utk.edu> writes:
> > - A feature which was dropped earlier by Stefane (only to satiate
> > LKML), we consider
> > very important. Allowing one tomapping of the kernels view of the
> > PMD's, allowing
> > user-space access to full 64-bit counts, if the architecture
> > supports a user-level read instruction.
> 
> You mean returning the register number for RDPMC or equivalent
> and a way to enable it for ring 3 access? 
> 
> I'm considering that an essential feature too. I wasn't aware
> it was dropped.
> 
> > Getting the counts in a
> > couple of dozen cycles
> > is ALWAYS a win for us.
> 
> Yes it is for everybody. I've been rather questioning if the slow
> ways (complicated syscalls) to get the counter information are really 
> needed.

I would like to add sparc64 support to perfmon2 as well
and therefore I've been considering this angle of the
API issues as well.

The counters on sparc64 can be configured to be readable by userspace,
so for the self-monitoring cases I really would like to make sure the
perfmon2 library interface could use direct reads for sampling instead
of system calls or specialized traps.

If I get some spare time I'll look at the current perfmon2 patches
and see if I can toss together sparc64 support to get a feel for
how things stand currently.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-16 17:51                             ` dean gaudet
@ 2007-11-17  0:29                               ` David Miller
  2007-11-17  1:07                                 ` Greg KH
  0 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-17  0:29 UTC (permalink / raw)
  To: dean
  Cc: andi, mucci, akpm, gregkh, eranian, wcohen, robert.richter,
	linux-kernel, ptools-perfapi

From: dean gaudet <dean@arctic.org>
Date: Fri, 16 Nov 2007 09:51:08 -0800 (PST)

> On Fri, 16 Nov 2007, Andi Kleen wrote:
> 
> > I didn't see a clear list. 
> 
> - cross platform extensible API for configuring perf counters
> - support for multiplexed counters
> - support for virtualized 64-bit counters
> - support for PC and call graph sampling at specific intervals
> - support for reading counters not necessarily with sampling
> - taskswitch support for counters
> - API available from userland
> - ability to self-monitor: need select/poll/etc interface
> - support for PEBS, IBS and whatever other new perf monitoring 
>   infrastructure the vendors through at us in the future
> - low overhead:  must minimize the "probe effect" of monitoring
> - low noise in measurements:  cannot achieve this in userland
> 
> permon2 has all of this and more i've probably neglected...

I want to state that even though I've been a stickler on the system
call stuff, in general I want to see perfmon2 go into tree and I agree
with how most of the infrastructure is implemented and the features it
provides.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
  2007-11-17  0:29                               ` David Miller
@ 2007-11-17  1:07                                 ` Greg KH
  0 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-17  1:07 UTC (permalink / raw)
  To: David Miller
  Cc: dean, andi, mucci, akpm, eranian, wcohen, robert.richter,
	linux-kernel, ptools-perfapi

On Fri, Nov 16, 2007 at 04:29:05PM -0800, David Miller wrote:
> From: dean gaudet <dean@arctic.org>
> Date: Fri, 16 Nov 2007 09:51:08 -0800 (PST)
> 
> > On Fri, 16 Nov 2007, Andi Kleen wrote:
> > 
> > > I didn't see a clear list. 
> > 
> > - cross platform extensible API for configuring perf counters
> > - support for multiplexed counters
> > - support for virtualized 64-bit counters
> > - support for PC and call graph sampling at specific intervals
> > - support for reading counters not necessarily with sampling
> > - taskswitch support for counters
> > - API available from userland
> > - ability to self-monitor: need select/poll/etc interface
> > - support for PEBS, IBS and whatever other new perf monitoring 
> >   infrastructure the vendors through at us in the future
> > - low overhead:  must minimize the "probe effect" of monitoring
> > - low noise in measurements:  cannot achieve this in userland
> > 
> > permon2 has all of this and more i've probably neglected...
> 
> I want to state that even though I've been a stickler on the system
> call stuff, in general I want to see perfmon2 go into tree and I agree
> with how most of the infrastructure is implemented and the features it
> provides.

Now if we only had a series of patches that we could actually review and
apply to the -mm tree so that people can try them out... :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
       [not found]                             ` <1d7226b10711161713j675341b7wdb4f050c59a8be0a@mail.gmail.com>
@ 2007-11-17  1:25                               ` Greg KH
       [not found]                                 ` <1d7226b10711161748n39b7f195q796d85282ef66134@mail.gmail.com>
  0 siblings, 1 reply; 116+ messages in thread
From: Greg KH @ 2007-11-17  1:25 UTC (permalink / raw)
  To: Patrick DEMICHEL
  Cc: linux-kernel, Philip Mucci, Andrew Morton, Stephane Eranian,
	William Cohen, Robert Richter, papi list, Andi Kleen

On Sat, Nov 17, 2007 at 02:13:13AM +0100, Patrick DEMICHEL wrote:
> Yet another noisy linux HPC user
> 
> I hope to convince you, lkml developers, to pay more attention to our HPC
> performance problems.

We do pay attention, and want to help out, we just need either bug
reports of problems that we can work to address, or patches in a
reviewable state whereby we are able to review, work with, and apply to
our trees.

Please do not think we are ignoring you at all, we are glad to work with
anyone who uses the Linux kernel on whatever platform as we well know
this allows us to create a kernel that works even better for everyone.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: perfmon2 merge news
       [not found]                                 ` <1d7226b10711161748n39b7f195q796d85282ef66134@mail.gmail.com>
@ 2007-11-17  2:13                                   ` Greg KH
  0 siblings, 0 replies; 116+ messages in thread
From: Greg KH @ 2007-11-17  2:13 UTC (permalink / raw)
  To: Patrick DEMICHEL
  Cc: linux-kernel, Philip Mucci, Andrew Morton, Stephane Eranian,
	William Cohen, Robert Richter, papi list, Andi Kleen

On Sat, Nov 17, 2007 at 02:48:45AM +0100, Patrick DEMICHEL wrote:
> Thanks Greg,
> 
>    but for external people it seems there is lot of people with opposite
> opinions, for sure some are valid and they can be focused on different
> things. But for example this critical topic seems quite not under control.
> And we don't like that.
>    At least not under the control of Stephane, whatever the efforts if could
> generate, we have the feeling we will never have something serious to us.
>    Also I never see a clear statement after long exchanges on what is the
> accepted final common view of some topic like that
>    Maybe there is never definitive position, but a resume could be
> interesting to make some reference point
> 
>    What I would like to see is:

<snip>

Heh, no, code is our currency here, it's the center of everything that
we do and work with.  Agreements, deadlines and plans are just not
relevant at all here, sorry.

So again, post the code, in reviewable patches, and then let's talk.  A
number of developers have expressed a concrete interest in getting this
kind of feature into the kernel tree, so show us the code so that we can
move forward.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-15  1:27                                               ` David Miller
  2007-11-15  2:34                                                 ` Paul Mackerras
@ 2007-11-19 13:08                                                 ` David Miller
  2007-11-19 20:53                                                   ` Stephane Eranian
  2007-11-19 21:43                                                   ` Paul Mackerras
  1 sibling, 2 replies; 116+ messages in thread
From: David Miller @ 2007-11-19 13:08 UTC (permalink / raw)
  To: paulus
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi


Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written.  The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-19 13:08                                                 ` David Miller
@ 2007-11-19 20:53                                                   ` Stephane Eranian
  2007-11-20  0:55                                                     ` David Miller
  2007-11-19 21:43                                                   ` Paul Mackerras
  1 sibling, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-19 20:53 UTC (permalink / raw)
  To: David Miller
  Cc: paulus, hch, akpm, gregkh, mucci, wcohen, robert.richter,
	linux-kernel, andi, Stephane Eranian

David,

On Mon, Nov 19, 2007 at 05:08:43AM -0800, David Miller wrote:
> 
> Instead of blabbering further about this topic, I decided to put my
> code where my mouth is and spent the weekend porting the perfmon2
> kernel bits, and the user bits (libpfm and pfmon) to sparc64.
> 

I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would
be happy to integrate your sparc64 patches.

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written.  The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
> 

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the
help of the LKML developers. I will post a list shortly.

> I rescind all of my earlier objections, let's merge this soon :-)

Thanks.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-19 13:08                                                 ` David Miller
  2007-11-19 20:53                                                   ` Stephane Eranian
@ 2007-11-19 21:43                                                   ` Paul Mackerras
  2007-11-19 22:48                                                     ` Stephane Eranian
  1 sibling, 1 reply; 116+ messages in thread
From: Paul Mackerras @ 2007-11-19 21:43 UTC (permalink / raw)
  To: David Miller
  Cc: hch, akpm, gregkh, mucci, eranian, wcohen, robert.richter,
	linux-kernel, andi

David Miller writes:

> As a result I've found that perfmon2 is quite nice and allows
> incredibly useful and powerful tools to be written.  The syscalls
> aren't that bad and really I see not reason to block it's inclusion.
> 
> I rescind all of my earlier objections, let's merge this soon :-)

Strongly agree.  However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32.  On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it.  At the least you also need to XOR the bit numbers
with 32 on those platforms.  Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-19 21:43                                                   ` Paul Mackerras
@ 2007-11-19 22:48                                                     ` Stephane Eranian
  2007-11-20  0:53                                                       ` David Miller
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-11-19 22:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David Miller, hch, akpm, gregkh, mucci, wcohen, robert.richter,
	linux-kernel, andi, Stephane Eranian

Paul,

On Tue, Nov 20, 2007 at 08:43:32AM +1100, Paul Mackerras wrote:
> David Miller writes:
> 
> > As a result I've found that perfmon2 is quite nice and allows
> > incredibly useful and powerful tools to be written.  The syscalls
> > aren't that bad and really I see not reason to block it's inclusion.
> > 
> > I rescind all of my earlier objections, let's merge this soon :-)
> 
> Strongly agree.  However, I think we need to add structure size
> arguments to most of the syscalls so we can extend them later.
> 
Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to
identify the version the application is requesting.

> Also, something I've been meaning to mention to Stephane is that the
> use of the cast_ulp() macro in perfmon is bogus and won't work on
> 32-bit big-endian platforms such as ppc32 and sparc32.  On such

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead.  This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.

What do you think?

> platforms you can't take a pointer to an array of u64, cast it to
> unsigned long * and expect the kernel bitmap operations to work
> correctly on it.  At the least you also need to XOR the bit numbers
> with 32 on those platforms.  Another alternative is to define the
> bitmaps as arrays of bytes instead, which eliminates all byte ordering
> and wordsize problems (but makes it more tricky to use the kernel
> bitmap functions directly).
> 

-- 

-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-19 22:48                                                     ` Stephane Eranian
@ 2007-11-20  0:53                                                       ` David Miller
  2007-12-13 16:00                                                         ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: David Miller @ 2007-11-20  0:53 UTC (permalink / raw)
  To: eranian
  Cc: paulus, hch, akpm, gregkh, mucci, wcohen, robert.richter,
	linux-kernel, andi

From: Stephane Eranian <eranian@hpl.hp.com>
Date: Mon, 19 Nov 2007 14:48:46 -0800

> Looks like we will have to use bytes (u8) instead.  This may have some
> performance impact as well. Several bitmaps are used in the context/interrupt
> routines. Even with u8, there is still a problem with the bitmap*() macros.
> Now, only a small subset of the bitmap() macros are used, so it may be okay
> to duplicate them for u8.

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of "unsigned long".

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon] Re: [perfmon2] perfmon2 merge news
  2007-11-19 20:53                                                   ` Stephane Eranian
@ 2007-11-20  0:55                                                     ` David Miller
  0 siblings, 0 replies; 116+ messages in thread
From: David Miller @ 2007-11-20  0:55 UTC (permalink / raw)
  To: eranian
  Cc: paulus, hch, akpm, gregkh, mucci, wcohen, robert.richter,
	linux-kernel, andi

From: Stephane Eranian <eranian@hpl.hp.com>
Date: Mon, 19 Nov 2007 12:53:30 -0800

> In anycase, I would be happy to integrate your sparc64 patches.

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-11-20  0:53                                                       ` David Miller
@ 2007-12-13 16:00                                                         ` Stephane Eranian
  2007-12-14 19:12                                                           ` Frank Ch. Eigler
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-12-13 16:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: davem, paulus, akpm, gregkh, mucci, wcohen, robert.richter, andi,
	eranian, Stephane Eranian

Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
 operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:


        Thread T0                        Thread T1
            |                                |
       pfm_restart()                         |
            |                                |
    spin_lock_irqsave()                      |
            |                                |
  <modify T1's machine state>--------------->|
            |                                |
    spin_unlock_irqrestore()                 |
            |                                |
            v                                v

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a semaphore
to enforce that guarantee but it gets difficult with corner cases and
cleanups in case of unpexected termination.

AFAIK, there is no single call to stop T1 and wait until it is completely
off the CPU, unless we go through the (internal) ptrace interface. 

Would you have anything better to suggest?

Thanks.

--
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-12-13 16:00                                                         ` Stephane Eranian
@ 2007-12-14 19:12                                                           ` Frank Ch. Eigler
  2007-12-14 21:07                                                             ` Stephane Eranian
  0 siblings, 1 reply; 116+ messages in thread
From: Frank Ch. Eigler @ 2007-12-14 19:12 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, davem, paulus, akpm, gregkh, mucci, wcohen,
	robert.richter, andi, eranian, roland


Stephane Eranian <eranian@hpl.hp.com> writes:

> [...]  AFAIK, there is no single call to stop T1 and wait until it
> is completely off the CPU, unless we go through the (internal)
> ptrace interface.

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-12-14 19:12                                                           ` Frank Ch. Eigler
@ 2007-12-14 21:07                                                             ` Stephane Eranian
  2007-12-15 15:54                                                               ` Frank Ch. Eigler
  0 siblings, 1 reply; 116+ messages in thread
From: Stephane Eranian @ 2007-12-14 21:07 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: linux-kernel, davem, paulus, akpm, gregkh, mucci, wcohen,
	robert.richter, andi, eranian, roland

Charles,

On Fri, Dec 14, 2007 at 02:12:17PM -0500, Frank Ch. Eigler wrote:
> 
> Stephane Eranian <eranian@hpl.hp.com> writes:
> 
> > [...]  AFAIK, there is no single call to stop T1 and wait until it
> > is completely off the CPU, unless we go through the (internal)
> > ptrace interface.
> 
> The utrace code supports this style of thread manipulation better
> than ptrace.

Afre you saying that utrace provides a utrace_thread_stop(tid) call
that returns only when the thread tid is off the CPU. And then there
is a utrace_thread_resume(tid) call. If that's the case then that is
what I need.

How are we with regards to utrace integration?

Thanks.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [perfmon2] perfmon2 merge news
  2007-12-14 21:07                                                             ` Stephane Eranian
@ 2007-12-15 15:54                                                               ` Frank Ch. Eigler
  0 siblings, 0 replies; 116+ messages in thread
From: Frank Ch. Eigler @ 2007-12-15 15:54 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, davem, paulus, akpm, gregkh, mucci, wcohen,
	robert.richter, andi, eranian, roland

Stephane Eranian <eranian@hpl.hp.com> writes:

> [...]
>> > [...]  AFAIK, there is no single call to stop T1 and wait until it
>> > is completely off the CPU, unless we go through the (internal)
>> > ptrace interface.
>> 
>> The utrace code supports this style of thread manipulation better
>> than ptrace.
>
> Afre you saying that utrace provides a utrace_thread_stop(tid) call
> that returns only when the thread tid is off the CPU. And then there
> is a utrace_thread_resume(tid) call. If that's the case then that is
> what I need.

While I see no single call, it can be synthesized from a sequence of
them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...),
then waiting for a callback.  Roland, is there a more compact way?

> How are we with regards to utrace integration?

Roland McGrath is working on breaking the patches down.

- FChE

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2007-12-15 15:59 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-07  0:34 [PATCH] fix up perfmon to build on -mm Greg KH
2007-11-07 10:34 ` Stephane Eranian
2007-11-07 17:07   ` Greg KH
2007-11-07 13:42 ` Stephane Eranian
2007-11-07 17:08   ` Greg KH
2007-11-07 17:33     ` Andrew Morton
2007-11-07 17:41       ` Greg KH
2007-11-07 17:50     ` Stephane Eranian
2007-11-07 17:47   ` Greg KH
2007-11-07 17:57     ` Stephane Eranian
2007-11-07 19:53       ` Greg KH
2007-11-07 20:39         ` Stephane Eranian
2007-11-08 15:27         ` Stephane Eranian
2007-11-09 20:06 ` Andrew Morton
2007-11-09 21:38   ` Greg KH
2007-11-10 20:32     ` Andi Kleen
2007-11-13 15:17       ` perfmon2 merge news Robert Richter
2007-11-13 15:35         ` [perfmon2] " William Cohen
2007-11-13 17:55           ` Stephane Eranian
2007-11-13 18:33             ` [perfmon] " William Cohen
2007-11-13 21:13               ` Stephane Eranian
2007-11-13 21:29                 ` Andi Kleen
2007-11-13 21:46                   ` Stephane Eranian
2007-11-13 21:50                     ` Andi Kleen
2007-11-13 22:22                       ` Stephane Eranian
2007-11-13 22:25                         ` Andi Kleen
2007-11-13 22:58                           ` Stephane Eranian
2007-11-14  2:07                             ` Andi Kleen
2007-11-14 13:09                               ` Stephane Eranian
2007-11-14 14:24                                 ` Andi Kleen
2007-11-14 15:44                                   ` William Cohen
2007-11-14 16:13                                     ` Stephane Eranian
2007-11-14 18:53                                     ` Philippe Elie
2007-11-14 19:15                                       ` Andi Kleen
2007-11-15  0:07                                   ` Stephane Eranian
2007-11-13 18:47             ` Philip Mucci
2007-11-13 18:59               ` Greg KH
2007-11-13 20:07                 ` Andrew Morton
2007-11-13 20:14                   ` Greg KH
2007-11-13 20:36                   ` Andi Kleen
2007-11-14  0:28                     ` Philip Mucci
2007-11-14  1:52                       ` Andi Kleen
2007-11-16  9:18                         ` Philip Mucci
2007-11-16 15:15                           ` Andi Kleen
2007-11-16 16:00                             ` Stephane Eranian
2007-11-16 16:28                               ` Andi Kleen
2007-11-16 17:13                                 ` William Cohen
2007-11-16 21:56                                   ` Stephane Eranian
2007-11-16 17:36                                 ` Stephane Eranian
2007-11-16 17:51                             ` dean gaudet
2007-11-17  0:29                               ` David Miller
2007-11-17  1:07                                 ` Greg KH
2007-11-16 20:16                             ` Philip Mucci
2007-11-17  0:15                             ` David Miller
     [not found]                             ` <1d7226b10711161713j675341b7wdb4f050c59a8be0a@mail.gmail.com>
2007-11-17  1:25                               ` Greg KH
     [not found]                                 ` <1d7226b10711161748n39b7f195q796d85282ef66134@mail.gmail.com>
2007-11-17  2:13                                   ` Greg KH
2007-11-14  7:24                   ` [perfmon] Re: [perfmon2] " Paul Mackerras
2007-11-14  7:40                     ` Andrew Morton
2007-11-14 10:38                     ` Christoph Hellwig
2007-11-14 10:43                       ` Paul Mackerras
2007-11-14 11:00                         ` Christoph Hellwig
2007-11-14 11:12                           ` David Miller
2007-11-14 11:14                             ` David Miller
2007-11-14 11:44                             ` Paul Mackerras
2007-11-13 23:49                               ` Nick Piggin
2007-11-14 11:58                                 ` David Miller
2007-11-14  0:25                                   ` Nick Piggin
2007-11-14 21:30                                     ` Paul Mackerras
2007-11-14 10:17                                       ` Nick Piggin
2007-11-14 22:56                                         ` Chuck Ebbert
2007-11-14 11:03                                           ` Nick Piggin
2007-11-14 11:52                               ` David Miller
2007-11-14 12:03                                 ` Paul Mackerras
2007-11-14 12:07                                   ` David Miller
2007-11-14  0:28                                     ` Nick Piggin
2007-11-14 21:50                                     ` Paul Mackerras
2007-11-14 23:03                                       ` David Miller
2007-11-14 23:12                                         ` Paul Mackerras
2007-11-14 23:21                                           ` David Miller
2007-11-15  1:11                                             ` Paul Mackerras
2007-11-15  1:27                                               ` David Miller
2007-11-15  2:34                                                 ` Paul Mackerras
2007-11-15  7:48                                                   ` Herbert Xu
2007-11-15  8:19                                                     ` Andi Kleen
2007-11-19 13:08                                                 ` David Miller
2007-11-19 20:53                                                   ` Stephane Eranian
2007-11-20  0:55                                                     ` David Miller
2007-11-19 21:43                                                   ` Paul Mackerras
2007-11-19 22:48                                                     ` Stephane Eranian
2007-11-20  0:53                                                       ` David Miller
2007-12-13 16:00                                                         ` Stephane Eranian
2007-12-14 19:12                                                           ` Frank Ch. Eigler
2007-12-14 21:07                                                             ` Stephane Eranian
2007-12-15 15:54                                                               ` Frank Ch. Eigler
2007-11-15  8:29                                               ` [perfmon] " Stephane Eranian
2007-11-14 13:51                               ` Stephane Eranian
2007-11-14 11:39                           ` Paul Mackerras
2007-11-14 11:52                             ` David Miller
2007-11-14 13:47                             ` Stephane Eranian
2007-11-14 12:38                           ` Andi Kleen
2007-11-14 14:13                             ` Stephane Eranian
2007-11-14 14:26                               ` Andi Kleen
2007-11-15  0:23                                 ` Paul Mackerras
2007-11-14 19:48                             ` David Miller
2007-11-15  4:20                             ` dean gaudet
2007-11-15  4:47                               ` Paul Mackerras
2007-11-15  5:14                                 ` dean gaudet
2007-11-15  8:53                               ` Stephane Eranian
2007-11-15 17:01                               ` [perfmon2] [perfmon] " Dan Terpstra
2007-11-13 21:33                 ` [perfmon] Re: [perfmon2] " Stephane Eranian
2007-11-13 21:45                   ` Greg KH
2007-11-13 22:27               ` Christoph Hellwig
2007-11-13 20:42           ` Andi Kleen
2007-11-13 18:32         ` Stephane Eranian
2007-11-13 22:29           ` Christoph Hellwig
2007-11-16 18:25           ` PMC core internal API design Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).