All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sysctl: Document that sys_sysctl will be removed.
@ 2006-07-10 22:39 Eric W. Biederman
  2006-07-10 22:50 ` Randy.Dunlap
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-10 22:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/feature-removal-schedule.txt |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index e978943..bef1bf0 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -250,3 +250,14 @@ Why:	These drivers never compiled since 
 Who:	Jean Delvare <khali@linux-fr.org>
 
 ---------------------------
+
+What:	sys_sysctl
+When:	January 2007
+Why:	The same information is available through /proc/sys and that is the
+	interface user space prefers to use. And there do not appear to be
+	any existing user in user space of sys_sysctl.  The additional
+	maintenance overhead of keeping a set of binary names gets
+	in the way of doing a good job of maintaining this interface.
+
+Who:	Eric Biederman <ebiederm@xmission.com>
+
-- 
1.4.1.gac83a


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] sysctl: Document that sys_sysctl will be removed.
  2006-07-10 22:39 [PATCH] sysctl: Document that sys_sysctl will be removed Eric W. Biederman
@ 2006-07-10 22:50 ` Randy.Dunlap
  2006-07-11  4:10   ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: Randy.Dunlap @ 2006-07-10 22:50 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: akpm, linux-kernel

On Mon, 10 Jul 2006 16:39:47 -0600 Eric W. Biederman wrote:

> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  Documentation/feature-removal-schedule.txt |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
> index e978943..bef1bf0 100644
> --- a/Documentation/feature-removal-schedule.txt
> +++ b/Documentation/feature-removal-schedule.txt
> @@ -250,3 +250,14 @@ Why:	These drivers never compiled since 
>  Who:	Jean Delvare <khali@linux-fr.org>
>  
>  ---------------------------
> +
> +What:	sys_sysctl
> +When:	January 2007
> +Why:	The same information is available through /proc/sys and that is the
> +	interface user space prefers to use. And there do not appear to be
> +	any existing user in user space of sys_sysctl.  The additional
> +	maintenance overhead of keeping a set of binary names gets
> +	in the way of doing a good job of maintaining this interface.
> +
> +Who:	Eric Biederman <ebiederm@xmission.com>


aha, patch 1/2 and patch 2/2 would have helped that.  :)

---
~Randy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] sysctl: Document that sys_sysctl will be removed.
  2006-07-10 22:50 ` Randy.Dunlap
@ 2006-07-11  4:10   ` Eric W. Biederman
  2006-07-11  7:07     ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-11  4:10 UTC (permalink / raw)
  To: Randy.Dunlap; +Cc: akpm, linux-kernel

"Randy.Dunlap" <rdunlap@xenotime.net> writes:

>
> aha, patch 1/2 and patch 2/2 would have helped that.  :)

Sorry.  I finally have found the original deprecation commit.

> commit 073cd7b5515a7f5b74dbb4917c717e3c390013e7
> Author: ak <ak>
> Date:   Sat Jul 12 16:45:55 2003 +0000
> 
>     [PATCH] Deprecate numerical sysctl
>     
>     Deprecate the numerical sysctl name space. People can use /proc/sys
>     instead.
>     
>     The numeric name space was never well maintained and especially
>     in distribution kernels is not very consistent (everybody has their
>     own extensions, conflicting with others). It's also a great
>     source of rejects when merging patches.  The name-based /proc/sys
>     is a much better interface for this, which people should use instead.
>     
>     Discussion of this on l-k found no advocate for it, so it seems to not
>     be very popular anyways.
>     
>     This patch deprecates numerical name space accesses to make it possible
>     to remove them in the future. The only exception is kernel.version,
>     which is used by glibc (this one has to be maintained forever)
>     
>     BKrev: 3f103b43JQH2fwSWpRLoTKziIiqH1w

The comment about kernel.version is odd. That information is available in
uname so I can't imagine why sys_sysctl would be an interesting source.
Also kernel.version is the compile string so it is pretty uninteresting
to glibc.

I guess if it is really needed someone will scream before the code gets
deleted completely.

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] sysctl: Document that sys_sysctl will be removed.
  2006-07-11  4:10   ` Eric W. Biederman
@ 2006-07-11  7:07     ` Arjan van de Ven
  2006-07-12 16:25       ` [PATCH] Use uname not sysctl to get the kernel revision Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2006-07-11  7:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Randy.Dunlap, akpm, linux-kernel


> The comment about kernel.version is odd. That information is available in
> uname so I can't imagine why sys_sysctl would be an interesting source.

glibc used it (pass tense); sometimes it's better to not ask why ;)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-11  7:07     ` Arjan van de Ven
@ 2006-07-12 16:25       ` Eric W. Biederman
  2006-07-12 16:50         ` Ulrich Drepper
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-12 16:25 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha


Currently it is felt but at least a subset of the kernel maintainers
that the binary sysctl interface is not maintainable, and the /proc
/sys interface should be used instead.  In investigating this it turns
out that the pthread code in glibc for detecting a SMP kernel appears
to be the primary user.

The information that we are asking for is available from the uname
system call so I don't understand why the code is using sysctl.

To understand the cost of the various approaches I put together
a little test program.  Using time for timing and running 100000
repetitions of the various system calls I get about
sysctl: 0.3s to 0.2s
uname:  0.1s to 0.07s
proc:   7.5 to  4.1s

proc is significantly slower which puzzles me.
But uname is noticeably faster than sysctl and uname is more portable
across linux flavors.  So updating the glibc pthread code to use
uname looks like the right way to implement is_smp_system. 

I do think detecting a SMP kernel to enable busy waiting on contended
mutexes is a very peculiar thing to be doing.  

My test performance test program:
> #include <string.h>
> #include <stdio.h>
> #include <sys/utsname.h>
> #include <errno.h>
> #include <stdarg.h>
> #include <stdlib.h>
> #include <sys/sysctl.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> static void uname_test(void)
> {
> 	struct utsname uts;
> 	uname(&uts);
> }
> 
> static void proc_test(void)
> {
> 	int fd;
> 	char buf[512];
> 	fd = open("/proc/sys/kernel/version", O_RDONLY);
> 	read(fd, buf, sizeof(buf));
> 	close(fd);
> }
> 
> static void sysctl_test(void)
> {
> 	static int sysctl_args[] = { CTL_KERN, KERN_VERSION };
> 	char buf[512];
> 	size_t reslen = sizeof(buf);
> 
> 	sysctl(sysctl_args, sizeof(sysctl_args)/sizeof(sysctl_args[0]),
> 		buf, &reslen, NULL, 0);
> }
> 
> int main(int argc, char *argv[])
> {
> 	void (*test)(void) = NULL;
> 	int reps = -1;
> 	int i;
> 
> 	for (i = 1; i < argc; i++) {
> 		if (strcmp(argv[i], "--sysctl") == 0)
> 			test = sysctl_test;
> 		else if (strcmp(argv[i], "--uname") == 0)
> 			test = uname_test;
> 		else if (strcmp(argv[i], "--proc") == 0)
> 			test = proc_test;
> 		else 
> 			reps = atol(argv[i]);
> 	}
> 	if ((reps == -1) || (test == NULL)) {
> 		fprintf(stderr, "usage: [--sysctl | --uname | --proc] <reps>\n");
> 		return 1;
> 	}
> 
> 	for (i = 0; i < reps; i++) {
> 		test();
> 	}
> 	return 0;
> }


My patch to use uname instead of proc or sysctl to get the 

--- glibc-2.4/nptl/sysdeps/unix/sysv/linux/smp.h-sysctl	2006-07-12 08:48:44.000000000 -0600
+++ glibc-2.4/nptl/sysdeps/unix/sysv/linux/smp.h	2006-07-12 09:57:07.000000000 -0600
@@ -17,11 +17,8 @@
    write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
    Boston, MA 02111-1307, USA.  */
 
-#include <errno.h>
-#include <fcntl.h>
 #include <string.h>
-#include <sys/sysctl.h>
-#include <not-cancel.h>
+#include <sys/utsname>
 
 /* Test whether the machine has more than one processor.  This is not the
    best test but good enough.  More complicated tests would require `malloc'
@@ -29,24 +26,8 @@
 static inline int
 is_smp_system (void)
 {
-  static const int sysctl_args[] = { CTL_KERN, KERN_VERSION };
-  char buf[512];
-  size_t reslen = sizeof (buf);
-
-  /* Try reading the number using `sysctl' first.  */
-  if (__sysctl ((int *) sysctl_args,
-		sizeof (sysctl_args) / sizeof (sysctl_args[0]),
-		buf, &reslen, NULL, 0) < 0)
-    {
-      /* This was not successful.  Now try reading the /proc filesystem.  */
-      int fd = open_not_cancel_2 ("/proc/sys/kernel/version", O_RDONLY);
-      if (__builtin_expect (fd, 0) == -1
-	  || (reslen = read_not_cancel (fd, buf, sizeof (buf))) <= 0)
-	/* This also didn't work.  We give up and say it's a UP machine.  */
-	buf[0] = '\0';
-
-      close_not_cancel_no_status (fd);
-    }
-
-  return strstr (buf, "SMP") != NULL;
+  struct utsname uts;
+  if (uname(&uts) < 0)
+	  uts.version[0] = '\0';
+  return strstr (uts.version, "SMP") != NULL;
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 16:25       ` [PATCH] Use uname not sysctl to get the kernel revision Eric W. Biederman
@ 2006-07-12 16:50         ` Ulrich Drepper
  2006-07-12 17:42           ` Eric W. Biederman
  2006-07-12 18:44           ` Roland McGrath
  0 siblings, 2 replies; 47+ messages in thread
From: Ulrich Drepper @ 2006-07-12 16:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

Eric W. Biederman wrote:
> But uname is noticeably faster than sysctl and uname is more portable
> across linux flavors.  So updating the glibc pthread code to use
> uname looks like the right way to implement is_smp_system. 

This is (was?) not the universal through.  We used uname at some point
but then I did some profiling and sysctl turned out to be faster.

If the reverse is true now I can certainly look into changing this but
the evidence and ideally has to be there.  The simplicity of the uname
code should mean that it's faster.

In a year or two I'll remove the test anyway.  By then there will likely
not be any UP kernels on reasonable machines anymore and I can drop all
the conditional code.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 16:50         ` Ulrich Drepper
@ 2006-07-12 17:42           ` Eric W. Biederman
  2006-07-12 23:24             ` Theodore Tso
  2006-07-12 18:44           ` Roland McGrath
  1 sibling, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-12 17:42 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha,
	Andi Kleen

Ulrich Drepper <drepper@redhat.com> writes:

> Eric W. Biederman wrote:
>> But uname is noticeably faster than sysctl and uname is more portable
>> across linux flavors.  So updating the glibc pthread code to use
>> uname looks like the right way to implement is_smp_system. 
>
> This is (was?) not the universal through.  We used uname at some point
> but then I did some profiling and sysctl turned out to be faster.

I track the code bask as far as I could and back to about 2000 in
pthread.c when the code was introduced it always used sys_sysctl.

> If the reverse is true now I can certainly look into changing this but
> the evidence and ideally has to be there.  The simplicity of the uname
> code should mean that it's faster.

The evidence and ideally what has to be there?

> In a year or two I'll remove the test anyway.  By then there will likely
> not be any UP kernels on reasonable machines anymore and I can drop all
> the conditional code.

Well there are embedded targets but I guess uclibc takes care of them.

Unless a darn good reason for keeping it is found, sys_sysctl won't be
in the kernel several months from now.  And uname is faster by a large
margin than /proc.

Right now because there has been a deprecated note in
"include/linux/sysctl.h" since 2003 people currently feel fine with
letting sys_sysctl code bit rot.  I am trying to resolve that
situation most likely by just updating the few stray pieces of user
space that care and then cutting out that chunk of kernel code.

Eric



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 16:50         ` Ulrich Drepper
  2006-07-12 17:42           ` Eric W. Biederman
@ 2006-07-12 18:44           ` Roland McGrath
  2006-07-12 19:33             ` Ulrich Drepper
  1 sibling, 1 reply; 47+ messages in thread
From: Roland McGrath @ 2006-07-12 18:44 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric W. Biederman, Arjan van de Ven, Randy.Dunlap, akpm,
	linux-kernel, libc-alpha

We could also put the uname info (modulo nodename) into the vDSO.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 18:44           ` Roland McGrath
@ 2006-07-12 19:33             ` Ulrich Drepper
  2006-07-12 19:53               ` Jakub Jelinek
  0 siblings, 1 reply; 47+ messages in thread
From: Ulrich Drepper @ 2006-07-12 19:33 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Eric W. Biederman, Arjan van de Ven, Randy.Dunlap, akpm,
	linux-kernel, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

Roland McGrath wrote:
> We could also put the uname info (modulo nodename) into the vDSO.

Or even better: real topology information.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 19:33             ` Ulrich Drepper
@ 2006-07-12 19:53               ` Jakub Jelinek
  2006-07-12 20:09                 ` H. Peter Anvin
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Jelinek @ 2006-07-12 19:53 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Roland McGrath, Eric W. Biederman, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
> Roland McGrath wrote:
> > We could also put the uname info (modulo nodename) into the vDSO.
> 
> Or even better: real topology information.

AND rather than OR would be even better.  So glibc could find kernel
version, etc. and topology in the vDSO cheaply.

	Jakub

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 19:53               ` Jakub Jelinek
@ 2006-07-12 20:09                 ` H. Peter Anvin
  2006-07-12 21:23                   ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-12 20:09 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, Roland McGrath, Eric W. Biederman,
	Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha

Jakub Jelinek wrote:
> On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
>> Roland McGrath wrote:
>>> We could also put the uname info (modulo nodename) into the vDSO.
>> Or even better: real topology information.
> 
> AND rather than OR would be even better.  So glibc could find kernel
> version, etc. and topology in the vDSO cheaply.

Wouldn't it make more sense for this to be in ELF tags, rather than the 
vdso?  Another alternative, I guess, would be to put a pointer in the 
ELF tags, which may point into the vdso.

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 20:09                 ` H. Peter Anvin
@ 2006-07-12 21:23                   ` Eric W. Biederman
  2006-07-12 21:29                     ` Arjan van de Ven
                                       ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-12 21:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jakub Jelinek, Ulrich Drepper, Roland McGrath, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

"H. Peter Anvin" <hpa@zytor.com> writes:

> Jakub Jelinek wrote:
>> On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
>>> Roland McGrath wrote:
>>>> We could also put the uname info (modulo nodename) into the vDSO.
>>> Or even better: real topology information.
>> AND rather than OR would be even better.  So glibc could find kernel
>> version, etc. and topology in the vDSO cheaply.
>
> Wouldn't it make more sense for this to be in ELF tags, rather than the vdso?
> Another alternative, I guess, would be to put a pointer in the ELF tags, which
> may point into the vdso.

Cheap and simple access to topology information would be interesting.

Glibc just wants to know if our kernel is SMP so it can know if it is
ok to busy wait for a bit waiting for a mutex.  Or if busy waiting is
a complete loss.

The practical challenge is that topology information is not fixed but
potentially varies at runtime.

Ulrich what would be interesting besides the possibility of having
multiple cpus?

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:23                   ` Eric W. Biederman
@ 2006-07-12 21:29                     ` Arjan van de Ven
  2006-07-12 21:56                       ` Eric W. Biederman
  2006-07-12 21:29                     ` H. Peter Anvin
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2006-07-12 21:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Wed, 2006-07-12 at 15:23 -0600, Eric W. Biederman wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
> > Jakub Jelinek wrote:
> >> On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
> >>> Roland McGrath wrote:
> >>>> We could also put the uname info (modulo nodename) into the vDSO.
> >>> Or even better: real topology information.
> >> AND rather than OR would be even better.  So glibc could find kernel
> >> version, etc. and topology in the vDSO cheaply.
> >
> > Wouldn't it make more sense for this to be in ELF tags, rather than the vdso?
> > Another alternative, I guess, would be to put a pointer in the ELF tags, which
> > may point into the vdso.
> 
> Cheap and simple access to topology information would be interesting.
> 
> Glibc just wants to know if our kernel is SMP so it can know if it is
> ok to busy wait for a bit waiting for a mutex.  Or if busy waiting is
> a complete loss.


with current power management... busy waiting pretty much is a loss even
on UP


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:23                   ` Eric W. Biederman
  2006-07-12 21:29                     ` Arjan van de Ven
@ 2006-07-12 21:29                     ` H. Peter Anvin
  2006-07-12 21:33                     ` Michael Tokarev
  2006-07-13  5:17                     ` Ulrich Drepper
  3 siblings, 0 replies; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-12 21:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jakub Jelinek, Ulrich Drepper, Roland McGrath, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

Eric W. Biederman wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> 
>> Jakub Jelinek wrote:
>>> On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
>>>> Roland McGrath wrote:
>>>>> We could also put the uname info (modulo nodename) into the vDSO.
>>>> Or even better: real topology information.
>>> AND rather than OR would be even better.  So glibc could find kernel
>>> version, etc. and topology in the vDSO cheaply.
>> Wouldn't it make more sense for this to be in ELF tags, rather than the vdso?
>> Another alternative, I guess, would be to put a pointer in the ELF tags, which
>> may point into the vdso.
> 
> Cheap and simple access to topology information would be interesting.
> 
> Glibc just wants to know if our kernel is SMP so it can know if it is
> ok to busy wait for a bit waiting for a mutex.  Or if busy waiting is
> a complete loss.
> 
> The practical challenge is that topology information is not fixed but
> potentially varies at runtime.
> 
> Ulrich what would be interesting besides the possibility of having
> multiple cpus?
> 

Something that might make sense to ask CPU vendors for in the future: an 
instruction that can either trap or be a noop (or better, cpu_relax) 
based on a control register.

Not that that solves any problem any time soon.

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:23                   ` Eric W. Biederman
  2006-07-12 21:29                     ` Arjan van de Ven
  2006-07-12 21:29                     ` H. Peter Anvin
@ 2006-07-12 21:33                     ` Michael Tokarev
  2006-07-13  5:17                     ` Ulrich Drepper
  3 siblings, 0 replies; 47+ messages in thread
From: Michael Tokarev @ 2006-07-12 21:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha

Eric W. Biederman wrote:
[]
> Glibc just wants to know if our kernel is SMP so it can know if it is
> ok to busy wait for a bit waiting for a mutex.  Or if busy waiting is
> a complete loss.

BTW, with smp-alternatives thing merged, "SMP or not" may not be that
simple question anymore.

I for one stopped compiling UP and SMP kernels for x86 since 2.6.17,
because SMP kernel works just fine on UP, including benchmarks (as
opposed to SMP kernel w/o smp-alternatives).  But I don't remember
if uname shows SMP in this case or not (don't have any running UP
machine with that kernel right now).

But the thing is: smp-alternatives + cpu-hotplug changes things at
runtime...

/mjt

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:29                     ` Arjan van de Ven
@ 2006-07-12 21:56                       ` Eric W. Biederman
  2006-07-12 22:01                         ` Arjan van de Ven
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-12 21:56 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: H. Peter Anvin, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

Arjan van de Ven <arjan@infradead.org> writes:

> On Wed, 2006-07-12 at 15:23 -0600, Eric W. Biederman wrote:
>> "H. Peter Anvin" <hpa@zytor.com> writes:
>> 
>> > Jakub Jelinek wrote:
>> >> On Wed, Jul 12, 2006 at 12:33:56PM -0700, Ulrich Drepper wrote:
>> >>> Roland McGrath wrote:
>> >>>> We could also put the uname info (modulo nodename) into the vDSO.
>> >>> Or even better: real topology information.
>> >> AND rather than OR would be even better.  So glibc could find kernel
>> >> version, etc. and topology in the vDSO cheaply.
>> >
>> > Wouldn't it make more sense for this to be in ELF tags, rather than the
> vdso?
>> > Another alternative, I guess, would be to put a pointer in the ELF tags,
> which
>> > may point into the vdso.
>> 
>> Cheap and simple access to topology information would be interesting.
>> 
>> Glibc just wants to know if our kernel is SMP so it can know if it is
>> ok to busy wait for a bit waiting for a mutex.  Or if busy waiting is
>> a complete loss.
>
>
> with current power management... busy waiting pretty much is a loss even
> on UP

It is a short busy wait before falling asleep.  I assume you mean
busy wait is a loss even on SMP?

Eric


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:56                       ` Eric W. Biederman
@ 2006-07-12 22:01                         ` Arjan van de Ven
  2006-07-12 22:02                           ` H. Peter Anvin
  0 siblings, 1 reply; 47+ messages in thread
From: Arjan van de Ven @ 2006-07-12 22:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha


> It is a short busy wait before falling asleep.  I assume you mean
> busy wait is a loss even on SMP?

eh yeah I forgot to think for a second. But yes even for SMP busy wait
is pretty bad power wise nowadays.. at least if you wait more than a few
hundred cycles. (and if you wait less... then it's almost unlikely that
it'll be useful as well)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 22:01                         ` Arjan van de Ven
@ 2006-07-12 22:02                           ` H. Peter Anvin
  2006-07-12 22:26                             ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-12 22:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Eric W. Biederman, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

Arjan van de Ven wrote:
>> It is a short busy wait before falling asleep.  I assume you mean
>> busy wait is a loss even on SMP?
> 
> eh yeah I forgot to think for a second. But yes even for SMP busy wait
> is pretty bad power wise nowadays.. at least if you wait more than a few
> hundred cycles. (and if you wait less... then it's almost unlikely that
> it'll be useful as well)
> 

It depends greatly; if a lock is likely to get released by the user 
after a few memory accesses, spinning is likely to be a win.

	-hpa


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 22:02                           ` H. Peter Anvin
@ 2006-07-12 22:26                             ` Eric W. Biederman
  2006-07-12 22:31                               ` H. Peter Anvin
  2006-07-12 23:07                               ` Alan Cox
  0 siblings, 2 replies; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-12 22:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

"H. Peter Anvin" <hpa@zytor.com> writes:

> Arjan van de Ven wrote:
>>> It is a short busy wait before falling asleep.  I assume you mean
>>> busy wait is a loss even on SMP?
>> eh yeah I forgot to think for a second. But yes even for SMP busy wait
>> is pretty bad power wise nowadays.. at least if you wait more than a few
>> hundred cycles. (and if you wait less... then it's almost unlikely that
>> it'll be useful as well)
>>
>
> It depends greatly; if a lock is likely to get released by the user after a few
> memory accesses, spinning is likely to be a win.

But this requires that the lock be short lived, and highly contended.

If the lock is not short lived then the release is like to be a long
ways off.  If the lock is not highly contended then you are not likely
to hit the window when someone else as the contended lock.

How frequent are highly contended short lived locks in user space?

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 22:26                             ` Eric W. Biederman
@ 2006-07-12 22:31                               ` H. Peter Anvin
  2006-07-12 23:07                               ` Alan Cox
  1 sibling, 0 replies; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-12 22:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Arjan van de Ven, Jakub Jelinek, Ulrich Drepper, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

Eric W. Biederman wrote:
>>>
>> It depends greatly; if a lock is likely to get released by the user after a few
>> memory accesses, spinning is likely to be a win.
> 
> But this requires that the lock be short lived, and highly contended.
> 

Correct, and incorrect, in that order.

The contention level of the lock determines how likely you are to fail 
to acquire it immediately, not how long it takes until it can be 
acquired *after you know a failure has already happened.*

> If the lock is not short lived then the release is like to be a long
> ways off.  If the lock is not highly contended then you are not likely
> to hit the window when someone else as the contended lock.

The last sentence makes no sense either grammatically or technically. 
Sorry.

> How frequent are highly contended short lived locks in user space?

Short-lived locks (which may be significantly contended) are very common 
to protect data structures.

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 22:26                             ` Eric W. Biederman
  2006-07-12 22:31                               ` H. Peter Anvin
@ 2006-07-12 23:07                               ` Alan Cox
  2006-07-12 23:19                                 ` H. Peter Anvin
  2006-07-14 18:45                                 ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 47+ messages in thread
From: Alan Cox @ 2006-07-12 23:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Arjan van de Ven, Jakub Jelinek, Ulrich Drepper,
	Roland McGrath, Randy.Dunlap, akpm, linux-kernel, libc-alpha

Ar Mer, 2006-07-12 am 16:26 -0600, ysgrifennodd Eric W. Biederman:
> If the lock is not short lived then the release is like to be a long
> ways off.  If the lock is not highly contended then you are not likely
> to hit the window when someone else as the contended lock.
> 
> How frequent are highly contended short lived locks in user space?

I'm not sure it matters.

If you want to do the job right then do this

- Stick an indicator of how much else wants to run on this CPU in the
vsyscall page or similar location

In your locks you can now do

              while(try_and_grab_lock() == FAILED) {
                       if (kernelpage->waiting > 0)
                              sys_somelockwaitthing()
              }

Furthermore the kernel can be intelligent about the waiting indicator
for power or other global scheduling reasons

[Disclaimer: There is a patent issue around this technique but its not
one that will impact GPL code as permissions are given for GPL use.]

Alan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:07                               ` Alan Cox
@ 2006-07-12 23:19                                 ` H. Peter Anvin
  2006-07-13 11:15                                   ` Alan Cox
  2006-07-14 18:45                                 ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-12 23:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Eric W. Biederman, Arjan van de Ven, Jakub Jelinek,
	Ulrich Drepper, Roland McGrath, Randy.Dunlap, akpm, linux-kernel,
	libc-alpha

Alan Cox wrote:
> 
> [Disclaimer: There is a patent issue around this technique but its not
> one that will impact GPL code as permissions are given for GPL use.]
> 

glibc is (and has to be) LGPL.

Anyway, it seems absolutely insane that having a programmable threshold 
for spinning is patented...

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 17:42           ` Eric W. Biederman
@ 2006-07-12 23:24             ` Theodore Tso
  2006-07-12 23:31               ` Andi Kleen
                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Theodore Tso @ 2006-07-12 23:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ulrich Drepper, Arjan van de Ven, Randy.Dunlap, akpm,
	linux-kernel, libc-alpha, Andi Kleen

On Wed, Jul 12, 2006 at 11:42:47AM -0600, Eric W. Biederman wrote:
> Unless a darn good reason for keeping it is found, sys_sysctl won't be
> in the kernel several months from now.  And uname is faster by a large
> margin than /proc.

Um, if glibc is using sys_sysctl, then that's a pretty good reason.
Once we remove it from the kernel, then people will be forced to
upgrade glibc's before they can install a newer kernel.  Can we please
give people some time for an version of glibc with this change to make
it out to most deployed systems, first?  It's really annoying when
it's not possible to install a stock kernel.org kernel on a system,
and often upgrading glibc is not a trivial thing to do on a
distribution userspace, especially if there is a concern for ISV
compatibility.  (Especially if C++ code is involved, unfortunately.)

> Right now because there has been a deprecated note in
> "include/linux/sysctl.h" since 2003 people currently feel fine with
> letting sys_sysctl code bit rot.  I am trying to resolve that
> situation most likely by just updating the few stray pieces of user
> space that care and then cutting out that chunk of kernel code.

What we should do is what we've done in the past before removing a
system call like this.  printk a deprecation warning no more than n
times an hours with the process name using the deprecated interface.
A deprecated note in a header isn't necessarily something which will
be noticed by userspace programmers.  Heck, it isn't even in
Documentation/feature-removal-schedule.txt yet.

If people want to remove it, let's please do this in an orderly
fashion, and with ample warning that people besides kernel developers
will actually *notice*.

						- Ted

P.S.  I happen to be one those developers who think the binary
interface is not so bad, and for compared to reading from /proc/sys,
the sysctl syscall *is* faster.  But at the same there, there really
isn't anything where really does require that kind of speed, so that
point is moot.  But at the same time, what is the cost of leaving
sys_sysctl in the kernel for an extra 6-12 months, or even longer,
starting from now?  

Or if we going to remove parts of sysctl, can we at least keep enough
there so that existing glibc systems don't break?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:24             ` Theodore Tso
@ 2006-07-12 23:31               ` Andi Kleen
  2006-07-13  0:12                 ` Theodore Tso
  2006-07-12 23:44               ` Steve Munroe
  2006-07-13  0:19               ` Eric W. Biederman
  2 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2006-07-12 23:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Eric W. Biederman, Ulrich Drepper, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Thursday 13 July 2006 01:24, Theodore Tso wrote:

> Um, if glibc is using sys_sysctl, then that's a pretty good reason.
> Once we remove it from the kernel, then people will be forced to
> upgrade glibc's before they can install a newer kernel.  Can we please
> give people some time for an version of glibc with this change to make
> it out to most deployed systems, first?  It's really annoying when
> it's not possible to install a stock kernel.org kernel on a system,
> and often upgrading glibc is not a trivial thing to do on a
> distribution userspace, especially if there is a concern for ISV
> compatibility.  (Especially if C++ code is involved, unfortunately.)

glibc still works, just slower. But I think the best strategy 
is just to emulate the single sysctl glibc is using and printk
for the rest.

> What we should do is what we've done in the past before removing a
> system call like this.  printk a deprecation warning no more than n
> times an hours with the process name using the deprecated interface.

We did this some time ago, but Andrew took it out (partly because
the original code was somewhat broken and the printk tended to trigger
too often in crashme) 

Hopefully he puts it back in now.

> P.S.  I happen to be one those developers who think the binary
> interface is not so bad, and for compared to reading from /proc/sys,
> the sysctl syscall *is* faster.  But at the same there, there really
> isn't anything where really does require that kind of speed, so that
> point is moot.  But at the same time, what is the cost of leaving
> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
> starting from now?  

The numerical namespace for sysctl is unsalvagable imho. e.g. distributions
regularly break it because there is no central repository of numbers
so it's not very usable anyways in practice.
 
-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:24             ` Theodore Tso
  2006-07-12 23:31               ` Andi Kleen
@ 2006-07-12 23:44               ` Steve Munroe
  2006-07-14 18:49                 ` Benjamin Herrenschmidt
  2006-07-13  0:19               ` Eric W. Biederman
  2 siblings, 1 reply; 47+ messages in thread
From: Steve Munroe @ 2006-07-12 23:44 UTC (permalink / raw)
  To: Theodore Tso, libc-alpha, linux-kernel
  Cc: Andi Kleen, akpm, Arjan van de Ven, Ulrich Drepper,
	Eric W. Biederman, Randy.Dunlap


Theodore Tso <tytso@mit.edu> wrote on 07/12/2006 06:24:14 PM:

> On Wed, Jul 12, 2006 at 11:42:47AM -0600, Eric W. Biederman wrote:
> > Unless a darn good reason for keeping it is found, sys_sysctl won't be
> > in the kernel several months from now.  And uname is faster by a large
> > margin than /proc.
>
> Um, if glibc is using sys_sysctl, then that's a pretty good reason.
> Once we remove it from the kernel, then people will be forced to
> upgrade glibc's before they can install a newer kernel.  Can we please
> give people some time for an version of glibc with this change to make
> it out to most deployed systems, first?  It's really annoying when
> it's not possible to install a stock kernel.org kernel on a system,
> and often upgrading glibc is not a trivial thing to do on a
> distribution userspace, especially if there is a concern for ISV
> compatibility.  (Especially if C++ code is involved, unfortunately.)
>
We will need an implementation that will fall back to sys_sysctl for older
kernels. This is already common practice in glibc. I don't really
understand the performance concern because it seems to me that
_is_smp_system() is only called once per process.

But isn't this the kind of thing that the Aux Vector is for? I like vDSO
too, but I think it is best deployed for information of a more dynamic
nature and performance sensitive.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:31               ` Andi Kleen
@ 2006-07-13  0:12                 ` Theodore Tso
  2006-07-13  2:33                   ` Eric W. Biederman
  2006-07-13 12:15                   ` Andi Kleen
  0 siblings, 2 replies; 47+ messages in thread
From: Theodore Tso @ 2006-07-13  0:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Eric W. Biederman, Ulrich Drepper, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Thu, Jul 13, 2006 at 01:31:46AM +0200, Andi Kleen wrote:
> 
> glibc still works, just slower. But I think the best strategy 
> is just to emulate the single sysctl glibc is using and printk
> for the rest.
> 

That sounds reasonable, yes.


> > point is moot.  But at the same time, what is the cost of leaving
> > sys_sysctl in the kernel for an extra 6-12 months, or even longer,
> > starting from now?  
>
> The numerical namespace for sysctl is unsalvagable imho. e.g. distributions
> regularly break it because there is no central repository of numbers
> so it's not very usable anyways in practice.

That may be true, but it doesn't answer the question, what's the cost
of leaving in sys_sysctl in there for now?  

In any case, if we really do want to get rid of it, the next step
should be a working deprecation printk and adding something to
Documentation/feature-removal-schedule.txt.

						- Ted

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:24             ` Theodore Tso
  2006-07-12 23:31               ` Andi Kleen
  2006-07-12 23:44               ` Steve Munroe
@ 2006-07-13  0:19               ` Eric W. Biederman
  2 siblings, 0 replies; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13  0:19 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ulrich Drepper, Arjan van de Ven, Randy.Dunlap, akpm,
	linux-kernel, libc-alpha, Andi Kleen

Theodore Tso <tytso@mit.edu> writes:

> On Wed, Jul 12, 2006 at 11:42:47AM -0600, Eric W. Biederman wrote:
>> Unless a darn good reason for keeping it is found, sys_sysctl won't be
>> in the kernel several months from now.  And uname is faster by a large
>> margin than /proc.
>
> Um, if glibc is using sys_sysctl, then that's a pretty good reason.
> Once we remove it from the kernel, then people will be forced to
> upgrade glibc's before they can install a newer kernel.  Can we please
> give people some time for an version of glibc with this change to make
> it out to most deployed systems, first?  It's really annoying when
> it's not possible to install a stock kernel.org kernel on a system,
> and often upgrading glibc is not a trivial thing to do on a
> distribution userspace, especially if there is a concern for ISV
> compatibility.  (Especially if C++ code is involved, unfortunately.)

I agree.

The reason for stopping this is that sys_sysctl at that location
in glibc is unnecessary, we can use uname now.

Currently that usage by glibc gives false positives if we want
to warn users.

>> Right now because there has been a deprecated note in
>> "include/linux/sysctl.h" since 2003 people currently feel fine with
>> letting sys_sysctl code bit rot.  I am trying to resolve that
>> situation most likely by just updating the few stray pieces of user
>> space that care and then cutting out that chunk of kernel code.
>
> What we should do is what we've done in the past before removing a
> system call like this.  printk a deprecation warning no more than n
> times an hours with the process name using the deprecated interface.
> A deprecated note in a header isn't necessarily something which will
> be noticed by userspace programmers.  Heck, it isn't even in
> Documentation/feature-removal-schedule.txt yet.

I sent Andrew patches yesterday to put it in 
Documentation/feature-remove-schedule.txt, and to print a warning, and
to optionally compile out sys_sysctl. 

> If people want to remove it, let's please do this in an orderly
> fashion, and with ample warning that people besides kernel developers
> will actually *notice*.

I agree.  Part of that beyond the deprecated message is sending
patches to fixup the few remaining users and talking about it a lot
so even if someone doesn't run the kernels with deprecated message
they might notice something.

> 						- Ted
>
> P.S.  I happen to be one those developers who think the binary
> interface is not so bad, and for compared to reading from /proc/sys,
> the sysctl syscall *is* faster.  But at the same there, there really
> isn't anything where really does require that kind of speed, so that
> point is moot.  But at the same time, what is the cost of leaving
> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
> starting from now?  

The core problem is enough people have read that depreciated warning
that the binary interface of kernel/sysctl.c is not being maintained
seriously.  So the code must move out of this half deprecated state.
Either to all of the way gone (preferably) or reinstated as an
interface that we are serious about maintaining.  Code that
bit rots and people don't care is a problem.

> Or if we going to remove parts of sysctl, can we at least keep enough
> there so that existing glibc systems don't break?

That is not a problem. glibc will happily fall back to reading
the values from /proc/sys/kernel/version if sysctl fails.  It
just makes more sense (to me at least) to use uname for
getting uname data.

Eric


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  0:12                 ` Theodore Tso
@ 2006-07-13  2:33                   ` Eric W. Biederman
  2006-07-13 12:15                   ` Andi Kleen
  1 sibling, 0 replies; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13  2:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andi Kleen, Ulrich Drepper, Arjan van de Ven, Randy.Dunlap, akpm,
	linux-kernel, libc-alpha

Theodore Tso <tytso@mit.edu> writes:

> That may be true, but it doesn't answer the question, what's the cost
> of leaving in sys_sysctl in there for now?  

Among other things the implementation of all of: 
CTL_KERN, {KERN_OSTYPE, KERN_OSRELEASE, KERN_OSREV, KERN_VERSION,
          KERN_SECUREMASK,KERN_PROF,KERN_NODENAME,KERN_DOMAINNAME }
are broken in kernel/sysctl.c because the locking is missing.

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 21:23                   ` Eric W. Biederman
                                       ` (2 preceding siblings ...)
  2006-07-12 21:33                     ` Michael Tokarev
@ 2006-07-13  5:17                     ` Ulrich Drepper
  2006-07-13  6:27                       ` Ian Wienand
  2006-07-13 14:39                       ` Eric W. Biederman
  3 siblings, 2 replies; 47+ messages in thread
From: Ulrich Drepper @ 2006-07-13  5:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: H. Peter Anvin, Jakub Jelinek, Roland McGrath, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1533 bytes --]

Eric W. Biederman wrote:
> Ulrich what would be interesting besides the possibility of having
> multiple cpus?

What is needed for various things like memory handling etc is all
topology information.  Somebody might remember the numa library proposal
I had in April 2004 which was cast aside because people were only
looking for a "quick fix".  Well, the problem still isn't solved.

IMO the vdso should export information about:

- processors and their relationship (hyperthreads, cores)

- the CPU caches and how they relate to the cores (e.g., dual core
  with shared L2)

- local main memory for each processor

- relative costs of the memory access of the various memory regions
  (for numa local memory to a node, intra-node costs)

- ideally, relative costs main memory and CPU caches


All this information can be steadily updated by the kernel as new
CPUs/memory get added/removed.  The vdso should have functions to access
this information.  It's easy enough to make this access race free.

I guess I should try to come up with a representation for this
knowledge.  Collecting the information (except the costs) should be
easy.  Determining the costs also shouldn't be that hard but it can be
very useful.  Some of this information could be determined at userlevel
but you really don't want every process to compute all this from
scratch.  And stored data in a file is stale if the system changes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:17                     ` Ulrich Drepper
@ 2006-07-13  6:27                       ` Ian Wienand
  2006-07-13 14:39                       ` Eric W. Biederman
  1 sibling, 0 replies; 47+ messages in thread
From: Ian Wienand @ 2006-07-13  6:27 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric W. Biederman, H. Peter Anvin, Jakub Jelinek, Roland McGrath,
	Arjan van de Ven, Randy.Dunlap, akpm, linux-kernel, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 337 bytes --]

On Wed, Jul 12, 2006 at 10:17:51PM -0700, Ulrich Drepper wrote:
> I guess I should try to come up with a representation for this
> knowledge.

Sounds a little like the "Machine Description" as mentioned in the
UltraSPARC Virtual Machine Specification, Chapter 8

http://opensparc.sunsource.net/specs/Hypervisor-api-current-draft.pdf

-i

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 191 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:19                                 ` H. Peter Anvin
@ 2006-07-13 11:15                                   ` Alan Cox
  0 siblings, 0 replies; 47+ messages in thread
From: Alan Cox @ 2006-07-13 11:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eric W. Biederman, Arjan van de Ven, Jakub Jelinek,
	Ulrich Drepper, Roland McGrath, Randy.Dunlap, akpm, linux-kernel,
	libc-alpha

Ar Mer, 2006-07-12 am 16:19 -0700, ysgrifennodd H. Peter Anvin:
> glibc is (and has to be) LGPL.
> 
> Anyway, it seems absolutely insane that having a programmable threshold 
> for spinning is patented...

I'm not aware programmable thresholds are patented/patent-pending, just
having the kernel indicate through a shared variable whether other tasks
are waiting so that it avoids syscalls and latency costs.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  0:12                 ` Theodore Tso
  2006-07-13  2:33                   ` Eric W. Biederman
@ 2006-07-13 12:15                   ` Andi Kleen
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2006-07-13 12:15 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Eric W. Biederman, Ulrich Drepper, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Thursday 13 July 2006 02:12, Theodore Tso wrote:
> On Thu, Jul 13, 2006 at 01:31:46AM +0200, Andi Kleen wrote:
> > glibc still works, just slower. But I think the best strategy
> > is just to emulate the single sysctl glibc is using and printk
> > for the rest.
>
> That sounds reasonable, yes.
>
> > > point is moot.  But at the same time, what is the cost of leaving
> > > sys_sysctl in the kernel for an extra 6-12 months, or even longer,
> > > starting from now?
> >
> > The numerical namespace for sysctl is unsalvagable imho. e.g.
> > distributions regularly break it because there is no central repository
> > of numbers so it's not very usable anyways in practice.
>
> That may be true, but it doesn't answer the question, what's the cost
> of leaving in sys_sysctl in there for now?

For once linux/sysctl.h is one of the biggest source of patch rejects.
The sooner it goes the better.

>
> In any case, if we really do want to get rid of it, the next step
> should be a working deprecation printk 

It was in there for months already.

> and adding something to 
> Documentation/feature-removal-schedule.txt.

That is what Eric's patch did.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:17                     ` Ulrich Drepper
  2006-07-13  6:27                       ` Ian Wienand
@ 2006-07-13 14:39                       ` Eric W. Biederman
  2006-07-13 15:05                         ` Arjan van de Ven
  1 sibling, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13 14:39 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: H. Peter Anvin, Jakub Jelinek, Roland McGrath, Arjan van de Ven,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

Ulrich Drepper <drepper@redhat.com> writes:

> Eric W. Biederman wrote:
>> Ulrich what would be interesting besides the possibility of having
>> multiple cpus?
>
> What is needed for various things like memory handling etc is all
> topology information.  Somebody might remember the numa library proposal
> I had in April 2004 which was cast aside because people were only
> looking for a "quick fix".  Well, the problem still isn't solved.
>
> IMO the vdso should export information about:
>
> - processors and their relationship (hyperthreads, cores)
>
> - the CPU caches and how they relate to the cores (e.g., dual core
>   with shared L2)
>
> - local main memory for each processor
>
> - relative costs of the memory access of the various memory regions
>   (for numa local memory to a node, intra-node costs)
>
> - ideally, relative costs main memory and CPU caches
>
>
> All this information can be steadily updated by the kernel as new
> CPUs/memory get added/removed.  The vdso should have functions to access
> this information.  It's easy enough to make this access race free.
>
> I guess I should try to come up with a representation for this
> knowledge.  Collecting the information (except the costs) should be
> easy.  Determining the costs also shouldn't be that hard but it can be
> very useful.  Some of this information could be determined at userlevel
> but you really don't want every process to compute all this from
> scratch.  And stored data in a file is stale if the system changes.

The history of Linux shows that auto-tuning while not always perfect
is much more effective than manual tuning.  How are you envisioning
using this information? 

I find it really easy to see how topology information can be used
to manually tune a system.  I don't currently see how it can be used
to automatically tune a system.

My fear is that we will end up making things brittle with user space
over specifying things.

I have another related concern about what rules need to be in place
so I can upgrade the kernel vdso while a user space program is running.
That make me seriously wonder how sane the vdso concept is, but that
is a completely different issue.


Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13 14:39                       ` Eric W. Biederman
@ 2006-07-13 15:05                         ` Arjan van de Ven
  0 siblings, 0 replies; 47+ messages in thread
From: Arjan van de Ven @ 2006-07-13 15:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ulrich Drepper, H. Peter Anvin, Jakub Jelinek, Roland McGrath,
	Randy.Dunlap, akpm, linux-kernel, libc-alpha

On Thu, 2006-07-13 at 08:39 -0600, Eric W. Biederman wrote:
> Ulrich Drepper <drepper@redhat.com> writes:
> 
> > Eric W. Biederman wrote:
> >> Ulrich what would be interesting besides the possibility of having
> >> multiple cpus?
> >
> > What is needed for various things like memory handling etc is all
> > topology information.  Somebody might remember the numa library proposal
> > I had in April 2004 which was cast aside because people were only
> > looking for a "quick fix".  Well, the problem still isn't solved.
> >
> > IMO the vdso should export information about:
> >
> > - processors and their relationship (hyperthreads, cores)
> >
> > - the CPU caches and how they relate to the cores (e.g., dual core
> >   with shared L2)
> >
> > - local main memory for each processor
> >
> > - relative costs of the memory access of the various memory regions
> >   (for numa local memory to a node, intra-node costs)
> >
> > - ideally, relative costs main memory and CPU caches
> >
> >
> > All this information can be steadily updated by the kernel as new
> > CPUs/memory get added/removed.  The vdso should have functions to access
> > this information.  It's easy enough to make this access race free.
> >

why does this have to be in the vdso? It's not like the code can be a
regular userspace lib/daemon that gets all the hotplug events and that
processes the info from /proc and /sys once during boot. A bit like how
nscd works I suppose..



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:07                               ` Alan Cox
  2006-07-12 23:19                                 ` H. Peter Anvin
@ 2006-07-14 18:45                                 ` Benjamin Herrenschmidt
  2006-07-14 19:11                                   ` H. Peter Anvin
  1 sibling, 1 reply; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2006-07-14 18:45 UTC (permalink / raw)
  To: Alan Cox
  Cc: Eric W. Biederman, H. Peter Anvin, Arjan van de Ven,
	Jakub Jelinek, Ulrich Drepper, Roland McGrath, Randy.Dunlap,
	akpm, linux-kernel, libc-alpha

On Thu, 2006-07-13 at 00:07 +0100, Alan Cox wrote:
> Ar Mer, 2006-07-12 am 16:26 -0600, ysgrifennodd Eric W. Biederman:
> > If the lock is not short lived then the release is like to be a long
> > ways off.  If the lock is not highly contended then you are not likely
> > to hit the window when someone else as the contended lock.
> > 
> > How frequent are highly contended short lived locks in user space?
> 
> I'm not sure it matters.
> 
> If you want to do the job right then do this
> 
> - Stick an indicator of how much else wants to run on this CPU in the
> vsyscall page or similar location

Except that "this cpu" doesn't really mean anything in userspace, and
while I think Andi has some tricks to get some sort of CPU number to
userspace (though it's really only valid during the execution of the
instruction that reads it :) I haven't yet found an equivalent for
powerpc (and possibly other architectures will have the same problem).

> In your locks you can now do
> 
>               while(try_and_grab_lock() == FAILED) {
>                        if (kernelpage->waiting > 0)
>                               sys_somelockwaitthing()
>               }
> 
> Furthermore the kernel can be intelligent about the waiting indicator
> for power or other global scheduling reasons
> 
> [Disclaimer: There is a patent issue around this technique but its not
> one that will impact GPL code as permissions are given for GPL use.]
> 
> Alan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-12 23:44               ` Steve Munroe
@ 2006-07-14 18:49                 ` Benjamin Herrenschmidt
  2006-07-14 19:09                   ` Andi Kleen
  0 siblings, 1 reply; 47+ messages in thread
From: Benjamin Herrenschmidt @ 2006-07-14 18:49 UTC (permalink / raw)
  To: Steve Munroe
  Cc: Theodore Tso, libc-alpha, linux-kernel, Andi Kleen, akpm,
	Arjan van de Ven, Ulrich Drepper, Eric W. Biederman,
	Randy.Dunlap

> We will need an implementation that will fall back to sys_sysctl for older
> kernels. This is already common practice in glibc. I don't really
> understand the performance concern because it seems to me that
> _is_smp_system() is only called once per process.
> 
> But isn't this the kind of thing that the Aux Vector is for? I like vDSO
> too, but I think it is best deployed for information of a more dynamic
> nature and performance sensitive.

For a simple "is_smp" kind of flag, I would tend to agree with the
above... for more complex NUMA topology and/or cache characteristics,
which is quite a bigger amount of information, I'm not sure it's worth
copying all of that data on every process exec (and making the initial
AT_ parsing slower). Especially since very few processes actually care
about those.

Ben.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-14 18:49                 ` Benjamin Herrenschmidt
@ 2006-07-14 19:09                   ` Andi Kleen
  0 siblings, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2006-07-14 19:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Steve Munroe, Theodore Tso, libc-alpha, linux-kernel, akpm,
	Arjan van de Ven, Ulrich Drepper, Eric W. Biederman,
	Randy.Dunlap

On Friday 14 July 2006 20:49, Benjamin Herrenschmidt wrote:
> > We will need an implementation that will fall back to sys_sysctl for older
> > kernels. This is already common practice in glibc. I don't really
> > understand the performance concern because it seems to me that
> > _is_smp_system() is only called once per process.
> > 
> > But isn't this the kind of thing that the Aux Vector is for? I like vDSO
> > too, but I think it is best deployed for information of a more dynamic
> > nature and performance sensitive.
> 
> For a simple "is_smp" kind of flag, I would tend to agree with the
> above... for more complex NUMA topology and/or cache characteristics,
> which is quite a bigger amount of information, I'm not sure it's worth
> copying all of that data on every process exec (and making the initial
> AT_ parsing slower). Especially since very few processes actually care
> about those.

I've actually spent some thought on that recently. The motivation
came from someone who wanted the number of CPUs in a fast way 
to tune AMD64 memcpy etc. better.

My proposal was to supply four new count:
number of cores, number of siblings, number of sockets, number of nodes

These all fit easily in 16bit so it would be 2 new entries in the
aux vector (128 bit total). Shouldn't be much overhead to write this.

If you need more exact topology you can probably eat the overhead
of parsing /proc/cpuinfo or read it from sysfs (or just use libnuma
which supplies most of this in an easy way) 

Doing it in a vDSO would be in theory ok for me too, except that x86-64
doesn't have one so far. Even in vDSO I wouldn't add much more than this
(like bitmaps and similar) because otherwise cpu/node hotplug could be racy.
Also I'm reluctant to redo /proc/cpuinfo and /sys for this.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-14 18:45                                 ` Benjamin Herrenschmidt
@ 2006-07-14 19:11                                   ` H. Peter Anvin
  0 siblings, 0 replies; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-14 19:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alan Cox, Eric W. Biederman, Arjan van de Ven, Jakub Jelinek,
	Ulrich Drepper, Roland McGrath, Randy.Dunlap, akpm, linux-kernel,
	libc-alpha

Benjamin Herrenschmidt wrote:
>>
>> If you want to do the job right then do this
>>
>> - Stick an indicator of how much else wants to run on this CPU in the
>> vsyscall page or similar location
> 
> Except that "this cpu" doesn't really mean anything in userspace, and
> while I think Andi has some tricks to get some sort of CPU number to
> userspace (though it's really only valid during the execution of the
> instruction that reads it :) I haven't yet found an equivalent for
> powerpc (and possibly other architectures will have the same problem).
> 

Sure it does... although its validity in terms of a locality metric 
decays with time.

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13 16:53     ` Eric W. Biederman
@ 2006-07-13 17:06       ` Albert Cahalan
  0 siblings, 0 replies; 47+ messages in thread
From: Albert Cahalan @ 2006-07-13 17:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: ak, tytso, drepper, arjan, rdunlap, akpm, linux-kernel, libc-alpha

On 7/13/06, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Albert Cahalan" <acahalan@gmail.com> writes:

> > Matching keywords, as is needed for /proc/*/status,
> > is also horribly slow. I ended up using gperf to make
> > a perfect hash table, then gcc's computed goto for
> > jumping to the code, and it still wasn't cheap to do.
> > (while /sys lacks this, the extra open-read-close is
> > certain to be far worse)
>
> I agree matching keywords and such seems slow.
>
> If the only overhead comes from open-read-close we can
> come up with a sys_readfile that doesn't need to actually
> open the file for one shot cases.

A sys_readfile would be great. It probably should
work like readlink. Supplying a struct stat without
a race condition would be good too.

Note that /sys will still be needlessly slow because
of the one-item-per-file idea. One of the few good
things about /proc is that you can get a whole
struct full of data all at once.

Fixing one bottleneck just leads to the next. It's best
to fix all the anti-performance stupidity at once.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13 16:15   ` Albert Cahalan
@ 2006-07-13 16:53     ` Eric W. Biederman
  2006-07-13 17:06       ` Albert Cahalan
  0 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13 16:53 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: ak, tytso, drepper, arjan, rdunlap, akpm, linux-kernel, libc-alpha

"Albert Cahalan" <acahalan@gmail.com> writes:

> On 7/13/06, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> "Albert Cahalan" <acahalan@gmail.com> writes:
>> > Andi Kleen writes:
>> >> On Thursday 13 July 2006 01:24, Theodore Tso wrote:
>> >
>> >>> P.S.  I happen to be one those developers who think the binary
>> >>> interface is not so bad, and for compared to reading from /proc/sys,
>> >>> the sysctl syscall *is* faster.  But at the same there, there really
>> >>> isn't anything where really does require that kind of speed, so that
>> >>> point is moot.  But at the same time, what is the cost of leaving
>> >>> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
>> >>> starting from now?
>> >>
>> >> The numerical namespace for sysctl is unsalvagable imho. e.g.
>> >> distributions regularly break it because there is no central
>> >> repository of numbers so it's not very usable anyways in practice.
>> >
>> > Huh? How exactly is this different from system call numbers,
>> > ioctl numbers, fcntl numbers, ptrace command numbers, and every
>> > other part of the Linux ABI?
>>
>> The only practical difference is that what people use is
>> /proc/sys so the binary sysctl interface is not seriously maintained
>> and bugs crop up.
>
> There is a chicken-and-egg problem here then.
> Let's fix it.
>
> I maintain the sysctl program, which most Linux
> distributions run at boot. I agree to switch to the
> binary sysctl interface if somebody will maintain
> the kernel side of things. This will shave a bit of
> time off boot on nearly every Linux box out there.
> The total time saved is probably a human lifetime,
> so it's like saving somebody's life.

:)

I don't want to make any commitments until we have thrashed
this out at kernel summit and OLS.  

>> > Normal sysctl works very well for FreeBSD. I'm jealous.
>> > They also have a few related calls that are very nice.
>> >
>> > Here we fight over a few CPU cycles in the syscall entry path,
>> > then piss away performance by requiring open-read-close and
>> > marshalling everything through decimal ASCII text. WTF? Let's
>> > just have one system call (make_XML_SOAP_request) and be done.
>>
>> There is a cost to open-read-close.  But as a simple benchmark
>> against a file will show reading data from /proc/sys is much slower
>> than reading data from a file.
>>
>> From what I have been able to measure so far, open-read-close only
>> seems to double the cost over sysctl, and access can do the filename
>> resolution about as quickly as sysctl can deal with a binary path.  So
>> I suspect it is the allocation of struct file that makes
>> open-read-close more expensive.  Reading the data is in the noise.
>
> Eh? A factor of two is not "in the noise".

The factor of two comes from just the fd = open(); close(fd); 
Throwing a read into a loop where I am measuring things does not change
the cache hot cost in a measurable way.  Which says the cost
is in the gyrations we use to get to the data, not in getting
the data itself.

The fact that we have bottlenecks even when cache hot is interesting.

>> sysfs current does a lot better than /proc/sys I think it was only
>> 60% heavier than performing the same operation on a real file.
>
> That is still a horrible way to piss away performance.

Agreed.  But it is a lot better than the 5x performance hit of /proc/sys
I see over regular files.

Since I was measuring the cache hot case it may simply be that
generating the data no matter how fast we do it is slower than
simply copying the data.

>> Performance wise there does seem to be a problem with the
>> implementation.  How to fix it I don't yet know.  But I have
>> yet to see ascii text be implicated.
>
> I have more experience with /proc. There, ASCII is
> known to be a problem.
>
> Parsing a 64-bit number is horribly slow on i386.

I can see that.  Of course if that was the only problem
converting the number into hex would make the parsing
trivial again.

> Matching keywords, as is needed for /proc/*/status,
> is also horribly slow. I ended up using gperf to make
> a perfect hash table, then gcc's computed goto for
> jumping to the code, and it still wasn't cheap to do.
> (while /sys lacks this, the extra open-read-close is
> certain to be far worse)

I agree matching keywords and such seems slow.

If the only overhead comes from open-read-close we can
come up with a sys_readfile that doesn't need to actually
open the file for one shot cases.

The cost of a system call (getpid) is largely in the noise 
compared to the cost of sysctl and open-read-close on files.
So I'm not convinced a multiple system call approach is a bad thing.

Regardless of where we go what is clear from my preliminary
investigation is that our interfaces to in kernel data are
slow and there is a lot of room for improvement.

Eric




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  6:38 ` Eric W. Biederman
@ 2006-07-13 16:15   ` Albert Cahalan
  2006-07-13 16:53     ` Eric W. Biederman
  0 siblings, 1 reply; 47+ messages in thread
From: Albert Cahalan @ 2006-07-13 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: ak, tytso, drepper, arjan, rdunlap, akpm, linux-kernel, libc-alpha

On 7/13/06, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Albert Cahalan" <acahalan@gmail.com> writes:
> > Andi Kleen writes:
> >> On Thursday 13 July 2006 01:24, Theodore Tso wrote:
> >
> >>> P.S.  I happen to be one those developers who think the binary
> >>> interface is not so bad, and for compared to reading from /proc/sys,
> >>> the sysctl syscall *is* faster.  But at the same there, there really
> >>> isn't anything where really does require that kind of speed, so that
> >>> point is moot.  But at the same time, what is the cost of leaving
> >>> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
> >>> starting from now?
> >>
> >> The numerical namespace for sysctl is unsalvagable imho. e.g.
> >> distributions regularly break it because there is no central
> >> repository of numbers so it's not very usable anyways in practice.
> >
> > Huh? How exactly is this different from system call numbers,
> > ioctl numbers, fcntl numbers, ptrace command numbers, and every
> > other part of the Linux ABI?
>
> The only practical difference is that what people use is
> /proc/sys so the binary sysctl interface is not seriously maintained
> and bugs crop up.

There is a chicken-and-egg problem here then.
Let's fix it.

I maintain the sysctl program, which most Linux
distributions run at boot. I agree to switch to the
binary sysctl interface if somebody will maintain
the kernel side of things. This will shave a bit of
time off boot on nearly every Linux box out there.
The total time saved is probably a human lifetime,
so it's like saving somebody's life.

> > Normal sysctl works very well for FreeBSD. I'm jealous.
> > They also have a few related calls that are very nice.
> >
> > Here we fight over a few CPU cycles in the syscall entry path,
> > then piss away performance by requiring open-read-close and
> > marshalling everything through decimal ASCII text. WTF? Let's
> > just have one system call (make_XML_SOAP_request) and be done.
>
> There is a cost to open-read-close.  But as a simple benchmark
> against a file will show reading data from /proc/sys is much slower
> than reading data from a file.
>
> From what I have been able to measure so far, open-read-close only
> seems to double the cost over sysctl, and access can do the filename
> resolution about as quickly as sysctl can deal with a binary path.  So
> I suspect it is the allocation of struct file that makes
> open-read-close more expensive.  Reading the data is in the noise.

Eh? A factor of two is not "in the noise".

> sysfs current does a lot better than /proc/sys I think it was only
> 60% heavier than performing the same operation on a real file.

That is still a horrible way to piss away performance.

> Performance wise there does seem to be a problem with the
> implementation.  How to fix it I don't yet know.  But I have
> yet to see ascii text be implicated.

I have more experience with /proc. There, ASCII is
known to be a problem.

Parsing a 64-bit number is horribly slow on i386.

Matching keywords, as is needed for /proc/*/status,
is also horribly slow. I ended up using gperf to make
a perfect hash table, then gcc's computed goto for
jumping to the code, and it still wasn't cheap to do.
(while /sys lacks this, the extra open-read-close is
certain to be far worse)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:00 Albert Cahalan
  2006-07-13  5:42 ` H. Peter Anvin
  2006-07-13  6:38 ` Eric W. Biederman
@ 2006-07-13 15:20 ` Eric W. Biederman
  2 siblings, 0 replies; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13 15:20 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: ak, tytso, drepper, arjan, rdunlap, akpm, linux-kernel, libc-alpha

"Albert Cahalan" <acahalan@gmail.com> writes:

> Normal sysctl works very well for FreeBSD. I'm jealous.
> They also have a few related calls that are very nice.

Of course as I recall the BSDs change reserve the right
to change all of their magic numbers periodically.

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:00 Albert Cahalan
  2006-07-13  5:42 ` H. Peter Anvin
@ 2006-07-13  6:38 ` Eric W. Biederman
  2006-07-13 16:15   ` Albert Cahalan
  2006-07-13 15:20 ` Eric W. Biederman
  2 siblings, 1 reply; 47+ messages in thread
From: Eric W. Biederman @ 2006-07-13  6:38 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: ak, tytso, drepper, arjan, rdunlap, akpm, linux-kernel, libc-alpha

"Albert Cahalan" <acahalan@gmail.com> writes:

> Andi Kleen writes:
>> On Thursday 13 July 2006 01:24, Theodore Tso wrote:
>
>>> P.S.  I happen to be one those developers who think the binary
>>> interface is not so bad, and for compared to reading from /proc/sys,
>>> the sysctl syscall *is* faster.  But at the same there, there really
>>> isn't anything where really does require that kind of speed, so that
>>> point is moot.  But at the same time, what is the cost of leaving
>>> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
>>> starting from now?
>>
>> The numerical namespace for sysctl is unsalvagable imho. e.g.
>> distributions regularly break it because there is no central
>> repository of numbers so it's not very usable anyways in practice.
>
> Huh? How exactly is this different from system call numbers,
> ioctl numbers, fcntl numbers, ptrace command numbers, and every
> other part of the Linux ABI?

The only practical difference is that what people use is
/proc/sys so the binary sysctl interface is not seriously maintained
and bugs crop up.

> Normal sysctl works very well for FreeBSD. I'm jealous.
> They also have a few related calls that are very nice.
>
> Here we fight over a few CPU cycles in the syscall entry path,
> then piss away performance by requiring open-read-close and
> marshalling everything through decimal ASCII text. WTF? Let's
> just have one system call (make_XML_SOAP_request) and be done.

There is a cost to open-read-close.  But as a simple benchmark
against a file will show reading data from /proc/sys is much slower
than reading data from a file.

>From what I have been able to measure so far, open-read-close only
seems to double the cost over sysctl, and access can do the filename
resolution about as quickly as sysctl can deal with a binary path.  So
I suspect it is the allocation of struct file that makes
open-read-close more expensive.  Reading the data is in the noise.

sysfs current does a lot better than /proc/sys I think it was only
60% heavier than performing the same operation on a real file.

Part of the problem with /proc/sys and other data in proc is
that we deliberately kill the drop everything out of the dcache
as soon as we have found it.  Which is terrible performance wise.

All of those measurements were with string data that I don't
interpret on either side.

Performance wise there does seem to be a problem with the
implementation.  How to fix it I don't yet know.  But I have
yet to see ascii text be implicated.

Eric

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  6:09   ` Albert Cahalan
@ 2006-07-13  6:13     ` Albert Cahalan
  0 siblings, 0 replies; 47+ messages in thread
From: Albert Cahalan @ 2006-07-13  6:13 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: ak, tytso, ebiederm, drepper, arjan, rdunlap, akpm, linux-kernel,
	libc-alpha

On 7/13/06, Albert Cahalan <acahalan@gmail.com> wrote:
> On 7/13/06, H. Peter Anvin <hpa@zytor.com> wrote:
> > Albert Cahalan wrote:
> > >>
> > >> The numerical namespace for sysctl is unsalvagable imho. e.g.
> > >> distributions regularly break it because there is no central
> > >> repository of numbers so it's not very usable anyways in practice.
> > >
> > > Huh? How exactly is this different from system call numbers,
> > > ioctl numbers, fcntl numbers, ptrace command numbers, and every
> > > other part of the Linux ABI?
> > >
> >
> > Mostly because some branches of the sysctl tree have dynamic content
> > which is hard to marshal into a numeric form.
>
> Dynamic content is no problem. FreeBSD uses sysctl
> to implement their "ps" program. The process info comes
> out of sysctl now. The sysctl man page has an example.
>
> Non-numeric data is more troublesome. FreeBSD has
> a syscall that will take text (still faster than /proc/sys),
> and another that will convert the text representation
> into numeric form for later high-performance use.
>
> Look up all 3 calls here, in section 2:
> http://www.freebsd.org/cgi/man.cgi?manpath=FreeBSD+7.0-current

Excuse me, it's in section 3. I don't know if they use
a _sysctl like we do or what. Anyway, they claim that
the numeric version is several times faster, so I don't
think this is just a libc wrapper around /proc.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:42 ` H. Peter Anvin
@ 2006-07-13  6:09   ` Albert Cahalan
  2006-07-13  6:13     ` Albert Cahalan
  0 siblings, 1 reply; 47+ messages in thread
From: Albert Cahalan @ 2006-07-13  6:09 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: ak, tytso, ebiederm, drepper, arjan, rdunlap, akpm, linux-kernel,
	libc-alpha

On 7/13/06, H. Peter Anvin <hpa@zytor.com> wrote:
> Albert Cahalan wrote:
> >>
> >> The numerical namespace for sysctl is unsalvagable imho. e.g.
> >> distributions regularly break it because there is no central
> >> repository of numbers so it's not very usable anyways in practice.
> >
> > Huh? How exactly is this different from system call numbers,
> > ioctl numbers, fcntl numbers, ptrace command numbers, and every
> > other part of the Linux ABI?
> >
>
> Mostly because some branches of the sysctl tree have dynamic content
> which is hard to marshal into a numeric form.

Dynamic content is no problem. FreeBSD uses sysctl
to implement their "ps" program. The process info comes
out of sysctl now. The sysctl man page has an example.

Non-numeric data is more troublesome. FreeBSD has
a syscall that will take text (still faster than /proc/sys),
and another that will convert the text representation
into numeric form for later high-performance use.

Look up all 3 calls here, in section 2:
http://www.freebsd.org/cgi/man.cgi?manpath=FreeBSD+7.0-current

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
  2006-07-13  5:00 Albert Cahalan
@ 2006-07-13  5:42 ` H. Peter Anvin
  2006-07-13  6:09   ` Albert Cahalan
  2006-07-13  6:38 ` Eric W. Biederman
  2006-07-13 15:20 ` Eric W. Biederman
  2 siblings, 1 reply; 47+ messages in thread
From: H. Peter Anvin @ 2006-07-13  5:42 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: ak, tytso, ebiederm, drepper, arjan, rdunlap, akpm, linux-kernel,
	libc-alpha

Albert Cahalan wrote:
>>
>> The numerical namespace for sysctl is unsalvagable imho. e.g.
>> distributions regularly break it because there is no central
>> repository of numbers so it's not very usable anyways in practice.
> 
> Huh? How exactly is this different from system call numbers,
> ioctl numbers, fcntl numbers, ptrace command numbers, and every
> other part of the Linux ABI?
> 

Mostly because some branches of the sysctl tree have dynamic content 
which is hard to marshal into a numeric form.

	-hpa

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] Use uname not sysctl to get the kernel revision
@ 2006-07-13  5:00 Albert Cahalan
  2006-07-13  5:42 ` H. Peter Anvin
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Albert Cahalan @ 2006-07-13  5:00 UTC (permalink / raw)
  To: ak, tytso, ebiederm, drepper, arjan, rdunlap, akpm, linux-kernel,
	libc-alpha

Andi Kleen writes:
> On Thursday 13 July 2006 01:24, Theodore Tso wrote:

>> P.S.  I happen to be one those developers who think the binary
>> interface is not so bad, and for compared to reading from /proc/sys,
>> the sysctl syscall *is* faster.  But at the same there, there really
>> isn't anything where really does require that kind of speed, so that
>> point is moot.  But at the same time, what is the cost of leaving
>> sys_sysctl in the kernel for an extra 6-12 months, or even longer,
>> starting from now?
>
> The numerical namespace for sysctl is unsalvagable imho. e.g.
> distributions regularly break it because there is no central
> repository of numbers so it's not very usable anyways in practice.

Huh? How exactly is this different from system call numbers,
ioctl numbers, fcntl numbers, ptrace command numbers, and every
other part of the Linux ABI?

Normal sysctl works very well for FreeBSD. I'm jealous.
They also have a few related calls that are very nice.

Here we fight over a few CPU cycles in the syscall entry path,
then piss away performance by requiring open-read-close and
marshalling everything through decimal ASCII text. WTF? Let's
just have one system call (make_XML_SOAP_request) and be done.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2006-07-14 19:13 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-10 22:39 [PATCH] sysctl: Document that sys_sysctl will be removed Eric W. Biederman
2006-07-10 22:50 ` Randy.Dunlap
2006-07-11  4:10   ` Eric W. Biederman
2006-07-11  7:07     ` Arjan van de Ven
2006-07-12 16:25       ` [PATCH] Use uname not sysctl to get the kernel revision Eric W. Biederman
2006-07-12 16:50         ` Ulrich Drepper
2006-07-12 17:42           ` Eric W. Biederman
2006-07-12 23:24             ` Theodore Tso
2006-07-12 23:31               ` Andi Kleen
2006-07-13  0:12                 ` Theodore Tso
2006-07-13  2:33                   ` Eric W. Biederman
2006-07-13 12:15                   ` Andi Kleen
2006-07-12 23:44               ` Steve Munroe
2006-07-14 18:49                 ` Benjamin Herrenschmidt
2006-07-14 19:09                   ` Andi Kleen
2006-07-13  0:19               ` Eric W. Biederman
2006-07-12 18:44           ` Roland McGrath
2006-07-12 19:33             ` Ulrich Drepper
2006-07-12 19:53               ` Jakub Jelinek
2006-07-12 20:09                 ` H. Peter Anvin
2006-07-12 21:23                   ` Eric W. Biederman
2006-07-12 21:29                     ` Arjan van de Ven
2006-07-12 21:56                       ` Eric W. Biederman
2006-07-12 22:01                         ` Arjan van de Ven
2006-07-12 22:02                           ` H. Peter Anvin
2006-07-12 22:26                             ` Eric W. Biederman
2006-07-12 22:31                               ` H. Peter Anvin
2006-07-12 23:07                               ` Alan Cox
2006-07-12 23:19                                 ` H. Peter Anvin
2006-07-13 11:15                                   ` Alan Cox
2006-07-14 18:45                                 ` Benjamin Herrenschmidt
2006-07-14 19:11                                   ` H. Peter Anvin
2006-07-12 21:29                     ` H. Peter Anvin
2006-07-12 21:33                     ` Michael Tokarev
2006-07-13  5:17                     ` Ulrich Drepper
2006-07-13  6:27                       ` Ian Wienand
2006-07-13 14:39                       ` Eric W. Biederman
2006-07-13 15:05                         ` Arjan van de Ven
2006-07-13  5:00 Albert Cahalan
2006-07-13  5:42 ` H. Peter Anvin
2006-07-13  6:09   ` Albert Cahalan
2006-07-13  6:13     ` Albert Cahalan
2006-07-13  6:38 ` Eric W. Biederman
2006-07-13 16:15   ` Albert Cahalan
2006-07-13 16:53     ` Eric W. Biederman
2006-07-13 17:06       ` Albert Cahalan
2006-07-13 15:20 ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.