linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Better CLONE_SETTLS support for Hammer
@ 2003-03-05 18:55 Ulrich Drepper
  2003-03-05 19:06 ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-05 18:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi,

Please consider using this patch which changes the way CLONE_SETTLS is
handled on Hammer completely.  The old approach was to slavishly follow
what x86 does with the desastrous result that TCBs (and therefore
stacks) could only be allocated in the low 4GB.  This would have been a
really bad limitation going forward.

But as it turns out the kernel already has support for handling %fs in a
different way, to support prctl(ARCH_SET_FS).  So let's just use the
same mechanism.  clone() will simply take an 64-bit address and use it
as if prctl() was called.

The changes are pretty minimal.  The appened patch is relative to the
current BK sources and they also incorporate the bug fix to use the
correct register for the SETTLS clone() parameter.


- --- arch/x86_64/kernel/process.c	2003-02-23 22:02:23.000000000 -0800
+++ arch/x86_64/kernel/process.c-new	2003-03-05 10:14:46.000000000 -0800
@@ -269,11 +269,18 @@

 	p->thread.rip = (unsigned long) ret_from_fork;

- -	p->thread.fs = me->thread.fs;
+	if ((clone_flags & CLONE_SETTLS) && !test_thread_flag(TIF_IA32)) {
+		if (regs->r8 >= TASK_SIZE)
+			return -EPERM;
+		p->thread.fs = regs->r8;
+		p->thread.fsindex = 0;
+	} else {
+		p->thread.fs = me->thread.fs;
+		asm("movl %%fs,%0" : "=m" (p->thread.fsindex));
+	}
 	p->thread.gs = me->thread.gs;

 	asm("movl %%gs,%0" : "=m" (p->thread.gsindex));
- -	asm("movl %%fs,%0" : "=m" (p->thread.fsindex));
 	asm("movl %%es,%0" : "=m" (p->thread.es));
 	asm("movl %%ds,%0" : "=m" (p->thread.ds));

@@ -291,14 +298,12 @@
 	/*
 	 * Set a new TLS for the child thread?
 	 */
- -	if (clone_flags & CLONE_SETTLS) {
+	if ((clone_flags & CLONE_SETTLS) && test_thread_flag(TIF_IA32)) {
 		struct n_desc_struct *desc;
 		struct user_desc info;
 		int idx;

- -		if (copy_from_user(&info, test_thread_flag(TIF_IA32) ?
- -								  (void *)childregs->rsi :
- -								  (void *)childregs->rdx, sizeof(info)))
+		if (copy_from_user(&info, (void *)childregs->rsi))
 			return -EFAULT;
 		if (LDT_empty(&info))
 			return -EINVAL;


- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+Zkg32ijCOnn/RHQRAr9cAKDL+9bmX26v4P6GRpAq11kzUrgDOwCdFcrk
RO3CMzlLQrEpO28itl0JdxM=
=YxTu
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 18:55 Better CLONE_SETTLS support for Hammer Ulrich Drepper
@ 2003-03-05 19:06 ` Andi Kleen
  2003-03-05 19:25   ` Ulrich Drepper
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andi Kleen @ 2003-03-05 19:06 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 10:55:50AM -0800, Ulrich Drepper wrote:
> Please consider using this patch which changes the way CLONE_SETTLS is
> handled on Hammer completely.  The old approach was to slavishly follow
> what x86 does with the desastrous result that TCBs (and therefore
> stacks) could only be allocated in the low 4GB.  This would have been a
> really bad limitation going forward.

Why stacks? is there a technical reason %fs has to point in the same
area as the stack?

If you just want to save one mmap/allocation: I think the context switch
overhead will be more expensive than the allocation.

> 
> But as it turns out the kernel already has support for handling %fs in a
> different way, to support prctl(ARCH_SET_FS).  So let's just use the
> same mechanism.  clone() will simply take an 64-bit address and use it
> as if prctl() was called.

The problem is that the context switch is much more expensive with that
(wrmsr is quite expensive compared to the memcpy or index reload). The kernel 
optimizes it away when not needed, but with glibc using them 
for everything all processes will switch slower.

I had hoped that user land would prefer fast context switches and
just stuff the index tables for the thread local data into the 
first 2GB (using MAP_32BIT). Yes limiting the thread stacks to 4GB 
would be bad I agree, but is it that big a problem to split the
index table for thread local data and the stack? 

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 19:06 ` Andi Kleen
@ 2003-03-05 19:25   ` Ulrich Drepper
  2003-03-05 21:19     ` Andi Kleen
  2003-03-05 19:32   ` Ulrich Drepper
  2003-03-06  5:33   ` H. Peter Anvin
  2 siblings, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-05 19:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:

> The problem is that the context switch is much more expensive with that
> (wrmsr is quite expensive compared to the memcpy or index reload). The kernel 
> optimizes it away when not needed, but with glibc using them 
> for everything all processes will switch slower.

And the loadsegment() call with all the preparations if faster?  And
faster in future revisions of the processor?  Since I cannot get any
recent kernel to run you'll have to do the timing.  I wouldn't expect
the difference to be significant.


>  but is it that big a problem to split the
> index table for thread local data and the stack? 

Yes, it it.  It would basically double thread create-destroy costs.
double the internal administrative overhead (and time and memory), would
add more dcache pressure, and so on.  It is simply stupid.  We don't
have to do it for any other architecture, so don't force such hacks on
us for an architecture whose lifespan just starts.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+Zk8m2ijCOnn/RHQRAiUWAJ4o3akYg11tw4PGoIrqln3r/9v4kQCgm+MD
kcrsGMVa0Z++yccEkxolxX8=
=gHz/
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 19:06 ` Andi Kleen
  2003-03-05 19:25   ` Ulrich Drepper
@ 2003-03-05 19:32   ` Ulrich Drepper
  2003-03-05 21:21     ` Andi Kleen
  2003-03-06  5:33   ` H. Peter Anvin
  2 siblings, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-05 19:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:

> If you just want to save one mmap/allocation: I think the context switch
> overhead will be more expensive than the allocation.

And two more things:

1. every process will already use the prctl(ARCH_SET_FS).  We are
talking here only about the 2nd thread and later.  We need to use
prctl(ARCH_SET_FS) for TLS support.


2. May I remind you that you were in the crowd who complained when we
requested a dedicated thread register?  Yes, I still would prefer that
but I have no energy to battle for that.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+ZlDV2ijCOnn/RHQRAncTAJ9TOHCpPZKWqD/1BeZVpzGenYvtZACcDaYJ
Y1kyJd90RT60PpyjnR9q1Mg=
=WADr
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 19:25   ` Ulrich Drepper
@ 2003-03-05 21:19     ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2003-03-05 21:19 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 11:25:26AM -0800, Ulrich Drepper wrote:
> > The problem is that the context switch is much more expensive with that
> > (wrmsr is quite expensive compared to the memcpy or index reload). The kernel 
> > optimizes it away when not needed, but with glibc using them 
> > for everything all processes will switch slower.
> 
> And the loadsegment() call with all the preparations if faster?  And

loadsegment and preparation has to be done anyways for compatibility
(we tried to do that lazy too, but failed) 

The 64bit base switch is additional cost.

> 
> >  but is it that big a problem to split the
> > index table for thread local data and the stack? 
> 
> Yes, it it.  It would basically double thread create-destroy costs.
> double the internal administrative overhead (and time and memory), would
> add more dcache pressure, and so on.  It is simply stupid.  We don't
> have to do it for any other architecture, so don't force such hacks on
> us for an architecture whose lifespan just starts.

I would definitely prefer double thread-create/delete costs over even
slightly higher context switch costs. Compared to a context switch a 
thread creation or deletion is a once-in-a-millenium event.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 19:32   ` Ulrich Drepper
@ 2003-03-05 21:21     ` Andi Kleen
  2003-03-05 23:04       ` Ulrich Drepper
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2003-03-05 21:21 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 11:32:36AM -0800, Ulrich Drepper wrote:
> > If you just want to save one mmap/allocation: I think the context switch
> > overhead will be more expensive than the allocation.
> 
> And two more things:
> 
> 1. every process will already use the prctl(ARCH_SET_FS).  We are
> talking here only about the 2nd thread and later.  We need to use
> prctl(ARCH_SET_FS) for TLS support.

Not when you put it into the first four GB. Then you can use the same
calls as 32bit.

> 2. May I remind you that you were in the crowd who complained when we
> requested a dedicated thread register?  Yes, I still would prefer that
> but I have no energy to battle for that.

I don't see the connection.


-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 21:21     ` Andi Kleen
@ 2003-03-05 23:04       ` Ulrich Drepper
  2003-03-06  1:05         ` Andi Kleen
  2003-03-06  2:08         ` Benjamin LaHaise
  0 siblings, 2 replies; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-05 23:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:

>>1. every process will already use the prctl(ARCH_SET_FS).  We are
>>talking here only about the 2nd thread and later.  We need to use
>>prctl(ARCH_SET_FS) for TLS support.
> 
> 
> Not when you put it into the first four GB. Then you can use the same
> calls as 32bit.

Show me numbers.  Or even better: fix the kernel so that I can run it
myself.  What is the time difference between using the segment register
switching with all the different allocated thread areas version using wrmsr.


*Iff* using the MSR is slower, as far as I'm concerned the
set_thread_area syscall should complete go away from x86-64.  Require
the use of prctl to get and set the base address.  Then internally in
the prctl call map it to either the use of a 32 base address segment or
use the MSR.

This way whoever needs a segment base address can preferably allocate it
in the low 4GB, but if it fails the kernel support still work.  And with
the same interface.  Currently this is not the case and this is not
acceptable.


>>2. May I remind you that you were in the crowd who complained when we
>>requested a dedicated thread register?  Yes, I still would prefer that
>>but I have no energy to battle for that.
> 
> 
> I don't see the connection.

I wouldn't want to either in your position.  You caused this whole mess
and now you're not willing to fix it.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+ZoJn2ijCOnn/RHQRAtZRAKCFYu1q0IX92o7axPFqK8YuYIdISACfaA1M
TnmREkEevHxAfrhNWYk9uZs=
=ivkl
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 23:04       ` Ulrich Drepper
@ 2003-03-06  1:05         ` Andi Kleen
  2003-03-06  3:53           ` Ulrich Drepper
  2003-03-06  4:14           ` Ulrich Drepper
  2003-03-06  2:08         ` Benjamin LaHaise
  1 sibling, 2 replies; 18+ messages in thread
From: Andi Kleen @ 2003-03-06  1:05 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 03:04:07PM -0800, Ulrich Drepper wrote:
> >>1. every process will already use the prctl(ARCH_SET_FS).  We are
> >>talking here only about the 2nd thread and later.  We need to use
> >>prctl(ARCH_SET_FS) for TLS support.
> > 
> > 
> > Not when you put it into the first four GB. Then you can use the same
> > calls as 32bit.
> 
> Show me numbers.  Or even better: fix the kernel so that I can run it

You want the change, not me ;)

I did benchmark it some time ago (cannot share numbers sorry) and it 
didn't look like a good idea.

> myself.  What is the time difference between using the segment register
> switching with all the different allocated thread areas version using wrmsr.

It should already work on the current kernel, modulo clone.
(but arch_prctl, set_thread_area in 2.5, ldt in 2.4 etc.) 

> 
> 
> *Iff* using the MSR is slower, as far as I'm concerned the
> set_thread_area syscall should complete go away from x86-64.  Require

It's needed for 32bit emulation at least. The code is 100% shared
between the emulation and the native 64bit model.
In theory it could be removed from the system call table for 64bit
but there didn't seem a good reason to do so - after all 64bit programs
can put their thread local data into the first 4GB and 
fast context switches.

> the use of prctl to get and set the base address.  Then internally in
> the prctl call map it to either the use of a 32 base address segment or
> use the MSR.

The problem is that the 64bit base has different semantics. 

When you use a segment register you have to do:

	call kernel to set gdt/ldt
	movl index,%%fs

But when the kernel did set the 64bit base in the kernel call the
following movl to the selector would destroy it again

Loading the index inside the system call would also be problematic:
First it would be different from what i386 does, causing porting headaches.
Also you could not easily do it from a different thread unlike the 
LDT load.
	

> This way whoever needs a segment base address can preferably allocate it
> in the low 4GB, but if it fails the kernel support still work.  And with
> the same interface.  Currently this is not the case and this is not
> acceptable.

That should already work and it is in fact how I imagined this to be: 

do MAP_32BIT - if yes use set_thread_area or an LDT entry;

if not use arch_prctl

The NPTL signal race problem for the clones in case you have a 64bit
base is a bit ugly though I agree.

I don't like your patch currently because it'll guarantee slow 
context switch times for 64bit.

Automatic switching based on the set bits in the base may be possible 
(in fact I had something like this in set_thread_area for some time, but 
removed it because of the ugly semantics because set_thread_area doesn't
already load the selector). If the selector load is forced
in clone however it would not be as weird, just only somewhat
ugly. You'll have to guarantee in user space then that you don't
reload it.

Real solution would be Windowish - Create clone7() with both
selectors and bases 

[I suspect 2.8 and 3.0 will get that anyways as experience
on other operating systems who started on the same path 
shows. e.g. AmigaOS grew more CreateTask
variants with more arguments with each release until they eventually
settled on passing tag lists.]

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 23:04       ` Ulrich Drepper
  2003-03-06  1:05         ` Andi Kleen
@ 2003-03-06  2:08         ` Benjamin LaHaise
  2003-03-06  3:52           ` Ulrich Drepper
  1 sibling, 1 reply; 18+ messages in thread
From: Benjamin LaHaise @ 2003-03-06  2:08 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 03:04:07PM -0800, Ulrich Drepper wrote:
> I wouldn't want to either in your position.  You caused this whole mess
> and now you're not willing to fix it.

I disagree.  We don't need to use a segment register to get the thread 
library equivalent of "current".  Changing the segment registers on a 
per process basis is a waste of time in the context switch.  Instead, 
making x86-64 TLS support based off of the stack pointer, or even using 
a fixed per-cpu segment register such that gs:0 holds the pointer to the 
thread "current" would be better.  Make the users of threads suffer, not 
every single application and syscall in the system.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  2:08         ` Benjamin LaHaise
@ 2003-03-06  3:52           ` Ulrich Drepper
  2003-03-06  5:29             ` Benjamin LaHaise
  0 siblings, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-06  3:52 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andi Kleen, Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin LaHaise wrote:

> Instead, 
> making x86-64 TLS support based off of the stack pointer, or even using 
> a fixed per-cpu segment register such that gs:0 holds the pointer to the 
> thread "current" would be better.

Gee, here is somebody who knows about thread APIs and ABIs.  You're
really embarrassing yourself.


> Make the users of threads suffer, not 
> every single application and syscall in the system.

Whether you like it or not, people are using threads.  TLS requires a
thread register even in single-threaded applications.  The mechanism I
proposed would use segments if <4GB addresses can be allocated,
otherwise fall back to prctl(ARCH_SET_FS).  This is about as good as you
can get it.

Remove anything and hammer is the only architecture without good thread
support which undoubtedly makes you happy but almost nobody else.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+ZsX02ijCOnn/RHQRAv87AJ4hHmL0N8ckGGsqROBk439jiUPAVgCgi/1K
NayzBhn7WQwjGATGaDzv0Pc=
=JXCv
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  1:05         ` Andi Kleen
@ 2003-03-06  3:53           ` Ulrich Drepper
  2003-03-06  4:14           ` Ulrich Drepper
  1 sibling, 0 replies; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-06  3:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:

> You want the change, not me ;)

But I cannot test it since the kernel doesn't work.


> It should already work on the current kernel, modulo clone.
> (but arch_prctl, set_thread_area in 2.5, ldt in 2.4 etc.) 

I cannot confirm this.  I wasted a lot of time on getting it to work.
Without avail.


> It's needed for 32bit emulation at least. The code is 100% shared
> between the emulation and the native 64bit model.
> In theory it could be removed from the system call table for 64bit
> but there didn't seem a good reason to do so - after all 64bit programs
> can put their thread local data into the first 4GB and 
> fast context switches.
> 
> 
>>the use of prctl to get and set the base address.  Then internally in
>>the prctl call map it to either the use of a 32 base address segment or
>>use the MSR.
> 
> 
> The problem is that the 64bit base has different semantics. 
> 
> When you use a segment register you have to do:
> 
> 	call kernel to set gdt/ldt
> 	movl index,%%fs
> 
> But when the kernel did set the 64bit base in the kernel call the
> following movl to the selector would destroy it again
> 
> Loading the index inside the system call would also be problematic:
> First it would be different from what i386 does, causing porting headaches.
> Also you could not easily do it from a different thread unlike the 
> LDT load.
> 	
> 
> 
>>This way whoever needs a segment base address can preferably allocate it
>>in the low 4GB, but if it fails the kernel support still work.  And with
>>the same interface.  Currently this is not the case and this is not
>>acceptable.
> 
> 
> That should already work and it is in fact how I imagined this to be: 
> 
> do MAP_32BIT - if yes use set_thread_area or an LDT entry;
> 
> if not use arch_prctl
> 
> The NPTL signal race problem for the clones in case you have a 64bit
> base is a bit ugly though I agree.
> 
> I don't like your patch currently because it'll guarantee slow 
> context switch times for 64bit.
> 
> Automatic switching based on the set bits in the base may be possible 
> (in fact I had something like this in set_thread_area for some time, but 
> removed it because of the ugly semantics because set_thread_area doesn't
> already load the selector). If the selector load is forced
> in clone however it would not be as weird, just only somewhat
> ugly. You'll have to guarantee in user space then that you don't
> reload it.
> 
> Real solution would be Windowish - Create clone7() with both
> selectors and bases 
> 
> [I suspect 2.8 and 3.0 will get that anyways as experience
> on other operating systems who started on the same path 
> shows. e.g. AmigaOS grew more CreateTask
> variants with more arguments with each release until they eventually
> settled on passing tag lists.]
> 
> -Andi
> 


- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+ZsZE2ijCOnn/RHQRAphoAJ9YRohA3FrNkAWrTlk0nigBj1/NCwCdGmkR
uxv9VRkBY//SftCcmk2KwgQ=
=W1al
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  1:05         ` Andi Kleen
  2003-03-06  3:53           ` Ulrich Drepper
@ 2003-03-06  4:14           ` Ulrich Drepper
  2003-03-06 10:27             ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-06  4:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[Dammit, send prematurely.  Ignore the first copy.]

Andi Kleen wrote:

> You want the change, not me ;)

But I cannot test it since the kernel doesn't work.


> It should already work on the current kernel, modulo clone.
> (but arch_prctl, set_thread_area in 2.5, ldt in 2.4 etc.)

I cannot confirm this.  I wasted a lot of time on getting it to work.
Without avail.


> It's needed for 32bit emulation at least. The code is 100% shared
> between the emulation and the native 64bit model.

That's irrelevant.  Nobody needs the interface in 64-bit mode if there
is the prctl() code.  Yes, there will be one more if() in copy_thread,
but so what?


> In theory it could be removed from the system call table for 64bit
> but there didn't seem a good reason to do so - after all 64bit programs
> can put their thread local data into the first 4GB and
> fast context switches.

This limits the scalability.  Forever.


> The problem is that the 64bit base has different semantics.
>
> When you use a segment register you have to do:
>
> 	call kernel to set gdt/ldt
> 	movl index,%%fs
>
> But when the kernel did set the 64bit base in the kernel call the
> following movl to the selector would destroy it again

And the problem is?  Nobody must mug around with the segment registers
without knowing what s/he does.

If you want to change the %fs/%gs value temporarily follow this receipe:


  if (%[fg]s index is zero)
    saveidx = 0
    saveaddr = prctl (ARCH_GET_[FG]S)
  else
    saveidx = index of %[fg]s

  [...do your work...]

  if (saveidx == 0)
    prctl(ARCH_SET_[FG]S, saveaddr)
  else
    %[fg]s = saveindex


It's nothing you cannot handle from userspace.


> Loading the index inside the system call would also be problematic:
> First it would be different from what i386 does, causing porting
headaches.

Every program needs porting.  Normal memory allocation isn't guaranteed
to yield addresses < 4GB and therefore the use as a segment base address
fails.  Beside that, any program which modifies segment registers (I
cannot even think of more than two) are special and already have
portability code included if they run on machines != x86.


> Also you could not easily do it from a different thread unlike the
> LDT load.

Who says something about changing the LDT handling?  The set_thread_area
syscall handles the GDT.  And you cannot modify the GDT address
registered with a given index in another thread or process except with
ptrace().  Not even on x86.


> That should already work and it is in fact how I imagined this to be:
>
> do MAP_32BIT - if yes use set_thread_area or an LDT entry;
>
> if not use arch_prctl
>
> The NPTL signal race problem for the clones in case you have a 64bit
> base is a bit ugly though I agree.

You don't need two interface.  Make prctl() do it automatically.  It has
all the info it needs.  Forget about the set_thread_area syscall in
64-bit mode and simply use one fixed GDT entry in case the address
passed to pcrtl() is small enough.  Same for clone(): the SETTLS
parameter shole be a simple address.  Treat it as passed to prctl() and
use a segment or the MSR.

Te only remaining question is who to load the segment register.   There
are two possibilities:

- - have prctl() return the index and expect the user to load it.  This is
  slightly binary incompatible (existing code depends on no such
  requirement).  It could be solved by introducing ARCH_SET_FS_AUTO or
  so;

- - automatically load the %fs or %gs register with the correct value
  before returning from prctl().  This introduces no binary
  incompatibilities and it's really the expected behavior.


If you don't want to do the work help me to get 2.5 running on my
machine and I'll come up with a patch.

- --
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+Zssb2ijCOnn/RHQRAubyAJ4lHZHO0ZFgHdWOab1mnezIJ1k/KQCggmOU
WXZQEdvrAZsGhbwNzskoXX4=
=m9SY
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  3:52           ` Ulrich Drepper
@ 2003-03-06  5:29             ` Benjamin LaHaise
  2003-03-06  5:47               ` Ulrich Drepper
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin LaHaise @ 2003-03-06  5:29 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 07:52:20PM -0800, Ulrich Drepper wrote:
> Benjamin LaHaise wrote:
> 
> > Instead, 
> > making x86-64 TLS support based off of the stack pointer, or even using 
> > a fixed per-cpu segment register such that gs:0 holds the pointer to the 
> > thread "current" would be better.
> 
> Gee, here is somebody who knows about thread APIs and ABIs.  You're
> really embarrassing yourself.

If you're saying that drawing an analagy between the kernel's concept 
of the current task pointer and thread APIs is invalid, I'm rather 
surprised.  Ulrich, comments like this are unnecessarily inflammatory 
and don't lead to a productive discussion of the merits and failings 
of specific points.

> > Make the users of threads suffer, not 
> > every single application and syscall in the system.
> 
> Whether you like it or not, people are using threads.  TLS requires a
> thread register even in single-threaded applications.  The mechanism I
> proposed would use segments if <4GB addresses can be allocated,
> otherwise fall back to prctl(ARCH_SET_FS).  This is about as good as you
> can get it.

My position is that requiring a thread register for unthreaded applications 
is unnecessary bloat.  Since the thread register is yet another register 
that must be saved and reloaded on context switch, while it provides no 
improvements in performance or functionality for single threaded programs, 
it is pure bloat.  Every single change like this that slows down unthreads 
apps goes directly onto the boot and startup time of your system and 
various applications.  Think about it!

> Remove anything and hammer is the only architecture without good thread
> support which undoubtedly makes you happy but almost nobody else.

I'm trying to help reach a point of consensus where you are failing to 
see Andi's point: using the TLS segment everywhere slows everything down 
a tiny bit.  Don't do it for unthreaded programs.  We're trying to avoid 
pessimizations while we still can for a new architecture.  I fail to see 
why you are not listening to this.  Yes, threaded apps exist, but their 
existance does not preclude the slowing down of unthreaded apps.

		-ben
-- 
Junk email?  <a href="mailto:aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-05 19:06 ` Andi Kleen
  2003-03-05 19:25   ` Ulrich Drepper
  2003-03-05 19:32   ` Ulrich Drepper
@ 2003-03-06  5:33   ` H. Peter Anvin
  2 siblings, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2003-03-06  5:33 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20030305190622.GA5400@wotan.suse.de>
By author:    Andi Kleen <ak@suse.de>
In newsgroup: linux.dev.kernel
> > 
> > But as it turns out the kernel already has support for handling %fs in a
> > different way, to support prctl(ARCH_SET_FS).  So let's just use the
> > same mechanism.  clone() will simply take an 64-bit address and use it
> > as if prctl() was called.
> 
> The problem is that the context switch is much more expensive with
> that (wrmsr is quite expensive compared to the memcpy or index
> reload). The kernel optimizes it away when not needed, but with
> glibc using them for everything all processes will switch slower.
> 

This is almost certainly the biggest brainfuck in the x86-64
architecture.  It should have been supported to set the segment
registers via "movq rm64,%fs|gs" (i.e. REX64 MOV sr,r/m64).  Barring
that, it would have been better if *all* setting of the segment
registers from userspace had been completely outlawed, so the kernel
at least could have tracked the usage.  The combination is in so many
ways worse than ever.

IMNSHO we should assume in the ABI that this will be fixed at some
point, and therefore we shouldn't work too hard to create kluges that
are based on warts in the first x86-64 generation (K8) only, that
later can't be fixed.

There is plenty of reason to believe the x86-64 architecture will be
around for a long time to come.  It is more important to create sane
ABIs than it is for those ABIs to be microoptimized to the first
generation of x86-64 processors.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  5:29             ` Benjamin LaHaise
@ 2003-03-06  5:47               ` Ulrich Drepper
  0 siblings, 0 replies; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-06  5:47 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Andi Kleen, Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benjamin LaHaise wrote:
> using the TLS segment everywhere slows everything down
> a tiny bit.  Don't do it for unthreaded programs.

You completely (still!) fail to understand the issue.  How can I take
somebody who suggest using the stack as a pseudo thread register
serious?  You don't see the impact of this so you don't recognize who
absolutely absurd this is.

Wrt inthreaded apps: either TLS is used everywhere or not at all.
Single threaded code uses library code which relies on TLS.  And since
TLS is part of the ABI there is not question about the "not at all".
The remaining issue is how to do it with the least impact.  For this
I've proposed a method which does this (according to Andi's
measurements) with the least impact.

And nobody forces you to use the standard runtime environment.  Go on,
create your own.  Then you won't have any penalties except one single
'if' in the context switching code which you hopefully can live with.
Mark it with unlikely() for all I care.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+ZuEB2ijCOnn/RHQRAnlwAJ4kcZ7FbESC+FIsOyn6Ia0wN8FskgCgvh/K
SR1Ki1CnTe2QXq0Gn7TsvAY=
=jJ0e
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06  4:14           ` Ulrich Drepper
@ 2003-03-06 10:27             ` Andi Kleen
  2003-03-06 18:58               ` Ulrich Drepper
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2003-03-06 10:27 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Andi Kleen, Linux Kernel

On Wed, Mar 05, 2003 at 08:14:18PM -0800, Ulrich Drepper wrote:
> 
> > It should already work on the current kernel, modulo clone.
> > (but arch_prctl, set_thread_area in 2.5, ldt in 2.4 etc.)
> 
> I cannot confirm this.  I wasted a lot of time on getting it to work.
> Without avail.

We're using %fs enabled glibc (currently with arch_prctl, but I would
like to change that because it's slow) 

> And the problem is?  Nobody must mug around with the segment registers
> without knowing what s/he does.

It's hardcoding the magic fs register in the kernel for once.

> You don't need two interface.  Make prctl() do it automatically.  It has
> all the info it needs.  Forget about the set_thread_area syscall in
> 64-bit mode and simply use one fixed GDT entry in case the address
> passed to pcrtl() is small enough.  Same for clone(): the SETTLS
> parameter shole be a simple address.  Treat it as passed to prctl() and
> use a segment or the MSR.

I had some code like this for some time - not in prctl, but in set_thread_area,
but I removed it because the selector messing looked too ugly.

But that was before prctl reloaded the selector forcefully to zero.
That was later changed to fix another bug.

Now with it getting reloaded it would make sense to set it in the GDT
too if possible, I agree. I'll implement that.

It will also transparent speed up the glibcs already using arch_prctl.
I like that.

I can do a similar thing in clone. It unfortuately also hardcodes fs there,
but I guess that ugly hack will be needed to get the broken NPTL design for this
to work.

You just have to guarantee from user space that you don't do nasty
things with the selector.

> 
> - - have prctl() return the index and expect the user to load it.  This is
>   slightly binary incompatible (existing code depends on no such
>   requirement).  It could be solved by introducing ARCH_SET_FS_AUTO or
>   so;
> 
> - - automatically load the %fs or %gs register with the correct value
>   before returning from prctl().  This introduces no binary
>   incompatibilities and it's really the expected behavior.

It's already done (set to zero) to not confuse the lazy switch logic.

> 
> 
> If you don't want to do the work help me to get 2.5 running on my
> machine and I'll come up with a patch.

2.5.64 currently doesn't boot (known issue); 2.5.63 works however.
I'll look into the .64 problems later today and put a fix onto the
usual place when done (ftp://ftp.x86-64.org/pub/linux/v2.5/) 

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06 10:27             ` Andi Kleen
@ 2003-03-06 18:58               ` Ulrich Drepper
  2003-03-06 19:09                 ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Ulrich Drepper @ 2003-03-06 18:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andi Kleen wrote:

> We're using %fs enabled glibc (currently with arch_prctl, but I would
> like to change that because it's slow) 

It's already part of the documented ABI.  Yes, if right now somebody
said r11 or whatever is declared the thread register, we could change
it.  But give it a few more weeks and it'll probably be impossible.


> It's hardcoding the magic fs register in the kernel for once.

The kernel just corresponds to what userland expects.  It's the same for
otherarchitectures where clone() loads the appropriate register.


> Now with it getting reloaded it would make sense to set it in the GDT
> too if possible, I agree. I'll implement that.

I'm not sure what your plans wrt to set_thread_area in 64-bit mode are
but I would strongly suggest to remove it completely.


> It will also transparent speed up the glibcs already using arch_prctl.
> I like that.

Yes.


> I can do a similar thing in clone. It unfortuately also hardcodes fs there,
> but I guess that ugly hack will be needed to get the broken NPTL design for this
> to work.

You don't understand threads.  There is nothing broken or NPTL-specific
about the design.  Every thread creation mechanism must be signal save.
 LinuxThreads's is not, which is a bit less of a problem since the
signal handling is broken, too, and signals are unlikely to arrive at
the new thread. Nevertheless, there are certainly spurious crashes which
are hard to reproduce and explain which can be attributed to this problem.

With fixed signal handling the newly created thread needs to have a
valid and correct thread register from the first instruction on and this
can only be implemented in the kernel.  Again, the other archs implement
similar dependencies.  ia-64 loads the tp register with the given value.
 The kernel doesn't live in a vacuum so this kind of dependency is OK.


> You just have to guarantee from user space that you don't do nasty
> things with the selector.

Sure.  Write this in the ABI docs.  Or leave the whole segment register
handling undefined so that nobody gets the idea it's useful to use.


>>- - automatically load the %fs or %gs register with the correct value
>>  before returning from prctl().  This introduces no binary
>>  incompatibilities and it's really the expected behavior.
> 
> 
> It's already done (set to zero) to not confuse the lazy switch logic.

Switching to zero is only correct when using the MSR.  I'm talking about
the case when the value is used as the base address of a segment.  In
that case %fs has to be loaded with the value corresponding to the
apppropriate GDT index.  The loading can happen in the kernel (or when
returning from it) or left to the user.  I think the former is better
since it corresponds more cleanly to what happens if the MSR is used.
If you don't want it return the GDT index and the user code will have to
issue the mov instruction if the return index is != 0 (indicating the
MSR is used).


> 2.5.64 currently doesn't boot (known issue); 2.5.63 works however.
> I'll look into the .64 problems later today and put a fix onto the
> usual place when done (ftp://ftp.x86-64.org/pub/linux/v2.5/) 

Thanks, looking forward to that.

- -- 
- --------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+Z5pV2ijCOnn/RHQRAsmhAJ9A9JXlw6YJ/5RAG+Y5NF708UChVwCgv7Cc
1JoFN0hrXYjZ7eynW3Nyj8I=
=+qtg
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Better CLONE_SETTLS support for Hammer
  2003-03-06 18:58               ` Ulrich Drepper
@ 2003-03-06 19:09                 ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2003-03-06 19:09 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: Linux Kernel

On Thu, 2003-03-06 at 19:58, Ulrich Drepper wrote:

> 
> It's already part of the documented ABI.  Yes, if right now somebody
> said r11 or whatever is declared the thread register, we could change
> it.  But give it a few more weeks and it'll probably be impossible

It's already pretty much impossible.


> You don't understand threads.  There is nothing broken or NPTL-specific
> about the design.  Every thread creation mechanism must be signal save.
>  LinuxThreads's is not, which is a bit less of a problem since the
> signal handling is broken, too, and signals are unlikely to arrive at
> the new thread. Nevertheless, there are certainly spurious crashes which
> are hard to reproduce and explain which can be attributed to this problem.

I would have designed it with CLONE_BLOCKALLSIGNALS and then set it in
the target process before unblocking signals. This would be in the 
spirit of fork/exec and also what posix APIs do for similar
problems (like pselect)

But of course it's too late now. We'll have to live with this
ugly thing.

Just hope you won't run out of the argument registers for the next
round of enhancements ;)

-Andi



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2003-03-06 18:58 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-05 18:55 Better CLONE_SETTLS support for Hammer Ulrich Drepper
2003-03-05 19:06 ` Andi Kleen
2003-03-05 19:25   ` Ulrich Drepper
2003-03-05 21:19     ` Andi Kleen
2003-03-05 19:32   ` Ulrich Drepper
2003-03-05 21:21     ` Andi Kleen
2003-03-05 23:04       ` Ulrich Drepper
2003-03-06  1:05         ` Andi Kleen
2003-03-06  3:53           ` Ulrich Drepper
2003-03-06  4:14           ` Ulrich Drepper
2003-03-06 10:27             ` Andi Kleen
2003-03-06 18:58               ` Ulrich Drepper
2003-03-06 19:09                 ` Andi Kleen
2003-03-06  2:08         ` Benjamin LaHaise
2003-03-06  3:52           ` Ulrich Drepper
2003-03-06  5:29             ` Benjamin LaHaise
2003-03-06  5:47               ` Ulrich Drepper
2003-03-06  5:33   ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).