All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-29 10:31 Albert Cahalan
  2010-06-01 19:32   ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 32+ messages in thread
From: Albert Cahalan @ 2010-05-29 10:31 UTC (permalink / raw)
  To: linux-kernel, sukadev, randy.dunlap, linuxppc-dev

Sukadev Bhattiprolu writes:

> Randy Dunlap [randy.dunlap at oracle.com] wrote:
>>> base of the region allocated for stack. These architectures
>>> must pass in the size of the stack-region in ->child_stack_size.
>>
>>                               stack region
>>
>> Seems unfortunate that different architectures use
>> the fields differently.
>
> Yes and no. The field still has a single purpose, just that
> some architectures may not need it. We enforce that if unused
> on an architecture, the field must be 0. It looked like
> the easiest way to keep the API common across architectures.

Yuck. You're forcing userspace to have #ifdef messes or,
more likely, just not work on all architectures. There is
no reason to have field usage vary by architecture. The
original clone syscall was not designed with ia64 and hppa
in mind, and has been causing trouble ever since. Let's not
perpetuate the problem.

Given code like this:   stack_base = malloc(stack_size);
stack_base and stack_size are what the kernel needs.

I suspect that you chose the defective method for some reason
related to restarting processes that were created with the
older system calls. I can't say most of us even care, but in
that broken-already case your process restarter can make up
some numbers that will work. (for i386, the base could be the
lowest address in the vma in which %esp lies, or even address 0)

A related issue is that stack allocation and deallocation can
be quite painful: it is difficult (some assembly required) to
free one's own stack, and impossible if one is already dead.
We could use a flag to let the kernel handle allocation, with
the stack getting freed just after any ptracer gets a last look.
This issue is especially troublesome for me because the syscall
essentially requires per-thread memory to work; it is currently
extremely difficult to use the syscall in code which lacks that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-29 10:31 [PATCH v21 011/100] eclone (11/11): Document sys_eclone Albert Cahalan
@ 2010-06-01 19:32   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-01 19:32 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

Albert Cahalan [acahalan@gmail.com] wrote:
| Sukadev Bhattiprolu writes:
| 
| > Randy Dunlap [randy.dunlap at oracle.com] wrote:
| >>> base of the region allocated for stack. These architectures
| >>> must pass in the size of the stack-region in ->child_stack_size.
| >>
| >>                               stack region
| >>
| >> Seems unfortunate that different architectures use
| >> the fields differently.
| >
| > Yes and no. The field still has a single purpose, just that
| > some architectures may not need it. We enforce that if unused
| > on an architecture, the field must be 0. It looked like
| > the easiest way to keep the API common across architectures.
| 
| Yuck. You're forcing userspace to have #ifdef messes or,
| more likely, just not work on all architectures.

There is going to be #ifdef code in the library interface to eclone().
But applications should not need any #ifdefs. Please see the test cases
for eclone in

	git://git.sr71.net/~hallyn/cr_tests.git

There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

These use the libeclone.a built from following git-tree, which has the
arch-dependent user space code.

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Is that the #ifdef mess you are talking about ? I don't see that as a
consequence of the API. So maybe you can elaborate.

| There is no reason to have field usage vary by architecture. The

The field usage does not vary by architecture. Some architectures
don't use some fields and those fields must be 0. A simple 

	memset(&clone_args, 0, sizeof(clone_args))

before initializing fields is all that is required.

| original clone syscall was not designed with ia64 and hppa
| in mind, and has been causing trouble ever since. Let's not
| perpetuate the problem.

and lot of folks contributed to this new API to try and make sure
it is portable and meets the forseeable requirements.

| 
| Given code like this:   stack_base = malloc(stack_size);
| stack_base and stack_size are what the kernel needs.
| 
| I suspect that you chose the defective method for some reason
| related to restarting processes that were created with the
| older system calls. I can't say most of us even care, but in
| that broken-already case your process restarter can make up
| some numbers that will work. (for i386, the base could be the
| lowest address in the vma in which %esp lies, or even address 0)

I don't understand how "making up some numbers (pids) that will work"
is more portable/cleaner than the proposed eclone(). 

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-01 19:32   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-01 19:32 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

Albert Cahalan [acahalan@gmail.com] wrote:
| Sukadev Bhattiprolu writes:
| 
| > Randy Dunlap [randy.dunlap at oracle.com] wrote:
| >>> base of the region allocated for stack. These architectures
| >>> must pass in the size of the stack-region in ->child_stack_size.
| >>
| >>                               stack region
| >>
| >> Seems unfortunate that different architectures use
| >> the fields differently.
| >
| > Yes and no. The field still has a single purpose, just that
| > some architectures may not need it. We enforce that if unused
| > on an architecture, the field must be 0. It looked like
| > the easiest way to keep the API common across architectures.
| 
| Yuck. You're forcing userspace to have #ifdef messes or,
| more likely, just not work on all architectures.

There is going to be #ifdef code in the library interface to eclone().
But applications should not need any #ifdefs. Please see the test cases
for eclone in

	git://git.sr71.net/~hallyn/cr_tests.git

There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

These use the libeclone.a built from following git-tree, which has the
arch-dependent user space code.

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Is that the #ifdef mess you are talking about ? I don't see that as a
consequence of the API. So maybe you can elaborate.

| There is no reason to have field usage vary by architecture. The

The field usage does not vary by architecture. Some architectures
don't use some fields and those fields must be 0. A simple 

	memset(&clone_args, 0, sizeof(clone_args))

before initializing fields is all that is required.

| original clone syscall was not designed with ia64 and hppa
| in mind, and has been causing trouble ever since. Let's not
| perpetuate the problem.

and lot of folks contributed to this new API to try and make sure
it is portable and meets the forseeable requirements.

| 
| Given code like this:   stack_base = malloc(stack_size);
| stack_base and stack_size are what the kernel needs.
| 
| I suspect that you chose the defective method for some reason
| related to restarting processes that were created with the
| older system calls. I can't say most of us even care, but in
| that broken-already case your process restarter can make up
| some numbers that will work. (for i386, the base could be the
| lowest address in the vma in which %esp lies, or even address 0)

I don't understand how "making up some numbers (pids) that will work"
is more portable/cleaner than the proposed eclone(). 

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-01 19:32   ` Sukadev Bhattiprolu
@ 2010-06-01 19:59     ` Albert Cahalan
  -1 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-01 19:59 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

On Tue, Jun 1, 2010 at 3:32 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Albert Cahalan [acahalan@gmail.com] wrote:
> | Sukadev Bhattiprolu writes:
> | > Randy Dunlap [randy.dunlap at oracle.com] wrote:

> | >>> base of the region allocated for stack. These architectures
> | >>> must pass in the size of the stack-region in ->child_stack_size.
> | >>
> | >>                               stack region
> | >>
> | >> Seems unfortunate that different architectures use
> | >> the fields differently.
> | >
> | > Yes and no. The field still has a single purpose, just that
> | > some architectures may not need it. We enforce that if unused
> | > on an architecture, the field must be 0. It looked like
> | > the easiest way to keep the API common across architectures.
> |
> | Yuck. You're forcing userspace to have #ifdef messes or,
> | more likely, just not work on all architectures.
>
> There is going to be #ifdef code in the library interface to eclone().
> But applications should not need any #ifdefs. Please see the test cases
> for eclone in
>
>        git://git.sr71.net/~hallyn/cr_tests.git
>
> There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

Come on, seriously, you know it's ia64 and hppa that
have issues. Maybe the nommu ports also have issues.

The only portable way to specify the stack is base and offset,
with flags or magic values for "share" and "kernel managed".

> | There is no reason to have field usage vary by architecture. The
>
> The field usage does not vary by architecture. Some architectures
> don't use some fields and those fields must be 0.

It looks like you contradict yourself. Please explain how
those two sentences are compatible.

> | original clone syscall was not designed with ia64 and hppa
> | in mind, and has been causing trouble ever since. Let's not
> | perpetuate the problem.
>
> and lot of folks contributed to this new API to try and make sure
> it is portable and meets the forseeable requirements.

Right, and some folks were ignored.

> | Given code like this:   stack_base = malloc(stack_size);
> | stack_base and stack_size are what the kernel needs.
> |
> | I suspect that you chose the defective method for some reason
> | related to restarting processes that were created with the
> | older system calls. I can't say most of us even care, but in
> | that broken-already case your process restarter can make up
> | some numbers that will work. (for i386, the base could be the
> | lowest address in the vma in which %esp lies, or even address 0)
>
> I don't understand how "making up some numbers (pids) that will work"
> is more portable/cleaner than the proposed eclone().

It isolates the cross-platform problems to an obscure tool
instead of polluting the kernel interface that everybody uses.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-01 19:59     ` Albert Cahalan
  0 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-01 19:59 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

On Tue, Jun 1, 2010 at 3:32 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Albert Cahalan [acahalan@gmail.com] wrote:
> | Sukadev Bhattiprolu writes:
> | > Randy Dunlap [randy.dunlap at oracle.com] wrote:

> | >>> base of the region allocated for stack. These architectures
> | >>> must pass in the size of the stack-region in ->child_stack_size.
> | >>
> | >>                               stack region
> | >>
> | >> Seems unfortunate that different architectures use
> | >> the fields differently.
> | >
> | > Yes and no. The field still has a single purpose, just that
> | > some architectures may not need it. We enforce that if unused
> | > on an architecture, the field must be 0. It looked like
> | > the easiest way to keep the API common across architectures.
> |
> | Yuck. You're forcing userspace to have #ifdef messes or,
> | more likely, just not work on all architectures.
>
> There is going to be #ifdef code in the library interface to eclone().
> But applications should not need any #ifdefs. Please see the test cases
> for eclone in
>
>        git://git.sr71.net/~hallyn/cr_tests.git
>
> There is no #ifdef and the tests work on x86, x86_64, ppc, s390.

Come on, seriously, you know it's ia64 and hppa that
have issues. Maybe the nommu ports also have issues.

The only portable way to specify the stack is base and offset,
with flags or magic values for "share" and "kernel managed".

> | There is no reason to have field usage vary by architecture. The
>
> The field usage does not vary by architecture. Some architectures
> don't use some fields and those fields must be 0.

It looks like you contradict yourself. Please explain how
those two sentences are compatible.

> | original clone syscall was not designed with ia64 and hppa
> | in mind, and has been causing trouble ever since. Let's not
> | perpetuate the problem.
>
> and lot of folks contributed to this new API to try and make sure
> it is portable and meets the forseeable requirements.

Right, and some folks were ignored.

> | Given code like this:   stack_base = malloc(stack_size);
> | stack_base and stack_size are what the kernel needs.
> |
> | I suspect that you chose the defective method for some reason
> | related to restarting processes that were created with the
> | older system calls. I can't say most of us even care, but in
> | that broken-already case your process restarter can make up
> | some numbers that will work. (for i386, the base could be the
> | lowest address in the vma in which %esp lies, or even address 0)
>
> I don't understand how "making up some numbers (pids) that will work"
> is more portable/cleaner than the proposed eclone().

It isolates the cross-platform problems to an obscure tool
instead of polluting the kernel interface that everybody uses.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-01 19:59     ` Albert Cahalan
@ 2010-06-02  1:38       ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-02  1:38 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

| Come on, seriously, you know it's ia64 and hppa that
| have issues. Maybe the nommu ports also have issues.
| 
| The only portable way to specify the stack is base and offset,
| with flags or magic values for "share" and "kernel managed".

Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
comes in.

But are you saying that we should force x86 and other architectures to
specify base and offset for eclone() even though they currently specify
just the stack pointer to clone() ?

That would remove the ifdef, but could be a big change to applications
on x86 and other architectures.

| 
| > | There is no reason to have field usage vary by architecture. The
| >
| > The field usage does not vary by architecture. Some architectures
| > don't use some fields and those fields must be 0.
| 
| It looks like you contradict yourself. Please explain how
| those two sentences are compatible.
| 
| > | original clone syscall was not designed with ia64 and hppa
| > | in mind, and has been causing trouble ever since. Let's not
| > | perpetuate the problem.
| >
| > and lot of folks contributed to this new API to try and make sure
| > it is portable and meets the forseeable requirements.
| 
| Right, and some folks were ignored.

I don't think your comment was ignored. The ->child_stack_size field was
added specifically for IA64 and my understanding was that ->clone_flags_high
could be used to specify the "kernel managed" or "shared" mode you mention
above.

| >
| > I don't understand how "making up some numbers (pids) that will work"
| > is more portable/cleaner than the proposed eclone().
| 
| It isolates the cross-platform problems to an obscure tool
| instead of polluting the kernel interface that everybody uses.

Sure, there was talk about using an approach like /proc/<pid>/next_pid
where you write your target pid into the file and the next time you
fork() you get that target pid. But it was considered racy and ugly.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-02  1:38       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-02  1:38 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

| Come on, seriously, you know it's ia64 and hppa that
| have issues. Maybe the nommu ports also have issues.
| 
| The only portable way to specify the stack is base and offset,
| with flags or magic values for "share" and "kernel managed".

Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
comes in.

But are you saying that we should force x86 and other architectures to
specify base and offset for eclone() even though they currently specify
just the stack pointer to clone() ?

That would remove the ifdef, but could be a big change to applications
on x86 and other architectures.

| 
| > | There is no reason to have field usage vary by architecture. The
| >
| > The field usage does not vary by architecture. Some architectures
| > don't use some fields and those fields must be 0.
| 
| It looks like you contradict yourself. Please explain how
| those two sentences are compatible.
| 
| > | original clone syscall was not designed with ia64 and hppa
| > | in mind, and has been causing trouble ever since. Let's not
| > | perpetuate the problem.
| >
| > and lot of folks contributed to this new API to try and make sure
| > it is portable and meets the forseeable requirements.
| 
| Right, and some folks were ignored.

I don't think your comment was ignored. The ->child_stack_size field was
added specifically for IA64 and my understanding was that ->clone_flags_high
could be used to specify the "kernel managed" or "shared" mode you mention
above.

| >
| > I don't understand how "making up some numbers (pids) that will work"
| > is more portable/cleaner than the proposed eclone().
| 
| It isolates the cross-platform problems to an obscure tool
| instead of polluting the kernel interface that everybody uses.

Sure, there was talk about using an approach like /proc/<pid>/next_pid
where you write your target pid into the file and the next time you
fork() you get that target pid. But it was considered racy and ugly.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-02  1:38       ` Sukadev Bhattiprolu
@ 2010-06-05 11:49         ` Albert Cahalan
  -1 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-05 11:49         ` Albert Cahalan
  0 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-02  1:38       ` Sukadev Bhattiprolu
@ 2010-06-05 11:58         ` Albert Cahalan
  -1 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:58 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-05 11:58         ` Albert Cahalan
  0 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:58 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-02  1:38       ` Sukadev Bhattiprolu
@ 2010-06-05 12:08         ` Albert Cahalan
  -1 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 12:08 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-05 12:08         ` Albert Cahalan
  0 siblings, 0 replies; 32+ messages in thread
From: Albert Cahalan @ 2010-06-05 12:08 UTC (permalink / raw)
  To: Sukadev Bhattiprolu; +Cc: randy.dunlap, linuxppc-dev, linux-kernel

On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?

Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).

This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.

> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.

It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.

> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.

Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.

You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.

For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.

For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.

Right now user code must be like this:

base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif

The man page is likewise messy.

Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-05 12:08         ` Albert Cahalan
@ 2010-06-09 18:14           ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-09 18:14 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: linux-kernel, randy.dunlap, linuxppc-dev, hpa, roland, arnd

Albert Cahalan [acahalan@gmail.com] wrote:
| On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
| <sukadev@linux.vnet.ibm.com> wrote:
| > | Come on, seriously, you know it's ia64 and hppa that
| > | have issues. Maybe the nommu ports also have issues.
| > |
| > | The only portable way to specify the stack is base and offset,
| > | with flags or magic values for "share" and "kernel managed".
| >
| > Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
| > comes in.
| >
| > But are you saying that we should force x86 and other architectures to
| > specify base and offset for eclone() even though they currently specify
| > just the stack pointer to clone() ?
| 
| Even for x86, it's an easier API. Callers would be specifying
| two numbers they already have: the argument and return value
| for malloc. Currently the numbers must be added together,
| destroying information, except on hppa (must not add size)
| and ia64 (must use what I'm proposing already).

I agree its easier and would avoid #ifdefs in the applications.

Peter, Arnd, Roland - do you have any concerns with requiring all
architectures to specify the stack to eclone() as [base, offset]

To recap, currently we have 

struct clone_args {
	u64 clone_flags_high;
	/*
	 * Architectures can use child_stack for either the stack pointer or
	 * the base of of stack. If child_stack is used as the stack pointer,
	 * child_stack_size must be 0. Otherwise child_stack_size must be
	 * set to size of allocated stack.
	 */
	u64 child_stack;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u32 nr_pids;
	u32 reserved0;
};

sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
		pid_t * __user pids)


Most architecutres would specify the stack pointer in ->child_stack and
ignore the ->child_stack_size.

IA64 specifies the *stack-base* in ->child_stack and the stack size in
->child_stack_size.

Albert and Randy point out that this would require #ifdefs in the
application code that intends to be portable across say IA64 and x86.

Can we instead have all architectures specify [base, size] ?

Thanks

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-09 18:14           ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-09 18:14 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: randy.dunlap, arnd, linux-kernel, hpa, linuxppc-dev, roland

Albert Cahalan [acahalan@gmail.com] wrote:
| On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
| <sukadev@linux.vnet.ibm.com> wrote:
| > | Come on, seriously, you know it's ia64 and hppa that
| > | have issues. Maybe the nommu ports also have issues.
| > |
| > | The only portable way to specify the stack is base and offset,
| > | with flags or magic values for "share" and "kernel managed".
| >
| > Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
| > comes in.
| >
| > But are you saying that we should force x86 and other architectures to
| > specify base and offset for eclone() even though they currently specify
| > just the stack pointer to clone() ?
| 
| Even for x86, it's an easier API. Callers would be specifying
| two numbers they already have: the argument and return value
| for malloc. Currently the numbers must be added together,
| destroying information, except on hppa (must not add size)
| and ia64 (must use what I'm proposing already).

I agree its easier and would avoid #ifdefs in the applications.

Peter, Arnd, Roland - do you have any concerns with requiring all
architectures to specify the stack to eclone() as [base, offset]

To recap, currently we have 

struct clone_args {
	u64 clone_flags_high;
	/*
	 * Architectures can use child_stack for either the stack pointer or
	 * the base of of stack. If child_stack is used as the stack pointer,
	 * child_stack_size must be 0. Otherwise child_stack_size must be
	 * set to size of allocated stack.
	 */
	u64 child_stack;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u32 nr_pids;
	u32 reserved0;
};

sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
		pid_t * __user pids)


Most architecutres would specify the stack pointer in ->child_stack and
ignore the ->child_stack_size.

IA64 specifies the *stack-base* in ->child_stack and the stack size in
->child_stack_size.

Albert and Randy point out that this would require #ifdefs in the
application code that intends to be portable across say IA64 and x86.

Can we instead have all architectures specify [base, size] ?

Thanks

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-09 18:14           ` Sukadev Bhattiprolu
@ 2010-06-09 18:46             ` H. Peter Anvin
  -1 siblings, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2010-06-09 18:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, roland, arnd

On 06/09/2010 11:14 AM, Sukadev Bhattiprolu wrote:
> | 
> | Even for x86, it's an easier API. Callers would be specifying
> | two numbers they already have: the argument and return value
> | for malloc. Currently the numbers must be added together,
> | destroying information, except on hppa (must not add size)
> | and ia64 (must use what I'm proposing already).
> 
> I agree its easier and would avoid #ifdefs in the applications.
> 
> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]
> 

Makes sense to me.  There might be advantages to be able to track the
size of the "stack allocation" even for other architectures, too.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-09 18:46             ` H. Peter Anvin
  0 siblings, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2010-06-09 18:46 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap, arnd, Albert Cahalan, linux-kernel, linuxppc-dev, roland

On 06/09/2010 11:14 AM, Sukadev Bhattiprolu wrote:
> | 
> | Even for x86, it's an easier API. Callers would be specifying
> | two numbers they already have: the argument and return value
> | for malloc. Currently the numbers must be added together,
> | destroying information, except on hppa (must not add size)
> | and ia64 (must use what I'm proposing already).
> 
> I agree its easier and would avoid #ifdefs in the applications.
> 
> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]
> 

Makes sense to me.  There might be advantages to be able to track the
size of the "stack allocation" even for other architectures, too.

	-hpa

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-09 18:14           ` Sukadev Bhattiprolu
@ 2010-06-09 22:32             ` Roland McGrath
  -1 siblings, 0 replies; 32+ messages in thread
From: Roland McGrath @ 2010-06-09 22:32 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, hpa, arnd

> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]

I can't see why that would be a problem.  
It's consistent with the sigaltstack interface we already have.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-09 22:32             ` Roland McGrath
  0 siblings, 0 replies; 32+ messages in thread
From: Roland McGrath @ 2010-06-09 22:32 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap, arnd, Albert Cahalan, linux-kernel, hpa, linuxppc-dev

> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]

I can't see why that would be a problem.  
It's consistent with the sigaltstack interface we already have.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-06-09 18:14           ` Sukadev Bhattiprolu
@ 2010-06-10  9:15             ` Arnd Bergmann
  -1 siblings, 0 replies; 32+ messages in thread
From: Arnd Bergmann @ 2010-06-10  9:15 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, hpa, roland

On Wednesday 09 June 2010, Sukadev Bhattiprolu wrote:
> Albert and Randy point out that this would require #ifdefs in the
> application code that intends to be portable across say IA64 and x86.
> 
> Can we instead have all architectures specify [base, size] ?

No objections from me on that.

	Arnd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-06-10  9:15             ` Arnd Bergmann
  0 siblings, 0 replies; 32+ messages in thread
From: Arnd Bergmann @ 2010-06-10  9:15 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap, Albert Cahalan, linux-kernel, hpa, linuxppc-dev, roland

On Wednesday 09 June 2010, Sukadev Bhattiprolu wrote:
> Albert and Randy point out that this would require #ifdefs in the
> application code that intends to be portable across say IA64 and x86.
> 
> Can we instead have all architectures specify [base, size] ?

No objections from me on that.

	Arnd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
       [not found]     ` <20100505141447.fc2397f6.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2010-05-05 22:25       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-mnsaURCQ41sdnm+yROfE0A,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Pavel Emelyanov

Randy Dunlap [randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org] wrote:
| > +		base of the region allocated for stack. These architectures
| > +		must pass in the size of the stack-region in ->child_stack_size.
| 
| 		                             stack region
| 
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

| 
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-05 21:14     ` Randy Dunlap
  (?)
@ 2010-05-05 22:25       ` Sukadev Bhattiprolu
  -1 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Oren Laadan, Andrew Morton, containers, linux-kernel,
	Serge Hallyn, Matt Helsley, Pavel Emelyanov, linux-api, x86,
	linux-s390, linuxppc-dev

Randy Dunlap [randy.dunlap@oracle.com] wrote:
| > +		base of the region allocated for stack. These architectures
| > +		must pass in the size of the stack-region in ->child_stack_size.
| 
| 		                             stack region
| 
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

| 
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-05 22:25       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-s390, Oren Laadan, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, linux-api, Serge Hallyn,
	Andrew Morton, Pavel Emelyanov

Randy Dunlap [randy.dunlap@oracle.com] wrote:
| > +		base of the region allocated for stack. These architectures
| > +		must pass in the size of the stack-region in ->child_stack_size.
| 
| 		                             stack region
| 
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

| 
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-05 22:25       ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 32+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Oren Laadan, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, linux-api-u79uwXL29TY76Z2rM5mHXA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-mnsaURCQ41sdnm+yROfE0A

Randy Dunlap [randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org] wrote:
| > +		base of the region allocated for stack. These architectures
| > +		must pass in the size of the stack-region in ->child_stack_size.
| 
| 		                             stack region
| 
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

| 
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
       [not found]   ` <1272723382-19470-12-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-05-05 21:14     ` Randy Dunlap
  0 siblings, 0 replies; 32+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA, Sukadev Bhattiprolu,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-mnsaURCQ41sdnm+yROfE0A,
	Pavel-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Andrew Morton,
	Emelyanov, Serge-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Acked-by: Oren Laadan  <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-01 14:14   ` Oren Laadan
  (?)
@ 2010-05-05 21:14     ` Randy Dunlap
  -1 siblings, 0 replies; 32+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, containers, linux-kernel, Serge Hallyn,
	Matt Helsley, Pavel Emelyanov, Sukadev Bhattiprolu, linux-api,
	x86, linux-s390, linuxppc-dev

On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-05 21:14     ` Randy Dunlap
  0 siblings, 0 replies; 32+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Andrew Morton,
	Sukadev Bhattiprolu, Pavel Emelyanov

On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-05 21:14     ` Randy Dunlap
  0 siblings, 0 replies; 32+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu,
	linux-api-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-mnsaURCQ41sdnm+yROfE0A

On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Acked-by: Oren Laadan  <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v21 011/100] eclone (11/11): Document sys_eclone
       [not found] ` <1272723382-19470-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-05-01 14:14   ` Oren Laadan
  0 siblings, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxppc-dev-mnsaURCQ41sdnm+yROfE0A, Sukadev Bhattiprolu,
	Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This gives a brief overview of the eclone() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v13]:
	- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
	  ->child_stack and ensure ->child_stack_size is 0 on architectures
	  that don't need it.
	- [Arnd Bergmann] Remove ->reserved1 field
	- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
	  example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
	- [Serge Hallyn] Fix/simplify stack-setup in the example code
	- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org
Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Acked-by: Oren Laadan  <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 348 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/eclone

diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 0000000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+		pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does, the
+	eclone() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint. Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When eclone() supports more than 32 flags, the additional bits
+		in the clone_flags should be specified in this field. This
+		field is currently unused and must be set to 0.
+
+	u64 child_stack;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields in
+		clone() and clone2() (on IA64) system calls. The usage of
+		these two fields depends on the processor architecture.
+
+		Most architectures use ->child_stack to pass-in a stack-pointer
+		itself and don't need the ->child_stack_size field. On these
+		architectures the ->child_stack_size field must be 0.
+
+		Some architectures, eg IA64, use ->child_stack to pass-in the
+		base of the region allocated for stack. These architectures
+		must pass in the size of the stack-region in ->child_stack_size.
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to eclone() (see below). nr_pids should not exceed
+		the current nesting level of the calling process (i.e if the
+		process is in init_pid_ns, nr_pids must be 1, if process is
+		in a pid namespace that is a child of init-pid-ns, nr_pids
+		cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		eclone() in the future, while preserving backward compatibility.
+		They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, eclone() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EINVAL	The child_stack_size field is not 0 (on architectures that
+		pass in a stack pointer in ->child_stack field)
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example eclone() usage - Create a child process with pid CHILD_TID1 in
+ * the current pid namespace. The child gets the usual "random" pid in any
+ * ancestor pid namespaces.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_eclone		337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE		8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+	u32 nr_pids;
+
+	u32 reserved0;
+};
+
+#define exit		_exit
+
+/*
+ * Following eclone() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_eclone)
+
+int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
+		int *pids)
+{
+	long retval;
+
+	__asm__ __volatile__(
+		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
+		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
+		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
+		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
+
+		 "pushl %%ebp\n\t"	/* save value of ebp */
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+
+		 "popl %%esi\n\t"	/* get subthread function */
+		 "call *%%esi\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+
+		:"=a" (retval)
+
+		:"0" (__NR_eclone),
+		 "i" (__NR_exit),
+		 "m" (flags_low),
+		 "m" (clone_args),
+		 "m" (args_size),
+		 "m" (pids)
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
+{
+	void *stack_base;
+	void **stack_top;
+
+	stack_base = malloc(size + size);
+	if (!stack_base) {
+		perror("malloc()");
+		exit(1);
+	}
+
+	stack_top = (void **)((char *)stack_base + (size - 4));
+	*--stack_top = child_arg;
+	*--stack_top = child_fn;
+
+	return stack_top;
+}
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	1177
+#define CHILD_TID3	2799
+
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
+	if (ctid != CHILD_TID1) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID1, ctid);
+		exit(1);
+	} else {
+		printf("Child got the expected tid, %d\n", gettid());
+	}
+	sleep(2);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg, STACKSIZE);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack		= (u64)(unsigned long)stack;
+	ca->child_stack_size	= (u64)0;
+	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
+	ca->nr_pids		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = eclone(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
+				rc, errno);
+	return rc;
+}
+
+/*
+ * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
+ * The test case creates a child in the current pid namespace and uses only
+ * the first value, CHILD_TID1.
+ */
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
+int main()
+{
+	int rc, pid, status;
+	unsigned long flags;
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		}
+	}
+	return 0;
+}
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-01 14:14 [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
@ 2010-05-01 14:14   ` Oren Laadan
  2010-05-01 14:14   ` Oren Laadan
  1 sibling, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
	linuxppc-dev

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This gives a brief overview of the eclone() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v13]:
	- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
	  ->child_stack and ensure ->child_stack_size is 0 on architectures
	  that don't need it.
	- [Arnd Bergmann] Remove ->reserved1 field
	- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
	  example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
	- [Serge Hallyn] Fix/simplify stack-setup in the example code
	- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
---
 Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 348 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/eclone

diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 0000000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+		pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does, the
+	eclone() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint. Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When eclone() supports more than 32 flags, the additional bits
+		in the clone_flags should be specified in this field. This
+		field is currently unused and must be set to 0.
+
+	u64 child_stack;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields in
+		clone() and clone2() (on IA64) system calls. The usage of
+		these two fields depends on the processor architecture.
+
+		Most architectures use ->child_stack to pass-in a stack-pointer
+		itself and don't need the ->child_stack_size field. On these
+		architectures the ->child_stack_size field must be 0.
+
+		Some architectures, eg IA64, use ->child_stack to pass-in the
+		base of the region allocated for stack. These architectures
+		must pass in the size of the stack-region in ->child_stack_size.
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to eclone() (see below). nr_pids should not exceed
+		the current nesting level of the calling process (i.e if the
+		process is in init_pid_ns, nr_pids must be 1, if process is
+		in a pid namespace that is a child of init-pid-ns, nr_pids
+		cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		eclone() in the future, while preserving backward compatibility.
+		They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, eclone() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EINVAL	The child_stack_size field is not 0 (on architectures that
+		pass in a stack pointer in ->child_stack field)
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example eclone() usage - Create a child process with pid CHILD_TID1 in
+ * the current pid namespace. The child gets the usual "random" pid in any
+ * ancestor pid namespaces.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_eclone		337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE		8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+	u32 nr_pids;
+
+	u32 reserved0;
+};
+
+#define exit		_exit
+
+/*
+ * Following eclone() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_eclone)
+
+int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
+		int *pids)
+{
+	long retval;
+
+	__asm__ __volatile__(
+		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
+		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
+		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
+		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
+
+		 "pushl %%ebp\n\t"	/* save value of ebp */
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+
+		 "popl %%esi\n\t"	/* get subthread function */
+		 "call *%%esi\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+
+		:"=a" (retval)
+
+		:"0" (__NR_eclone),
+		 "i" (__NR_exit),
+		 "m" (flags_low),
+		 "m" (clone_args),
+		 "m" (args_size),
+		 "m" (pids)
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
+{
+	void *stack_base;
+	void **stack_top;
+
+	stack_base = malloc(size + size);
+	if (!stack_base) {
+		perror("malloc()");
+		exit(1);
+	}
+
+	stack_top = (void **)((char *)stack_base + (size - 4));
+	*--stack_top = child_arg;
+	*--stack_top = child_fn;
+
+	return stack_top;
+}
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	1177
+#define CHILD_TID3	2799
+
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
+	if (ctid != CHILD_TID1) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID1, ctid);
+		exit(1);
+	} else {
+		printf("Child got the expected tid, %d\n", gettid());
+	}
+	sleep(2);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg, STACKSIZE);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack		= (u64)(unsigned long)stack;
+	ca->child_stack_size	= (u64)0;
+	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
+	ca->nr_pids		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = eclone(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
+				rc, errno);
+	return rc;
+}
+
+/*
+ * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
+ * The test case creates a child in the current pid namespace and uses only
+ * the first value, CHILD_TID1.
+ */
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
+int main()
+{
+	int rc, pid, status;
+	unsigned long flags;
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		}
+	}
+	return 0;
+}
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-01 14:14   ` Oren Laadan
  0 siblings, 0 replies; 32+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Sukadev Bhattiprolu,
	Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This gives a brief overview of the eclone() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v13]:
	- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
	  ->child_stack and ensure ->child_stack_size is 0 on architectures
	  that don't need it.
	- [Arnd Bergmann] Remove ->reserved1 field
	- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
	  example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
	- [Serge Hallyn] Fix/simplify stack-setup in the example code
	- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
---
 Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 348 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/eclone

diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 0000000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+		pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does, the
+	eclone() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint. Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When eclone() supports more than 32 flags, the additional bits
+		in the clone_flags should be specified in this field. This
+		field is currently unused and must be set to 0.
+
+	u64 child_stack;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields in
+		clone() and clone2() (on IA64) system calls. The usage of
+		these two fields depends on the processor architecture.
+
+		Most architectures use ->child_stack to pass-in a stack-pointer
+		itself and don't need the ->child_stack_size field. On these
+		architectures the ->child_stack_size field must be 0.
+
+		Some architectures, eg IA64, use ->child_stack to pass-in the
+		base of the region allocated for stack. These architectures
+		must pass in the size of the stack-region in ->child_stack_size.
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to eclone() (see below). nr_pids should not exceed
+		the current nesting level of the calling process (i.e if the
+		process is in init_pid_ns, nr_pids must be 1, if process is
+		in a pid namespace that is a child of init-pid-ns, nr_pids
+		cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		eclone() in the future, while preserving backward compatibility.
+		They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, eclone() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EINVAL	The child_stack_size field is not 0 (on architectures that
+		pass in a stack pointer in ->child_stack field)
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example eclone() usage - Create a child process with pid CHILD_TID1 in
+ * the current pid namespace. The child gets the usual "random" pid in any
+ * ancestor pid namespaces.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_eclone		337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE		8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+	u32 nr_pids;
+
+	u32 reserved0;
+};
+
+#define exit		_exit
+
+/*
+ * Following eclone() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_eclone)
+
+int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
+		int *pids)
+{
+	long retval;
+
+	__asm__ __volatile__(
+		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
+		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
+		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
+		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
+
+		 "pushl %%ebp\n\t"	/* save value of ebp */
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+
+		 "popl %%esi\n\t"	/* get subthread function */
+		 "call *%%esi\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+
+		:"=a" (retval)
+
+		:"0" (__NR_eclone),
+		 "i" (__NR_exit),
+		 "m" (flags_low),
+		 "m" (clone_args),
+		 "m" (args_size),
+		 "m" (pids)
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
+{
+	void *stack_base;
+	void **stack_top;
+
+	stack_base = malloc(size + size);
+	if (!stack_base) {
+		perror("malloc()");
+		exit(1);
+	}
+
+	stack_top = (void **)((char *)stack_base + (size - 4));
+	*--stack_top = child_arg;
+	*--stack_top = child_fn;
+
+	return stack_top;
+}
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	1177
+#define CHILD_TID3	2799
+
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
+	if (ctid != CHILD_TID1) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID1, ctid);
+		exit(1);
+	} else {
+		printf("Child got the expected tid, %d\n", gettid());
+	}
+	sleep(2);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg, STACKSIZE);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack		= (u64)(unsigned long)stack;
+	ca->child_stack_size	= (u64)0;
+	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
+	ca->nr_pids		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = eclone(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
+				rc, errno);
+	return rc;
+}
+
+/*
+ * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
+ * The test case creates a child in the current pid namespace and uses only
+ * the first value, CHILD_TID1.
+ */
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
+int main()
+{
+	int rc, pid, status;
+	unsigned long flags;
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		}
+	}
+	return 0;
+}
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-06-10  9:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-29 10:31 [PATCH v21 011/100] eclone (11/11): Document sys_eclone Albert Cahalan
2010-06-01 19:32 ` Sukadev Bhattiprolu
2010-06-01 19:32   ` Sukadev Bhattiprolu
2010-06-01 19:59   ` Albert Cahalan
2010-06-01 19:59     ` Albert Cahalan
2010-06-02  1:38     ` Sukadev Bhattiprolu
2010-06-02  1:38       ` Sukadev Bhattiprolu
2010-06-05 11:49       ` Albert Cahalan
2010-06-05 11:49         ` Albert Cahalan
2010-06-05 11:58       ` Albert Cahalan
2010-06-05 11:58         ` Albert Cahalan
2010-06-05 12:08       ` Albert Cahalan
2010-06-05 12:08         ` Albert Cahalan
2010-06-09 18:14         ` Sukadev Bhattiprolu
2010-06-09 18:14           ` Sukadev Bhattiprolu
2010-06-09 18:46           ` H. Peter Anvin
2010-06-09 18:46             ` H. Peter Anvin
2010-06-09 22:32           ` Roland McGrath
2010-06-09 22:32             ` Roland McGrath
2010-06-10  9:15           ` Arnd Bergmann
2010-06-10  9:15             ` Arnd Bergmann
  -- strict thread matches above, loose matches on Subject: below --
2010-05-01 14:14 [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
     [not found] ` <1272723382-19470-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-05-01 14:14   ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
2010-05-01 14:14 ` Oren Laadan
2010-05-01 14:14   ` Oren Laadan
     [not found]   ` <1272723382-19470-12-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-05-05 21:14     ` Randy Dunlap
2010-05-05 21:14   ` Randy Dunlap
2010-05-05 21:14     ` Randy Dunlap
2010-05-05 21:14     ` Randy Dunlap
     [not found]     ` <20100505141447.fc2397f6.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2010-05-05 22:25       ` Sukadev Bhattiprolu
2010-05-05 22:25     ` Sukadev Bhattiprolu
2010-05-05 22:25       ` Sukadev Bhattiprolu
2010-05-05 22:25       ` Sukadev Bhattiprolu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.