Which of the virtualization approaches is more suitable for kernel?

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Which of the virtualization approaches is more suitable for kernel?
@ 2006-02-20 15:45 Kirill Korotaev
  2006-02-20 16:12 ` Herbert Poetzl
  2006-02-24 21:44 ` Eric W. Biederman
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill Korotaev @ 2006-02-20 15:45 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List, devel,
	Eric W. Biederman, Andrey Savochkin, Alexey Kuznetsov,
	Stanislav Protassov, serue, frankeh, clg, haveblue, mrmacman_g4,
	alan, Herbert Poetzl, Andrew Morton

Linus, Andrew,

We need your help on what virtualization approach you would accept to
mainstream (if any) and where we should go.

If to drop VPID virtualization which caused many disputes, we actually
have the one virtualization solution, but 2 approaches for it. Which one
will go depends on the goals and your approval any way.

So what are the approaches?

1. namespaces for all types of resources (Eric W. Biederman)

The proposed solution is similar to fs namespaces. This approach
introduces namespaces for IPC, IPv4 networking, IPv6 networking, pids
etc. It also adds to task_struct a set of pointers to namespaces which
task belongs to, i.e. current->ipv4_ns, current->ipc_ns etc.

Benefits:
- fine grained namespaces
- task_struct points directly to a structure describing the namespace
and it's data (not to container)
- maybe a bit more logical/clean implementation, since no effective
container is involved unlike a second approach.

Disadvantages:
- it is only proof of concept code right now. nothing more.
- such an approach requires adding of additional argument to many
functions (e.g. Eric's patch for networking is 1.5 bigger than openvz).
it also adds code for getting namespace from objects. e.g.
skb->sk->namespace.
- so it increases stack usage
- it can't efficiently compile to the same not virtualized kernel, which
can be undesired for embedded linux.
- fine grained namespaces are actually an obfuscation, since kernel
subsystems are tightly interconnected. e.g. network -> sysctl -> proc,
mqueues -> netlink, ipc -> fs and most often can be used only as a whole 
container.
- it involves a bit more complicated procedure of a container
create/enter which requires exec or something like this, since there is
no effective container which could be simply triggered.

2. containers (OpenVZ.org/linux-vserver.org)

Container solution was discussed before, and actually it is also
namespace solution, but as a whole total namespace, with a single kernel
structure describing it.
Every task has two cotnainer pointers: container and effective
container. The later is used to temporarily switch to other contexts,
e.g. when handling IRQs, TCP/IP etc.

Benefits:
- clear logical bounded container, it is clear when container is alive
and when not.
- it doesn't introduce additional args for most functions, no additional
stack usage.
- it compiles to old good kernel when virtualization if off, so doesn't
disturb other configurations.
- Eric brought an interesting idea about introducing interface like
DEFINE_CPU_VAR(), which could potentially allow to create virtualized
variables automagically and access them via econtainer().
- mature working code exists which is used in production for years, so
first working version can be done much quicker

Disadvantages:
- one additional pointer dereference when accessing virtualized
resources, e.g. current->econtainer->shm_ids

Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-20 15:45 Which of the virtualization approaches is more suitable for kernel? Kirill Korotaev
@ 2006-02-20 16:12 ` Herbert Poetzl
  2006-02-21 16:00   ` Kirill Korotaev
  2006-02-24 21:44 ` Eric W. Biederman
  1 sibling, 1 reply; 27+ messages in thread
From: Herbert Poetzl @ 2006-02-20 16:12 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List, devel,
	Eric W. Biederman, Andrey Savochkin, Alexey Kuznetsov,
	Stanislav Protassov, serue, frankeh, clg, haveblue, mrmacman_g4,
	alan, Andrew Morton

On Mon, Feb 20, 2006 at 06:45:21PM +0300, Kirill Korotaev wrote:
> Linus, Andrew,
> 
> We need your help on what virtualization approach you would accept to
> mainstream (if any) and where we should go.
> 
> If to drop VPID virtualization which caused many disputes, we actually
> have the one virtualization solution, but 2 approaches for it. Which
> one will go depends on the goals and your approval any way.
> 
> So what are the approaches?
> 
> 1. namespaces for all types of resources (Eric W. Biederman)
> 
> The proposed solution is similar to fs namespaces. This approach
> introduces namespaces for IPC, IPv4 networking, IPv6 networking, pids
> etc. It also adds to task_struct a set of pointers to namespaces which
> task belongs to, i.e. current->ipv4_ns, current->ipc_ns etc.
> 
> Benefits:
> - fine grained namespaces
> - task_struct points directly to a structure describing the namespace
>   and it's data (not to container)
> - maybe a bit more logical/clean implementation, since no effective
>   container is involved unlike a second approach.

> Disadvantages:
> - it is only proof of concept code right now. nothing more.

sorry, but that is no disadvantage in my book ...

> - such an approach requires adding of additional argument to many
>  functions (e.g. Eric's patch for networking is 1.5 bigger than openvz).

hmm? last time I checked OpenVZ was quite bloated
compared to Linux-VServer, and Eric's network part
isn't even there yet ...

> it also adds code for getting namespace from objects. e.g.
> skb->sk->namespace.

> - so it increases stack usage

> - it can't efficiently compile to the same not virtualized kernel,
>   which can be undesired for embedded linux.

while OpenVZ does?

> - fine grained namespaces are actually an obfuscation, since kernel
>   subsystems are tightly interconnected. e.g. network -> sysctl -> proc,
>   mqueues -> netlink, ipc -> fs and most often can be used only as a
>   whole container.

I think a lot of _strange_ interconnects there could
use some cleanup, and after that the interconenctions
would be very small

> - it involves a bit more complicated procedure of a container
>   create/enter which requires exec or something like this, since 
>   there is no effective container which could be simply triggered.

I don't understand this argument ...

> 2. containers (OpenVZ.org/linux-vserver.org)

please do not generalize here, Linux-VServer does 
not use a single container structure as you might
think ...

> Container solution was discussed before, and actually it is also
> namespace solution, but as a whole total namespace, with a single 
> kernel structure describing it.

that might be true for OpenVZ, but it is not for
Linux-VServer, as we have structures for network
and process contexts as well as different ones for
disk limits

> Every task has two cotnainer pointers: container and effective
> container. The later is used to temporarily switch to other 
> contexts, e.g. when handling IRQs, TCP/IP etc.

this doesn't look very cool to me, as IRQs should
be handled in the host context and TCP/IP in the
proper network space ...

> Benefits:
> - clear logical bounded container, it is clear when container 
>   is alive and when not.

how does that handle the issues you described with
sockets in wait state which have very long timeouts?

> - it doesn't introduce additional args for most functions,
>   no additional stack usage.

a single additional arg here and there won't hurt,
and I'm pretty sure most of them will be in inlined
code, where it doesn't really matter 

> - it compiles to old good kernel when virtualization if off, 
>   so doesn't disturb other configurations.

the question here is, do we really want to turn it 
off at all? IMHO the design and implementation 
should be sufficiently good so that it does neither
impose unnecessary overhead nor change the default
behaviour ...

> - Eric brought an interesting idea about introducing interface like
>   DEFINE_CPU_VAR(), which could potentially allow to create virtualized
>   variables automagically and access them via econtainer().

how is that an advantage of the container approach?

> - mature working code exists which is used in production for years, 
>   so first working version can be done much quicker

from the OpenVZ/Virtuozzo(tm) page:

 Specific benefits of Virtuozzo(tm) compared to OpenVZ can be 
 found below:

 - Higher VPS density. Virtuozzo(tm) provides efficient memory 
   and file sharing mechanisms enabling higher VPS density and 
   better performance of VPSs.

 - Improved Stability, Scalability, and Performance. Virtuozzo(tm)
   is designed to run 24×7 environments with production workloads
   on hosts with up-to 32 CPUs.

so I conclude, OpenVZ does not contain the code which
provides all this ...

> Disadvantages:
> - one additional pointer dereference when accessing virtualized
>   resources, e.g. current->econtainer->shm_ids

best,
Herbert

> Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-20 16:12 ` Herbert Poetzl
@ 2006-02-21 16:00   ` Kirill Korotaev
  2006-02-21 20:33     ` Sam Vilain
  2006-02-21 23:50     ` Herbert Poetzl
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill Korotaev @ 2006-02-21 16:00 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List, devel,
	Eric W. Biederman, Andrey Savochkin, Alexey Kuznetsov,
	Stanislav Protassov, serue, frankeh, clg, haveblue, mrmacman_g4,
	alan, Andrew Morton

>>- such an approach requires adding of additional argument to many
>> functions (e.g. Eric's patch for networking is 1.5 bigger than openvz).
> hmm? last time I checked OpenVZ was quite bloated
> compared to Linux-VServer, and Eric's network part
> isn't even there yet ...
This is rather subjective feeling.
I can tell the same about VServer.

>>- it can't efficiently compile to the same not virtualized kernel,
>>  which can be undesired for embedded linux.
> while OpenVZ does?
yes. In _most_ cases does.
If Linus/Andrew/others feel this is not an issues at all I will be the 
first who will greet Eric's approach. I'm not against it, though it 
looks as a disadvantage for me.

>>- fine grained namespaces are actually an obfuscation, since kernel
>>  subsystems are tightly interconnected. e.g. network -> sysctl -> proc,
>>  mqueues -> netlink, ipc -> fs and most often can be used only as a
>>  whole container.
> I think a lot of _strange_ interconnects there could
> use some cleanup, and after that the interconenctions
> would be very small
Why do you think they are strange!? Is it strange that networking 
exports it's sysctls and statictics via proc?
Is it strange for you that IPC uses fs?
It is by _design_.

>>- it involves a bit more complicated procedure of a container
>>  create/enter which requires exec or something like this, since 
>>  there is no effective container which could be simply triggered.
> I don't understand this argument ...
- you need to track dependencies between namespaces (e.g. NAT requires 
conntracks, IPC requires FS etc.). this should be handled, otherwise one 
container being able to create nested container will be able to make oops.

>>2. containers (OpenVZ.org/linux-vserver.org)
> 
> 
> please do not generalize here, Linux-VServer does 
> not use a single container structure as you might
> think ...
1.
This topic is not a question of single container only...
also AFAIK you use it altogether only.
2.
Just to be clear: once again, I'm not against namespaces.

>>Container solution was discussed before, and actually it is also
>>namespace solution, but as a whole total namespace, with a single 
>>kernel structure describing it.
> 
> that might be true for OpenVZ, but it is not for
> Linux-VServer, as we have structures for network
> and process contexts as well as different ones for
> disk limits
do you have support for it in tools?
i.e. do you support namespaces somehow? can you create half virtualized 
container?

>>Every task has two cotnainer pointers: container and effective
>>container. The later is used to temporarily switch to other 
>>contexts, e.g. when handling IRQs, TCP/IP etc.
> 
> 
> this doesn't look very cool to me, as IRQs should
> be handled in the host context and TCP/IP in the
> proper network space ...
this is exactly what it does.
on IRQ context is switched to host.
In TCP/IP to context of socket or network device.

>>Benefits:
>>- clear logical bounded container, it is clear when container 
>>  is alive and when not.
> how does that handle the issues you described with
> sockets in wait state which have very long timeouts?
easily.
we have clear logic of container lifetime - it is alive until last 
process is alive in it. When processes die, container is destroy and so 
does all it's sockets. from namespaces point of view, this means that 
lifetime of network namespace is limited to lifetime of pid_namespace.

>>- it doesn't introduce additional args for most functions,
>>  no additional stack usage.
> a single additional arg here and there won't hurt,
> and I'm pretty sure most of them will be in inlined
> code, where it doesn't really matter 
have you analyzed that before thinking about inlining?

>>- it compiles to old good kernel when virtualization if off, 
>>  so doesn't disturb other configurations.
> the question here is, do we really want to turn it 
> off at all? IMHO the design and implementation 
> should be sufficiently good so that it does neither
> impose unnecessary overhead nor change the default
> behaviour ...
this is the question I want to get from Linus/Andrew.
I don't believe in low overhead. It starts from virtualization, then 
goes reource management etc.
These features _definetely_ introduce overhead and increase resource 
consumption. Not big, but why not configurable?
You don't need CPUSETS on UP i386 machine, do you? Why may I want this 
stuff in my embedded Linux? The same for secured Linux distributions,
it only opens the ways for possible security issues.

>>- Eric brought an interesting idea about introducing interface like
>>  DEFINE_CPU_VAR(), which could potentially allow to create virtualized
>>  variables automagically and access them via econtainer().
> how is that an advantage of the container approach?
Such vars can automatically be defined to something like 
"(econtainer()->virtualized_variable)".
This looks similar to percpu variable interfaces.

>>- mature working code exists which is used in production for years, 
>>  so first working version can be done much quicker
> from the OpenVZ/Virtuozzo(tm) page:
>  Specific benefits of Virtuozzo(tm) compared to OpenVZ can be 
>  found below:
>  - Higher VPS density. Virtuozzo(tm) provides efficient memory 
>    and file sharing mechanisms enabling higher VPS density and 
>    better performance of VPSs.
>  - Improved Stability, Scalability, and Performance. Virtuozzo(tm)
>    is designed to run 24×7 environments with production workloads
>    on hosts with up-to 32 CPUs.
> so I conclude, OpenVZ does not contain the code which
> provides all this ..
:))))
Doesn't provide what? Stability?
Q/A process used for Virtuozzo end up in OpenVZ code eventually as well. 
This is more subject of support/QA.
Performance? we optimize systems for customers, have 
HA/monitoring/tuning/management tools for it etc.

Seems, you are just trying to move from the topic. Great.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-21 16:00   ` Kirill Korotaev
@ 2006-02-21 20:33     ` Sam Vilain
  2006-02-21 23:50     ` Herbert Poetzl
  1 sibling, 0 replies; 27+ messages in thread
From: Sam Vilain @ 2006-02-21 20:33 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Herbert Poetzl, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Eric W. Biederman,
	Andrey Savochkin, Alexey Kuznetsov, Stanislav Protassov, serue,
	frankeh, clg, haveblue, mrmacman_g4, alan, Andrew Morton

Kirill Korotaev wrote:
>>>- fine grained namespaces are actually an obfuscation, since kernel
>>> subsystems are tightly interconnected. e.g. network -> sysctl -> proc,
>>> mqueues -> netlink, ipc -> fs and most often can be used only as a
>>> whole container.
>>I think a lot of _strange_ interconnects there could
>>use some cleanup, and after that the interconenctions
>>would be very small
> Why do you think they are strange!? Is it strange that networking 
> exports it's sysctls and statictics via proc?
> Is it strange for you that IPC uses fs?
> It is by _design_.

Great, and this kind of simple design also worked well for the first few 
iterations of Linux-VServer.  However, some people need more flexibility 
as we are seeing by the wide range of virtualisation schemes being 
proposed.  In the 2.1.x VServer patch the network and (process&IPC) 
isolation and virtualisation have been kept seperate, and can be managed 
with seperate utilities.  There is also a syscall and utility to manage 
the existing kernel filesystem namespaces.

Eric's pspace work keeps the PID aspect seperate too, which I never 
envisioned possible.

I think that if we can keep as much seperation between systems as 
possible, then we will have a cleaner design.  Also it will make life 
easier for the core team as we can more easily divide up the patches for 
consideration by the relevant subsystem maintainer.

> - you need to track dependencies between namespaces (e.g. NAT requires 
> conntracks, IPC requires FS etc.). this should be handled, otherwise one 
> container being able to create nested container will be able to make oops.

This is just normal refcounting.  Yes, IPC requires filesystem code, but 
it doesn't care about the VFS, which is what filesystem namespaces abstract.

> do you have support for it in tools?
 > i.e. do you support namespaces somehow? can you create half
 > virtualized container?

See the util-vserver package, it comes with chbind and vnamespace which 
allow creation of 'half-virtualized' containers, though most of the rest 
of the functionality, such as per-vserver ulimits, disklimits, etc have 
been shoehorned into the general vx_info structure.  As we merge into 
the mainstream we can review each of these decisions and decide whether 
it is an inherantly per-process decision, or more XX_info structures are 
warranted.

>>this doesn't look very cool to me, as IRQs should
>>be handled in the host context and TCP/IP in the
>>proper network space ...
> this is exactly what it does.
> on IRQ context is switched to host.
> In TCP/IP to context of socket or network device.

That sounds like an interesting innovation, and we can compare our 
patches in this space once we have some common terms of reference and 
starting points.

>>the question here is, do we really want to turn it 
>>off at all? IMHO the design and implementation 
>>should be sufficiently good so that it does neither
>>impose unnecessary overhead nor change the default
>>behaviour ...
> this is the question I want to get from Linus/Andrew.
> I don't believe in low overhead. It starts from virtualization, then 
> goes reource management etc.
> These features _definetely_ introduce overhead and increase resource 
> consumption. Not big, but why not configurable?

Obviously, our projects have different goals; Linux-VServer has very 
little performance overhead.  Special provisions are made to achieve 
scalability on SMP and to avoid unnecessary cacheline issues.  Once that 
is sorted out, it's very hard to measure any performance overhead of it, 
especially when the task_struct->vx_info pointer is null.

However I see nothing wrong with making all code disappear without the 
kernel config option enabled.  I expect that as time goes on, you'd just 
as soon disable it as you would disable the open() system call.  I think 
that's what Herbert was getting at with his comment.

> Seems, you are just trying to move from the topic. Great.

I always did want to be a Lumberjack!

Sam.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-21 16:00   ` Kirill Korotaev
  2006-02-21 20:33     ` Sam Vilain
@ 2006-02-21 23:50     ` Herbert Poetzl
  2006-02-22 10:09       ` [Devel] " Kir Kolyshkin
  1 sibling, 1 reply; 27+ messages in thread
From: Herbert Poetzl @ 2006-02-21 23:50 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List, devel,
	Eric W. Biederman, Andrey Savochkin, Alexey Kuznetsov,
	Stanislav Protassov, serue, frankeh, clg, haveblue, mrmacman_g4,
	alan, Andrew Morton

On Tue, Feb 21, 2006 at 07:00:55PM +0300, Kirill Korotaev wrote:
>>>- such an approach requires adding of additional argument to many
>>>functions (e.g. Eric's patch for networking is 1.5 bigger than openvz).
>>hmm? last time I checked OpenVZ was quite bloated
>>compared to Linux-VServer, and Eric's network part
>>isn't even there yet ...
>This is rather subjective feeling.

of course, of course ...

OpenVZ stable patches:
	1857829 patch-022stab032-core
	1886915 patch-022stab034-core
	7390511 patch-022stab045-combined
	7570326 patch-022stab050-combined
	8042889 patch-022stab056-combined
	8059201 patch-022stab064-combined

Linux-VServer stable releases:
	 100130 patch-2.4.20-vs1.00.diff
	 135068 patch-2.4.21-vs1.20.diff
	 587170 patch-2.6.12.4-vs2.0.diff
	 593052 patch-2.6.14.3-vs2.01.diff
	 619268 patch-2.6.15.4-vs2.0.2-rc6.diff

>I can tell the same about VServer.
really?

>>>- it can't efficiently compile to the same not virtualized kernel,
>>> which can be undesired for embedded linux.
>>while OpenVZ does?
>yes. In _most_ cases does.
>If Linus/Andrew/others feel this is not an issues at all I will be the 
>first who will greet Eric's approach. I'm not against it, though it 
>looks as a disadvantage for me.
>
>>>- fine grained namespaces are actually an obfuscation, since kernel
>>> subsystems are tightly interconnected. e.g. network -> sysctl -> proc,
>>> mqueues -> netlink, ipc -> fs and most often can be used only as a
>>> whole container.
>>I think a lot of _strange_ interconnects there could
>>use some cleanup, and after that the interconenctions
>>would be very small
>Why do you think they are strange!? Is it strange that networking 
>exports it's sysctls and statictics via proc?
>Is it strange for you that IPC uses fs?
>It is by _design_.
>
>>>- it involves a bit more complicated procedure of a container
>>> create/enter which requires exec or something like this, since 
>>> there is no effective container which could be simply triggered.
>>I don't understand this argument ...
>- you need to track dependencies between namespaces (e.g. NAT requires 
>conntracks, IPC requires FS etc.). this should be handled, otherwise one 
>container being able to create nested container will be able to make oops.
>
>>>2. containers (OpenVZ.org/linux-vserver.org)
>>
>>please do not generalize here, Linux-VServer does 
>>not use a single container structure as you might
>>think ...
>1.
>This topic is not a question of single container only...
>also AFAIK you use it altogether only.
>2.
>Just to be clear: once again, I'm not against namespaces.
>
>>>Container solution was discussed before, and actually it is also
>>>namespace solution, but as a whole total namespace, with a single 
>>>kernel structure describing it.
>>
>>that might be true for OpenVZ, but it is not for
>>Linux-VServer, as we have structures for network
>>and process contexts as well as different ones for
>>disk limits
>do you have support for it in tools?
>i.e. do you support namespaces somehow? can you create half virtualized 
>container?

sure, just get the tools and use vnamespace
or if you prefer chcontext or chbind ...

>>>Every task has two cotnainer pointers: container and effective
>>>container. The later is used to temporarily switch to other 
>>>contexts, e.g. when handling IRQs, TCP/IP etc.
>>
>>
>>this doesn't look very cool to me, as IRQs should
>>be handled in the host context and TCP/IP in the
>>proper network space ...
>this is exactly what it does.
>on IRQ context is switched to host.
>In TCP/IP to context of socket or network device.
>
>>>Benefits:
>>>- clear logical bounded container, it is clear when container 
>>> is alive and when not.
>>how does that handle the issues you described with
>>sockets in wait state which have very long timeouts?
>easily.
>we have clear logic of container lifetime - it is alive until last 
>process is alive in it. When processes die, container is destroy and so 
>does all it's sockets. from namespaces point of view, this means that 
>lifetime of network namespace is limited to lifetime of pid_namespace.
>
>>>- it doesn't introduce additional args for most functions,
>>> no additional stack usage.
>>a single additional arg here and there won't hurt,
>>and I'm pretty sure most of them will be in inlined
>>code, where it doesn't really matter 
>have you analyzed that before thinking about inlining?
>
>>>- it compiles to old good kernel when virtualization if off, 
>>> so doesn't disturb other configurations.
>>the question here is, do we really want to turn it 
>>off at all? IMHO the design and implementation 
>>should be sufficiently good so that it does neither
>>impose unnecessary overhead nor change the default
>>behaviour ...
>this is the question I want to get from Linus/Andrew.
>I don't believe in low overhead. It starts from virtualization, then 
>goes reource management etc.
>These features _definetely_ introduce overhead and increase resource 
>consumption. Not big, but why not configurable?
>You don't need CPUSETS on UP i386 machine, do you? Why may I want this 
>stuff in my embedded Linux? The same for secured Linux distributions,
>it only opens the ways for possible security issues.

ah, you got me wrong there, of course embedded
systems have other requirements, and it might
turn out that some of those virtualizations 
require config options to disable them ...

but, I do not see a measurable overhead there
and I do not consider it a problem to disable
certain more expensive parts ...

>>>- Eric brought an interesting idea about introducing interface like
>>> DEFINE_CPU_VAR(), which could potentially allow to create virtualized
>>> variables automagically and access them via econtainer().
>>how is that an advantage of the container approach?
>Such vars can automatically be defined to something like 
>"(econtainer()->virtualized_variable)".
>This looks similar to percpu variable interfaces.
>
>>>- mature working code exists which is used in production for years, 
>>> so first working version can be done much quicker
>>from the OpenVZ/Virtuozzo(tm) page:
>> Specific benefits of Virtuozzo(tm) compared to OpenVZ can be 
>> found below:
>> - Higher VPS density. Virtuozzo(tm) provides efficient memory 
>>   and file sharing mechanisms enabling higher VPS density and 
>>   better performance of VPSs.
>> - Improved Stability, Scalability, and Performance. Virtuozzo(tm)
>>   is designed to run 24×7 environments with production workloads
>>   on hosts with up-to 32 CPUs.
>>so I conclude, OpenVZ does not contain the code which
>>provides all this ..
>:))))
>Doesn't provide what? Stability?
>Q/A process used for Virtuozzo end up in OpenVZ code eventually as well. 
>This is more subject of support/QA.
>Performance? we optimize systems for customers, have 
>HA/monitoring/tuning/management tools for it etc.
>
>Seems, you are just trying to move from the topic. Great.
I guess I was right on topic ...

best,
Herbert

>Thanks,
>Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Devel] Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-21 23:50     ` Herbert Poetzl
@ 2006-02-22 10:09       ` Kir Kolyshkin
  2006-02-22 15:26         ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Kir Kolyshkin @ 2006-02-22 10:09 UTC (permalink / raw)
  To: devel
  Cc: Kirill Korotaev, Andrew Morton, Rik van Riel, Andrey Savochkin,
	alan, Linux Kernel Mailing List, mrmacman_g4, Linus Torvalds,
	frankeh, Eric W. Biederman, serue, Alexey Kuznetsov

Herbert Poetzl wrote:

>On Tue, Feb 21, 2006 at 07:00:55PM +0300, Kirill Korotaev wrote:
>  
>
>>>>- such an approach requires adding of additional argument to many
>>>>functions (e.g. Eric's patch for networking is 1.5 bigger than openvz).
>>>>        
>>>>
>>>hmm? last time I checked OpenVZ was quite bloated
>>>compared to Linux-VServer, and Eric's network part
>>>isn't even there yet ...
>>>      
>>>
>>This is rather subjective feeling.
>>    
>>
>
>of course, of course ...
>
>OpenVZ stable patches:
>	1857829 patch-022stab032-core
>	1886915 patch-022stab034-core
>	7390511 patch-022stab045-combined
>	7570326 patch-022stab050-combined
>	8042889 patch-022stab056-combined
>	8059201 patch-022stab064-combined
>
>Linux-VServer stable releases:
>	 100130 patch-2.4.20-vs1.00.diff
>	 135068 patch-2.4.21-vs1.20.diff
>	 587170 patch-2.6.12.4-vs2.0.diff
>	 593052 patch-2.6.14.3-vs2.01.diff
>	 619268 patch-2.6.15.4-vs2.0.2-rc6.diff
>  
>
Herbert,

Please stop seeding, hmm, falseness. OpenVZ patches you mention are 
against 2.6.8 kernel, thus they contain tons of backported mainstream 
bugfixes and driver updates; so, most of this size is not 
virtualization, but general security/stability/drivers stuff. And yes, 
that size also indirectly tells how much work we do to keep our users happy.

Back to the topic. If you (or somebody else) wants to see the real size 
of things, take a look at broken-out patch set, available from
http://download.openvz.org/kernel/broken-out/. Here (2.6.15-025stab014.1 
kernel) we see that it all boils down to:

Virtualization stuff:                    diff-vemix-20060120-core   817K
Resource management (User Beancounters): diff-ubc-20060120          377K
Two-level disk quota:                    diff-vzdq-20051219-2       154K


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Devel] Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-22 10:09       ` [Devel] " Kir Kolyshkin
@ 2006-02-22 15:26         ` Eric W. Biederman
  2006-02-23 12:02           ` Kir Kolyshkin
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2006-02-22 15:26 UTC (permalink / raw)
  To: Kir Kolyshkin
  Cc: devel, Kirill Korotaev, Andrew Morton, Rik van Riel,
	Andrey Savochkin, alan, Linux Kernel Mailing List, mrmacman_g4,
	Linus Torvalds, frankeh, serue, Alexey Kuznetsov

Kir Kolyshkin <kir@openvz.org> writes:

> Please stop seeding, hmm, falseness. OpenVZ patches you mention are against
> 2.6.8 kernel, thus they contain tons of backported mainstream bugfixes and
> driver updates; so, most of this size is not virtualization, but general
> security/stability/drivers stuff. And yes, that size also indirectly tells how
> much work we do to keep our users happy.

I think Herbert was trying to add some balance to the equation.

> Back to the topic. If you (or somebody else) wants to see the real size of
> things, take a look at broken-out patch set, available from
> http://download.openvz.org/kernel/broken-out/. Here (2.6.15-025stab014.1 kernel)
> we see that it all boils down to:

Thanks.  This is the first indication I have seen that you even have broken-out 
patches.  Why those aren't in your source rpms is beyond me.  Everything
seems to have been posted in a 2-3 day window at the end of January and the
beginning of February.  Is this something you are now providing?

Shakes head.  You have a patch in broken-out that is 817K.  Do you really
maintain it this way as one giant patch?

> Virtualization stuff:                    diff-vemix-20060120-core   817K
> Resource management (User Beancounters): diff-ubc-20060120          377K
> Two-level disk quota:                    diff-vzdq-20051219-2       154K

As for the size of my code, sure parts of it are big I haven't really
measured.  Primarily this is because I'm not afraid of doing the heavy
lifting necessary for a clean long term maintainable solution.

Now while all of this is interesting.  It really is beside the point
because neither the current vserver nor the current openvz code are
ready for mainstream kernel inclusion.  Please let's not get side
tracked playing whose patch is bigger.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Devel] Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-22 15:26         ` Eric W. Biederman
@ 2006-02-23 12:02           ` Kir Kolyshkin
  2006-02-23 13:25             ` Eric W. Biederman
  0 siblings, 1 reply; 27+ messages in thread
From: Kir Kolyshkin @ 2006-02-23 12:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: devel, Kirill Korotaev, Andrew Morton, Rik van Riel,
	Andrey Savochkin, alan, Linux Kernel Mailing List, mrmacman_g4,
	Linus Torvalds, frankeh, serue, Alexey Kuznetsov

Eric W. Biederman wrote:
>>Back to the topic. If you (or somebody else) wants to see the real size of
>>things, take a look at broken-out patch set, available from
>>http://download.openvz.org/kernel/broken-out/. Here (2.6.15-025stab014.1 kernel)
>>we see that it all boils down to:
> 
> 
> Thanks.  This is the first indication I have seen that you even have broken-out 
> patches.

When Kirill Korovaev announced OpenVZ patch set on LKML (two times -- 
initially and for 2.6.15), he gave the links to the broken-out patch 
set, both times.

> Why those aren't in your source rpms is beyond me.

That reflects our internal organization: we have a core virtualization 
team which comes up with a core patch (combining all the stuff), and a 
maintenance team which can add some extra patches (driver updates, some 
bugfixes). So that extra patches comes up as a separate patches in 
src.rpms, while virtualization stuff comes up as a single patch. That 
way it is easier for our maintainters group.

Sure we understand this is not convenient for developers who want to 
look at our code -- and thus we provide broken-out kernel patch sets 
from time to time (not for every release, as it requires some effort 
from Kirill, who is really buzy anyway). So, if you want this for a 
specific kernel -- just ask.

I understand that this might look strange, but again, this reflects our 
internal development structure.

> Everything
> seems to have been posted in a 2-3 day window at the end of January and the
> beginning of February.  Is this something you are now providing?

Again, yes, occasionally from time to time, or upon request.

> Shakes head.  You have a patch in broken-out that is 817K.  Do you really
> maintain it this way as one giant patch?

In that version I took (025stab014) it was indeed as one big patch, and 
I believe Kirill maintains it that way.

Previous kernel version (025stab012) was more fine-grained, take a look 
at http://download.openvz.org/kernel/broken-out/2.6.15-025stab012.1

> Please let's not get side tracked playing whose patch is bigger.

Absolutely agree!

Regards,
   Kir Kolyshkin, OpenVZ team.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Devel] Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-23 12:02           ` Kir Kolyshkin
@ 2006-02-23 13:25             ` Eric W. Biederman
  2006-02-23 14:00               ` Kir Kolyshkin
  0 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2006-02-23 13:25 UTC (permalink / raw)
  To: Kir Kolyshkin
  Cc: devel, Kirill Korotaev, Andrew Morton, Rik van Riel,
	Andrey Savochkin, alan, Linux Kernel Mailing List, mrmacman_g4,
	Linus Torvalds, frankeh, serue, Alexey Kuznetsov

Kir Kolyshkin <kir@openvz.org> writes:

> Eric W. Biederman wrote:
>>>Back to the topic. If you (or somebody else) wants to see the real size of
>>>things, take a look at broken-out patch set, available from
>>>http://download.openvz.org/kernel/broken-out/. Here (2.6.15-025stab014.1
> kernel)
>>>we see that it all boils down to:
>> Thanks.  This is the first indication I have seen that you even have
>> broken-out patches.
>
> When Kirill Korovaev announced OpenVZ patch set on LKML (two times -- 
> initially and for 2.6.15), he gave the links to the broken-out patch set, both
> times.
Hmm.  I guess I just missed it.

>> Why those aren't in your source rpms is beyond me.
>
> That reflects our internal organization: we have a core virtualization team
> which comes up with a core patch (combining all the stuff), and a maintenance
> team which can add some extra patches (driver updates, some bugfixes). So that
> extra patches comes up as a separate patches in src.rpms, while virtualization
> stuff comes up as a single patch. That way it is easier for our maintainters
> group.
>
> Sure we understand this is not convenient for developers who want to look at our
> code -- and thus we provide broken-out kernel patch sets from time to time (not
> for every release, as it requires some effort from Kirill, who is really buzy
> anyway). So, if you want this for a specific kernel -- just ask.
>
> I understand that this might look strange, but again, this reflects our internal
> development structure.

There is something this brings up.  Currently OpenVZ seems to be a
project where you guys do the work and release the source under the
GPL.  Making it technically an open source project.  However at the
development level it does not appear to be a community project, at
least at the developer level.

There is nothing wrong with not doing involving the larger community
in the development, but what it does mean is that largely as a
developer OpenVZ is uninteresting to me.

>> Shakes head.  You have a patch in broken-out that is 817K.  Do you really
>> maintain it this way as one giant patch?
>
> In that version I took (025stab014) it was indeed as one big patch, and I
> believe Kirill maintains it that way.
>
> Previous kernel version (025stab012) was more fine-grained, take a look at
> http://download.openvz.org/kernel/broken-out/2.6.15-025stab012.1

Looks a little better yes.  This actually surprises me because my
past experience is that usually well focused patches are easier
to forward port as they usually have fewer collisions and those
collisions are usually easier to resolve.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Devel] Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-23 13:25             ` Eric W. Biederman
@ 2006-02-23 14:00               ` Kir Kolyshkin
  0 siblings, 0 replies; 27+ messages in thread
From: Kir Kolyshkin @ 2006-02-23 14:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kir Kolyshkin, devel, Kirill Korotaev, Andrew Morton,
	Rik van Riel, Andrey Savochkin, alan, Linux Kernel Mailing List,
	mrmacman_g4, Linus Torvalds, frankeh, serue, Alexey Kuznetsov

Eric W. Biederman wrote:

>>That reflects our internal organization: we have a core virtualization team
>>which comes up with a core patch (combining all the stuff), and a maintenance
>>team which can add some extra patches (driver updates, some bugfixes). So that
>>extra patches comes up as a separate patches in src.rpms, while virtualization
>>stuff comes up as a single patch. That way it is easier for our maintainters
>>group.
>>
>>Sure we understand this is not convenient for developers who want to look at our
>>code -- and thus we provide broken-out kernel patch sets from time to time (not
>>for every release, as it requires some effort from Kirill, who is really buzy
>>anyway). So, if you want this for a specific kernel -- just ask.
>>
>>I understand that this might look strange, but again, this reflects our internal
>>development structure.
>>    
>>
>
>There is something this brings up.  Currently OpenVZ seems to be a
>project where you guys do the work and release the source under the
>GPL.  Making it technically an open source project.  However at the
>development level it does not appear to be a community project, at
>least at the developer level.
>
>There is nothing wrong with not doing involving the larger community
>in the development, but what it does mean is that largely as a
>developer OpenVZ is uninteresting to me.
>  
>
I though that first thing that makes particular technology interesting 
or otherwise appealing to developers is the technology itself, i.e. is 
it interesting, appealing, innovative and superior, is it tomorrow today 
and so on. From that point, OpenVZ is pretty much interesting. From the 
point of openness -- well, you might be right, there's still something 
we could do.

I understand it should work both ways -- we should provide easier ways 
to access the code, to contribute etc. Still, I see little to no 
interest of contributing to OpenVZ kernel. Probably this is because of 
high entry level, probably it is because we are not yet open enough or so.

Any way, I would love to hear any comments/suggestions of how we can 
improve this situation from our side (and let me express hope you will 
improve it from yours:)).

Regards,
  Kir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-20 15:45 Which of the virtualization approaches is more suitable for kernel? Kirill Korotaev
  2006-02-20 16:12 ` Herbert Poetzl
@ 2006-02-24 21:44 ` Eric W. Biederman
  2006-02-24 23:01   ` Herbert Poetzl
  2006-02-27 17:42   ` Dave Hansen
  1 sibling, 2 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-02-24 21:44 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List, devel,
	Andrey Savochkin, Alexey Kuznetsov, Stanislav Protassov, serue,
	frankeh, clg, haveblue, mrmacman_g4, alan, Herbert Poetzl,
	Andrew Morton

Kirill Korotaev <dev@sw.ru> writes:

> Linus, Andrew,
>
> We need your help on what virtualization approach you would accept to
> mainstream (if any) and where we should go.
>
> If to drop VPID virtualization which caused many disputes, we actually
> have the one virtualization solution, but 2 approaches for it. Which one
> will go depends on the goals and your approval any way.

My apologies for not replying sooner.

>From the looks of previous replies I think we have some valid commonalities
that we can focus on.

Largely we all agree that to applications things should look exactly as
they do now.  Currently we do not agree on management interfaces.

We seem to have much more agreement on everything except pids, so discussing
some of the other pieces looks worth while.

So I propose we the patches to solve the problem into three categories.
- General cleanups that simplify or fix problems now, but have
  a major advantage for our work.
- The kernel internal implementation of the various namespaces
  without an interface to create new ones.
- The new interfaces for how we create and control containers/namesp    aces.

This should allow the various approach to start sharing code, getting
progressively closer to each other until we have an implementation
we can agree is ready to go into Linus's kernel.  Plus that will
allow us to have our technical flame wars without totally stopping
progress.

We can start on a broad front, looking at several different things.
But I suggest the first thing we all look at is SYSVIPC.  It is
currently a clearly recognized namespace in the kernel so the scope is
well defined.  SYSVIPC is just complicated enough to have a
non-trivial implementation while at the same time being simple enough
that we can go through the code in exhausting detail.  Getting the
group dynamics working properly.

Then we can as a group look at networking, pids, and the other pieces.

But I do think it is important that we take the problem in pieces
because otherwise it is simply to large to review properly.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-24 21:44 ` Eric W. Biederman
@ 2006-02-24 23:01   ` Herbert Poetzl
  2006-02-27 17:42   ` Dave Hansen
  1 sibling, 0 replies; 27+ messages in thread
From: Herbert Poetzl @ 2006-02-24 23:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Andrey Savochkin,
	Alexey Kuznetsov, Stanislav Protassov, serue, frankeh, clg,
	haveblue, mrmacman_g4, alan, Andrew Morton

On Fri, Feb 24, 2006 at 02:44:42PM -0700, Eric W. Biederman wrote:
> Kirill Korotaev <dev@sw.ru> writes:
> 
> > Linus, Andrew,
> >
> > We need your help on what virtualization approach you would accept
> > to mainstream (if any) and where we should go.
> >
> > If to drop VPID virtualization which caused many disputes, we
> > actually have the one virtualization solution, but 2 approaches for
> > it. Which one will go depends on the goals and your approval any
> > way.
> 
> My apologies for not replying sooner.
> 
> > From the looks of previous replies I think we have some valid
> > commonalities that we can focus on.
> 
> Largely we all agree that to applications things should look exactly
> as they do now. Currently we do not agree on management interfaces.
>
> We seem to have much more agreement on everything except pids, so
> discussing some of the other pieces looks worth while.
> 
> So I propose we the patches to solve the problem into three categories.
> - General cleanups that simplify or fix problems now, but have
>   a major advantage for our work.
> - The kernel internal implementation of the various namespaces
>   without an interface to create new ones.
> - The new interfaces for how we create and control containers/namespaces.

proposal accepted on my side

> This should allow the various approach to start sharing code, getting
> progressively closer to each other until we have an implementation we
> can agree is ready to go into Linus's kernel. Plus that will allow us
> to have our technical flame wars without totally stopping progress.
> 
> We can start on a broad front, looking at several different things.
> But I suggest the first thing we all look at is SYSVIPC. It is
> currently a clearly recognized namespace in the kernel so the scope is
> well defined. SYSVIPC is just complicated enough to have a non-trivial
> implementation while at the same time being simple enough that we can
> go through the code in exhausting detail. Getting the group dynamics
> working properly.

okay, sounds good ...

> Then we can as a group look at networking, pids, and the other pieces.
> 
> But I do think it is important that we take the problem in pieces
> because otherwise it is simply to large to review properly.

definitely

best,
Herbert

> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-24 21:44 ` Eric W. Biederman
  2006-02-24 23:01   ` Herbert Poetzl
@ 2006-02-27 17:42   ` Dave Hansen
  2006-02-27 21:14     ` Eric W. Biederman
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2006-02-27 17:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Andrey Savochkin,
	Alexey Kuznetsov, Stanislav Protassov, serue, frankeh, clg,
	mrmacman_g4, alan, Herbert Poetzl, Andrew Morton

On Fri, 2006-02-24 at 14:44 -0700, Eric W. Biederman wrote:
> We can start on a broad front, looking at several different things.
> But I suggest the first thing we all look at is SYSVIPC.  It is
> currently a clearly recognized namespace in the kernel so the scope is
> well defined.  SYSVIPC is just complicated enough to have a
> non-trivial implementation while at the same time being simple enough
> that we can go through the code in exhausting detail.  Getting the
> group dynamics working properly. 

Here's a quick stab at the ipc/msg.c portion of this work.  The basic
approach was to move msg_ids, msg_bytes, and msg_hdrs into a structure,
put a pointer to that structure in the task_struct and then dynamically
allocate it.

There is still only one system-wide one of these for now.  It can
obviously be extended, though. :)

This is a very simple, brute-force, hack-until-it-compiles-and-boots
approach.  (I just realized that I didn't check the return of the alloc
properly.)

Is this the form that we'd like these patches to take?  Any comments
about the naming?  Do we want to keep the _namespace nomenclature, or
does the "context" that I used here make more sense

-- Dave

 work-dave/include/linux/ipc.h   |   12 +++
 work-dave/include/linux/sched.h |    1 
 work-dave/ipc/msg.c             |  152 ++++++++++++++++++++++------------------
 work-dave/ipc/util.c            |    7 +
 work-dave/ipc/util.h            |    2 
 work-dave/kernel/fork.c         |    5 +
 6 files changed, 108 insertions(+), 71 deletions(-)

--- work/ipc/msg.c~sysv-container	2006-02-27 09:30:23.000000000 -0800
+++ work-dave/ipc/msg.c	2006-02-27 09:32:18.000000000 -0800
@@ -60,35 +60,44 @@ struct msg_sender {
 #define SEARCH_NOTEQUAL		3
 #define SEARCH_LESSEQUAL	4
 
-static atomic_t msg_bytes = ATOMIC_INIT(0);
-static atomic_t msg_hdrs = ATOMIC_INIT(0);
+#define msg_lock(ctx, id)	((struct msg_queue*)ipc_lock(&ctx->msg_ids,id))
+#define msg_unlock(ctx, msq)	ipc_unlock(&(msq)->q_perm)
+#define msg_rmid(ctx, id)	((struct msg_queue*)ipc_rmid(&ctx->msg_ids,id))
+#define msg_checkid(ctx, msq, msgid)	\
+	ipc_checkid(&ctx->msg_ids,&msq->q_perm,msgid)
+#define msg_buildid(ctx, id, seq) \
+	ipc_buildid(&ctx->msg_ids, id, seq)
 
-static struct ipc_ids msg_ids;
-
-#define msg_lock(id)	((struct msg_queue*)ipc_lock(&msg_ids,id))
-#define msg_unlock(msq)	ipc_unlock(&(msq)->q_perm)
-#define msg_rmid(id)	((struct msg_queue*)ipc_rmid(&msg_ids,id))
-#define msg_checkid(msq, msgid)	\
-	ipc_checkid(&msg_ids,&msq->q_perm,msgid)
-#define msg_buildid(id, seq) \
-	ipc_buildid(&msg_ids, id, seq)
-
-static void freeque (struct msg_queue *msq, int id);
-static int newque (key_t key, int msgflg);
+static void freeque (struct ipc_msg_context *, struct msg_queue *msq, int id);
+static int newque (struct ipc_msg_context *context, key_t key, int id);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
 
-void __init msg_init (void)
+struct ipc_msg_context *alloc_ipc_msg_context(gfp_t flags)
+{
+	struct ipc_msg_context *msg_context;
+
+	msg_context = kzalloc(sizeof(*msg_context), flags);
+	if (!msg_context)
+		return NULL;
+
+	atomic_set(&msg_context->msg_bytes, 0);
+	atomic_set(&msg_context->msg_hdrs, 0);
+
+	return msg_context;
+}
+
+void __init msg_init (struct ipc_msg_context *context)
 {
-	ipc_init_ids(&msg_ids,msg_ctlmni);
+	ipc_init_ids(&context->msg_ids,msg_ctlmni);
 	ipc_init_proc_interface("sysvipc/msg",
 				"       key      msqid perms      cbytes       qnum lspid lrpid   uid   gid  cuid  cgid      stime      rtime      ctime\n",
-				&msg_ids,
+				&context->msg_ids,
 				sysvipc_msg_proc_show);
 }
 
-static int newque (key_t key, int msgflg)
+static int newque (struct ipc_msg_context *context, key_t key, int msgflg)
 {
 	int id;
 	int retval;
@@ -108,14 +117,14 @@ static int newque (key_t key, int msgflg
 		return retval;
 	}
 
-	id = ipc_addid(&msg_ids, &msq->q_perm, msg_ctlmni);
+	id = ipc_addid(&context->msg_ids, &msq->q_perm, msg_ctlmni);
 	if(id == -1) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
 		return -ENOSPC;
 	}
 
-	msq->q_id = msg_buildid(id,msq->q_perm.seq);
+	msq->q_id = msg_buildid(context,id,msq->q_perm.seq);
 	msq->q_stime = msq->q_rtime = 0;
 	msq->q_ctime = get_seconds();
 	msq->q_cbytes = msq->q_qnum = 0;
@@ -124,7 +133,7 @@ static int newque (key_t key, int msgflg
 	INIT_LIST_HEAD(&msq->q_messages);
 	INIT_LIST_HEAD(&msq->q_receivers);
 	INIT_LIST_HEAD(&msq->q_senders);
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 
 	return msq->q_id;
 }
@@ -182,23 +191,24 @@ static void expunge_all(struct msg_queue
  * msg_ids.sem and the spinlock for this message queue is hold
  * before freeque() is called. msg_ids.sem remains locked on exit.
  */
-static void freeque (struct msg_queue *msq, int id)
+static void freeque (struct ipc_msg_context *context,
+		     struct msg_queue *msq, int id)
 {
 	struct list_head *tmp;
 
 	expunge_all(msq,-EIDRM);
 	ss_wakeup(&msq->q_senders,1);
-	msq = msg_rmid(id);
-	msg_unlock(msq);
+	msq = msg_rmid(context, id);
+	msg_unlock(context, msq);
 		
 	tmp = msq->q_messages.next;
 	while(tmp != &msq->q_messages) {
 		struct msg_msg* msg = list_entry(tmp,struct msg_msg,m_list);
 		tmp = tmp->next;
-		atomic_dec(&msg_hdrs);
+		atomic_dec(&context->msg_hdrs);
 		free_msg(msg);
 	}
-	atomic_sub(msq->q_cbytes, &msg_bytes);
+	atomic_sub(msq->q_cbytes, &context->msg_bytes);
 	security_msg_queue_free(msq);
 	ipc_rcu_putref(msq);
 }
@@ -207,32 +217,34 @@ asmlinkage long sys_msgget (key_t key, i
 {
 	int id, ret = -EPERM;
 	struct msg_queue *msq;
-	
-	down(&msg_ids.sem);
+	struct ipc_msg_context *context = current->ipc_msg_context;
+
+	down(&context->msg_ids.sem);
 	if (key == IPC_PRIVATE) 
-		ret = newque(key, msgflg);
-	else if ((id = ipc_findkey(&msg_ids, key)) == -1) { /* key not used */
+		ret = newque(context, key, msgflg);
+	else if ((id = ipc_findkey(&context->msg_ids, key)) == -1) {
+		/* key not used */
 		if (!(msgflg & IPC_CREAT))
 			ret = -ENOENT;
 		else
-			ret = newque(key, msgflg);
+			ret = newque(context, key, msgflg);
 	} else if (msgflg & IPC_CREAT && msgflg & IPC_EXCL) {
 		ret = -EEXIST;
 	} else {
-		msq = msg_lock(id);
+		msq = msg_lock(context, id);
 		if(msq==NULL)
 			BUG();
 		if (ipcperms(&msq->q_perm, msgflg))
 			ret = -EACCES;
 		else {
-			int qid = msg_buildid(id, msq->q_perm.seq);
+			int qid = msg_buildid(context, id, msq->q_perm.seq);
 		    	ret = security_msg_queue_associate(msq, msgflg);
 			if (!ret)
 				ret = qid;
 		}
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 	}
-	up(&msg_ids.sem);
+	up(&context->msg_ids.sem);
 	return ret;
 }
 
@@ -333,6 +345,7 @@ asmlinkage long sys_msgctl (int msqid, i
 	struct msg_queue *msq;
 	struct msq_setbuf setbuf;
 	struct kern_ipc_perm *ipcp;
+	struct ipc_msg_context *context = current->ipc_msg_context;
 	
 	if (msqid < 0 || cmd < 0)
 		return -EINVAL;
@@ -362,18 +375,18 @@ asmlinkage long sys_msgctl (int msqid, i
 		msginfo.msgmnb = msg_ctlmnb;
 		msginfo.msgssz = MSGSSZ;
 		msginfo.msgseg = MSGSEG;
-		down(&msg_ids.sem);
+		down(&context->msg_ids.sem);
 		if (cmd == MSG_INFO) {
-			msginfo.msgpool = msg_ids.in_use;
-			msginfo.msgmap = atomic_read(&msg_hdrs);
-			msginfo.msgtql = atomic_read(&msg_bytes);
+			msginfo.msgpool = context->msg_ids.in_use;
+			msginfo.msgmap = atomic_read(&context->msg_hdrs);
+			msginfo.msgtql = atomic_read(&context->msg_bytes);
 		} else {
 			msginfo.msgmap = MSGMAP;
 			msginfo.msgpool = MSGPOOL;
 			msginfo.msgtql = MSGTQL;
 		}
-		max_id = msg_ids.max_id;
-		up(&msg_ids.sem);
+		max_id = context->msg_ids.max_id;
+		up(&context->msg_ids.sem);
 		if (copy_to_user (buf, &msginfo, sizeof(struct msginfo)))
 			return -EFAULT;
 		return (max_id < 0) ? 0: max_id;
@@ -385,20 +398,21 @@ asmlinkage long sys_msgctl (int msqid, i
 		int success_return;
 		if (!buf)
 			return -EFAULT;
-		if(cmd == MSG_STAT && msqid >= msg_ids.entries->size)
+		if(cmd == MSG_STAT && msqid >= context->msg_ids.entries->size)
 			return -EINVAL;
 
 		memset(&tbuf,0,sizeof(tbuf));
 
-		msq = msg_lock(msqid);
+		msq = msg_lock(context, msqid);
 		if (msq == NULL)
 			return -EINVAL;
 
 		if(cmd == MSG_STAT) {
-			success_return = msg_buildid(msqid, msq->q_perm.seq);
+			success_return =
+				msg_buildid(context, msqid, msq->q_perm.seq);
 		} else {
 			err = -EIDRM;
-			if (msg_checkid(msq,msqid))
+			if (msg_checkid(context,msq,msqid))
 				goto out_unlock;
 			success_return = 0;
 		}
@@ -419,7 +433,7 @@ asmlinkage long sys_msgctl (int msqid, i
 		tbuf.msg_qbytes = msq->q_qbytes;
 		tbuf.msg_lspid  = msq->q_lspid;
 		tbuf.msg_lrpid  = msq->q_lrpid;
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		if (copy_msqid_to_user(buf, &tbuf, version))
 			return -EFAULT;
 		return success_return;
@@ -438,14 +452,14 @@ asmlinkage long sys_msgctl (int msqid, i
 		return  -EINVAL;
 	}
 
-	down(&msg_ids.sem);
-	msq = msg_lock(msqid);
+	down(&context->msg_ids.sem);
+	msq = msg_lock(context, msqid);
 	err=-EINVAL;
 	if (msq == NULL)
 		goto out_up;
 
 	err = -EIDRM;
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock_up;
 	ipcp = &msq->q_perm;
 	err = -EPERM;
@@ -480,22 +494,22 @@ asmlinkage long sys_msgctl (int msqid, i
 		 * due to a larger queue size.
 		 */
 		ss_wakeup(&msq->q_senders,0);
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		break;
 	}
 	case IPC_RMID:
-		freeque (msq, msqid); 
+		freeque (context, msq, msqid);
 		break;
 	}
 	err = 0;
 out_up:
-	up(&msg_ids.sem);
+	up(&context->msg_ids.sem);
 	return err;
 out_unlock_up:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 	goto out_up;
 out_unlock:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 	return err;
 }
 
@@ -558,7 +572,8 @@ asmlinkage long sys_msgsnd (int msqid, s
 	struct msg_msg *msg;
 	long mtype;
 	int err;
-	
+	struct ipc_msg_context *context = current->ipc_msg_context;
+
 	if (msgsz > msg_ctlmax || (long) msgsz < 0 || msqid < 0)
 		return -EINVAL;
 	if (get_user(mtype, &msgp->mtype))
@@ -573,13 +588,13 @@ asmlinkage long sys_msgsnd (int msqid, s
 	msg->m_type = mtype;
 	msg->m_ts = msgsz;
 
-	msq = msg_lock(msqid);
+	msq = msg_lock(context, msqid);
 	err=-EINVAL;
 	if(msq==NULL)
 		goto out_free;
 
 	err= -EIDRM;
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock_free;
 
 	for (;;) {
@@ -605,7 +620,7 @@ asmlinkage long sys_msgsnd (int msqid, s
 		}
 		ss_add(msq, &s);
 		ipc_rcu_getref(msq);
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		schedule();
 
 		ipc_lock_by_ptr(&msq->q_perm);
@@ -630,15 +645,15 @@ asmlinkage long sys_msgsnd (int msqid, s
 		list_add_tail(&msg->m_list,&msq->q_messages);
 		msq->q_cbytes += msgsz;
 		msq->q_qnum++;
-		atomic_add(msgsz,&msg_bytes);
-		atomic_inc(&msg_hdrs);
+		atomic_add(msgsz,&context->msg_bytes);
+		atomic_inc(&context->msg_hdrs);
 	}
 	
 	err = 0;
 	msg = NULL;
 
 out_unlock_free:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 out_free:
 	if(msg!=NULL)
 		free_msg(msg);
@@ -670,17 +685,18 @@ asmlinkage long sys_msgrcv (int msqid, s
 	struct msg_queue *msq;
 	struct msg_msg *msg;
 	int mode;
+	struct ipc_msg_context *context = current->ipc_msg_context;
 
 	if (msqid < 0 || (long) msgsz < 0)
 		return -EINVAL;
 	mode = convert_mode(&msgtyp,msgflg);
 
-	msq = msg_lock(msqid);
+	msq = msg_lock(context, msqid);
 	if(msq==NULL)
 		return -EINVAL;
 
 	msg = ERR_PTR(-EIDRM);
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock;
 
 	for (;;) {
@@ -720,10 +736,10 @@ asmlinkage long sys_msgrcv (int msqid, s
 			msq->q_rtime = get_seconds();
 			msq->q_lrpid = current->tgid;
 			msq->q_cbytes -= msg->m_ts;
-			atomic_sub(msg->m_ts,&msg_bytes);
-			atomic_dec(&msg_hdrs);
+			atomic_sub(msg->m_ts,&context->msg_bytes);
+			atomic_dec(&context->msg_hdrs);
 			ss_wakeup(&msq->q_senders,0);
-			msg_unlock(msq);
+			msg_unlock(context, msq);
 			break;
 		}
 		/* No message waiting. Wait for a message */
@@ -741,7 +757,7 @@ asmlinkage long sys_msgrcv (int msqid, s
 			msr_d.r_maxsize = msgsz;
 		msr_d.r_msg = ERR_PTR(-EAGAIN);
 		current->state = TASK_INTERRUPTIBLE;
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 
 		schedule();
 
@@ -794,7 +810,7 @@ asmlinkage long sys_msgrcv (int msqid, s
 		if (signal_pending(current)) {
 			msg = ERR_PTR(-ERESTARTNOHAND);
 out_unlock:
-			msg_unlock(msq);
+			msg_unlock(context, msq);
 			break;
 		}
 	}
diff -puN ipc/msgutil.c~sysv-container ipc/msgutil.c
diff -puN ipc/sem.c~sysv-container ipc/sem.c
diff -puN ipc/shm.c~sysv-container ipc/shm.c
diff -puN ipc/util.c~sysv-container ipc/util.c
--- work/ipc/util.c~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/ipc/util.c	2006-02-27 09:31:43.000000000 -0800
@@ -45,11 +45,14 @@ struct ipc_proc_iface {
  *	The various system5 IPC resources (semaphores, messages and shared
  *	memory are initialised
  */
- 
+
+struct ipc_msg_context *ipc_msg_context;
 static int __init ipc_init(void)
 {
+	ipc_msg_context = alloc_ipc_msg_context(GFP_KERNEL);
+
 	sem_init();
-	msg_init();
+	msg_init(ipc_msg_context);
 	shm_init();
 	return 0;
 }
diff -puN ipc/util.h~sysv-container ipc/util.h
--- work/ipc/util.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/ipc/util.h	2006-02-27 09:30:24.000000000 -0800
@@ -12,7 +12,7 @@
 #define SEQ_MULTIPLIER	(IPCMNI)
 
 void sem_init (void);
-void msg_init (void);
+void msg_init (struct ipc_msg_context *context);
 void shm_init (void);
 
 struct seq_file;
diff -puN include/linux/sched.h~sysv-container include/linux/sched.h
--- work/include/linux/sched.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/include/linux/sched.h	2006-02-27 09:30:24.000000000 -0800
@@ -793,6 +793,7 @@ struct task_struct {
 	int link_count, total_link_count;
 /* ipc stuff */
 	struct sysv_sem sysvsem;
+	struct ipc_msg_context *ipc_msg_context;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /* filesystem information */
diff -puN include/linux/ipc.h~sysv-container include/linux/ipc.h
--- work/include/linux/ipc.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/include/linux/ipc.h	2006-02-27 09:30:24.000000000 -0800
@@ -2,6 +2,9 @@
 #define _LINUX_IPC_H
 
 #include <linux/types.h>
+#include <linux/kref.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
 
 #define IPC_PRIVATE ((__kernel_key_t) 0)  
 
@@ -83,6 +86,15 @@ struct ipc_ids {
 	struct ipc_id_ary* entries;
 };
 
+struct ipc_msg_context {
+	atomic_t msg_bytes;
+	atomic_t msg_hdrs;
+
+	struct ipc_ids msg_ids;
+	struct kref count;
+};
+
+extern struct ipc_msg_context *alloc_ipc_msg_context(gfp_t flags);
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_IPC_H */
--- work/kernel/fork.c~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/kernel/fork.c	2006-02-27 09:30:24.000000000 -0800
@@ -1184,6 +1184,11 @@ static task_t *copy_process(unsigned lon
 	}
 	attach_pid(p, PIDTYPE_TGID, p->tgid);
 	attach_pid(p, PIDTYPE_PID, p->pid);
+	{ // this extern will go away when we start to dynamically
+	  // allocate these, nothing to see here
+		extern struct ipc_msg_context ipc_msg_context;
+		p->ipc_msg_context = current->ipc_msg_context;
+	}
 
 	nr_threads++;
 	total_forks++;
_



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-27 17:42   ` Dave Hansen
@ 2006-02-27 21:14     ` Eric W. Biederman
  2006-02-27 21:35       ` Dave Hansen
  2006-03-04  3:17       ` sysctls inside containers Dave Hansen
  0 siblings, 2 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-02-27 21:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill Korotaev, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Andrey Savochkin,
	Alexey Kuznetsov, Stanislav Protassov, serue, frankeh, clg,
	mrmacman_g4, alan, Herbert Poetzl, Andrew Morton

Dave Hansen <haveblue@us.ibm.com> writes:

> On Fri, 2006-02-24 at 14:44 -0700, Eric W. Biederman wrote:
>> We can start on a broad front, looking at several different things.
>> But I suggest the first thing we all look at is SYSVIPC.  It is
>> currently a clearly recognized namespace in the kernel so the scope is
>> well defined.  SYSVIPC is just complicated enough to have a
>> non-trivial implementation while at the same time being simple enough
>> that we can go through the code in exhausting detail.  Getting the
>> group dynamics working properly. 
>
> Here's a quick stab at the ipc/msg.c portion of this work.  The basic
> approach was to move msg_ids, msg_bytes, and msg_hdrs into a structure,
> put a pointer to that structure in the task_struct and then dynamically
> allocate it.
>
> There is still only one system-wide one of these for now.  It can
> obviously be extended, though. :)
>
> This is a very simple, brute-force, hack-until-it-compiles-and-boots
> approach.  (I just realized that I didn't check the return of the alloc
> properly.)
>
> Is this the form that we'd like these patches to take?  Any comments
> about the naming?  Do we want to keep the _namespace nomenclature, or
> does the "context" that I used here make more sense

I think from 10,000 feet the form is about right.

I like the namespace nomenclature.  (It can be shorted to _space  or _ns).
In part because it shortens well, and in part because it emphasizes that
we are *just* dealing with the names.

You split the resolution at just ipc_msgs.  When I really think it should
be everything ipcs deals with.

Performing the assignment inside the tasklist_lock is not something we
want to do in do_fork().

So it looks like a good start.  There are a lot of details yet to be filled
in, proc, sysctl, cleanup on namespace release.  (We can still provide
the create destroy methods even if we don't hook the up).

I think in this case I would put the actual namespace structure
definition in util.h, and just put a struct ipc_ns in sched.h.
sysvipc is isolated enough that nothing outside of the ipc/
directory needs to know the implementation details.

It probably makes sense to have a statically structure and
to set the pointer initially in init_task.h

Until we reach the point where we can multiple instances that
even removes the need to have a pointer copy in do_fork()
as that happens already as part of the structure copy.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-27 21:14     ` Eric W. Biederman
@ 2006-02-27 21:35       ` Dave Hansen
  2006-02-27 21:56         ` Eric W. Biederman
  2006-03-04  3:17       ` sysctls inside containers Dave Hansen
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2006-02-27 21:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Andrey Savochkin,
	Alexey Kuznetsov, Stanislav Protassov, serue, frankeh, clg,
	mrmacman_g4, alan, Herbert Poetzl, Andrew Morton

On Mon, 2006-02-27 at 14:14 -0700, Eric W. Biederman wrote:
> I like the namespace nomenclature.  (It can be shorted to _space  or _ns).
> In part because it shortens well, and in part because it emphasizes that
> we are *just* dealing with the names.

When I was looking at this, I was pretending to be just somebody looking
at sysv code, with no knowledge of containers or namespaces.

For a person like that, I think names like _space or _ns are pretty much
not an option, unless those terms become as integral to the kernel as
things like kobjects.  

> You split the resolution at just ipc_msgs.  When I really think it should
> be everything ipcs deals with.

This was just the first patch. :)

> Performing the assignment inside the tasklist_lock is not something we
> want to do in do_fork().

Any particular reason why?  There seem to be a number of things done in
there that aren't _strictly_ needed under the tasklist_lock.  Where
would you do it?

> So it looks like a good start.  There are a lot of details yet to be filled
> in, proc, sysctl, cleanup on namespace release.  (We can still provide
> the create destroy methods even if we don't hook the up).

Yeah, I saved shm for last because it has the largest number of outside
interactions.  My current thoughts are that we'll need _contexts or
_namespaces associated with /proc mounts as well.  

> I think in this case I would put the actual namespace structure
> definition in util.h, and just put a struct ipc_ns in sched.h.

Ahhh, as in

	struct ipc_ns;

And just keep a pointer from the task?  Yeah, that does keep it quite
isolated.  

-- Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Which of the virtualization approaches is more suitable for kernel?
  2006-02-27 21:35       ` Dave Hansen
@ 2006-02-27 21:56         ` Eric W. Biederman
  0 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-02-27 21:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill Korotaev, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, devel, Andrey Savochkin,
	Alexey Kuznetsov, Stanislav Protassov, serue, frankeh, clg,
	mrmacman_g4, alan, Herbert Poetzl, Andrew Morton

Dave Hansen <haveblue@us.ibm.com> writes:

> On Mon, 2006-02-27 at 14:14 -0700, Eric W. Biederman wrote:
>> I like the namespace nomenclature.  (It can be shorted to _space  or _ns).
>> In part because it shortens well, and in part because it emphasizes that
>> we are *just* dealing with the names.
>
> When I was looking at this, I was pretending to be just somebody looking
> at sysv code, with no knowledge of containers or namespaces.
>
> For a person like that, I think names like _space or _ns are pretty much
> not an option, unless those terms become as integral to the kernel as
> things like kobjects.  

To be clear I was talking name suffixes.  So ipc_space certainly conveys
something, and even ipc_ns may be ok.

>> You split the resolution at just ipc_msgs.  When I really think it should
>> be everything ipcs deals with.
>
> This was just the first patch. :)

:) 

Just wanted to make certain we agreed on the scope.

>> Performing the assignment inside the tasklist_lock is not something we
>> want to do in do_fork().
>
> Any particular reason why?  There seem to be a number of things done in
> there that aren't _strictly_ needed under the tasklist_lock.  Where
> would you do it?

Well all of the other things we can share or not share are already
outside of the tasklist_lock.

We may not be quite minimal but we actually are fairly close to minimal
inside the tasklist_lock.

>> So it looks like a good start.  There are a lot of details yet to be filled
>> in, proc, sysctl, cleanup on namespace release.  (We can still provide
>> the create destroy methods even if we don't hook the up).
>
> Yeah, I saved shm for last because it has the largest number of outside
> interactions.  My current thoughts are that we'll need _contexts or
> _namespaces associated with /proc mounts as well.  

Yes.  I think the easy way to handle this is to have a symlink
from /proc/sysvipc to /proc/self/sysvipc.  And then we have a per
process reporting area.

That preserves all of the old programs but enables us to get the
information out.

>> I think in this case I would put the actual namespace structure
>> definition in util.h, and just put a struct ipc_ns in sched.h.
>
> Ahhh, as in
>
> 	struct ipc_ns;
>
> And just keep a pointer from the task?  Yeah, that does keep it quite
> isolated.  

Yep.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* sysctls inside containers
  2006-02-27 21:14     ` Eric W. Biederman
  2006-02-27 21:35       ` Dave Hansen
@ 2006-03-04  3:17       ` Dave Hansen
  2006-03-04 10:27         ` Eric W. Biederman
                           ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Dave Hansen @ 2006-03-04  3:17 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Kernel Mailing List, serue, frankeh, clg

Trimming the cc list down, because the scope has narrowed
significantly...

On Mon, 2006-02-27 at 14:14 -0700, Eric W. Biederman wrote: 
> So it looks like a good start.  There are a lot of details yet to be filled
> in, proc, sysctl, cleanup on namespace release.  (We can still provide
> the create destroy methods even if we don't hook the up). 

Well, I at least got to the point of seeing how the sysctls interact
when I tried to containerize them.  Eric, I think the idea of the sysv
code being nicely and completely isolated is pretty much gone, due to
their connection to sysctls.  I think I'll go back and just isolate the
"struct ipc_ids" portion.  We can do the accounting bits later.

The patches I have will isolate the IDs, but I'm not sure how much sense
that makes without doing the things like the shm_tot variable.  Does
anybody think we need to go after sysctls first, perhaps?  Or, is this a
problem graph with cycles in it? :)

I don't see an immediately clear solution on how to containerize sysctls
properly.  The entire construct seems to be built around getting data
from in and out of global variables and into /proc files.

We obviously want to be rid of many of these global variables.  So, does
it make sense to introduce different classes of sysctls, at least
internally?  There are probably just two types: global, writable only
from the root container and container-private.  Does it make sense to
have _both_?  Perhaps a sysadmin 

Eric, can you think of how you would represent these in the hierarchical
container model?  How would they work?

On another note, after messing with putting data in the init_task for
these things, I'm a little more convinced that we aren't going to want
to clutter up the task_struct with all kinds of containerized resources,
_plus_ make all of the interfaces to share or unshare each of those.
That global 'struct container' is looking a bit more attractive.

-- Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-04  3:17       ` sysctls inside containers Dave Hansen
@ 2006-03-04 10:27         ` Eric W. Biederman
  2006-03-06 16:27           ` Dave Hansen
  2006-03-10 10:17         ` Kirill Korotaev
  2006-03-10 10:19         ` Kirill Korotaev
  2 siblings, 1 reply; 27+ messages in thread
From: Eric W. Biederman @ 2006-03-04 10:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, serue, frankeh, clg, Herbert Poetzl,
	Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> Trimming the cc list down, because the scope has narrowed
> significantly...

Undoing some of the trimming to include the vserver guys at least.
Herbert at least has a better handle on how to do resource limits
than I do.

> On Mon, 2006-02-27 at 14:14 -0700, Eric W. Biederman wrote: 
>> So it looks like a good start.  There are a lot of details yet to be filled
>> in, proc, sysctl, cleanup on namespace release.  (We can still provide
>> the create destroy methods even if we don't hook the up). 
>
> Well, I at least got to the point of seeing how the sysctls interact
> when I tried to containerize them.  Eric, I think the idea of the sysv
> code being nicely and completely isolated is pretty much gone, due to
> their connection to sysctls.  I think I'll go back and just isolate the
> "struct ipc_ids" portion.  We can do the accounting bits later.

> The patches I have will isolate the IDs, but I'm not sure how much sense
> that makes without doing the things like the shm_tot variable.  Does
> anybody think we need to go after sysctls first, perhaps?  Or, is this a
> problem graph with cycles in it? :)

There is a reason I'm cleaning up /proc at the moment :)

But the only real gotcha I see is how do we do per namespace resource
limits.  As opposed to global resource limits.  For the moment
just having truly global limits seems reasonable.

> I don't see an immediately clear solution on how to containerize sysctls
> properly.  The entire construct seems to be built around getting data
> from in and out of global variables and into /proc files.

I successfully handled pid_max.  So while it is awkward you can get
per task values in and out if sysctl.

> We obviously want to be rid of many of these global variables.  So, does
> it make sense to introduce different classes of sysctls, at least
> internally?  There are probably just two types: global, writable only
> from the root container and container-private.  Does it make sense to
> have _both_?  Perhaps a sysadmin 

So having both is doable.  Although most of that goes to resource
limits, and resource limits are something we clearly need to discuss
but they really are a separate problem.

> Eric, can you think of how you would represent these in the hierarchical
> container model?  How would they work?

There is nothing in the implementation of sysvipc that is inherently
hierarchical.  So it would simply be disjoin flat namespaces.  That
are only hierarchical in the sense that they are connected to
processes which form a hierarchical process tree.

As for the resource limit problem that is the domain of CKRM and the
bean counter patches.  The isolation work we are doing may put a new
spin on the problem and help it get solved but it isn't something we
need to solve.

Herbert Poetzl has suggest in the past is that we may need some kind
of namespace off to the side that is the essence of a container and
has all of the container resource limits.

> On another note, after messing with putting data in the init_task for
> these things, I'm a little more convinced that we aren't going to want
> to clutter up the task_struct with all kinds of containerized resources,
> _plus_ make all of the interfaces to share or unshare each of those.
> That global 'struct container' is looking a bit more attractive.

There aren't that many of them, and what different groups want to
share/unshare is different.  The biggest difference is that for
migration restart it is not necessary to solve the problem of having
the same UID map to multiple user_structs depending on the context and
all of the weird things that will do to permission checking in the
kernel.

There are also a few items migration cares about that people doing
virtual private servers don't.  Like the ability to isolate monotonic
timers, so even after migration the monotonic properties are
maintained.  If you always remain on the same kernel that is a problem
that simply does not come up.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-04 10:27         ` Eric W. Biederman
@ 2006-03-06 16:27           ` Dave Hansen
  2006-03-06 17:08             ` Herbert Poetzl
  2006-03-06 18:56             ` Eric W. Biederman
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Hansen @ 2006-03-06 16:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, serue, frankeh, clg, Herbert Poetzl,
	Sam Vilain

On Sat, 2006-03-04 at 03:27 -0700, Eric W. Biederman wrote:
> > I don't see an immediately clear solution on how to containerize sysctls
> > properly.  The entire construct seems to be built around getting data
> > from in and out of global variables and into /proc files.
> 
> I successfully handled pid_max.  So while it is awkward you can get
> per task values in and out if sysctl. 

This:

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6-ns.git;a=commitdiff;h=1150082e0bae41a3621043b4c5ce15e9112884fa

sir, is a hack :)

We can't possibly do that for each and every sysctl variable.  It would
mean about fourteen billion duplicated _conv() functions. 

I'm wondering if, instead of having the .data field in that table, we
can have a function pointer which, when called, gives a pointer to the
data.  I'll give it a shot.

-- Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-06 16:27           ` Dave Hansen
@ 2006-03-06 17:08             ` Herbert Poetzl
  2006-03-06 17:18               ` Dave Hansen
  2006-03-06 18:56             ` Eric W. Biederman
  1 sibling, 1 reply; 27+ messages in thread
From: Herbert Poetzl @ 2006-03-06 17:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Linux Kernel Mailing List, serue, frankeh,
	clg, Sam Vilain

On Mon, Mar 06, 2006 at 08:27:16AM -0800, Dave Hansen wrote:
> On Sat, 2006-03-04 at 03:27 -0700, Eric W. Biederman wrote:
> > > I don't see an immediately clear solution on how to containerize sysctls
> > > properly.  The entire construct seems to be built around getting data
> > > from in and out of global variables and into /proc files.
> > 
> > I successfully handled pid_max.  So while it is awkward you can get
> > per task values in and out if sysctl. 
> 
> This:
> 
> http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6-ns.git;a=commitdiff;h=1150082e0bae41a3621043b4c5ce15e9112884fa
> 
> sir, is a hack :)
> 
> We can't possibly do that for each and every sysctl variable.  It would
> mean about fourteen billion duplicated _conv() functions. 
> 
> I'm wondering if, instead of having the .data field in that table, we
> can have a function pointer which, when called, gives a pointer to the
> data.  I'll give it a shot.

something similar to this?

http://www.13thfloor.at/vserver/d_rel26/v2.1.0/split-2.6.14.4-vs2.1.0/15_2.6.14.4_virt.diff.hl

(look for virt_handler() for sysctl ops)

best,
Herbert

> -- Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-06 17:08             ` Herbert Poetzl
@ 2006-03-06 17:18               ` Dave Hansen
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2006-03-06 17:18 UTC (permalink / raw)
  To: Herbert Poetzl
  Cc: Eric W. Biederman, Linux Kernel Mailing List, serue, frankeh,
	clg, Sam Vilain

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

On Mon, 2006-03-06 at 18:08 +0100, Herbert Poetzl wrote:
> > I'm wondering if, instead of having the .data field in that table, we
> > can have a function pointer which, when called, gives a pointer to the
> > data.  I'll give it a shot.
> 
> something similar to this?
> 
> http://www.13thfloor.at/vserver/d_rel26/v2.1.0/split-2.6.14.4-vs2.1.0/15_2.6.14.4_virt.diff.hl
> 
> (look for virt_handler() for sysctl ops) 

Yeah, that's the right general idea.  However, I think for mainline we
probably want to make it a little less of an add-on feature.  I was
thinking that we could replace all of the current, flat sysctl .data
pointers with some simple helper functions.  Something like this:

#define SYSCTL_HELPER_FOR(name) sysctl_return_var_named##name
#define DECLARE_HELPER_FOR(type, name)          \
void *SYSCTL_HELPER_FOR(name)(void)             \
{                                               \
        extern type name;                       \
        return &name;                           \
}

And, the declaration for max_threads:

DECLARE_HELPER_FOR(int, max_threads));

and the ctl_table entry:

        {
                .ctl_name       = KERN_MAX_THREADS,
                .procname       = "threads-max",
                .data_access    = &SYSCTL_HELPER_FOR(max_threads),
                .maxlen         = sizeof(int),
                .mode           = 0644,
                .proc_handler   = &proc_dointvec,
        },

Note that the 

	.data	= &max_threads,

line is gone.

I hate multi-line #defines as much as the next guy, but it might be
worth it.  I've attached what I have so far, but it is a major work in
progress.  

-- Dave

[-- Attachment #2: sysv-container.patch --]
[-- Type: text/x-patch, Size: 14577 bytes --]



---

 work-dave/include/linux/ipc.h   |   12 +++
 work-dave/include/linux/sched.h |    1 
 work-dave/ipc/msg.c             |  152 ++++++++++++++++++++++------------------
 work-dave/ipc/util.c            |    7 +
 work-dave/ipc/util.h            |    2 
 work-dave/kernel/fork.c         |    5 +
 6 files changed, 108 insertions(+), 71 deletions(-)

diff -puN ipc/compat.c~sysv-container ipc/compat.c
diff -puN ipc/compat_mq.c~sysv-container ipc/compat_mq.c
diff -puN ipc/mqueue.c~sysv-container ipc/mqueue.c
diff -puN ipc/msg.c~sysv-container ipc/msg.c
--- work/ipc/msg.c~sysv-container	2006-02-27 09:30:23.000000000 -0800
+++ work-dave/ipc/msg.c	2006-02-27 09:32:18.000000000 -0800
@@ -60,35 +60,44 @@ struct msg_sender {
 #define SEARCH_NOTEQUAL		3
 #define SEARCH_LESSEQUAL	4
 
-static atomic_t msg_bytes = ATOMIC_INIT(0);
-static atomic_t msg_hdrs = ATOMIC_INIT(0);
+#define msg_lock(ctx, id)	((struct msg_queue*)ipc_lock(&ctx->msg_ids,id))
+#define msg_unlock(ctx, msq)	ipc_unlock(&(msq)->q_perm)
+#define msg_rmid(ctx, id)	((struct msg_queue*)ipc_rmid(&ctx->msg_ids,id))
+#define msg_checkid(ctx, msq, msgid)	\
+	ipc_checkid(&ctx->msg_ids,&msq->q_perm,msgid)
+#define msg_buildid(ctx, id, seq) \
+	ipc_buildid(&ctx->msg_ids, id, seq)
 
-static struct ipc_ids msg_ids;
-
-#define msg_lock(id)	((struct msg_queue*)ipc_lock(&msg_ids,id))
-#define msg_unlock(msq)	ipc_unlock(&(msq)->q_perm)
-#define msg_rmid(id)	((struct msg_queue*)ipc_rmid(&msg_ids,id))
-#define msg_checkid(msq, msgid)	\
-	ipc_checkid(&msg_ids,&msq->q_perm,msgid)
-#define msg_buildid(id, seq) \
-	ipc_buildid(&msg_ids, id, seq)
-
-static void freeque (struct msg_queue *msq, int id);
-static int newque (key_t key, int msgflg);
+static void freeque (struct ipc_msg_context *, struct msg_queue *msq, int id);
+static int newque (struct ipc_msg_context *context, key_t key, int id);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
 
-void __init msg_init (void)
+struct ipc_msg_context *alloc_ipc_msg_context(gfp_t flags)
+{
+	struct ipc_msg_context *msg_context;
+
+	msg_context = kzalloc(sizeof(*msg_context), flags);
+	if (!msg_context)
+		return NULL;
+
+	atomic_set(&msg_context->msg_bytes, 0);
+	atomic_set(&msg_context->msg_hdrs, 0);
+
+	return msg_context;
+}
+
+void __init msg_init (struct ipc_msg_context *context)
 {
-	ipc_init_ids(&msg_ids,msg_ctlmni);
+	ipc_init_ids(&context->msg_ids,msg_ctlmni);
 	ipc_init_proc_interface("sysvipc/msg",
 				"       key      msqid perms      cbytes       qnum lspid lrpid   uid   gid  cuid  cgid      stime      rtime      ctime\n",
-				&msg_ids,
+				&context->msg_ids,
 				sysvipc_msg_proc_show);
 }
 
-static int newque (key_t key, int msgflg)
+static int newque (struct ipc_msg_context *context, key_t key, int msgflg)
 {
 	int id;
 	int retval;
@@ -108,14 +117,14 @@ static int newque (key_t key, int msgflg
 		return retval;
 	}
 
-	id = ipc_addid(&msg_ids, &msq->q_perm, msg_ctlmni);
+	id = ipc_addid(&context->msg_ids, &msq->q_perm, msg_ctlmni);
 	if(id == -1) {
 		security_msg_queue_free(msq);
 		ipc_rcu_putref(msq);
 		return -ENOSPC;
 	}
 
-	msq->q_id = msg_buildid(id,msq->q_perm.seq);
+	msq->q_id = msg_buildid(context,id,msq->q_perm.seq);
 	msq->q_stime = msq->q_rtime = 0;
 	msq->q_ctime = get_seconds();
 	msq->q_cbytes = msq->q_qnum = 0;
@@ -124,7 +133,7 @@ static int newque (key_t key, int msgflg
 	INIT_LIST_HEAD(&msq->q_messages);
 	INIT_LIST_HEAD(&msq->q_receivers);
 	INIT_LIST_HEAD(&msq->q_senders);
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 
 	return msq->q_id;
 }
@@ -182,23 +191,24 @@ static void expunge_all(struct msg_queue
  * msg_ids.sem and the spinlock for this message queue is hold
  * before freeque() is called. msg_ids.sem remains locked on exit.
  */
-static void freeque (struct msg_queue *msq, int id)
+static void freeque (struct ipc_msg_context *context,
+		     struct msg_queue *msq, int id)
 {
 	struct list_head *tmp;
 
 	expunge_all(msq,-EIDRM);
 	ss_wakeup(&msq->q_senders,1);
-	msq = msg_rmid(id);
-	msg_unlock(msq);
+	msq = msg_rmid(context, id);
+	msg_unlock(context, msq);
 		
 	tmp = msq->q_messages.next;
 	while(tmp != &msq->q_messages) {
 		struct msg_msg* msg = list_entry(tmp,struct msg_msg,m_list);
 		tmp = tmp->next;
-		atomic_dec(&msg_hdrs);
+		atomic_dec(&context->msg_hdrs);
 		free_msg(msg);
 	}
-	atomic_sub(msq->q_cbytes, &msg_bytes);
+	atomic_sub(msq->q_cbytes, &context->msg_bytes);
 	security_msg_queue_free(msq);
 	ipc_rcu_putref(msq);
 }
@@ -207,32 +217,34 @@ asmlinkage long sys_msgget (key_t key, i
 {
 	int id, ret = -EPERM;
 	struct msg_queue *msq;
-	
-	down(&msg_ids.sem);
+	struct ipc_msg_context *context = current->ipc_msg_context;
+
+	down(&context->msg_ids.sem);
 	if (key == IPC_PRIVATE) 
-		ret = newque(key, msgflg);
-	else if ((id = ipc_findkey(&msg_ids, key)) == -1) { /* key not used */
+		ret = newque(context, key, msgflg);
+	else if ((id = ipc_findkey(&context->msg_ids, key)) == -1) {
+		/* key not used */
 		if (!(msgflg & IPC_CREAT))
 			ret = -ENOENT;
 		else
-			ret = newque(key, msgflg);
+			ret = newque(context, key, msgflg);
 	} else if (msgflg & IPC_CREAT && msgflg & IPC_EXCL) {
 		ret = -EEXIST;
 	} else {
-		msq = msg_lock(id);
+		msq = msg_lock(context, id);
 		if(msq==NULL)
 			BUG();
 		if (ipcperms(&msq->q_perm, msgflg))
 			ret = -EACCES;
 		else {
-			int qid = msg_buildid(id, msq->q_perm.seq);
+			int qid = msg_buildid(context, id, msq->q_perm.seq);
 		    	ret = security_msg_queue_associate(msq, msgflg);
 			if (!ret)
 				ret = qid;
 		}
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 	}
-	up(&msg_ids.sem);
+	up(&context->msg_ids.sem);
 	return ret;
 }
 
@@ -333,6 +345,7 @@ asmlinkage long sys_msgctl (int msqid, i
 	struct msg_queue *msq;
 	struct msq_setbuf setbuf;
 	struct kern_ipc_perm *ipcp;
+	struct ipc_msg_context *context = current->ipc_msg_context;
 	
 	if (msqid < 0 || cmd < 0)
 		return -EINVAL;
@@ -362,18 +375,18 @@ asmlinkage long sys_msgctl (int msqid, i
 		msginfo.msgmnb = msg_ctlmnb;
 		msginfo.msgssz = MSGSSZ;
 		msginfo.msgseg = MSGSEG;
-		down(&msg_ids.sem);
+		down(&context->msg_ids.sem);
 		if (cmd == MSG_INFO) {
-			msginfo.msgpool = msg_ids.in_use;
-			msginfo.msgmap = atomic_read(&msg_hdrs);
-			msginfo.msgtql = atomic_read(&msg_bytes);
+			msginfo.msgpool = context->msg_ids.in_use;
+			msginfo.msgmap = atomic_read(&context->msg_hdrs);
+			msginfo.msgtql = atomic_read(&context->msg_bytes);
 		} else {
 			msginfo.msgmap = MSGMAP;
 			msginfo.msgpool = MSGPOOL;
 			msginfo.msgtql = MSGTQL;
 		}
-		max_id = msg_ids.max_id;
-		up(&msg_ids.sem);
+		max_id = context->msg_ids.max_id;
+		up(&context->msg_ids.sem);
 		if (copy_to_user (buf, &msginfo, sizeof(struct msginfo)))
 			return -EFAULT;
 		return (max_id < 0) ? 0: max_id;
@@ -385,20 +398,21 @@ asmlinkage long sys_msgctl (int msqid, i
 		int success_return;
 		if (!buf)
 			return -EFAULT;
-		if(cmd == MSG_STAT && msqid >= msg_ids.entries->size)
+		if(cmd == MSG_STAT && msqid >= context->msg_ids.entries->size)
 			return -EINVAL;
 
 		memset(&tbuf,0,sizeof(tbuf));
 
-		msq = msg_lock(msqid);
+		msq = msg_lock(context, msqid);
 		if (msq == NULL)
 			return -EINVAL;
 
 		if(cmd == MSG_STAT) {
-			success_return = msg_buildid(msqid, msq->q_perm.seq);
+			success_return =
+				msg_buildid(context, msqid, msq->q_perm.seq);
 		} else {
 			err = -EIDRM;
-			if (msg_checkid(msq,msqid))
+			if (msg_checkid(context,msq,msqid))
 				goto out_unlock;
 			success_return = 0;
 		}
@@ -419,7 +433,7 @@ asmlinkage long sys_msgctl (int msqid, i
 		tbuf.msg_qbytes = msq->q_qbytes;
 		tbuf.msg_lspid  = msq->q_lspid;
 		tbuf.msg_lrpid  = msq->q_lrpid;
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		if (copy_msqid_to_user(buf, &tbuf, version))
 			return -EFAULT;
 		return success_return;
@@ -438,14 +452,14 @@ asmlinkage long sys_msgctl (int msqid, i
 		return  -EINVAL;
 	}
 
-	down(&msg_ids.sem);
-	msq = msg_lock(msqid);
+	down(&context->msg_ids.sem);
+	msq = msg_lock(context, msqid);
 	err=-EINVAL;
 	if (msq == NULL)
 		goto out_up;
 
 	err = -EIDRM;
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock_up;
 	ipcp = &msq->q_perm;
 	err = -EPERM;
@@ -480,22 +494,22 @@ asmlinkage long sys_msgctl (int msqid, i
 		 * due to a larger queue size.
 		 */
 		ss_wakeup(&msq->q_senders,0);
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		break;
 	}
 	case IPC_RMID:
-		freeque (msq, msqid); 
+		freeque (context, msq, msqid);
 		break;
 	}
 	err = 0;
 out_up:
-	up(&msg_ids.sem);
+	up(&context->msg_ids.sem);
 	return err;
 out_unlock_up:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 	goto out_up;
 out_unlock:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 	return err;
 }
 
@@ -558,7 +572,8 @@ asmlinkage long sys_msgsnd (int msqid, s
 	struct msg_msg *msg;
 	long mtype;
 	int err;
-	
+	struct ipc_msg_context *context = current->ipc_msg_context;
+
 	if (msgsz > msg_ctlmax || (long) msgsz < 0 || msqid < 0)
 		return -EINVAL;
 	if (get_user(mtype, &msgp->mtype))
@@ -573,13 +588,13 @@ asmlinkage long sys_msgsnd (int msqid, s
 	msg->m_type = mtype;
 	msg->m_ts = msgsz;
 
-	msq = msg_lock(msqid);
+	msq = msg_lock(context, msqid);
 	err=-EINVAL;
 	if(msq==NULL)
 		goto out_free;
 
 	err= -EIDRM;
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock_free;
 
 	for (;;) {
@@ -605,7 +620,7 @@ asmlinkage long sys_msgsnd (int msqid, s
 		}
 		ss_add(msq, &s);
 		ipc_rcu_getref(msq);
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 		schedule();
 
 		ipc_lock_by_ptr(&msq->q_perm);
@@ -630,15 +645,15 @@ asmlinkage long sys_msgsnd (int msqid, s
 		list_add_tail(&msg->m_list,&msq->q_messages);
 		msq->q_cbytes += msgsz;
 		msq->q_qnum++;
-		atomic_add(msgsz,&msg_bytes);
-		atomic_inc(&msg_hdrs);
+		atomic_add(msgsz,&context->msg_bytes);
+		atomic_inc(&context->msg_hdrs);
 	}
 	
 	err = 0;
 	msg = NULL;
 
 out_unlock_free:
-	msg_unlock(msq);
+	msg_unlock(context, msq);
 out_free:
 	if(msg!=NULL)
 		free_msg(msg);
@@ -670,17 +685,18 @@ asmlinkage long sys_msgrcv (int msqid, s
 	struct msg_queue *msq;
 	struct msg_msg *msg;
 	int mode;
+	struct ipc_msg_context *context = current->ipc_msg_context;
 
 	if (msqid < 0 || (long) msgsz < 0)
 		return -EINVAL;
 	mode = convert_mode(&msgtyp,msgflg);
 
-	msq = msg_lock(msqid);
+	msq = msg_lock(context, msqid);
 	if(msq==NULL)
 		return -EINVAL;
 
 	msg = ERR_PTR(-EIDRM);
-	if (msg_checkid(msq,msqid))
+	if (msg_checkid(context,msq,msqid))
 		goto out_unlock;
 
 	for (;;) {
@@ -720,10 +736,10 @@ asmlinkage long sys_msgrcv (int msqid, s
 			msq->q_rtime = get_seconds();
 			msq->q_lrpid = current->tgid;
 			msq->q_cbytes -= msg->m_ts;
-			atomic_sub(msg->m_ts,&msg_bytes);
-			atomic_dec(&msg_hdrs);
+			atomic_sub(msg->m_ts,&context->msg_bytes);
+			atomic_dec(&context->msg_hdrs);
 			ss_wakeup(&msq->q_senders,0);
-			msg_unlock(msq);
+			msg_unlock(context, msq);
 			break;
 		}
 		/* No message waiting. Wait for a message */
@@ -741,7 +757,7 @@ asmlinkage long sys_msgrcv (int msqid, s
 			msr_d.r_maxsize = msgsz;
 		msr_d.r_msg = ERR_PTR(-EAGAIN);
 		current->state = TASK_INTERRUPTIBLE;
-		msg_unlock(msq);
+		msg_unlock(context, msq);
 
 		schedule();
 
@@ -794,7 +810,7 @@ asmlinkage long sys_msgrcv (int msqid, s
 		if (signal_pending(current)) {
 			msg = ERR_PTR(-ERESTARTNOHAND);
 out_unlock:
-			msg_unlock(msq);
+			msg_unlock(context, msq);
 			break;
 		}
 	}
diff -puN ipc/msgutil.c~sysv-container ipc/msgutil.c
diff -puN ipc/sem.c~sysv-container ipc/sem.c
diff -puN ipc/shm.c~sysv-container ipc/shm.c
diff -puN ipc/util.c~sysv-container ipc/util.c
--- work/ipc/util.c~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/ipc/util.c	2006-02-27 09:31:43.000000000 -0800
@@ -45,11 +45,14 @@ struct ipc_proc_iface {
  *	The various system5 IPC resources (semaphores, messages and shared
  *	memory are initialised
  */
- 
+
+struct ipc_msg_context *ipc_msg_context;
 static int __init ipc_init(void)
 {
+	ipc_msg_context = alloc_ipc_msg_context(GFP_KERNEL);
+
 	sem_init();
-	msg_init();
+	msg_init(ipc_msg_context);
 	shm_init();
 	return 0;
 }
diff -puN ipc/util.h~sysv-container ipc/util.h
--- work/ipc/util.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/ipc/util.h	2006-02-27 09:30:24.000000000 -0800
@@ -12,7 +12,7 @@
 #define SEQ_MULTIPLIER	(IPCMNI)
 
 void sem_init (void);
-void msg_init (void);
+void msg_init (struct ipc_msg_context *context);
 void shm_init (void);
 
 struct seq_file;
diff -puN include/linux/sched.h~sysv-container include/linux/sched.h
--- work/include/linux/sched.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/include/linux/sched.h	2006-02-27 09:30:24.000000000 -0800
@@ -793,6 +793,7 @@ struct task_struct {
 	int link_count, total_link_count;
 /* ipc stuff */
 	struct sysv_sem sysvsem;
+	struct ipc_msg_context *ipc_msg_context;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /* filesystem information */
diff -puN include/linux/ipc.h~sysv-container include/linux/ipc.h
--- work/include/linux/ipc.h~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/include/linux/ipc.h	2006-02-27 09:30:24.000000000 -0800
@@ -2,6 +2,9 @@
 #define _LINUX_IPC_H
 
 #include <linux/types.h>
+#include <linux/kref.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
 
 #define IPC_PRIVATE ((__kernel_key_t) 0)  
 
@@ -83,6 +86,15 @@ struct ipc_ids {
 	struct ipc_id_ary* entries;
 };
 
+struct ipc_msg_context {
+	atomic_t msg_bytes;
+	atomic_t msg_hdrs;
+
+	struct ipc_ids msg_ids;
+	struct kref count;
+};
+
+extern struct ipc_msg_context *alloc_ipc_msg_context(gfp_t flags);
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_IPC_H */
diff -puN fs/proc/kmsg.c~sysv-container fs/proc/kmsg.c
diff -puN include/linux/tipc_config.h~sysv-container include/linux/tipc_config.h
diff -puN fs/ufs/util.c~sysv-container fs/ufs/util.c
diff -puN fs/ufs/util.h~sysv-container fs/ufs/util.h
diff -puN include/linux/cn_proc.h~sysv-container include/linux/cn_proc.h
diff -puN kernel/fork.c~sysv-container kernel/fork.c
--- work/kernel/fork.c~sysv-container	2006-02-27 09:30:24.000000000 -0800
+++ work-dave/kernel/fork.c	2006-02-27 09:30:24.000000000 -0800
@@ -1184,6 +1184,11 @@ static task_t *copy_process(unsigned lon
 	}
 	attach_pid(p, PIDTYPE_TGID, p->tgid);
 	attach_pid(p, PIDTYPE_PID, p->pid);
+	{ // this extern will go away when we start to dynamically
+	  // allocate these
+		extern struct ipc_msg_context ipc_msg_context;
+		p->ipc_msg_context = current->ipc_msg_context;
+	}
 
 	nr_threads++;
 	total_forks++;
_

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-06 16:27           ` Dave Hansen
  2006-03-06 17:08             ` Herbert Poetzl
@ 2006-03-06 18:56             ` Eric W. Biederman
  1 sibling, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-03-06 18:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, serue, frankeh, clg, Herbert Poetzl,
	Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Sat, 2006-03-04 at 03:27 -0700, Eric W. Biederman wrote:
>> > I don't see an immediately clear solution on how to containerize sysctls
>> > properly.  The entire construct seems to be built around getting data
>> > from in and out of global variables and into /proc files.
>> 
>> I successfully handled pid_max.  So while it is awkward you can get
>> per task values in and out if sysctl. 
>
> This:
>
> http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6-ns.git;a=commitdiff;h=1150082e0bae41a3621043b4c5ce15e9112884fa
>
> sir, is a hack :)

I did mention awkward!

> We can't possibly do that for each and every sysctl variable.  It would
> mean about fourteen billion duplicated _conv() functions. 
>
> I'm wondering if, instead of having the .data field in that table, we
> can have a function pointer which, when called, gives a pointer to the
> data.  I'll give it a shot.

Sounds like a step in the right direction.  In the context of _conv() functions
it would also be nice if the limit checking between the sys_sysctl path and
the proc path were the same.

We definitely need some infrastructure cleanups if we are going to do much with
sysctl.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-04  3:17       ` sysctls inside containers Dave Hansen
  2006-03-04 10:27         ` Eric W. Biederman
@ 2006-03-10 10:17         ` Kirill Korotaev
  2006-03-10 13:22           ` Eric W. Biederman
  2006-03-10 10:19         ` Kirill Korotaev
  2 siblings, 1 reply; 27+ messages in thread
From: Kirill Korotaev @ 2006-03-10 10:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Linux Kernel Mailing List, serue, frankeh, clg

> Well, I at least got to the point of seeing how the sysctls interact
> when I tried to containerize them.  Eric, I think the idea of the sysv
> code being nicely and completely isolated is pretty much gone, due to
> their connection to sysctls.  I think I'll go back and just isolate the
> "struct ipc_ids" portion.  We can do the accounting bits later.
> 
> The patches I have will isolate the IDs, but I'm not sure how much sense
> that makes without doing the things like the shm_tot variable.  Does
> anybody think we need to go after sysctls first, perhaps?  Or, is this a
> problem graph with cycles in it? :)
> 
> I don't see an immediately clear solution on how to containerize sysctls
> properly.  The entire construct seems to be built around getting data
> from in and out of global variables and into /proc files.
> 
> We obviously want to be rid of many of these global variables.  So, does
> it make sense to introduce different classes of sysctls, at least
> internally?  There are probably just two types: global, writable only
> from the root container and container-private.  Does it make sense to
> have _both_?  Perhaps a sysadmin 
> 
> Eric, can you think of how you would represent these in the hierarchical
> container model?  How would they work?
> 
> On another note, after messing with putting data in the init_task for
> these things, I'm a little more convinced that we aren't going to want
> to clutter up the task_struct with all kinds of containerized resources,
> _plus_ make all of the interfaces to share or unshare each of those.
> That global 'struct container' is looking a bit more attractive.

After checking proposed yours, Eric and vserver solutions, I must say 
that these all are hacks.
If we want to virtualize sysctl we need to do it in honest way:
multiple sysctl trees, which can be different in different namespaces.
For example, one namespace can see /proc/sys/net/route and the other one 
not. Introducing helpers/handlers etc. doesn't fully solve the problem 
of visibility of different parts of sysctl tree and it's access rights.
Another example, the same network device can present in 2 namespaces and 
these are dynamically(!) created entries in sysctl.

So we need to address actually 2 issues:
- ability to limit parts of sysctl tree visibility to namespace
- ability to limit/change sysctl access rights in namespace

You can check OpenVZ for cloning sysctl tree code. It is not clean, nor 
elegant, but can be cleanuped.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-10 10:17         ` Kirill Korotaev
@ 2006-03-10 13:22           ` Eric W. Biederman
  0 siblings, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-03-10 13:22 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Dave Hansen, Linux Kernel Mailing List, serue, frankeh, clg

Kirill Korotaev <dev@openvz.org> writes:

> After checking proposed yours, Eric and vserver solutions, I must say that these
> all are hacks.
> If we want to virtualize sysctl we need to do it in honest way:
> multiple sysctl trees, which can be different in different namespaces.
> For example, one namespace can see /proc/sys/net/route and the other one
> not.

At least a different copy of /proc/sys/net/route :)

> Introducing helpers/handlers etc. doesn't fully solve the problem of
> visibility of different parts of sysctl tree and it's access rights.

I need to look a little deeper but I think if we add two helper
functions: One that returns the address of a value based upon our state,
and another that returns a subdirectory based upon our state I think we
should be ok.

Both of them taking a struct task_struct argument so we can make the decision
what to show based upon the calling process.

> Another example, the same network device can present in 2 namespaces and these
> are dynamically(!) created entries in sysctl.
>
> So we need to address actually 2 issues:
> - ability to limit parts of sysctl tree visibility to namespace
> - ability to limit/change sysctl access rights in namespace
>
> You can check OpenVZ for cloning sysctl tree code. It is not clean, nor elegant,
> but can be cleanuped.

Sounds like a decent idea.

What I have found so far with access rights is that if you dig deep enough
you don't need magic to make it safe.  But this may only be because I have
not hit something that is fundamentally different.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-04  3:17       ` sysctls inside containers Dave Hansen
  2006-03-04 10:27         ` Eric W. Biederman
  2006-03-10 10:17         ` Kirill Korotaev
@ 2006-03-10 10:19         ` Kirill Korotaev
  2006-03-10 11:55           ` Eric W. Biederman
  2006-03-10 18:58           ` Dave Hansen
  2 siblings, 2 replies; 27+ messages in thread
From: Kirill Korotaev @ 2006-03-10 10:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Linux Kernel Mailing List, serue, frankeh, clg

> On another note, after messing with putting data in the init_task for
> these things, I'm a little more convinced that we aren't going to want
> to clutter up the task_struct with all kinds of containerized resources,
> _plus_ make all of the interfaces to share or unshare each of those.
> That global 'struct container' is looking a bit more attractive.
BTW, Dave,

have you noticed that ipc/mqueue.c uses netlink to send messages?
This essentially means that they are tied as well...

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-10 10:19         ` Kirill Korotaev
@ 2006-03-10 11:55           ` Eric W. Biederman
  2006-03-10 18:58           ` Dave Hansen
  1 sibling, 0 replies; 27+ messages in thread
From: Eric W. Biederman @ 2006-03-10 11:55 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Dave Hansen, Linux Kernel Mailing List, serue, frankeh, clg

Kirill Korotaev <dev@sw.ru> writes:

>> On another note, after messing with putting data in the init_task for
>> these things, I'm a little more convinced that we aren't going to want
>> to clutter up the task_struct with all kinds of containerized resources,
>> _plus_ make all of the interfaces to share or unshare each of those.
>> That global 'struct container' is looking a bit more attractive.
> BTW, Dave,
>
> have you noticed that ipc/mqueue.c uses netlink to send messages?
> This essentially means that they are tied as well...

Yes, netlink is something to be considered in the great untangling.

However for a sysvipc namespace ipc/mqueue.c is something that doesn't
need to be handled because that is the implementation of posix message
queues not sysv ipc.

I think I succeeded in untagling the worst of netlink in my proof of
concept implementation, but certainly there is more todo.

Eric

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sysctls inside containers
  2006-03-10 10:19         ` Kirill Korotaev
  2006-03-10 11:55           ` Eric W. Biederman
@ 2006-03-10 18:58           ` Dave Hansen
  1 sibling, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2006-03-10 18:58 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Eric W. Biederman, Linux Kernel Mailing List, serue, frankeh, clg

On Fri, 2006-03-10 at 13:19 +0300, Kirill Korotaev wrote:
> > On another note, after messing with putting data in the init_task for
> > these things, I'm a little more convinced that we aren't going to want
> > to clutter up the task_struct with all kinds of containerized resources,
> > _plus_ make all of the interfaces to share or unshare each of those.
> > That global 'struct container' is looking a bit more attractive.
> 
> have you noticed that ipc/mqueue.c uses netlink to send messages?
> This essentially means that they are tied as well...

Nope, I missed that.

But, netlink is probably a completely separate issue, at least for now.
I'm sure we're going to have "container leaks" and boundary violations
for a long time, but I'll add netlink to the list.

-- Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-03-10 18:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-20 15:45 Which of the virtualization approaches is more suitable for kernel? Kirill Korotaev
2006-02-20 16:12 ` Herbert Poetzl
2006-02-21 16:00   ` Kirill Korotaev
2006-02-21 20:33     ` Sam Vilain
2006-02-21 23:50     ` Herbert Poetzl
2006-02-22 10:09       ` [Devel] " Kir Kolyshkin
2006-02-22 15:26         ` Eric W. Biederman
2006-02-23 12:02           ` Kir Kolyshkin
2006-02-23 13:25             ` Eric W. Biederman
2006-02-23 14:00               ` Kir Kolyshkin
2006-02-24 21:44 ` Eric W. Biederman
2006-02-24 23:01   ` Herbert Poetzl
2006-02-27 17:42   ` Dave Hansen
2006-02-27 21:14     ` Eric W. Biederman
2006-02-27 21:35       ` Dave Hansen
2006-02-27 21:56         ` Eric W. Biederman
2006-03-04  3:17       ` sysctls inside containers Dave Hansen
2006-03-04 10:27         ` Eric W. Biederman
2006-03-06 16:27           ` Dave Hansen
2006-03-06 17:08             ` Herbert Poetzl
2006-03-06 17:18               ` Dave Hansen
2006-03-06 18:56             ` Eric W. Biederman
2006-03-10 10:17         ` Kirill Korotaev
2006-03-10 13:22           ` Eric W. Biederman
2006-03-10 10:19         ` Kirill Korotaev
2006-03-10 11:55           ` Eric W. Biederman
2006-03-10 18:58           ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).