linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/30] C/R OpenVZ/Virtuozzo style
@ 2009-04-10  2:32 Alexey Dobriyan
  2009-04-10  2:44 ` Alexey Dobriyan
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-10  2:32 UTC (permalink / raw)
  To: akpm, containers
  Cc: xemul, serue, dave, mingo, orenl, hch, torvalds, linux-kernel

This is to show how we see C/R and to provoke discussion on number of
important issues (mounts, ...).

This is small part of long-awaited to be cleanuped code.

It's able to restore busyloop on i386 and x86_64 and restore i386
busyloop on x86_64. It wasn't tested much more than that.

I'm currently starting formal testsuite, otherwise it's whack-a-mole
game and formal TODO list (a huge one).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
@ 2009-04-10  2:44 ` Alexey Dobriyan
  2009-04-10  5:07 ` Dave Hansen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-10  2:44 UTC (permalink / raw)
  To: akpm, containers
  Cc: xemul, serue, dave, mingo, orenl, hch, torvalds, linux-kernel

On Fri, Apr 10, 2009 at 06:32:07AM +0400, Alexey Dobriyan wrote:
> This is to show how we see C/R and to provoke discussion on number of
> important issues (mounts, ...).
> 
> This is small part of long-awaited to be cleanuped code.
> 
> It's able to restore busyloop on i386 and x86_64 and restore i386
> busyloop on x86_64. It wasn't tested much more than that.
> 
> I'm currently starting formal testsuite, otherwise it's whack-a-mole
> game and formal TODO list (a huge one).

Andrew, patches 1-9 stand by themselves and can be applied at any moment
(except 4 and 7 where I made functions needlessly global).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
  2009-04-10  2:44 ` Alexey Dobriyan
@ 2009-04-10  5:07 ` Dave Hansen
  2009-04-13  9:14   ` Alexey Dobriyan
  2009-04-10  8:28 ` [PATCH 00/30] C/R OpenVZ/Virtuozzo style Ingo Molnar
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2009-04-10  5:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, mingo, orenl, hch, torvalds,
	linux-kernel

Hey Alexey,

I'm curious how you see these fitting in with the work that we've been
doing with Oren.  Do you mean to just start a discussion or are you
really proposing these as an alternative to what Oren has been posting?

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
  2009-04-10  2:44 ` Alexey Dobriyan
  2009-04-10  5:07 ` Dave Hansen
@ 2009-04-10  8:28 ` Ingo Molnar
  2009-04-10 11:45   ` Alexey Dobriyan
  2009-04-10 15:06 ` Linus Torvalds
  2009-04-14  5:46 ` Oren Laadan
  4 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2009-04-10  8:28 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, dave, orenl, hch, torvalds, linux-kernel


* Alexey Dobriyan <adobriyan@gmail.com> wrote:

> This is to show how we see C/R and to provoke discussion on number 
> of important issues (mounts, ...).
> 
> This is small part of long-awaited to be cleanuped code.
> 
> It's able to restore busyloop on i386 and x86_64 and restore i386 
> busyloop on x86_64. It wasn't tested much more than that.
> 
> I'm currently starting formal testsuite, otherwise it's 
> whack-a-mole game and formal TODO list (a huge one).

One simple patch-submission request: if you start sending large 
series of patches then please post it like others do: by using 
git-format-patch or some other tool that preserves the threading of 
the discussion and links all the 1/30..30/30 patches back to the 
0/30 mail.

This submission has created 31 separate threads on lkml, intermixed 
with other threads - making it more difficult to review it as a 
coherent unit.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  8:28 ` [PATCH 00/30] C/R OpenVZ/Virtuozzo style Ingo Molnar
@ 2009-04-10 11:45   ` Alexey Dobriyan
  0 siblings, 0 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-10 11:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: akpm, containers, xemul, serue, dave, orenl, hch, torvalds, linux-kernel

On Fri, Apr 10, 2009 at 10:28:15AM +0200, Ingo Molnar wrote:
> 
> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > This is to show how we see C/R and to provoke discussion on number 
> > of important issues (mounts, ...).
> > 
> > This is small part of long-awaited to be cleanuped code.
> > 
> > It's able to restore busyloop on i386 and x86_64 and restore i386 
> > busyloop on x86_64. It wasn't tested much more than that.
> > 
> > I'm currently starting formal testsuite, otherwise it's 
> > whack-a-mole game and formal TODO list (a huge one).
> 
> One simple patch-submission request: if you start sending large 
> series of patches then please post it like others do: by using 
> git-format-patch or some other tool that preserves the threading of 
> the discussion and links all the 1/30..30/30 patches back to the 
> 0/30 mail.
> 
> This submission has created 31 separate threads on lkml, intermixed 
> with other threads - making it more difficult to review it as a 
> coherent unit.

OK, I thought stgit will screw something, so sent it old proven way :-\

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
                   ` (2 preceding siblings ...)
  2009-04-10  8:28 ` [PATCH 00/30] C/R OpenVZ/Virtuozzo style Ingo Molnar
@ 2009-04-10 15:06 ` Linus Torvalds
  2009-04-13  7:39   ` Alexey Dobriyan
  2009-04-14  5:46 ` Oren Laadan
  4 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2009-04-10 15:06 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, dave, mingo, orenl, hch, linux-kernel



On Fri, 10 Apr 2009, Alexey Dobriyan wrote:
>
> This is to show how we see C/R and to provoke discussion on number of
> important issues (mounts, ...).

My only initial reaction is that I absolutely hate the naming (not to say 
I love the code - just to say that I didn't even look at it, because I got 
hung up on the name).

"cr"? It could be anything. I realize that to _you_ that is meaningful, 
but to somebody less specifically interested in checkpoint-restore 'cr' 
means 'carriage return' or just doesn't really say anything at all. 

That goes both for file naming (kernel/cr/xyzzy.c) and to a lesser degree 
for function naming too. I also don't think it makes sense to have 
something like kernel/cr/cr-x86_32.c or kernel/cr/cr-tty.c - maybe that is 
good right now, but I sure hope that the long-term goal is to have these 
things in the code that will need to change them when the code gets 
updated (ie arch/x86/kernel and drivers/char/)

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10 15:06 ` Linus Torvalds
@ 2009-04-13  7:39   ` Alexey Dobriyan
  2009-04-13 18:39     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-13  7:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, containers, xemul, serue, dave, mingo, orenl, hch, linux-kernel

On Fri, Apr 10, 2009 at 08:06:55AM -0700, Linus Torvalds wrote:
> On Fri, 10 Apr 2009, Alexey Dobriyan wrote:
> >
> > This is to show how we see C/R and to provoke discussion on number of
> > important issues (mounts, ...).
> 
> My only initial reaction is that I absolutely hate the naming (not to say 
> I love the code - just to say that I didn't even look at it, because I got 
> hung up on the name).
> 
> "cr"? It could be anything. I realize that to _you_ that is meaningful, 
> but to somebody less specifically interested in checkpoint-restore 'cr' 
> means 'carriage return' or just doesn't really say anything at all. 

Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
and "rst_". And I think "cr_" is super nice prefix: it's short, it's C-like,
it reminds about restart part. Eventually, C/R will become standard
in-kernel thing everyone should be at least aware of, so it's like
learning what "vma" means.

> That goes both for file naming (kernel/cr/xyzzy.c) and to a lesser degree 
> for function naming too. I also don't think it makes sense to have 
> something like kernel/cr/cr-x86_32.c or kernel/cr/cr-tty.c - maybe that is 
> good right now, but I sure hope that the long-term goal is to have these 
> things in the code that will need to change them when the code gets 
> updated (ie arch/x86/kernel and drivers/char/)

In the long run, yes, C/R should be moved closer to core code it tries to
checkpoint. Right now, however, doing "make kernel/cr/" is much quicker
and C/R can not do much, so it's unclear how exactly splitting should be
done.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  5:07 ` Dave Hansen
@ 2009-04-13  9:14   ` Alexey Dobriyan
  2009-04-13 11:16     ` Dave Hansen
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-13  9:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, containers, xemul, serue, mingo, orenl, hch, torvalds,
	linux-kernel

On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> I'm curious how you see these fitting in with the work that we've been
> doing with Oren.  Do you mean to just start a discussion or are you
> really proposing these as an alternative to what Oren has been posting?

Yes, this is posted as alternative.

Some design decisions are seen as incorrect from here like:
* not rejecting checkpoint with possible "leaks" from container
* not having CAP_SYS_ADMIN on restart(2)
* having small (TASK_COMM_LEN) and bigger (objref[1]) image format
  misdesigns.
* doing fork(2)+restart(2) per restarted task and whole orchestration
  done from userspace/future init task.
* not seeing bigger picture (note, this is not equivalent to supporting
  everything at once, nobody is asking for everything at once) wrt shared
  objects and format and code changes because of that (note again, image
  format will change, but it's easy to design high level structure which
  won't change)
* checking of unsupported features done at wrong place and wrong time
  and runtime overhead because of that on CR=y kernels.

There are also low-level things, but it's cumulative effect.

[1] Do I inderstand correctly that cookie for shared object is an
address on kernel stack? This is obviously unreliable, if yes :-)

	int objref;
		...
	/* adding 'file' to the hash will keep a reference to it */
	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
					^^^^^^^

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13  9:14   ` Alexey Dobriyan
@ 2009-04-13 11:16     ` Dave Hansen
  2009-04-13 18:07     ` Dave Hansen
  2009-04-14  4:26     ` Oren Laadan
  2 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2009-04-13 11:16 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul, containers, linux-kernel, hch, akpm, torvalds, mingo

On Mon, 2009-04-13 at 13:14 +0400, Alexey Dobriyan wrote:
> [1] Do I inderstand correctly that cookie for shared object is an
> address on kernel stack? This is obviously unreliable, if yes :-)
> 
>         int objref;
>                 ...
>         /* adding 'file' to the hash will keep a reference to it */
>         new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);

No, that's just Oren's way to get two return variables.

He needs 'new' to figure out whether to write out the full file or just
an objref record.  He also needs 'objref' itself in case of writing
either.  cr_obj_add_ptr() modifies objref.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13  9:14   ` Alexey Dobriyan
  2009-04-13 11:16     ` Dave Hansen
@ 2009-04-13 18:07     ` Dave Hansen
  2009-04-14  4:26     ` Oren Laadan
  2 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2009-04-13 18:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, mingo, orenl, hch, torvalds,
	linux-kernel

On Mon, 2009-04-13 at 13:14 +0400, Alexey Dobriyan wrote:
> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> > I'm curious how you see these fitting in with the work that we've been
> > doing with Oren.  Do you mean to just start a discussion or are you
> > really proposing these as an alternative to what Oren has been posting?
> 
> Yes, this is posted as alternative.

I'm sure that you can understand that we've been working on Oren's
patches for a bit and don't want to jump ship on a whim. :)

Do you think we can change Oren's patch sufficiently to meet your needs
here?  Do you think we can change your patch sufficiently to meet Oren's
needs?

> Some design decisions are seen as incorrect from here like:
> * not rejecting checkpoint with possible "leaks" from container

Could you elaborate on this a bit?  This sounds really important, but
I'm having difficulty seeing how your patch addresses this or how Oren's
doesn't.

> * not having CAP_SYS_ADMIN on restart(2)
> * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
>   misdesigns.

The format is certainly still in a huge amount of flux.  I'd be really
happy to see patches fixing these or clarifying them.  I'm planning on
going and looking at your patches in detail right now.  Would you mind
doing a more detailed review of Oren's so that we could use your
expertise to close some of these gaps in the format?

> * doing fork(2)+restart(2) per restarted task and whole orchestration
>   done from userspace/future init task.

Yeah, we've certainly argued plenty about this one amongst ourselves.
I'm personally open to changing this especially if you feel strongly
about it.

> * not seeing bigger picture (note, this is not equivalent to supporting
>   everything at once, nobody is asking for everything at once) wrt shared
>   objects and format and code changes because of that (note again, image
>   format will change, but it's easy to design high level structure which
>   won't change)
> * checking of unsupported features done at wrong place and wrong time
>   and runtime overhead because of that on CR=y kernels.

Yeah, I'm especially worried with respect to the shared objects right
now.  I'm pestering Oren to replace a big chunk of that stuff with other
garbage that I've dreamed up.  I'll go take a look at how you do this in
your patches...  Perhaps it will work even better.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13  7:39   ` Alexey Dobriyan
@ 2009-04-13 18:39     ` Linus Torvalds
  2009-04-13 19:30       ` Ingo Molnar
  2009-04-14 12:29       ` Alexey Dobriyan
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2009-04-13 18:39 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, dave, mingo, orenl, hch, linux-kernel



On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> 
> Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
> and "rst_".

So?

We're not merging OpenVZ code _either_.

> And I think "cr_" is super nice prefix: it's short, it's C-like,
> it reminds about restart part

It does no such thing. THAT'S THE POINT. "cr" means _nothing_ to anybody 
else than some CR-specific people, and those people don't even need it!

Look around you. We try to use nicer names. We spell out "cpufreq", we 
don't call it "cf".

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13 18:39     ` Linus Torvalds
@ 2009-04-13 19:30       ` Ingo Molnar
  2009-04-14 12:29       ` Alexey Dobriyan
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2009-04-13 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, akpm, containers, xemul, serue, dave, orenl,
	hch, linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> > 
> > Well, in OpenVZ everything is in kernel/cpt/ and prefixed with 
> > "cpt_" and "rst_".
> 
> So?
> 
> We're not merging OpenVZ code _either_.
> 
> > And I think "cr_" is super nice prefix: it's short, it's C-like, 
> > it reminds about restart part
> 
> It does no such thing. THAT'S THE POINT. "cr" means _nothing_ to 
> anybody else than some CR-specific people, and those people don't 
> even need it!
> 
> Look around you. We try to use nicer names. We spell out 
> "cpufreq", we don't call it "cf".

I've done an analysis about what a good kernel-global naming scheme 
for checkpoint/restore might be (see further below).

Here's the current namespace (from a stupid sed+grep script combo 
done on my mbox - so it combines all the submitted patches):

cr_align
cr_arch
cr_arch_check_image_header
cr_arch_check_image_task_struct
cr_arch_check_mm_struct
cr_arch_check_task_struct
cr_arch_dump_mm_struct
cr_arch_dump_task_struct
cr_arch_len_mm_struct
cr_arch_len_task_struct
cr_arch_restore_mm_struct
cr_arch_restore_task_struct
cr_arg_end
cr_arg_start
cr_attach_get_file
cr_bits
cr_blocked
cr_brk
cr_can_checkpoint_file
cr_cap_bset
cr_cap_effective
cr_cap_inheritable
cr_cap_permitted
cr_c_cc
cr_c_cflag
cr_check_*
cr_check_cred
cr_check_file
cr_check_file:
cr_check_image_header
cr_check_mm
cr_check_mm_struct
cr_check_pid_ns
cr_check_sighand
cr_check_sighand_struct
cr_check_signal
cr_check_signal_struct
cr_check_sock
cr_check_task_struct
cr_check_tty
cr_check_user_struct
cr_check_vma
cr_c_iflag
cr_c_ispeed
cr_c_lflag
cr_c_line
cr_c_oflag
cr_collect
cr_collect_all_cred
cr_collect_all_file
cr_collect_all_files_struct
cr_collect_all_fs_struct
cr_collect_all_group_info
cr_collect_all_ipc_ns
cr_collect_all_mm_struct
cr_collect_all_mnt_ns
cr_collect_all_net_ns
cr_collect_all_nsproxy
cr_collect_all_pid
cr_collect_all_pid_ns
cr_collect_all_sighand_struct
cr_collect_all_signal_struct
cr_collect_all_sock
cr_collect_all_task_struct
cr_collect_all_tty_struct
cr_collect_all_user_ns
cr_collect_all_user_struct
cr_collect_all_uts_ns
cr_collect_cred
cr_collect_file
cr_collect_files_struct
cr_collect_fs_struct
cr_collect_group_info
cr_collect_ipc_ns
cr_collect_mm
cr_collect_mm_struct
cr_collect_mnt_ns
cr_collect_net_ns
cr_collect_nsproxy
cr_collect_object
cr_collect_pid
cr_collect_pid_ns
cr_collect_sighand
cr_collect_sighand_struct
cr_collect_signal
cr_collect_signal_struct
cr_collect_sock
cr_collect_tasks
cr_collect_task_struct
cr_collect_tty
cr_collect_user_ns
cr_collect_user_struct
cr_collect_uts_ns
cr_comm
cr_context
cr_context_create
cr_context_destroy
cr_context_obj_type
cr_context_task_struct
cr_core
cr_c_ospeed
cr_cred
cr_cs
cr_ct
cr_ctrl_status
cr_ctx
cr_ctx_count
cr_ctx_put
cr_data
cr_debug
cr_def_flags
cr_domainname
cr_dr0
cr_dr1
cr_dr2
cr_dr3
cr_dr6
cr_dr7
cr_driver_flags
cr_driver_subtype
cr_driver_type
cr_ds
cr_dump
cr_dump_all_cred
cr_dump_all_file
cr_dump_all_files_struct
cr_dump_all_fs_struct
cr_dump_all_group_info
cr_dump_all_ipc_ns
cr_dump_all_mm_struct
cr_dump_all_mnt_ns
cr_dump_all_net_ns
cr_dump_all_nsproxy
cr_dump_all_pid
cr_dump_all_pid_ns
cr_dump_all_sighand_struct
cr_dump_all_signal_struct
cr_dump_all_sock
cr_dump_all_task_struct
cr_dump_all_tty
cr_dump_all_user_ns
cr_dump_all_user_struct
cr_dump_all_uts_ns
cr_dump_anonvma
cr_dump_cred
cr_dump_fd
cr_dump_file
cr_dump_files_struct
cr_dump_fs_struct
cr_dump_group_info
cr_dump_header
cr_dump_image_header
cr_dump_ipc_ns
cr_dump_mm_struct
cr_dump_mnt_ns
cr_dump_net_ns
cr_dump_nsproxy
cr_dump_pid
cr_dump_pid_ns
cr_dump_ptr
cr_dump_sighand_struct
cr_dump_signal_struct
cr_dump_sigset
cr_dump_sock
cr_dump_task_struct
cr_dump_task_struct_x86_32
cr_dump_task_struct_x86_64
cr_dump_terminator
cr_dump_tty
cr_dump_user_ns
cr_dump_user_struct
cr_dump_uts_ns
cr_dump_vma
cr_dump_vma_pages
cr_dump_xstate
cr_eax
cr_ebp
cr_ebx
cr_ecx
cr_edi
cr_edx
cr_eflags
cr_egid
cr_eip
cr_enabled
cr_end_code
cr_end_data
cr_env_end
cr_env_start
cr_es
cr_esi
cr_esp
cr_euid
cr_explain_file
cr_fd
cr_fd_flags
cr_f_flags
cr_file
cr_file_checkpoint
cr_file_checkpointable
cr_file_get_func
cr_files_struct
cr_file_supported
cr_fill_name
cr_find_obj_by_pos
cr_find_obj_by_ptr
cr_flags
cr_flow_stopped
cr_f_owner
cr_f_owner_euid
cr_f_owner_pid_type
cr_f_owner_signum
cr_f_owner_uid
cr_f_pos
cr_freeze_tasks
cr_fs
cr_fs_checkpointable
cr_fsgid
cr_fsindex
cr_fsuid
cr_gid
cr_gs
cr_gsindex
cr_hbuf_get
cr_hbuf_put
cr_hdr
cr_hdr_cpu
cr_hdr_fd
cr_hdr_fd_data
cr_hdr_fd_ent
cr_hdr_files
cr_hdr_head
cr_hdr_head_arch
cr_hdr_mm
cr_hdr_mm_context
cr_hdr_pgarr
cr_hdr_pids
cr_hdr_tail
cr_hdr_tree
cr_hdr_vma
cr_header
cr_hw_stopped
cr_imag
cr_image_arch_x86_32
cr_image_arch_x86_64
cr_image_cred
cr_image_fd
cr_image_file
cr_image_files_struct
cr_image_fs_struct
cr_image_group_info
cr_image_header
cr_image_header_arch
cr_image_ipc_ns
cr_image_magic
cr_image_mm_struct
cr_image_mnt_ns
cr_image_net_ns
cr_image_nsproxy
cr_image_pid
cr_image_pid_ns
cr_image_sighand_struct
cr_image_signal_struct
cr_image_sigset
cr_image_sock
cr_image_task_st
cr_image_task_struct
cr_image_tty
cr_image_user_ns
cr_image_user_struct
cr_image_uts_ns
cr_image_version
cr_image_vma
cr_image_vma_content
cr_image_vma_vdso
cr_i_mode
cr_index
cr_ipc_ns
cr_kill_tasks
cr_kread
cr_last_pid
cr_len
cr_len_arch
cr_len_xstate
cr_level
cr_link_index
cr_low_latency
cr_machine
cr_mm_struct
cr_mnt_ns
cr_msg_ctlmax
cr_msg_ctlmnb
cr_msg_ctlmni
cr_name
cr_name_len
cr_net_ns
cr_ngroups
cr_nodename
cr_nr
cr_nr_pages
cr_nsproxy
cr_obj
cr_obj_add_ptr
cr_obj_add_ref
cr_object
cr_object_create
cr_object_destroy
cr_object_header
cr_orig_eax
cr_orig_rax
cr_packet
cr_page_size
cr_pgarr
cr_pid_ns
cr_pid_type
cr_pos_cred
cr_pos_f_cred
cr_pos_file
cr_pos_files
cr_pos_f_owner_pid
cr_pos_fs
cr_pos_group_info
cr_pos_ipc_ns
cr_pos_mm
cr_pos_mm_struct
cr_pos_mnt_ns
cr_pos_net_ns
cr_pos_nsproxy
cr_pos_parent
cr_pos_pgrp
cr_pos_pid
cr_pos_pid_ns
cr_pos_pids
cr_pos_real_cred
cr_pos_real_parent
cr_pos_session
cr_pos_sighand
cr_pos_signal
cr_pos_sk_net
cr_pos_t
cr_pos_user
cr_pos_user_ns
cr_pos_uts_ns
cr_pos_vm_file
cr_pread
cr_prepare_image
cr_pwd
cr_pwd_len
cr_quan@163
cr_r10
cr_r11
cr_r12
cr_r13
cr_r14
cr_r15
cr_r8
cr_r9
cr_rax
cr_rbp
cr_rbx
cr_rcx
cr_rdi
cr_rdx
cr_read_buffer
cr_read_buf_type
cr_read_cpu
cr_read_cpu_fpu
cr_read_fd_data
cr_read_files
cr_read_head_arch
cr_read_mm
cr_read_mm_context
cr_read_obj_type
cr_read_open_fname
cr_read_string
cr_read_task_struct
cr_read_thread
cr_real_blocked
cr_release
cr_restart
cr_restore_all_fd
cr_restore_all_vma
cr_restore_cred
cr_restore_fd
cr_restore_file
cr_restore_files_struct
cr_restore_fs_struct
cr_restore_group_info
cr_restore_ipc_ns
cr_restore_mm_struct
cr_restore_mnt_ns
cr_restore_net_ns
cr_restore_nsproxy
cr_restore_pid
cr_restore_pid_ns
cr_restore_ptr
cr_restore_sighand_struct
cr_restore_signal_struct
cr_restore_sigset
cr_restore_task_struct
cr_restore_task_struct_x86_32
cr_restore_task_struct_x86_64
cr_restore_uts_ns
cr_restore_vma
cr_restore_vma_content
cr_restore_vma_vdso
cr_restore_xstate
cr_rflags
cr_rip
cr_rlim
cr_rlim_cur
cr_rlim_max
cr_root
cr_root_len
cr_rsi
cr_rsp
cr_s390_mm
cr_s390_regs
cr_sa
cr_sa_flags
cr_sa_handler
cr_sa_mask
cr_sa_restorer
cr_save_cpu_regs
cr_saved_auxv
cr_saved_sigmask
cr_scan_fds
cr_securebits
cr_sem_ctls
cr_sgid
cr_shm_ctlall
cr_shm_ctlmax
cr_shm_ctlmni
cr_sighand
cr_signal
cr_signature
cr_signum
cr_sk_family
cr_ss
cr_start_addr
cr_start_brk
cr_start_code
cr_start_data
cr_start_stack
cr_stopped
cr_suid
cr_sysctl_max_dgram_qlen
cr_sysctl_somaxconn
cr_sysname
cr_task_struct
cr_task_struct_arch
cr_termios
cr_thaw_tasks
cr_tls_array
cr_tsk_arch
cr_type
cr_uid
cr_umask
cr_unx
cr_uts_ns
cr_uts_release
cr_version
cr_vm_end
cr_vm_flags
cr_vm_page_prot
cr_vm_pgoff
cr_vm_start
cr_winsize
cr_write
cr_write_buffer
cr_write_cpu
cr_write_error
cr_write_fd_data
cr_write_files
cr_write_head
cr_write_head_arch
cr_write_mm_context
cr_write_obj
cr_write_thread
cr_write_vma
cr_ws_col
cr_ws_row
cr_ws_xpixel
cr_ws_ypixel
cr_xstate

The cr_ prefix for internal symbols and fields should be dropped. 

For global symbols, the following naming scheme looks pretty 
intuitive, unique and self-descriptive to me:

  state_checkpoint_XYZ()
  state_restore_XYZ()
  state_collect_XYZ()
  state_dump_XYZ()
  state_image_XYZ()
  ...

Seeing a state_*() symbol anywhere in the kernel will tell me 'ah, 
this is kernel object state checkpoint/restore related 
functionality', at a glance.

[ And equally importantly, when seeing a state_*() symbol i can
  skip it easily, as an uninteresting-to-my-current-activity item. ]

I had a quick look at various existing uses of "\<state_.*" symbols 
in the kernel, and no widely used variant looked particularly 
confusing to me, so i think this would be a good choice IMHO.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13  9:14   ` Alexey Dobriyan
  2009-04-13 11:16     ` Dave Hansen
  2009-04-13 18:07     ` Dave Hansen
@ 2009-04-14  4:26     ` Oren Laadan
  2009-04-14 14:58       ` Alexey Dobriyan
  2 siblings, 1 reply; 37+ messages in thread
From: Oren Laadan @ 2009-04-14  4:26 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel



Alexey Dobriyan wrote:
> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
>> I'm curious how you see these fitting in with the work that we've been
>> doing with Oren.  Do you mean to just start a discussion or are you
>> really proposing these as an alternative to what Oren has been posting?
> 
> Yes, this is posted as alternative.
> 
> Some design decisions are seen as incorrect from here like:

A definition of "design" would help; I find most of your comments
below either vague, cryptic, or technical nits...

> * not rejecting checkpoint with possible "leaks" from container

...like this, for example.
Anything in the current design makes it impossible ?

Anything prohibiting your from adding this feature to the current
patch-set ?

> * not having CAP_SYS_ADMIN on restart(2)

Surely you have read already on the containers mailing list that
for the *time being* we attempt to get as far as possible without
requiring root privileges, to identify security hot-spots.

And surely you have read there the observation that for the general
case root privileges will probably be inevitable.

And surely you don't seriously think that adding this check changes
the "design"...

> * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
>   misdesigns.

Eh ?

> * doing fork(2)+restart(2) per restarted task and whole orchestration
>   done from userspace/future init task.

Why is it "incorrect" ?
What makes it "better" to do it in the kernel ?
Only because you say so is not convincing.

(also see my other post in this matter).

> * not seeing bigger picture (note, this is not equivalent to supporting
>   everything at once, nobody is asking for everything at once) wrt shared
>   objects and format and code changes because of that (note again, image
>   format will change, but it's easy to design high level structure which
>   won't change)

Why don't you describe the bigger picture so that the rest of can
finally see it, too ?!
(what a waste to have spent all this effort in vain...)

> * checking of unsupported features done at wrong place and wrong time
>   and runtime overhead because of that on CR=y kernels.

Eh ?   Did you follow the code recently ?

> 
> There are also low-level things, but it's cumulative effect.
> 
> [1] Do I inderstand correctly that cookie for shared object is an
> address on kernel stack? This is obviously unreliable, if yes :-)

Ah... I see... you didn't look at it that hard, not even read the
documentation with the code.

> 
> 	int objref;
> 		...
> 	/* adding 'file' to the hash will keep a reference to it */
> 	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
> 					^^^^^^^

That said, there are more similarities than differences between your
suggested template and the current patchset. With your expertise you
can contribute tremendously if you decide to work together.

Oren.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
                   ` (3 preceding siblings ...)
  2009-04-10 15:06 ` Linus Torvalds
@ 2009-04-14  5:46 ` Oren Laadan
  2009-04-14 15:19   ` Alexey Dobriyan
  4 siblings, 1 reply; 37+ messages in thread
From: Oren Laadan @ 2009-04-14  5:46 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, dave, mingo, hch, torvalds, linux-kernel


Some meta comments about this patch set:

* Patches 1-9 are cleanups, unrelated to checkpoint/restart. They
deserve a separate thread.

* You barely take locks or reference counts to objects that you
later refer to. What if something really bad happens ?

* (contd) If you don't take locks, then you at the very least need
to rely on the container remaining frozen for the duration of the
operation (during checkpoint).

* (contd) Still with locks and references, during restart you can't
even freeze the container, so need extra logic to prevent bad things
(e.g. OOM killer, signal or ptrace from parent container etc).

* What is the rationale behind doing the freeze/unfreeze from within
sys_checkpoint/sys_restart ?   Instead of you let userspace do it
(and only verify in kernel) you gain, again, flexibility. For example,
you want to also snapshot the filesystem, then userspace will do
something like:  freeze container -> snapshot filesystem -> checkpoint
-> thaw container.

* A plethora of "FIXME" comments ...

Alexey Dobriyan wrote:
> This is to show how we see C/R and to provoke discussion on number of
> important issues (mounts, ...).

Quoting your patch:
---
This is one big FIXME:
	What to do with overmounted files?
	What to do with mounts at all, who should restore them?

just restore something to not oops on task exit
---

> 
> This is small part of long-awaited to be cleanuped code.
> 
> It's able to restore busyloop on i386 and x86_64 and restore i386
> busyloop on x86_64. It wasn't tested much more than that.

Oh .. I really wish you'd sent a x86_64 patch our way, too ;)

> 
> I'm currently starting formal testsuite, otherwise it's whack-a-mole
> game and formal TODO list (a huge one).
> 

So I'm still struggling to see the major different in the approaches
that would justify throwing away our hard worked, reviewed, tested
and functional code, and take this - similar in design, largely
incomplete and unreviewed code.

Best,

Oren.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-13 18:39     ` Linus Torvalds
  2009-04-13 19:30       ` Ingo Molnar
@ 2009-04-14 12:29       ` Alexey Dobriyan
  2009-04-14 13:44         ` Ingo Molnar
  2009-04-14 17:09         ` Linus Torvalds
  1 sibling, 2 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 12:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: akpm, containers, xemul, serue, dave, mingo, orenl, hch, linux-kernel

On Mon, Apr 13, 2009 at 11:39:51AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> > 
> > Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
> > and "rst_".
> 
> So?
> 
> We're not merging OpenVZ code _either_.

This is to give example of other prefixes: cpt_ and rst_
Are they fine?

> > And I think "cr_" is super nice prefix: it's short, it's C-like,
> > it reminds about restart part
> 
> It does no such thing. THAT'S THE POINT. "cr" means _nothing_ to anybody 
> else than some CR-specific people, and those people don't even need it!
> 
> Look around you. We try to use nicer names. We spell out "cpufreq", we 
> don't call it "cf".

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 12:29       ` Alexey Dobriyan
@ 2009-04-14 13:44         ` Ingo Molnar
  2009-04-14 16:53           ` Alexey Dobriyan
  2009-04-14 17:09         ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2009-04-14 13:44 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Linus Torvalds, akpm, containers, xemul, serue, dave, orenl, hch,
	linux-kernel


* Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Mon, Apr 13, 2009 at 11:39:51AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> > > 
> > > Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
> > > and "rst_".
> > 
> > So?
> > 
> > We're not merging OpenVZ code _either_.
> 
> This is to give example of other prefixes: cpt_ and rst_
> Are they fine?

Not really. 'rst' can be easily mistaken for 'reset' and neither 
really tells me at a glance what they do. They are also quite 
tongue-twisters.

See my namespace analysis and suggestions from yesterday for a 
proper naming scheme.

The key i believe is to move away from this singular 'the world is 
all about checkpoint and restore', and move it to a IMHO clearer 
state_*() type of naming which really isolates all these kernel 
state save/restore management APIs from other kernel APIs. (See my 
mail from yesterday for details.)

kstate_*() would be another, perhaps even clearer naming scheme. 
I.e.:

  kstate_checkpoint_XYZ()
  kstate_restore_XYZ()
  kstate_collect_XYZ()
  kstate_dump_XYZ()
  kstate_image_XYZ()
  ...

Just _look_ at them - they are expressive at a glance, and 
reasonably short. That is the kind of first-time impression
we need, not a 'wtf?' moment.

I just checked, there's zero hits on "git grep \<kstate_" in the 
kernel, so it's a pristine namespace. IMHO, go wild ...

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14  4:26     ` Oren Laadan
@ 2009-04-14 14:58       ` Alexey Dobriyan
  2009-04-14 18:08         ` Oren Laadan
  0 siblings, 1 reply; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 14:58 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel

On Tue, Apr 14, 2009 at 12:26:50AM -0400, Oren Laadan wrote:
> Alexey Dobriyan wrote:
> > On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> >> I'm curious how you see these fitting in with the work that we've been
> >> doing with Oren.  Do you mean to just start a discussion or are you
> >> really proposing these as an alternative to what Oren has been posting?
> > 
> > Yes, this is posted as alternative.
> > 
> > Some design decisions are seen as incorrect from here like:
> 
> A definition of "design" would help; I find most of your comments
> below either vague, cryptic, or technical nits...
> 
> > * not rejecting checkpoint with possible "leaks" from container
> 
> ...like this, for example.

Like checkpointing one process out of many living together.

If you allow this you consequently drop checks (e.g. refcount checks)
for "somebody else is using structure to be checkpointed".

If you drop these checks, you can't decipher legal sutiations like
"process genuinely doesn't care about routing table of netns it lives in"
from "illegal" situations like "process created shm segment but currently
doesn't use it so not checkpointing ipcns will result in breakagenlater".

You'll have to move responsibility to user, so user exactly knows what
app relies on and on what. And probably add flags like CKPT_SHM,
CKPT_NETNS_ROUTE ad infinitum.

And user will screw it badly and complain: "after restart my app
segfaulted". And user himself is screwed now: old running process is
already killed (it was checkpointed on purpose) and new process in image
segfaults every time it's restarted.

All of this in out opinion results in doing C/R unreliably and badly.

We are going to do it well and dig from the other side.

If "leak" (any "leak") is detected, C/R is aborted because kernel
doesn't know what app relies on and what app doesn't care about.

This protected from situations and failure modes described above.

This also protects to some extent from in-kernel changes where C/R code
should have been updated but wasn't. Person doing incomplete change won't
notice e.g refcount checks and won't try to "fix" them. But we'll notice it,
e.g. when running testsuite (amen) and update C/R code accordingly.

I'm talking about these checks so that everyone understands:

	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
                struct mm_struct *mm = obj->o_obj;
                unsigned int cnt = atomic_read(&mm->mm_users);

                if (obj->o_count != cnt) {
                        printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
                        return -EINVAL;
                }
        }

They are like moving detectors, small, invisible, something moved, you don't
know what, but you don't care because you have to investigate anyway.

In this scheme, if user wants to checkpoint just one process, he should
start it alone in separate container. Right now, in posted patchset
as cloned process with
CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET

All of this by definition won't create situation "mom, where did my shm segment go?".

> Anything in the current design makes it impossible ?
> 
> Anything prohibiting your from adding this feature to the current
> patch-set ?
> 
> > * not having CAP_SYS_ADMIN on restart(2)
> 
> Surely you have read already on the containers mailing list that
> for the *time being* we attempt to get as far as possible without
> requiring root privileges, to identify security hot-spots.

More or less everything is hotspot.

> And surely you have read there the observation that for the general
> case root privileges will probably be inevitable.
> 
> And surely you don't seriously think that adding this check changes
> the "design"...

This is not about design now, this is about if you allow restart(2) for
everyone, you open kernel to very very high number of holes of all
types. I wrote about it in reply to your code.

Numbers of states described by any valid image as described by header
files is very much higher than number of states that can validly end up
in an image. Reducing set of states to only valid ones will require
many checks.

For example, just one structure: struct cr_hdr_cpu.

Segments are directly loaded regardless of RPL.

Debug registers are directly loaded regardless of to where breakpoint
points to.

If you don't understand scale of problem and insist on restart(2) not
having CAP_SYS_ADMIN, we'll just drop from discussion and write email
to bugtraq to warn sysadmins at least.

restart(2) for everyone sounds like excellent feature and is excellent
feature but simultaneusly is completely unrealistic for Unix-like kernel.

> > * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
> >   misdesigns.
> 
> Eh ?

TASK_COMM_LEN is 16 bytes and is not going to change, so you do

	__u8 cr_comm[16];

and forget. If TASK_COMM_LEN changes you'll just bump size of
task_struct image and image version.

Saving ->comm as variable-sized string is complication out of nowhere.

As for objref, my apoligies, objection dropped, I understood how it can be useful.

> > * doing fork(2)+restart(2) per restarted task and whole orchestration
> >   done from userspace/future init task.
> 
> Why is it "incorrect" ?
> What makes it "better" to do it in the kernel ?
> Only because you say so is not convincing.
> 
> (also see my other post in this matter).
> 
> > * not seeing bigger picture (note, this is not equivalent to supporting
> >   everything at once, nobody is asking for everything at once) wrt shared
> >   objects and format and code changes because of that (note again, image
> >   format will change, but it's easy to design high level structure which
> >   won't change)
> 
> Why don't you describe the bigger picture so that the rest of can
> finally see it, too ?!
> (what a waste to have spent all this effort in vain...)

Sit and graph down all structures that you will eventually checkpoint and
relations about them. Things like "struct cr_hdr_pids" won't even be on
the radar.

> > * checking of unsupported features done at wrong place and wrong time
> >   and runtime overhead because of that on CR=y kernels.
> 
> Eh ?   Did you follow the code recently ?

Checks during dup_fd() were dropped? Good.

> > There are also low-level things, but it's cumulative effect.
> > 
> > [1] Do I inderstand correctly that cookie for shared object is an
> > address on kernel stack? This is obviously unreliable, if yes :-)
> 
> Ah... I see... you didn't look at it that hard, not even read the
> documentation with the code.
> 
> > 
> > 	int objref;
> > 		...
> > 	/* adding 'file' to the hash will keep a reference to it */
> > 	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
> > 					^^^^^^^
> 
> That said, there are more similarities than differences between your
> suggested template and the current patchset. With your expertise you
> can contribute tremendously if you decide to work together.

OK, totally misread code.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14  5:46 ` Oren Laadan
@ 2009-04-14 15:19   ` Alexey Dobriyan
  0 siblings, 0 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 15:19 UTC (permalink / raw)
  To: Oren Laadan
  Cc: akpm, containers, xemul, serue, dave, mingo, hch, torvalds, linux-kernel

On Tue, Apr 14, 2009 at 01:46:36AM -0400, Oren Laadan wrote:
> Some meta comments about this patch set:
> 
> * Patches 1-9 are cleanups, unrelated to checkpoint/restart. They
> deserve a separate thread.

They will be sent separatedly.

> * You barely take locks or reference counts to objects that you
> later refer to. What if something really bad happens ?

I'm thinking right now what is needed and isn't.

Tasks are frozen so unmap VMA in the middle of dump can't happen.

Refcounts aren't needed  because tasks are frozen but alive and pin
everything needed. This is during dump part.

What is needed is impossibility of a task to dissapear in frozen state.

On restart, I probably leak refcount to every object, but haven't
checked just it.

> * (contd) If you don't take locks, then you at the very least need
> to rely on the container remaining frozen for the duration of the
> operation (during checkpoint).

> * (contd) Still with locks and references, during restart you can't
> even freeze the container, so need extra logic to prevent bad things
> (e.g. OOM killer, signal or ptrace from parent container etc).
> 
> * What is the rationale behind doing the freeze/unfreeze from within
> sys_checkpoint/sys_restart ?   Instead of you let userspace do it
> (and only verify in kernel) you gain, again, flexibility. For example,
> you want to also snapshot the filesystem, then userspace will do
> something like:  freeze container -> snapshot filesystem -> checkpoint
> -> thaw container.

This is incomplete part. But, yes, freeze, dump, thaw/kill as separate
actions make sense.

	checkpoint(CR_CPT_FREEZE);
	[rsync fs]
	checkpoint(CR_CPT_DUMP|CR_CPT_KILL);

with check that CR_CPT_THAW doesn't happen during dump.

> * A plethora of "FIXME" comments ...
> 
> Alexey Dobriyan wrote:
> > This is to show how we see C/R and to provoke discussion on number of
> > important issues (mounts, ...).
> 
> Quoting your patch:
> ---
> This is one big FIXME:
> 	What to do with overmounted files?
> 	What to do with mounts at all, who should restore them?
> 
> just restore something to not oops on task exit

This is just a skeleton for mnt_ns checkpoint, not seriously suggested
to merge it.

> > This is small part of long-awaited to be cleanuped code.
> > 
> > It's able to restore busyloop on i386 and x86_64 and restore i386
> > busyloop on x86_64. It wasn't tested much more than that.
> 
> Oh .. I really wish you'd sent a x86_64 patch our way, too ;)
> 
> > 
> > I'm currently starting formal testsuite, otherwise it's whack-a-mole
> > game and formal TODO list (a huge one).
> > 
> 
> So I'm still struggling to see the major different in the approaches
> that would justify throwing away our hard worked, reviewed, tested
> and functional code, and take this - similar in design, largely
> incomplete and unreviewed code.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 13:44         ` Ingo Molnar
@ 2009-04-14 16:53           ` Alexey Dobriyan
  0 siblings, 0 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 16:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, akpm, containers, xemul, serue, dave, orenl, hch,
	linux-kernel

On Tue, Apr 14, 2009 at 03:44:20PM +0200, Ingo Molnar wrote:
> 
> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > On Mon, Apr 13, 2009 at 11:39:51AM -0700, Linus Torvalds wrote:
> > > 
> > > 
> > > On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> > > > 
> > > > Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
> > > > and "rst_".
> > > 
> > > So?
> > > 
> > > We're not merging OpenVZ code _either_.
> > 
> > This is to give example of other prefixes: cpt_ and rst_
> > Are they fine?
> 
> Not really. 'rst' can be easily mistaken for 'reset' and neither 
> really tells me at a glance what they do. They are also quite 
> tongue-twisters.
> 
> See my namespace analysis and suggestions from yesterday for a 
> proper naming scheme.
> 
> The key i believe is to move away from this singular 'the world is 
> all about checkpoint and restore', and move it to a IMHO clearer 
> state_*() type of naming which really isolates all these kernel 
> state save/restore management APIs from other kernel APIs. (See my 
> mail from yesterday for details.)
> 
> kstate_*() would be another, perhaps even clearer naming scheme. 
> I.e.:
> 
>   kstate_checkpoint_XYZ()
>   kstate_restore_XYZ()
>   kstate_collect_XYZ()
>   kstate_dump_XYZ()
>   kstate_image_XYZ()
>   ...
> 
> Just _look_ at them - they are expressive at a glance, and 
> reasonably short. That is the kind of first-time impression
> we need, not a 'wtf?' moment.
> 
> I just checked, there's zero hits on "git grep \<kstate_" in the 
> kernel, so it's a pristine namespace. IMHO, go wild ...

Need to try it for real.

One minor nit. This kstate_ doesn't include quite a noticable of
in-kernel state. For example, task readahead state isn't relevant
to C/R at all. It's state but irrelevant state.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 12:29       ` Alexey Dobriyan
  2009-04-14 13:44         ` Ingo Molnar
@ 2009-04-14 17:09         ` Linus Torvalds
  2009-04-14 17:19           ` Randy Dunlap
  1 sibling, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2009-04-14 17:09 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, containers, xemul, serue, dave, mingo, orenl, hch, linux-kernel



On Tue, 14 Apr 2009, Alexey Dobriyan wrote:
> > 
> > We're not merging OpenVZ code _either_.
> 
> This is to give example of other prefixes: cpt_ and rst_
> Are they fine?

Do you secretly work for IBM?

IBM has a well-known disdain for vowels, and basically refuses to use them 
for mnemonics (they were called on this, and did "eieio" as an instruction 
just to try to make up for it).

But I'm from Finland. In Finnish, about 75% of all letters are vowels. I 
find this dis-emvoweling to be stupid and impractical. Without vowels, you 
can't tell Finnish words apart (admittedly, _with_ vowels, you generally 
cannot pronounce them, so to a non-Finn it doesn't much matter).

My point is, let's go for a happy medium - LEAVE THE F*CKING VOWELS IN 
PLACE ALREADY.

So let's call it "checkpoint" and "restore". Ok?

		Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 17:09         ` Linus Torvalds
@ 2009-04-14 17:19           ` Randy Dunlap
  2009-04-14 17:32             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Randy Dunlap @ 2009-04-14 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, akpm, containers, xemul, serue, dave, mingo,
	orenl, hch, linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 14 Apr 2009, Alexey Dobriyan wrote:
>>> We're not merging OpenVZ code _either_.
>> This is to give example of other prefixes: cpt_ and rst_
>> Are they fine?

http://www.rfc-editor.org/rfc/rfc5513.txt


> Do you secretly work for IBM?
> 
> IBM has a well-known disdain for vowels, and basically refuses to use them 
> for mnemonics (they were called on this, and did "eieio" as an instruction 
> just to try to make up for it).
> 
> But I'm from Finland. In Finnish, about 75% of all letters are vowels. I 
> find this dis-emvoweling to be stupid and impractical. Without vowels, you 
> can't tell Finnish words apart (admittedly, _with_ vowels, you generally 
> cannot pronounce them, so to a non-Finn it doesn't much matter).
> 
> My point is, let's go for a happy medium - LEAVE THE F*CKING VOWELS IN 
> PLACE ALREADY.
> 
> So let's call it "checkpoint" and "restore". Ok?

restart ?

-- 
~Randy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 17:19           ` Randy Dunlap
@ 2009-04-14 17:32             ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2009-04-14 17:32 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Alexey Dobriyan, akpm, containers, xemul, serue, dave, mingo,
	orenl, hch, linux-kernel



On Tue, 14 Apr 2009, Randy Dunlap wrote:
> > 
> > So let's call it "checkpoint" and "restore". Ok?
> 
> restart ?

Even better (and proving my point that trying to use contractions like 
"rst" can be misleading, although I agree with Ingo that the most common 
mistake would be "reset")

			Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 14:58       ` Alexey Dobriyan
@ 2009-04-14 18:08         ` Oren Laadan
  2009-04-14 18:34           ` Alexey Dobriyan
  2009-04-14 20:49           ` Alexey Dobriyan
  0 siblings, 2 replies; 37+ messages in thread
From: Oren Laadan @ 2009-04-14 18:08 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel



Alexey Dobriyan wrote:
> On Tue, Apr 14, 2009 at 12:26:50AM -0400, Oren Laadan wrote:
>> Alexey Dobriyan wrote:
>>> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
>>>> I'm curious how you see these fitting in with the work that we've been
>>>> doing with Oren.  Do you mean to just start a discussion or are you
>>>> really proposing these as an alternative to what Oren has been posting?
>>> Yes, this is posted as alternative.
>>>
>>> Some design decisions are seen as incorrect from here like:
>> A definition of "design" would help; I find most of your comments
>> below either vague, cryptic, or technical nits...
>>
>>> * not rejecting checkpoint with possible "leaks" from container
>> ...like this, for example.
> 
> Like checkpointing one process out of many living together.

See the thread on creating tasks in userspace vs. kernel space:
the argument here is that is an interesting enough use case for
a checkpoint of not-an-entire-container.

Of course it will require more logic to it, so the user can choose
what she cares or does not care about, and the kernel could alert
the user about it.

The point is, that it is, IMHO, a desirable capability.

> 
> If you allow this you consequently drop checks (e.g. refcount checks)
> for "somebody else is using structure to be checkpointed".
> 

>From this point below, I totally agree with you that for the purpose
of a whole-container-checkpoint this is certainly desirable. My point
was that it can be easily added the existing patchset (not yours).
Why not add it there ?

> If you drop these checks, you can't decipher legal sutiations like
> "process genuinely doesn't care about routing table of netns it lives in"
> from "illegal" situations like "process created shm segment but currently
> doesn't use it so not checkpointing ipcns will result in breakagenlater".
> 
> You'll have to move responsibility to user, so user exactly knows what
> app relies on and on what. And probably add flags like CKPT_SHM,
> CKPT_NETNS_ROUTE ad infinitum.
> 
> And user will screw it badly and complain: "after restart my app
> segfaulted". And user himself is screwed now: old running process is
> already killed (it was checkpointed on purpose) and new process in image
> segfaults every time it's restarted.
> 
> All of this in out opinion results in doing C/R unreliably and badly.
> 
> We are going to do it well and dig from the other side.
> 
> If "leak" (any "leak") is detected, C/R is aborted because kernel
> doesn't know what app relies on and what app doesn't care about.
> 
> This protected from situations and failure modes described above.
> 
> This also protects to some extent from in-kernel changes where C/R code
> should have been updated but wasn't. Person doing incomplete change won't
> notice e.g refcount checks and won't try to "fix" them. But we'll notice it,
> e.g. when running testsuite (amen) and update C/R code accordingly.
> 
> I'm talking about these checks so that everyone understands:
> 
> 	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
>                 struct mm_struct *mm = obj->o_obj;
>                 unsigned int cnt = atomic_read(&mm->mm_users);
> 
>                 if (obj->o_count != cnt) {
>                         printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
>                         return -EINVAL;
>                 }
>         }
> 
> They are like moving detectors, small, invisible, something moved, you don't
> know what, but you don't care because you have to investigate anyway.
> 
> In this scheme, if user wants to checkpoint just one process, he should
> start it alone in separate container. Right now, in posted patchset
> as cloned process with
> CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET

So you suggest that to checkpoint a single process, say a cpu job that
would run a week, which runs in the topmost pid_ns, I will need to
checkpoint the entire topmost pid_ns (as a container, if at all possible
- surely there will non-checkpointable tasks there) and then in
user-space filter out the data and leave only one task, and then to
restart I'll use a container again ?

Pffff ... why not just allow subtree checkpoint, not in a container,
with its well known limitations -- would work the same, for very little
additional implementation cost.

> 
> All of this by definition won't create situation "mom, where did my shm segment go?".
> 
>> Anything in the current design makes it impossible ?
>>
>> Anything prohibiting your from adding this feature to the current
>> patch-set ?
>>
>>> * not having CAP_SYS_ADMIN on restart(2)
>> Surely you have read already on the containers mailing list that
>> for the *time being* we attempt to get as far as possible without
>> requiring root privileges, to identify security hot-spots.
> 
> More or less everything is hotspot.

Going back to my example of a subtree above: what sort of security
issues you would have here, assuming all restart operations go via
the usual security checks just like syscalls, and we take special
care about values we allow as input, e.g. cpu debug registers etc ?

BTW, the checks for cpu registers and other values for sanity is
needed anyway to ensure that the kernel freak out if wrong input
is fed (maliciously or by mistake).

> 
>> And surely you have read there the observation that for the general
>> case root privileges will probably be inevitable.
>>
>> And surely you don't seriously think that adding this check changes
>> the "design"...
> 
> This is not about design now, this is about if you allow restart(2) for
> everyone, you open kernel to very very high number of holes of all
> types. I wrote about it in reply to your code.
> 

And there are many examples where this is not the case. The current
patchset (v14-rc3), for example, to the best of my knowledge, does
not have such issues (well, need to add checks for validity of regs
but there is a FIXME comment :)

> Numbers of states described by any valid image as described by header
> files is very much higher than number of states that can validly end up
> in an image. Reducing set of states to only valid ones will require
> many checks.
> 
> For example, just one structure: struct cr_hdr_cpu.
> 
> Segments are directly loaded regardless of RPL.
> 
> Debug registers are directly loaded regardless of to where breakpoint
> points to.
> 
> If you don't understand scale of problem and insist on restart(2) not
> having CAP_SYS_ADMIN, we'll just drop from discussion and write email
> to bugtraq to warn sysadmins at least.

I humbly suggest that for many useful use-cases restart(2) can succeed
and work well without requiring CAP_SYS_ADMIN. I'm also citing Serge
saying that it's a good exercise to identify all those specific points
where CAP_SYS_ADMIN is needed and make sure we understand them well.

For this purpose, _starting_ with not requiring CAP_SYS_ADMIN is
actually a good idea.  Then we can isolate where it's strictly needed,
what are the security implications, and add the requirement there.

> 
> restart(2) for everyone sounds like excellent feature and is excellent
> feature but simultaneusly is completely unrealistic for Unix-like kernel.
> 
>>> * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
>>>   misdesigns.
>> Eh ?
> 
> TASK_COMM_LEN is 16 bytes and is not going to change, so you do
> 
> 	__u8 cr_comm[16];
> 
> and forget. If TASK_COMM_LEN changes you'll just bump size of
> task_struct image and image version.

That's an extra safety check. So that kernel doesn't _crash_ if a new
image is fed to an older kernel (and for some reason the image altered).

And also needed when you support non-CAP_SYS_ADMIN restart(2) (yeah,
yeah, I got the point, limited version only).

Oren.

> 
> Saving ->comm as variable-sized string is complication out of nowhere.
> 
> As for objref, my apoligies, objection dropped, I understood how it can be useful.
> 
>>> * doing fork(2)+restart(2) per restarted task and whole orchestration
>>>   done from userspace/future init task.
>> Why is it "incorrect" ?
>> What makes it "better" to do it in the kernel ?
>> Only because you say so is not convincing.
>>
>> (also see my other post in this matter).
>>
>>> * not seeing bigger picture (note, this is not equivalent to supporting
>>>   everything at once, nobody is asking for everything at once) wrt shared
>>>   objects and format and code changes because of that (note again, image
>>>   format will change, but it's easy to design high level structure which
>>>   won't change)
>> Why don't you describe the bigger picture so that the rest of can
>> finally see it, too ?!
>> (what a waste to have spent all this effort in vain...)
> 
> Sit and graph down all structures that you will eventually checkpoint and
> relations about them. Things like "struct cr_hdr_pids" won't even be on
> the radar.
> 
>>> * checking of unsupported features done at wrong place and wrong time
>>>   and runtime overhead because of that on CR=y kernels.
>> Eh ?   Did you follow the code recently ?
> 
> Checks during dup_fd() were dropped? Good.
> 
>>> There are also low-level things, but it's cumulative effect.
>>>
>>> [1] Do I inderstand correctly that cookie for shared object is an
>>> address on kernel stack? This is obviously unreliable, if yes :-)
>> Ah... I see... you didn't look at it that hard, not even read the
>> documentation with the code.
>>
>>> 	int objref;
>>> 		...
>>> 	/* adding 'file' to the hash will keep a reference to it */
>>> 	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
>>> 					^^^^^^^
>> That said, there are more similarities than differences between your
>> suggested template and the current patchset. With your expertise you
>> can contribute tremendously if you decide to work together.
> 
> OK, totally misread code.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 18:08         ` Oren Laadan
@ 2009-04-14 18:34           ` Alexey Dobriyan
  2009-04-14 19:31             ` Oren Laadan
  2009-04-14 20:49           ` Alexey Dobriyan
  1 sibling, 1 reply; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 18:34 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel

On Tue, Apr 14, 2009 at 02:08:21PM -0400, Oren Laadan wrote:
> 
> 
> Alexey Dobriyan wrote:
> > On Tue, Apr 14, 2009 at 12:26:50AM -0400, Oren Laadan wrote:
> >> Alexey Dobriyan wrote:
> >>> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> >>>> I'm curious how you see these fitting in with the work that we've been
> >>>> doing with Oren.  Do you mean to just start a discussion or are you
> >>>> really proposing these as an alternative to what Oren has been posting?
> >>> Yes, this is posted as alternative.
> >>>
> >>> Some design decisions are seen as incorrect from here like:
> >> A definition of "design" would help; I find most of your comments
> >> below either vague, cryptic, or technical nits...
> >>
> >>> * not rejecting checkpoint with possible "leaks" from container
> >> ...like this, for example.
> > 
> > Like checkpointing one process out of many living together.
> 
> See the thread on creating tasks in userspace vs. kernel space:
> the argument here is that is an interesting enough use case for
> a checkpoint of not-an-entire-container.
> 
> Of course it will require more logic to it, so the user can choose
> what she cares or does not care about, and the kernel could alert
> the user about it.
> 
> The point is, that it is, IMHO, a desirable capability.
> 
> > 
> > If you allow this you consequently drop checks (e.g. refcount checks)
> > for "somebody else is using structure to be checkpointed".
> > 
> 
> From this point below, I totally agree with you that for the purpose
> of a whole-container-checkpoint this is certainly desirable. My point
> was that it can be easily added the existing patchset (not yours).
> Why not add it there ?
> 
> > If you drop these checks, you can't decipher legal sutiations like
> > "process genuinely doesn't care about routing table of netns it lives in"
> > from "illegal" situations like "process created shm segment but currently
> > doesn't use it so not checkpointing ipcns will result in breakagenlater".
> > 
> > You'll have to move responsibility to user, so user exactly knows what
> > app relies on and on what. And probably add flags like CKPT_SHM,
> > CKPT_NETNS_ROUTE ad infinitum.
> > 
> > And user will screw it badly and complain: "after restart my app
> > segfaulted". And user himself is screwed now: old running process is
> > already killed (it was checkpointed on purpose) and new process in image
> > segfaults every time it's restarted.
> > 
> > All of this in out opinion results in doing C/R unreliably and badly.
> > 
> > We are going to do it well and dig from the other side.
> > 
> > If "leak" (any "leak") is detected, C/R is aborted because kernel
> > doesn't know what app relies on and what app doesn't care about.
> > 
> > This protected from situations and failure modes described above.
> > 
> > This also protects to some extent from in-kernel changes where C/R code
> > should have been updated but wasn't. Person doing incomplete change won't
> > notice e.g refcount checks and won't try to "fix" them. But we'll notice it,
> > e.g. when running testsuite (amen) and update C/R code accordingly.
> > 
> > I'm talking about these checks so that everyone understands:
> > 
> > 	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
> >                 struct mm_struct *mm = obj->o_obj;
> >                 unsigned int cnt = atomic_read(&mm->mm_users);
> > 
> >                 if (obj->o_count != cnt) {
> >                         printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
> >                         return -EINVAL;
> >                 }
> >         }
> > 
> > They are like moving detectors, small, invisible, something moved, you don't
> > know what, but you don't care because you have to investigate anyway.
> > 
> > In this scheme, if user wants to checkpoint just one process, he should
> > start it alone in separate container. Right now, in posted patchset
> > as cloned process with
> > CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET
> 
> So you suggest that to checkpoint a single process, say a cpu job that
> would run a week, which runs in the topmost pid_ns, I will need to
> checkpoint the entire topmost pid_ns (as a container, if at all possible
> - surely there will non-checkpointable tasks there) and then in
> user-space filter out the data and leave only one task, and then to
> restart I'll use a container again ?

No, you do little preparations and start CPU job in container from the very
beginning.

> Pffff ... why not just allow subtree checkpoint, not in a container,
> with its well known limitations -- would work the same, for very little
> additional implementation cost.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 18:34           ` Alexey Dobriyan
@ 2009-04-14 19:31             ` Oren Laadan
  2009-04-14 20:08               ` Alexey Dobriyan
  0 siblings, 1 reply; 37+ messages in thread
From: Oren Laadan @ 2009-04-14 19:31 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel



Alexey Dobriyan wrote:
> On Tue, Apr 14, 2009 at 02:08:21PM -0400, Oren Laadan wrote:
>>
>> Alexey Dobriyan wrote:
>>> On Tue, Apr 14, 2009 at 12:26:50AM -0400, Oren Laadan wrote:
>>>> Alexey Dobriyan wrote:
>>>>> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
>>>>>> I'm curious how you see these fitting in with the work that we've been
>>>>>> doing with Oren.  Do you mean to just start a discussion or are you
>>>>>> really proposing these as an alternative to what Oren has been posting?
>>>>> Yes, this is posted as alternative.
>>>>>
>>>>> Some design decisions are seen as incorrect from here like:
>>>> A definition of "design" would help; I find most of your comments
>>>> below either vague, cryptic, or technical nits...
>>>>
>>>>> * not rejecting checkpoint with possible "leaks" from container
>>>> ...like this, for example.
>>> Like checkpointing one process out of many living together.
>> See the thread on creating tasks in userspace vs. kernel space:
>> the argument here is that is an interesting enough use case for
>> a checkpoint of not-an-entire-container.
>>
>> Of course it will require more logic to it, so the user can choose
>> what she cares or does not care about, and the kernel could alert
>> the user about it.
>>
>> The point is, that it is, IMHO, a desirable capability.
>>
>>> If you allow this you consequently drop checks (e.g. refcount checks)
>>> for "somebody else is using structure to be checkpointed".
>>>
>> From this point below, I totally agree with you that for the purpose
>> of a whole-container-checkpoint this is certainly desirable. My point
>> was that it can be easily added the existing patchset (not yours).
>> Why not add it there ?
>>
>>> If you drop these checks, you can't decipher legal sutiations like
>>> "process genuinely doesn't care about routing table of netns it lives in"
>>> from "illegal" situations like "process created shm segment but currently
>>> doesn't use it so not checkpointing ipcns will result in breakagenlater".
>>>
>>> You'll have to move responsibility to user, so user exactly knows what
>>> app relies on and on what. And probably add flags like CKPT_SHM,
>>> CKPT_NETNS_ROUTE ad infinitum.
>>>
>>> And user will screw it badly and complain: "after restart my app
>>> segfaulted". And user himself is screwed now: old running process is
>>> already killed (it was checkpointed on purpose) and new process in image
>>> segfaults every time it's restarted.
>>>
>>> All of this in out opinion results in doing C/R unreliably and badly.
>>>
>>> We are going to do it well and dig from the other side.
>>>
>>> If "leak" (any "leak") is detected, C/R is aborted because kernel
>>> doesn't know what app relies on and what app doesn't care about.
>>>
>>> This protected from situations and failure modes described above.
>>>
>>> This also protects to some extent from in-kernel changes where C/R code
>>> should have been updated but wasn't. Person doing incomplete change won't
>>> notice e.g refcount checks and won't try to "fix" them. But we'll notice it,
>>> e.g. when running testsuite (amen) and update C/R code accordingly.
>>>
>>> I'm talking about these checks so that everyone understands:
>>>
>>> 	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
>>>                 struct mm_struct *mm = obj->o_obj;
>>>                 unsigned int cnt = atomic_read(&mm->mm_users);
>>>
>>>                 if (obj->o_count != cnt) {
>>>                         printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
>>>                         return -EINVAL;
>>>                 }
>>>         }
>>>
>>> They are like moving detectors, small, invisible, something moved, you don't
>>> know what, but you don't care because you have to investigate anyway.
>>>
>>> In this scheme, if user wants to checkpoint just one process, he should
>>> start it alone in separate container. Right now, in posted patchset
>>> as cloned process with
>>> CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET
>> So you suggest that to checkpoint a single process, say a cpu job that
>> would run a week, which runs in the topmost pid_ns, I will need to
>> checkpoint the entire topmost pid_ns (as a container, if at all possible
>> - surely there will non-checkpointable tasks there) and then in
>> user-space filter out the data and leave only one task, and then to
>> restart I'll use a container again ?
> 
> No, you do little preparations and start CPU job in container from the very
> beginning.

So you are denying all those other users that don't want to do that
the joy of checkpointing and restarting their stuff ... :(

Or, for users who do run everything in container, but some task is not
checkpointable - it is using this electronic microscope device attached
to their handheld. Alas, they do want to checkpoint that useful program
they are running there that calculates fibonacci numbers ...

Or, a nested container that shares something with the parent container,
so is not checkpointable by itself...

Ok, you probably got the idea.

Oren.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 19:31             ` Oren Laadan
@ 2009-04-14 20:08               ` Alexey Dobriyan
  0 siblings, 0 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 20:08 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel

On Tue, Apr 14, 2009 at 03:31:55PM -0400, Oren Laadan wrote:
> 
> 
> Alexey Dobriyan wrote:
> > On Tue, Apr 14, 2009 at 02:08:21PM -0400, Oren Laadan wrote:
> >>
> >> Alexey Dobriyan wrote:
> >>> On Tue, Apr 14, 2009 at 12:26:50AM -0400, Oren Laadan wrote:
> >>>> Alexey Dobriyan wrote:
> >>>>> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> >>>>>> I'm curious how you see these fitting in with the work that we've been
> >>>>>> doing with Oren.  Do you mean to just start a discussion or are you
> >>>>>> really proposing these as an alternative to what Oren has been posting?
> >>>>> Yes, this is posted as alternative.
> >>>>>
> >>>>> Some design decisions are seen as incorrect from here like:
> >>>> A definition of "design" would help; I find most of your comments
> >>>> below either vague, cryptic, or technical nits...
> >>>>
> >>>>> * not rejecting checkpoint with possible "leaks" from container
> >>>> ...like this, for example.
> >>> Like checkpointing one process out of many living together.
> >> See the thread on creating tasks in userspace vs. kernel space:
> >> the argument here is that is an interesting enough use case for
> >> a checkpoint of not-an-entire-container.
> >>
> >> Of course it will require more logic to it, so the user can choose
> >> what she cares or does not care about, and the kernel could alert
> >> the user about it.
> >>
> >> The point is, that it is, IMHO, a desirable capability.
> >>
> >>> If you allow this you consequently drop checks (e.g. refcount checks)
> >>> for "somebody else is using structure to be checkpointed".
> >>>
> >> From this point below, I totally agree with you that for the purpose
> >> of a whole-container-checkpoint this is certainly desirable. My point
> >> was that it can be easily added the existing patchset (not yours).
> >> Why not add it there ?
> >>
> >>> If you drop these checks, you can't decipher legal sutiations like
> >>> "process genuinely doesn't care about routing table of netns it lives in"
> >>> from "illegal" situations like "process created shm segment but currently
> >>> doesn't use it so not checkpointing ipcns will result in breakagenlater".
> >>>
> >>> You'll have to move responsibility to user, so user exactly knows what
> >>> app relies on and on what. And probably add flags like CKPT_SHM,
> >>> CKPT_NETNS_ROUTE ad infinitum.
> >>>
> >>> And user will screw it badly and complain: "after restart my app
> >>> segfaulted". And user himself is screwed now: old running process is
> >>> already killed (it was checkpointed on purpose) and new process in image
> >>> segfaults every time it's restarted.
> >>>
> >>> All of this in out opinion results in doing C/R unreliably and badly.
> >>>
> >>> We are going to do it well and dig from the other side.
> >>>
> >>> If "leak" (any "leak") is detected, C/R is aborted because kernel
> >>> doesn't know what app relies on and what app doesn't care about.
> >>>
> >>> This protected from situations and failure modes described above.
> >>>
> >>> This also protects to some extent from in-kernel changes where C/R code
> >>> should have been updated but wasn't. Person doing incomplete change won't
> >>> notice e.g refcount checks and won't try to "fix" them. But we'll notice it,
> >>> e.g. when running testsuite (amen) and update C/R code accordingly.
> >>>
> >>> I'm talking about these checks so that everyone understands:
> >>>
> >>> 	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
> >>>                 struct mm_struct *mm = obj->o_obj;
> >>>                 unsigned int cnt = atomic_read(&mm->mm_users);
> >>>
> >>>                 if (obj->o_count != cnt) {
> >>>                         printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
> >>>                         return -EINVAL;
> >>>                 }
> >>>         }
> >>>
> >>> They are like moving detectors, small, invisible, something moved, you don't
> >>> know what, but you don't care because you have to investigate anyway.
> >>>
> >>> In this scheme, if user wants to checkpoint just one process, he should
> >>> start it alone in separate container. Right now, in posted patchset
> >>> as cloned process with
> >>> CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET
> >> So you suggest that to checkpoint a single process, say a cpu job that
> >> would run a week, which runs in the topmost pid_ns, I will need to
> >> checkpoint the entire topmost pid_ns (as a container, if at all possible
> >> - surely there will non-checkpointable tasks there) and then in
> >> user-space filter out the data and leave only one task, and then to
> >> restart I'll use a container again ?
> > 
> > No, you do little preparations and start CPU job in container from the very
> > beginning.
> 
> So you are denying all those other users that don't want to do that
> the joy of checkpointing and restarting their stuff ... :(

That's the price for a feature. In return kernel promises to not create
surprises after restart(2).

> Or, for users who do run everything in container, but some task is not
> checkpointable - it is using this electronic microscope device attached
> to their handheld.

Why this example?

If code for opened file will register checkpoint/restart hooks, everything
will be fine.

> Alas, they do want to checkpoint that useful program they are running there
> that calculates fibonacci numbers ...
> 
> Or, a nested container that shares something with the parent container,
> so is not checkpointable by itself...
>
> Ok, you probably got the idea.

You should too.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 18:08         ` Oren Laadan
  2009-04-14 18:34           ` Alexey Dobriyan
@ 2009-04-14 20:49           ` Alexey Dobriyan
  2009-04-14 21:11             ` Dave Hansen
  2009-04-14 21:39             ` Serge E. Hallyn
  1 sibling, 2 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-14 20:49 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, akpm, containers, xemul, serue, mingo, hch,
	torvalds, linux-kernel

> >>> * not having CAP_SYS_ADMIN on restart(2)
> >> Surely you have read already on the containers mailing list that
> >> for the *time being* we attempt to get as far as possible without
> >> requiring root privileges, to identify security hot-spots.
> > 
> > More or less everything is hotspot.
> 
> Going back to my example of a subtree above: what sort of security
> issues you would have here, assuming all restart operations go via
> the usual security checks just like syscalls, and we take special
> care about values we allow as input, e.g. cpu debug registers etc ?

This is like asking if everything goes through correct security checks
how will you screw something up. Everything won't go via usual security
checks.

Just for giggles, writing exploit for localhost hole becomes easier --
you very effectively turn off all randomization kernel does because VMA
boundaries comes from disk.

> BTW, the checks for cpu registers and other values for sanity is
> needed anyway to ensure that the kernel freak out if wrong input
> is fed (maliciously or by mistake).

Out of head, mucking with registers of setuid program is no-no.
As well as mucking with dirty data of setuid program.

Capabilities will come from disk.

Many EPERM checks are suspect.

do_arch_prctl(ARCH_SET_GS) has TASK_SIZE check, do you?

> >> And surely you have read there the observation that for the general
> >> case root privileges will probably be inevitable.
> >>
> >> And surely you don't seriously think that adding this check changes
> >> the "design"...
> > 
> > This is not about design now, this is about if you allow restart(2) for
> > everyone, you open kernel to very very high number of holes of all
> > types. I wrote about it in reply to your code.
> > 
> 
> And there are many examples where this is not the case. The current
> patchset (v14-rc3), for example, to the best of my knowledge, does
> not have such issues (well, need to add checks for validity of regs
> but there is a FIXME comment :)

And such an obvious issue like segmentt RPL is fixed? :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 20:49           ` Alexey Dobriyan
@ 2009-04-14 21:11             ` Dave Hansen
  2009-04-14 21:39             ` Serge E. Hallyn
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2009-04-14 21:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Oren Laadan, xemul, containers, linux-kernel, hch, akpm, torvalds, mingo

On Wed, 2009-04-15 at 00:49 +0400, Alexey Dobriyan wrote:
> > Going back to my example of a subtree above: what sort of security
> > issues you would have here, assuming all restart operations go via
> > the usual security checks just like syscalls, and we take special
> > care about values we allow as input, e.g. cpu debug registers etc ?
> 
> This is like asking if everything goes through correct security checks
> how will you screw something up. Everything won't go via usual security
> checks.
> 
> Just for giggles, writing exploit for localhost hole becomes easier --
> you very effectively turn off all randomization kernel does because VMA
> boundaries comes from disk.

Alexey, I think there's a bit of a misunderstanding here.  We're
proposing that whatever gets done during the restart would not be able
to elevate above the privilege level of the user doing the restart.  For
instance, a user would not be able to restart a suid application -- that
would require privilege.

In the same way, at checkpoint time, we should deny users the ability to
checkpoint things that are privileged.  We need to pay very close
attention to the kernel interfaces that we use to ensure that all the
normal security checks are observed, of course.  

There's no new hole.  A restore operation is just a pieced-together set
of operations that the user would have already been able to perform.

Again, I think this stems from some of us that think that c/r should be
used outside exclusively checkpointing whole containers.  I think we'll
find some expanded use for c/r on things that aren't strict system
containers.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
  2009-04-14 20:49           ` Alexey Dobriyan
  2009-04-14 21:11             ` Dave Hansen
@ 2009-04-14 21:39             ` Serge E. Hallyn
  2009-04-15 19:21               ` CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style) Alexey Dobriyan
  1 sibling, 1 reply; 37+ messages in thread
From: Serge E. Hallyn @ 2009-04-14 21:39 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Oren Laadan, Dave Hansen, akpm, containers, xemul, mingo, hch,
	torvalds, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > >>> * not having CAP_SYS_ADMIN on restart(2)
> > >> Surely you have read already on the containers mailing list that
> > >> for the *time being* we attempt to get as far as possible without
> > >> requiring root privileges, to identify security hot-spots.
> > > 
> > > More or less everything is hotspot.
> > 
> > Going back to my example of a subtree above: what sort of security
> > issues you would have here, assuming all restart operations go via
> > the usual security checks just like syscalls, and we take special
> > care about values we allow as input, e.g. cpu debug registers etc ?
> 
> This is like asking if everything goes through correct security checks
> how will you screw something up. Everything won't go via usual security
> checks.

(Everything which doesn't go through usual security checks gets special
security checks...)

You are asking an administrator to give users setuid-root programs when
they wouldn't otherwise need it.  So now not only is there potential of
security holes in kernel code which was written under the assumption
that 'hey you have privilege when you run this so are trusted'.  You
also have more people running probably even less-vetted userspace code
with privilege.

> Just for giggles, writing exploit for localhost hole becomes easier --
> you very effectively turn off all randomization kernel does because VMA
> boundaries comes from disk.

Hmm, interesting point.

But again you're going on a naive assumption that if restart requires
privilege, only trusted users will use it.  Rather, in the real world,
I claim that unprivileged users will be given more privilege than they
need.

> > BTW, the checks for cpu registers and other values for sanity is
> > needed anyway to ensure that the kernel freak out if wrong input
> > is fed (maliciously or by mistake).
> 
> Out of head, mucking with registers of setuid program is no-no.
> As well as mucking with dirty data of setuid program.

No argument there, no idea why you bring this up.

> Capabilities will come from disk.

Sure.

> Many EPERM checks are suspect.

No idea what you're saying.

> do_arch_prctl(ARCH_SET_GS) has TASK_SIZE check, do you?

We don't support x86_64 yet :)

> > >> And surely you have read there the observation that for the general
> > >> case root privileges will probably be inevitable.
> > >>
> > >> And surely you don't seriously think that adding this check changes
> > >> the "design"...
> > > 
> > > This is not about design now, this is about if you allow restart(2) for
> > > everyone, you open kernel to very very high number of holes of all
> > > types. I wrote about it in reply to your code.
> > > 
> > 
> > And there are many examples where this is not the case. The current
> > patchset (v14-rc3), for example, to the best of my knowledge, does
> > not have such issues (well, need to add checks for validity of regs
> > but there is a FIXME comment :)
> 
> And such an obvious issue like segmentt RPL is fixed? :-)

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style)
  2009-04-14 21:39             ` Serge E. Hallyn
@ 2009-04-15 19:21               ` Alexey Dobriyan
  2009-04-15 20:22                 ` Serge E. Hallyn
  2009-04-15 20:23                 ` Dave Hansen
  0 siblings, 2 replies; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-15 19:21 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Oren Laadan, Dave Hansen, akpm, containers, xemul, mingo, hch,
	torvalds, linux-kernel

Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style)
  2009-04-15 19:21               ` CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style) Alexey Dobriyan
@ 2009-04-15 20:22                 ` Serge E. Hallyn
  2009-04-15 20:23                 ` Dave Hansen
  1 sibling, 0 replies; 37+ messages in thread
From: Serge E. Hallyn @ 2009-04-15 20:22 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Oren Laadan, Dave Hansen, akpm, containers, xemul, mingo, hch,
	torvalds, linux-kernel

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?

You mean a sysctl to specify whether to require CAP_SYS_ADMIN for
restart(2)?

Yeah I wouldn't object to that - it certainly seems like something
sane for an admin to use depending on their users.

Though I think the bigger fish to fry first is whether we only support
whole-container checkpoint/restart.  If that is the case, then
CAP_SYS_ADMIN will always be needed for restart since it will always
unshare some namespaces.

thanks,
-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style)
  2009-04-15 19:21               ` CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style) Alexey Dobriyan
  2009-04-15 20:22                 ` Serge E. Hallyn
@ 2009-04-15 20:23                 ` Dave Hansen
  2009-04-15 20:39                   ` Serge E. Hallyn
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2009-04-15 20:23 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Serge E. Hallyn, xemul, containers, linux-kernel, hch, akpm,
	torvalds, mingo

On Wed, 2009-04-15 at 23:21 +0400, Alexey Dobriyan wrote:
> Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?

If the point is not to let users even *try* restarting things if it
*might* not work, then I guess this might be reasonable.  

If the goal is to increase security by always requiring CAP_SYS_ADMIN
for "dangerous" operations, I fear it will be harmful.  We may have
people adding features that are not considering the security impact of
what they're doing just because the cases they care about all require
privilege.

What would the goal be?

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style)
  2009-04-15 20:23                 ` Dave Hansen
@ 2009-04-15 20:39                   ` Serge E. Hallyn
  2009-04-15 21:05                     ` CAP_SYS_ADMIN on restart(2) Oren Laadan
  0 siblings, 1 reply; 37+ messages in thread
From: Serge E. Hallyn @ 2009-04-15 20:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexey Dobriyan, xemul, containers, linux-kernel, hch, akpm,
	torvalds, mingo

Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> On Wed, 2009-04-15 at 23:21 +0400, Alexey Dobriyan wrote:
> > Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?
> 
> If the point is not to let users even *try* restarting things if it
> *might* not work, then I guess this might be reasonable.  
> 
> If the goal is to increase security by always requiring CAP_SYS_ADMIN
> for "dangerous" operations, I fear it will be harmful.  We may have
> people adding features that are not considering the security impact of
> what they're doing just because the cases they care about all require
> privilege.

Nah, I disagree.  (Or put another way, that wouldn't be the goal)
There are two administrators we want to satisfy:

1. the one who wants his users to do partial checkpoints, but doesn't
want to risk giving away any privilege at all in the process.  He'll
be satisified by setting restart(2) to not require cap_sys_admin,
and his users just won't be able to do a whole container.  A lot of
users will be happy with that (though no SYSVIPC support, then).

2. the one who may have one or two users who he trusts to do
checkpoint/restart, but otherwise doesn't want even the slightest
risk of other users using restart, and maybe finding an exploit.
That one is probably the more common admin, and he'll be satisified
with requiring CAP_SYS_ADMIN for all restart(2)'s, since he's ok
risking giving extra privilege to the ones he trusts.

And meanwhile, by virtue of leaving (1) supported, we are still
more under the gun to make sure that everything restart(2) does
is properly checked.

> What would the goal be?
> 
> -- Dave

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2)
  2009-04-15 20:39                   ` Serge E. Hallyn
@ 2009-04-15 21:05                     ` Oren Laadan
  2009-04-15 21:16                       ` Serge E. Hallyn
  0 siblings, 1 reply; 37+ messages in thread
From: Oren Laadan @ 2009-04-15 21:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, xemul, containers, mingo, linux-kernel, hch, akpm,
	torvalds, Alexey Dobriyan



Serge E. Hallyn wrote:
> Quoting Dave Hansen (dave@linux.vnet.ibm.com):
>> On Wed, 2009-04-15 at 23:21 +0400, Alexey Dobriyan wrote:
>>> Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?
>> If the point is not to let users even *try* restarting things if it
>> *might* not work, then I guess this might be reasonable.  
>>
>> If the goal is to increase security by always requiring CAP_SYS_ADMIN
>> for "dangerous" operations, I fear it will be harmful.  We may have
>> people adding features that are not considering the security impact of
>> what they're doing just because the cases they care about all require
>> privilege.
> 
> Nah, I disagree.  (Or put another way, that wouldn't be the goal)
> There are two administrators we want to satisfy:
> 
> 1. the one who wants his users to do partial checkpoints, but doesn't
> want to risk giving away any privilege at all in the process.  He'll
> be satisified by setting restart(2) to not require cap_sys_admin,
> and his users just won't be able to do a whole container.  A lot of
> users will be happy with that (though no SYSVIPC support, then).

There is also a middle way: use setuid program to allow creation
of a new namespace (under your favorite policy), then drop the
privileges and continue as unprivileged inside that container.

IOW, don't make the initial container-creation a barrier for the
entire operation.

Oren.

> 
> 2. the one who may have one or two users who he trusts to do
> checkpoint/restart, but otherwise doesn't want even the slightest
> risk of other users using restart, and maybe finding an exploit.
> That one is probably the more common admin, and he'll be satisified
> with requiring CAP_SYS_ADMIN for all restart(2)'s, since he's ok
> risking giving extra privilege to the ones he trusts.
> 
> And meanwhile, by virtue of leaving (1) supported, we are still
> more under the gun to make sure that everything restart(2) does
> is properly checked.
> 
>> What would the goal be?
>>
>> -- Dave
> 
> -serge
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2)
  2009-04-15 21:05                     ` CAP_SYS_ADMIN on restart(2) Oren Laadan
@ 2009-04-15 21:16                       ` Serge E. Hallyn
  2009-04-16 15:35                         ` Alexey Dobriyan
  0 siblings, 1 reply; 37+ messages in thread
From: Serge E. Hallyn @ 2009-04-15 21:16 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, xemul, containers, mingo, linux-kernel, hch, akpm,
	torvalds, Alexey Dobriyan

Quoting Oren Laadan (orenl@cs.columbia.edu):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> >> On Wed, 2009-04-15 at 23:21 +0400, Alexey Dobriyan wrote:
> >>> Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?
> >> If the point is not to let users even *try* restarting things if it
> >> *might* not work, then I guess this might be reasonable.  
> >>
> >> If the goal is to increase security by always requiring CAP_SYS_ADMIN
> >> for "dangerous" operations, I fear it will be harmful.  We may have
> >> people adding features that are not considering the security impact of
> >> what they're doing just because the cases they care about all require
> >> privilege.
> > 
> > Nah, I disagree.  (Or put another way, that wouldn't be the goal)
> > There are two administrators we want to satisfy:
> > 
> > 1. the one who wants his users to do partial checkpoints, but doesn't
> > want to risk giving away any privilege at all in the process.  He'll
> > be satisified by setting restart(2) to not require cap_sys_admin,
> > and his users just won't be able to do a whole container.  A lot of
> > users will be happy with that (though no SYSVIPC support, then).
> 
> There is also a middle way: use setuid program to allow creation
> of a new namespace (under your favorite policy), then drop the
> privileges and continue as unprivileged inside that container.
> 
> IOW, don't make the initial container-creation a barrier for the
> entire operation.

That is still possible here.  But I don't think it's relevant.

What Alexey wants, I believe, is for users to be able to not have
to worry about there being exploitable bugs in restart(2) which
unprivileged users can play with.  And for the usual distro-kernel
reasons, saying use 'CONFIG_CHECKPOINT=n' is not an option.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2)
  2009-04-15 21:16                       ` Serge E. Hallyn
@ 2009-04-16 15:35                         ` Alexey Dobriyan
  2009-04-16 16:29                           ` Serge E. Hallyn
  0 siblings, 1 reply; 37+ messages in thread
From: Alexey Dobriyan @ 2009-04-16 15:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Oren Laadan, Dave Hansen, xemul, containers, mingo, linux-kernel,
	hch, akpm, torvalds

On Wed, Apr 15, 2009 at 04:16:09PM -0500, Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@cs.columbia.edu):
> > 
> > 
> > Serge E. Hallyn wrote:
> > > Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> > >> On Wed, 2009-04-15 at 23:21 +0400, Alexey Dobriyan wrote:
> > >>> Is sysctl to control CAP_SYS_ADMIN on restart(2) OK?
> > >> If the point is not to let users even *try* restarting things if it
> > >> *might* not work, then I guess this might be reasonable.  
> > >>
> > >> If the goal is to increase security by always requiring CAP_SYS_ADMIN
> > >> for "dangerous" operations, I fear it will be harmful.  We may have
> > >> people adding features that are not considering the security impact of
> > >> what they're doing just because the cases they care about all require
> > >> privilege.
> > > 
> > > Nah, I disagree.  (Or put another way, that wouldn't be the goal)
> > > There are two administrators we want to satisfy:
> > > 
> > > 1. the one who wants his users to do partial checkpoints, but doesn't
> > > want to risk giving away any privilege at all in the process.  He'll
> > > be satisified by setting restart(2) to not require cap_sys_admin,
> > > and his users just won't be able to do a whole container.  A lot of
> > > users will be happy with that (though no SYSVIPC support, then).
> > 
> > There is also a middle way: use setuid program to allow creation
> > of a new namespace (under your favorite policy), then drop the
> > privileges and continue as unprivileged inside that container.
> > 
> > IOW, don't make the initial container-creation a barrier for the
> > entire operation.
> 
> That is still possible here.  But I don't think it's relevant.
> 
> What Alexey wants, I believe, is for users to be able to not have
> to worry about there being exploitable bugs in restart(2) which
> unprivileged users can play with.  And for the usual distro-kernel
> reasons, saying use 'CONFIG_CHECKPOINT=n' is not an option.

This is correct, yes. If I would be a sysadmin who knows a bit about
kernel internals, I'd never trust restart(2) to get it right.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CAP_SYS_ADMIN on restart(2)
  2009-04-16 15:35                         ` Alexey Dobriyan
@ 2009-04-16 16:29                           ` Serge E. Hallyn
  0 siblings, 0 replies; 37+ messages in thread
From: Serge E. Hallyn @ 2009-04-16 16:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Oren Laadan, Dave Hansen, xemul, containers, mingo, linux-kernel,
	hch, akpm, torvalds

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > What Alexey wants, I believe, is for users to be able to not have
> > to worry about there being exploitable bugs in restart(2) which
> > unprivileged users can play with.  And for the usual distro-kernel
> > reasons, saying use 'CONFIG_CHECKPOINT=n' is not an option.
> 
> This is correct, yes. If I would be a sysadmin who knows a bit about
> kernel internals, I'd never trust restart(2) to get it right.

Now I suppose what we could do is define a new CAP_SYS_RESTART
capability and require that.  Then the admin to whom I'm trying
to cater could simply 'capset cap_sys_restart=pe /bin/restart'.
Then all users could use restart without being granted the
extra privilege implied by CAP_SYS_ADMIN.

-serge

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-04-16 16:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-10  2:32 [PATCH 00/30] C/R OpenVZ/Virtuozzo style Alexey Dobriyan
2009-04-10  2:44 ` Alexey Dobriyan
2009-04-10  5:07 ` Dave Hansen
2009-04-13  9:14   ` Alexey Dobriyan
2009-04-13 11:16     ` Dave Hansen
2009-04-13 18:07     ` Dave Hansen
2009-04-14  4:26     ` Oren Laadan
2009-04-14 14:58       ` Alexey Dobriyan
2009-04-14 18:08         ` Oren Laadan
2009-04-14 18:34           ` Alexey Dobriyan
2009-04-14 19:31             ` Oren Laadan
2009-04-14 20:08               ` Alexey Dobriyan
2009-04-14 20:49           ` Alexey Dobriyan
2009-04-14 21:11             ` Dave Hansen
2009-04-14 21:39             ` Serge E. Hallyn
2009-04-15 19:21               ` CAP_SYS_ADMIN on restart(2) (was: Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style) Alexey Dobriyan
2009-04-15 20:22                 ` Serge E. Hallyn
2009-04-15 20:23                 ` Dave Hansen
2009-04-15 20:39                   ` Serge E. Hallyn
2009-04-15 21:05                     ` CAP_SYS_ADMIN on restart(2) Oren Laadan
2009-04-15 21:16                       ` Serge E. Hallyn
2009-04-16 15:35                         ` Alexey Dobriyan
2009-04-16 16:29                           ` Serge E. Hallyn
2009-04-10  8:28 ` [PATCH 00/30] C/R OpenVZ/Virtuozzo style Ingo Molnar
2009-04-10 11:45   ` Alexey Dobriyan
2009-04-10 15:06 ` Linus Torvalds
2009-04-13  7:39   ` Alexey Dobriyan
2009-04-13 18:39     ` Linus Torvalds
2009-04-13 19:30       ` Ingo Molnar
2009-04-14 12:29       ` Alexey Dobriyan
2009-04-14 13:44         ` Ingo Molnar
2009-04-14 16:53           ` Alexey Dobriyan
2009-04-14 17:09         ` Linus Torvalds
2009-04-14 17:19           ` Randy Dunlap
2009-04-14 17:32             ` Linus Torvalds
2009-04-14  5:46 ` Oren Laadan
2009-04-14 15:19   ` Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).