All of lore.kernel.org
 help / color / mirror / Atom feed
* C/R: File substitution at restart
@ 2010-09-08 10:03 Matthieu Fertré
       [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Matthieu Fertré @ 2010-09-08 10:03 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Serge E. Hallyn, Nathan Lynch, Louis Rilling, Dan Smith,
	Sukadev Bhattiprolu

Hi,

Here is a proposal for a C/R related feature already developed in
Kerrighed: file substitution at restart.

The goal of this mail is to start a discussion about adding such feature
to Linux cr. Comments are welcome!



What is file substitution ?
===========================

It is the ability, at restart of a checkpointed application, to
substitute some of the opened files by some other files.

Only files accessed through a FD can be substituted, not mapped file
(unless they are reachable through a FD in the same time).

The feature ensures 'struct file' sharing as before the checkpoint.
Thus, if a process is for instance sharing the same struct file for
stdin, stdout, and stderr, it is not possible to give a different file
for each at restart.

Use cases
=========

1) Circumvent Checkpointer limitations:

* Allow to restart an application that has some files not supported by
checkpoint/restart implementation.

2) Conflicts between existing files and files that should be restored:

* Allow to restart an application of which one input data file is not
writable by the user and thus can not be restored/replaced.

* Allow to restart an application of which one output data file is
already open by another instance of the program.

3) Checkpoint/restart optimization/flexibility:

* Let the application checkpoint and restore files by itself to get
better performance or flexibility.

Example: OpenMPI sockets. Avoid to handle communication buffers and
ensure consistent distributed state.

How user(s) can use it ?
========================

(Kerrighed) restart manual page quote:
"-s file_identifier,fd, --substitute-file=file_identifier,fd

This option allows to replace one of the open files of the checkpointed
application by one of the file opened by the process calling the restart
command.

fd is the file descriptor (as given by open (2)) of the calling process
that will be used as a replacement after the restart.

file_identifier is an identifier of one the open files of the
checkpointed application. This identifier is generated at checkpoint
time. It can be retrieved from the file(s) user_info_*.txt that live(s)
in the checkpoint directory. Each line of this file refers to one of the
open files of the checkpointed application. For each open file, we get
the following information:
type|file_identifier|symbolic name|list of pid:fd

This option can be used several times to substitute several files."

Here is a simple (and stupid?) example extracted from Louis's talk at OLS:

$ krgcr-run ping localhost
Running application 6315097
(ping output omitted)
$ checkpoint -i 6315097
$
$ cat /var/chkpt/6315097/v1/user_info_1.txt
socket |0001FFFF880066D68EA8|socket:[219057]|6315097:3
tty    |0001FFFF88007D040EA8|/dev/pts/1|6315097:0,6315097:1,6315097:2

$ # Use current terminal for standard I/O at restart:
$ # ('-t' stands for tty and acts as a wrapper for option '-s')
$ restart -t 6315097 1

$ # Use stdin instead of socket at restart:
$ restart -s 0001FFFF880066D68EA8,0 6315097 1



Thanks,

Matthieu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
@ 2010-09-08 13:09   ` Serge E. Hallyn
       [not found]     ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2010-09-08 13:09 UTC (permalink / raw)
  To: Matthieu Fertré
  Cc: Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nathan Lynch, Louis Rilling, Dan Smith, Sukadev Bhattiprolu

Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> Hi,
> 
> Here is a proposal for a C/R related feature already developed in
> Kerrighed: file substitution at restart.
> 
> The goal of this mail is to start a discussion about adding such feature
> to Linux cr. Comments are welcome!

Yup, AFAIK metacluster and zap do this too.  I don't think there is
any question about whether we want to support this, but rather
what the user-kernel API should look like.  Perhaps the easiest
"API" is to have the userspace program rewrite the checkpoint image,
but that probably isn't quite as simple as just substituting #s in
the image, bc we'll have to also find the place where the source of
the original fd was specified and tweak that.

I assume this is one of the things Oren would have 'cradvise()'
do, and at this point that sounds nice to me - might be worth
seeing how the community reacts.  Sentiments on such things change,
after all.

Have there been any other suggestions?

-serge

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]     ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2010-09-08 17:56       ` Sukadev Bhattiprolu
       [not found]         ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2010-09-08 19:35       ` Matt Helsley
  1 sibling, 1 reply; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2010-09-08 17:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge E. Hallyn, Nathan Lynch,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Louis Rilling, Dan Smith

Serge Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote:
| Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
| > Hi,
| > 
| > Here is a proposal for a C/R related feature already developed in
| > Kerrighed: file substitution at restart.
| > 
| > The goal of this mail is to start a discussion about adding such feature
| > to Linux cr. Comments are welcome!
| 
| Yup, AFAIK metacluster and zap do this too.  I don't think there is
| any question about whether we want to support this, but rather
| what the user-kernel API should look like.  Perhaps the easiest
| "API" is to have the userspace program rewrite the checkpoint image,
| but that probably isn't quite as simple as just substituting #s in
| the image, bc we'll have to also find the place where the source of
| the original fd was specified and tweak that.
| 
| I assume this is one of the things Oren would have 'cradvise()'
| do, and at this point that sounds nice to me - might be worth
| seeing how the community reacts.  Sentiments on such things change,
| after all.

Yes, I had the same question about the kernel API. cradvise() would be
one option, but am not too clear on the details. For each process in
the checkpoint image that we want to substitute one or more fds,
do we call cradvise() *before* the call to sys_restart() ? This would
require the kernel to save these substitution pairs in memory until
the following sys_restart() right ?

Passing in a list of fd-substition pairs to sys_restart() might be one
option, but would require modifying the sys_restart() API.

Sukadev

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]     ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  2010-09-08 17:56       ` Sukadev Bhattiprolu
@ 2010-09-08 19:35       ` Matt Helsley
       [not found]         ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Matt Helsley @ 2010-09-08 19:35 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge E. Hallyn, Nathan Lynch,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Louis Rilling, Dan Smith, Sukadev Bhattiprolu

On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> > Hi,
> > 
> > Here is a proposal for a C/R related feature already developed in
> > Kerrighed: file substitution at restart.
> > 
> > The goal of this mail is to start a discussion about adding such feature
> > to Linux cr. Comments are welcome!
> 
> Yup, AFAIK metacluster and zap do this too.  I don't think there is
> any question about whether we want to support this, but rather
> what the user-kernel API should look like.  Perhaps the easiest
> "API" is to have the userspace program rewrite the checkpoint image,
> but that probably isn't quite as simple as just substituting #s in
> the image, bc we'll have to also find the place where the source of
> the original fd was specified and tweak that.
> 
> I assume this is one of the things Oren would have 'cradvise()'
> do, and at this point that sounds nice to me - might be worth
> seeing how the community reacts.  Sentiments on such things change,
> after all.
> 
> Have there been any other suggestions?

I think it can be split into two composable pieces which may also be
useful independently.

The first uses the fcntl() interface to add a flag like
O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
restart. That way we don't have to specify an fd number and a "source"
to the kernel. Just tell the kernel to keep the fd. The source can
be opened and dup2'd via userspace. This is useful without the
second piece if we want to simply add rather than replace an fd.

Then a separate interface/tool is needed to ignore/delete
the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult
part. It's difficult because depending on the open file the portions of
the image to ignore/delete can vary wildly. For instance, imagine if an
epoll fd was being ignored. It starts much like a generic file but there
is an image header related to it that isn't a CKPT_OBJ_*. If we fail to
delete/ignore this section prior to parsing then it completely breaks
the parsing. In contrast, CKPT_OBJ_* do not break the parsing since
they aren't expected in a strict order -- the parser is capable of
parsing them at any time and the only order constraint on them is that
they appear in the image before they are referenced.
This piece is also useful by itself if we want to ignore/delete an fd
rather than substitute it.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]         ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-09-08 22:49           ` Serge E. Hallyn
  0 siblings, 0 replies; 10+ messages in thread
From: Serge E. Hallyn @ 2010-09-08 22:49 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Serge E. Hallyn, Nathan Lynch,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Louis Rilling, Dan Smith

Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> Serge Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote:
> | Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> | > Hi,
> | > 
> | > Here is a proposal for a C/R related feature already developed in
> | > Kerrighed: file substitution at restart.
> | > 
> | > The goal of this mail is to start a discussion about adding such feature
> | > to Linux cr. Comments are welcome!
> | 
> | Yup, AFAIK metacluster and zap do this too.  I don't think there is
> | any question about whether we want to support this, but rather
> | what the user-kernel API should look like.  Perhaps the easiest
> | "API" is to have the userspace program rewrite the checkpoint image,
> | but that probably isn't quite as simple as just substituting #s in
> | the image, bc we'll have to also find the place where the source of
> | the original fd was specified and tweak that.
> | 
> | I assume this is one of the things Oren would have 'cradvise()'
> | do, and at this point that sounds nice to me - might be worth
> | seeing how the community reacts.  Sentiments on such things change,
> | after all.
> 
> Yes, I had the same question about the kernel API. cradvise() would be
> one option, but am not too clear on the details. For each process in
> the checkpoint image that we want to substitute one or more fds,
> do we call cradvise() *before* the call to sys_restart() ? This would

No, I would rather think that we follow the Kerrighed example,
and specify a checkpoint-wide id for the fd (the objhash id i
guess).  The first cr_advise() starts to create a restart context,
which finally gets used at sys_restart by the coordinator (and of
course all subsequent tasks).

> require the kernel to save these substitution pairs in memory until
> the following sys_restart() right ?
> 
> Passing in a list of fd-substition pairs to sys_restart() might be one
> option, but would require modifying the sys_restart() API.
> 
> Sukadev

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]         ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-09-09  1:03           ` Serge E. Hallyn
       [not found]             ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Serge E. Hallyn @ 2010-09-09  1:03 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Serge E. Hallyn, Nathan Lynch,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Louis Rilling, Dan Smith, Sukadev Bhattiprolu

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> > > Hi,
> > > 
> > > Here is a proposal for a C/R related feature already developed in
> > > Kerrighed: file substitution at restart.
> > > 
> > > The goal of this mail is to start a discussion about adding such feature
> > > to Linux cr. Comments are welcome!
> > 
> > Yup, AFAIK metacluster and zap do this too.  I don't think there is
> > any question about whether we want to support this, but rather
> > what the user-kernel API should look like.  Perhaps the easiest
> > "API" is to have the userspace program rewrite the checkpoint image,
> > but that probably isn't quite as simple as just substituting #s in
> > the image, bc we'll have to also find the place where the source of
> > the original fd was specified and tweak that.
> > 
> > I assume this is one of the things Oren would have 'cradvise()'
> > do, and at this point that sounds nice to me - might be worth
> > seeing how the community reacts.  Sentiments on such things change,
> > after all.
> > 
> > Have there been any other suggestions?
> 
> I think it can be split into two composable pieces which may also be
> useful independently.
> 
> The first uses the fcntl() interface to add a flag like
> O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> restart. That way we don't have to specify an fd number and a "source"
> to the kernel. Just tell the kernel to keep the fd. The source can
> be opened and dup2'd via userspace. This is useful without the
> second piece if we want to simply add rather than replace an fd.

Can you think of any other use for this flag other than restart?
If so, then having a fcntl flag (and later madvise) makes sense.
But if we're going to add options to various different APIS which
really are all only useful for c/r, then maybe a single new cr_advise()
really does make sense.  The alternative may be more popular at first
but would IMO turn into a disaster.

> Then a separate interface/tool is needed to ignore/delete
> the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult
> part. It's difficult because depending on the open file the portions of
> the image to ignore/delete can vary wildly. For instance, imagine if an
> epoll fd was being ignored. It starts much like a generic file but there
> is an image header related to it that isn't a CKPT_OBJ_*. If we fail to
> delete/ignore this section prior to parsing then it completely breaks
> the parsing.

Yup, that is precisely what stopped me when I tried to do this 6 months
or so ago just for stdin/stdout/stderr.

> In contrast, CKPT_OBJ_* do not break the parsing since
> they aren't expected in a strict order -- the parser is capable of
> parsing them at any time and the only order constraint on them is that
> they appear in the image before they are referenced.
> This piece is also useful by itself if we want to ignore/delete an fd
> rather than substitute it.

Are you working on any of this?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]             ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2010-09-09  4:06               ` Matt Helsley
       [not found]                 ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Matt Helsley @ 2010-09-09  4:06 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Serge E. Hallyn, Nathan Lynch,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matthieu Fertré,
	Louis Rilling, Dan Smith, Sukadev Bhattiprolu

On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org):
> > > > Hi,
> > > > 
> > > > Here is a proposal for a C/R related feature already developed in
> > > > Kerrighed: file substitution at restart.
> > > > 
> > > > The goal of this mail is to start a discussion about adding such feature
> > > > to Linux cr. Comments are welcome!
> > > 
> > > Yup, AFAIK metacluster and zap do this too.  I don't think there is
> > > any question about whether we want to support this, but rather
> > > what the user-kernel API should look like.  Perhaps the easiest
> > > "API" is to have the userspace program rewrite the checkpoint image,
> > > but that probably isn't quite as simple as just substituting #s in
> > > the image, bc we'll have to also find the place where the source of
> > > the original fd was specified and tweak that.

If the object to be replaced is specified by obj id then we could
simply rewrite the obj id to refer to the replacement. The original
object could still be in the image. It would get parsed but the only
reference to it would be in the objhash. When the objhash is
cleaned up the original object would be too.

(more below)

> > > 
> > > I assume this is one of the things Oren would have 'cradvise()'
> > > do, and at this point that sounds nice to me - might be worth
> > > seeing how the community reacts.  Sentiments on such things change,
> > > after all.
> > > 
> > > Have there been any other suggestions?
> > 
> > I think it can be split into two composable pieces which may also be
> > useful independently.
> > 
> > The first uses the fcntl() interface to add a flag like
> > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > restart. That way we don't have to specify an fd number and a "source"
> > to the kernel. Just tell the kernel to keep the fd. The source can
> > be opened and dup2'd via userspace. This is useful without the
> > second piece if we want to simply add rather than replace an fd.
> 
> Can you think of any other use for this flag other than restart?

<joking>
I can't think of any other uses for O_CLOEXEC.
</joking>

Seriously though, restart will be used _much_ less often than exec so yes
it does seem like a waste of a valuable bit and something that wouldn't
quite belong in an fcntl interface.

However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
Right now restart closes all file descriptors and pays absolutely
no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
want to keep we do not mark with O_CLOEXEC.


Here's another idea which I haven't fully thought out yet.

We could introduce the concept of object id substitutions in the image.
So the image would look like (going from file pos 0 at the top..):

0 +-------------------------------+
  |                               |
                .....
  +-------------------------------+
  |     <substitute object>       | <--- object with id == <substitute id>
                .....
  +---------------+---------------+
  |  <object id>  |<substitute id>|
  +---------------+---------------+
                .....
  +---------------+---------------+
  |     <object to ignore>        | <-- object with id == <object id>
                .....

(The above is ignoring the ckpt_hdr fields..)

When we read the image during restart we use the substitute ids to
create indirect objhash entries. When we encounter an obj id and
it refers to an indirect entry we first parse the object (ignoring
errors and dropping references on new objhash insertions), flip
a bit on the indirect entry (indicating the object has been parsed),
and then lookup the substitute id and return whatever that resolved to.

We can ignore the new objhash objects by making the objhash have its
own operation struct. When we're parsing an object that's been
substituted we just temporarily set the objhash add/lookup operations
to something suitable for properly dropping references to the new
object(s). This way we don't have to add checks for this peculiar
need all over the checkpoint/restart code. Sure it'll be slower...

I can think of a few problems with that already. If the substituted
obj differs wildly in file type then any defer queue entries that use
obj ids to complete the deferred work would fail miserably...

That said, so far I've never heard folks discuss substituting anything
but fds. Perhaps enabling substitution at the objhash level is just too
broad and we'd be better off only allowing fd substitutions?

> If so, then having a fcntl flag (and later madvise) makes sense.
> But if we're going to add options to various different APIS which
> really are all only useful for c/r, then maybe a single new cr_advise()
> really does make sense.  The alternative may be more popular at first
> but would IMO turn into a disaster.

Good point.

> > Then a separate interface/tool is needed to ignore/delete
> > the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult
> > part. It's difficult because depending on the open file the portions of
> > the image to ignore/delete can vary wildly. For instance, imagine if an
> > epoll fd was being ignored. It starts much like a generic file but there
> > is an image header related to it that isn't a CKPT_OBJ_*. If we fail to
> > delete/ignore this section prior to parsing then it completely breaks
> > the parsing.
> 
> Yup, that is precisely what stopped me when I tried to do this 6 months
> or so ago just for stdin/stdout/stderr.
> 
> > In contrast, CKPT_OBJ_* do not break the parsing since
> > they aren't expected in a strict order -- the parser is capable of
> > parsing them at any time and the only order constraint on them is that
> > they appear in the image before they are referenced.
> > This piece is also useful by itself if we want to ignore/delete an fd
> > rather than substitute it.
> 
> Are you working on any of this?

Not really. I wrote a quick patch to introduce an new fcntl flag I called
O_NOCLOREST but ran into the same problem with parsing.

Cheers,
	-Matt

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]                 ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-09-09 10:37                   ` Louis Rilling
       [not found]                     ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Louis Rilling @ 2010-09-09 10:37 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Serge E. Hallyn, Matthieu Fertré,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nathan Lynch, Dan Smith, Sukadev Bhattiprolu


[-- Attachment #1.1: Type: text/plain, Size: 5614 bytes --]

On 08/09/10 21:06 -0700, Matt Helsley wrote:
> On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > I think it can be split into two composable pieces which may also be
> > > useful independently.
> > > 
> > > The first uses the fcntl() interface to add a flag like
> > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > > restart. That way we don't have to specify an fd number and a "source"
> > > to the kernel. Just tell the kernel to keep the fd. The source can
> > > be opened and dup2'd via userspace. This is useful without the
> > > second piece if we want to simply add rather than replace an fd.
> > 
> > Can you think of any other use for this flag other than restart?
> 
> <joking>
> I can't think of any other uses for O_CLOEXEC.
> </joking>
> 
> Seriously though, restart will be used _much_ less often than exec so yes
> it does seem like a waste of a valuable bit and something that wouldn't
> quite belong in an fcntl interface.
> 
> However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
> Right now restart closes all file descriptors and pays absolutely
> no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
> too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
> want to keep we do not mark with O_CLOEXEC.

This would also be useful at checkpoint, to tell sys_checkpoint() which fds
should be ignored, being because it is not supported or because the application
has a better way to deal with it.

> 
> 
> Here's another idea which I haven't fully thought out yet.
> 
> We could introduce the concept of object id substitutions in the image.
> So the image would look like (going from file pos 0 at the top..):
> 
> 0 +-------------------------------+
>   |                               |
>                 .....
>   +-------------------------------+
>   |     <substitute object>       | <--- object with id == <substitute id>
>                 .....
>   +---------------+---------------+
>   |  <object id>  |<substitute id>|
>   +---------------+---------------+
>                 .....
>   +---------------+---------------+
>   |     <object to ignore>        | <-- object with id == <object id>
>                 .....
> 
> (The above is ignoring the ckpt_hdr fields..)
> 
> When we read the image during restart we use the substitute ids to
> create indirect objhash entries. When we encounter an obj id and
> it refers to an indirect entry we first parse the object (ignoring
> errors and dropping references on new objhash insertions), flip
> a bit on the indirect entry (indicating the object has been parsed),
> and then lookup the substitute id and return whatever that resolved to.
> 
> We can ignore the new objhash objects by making the objhash have its
> own operation struct. When we're parsing an object that's been
> substituted we just temporarily set the objhash add/lookup operations
> to something suitable for properly dropping references to the new
> object(s). This way we don't have to add checks for this peculiar
> need all over the checkpoint/restart code. Sure it'll be slower...

If at checkpoint we can take care to ignore files that we know will be
substituted, this should not be that slower.

> 
> I can think of a few problems with that already. If the substituted
> obj differs wildly in file type then any defer queue entries that use
> obj ids to complete the deferred work would fail miserably...

The problem I see with rewriting the image is that this may impose additional
I/O, for instance to duplicate the image before rewriting, or if it is simply
rewritten to disk. In contrast, having an easily parsable table of fds at the
beginning of the image, with associated object ids (and preferably more info
like file type, path, owner rights, etc.) makes it easy and lightweight to
build a separate substitution table, that we could feed sys_restart() with
(maybe only the coordinator could feed sys_restart() with such a table).

> 
> That said, so far I've never heard folks discuss substituting anything
> but fds. Perhaps enabling substitution at the objhash level is just too
> broad and we'd be better off only allowing fd substitutions?

Well... A while ago I asked about substituting SYSV IPC objects (I was talking
about SHM at that moment, but semaphore sets or message queues are even easier
to substitute). In such a scenario a pipeline of video transcoding would use
SYSV SHMs to store transitional frames between the stages of the pipeline, and
the SHMs themselves would not need to be checkpointed, or could be checkpointed
at a lower frequency than the processes.

Substituting memory mapped files (for instance POSIX SHMs) would be useful too.

> 
> > If so, then having a fcntl flag (and later madvise) makes sense.
> > But if we're going to add options to various different APIS which
> > really are all only useful for c/r, then maybe a single new cr_advise()
> > really does make sense.  The alternative may be more popular at first
> > but would IMO turn into a disaster.
> 
> Good point.

cr_advise() or changing sys_checkpoint() and sys_restart() are both fine to me.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]                     ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
@ 2010-09-09 11:02                       ` Matt Helsley
       [not found]                         ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Matt Helsley @ 2010-09-09 11:02 UTC (permalink / raw)
  To: Matt Helsley, Serge E. Hallyn, Serge E. Hallyn, Nathan Lynch

On Thu, Sep 09, 2010 at 12:37:20PM +0200, Louis Rilling wrote:
> On 08/09/10 21:06 -0700, Matt Helsley wrote:
> > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> > > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > > I think it can be split into two composable pieces which may also be
> > > > useful independently.
> > > > 
> > > > The first uses the fcntl() interface to add a flag like
> > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > > > restart. That way we don't have to specify an fd number and a "source"
> > > > to the kernel. Just tell the kernel to keep the fd. The source can
> > > > be opened and dup2'd via userspace. This is useful without the
> > > > second piece if we want to simply add rather than replace an fd.
> > > 
> > > Can you think of any other use for this flag other than restart?
> > 
> > <joking>
> > I can't think of any other uses for O_CLOEXEC.
> > </joking>
> > 
> > Seriously though, restart will be used _much_ less often than exec so yes
> > it does seem like a waste of a valuable bit and something that wouldn't
> > quite belong in an fcntl interface.
> > 
> > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
> > Right now restart closes all file descriptors and pays absolutely
> > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
> > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
> > want to keep we do not mark with O_CLOEXEC.
> 
> This would also be useful at checkpoint, to tell sys_checkpoint() which fds
> should be ignored, being because it is not supported or because the application
> has a better way to deal with it.

True. Though unlike restart I don't think we just can (ab|re)use O_CLOEXEC
for that purpose.

> 
> > 
> > 
> > Here's another idea which I haven't fully thought out yet.
> > 
> > We could introduce the concept of object id substitutions in the image.
> > So the image would look like (going from file pos 0 at the top..):
> > 
> > 0 +-------------------------------+
> >   |                               |
> >                 .....
> >   +-------------------------------+
> >   |     <substitute object>       | <--- object with id == <substitute id>
> >                 .....
> >   +---------------+---------------+
> >   |  <object id>  |<substitute id>|
> >   +---------------+---------------+
> >                 .....
> >   +---------------+---------------+
> >   |     <object to ignore>        | <-- object with id == <object id>
> >                 .....
> > 
> > (The above is ignoring the ckpt_hdr fields..)
> > 
> > When we read the image during restart we use the substitute ids to
> > create indirect objhash entries. When we encounter an obj id and
> > it refers to an indirect entry we first parse the object (ignoring
> > errors and dropping references on new objhash insertions), flip
> > a bit on the indirect entry (indicating the object has been parsed),
> > and then lookup the substitute id and return whatever that resolved to.
> > 
> > We can ignore the new objhash objects by making the objhash have its
> > own operation struct. When we're parsing an object that's been
> > substituted we just temporarily set the objhash add/lookup operations
> > to something suitable for properly dropping references to the new
> > object(s). This way we don't have to add checks for this peculiar
> > need all over the checkpoint/restart code. Sure it'll be slower...
> 
> If at checkpoint we can take care to ignore files that we know will be
> substituted, this should not be that slower.

So, would you say typically it's the application developer who knows
what to ignore? Are we expecting distros/packagers to be able to set
that up? Admins? These specific optimizations seem like they would be a
bit fragile unless the application developer is involved.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: C/R: File substitution at restart
       [not found]                         ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-09-09 11:34                           ` Louis Rilling
  0 siblings, 0 replies; 10+ messages in thread
From: Louis Rilling @ 2010-09-09 11:34 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Serge E. Hallyn, Matthieu Fertré,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nathan Lynch, Dan Smith, Sukadev Bhattiprolu


[-- Attachment #1.1: Type: text/plain, Size: 5239 bytes --]

On 09/09/10  4:02 -0700, Matt Helsley wrote:
> On Thu, Sep 09, 2010 at 12:37:20PM +0200, Louis Rilling wrote:
> > On 08/09/10 21:06 -0700, Matt Helsley wrote:
> > > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote:
> > > > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> > > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote:
> > > > > I think it can be split into two composable pieces which may also be
> > > > > useful independently.
> > > > > 
> > > > > The first uses the fcntl() interface to add a flag like
> > > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during
> > > > > restart. That way we don't have to specify an fd number and a "source"
> > > > > to the kernel. Just tell the kernel to keep the fd. The source can
> > > > > be opened and dup2'd via userspace. This is useful without the
> > > > > second piece if we want to simply add rather than replace an fd.
> > > > 
> > > > Can you think of any other use for this flag other than restart?
> > > 
> > > <joking>
> > > I can't think of any other uses for O_CLOEXEC.
> > > </joking>
> > > 
> > > Seriously though, restart will be used _much_ less often than exec so yes
> > > it does seem like a waste of a valuable bit and something that wouldn't
> > > quite belong in an fcntl interface.
> > > 
> > > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC.
> > > Right now restart closes all file descriptors and pays absolutely
> > > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST
> > > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we
> > > want to keep we do not mark with O_CLOEXEC.
> > 
> > This would also be useful at checkpoint, to tell sys_checkpoint() which fds
> > should be ignored, being because it is not supported or because the application
> > has a better way to deal with it.
> 
> True. Though unlike restart I don't think we just can (ab|re)use O_CLOEXEC
> for that purpose.
> 
> > 
> > > 
> > > 
> > > Here's another idea which I haven't fully thought out yet.
> > > 
> > > We could introduce the concept of object id substitutions in the image.
> > > So the image would look like (going from file pos 0 at the top..):
> > > 
> > > 0 +-------------------------------+
> > >   |                               |
> > >                 .....
> > >   +-------------------------------+
> > >   |     <substitute object>       | <--- object with id == <substitute id>
> > >                 .....
> > >   +---------------+---------------+
> > >   |  <object id>  |<substitute id>|
> > >   +---------------+---------------+
> > >                 .....
> > >   +---------------+---------------+
> > >   |     <object to ignore>        | <-- object with id == <object id>
> > >                 .....
> > > 
> > > (The above is ignoring the ckpt_hdr fields..)
> > > 
> > > When we read the image during restart we use the substitute ids to
> > > create indirect objhash entries. When we encounter an obj id and
> > > it refers to an indirect entry we first parse the object (ignoring
> > > errors and dropping references on new objhash insertions), flip
> > > a bit on the indirect entry (indicating the object has been parsed),
> > > and then lookup the substitute id and return whatever that resolved to.
> > > 
> > > We can ignore the new objhash objects by making the objhash have its
> > > own operation struct. When we're parsing an object that's been
> > > substituted we just temporarily set the objhash add/lookup operations
> > > to something suitable for properly dropping references to the new
> > > object(s). This way we don't have to add checks for this peculiar
> > > need all over the checkpoint/restart code. Sure it'll be slower...
> > 
> > If at checkpoint we can take care to ignore files that we know will be
> > substituted, this should not be that slower.
> 
> So, would you say typically it's the application developer who knows
> what to ignore? Are we expecting distros/packagers to be able to set
> that up? Admins? These specific optimizations seem like they would be a
> bit fragile unless the application developer is involved.

If you look at OpenMPI's C/R framework, the policy to ignore/substitude fds
(mostly sockets and log files) is programmed in the C/R plugin. In that case,
the middleware knows.

Otherwise, for such optimization cases I expect applications to come with
dedicated C/R helpers. So, yes, the application developer is involved.

However, in some special cases like stdio redirection of containers,
administrators should be able to do it, or even users. Imagine a user feeding
some file to the checkpointed app, and wanting the app to work on a different
file at restart.  With a parsing tool and enough info in the fd table of the
checkpoint image, the user could easily know which fd should be substituted
because its path matches the file that was fed to the app.

Thanks,

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-09-09 11:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 10:03 C/R: File substitution at restart Matthieu Fertré
     [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>
2010-09-08 13:09   ` Serge E. Hallyn
     [not found]     ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2010-09-08 17:56       ` Sukadev Bhattiprolu
     [not found]         ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-09-08 22:49           ` Serge E. Hallyn
2010-09-08 19:35       ` Matt Helsley
     [not found]         ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-09-09  1:03           ` Serge E. Hallyn
     [not found]             ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2010-09-09  4:06               ` Matt Helsley
     [not found]                 ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-09-09 10:37                   ` Louis Rilling
     [not found]                     ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2010-09-09 11:02                       ` Matt Helsley
     [not found]                         ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-09-09 11:34                           ` Louis Rilling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.