* C/R: File substitution at restart @ 2010-09-08 10:03 Matthieu Fertré [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Matthieu Fertré @ 2010-09-08 10:03 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Serge E. Hallyn, Nathan Lynch, Louis Rilling, Dan Smith, Sukadev Bhattiprolu Hi, Here is a proposal for a C/R related feature already developed in Kerrighed: file substitution at restart. The goal of this mail is to start a discussion about adding such feature to Linux cr. Comments are welcome! What is file substitution ? =========================== It is the ability, at restart of a checkpointed application, to substitute some of the opened files by some other files. Only files accessed through a FD can be substituted, not mapped file (unless they are reachable through a FD in the same time). The feature ensures 'struct file' sharing as before the checkpoint. Thus, if a process is for instance sharing the same struct file for stdin, stdout, and stderr, it is not possible to give a different file for each at restart. Use cases ========= 1) Circumvent Checkpointer limitations: * Allow to restart an application that has some files not supported by checkpoint/restart implementation. 2) Conflicts between existing files and files that should be restored: * Allow to restart an application of which one input data file is not writable by the user and thus can not be restored/replaced. * Allow to restart an application of which one output data file is already open by another instance of the program. 3) Checkpoint/restart optimization/flexibility: * Let the application checkpoint and restore files by itself to get better performance or flexibility. Example: OpenMPI sockets. Avoid to handle communication buffers and ensure consistent distributed state. How user(s) can use it ? ======================== (Kerrighed) restart manual page quote: "-s file_identifier,fd, --substitute-file=file_identifier,fd This option allows to replace one of the open files of the checkpointed application by one of the file opened by the process calling the restart command. fd is the file descriptor (as given by open (2)) of the calling process that will be used as a replacement after the restart. file_identifier is an identifier of one the open files of the checkpointed application. This identifier is generated at checkpoint time. It can be retrieved from the file(s) user_info_*.txt that live(s) in the checkpoint directory. Each line of this file refers to one of the open files of the checkpointed application. For each open file, we get the following information: type|file_identifier|symbolic name|list of pid:fd This option can be used several times to substitute several files." Here is a simple (and stupid?) example extracted from Louis's talk at OLS: $ krgcr-run ping localhost Running application 6315097 (ping output omitted) $ checkpoint -i 6315097 $ $ cat /var/chkpt/6315097/v1/user_info_1.txt socket |0001FFFF880066D68EA8|socket:[219057]|6315097:3 tty |0001FFFF88007D040EA8|/dev/pts/1|6315097:0,6315097:1,6315097:2 $ # Use current terminal for standard I/O at restart: $ # ('-t' stands for tty and acts as a wrapper for option '-s') $ restart -t 6315097 1 $ # Use stdin instead of socket at restart: $ restart -s 0001FFFF880066D68EA8,0 6315097 1 Thanks, Matthieu ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> @ 2010-09-08 13:09 ` Serge E. Hallyn [not found] ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Serge E. Hallyn @ 2010-09-08 13:09 UTC (permalink / raw) To: Matthieu Fertré Cc: Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Nathan Lynch, Louis Rilling, Dan Smith, Sukadev Bhattiprolu Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): > Hi, > > Here is a proposal for a C/R related feature already developed in > Kerrighed: file substitution at restart. > > The goal of this mail is to start a discussion about adding such feature > to Linux cr. Comments are welcome! Yup, AFAIK metacluster and zap do this too. I don't think there is any question about whether we want to support this, but rather what the user-kernel API should look like. Perhaps the easiest "API" is to have the userspace program rewrite the checkpoint image, but that probably isn't quite as simple as just substituting #s in the image, bc we'll have to also find the place where the source of the original fd was specified and tweak that. I assume this is one of the things Oren would have 'cradvise()' do, and at this point that sounds nice to me - might be worth seeing how the community reacts. Sentiments on such things change, after all. Have there been any other suggestions? -serge ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2010-09-08 17:56 ` Sukadev Bhattiprolu [not found] ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2010-09-08 19:35 ` Matt Helsley 1 sibling, 1 reply; 10+ messages in thread From: Sukadev Bhattiprolu @ 2010-09-08 17:56 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge E. Hallyn, Nathan Lynch, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Matthieu Fertré, Louis Rilling, Dan Smith Serge Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote: | Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): | > Hi, | > | > Here is a proposal for a C/R related feature already developed in | > Kerrighed: file substitution at restart. | > | > The goal of this mail is to start a discussion about adding such feature | > to Linux cr. Comments are welcome! | | Yup, AFAIK metacluster and zap do this too. I don't think there is | any question about whether we want to support this, but rather | what the user-kernel API should look like. Perhaps the easiest | "API" is to have the userspace program rewrite the checkpoint image, | but that probably isn't quite as simple as just substituting #s in | the image, bc we'll have to also find the place where the source of | the original fd was specified and tweak that. | | I assume this is one of the things Oren would have 'cradvise()' | do, and at this point that sounds nice to me - might be worth | seeing how the community reacts. Sentiments on such things change, | after all. Yes, I had the same question about the kernel API. cradvise() would be one option, but am not too clear on the details. For each process in the checkpoint image that we want to substitute one or more fds, do we call cradvise() *before* the call to sys_restart() ? This would require the kernel to save these substitution pairs in memory until the following sys_restart() right ? Passing in a list of fd-substition pairs to sys_restart() might be one option, but would require modifying the sys_restart() API. Sukadev ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2010-09-08 22:49 ` Serge E. Hallyn 0 siblings, 0 replies; 10+ messages in thread From: Serge E. Hallyn @ 2010-09-08 22:49 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Serge E. Hallyn, Nathan Lynch, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Matthieu Fertré, Louis Rilling, Dan Smith Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org): > Serge Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote: > | Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): > | > Hi, > | > > | > Here is a proposal for a C/R related feature already developed in > | > Kerrighed: file substitution at restart. > | > > | > The goal of this mail is to start a discussion about adding such feature > | > to Linux cr. Comments are welcome! > | > | Yup, AFAIK metacluster and zap do this too. I don't think there is > | any question about whether we want to support this, but rather > | what the user-kernel API should look like. Perhaps the easiest > | "API" is to have the userspace program rewrite the checkpoint image, > | but that probably isn't quite as simple as just substituting #s in > | the image, bc we'll have to also find the place where the source of > | the original fd was specified and tweak that. > | > | I assume this is one of the things Oren would have 'cradvise()' > | do, and at this point that sounds nice to me - might be worth > | seeing how the community reacts. Sentiments on such things change, > | after all. > > Yes, I had the same question about the kernel API. cradvise() would be > one option, but am not too clear on the details. For each process in > the checkpoint image that we want to substitute one or more fds, > do we call cradvise() *before* the call to sys_restart() ? This would No, I would rather think that we follow the Kerrighed example, and specify a checkpoint-wide id for the fd (the objhash id i guess). The first cr_advise() starts to create a restart context, which finally gets used at sys_restart by the coordinator (and of course all subsequent tasks). > require the kernel to save these substitution pairs in memory until > the following sys_restart() right ? > > Passing in a list of fd-substition pairs to sys_restart() might be one > option, but would require modifying the sys_restart() API. > > Sukadev ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: C/R: File substitution at restart [not found] ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2010-09-08 17:56 ` Sukadev Bhattiprolu @ 2010-09-08 19:35 ` Matt Helsley [not found] ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 1 sibling, 1 reply; 10+ messages in thread From: Matt Helsley @ 2010-09-08 19:35 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge E. Hallyn, Nathan Lynch, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Matthieu Fertré, Louis Rilling, Dan Smith, Sukadev Bhattiprolu On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): > > Hi, > > > > Here is a proposal for a C/R related feature already developed in > > Kerrighed: file substitution at restart. > > > > The goal of this mail is to start a discussion about adding such feature > > to Linux cr. Comments are welcome! > > Yup, AFAIK metacluster and zap do this too. I don't think there is > any question about whether we want to support this, but rather > what the user-kernel API should look like. Perhaps the easiest > "API" is to have the userspace program rewrite the checkpoint image, > but that probably isn't quite as simple as just substituting #s in > the image, bc we'll have to also find the place where the source of > the original fd was specified and tweak that. > > I assume this is one of the things Oren would have 'cradvise()' > do, and at this point that sounds nice to me - might be worth > seeing how the community reacts. Sentiments on such things change, > after all. > > Have there been any other suggestions? I think it can be split into two composable pieces which may also be useful independently. The first uses the fcntl() interface to add a flag like O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during restart. That way we don't have to specify an fd number and a "source" to the kernel. Just tell the kernel to keep the fd. The source can be opened and dup2'd via userspace. This is useful without the second piece if we want to simply add rather than replace an fd. Then a separate interface/tool is needed to ignore/delete the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult part. It's difficult because depending on the open file the portions of the image to ignore/delete can vary wildly. For instance, imagine if an epoll fd was being ignored. It starts much like a generic file but there is an image header related to it that isn't a CKPT_OBJ_*. If we fail to delete/ignore this section prior to parsing then it completely breaks the parsing. In contrast, CKPT_OBJ_* do not break the parsing since they aren't expected in a strict order -- the parser is capable of parsing them at any time and the only order constraint on them is that they appear in the image before they are referenced. This piece is also useful by itself if we want to ignore/delete an fd rather than substitute it. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-09-09 1:03 ` Serge E. Hallyn [not found] ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Serge E. Hallyn @ 2010-09-09 1:03 UTC (permalink / raw) To: Matt Helsley Cc: Serge E. Hallyn, Nathan Lynch, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Matthieu Fertré, Louis Rilling, Dan Smith, Sukadev Bhattiprolu Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > > Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): > > > Hi, > > > > > > Here is a proposal for a C/R related feature already developed in > > > Kerrighed: file substitution at restart. > > > > > > The goal of this mail is to start a discussion about adding such feature > > > to Linux cr. Comments are welcome! > > > > Yup, AFAIK metacluster and zap do this too. I don't think there is > > any question about whether we want to support this, but rather > > what the user-kernel API should look like. Perhaps the easiest > > "API" is to have the userspace program rewrite the checkpoint image, > > but that probably isn't quite as simple as just substituting #s in > > the image, bc we'll have to also find the place where the source of > > the original fd was specified and tweak that. > > > > I assume this is one of the things Oren would have 'cradvise()' > > do, and at this point that sounds nice to me - might be worth > > seeing how the community reacts. Sentiments on such things change, > > after all. > > > > Have there been any other suggestions? > > I think it can be split into two composable pieces which may also be > useful independently. > > The first uses the fcntl() interface to add a flag like > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during > restart. That way we don't have to specify an fd number and a "source" > to the kernel. Just tell the kernel to keep the fd. The source can > be opened and dup2'd via userspace. This is useful without the > second piece if we want to simply add rather than replace an fd. Can you think of any other use for this flag other than restart? If so, then having a fcntl flag (and later madvise) makes sense. But if we're going to add options to various different APIS which really are all only useful for c/r, then maybe a single new cr_advise() really does make sense. The alternative may be more popular at first but would IMO turn into a disaster. > Then a separate interface/tool is needed to ignore/delete > the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult > part. It's difficult because depending on the open file the portions of > the image to ignore/delete can vary wildly. For instance, imagine if an > epoll fd was being ignored. It starts much like a generic file but there > is an image header related to it that isn't a CKPT_OBJ_*. If we fail to > delete/ignore this section prior to parsing then it completely breaks > the parsing. Yup, that is precisely what stopped me when I tried to do this 6 months or so ago just for stdin/stdout/stderr. > In contrast, CKPT_OBJ_* do not break the parsing since > they aren't expected in a strict order -- the parser is capable of > parsing them at any time and the only order constraint on them is that > they appear in the image before they are referenced. > This piece is also useful by itself if we want to ignore/delete an fd > rather than substitute it. Are you working on any of this? ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2010-09-09 4:06 ` Matt Helsley [not found] ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Matt Helsley @ 2010-09-09 4:06 UTC (permalink / raw) To: Serge E. Hallyn Cc: Serge E. Hallyn, Nathan Lynch, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Matthieu Fertré, Louis Rilling, Dan Smith, Sukadev Bhattiprolu On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote: > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > > > Quoting Matthieu Fertré (matthieu.fertre-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org): > > > > Hi, > > > > > > > > Here is a proposal for a C/R related feature already developed in > > > > Kerrighed: file substitution at restart. > > > > > > > > The goal of this mail is to start a discussion about adding such feature > > > > to Linux cr. Comments are welcome! > > > > > > Yup, AFAIK metacluster and zap do this too. I don't think there is > > > any question about whether we want to support this, but rather > > > what the user-kernel API should look like. Perhaps the easiest > > > "API" is to have the userspace program rewrite the checkpoint image, > > > but that probably isn't quite as simple as just substituting #s in > > > the image, bc we'll have to also find the place where the source of > > > the original fd was specified and tweak that. If the object to be replaced is specified by obj id then we could simply rewrite the obj id to refer to the replacement. The original object could still be in the image. It would get parsed but the only reference to it would be in the objhash. When the objhash is cleaned up the original object would be too. (more below) > > > > > > I assume this is one of the things Oren would have 'cradvise()' > > > do, and at this point that sounds nice to me - might be worth > > > seeing how the community reacts. Sentiments on such things change, > > > after all. > > > > > > Have there been any other suggestions? > > > > I think it can be split into two composable pieces which may also be > > useful independently. > > > > The first uses the fcntl() interface to add a flag like > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during > > restart. That way we don't have to specify an fd number and a "source" > > to the kernel. Just tell the kernel to keep the fd. The source can > > be opened and dup2'd via userspace. This is useful without the > > second piece if we want to simply add rather than replace an fd. > > Can you think of any other use for this flag other than restart? <joking> I can't think of any other uses for O_CLOEXEC. </joking> Seriously though, restart will be used _much_ less often than exec so yes it does seem like a waste of a valuable bit and something that wouldn't quite belong in an fcntl interface. However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC. Right now restart closes all file descriptors and pays absolutely no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we want to keep we do not mark with O_CLOEXEC. Here's another idea which I haven't fully thought out yet. We could introduce the concept of object id substitutions in the image. So the image would look like (going from file pos 0 at the top..): 0 +-------------------------------+ | | ..... +-------------------------------+ | <substitute object> | <--- object with id == <substitute id> ..... +---------------+---------------+ | <object id> |<substitute id>| +---------------+---------------+ ..... +---------------+---------------+ | <object to ignore> | <-- object with id == <object id> ..... (The above is ignoring the ckpt_hdr fields..) When we read the image during restart we use the substitute ids to create indirect objhash entries. When we encounter an obj id and it refers to an indirect entry we first parse the object (ignoring errors and dropping references on new objhash insertions), flip a bit on the indirect entry (indicating the object has been parsed), and then lookup the substitute id and return whatever that resolved to. We can ignore the new objhash objects by making the objhash have its own operation struct. When we're parsing an object that's been substituted we just temporarily set the objhash add/lookup operations to something suitable for properly dropping references to the new object(s). This way we don't have to add checks for this peculiar need all over the checkpoint/restart code. Sure it'll be slower... I can think of a few problems with that already. If the substituted obj differs wildly in file type then any defer queue entries that use obj ids to complete the deferred work would fail miserably... That said, so far I've never heard folks discuss substituting anything but fds. Perhaps enabling substitution at the objhash level is just too broad and we'd be better off only allowing fd substitutions? > If so, then having a fcntl flag (and later madvise) makes sense. > But if we're going to add options to various different APIS which > really are all only useful for c/r, then maybe a single new cr_advise() > really does make sense. The alternative may be more popular at first > but would IMO turn into a disaster. Good point. > > Then a separate interface/tool is needed to ignore/delete > > the extra CKPT_OBJ_FILE in the checkpoint image. That's the difficult > > part. It's difficult because depending on the open file the portions of > > the image to ignore/delete can vary wildly. For instance, imagine if an > > epoll fd was being ignored. It starts much like a generic file but there > > is an image header related to it that isn't a CKPT_OBJ_*. If we fail to > > delete/ignore this section prior to parsing then it completely breaks > > the parsing. > > Yup, that is precisely what stopped me when I tried to do this 6 months > or so ago just for stdin/stdout/stderr. > > > In contrast, CKPT_OBJ_* do not break the parsing since > > they aren't expected in a strict order -- the parser is capable of > > parsing them at any time and the only order constraint on them is that > > they appear in the image before they are referenced. > > This piece is also useful by itself if we want to ignore/delete an fd > > rather than substitute it. > > Are you working on any of this? Not really. I wrote a quick patch to introduce an new fcntl flag I called O_NOCLOREST but ran into the same problem with parsing. Cheers, -Matt ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-09-09 10:37 ` Louis Rilling [not found] ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Louis Rilling @ 2010-09-09 10:37 UTC (permalink / raw) To: Matt Helsley Cc: Serge E. Hallyn, Matthieu Fertré, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Nathan Lynch, Dan Smith, Sukadev Bhattiprolu [-- Attachment #1.1: Type: text/plain, Size: 5614 bytes --] On 08/09/10 21:06 -0700, Matt Helsley wrote: > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote: > > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > > > I think it can be split into two composable pieces which may also be > > > useful independently. > > > > > > The first uses the fcntl() interface to add a flag like > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during > > > restart. That way we don't have to specify an fd number and a "source" > > > to the kernel. Just tell the kernel to keep the fd. The source can > > > be opened and dup2'd via userspace. This is useful without the > > > second piece if we want to simply add rather than replace an fd. > > > > Can you think of any other use for this flag other than restart? > > <joking> > I can't think of any other uses for O_CLOEXEC. > </joking> > > Seriously though, restart will be used _much_ less often than exec so yes > it does seem like a waste of a valuable bit and something that wouldn't > quite belong in an fcntl interface. > > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC. > Right now restart closes all file descriptors and pays absolutely > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we > want to keep we do not mark with O_CLOEXEC. This would also be useful at checkpoint, to tell sys_checkpoint() which fds should be ignored, being because it is not supported or because the application has a better way to deal with it. > > > Here's another idea which I haven't fully thought out yet. > > We could introduce the concept of object id substitutions in the image. > So the image would look like (going from file pos 0 at the top..): > > 0 +-------------------------------+ > | | > ..... > +-------------------------------+ > | <substitute object> | <--- object with id == <substitute id> > ..... > +---------------+---------------+ > | <object id> |<substitute id>| > +---------------+---------------+ > ..... > +---------------+---------------+ > | <object to ignore> | <-- object with id == <object id> > ..... > > (The above is ignoring the ckpt_hdr fields..) > > When we read the image during restart we use the substitute ids to > create indirect objhash entries. When we encounter an obj id and > it refers to an indirect entry we first parse the object (ignoring > errors and dropping references on new objhash insertions), flip > a bit on the indirect entry (indicating the object has been parsed), > and then lookup the substitute id and return whatever that resolved to. > > We can ignore the new objhash objects by making the objhash have its > own operation struct. When we're parsing an object that's been > substituted we just temporarily set the objhash add/lookup operations > to something suitable for properly dropping references to the new > object(s). This way we don't have to add checks for this peculiar > need all over the checkpoint/restart code. Sure it'll be slower... If at checkpoint we can take care to ignore files that we know will be substituted, this should not be that slower. > > I can think of a few problems with that already. If the substituted > obj differs wildly in file type then any defer queue entries that use > obj ids to complete the deferred work would fail miserably... The problem I see with rewriting the image is that this may impose additional I/O, for instance to duplicate the image before rewriting, or if it is simply rewritten to disk. In contrast, having an easily parsable table of fds at the beginning of the image, with associated object ids (and preferably more info like file type, path, owner rights, etc.) makes it easy and lightweight to build a separate substitution table, that we could feed sys_restart() with (maybe only the coordinator could feed sys_restart() with such a table). > > That said, so far I've never heard folks discuss substituting anything > but fds. Perhaps enabling substitution at the objhash level is just too > broad and we'd be better off only allowing fd substitutions? Well... A while ago I asked about substituting SYSV IPC objects (I was talking about SHM at that moment, but semaphore sets or message queues are even easier to substitute). In such a scenario a pipeline of video transcoding would use SYSV SHMs to store transitional frames between the stages of the pipeline, and the SHMs themselves would not need to be checkpointed, or could be checkpointed at a lower frequency than the processes. Substituting memory mapped files (for instance POSIX SHMs) would be useful too. > > > If so, then having a fcntl flag (and later madvise) makes sense. > > But if we're going to add options to various different APIS which > > really are all only useful for c/r, then maybe a single new cr_advise() > > really does make sense. The alternative may be more popular at first > > but would IMO turn into a disaster. > > Good point. cr_advise() or changing sys_checkpoint() and sys_restart() are both fine to me. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 206 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org> @ 2010-09-09 11:02 ` Matt Helsley [not found] ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Matt Helsley @ 2010-09-09 11:02 UTC (permalink / raw) To: Matt Helsley, Serge E. Hallyn, Serge E. Hallyn, Nathan Lynch On Thu, Sep 09, 2010 at 12:37:20PM +0200, Louis Rilling wrote: > On 08/09/10 21:06 -0700, Matt Helsley wrote: > > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote: > > > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > > > > I think it can be split into two composable pieces which may also be > > > > useful independently. > > > > > > > > The first uses the fcntl() interface to add a flag like > > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during > > > > restart. That way we don't have to specify an fd number and a "source" > > > > to the kernel. Just tell the kernel to keep the fd. The source can > > > > be opened and dup2'd via userspace. This is useful without the > > > > second piece if we want to simply add rather than replace an fd. > > > > > > Can you think of any other use for this flag other than restart? > > > > <joking> > > I can't think of any other uses for O_CLOEXEC. > > </joking> > > > > Seriously though, restart will be used _much_ less often than exec so yes > > it does seem like a waste of a valuable bit and something that wouldn't > > quite belong in an fcntl interface. > > > > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC. > > Right now restart closes all file descriptors and pays absolutely > > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST > > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we > > want to keep we do not mark with O_CLOEXEC. > > This would also be useful at checkpoint, to tell sys_checkpoint() which fds > should be ignored, being because it is not supported or because the application > has a better way to deal with it. True. Though unlike restart I don't think we just can (ab|re)use O_CLOEXEC for that purpose. > > > > > > > Here's another idea which I haven't fully thought out yet. > > > > We could introduce the concept of object id substitutions in the image. > > So the image would look like (going from file pos 0 at the top..): > > > > 0 +-------------------------------+ > > | | > > ..... > > +-------------------------------+ > > | <substitute object> | <--- object with id == <substitute id> > > ..... > > +---------------+---------------+ > > | <object id> |<substitute id>| > > +---------------+---------------+ > > ..... > > +---------------+---------------+ > > | <object to ignore> | <-- object with id == <object id> > > ..... > > > > (The above is ignoring the ckpt_hdr fields..) > > > > When we read the image during restart we use the substitute ids to > > create indirect objhash entries. When we encounter an obj id and > > it refers to an indirect entry we first parse the object (ignoring > > errors and dropping references on new objhash insertions), flip > > a bit on the indirect entry (indicating the object has been parsed), > > and then lookup the substitute id and return whatever that resolved to. > > > > We can ignore the new objhash objects by making the objhash have its > > own operation struct. When we're parsing an object that's been > > substituted we just temporarily set the objhash add/lookup operations > > to something suitable for properly dropping references to the new > > object(s). This way we don't have to add checks for this peculiar > > need all over the checkpoint/restart code. Sure it'll be slower... > > If at checkpoint we can take care to ignore files that we know will be > substituted, this should not be that slower. So, would you say typically it's the application developer who knows what to ignore? Are we expecting distros/packagers to be able to set that up? Admins? These specific optimizations seem like they would be a bit fragile unless the application developer is involved. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: C/R: File substitution at restart [not found] ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-09-09 11:34 ` Louis Rilling 0 siblings, 0 replies; 10+ messages in thread From: Louis Rilling @ 2010-09-09 11:34 UTC (permalink / raw) To: Matt Helsley Cc: Serge E. Hallyn, Matthieu Fertré, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Nathan Lynch, Dan Smith, Sukadev Bhattiprolu [-- Attachment #1.1: Type: text/plain, Size: 5239 bytes --] On 09/09/10 4:02 -0700, Matt Helsley wrote: > On Thu, Sep 09, 2010 at 12:37:20PM +0200, Louis Rilling wrote: > > On 08/09/10 21:06 -0700, Matt Helsley wrote: > > > On Wed, Sep 08, 2010 at 08:03:52PM -0500, Serge E. Hallyn wrote: > > > > Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > > > > > On Wed, Sep 08, 2010 at 08:09:31AM -0500, Serge E. Hallyn wrote: > > > > > I think it can be split into two composable pieces which may also be > > > > > useful independently. > > > > > > > > > > The first uses the fcntl() interface to add a flag like > > > > > O_CLOEXEC. Unlike O_CLOEXEC it marks an fd for preservation during > > > > > restart. That way we don't have to specify an fd number and a "source" > > > > > to the kernel. Just tell the kernel to keep the fd. The source can > > > > > be opened and dup2'd via userspace. This is useful without the > > > > > second piece if we want to simply add rather than replace an fd. > > > > > > > > Can you think of any other use for this flag other than restart? > > > > > > <joking> > > > I can't think of any other uses for O_CLOEXEC. > > > </joking> > > > > > > Seriously though, restart will be used _much_ less often than exec so yes > > > it does seem like a waste of a valuable bit and something that wouldn't > > > quite belong in an fcntl interface. > > > > > > However we can try to be a tad clever -- we could (ab|re)use O_CLOEXEC. > > > Right now restart closes all file descriptors and pays absolutely > > > no attention to O_CLOEXEC. We could reuse O_CLOEXEC to mean O_CLOREST > > > too. Have user-cr's restart tool mark all unwanted fds O_CLOEXEC. Any we > > > want to keep we do not mark with O_CLOEXEC. > > > > This would also be useful at checkpoint, to tell sys_checkpoint() which fds > > should be ignored, being because it is not supported or because the application > > has a better way to deal with it. > > True. Though unlike restart I don't think we just can (ab|re)use O_CLOEXEC > for that purpose. > > > > > > > > > > > > Here's another idea which I haven't fully thought out yet. > > > > > > We could introduce the concept of object id substitutions in the image. > > > So the image would look like (going from file pos 0 at the top..): > > > > > > 0 +-------------------------------+ > > > | | > > > ..... > > > +-------------------------------+ > > > | <substitute object> | <--- object with id == <substitute id> > > > ..... > > > +---------------+---------------+ > > > | <object id> |<substitute id>| > > > +---------------+---------------+ > > > ..... > > > +---------------+---------------+ > > > | <object to ignore> | <-- object with id == <object id> > > > ..... > > > > > > (The above is ignoring the ckpt_hdr fields..) > > > > > > When we read the image during restart we use the substitute ids to > > > create indirect objhash entries. When we encounter an obj id and > > > it refers to an indirect entry we first parse the object (ignoring > > > errors and dropping references on new objhash insertions), flip > > > a bit on the indirect entry (indicating the object has been parsed), > > > and then lookup the substitute id and return whatever that resolved to. > > > > > > We can ignore the new objhash objects by making the objhash have its > > > own operation struct. When we're parsing an object that's been > > > substituted we just temporarily set the objhash add/lookup operations > > > to something suitable for properly dropping references to the new > > > object(s). This way we don't have to add checks for this peculiar > > > need all over the checkpoint/restart code. Sure it'll be slower... > > > > If at checkpoint we can take care to ignore files that we know will be > > substituted, this should not be that slower. > > So, would you say typically it's the application developer who knows > what to ignore? Are we expecting distros/packagers to be able to set > that up? Admins? These specific optimizations seem like they would be a > bit fragile unless the application developer is involved. If you look at OpenMPI's C/R framework, the policy to ignore/substitude fds (mostly sockets and log files) is programmed in the C/R plugin. In that case, the middleware knows. Otherwise, for such optimization cases I expect applications to come with dedicated C/R helpers. So, yes, the application developer is involved. However, in some special cases like stdio redirection of containers, administrators should be able to do it, or even users. Imagine a user feeding some file to the checkpointed app, and wanting the app to work on a different file at restart. With a parsing tool and enough info in the fd table of the checkpoint image, the user could easily know which fd should be substituted because its path matches the file that was fed to the app. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] [-- Attachment #2: Type: text/plain, Size: 206 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-09-09 11:34 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-09-08 10:03 C/R: File substitution at restart Matthieu Fertré [not found] ` <4C875F6E.2030004-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org> 2010-09-08 13:09 ` Serge E. Hallyn [not found] ` <20100908130931.GA11161-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2010-09-08 17:56 ` Sukadev Bhattiprolu [not found] ` <20100908175648.GA12281-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2010-09-08 22:49 ` Serge E. Hallyn 2010-09-08 19:35 ` Matt Helsley [not found] ` <20100908193531.GB8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-09-09 1:03 ` Serge E. Hallyn [not found] ` <20100909010352.GA13880-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2010-09-09 4:06 ` Matt Helsley [not found] ` <20100909040635.GE8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-09-09 10:37 ` Louis Rilling [not found] ` <20100909103720.GF4812-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org> 2010-09-09 11:02 ` Matt Helsley [not found] ` <20100909110220.GF8957-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-09-09 11:34 ` Louis Rilling
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.