linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: avagin at virtuozzo.com (Andrey Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Tue, 25 Sep 2018 01:42:02 +0000	[thread overview]
Message-ID: <20180925014150.GA6302@outlook.office365.com> (raw)
In-Reply-To: <874leezh8n.fsf@xmission.com>

On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin <avagin at virtuozzo.com> writes:
> 
> > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov <dima at arista.com> writes:
> >> 
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as 
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> >   Is having a simple offset enough?
> >> >   What to do when date and time is changed on the host?
> >> >   Is there a need to adjust vfs modification and creation times? 
> >> >   Implementation for adjtime() syscall.
> >> 
> >> Overall I support this effort.  In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >> 
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >> 
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace.  That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources.  What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >> 
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code.  Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >> 
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused.  Allowing testing of the kernel's
> >> time code by setting up a new time namespace.  So a person in production
> >> could setup a time namespace with the time set ahead a little  bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
> 
> I don't know if periodically is a proper characterization.  There may be
> a code path that does that.  But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
> 
> So it appears to make sense for timekeeping_update to be per namespace.
> 
> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be.  As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem.  But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
> 
> Where were you seeing timekeeping_update being called periodically?

timekeeping_update() is called HZ times per-second:

[   67.912858]  timekeeping_update.cold.26+0x5/0xa
[   67.913332]  timekeeping_advance+0x361/0x5c0
[   67.913857]  ? tick_sched_do_timer+0x55/0x70
[   67.914409]  ? tick_sched_do_timer+0x70/0x70
[   67.914947]  tick_sched_do_timer+0x55/0x70
[   67.915505]  tick_sched_timer+0x27/0x70
[   67.916042]  __hrtimer_run_queues+0x10f/0x440
[   67.916639]  hrtimer_interrupt+0x100/0x220
[   67.917305]  smp_apic_timer_interrupt+0x79/0x220
[   67.918030]  apic_timer_interrupt+0xf/0x20

> 
> > 2. What will we do with vdso? It looks like we will have to have a
> > separate vsyscall_gtod_data for each ns and update each of them
> > separately.
> 
> Yes.  But you don't have to have introduce another variable just make
> certain vsyscall_gtod_data is a page aligned thing per time namespace.
> 
> If I read the summary of the existing patchset something very similiar
> is already going on.

I mean vsyscall_gtod_data has some data which are often updated. There
are timestamps for monotonic and wall clocks. clock_gettime() reads a
time stamp from vsyscall_gtod_data and then use tsc to approximate the
current value of a clock.

Actually, this is not the second question, it is a part of the first
question. update_vsyscall() is called from timekeeping_update().

> 
> Each process would only map one.  And unshare of the time namespace
> would need to act like the pid namespace or be limited to only being
> allowed when there is only a single task using the mm.
> 
> Eric

WARNING: multiple messages have this Message-ID (diff)
From: avagin@virtuozzo.com (Andrey Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Tue, 25 Sep 2018 01:42:02 +0000	[thread overview]
Message-ID: <20180925014150.GA6302@outlook.office365.com> (raw)
Message-ID: <20180925014202.Y_wf91-VMDiHbo9x2TcAwE7dEUERp0jd_MRrsNTFXUI@z> (raw)
In-Reply-To: <874leezh8n.fsf@xmission.com>

On Tue, Sep 25, 2018@12:02:32AM +0200, Eric W. Biederman wrote:
> Andrey Vagin <avagin at virtuozzo.com> writes:
> 
> > On Fri, Sep 21, 2018@02:27:29PM +0200, Eric W. Biederman wrote:
> >> Dmitry Safonov <dima at arista.com> writes:
> >> 
> >> > Discussions around time virtualization are there for a long time.
> >> > The first attempt to implement time namespace was in 2006 by Jeff Dike.
> >> > From that time, the topic appears on and off in various discussions.
> >> >
> >> > There are two main use cases for time namespaces:
> >> > 1. change date and time inside a container;
> >> > 2. adjust clocks for a container restored from a checkpoint.
> >> >
> >> > “It seems like this might be one of the last major obstacles keeping
> >> > migration from being used in production systems, given that not all
> >> > containers and connections can be migrated as long as a time dependency
> >> > is capable of messing it up.” (by github.com/dav-ell)
> >> >
> >> > The kernel provides access to several clocks: CLOCK_REALTIME,
> >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
> >> > start points for them are not defined and are different for each running
> >> > system. When a container is migrated from one node to another, all
> >> > clocks have to be restored into consistent states; in other words, they
> >> > have to continue running from the same points where they have been
> >> > dumped.
> >> >
> >> > The main idea behind this patch set is adding per-namespace offsets for
> >> > system clocks. When a process in a non-root time namespace requests
> >> > time of a clock, a namespace offset is added to the current value of
> >> > this clock on a host and the sum is returned.
> >> >
> >> > All offsets are placed on a separate page, this allows up to map it as 
> >> > part of vvar into user processes and use offsets from vdso calls.
> >> >
> >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
> >> > clocks.
> >> >
> >> > Questions to discuss:
> >> >
> >> > * Clone flags exhaustion. Currently there is only one unused clone flag
> >> > bit left, and it may be worth to use it to extend arguments of the clone
> >> > system call.
> >> >
> >> > * Realtime clock implementation details:
> >> >   Is having a simple offset enough?
> >> >   What to do when date and time is changed on the host?
> >> >   Is there a need to adjust vfs modification and creation times? 
> >> >   Implementation for adjtime() syscall.
> >> 
> >> Overall I support this effort.  In my quick skim this code looked good.
> >
> > Hi Eric,
> >
> > Thank you for the feedback.
> >
> >> 
> >> My feeling is that we need to be able to support running ntpd and
> >> support one namespace doing googles smoothing of leap seconds while
> >> another namespace takes the leap second.
> >> 
> >> What I was imagining when I was last thinking about this was one
> >> instance of struct timekeeper aka tk_core per time namespace.  That
> >> structure already keeps offsets for all of the various clocks from
> >> the kerne internal time sources.  What would be needed would be to
> >> pass in an appropriate time namespace pointer.
> >> 
> >> I could be completely wrong as I have not take the time to completely
> >> trace through the code.  Have you looked at pushing the time namespace
> >> down as far as tk_core?
> >> 
> >> What I think would be the big advantage (besides ntp working) is that
> >> the bulk of the code could be reused.  Allowing testing of the kernel's
> >> time code by setting up a new time namespace.  So a person in production
> >> could setup a time namespace with the time set ahead a little  bit and
> >> be able to verify that the kernel handles the upcoming leap second
> >> properly.
> >>
> >
> > It is an interesting idea, but I have a few questions:
> >
> > 1. Does it mean that timekeeping_update() will be called for each
> > namespace? This functions is called periodically, it updates times on the
> > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an
> > overhead of this?
> 
> I don't know if periodically is a proper characterization.  There may be
> a code path that does that.  But from what I can see timekeeping_update
> is the guts of settimeofday (and a few related functions).
> 
> So it appears to make sense for timekeeping_update to be per namespace.
> 
> Hmm.  Looking at what is updated in the vsyscall_gtod_data it does
> look like you would have to periodically update things, but I don't know
> big that period would be.  As long as the period is reasonably large,
> or the time namespaces were sufficiently deschronized it should not
> be a problem.  But that is the class of problem that could make
> my ideal impractical if there is measuarable overhead.
> 
> Where were you seeing timekeeping_update being called periodically?

timekeeping_update() is called HZ times per-second:

[   67.912858]  timekeeping_update.cold.26+0x5/0xa
[   67.913332]  timekeeping_advance+0x361/0x5c0
[   67.913857]  ? tick_sched_do_timer+0x55/0x70
[   67.914409]  ? tick_sched_do_timer+0x70/0x70
[   67.914947]  tick_sched_do_timer+0x55/0x70
[   67.915505]  tick_sched_timer+0x27/0x70
[   67.916042]  __hrtimer_run_queues+0x10f/0x440
[   67.916639]  hrtimer_interrupt+0x100/0x220
[   67.917305]  smp_apic_timer_interrupt+0x79/0x220
[   67.918030]  apic_timer_interrupt+0xf/0x20

> 
> > 2. What will we do with vdso? It looks like we will have to have a
> > separate vsyscall_gtod_data for each ns and update each of them
> > separately.
> 
> Yes.  But you don't have to have introduce another variable just make
> certain vsyscall_gtod_data is a page aligned thing per time namespace.
> 
> If I read the summary of the existing patchset something very similiar
> is already going on.

I mean vsyscall_gtod_data has some data which are often updated. There
are timestamps for monotonic and wall clocks. clock_gettime() reads a
time stamp from vsyscall_gtod_data and then use tsc to approximate the
current value of a clock.

Actually, this is not the second question, it is a part of the first
question. update_vsyscall() is called from timekeeping_update().

> 
> Each process would only map one.  And unshare of the time namespace
> would need to act like the pid namespace or be limited to only being
> allowed when there is only a single task using the mm.
> 
> Eric

  parent reply	other threads:[~2018-09-25  1:42 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace dima
2018-09-19 20:50 ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-24 21:36   ` shuah
2018-09-24 21:36     ` Shuah Khan
2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace ebiederm
2018-09-21 12:27   ` Eric W. Biederman
2018-09-24 20:51   ` avagin
2018-09-24 20:51     ` Andrey Vagin
2018-09-24 22:02     ` ebiederm
2018-09-24 22:02       ` Eric W. Biederman
2018-09-25  1:42       ` avagin [this message]
2018-09-25  1:42         ` Andrey Vagin
2018-09-26 17:36         ` ebiederm
2018-09-26 17:36           ` Eric W. Biederman
2018-09-26 17:59           ` 0x7f454c46
2018-09-26 17:59             ` Dmitry Safonov
2018-09-27 21:30           ` tglx
2018-09-27 21:30             ` Thomas Gleixner
2018-09-27 21:41             ` tglx
2018-09-27 21:41               ` Thomas Gleixner
2018-10-01 23:20               ` avagin
2018-10-01 23:20                 ` Andrey Vagin
2018-10-02  6:15                 ` tglx
2018-10-02  6:15                   ` Thomas Gleixner
2018-10-02 21:05                   ` 0x7f454c46
2018-10-02 21:05                     ` Dmitry Safonov
2018-10-02 21:26                     ` tglx
2018-10-02 21:26                       ` Thomas Gleixner
2018-09-28 17:03             ` ebiederm
2018-09-28 17:03               ` Eric W. Biederman
2018-09-28 19:32               ` tglx
2018-09-28 19:32                 ` Thomas Gleixner
2018-10-01  9:05                 ` ebiederm
2018-10-01  9:05                   ` Eric W. Biederman
2018-10-01  9:15                 ` Setting monotonic time? ebiederm
2018-10-01  9:15                   ` Eric W. Biederman
2018-10-01 18:52                   ` tglx
2018-10-01 18:52                     ` Thomas Gleixner
2018-10-02 20:00                     ` arnd
2018-10-02 20:00                       ` Arnd Bergmann
2018-10-02 20:06                       ` tglx
2018-10-02 20:06                         ` Thomas Gleixner
2018-10-03  4:50                         ` ebiederm
2018-10-03  4:50                           ` Eric W. Biederman
2018-10-03  5:25                           ` tglx
2018-10-03  5:25                             ` Thomas Gleixner
2018-10-03  6:14                             ` ebiederm
2018-10-03  6:14                               ` Eric W. Biederman
2018-10-03  7:02                               ` arnd
2018-10-03  7:02                                 ` Arnd Bergmann
2018-10-03  6:14                             ` tglx
2018-10-03  6:14                               ` Thomas Gleixner
2018-10-01 20:51                   ` avagin
2018-10-01 20:51                     ` Andrey Vagin
2018-10-02  6:16                     ` tglx
2018-10-02  6:16                       ` Thomas Gleixner
2018-10-21  1:41               ` [RFC 00/20] ns: Introduce Time Namespace avagin
2018-10-21  1:41                 ` Andrei Vagin
2018-10-21  3:54                 ` avagin
2018-10-21  3:54                   ` Andrei Vagin
2018-10-29 20:33                 ` tglx
2018-10-29 20:33                   ` Thomas Gleixner
2018-10-29 21:21                   ` ebiederm
2018-10-29 21:21                     ` Eric W. Biederman
2018-10-29 21:36                     ` tglx
2018-10-29 21:36                       ` Thomas Gleixner
2018-10-31 16:26                   ` avagin
2018-10-31 16:26                     ` Andrei Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180925014150.GA6302@outlook.office365.com \
    --to=linux-kselftest@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).