linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: avagin at gmail.com (Andrei Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx at linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers at lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: avagin@gmail.com (Andrei Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
Message-ID: <20181021035436.OosCt59uINa5f8_NsytwDTIr9rc3wUQIK4egsXdQPm0@z> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018@06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018@07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx at linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers at lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

  parent reply	other threads:[~2018-10-21  3:54 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace dima
2018-09-19 20:50 ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-24 21:36   ` shuah
2018-09-24 21:36     ` Shuah Khan
2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test dima
2018-09-19 20:50   ` Dmitry Safonov
2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace ebiederm
2018-09-21 12:27   ` Eric W. Biederman
2018-09-24 20:51   ` avagin
2018-09-24 20:51     ` Andrey Vagin
2018-09-24 22:02     ` ebiederm
2018-09-24 22:02       ` Eric W. Biederman
2018-09-25  1:42       ` avagin
2018-09-25  1:42         ` Andrey Vagin
2018-09-26 17:36         ` ebiederm
2018-09-26 17:36           ` Eric W. Biederman
2018-09-26 17:59           ` 0x7f454c46
2018-09-26 17:59             ` Dmitry Safonov
2018-09-27 21:30           ` tglx
2018-09-27 21:30             ` Thomas Gleixner
2018-09-27 21:41             ` tglx
2018-09-27 21:41               ` Thomas Gleixner
2018-10-01 23:20               ` avagin
2018-10-01 23:20                 ` Andrey Vagin
2018-10-02  6:15                 ` tglx
2018-10-02  6:15                   ` Thomas Gleixner
2018-10-02 21:05                   ` 0x7f454c46
2018-10-02 21:05                     ` Dmitry Safonov
2018-10-02 21:26                     ` tglx
2018-10-02 21:26                       ` Thomas Gleixner
2018-09-28 17:03             ` ebiederm
2018-09-28 17:03               ` Eric W. Biederman
2018-09-28 19:32               ` tglx
2018-09-28 19:32                 ` Thomas Gleixner
2018-10-01  9:05                 ` ebiederm
2018-10-01  9:05                   ` Eric W. Biederman
2018-10-01  9:15                 ` Setting monotonic time? ebiederm
2018-10-01  9:15                   ` Eric W. Biederman
2018-10-01 18:52                   ` tglx
2018-10-01 18:52                     ` Thomas Gleixner
2018-10-02 20:00                     ` arnd
2018-10-02 20:00                       ` Arnd Bergmann
2018-10-02 20:06                       ` tglx
2018-10-02 20:06                         ` Thomas Gleixner
2018-10-03  4:50                         ` ebiederm
2018-10-03  4:50                           ` Eric W. Biederman
2018-10-03  5:25                           ` tglx
2018-10-03  5:25                             ` Thomas Gleixner
2018-10-03  6:14                             ` ebiederm
2018-10-03  6:14                               ` Eric W. Biederman
2018-10-03  7:02                               ` arnd
2018-10-03  7:02                                 ` Arnd Bergmann
2018-10-03  6:14                             ` tglx
2018-10-03  6:14                               ` Thomas Gleixner
2018-10-01 20:51                   ` avagin
2018-10-01 20:51                     ` Andrey Vagin
2018-10-02  6:16                     ` tglx
2018-10-02  6:16                       ` Thomas Gleixner
2018-10-21  1:41               ` [RFC 00/20] ns: Introduce Time Namespace avagin
2018-10-21  1:41                 ` Andrei Vagin
2018-10-21  3:54                 ` avagin [this message]
2018-10-21  3:54                   ` Andrei Vagin
2018-10-29 20:33                 ` tglx
2018-10-29 20:33                   ` Thomas Gleixner
2018-10-29 21:21                   ` ebiederm
2018-10-29 21:21                     ` Eric W. Biederman
2018-10-29 21:36                     ` tglx
2018-10-29 21:36                       ` Thomas Gleixner
2018-10-31 16:26                   ` avagin
2018-10-31 16:26                     ` Andrei Vagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181021035435.GA21328@gmail.com \
    --to=linux-kselftest@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).