From: avagin at gmail.com (Andrei Vagin) Subject: [RFC 00/20] ns: Introduce Time Namespace Date: Sat, 20 Oct 2018 20:54:36 -0700 [thread overview] Message-ID: <20181021035435.GA21328@gmail.com> (raw) In-Reply-To: <20181021014121.GA23474@gmail.com> On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner <tglx at linutronix.de> writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >> tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > > real time. > > > > The problem is that CLOCK_MONOTONIC between nodes in the cluster > > has not relation ship to each other (except a synchronized length of > > the second). So applications that migrate can see CLOCK_MONOTONIC > > and CLOCK_BOOTTIME go backwards. > > > > This is the truly pressing problem and adding some kind of offset > > sounds like it would be the solution. Possibly by allowing a boot > > time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > > that needes to deal with both something inside of google where they > > slew time to avoid leap time seconds and something in the outside > > world proper UTC time is kept as an offset from TAI with the > > occasional leap seconds. > > > > In the later case it would fundamentally require having seconds of > > different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea that isolation of the real-time clock should > be optional gives us another hint that offsets for monotonic and > boot-time clocks can be implemented independently. > > Eric and Tomas, what do you think about this? If you agree that these Sorry Thomas, I mistyped your name. > two cases can be implemented separately, what should we do with this > series to make it ready to be merged? > > I know that we need to: > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > * forbid changing offsets after creating timers > > Anything else? > > Thanks, > Andrei > > > > > A pure 64bit nanoseond counter is good for 500 years. So 64bit > > variables can be used to hold time, and everything can be converted from > > there. > > > > This suggests we can for ticks have two values. > > - The number of ticks from the time source. > > - The number of times the ticks would have rolled over. > > > > That sounds like it may be a little simplistic as it would require being > > very diligent about firing a timer exactly at rollover and not losing > > that, but for a handwaving argument is probably enough to generate > > a 64bit tick counter. > > > > If the focus is on a 64bit tick counter then what update_wall_time > > has to do is very limited. Just deal the accounting needed to cope with > > tick rollover. > > > > Getting the actual time looks like it would be as simple as now, with > > perhaps an extra addition to account for the number of times the tick > > counter has rolled over. With limited precision arithmetic and various > > optimizations I don't think it is that simple to implement but it feels > > like it should be very little extra work. > > > > For timers my inclination would be to assume no adjustments to the > > current time parameters and set the timer to go off then. If the time > > on the appropriate clock has been changed since the timer was set and > > the timer is going off early reschedule so the timer fires at the > > appropriate time. > > > > With the above I think it is theoretically possible to build a time > > namespace that supports multiple lengths of second, and does not have > > much overhead. > > > > Not that I think a final implementation would necessary look like what I > > have described. I just think it is possible with extreme care to evolve > > the current code base into something that can efficiently handle > > multiple time domains with slightly different lenghts of second. > > > > Thomas does it sound like I am completely out of touch with reality? > > > > It does though sound like it is going to take some serious digging > > through the code to understand how what everything does and how and why > > everthing works the way it does. Not something grafted on top with just > > a cursory understanding of how the code works. > > > > Eric > > _______________________________________________ > > Containers mailing list > > Containers at lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/containers
WARNING: multiple messages have this Message-ID (diff)
From: avagin@gmail.com (Andrei Vagin) Subject: [RFC 00/20] ns: Introduce Time Namespace Date: Sat, 20 Oct 2018 20:54:36 -0700 [thread overview] Message-ID: <20181021035435.GA21328@gmail.com> (raw) Message-ID: <20181021035436.OosCt59uINa5f8_NsytwDTIr9rc3wUQIK4egsXdQPm0@z> (raw) In-Reply-To: <20181021014121.GA23474@gmail.com> On Sat, Oct 20, 2018@06:41:23PM -0700, Andrei Vagin wrote: > On Fri, Sep 28, 2018@07:03:22PM +0200, Eric W. Biederman wrote: > > Thomas Gleixner <tglx at linutronix.de> writes: > > > > > On Wed, 26 Sep 2018, Eric W. Biederman wrote: > > >> Reading the code the calling sequence there is: > > >> tick_sched_do_timer > > >> tick_do_update_jiffies64 > > >> update_wall_time > > >> timekeeping_advance > > >> timekeepging_update > > >> > > >> If I read that properly under the right nohz circumstances that update > > >> can be delayed indefinitely. > > >> > > >> So I think we could prototype a time namespace that was per > > >> timekeeping_update and just had update_wall_time iterate through > > >> all of the time namespaces. > > > > > > Please don't go there. timekeeping_update() is already heavy and walking > > > through a gazillion of namespaces will just make it horrible, > > > > > >> I don't think the naive version would scale to very many time > > >> namespaces. > > > > > > :) > > > > > >> At the same time using the techniques from the nohz work and a little > > >> smarts I expect we could get the code to scale. > > > > > > You'd need to invoke the update when the namespace is switched in and > > > hasn't been updated since the last tick happened. That might be doable, but > > > you also need to take the wraparound constraints of the underlying > > > clocksources into account, which again can cause walking all name spaces > > > when they are all idle long enough. > > > > The wrap around constraints being how long before the time sources wrap > > around so you have to read them once per wrap around? I have not dug > > deeply enough into the code to see that yet. > > > > > From there it becomes hairy, because it's not only timekeeping, > > > i.e. reading time, this is also affecting all timers which are armed from a > > > namespace. > > > > > > That gets really ugly because when you do settimeofday() or adjtimex() for > > > a particular namespace, then you have to search for all armed timers of > > > that namespace and adjust them. > > > > > > The original posix timer code had the same issue because it mapped the > > > clock realtime timers to the timer wheel so any setting of the clock caused > > > a full walk of all armed timers, disarming, adjusting and requeing > > > them. That's horrible not only performance wise, it's also a locking > > > nightmare of all sorts. > > > > > > Add time skew via NTP/PTP into the picture and you might have to adjust > > > timers as well, because you need to guarantee that they are not expiring > > > early. > > > > > > I haven't looked through Dimitry's patches yet, but I don't see how this > > > can work at all without introducing subtle issues all over the place. > > > > Then it sounds like this will take some more digging. > > > > Please pardon me for thinking out load. > > > > There are one or more time sources that we use to compute the time > > and for each time source we have a conversion from ticks of the > > time source to nanoseconds. > > > > Each time source needs to be sampled at least once per wrap-around > > and something incremented so that we don't loose time when looking > > at that time source. > > > > There are several clocks presented to userspace and they all share the > > same length of second and are all fundamentally offsets from > > CLOCK_MONOTONIC. > > > > I see two fundamental driving cases for a time namespace. > > 1) Migration from one node to another node in a cluster in almost > > real time. > > > > The problem is that CLOCK_MONOTONIC between nodes in the cluster > > has not relation ship to each other (except a synchronized length of > > the second). So applications that migrate can see CLOCK_MONOTONIC > > and CLOCK_BOOTTIME go backwards. > > > > This is the truly pressing problem and adding some kind of offset > > sounds like it would be the solution. Possibly by allowing a boot > > time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC. > > > > 2) Dealing with two separate time management domains. Say a machine > > that needes to deal with both something inside of google where they > > slew time to avoid leap time seconds and something in the outside > > world proper UTC time is kept as an offset from TAI with the > > occasional leap seconds. > > > > In the later case it would fundamentally require having seconds of > > different length. > > > > I want to add that the second case should be optional. > > When a container is migrated to another host, we have to restore its > monotonic and boottime clocks, but we still expect that the container > will continue using the host real-time clock. > > Before stating this series, I was thinking about this, I decided that > these cases can be solved independently. Probably, the full isolation of > the time sub-system will have much higher overhead than just offsets for > a few clocks. And the idea that isolation of the real-time clock should > be optional gives us another hint that offsets for monotonic and > boot-time clocks can be implemented independently. > > Eric and Tomas, what do you think about this? If you agree that these Sorry Thomas, I mistyped your name. > two cases can be implemented separately, what should we do with this > series to make it ready to be merged? > > I know that we need to: > > * look at device drivers that report timestamps in CLOCK_MONOTONIC base. > * forbid changing offsets after creating timers > > Anything else? > > Thanks, > Andrei > > > > > A pure 64bit nanoseond counter is good for 500 years. So 64bit > > variables can be used to hold time, and everything can be converted from > > there. > > > > This suggests we can for ticks have two values. > > - The number of ticks from the time source. > > - The number of times the ticks would have rolled over. > > > > That sounds like it may be a little simplistic as it would require being > > very diligent about firing a timer exactly at rollover and not losing > > that, but for a handwaving argument is probably enough to generate > > a 64bit tick counter. > > > > If the focus is on a 64bit tick counter then what update_wall_time > > has to do is very limited. Just deal the accounting needed to cope with > > tick rollover. > > > > Getting the actual time looks like it would be as simple as now, with > > perhaps an extra addition to account for the number of times the tick > > counter has rolled over. With limited precision arithmetic and various > > optimizations I don't think it is that simple to implement but it feels > > like it should be very little extra work. > > > > For timers my inclination would be to assume no adjustments to the > > current time parameters and set the timer to go off then. If the time > > on the appropriate clock has been changed since the timer was set and > > the timer is going off early reschedule so the timer fires at the > > appropriate time. > > > > With the above I think it is theoretically possible to build a time > > namespace that supports multiple lengths of second, and does not have > > much overhead. > > > > Not that I think a final implementation would necessary look like what I > > have described. I just think it is possible with extreme care to evolve > > the current code base into something that can efficiently handle > > multiple time domains with slightly different lenghts of second. > > > > Thomas does it sound like I am completely out of touch with reality? > > > > It does though sound like it is going to take some serious digging > > through the code to understand how what everything does and how and why > > everthing works the way it does. Not something grafted on top with just > > a cursory understanding of how the code works. > > > > Eric > > _______________________________________________ > > Containers mailing list > > Containers at lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/containers
next prev parent reply other threads:[~2018-10-21 3:54 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-24 21:36 ` shuah 2018-09-24 21:36 ` Shuah Khan 2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace ebiederm 2018-09-21 12:27 ` Eric W. Biederman 2018-09-24 20:51 ` avagin 2018-09-24 20:51 ` Andrey Vagin 2018-09-24 22:02 ` ebiederm 2018-09-24 22:02 ` Eric W. Biederman 2018-09-25 1:42 ` avagin 2018-09-25 1:42 ` Andrey Vagin 2018-09-26 17:36 ` ebiederm 2018-09-26 17:36 ` Eric W. Biederman 2018-09-26 17:59 ` 0x7f454c46 2018-09-26 17:59 ` Dmitry Safonov 2018-09-27 21:30 ` tglx 2018-09-27 21:30 ` Thomas Gleixner 2018-09-27 21:41 ` tglx 2018-09-27 21:41 ` Thomas Gleixner 2018-10-01 23:20 ` avagin 2018-10-01 23:20 ` Andrey Vagin 2018-10-02 6:15 ` tglx 2018-10-02 6:15 ` Thomas Gleixner 2018-10-02 21:05 ` 0x7f454c46 2018-10-02 21:05 ` Dmitry Safonov 2018-10-02 21:26 ` tglx 2018-10-02 21:26 ` Thomas Gleixner 2018-09-28 17:03 ` ebiederm 2018-09-28 17:03 ` Eric W. Biederman 2018-09-28 19:32 ` tglx 2018-09-28 19:32 ` Thomas Gleixner 2018-10-01 9:05 ` ebiederm 2018-10-01 9:05 ` Eric W. Biederman 2018-10-01 9:15 ` Setting monotonic time? ebiederm 2018-10-01 9:15 ` Eric W. Biederman 2018-10-01 18:52 ` tglx 2018-10-01 18:52 ` Thomas Gleixner 2018-10-02 20:00 ` arnd 2018-10-02 20:00 ` Arnd Bergmann 2018-10-02 20:06 ` tglx 2018-10-02 20:06 ` Thomas Gleixner 2018-10-03 4:50 ` ebiederm 2018-10-03 4:50 ` Eric W. Biederman 2018-10-03 5:25 ` tglx 2018-10-03 5:25 ` Thomas Gleixner 2018-10-03 6:14 ` ebiederm 2018-10-03 6:14 ` Eric W. Biederman 2018-10-03 7:02 ` arnd 2018-10-03 7:02 ` Arnd Bergmann 2018-10-03 6:14 ` tglx 2018-10-03 6:14 ` Thomas Gleixner 2018-10-01 20:51 ` avagin 2018-10-01 20:51 ` Andrey Vagin 2018-10-02 6:16 ` tglx 2018-10-02 6:16 ` Thomas Gleixner 2018-10-21 1:41 ` [RFC 00/20] ns: Introduce Time Namespace avagin 2018-10-21 1:41 ` Andrei Vagin 2018-10-21 3:54 ` avagin [this message] 2018-10-21 3:54 ` Andrei Vagin 2018-10-29 20:33 ` tglx 2018-10-29 20:33 ` Thomas Gleixner 2018-10-29 21:21 ` ebiederm 2018-10-29 21:21 ` Eric W. Biederman 2018-10-29 21:36 ` tglx 2018-10-29 21:36 ` Thomas Gleixner 2018-10-31 16:26 ` avagin 2018-10-31 16:26 ` Andrei Vagin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20181021035435.GA21328@gmail.com \ --to=linux-kselftest@vger.kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).