All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrei Vagin <avagin@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	Dmitry Safonov <dima@arista.com>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	Jeff Dike <jdike@addtoit.com>, "x86@kernel.org" <x86@kernel.org>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	"criu@openvz.org" <criu@openvz.org>,
	Ingo Molnar <mingo@redhat.com>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Andy Lutomirski <luto@kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Pavel Emelianov <xemul@virtuozzo.com>,
	Shuah Khan <shuah@kernel.org>,
	"containers@lists.linux-foundation.org" 
	<containers@lists.linux-foundation.org>,
	Adrian Reber <adrian@lisas.de>
Subject: Re: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: avagin at gmail.com (Andrei Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx at linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers at lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: avagin@gmail.com (Andrei Vagin)
Subject: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
Message-ID: <20181021035436.OosCt59uINa5f8_NsytwDTIr9rc3wUQIK4egsXdQPm0@z> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018@06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018@07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx at linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers at lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

WARNING: multiple messages have this Message-ID (diff)
From: Andrei Vagin <avagin@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	Dmitry Safonov <dima@arista.com>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	Jeff Dike <jdike@addtoit.com>, "x86@kernel.org" <x86@kernel.org>,
	Dmitry Safonov <0x7f454c46@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Oleg Nesterov <oleg@redhat.com>,
	"criu@openvz.org" <criu@openvz.org>,
	Ingo Molnar <mingo@redhat.com>,
	Alexey Dobriyan <adobriyan@gmail.com>,
	Andy Lutomirski <luto@kernel.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Pavel Emelianov <xemul@virtuozzo.com>,
	Shuah Khan <shuah@kernel.org>,
	"containers@lists.linux-foundation.org" <containers@li>
Subject: Re: [RFC 00/20] ns: Introduce Time Namespace
Date: Sat, 20 Oct 2018 20:54:36 -0700	[thread overview]
Message-ID: <20181021035435.GA21328@gmail.com> (raw)
In-Reply-To: <20181021014121.GA23474@gmail.com>

On Sat, Oct 20, 2018 at 06:41:23PM -0700, Andrei Vagin wrote:
> On Fri, Sep 28, 2018 at 07:03:22PM +0200, Eric W. Biederman wrote:
> > Thomas Gleixner <tglx@linutronix.de> writes:
> > 
> > > On Wed, 26 Sep 2018, Eric W. Biederman wrote:
> > >> Reading the code the calling sequence there is:
> > >> tick_sched_do_timer
> > >>    tick_do_update_jiffies64
> > >>       update_wall_time
> > >>           timekeeping_advance
> > >>              timekeepging_update
> > >> 
> > >> If I read that properly under the right nohz circumstances that update
> > >> can be delayed indefinitely.
> > >> 
> > >> So I think we could prototype a time namespace that was per
> > >> timekeeping_update and just had update_wall_time iterate through
> > >> all of the time namespaces.
> > >
> > > Please don't go there. timekeeping_update() is already heavy and walking
> > > through a gazillion of namespaces will just make it horrible,
> > >
> > >> I don't think the naive version would scale to very many time
> > >> namespaces.
> > >
> > > :)
> > >
> > >> At the same time using the techniques from the nohz work and a little
> > >> smarts I expect we could get the code to scale.
> > >
> > > You'd need to invoke the update when the namespace is switched in and
> > > hasn't been updated since the last tick happened. That might be doable, but
> > > you also need to take the wraparound constraints of the underlying
> > > clocksources into account, which again can cause walking all name spaces
> > > when they are all idle long enough.
> > 
> > The wrap around constraints being how long before the time sources wrap
> > around so you have to read them once per wrap around?  I have not dug
> > deeply enough into the code to see that yet.
> > 
> > > From there it becomes hairy, because it's not only timekeeping,
> > > i.e. reading time, this is also affecting all timers which are armed from a
> > > namespace.
> > >
> > > That gets really ugly because when you do settimeofday() or adjtimex() for
> > > a particular namespace, then you have to search for all armed timers of
> > > that namespace and adjust them.
> > >
> > > The original posix timer code had the same issue because it mapped the
> > > clock realtime timers to the timer wheel so any setting of the clock caused
> > > a full walk of all armed timers, disarming, adjusting and requeing
> > > them. That's horrible not only performance wise, it's also a locking
> > > nightmare of all sorts.
> > >
> > > Add time skew via NTP/PTP into the picture and you might have to adjust
> > > timers as well, because you need to guarantee that they are not expiring
> > > early.
> > >
> > > I haven't looked through Dimitry's patches yet, but I don't see how this
> > > can work at all without introducing subtle issues all over the place.
> > 
> > Then it sounds like this will take some more digging.
> > 
> > Please pardon me for thinking out load.
> > 
> > There are one or more time sources that we use to compute the time
> > and for each time source we have a conversion from ticks of the
> > time source to nanoseconds.
> > 
> > Each time source needs to be sampled at least once per wrap-around
> > and something incremented so that we don't loose time when looking
> > at that time source.
> > 
> > There are several clocks presented to userspace and they all share the
> > same length of second and are all fundamentally offsets from
> > CLOCK_MONOTONIC.
> > 
> > I see two fundamental driving cases for a time namespace.
> > 1) Migration from one node to another node in a cluster in almost
> >    real time.
> > 
> >    The problem is that CLOCK_MONOTONIC between nodes in the cluster
> >    has not relation ship to each other (except a synchronized length of
> >    the second).  So applications that migrate can see CLOCK_MONOTONIC
> >    and CLOCK_BOOTTIME go backwards.
> > 
> >    This is the truly pressing problem and adding some kind of offset
> >    sounds like it would be the solution.  Possibly by allowing a boot
> >    time synchronization of CLOCK_BOOTTIME and CLOCK_MONOTONIC.
> > 
> > 2) Dealing with two separate time management domains.  Say a machine
> >    that needes to deal with both something inside of google where they
> >    slew time to avoid leap time seconds and something in the outside
> >    world proper UTC time is kept as an offset from TAI with the
> >    occasional leap seconds.
> > 
> >    In the later case it would fundamentally require having seconds of
> >    different length.
> > 
> 
> I want to add that the second case should be optional.
> 
> When a container is migrated to another host, we have to restore its
> monotonic and boottime clocks, but we still expect that the container
> will continue using the host real-time clock.
> 
> Before stating this series, I was thinking about this, I decided that
> these cases can be solved independently. Probably, the full isolation of
> the time sub-system will have much higher overhead than just offsets for
> a few clocks. And the idea that isolation of the real-time clock should
> be optional gives us another hint that offsets for monotonic and
> boot-time clocks can be implemented independently.
> 
> Eric and Tomas, what do you think about this? If you agree that these

Sorry Thomas, I mistyped your name.

> two cases can be implemented separately, what should we do with this
> series to make it ready to be merged?
> 
> I know that we need to:
> 
> * look at device drivers that report timestamps in CLOCK_MONOTONIC base.
> * forbid changing offsets after creating timers
> 
> Anything else?
> 
> Thanks,
> Andrei
> 
> > 
> > A pure 64bit nanoseond counter is good for 500 years.  So 64bit
> > variables can be used to hold time, and everything can be converted from
> > there.
> > 
> > This suggests we can for ticks have two values.
> > - The number of ticks from the time source.
> > - The number of times the ticks would have rolled over.
> > 
> > That sounds like it may be a little simplistic as it would require being
> > very diligent about firing a timer exactly at rollover and not losing
> > that, but for a handwaving argument is probably enough to generate
> > a 64bit tick counter.
> > 
> > If the focus is on a 64bit tick counter then what update_wall_time
> > has to do is very limited.  Just deal the accounting needed to cope with
> > tick rollover.
> > 
> > Getting the actual time looks like it would be as simple as now, with
> > perhaps an extra addition to account for the number of times the tick
> > counter has rolled over.  With limited precision arithmetic and various
> > optimizations I don't think it is that simple to implement but it feels
> > like it should be very little extra work.
> > 
> > For timers my inclination would be to assume no adjustments to the
> > current time parameters and set the timer to go off then.   If the time
> > on the appropriate clock has been changed since the timer was set and
> > the timer is going off early reschedule so the timer fires at the
> > appropriate time.
> > 
> > With the above I think it is theoretically possible to build a time
> > namespace that supports multiple lengths of second, and does not have
> > much overhead.
> > 
> > Not that I think a final implementation would necessary look like what I
> > have described. I just think it is possible with extreme care to evolve
> > the current code base into something that can efficiently handle
> > multiple time domains with slightly different lenghts of second.
> > 
> > Thomas does it sound like I am completely out of touch with reality?
> > 
> > It does though sound like it is going to take some serious digging
> > through the code to understand how what everything does and how and why
> > everthing works the way it does.  Not something grafted on top with just
> > a cursory understanding of how the code works.
> > 
> > Eric
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers

  reply	other threads:[~2018-10-21  3:54 UTC|newest]

Thread overview: 164+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace Dmitry Safonov
2018-09-19 20:50 ` Dmitry Safonov
2018-09-19 20:50 ` Dmitry Safonov
2018-09-19 20:50 ` dima
2018-09-19 20:50 ` [RFC 01/20] " Dmitry Safonov
2018-09-28 18:20   ` Laurent Vivier
2018-09-19 20:50 ` [RFC 02/20] timens: Add timens_offsets Dmitry Safonov
2018-09-20 18:45   ` Cyrill Gorcunov
2018-09-20 22:14     ` Cyrill Gorcunov
2018-09-19 20:50 ` [RFC 03/20] timens: Introduce CLOCK_MONOTONIC offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 04/20] timens: Introduce CLOCK_BOOTTIME offset Dmitry Safonov
2018-09-30  3:18   ` [LKP] [timens] 3cc8de9dcb: RIP:posix_get_boottime kernel test robot
2018-09-30  3:18     ` kernel test robot
2018-09-30  3:18     ` [LKP] " kernel test robot
2018-09-19 20:50 ` [RFC 05/20] timerfd/timens: Take into account ns clock offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 06/20] kernel: Take into account timens clock offsets in clock_nanosleep Dmitry Safonov
2018-09-19 20:50 ` [RFC 07/20] timens: Shift /proc/uptime Dmitry Safonov
2018-09-19 20:50 ` [RFC 08/20] x86/vdso: Restrict splitting vvar vma Dmitry Safonov
2018-09-19 20:50 ` [RFC 09/20] x86/vdso/timens: Add offsets page in vvar Dmitry Safonov
2018-09-19 20:50 ` [RFC 10/20] x86/vdso: Use set_normalized_timespec() to avoid 32 bit overflow Dmitry Safonov
2018-09-19 20:50 ` [RFC 11/20] x86/vdso: Purge timens page on setns()/unshare()/clone() Dmitry Safonov
2018-09-19 20:50 ` [RFC 12/20] x86/vdso: Look for vvar vma to purge timens page Dmitry Safonov
2018-09-19 20:50 ` [RFC 13/20] posix-timers/timens: Take into account clock offsets Dmitry Safonov
2018-09-30  3:11   ` [LKP] [posix] 25217c6e39: BUG:KASAN:null-ptr-deref_in_c kernel test robot
2018-09-30  3:11     ` kernel test robot
2018-09-30  3:11     ` [LKP] " kernel test robot
2018-09-19 20:50 ` [RFC 14/20] timens: Add align for timens_offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 15/20] timens: Optimize zero-offsets Dmitry Safonov
2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks Dmitry Safonov
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50   ` dima
2018-09-24 21:36   ` Shuah Khan
2018-09-24 21:36     ` Shuah Khan
2018-09-24 21:36     ` shuah
2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd Dmitry Safonov
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50   ` dima
2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep Dmitry Safonov
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50   ` dima
2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest Dmitry Safonov
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50   ` dima
2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test Dmitry Safonov
2018-09-19 20:50   ` Dmitry Safonov
2018-09-19 20:50   ` dima
2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace Eric W. Biederman
2018-09-21 12:27   ` Eric W. Biederman
2018-09-21 12:27   ` ebiederm
2018-09-24 20:51   ` Andrey Vagin
2018-09-24 20:51     ` Andrey Vagin
2018-09-24 20:51     ` Andrey Vagin
2018-09-24 20:51     ` avagin
2018-09-24 22:02     ` Eric W. Biederman
2018-09-24 22:02       ` Eric W. Biederman
2018-09-24 22:02       ` Eric W. Biederman
2018-09-24 22:02       ` ebiederm
2018-09-25  1:42       ` Andrey Vagin
2018-09-25  1:42         ` Andrey Vagin
2018-09-25  1:42         ` Andrey Vagin
2018-09-25  1:42         ` avagin
2018-09-26 17:36         ` Eric W. Biederman
2018-09-26 17:36           ` Eric W. Biederman
2018-09-26 17:36           ` Eric W. Biederman
2018-09-26 17:36           ` ebiederm
2018-09-26 17:59           ` Dmitry Safonov
2018-09-26 17:59             ` Dmitry Safonov
2018-09-26 17:59             ` Dmitry Safonov
2018-09-26 17:59             ` 0x7f454c46
2018-09-27 21:30           ` Thomas Gleixner
2018-09-27 21:30             ` Thomas Gleixner
2018-09-27 21:30             ` Thomas Gleixner
2018-09-27 21:30             ` tglx
2018-09-27 21:41             ` Thomas Gleixner
2018-09-27 21:41               ` Thomas Gleixner
2018-09-27 21:41               ` Thomas Gleixner
2018-09-27 21:41               ` tglx
2018-10-01 23:20               ` Andrey Vagin
2018-10-01 23:20                 ` Andrey Vagin
2018-10-01 23:20                 ` Andrey Vagin
2018-10-01 23:20                 ` avagin
2018-10-02  6:15                 ` Thomas Gleixner
2018-10-02  6:15                   ` Thomas Gleixner
2018-10-02  6:15                   ` Thomas Gleixner
2018-10-02  6:15                   ` tglx
2018-10-02 21:05                   ` Dmitry Safonov
2018-10-02 21:05                     ` Dmitry Safonov
2018-10-02 21:05                     ` 0x7f454c46
2018-10-02 21:26                     ` Thomas Gleixner
2018-10-02 21:26                       ` Thomas Gleixner
2018-10-02 21:26                       ` tglx
2018-09-28 17:03             ` Eric W. Biederman
2018-09-28 17:03               ` Eric W. Biederman
2018-09-28 17:03               ` Eric W. Biederman
2018-09-28 17:03               ` ebiederm
2018-09-28 19:32               ` Thomas Gleixner
2018-09-28 19:32                 ` Thomas Gleixner
2018-09-28 19:32                 ` Thomas Gleixner
2018-09-28 19:32                 ` tglx
2018-10-01  9:05                 ` Eric W. Biederman
2018-10-01  9:05                   ` Eric W. Biederman
2018-10-01  9:05                   ` Eric W. Biederman
2018-10-01  9:05                   ` ebiederm
2018-10-01  9:15                 ` Setting monotonic time? Eric W. Biederman
2018-10-01  9:15                   ` Eric W. Biederman
2018-10-01  9:15                   ` Eric W. Biederman
2018-10-01  9:15                   ` ebiederm
2018-10-01 18:52                   ` Thomas Gleixner
2018-10-01 18:52                     ` Thomas Gleixner
2018-10-01 18:52                     ` Thomas Gleixner
2018-10-01 18:52                     ` tglx
2018-10-02 20:00                     ` Arnd Bergmann
2018-10-02 20:00                       ` Arnd Bergmann
2018-10-02 20:00                       ` arnd
2018-10-02 20:06                       ` Thomas Gleixner
2018-10-02 20:06                         ` Thomas Gleixner
2018-10-02 20:06                         ` tglx
2018-10-03  4:50                         ` Eric W. Biederman
2018-10-03  4:50                           ` Eric W. Biederman
2018-10-03  4:50                           ` ebiederm
2018-10-03  5:25                           ` Thomas Gleixner
2018-10-03  5:25                             ` Thomas Gleixner
2018-10-03  5:25                             ` tglx
2018-10-03  6:14                             ` Eric W. Biederman
2018-10-03  6:14                               ` Eric W. Biederman
2018-10-03  6:14                               ` ebiederm
2018-10-03  7:02                               ` Arnd Bergmann
2018-10-03  7:02                                 ` Arnd Bergmann
2018-10-03  7:02                                 ` arnd
2018-10-03  6:14                             ` Thomas Gleixner
2018-10-03  6:14                               ` Thomas Gleixner
2018-10-03  6:14                               ` tglx
2018-10-01 20:51                   ` Andrey Vagin
2018-10-01 20:51                     ` Andrey Vagin
2018-10-01 20:51                     ` Andrey Vagin
2018-10-01 20:51                     ` avagin
2018-10-02  6:16                     ` Thomas Gleixner
2018-10-02  6:16                       ` Thomas Gleixner
2018-10-02  6:16                       ` Thomas Gleixner
2018-10-02  6:16                       ` tglx
2018-10-21  1:41               ` [RFC 00/20] ns: Introduce Time Namespace Andrei Vagin
2018-10-21  1:41                 ` Andrei Vagin
2018-10-21  1:41                 ` Andrei Vagin
2018-10-21  1:41                 ` avagin
2018-10-21  3:54                 ` Andrei Vagin [this message]
2018-10-21  3:54                   ` Andrei Vagin
2018-10-21  3:54                   ` Andrei Vagin
2018-10-21  3:54                   ` avagin
2018-10-29 20:33                 ` Thomas Gleixner
2018-10-29 20:33                   ` Thomas Gleixner
2018-10-29 20:33                   ` Thomas Gleixner
2018-10-29 20:33                   ` tglx
2018-10-29 21:21                   ` Eric W. Biederman
2018-10-29 21:21                     ` Eric W. Biederman
2018-10-29 21:21                     ` Eric W. Biederman
2018-10-29 21:21                     ` ebiederm
2018-10-29 21:36                     ` Thomas Gleixner
2018-10-29 21:36                       ` Thomas Gleixner
2018-10-29 21:36                       ` Thomas Gleixner
2018-10-29 21:36                       ` tglx
2018-10-31 16:26                   ` Andrei Vagin
2018-10-31 16:26                     ` Andrei Vagin
2018-10-31 16:26                     ` Andrei Vagin
2018-10-31 16:26                     ` avagin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181021035435.GA21328@gmail.com \
    --to=avagin@gmail.com \
    --cc=0x7f454c46@gmail.com \
    --cc=adobriyan@gmail.com \
    --cc=adrian@lisas.de \
    --cc=christian.brauner@ubuntu.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=criu@openvz.org \
    --cc=dima@arista.com \
    --cc=ebiederm@xmission.com \
    --cc=gorcunov@openvz.org \
    --cc=hpa@zytor.com \
    --cc=jdike@addtoit.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.