From: avagin at virtuozzo.com (Andrey Vagin) Subject: [RFC 00/20] ns: Introduce Time Namespace Date: Tue, 25 Sep 2018 01:42:02 +0000 [thread overview] Message-ID: <20180925014150.GA6302@outlook.office365.com> (raw) In-Reply-To: <874leezh8n.fsf@xmission.com> On Tue, Sep 25, 2018 at 12:02:32AM +0200, Eric W. Biederman wrote: > Andrey Vagin <avagin at virtuozzo.com> writes: > > > On Fri, Sep 21, 2018 at 02:27:29PM +0200, Eric W. Biederman wrote: > >> Dmitry Safonov <dima at arista.com> writes: > >> > >> > Discussions around time virtualization are there for a long time. > >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. > >> > From that time, the topic appears on and off in various discussions. > >> > > >> > There are two main use cases for time namespaces: > >> > 1. change date and time inside a container; > >> > 2. adjust clocks for a container restored from a checkpoint. > >> > > >> > “It seems like this might be one of the last major obstacles keeping > >> > migration from being used in production systems, given that not all > >> > containers and connections can be migrated as long as a time dependency > >> > is capable of messing it up.” (by github.com/dav-ell) > >> > > >> > The kernel provides access to several clocks: CLOCK_REALTIME, > >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > >> > start points for them are not defined and are different for each running > >> > system. When a container is migrated from one node to another, all > >> > clocks have to be restored into consistent states; in other words, they > >> > have to continue running from the same points where they have been > >> > dumped. > >> > > >> > The main idea behind this patch set is adding per-namespace offsets for > >> > system clocks. When a process in a non-root time namespace requests > >> > time of a clock, a namespace offset is added to the current value of > >> > this clock on a host and the sum is returned. > >> > > >> > All offsets are placed on a separate page, this allows up to map it as > >> > part of vvar into user processes and use offsets from vdso calls. > >> > > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > >> > clocks. > >> > > >> > Questions to discuss: > >> > > >> > * Clone flags exhaustion. Currently there is only one unused clone flag > >> > bit left, and it may be worth to use it to extend arguments of the clone > >> > system call. > >> > > >> > * Realtime clock implementation details: > >> > Is having a simple offset enough? > >> > What to do when date and time is changed on the host? > >> > Is there a need to adjust vfs modification and creation times? > >> > Implementation for adjtime() syscall. > >> > >> Overall I support this effort. In my quick skim this code looked good. > > > > Hi Eric, > > > > Thank you for the feedback. > > > >> > >> My feeling is that we need to be able to support running ntpd and > >> support one namespace doing googles smoothing of leap seconds while > >> another namespace takes the leap second. > >> > >> What I was imagining when I was last thinking about this was one > >> instance of struct timekeeper aka tk_core per time namespace. That > >> structure already keeps offsets for all of the various clocks from > >> the kerne internal time sources. What would be needed would be to > >> pass in an appropriate time namespace pointer. > >> > >> I could be completely wrong as I have not take the time to completely > >> trace through the code. Have you looked at pushing the time namespace > >> down as far as tk_core? > >> > >> What I think would be the big advantage (besides ntp working) is that > >> the bulk of the code could be reused. Allowing testing of the kernel's > >> time code by setting up a new time namespace. So a person in production > >> could setup a time namespace with the time set ahead a little bit and > >> be able to verify that the kernel handles the upcoming leap second > >> properly. > >> > > > > It is an interesting idea, but I have a few questions: > > > > 1. Does it mean that timekeeping_update() will be called for each > > namespace? This functions is called periodically, it updates times on the > > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > > overhead of this? > > I don't know if periodically is a proper characterization. There may be > a code path that does that. But from what I can see timekeeping_update > is the guts of settimeofday (and a few related functions). > > So it appears to make sense for timekeeping_update to be per namespace. > > Hmm. Looking at what is updated in the vsyscall_gtod_data it does > look like you would have to periodically update things, but I don't know > big that period would be. As long as the period is reasonably large, > or the time namespaces were sufficiently deschronized it should not > be a problem. But that is the class of problem that could make > my ideal impractical if there is measuarable overhead. > > Where were you seeing timekeeping_update being called periodically? timekeeping_update() is called HZ times per-second: [ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [ 67.914409] ? tick_sched_do_timer+0x70/0x70 [ 67.914947] tick_sched_do_timer+0x55/0x70 [ 67.915505] tick_sched_timer+0x27/0x70 [ 67.916042] __hrtimer_run_queues+0x10f/0x440 [ 67.916639] hrtimer_interrupt+0x100/0x220 [ 67.917305] smp_apic_timer_interrupt+0x79/0x220 [ 67.918030] apic_timer_interrupt+0xf/0x20 > > > 2. What will we do with vdso? It looks like we will have to have a > > separate vsyscall_gtod_data for each ns and update each of them > > separately. > > Yes. But you don't have to have introduce another variable just make > certain vsyscall_gtod_data is a page aligned thing per time namespace. > > If I read the summary of the existing patchset something very similiar > is already going on. I mean vsyscall_gtod_data has some data which are often updated. There are timestamps for monotonic and wall clocks. clock_gettime() reads a time stamp from vsyscall_gtod_data and then use tsc to approximate the current value of a clock. Actually, this is not the second question, it is a part of the first question. update_vsyscall() is called from timekeeping_update(). > > Each process would only map one. And unshare of the time namespace > would need to act like the pid namespace or be limited to only being > allowed when there is only a single task using the mm. > > Eric
WARNING: multiple messages have this Message-ID (diff)
From: avagin@virtuozzo.com (Andrey Vagin) Subject: [RFC 00/20] ns: Introduce Time Namespace Date: Tue, 25 Sep 2018 01:42:02 +0000 [thread overview] Message-ID: <20180925014150.GA6302@outlook.office365.com> (raw) Message-ID: <20180925014202.Y_wf91-VMDiHbo9x2TcAwE7dEUERp0jd_MRrsNTFXUI@z> (raw) In-Reply-To: <874leezh8n.fsf@xmission.com> On Tue, Sep 25, 2018@12:02:32AM +0200, Eric W. Biederman wrote: > Andrey Vagin <avagin at virtuozzo.com> writes: > > > On Fri, Sep 21, 2018@02:27:29PM +0200, Eric W. Biederman wrote: > >> Dmitry Safonov <dima at arista.com> writes: > >> > >> > Discussions around time virtualization are there for a long time. > >> > The first attempt to implement time namespace was in 2006 by Jeff Dike. > >> > From that time, the topic appears on and off in various discussions. > >> > > >> > There are two main use cases for time namespaces: > >> > 1. change date and time inside a container; > >> > 2. adjust clocks for a container restored from a checkpoint. > >> > > >> > “It seems like this might be one of the last major obstacles keeping > >> > migration from being used in production systems, given that not all > >> > containers and connections can be migrated as long as a time dependency > >> > is capable of messing it up.” (by github.com/dav-ell) > >> > > >> > The kernel provides access to several clocks: CLOCK_REALTIME, > >> > CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the > >> > start points for them are not defined and are different for each running > >> > system. When a container is migrated from one node to another, all > >> > clocks have to be restored into consistent states; in other words, they > >> > have to continue running from the same points where they have been > >> > dumped. > >> > > >> > The main idea behind this patch set is adding per-namespace offsets for > >> > system clocks. When a process in a non-root time namespace requests > >> > time of a clock, a namespace offset is added to the current value of > >> > this clock on a host and the sum is returned. > >> > > >> > All offsets are placed on a separate page, this allows up to map it as > >> > part of vvar into user processes and use offsets from vdso calls. > >> > > >> > Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME > >> > clocks. > >> > > >> > Questions to discuss: > >> > > >> > * Clone flags exhaustion. Currently there is only one unused clone flag > >> > bit left, and it may be worth to use it to extend arguments of the clone > >> > system call. > >> > > >> > * Realtime clock implementation details: > >> > Is having a simple offset enough? > >> > What to do when date and time is changed on the host? > >> > Is there a need to adjust vfs modification and creation times? > >> > Implementation for adjtime() syscall. > >> > >> Overall I support this effort. In my quick skim this code looked good. > > > > Hi Eric, > > > > Thank you for the feedback. > > > >> > >> My feeling is that we need to be able to support running ntpd and > >> support one namespace doing googles smoothing of leap seconds while > >> another namespace takes the leap second. > >> > >> What I was imagining when I was last thinking about this was one > >> instance of struct timekeeper aka tk_core per time namespace. That > >> structure already keeps offsets for all of the various clocks from > >> the kerne internal time sources. What would be needed would be to > >> pass in an appropriate time namespace pointer. > >> > >> I could be completely wrong as I have not take the time to completely > >> trace through the code. Have you looked at pushing the time namespace > >> down as far as tk_core? > >> > >> What I think would be the big advantage (besides ntp working) is that > >> the bulk of the code could be reused. Allowing testing of the kernel's > >> time code by setting up a new time namespace. So a person in production > >> could setup a time namespace with the time set ahead a little bit and > >> be able to verify that the kernel handles the upcoming leap second > >> properly. > >> > > > > It is an interesting idea, but I have a few questions: > > > > 1. Does it mean that timekeeping_update() will be called for each > > namespace? This functions is called periodically, it updates times on the > > timekeeper structure, updates vsyscall_gtod_data, etc. What will be an > > overhead of this? > > I don't know if periodically is a proper characterization. There may be > a code path that does that. But from what I can see timekeeping_update > is the guts of settimeofday (and a few related functions). > > So it appears to make sense for timekeeping_update to be per namespace. > > Hmm. Looking at what is updated in the vsyscall_gtod_data it does > look like you would have to periodically update things, but I don't know > big that period would be. As long as the period is reasonably large, > or the time namespaces were sufficiently deschronized it should not > be a problem. But that is the class of problem that could make > my ideal impractical if there is measuarable overhead. > > Where were you seeing timekeeping_update being called periodically? timekeeping_update() is called HZ times per-second: [ 67.912858] timekeeping_update.cold.26+0x5/0xa [ 67.913332] timekeeping_advance+0x361/0x5c0 [ 67.913857] ? tick_sched_do_timer+0x55/0x70 [ 67.914409] ? tick_sched_do_timer+0x70/0x70 [ 67.914947] tick_sched_do_timer+0x55/0x70 [ 67.915505] tick_sched_timer+0x27/0x70 [ 67.916042] __hrtimer_run_queues+0x10f/0x440 [ 67.916639] hrtimer_interrupt+0x100/0x220 [ 67.917305] smp_apic_timer_interrupt+0x79/0x220 [ 67.918030] apic_timer_interrupt+0xf/0x20 > > > 2. What will we do with vdso? It looks like we will have to have a > > separate vsyscall_gtod_data for each ns and update each of them > > separately. > > Yes. But you don't have to have introduce another variable just make > certain vsyscall_gtod_data is a page aligned thing per time namespace. > > If I read the summary of the existing patchset something very similiar > is already going on. I mean vsyscall_gtod_data has some data which are often updated. There are timestamps for monotonic and wall clocks. clock_gettime() reads a time stamp from vsyscall_gtod_data and then use tsc to approximate the current value of a clock. Actually, this is not the second question, it is a part of the first question. update_vsyscall() is called from timekeeping_update(). > > Each process would only map one. And unshare of the time namespace > would need to act like the pid namespace or be limited to only being > allowed when there is only a single task using the mm. > > Eric
next prev parent reply other threads:[~2018-09-25 1:42 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-09-19 20:50 [RFC 00/20] ns: Introduce Time Namespace dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 16/20] selftest: Add Time Namespace test for supported clocks dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-24 21:36 ` shuah 2018-09-24 21:36 ` Shuah Khan 2018-09-19 20:50 ` [RFC 17/20] selftest/timens: Add test for timerfd dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 18/20] selftest/timens: Add test for clock_nanosleep dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 19/20] timens/selftest: Add procfs selftest dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-19 20:50 ` [RFC 20/20] timens/selftest: Add timer offsets test dima 2018-09-19 20:50 ` Dmitry Safonov 2018-09-21 12:27 ` [RFC 00/20] ns: Introduce Time Namespace ebiederm 2018-09-21 12:27 ` Eric W. Biederman 2018-09-24 20:51 ` avagin 2018-09-24 20:51 ` Andrey Vagin 2018-09-24 22:02 ` ebiederm 2018-09-24 22:02 ` Eric W. Biederman 2018-09-25 1:42 ` avagin [this message] 2018-09-25 1:42 ` Andrey Vagin 2018-09-26 17:36 ` ebiederm 2018-09-26 17:36 ` Eric W. Biederman 2018-09-26 17:59 ` 0x7f454c46 2018-09-26 17:59 ` Dmitry Safonov 2018-09-27 21:30 ` tglx 2018-09-27 21:30 ` Thomas Gleixner 2018-09-27 21:41 ` tglx 2018-09-27 21:41 ` Thomas Gleixner 2018-10-01 23:20 ` avagin 2018-10-01 23:20 ` Andrey Vagin 2018-10-02 6:15 ` tglx 2018-10-02 6:15 ` Thomas Gleixner 2018-10-02 21:05 ` 0x7f454c46 2018-10-02 21:05 ` Dmitry Safonov 2018-10-02 21:26 ` tglx 2018-10-02 21:26 ` Thomas Gleixner 2018-09-28 17:03 ` ebiederm 2018-09-28 17:03 ` Eric W. Biederman 2018-09-28 19:32 ` tglx 2018-09-28 19:32 ` Thomas Gleixner 2018-10-01 9:05 ` ebiederm 2018-10-01 9:05 ` Eric W. Biederman 2018-10-01 9:15 ` Setting monotonic time? ebiederm 2018-10-01 9:15 ` Eric W. Biederman 2018-10-01 18:52 ` tglx 2018-10-01 18:52 ` Thomas Gleixner 2018-10-02 20:00 ` arnd 2018-10-02 20:00 ` Arnd Bergmann 2018-10-02 20:06 ` tglx 2018-10-02 20:06 ` Thomas Gleixner 2018-10-03 4:50 ` ebiederm 2018-10-03 4:50 ` Eric W. Biederman 2018-10-03 5:25 ` tglx 2018-10-03 5:25 ` Thomas Gleixner 2018-10-03 6:14 ` ebiederm 2018-10-03 6:14 ` Eric W. Biederman 2018-10-03 7:02 ` arnd 2018-10-03 7:02 ` Arnd Bergmann 2018-10-03 6:14 ` tglx 2018-10-03 6:14 ` Thomas Gleixner 2018-10-01 20:51 ` avagin 2018-10-01 20:51 ` Andrey Vagin 2018-10-02 6:16 ` tglx 2018-10-02 6:16 ` Thomas Gleixner 2018-10-21 1:41 ` [RFC 00/20] ns: Introduce Time Namespace avagin 2018-10-21 1:41 ` Andrei Vagin 2018-10-21 3:54 ` avagin 2018-10-21 3:54 ` Andrei Vagin 2018-10-29 20:33 ` tglx 2018-10-29 20:33 ` Thomas Gleixner 2018-10-29 21:21 ` ebiederm 2018-10-29 21:21 ` Eric W. Biederman 2018-10-29 21:36 ` tglx 2018-10-29 21:36 ` Thomas Gleixner 2018-10-31 16:26 ` avagin 2018-10-31 16:26 ` Andrei Vagin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20180925014150.GA6302@outlook.office365.com \ --to=linux-kselftest@vger.kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).