From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757057Ab2D3U5K (ORCPT ); Mon, 30 Apr 2012 16:57:10 -0400 Received: from e37.co.us.ibm.com ([32.97.110.158]:54365 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756582Ab2D3U5H (ORCPT ); Mon, 30 Apr 2012 16:57:07 -0400 Message-ID: <4F9EFC70.50405@linaro.org> Date: Mon, 30 Apr 2012 13:56:16 -0700 From: John Stultz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1 MIME-Version: 1.0 To: Richard Cochran CC: linux-kernel@vger.kernel.org, Thomas Gleixner Subject: Re: [PATCH RFC V1 0/5] Rationalize time keeping References: <4F9B228F.90903@linaro.org> <20120428080443.GA2241@netboy.at.omicron.at> In-Reply-To: <20120428080443.GA2241@netboy.at.omicron.at> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12043020-7408-0000-0000-0000049E97FD Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/28/2012 01:04 AM, Richard Cochran wrote: > On Fri, Apr 27, 2012 at 03:49:51PM -0700, John Stultz wrote: >> On 04/27/2012 01:12 AM, Richard Cochran wrote: >>> * Benefits >>> - Fixes the buggy, inconsistent time reporting surrounding a leap >>> second event. >> Just to clarify this, so we've got the right scope on the problem, >> you're trying to address the fact that the leap second is not >> actually applied until the tick after the leap second, correct? > That is one problem, yes. > >> Where basically you can see small offsets like: > I can synchronize over the network to under 100 nanoseconds, so to me, > one second is a large offset. Well, the leap-offset is a second, but when it arrives is only tick-accurate. :) >> My only concern is how we >> manage it along with possible smeared-leap-seconds ala: >> http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html >> >> ( I shudder at the idea of managing two separate frequency >> corrections for different time domains). > Are you planning to implement that? This approach is by no means > universally accepted. No no. I have no plans there. > In my view, what Google is doing is hack (albeit a sensible for > business applications). For test and measurement or scientific > applications, it does not make sense to introduce artifical frequency > errors in this way. True, although even if it is a hack, google *is* using it. My concern is that if CLOCK_REALTIME is smeared to avoid a leap second jump, in that environment we cannot also accurate provide a correct CLOCK_TAI. So far that's not been a problem, because CLOCK_TAI isn't a clockid we yet support. But the expectations bar always rises, so I suspect once we have a CLOCK_TAI, someone will want us to handle smeared-leap seconds without affecting CLOCK_TAI's correctness. > Another variant of this idea: http://www.cl.cam.ac.uk/~mgk25/time/utc-sls/ > > Here is a nice quote from that page: > > All other objections to UTC-SLS that I heard were not directed > against its specific design choices, but against the (very well > established) practice of using UTC at all in the applications that > this proposal targets: > > * Some people argue that operating system interfaces, such as the > POSIX "seconds since the epoch" scale used in time_t APIs, should > be changed from being an encoding of UTC to being an encoding of > the leap-second free TAI timescale. > > * Some people want to go even further and abandon UTC and leap > seconds entirely, detach all civilian time zones from the > rotation of Earth, and redefine them purely based on atomic time. > > While these people are usually happy to agree that UTC-SLS is a > sensible engineering solution as long as UTC remains the main time > basis of distributed computing, they argue that this is just a > workaround that will be obsolete once their grand vision of giving > up UTC entirely has become true, and that it is therefore just an > unwelcome distraction from their ultimate goal. I think this last point is very telling. Neither of the above options are really viable in my mind, as I don't see any real consensus to giving up UTC. What is in-practice is actually way more important then where folks wish things would go. > Until the whole world agrees to this "work around" I think we should > stick to current standards. If and when this practice becomes > standardized (I'm not holding my breath), then we could simply drop > the internal difference between the kernel time scale and UTC, and > steer out the leap second synchronously with the rest of the world. Well, I think that Google shows some folks are starting to use workarounds like smeared-leap-seconds/UTC-SLS. So its something we should watch carefully and expect more folks to follow. Its true that you don't want to mix UTC-SLS and standard UTC time domains, but its likely this will be a site-specific configuration. So its a concern when a correct CLOCK_TAI would be incompatible on systems using these hacks/workarounds. >>> * Performance Impacts >>> ** con >>> - Small extra cost when reading the time (one integer addition plus >>> one integer test). >> This may not be so small when it comes to folks who are very >> concerned about the clock_gettime hotpath. > If you would support the option to only insert leap seconds, then the > cost is one integer addition and one integer test. *Any* extra work is a big deal to folks who are sensitive to clock_gettime performance. That said, I don't see why its more complicated to also handle leap removal? > Also, once we have a rational time interface (like CLOCK_TAI), then > time sensitive will want to use that instead anyhow. Well, performance sensitive and correctness sensitive are two different things. :) I think CLOCK_TAI is much cleaner for things, but at the same time, the world "thinks" in UTC, and converting between them isn't always trivial (very similar to the timezone presentation layer, which isn't fun). So I'd temper any hopes of mass conversion. :) >> Further, the correction will be needed to be made in the vsyscall >> paths, which isn't done with your current patchset (causing userland >> to see different time values then what kernel space calculates). > Do you mean __current_kernel_time? What did I miss? No. So, on architectures that support vsyscalls/vdso (x86_64, powerpc, ia64, and maybe a few others) getnstimeofday() is really only an internal interface for in-kernel access. Userland uses the vsyscall/vdso interface to be able to read the time completely from userland context (with no syscall overhead). Since this is done in different ways for each architecture, you need to export the proper information out via update_vsyscall() and also update the arch-specific vsyscall gettimeofday paths (which is non-trivial, as some arches are implemented in asm, etc - my sympathies here, its a pain). >> One possible thing to consider? Since the TIME_OOP flag is only >> visible via the adjtimex() interface, maybe it alone should have the >> extra overhead of the conditional? > This would mean that you would have to do the conditional somehow > backwards in order to provide TAI time values. To me, the logical way > is to keep a continuous time scale, and then compute UTC from it. ? Not sure I'm following you here. What I'm recommending, is even if you rework the kernel so that it constructs time as follows: CLOCK_TAI = CLOCK_MONOTONIC + monotonic_to_tai CLOCK_REALTIME = CLOCK_TAI + tai_to_utc The adjustment made to tai_to_utc by the leap second would still be changed at tick time, but the logic to avoid the sub-tick inconsistency at the second edge would be only made to the adjtimex() interface. Thus for folks who really care about leap seconds, who already need to use adjtimex in order to detect the TIME_OOP flag would get the very correct time value, but the performance sensitive users of clock_gettime wouldn't be affected. >> I'm not excited about the >> gettimeofday field returned by adjtimex not matching what >> gettimeofday actually provides for that single-tick interval, but >> maybe its a reasonable middle ground? > Not sure what you mean, but to me it is not acceptable to deliver > inconsistent time values to userspace! For users of clock_gettime/gettimeofday, a leapsecond is an inconsistency. Neither interfaces provide a way to detect that the TIME_OOP flag is set and its not 23:59:59 again, but 23:59:60 (which can't be represented by a time_t). Thus even if the behavior was perfect, and the leapsecond landed at exactly the second edge, it is still a time hiccup to most applications anyway. Thus, most of userland doesn't really care if the hiccup happens up to a tick after the second's edge. They don't expect it anyway. So they really don't want a constant performance drop in order for the hiccup to be more "correct" when it happens. :) That's why I'm suggesting that you consider starting by modifying the adjtimex() interface. Any application that actually cares about leapseconds should be using adjtimex() since its the only interface that allows you to realize that's whats happening. Its not a performance optimized path, and so its a fine candidate for being slow-but-correct. My only concern there is that it would cause problems when mixing adjtimex() calls with clock_gettime() calls, because you could have a tick-length of time when they report different time values. But this may be acceptable. >>> ** pro >>> - Removes repetitive, periodic division (secs % 86400 == 0) the whole >>> day long preceding a leap second. >>> - Cost of maintaining leap second status goes to the user of the >>> NTP adjtimex() interface, if any. >> Not sure I follow this last point. How are we pushing this >> maintenance to adjtimex() users? > Only adjtimex calls timekeeper_gettod_status, where the leap second is > calculated, outside of timekeeper.lock, on the NTP user space's kernel > time. So its not really a cost-of-maintaining, but a cost-of-calculation. We only calculate the next leap second when its provided via adjtimex rather then doing the check periodically in the kernel. > In current Linux, the modulus is done in update_wall_time and > logarithmic_accumulation, on kernel time. > >>> * Todo >>> - The function __current_kernel_time accesses the time variables >>> without taking the lock. I can't figure that out. >>> >> There's a few cases where we want the current second value when we >> already hold the xtime_lock, or we might possibly hold the >> xtime_lock. Its an special internal interface for special users >> (update_vsyscall, for example). > What about kdb_summary? I don't know the kdb patch especially well, but I suspect kdb_summary might be triggered at unexpected times if you're trying to debug a remote kernel. Thus we want to be able to get the time_t value (which can be read safely without a lock on most systems) without trying to grab a lock that might be held. This avoids deadlock should kdb be blocking the lock-holder from running. thanks -john