LKML Archive on lore.kernel.org
 help / color / Atom feed
From: John Stultz <john.stultz@linaro.org>
To: Karsten Blees <karsten.blees@gmail.com>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH] time.c::timespec_trunc: fix nanosecond file time rounding
Date: Tue, 16 Jun 2015 16:08:12 -0700
Message-ID: <CALAqxLVSKSgf2+c1oHdSv9aH7gUjE9=q-E-j81DGsPNM1LoH9g@mail.gmail.com> (raw)
In-Reply-To: <5580A586.7060202@gmail.com>

On Tue, Jun 16, 2015 at 3:39 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
> Am 16.06.2015 um 19:07 schrieb John Stultz:
>> On Tue, Jun 9, 2015 at 10:36 AM, Karsten Blees <karsten.blees@gmail.com> wrote:
>>> From: Karsten Blees <blees@dcon.de>
>>> Date: Tue, 9 Jun 2015 10:50:28 +0200
>>>
>>> The rounding optimization in timespec_trunc() is based on the incorrect
>>> assumptions that current_kernel_time() is rounded to jiffies resolution,
>>> and that jiffies resolution is a multiple of all potential file time
>>> granularities.
>>
>> Sorry, this is a little opaque on the first read. You're saying that
>> there are filesystems where the on-disk granularity is smaller then a
>> tick/jiffy, but larger then a nanosecond, right?
>>
>
> Yes, examples include CIFS, NTFS (100 ns) and CEPH, UDF (1000 ns).

Thanks. Adding these concrete examples to the commit message would be good.


> The current code assumes that rounding can be avoided if (gran <= ns_per_tick).
>
> However, this optimization is only valid if:
>
> 1. current_kernel_time().tv_nsec is already rounded to tick resolution.
>    E.g. with HZ=1000 you would get tv_nsec = 1000000, 2000000, 3000000, but
>    never 1000001. AFAICT this is not true; current_kernel_time() may be
>    incremented only once per tick, but its not rounded to tick resolution.
>
> 2. ns_per_tick is evenly divisible by gran, for all potential HZ and
>    granularity values. IOW "(ns_per_tick % gran) == 0". This may have been
>    true for HZ=100, 250, 1000, but not for HZ=300. E.g. if assumption 1
>    above was true, HZ=300 would give you tv_nsec = 3333333, 6666666,
>    9999999... This would definitely need to be rounded to e.g. UDF
>    resolution, even though (1000 <= 3333333) is clearly true.
>
>>> Thus, sub-second portions of in-core file times are not rounded to on-disk
>>> granularity. I.e. file times may change when the inode is re-read from disk
>>> or when the file system is remounted.
>>>
>>> File systems with on-disk resolutions of exactly 1 ns or 1 s are not
>>> affected by this.
>>>
>>> Steps to reproduce with e.g. UDF:
>>>
>>>   $ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
>>>   $ mkdir udf && mount udfdisk udf
>>>   $ touch udf/test && stat -c %y udf/test
>>>   2015-06-09 10:22:56.130006767 +0200
>>>   $ umount udf && mount udfdisk udf
>>>   $ stat -c %y udf/test
>>>   2015-06-09 10:22:56.130006000 +0200
>>>
>>> Remounting rounds the mtime to 1µs.
>>>
>>> Fix the rounding in timespec_trunc() and update the documentation.
>>>
>>> Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
>>> as struct super_block.s_time_gran isn't prepared to handle different
>>> ctime / mtime / atime resolutions nor resolutions > 1 second.
>>>
>>> Signed-off-by: Karsten Blees <blees@dcon.de>
>>> ---
>>>
>>> This issue came up in a recent discussion on the git ML about enabling
>>> nanosecond file times on Windows, see
>>>
>>> http://thread.gmane.org/gmane.comp.version-control.msysgit/21290/focus=21315
>>>
>>>
>>>  kernel/time/time.c | 17 ++++-------------
>>>  1 file changed, 4 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/kernel/time/time.c b/kernel/time/time.c
>>> index 972e3bb..362ee06 100644
>>> --- a/kernel/time/time.c
>>> +++ b/kernel/time/time.c
>>> @@ -287,23 +287,14 @@ EXPORT_SYMBOL(jiffies_to_usecs);
>>>   * @t: Timespec
>>>   * @gran: Granularity in ns.
>>>   *
>>> - * Truncate a timespec to a granularity. gran must be smaller than a second.
>>> - * Always rounds down.
>>> - *
>>> - * This function should be only used for timestamps returned by
>>> - * current_kernel_time() or CURRENT_TIME, not with do_gettimeofday() because
>>> - * it doesn't handle the better resolution of the latter.
>>> + * Truncate a timespec to a granularity. gran must not be greater than a
>>> + * second (10^9 ns). Always rounds down.
>>>   */
>>>  struct timespec timespec_trunc(struct timespec t, unsigned gran)
>>>  {
>>> -       /*
>>> -        * Division is pretty slow so avoid it for common cases.
>>> -        * Currently current_kernel_time() never returns better than
>>> -        * jiffies resolution. Exploit that.
>>> -        */
>>> -       if (gran <= jiffies_to_usecs(1) * 1000) {
>>> +       if (gran <= 1) {
>>>                 /* nothing */
>>
>> So this change will in effect, cause us to truncate where granularity
>> was less then one tick, where before we didn't do anything. Have you
>> reviewed all users to ensure this is safe (I assume you have, but it
>> might be good to describe which users are affected in the commit
>> message)?
>>
>>
>
> timespec_trunc() is exclusively used to calculate inode's [acm]time.
> It is mostly called through current_fs_time(), only a handful of fs
> drivers use it directly (but always with super_block.s_time_gran as
> second argument).
>
> So I think changing the function to do what the documentation says it
> does should be safe...

Yea, though existing behavior is often more "expected" then documented
behavior. :)


>
>>> -       } else if (gran == 1000000000) {
>>> +       } else if (gran >= 1000000000) {
>>>                 t.tv_nsec = 0;
>>
>> While the code (which is quite old) wasn't super intuitive, this looks
>> to be making it more subtle instead of more clear. So if the
>> granularity is larger then a second, we just truncate to a second?
>> That seems surprising. If handling granularity larger then a second
>> isn't supported, we should probably make that explicit and add a
>> WARN_ON to catch problematic users of the function.
>
> Indeed, I changed this to catch invalid arguments (similar to how
> "gran <= 1" catches 0 and thus prevents division by zero).
>
> What about this instead?
>
>         if (gran == 1) {
>                 /* nothing */
>         } else if (gran == 1000000000) {
>                 t.tv_nsec = 0;
>         } else if (gran < 1 || gran > 1000000000) {
>                 WARN_ON(1);
>         } else {
>                 t.tv_nsec -= t.tv_nsec % gran;
>         }
>         return t;

Logically its ok. I might suggest cleaning it up as:

if ((gran < 1) || (gran > NSEC_PER_SEC))
   WARN_ON(1);  /* catch invalid granularity values  */
else if (gran == NSEC_PER_SEC)
   t.tv_nsec = 0; /* special case to avoid div */
else if ((gran > 1) && ( gran < NSEC_PER_SEC))
     t.tv_nsec -= t.tv_nsec % gran;
return t;

Also it would be good to make it clear in the function comment that
gran > NSEC_PER_SEC are invalid.

thanks
-john

  reply index

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-09 17:36 Karsten Blees
2015-06-16 17:07 ` John Stultz
2015-06-16 22:39   ` Karsten Blees
2015-06-16 23:08     ` John Stultz [this message]
2015-06-25 12:13       ` [PATCH v2] " Karsten Blees
2015-07-01 18:07         ` John Stultz

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALAqxLVSKSgf2+c1oHdSv9aH7gUjE9=q-E-j81DGsPNM1LoH9g@mail.gmail.com' \
    --to=john.stultz@linaro.org \
    --cc=karsten.blees@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git