archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <>
To: Jeff Layton <>
Cc: Christian Brauner <>,,,
	Jan Kara <>
Subject: Re: [GIT PULL v2] timestamp fixes
Date: Thu, 21 Sep 2023 12:28:13 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Thu, 21 Sept 2023 at 11:51, Jeff Layton <> wrote:
> We have many, many inodes though, and 12 bytes per adds up!

That was my thinking, but honestly, who knows what other alignment
issues might eat up some - or all - of the theoreteical 12 bytes.

It might be, for example, that the inode is already some aligned size,
and that the allocation alignment means that the size wouldn't
*really* shrink at all.

So I just want to make clear that I think the 12 bytes isn't
necessarily there. Maybe you'd get it, maybe it would be hidden by
other things.

My biggest impetus was really that whole abuse of a type that I
already disliked for other reasons.

> I'm on board with the idea, but...that's likely to be as big a patch
> series as the ctime overhaul was. In fact, it'll touch a lot of the same
> code. I can take a stab at that in the near future though.

Yea, it's likely to be fairly big and invasive.  That was one of the
reasons for my suggested "inode_time()" macro hack: using the macro
argument concatenation is really a hack to "gather" the pieces based
on name, and while it's odd and not a very typical kernel model, I
think doing it that way might allow the conversion to be slightly less

You'd obviously have to have the same kind of thing for assignment.

Without that kind of name-based hack, you'd have to create all these
random helper functions that just do the same thing over and over for
the different times, which seems really annoying.

> Since we're on the subject...another thing that bothers me with all of
> the timestamp handling is that we don't currently try to mitigate "torn
> reads" across the two different words. It seems like you could fetch a
> tv_sec value and then get a tv_nsec value that represents an entirely
> different timestamp if there are stores between them.

Hmm. I think that's an issue that we have always had in theory, and
have ignored because it's simply not problematic in practice, and
fixing it is *hugely* painful.

I suspect we'd have to use some kind of sequence lock for it (to make
reads be cheap), and while it's _possible_ that having the separate
accessor functions for reading/writing those times might help things
out, I suspect the reading/writing happens for the different times (ie
atime/mtime/ctime) together often enough that you might want to have
the locking done at an outer level, and _not_ do it at the accessor

So I suspect this is a completely separate issue (ie even an accessor
doesn't make the "hugely painful" go away). And probably not worth
worrying about *unless* somebody decides that they really really care
about the race.

That said, one thing that *could* help is if people decide that the
right format for inode times is to just have one 64-bit word that has
"sufficient resolution". That's what we did for "kernel time", ie
"ktime_t" is a 64-bit nanosecond count, and by being just a single
value, it avoids not just the horrible padding with 'struct
timespec64', it is also dense _and_ can be accessed as one atomic

Sadly, that "sufficient resolution" couldn't be nanoseconds, because
64-bit nanoseconds isn't enough of a spread. It's fine for the kernel
time, because 2**63 nanoseconds is 292 years, so it moved the "year
2038" problem to "year 2262".

And that's ok when we're talking about times that are kernel running
times and we haev a couple of centuries to say "ok, we'll need to make
it be a bigger type", but when you save the values to disk, things are
different. I suspect filesystem people are *not* willing to deal with
a "year 2262" issue.

But if we were to say that "a tenth of microsecond resolution is
sufficient for inode timestamps", then suddenly 64 bits is *enormous*.
So we could do a

    // tenth of a microseconds since Jan 1, 1970
    typedef s64 fstime_t;

and have a nice dense timestamp format with reasonable - but not
nanosecond - accuracy. Now that 292 year range has become 29,247
years, and filesystem people *might* find the "year-31k" problem

I happen to think that "100ns timestamp resolution on files is
sufficient" is a very reasonable statement, but I suspect that we'll
still find lots of people who say "that's completely unacceptable"
both to that resolution, and to the 31k-year problem.

But wouldn't it be nice to have just one single "fstime_t" for file
timestamps, the same way we have "ktime_t" for CPU timestamps?

Then we'd save even more space in the 'struct inode'....


  reply	other threads:[~2023-09-21 19:51 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-21 11:20 [GIT PULL v2] timestamp fixes Christian Brauner
2023-09-21 18:24 ` Linus Torvalds
2023-09-21 18:51   ` Jeff Layton
2023-09-21 19:28     ` Linus Torvalds [this message]
2023-09-21 19:46       ` Linus Torvalds
2023-09-21 21:57         ` Jeff Layton
2023-09-22 12:28           ` Christian Brauner
2023-09-22 10:19       ` David Sterba
2023-09-23  6:36       ` Amir Goldstein
2023-09-23 17:48         ` Linus Torvalds
2023-09-23 19:30           ` Theodore Ts'o
2023-09-23 20:03             ` Linus Torvalds
2023-09-23 22:07               ` Theodore Ts'o
2023-09-23 23:31                 ` Linus Torvalds
2023-09-23 21:29           ` Amir Goldstein
2023-09-24 10:26             ` Christian Brauner
2023-09-25 11:22           ` Jeff Layton
2023-09-25 16:02             ` Linus Torvalds
2023-09-22 12:24     ` Christian Brauner
2023-09-24  8:34     ` Amir Goldstein
2023-09-24 10:15       ` Christian Brauner
2023-09-22 12:16   ` Christian Brauner
2023-09-21 20:10 ` pr-tracker-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).