Re: [RFC][PATCH] xfs: extended timestamp range

From: Dave Chinner <david@fromorbit.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Amir Goldstein <amir73il@gmail.com>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Deepa Dinamani <deepa.kernel@gmail.com>
Subject: Re: [RFC][PATCH] xfs: extended timestamp range
Date: Thu, 14 Nov 2019 07:40:57 +1100	[thread overview]
Message-ID: <20191113204057.GV4614@dread.disaster.area> (raw)
In-Reply-To: <CAK8P3a2+AhLj+eAJVmKZ_V82Xgdb87vv8o01CzYQ=MCNA5bU-A@mail.gmail.com>

On Wed, Nov 13, 2019 at 02:20:44PM +0100, Arnd Bergmann wrote:
> On Wed, Nov 13, 2019 at 5:59 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Wed, Nov 13, 2019 at 06:42:09AM +0200, Amir Goldstein wrote:
> > > [CC:Arnd,Deepa]
> > > On Wed, Nov 13, 2019 at 5:56 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > On Wed, Nov 13, 2019 at 08:11:53AM +1100, Dave Chinner wrote:
> > > > > On Tue, Nov 12, 2019 at 06:00:19AM +0200, Amir Goldstein wrote:
> > > > > > On Tue, Nov 12, 2019 at 12:35 AM Darrick J. Wong
> > > > > > <darrick.wong@oracle.com> wrote:
> > > >
> > >
> > > I didn't get why you dropped the di_pad idea (Arnd's and Dave's patches).
> > > Is it to save the pad for another rainy day?
> >
> > Something like that.  Someone /else/ can worry about the y2486 problem.
> >
> > <duck>
> >
> > Practically speaking I'd almost rather drop the precision in order to
> > extend the seconds range, since timestamp updates are only precise to
> > HZ anyway.
> 
> I think there would be noticeable overhead from anything that requires
> a 64-bit division to update a timestamp, or to read it into an in-memory
> inode, especially on 32-bit architectures.

Comapred to actually allocating the inode, initialising it, etc?
The cost of a division is noise and, as such, is completely
irrelevant. It's classic premature micro-optimisation of CPU usage
without any regard for the context in which the division might
occur.

Seriously, XFS is not a filesystem for small, poorly performing 32
bit CPUs. It's a filesystem for 64 bit servers with hundreds of
CPUs, terabytes of memory and storage capable of millions of IOPs.
All I care about is that it /works/ on tiny systems and doesn't
corrupt filesystems - we've already made so many compromises in
terms of algorithmic scalability and memory usage that XFS
performance does not scale down effectively to tiny machines and
tiny filesystems.

IOWs, if you have a slow, poorly implemented 32 bit CPU with limited
RAM and storage capability, then XFS is not the filesystem for your
hardware. And in that context, the cost of a 64bit integer division
is largely irrelevant to us.

> I think the reason why separate seconds/nanoseconds are easy is that
> this is what both the in-kernel and user space interface are based around.

That's also largely irrelevant to what we store on disk. We store
lots of stuff in encoded data structures because it's more space
efficient to do so. We trade off CPU time for space efficiency
-everywhere- in the XFS filesystem, because CPU is far cheaper than
the RAM to store it and cheaper than moving information to/from disk.

> We clearly don't use the precision, but anything else is more expensive
> at runtime.

It's noise in runtime CPU profiles. It's far more important for the
filesystem on-disk format to be sane, scalable, maintainable and
repairable than it is to save a CPU cycle or two in a path
that is several thousand CPU instructions long already....

> > > Is there any core xfs developer that was going to tackle this?
> > >
> > > I'm here, so if you need my help moving things forward let me know.
> >
> > I wrote a trivial garbage version this afternoon, will have something
> > more polished tomorrow.  None of this is 5.6 material, we have time.
> 
> I think from a user perspective, it would be the nicest to just add the

The *user* does not see or know anything to do with what is on disk,
nor does the kernel code outside the disk->memory translation
functions. What the users passes to/from the kernel is irrelevant
when discussing how we store something on disk. Indeed, what the
kernel passes to the filesystem is largely irrelevant, too :)

> the existing behavior: setting a timestamp after 2038 using utimensat()
> silently wraps the seconds back to the regular epoch. With the
> extension patch, you get the correct results as long as the inode was
> both written and read on a new enough kernel, while all pre-5.4
> kernels produce the same incorrect data that they always have.

Feature bits prevent the new format being read on old kernels.
We can't allow an old kernel to parse an structure that it will
present to the user incorrectly because it was written by a more
recent kernel. Changing timestamp formats means old kernels can no
longer mount that filesystem, and that goes for your changes as well
- an old kernel will silently mishandle a >2038 date, and may even
detect it as corruption because we expect padding to be zero on
unused on-disk fields. Worse, older kernels will overwrite the new
epoch fields with zeros when the inode is re-written, destroying
the >y2038 timestamp information that is in that padding.

IOWs, old kernels do not preserve stuff written into on-disk
strucutre padding - the expect it to be zero and will write zeros
there whenever the structure is written to disk. That means these
epoch based y2038k format changes are not forwards or backwards
compatible - you can't interchange the filesystem between kernels
that do/don't support y2038k timestamps and expect it to works
correctly - feature bits are needed to prevent the timestamps from
being mis-interpretted or silently corrupted.

That means it's a one-way conversion. Hence if the old format
wrapped back to <1970, then new value written to disk will encode
that <1970 date to disk *in the new format*. Nothing gets lost or
changed in the process, and it's clear that an old kernel cannot
mangle it because it can no longer mount the filesystem that holds
new format timestamps....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com