LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	joseph@codesourcery.com, john.stultz@linaro.org,
	hch@infradead.org, tglx@linutronix.de, geert@linux-m68k.org,
	lftan@altera.com, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: [RFC 11/32] xfs: convert to struct inode_time
Date: Tue, 3 Jun 2014 10:32:27 +1000
Message-ID: <20140603003227.GP6677@dastard> (raw)
In-Reply-To: <7106937.MLk03lftzD@wuerfel>

On Mon, Jun 02, 2014 at 01:43:44PM +0200, Arnd Bergmann wrote:
> On Monday 02 June 2014 10:28:22 Dave Chinner wrote:
> > On Sun, Jun 01, 2014 at 10:24:37AM +1000, Dave Chinner wrote:
> > > On Sat, May 31, 2014 at 05:37:52PM +0200, Arnd Bergmann wrote:
> > > > In my list at http://kernelnewbies.org/y2038, I found that almost
> > > > all file systems at least times until 2106, because they treat
> > > > the on-disk value as unsigned on 64-bit systems, or they use
> > > > a completely different representation. My guess is that somebody
> > > > earlier spent a lot of work on making that happen.
> > > > 
> > > > The exceptions are:
> > > > 
> > > > * exofs uses signed values, which can probably be changed to be
> > > >   consistent with the others.
> > > > * isofs has a bug that limits it until 2027 on architectures with
> > > >   a signed 'char' type (otherwise it's 2155).
> > > > * udf can represent times for many thousands of years through a
> > > >   16-bit year representation, but the code to convert to epoch
> > > >   uses a const array that ends at 2038.
> > > > * afs uses signed seconds and can probably be fixed
> > > > * coda relies on user space time representation getting passed
> > > >   through an ioctl.
> > > > * I miscategorized xfs/ext2/ext3 as having unsigned 32-bit seconds,
> > > >   where they really use signed.
> > > > 
> > > > I was confused about XFS since I didn't noticed that there are
> > > > separate xfs_ictimestamp_t and xfs_timestamp_t types, so I expected
> > > > XFS to also use the 1970-2106 time range on 64-bit systems today.
> > > 
> > > You've missed an awful lot more than just the implications for the
> > > core kernel code.
> > > 
> > > There's a good chance such changes propagate to APIs elsewhere in
> > > the filesystems, because something you haven't realised is that XFS
> > > effectively exposes the on-disk timestamp format directly to
> > > userspace via the bulkstat interface (see struct xfs_bstat). It also
> > > affects the XFS open-by-handle ioctl and the swap extent ioctl used
> > > by the online defragmenter.
> 
> I really didn't look at them at all, as ioctl is very late on my
> mental list of things to change. I do realize that a lot of drivers
> and file systems do have ioctls that pass time values and we need to
> address them one by one.
> 
> I just looked at the ioctls you mentioned but don't see how open-by-handle
> is affected by this. Can you point me to what you mean?

Sorry, I misremembered how some of the XFS open-by-handle code works
in userspace (XFS has a pretty rich open-by-handle ioctl() interface
that predates the kernel syscalls by at least 10 years).  Basically
there is code in userspace that uses the information returned from
bulkstat to construct file handles to pass to the open-by-handle
ioctls. xfs_fsr then uses the combination of open-by-handle from the
bulkstat output and the bulkstat output to feed into the swap extent
ioctls....

i.e. the filesystem's idea of what time is is passed to userspace as
an opaque cookie in this case, but it is not used directly by the
open-by-handle interfaces like I implied it was.

> > Just to put that in context, here's the kernel patch to add extended
> > epoch support to XFS. It's completely untested as I haven't done any
> > userspace code changes to enable the feature. However, it should
> > give you an indication of how far the simple act of changing the
> > kernel time representation spread through the filesystem. This does
> > not include any of the VFS infrastructure to specifying the range of
> > supported timestamps.  It survives some smoke testing, but dies when
> > the online defragmenter starts using the bulkstat and swap extent
> > ioctls (the assert in xfs_inode_time_from_epoch() fires), so I
> > probably don't have that all sorted correctly yet...
> > 
> > To test extended epoch support, however, I need to some fstests that
> > define and validate the behaviour of the new syscalls - until we get
> > those we can't validate that the filesystem follows the spec
> > properly. I also suspect we are going to need an interface to query
> > the supported range of timestamps from a filesystem so that we can
> > test boundary conditions in an automated fashion....
> 
> Thanks a lot for having an initial look at this yourself!
> 
> I'd still consider the two problems largely orthogonal.

Depends how you look at it. You can't extend the kernel's idea of
time without permanent storage being able to specify the supported
bounds - that's a non-negotiable aspect of introducing extended
epoch timestamp support.

The actual addition of extended timestamp support to each individual
filesystem is orthoganol to the introduction of the struct
inode_time, but doing this addition properly is dependent on the VFS
infrastructure being there in the first place.

> My patch set
> (at least with the 64-bit tv_sec) just gets 32-bit kernels to behave
> more like 64-bit kernels regarding inode time stamps, which does
> impact all the file systems that the a 64-bit time or the NFS
> unsigned epoch (1970-2106), while your patch extends the file
> system internal epoch (1901-2038 for XFS) so it can be used by
> anything that knows how to handle larger than 32-bit second values
> (either 64-bit kernel or 32-bit with inode_time patch).

Right, but the issue is that 64 bit second counters are broken right
now because most filesystems can't support more than 32 bit values.
So it doesn't matter whether it's 32 bit or 64 bit machines, just
adding explicit support for >32 bit second counters without doing
anything else just extends that brokenness into the indefinite
future.

If we don't fix it now (i.e in the new user API and supporting
infrastructure), then we'll *never be able to fix it* and we'll be
stuck with timestamps that do really weird things when you pass
arbitrary future dates to the kernel.

> > diff --git a/fs/xfs/xfs_dinode.h b/fs/xfs/xfs_dinode.h
> > index 623bbe8..79f94722 100644
> > --- a/fs/xfs/xfs_dinode.h
> > +++ b/fs/xfs/xfs_dinode.h
> > @@ -21,11 +21,53 @@
> >  #define        XFS_DINODE_MAGIC                0x494e  /* 'IN' */
> >  #define XFS_DINODE_GOOD_VERSION(v)     ((v) >= 1 && (v) <= 3)
> >  
> > +/*
> > + * Inode timestamps get more complex when we consider supporting times beyond
> > + * the standard unix epoch of Jan 2038. The struct xfs_timestamp cannot support
> > + * more than a single extension by playing sign games, and that is still not
> > + * reliable. We also can't extend the timestamp structure because there is no
> > + * free space around them in the on-disk inode.
> > + *
> > + * Hence the simplest thing to do is to add an epoch counter for each timestamp
> > + * in the inode. This can be a single byte for each timestamp and make use of
> > + * a hole we currently pad. This gives us another 255 epochs range for the
> > + * timestamps, but requires a superblock feature bit to indicate that these
> > + * fields have meaning and can be non-zero.
> 
> Nice trick!

It's a pretty common way of extending the range of a variable for
on-disk formats. The on-disk format is completely disconnected from
the in-memory representation, so it's "easy" to play games like this
within the on-disk format.

If you look closely at ext4, you'll see all the lo/hi variables
where extension of 16->32 bits or 32->48 bits has occurred from
the ext2/3 variable formats... ;)

> 
> > +static inline __uint8_t
> > +xfs_timestamp_epoch(
> > +       struct timespec         *time)
> > +{
> > +       /* will be zero until the extended struct inode_time is introduced */
> > +       return 0;
> > +}
> > +
> > +static inline __int32_t
> > +xfs_timestamp_sec(
> > +       struct timespec         *time)
> > +{
> > +       return time->tv_sec;
> > +}
> > +
> > +static inline __kernel_time_t
> > +xfs_inode_time_from_epoch(
> > +       __uint8_t       epoch,
> > +       __int32_t       seconds)
> > +{
> > +       /* need to handle non-zero epoch when struct inode_time is introduced */
> > +       ASSERT(epoch == 0);
> > +       return seconds;
> > +}
> 
> Why don't you already implement epoch conversion for 64-bit kernels that
> are able to represent the time today?

Because I wasn't trying to solve the entire problem, just
demonstrate the infrastructure needed to support extended
timestamps.....

> This is how ext4 does it (I mean
> the sizeof() trick, not the bit stuffing they do):
....
> I guess if there is general agreement on introducing 'struct inode_time',
> we can skip that intermediate step.

Also, I don't like the concept of having filesystems that will work
on 64 bit but not 32 bit machines. Over the past 10 years, we've
managed to remove most of those differences from the VFS and XFS,
so adding new distinctions between 32/64 bit machines is not the
direction I want to head in.

As it is, I'm expecting to do this only after the struct inode_time
and the superblock "time range" infrastructure have been added to
the kernel and VFS.  If that change is not made, then we've still
only got 32 bit time....

> > @@ -509,8 +509,11 @@ xfs_sb_has_ro_compat_feature(
> >  }
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE     (1 << 0)        /* filetype in dirent */
> > +#define XFS_SB_FEAT_INCOMPAT_EPOCH     (1 << 1)        /* Time beyond 2038 */
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > -               (XFS_SB_FEAT_INCOMPAT_FTYPE)
> > +               (XFS_SB_FEAT_INCOMPAT_FTYPE | \
> > +                XFS_SB_FEAT_INCOMPAT_EPOCH | \
> > +                0)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN   ~XFS_SB_FEAT_INCOMPAT_ALL
> 
> How does this flag get set?

mkfs.xfs

> Do you have to manually change it in the
> superblock? Since most of the time I'd suspect you wouldn't actually
> use it for the foreseeable future, would it make sense to have a mount
> option that allows it to be set, but doesn't actually change the
> superblock until the first inode gets written with a nonzero epoch?

Yes, we could set the flag on the first timestamp that goes beyond
the current epoch, but that has two problems:

	1. filesystem silently becomes incompatible with older
	kernels so failed upgrade rollbacks become problematic; and

	2. It adds unecessary complexity, as this will end up being
	the default behaviour for all new filesystems within a year.
	Then we end up with a mount option and conversion functions
	that never get used but we have to support for years....

> That way, you'd still be able to mount it with an older kernel but
> also be forward compatible with time moving on.

We've got plenty of time to roll this out so I don't see any need
for putting in place temporary support mechanisms that unnecessarily
complicate the code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-30 20:01 [RFC 00/32] making inode time stamps y2038 ready Arnd Bergmann
2014-05-30 20:01 ` [RFC 01/32] fs: introduce new 'struct inode_time' Arnd Bergmann
2014-05-31  7:56   ` Geert Uytterhoeven
2014-05-31  8:39     ` Andreas Schwab
2014-05-31 13:19       ` Geert Uytterhoeven
2014-05-31 13:46         ` Andreas Schwab
2014-05-31 14:54       ` Arnd Bergmann
2014-05-31 16:15         ` Geert Uytterhoeven
2014-05-31  9:03   ` H. Peter Anvin
2014-05-31 14:53     ` Arnd Bergmann
2014-05-31 14:55       ` H. Peter Anvin
2014-05-30 20:01 ` [RFC 02/32] uapi: add struct __kernel_timespec{32,64} Arnd Bergmann
2014-05-30 20:18   ` H. Peter Anvin
2014-05-31 15:09     ` Arnd Bergmann
2014-05-30 20:01 ` [RFC 03/32] fs: introduce sys_utimens64at Arnd Bergmann
2014-05-31  9:22   ` Andreas Schwab
2014-05-31 14:55     ` Arnd Bergmann
2014-05-30 20:01 ` [RFC 04/32] fs: introduce sys_newfstat64/sys_newfstatat64 Arnd Bergmann
2014-05-30 20:01 ` [RFC 05/32] arch: hook up new stat and utimes syscalls Arnd Bergmann
2014-05-30 20:01 ` [RFC 06/32] isofs: fix timestamps beyond 2027 Arnd Bergmann
2014-05-31  7:59   ` Geert Uytterhoeven
2014-05-31  8:47     ` H. Peter Anvin
2014-05-30 20:01 ` [RFC 07/32] fs/nfs: convert to struct inode_time Arnd Bergmann
2014-05-30 20:01 ` [RFC 08/32] fs/ceph: convert to 'struct inode_time' Arnd Bergmann
2014-05-30 20:01 ` [RFC 09/32] fs/pstore: convert to struct inode_time Arnd Bergmann
2014-05-30 21:14   ` Kees Cook
2014-05-30 20:01 ` [RFC 10/32] fs/coda: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 11/32] xfs: " Arnd Bergmann
2014-05-31  0:37   ` Dave Chinner
2014-05-31  0:41     ` H. Peter Anvin
2014-05-31  1:14       ` Dave Chinner
2014-05-31  1:22         ` H. Peter Anvin
2014-05-31  5:54           ` Dave Chinner
2014-05-31  8:41             ` H. Peter Anvin
2014-05-31 15:46               ` Nicolas Pitre
2014-06-01 19:56                 ` Arnd Bergmann
2014-06-01 20:26                   ` H. Peter Anvin
2014-06-02 11:02                     ` Arnd Bergmann
2014-06-02  1:36                   ` Nicolas Pitre
2014-06-02  2:22                     ` Dave Chinner
2014-06-02  7:09                       ` Geert Uytterhoeven
2014-06-02 10:56                     ` Arnd Bergmann
2014-06-02 11:57                       ` Theodore Ts'o
2014-06-02 12:38                         ` Arnd Bergmann
2014-06-02 13:15                           ` Theodore Ts'o
2014-06-02 12:52                         ` Arnd Bergmann
2014-06-02 13:07                           ` Theodore Ts'o
2014-06-02 15:01                             ` Arnd Bergmann
2014-06-02 14:52                         ` H. Peter Anvin
2014-06-02 15:04                       ` Chuck Lever
2014-06-02 15:31                         ` Theodore Ts'o
2014-06-02 17:12                           ` H. Peter Anvin
2014-06-02 18:50                             ` Arnd Bergmann
2014-06-02 22:29                             ` Theodore Ts'o
2014-06-02 22:32                               ` H. Peter Anvin
2014-06-02 23:32                                 ` Theodore Ts'o
2014-06-02 23:33                                   ` H. Peter Anvin
2014-06-03 13:09                                   ` Roger Willcocks
2014-06-02 18:52                         ` Arnd Bergmann
2014-06-02 18:58                         ` Roger Willcocks
2014-06-02 19:04                           ` Chuck Lever
2014-06-02 19:10                             ` Arnd Bergmann
2014-06-01  0:39               ` Dave Chinner
2014-06-02 14:00             ` Joseph S. Myers
2014-05-31 15:37         ` Arnd Bergmann
2014-06-01  0:24           ` Dave Chinner
2014-06-02  0:28             ` Dave Chinner
2014-06-02 11:35               ` Roger Willcocks
2014-06-02 11:43               ` Arnd Bergmann
2014-06-03  0:32                 ` Dave Chinner [this message]
2014-06-03  7:33                   ` Arnd Bergmann
2014-06-03  8:41                     ` Dave Chinner
2014-06-03  9:16                       ` Arnd Bergmann
2014-05-30 20:01 ` [RFC 12/32] btrfs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 13/32] ext3: " Arnd Bergmann
2014-05-31  9:10   ` H. Peter Anvin
2014-05-31 14:32     ` Arnd Bergmann
2014-05-30 20:01 ` [RFC 14/32] ext4: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 15/32] cifs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 16/32] ntfs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 17/32] ubifs: " Arnd Bergmann
2014-06-02  7:54   ` Artem Bityutskiy
2014-05-30 20:01 ` [RFC 18/32] ocfs2: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 19/32] fs/fat: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 20/32] afs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 21/32] udf: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 22/32] fs: convert simple fs to inode_time Arnd Bergmann
2014-05-30 23:06   ` Greg Kroah-Hartman
2014-05-30 20:01 ` [RFC 23/32] logfs: convert to struct inode_time Arnd Bergmann
2014-05-30 20:01 ` [RFC 24/32] hfs, hfsplus: " Arnd Bergmann
2014-05-31 14:23   ` Vyacheslav Dubeyko
2014-05-30 20:01 ` [RFC 25/32] gfs2: " Arnd Bergmann
2014-06-02  9:52   ` Steven Whitehouse
2014-05-30 20:01 ` [RFC 26/32] reiserfs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 27/32] jffs2: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 28/32] adfs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 29/32] f2fs: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 30/32] fuse: " Arnd Bergmann
2014-05-30 20:01 ` [RFC 31/32] scsi: fnic: use current_kernel_time() for timestamp Arnd Bergmann
2014-05-30 20:01 ` [RFC 32/32] fs: use new inode_time definition unconditionally Arnd Bergmann
2014-05-31 14:30 ` [RFC 00/32] making inode time stamps y2038 ready Vyacheslav Dubeyko
2014-06-03 12:21   ` Arnd Bergmann
2014-05-31 14:51 ` Richard Cochran
2014-05-31 15:23   ` Arnd Bergmann
2014-05-31 18:22     ` Richard Cochran
2014-05-31 19:34       ` H. Peter Anvin
2014-06-01  4:46         ` Richard Cochran
2014-06-01  4:44     ` Richard Cochran
2014-06-02 13:52 ` Joseph S. Myers
2014-06-02 19:19   ` Arnd Bergmann
2014-06-02 19:26     ` H. Peter Anvin
2014-06-02 19:55       ` Arnd Bergmann
2014-06-02 21:57         ` H. Peter Anvin
2014-06-03 14:22           ` Arnd Bergmann
2014-06-03 14:33             ` Joseph S. Myers
2014-06-03 14:37               ` Arnd Bergmann
2014-06-03 21:38             ` Dave Chinner
2014-06-04 15:03               ` Arnd Bergmann
2014-06-04 17:30                 ` Nicolas Pitre
2014-06-04 19:24                   ` Arnd Bergmann
2014-06-05  0:10                     ` H. Peter Anvin
2014-06-10  9:54                       ` Arnd Bergmann
2014-06-02 21:02     ` Joseph S. Myers
2014-06-04 15:05       ` Arnd Bergmann

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140603003227.GP6677@dastard \
    --to=david@fromorbit.com \
    --cc=arnd@arndb.de \
    --cc=geert@linux-m68k.org \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=john.stultz@linaro.org \
    --cc=joseph@codesourcery.com \
    --cc=lftan@altera.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox