From: griffin tucker <xfssxsltislti2490@griffintucker.id.au>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 08/11] xfs: widen ondisk timestamps to deal with y2038 problem
Date: Fri, 21 Aug 2020 15:02:33 +1000 [thread overview]
Message-ID: <CAH3XsHF=bPBvogcWTRaK8XDFF2zu5Q5Mc++FUJYb8eHY9u_zqQ@mail.gmail.com> (raw)
In-Reply-To: <20200820162049.GI6096@magnolia>
thanks for taking the time to respond to my seemingly stupid questions.
what about interplanetary timestamps? and during transit to the
moon/planets when time slows?
On Fri, 21 Aug 2020 at 02:24, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Thu, Aug 20, 2020 at 02:42:29PM +1000, griffin tucker wrote:
> > how difficult would it be to implement 128 bit timestamps?
>
> Somewhat. __uint128_t exists as a gcc extension, but you'd also have to
> expand the ondisk inode structure definition to accomodate 64 more bits
> of data per timestamp. If you split the timestamp between a low u64 and
> a high u64 you'd also need to recombine them carefully to avoid screwing
> up sign extension like ext4 did when they adopted 34-bit timestamps.
>
> > and how much further in time beyond 2486 would this allow?
>
> I dunno. 2^66 seconds beyond December 1901, I suppose.
>
> That's what, the year ... 2338218323135 or something? Far beyond what
> the maximum kernel supports (2^63-1 seconds).
>
> > is the implementation of 128 bit timestamps just not feasible due to
> > 64 bit alu width?
>
> Just my opinion, but I think the difficulty here is the added complexity
> to push XFS beyond the 25th century. I don't plan to be around for
> that, and prefer to let Captain Picard deal with that expansion. ;)
>
> > how long is it until hardware capability reaches beyond the 16exabyte limit?
>
> Still a few years, even if you staple together all the SMR HDDs in the
> world. You'd need, what, like, 400,000 20TB HDDs to reach 8EB?
>
> --D
>
> > On Thu, 20 Aug 2020 at 10:01, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 02:43:22PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Aug 19, 2020 at 09:35:35AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 17, 2020 at 03:57:39PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > >
> > > > > > Redesign the ondisk timestamps to be a simple unsigned 64-bit counter of
> > > > > > nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix
> > > > > > time epoch). This enables us to handle dates up to 2486, which solves
> > > > > > the y2038 problem.
> > > > > >
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > .....
> > > > > > +/* Convert an ondisk timestamp into the 64-bit safe incore format. */
> > > > > > void
> > > > > > xfs_inode_from_disk_timestamp(
> > > > > > + struct xfs_dinode *dip,
> > > > > > struct timespec64 *tv,
> > > > > > const union xfs_timestamp *ts)
> > > > > > {
> > > > > > + if (dip->di_version >= 3 &&
> > > > > > + (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME))) {
> > > > > > + uint64_t t = be64_to_cpu(ts->t_bigtime);
> > > > > > + uint64_t s;
> > > > > > + uint32_t n;
> > > > > > +
> > > > > > + s = div_u64_rem(t, NSEC_PER_SEC, &n);
> > > > > > + tv->tv_sec = s - XFS_INO_BIGTIME_EPOCH;
> > > > > > + tv->tv_nsec = n;
> > > > > > + return;
> > > > > > + }
> > > > > > +
> > > > > > tv->tv_sec = (int)be32_to_cpu(ts->t_sec);
> > > > > > tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec);
> > > > > > }
> > > > >
> > > > > Can't say I'm sold on this union. It seems cleaner to me to just
> > > > > make the timestamp an opaque 64 bit field on disk and convert it to
> > > > > the in-memory representation directly in the to/from disk
> > > > > operations. e.g.:
> > > > >
> > > > > void
> > > > > xfs_inode_from_disk_timestamp(
> > > > > struct xfs_dinode *dip,
> > > > > struct timespec64 *tv,
> > > > > __be64 ts)
> > > > > {
> > > > >
> > > > > uint64_t t = be64_to_cpu(ts);
> > > > > uint64_t s;
> > > > > uint32_t n;
> > > > >
> > > > > if (xfs_dinode_is_bigtime(dip)) {
> > > > > s = div_u64_rem(t, NSEC_PER_SEC, &n) - XFS_INO_BIGTIME_EPOCH;
> > > > > } else {
> > > > > s = (int)(t >> 32);
> > > > > n = (int)(t & 0xffffffff);
> > > > > }
> > > > > tv->tv_sec = s;
> > > > > tv->tv_nsec = n;
> > > > > }
> > > >
> > > > I don't like this open-coded union approach at all because now I have to
> > > > keep the t_sec and t_nsec bits separate in my head instead of letting
> > > > the C compiler take care of that detail. The sample code above doesn't
> > > > handle that correctly either:
> > > >
> > > > Start with an old kernel on a little endian system; each uppercase
> > > > letter represents a byte (A is the LSB of t_sec, D is the MSB of t_sec,
> > > > E is the LSB of t_nsec, and H is the MSB of t_nsec):
> > > >
> > > > sec nsec (incore)
> > > > ABCD EFGH
> > > >
> > > > That gets written out as:
> > > >
> > > > sec nsec (ondisk)
> > > > DCBA HGFE
> > > >
> > > > Now reboot with a new kernel that only knows 64bit timestamps on disk:
> > > >
> > > > 64bit (ondisk)
> > > > DCBAHGFE
> > > >
> > > > Now it does the first be64_to_cpu conversion:
> > > > 64bit (incore)
> > > > EFGHABCD
> > > >
> > > > And then masks and shifts:
> > > > sec nsec (incore)
> > > > EFGH ABCD
> > > >
> > > > Oops, we just switched the values!
> > > >
> > > > The correct approach (I think) is to perform the shifting and masking on
> > > > the raw __be64 value before converting them to incore format via
> > > > be32_to_cpu, but now I have to work out all four cases by hand instead
> > > > of letting the compiler do the legwork for me. I don't remember if it's
> > > > correct to go around shifting and masking __be64 values.
> > > >
> > > > I guess the good news is that at least we have generic/402 to catch
> > > > these kinds of persistence problems, but ugh.
> > > >
> > > > Anyway, what are you afraid of? The C compiler smoking crack and not
> > > > actually overlapping the two union elements? We could control for
> > > > that...
> > >
> > > (Following up on the mailing list with something I pasted into IRC)
> > >
> > > Ok, so I temporarily patched up my dev tree with this approximation of
> > > how that would work, properly done:
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index c59ddb56bb90..7c71e4440402 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -176,8 +176,8 @@ xfs_inode_from_disk_timestamp(
> > > return;
> > > }
> > >
> > > - tv->tv_sec = (int)be32_to_cpu(ts->t_sec);
> > > - tv->tv_nsec = (int)be32_to_cpu(ts->t_nsec);
> > > + tv->tv_sec = (time64_t)be32_to_cpu((__be32)(ts->t_bigtime >> 32));
> > > + tv->tv_nsec = be32_to_cpu(ts->t_bigtime & 0xFFFFFFFFU);
> > > }
> > >
> > > int
> > > @@ -294,8 +294,8 @@ xfs_inode_to_disk_timestamp(
> > > return;
> > > }
> > >
> > > - ts->t_sec = cpu_to_be32(tv->tv_sec);
> > > - ts->t_nsec = cpu_to_be32(tv->tv_nsec);
> > > + ts->t_bigtime = (__be64)cpu_to_be32(tv->tv_sec) << 32 |
> > > + cpu_to_be32(tv->tv_nsec);
> > > }
> > >
> > > void
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index d44e8932979b..5d36d6dea326 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -308,8 +308,8 @@ xfs_log_dinode_to_disk_ts(
> > > return;
> > > }
> > >
> > > - ts->t_sec = cpu_to_be32(its->t_sec);
> > > - ts->t_nsec = cpu_to_be32(its->t_nsec);
> > > + ts->t_bigtime = (__be64)cpu_to_be32(its->t_sec) << 32 |
> > > + cpu_to_be32(its->t_nsec);
> > > }
> > >
> > > And immediately got a ton of smatch warnings:
> > >
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:179:32: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:179:32: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:180:23: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:180:23: warning: cast to restricted __be32
> > > xfs_inode_buf.c:297:26: warning: cast to restricted __be64
> > > xfs_inode_buf.c:297:26: warning: cast from restricted __be32
> > > xfs_inode_buf.c:297:26: warning: restricted __be64 degrades to integer
> > > xfs_inode_buf.c:298:25: warning: restricted __be32 degrades to integer
> > > xfs_inode_buf.c:297:23: warning: incorrect type in assignment (different base types)
> > > xfs_inode_buf.c:297:23: expected restricted __be64 [usertype] t_bigtime
> > > xfs_inode_buf.c:297:23: got unsigned long long
> > >
> > > (and even more in xfs_inode_item.c)
> > >
> > > So... while we could get rid of the union and hand-decode the timestamp
> > > from a __be64 on legacy filesystems, I see the static checker complaints
> > > as a second piece of evidence that this would be unnecessarily risky.
> > >
> > > --D
> > >
> > > > > > @@ -220,9 +234,9 @@ xfs_inode_from_disk(
> > > > > > * a time before epoch is converted to a time long after epoch
> > > > > > * on 64 bit systems.
> > > > > > */
> > > > > > - xfs_inode_from_disk_timestamp(&inode->i_atime, &from->di_atime);
> > > > > > - xfs_inode_from_disk_timestamp(&inode->i_mtime, &from->di_mtime);
> > > > > > - xfs_inode_from_disk_timestamp(&inode->i_ctime, &from->di_ctime);
> > > > > > + xfs_inode_from_disk_timestamp(from, &inode->i_atime, &from->di_atime);
> > > > > > + xfs_inode_from_disk_timestamp(from, &inode->i_mtime, &from->di_mtime);
> > > > > > + xfs_inode_from_disk_timestamp(from, &inode->i_ctime, &from->di_ctime);
> > > > > >
> > > > > > to->di_size = be64_to_cpu(from->di_size);
> > > > > > to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > > > @@ -235,9 +249,17 @@ xfs_inode_from_disk(
> > > > > > if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
> > > > > > inode_set_iversion_queried(inode,
> > > > > > be64_to_cpu(from->di_changecount));
> > > > > > - xfs_inode_from_disk_timestamp(&to->di_crtime, &from->di_crtime);
> > > > > > + xfs_inode_from_disk_timestamp(from, &to->di_crtime,
> > > > > > + &from->di_crtime);
> > > > > > to->di_flags2 = be64_to_cpu(from->di_flags2);
> > > > > > to->di_cowextsize = be32_to_cpu(from->di_cowextsize);
> > > > > > + /*
> > > > > > + * Set the bigtime flag incore so that we automatically convert
> > > > > > + * this inode's ondisk timestamps to bigtime format the next
> > > > > > + * time we write the inode core to disk.
> > > > > > + */
> > > > > > + if (xfs_sb_version_hasbigtime(&ip->i_mount->m_sb))
> > > > > > + to->di_flags2 |= XFS_DIFLAG2_BIGTIME;
> > > > > > }
> > > > >
> > > > > We do not want on-disk flags to be changed outside transactions like
> > > > > this. Indeed, this has implications for O_DSYNC operation, in that
> > > > > we do not trigger inode sync operations if the inode is only
> > > > > timestamp dirty. If we've changed this flag, then the inode is more
> > > > > than "timestamp dirty" and O_DSYNC will need to flush the entire
> > > > > inode.... :/
> > > >
> > > > I forgot about XFS_ILOG_TIMESTAMP.
> > > >
> > > > > IOWs, I think we should only change this flag in a timestamp
> > > > > transaction where the timestamps are actually being logged and hence
> > > > > we can set inode dirty state appropriately so that everything will
> > > > > get logged, changed and written back correctly....
> > > >
> > > > Yeah, that's fair. I'll change xfs_trans_log_inode to set the bigtime
> > > > flag if we're logging either the timestamps or the core.
> > > >
> > > > --D
> > > >
> > > > > Cheers,
> > > > >
> > > > > Dave.
> > > > > --
> > > > > Dave Chinner
> > > > > david@fromorbit.com
next prev parent reply other threads:[~2020-08-21 5:02 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-17 22:56 [PATCH v2 00/11] xfs: widen timestamps to deal with y2038 Darrick J. Wong
2020-08-17 22:56 ` [PATCH 01/11] xfs: explicitly define inode timestamp range Darrick J. Wong
2020-08-18 6:25 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 02/11] xfs: refactor quota expiration timer modification Darrick J. Wong
2020-08-18 6:48 ` Amir Goldstein
2020-08-18 15:40 ` Darrick J. Wong
2020-08-17 22:57 ` [PATCH 03/11] xfs: refactor default quota grace period setting code Darrick J. Wong
2020-08-18 10:46 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 04/11] xfs: remove xfs_timestamp_t Darrick J. Wong
2020-08-18 10:50 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 05/11] xfs: move xfs_log_dinode_to_disk to the log code Darrick J. Wong
2020-08-18 10:51 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 06/11] xfs: refactor inode timestamp coding Darrick J. Wong
2020-08-18 11:20 ` Amir Goldstein
2020-08-18 15:42 ` Darrick J. Wong
2020-08-17 22:57 ` [PATCH 07/11] xfs: convert struct xfs_timestamp to union Darrick J. Wong
2020-08-18 11:24 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 08/11] xfs: widen ondisk timestamps to deal with y2038 problem Darrick J. Wong
2020-08-18 12:00 ` Amir Goldstein
2020-08-18 12:53 ` Amir Goldstein
2020-08-18 15:53 ` Darrick J. Wong
2020-08-18 20:52 ` Darrick J. Wong
2020-08-18 15:44 ` Darrick J. Wong
2020-08-18 23:35 ` Dave Chinner
2020-08-19 21:43 ` Darrick J. Wong
2020-08-19 23:58 ` Dave Chinner
2020-08-20 0:01 ` Darrick J. Wong
2020-08-20 4:42 ` griffin tucker
2020-08-20 16:23 ` Darrick J. Wong
2020-08-21 5:02 ` griffin tucker [this message]
2020-08-21 15:31 ` Mike Fleetwood
2020-08-20 5:11 ` Amir Goldstein
2020-08-20 22:47 ` Dave Chinner
2020-08-17 22:57 ` [PATCH 09/11] xfs: refactor quota timestamp coding Darrick J. Wong
2020-08-18 12:25 ` Amir Goldstein
2020-08-17 22:57 ` [PATCH 10/11] xfs: enable bigtime for quota timers Darrick J. Wong
2020-08-18 13:58 ` Amir Goldstein
2020-08-18 15:59 ` Darrick J. Wong
2020-08-17 22:57 ` [PATCH 11/11] xfs: enable big timestamps Darrick J. Wong
2020-08-18 14:04 ` Amir Goldstein
2020-08-18 23:01 ` [PATCH v2 00/11] xfs: widen timestamps to deal with y2038 Dave Chinner
2020-08-18 23:10 ` Darrick J. Wong
2020-08-18 23:41 ` Dave Chinner
2020-08-21 2:11 [PATCH v3 " Darrick J. Wong
2020-08-21 2:12 ` [PATCH 08/11] xfs: widen ondisk timestamps to deal with y2038 problem Darrick J. Wong
2020-08-22 7:33 ` Christoph Hellwig
2020-08-24 2:43 ` Darrick J. Wong
2020-08-25 0:39 ` Darrick J. Wong
2020-08-24 1:25 ` Dave Chinner
2020-08-24 3:13 ` Darrick J. Wong
2020-08-24 6:15 ` Dave Chinner
2020-08-24 16:24 ` Darrick J. Wong
2020-08-24 21:13 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAH3XsHF=bPBvogcWTRaK8XDFF2zu5Q5Mc++FUJYb8eHY9u_zqQ@mail.gmail.com' \
--to=xfssxsltislti2490@griffintucker.id.au \
--cc=darrick.wong@oracle.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).