From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BCC9C433E2 for ; Fri, 4 Sep 2020 15:51:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4AD752074D for ; Fri, 4 Sep 2020 15:51:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="SiICJ4e8" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726111AbgIDPvu (ORCPT ); Fri, 4 Sep 2020 11:51:50 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:54628 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726184AbgIDPvo (ORCPT ); Fri, 4 Sep 2020 11:51:44 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 084FnikE046773; Fri, 4 Sep 2020 15:51:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2020-01-29; bh=nLhSygs+UQAUo4ZZihitXIlq+LSFXDoPjn+Sn28ybq0=; b=SiICJ4e8qbasu6cEfj31IaFxYeybRfC9szEu4gUvgE7AeWgQ2Y6sVLP9vgN44HZ1Ucot AI95OILXNZx1r19iiqH41Ye9g3+V9eiI08OR/4TegxW6wBnIBdj9tHT1nYfLq8Oeh2WF 7C2K/Txn5Y62I4vPXBDVBweEmN0+i2uLeeThcoKFZcrDSwZn0oWdmzxmGfSIwlOhOQkp z78Q0RIw2GP4SawEQABZjhfkcUw3WyLa2FSfO7Srv6X7hVlp/TjCifNMU/fvfpQwNB8X urXsXV3NYhwjcXfZP0bW2E/ZmUOUJzEvfOtnUBIvjNbPMH3tASBZggeDR+K+X2cJrWxZ Xg== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2120.oracle.com with ESMTP id 339dmndmhk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 04 Sep 2020 15:51:38 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 084FonWD135718; Fri, 4 Sep 2020 15:51:38 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserp3020.oracle.com with ESMTP id 3380xdh700-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 04 Sep 2020 15:51:37 +0000 Received: from abhmp0012.oracle.com (abhmp0012.oracle.com [141.146.116.18]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 084FpabE006539; Fri, 4 Sep 2020 15:51:36 GMT Received: from localhost (/67.169.218.210) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 04 Sep 2020 08:51:35 -0700 Date: Fri, 4 Sep 2020 08:51:35 -0700 From: "Darrick J. Wong" To: Chandan Babu R Cc: Dave Chinner , linux-xfs@vger.kernel.org, bfoster@redhat.com Subject: Re: [PATCH 3/3] xfs: Extend data/attr fork extent counter width Message-ID: <20200904155123.GH6096@magnolia> References: <20200831130010.454-1-chandanrlinux@gmail.com> <20200831130010.454-4-chandanrlinux@gmail.com> <20200903225145.GG12131@dread.disaster.area> <6807151.PTRk86Yujf@garuda> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6807151.PTRk86Yujf@garuda> X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9734 signatures=668679 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 spamscore=0 phishscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 bulkscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009040137 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9734 signatures=668679 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 phishscore=0 mlxlogscore=999 adultscore=0 impostorscore=0 mlxscore=0 suspectscore=0 spamscore=0 clxscore=1015 malwarescore=0 lowpriorityscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009040137 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Fri, Sep 04, 2020 at 02:27:35PM +0530, Chandan Babu R wrote: > On Friday 4 September 2020 4:21:45 AM IST Dave Chinner wrote: > > On Mon, Aug 31, 2020 at 06:30:10PM +0530, Chandan Babu R wrote: > > > The commit xfs: fix inode fork extent count overflow > > > (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion > > > data fork extents should be possible to create. However the > > > corresponding on-disk field has a signed 32-bit type. Hence this > > > commit extends the per-inode data extent counter to 47 bits. The > > > length of 47-bits was chosen because, > > > Maximum file size = 2^63. > > > Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47. > > > > > > Also, XFS has a per-inode xattr extent counter which is 16 bits > > > wide. A workload which > > > 1. Creates 1 million 255-byte sized xattrs, > > > 2. Deletes 50% of these xattrs in an alternating manner, > > > 3. Tries to insert 400,000 new 255-byte sized xattrs > > > causes the xattr extent counter to overflow. > > > > > > Dave tells me that there are instances where a single file has more than > > > 100 million hardlinks. With parent pointers being stored in xattrs, we > > > will overflow the signed 16-bits wide xattr extent counter when large > > > number of hardlinks are created. Hence this commit extends the on-disk > > > field to 32-bits. > > > > > > The following changes are made to accomplish this, > > > > > > 1. A new incompat superblock flag to prevent older kernels from mounting > > > the filesystem. This flag has to be set during mkfs time. > > > 2. Carve out a new 32-bit field from xfs_dinode->di_pad2[]. This field > > > holds the most significant 15 bits of the data extent counter. > > > 3. Carve out a new 16-bit field from xfs_dinode->di_pad2[]. This field > > > holds the most significant 16 bits of the attr extent counter. > > > > > > Signed-off-by: Chandan Babu R > > > --- > > > fs/xfs/libxfs/xfs_bmap.c | 8 ++++--- > > > fs/xfs/libxfs/xfs_format.h | 20 ++++++++++++---- > > > fs/xfs/libxfs/xfs_inode_buf.c | 42 ++++++++++++++++++++++++++------- > > > fs/xfs/libxfs/xfs_inode_buf.h | 4 ++-- > > > fs/xfs/libxfs/xfs_inode_fork.h | 17 +++++++++---- > > > fs/xfs/libxfs/xfs_log_format.h | 8 ++++--- > > > fs/xfs/libxfs/xfs_types.h | 10 ++++---- > > > fs/xfs/scrub/inode.c | 2 +- > > > fs/xfs/xfs_inode.c | 2 +- > > > fs/xfs/xfs_inode_item.c | 12 ++++++++-- > > > fs/xfs/xfs_inode_item_recover.c | 20 ++++++++++++---- > > > 11 files changed, 105 insertions(+), 40 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > > > index 16b983b8977d..8788f47ba59e 100644 > > > --- a/fs/xfs/libxfs/xfs_bmap.c > > > +++ b/fs/xfs/libxfs/xfs_bmap.c > > > @@ -52,9 +52,9 @@ xfs_bmap_compute_maxlevels( > > > xfs_mount_t *mp, /* file system mount structure */ > > > int whichfork) /* data or attr fork */ > > > { > > > + xfs_extnum_t maxleafents; /* max leaf entries possible */ > > > int level; /* btree level */ > > > uint maxblocks; /* max blocks at this level */ > > > - uint maxleafents; /* max leaf entries possible */ > > > int maxrootrecs; /* max records in root block */ > > > int minleafrecs; /* min records in leaf block */ > > > int minnoderecs; /* min records in node block */ > > > @@ -64,7 +64,9 @@ xfs_bmap_compute_maxlevels( > > > * The maximum number of extents in a file, hence the maximum number of > > > * leaf entries, is controlled by the size of the on-disk extent count, > > > * either a signed 32-bit number for the data fork, or a signed 16-bit > > > - * number for the attr fork. > > > + * number for the attr fork. With mkfs.xfs' wide-extcount option > > > + * enabled, the data fork extent count is unsigned 47-bits wide, while > > > + * the corresponding attr fork extent count is unsigned 32-bits wide. > > > > This doesn't really need to state what the sizes of the on disk > > fields are. If anything should state that, it's a description of the > > helper function that returns the maximum supported extent count. > > Also, it's the maximum extents in a the fork, not the _file_. > > > > i.e. this should probably just read > > > > * The maximum number of extents in a fork, hence the maximum number of > > * leaf entries, is controlled by the size of the on-disk extent count. > > I agree. I will fix this up. > > > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h > > > index 5f41e177dbda..2684cafd0356 100644 > > > --- a/fs/xfs/libxfs/xfs_format.h > > > +++ b/fs/xfs/libxfs/xfs_format.h > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature( > > > #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ > > > #define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */ > > > #define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */ > > > -#define XFS_SB_FEAT_INCOMPAT_ALL \ > > > +#define XFS_SB_FEAT_INCOMPAT_WIDEEXTCNT (1 << 3) /* Wider data/attr fork extent counters */ > > > +#define XFS_SB_FEAT_INCOMPAT_ALL \ > > > (XFS_SB_FEAT_INCOMPAT_FTYPE| \ > > > XFS_SB_FEAT_INCOMPAT_SPINODES| \ > > > - XFS_SB_FEAT_INCOMPAT_META_UUID) > > > + XFS_SB_FEAT_INCOMPAT_META_UUID| \ > > > + XFS_SB_FEAT_INCOMPAT_WIDEEXTCNT) > > > > Don't we normally add the feature bit in a standalone patch once all > > the infrastructure has already been put in place? > > Yes, I now realize that code changes like "defining new fields in on-disk > inode structure" and "promoting xfs_extnum_t to uint64_t" can be moved to a > separate patch. I will split this patch into as many required parts before > posting the next version. > > > > > > #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL > > > static inline bool > > > @@ -551,6 +553,12 @@ static inline bool xfs_sb_version_hasmetauuid(struct xfs_sb *sbp) > > > (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_META_UUID); > > > } > > > > > > +static inline bool xfs_sb_version_haswideextcnt(struct xfs_sb *sbp) > > > +{ > > > + return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) && > > > + (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_WIDEEXTCNT); > > > +} > > > > I don't really like the name of the feature :/ > > > > Precendence in naming feature additions like this is "32 bit project > > IDs" - when we extended them from 16 to 32 bits, we didn't call them > > "wide project IDs" as "wide" could mean anything. What do we do if > > we later need to increase the size of the attribute fork extent > > count? :/ > > > > xfs_sb_version_hasextcount_64bit() would match the > > xfs_sb_version_hasprojid_32bit() naming internally.... I was about to suggest "nexts64" but my brain typo'd that into "next4" and no don't go there. ;) > > I agree. I will fix the name here and in xfsprogs. > > > > > > static inline bool xfs_sb_version_hasrmapbt(struct xfs_sb *sbp) > > > { > > > return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) && > > > @@ -873,8 +881,8 @@ typedef struct xfs_dinode { > > > __be64 di_size; /* number of bytes in file */ > > > __be64 di_nblocks; /* # of direct & btree blocks used */ > > > __be32 di_extsize; /* basic/minimum extent size for file */ > > > - __be32 di_nextents; /* number of extents in data fork */ > > > - __be16 di_anextents; /* number of extents in attribute fork*/ > > > + __be32 di_nextents_lo; /* lower part of data fork extent count */ > > > + __be16 di_anextents_lo;/* lower part of attr fork extent count */ > > > __u8 di_forkoff; /* attr fork offs, <<3 for 64b align */ > > > __s8 di_aformat; /* format of attr fork's data */ > > > __be32 di_dmevmask; /* DMIG event mask */ > > > @@ -891,7 +899,9 @@ typedef struct xfs_dinode { > > > __be64 di_lsn; /* flush sequence */ > > > __be64 di_flags2; /* more random flags */ > > > __be32 di_cowextsize; /* basic cow extent size for file */ > > > - __u8 di_pad2[12]; /* more padding for future expansion */ > > > + __be32 di_nextents_hi; /* higher part of data fork extent count */ > > > + __be16 di_anextents_hi;/* higher part of attr fork extent count */ > > > + __u8 di_pad2[6]; /* more padding for future expansion */ > > > > I think I've mentioned this before - I don't really like extending > > inode variables this way. We did it for projid32 because we did not > > have any spare space in the v4 inode to do anything else. > > Yes, You had suggested the "add new inode member" approach in one of the older > versions of the patchset. But Christoph had objected to this approach > (https://www.spinics.net/lists/linux-xfs/msg40112.html). Hence I had dropped > the idea. Sorry, I should have consulted with you before taking that decision. > > > > > I would kinda prefer to do something like this: > > > > - __be32 di_nextents; /* number of extents in data fork */ > > - __be16 di_anextents; /* number of extents in attribute fork*/ > > + __be32 di_nextents32; /* 32 bit fork extent count */ > > + __be16 di_nextents16; /* 16 bit fork extent count */ > > .... > > - __u8 di_pad2[12]; /* more padding for future expansion */ > > + __u8 di_pad2[4]; /* more padding for future expansion */ > > + __be64 di_nextents64; /* 64 bit fork extent count */ The comments for these fields had better document the fact that we have this shifty encoding scheme. Something like: /* * On a extcount64 filesystem, di_nextents64 holds the data fork * extent count, di_nextents32 holds the attr fork extent count, * and di_nextents16 must be zero. * * Without that feature, di_nextents32 holds the data fork * extent count, di_nextents16 holds the attr fork extent count, * and di_nextents64 must be zero. */ __be32 di_nextents32; __be16 di_nextents16; .... __be64 di_nextents64; I more or less agree with the rest of Dave's reply. --D > > > > > > And then depending on the hasextcount_64bit bit is set, we read from > > disk like this: > > > > if (hasextcount_64bit) { > > to->di_nextents = be64_to_cpu(dip->di_nextents64); > > to->di_naextents = be32_to_cpu(dip->di_nextents32); > > if (dip->di_nextents16 != 0) > > return -EFSCORRUPTED; > > } else { > > to->di_nextents = be32_to_cpu(dip->di_nextents32); > > to->di_naextents = be16_to_cpu(dip->di_nextents16); > > if (dip->di_nextents64 != 0) > > return -EFSCORRUPTED; > > } > > > > and the writing to disk is equally simple. There's no bit shifting > > or masking, and we still end up with the same amount of unused space > > in the inode when hasextcount_64bit is set because di_nextents16 can > > be reused by another new feature.... > > > > > @@ -408,10 +425,17 @@ xfs_dfork_nextents(struct xfs_sb *sbp, struct xfs_dinode *dip, int whichfork) > > > { > > > xfs_extnum_t nextents; > > > > > > - if (whichfork == XFS_DATA_FORK) > > > - nextents = be32_to_cpu(dip->di_nextents); > > > - else > > > - nextents = be16_to_cpu(dip->di_anextents); > > > + if (whichfork == XFS_DATA_FORK) { > > > + nextents = be32_to_cpu(dip->di_nextents_lo); > > > + if (xfs_sb_version_haswideextcnt(sbp)) > > > + nextents |= > > > + ((xfs_extnum_t)be32_to_cpu(dip->di_nextents_hi) << 32); > > > + } else { > > > + nextents = be16_to_cpu(dip->di_anextents_lo); > > > + if (xfs_sb_version_haswideextcnt(sbp)) > > > + nextents |= > > > + ((xfs_aextnum_t)be16_to_cpu(dip->di_anextents_hi) << 16); > > > + } > > > > ... and we get rid of this bit of messy code :) > > I agree. I am pretty sure that this will also make it easy to code up the > corresponding changes in xfs_db. > > > > > > @@ -157,10 +157,17 @@ static inline xfs_extnum_t xfs_iext_max(struct xfs_sb *sbp, int whichfork) > > > { > > > ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK); > > > > > > - if (whichfork == XFS_DATA_FORK) > > > - return MAXEXTNUM; > > > - else > > > - return MAXAEXTNUM; > > > + if (whichfork == XFS_DATA_FORK) { > > > + if (xfs_sb_version_haswideextcnt(sbp)) > > > + return MAXEXTNUM_HI; > > > + else > > > + return MAXEXTNUM; > > > + } else { > > > + if (xfs_sb_version_haswideextcnt(sbp)) > > > + return MAXAEXTNUM_HI; > > > + else > > > + return MAXAEXTNUM; > > > + } > > > > I think we should actually rework MAXEXTNUM/MAXAEXTNUM before doing > > this. They are defined in xfs_types.h as in-memory limits, while > > these are actually returning on-disk format limits which should be > > defined in xfs_format.h > > > > e.g: > > > > #define XFS_IFORK_EXTCNT_MAX64 .... > > #define XFS_IFORK_EXTCNT_MAX32 .... > > #define XFS_IFORK_EXTCNT_MAX16 .... > > > > And in xfs_iext_max() we do: > > > > bool has64 = xfs_sb_version_haswideextcnt() > > > > switch (whichfork) { > > case XFS_DATA_FORK: > > return has64 ? XFS_IFORK_EXTCNT_MAX64 : XFS_IFORK_EXTCNT_MAX32; > > case XFS_ATTR_FORK: > > return has64 ? XFS_IFORK_EXTCNT_MAX32 : XFS_IFORK_EXTCNT_MAX16; > > case XFS_COW_FORK: > > return XFS_IFORK_EXTCNT_MAX32; > > default: > > ASSERT(0); > > break; > > } > > return -EFSCORRUPTED; > > > > > @@ -59,8 +59,10 @@ typedef void * xfs_failaddr_t; > > > * Max values for extlen, extnum, aextnum. > > > */ > > > #define MAXEXTLEN ((xfs_extlen_t)0x001fffff) /* 21 bits */ > > > -#define MAXEXTNUM ((xfs_extnum_t)0x7fffffff) /* signed int */ > > > -#define MAXAEXTNUM ((xfs_aextnum_t)0x7fff) /* signed short */ > > > +#define MAXEXTNUM ((int32_t)0x7fffffff) /* signed int */ > > > +#define MAXAEXTNUM ((int16_t)0x7fff) /* signed short */ > > > +#define MAXEXTNUM_HI ((xfs_extnum_t)0x7fffffffffff) /* unsigned 47 bits */ > > > +#define MAXAEXTNUM_HI ((xfs_aextnum_t)0xffffffff) /* unsigned 32 bits */ > > > > Yeah, these on-disk limits need to go into xfs_format.h and not used > > directly anymore... > > > > Sure. I will apply the above comments in the next version of this patchset. > > Thanks a lot for your time! > > -- > chandan > > >