* [PATCH 0/3] xfs: fix stale disk exposure after crash @ 2020-05-19 0:49 Darrick J. Wong 2020-05-19 0:49 ` [PATCH 1/3] xfs: force writes to delalloc regions to unwritten Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Darrick J. Wong @ 2020-05-19 0:49 UTC (permalink / raw) To: darrick.wong; +Cc: Christoph Hellwig, linux-xfs, hch, bfoster Hi all, These three patches try to shrink the window during which a crash during writeback can expose stale disk contents. The first patch causes delalloc reservations to be converted to unwritten extents for any writeback that's going on within EOF. The second patch fixes a minor error encountered during writeback; and the third patch fixes speculative preallocation to work when the EOF block could be unwritten. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=stale-exposure xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=stale-exposure ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] xfs: force writes to delalloc regions to unwritten 2020-05-19 0:49 [PATCH 0/3] xfs: fix stale disk exposure after crash Darrick J. Wong @ 2020-05-19 0:49 ` Darrick J. Wong 2020-05-19 12:45 ` Brian Foster 2020-05-19 0:49 ` [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong 2020-05-19 0:49 ` [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong 2 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2020-05-19 0:49 UTC (permalink / raw) To: darrick.wong; +Cc: Christoph Hellwig, linux-xfs, hch, bfoster From: Darrick J. Wong <darrick.wong@oracle.com> When writing to a delalloc region in the data fork, commit the new allocations (of the da reservation) as unwritten so that the mappings are only marked written once writeback completes successfully. This fixes the problem of stale data exposure if the system goes down during targeted writeback of a specific region of a file, as tested by generic/042. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_bmap.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index fda13cd7add0..825d170e1503 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4193,17 +4193,7 @@ xfs_bmapi_allocate( bma->got.br_blockcount = bma->length; bma->got.br_state = XFS_EXT_NORM; - /* - * In the data fork, a wasdelay extent has been initialized, so - * shouldn't be flagged as unwritten. - * - * For the cow fork, however, we convert delalloc reservations - * (extents allocated for speculative preallocation) to - * allocated unwritten extents, and only convert the unwritten - * extents to real extents when we're about to write the data. - */ - if ((!bma->wasdel || (bma->flags & XFS_BMAPI_COWFORK)) && - (bma->flags & XFS_BMAPI_PREALLOC)) + if (bma->flags & XFS_BMAPI_PREALLOC) bma->got.br_state = XFS_EXT_UNWRITTEN; if (bma->wasdel) @@ -4611,8 +4601,24 @@ xfs_bmapi_convert_delalloc( bma.offset = bma.got.br_startoff; bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN); bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork); + + /* + * When we're converting the delalloc reservations backing dirty pages + * in the page cache, we must be careful about how we create the new + * extents: + * + * New CoW fork extents are created unwritten, turned into real extents + * when we're about to write the data to disk, and mapped into the data + * fork after the write finishes. End of story. + * + * New data fork extents must be mapped in as unwritten and converted + * to real extents after the write succeeds to avoid exposing stale + * disk contents if we crash. + */ if (whichfork == XFS_COW_FORK) bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; + else + bma.flags = XFS_BMAPI_PREALLOC; if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev)) bma.prev.br_startoff = NULLFILEOFF; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/3] xfs: force writes to delalloc regions to unwritten 2020-05-19 0:49 ` [PATCH 1/3] xfs: force writes to delalloc regions to unwritten Darrick J. Wong @ 2020-05-19 12:45 ` Brian Foster 0 siblings, 0 replies; 15+ messages in thread From: Brian Foster @ 2020-05-19 12:45 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, hch On Mon, May 18, 2020 at 05:49:11PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > When writing to a delalloc region in the data fork, commit the new > allocations (of the da reservation) as unwritten so that the mappings > are only marked written once writeback completes successfully. This > fixes the problem of stale data exposure if the system goes down during > targeted writeback of a specific region of a file, as tested by > generic/042. > We could probably add generic/042 into the auto group once this patch lands. > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/libxfs/xfs_bmap.c | 28 +++++++++++++++++----------- > 1 file changed, 17 insertions(+), 11 deletions(-) > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index fda13cd7add0..825d170e1503 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c ... > @@ -4611,8 +4601,24 @@ xfs_bmapi_convert_delalloc( > bma.offset = bma.got.br_startoff; > bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN); > bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork); > + > + /* > + * When we're converting the delalloc reservations backing dirty pages > + * in the page cache, we must be careful about how we create the new > + * extents: > + * > + * New CoW fork extents are created unwritten, turned into real extents > + * when we're about to write the data to disk, and mapped into the data > + * fork after the write finishes. End of story. > + * > + * New data fork extents must be mapped in as unwritten and converted > + * to real extents after the write succeeds to avoid exposing stale > + * disk contents if we crash. > + */ > if (whichfork == XFS_COW_FORK) > bma.flags = XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; > + else > + bma.flags = XFS_BMAPI_PREALLOC; The following seems a bit cleaner: bma.flags = XFS_BMAPI_PREALLOC; if (whichfork == XFS_COW_FORK) bma.flags |= XFS_BMAPI_COWFORK; ... but nit aside, LGTM: Reviewed-by: Brian Foster <bfoster@redhat.com> > > if (!xfs_iext_peek_prev_extent(ifp, &bma.icur, &bma.prev)) > bma.prev.br_startoff = NULLFILEOFF; > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot 2020-05-19 0:49 [PATCH 0/3] xfs: fix stale disk exposure after crash Darrick J. Wong 2020-05-19 0:49 ` [PATCH 1/3] xfs: force writes to delalloc regions to unwritten Darrick J. Wong @ 2020-05-19 0:49 ` Darrick J. Wong 2020-05-19 7:13 ` Christoph Hellwig 2020-05-19 12:46 ` Brian Foster 2020-05-19 0:49 ` [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong 2 siblings, 2 replies; 15+ messages in thread From: Darrick J. Wong @ 2020-05-19 0:49 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, hch, bfoster From: Darrick J. Wong <darrick.wong@oracle.com> During writeback, it's possible for the quota block reservation in xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota limit. This causes writeback errors for data that was already written to disk, when it's not even guaranteed that the bmbt will expand to exceed the quota limit. Irritatingly, this condition is reported to userspace as EIO by fsync, which is confusing. We wrote the data, so allow the reservation. That might put us slightly above the hard limit, but it's better than losing data after a write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_iomap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index bb590a267a7f..ac970b13b1f8 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -563,7 +563,7 @@ xfs_iomap_write_unwritten( xfs_trans_ijoin(tp, ip, 0); error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0, - XFS_QMOPT_RES_REGBLKS); + XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES); if (error) goto error_on_bmapi_transaction; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot 2020-05-19 0:49 ` [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong @ 2020-05-19 7:13 ` Christoph Hellwig 2020-05-19 12:46 ` Brian Foster 1 sibling, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2020-05-19 7:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch, bfoster On Mon, May 18, 2020 at 05:49:17PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > During writeback, it's possible for the quota block reservation in > xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota > limit. This causes writeback errors for data that was already written > to disk, when it's not even guaranteed that the bmbt will expand to > exceed the quota limit. Irritatingly, this condition is reported to > userspace as EIO by fsync, which is confusing. > > We wrote the data, so allow the reservation. That might put us slightly > above the hard limit, but it's better than losing data after a write. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot 2020-05-19 0:49 ` [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong 2020-05-19 7:13 ` Christoph Hellwig @ 2020-05-19 12:46 ` Brian Foster 1 sibling, 0 replies; 15+ messages in thread From: Brian Foster @ 2020-05-19 12:46 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch On Mon, May 18, 2020 at 05:49:17PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > During writeback, it's possible for the quota block reservation in > xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota > limit. This causes writeback errors for data that was already written > to disk, when it's not even guaranteed that the bmbt will expand to > exceed the quota limit. Irritatingly, this condition is reported to > userspace as EIO by fsync, which is confusing. > > We wrote the data, so allow the reservation. That might put us slightly > above the hard limit, but it's better than losing data after a write. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_iomap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index bb590a267a7f..ac970b13b1f8 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -563,7 +563,7 @@ xfs_iomap_write_unwritten( > xfs_trans_ijoin(tp, ip, 0); > > error = xfs_trans_reserve_quota_nblks(tp, ip, resblks, 0, > - XFS_QMOPT_RES_REGBLKS); > + XFS_QMOPT_RES_REGBLKS | XFS_QMOPT_FORCE_RES); > if (error) > goto error_on_bmapi_transaction; > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-19 0:49 [PATCH 0/3] xfs: fix stale disk exposure after crash Darrick J. Wong 2020-05-19 0:49 ` [PATCH 1/3] xfs: force writes to delalloc regions to unwritten Darrick J. Wong 2020-05-19 0:49 ` [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong @ 2020-05-19 0:49 ` Darrick J. Wong 2020-05-19 12:48 ` Brian Foster 2020-05-19 12:54 ` Christoph Hellwig 2 siblings, 2 replies; 15+ messages in thread From: Darrick J. Wong @ 2020-05-19 0:49 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, hch, bfoster From: Darrick J. Wong <darrick.wong@oracle.com> When we're estimating a new speculative preallocation length for an extending write, we should walk backwards through the extent list to determine the number of number of blocks that are physically and logically contiguous with the write offset, and use that as an input to the preallocation size computation. This way, preallocation length is truly measured by the effectiveness of the allocator in giving us contiguous allocations without being influenced by the state of a given extent. This fixes both the problem where ZERO_RANGE within an EOF can reduce preallocation, and prevents the unnecessary shrinkage of preallocation when delalloc extents are turned into unwritten extents. This was found as a regression in xfs/014 after changing delalloc writes to create unwritten extents during writeback. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_iomap.c | 63 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 52 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index ac970b13b1f8..2dffd56a433c 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -351,6 +351,46 @@ xfs_quota_calc_throttle( } } +/* + * Determine if the previous extent's range of offsets is contiguous with + * @offset_fsb. If so, set @prev_contig to the number of blocks that are + * physically contiguous with that previous extent and return true. If there + * is no previous extent or there's a hole right before @offset_fsb, return + * false. + * + * Note that we don't care if the previous extents are written or not. + */ +static inline bool +xfs_iomap_prev_contiguous( + struct xfs_ifork *ifp, + struct xfs_iext_cursor *cur, + xfs_fileoff_t offset_fsb, + xfs_extlen_t *prev_contig) +{ + struct xfs_iext_cursor ncur = *cur; + struct xfs_bmbt_irec got, old; + + xfs_iext_prev(ifp, &ncur); + if (!xfs_iext_get_extent(ifp, &ncur, &old)) + return false; + if (old.br_startoff + old.br_blockcount < offset_fsb) + return false; + + *prev_contig = old.br_blockcount; + + xfs_iext_prev(ifp, &ncur); + while (xfs_iext_get_extent(ifp, &ncur, &got) && + got.br_blockcount + got.br_startoff == old.br_startoff && + got.br_blockcount + got.br_startblock == old.br_startblock && + *prev_contig <= MAXEXTLEN) { + *prev_contig += got.br_blockcount; + old = got; /* struct copy */ + xfs_iext_prev(ifp, &ncur); + } + + return true; +} + /* * If we are doing a write at the end of the file and there are no allocations * past this one, then extend the allocation out to the file system's write @@ -380,12 +420,12 @@ xfs_iomap_prealloc_size( struct xfs_mount *mp = ip->i_mount; struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); - struct xfs_bmbt_irec prev; int shift = 0; int64_t freesp; xfs_fsblock_t qblocks; int qshift = 0; xfs_fsblock_t alloc_blocks = 0; + xfs_extlen_t plen = 0; if (offset + count <= XFS_ISIZE(ip)) return 0; @@ -400,9 +440,9 @@ xfs_iomap_prealloc_size( */ if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) || XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || - !xfs_iext_peek_prev_extent(ifp, icur, &prev) || - prev.br_startoff + prev.br_blockcount < offset_fsb) + !xfs_iomap_prev_contiguous(ifp, icur, offset_fsb, &plen)) { return mp->m_allocsize_blocks; + } /* * Determine the initial size of the preallocation. We are beyond the @@ -413,15 +453,16 @@ xfs_iomap_prealloc_size( * preallocation size. * * If the extent is a hole, then preallocation is essentially disabled. - * Otherwise we take the size of the preceding data extent as the basis - * for the preallocation size. If the size of the extent is greater than - * half the maximum extent length, then use the current offset as the - * basis. This ensures that for large files the preallocation size - * always extends to MAXEXTLEN rather than falling short due to things - * like stripe unit/width alignment of real extents. + * Otherwise we take the size of the contiguous preceding data extents + * as the basis for the preallocation size. If the size of the extent + * is greater than half the maximum extent length, then use the current + * offset as the basis. This ensures that for large files the + * preallocation size always extends to MAXEXTLEN rather than falling + * short due to things like stripe unit/width alignment of real + * extents. */ - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) - alloc_blocks = prev.br_blockcount << 1; + if (plen <= (MAXEXTLEN >> 1)) + alloc_blocks = plen << 1; else alloc_blocks = XFS_B_TO_FSB(mp, offset); if (!alloc_blocks) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-19 0:49 ` [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong @ 2020-05-19 12:48 ` Brian Foster 2020-05-20 13:23 ` Brian Foster 2020-05-19 12:54 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Brian Foster @ 2020-05-19 12:48 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch On Mon, May 18, 2020 at 05:49:23PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > When we're estimating a new speculative preallocation length for an > extending write, we should walk backwards through the extent list to > determine the number of number of blocks that are physically and > logically contiguous with the write offset, and use that as an input to > the preallocation size computation. > > This way, preallocation length is truly measured by the effectiveness of > the allocator in giving us contiguous allocations without being > influenced by the state of a given extent. This fixes both the problem > where ZERO_RANGE within an EOF can reduce preallocation, and prevents > the unnecessary shrinkage of preallocation when delalloc extents are > turned into unwritten extents. > > This was found as a regression in xfs/014 after changing delalloc writes > to create unwritten extents during writeback. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/xfs/xfs_iomap.c | 63 +++++++++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 52 insertions(+), 11 deletions(-) > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index ac970b13b1f8..2dffd56a433c 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -351,6 +351,46 @@ xfs_quota_calc_throttle( > } > } > > +/* > + * Determine if the previous extent's range of offsets is contiguous with > + * @offset_fsb. If so, set @prev_contig to the number of blocks that are > + * physically contiguous with that previous extent and return true. If there > + * is no previous extent or there's a hole right before @offset_fsb, return > + * false. > + * > + * Note that we don't care if the previous extents are written or not. Why? :) Might be helpful to elaborate on why we require this algorithm now... > + */ > +static inline bool > +xfs_iomap_prev_contiguous( > + struct xfs_ifork *ifp, > + struct xfs_iext_cursor *cur, > + xfs_fileoff_t offset_fsb, > + xfs_extlen_t *prev_contig) > +{ > + struct xfs_iext_cursor ncur = *cur; > + struct xfs_bmbt_irec got, old; > + > + xfs_iext_prev(ifp, &ncur); > + if (!xfs_iext_get_extent(ifp, &ncur, &old)) > + return false; > + if (old.br_startoff + old.br_blockcount < offset_fsb) > + return false; > + > + *prev_contig = old.br_blockcount; > + > + xfs_iext_prev(ifp, &ncur); > + while (xfs_iext_get_extent(ifp, &ncur, &got) && > + got.br_blockcount + got.br_startoff == old.br_startoff && > + got.br_blockcount + got.br_startblock == old.br_startblock && > + *prev_contig <= MAXEXTLEN) { > + *prev_contig += got.br_blockcount; > + old = got; /* struct copy */ > + xfs_iext_prev(ifp, &ncur); > + } > + > + return true; > +} > + > /* > * If we are doing a write at the end of the file and there are no allocations > * past this one, then extend the allocation out to the file system's write > @@ -380,12 +420,12 @@ xfs_iomap_prealloc_size( > struct xfs_mount *mp = ip->i_mount; > struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); > xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > - struct xfs_bmbt_irec prev; > int shift = 0; > int64_t freesp; > xfs_fsblock_t qblocks; > int qshift = 0; > xfs_fsblock_t alloc_blocks = 0; > + xfs_extlen_t plen = 0; > > if (offset + count <= XFS_ISIZE(ip)) > return 0; > @@ -400,9 +440,9 @@ xfs_iomap_prealloc_size( > */ > if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) || > XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || > - !xfs_iext_peek_prev_extent(ifp, icur, &prev) || > - prev.br_startoff + prev.br_blockcount < offset_fsb) > + !xfs_iomap_prev_contiguous(ifp, icur, offset_fsb, &plen)) { > return mp->m_allocsize_blocks; > + } > > /* > * Determine the initial size of the preallocation. We are beyond the > @@ -413,15 +453,16 @@ xfs_iomap_prealloc_size( > * preallocation size. > * > * If the extent is a hole, then preallocation is essentially disabled. > - * Otherwise we take the size of the preceding data extent as the basis > - * for the preallocation size. If the size of the extent is greater than > - * half the maximum extent length, then use the current offset as the > - * basis. This ensures that for large files the preallocation size > - * always extends to MAXEXTLEN rather than falling short due to things > - * like stripe unit/width alignment of real extents. > + * Otherwise we take the size of the contiguous preceding data extents > + * as the basis for the preallocation size. If the size of the extent I'd refer to it as the "size of the contiguous range" or some such since we're now talking about aggregating many extents to form the prealloc basis. I am a little curious if there's any noticeable impact from having to perform the worst case extent walk in this path. For example, suppose we had a speculatively preallocated file where we started writing (i.e. converting) every other unwritten block such that this path had to walk an extent per block until hitting the 8g max (8g == 2097152 4k blocks). I guess the hope is that either most of those blocks wouldn't have been written back and converted by the time we'd trigger the next post-eof prealloc or that it would just be a wash in the stream of staggered writes falling into our max sized preallocations. Either way, I think it's more important to maintain the existing heuristic so this seems reasonable from that perspective. This scenario also makes me wonder if we should consider continuing to do write time file size extension zeroing in certain cases vs. leaving around unwritten extents. For example, the current post-eof prealloc behavior is that writes that jump over current EOF will zero (via buffered writes) any allocated blocks (delalloc or physical) between current EOF and the start of the write. That behavior doesn't change with delalloc -> unwritten if the prealloc is still delalloc at the time of the extending write, but we'd presumably skip those zeroing writes if the prealloc had been converted to real+unwritten first. Hmm.. that does seem a little random to me, particularly if somebody starts to see unwritten extents sprinkled throughout a file that never explicitly saw preallocation. Perhaps we should avoid converting post-eof blocks? OTOH, unwritten probably makes sense for large jumps in EOF vs zeroing GBs of blocks, so one could argue that we should convert such ranges (if still delalloc) rather than zero them at all. Thoughts? We should probably work this out one way or another before landing the delalloc -> unwritten behavior.. Brian > + * is greater than half the maximum extent length, then use the current > + * offset as the basis. This ensures that for large files the > + * preallocation size always extends to MAXEXTLEN rather than falling > + * short due to things like stripe unit/width alignment of real > + * extents. > */ > - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) > - alloc_blocks = prev.br_blockcount << 1; > + if (plen <= (MAXEXTLEN >> 1)) > + alloc_blocks = plen << 1; > else > alloc_blocks = XFS_B_TO_FSB(mp, offset); > if (!alloc_blocks) > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-19 12:48 ` Brian Foster @ 2020-05-20 13:23 ` Brian Foster 2020-05-20 19:48 ` Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Brian Foster @ 2020-05-20 13:23 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch On Tue, May 19, 2020 at 08:48:27AM -0400, Brian Foster wrote: > On Mon, May 18, 2020 at 05:49:23PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > When we're estimating a new speculative preallocation length for an > > extending write, we should walk backwards through the extent list to > > determine the number of number of blocks that are physically and > > logically contiguous with the write offset, and use that as an input to > > the preallocation size computation. > > > > This way, preallocation length is truly measured by the effectiveness of > > the allocator in giving us contiguous allocations without being > > influenced by the state of a given extent. This fixes both the problem > > where ZERO_RANGE within an EOF can reduce preallocation, and prevents > > the unnecessary shrinkage of preallocation when delalloc extents are > > turned into unwritten extents. > > > > This was found as a regression in xfs/014 after changing delalloc writes > > to create unwritten extents during writeback. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > fs/xfs/xfs_iomap.c | 63 +++++++++++++++++++++++++++++++++++++++++++--------- > > 1 file changed, 52 insertions(+), 11 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index ac970b13b1f8..2dffd56a433c 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c ... > > @@ -413,15 +453,16 @@ xfs_iomap_prealloc_size( > > * preallocation size. > > * > > * If the extent is a hole, then preallocation is essentially disabled. > > - * Otherwise we take the size of the preceding data extent as the basis > > - * for the preallocation size. If the size of the extent is greater than > > - * half the maximum extent length, then use the current offset as the > > - * basis. This ensures that for large files the preallocation size > > - * always extends to MAXEXTLEN rather than falling short due to things > > - * like stripe unit/width alignment of real extents. > > + * Otherwise we take the size of the contiguous preceding data extents > > + * as the basis for the preallocation size. If the size of the extent > > I'd refer to it as the "size of the contiguous range" or some such since > we're now talking about aggregating many extents to form the prealloc > basis. > > I am a little curious if there's any noticeable impact from having to > perform the worst case extent walk in this path. For example, suppose we > had a speculatively preallocated file where we started writing (i.e. > converting) every other unwritten block such that this path had to walk > an extent per block until hitting the 8g max (8g == 2097152 4k blocks). > I guess the hope is that either most of those blocks wouldn't have been > written back and converted by the time we'd trigger the next post-eof > prealloc or that it would just be a wash in the stream of staggered > writes falling into our max sized preallocations. Either way, I think > it's more important to maintain the existing heuristic so this seems > reasonable from that perspective. > I had a system spinning yesterday to create this worst case condition across a couple max sized extents. Testing out the preallocation path, I see a hit from ~5ms to ~35ms between the baseline variant and the updated calculation algorithm. Note that the baseline here is only doing a 16 block prealloc vs. 8gb with the fix, but regardless a 30ms difference for an 8gb allocation doesn't really seem noticeable enough to matter. IOW, I think we can disgregard this particular concern... > This scenario also makes me wonder if we should consider continuing to > do write time file size extension zeroing in certain cases vs. leaving > around unwritten extents. For example, the current post-eof prealloc > behavior is that writes that jump over current EOF will zero (via > buffered writes) any allocated blocks (delalloc or physical) between > current EOF and the start of the write. That behavior doesn't change > with delalloc -> unwritten if the prealloc is still delalloc at the time > of the extending write, but we'd presumably skip those zeroing writes if > the prealloc had been converted to real+unwritten first. Hmm.. that does > seem a little random to me, particularly if somebody starts to see > unwritten extents sprinkled throughout a file that never explicitly saw > preallocation. Perhaps we should avoid converting post-eof blocks? OTOH, > unwritten probably makes sense for large jumps in EOF vs zeroing GBs of > blocks, so one could argue that we should convert such ranges (if still > delalloc) rather than zero them at all. Thoughts? We should probably > work this out one way or another before landing the delalloc -> > unwritten behavior.. > Thinking about this one a little more.. it seems it's probably not worth conditionally changing post-eof conversion behavior. Since there are cases where we probably want post-eof prealloc to remain as unwritten if it's carried within i_size, that would either be extra code for post-eof blocks or would split up any kind of heuristic for manually zeroing such blocks. ISTM that the right approach might be to convert all delalloc -> unwritten (as this series already does), then perhaps consider a write time heuristic that would perform manual zeroing of extents if certain conditions are met (e.g., for unwritten extents under a certain size). The remaining question to me is whether we should include something like that before changing delalloc behavior or consider it separately as an optimization... Brian P.S., BTW I forgot to mention on my last pass of this series that it probably makes sense to reorder these behavioral fixup patches to precede the patch that actually changes delalloc to allocate unwritten extents. > Brian > > > + * is greater than half the maximum extent length, then use the current > > + * offset as the basis. This ensures that for large files the > > + * preallocation size always extends to MAXEXTLEN rather than falling > > + * short due to things like stripe unit/width alignment of real > > + * extents. > > */ > > - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) > > - alloc_blocks = prev.br_blockcount << 1; > > + if (plen <= (MAXEXTLEN >> 1)) > > + alloc_blocks = plen << 1; > > else > > alloc_blocks = XFS_B_TO_FSB(mp, offset); > > if (!alloc_blocks) > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-20 13:23 ` Brian Foster @ 2020-05-20 19:48 ` Darrick J. Wong 2020-05-21 12:24 ` Brian Foster 0 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2020-05-20 19:48 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs, hch On Wed, May 20, 2020 at 09:23:34AM -0400, Brian Foster wrote: > On Tue, May 19, 2020 at 08:48:27AM -0400, Brian Foster wrote: > > On Mon, May 18, 2020 at 05:49:23PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > When we're estimating a new speculative preallocation length for an > > > extending write, we should walk backwards through the extent list to > > > determine the number of number of blocks that are physically and > > > logically contiguous with the write offset, and use that as an input to > > > the preallocation size computation. > > > > > > This way, preallocation length is truly measured by the effectiveness of > > > the allocator in giving us contiguous allocations without being > > > influenced by the state of a given extent. This fixes both the problem > > > where ZERO_RANGE within an EOF can reduce preallocation, and prevents > > > the unnecessary shrinkage of preallocation when delalloc extents are > > > turned into unwritten extents. > > > > > > This was found as a regression in xfs/014 after changing delalloc writes > > > to create unwritten extents during writeback. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > fs/xfs/xfs_iomap.c | 63 +++++++++++++++++++++++++++++++++++++++++++--------- > > > 1 file changed, 52 insertions(+), 11 deletions(-) > > > > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index ac970b13b1f8..2dffd56a433c 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > ... > > > @@ -413,15 +453,16 @@ xfs_iomap_prealloc_size( > > > * preallocation size. > > > * > > > * If the extent is a hole, then preallocation is essentially disabled. > > > - * Otherwise we take the size of the preceding data extent as the basis > > > - * for the preallocation size. If the size of the extent is greater than > > > - * half the maximum extent length, then use the current offset as the > > > - * basis. This ensures that for large files the preallocation size > > > - * always extends to MAXEXTLEN rather than falling short due to things > > > - * like stripe unit/width alignment of real extents. > > > + * Otherwise we take the size of the contiguous preceding data extents > > > + * as the basis for the preallocation size. If the size of the extent > > > > I'd refer to it as the "size of the contiguous range" or some such since > > we're now talking about aggregating many extents to form the prealloc > > basis. > > > > I am a little curious if there's any noticeable impact from having to > > perform the worst case extent walk in this path. For example, suppose we > > had a speculatively preallocated file where we started writing (i.e. > > converting) every other unwritten block such that this path had to walk > > an extent per block until hitting the 8g max (8g == 2097152 4k blocks). > > I guess the hope is that either most of those blocks wouldn't have been > > written back and converted by the time we'd trigger the next post-eof > > prealloc or that it would just be a wash in the stream of staggered > > writes falling into our max sized preallocations. Either way, I think > > it's more important to maintain the existing heuristic so this seems > > reasonable from that perspective. > > > > I had a system spinning yesterday to create this worst case condition > across a couple max sized extents. Testing out the preallocation path, I > see a hit from ~5ms to ~35ms between the baseline variant and the > updated calculation algorithm. Note that the baseline here is only doing > a 16 block prealloc vs. 8gb with the fix, but regardless a 30ms > difference for an 8gb allocation doesn't really seem noticeable enough > to matter. IOW, I think we can disgregard this particular concern... <nod> I'd figured that walking backwards through the incore extent tree shouldn't be all that costly, particularly since it tends to result in more aggressive speculative preallocation. > > This scenario also makes me wonder if we should consider continuing to > > do write time file size extension zeroing in certain cases vs. leaving > > around unwritten extents. For example, the current post-eof prealloc > > behavior is that writes that jump over current EOF will zero (via > > buffered writes) any allocated blocks (delalloc or physical) between > > current EOF and the start of the write. Right... > > That behavior doesn't change with delalloc -> unwritten if the > > prealloc is still delalloc at the time of the extending write, but > > we'd presumably skip those zeroing writes if the prealloc had been > > converted to real+unwritten first. ...though I wonder how often we'll encounter the situation where we've created a posteof preallocation, and we end up with written extents beyond EOF, and then the next write jumps past EOF? > > Hmm.. that does > > seem a little random to me, particularly if somebody starts to see > > unwritten extents sprinkled throughout a file that never explicitly saw > > preallocation. Perhaps we should avoid converting post-eof blocks? OTOH, > > unwritten probably makes sense for large jumps in EOF vs zeroing GBs of > > blocks, so one could argue that we should convert such ranges (if still > > delalloc) rather than zero them at all. Thoughts? We should probably > > work this out one way or another before landing the delalloc -> > > unwritten behavior.. /me wonders if that will be a problem in practice? If writeback starts, it'll (try to) convert the delalloc reservation into an unwritten extent (even if the extent crosses EOF). Writeback completion will convert the written parts to regular extents, leaving the rest unwritten. iomap_zero_range is a NOP on unwritten extents, so starting a new write far beyond EOF will do very little work to make sure [oldsize, write start] range is zero. (Or am I misunderstanding this...?) > Thinking about this one a little more.. it seems it's probably not worth > conditionally changing post-eof conversion behavior. Since there are > cases where we probably want post-eof prealloc to remain as unwritten if > it's carried within i_size, that would either be extra code for post-eof > blocks or would split up any kind of heuristic for manually zeroing such > blocks. ISTM that the right approach might be to convert all delalloc -> I might just leave it all unwritten and not try to be clever about pre-zeroing when we extend EOF because that just adds more complexity. > unwritten (as this series already does), then perhaps consider a write > time heuristic that would perform manual zeroing of extents if certain > conditions are met (e.g., for unwritten extents under a certain size). ISTR that ext4 has some sort of heuristic to do that. I'm not sure which is better as writes get relatively more expensive, but so long as the amount of zeroing is less than an erase block size it probably doesn't matter... > The remaining question to me is whether we should include something like > that before changing delalloc behavior or consider it separately as an > optimization... Separately. It's getting late in the cycle. ;) > Brian > > P.S., BTW I forgot to mention on my last pass of this series that it > probably makes sense to reorder these behavioral fixup patches to > precede the patch that actually changes delalloc to allocate unwritten > extents. Yeah, I'll do that. I think I mostly like Christoph's rewrite of the third patch of the series. --D > > Brian > > > > > + * is greater than half the maximum extent length, then use the current > > > + * offset as the basis. This ensures that for large files the > > > + * preallocation size always extends to MAXEXTLEN rather than falling > > > + * short due to things like stripe unit/width alignment of real > > > + * extents. > > > */ > > > - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) > > > - alloc_blocks = prev.br_blockcount << 1; > > > + if (plen <= (MAXEXTLEN >> 1)) > > > + alloc_blocks = plen << 1; > > > else > > > alloc_blocks = XFS_B_TO_FSB(mp, offset); > > > if (!alloc_blocks) > > > > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-20 19:48 ` Darrick J. Wong @ 2020-05-21 12:24 ` Brian Foster 0 siblings, 0 replies; 15+ messages in thread From: Brian Foster @ 2020-05-21 12:24 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch On Wed, May 20, 2020 at 12:48:53PM -0700, Darrick J. Wong wrote: > On Wed, May 20, 2020 at 09:23:34AM -0400, Brian Foster wrote: > > On Tue, May 19, 2020 at 08:48:27AM -0400, Brian Foster wrote: > > > On Mon, May 18, 2020 at 05:49:23PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > When we're estimating a new speculative preallocation length for an > > > > extending write, we should walk backwards through the extent list to > > > > determine the number of number of blocks that are physically and > > > > logically contiguous with the write offset, and use that as an input to > > > > the preallocation size computation. > > > > > > > > This way, preallocation length is truly measured by the effectiveness of > > > > the allocator in giving us contiguous allocations without being > > > > influenced by the state of a given extent. This fixes both the problem > > > > where ZERO_RANGE within an EOF can reduce preallocation, and prevents > > > > the unnecessary shrinkage of preallocation when delalloc extents are > > > > turned into unwritten extents. > > > > > > > > This was found as a regression in xfs/014 after changing delalloc writes > > > > to create unwritten extents during writeback. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --- > > > > fs/xfs/xfs_iomap.c | 63 +++++++++++++++++++++++++++++++++++++++++++--------- > > > > 1 file changed, 52 insertions(+), 11 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > index ac970b13b1f8..2dffd56a433c 100644 > > > > --- a/fs/xfs/xfs_iomap.c > > > > +++ b/fs/xfs/xfs_iomap.c > > ... > > > > @@ -413,15 +453,16 @@ xfs_iomap_prealloc_size( > > > > * preallocation size. > > > > * > > > > * If the extent is a hole, then preallocation is essentially disabled. > > > > - * Otherwise we take the size of the preceding data extent as the basis > > > > - * for the preallocation size. If the size of the extent is greater than > > > > - * half the maximum extent length, then use the current offset as the > > > > - * basis. This ensures that for large files the preallocation size > > > > - * always extends to MAXEXTLEN rather than falling short due to things > > > > - * like stripe unit/width alignment of real extents. > > > > + * Otherwise we take the size of the contiguous preceding data extents > > > > + * as the basis for the preallocation size. If the size of the extent > > > > > > I'd refer to it as the "size of the contiguous range" or some such since > > > we're now talking about aggregating many extents to form the prealloc > > > basis. > > > > > > I am a little curious if there's any noticeable impact from having to > > > perform the worst case extent walk in this path. For example, suppose we > > > had a speculatively preallocated file where we started writing (i.e. > > > converting) every other unwritten block such that this path had to walk > > > an extent per block until hitting the 8g max (8g == 2097152 4k blocks). > > > I guess the hope is that either most of those blocks wouldn't have been > > > written back and converted by the time we'd trigger the next post-eof > > > prealloc or that it would just be a wash in the stream of staggered > > > writes falling into our max sized preallocations. Either way, I think > > > it's more important to maintain the existing heuristic so this seems > > > reasonable from that perspective. > > > > > > > I had a system spinning yesterday to create this worst case condition > > across a couple max sized extents. Testing out the preallocation path, I > > see a hit from ~5ms to ~35ms between the baseline variant and the > > updated calculation algorithm. Note that the baseline here is only doing > > a 16 block prealloc vs. 8gb with the fix, but regardless a 30ms > > difference for an 8gb allocation doesn't really seem noticeable enough > > to matter. IOW, I think we can disgregard this particular concern... > > <nod> I'd figured that walking backwards through the incore extent tree > shouldn't be all that costly, particularly since it tends to result in > more aggressive speculative preallocation. > Yeah, I was more curious about going from oneshot "check the prev extent" logic to "walk N extents" logic with a significantly higher upper bound and wanted to make sure we've analyzed worst case behavior. > > > This scenario also makes me wonder if we should consider continuing to > > > do write time file size extension zeroing in certain cases vs. leaving > > > around unwritten extents. For example, the current post-eof prealloc > > > behavior is that writes that jump over current EOF will zero (via > > > buffered writes) any allocated blocks (delalloc or physical) between > > > current EOF and the start of the write. > > Right... > > > > That behavior doesn't change with delalloc -> unwritten if the > > > prealloc is still delalloc at the time of the extending write, but > > > we'd presumably skip those zeroing writes if the prealloc had been > > > converted to real+unwritten first. > > ...though I wonder how often we'll encounter the situation where we've > created a posteof preallocation, and we end up with written extents > beyond EOF, and then the next write jumps past EOF? > I don't think it's all that uncommon. The first contributing factor is preallocation size. If a write workload is sparsely appending to a file and sees one good speculative preallocation condition, further extending writes are more likely to fall into the preallocation and essentially compound the behavior for subsequent preallocations. The second contributing factor is writeback converting the delalloc extent. I guess we have less control over that between writeback heuristics, workload (fsync?), memory availability, etc., but suffice it to say that the larger the file and EOF extent grows, the more likely writeback converts the extent to real (now unwritten) blocks. > > > Hmm.. that does > > > seem a little random to me, particularly if somebody starts to see > > > unwritten extents sprinkled throughout a file that never explicitly saw > > > preallocation. Perhaps we should avoid converting post-eof blocks? OTOH, > > > unwritten probably makes sense for large jumps in EOF vs zeroing GBs of > > > blocks, so one could argue that we should convert such ranges (if still > > > delalloc) rather than zero them at all. Thoughts? We should probably > > > work this out one way or another before landing the delalloc -> > > > unwritten behavior.. > > /me wonders if that will be a problem in practice? If writeback starts, > it'll (try to) convert the delalloc reservation into an unwritten extent > (even if the extent crosses EOF). Writeback completion will convert the > written parts to regular extents, leaving the rest unwritten. > > iomap_zero_range is a NOP on unwritten extents, so starting a new write > far beyond EOF will do very little work to make sure [oldsize, write > start] range is zero. > > (Or am I misunderstanding this...?) > Nope, that's what happens. It's not a correctness issue. I'm just calling out the subtle behavior change in that we can potentially leave unwritten post-eof ranges sprinkled around as unwritten extents and asking whether we think that might cause any notable issues for users. We used to have a steady trickle of questions along the lines of "why does my file/fs use so much space?" that boiled down to speculative preallocation. The solution was usually more user education (i.e., FAQ updates) and I think users have kind of become used to it by now, but the broader point is that some users might notice such things and wonder why some workload that previously created large files with fewer large extents now creates files with hundreds (or more) extents purely due to extent state of unwritten ranges. I'm not totally convinced it matters. It could very well be just another case of more education, but I wanted to at least make sure we discussed it. ;) What particularly annoys me about this is that the resulting behavior is so inconsistent. Some workloads that might involve fsync or otherwise more frequent writeback events could see more unwritten extents whereas others with very similar write patterns might see none at all. Conversely, some users might continue to see huge post-eof zeroing writes while others won't depending on whether a post-eof write beats out writeback of the associated of extent. > > Thinking about this one a little more.. it seems it's probably not worth > > conditionally changing post-eof conversion behavior. Since there are > > cases where we probably want post-eof prealloc to remain as unwritten if > > it's carried within i_size, that would either be extra code for post-eof > > blocks or would split up any kind of heuristic for manually zeroing such > > blocks. ISTM that the right approach might be to convert all delalloc -> > > I might just leave it all unwritten and not try to be clever about > pre-zeroing when we extend EOF because that just adds more complexity. > Well we do already have the post-eof zeroing logic, but indeed it would be more complexity to conditionalize it on something other than extent state. I guess we'd need to pass a size hint or something to iomap or otherwise modify our callback path to detect this situation. There's also the question of whether to overwrite explicit (fallocate) post-eof preallocation or not... > > unwritten (as this series already does), then perhaps consider a write > > time heuristic that would perform manual zeroing of extents if certain > > conditions are met (e.g., for unwritten extents under a certain size). > > ISTR that ext4 has some sort of heuristic to do that. I'm not sure > which is better as writes get relatively more expensive, but so long as > the amount of zeroing is less than an erase block size it probably > doesn't matter... > Yeah, though note again that historical behavior is to perform buffered writes to zero the range regardless of its size or delalloc vs. real allocation state (explicit preallocation notwithstanding). We could use a cutoff size of tens of MBs or more and still be pretty conservative from a behavioral change standpoint. > > The remaining question to me is whether we should include something like > > that before changing delalloc behavior or consider it separately as an > > optimization... > > Separately. It's getting late in the cycle. ;) > To clarify, my question is whether we should require manual zeroing logic for some cases before changing delalloc allocation behavior or if we're Ok with whatever the "worst case" behavior is (and thus consider such manual zeroing a fragmentation mitigation optimization that can come later). I think this is independent of wherever the current release cycle is at because it's more a question of getting it right vs. getting it done. :) Brian > > Brian > > > > P.S., BTW I forgot to mention on my last pass of this series that it > > probably makes sense to reorder these behavioral fixup patches to > > precede the patch that actually changes delalloc to allocate unwritten > > extents. > > Yeah, I'll do that. I think I mostly like Christoph's rewrite of the > third patch of the series. > > --D > > > > Brian > > > > > > > + * is greater than half the maximum extent length, then use the current > > > > + * offset as the basis. This ensures that for large files the > > > > + * preallocation size always extends to MAXEXTLEN rather than falling > > > > + * short due to things like stripe unit/width alignment of real > > > > + * extents. > > > > */ > > > > - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) > > > > - alloc_blocks = prev.br_blockcount << 1; > > > > + if (plen <= (MAXEXTLEN >> 1)) > > > > + alloc_blocks = plen << 1; > > > > else > > > > alloc_blocks = XFS_B_TO_FSB(mp, offset); > > > > if (!alloc_blocks) > > > > > > > > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-19 0:49 ` [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong 2020-05-19 12:48 ` Brian Foster @ 2020-05-19 12:54 ` Christoph Hellwig 2020-05-20 21:17 ` Darrick J. Wong 1 sibling, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2020-05-19 12:54 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch, bfoster The actual logic looks good, but I think the new helper and another third set of comment explaining what is going on makes this area even more confusing. What about something like this instead? diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index bb590a267a7f9..26f9874361cd3 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -352,22 +352,10 @@ xfs_quota_calc_throttle( } /* - * If we are doing a write at the end of the file and there are no allocations - * past this one, then extend the allocation out to the file system's write - * iosize. - * * If we don't have a user specified preallocation size, dynamically increase * the preallocation size as the size of the file grows. Cap the maximum size * at a single extent or less if the filesystem is near full. The closer the - * filesystem is to full, the smaller the maximum prealocation. - * - * As an exception we don't do any preallocation at all if the file is smaller - * than the minimum preallocation and we are using the default dynamic - * preallocation scheme, as it is likely this is the only write to the file that - * is going to be done. - * - * We clean up any extra space left over when the file is closed in - * xfs_inactive(). + * filesystem is to full, the smaller the maximum preallocation. */ STATIC xfs_fsblock_t xfs_iomap_prealloc_size( @@ -380,52 +368,58 @@ xfs_iomap_prealloc_size( struct xfs_mount *mp = ip->i_mount; struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); - struct xfs_bmbt_irec prev; + struct xfs_iext_cursor ncur = *icur; + struct xfs_bmbt_irec prev, got; int shift = 0; int64_t freesp; xfs_fsblock_t qblocks; int qshift = 0; - xfs_fsblock_t alloc_blocks = 0; + xfs_fsblock_t alloc_blocks; + xfs_extlen_t plen; - if (offset + count <= XFS_ISIZE(ip)) - return 0; - - if (!(mp->m_flags & XFS_MOUNT_ALLOCSIZE) && - (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks))) + /* + * As an exception we don't do any preallocation at all if the file is + * smaller than the minimum preallocation and we are using the default + * dynamic preallocation scheme, as it is likely this is the only write + * to the file that is going to be done. + */ + if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)) return 0; /* - * If an explicit allocsize is set, the file is small, or we - * are writing behind a hole, then use the minimum prealloc: + * Otherwise use the minimum prealloca size for small files, or if we + * are writing right after a hole. */ - if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) || - XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || - !xfs_iext_peek_prev_extent(ifp, icur, &prev) || + if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || + !xfs_iext_prev_extent(ifp, &ncur, &prev) || prev.br_startoff + prev.br_blockcount < offset_fsb) return mp->m_allocsize_blocks; /* - * Determine the initial size of the preallocation. We are beyond the - * current EOF here, but we need to take into account whether this is - * a sparse write or an extending write when determining the - * preallocation size. Hence we need to look up the extent that ends - * at the current write offset and use the result to determine the - * preallocation size. - * - * If the extent is a hole, then preallocation is essentially disabled. - * Otherwise we take the size of the preceding data extent as the basis - * for the preallocation size. If the size of the extent is greater than - * half the maximum extent length, then use the current offset as the - * basis. This ensures that for large files the preallocation size - * always extends to MAXEXTLEN rather than falling short due to things - * like stripe unit/width alignment of real extents. + * Take the size of the contiguous preceding data extents as the basis + * for the preallocation size. Note that we don't care if the previous + * extents are written or not. */ - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) - alloc_blocks = prev.br_blockcount << 1; - else + plen = prev.br_blockcount; + while (xfs_iext_prev_extent(ifp, &ncur, &got)) { + if (plen > MAXEXTLEN / 2 || + got.br_startoff + got.br_blockcount != prev.br_startoff || + got.br_startblock + got.br_blockcount != prev.br_startblock) + break; + plen += got.br_blockcount; + prev = got; + } + + /* + * If the size of the extents is greater than half the maximum extent + * length, then use the current offset as the basis. This ensures that + * for large files the preallocation size always extends to MAXEXTLEN + * rather than falling short due to things like stripe unit/width + * alignment of real extents. + */ + alloc_blocks = plen * 2; + if (alloc_blocks > MAXEXTLEN) alloc_blocks = XFS_B_TO_FSB(mp, offset); - if (!alloc_blocks) - goto check_writeio; qblocks = alloc_blocks; /* @@ -494,7 +488,6 @@ xfs_iomap_prealloc_size( */ while (alloc_blocks && alloc_blocks >= freesp) alloc_blocks >>= 4; -check_writeio: if (alloc_blocks < mp->m_allocsize_blocks) alloc_blocks = mp->m_allocsize_blocks; trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift, @@ -961,9 +954,16 @@ xfs_buffered_write_iomap_begin( if (error) goto out_unlock; - if (eof) { - prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, offset, - count, &icur); + if (eof && offset + count > XFS_ISIZE(ip)) { + /* + * Determine the initial size of the preallocation. + * We clean up any extra preallocation when the file is closed. + */ + if (mp->m_flags & XFS_MOUNT_ALLOCSIZE) + prealloc_blocks = mp->m_allocsize_blocks; + else + prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, + offset, count, &icur); if (prealloc_blocks) { xfs_extlen_t align; xfs_off_t end_offset; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-19 12:54 ` Christoph Hellwig @ 2020-05-20 21:17 ` Darrick J. Wong 2020-05-21 9:31 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2020-05-20 21:17 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs, bfoster On Tue, May 19, 2020 at 05:54:37AM -0700, Christoph Hellwig wrote: > The actual logic looks good, but I think the new helper and another > third set of comment explaining what is going on makes this area even > more confusing. What about something like this instead? This seems reasonable, but the callsite cleanups ought to be a separate patch from the behavior change. > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index bb590a267a7f9..26f9874361cd3 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -352,22 +352,10 @@ xfs_quota_calc_throttle( > } > > /* > - * If we are doing a write at the end of the file and there are no allocations > - * past this one, then extend the allocation out to the file system's write > - * iosize. > - * > * If we don't have a user specified preallocation size, dynamically increase > * the preallocation size as the size of the file grows. Cap the maximum size > * at a single extent or less if the filesystem is near full. The closer the > - * filesystem is to full, the smaller the maximum prealocation. > - * > - * As an exception we don't do any preallocation at all if the file is smaller > - * than the minimum preallocation and we are using the default dynamic > - * preallocation scheme, as it is likely this is the only write to the file that > - * is going to be done. > - * > - * We clean up any extra space left over when the file is closed in > - * xfs_inactive(). > + * filesystem is to full, the smaller the maximum preallocation. > */ > STATIC xfs_fsblock_t > xfs_iomap_prealloc_size( > @@ -380,52 +368,58 @@ xfs_iomap_prealloc_size( > struct xfs_mount *mp = ip->i_mount; > struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); > xfs_fileoff_t offset_fsb = XFS_B_TO_FSBT(mp, offset); > - struct xfs_bmbt_irec prev; > + struct xfs_iext_cursor ncur = *icur; > + struct xfs_bmbt_irec prev, got; > int shift = 0; > int64_t freesp; > xfs_fsblock_t qblocks; > int qshift = 0; > - xfs_fsblock_t alloc_blocks = 0; > + xfs_fsblock_t alloc_blocks; > + xfs_extlen_t plen; > > - if (offset + count <= XFS_ISIZE(ip)) > - return 0; > - > - if (!(mp->m_flags & XFS_MOUNT_ALLOCSIZE) && > - (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks))) > + /* > + * As an exception we don't do any preallocation at all if the file is > + * smaller than the minimum preallocation and we are using the default > + * dynamic preallocation scheme, as it is likely this is the only write > + * to the file that is going to be done. > + */ > + if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_allocsize_blocks)) > return 0; > > /* > - * If an explicit allocsize is set, the file is small, or we > - * are writing behind a hole, then use the minimum prealloc: > + * Otherwise use the minimum prealloca size for small files, or if we "preallocation"? > + * are writing right after a hole. > */ > - if ((mp->m_flags & XFS_MOUNT_ALLOCSIZE) || > - XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || > - !xfs_iext_peek_prev_extent(ifp, icur, &prev) || > + if (XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, mp->m_dalign) || > + !xfs_iext_prev_extent(ifp, &ncur, &prev) || > prev.br_startoff + prev.br_blockcount < offset_fsb) > return mp->m_allocsize_blocks; > > /* > - * Determine the initial size of the preallocation. We are beyond the > - * current EOF here, but we need to take into account whether this is > - * a sparse write or an extending write when determining the > - * preallocation size. Hence we need to look up the extent that ends > - * at the current write offset and use the result to determine the > - * preallocation size. > - * > - * If the extent is a hole, then preallocation is essentially disabled. > - * Otherwise we take the size of the preceding data extent as the basis > - * for the preallocation size. If the size of the extent is greater than > - * half the maximum extent length, then use the current offset as the > - * basis. This ensures that for large files the preallocation size > - * always extends to MAXEXTLEN rather than falling short due to things > - * like stripe unit/width alignment of real extents. > + * Take the size of the contiguous preceding data extents as the basis > + * for the preallocation size. Note that we don't care if the previous > + * extents are written or not. > */ > - if (prev.br_blockcount <= (MAXEXTLEN >> 1)) > - alloc_blocks = prev.br_blockcount << 1; > - else > + plen = prev.br_blockcount; > + while (xfs_iext_prev_extent(ifp, &ncur, &got)) { > + if (plen > MAXEXTLEN / 2 || > + got.br_startoff + got.br_blockcount != prev.br_startoff || > + got.br_startblock + got.br_blockcount != prev.br_startblock) > + break; > + plen += got.br_blockcount; > + prev = got; > + } > + > + /* > + * If the size of the extents is greater than half the maximum extent > + * length, then use the current offset as the basis. This ensures that > + * for large files the preallocation size always extends to MAXEXTLEN > + * rather than falling short due to things like stripe unit/width > + * alignment of real extents. > + */ > + alloc_blocks = plen * 2; > + if (alloc_blocks > MAXEXTLEN) > alloc_blocks = XFS_B_TO_FSB(mp, offset); > - if (!alloc_blocks) > - goto check_writeio; > qblocks = alloc_blocks; > > /* > @@ -494,7 +488,6 @@ xfs_iomap_prealloc_size( > */ > while (alloc_blocks && alloc_blocks >= freesp) > alloc_blocks >>= 4; > -check_writeio: > if (alloc_blocks < mp->m_allocsize_blocks) > alloc_blocks = mp->m_allocsize_blocks; > trace_xfs_iomap_prealloc_size(ip, alloc_blocks, shift, > @@ -961,9 +954,16 @@ xfs_buffered_write_iomap_begin( > if (error) > goto out_unlock; > > - if (eof) { > - prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, offset, > - count, &icur); > + if (eof && offset + count > XFS_ISIZE(ip)) { > + /* > + * Determine the initial size of the preallocation. > + * We clean up any extra preallocation when the file is closed. > + */ > + if (mp->m_flags & XFS_MOUNT_ALLOCSIZE) > + prealloc_blocks = mp->m_allocsize_blocks; > + else > + prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, > + offset, count, &icur); I'm not sure how much we're really gaining from moving the MOUNT_ALLOCSIZE check out to the caller, but I don't feel all that passionate about this. --D > if (prealloc_blocks) { > xfs_extlen_t align; > xfs_off_t end_offset; ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-20 21:17 ` Darrick J. Wong @ 2020-05-21 9:31 ` Christoph Hellwig 2020-05-21 17:19 ` Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2020-05-21 9:31 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, bfoster On Wed, May 20, 2020 at 02:17:16PM -0700, Darrick J. Wong wrote: > On Tue, May 19, 2020 at 05:54:37AM -0700, Christoph Hellwig wrote: > > The actual logic looks good, but I think the new helper and another > > third set of comment explaining what is going on makes this area even > > more confusing. What about something like this instead? > > This seems reasonable, but the callsite cleanups ought to be a separate > patch from the behavior change. Do you want me to send prep patches, or do you want to split it our yourself? > > + if (eof && offset + count > XFS_ISIZE(ip)) { > > + /* > > + * Determine the initial size of the preallocation. > > + * We clean up any extra preallocation when the file is closed. > > + */ > > + if (mp->m_flags & XFS_MOUNT_ALLOCSIZE) > > + prealloc_blocks = mp->m_allocsize_blocks; > > + else > > + prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, > > + offset, count, &icur); > > I'm not sure how much we're really gaining from moving the > MOUNT_ALLOCSIZE check out to the caller, but I don't feel all that > passionate about this. From the pure code stats point of view it doensn't matter. But from the software architecture POV it does - now xfs_iomap_prealloc_size contains the dynamic prealloc size algorithm, while the hard coded case is handled in the caller. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size 2020-05-21 9:31 ` Christoph Hellwig @ 2020-05-21 17:19 ` Darrick J. Wong 0 siblings, 0 replies; 15+ messages in thread From: Darrick J. Wong @ 2020-05-21 17:19 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs, bfoster On Thu, May 21, 2020 at 02:31:40AM -0700, Christoph Hellwig wrote: > On Wed, May 20, 2020 at 02:17:16PM -0700, Darrick J. Wong wrote: > > On Tue, May 19, 2020 at 05:54:37AM -0700, Christoph Hellwig wrote: > > > The actual logic looks good, but I think the new helper and another > > > third set of comment explaining what is going on makes this area even > > > more confusing. What about something like this instead? > > > > This seems reasonable, but the callsite cleanups ought to be a separate > > patch from the behavior change. > > Do you want me to send prep patches, or do you want to split it our > yourself? I split them already, so I'll send v2 shortly. > > > + if (eof && offset + count > XFS_ISIZE(ip)) { > > > + /* > > > + * Determine the initial size of the preallocation. > > > + * We clean up any extra preallocation when the file is closed. > > > + */ > > > + if (mp->m_flags & XFS_MOUNT_ALLOCSIZE) > > > + prealloc_blocks = mp->m_allocsize_blocks; > > > + else > > > + prealloc_blocks = xfs_iomap_prealloc_size(ip, allocfork, > > > + offset, count, &icur); > > > > I'm not sure how much we're really gaining from moving the > > MOUNT_ALLOCSIZE check out to the caller, but I don't feel all that > > passionate about this. > > From the pure code stats point of view it doensn't matter. But from the > software architecture POV it does - now xfs_iomap_prealloc_size contains > the dynamic prealloc size algorithm, while the hard coded case is > handled in the caller. <nod> --D ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-05-21 17:19 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-19 0:49 [PATCH 0/3] xfs: fix stale disk exposure after crash Darrick J. Wong 2020-05-19 0:49 ` [PATCH 1/3] xfs: force writes to delalloc regions to unwritten Darrick J. Wong 2020-05-19 12:45 ` Brian Foster 2020-05-19 0:49 ` [PATCH 2/3] xfs: don't fail unwritten extent conversion on writeback due to edquot Darrick J. Wong 2020-05-19 7:13 ` Christoph Hellwig 2020-05-19 12:46 ` Brian Foster 2020-05-19 0:49 ` [PATCH 3/3] xfs: measure all contiguous previous extents for prealloc size Darrick J. Wong 2020-05-19 12:48 ` Brian Foster 2020-05-20 13:23 ` Brian Foster 2020-05-20 19:48 ` Darrick J. Wong 2020-05-21 12:24 ` Brian Foster 2020-05-19 12:54 ` Christoph Hellwig 2020-05-20 21:17 ` Darrick J. Wong 2020-05-21 9:31 ` Christoph Hellwig 2020-05-21 17:19 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).