* [PATCH 0/3] xfs: hold ilock across insert and collapse range @ 2019-12-13 17:12 Brian Foster 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw) To: linux-xfs Hi all, This is a followup to the recent bugfix I sent on collapse range. Dave suggested that insert/collapse should probably be atomic wrt to ilock, so this series reworks both operations appropriately. This survives a couple fstests runs and I'll be running an fsx test over the weekend... Brian Brian Foster (3): xfs: open code insert range extent split helper xfs: rework insert range into an atomic operation xfs: rework collapse range into an atomic operation fs/xfs/libxfs/xfs_bmap.c | 32 ++-------------------- fs/xfs/libxfs/xfs_bmap.h | 3 ++- fs/xfs/xfs_bmap_util.c | 57 ++++++++++++++++++++++------------------ 3 files changed, 36 insertions(+), 56 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/3] xfs: open code insert range extent split helper 2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster @ 2019-12-13 17:12 ` Brian Foster 2019-12-17 17:02 ` Allison Collins ` (2 more replies) 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster 2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster 2 siblings, 3 replies; 23+ messages in thread From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw) To: linux-xfs The insert range operation currently splits the extent at the target offset in a separate transaction and lock cycle from the one that shifts extents. In preparation for reworking insert range into an atomic operation, lift the code into the caller so it can be easily condensed to a single rolling transaction and lock cycle and eliminate the helper. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------ fs/xfs/libxfs/xfs_bmap.h | 3 ++- fs/xfs/xfs_bmap_util.c | 14 +++++++++++++- 3 files changed, 17 insertions(+), 32 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index a9ad1f991ba3..2bba0f983e4f 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents( * @split_fsb is a block where the extents is split. If split_fsb lies in a * hole or the first block of extents, just return 0. */ -STATIC int -xfs_bmap_split_extent_at( +int +xfs_bmap_split_extent( struct xfs_trans *tp, struct xfs_inode *ip, xfs_fileoff_t split_fsb) @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at( return error; } -int -xfs_bmap_split_extent( - struct xfs_inode *ip, - xfs_fileoff_t split_fsb) -{ - struct xfs_mount *mp = ip->i_mount; - struct xfs_trans *tp; - int error; - - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, - XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); - if (error) - return error; - - xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); - - error = xfs_bmap_split_extent_at(tp, ip, split_fsb); - if (error) - goto out; - - return xfs_trans_commit(tp); - -out: - xfs_trans_cancel(tp); - return error; -} - /* Deferred mapping is only for real extents in the data fork. */ static bool xfs_bmap_is_update_needed( diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index 14d25e0b7d9c..f3259ad5c22c 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -222,7 +222,8 @@ int xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off, int xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip, xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb, bool *done, xfs_fileoff_t stop_fsb); -int xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset); +int xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip, + xfs_fileoff_t split_offset); int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork, xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc, struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur, diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 2efd78a9719e..829ab1a804c9 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1139,7 +1139,19 @@ xfs_insert_file_space( * is not the starting block of extent, we need to split the extent at * stop_fsb. */ - error = xfs_bmap_split_extent(ip, stop_fsb); + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, + XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); + if (error) + return error; + + xfs_ilock(ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); + + error = xfs_bmap_split_extent(tp, ip, stop_fsb); + if (error) + goto out_trans_cancel; + + error = xfs_trans_commit(tp); if (error) return error; -- 2.20.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] xfs: open code insert range extent split helper 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster @ 2019-12-17 17:02 ` Allison Collins 2019-12-17 22:21 ` Darrick J. Wong 2019-12-18 2:16 ` Darrick J. Wong 2019-12-24 11:18 ` Christoph Hellwig 2 siblings, 1 reply; 23+ messages in thread From: Allison Collins @ 2019-12-17 17:02 UTC (permalink / raw) To: Brian Foster, linux-xfs On 12/13/19 10:12 AM, Brian Foster wrote: > The insert range operation currently splits the extent at the target > offset in a separate transaction and lock cycle from the one that > shifts extents. In preparation for reworking insert range into an > atomic operation, lift the code into the caller so it can be easily > condensed to a single rolling transaction and lock cycle and > eliminate the helper. No functional changes. > > Signed-off-by: Brian Foster <bfoster@redhat.com> Looks ok to me. Reviewed by: Allison Collins <allison.henderson@oracle.com> > --- > fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------ > fs/xfs/libxfs/xfs_bmap.h | 3 ++- > fs/xfs/xfs_bmap_util.c | 14 +++++++++++++- > 3 files changed, 17 insertions(+), 32 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index a9ad1f991ba3..2bba0f983e4f 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents( > * @split_fsb is a block where the extents is split. If split_fsb lies in a > * hole or the first block of extents, just return 0. > */ > -STATIC int > -xfs_bmap_split_extent_at( > +int > +xfs_bmap_split_extent( > struct xfs_trans *tp, > struct xfs_inode *ip, > xfs_fileoff_t split_fsb) > @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at( > return error; > } > > -int > -xfs_bmap_split_extent( > - struct xfs_inode *ip, > - xfs_fileoff_t split_fsb) > -{ > - struct xfs_mount *mp = ip->i_mount; > - struct xfs_trans *tp; > - int error; > - > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > - XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > - if (error) > - return error; > - > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > - > - error = xfs_bmap_split_extent_at(tp, ip, split_fsb); > - if (error) > - goto out; > - > - return xfs_trans_commit(tp); > - > -out: > - xfs_trans_cancel(tp); > - return error; > -} > - > /* Deferred mapping is only for real extents in the data fork. */ > static bool > xfs_bmap_is_update_needed( > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h > index 14d25e0b7d9c..f3259ad5c22c 100644 > --- a/fs/xfs/libxfs/xfs_bmap.h > +++ b/fs/xfs/libxfs/xfs_bmap.h > @@ -222,7 +222,8 @@ int xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off, > int xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip, > xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb, > bool *done, xfs_fileoff_t stop_fsb); > -int xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset); > +int xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip, > + xfs_fileoff_t split_offset); > int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork, > xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc, > struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur, > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 2efd78a9719e..829ab1a804c9 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1139,7 +1139,19 @@ xfs_insert_file_space( > * is not the starting block of extent, we need to split the extent at > * stop_fsb. > */ > - error = xfs_bmap_split_extent(ip, stop_fsb); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > + XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > + if (error) > + return error; > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + > + error = xfs_bmap_split_extent(tp, ip, stop_fsb); > + if (error) > + goto out_trans_cancel; > + > + error = xfs_trans_commit(tp); > if (error) > return error; > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] xfs: open code insert range extent split helper 2019-12-17 17:02 ` Allison Collins @ 2019-12-17 22:21 ` Darrick J. Wong 0 siblings, 0 replies; 23+ messages in thread From: Darrick J. Wong @ 2019-12-17 22:21 UTC (permalink / raw) To: Allison Collins; +Cc: Brian Foster, linux-xfs On Tue, Dec 17, 2019 at 10:02:42AM -0700, Allison Collins wrote: > On 12/13/19 10:12 AM, Brian Foster wrote: > > The insert range operation currently splits the extent at the target > > offset in a separate transaction and lock cycle from the one that > > shifts extents. In preparation for reworking insert range into an > > atomic operation, lift the code into the caller so it can be easily > > condensed to a single rolling transaction and lock cycle and > > eliminate the helper. No functional changes. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > Looks ok to me. > > Reviewed by: Allison Collins <allison.henderson@oracle.com> "Reviewed-by", with dash? --D > > --- > > fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------ > > fs/xfs/libxfs/xfs_bmap.h | 3 ++- > > fs/xfs/xfs_bmap_util.c | 14 +++++++++++++- > > 3 files changed, 17 insertions(+), 32 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > > index a9ad1f991ba3..2bba0f983e4f 100644 > > --- a/fs/xfs/libxfs/xfs_bmap.c > > +++ b/fs/xfs/libxfs/xfs_bmap.c > > @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents( > > * @split_fsb is a block where the extents is split. If split_fsb lies in a > > * hole or the first block of extents, just return 0. > > */ > > -STATIC int > > -xfs_bmap_split_extent_at( > > +int > > +xfs_bmap_split_extent( > > struct xfs_trans *tp, > > struct xfs_inode *ip, > > xfs_fileoff_t split_fsb) > > @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at( > > return error; > > } > > -int > > -xfs_bmap_split_extent( > > - struct xfs_inode *ip, > > - xfs_fileoff_t split_fsb) > > -{ > > - struct xfs_mount *mp = ip->i_mount; > > - struct xfs_trans *tp; > > - int error; > > - > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > - XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > - if (error) > > - return error; > > - > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > - > > - error = xfs_bmap_split_extent_at(tp, ip, split_fsb); > > - if (error) > > - goto out; > > - > > - return xfs_trans_commit(tp); > > - > > -out: > > - xfs_trans_cancel(tp); > > - return error; > > -} > > - > > /* Deferred mapping is only for real extents in the data fork. */ > > static bool > > xfs_bmap_is_update_needed( > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h > > index 14d25e0b7d9c..f3259ad5c22c 100644 > > --- a/fs/xfs/libxfs/xfs_bmap.h > > +++ b/fs/xfs/libxfs/xfs_bmap.h > > @@ -222,7 +222,8 @@ int xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off, > > int xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip, > > xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb, > > bool *done, xfs_fileoff_t stop_fsb); > > -int xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset); > > +int xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip, > > + xfs_fileoff_t split_offset); > > int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork, > > xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc, > > struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur, > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 2efd78a9719e..829ab1a804c9 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -1139,7 +1139,19 @@ xfs_insert_file_space( > > * is not the starting block of extent, we need to split the extent at > > * stop_fsb. > > */ > > - error = xfs_bmap_split_extent(ip, stop_fsb); > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > + XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > + if (error) > > + return error; > > + > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > + > > + error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > + if (error) > > + goto out_trans_cancel; > > + > > + error = xfs_trans_commit(tp); > > if (error) > > return error; > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] xfs: open code insert range extent split helper 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster 2019-12-17 17:02 ` Allison Collins @ 2019-12-18 2:16 ` Darrick J. Wong 2019-12-24 11:18 ` Christoph Hellwig 2 siblings, 0 replies; 23+ messages in thread From: Darrick J. Wong @ 2019-12-18 2:16 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Dec 13, 2019 at 12:12:56PM -0500, Brian Foster wrote: > The insert range operation currently splits the extent at the target > offset in a separate transaction and lock cycle from the one that > shifts extents. In preparation for reworking insert range into an > atomic operation, lift the code into the caller so it can be easily > condensed to a single rolling transaction and lock cycle and > eliminate the helper. No functional changes. > > Signed-off-by: Brian Foster <bfoster@redhat.com> Looks ok, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/xfs/libxfs/xfs_bmap.c | 32 ++------------------------------ > fs/xfs/libxfs/xfs_bmap.h | 3 ++- > fs/xfs/xfs_bmap_util.c | 14 +++++++++++++- > 3 files changed, 17 insertions(+), 32 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index a9ad1f991ba3..2bba0f983e4f 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -6021,8 +6021,8 @@ xfs_bmap_insert_extents( > * @split_fsb is a block where the extents is split. If split_fsb lies in a > * hole or the first block of extents, just return 0. > */ > -STATIC int > -xfs_bmap_split_extent_at( > +int > +xfs_bmap_split_extent( > struct xfs_trans *tp, > struct xfs_inode *ip, > xfs_fileoff_t split_fsb) > @@ -6138,34 +6138,6 @@ xfs_bmap_split_extent_at( > return error; > } > > -int > -xfs_bmap_split_extent( > - struct xfs_inode *ip, > - xfs_fileoff_t split_fsb) > -{ > - struct xfs_mount *mp = ip->i_mount; > - struct xfs_trans *tp; > - int error; > - > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > - XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > - if (error) > - return error; > - > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > - > - error = xfs_bmap_split_extent_at(tp, ip, split_fsb); > - if (error) > - goto out; > - > - return xfs_trans_commit(tp); > - > -out: > - xfs_trans_cancel(tp); > - return error; > -} > - > /* Deferred mapping is only for real extents in the data fork. */ > static bool > xfs_bmap_is_update_needed( > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h > index 14d25e0b7d9c..f3259ad5c22c 100644 > --- a/fs/xfs/libxfs/xfs_bmap.h > +++ b/fs/xfs/libxfs/xfs_bmap.h > @@ -222,7 +222,8 @@ int xfs_bmap_can_insert_extents(struct xfs_inode *ip, xfs_fileoff_t off, > int xfs_bmap_insert_extents(struct xfs_trans *tp, struct xfs_inode *ip, > xfs_fileoff_t *next_fsb, xfs_fileoff_t offset_shift_fsb, > bool *done, xfs_fileoff_t stop_fsb); > -int xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset); > +int xfs_bmap_split_extent(struct xfs_trans *tp, struct xfs_inode *ip, > + xfs_fileoff_t split_offset); > int xfs_bmapi_reserve_delalloc(struct xfs_inode *ip, int whichfork, > xfs_fileoff_t off, xfs_filblks_t len, xfs_filblks_t prealloc, > struct xfs_bmbt_irec *got, struct xfs_iext_cursor *cur, > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 2efd78a9719e..829ab1a804c9 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1139,7 +1139,19 @@ xfs_insert_file_space( > * is not the starting block of extent, we need to split the extent at > * stop_fsb. > */ > - error = xfs_bmap_split_extent(ip, stop_fsb); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > + XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > + if (error) > + return error; > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + > + error = xfs_bmap_split_extent(tp, ip, stop_fsb); > + if (error) > + goto out_trans_cancel; > + > + error = xfs_trans_commit(tp); > if (error) > return error; > > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/3] xfs: open code insert range extent split helper 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster 2019-12-17 17:02 ` Allison Collins 2019-12-18 2:16 ` Darrick J. Wong @ 2019-12-24 11:18 ` Christoph Hellwig 2 siblings, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2019-12-24 11:18 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Dec 13, 2019 at 12:12:56PM -0500, Brian Foster wrote: > The insert range operation currently splits the extent at the target > offset in a separate transaction and lock cycle from the one that > shifts extents. In preparation for reworking insert range into an > atomic operation, lift the code into the caller so it can be easily > condensed to a single rolling transaction and lock cycle and > eliminate the helper. No functional changes. The helper already is rather questionable as-is, so even without looking at the following patches: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster @ 2019-12-13 17:12 ` Brian Foster 2019-12-17 17:04 ` Allison Collins ` (2 more replies) 2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster 2 siblings, 3 replies; 23+ messages in thread From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw) To: linux-xfs The insert range operation uses a unique transaction and ilock cycle for the extent split and each extent shift iteration of the overall operation. While this works, it is risks racing with other operations in subtle ways such as COW writeback modifying an extent tree in the middle of a shift operation. To avoid this problem, make insert range atomic with respect to ilock. Hold the ilock across the entire operation, replace the individual transactions with a single rolling transaction sequence and relog the inode to keep it moving in the log. This guarantees that nothing else can change the extent mapping of an inode while an insert range operation is in progress. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 829ab1a804c9..555c8b49a223 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1134,47 +1134,41 @@ xfs_insert_file_space( if (error) return error; - /* - * The extent shifting code works on extent granularity. So, if stop_fsb - * is not the starting block of extent, we need to split the extent at - * stop_fsb. - */ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); if (error) return error; xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, 0); + /* + * The extent shifting code works on extent granularity. So, if stop_fsb + * is not the starting block of extent, we need to split the extent at + * stop_fsb. + */ error = xfs_bmap_split_extent(tp, ip, stop_fsb); if (error) goto out_trans_cancel; - error = xfs_trans_commit(tp); - if (error) - return error; - - while (!error && !done) { - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, - &tp); + do { + error = xfs_trans_roll_inode(&tp, ip); if (error) - break; + goto out_trans_cancel; - xfs_ilock(ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, &done, stop_fsb); if (error) goto out_trans_cancel; + } while (!done); - error = xfs_trans_commit(tp); - } - + error = xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; out_trans_cancel: xfs_trans_cancel(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster @ 2019-12-17 17:04 ` Allison Collins 2019-12-18 2:37 ` Darrick J. Wong 2019-12-24 11:26 ` Christoph Hellwig 2 siblings, 0 replies; 23+ messages in thread From: Allison Collins @ 2019-12-17 17:04 UTC (permalink / raw) To: Brian Foster, linux-xfs On 12/13/19 10:12 AM, Brian Foster wrote: > The insert range operation uses a unique transaction and ilock cycle > for the extent split and each extent shift iteration of the overall > operation. While this works, it is risks racing with other > operations in subtle ways such as COW writeback modifying an extent > tree in the middle of a shift operation. > > To avoid this problem, make insert range atomic with respect to > ilock. Hold the ilock across the entire operation, replace the > individual transactions with a single rolling transaction sequence > and relog the inode to keep it moving in the log. This guarantees > that nothing else can change the extent mapping of an inode while > an insert range operation is in progress. > > Signed-off-by: Brian Foster <bfoster@redhat.com> Ok, this looks good to me. Thanks! Reviewed by: Allison Collins <allison.henderson@oracle.com> > --- > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > 1 file changed, 13 insertions(+), 19 deletions(-) > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 829ab1a804c9..555c8b49a223 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > if (error) > return error; > > - /* > - * The extent shifting code works on extent granularity. So, if stop_fsb > - * is not the starting block of extent, we need to split the extent at > - * stop_fsb. > - */ > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > if (error) > return error; > > xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, 0); > > + /* > + * The extent shifting code works on extent granularity. So, if stop_fsb > + * is not the starting block of extent, we need to split the extent at > + * stop_fsb. > + */ > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > if (error) > goto out_trans_cancel; > > - error = xfs_trans_commit(tp); > - if (error) > - return error; > - > - while (!error && !done) { > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > - &tp); > + do { > + error = xfs_trans_roll_inode(&tp, ip); > if (error) > - break; > + goto out_trans_cancel; > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > &done, stop_fsb); > if (error) > goto out_trans_cancel; > + } while (!done); > > - error = xfs_trans_commit(tp); > - } > - > + error = xfs_trans_commit(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > > out_trans_cancel: > xfs_trans_cancel(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > } > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster 2019-12-17 17:04 ` Allison Collins @ 2019-12-18 2:37 ` Darrick J. Wong 2019-12-18 12:10 ` Brian Foster 2019-12-24 11:26 ` Christoph Hellwig 2 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2019-12-18 2:37 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > The insert range operation uses a unique transaction and ilock cycle > for the extent split and each extent shift iteration of the overall > operation. While this works, it is risks racing with other > operations in subtle ways such as COW writeback modifying an extent > tree in the middle of a shift operation. > > To avoid this problem, make insert range atomic with respect to > ilock. Hold the ilock across the entire operation, replace the > individual transactions with a single rolling transaction sequence > and relog the inode to keep it moving in the log. This guarantees > that nothing else can change the extent mapping of an inode while > an insert range operation is in progress. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > 1 file changed, 13 insertions(+), 19 deletions(-) > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 829ab1a804c9..555c8b49a223 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > if (error) > return error; > > - /* > - * The extent shifting code works on extent granularity. So, if stop_fsb > - * is not the starting block of extent, we need to split the extent at > - * stop_fsb. > - */ > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > if (error) > return error; > > xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, 0); > > + /* > + * The extent shifting code works on extent granularity. So, if stop_fsb > + * is not the starting block of extent, we need to split the extent at > + * stop_fsb. > + */ > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > if (error) > goto out_trans_cancel; > > - error = xfs_trans_commit(tp); > - if (error) > - return error; > - > - while (!error && !done) { > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > - &tp); I'm a little concerned about the livelock potential here, if there are a lot of other threads that have eaten all the transaction reservation and are trying to get our ILOCK, while at the same time this thread has the ILOCK and is trying to roll the transaction to move another extent, having already rolled the transaction more than logcount times. I think the extent shifting loop starts with the highest offset mapping and shifts it up and continues in order of decreasing offset until it gets to @stop_fsb, correct? Can we use "alloc trans; ilock; move; commit" for every extent higher than the one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; insert_extents; commit" to deal with that one extent that crosses @stop_fsb? tr_write pre-reserves enough space to that the roll won't need to get more, which would eliminate that potential problem, I think. --D > + do { > + error = xfs_trans_roll_inode(&tp, ip); > if (error) > - break; > + goto out_trans_cancel; > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > &done, stop_fsb); > if (error) > goto out_trans_cancel; > + } while (!done); > > - error = xfs_trans_commit(tp); > - } > - > + error = xfs_trans_commit(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > > out_trans_cancel: > xfs_trans_cancel(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > } > > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-18 2:37 ` Darrick J. Wong @ 2019-12-18 12:10 ` Brian Foster 2019-12-18 21:15 ` Darrick J. Wong 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-12-18 12:10 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote: > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > > The insert range operation uses a unique transaction and ilock cycle > > for the extent split and each extent shift iteration of the overall > > operation. While this works, it is risks racing with other > > operations in subtle ways such as COW writeback modifying an extent > > tree in the middle of a shift operation. > > > > To avoid this problem, make insert range atomic with respect to > > ilock. Hold the ilock across the entire operation, replace the > > individual transactions with a single rolling transaction sequence > > and relog the inode to keep it moving in the log. This guarantees > > that nothing else can change the extent mapping of an inode while > > an insert range operation is in progress. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > > 1 file changed, 13 insertions(+), 19 deletions(-) > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 829ab1a804c9..555c8b49a223 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > > if (error) > > return error; > > > > - /* > > - * The extent shifting code works on extent granularity. So, if stop_fsb > > - * is not the starting block of extent, we need to split the extent at > > - * stop_fsb. > > - */ > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > if (error) > > return error; > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > + xfs_trans_ijoin(tp, ip, 0); > > > > + /* > > + * The extent shifting code works on extent granularity. So, if stop_fsb > > + * is not the starting block of extent, we need to split the extent at > > + * stop_fsb. > > + */ > > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > if (error) > > goto out_trans_cancel; > > > > - error = xfs_trans_commit(tp); > > - if (error) > > - return error; > > - > > - while (!error && !done) { > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > > - &tp); > > I'm a little concerned about the livelock potential here, if there are a lot of > other threads that have eaten all the transaction reservation and are trying to > get our ILOCK, while at the same time this thread has the ILOCK and is trying > to roll the transaction to move another extent, having already rolled the > transaction more than logcount times. > My understanding is that the regrant mechanism is intended to deal with that scenario. Even after the initial (res * logcount) reservation is consumed, a regrant is different from an initial reservation in that the reservation head is unconditionally updated with a new reservation unit. We do wait on the write head in regrant, but IIUC that should be limited to the pool of already allocated transactions (and is woken before the reserve head waiters IIRC). I suppose something like this might be possible in theory if we were blocked on regrant and the entirety of remaining log space was consumed by transactions waiting on our ilock, but I think that is highly unlikely since we also hold the iolock here. Also note that this approach is based on the current truncate algorithm, which is probably a better barometer of potential for this kind of issue as it is a less specialized operation. I'd argue that if this is safe enough for truncate, it should be safe enough for range shifting. > I think the extent shifting loop starts with the highest offset mapping and > shifts it up and continues in order of decreasing offset until it gets to > @stop_fsb, correct? > Yep, for insert range at least. > Can we use "alloc trans; ilock; move; commit" for every extent higher than the > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; > insert_extents; commit" to deal with that one extent that crosses @stop_fsb? > tr_write pre-reserves enough space to that the roll won't need to get more, > which would eliminate that potential problem, I think. > We'd have to reorder the extent split for that kind of approach, which I think you've noted in the sequence above, as the race window is between the split and subsequent shift. Otherwise I think that would work. That said, I'd prefer not to introduce the extra complexity and functional variance unless it were absolutely necessary, and it's not clear to me that it is. If it is, we'd probably have seen similar issues in truncate and should target a fix there before worrying about range shift. Brian > --D > > > + do { > > + error = xfs_trans_roll_inode(&tp, ip); > > if (error) > > - break; > > + goto out_trans_cancel; > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > > &done, stop_fsb); > > if (error) > > goto out_trans_cancel; > > + } while (!done); > > > > - error = xfs_trans_commit(tp); > > - } > > - > > + error = xfs_trans_commit(tp); > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > return error; > > > > out_trans_cancel: > > xfs_trans_cancel(tp); > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > return error; > > } > > > > -- > > 2.20.1 > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-18 12:10 ` Brian Foster @ 2019-12-18 21:15 ` Darrick J. Wong 2019-12-19 11:55 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2019-12-18 21:15 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote: > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote: > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > > > The insert range operation uses a unique transaction and ilock cycle > > > for the extent split and each extent shift iteration of the overall > > > operation. While this works, it is risks racing with other > > > operations in subtle ways such as COW writeback modifying an extent > > > tree in the middle of a shift operation. > > > > > > To avoid this problem, make insert range atomic with respect to > > > ilock. Hold the ilock across the entire operation, replace the > > > individual transactions with a single rolling transaction sequence > > > and relog the inode to keep it moving in the log. This guarantees > > > that nothing else can change the extent mapping of an inode while > > > an insert range operation is in progress. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > > > 1 file changed, 13 insertions(+), 19 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > index 829ab1a804c9..555c8b49a223 100644 > > > --- a/fs/xfs/xfs_bmap_util.c > > > +++ b/fs/xfs/xfs_bmap_util.c > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > > > if (error) > > > return error; > > > > > > - /* > > > - * The extent shifting code works on extent granularity. So, if stop_fsb > > > - * is not the starting block of extent, we need to split the extent at > > > - * stop_fsb. > > > - */ > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > > if (error) > > > return error; > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > + /* > > > + * The extent shifting code works on extent granularity. So, if stop_fsb > > > + * is not the starting block of extent, we need to split the extent at > > > + * stop_fsb. > > > + */ > > > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > > if (error) > > > goto out_trans_cancel; > > > > > > - error = xfs_trans_commit(tp); > > > - if (error) > > > - return error; > > > - > > > - while (!error && !done) { > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > > > - &tp); > > > > I'm a little concerned about the livelock potential here, if there are a lot of > > other threads that have eaten all the transaction reservation and are trying to > > get our ILOCK, while at the same time this thread has the ILOCK and is trying > > to roll the transaction to move another extent, having already rolled the > > transaction more than logcount times. > > > > My understanding is that the regrant mechanism is intended to deal with > that scenario. Even after the initial (res * logcount) reservation is > consumed, a regrant is different from an initial reservation in that the > reservation head is unconditionally updated with a new reservation unit. > We do wait on the write head in regrant, but IIUC that should be limited > to the pool of already allocated transactions (and is woken before the > reserve head waiters IIRC). I suppose something like this might be > possible in theory if we were blocked on regrant and the entirety of > remaining log space was consumed by transactions waiting on our ilock, > but I think that is highly unlikely since we also hold the iolock here. True. The only time I saw this happen was with buffered COW writeback completions (which hold lock other than ILOCK), which should have been fixed by the patch I made to put all the writeback items to a single inode queue and run /one/ worker thread to process them all. So maybe my fears are unfounded nowadays. :) The only other place I can think of that does a lot of transaction rolling on a single inode is online repair, and it always holds all three exclusive locks. > Also note that this approach is based on the current truncate algorithm, > which is probably a better barometer of potential for this kind of issue > as it is a less specialized operation. I'd argue that if this is safe > enough for truncate, it should be safe enough for range shifting. Hehehe. > > I think the extent shifting loop starts with the highest offset mapping and > > shifts it up and continues in order of decreasing offset until it gets to > > @stop_fsb, correct? > > > > Yep, for insert range at least. > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb? > > tr_write pre-reserves enough space to that the roll won't need to get more, > > which would eliminate that potential problem, I think. > > > > We'd have to reorder the extent split for that kind of approach, which I > think you've noted in the sequence above, as the race window is between > the split and subsequent shift. Otherwise I think that would work. > > That said, I'd prefer not to introduce the extra complexity and > functional variance unless it were absolutely necessary, and it's not > clear to me that it is. If it is, we'd probably have seen similar issues > in truncate and should target a fix there before worrying about range > shift. Ok. Looking back through lore I don't see any complaints about insert range, so I guess it's fine. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > Brian > > > --D > > > > > + do { > > > + error = xfs_trans_roll_inode(&tp, ip); > > > if (error) > > > - break; > > > + goto out_trans_cancel; > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > > > &done, stop_fsb); > > > if (error) > > > goto out_trans_cancel; > > > + } while (!done); > > > > > > - error = xfs_trans_commit(tp); > > > - } > > > - > > > + error = xfs_trans_commit(tp); > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > return error; > > > > > > out_trans_cancel: > > > xfs_trans_cancel(tp); > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > return error; > > > } > > > > > > -- > > > 2.20.1 > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-18 21:15 ` Darrick J. Wong @ 2019-12-19 11:55 ` Brian Foster 2019-12-20 20:17 ` Darrick J. Wong 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-12-19 11:55 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote: > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote: > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote: > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > > > > The insert range operation uses a unique transaction and ilock cycle > > > > for the extent split and each extent shift iteration of the overall > > > > operation. While this works, it is risks racing with other > > > > operations in subtle ways such as COW writeback modifying an extent > > > > tree in the middle of a shift operation. > > > > > > > > To avoid this problem, make insert range atomic with respect to > > > > ilock. Hold the ilock across the entire operation, replace the > > > > individual transactions with a single rolling transaction sequence > > > > and relog the inode to keep it moving in the log. This guarantees > > > > that nothing else can change the extent mapping of an inode while > > > > an insert range operation is in progress. > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > > > > 1 file changed, 13 insertions(+), 19 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 829ab1a804c9..555c8b49a223 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > > > > if (error) > > > > return error; > > > > > > > > - /* > > > > - * The extent shifting code works on extent granularity. So, if stop_fsb > > > > - * is not the starting block of extent, we need to split the extent at > > > > - * stop_fsb. > > > > - */ > > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > > > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > > > if (error) > > > > return error; > > > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > > > + /* > > > > + * The extent shifting code works on extent granularity. So, if stop_fsb > > > > + * is not the starting block of extent, we need to split the extent at > > > > + * stop_fsb. > > > > + */ > > > > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > > > if (error) > > > > goto out_trans_cancel; > > > > > > > > - error = xfs_trans_commit(tp); > > > > - if (error) > > > > - return error; > > > > - > > > > - while (!error && !done) { > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > > > > - &tp); > > > > > > I'm a little concerned about the livelock potential here, if there are a lot of > > > other threads that have eaten all the transaction reservation and are trying to > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying > > > to roll the transaction to move another extent, having already rolled the > > > transaction more than logcount times. > > > > > > > My understanding is that the regrant mechanism is intended to deal with > > that scenario. Even after the initial (res * logcount) reservation is > > consumed, a regrant is different from an initial reservation in that the > > reservation head is unconditionally updated with a new reservation unit. > > We do wait on the write head in regrant, but IIUC that should be limited > > to the pool of already allocated transactions (and is woken before the > > reserve head waiters IIRC). I suppose something like this might be > > possible in theory if we were blocked on regrant and the entirety of > > remaining log space was consumed by transactions waiting on our ilock, > > but I think that is highly unlikely since we also hold the iolock here. > > True. The only time I saw this happen was with buffered COW writeback > completions (which hold lock other than ILOCK), which should have been > fixed by the patch I made to put all the writeback items to a single > inode queue and run /one/ worker thread to process them all. So maybe > my fears are unfounded nowadays. :) > Ok. If I'm following correctly, that goes back to this[1] commit. The commit log describes a situation where we basically have unbound, concurrent attempts to do to COW remappings on writeback completion. That leads to heavy ilock contention and the potential for deadlocks between log reservation and inode locks. Out of curiosity, was that old scheme bad enough to reproduce that kind of deadlock or was the deadlock description based on code analysis? I am a bit curious how "deadlock avoidant" the current transaction rolling/regrant mechanism is intended to be on its own vs. how much responsibility is beared by the implementation of certain operations (like truncate holding IOLOCK and assuming that sufficiently restricts object access). ISTM that it's technically possible for this lockup state to occur, but it relies on something unusual like the above where we allow enough unbound (open reservation -> lock) concurrency on a single metadata object to exhaust all log space. [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues") > The only other place I can think of that does a lot of transaction > rolling on a single inode is online repair, and it always holds all > three exclusive locks. > So I guess that should similarly hold off transaction reservation from userspace intended to modify the particular inode, provided the XFS internals play nice (re: your workqueue fix above). > > Also note that this approach is based on the current truncate algorithm, > > which is probably a better barometer of potential for this kind of issue > > as it is a less specialized operation. I'd argue that if this is safe > > enough for truncate, it should be safe enough for range shifting. > > Hehehe. > > > > I think the extent shifting loop starts with the highest offset mapping and > > > shifts it up and continues in order of decreasing offset until it gets to > > > @stop_fsb, correct? > > > > > > > Yep, for insert range at least. > > > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb? > > > tr_write pre-reserves enough space to that the roll won't need to get more, > > > which would eliminate that potential problem, I think. > > > > > > > We'd have to reorder the extent split for that kind of approach, which I > > think you've noted in the sequence above, as the race window is between > > the split and subsequent shift. Otherwise I think that would work. > > > > That said, I'd prefer not to introduce the extra complexity and > > functional variance unless it were absolutely necessary, and it's not > > clear to me that it is. If it is, we'd probably have seen similar issues > > in truncate and should target a fix there before worrying about range > > shift. > > Ok. Looking back through lore I don't see any complaints about insert > range, so I guess it's fine. > Ok, thanks. Brian > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > --D > > > Brian > > > > > --D > > > > > > > + do { > > > > + error = xfs_trans_roll_inode(&tp, ip); > > > > if (error) > > > > - break; > > > > + goto out_trans_cancel; > > > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > > > > &done, stop_fsb); > > > > if (error) > > > > goto out_trans_cancel; > > > > + } while (!done); > > > > > > > > - error = xfs_trans_commit(tp); > > > > - } > > > > - > > > > + error = xfs_trans_commit(tp); > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > return error; > > > > > > > > out_trans_cancel: > > > > xfs_trans_cancel(tp); > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > return error; > > > > } > > > > > > > > -- > > > > 2.20.1 > > > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-19 11:55 ` Brian Foster @ 2019-12-20 20:17 ` Darrick J. Wong 2019-12-23 12:12 ` Brian Foster 2019-12-24 11:28 ` Christoph Hellwig 0 siblings, 2 replies; 23+ messages in thread From: Darrick J. Wong @ 2019-12-20 20:17 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote: > On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote: > > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote: > > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote: > > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > > > > > The insert range operation uses a unique transaction and ilock cycle > > > > > for the extent split and each extent shift iteration of the overall > > > > > operation. While this works, it is risks racing with other > > > > > operations in subtle ways such as COW writeback modifying an extent > > > > > tree in the middle of a shift operation. > > > > > > > > > > To avoid this problem, make insert range atomic with respect to > > > > > ilock. Hold the ilock across the entire operation, replace the > > > > > individual transactions with a single rolling transaction sequence > > > > > and relog the inode to keep it moving in the log. This guarantees > > > > > that nothing else can change the extent mapping of an inode while > > > > > an insert range operation is in progress. > > > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > > --- > > > > > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > > > > > 1 file changed, 13 insertions(+), 19 deletions(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > > index 829ab1a804c9..555c8b49a223 100644 > > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > > > > > if (error) > > > > > return error; > > > > > > > > > > - /* > > > > > - * The extent shifting code works on extent granularity. So, if stop_fsb > > > > > - * is not the starting block of extent, we need to split the extent at > > > > > - * stop_fsb. > > > > > - */ > > > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > > > > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > > > > if (error) > > > > > return error; > > > > > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > > > > > + /* > > > > > + * The extent shifting code works on extent granularity. So, if stop_fsb > > > > > + * is not the starting block of extent, we need to split the extent at > > > > > + * stop_fsb. > > > > > + */ > > > > > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > > > > if (error) > > > > > goto out_trans_cancel; > > > > > > > > > > - error = xfs_trans_commit(tp); > > > > > - if (error) > > > > > - return error; > > > > > - > > > > > - while (!error && !done) { > > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > > > > > - &tp); > > > > > > > > I'm a little concerned about the livelock potential here, if there are a lot of > > > > other threads that have eaten all the transaction reservation and are trying to > > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying > > > > to roll the transaction to move another extent, having already rolled the > > > > transaction more than logcount times. > > > > > > > > > > My understanding is that the regrant mechanism is intended to deal with > > > that scenario. Even after the initial (res * logcount) reservation is > > > consumed, a regrant is different from an initial reservation in that the > > > reservation head is unconditionally updated with a new reservation unit. > > > We do wait on the write head in regrant, but IIUC that should be limited > > > to the pool of already allocated transactions (and is woken before the > > > reserve head waiters IIRC). I suppose something like this might be > > > possible in theory if we were blocked on regrant and the entirety of > > > remaining log space was consumed by transactions waiting on our ilock, > > > but I think that is highly unlikely since we also hold the iolock here. > > > > True. The only time I saw this happen was with buffered COW writeback > > completions (which hold lock other than ILOCK), which should have been > > fixed by the patch I made to put all the writeback items to a single > > inode queue and run /one/ worker thread to process them all. So maybe > > my fears are unfounded nowadays. :) > > > > Ok. If I'm following correctly, that goes back to this[1] commit. The > commit log describes a situation where we basically have unbound, > concurrent attempts to do to COW remappings on writeback completion. > That leads to heavy ilock contention and the potential for deadlocks > between log reservation and inode locks. Out of curiosity, was that old > scheme bad enough to reproduce that kind of deadlock or was the deadlock > description based on code analysis? It was bad enough to cause an internal complaint about log livelock. :( I think directio completions might suffer from the same class of problem though, since we allow concurrent dio writes and dio doesn't do any of the ioend batching that we do with buffered write ioends. I think fixing directio is going to be a tougher nut to crack because it means that in the non-overwrite case we can't really do the ioend completion processing from the dio submitter thread. It might also be nice to find a way to unify the ioend paths since they both do "convert unwritten and do cow remapping" on the entire range, and diverge only once that's done. > I am a bit curious how "deadlock avoidant" the current transaction > rolling/regrant mechanism is intended to be on its own vs. how much > responsibility is beared by the implementation of certain operations > (like truncate holding IOLOCK and assuming that sufficiently restricts > object access). ISTM that it's technically possible for this lockup > state to occur, but it relies on something unusual like the above where > we allow enough unbound (open reservation -> lock) concurrency on a > single metadata object to exhaust all log space. > > [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues") <nod> So far I haven't seen anyone in practice managing to hit a thread-happy directio cow remap deadlock, though I think triggering it should be as simple as turning on alwayscow and coaxing an aio stress tester into issuing a lot of random writes. > > The only other place I can think of that does a lot of transaction > > rolling on a single inode is online repair, and it always holds all > > three exclusive locks. > > > > So I guess that should similarly hold off transaction reservation from > userspace intended to modify the particular inode, provided the XFS > internals play nice (re: your workqueue fix above). Yeah. ISTR the scrub setup functions flush dirty pages and drain directios before taking the ILOCK, so it should never be picking any fights with writeback. :) > > > Also note that this approach is based on the current truncate algorithm, > > > which is probably a better barometer of potential for this kind of issue > > > as it is a less specialized operation. I'd argue that if this is safe > > > enough for truncate, it should be safe enough for range shifting. > > > > Hehehe. > > > > > > I think the extent shifting loop starts with the highest offset mapping and > > > > shifts it up and continues in order of decreasing offset until it gets to > > > > @stop_fsb, correct? > > > > > > > > > > Yep, for insert range at least. > > > > > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the > > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; > > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb? > > > > tr_write pre-reserves enough space to that the roll won't need to get more, > > > > which would eliminate that potential problem, I think. > > > > > > > > > > We'd have to reorder the extent split for that kind of approach, which I > > > think you've noted in the sequence above, as the race window is between > > > the split and subsequent shift. Otherwise I think that would work. > > > > > > That said, I'd prefer not to introduce the extra complexity and > > > functional variance unless it were absolutely necessary, and it's not > > > clear to me that it is. If it is, we'd probably have seen similar issues > > > in truncate and should target a fix there before worrying about range > > > shift. > > > > Ok. Looking back through lore I don't see any complaints about insert > > range, so I guess it's fine. > > > > Ok, thanks. > > Brian > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --D > > > > > Brian > > > > > > > --D > > > > > > > > > + do { > > > > > + error = xfs_trans_roll_inode(&tp, ip); > > > > > if (error) > > > > > - break; > > > > > + goto out_trans_cancel; > > > > > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > > > > > &done, stop_fsb); > > > > > if (error) > > > > > goto out_trans_cancel; > > > > > + } while (!done); > > > > > > > > > > - error = xfs_trans_commit(tp); > > > > > - } > > > > > - > > > > > + error = xfs_trans_commit(tp); > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > return error; > > > > > > > > > > out_trans_cancel: > > > > > xfs_trans_cancel(tp); > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > return error; > > > > > } > > > > > > > > > > -- > > > > > 2.20.1 > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-20 20:17 ` Darrick J. Wong @ 2019-12-23 12:12 ` Brian Foster 2019-12-24 11:28 ` Christoph Hellwig 1 sibling, 0 replies; 23+ messages in thread From: Brian Foster @ 2019-12-23 12:12 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote: > On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote: > > On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote: > > > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote: > > > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote: > > > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > > > > > > The insert range operation uses a unique transaction and ilock cycle > > > > > > for the extent split and each extent shift iteration of the overall > > > > > > operation. While this works, it is risks racing with other > > > > > > operations in subtle ways such as COW writeback modifying an extent > > > > > > tree in the middle of a shift operation. > > > > > > > > > > > > To avoid this problem, make insert range atomic with respect to > > > > > > ilock. Hold the ilock across the entire operation, replace the > > > > > > individual transactions with a single rolling transaction sequence > > > > > > and relog the inode to keep it moving in the log. This guarantees > > > > > > that nothing else can change the extent mapping of an inode while > > > > > > an insert range operation is in progress. > > > > > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > > > --- > > > > > > fs/xfs/xfs_bmap_util.c | 32 +++++++++++++------------------- > > > > > > 1 file changed, 13 insertions(+), 19 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > > > index 829ab1a804c9..555c8b49a223 100644 > > > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space( > > > > > > if (error) > > > > > > return error; > > > > > > > > > > > > - /* > > > > > > - * The extent shifting code works on extent granularity. So, if stop_fsb > > > > > > - * is not the starting block of extent, we need to split the extent at > > > > > > - * stop_fsb. > > > > > > - */ > > > > > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, > > > > > > XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp); > > > > > > if (error) > > > > > > return error; > > > > > > > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > > > > > > > + /* > > > > > > + * The extent shifting code works on extent granularity. So, if stop_fsb > > > > > > + * is not the starting block of extent, we need to split the extent at > > > > > > + * stop_fsb. > > > > > > + */ > > > > > > error = xfs_bmap_split_extent(tp, ip, stop_fsb); > > > > > > if (error) > > > > > > goto out_trans_cancel; > > > > > > > > > > > > - error = xfs_trans_commit(tp); > > > > > > - if (error) > > > > > > - return error; > > > > > > - > > > > > > - while (!error && !done) { > > > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, > > > > > > - &tp); > > > > > > > > > > I'm a little concerned about the livelock potential here, if there are a lot of > > > > > other threads that have eaten all the transaction reservation and are trying to > > > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying > > > > > to roll the transaction to move another extent, having already rolled the > > > > > transaction more than logcount times. > > > > > > > > > > > > > My understanding is that the regrant mechanism is intended to deal with > > > > that scenario. Even after the initial (res * logcount) reservation is > > > > consumed, a regrant is different from an initial reservation in that the > > > > reservation head is unconditionally updated with a new reservation unit. > > > > We do wait on the write head in regrant, but IIUC that should be limited > > > > to the pool of already allocated transactions (and is woken before the > > > > reserve head waiters IIRC). I suppose something like this might be > > > > possible in theory if we were blocked on regrant and the entirety of > > > > remaining log space was consumed by transactions waiting on our ilock, > > > > but I think that is highly unlikely since we also hold the iolock here. > > > > > > True. The only time I saw this happen was with buffered COW writeback > > > completions (which hold lock other than ILOCK), which should have been > > > fixed by the patch I made to put all the writeback items to a single > > > inode queue and run /one/ worker thread to process them all. So maybe > > > my fears are unfounded nowadays. :) > > > > > > > Ok. If I'm following correctly, that goes back to this[1] commit. The > > commit log describes a situation where we basically have unbound, > > concurrent attempts to do to COW remappings on writeback completion. > > That leads to heavy ilock contention and the potential for deadlocks > > between log reservation and inode locks. Out of curiosity, was that old > > scheme bad enough to reproduce that kind of deadlock or was the deadlock > > description based on code analysis? > > It was bad enough to cause an internal complaint about log livelock. :( > Interesting.. > I think directio completions might suffer from the same class of problem > though, since we allow concurrent dio writes and dio doesn't do any of > the ioend batching that we do with buffered write ioends. > Indeed, that makes sense since it sounds like a similar pattern. > I think fixing directio is going to be a tougher nut to crack because it > means that in the non-overwrite case we can't really do the ioend > completion processing from the dio submitter thread. > Hm.. yeah, but it might be a little strange to have highly concurrent dio writers to the same inode, right? IOW, it seems like the risk there more depends on foolishness of userspace rather than of our backend implementation deciding to allocate more tasks in the event of contention. That isn't to say we shouldn't think about ways to protect ourselves from the former case.. > It might also be nice to find a way to unify the ioend paths since they > both do "convert unwritten and do cow remapping" on the entire range, > and diverge only once that's done. > > > I am a bit curious how "deadlock avoidant" the current transaction > > rolling/regrant mechanism is intended to be on its own vs. how much > > responsibility is beared by the implementation of certain operations > > (like truncate holding IOLOCK and assuming that sufficiently restricts > > object access). ISTM that it's technically possible for this lockup > > state to occur, but it relies on something unusual like the above where > > we allow enough unbound (open reservation -> lock) concurrency on a > > single metadata object to exhaust all log space. > > > > [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues") > > <nod> So far I haven't seen anyone in practice managing to hit a > thread-happy directio cow remap deadlock, though I think triggering it > should be as simple as turning on alwayscow and coaxing an aio stress > tester into issuing a lot of random writes. > > > > The only other place I can think of that does a lot of transaction > > > rolling on a single inode is online repair, and it always holds all > > > three exclusive locks. > > > > > > > So I guess that should similarly hold off transaction reservation from > > userspace intended to modify the particular inode, provided the XFS > > internals play nice (re: your workqueue fix above). > > Yeah. ISTR the scrub setup functions flush dirty pages and drain > directios before taking the ILOCK, so it should never be picking any > fights with writeback. :) > That seems to be the right thing to do, at least for now. It doesn't look like truncate currently waits on dio. Extent shifting might by virtue of eofblocks trimming I guess, but that isn't guaranteed. Regardless, I think either operation could wait on dio without a problem given that they already isolate the inode for I/O, force writeback, etc. I suspect it's just not been a likely enough situation to have a heavy enough I/O workload concurrent with truncating (or shifting) a file to see this problem in traditional (i.e. unlike the "always COW" mode) codepaths. Brian > > > > Also note that this approach is based on the current truncate algorithm, > > > > which is probably a better barometer of potential for this kind of issue > > > > as it is a less specialized operation. I'd argue that if this is safe > > > > enough for truncate, it should be safe enough for range shifting. > > > > > > Hehehe. > > > > > > > > I think the extent shifting loop starts with the highest offset mapping and > > > > > shifts it up and continues in order of decreasing offset until it gets to > > > > > @stop_fsb, correct? > > > > > > > > > > > > > Yep, for insert range at least. > > > > > > > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the > > > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll; > > > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb? > > > > > tr_write pre-reserves enough space to that the roll won't need to get more, > > > > > which would eliminate that potential problem, I think. > > > > > > > > > > > > > We'd have to reorder the extent split for that kind of approach, which I > > > > think you've noted in the sequence above, as the race window is between > > > > the split and subsequent shift. Otherwise I think that would work. > > > > > > > > That said, I'd prefer not to introduce the extra complexity and > > > > functional variance unless it were absolutely necessary, and it's not > > > > clear to me that it is. If it is, we'd probably have seen similar issues > > > > in truncate and should target a fix there before worrying about range > > > > shift. > > > > > > Ok. Looking back through lore I don't see any complaints about insert > > > range, so I guess it's fine. > > > > > > > Ok, thanks. > > > > Brian > > > > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > --D > > > > > > > Brian > > > > > > > > > --D > > > > > > > > > > > + do { > > > > > > + error = xfs_trans_roll_inode(&tp, ip); > > > > > > if (error) > > > > > > - break; > > > > > > + goto out_trans_cancel; > > > > > > > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > > > error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb, > > > > > > &done, stop_fsb); > > > > > > if (error) > > > > > > goto out_trans_cancel; > > > > > > + } while (!done); > > > > > > > > > > > > - error = xfs_trans_commit(tp); > > > > > > - } > > > > > > - > > > > > > + error = xfs_trans_commit(tp); > > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > return error; > > > > > > > > > > > > out_trans_cancel: > > > > > > xfs_trans_cancel(tp); > > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > return error; > > > > > > } > > > > > > > > > > > > -- > > > > > > 2.20.1 > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-20 20:17 ` Darrick J. Wong 2019-12-23 12:12 ` Brian Foster @ 2019-12-24 11:28 ` Christoph Hellwig 2019-12-24 16:45 ` Darrick J. Wong 1 sibling, 1 reply; 23+ messages in thread From: Christoph Hellwig @ 2019-12-24 11:28 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote: > I think directio completions might suffer from the same class of problem > though, since we allow concurrent dio writes and dio doesn't do any of > the ioend batching that we do with buffered write ioends. OTOH direct I/O completions are per-I/O, and not per-extent like buffered I/O completions. Moreover for the case where we don't update i_size and don't need a separate log force (overwrites without O_SYNC or using fua) we could actually avoid the workqueue entirely with just a little work. > It might also be nice to find a way to unify the ioend paths since they > both do "convert unwritten and do cow remapping" on the entire range, > and diverge only once that's done. They were common a while ago and it was a complete mess. That is why I split them. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-24 11:28 ` Christoph Hellwig @ 2019-12-24 16:45 ` Darrick J. Wong 0 siblings, 0 replies; 23+ messages in thread From: Darrick J. Wong @ 2019-12-24 16:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs On Tue, Dec 24, 2019 at 03:28:45AM -0800, Christoph Hellwig wrote: > On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote: > > I think directio completions might suffer from the same class of problem > > though, since we allow concurrent dio writes and dio doesn't do any of > > the ioend batching that we do with buffered write ioends. > > OTOH direct I/O completions are per-I/O, and not per-extent like > buffered I/O completions. Moreover for the case where we don't update > i_size and don't need a separate log force (overwrites without O_SYNC > or using fua) we could actually avoid the workqueue entirely with just > a little work. <nod> > > It might also be nice to find a way to unify the ioend paths since they > > both do "convert unwritten and do cow remapping" on the entire range, > > and diverge only once that's done. > > They were common a while ago and it was a complete mess. That is why > I split them. And I couldn't figure out a sane way to make them work together so I guess it's just as well. :) --D ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] xfs: rework insert range into an atomic operation 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster 2019-12-17 17:04 ` Allison Collins 2019-12-18 2:37 ` Darrick J. Wong @ 2019-12-24 11:26 ` Christoph Hellwig 2 siblings, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2019-12-24 11:26 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote: > The insert range operation uses a unique transaction and ilock cycle > for the extent split and each extent shift iteration of the overall > operation. While this works, it is risks racing with other > operations in subtle ways such as COW writeback modifying an extent > tree in the middle of a shift operation. > > To avoid this problem, make insert range atomic with respect to > ilock. Hold the ilock across the entire operation, replace the > individual transactions with a single rolling transaction sequence > and relog the inode to keep it moving in the log. This guarantees > that nothing else can change the extent mapping of an inode while > an insert range operation is in progress. This looks good, and similar to our usual truncate sequence: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster @ 2019-12-13 17:12 ` Brian Foster 2019-12-18 2:39 ` Darrick J. Wong 2019-12-24 11:29 ` Christoph Hellwig 2 siblings, 2 replies; 23+ messages in thread From: Brian Foster @ 2019-12-13 17:12 UTC (permalink / raw) To: linux-xfs The collapse range operation uses a unique transaction and ilock cycle for the hole punch and each extent shift iteration of the overall operation. While the hole punch is safe as a separate operation due to the iolock, cycling the ilock after each extent shift is risky similar to insert range. To avoid this problem, make collapse range atomic with respect to ilock. Hold the ilock across the entire operation, replace the individual transactions with a single rolling transaction sequence and finish dfops on each iteration to perform pending frees and roll the transaction. Remove the unnecessary quota reservation as collapse range can only ever merge extents (and thus remove extent records and potentially free bmap blocks). The dfops call automatically relogs the inode to keep it moving in the log. This guarantees that nothing else can change the extent mapping of an inode while a collapse range operation is in progress. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 555c8b49a223..1c34a34997ca 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1050,7 +1050,6 @@ xfs_collapse_file_space( int error; xfs_fileoff_t next_fsb = XFS_B_TO_FSB(mp, offset + len); xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, len); - uint resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); bool done = false; ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); @@ -1066,32 +1065,34 @@ xfs_collapse_file_space( if (error) return error; - while (!error && !done) { - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, - &tp); - if (error) - break; + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp); + if (error) + return error; - xfs_ilock(ip, XFS_ILOCK_EXCL); - error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, - ip->i_gdquot, ip->i_pdquot, resblks, 0, - XFS_QMOPT_RES_REGBLKS); - if (error) - goto out_trans_cancel; - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); + xfs_ilock(ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, 0); + while (!done) { error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb, &done); if (error) goto out_trans_cancel; + if (done) + break; - error = xfs_trans_commit(tp); + /* finish any deferred frees and roll the transaction */ + error = xfs_defer_finish(&tp); + if (error) + goto out_trans_cancel; } + error = xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; out_trans_cancel: xfs_trans_cancel(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster @ 2019-12-18 2:39 ` Darrick J. Wong 2019-12-18 12:11 ` Brian Foster 2019-12-24 11:29 ` Christoph Hellwig 1 sibling, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2019-12-18 2:39 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote: > The collapse range operation uses a unique transaction and ilock > cycle for the hole punch and each extent shift iteration of the > overall operation. While the hole punch is safe as a separate > operation due to the iolock, cycling the ilock after each extent > shift is risky similar to insert range. It is? I thought collapse range was safe because we started by punching out the doomed range and shifting downwards, which eliminates the problems that come with two adjacent mappings that could be combined? <confused?> --D > To avoid this problem, make collapse range atomic with respect to > ilock. Hold the ilock across the entire operation, replace the > individual transactions with a single rolling transaction sequence > and finish dfops on each iteration to perform pending frees and roll > the transaction. Remove the unnecessary quota reservation as > collapse range can only ever merge extents (and thus remove extent > records and potentially free bmap blocks). The dfops call > automatically relogs the inode to keep it moving in the log. This > guarantees that nothing else can change the extent mapping of an > inode while a collapse range operation is in progress. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++-------------- > 1 file changed, 15 insertions(+), 14 deletions(-) > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > index 555c8b49a223..1c34a34997ca 100644 > --- a/fs/xfs/xfs_bmap_util.c > +++ b/fs/xfs/xfs_bmap_util.c > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space( > int error; > xfs_fileoff_t next_fsb = XFS_B_TO_FSB(mp, offset + len); > xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, len); > - uint resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > bool done = false; > > ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space( > if (error) > return error; > > - while (!error && !done) { > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, > - &tp); > - if (error) > - break; > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp); > + if (error) > + return error; > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > - error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, > - ip->i_gdquot, ip->i_pdquot, resblks, 0, > - XFS_QMOPT_RES_REGBLKS); > - if (error) > - goto out_trans_cancel; > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, 0); > > + while (!done) { > error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb, > &done); > if (error) > goto out_trans_cancel; > + if (done) > + break; > > - error = xfs_trans_commit(tp); > + /* finish any deferred frees and roll the transaction */ > + error = xfs_defer_finish(&tp); > + if (error) > + goto out_trans_cancel; > } > > + error = xfs_trans_commit(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > > out_trans_cancel: > xfs_trans_cancel(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > return error; > } > > -- > 2.20.1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-18 2:39 ` Darrick J. Wong @ 2019-12-18 12:11 ` Brian Foster 2019-12-18 21:19 ` Darrick J. Wong 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-12-18 12:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote: > On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote: > > The collapse range operation uses a unique transaction and ilock > > cycle for the hole punch and each extent shift iteration of the > > overall operation. While the hole punch is safe as a separate > > operation due to the iolock, cycling the ilock after each extent > > shift is risky similar to insert range. > > It is? I thought collapse range was safe because we started by punching > out the doomed range and shifting downwards, which eliminates the > problems that come with two adjacent mappings that could be combined? > > <confused?> > This is somewhat vague wording. I don't mean to say the same bug is possible on collapse. Indeed, I don't think that it is. What I mean to say is just that cycling the ilock generally opens the operation up to concurrent extent changes, similar to the behavior that resulted in the insert range bug and against the general design principle of the operation (as implied by the iolock hold -> flush -> unmap preparation sequence). IOW, it seems to me that a similar behavior is possible on collapse, it just might occur after an extent has been shifted into its new target range rather than before. That wouldn't be a corruption/bug because it doesn't change the semantics of the shift operation or the content of the file, but it's subtle and arguably a misbehavior and/or landmine. Brian > --D > > > To avoid this problem, make collapse range atomic with respect to > > ilock. Hold the ilock across the entire operation, replace the > > individual transactions with a single rolling transaction sequence > > and finish dfops on each iteration to perform pending frees and roll > > the transaction. Remove the unnecessary quota reservation as > > collapse range can only ever merge extents (and thus remove extent > > records and potentially free bmap blocks). The dfops call > > automatically relogs the inode to keep it moving in the log. This > > guarantees that nothing else can change the extent mapping of an > > inode while a collapse range operation is in progress. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++-------------- > > 1 file changed, 15 insertions(+), 14 deletions(-) > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > index 555c8b49a223..1c34a34997ca 100644 > > --- a/fs/xfs/xfs_bmap_util.c > > +++ b/fs/xfs/xfs_bmap_util.c > > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space( > > int error; > > xfs_fileoff_t next_fsb = XFS_B_TO_FSB(mp, offset + len); > > xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, len); > > - uint resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > bool done = false; > > > > ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); > > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space( > > if (error) > > return error; > > > > - while (!error && !done) { > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, > > - &tp); > > - if (error) > > - break; > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp); > > + if (error) > > + return error; > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > - error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, > > - ip->i_gdquot, ip->i_pdquot, resblks, 0, > > - XFS_QMOPT_RES_REGBLKS); > > - if (error) > > - goto out_trans_cancel; > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > + xfs_trans_ijoin(tp, ip, 0); > > > > + while (!done) { > > error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb, > > &done); > > if (error) > > goto out_trans_cancel; > > + if (done) > > + break; > > > > - error = xfs_trans_commit(tp); > > + /* finish any deferred frees and roll the transaction */ > > + error = xfs_defer_finish(&tp); > > + if (error) > > + goto out_trans_cancel; > > } > > > > + error = xfs_trans_commit(tp); > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > return error; > > > > out_trans_cancel: > > xfs_trans_cancel(tp); > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > return error; > > } > > > > -- > > 2.20.1 > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-18 12:11 ` Brian Foster @ 2019-12-18 21:19 ` Darrick J. Wong 2019-12-19 11:56 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Darrick J. Wong @ 2019-12-18 21:19 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Dec 18, 2019 at 07:11:20AM -0500, Brian Foster wrote: > On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote: > > On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote: > > > The collapse range operation uses a unique transaction and ilock > > > cycle for the hole punch and each extent shift iteration of the > > > overall operation. While the hole punch is safe as a separate > > > operation due to the iolock, cycling the ilock after each extent > > > shift is risky similar to insert range. > > > > It is? I thought collapse range was safe because we started by punching > > out the doomed range and shifting downwards, which eliminates the > > problems that come with two adjacent mappings that could be combined? > > > > <confused?> > > > > This is somewhat vague wording. I don't mean to say the same bug is > possible on collapse. Indeed, I don't think that it is. What I mean to > say is just that cycling the ilock generally opens the operation up to > concurrent extent changes, similar to the behavior that resulted in the > insert range bug and against the general design principle of the > operation (as implied by the iolock hold -> flush -> unmap preparation > sequence). Oh, ok, you're merely trying to prevent anyone from seeing the inode metadata while we're in the middle of a collapse-range operation. I wonder then if we need to take a look at the remap range operations, but oh wow is that a gnarly mess of inode locking. :) > IOW, it seems to me that a similar behavior is possible on collapse, it > just might occur after an extent has been shifted into its new target > range rather than before. That wouldn't be a corruption/bug because it > doesn't change the semantics of the shift operation or the content of > the file, but it's subtle and arguably a misbehavior and/or landmine. <nod> Ok, just making sure. :) Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > Brian > > > --D > > > > > To avoid this problem, make collapse range atomic with respect to > > > ilock. Hold the ilock across the entire operation, replace the > > > individual transactions with a single rolling transaction sequence > > > and finish dfops on each iteration to perform pending frees and roll > > > the transaction. Remove the unnecessary quota reservation as > > > collapse range can only ever merge extents (and thus remove extent > > > records and potentially free bmap blocks). The dfops call > > > automatically relogs the inode to keep it moving in the log. This > > > guarantees that nothing else can change the extent mapping of an > > > inode while a collapse range operation is in progress. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++-------------- > > > 1 file changed, 15 insertions(+), 14 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > index 555c8b49a223..1c34a34997ca 100644 > > > --- a/fs/xfs/xfs_bmap_util.c > > > +++ b/fs/xfs/xfs_bmap_util.c > > > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space( > > > int error; > > > xfs_fileoff_t next_fsb = XFS_B_TO_FSB(mp, offset + len); > > > xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, len); > > > - uint resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > > bool done = false; > > > > > > ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); > > > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space( > > > if (error) > > > return error; > > > > > > - while (!error && !done) { > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, > > > - &tp); > > > - if (error) > > > - break; > > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp); > > > + if (error) > > > + return error; > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > - error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, > > > - ip->i_gdquot, ip->i_pdquot, resblks, 0, > > > - XFS_QMOPT_RES_REGBLKS); > > > - if (error) > > > - goto out_trans_cancel; > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > + while (!done) { > > > error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb, > > > &done); > > > if (error) > > > goto out_trans_cancel; > > > + if (done) > > > + break; > > > > > > - error = xfs_trans_commit(tp); > > > + /* finish any deferred frees and roll the transaction */ > > > + error = xfs_defer_finish(&tp); > > > + if (error) > > > + goto out_trans_cancel; > > > } > > > > > > + error = xfs_trans_commit(tp); > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > return error; > > > > > > out_trans_cancel: > > > xfs_trans_cancel(tp); > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > return error; > > > } > > > > > > -- > > > 2.20.1 > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-18 21:19 ` Darrick J. Wong @ 2019-12-19 11:56 ` Brian Foster 0 siblings, 0 replies; 23+ messages in thread From: Brian Foster @ 2019-12-19 11:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Dec 18, 2019 at 01:19:05PM -0800, Darrick J. Wong wrote: > On Wed, Dec 18, 2019 at 07:11:20AM -0500, Brian Foster wrote: > > On Tue, Dec 17, 2019 at 06:39:58PM -0800, Darrick J. Wong wrote: > > > On Fri, Dec 13, 2019 at 12:12:58PM -0500, Brian Foster wrote: > > > > The collapse range operation uses a unique transaction and ilock > > > > cycle for the hole punch and each extent shift iteration of the > > > > overall operation. While the hole punch is safe as a separate > > > > operation due to the iolock, cycling the ilock after each extent > > > > shift is risky similar to insert range. > > > > > > It is? I thought collapse range was safe because we started by punching > > > out the doomed range and shifting downwards, which eliminates the > > > problems that come with two adjacent mappings that could be combined? > > > > > > <confused?> > > > > > > > This is somewhat vague wording. I don't mean to say the same bug is > > possible on collapse. Indeed, I don't think that it is. What I mean to > > say is just that cycling the ilock generally opens the operation up to > > concurrent extent changes, similar to the behavior that resulted in the > > insert range bug and against the general design principle of the > > operation (as implied by the iolock hold -> flush -> unmap preparation > > sequence). > > Oh, ok, you're merely trying to prevent anyone from seeing the inode > metadata while we're in the middle of a collapse-range operation. I > wonder then if we need to take a look at the remap range operations, but > oh wow is that a gnarly mess of inode locking. :) > Yeah, that's actually a much better way to put it. ;) Feel free to tweak the commit log along those lines if my current description is too vague/confusing. Thanks for the reviews.. Brian > > IOW, it seems to me that a similar behavior is possible on collapse, it > > just might occur after an extent has been shifted into its new target > > range rather than before. That wouldn't be a corruption/bug because it > > doesn't change the semantics of the shift operation or the content of > > the file, but it's subtle and arguably a misbehavior and/or landmine. > > <nod> Ok, just making sure. :) > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > --D > > > Brian > > > > > --D > > > > > > > To avoid this problem, make collapse range atomic with respect to > > > > ilock. Hold the ilock across the entire operation, replace the > > > > individual transactions with a single rolling transaction sequence > > > > and finish dfops on each iteration to perform pending frees and roll > > > > the transaction. Remove the unnecessary quota reservation as > > > > collapse range can only ever merge extents (and thus remove extent > > > > records and potentially free bmap blocks). The dfops call > > > > automatically relogs the inode to keep it moving in the log. This > > > > guarantees that nothing else can change the extent mapping of an > > > > inode while a collapse range operation is in progress. > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > fs/xfs/xfs_bmap_util.c | 29 +++++++++++++++-------------- > > > > 1 file changed, 15 insertions(+), 14 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c > > > > index 555c8b49a223..1c34a34997ca 100644 > > > > --- a/fs/xfs/xfs_bmap_util.c > > > > +++ b/fs/xfs/xfs_bmap_util.c > > > > @@ -1050,7 +1050,6 @@ xfs_collapse_file_space( > > > > int error; > > > > xfs_fileoff_t next_fsb = XFS_B_TO_FSB(mp, offset + len); > > > > xfs_fileoff_t shift_fsb = XFS_B_TO_FSB(mp, len); > > > > - uint resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0); > > > > bool done = false; > > > > > > > > ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); > > > > @@ -1066,32 +1065,34 @@ xfs_collapse_file_space( > > > > if (error) > > > > return error; > > > > > > > > - while (!error && !done) { > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, > > > > - &tp); > > > > - if (error) > > > > - break; > > > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp); > > > > + if (error) > > > > + return error; > > > > > > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > - error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot, > > > > - ip->i_gdquot, ip->i_pdquot, resblks, 0, > > > > - XFS_QMOPT_RES_REGBLKS); > > > > - if (error) > > > > - goto out_trans_cancel; > > > > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); > > > > + xfs_ilock(ip, XFS_ILOCK_EXCL); > > > > + xfs_trans_ijoin(tp, ip, 0); > > > > > > > > + while (!done) { > > > > error = xfs_bmap_collapse_extents(tp, ip, &next_fsb, shift_fsb, > > > > &done); > > > > if (error) > > > > goto out_trans_cancel; > > > > + if (done) > > > > + break; > > > > > > > > - error = xfs_trans_commit(tp); > > > > + /* finish any deferred frees and roll the transaction */ > > > > + error = xfs_defer_finish(&tp); > > > > + if (error) > > > > + goto out_trans_cancel; > > > > } > > > > > > > > + error = xfs_trans_commit(tp); > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > return error; > > > > > > > > out_trans_cancel: > > > > xfs_trans_cancel(tp); > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > return error; > > > > } > > > > > > > > -- > > > > 2.20.1 > > > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/3] xfs: rework collapse range into an atomic operation 2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster 2019-12-18 2:39 ` Darrick J. Wong @ 2019-12-24 11:29 ` Christoph Hellwig 1 sibling, 0 replies; 23+ messages in thread From: Christoph Hellwig @ 2019-12-24 11:29 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs Looks good modulo any commit log issues: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-12-24 16:47 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-13 17:12 [PATCH 0/3] xfs: hold ilock across insert and collapse range Brian Foster 2019-12-13 17:12 ` [PATCH 1/3] xfs: open code insert range extent split helper Brian Foster 2019-12-17 17:02 ` Allison Collins 2019-12-17 22:21 ` Darrick J. Wong 2019-12-18 2:16 ` Darrick J. Wong 2019-12-24 11:18 ` Christoph Hellwig 2019-12-13 17:12 ` [PATCH 2/3] xfs: rework insert range into an atomic operation Brian Foster 2019-12-17 17:04 ` Allison Collins 2019-12-18 2:37 ` Darrick J. Wong 2019-12-18 12:10 ` Brian Foster 2019-12-18 21:15 ` Darrick J. Wong 2019-12-19 11:55 ` Brian Foster 2019-12-20 20:17 ` Darrick J. Wong 2019-12-23 12:12 ` Brian Foster 2019-12-24 11:28 ` Christoph Hellwig 2019-12-24 16:45 ` Darrick J. Wong 2019-12-24 11:26 ` Christoph Hellwig 2019-12-13 17:12 ` [PATCH 3/3] xfs: rework collapse " Brian Foster 2019-12-18 2:39 ` Darrick J. Wong 2019-12-18 12:11 ` Brian Foster 2019-12-18 21:19 ` Darrick J. Wong 2019-12-19 11:56 ` Brian Foster 2019-12-24 11:29 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).