Re: [PATCH] xfs: don't reuse busy extents on extent trim

From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: linux-xfs <linux-xfs@vger.kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	"Luis R. Rodriguez" <mcgrof@kernel.org>,
	Brian Foster <bfoster@redhat.com>,
	Amir Goldstein <amir73il@gmail.com>
Subject: Re: [PATCH] xfs: don't reuse busy extents on extent trim
Date: Wed, 11 Jan 2023 22:41:43 +0800	[thread overview]
Message-ID: <5287e7e6-adea-e865-5818-9cc34400cd0b@linux.alibaba.com> (raw)
In-Reply-To: <CAOQ4uxiWajRgGG2V=dYhBmVJYiRmdD+7YgkH2DMWGz6BAOXjvg@mail.gmail.com>

Hi,

On 2022/5/26 19:34, Amir Goldstein wrote:
> On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote:
>>
>> On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
>>> On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
>>>> Freed extents are marked busy from the point the freeing transaction
>>>> commits until the associated CIL context is checkpointed to the log.
>>>> This prevents reuse and overwrite of recently freed blocks before
>>>> the changes are committed to disk, which can lead to corruption
>>>> after a crash. The exception to this rule is that metadata
>>>> allocation is allowed to reuse busy extents because metadata changes
>>>> are also logged.
>>>>
>>>> As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
>>>> has allowed modification or complete invalidation of outstanding
>>>> busy extents for metadata allocations. This implementation assumes
>>>> that use of the associated extent is imminent, which is not always
>>>> the case. For example, the trimmed extent might not satisfy the
>>>> minimum length of the allocation request, or the allocation
>>>> algorithm might be involved in a search for the optimal result based
>>>> on locality.
>>>>
>>>> generic/019 reproduces a corruption caused by this scenario. First,
>>>> a metadata block (usually a bmbt or symlink block) is freed from an
>>>> inode. A subsequent bmbt split on an unrelated inode attempts a near
>>>> mode allocation request that invalidates the busy block during the
>>>> search, but does not ultimately allocate it. Due to the busy state
>>>> invalidation, the block is no longer considered busy to subsequent
>>>> allocation. A direct I/O write request immediately allocates the
>>>> block and writes to it.
>>>

...

>>
> 
> Hi Brian,
> 
> This patch was one of my selected fixes to backport for v5.10.y.
> It has a very scary looking commit message and the change seems
> to be independent of any infrastructure changes(?).
> 
> The problem is that applying this patch to v5.10.y reliably reproduces
> this buffer corruption assertion [*] with test xfs/076.
> 
> This happens on the kdevops system that is using loop devices over
> sparse files inside qemu images. It does not reproduce on my small
> VM at home.
> 
> Normally, I would just drop this patch from the stable candidates queue
> and move on, but I thought you might be interested to investigate this
> reliable reproducer, because maybe this system exercises an error
> that is otherwise rare to hit.
> 
> It seemed weird to me that NOT reusing the extent would result in
> data corruption, but it could indicate that reusing the extent was masking
> the assertion and hiding another bug(?).
> 
> Can you think of another reason to explain the regression this fix
> introduces to 5.10.y?
> 
> Do you care to investigate this failure or shall I just move on?
> 
> Thanks,
> Amir.
> 
> [*]
> : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file
> fs/xfs/xfs_trans.c.  Caller xfs_create+0x22f/0x590 [xfs]
> : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G            E
>     5.10.109-xfs-2 #8
> : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> 04/01/2014
> : Call Trace:
> :  dump_stack+0x6d/0x88
> :  xfs_trans_cancel+0x17b/0x1a0 [xfs]
> :  xfs_create+0x22f/0x590 [xfs]
> :  xfs_generic_create+0x245/0x310 [xfs]
> :  ? d_splice_alias+0x13a/0x3c0
> :  path_openat+0xe3f/0x1080
> :  do_filp_open+0x93/0x100
> :  ? handle_mm_fault+0x148e/0x1690
> :  ? __check_object_size+0x162/0x180
> :  do_sys_openat2+0x228/0x2d0
> :  do_sys_open+0x4b/0x80
> :  do_syscall_64+0x33/0x80
> :  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> : RIP: 0033:0x7f36b02eff1e
> : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00
> 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d
> 00 f0 ff ff 0
> : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e
> : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c
> : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
> : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001
> : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file
> fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764
> : XFS (loop5): Corruption of in-memory data detected.  Shutting down filesystem
> : XFS (loop5): Please unmount the filesystem and rectify the problem(s)
> 

(...just for the record) We also encountered this issue but without commit
  xfs: don't reuse busy extents on extent trim
applied in our 5.10 codebase...

Need to find some time to look into that...

[  413.283300] XFS (loop1): Internal error xfs_trans_cancel at line 950 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x219/0x590 [xfs]
[  413.284295] CPU: 0 PID: 27484 Comm: touch Tainted: G            E     5.10.134-13.an8.x86_64 #1
[  413.284296] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
[  413.284297] Call Trace:
[  413.284314]  dump_stack+0x57/0x6e
[  413.284373]  xfs_trans_cancel+0xa3/0x110 [xfs]
[  413.284412]  xfs_create+0x219/0x590 [xfs]
[  413.284458]  xfs_generic_create+0x21f/0x2d0 [xfs]
[  413.284462]  path_openat+0xdee/0x1020
[  413.284464]  do_filp_open+0x80/0xd0
[  413.284467]  ? __check_object_size+0x16a/0x180
[  413.284469]  do_sys_openat2+0x207/0x2c0
[  413.284471]  do_sys_open+0x3b/0x60
[  413.284475]  do_syscall_64+0x33/0x40
[  413.284478]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  413.284481] RIP: 0033:0x7fe623920252
[  413.284482] Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 55 43 2a 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
[  413.284483] RSP: 002b:00007ffd7a38ca70 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  413.284485] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fe623920252
[  413.284486] RDX: 0000000000000941 RSI: 00007ffd7a38d79d RDI: 00000000ffffff9c
[  413.284487] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[  413.284488] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
[  413.284488] R13: 0000000000000001 R14: 00007ffd7a38d79d R15: 00007fe623bbf374
[  413.289856] XFS (loop1): xfs_do_force_shutdown(0x8) called from line 951 of file fs/xfs/xfs_trans.c. Return address = 000000003fe0b8ba
[  413.289858] XFS (loop1): Corruption of in-memory data detected.  Shutting down filesystem
[  413.290573] XFS (loop1): Please unmount the filesystem and rectify the problem(s)

Thanks,
Gao Xiang