linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
@ 2023-05-22  2:07 Pengfei Xu
  2023-05-22  6:39 ` Bagas Sanjaya
  0 siblings, 1 reply; 14+ messages in thread
From: Pengfei Xu @ 2023-05-22  2:07 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, heng.su, dchinner, lkp

Hi Darrick,

Greeting!
There is BUG: unable to handle kernel NULL pointer dereference in
xfs_extent_free_diff_items in v6.4-rc3:

Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.

Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
"
f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
"

report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log

v6.4-rc3 reproduced info:
"
[   91.419498] loop0: detected capacity change from 0 to 65536
[   91.420095] XFS: attr2 mount option is deprecated.
[   91.420500] XFS: ikeep mount option is deprecated.
[   91.422379] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
[   91.423468] XFS (loop0): Mounting V4 Filesystem d28317a9-9e04-4f2a-be27-e55b4c413ff6
[   91.428169] XFS (loop0): Ending clean mount
[   91.429120] XFS (loop0): Quotacheck needed: Please wait.
[   91.432182] BUG: kernel NULL pointer dereference, address: 0000000000000008
[   91.432770] #PF: supervisor read access in kernel mode
[   91.433216] #PF: error_code(0x0000) - not-present page
[   91.433640] PGD 0 P4D 0 
[   91.433864] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   91.434232] CPU: 0 PID: 33 Comm: kworker/u4:2 Not tainted 6.4.0-rc3-kvm #2
[   91.434793] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[   91.435445] Workqueue: xfs_iwalk-393 xfs_pwork_work
[   91.435855] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
[   91.436312] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
[   91.437812] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
[   91.438250] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
[   91.438840] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
[   91.439430] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
[   91.440019] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
[   91.440610] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
[   91.441202] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
[   91.441864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.442343] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
[   91.442941] PKRU: 55555554
[   91.443178] Call Trace:
[   91.443394]  <TASK>
[   91.443585]  list_sort+0xb8/0x3a0
[   91.443885]  xfs_extent_free_create_intent+0xb6/0xc0
[   91.444312]  xfs_defer_create_intents+0xc3/0x220
[   91.444711]  ? write_comp_data+0x2f/0x90
[   91.445056]  xfs_defer_finish_noroll+0x9e/0xbc0
[   91.445449]  ? list_sort+0x344/0x3a0
[   91.445768]  __xfs_trans_commit+0x4be/0x630
[   91.446135]  xfs_trans_commit+0x20/0x30
[   91.446473]  xfs_dquot_disk_alloc+0x45d/0x4e0
[   91.446860]  xfs_qm_dqread+0x2f7/0x310
[   91.447192]  xfs_qm_dqget+0xd5/0x300
[   91.447506]  xfs_qm_quotacheck_dqadjust+0x5a/0x230
[   91.447921]  xfs_qm_dqusage_adjust+0x249/0x300
[   91.448313]  xfs_iwalk_ag_recs+0x1bd/0x2e0
[   91.448671]  xfs_iwalk_run_callbacks+0xc3/0x1c0
[   91.449071]  xfs_iwalk_ag+0x32e/0x3f0
[   91.449398]  xfs_iwalk_ag_work+0xbe/0xf0
[   91.449744]  xfs_pwork_work+0x2c/0xc0
[   91.450064]  process_one_work+0x3b1/0x860
[   91.450416]  worker_thread+0x52/0x660
[   91.450739]  ? __pfx_worker_thread+0x10/0x10
[   91.451113]  kthread+0x16d/0x1c0
[   91.451406]  ? __pfx_kthread+0x10/0x10
[   91.451740]  ret_from_fork+0x29/0x50
[   91.452064]  </TASK>
[   91.452261] Modules linked in:
[   91.452530] CR2: 0000000000000008
[   91.452819] ---[ end trace 0000000000000000 ]---
[   91.487979] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
[   91.488463] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
[   91.490021] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
[   91.490472] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
[   91.491080] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
[   91.491689] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
[   91.492298] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
[   91.492909] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
[   91.493516] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
[   91.494199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.494695] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
[   91.495306] PKRU: 55555554
[   91.495549] note: kworker/u4:2[33] exited with irqs disabled
"

I hope it's helpful.
Thanks!

---

If you don't need the following environment to reproduce the problem or if you
already have one, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Thanks!
BR.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-22  2:07 [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3 Pengfei Xu
@ 2023-05-22  6:39 ` Bagas Sanjaya
  2023-05-22 16:05   ` Darrick J. Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Bagas Sanjaya @ 2023-05-22  6:39 UTC (permalink / raw)
  To: Pengfei Xu, djwong
  Cc: linux-xfs, linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]

On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> Hi Darrick,
> 
> Greeting!
> There is BUG: unable to handle kernel NULL pointer dereference in
> xfs_extent_free_diff_items in v6.4-rc3:
> 
> Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> 
> Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> "
> f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> "
> 
> report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> 
> v6.4-rc3 reproduced info:
> "
> [   91.419498] loop0: detected capacity change from 0 to 65536
> [   91.420095] XFS: attr2 mount option is deprecated.
> [   91.420500] XFS: ikeep mount option is deprecated.
> [   91.422379] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
> [   91.423468] XFS (loop0): Mounting V4 Filesystem d28317a9-9e04-4f2a-be27-e55b4c413ff6
> [   91.428169] XFS (loop0): Ending clean mount
> [   91.429120] XFS (loop0): Quotacheck needed: Please wait.
> [   91.432182] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [   91.432770] #PF: supervisor read access in kernel mode
> [   91.433216] #PF: error_code(0x0000) - not-present page
> [   91.433640] PGD 0 P4D 0 
> [   91.433864] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [   91.434232] CPU: 0 PID: 33 Comm: kworker/u4:2 Not tainted 6.4.0-rc3-kvm #2
> [   91.434793] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> [   91.435445] Workqueue: xfs_iwalk-393 xfs_pwork_work
> [   91.435855] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
> [   91.436312] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> [   91.437812] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
> [   91.438250] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
> [   91.438840] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
> [   91.439430] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
> [   91.440019] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
> [   91.440610] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
> [   91.441202] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
> [   91.441864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   91.442343] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
> [   91.442941] PKRU: 55555554
> [   91.443178] Call Trace:
> [   91.443394]  <TASK>
> [   91.443585]  list_sort+0xb8/0x3a0
> [   91.443885]  xfs_extent_free_create_intent+0xb6/0xc0
> [   91.444312]  xfs_defer_create_intents+0xc3/0x220
> [   91.444711]  ? write_comp_data+0x2f/0x90
> [   91.445056]  xfs_defer_finish_noroll+0x9e/0xbc0
> [   91.445449]  ? list_sort+0x344/0x3a0
> [   91.445768]  __xfs_trans_commit+0x4be/0x630
> [   91.446135]  xfs_trans_commit+0x20/0x30
> [   91.446473]  xfs_dquot_disk_alloc+0x45d/0x4e0
> [   91.446860]  xfs_qm_dqread+0x2f7/0x310
> [   91.447192]  xfs_qm_dqget+0xd5/0x300
> [   91.447506]  xfs_qm_quotacheck_dqadjust+0x5a/0x230
> [   91.447921]  xfs_qm_dqusage_adjust+0x249/0x300
> [   91.448313]  xfs_iwalk_ag_recs+0x1bd/0x2e0
> [   91.448671]  xfs_iwalk_run_callbacks+0xc3/0x1c0
> [   91.449071]  xfs_iwalk_ag+0x32e/0x3f0
> [   91.449398]  xfs_iwalk_ag_work+0xbe/0xf0
> [   91.449744]  xfs_pwork_work+0x2c/0xc0
> [   91.450064]  process_one_work+0x3b1/0x860
> [   91.450416]  worker_thread+0x52/0x660
> [   91.450739]  ? __pfx_worker_thread+0x10/0x10
> [   91.451113]  kthread+0x16d/0x1c0
> [   91.451406]  ? __pfx_kthread+0x10/0x10
> [   91.451740]  ret_from_fork+0x29/0x50
> [   91.452064]  </TASK>
> [   91.452261] Modules linked in:
> [   91.452530] CR2: 0000000000000008
> [   91.452819] ---[ end trace 0000000000000000 ]---
> [   91.487979] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
> [   91.488463] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> [   91.490021] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
> [   91.490472] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
> [   91.491080] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
> [   91.491689] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
> [   91.492298] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
> [   91.492909] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
> [   91.493516] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
> [   91.494199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   91.494695] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
> [   91.495306] PKRU: 55555554
> [   91.495549] note: kworker/u4:2[33] exited with irqs disabled
> "
> 

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: f6b384631e1e34
#regzbot title: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items (due to xfs_extfree_intent perag change)
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=217470

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-22  6:39 ` Bagas Sanjaya
@ 2023-05-22 16:05   ` Darrick J. Wong
  2023-05-22 17:05     ` Linux regression tracking (Thorsten Leemhuis)
  2023-05-23  0:00     ` Eric Biggers
  0 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2023-05-22 16:05 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Pengfei Xu, linux-xfs, linux-fsdevel, heng.su, dchinner, lkp,
	Linux Regressions

On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > Hi Darrick,
> > 
> > Greeting!
> > There is BUG: unable to handle kernel NULL pointer dereference in
> > xfs_extent_free_diff_items in v6.4-rc3:
> > 
> > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > 
> > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > "
> > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > "
> > 
> > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > 
> > v6.4-rc3 reproduced info:

Diagnosis and patches welcomed.

Or are we doing the usual syzbot bullshit where you all assume that I'm
going to do all the fucking work for you?

--D

> > [   91.419498] loop0: detected capacity change from 0 to 65536
> > [   91.420095] XFS: attr2 mount option is deprecated.
> > [   91.420500] XFS: ikeep mount option is deprecated.
> > [   91.422379] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
> > [   91.423468] XFS (loop0): Mounting V4 Filesystem d28317a9-9e04-4f2a-be27-e55b4c413ff6
> > [   91.428169] XFS (loop0): Ending clean mount
> > [   91.429120] XFS (loop0): Quotacheck needed: Please wait.
> > [   91.432182] BUG: kernel NULL pointer dereference, address: 0000000000000008
> > [   91.432770] #PF: supervisor read access in kernel mode
> > [   91.433216] #PF: error_code(0x0000) - not-present page
> > [   91.433640] PGD 0 P4D 0 
> > [   91.433864] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [   91.434232] CPU: 0 PID: 33 Comm: kworker/u4:2 Not tainted 6.4.0-rc3-kvm #2
> > [   91.434793] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> > [   91.435445] Workqueue: xfs_iwalk-393 xfs_pwork_work
> > [   91.435855] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
> > [   91.436312] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> > [   91.437812] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
> > [   91.438250] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
> > [   91.438840] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
> > [   91.439430] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
> > [   91.440019] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
> > [   91.440610] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
> > [   91.441202] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
> > [   91.441864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   91.442343] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
> > [   91.442941] PKRU: 55555554
> > [   91.443178] Call Trace:
> > [   91.443394]  <TASK>
> > [   91.443585]  list_sort+0xb8/0x3a0
> > [   91.443885]  xfs_extent_free_create_intent+0xb6/0xc0
> > [   91.444312]  xfs_defer_create_intents+0xc3/0x220
> > [   91.444711]  ? write_comp_data+0x2f/0x90
> > [   91.445056]  xfs_defer_finish_noroll+0x9e/0xbc0
> > [   91.445449]  ? list_sort+0x344/0x3a0
> > [   91.445768]  __xfs_trans_commit+0x4be/0x630
> > [   91.446135]  xfs_trans_commit+0x20/0x30
> > [   91.446473]  xfs_dquot_disk_alloc+0x45d/0x4e0
> > [   91.446860]  xfs_qm_dqread+0x2f7/0x310
> > [   91.447192]  xfs_qm_dqget+0xd5/0x300
> > [   91.447506]  xfs_qm_quotacheck_dqadjust+0x5a/0x230
> > [   91.447921]  xfs_qm_dqusage_adjust+0x249/0x300
> > [   91.448313]  xfs_iwalk_ag_recs+0x1bd/0x2e0
> > [   91.448671]  xfs_iwalk_run_callbacks+0xc3/0x1c0
> > [   91.449071]  xfs_iwalk_ag+0x32e/0x3f0
> > [   91.449398]  xfs_iwalk_ag_work+0xbe/0xf0
> > [   91.449744]  xfs_pwork_work+0x2c/0xc0
> > [   91.450064]  process_one_work+0x3b1/0x860
> > [   91.450416]  worker_thread+0x52/0x660
> > [   91.450739]  ? __pfx_worker_thread+0x10/0x10
> > [   91.451113]  kthread+0x16d/0x1c0
> > [   91.451406]  ? __pfx_kthread+0x10/0x10
> > [   91.451740]  ret_from_fork+0x29/0x50
> > [   91.452064]  </TASK>
> > [   91.452261] Modules linked in:
> > [   91.452530] CR2: 0000000000000008
> > [   91.452819] ---[ end trace 0000000000000000 ]---
> > [   91.487979] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
> > [   91.488463] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> > [   91.490021] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
> > [   91.490472] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
> > [   91.491080] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
> > [   91.491689] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
> > [   91.492298] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
> > [   91.492909] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
> > [   91.493516] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
> > [   91.494199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   91.494695] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
> > [   91.495306] PKRU: 55555554
> > [   91.495549] note: kworker/u4:2[33] exited with irqs disabled
> > "
> > 
> 
> Thanks for the regression report. I'm adding it to regzbot:
> 
> #regzbot ^introduced: f6b384631e1e34
> #regzbot title: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items (due to xfs_extfree_intent perag change)
> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=217470
> 
> -- 
> An old man doll... just what I always wanted! - Clara



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-22 16:05   ` Darrick J. Wong
@ 2023-05-22 17:05     ` Linux regression tracking (Thorsten Leemhuis)
  2023-05-23  6:08       ` Bagas Sanjaya
  2023-05-23  0:00     ` Eric Biggers
  1 sibling, 1 reply; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-05-22 17:05 UTC (permalink / raw)
  To: Darrick J. Wong, Bagas Sanjaya
  Cc: Pengfei Xu, linux-xfs, linux-fsdevel, heng.su, dchinner, lkp,
	Linux Regressions

On 22.05.23 18:05, Darrick J. Wong wrote:
> On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
>> On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
>>> Greeting!
>>> There is BUG: unable to handle kernel NULL pointer dereference in
>>> xfs_extent_free_diff_items in v6.4-rc3:
>>>
>>> Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
>>>
>>> Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
>>> "
>>> f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
>>> "
>>>
>>> report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
>>> Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
>>> Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
>>> Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
>>> Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
>>> Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
>>>
>>> v6.4-rc3 reproduced info:
> 
> Diagnosis and patches welcomed.
> 
> Or are we doing the usual syzbot bullshit where you all assume that I'm
> going to do all the fucking work for you?

Darrick, sorry for the trouble. Bagas recently out of the blue started
to help with adding regressions to the tracking. That's great, but OTOH
it means that it's likely time to write a few things up that are obvious
to some of us and myself.

Bagas, please for the foreseeable future don't add regressions found by
syzkaller to the regression tracking, unless some well known developer
actually looked into the issue and indicated that it's something that
needs to be fixed.

Syzbot is great. But it occasionally does odd things or goes of the
rails. And in can easily find problems that didn't happen in an earlier
version, but are unlikely to be encountered by users in practice (aka
"in the wild"). And we normally don't consider those regressions that
needs to be fixed.

Ciao, Thorsten

#regzbot inconclusive: syzbot regression that would need further analysis

>>> [   91.419498] loop0: detected capacity change from 0 to 65536
>>> [   91.420095] XFS: attr2 mount option is deprecated.
>>> [   91.420500] XFS: ikeep mount option is deprecated.
>>> [   91.422379] XFS (loop0): Deprecated V4 format (crc=0) will not be supported after September 2030.
>>> [   91.423468] XFS (loop0): Mounting V4 Filesystem d28317a9-9e04-4f2a-be27-e55b4c413ff6
>>> [   91.428169] XFS (loop0): Ending clean mount
>>> [   91.429120] XFS (loop0): Quotacheck needed: Please wait.
>>> [   91.432182] BUG: kernel NULL pointer dereference, address: 0000000000000008
>>> [   91.432770] #PF: supervisor read access in kernel mode
>>> [   91.433216] #PF: error_code(0x0000) - not-present page
>>> [   91.433640] PGD 0 P4D 0 
>>> [   91.433864] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>> [   91.434232] CPU: 0 PID: 33 Comm: kworker/u4:2 Not tainted 6.4.0-rc3-kvm #2
>>> [   91.434793] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
>>> [   91.435445] Workqueue: xfs_iwalk-393 xfs_pwork_work
>>> [   91.435855] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
>>> [   91.436312] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
>>> [   91.437812] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
>>> [   91.438250] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
>>> [   91.438840] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
>>> [   91.439430] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
>>> [   91.440019] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
>>> [   91.440610] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
>>> [   91.441202] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
>>> [   91.441864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   91.442343] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
>>> [   91.442941] PKRU: 55555554
>>> [   91.443178] Call Trace:
>>> [   91.443394]  <TASK>
>>> [   91.443585]  list_sort+0xb8/0x3a0
>>> [   91.443885]  xfs_extent_free_create_intent+0xb6/0xc0
>>> [   91.444312]  xfs_defer_create_intents+0xc3/0x220
>>> [   91.444711]  ? write_comp_data+0x2f/0x90
>>> [   91.445056]  xfs_defer_finish_noroll+0x9e/0xbc0
>>> [   91.445449]  ? list_sort+0x344/0x3a0
>>> [   91.445768]  __xfs_trans_commit+0x4be/0x630
>>> [   91.446135]  xfs_trans_commit+0x20/0x30
>>> [   91.446473]  xfs_dquot_disk_alloc+0x45d/0x4e0
>>> [   91.446860]  xfs_qm_dqread+0x2f7/0x310
>>> [   91.447192]  xfs_qm_dqget+0xd5/0x300
>>> [   91.447506]  xfs_qm_quotacheck_dqadjust+0x5a/0x230
>>> [   91.447921]  xfs_qm_dqusage_adjust+0x249/0x300
>>> [   91.448313]  xfs_iwalk_ag_recs+0x1bd/0x2e0
>>> [   91.448671]  xfs_iwalk_run_callbacks+0xc3/0x1c0
>>> [   91.449071]  xfs_iwalk_ag+0x32e/0x3f0
>>> [   91.449398]  xfs_iwalk_ag_work+0xbe/0xf0
>>> [   91.449744]  xfs_pwork_work+0x2c/0xc0
>>> [   91.450064]  process_one_work+0x3b1/0x860
>>> [   91.450416]  worker_thread+0x52/0x660
>>> [   91.450739]  ? __pfx_worker_thread+0x10/0x10
>>> [   91.451113]  kthread+0x16d/0x1c0
>>> [   91.451406]  ? __pfx_kthread+0x10/0x10
>>> [   91.451740]  ret_from_fork+0x29/0x50
>>> [   91.452064]  </TASK>
>>> [   91.452261] Modules linked in:
>>> [   91.452530] CR2: 0000000000000008
>>> [   91.452819] ---[ end trace 0000000000000000 ]---
>>> [   91.487979] RIP: 0010:xfs_extent_free_diff_items+0x27/0x40
>>> [   91.488463] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 f4 53 48 89 d3 e8 05 73 7d ff 49 8b 44 24 28 48 8b 53 28 5b 41 5c <8b> 40 08 5d 2b 42 08 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
>>> [   91.490021] RSP: 0000:ffffc9000012b8c0 EFLAGS: 00010246
>>> [   91.490472] RAX: 0000000000000000 RBX: ffff8880015826c8 RCX: ffffffff81d71e41
>>> [   91.491080] RDX: 0000000000000000 RSI: ffff888001ca4800 RDI: 0000000000000002
>>> [   91.491689] RBP: ffffc9000012b8c0 R08: ffffc9000012b8e0 R09: 0000000000000000
>>> [   91.492298] R10: ffff88800613f290 R11: ffffffff83e426c0 R12: ffff888001582230
>>> [   91.492909] R13: ffff888001582428 R14: ffffffff81b042c0 R15: ffffc9000012b908
>>> [   91.493516] FS:  0000000000000000(0000) GS:ffff88807ec00000(0000) knlGS:0000000000000000
>>> [   91.494199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   91.494695] CR2: 0000000000000008 CR3: 000000000ed22006 CR4: 0000000000770ef0
>>> [   91.495306] PKRU: 55555554
>>> [   91.495549] note: kworker/u4:2[33] exited with irqs disabled
>>> "
>>>
>>
>> Thanks for the regression report. I'm adding it to regzbot:
>>
>> #regzbot ^introduced: f6b384631e1e34
>> #regzbot title: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items (due to xfs_extfree_intent perag change)
>> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=217470
>>
>> -- 
>> An old man doll... just what I always wanted! - Clara
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-22 16:05   ` Darrick J. Wong
  2023-05-22 17:05     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-05-23  0:00     ` Eric Biggers
  2023-05-23  7:31       ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2023-05-23  0:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Bagas Sanjaya, Pengfei Xu, linux-xfs, linux-fsdevel, heng.su,
	dchinner, lkp, Linux Regressions

On Mon, May 22, 2023 at 09:05:25AM -0700, Darrick J. Wong wrote:
> On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> > On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > > Hi Darrick,
> > > 
> > > Greeting!
> > > There is BUG: unable to handle kernel NULL pointer dereference in
> > > xfs_extent_free_diff_items in v6.4-rc3:
> > > 
> > > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > > 
> > > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > > "
> > > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > > "
> > > 
> > > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > > 
> > > v6.4-rc3 reproduced info:
> 
> Diagnosis and patches welcomed.
> 
> Or are we doing the usual syzbot bullshit where you all assume that I'm
> going to do all the fucking work for you?
> 

It looks like Pengfei already took the time to manually bisect this issue to a
very recent commit authored by you.  Is that not helpful?

(Apologies if I didn't include enough profanities for this email to be suitable
for linux-xfs@.)

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-22 17:05     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-05-23  6:08       ` Bagas Sanjaya
  2023-05-23  6:44         ` Pengfei Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Bagas Sanjaya @ 2023-05-23  6:08 UTC (permalink / raw)
  To: Linux regressions mailing list, Darrick J. Wong
  Cc: Pengfei Xu, linux-xfs, linux-fsdevel, heng.su, dchinner, lkp

On 5/23/23 00:05, Linux regression tracking (Thorsten Leemhuis) wrote:
> Darrick, sorry for the trouble. Bagas recently out of the blue started
> to help with adding regressions to the tracking. That's great, but OTOH
> it means that it's likely time to write a few things up that are obvious
> to some of us and myself.
> 
> Bagas, please for the foreseeable future don't add regressions found by
> syzkaller to the regression tracking, unless some well known developer
> actually looked into the issue and indicated that it's something that
> needs to be fixed.
> 
> Syzbot is great. But it occasionally does odd things or goes of the
> rails. And in can easily find problems that didn't happen in an earlier
> version, but are unlikely to be encountered by users in practice (aka
> "in the wild"). And we normally don't consider those regressions that
> needs to be fixed.
> 

Oops, at the moment I didn't know how to distinguish true regressions
and issues found by the bot, so I thought that both are regressions.

Thanks for the tip!

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23  6:08       ` Bagas Sanjaya
@ 2023-05-23  6:44         ` Pengfei Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Pengfei Xu @ 2023-05-23  6:44 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Linux regressions mailing list, Darrick J. Wong, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp

Hi Bagas Sanjaya,

On 2023-05-23 at 13:08:32 +0700, Bagas Sanjaya wrote:
> On 5/23/23 00:05, Linux regression tracking (Thorsten Leemhuis) wrote:
> > Darrick, sorry for the trouble. Bagas recently out of the blue started
> > to help with adding regressions to the tracking. That's great, but OTOH
> > it means that it's likely time to write a few things up that are obvious
> > to some of us and myself.
> > 
> > Bagas, please for the foreseeable future don't add regressions found by
> > syzkaller to the regression tracking, unless some well known developer
> > actually looked into the issue and indicated that it's something that
> > needs to be fixed.
> > 
> > Syzbot is great. But it occasionally does odd things or goes of the
> > rails. And in can easily find problems that didn't happen in an earlier
> > version, but are unlikely to be encountered by users in practice (aka
> > "in the wild"). And we normally don't consider those regressions that
> > needs to be fixed.
> > 
> 
> Oops, at the moment I didn't know how to distinguish true regressions
> and issues found by the bot, so I thought that both are regressions.
> 
> Thanks for the tip!
> 
  The bisect used keyword "xfs_extent_free_diff_items" to bisect, and seems
  it's not accurate this time. I will consider improving it.

  Thanks!
  BR.

> -- 
> An old man doll... just what I always wanted! - Clara
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23  0:00     ` Eric Biggers
@ 2023-05-23  7:31       ` Dave Chinner
  2023-05-23  9:14         ` Pengfei Xu
  2023-05-23 16:50         ` Eric Biggers
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2023-05-23  7:31 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J. Wong, Bagas Sanjaya, Pengfei Xu, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

On Tue, May 23, 2023 at 12:00:29AM +0000, Eric Biggers wrote:
> On Mon, May 22, 2023 at 09:05:25AM -0700, Darrick J. Wong wrote:
> > On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> > > On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > > > Hi Darrick,
> > > > 
> > > > Greeting!
> > > > There is BUG: unable to handle kernel NULL pointer dereference in
> > > > xfs_extent_free_diff_items in v6.4-rc3:
> > > > 
> > > > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > > > 
> > > > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > > > "
> > > > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > > > "
> > > > 
> > > > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > > > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > > > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > > > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > > > 
> > > > v6.4-rc3 reproduced info:
> > 
> > Diagnosis and patches welcomed.
> > 
> > Or are we doing the usual syzbot bullshit where you all assume that I'm
> > going to do all the fucking work for you?
> > 
> 
> It looks like Pengfei already took the time to manually bisect this issue to a
> very recent commit authored by you.  Is that not helpful?

No. The bisect is completely meaningless.

The cause of the problem is going to be some piece of corrupted
metadata has got through a verifier check or log recovery and has
resulted in a perag lookup failing. The bisect landed on the commit
where the perag dependency was introduced; whatever is letting
unchecked corrupted metadata throught he verifiers has existed long
before this recent change was made.

I've already spent two hours analysing this report - I've got to the
point where I've isolated the transaction in the trace, I see the
allocation being run as expected, I see all the right things
happening, and then it goes splat after the allocation has committed
and it starts processing defered extent free operations. Neither the
code nor the trace actually tell me anything about the nature of the
failure that has occurred.

At this point, I still don't know where the corrupted metadata is
coming from. That's the next thing I need to look at, and then I
realised that this bug report *doesn't include a pointer to the
corrupted filesystem image that is being mounted*.

IOWs, the bug report is deficient and not complete, and so I'm
forced to spend unnecessary time trying to work out how to extract
the filesystem image from a weird syzkaller report that is basically
just a bunch of undocumented blobs in a github tree.

This is the same sort of shit we've been having to deal rigth from
teh start with syzkaller. It doesn't matter that syzbot might have
improved it's reporting a bit these days, we still have to deal with
this sort of poor reporting from all the private syzkaller bot crank
handles that are being turned by people who know little more than
how to turn a crank handle.

To make matters worse, this is a v4 filesystem which has known
unfixable issues when handling corrupted filesystems in both log
replay and in runtime detection of corruption. We've repeatedly told
people running syzkaller (including Pengfei) to stop running it on
v4 filesystems and only report bugs on V5 format filesystems. This
is to avoid wasting time triaging these problems back down to v4
specific format bugs that ican only be fixed by moving to the v5
format.

.....

And now after 4 hours, I have found several corruptions in the on
disk format that v5 filesystems will have caught and v4 filesystems
will not.

The AGFL indexes in the AGF have been corrupted. They are within
valid bounds, but first + last != count. On a V5 filesystem we catch
this and trigger an AGFL reset that is done of the first allocation.
v4 filesystems do not do this last - first = count validation at
all.

Further, the AGFL has also been corrupted - it is full of null
blocks. This is another problem that V5 filesystems can catch and
report, but v4 filesystems don't because they don't have headers in
the AGFL that enable verification.

Yes, there's definitely scope for further improvements in validation
here, but the unhandled corruptions that I've found still don't
explain how we got a null perag in the xefi created from a
referenced perag that is causing the crash.

So, yeah, the bisect is completely useless, and I've got half a day
into triage and I still don't have any clue what the corruption is
that is causing the kernel to crash....

----

Do you see the problem now, Eric?

Performing root-cause analysis of syzkaller based malicious
filesystem corruption bugs is anything but simple. It takes hours to
days just to work through triage of a single bug report, and we're
getting a couple of these sorts of bug reported every week.

People who do nothing but turn the bot crank handle throw stuff like
this over the wall at usi are easy to find. Bots and bot crank
turners scale really easily. Engineers who can find and fix the
problems, OTOH, don't.

And just to rub salt into the wounds, we now have people who turn
crank handles on other bots to tell everyone else how important
they think the problem is without having performed any triage at
all. And then we're expected to make an untriaged bug report our
highest priority and immediately spend hours of time to make sense
of the steaming pile that has just been dumped on us.

Worse, we've had people who track regressions imply that if we don't
prioritise fixing regressions ahead of anything else we might be
working on, then we might not get new work merged until the
regressions have been fixed. In my book, that's akin to extortion,
and it might give you some insight to why Darrick reacted so
vigorously to having an untriaged syzkaller bug tracked as a high
visibility, must fix regression.

What we really need is more people who are capable to triaging bug
reports like this instead of having lots of people cranking on bot
handles and dumping untriaged bug reports on the maintainer.
Further, if you aren't capable of triaging the bug report, then you
aren't qualified to classify it as a "must fix" regression.

It's like people don't have any common sense or decency anymore:
it's not very nice to classify a bug as a "must fix" regression
without first having consulted the engineers responsible for that
code. If you don't know what the cause of the bug is, then don't
crank handles that cause people to have to address it immediately!

If nothing changes, then the ever increasing amount of bot cranking
is going to burn us out completely. Nobody wins when that
happens....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23  7:31       ` Dave Chinner
@ 2023-05-23  9:14         ` Pengfei Xu
  2023-05-23 21:52           ` Dave Chinner
  2023-05-23 16:50         ` Eric Biggers
  1 sibling, 1 reply; 14+ messages in thread
From: Pengfei Xu @ 2023-05-23  9:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Darrick J. Wong, Bagas Sanjaya, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

Hi Dave,

On 2023-05-23 at 17:31:23 +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 12:00:29AM +0000, Eric Biggers wrote:
> > On Mon, May 22, 2023 at 09:05:25AM -0700, Darrick J. Wong wrote:
> > > On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> > > > On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > > > > Hi Darrick,
> > > > > 
> > > > > Greeting!
> > > > > There is BUG: unable to handle kernel NULL pointer dereference in
> > > > > xfs_extent_free_diff_items in v6.4-rc3:
> > > > > 
> > > > > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > > > > 
> > > > > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > > > > "
> > > > > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > > > > "
> > > > > 
> > > > > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > > > > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > > > > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > > > > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > > > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > > > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > > > > 
> > > > > v6.4-rc3 reproduced info:
> > > 
> > > Diagnosis and patches welcomed.
> > > 
> > > Or are we doing the usual syzbot bullshit where you all assume that I'm
> > > going to do all the fucking work for you?
> > > 
> > 
> > It looks like Pengfei already took the time to manually bisect this issue to a
> > very recent commit authored by you.  Is that not helpful?
> 
> No. The bisect is completely meaningless.
> 
> The cause of the problem is going to be some piece of corrupted
> metadata has got through a verifier check or log recovery and has
> resulted in a perag lookup failing. The bisect landed on the commit
> where the perag dependency was introduced; whatever is letting
> unchecked corrupted metadata throught he verifiers has existed long
> before this recent change was made.
> 
> I've already spent two hours analysing this report - I've got to the
> point where I've isolated the transaction in the trace, I see the
> allocation being run as expected, I see all the right things
> happening, and then it goes splat after the allocation has committed
> and it starts processing defered extent free operations. Neither the
> code nor the trace actually tell me anything about the nature of the
> failure that has occurred.
> 
> At this point, I still don't know where the corrupted metadata is
> coming from. That's the next thing I need to look at, and then I
> realised that this bug report *doesn't include a pointer to the
> corrupted filesystem image that is being mounted*.
> 
> IOWs, the bug report is deficient and not complete, and so I'm
> forced to spend unnecessary time trying to work out how to extract
> the filesystem image from a weird syzkaller report that is basically
> just a bunch of undocumented blobs in a github tree.
> 
> This is the same sort of shit we've been having to deal rigth from
> teh start with syzkaller. It doesn't matter that syzbot might have
> improved it's reporting a bit these days, we still have to deal with
> this sort of poor reporting from all the private syzkaller bot crank
> handles that are being turned by people who know little more than
> how to turn a crank handle.
> 
> To make matters worse, this is a v4 filesystem which has known
> unfixable issues when handling corrupted filesystems in both log
> replay and in runtime detection of corruption. We've repeatedly told
> people running syzkaller (including Pengfei) to stop running it on
> v4 filesystems and only report bugs on V5 format filesystems. This
> is to avoid wasting time triaging these problems back down to v4
> specific format bugs that ican only be fixed by moving to the v5
> format.
> 
> .....
> 
> And now after 4 hours, I have found several corruptions in the on
> disk format that v5 filesystems will have caught and v4 filesystems
> will not.
> 
> The AGFL indexes in the AGF have been corrupted. They are within
> valid bounds, but first + last != count. On a V5 filesystem we catch
> this and trigger an AGFL reset that is done of the first allocation.
> v4 filesystems do not do this last - first = count validation at
> all.
> 
> Further, the AGFL has also been corrupted - it is full of null
> blocks. This is another problem that V5 filesystems can catch and
> report, but v4 filesystems don't because they don't have headers in
> the AGFL that enable verification.
> 
> Yes, there's definitely scope for further improvements in validation
> here, but the unhandled corruptions that I've found still don't
> explain how we got a null perag in the xefi created from a
> referenced perag that is causing the crash.
> 
> So, yeah, the bisect is completely useless, and I've got half a day
> into triage and I still don't have any clue what the corruption is
> that is causing the kernel to crash....
> 
> ----
> 
> Do you see the problem now, Eric?
> 
> Performing root-cause analysis of syzkaller based malicious
> filesystem corruption bugs is anything but simple. It takes hours to
> days just to work through triage of a single bug report, and we're
> getting a couple of these sorts of bug reported every week.
> 
> People who do nothing but turn the bot crank handle throw stuff like
> this over the wall at usi are easy to find. Bots and bot crank
> turners scale really easily. Engineers who can find and fix the
> problems, OTOH, don't.
> 
> And just to rub salt into the wounds, we now have people who turn
> crank handles on other bots to tell everyone else how important
> they think the problem is without having performed any triage at
> all. And then we're expected to make an untriaged bug report our
> highest priority and immediately spend hours of time to make sense
> of the steaming pile that has just been dumped on us.
> 
> Worse, we've had people who track regressions imply that if we don't
> prioritise fixing regressions ahead of anything else we might be
> working on, then we might not get new work merged until the
> regressions have been fixed. In my book, that's akin to extortion,
> and it might give you some insight to why Darrick reacted so
> vigorously to having an untriaged syzkaller bug tracked as a high
> visibility, must fix regression.
> 
> What we really need is more people who are capable to triaging bug
> reports like this instead of having lots of people cranking on bot
> handles and dumping untriaged bug reports on the maintainer.
> Further, if you aren't capable of triaging the bug report, then you
> aren't qualified to classify it as a "must fix" regression.
> 
> It's like people don't have any common sense or decency anymore:
> it's not very nice to classify a bug as a "must fix" regression
> without first having consulted the engineers responsible for that
> code. If you don't know what the cause of the bug is, then don't
> crank handles that cause people to have to address it immediately!
> 
> If nothing changes, then the ever increasing amount of bot cranking
> is going to burn us out completely. Nobody wins when that
> happens....
> 
  I did not do well in two points, which led to the problem of this useless
  bisect info:
  1. Should double check "V4 Filesystem" related issue carefully, and should
     give reason of problem.
  2. Double check the bisect bad and good dmesg info, this time actually
     "good(actually not good)" dmesg also contains "BUG" related
     dmesg, but it doesn't contain the keyword "xfs_extent_free_diff_items"
     dmesg info, and give the wrong bisect info.
     Sorry for inconvenience...

     For above 2 points, I will solve the above two problems thoroughly a.s.a.p.
     I won't report useless bisection information caused by the above 2 points.

  Thanks!
  BR.

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23  7:31       ` Dave Chinner
  2023-05-23  9:14         ` Pengfei Xu
@ 2023-05-23 16:50         ` Eric Biggers
  2023-05-23 22:16           ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2023-05-23 16:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Bagas Sanjaya, Pengfei Xu, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

Hi Dave,

On Tue, May 23, 2023 at 05:31:23PM +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 12:00:29AM +0000, Eric Biggers wrote:
> > On Mon, May 22, 2023 at 09:05:25AM -0700, Darrick J. Wong wrote:
> > > On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> > > > On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > > > > Hi Darrick,
> > > > > 
> > > > > Greeting!
> > > > > There is BUG: unable to handle kernel NULL pointer dereference in
> > > > > xfs_extent_free_diff_items in v6.4-rc3:
> > > > > 
> > > > > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > > > > 
> > > > > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > > > > "
> > > > > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > > > > "
> > > > > 
> > > > > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > > > > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > > > > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > > > > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > > > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > > > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > > > > 
> > > > > v6.4-rc3 reproduced info:
> > > 
> > > Diagnosis and patches welcomed.
> > > 
> > > Or are we doing the usual syzbot bullshit where you all assume that I'm
> > > going to do all the fucking work for you?
> > > 
> > 
> > It looks like Pengfei already took the time to manually bisect this issue to a
> > very recent commit authored by you.  Is that not helpful?
> 
> No. The bisect is completely meaningless.
> 
> The cause of the problem is going to be some piece of corrupted
> metadata has got through a verifier check or log recovery and has
> resulted in a perag lookup failing. The bisect landed on the commit
> where the perag dependency was introduced; whatever is letting
> unchecked corrupted metadata throught he verifiers has existed long
> before this recent change was made.
> 
> I've already spent two hours analysing this report - I've got to the
> point where I've isolated the transaction in the trace, I see the
> allocation being run as expected, I see all the right things
> happening, and then it goes splat after the allocation has committed
> and it starts processing defered extent free operations. Neither the
> code nor the trace actually tell me anything about the nature of the
> failure that has occurred.
> 
> At this point, I still don't know where the corrupted metadata is
> coming from. That's the next thing I need to look at, and then I
> realised that this bug report *doesn't include a pointer to the
> corrupted filesystem image that is being mounted*.
> 
> IOWs, the bug report is deficient and not complete, and so I'm
> forced to spend unnecessary time trying to work out how to extract
> the filesystem image from a weird syzkaller report that is basically
> just a bunch of undocumented blobs in a github tree.
> 
> This is the same sort of shit we've been having to deal rigth from
> teh start with syzkaller. It doesn't matter that syzbot might have
> improved it's reporting a bit these days, we still have to deal with
> this sort of poor reporting from all the private syzkaller bot crank
> handles that are being turned by people who know little more than
> how to turn a crank handle.
> 
> To make matters worse, this is a v4 filesystem which has known
> unfixable issues when handling corrupted filesystems in both log
> replay and in runtime detection of corruption. We've repeatedly told
> people running syzkaller (including Pengfei) to stop running it on
> v4 filesystems and only report bugs on V5 format filesystems. This
> is to avoid wasting time triaging these problems back down to v4
> specific format bugs that ican only be fixed by moving to the v5
> format.
> 
> .....
> 
> And now after 4 hours, I have found several corruptions in the on
> disk format that v5 filesystems will have caught and v4 filesystems
> will not.
> 
> The AGFL indexes in the AGF have been corrupted. They are within
> valid bounds, but first + last != count. On a V5 filesystem we catch
> this and trigger an AGFL reset that is done of the first allocation.
> v4 filesystems do not do this last - first = count validation at
> all.
> 
> Further, the AGFL has also been corrupted - it is full of null
> blocks. This is another problem that V5 filesystems can catch and
> report, but v4 filesystems don't because they don't have headers in
> the AGFL that enable verification.
> 
> Yes, there's definitely scope for further improvements in validation
> here, but the unhandled corruptions that I've found still don't
> explain how we got a null perag in the xefi created from a
> referenced perag that is causing the crash.
> 
> So, yeah, the bisect is completely useless, and I've got half a day
> into triage and I still don't have any clue what the corruption is
> that is causing the kernel to crash....
> 
> ----
> 
> Do you see the problem now, Eric?
> 
> Performing root-cause analysis of syzkaller based malicious
> filesystem corruption bugs is anything but simple. It takes hours to
> days just to work through triage of a single bug report, and we're
> getting a couple of these sorts of bug reported every week.
> 
> People who do nothing but turn the bot crank handle throw stuff like
> this over the wall at usi are easy to find. Bots and bot crank
> turners scale really easily. Engineers who can find and fix the
> problems, OTOH, don't.
> 
> And just to rub salt into the wounds, we now have people who turn
> crank handles on other bots to tell everyone else how important
> they think the problem is without having performed any triage at
> all. And then we're expected to make an untriaged bug report our
> highest priority and immediately spend hours of time to make sense
> of the steaming pile that has just been dumped on us.
> 
> Worse, we've had people who track regressions imply that if we don't
> prioritise fixing regressions ahead of anything else we might be
> working on, then we might not get new work merged until the
> regressions have been fixed. In my book, that's akin to extortion,
> and it might give you some insight to why Darrick reacted so
> vigorously to having an untriaged syzkaller bug tracked as a high
> visibility, must fix regression.
> 
> What we really need is more people who are capable to triaging bug
> reports like this instead of having lots of people cranking on bot
> handles and dumping untriaged bug reports on the maintainer.
> Further, if you aren't capable of triaging the bug report, then you
> aren't qualified to classify it as a "must fix" regression.
> 
> It's like people don't have any common sense or decency anymore:
> it's not very nice to classify a bug as a "must fix" regression
> without first having consulted the engineers responsible for that
> code. If you don't know what the cause of the bug is, then don't
> crank handles that cause people to have to address it immediately!
> 
> If nothing changes, then the ever increasing amount of bot cranking
> is going to burn us out completely. Nobody wins when that
> happens....
> 

Thanks for the explanation.  I personally didn't need such a long explanation,
but it should be helpful for Pengfei and others.

I was mostly just concerned that a report of a bug with a reproducer and a
bisection to a recent commit just got a response from the maintainer with
profanities and no helpful information.  ("It's like people don't have any
common sense or decency anymore...")  I think that makes it hard for people like
Pengfei to understand what they did wrong, especially when they might have
gotten very different responses from other kernel subsystems.  So, thank you for
providing a more detailed explanation, though honestly, something much shorter
would have sufficed.  (Maybe you should even have a write-up on the XFS wiki or
in Documentation/ that you point to whenever this sort of thing comes up?)

BTW, given that XFS has a policy of not fixing bugs in XFS v4 filesystems, I
suggest adding a kconfig option that disables the support for mounting XFS v4
filesystems.  Then you could just tell the people fuzzing XFS filesystem images
that they need to use that option.  That would save everyone a lot of time.
(To be clear, I'm not arguing for the XFS policy on v4 filesystems being right
or wrong; that's really not something I'd like to get into again...  I'm just
saying that if that's indeed your policy, this is what you should do.)

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23  9:14         ` Pengfei Xu
@ 2023-05-23 21:52           ` Dave Chinner
  2023-05-24  2:20             ` Pengfei Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-05-23 21:52 UTC (permalink / raw)
  To: Pengfei Xu
  Cc: Eric Biggers, Darrick J. Wong, Bagas Sanjaya, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

On Tue, May 23, 2023 at 05:14:24PM +0800, Pengfei Xu wrote:
>   I did not do well in two points, which led to the problem of this useless
>   bisect info:
>   1. Should double check "V4 Filesystem" related issue carefully, and should
>      give reason of problem.
>   2. Double check the bisect bad and good dmesg info, this time actually
>      "good(actually not good)" dmesg also contains "BUG" related
>      dmesg, but it doesn't contain the keyword "xfs_extent_free_diff_items"
>      dmesg info, and give the wrong bisect info.
>      Sorry for inconvenience...

I think you misunderstand.

The bisect you did was correct - the commit it
identified was certainly does expose the underlying issue.

The reason the bisect, while correct, is actually useless is that it
the underlying issue that the commit tripped over is not caused by
the change in the commit. The underlying issue has been there for a
long while - probably a decade - and it's that old, underlying issue
that has caused the new code to fail.

IOWs, the problem is not the new code (i.e. it is not a regression
in the new code identified by the bisect), the problem is in other
code that has been silently propagating undetected corruption for
years. Hence the bisect is not actually useful in diagnosing the
root cause of the problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23 16:50         ` Eric Biggers
@ 2023-05-23 22:16           ` Dave Chinner
  2023-05-23 23:46             ` Eric Biggers
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-05-23 22:16 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J. Wong, Bagas Sanjaya, Pengfei Xu, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

On Tue, May 23, 2023 at 04:50:44PM +0000, Eric Biggers wrote:
> On Tue, May 23, 2023 at 05:31:23PM +1000, Dave Chinner wrote:
> > On Tue, May 23, 2023 at 12:00:29AM +0000, Eric Biggers wrote:
> > > On Mon, May 22, 2023 at 09:05:25AM -0700, Darrick J. Wong wrote:
> > > > On Mon, May 22, 2023 at 01:39:27PM +0700, Bagas Sanjaya wrote:
> > > > > On Mon, May 22, 2023 at 10:07:28AM +0800, Pengfei Xu wrote:
> > > > > > Hi Darrick,
> > > > > > 
> > > > > > Greeting!
> > > > > > There is BUG: unable to handle kernel NULL pointer dereference in
> > > > > > xfs_extent_free_diff_items in v6.4-rc3:
> > > > > > 
> > > > > > Above issue could be reproduced in v6.4-rc3 and v6.4-rc2 kernel in guest.
> > > > > > 
> > > > > > Bisected this issue between v6.4-rc2 and v5.11, found the problem commit is:
> > > > > > "
> > > > > > f6b384631e1e xfs: give xfs_extfree_intent its own perag reference
> > > > > > "
> > > > > > 
> > > > > > report0, repro.stat and so on detailed info is link: https://github.com/xupengfe/syzkaller_logs/tree/main/230521_043336_xfs_extent_free_diff_items
> > > > > > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.c
> > > > > > Syzkaller reproduced prog: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/repro.prog
> > > > > > Kconfig: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/kconfig_origin
> > > > > > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/bisect_info.log
> > > > > > Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/230521_043336_xfs_extent_free_diff_items/v6.4-rc3_reproduce_dmesg.log
> > > > > > 
> > > > > > v6.4-rc3 reproduced info:
> > > > 
> > > > Diagnosis and patches welcomed.
> > > > 
> > > > Or are we doing the usual syzbot bullshit where you all assume that I'm
> > > > going to do all the fucking work for you?
> > > > 
> > > 
> > > It looks like Pengfei already took the time to manually bisect this issue to a
> > > very recent commit authored by you.  Is that not helpful?
> > 
> > No. The bisect is completely meaningless.
> > 
> > The cause of the problem is going to be some piece of corrupted
> > metadata has got through a verifier check or log recovery and has
> > resulted in a perag lookup failing. The bisect landed on the commit
> > where the perag dependency was introduced; whatever is letting
> > unchecked corrupted metadata throught he verifiers has existed long
> > before this recent change was made.
> > 
> > I've already spent two hours analysing this report - I've got to the
> > point where I've isolated the transaction in the trace, I see the
> > allocation being run as expected, I see all the right things
> > happening, and then it goes splat after the allocation has committed
> > and it starts processing defered extent free operations. Neither the
> > code nor the trace actually tell me anything about the nature of the
> > failure that has occurred.
> > 
> > At this point, I still don't know where the corrupted metadata is
> > coming from. That's the next thing I need to look at, and then I
> > realised that this bug report *doesn't include a pointer to the
> > corrupted filesystem image that is being mounted*.
> > 
> > IOWs, the bug report is deficient and not complete, and so I'm
> > forced to spend unnecessary time trying to work out how to extract
> > the filesystem image from a weird syzkaller report that is basically
> > just a bunch of undocumented blobs in a github tree.
> > 
> > This is the same sort of shit we've been having to deal rigth from
> > teh start with syzkaller. It doesn't matter that syzbot might have
> > improved it's reporting a bit these days, we still have to deal with
> > this sort of poor reporting from all the private syzkaller bot crank
> > handles that are being turned by people who know little more than
> > how to turn a crank handle.
> > 
> > To make matters worse, this is a v4 filesystem which has known
> > unfixable issues when handling corrupted filesystems in both log
> > replay and in runtime detection of corruption. We've repeatedly told
> > people running syzkaller (including Pengfei) to stop running it on
> > v4 filesystems and only report bugs on V5 format filesystems. This
> > is to avoid wasting time triaging these problems back down to v4
> > specific format bugs that ican only be fixed by moving to the v5
> > format.
> > 
> > .....
> > 
> > And now after 4 hours, I have found several corruptions in the on
> > disk format that v5 filesystems will have caught and v4 filesystems
> > will not.
> > 
> > The AGFL indexes in the AGF have been corrupted. They are within
> > valid bounds, but first + last != count. On a V5 filesystem we catch
> > this and trigger an AGFL reset that is done of the first allocation.
> > v4 filesystems do not do this last - first = count validation at
> > all.
> > 
> > Further, the AGFL has also been corrupted - it is full of null
> > blocks. This is another problem that V5 filesystems can catch and
> > report, but v4 filesystems don't because they don't have headers in
> > the AGFL that enable verification.
> > 
> > Yes, there's definitely scope for further improvements in validation
> > here, but the unhandled corruptions that I've found still don't
> > explain how we got a null perag in the xefi created from a
> > referenced perag that is causing the crash.
> > 
> > So, yeah, the bisect is completely useless, and I've got half a day
> > into triage and I still don't have any clue what the corruption is
> > that is causing the kernel to crash....
> > 
> > ----
> > 
> > Do you see the problem now, Eric?
> > 
> > Performing root-cause analysis of syzkaller based malicious
> > filesystem corruption bugs is anything but simple. It takes hours to
> > days just to work through triage of a single bug report, and we're
> > getting a couple of these sorts of bug reported every week.
> > 
> > People who do nothing but turn the bot crank handle throw stuff like
> > this over the wall at usi are easy to find. Bots and bot crank
> > turners scale really easily. Engineers who can find and fix the
> > problems, OTOH, don't.
> > 
> > And just to rub salt into the wounds, we now have people who turn
> > crank handles on other bots to tell everyone else how important
> > they think the problem is without having performed any triage at
> > all. And then we're expected to make an untriaged bug report our
> > highest priority and immediately spend hours of time to make sense
> > of the steaming pile that has just been dumped on us.
> > 
> > Worse, we've had people who track regressions imply that if we don't
> > prioritise fixing regressions ahead of anything else we might be
> > working on, then we might not get new work merged until the
> > regressions have been fixed. In my book, that's akin to extortion,
> > and it might give you some insight to why Darrick reacted so
> > vigorously to having an untriaged syzkaller bug tracked as a high
> > visibility, must fix regression.
> > 
> > What we really need is more people who are capable to triaging bug
> > reports like this instead of having lots of people cranking on bot
> > handles and dumping untriaged bug reports on the maintainer.
> > Further, if you aren't capable of triaging the bug report, then you
> > aren't qualified to classify it as a "must fix" regression.
> > 
> > It's like people don't have any common sense or decency anymore:
> > it's not very nice to classify a bug as a "must fix" regression
> > without first having consulted the engineers responsible for that
> > code. If you don't know what the cause of the bug is, then don't
> > crank handles that cause people to have to address it immediately!
> > 
> > If nothing changes, then the ever increasing amount of bot cranking
> > is going to burn us out completely. Nobody wins when that
> > happens....
> > 
> 
> Thanks for the explanation.  I personally didn't need such a long explanation,
> but it should be helpful for Pengfei and others.
> 
> I was mostly just concerned that a report of a bug with a reproducer and a
> bisection to a recent commit just got a response from the maintainer with
> profanities and no helpful information.  ("It's like people don't have any
> common sense or decency anymore...")  I think that makes it hard for people like
> Pengfei to understand what they did wrong, especially when they might have
> gotten very different responses from other kernel subsystems.  So, thank you for
> providing a more detailed explanation, though honestly, something much shorter
> would have sufficed.  (Maybe you should even have a write-up on the XFS wiki or
> in Documentation/ that you point to whenever this sort of thing comes up?)
> 
> BTW, given that XFS has a policy of not fixing bugs in XFS v4 filesystems, I

No, I *did not say that*. That is most definitely not our policy.

I said that *V4 filesystems have known bugs that can only be fixed
by moving to the V5 format*.

Some of the checks that V5 filesystems do can also be done on v4
filesystems. There are some missing checks for both v4 and v5
filesystems. Once I get to the root cause and understand how this
all blew up, I'll fix the problems in common v4/v5 code. I might not
test the v4 code that much, because....

> suggest adding a kconfig option that disables the support for mounting XFS v4
> filesystems.

fs/xfs/Kconfig:

config XFS_SUPPORT_V4
        bool "Support deprecated V4 (crc=0) format"
        depends on XFS_FS
        default y
        help
          The V4 filesystem format lacks certain features that are supported
          by the V5 format, such as metadata checksumming, strengthened
          metadata verification, and the ability to store timestamps past the
          year 2038.  Because of this, the V4 format is deprecated.  All users
          should upgrade by backing up their files, reformatting, and restoring
          from the backup.

          Administrators and users can detect a V4 filesystem by running
          xfs_info against a filesystem mountpoint and checking for a string
          beginning with "crc=".  If the string "crc=0" is found, the
          filesystem is a V4 filesystem.  If no such string is found, please
          upgrade xfsprogs to the latest version and try again.

          This option will become default N in September 2025.  Support for the
          V4 format will be removed entirely in September 2030.  Distributors
          can say N here to withdraw support earlier.

          To continue supporting the old V4 format (crc=0), say Y.
          To close off an attack surface, say N.

This was added almost 3 years ago in mid-2020. We're more than half
way through the deprecation period and then we're going to turn off
v4 support by default. At this point, nobody should be using v4
filesystems in new production systems, and those that are should be
preparing for upstream and distro support to be withdraw in the next
couple of years...

> Then you could just tell the people fuzzing XFS filesystem images
> that they need to use that option.  That would save everyone a lot of time.
> (To be clear, I'm not arguing for the XFS policy on v4 filesystems being right
> or wrong; that's really not something I'd like to get into again...  I'm just
> saying that if that's indeed your policy, this is what you should do.)

It should be obvious by now that we've already done this. 3 years
ago, in fact. And yet we are still having the same problems. Maybe
this helps you understand the level of frustration we have with all
the people running fuzzing bots out there....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23 22:16           ` Dave Chinner
@ 2023-05-23 23:46             ` Eric Biggers
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Biggers @ 2023-05-23 23:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Bagas Sanjaya, Pengfei Xu, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

On Wed, May 24, 2023 at 08:16:58AM +1000, Dave Chinner wrote:
> config XFS_SUPPORT_V4
>         bool "Support deprecated V4 (crc=0) format"
>         depends on XFS_FS
>         default y
>         help
>           The V4 filesystem format lacks certain features that are supported
>           by the V5 format, such as metadata checksumming, strengthened
>           metadata verification, and the ability to store timestamps past the
>           year 2038.  Because of this, the V4 format is deprecated.  All users
>           should upgrade by backing up their files, reformatting, and restoring
>           from the backup.
> 
>           Administrators and users can detect a V4 filesystem by running
>           xfs_info against a filesystem mountpoint and checking for a string
>           beginning with "crc=".  If the string "crc=0" is found, the
>           filesystem is a V4 filesystem.  If no such string is found, please
>           upgrade xfsprogs to the latest version and try again.
> 
>           This option will become default N in September 2025.  Support for the
>           V4 format will be removed entirely in September 2030.  Distributors
>           can say N here to withdraw support earlier.
> 
>           To continue supporting the old V4 format (crc=0), say Y.
>           To close off an attack surface, say N.
> 
> This was added almost 3 years ago in mid-2020. We're more than half
> way through the deprecation period and then we're going to turn off
> v4 support by default. At this point, nobody should be using v4
> filesystems in new production systems, and those that are should be
> preparing for upstream and distro support to be withdraw in the next
> couple of years...

Great to see that this exists now and there is a specific deprecation plan!

> > Then you could just tell the people fuzzing XFS filesystem images
> > that they need to use that option.  That would save everyone a lot of time.
> > (To be clear, I'm not arguing for the XFS policy on v4 filesystems being right
> > or wrong; that's really not something I'd like to get into again...  I'm just
> > saying that if that's indeed your policy, this is what you should do.)
> 
> It should be obvious by now that we've already done this. 3 years
> ago, in fact. And yet we are still having the same problems. Maybe
> this helps you understand the level of frustration we have with all
> the people running fuzzing bots out there....

I don't see evidence that this actually happened, though perhaps we are not
looking in the same places.  https://lore.kernel.org/all/?q=XFS_SUPPORT_V4
brings up little except the original patch thread, nor did
https://github.com/search?q=XFS_SUPPORT_V4&type=issues find anything.

Anyway, I took 3 minutes to file an issue in the syzkaller repo
(https://github.com/google/syzkaller/issues/3918), so at least this should get
resolved for syzbot soon.

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3
  2023-05-23 21:52           ` Dave Chinner
@ 2023-05-24  2:20             ` Pengfei Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Pengfei Xu @ 2023-05-24  2:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Darrick J. Wong, Bagas Sanjaya, linux-xfs,
	linux-fsdevel, heng.su, dchinner, lkp, Linux Regressions

On 2023-05-24 at 07:52:59 +1000, Dave Chinner wrote:
> On Tue, May 23, 2023 at 05:14:24PM +0800, Pengfei Xu wrote:
> >   I did not do well in two points, which led to the problem of this useless
> >   bisect info:
> >   1. Should double check "V4 Filesystem" related issue carefully, and should
> >      give reason of problem.
> >   2. Double check the bisect bad and good dmesg info, this time actually
> >      "good(actually not good)" dmesg also contains "BUG" related
> >      dmesg, but it doesn't contain the keyword "xfs_extent_free_diff_items"
> >      dmesg info, and give the wrong bisect info.
> >      Sorry for inconvenience...
> 
> I think you misunderstand.
> 
> The bisect you did was correct - the commit it
> identified was certainly does expose the underlying issue.
> 
> The reason the bisect, while correct, is actually useless is that it
> the underlying issue that the commit tripped over is not caused by
> the change in the commit. The underlying issue has been there for a
> long while - probably a decade - and it's that old, underlying issue
> that has caused the new code to fail.
> 
> IOWs, the problem is not the new code (i.e. it is not a regression
> in the new code identified by the bisect), the problem is in other
> code that has been silently propagating undetected corruption for
> years. Hence the bisect is not actually useful in diagnosing the
> root cause of the problem.
> 
  Thanks a lot Dave's description! It's clear.
  Anyway I will remove "CONFIG_XFS_SUPPORT_V4" in syzkaller fuzzing test
  next time to avoid the noise.

  Thanks also to Eric Biggers, Bagas Sanjaya and all community's help!

  Thanks!
  BR.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-05-24  2:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-22  2:07 [Syzkaller & bisect] There is BUG: unable to handle kernel NULL pointer dereference in xfs_extent_free_diff_items in v6.4-rc3 Pengfei Xu
2023-05-22  6:39 ` Bagas Sanjaya
2023-05-22 16:05   ` Darrick J. Wong
2023-05-22 17:05     ` Linux regression tracking (Thorsten Leemhuis)
2023-05-23  6:08       ` Bagas Sanjaya
2023-05-23  6:44         ` Pengfei Xu
2023-05-23  0:00     ` Eric Biggers
2023-05-23  7:31       ` Dave Chinner
2023-05-23  9:14         ` Pengfei Xu
2023-05-23 21:52           ` Dave Chinner
2023-05-24  2:20             ` Pengfei Xu
2023-05-23 16:50         ` Eric Biggers
2023-05-23 22:16           ` Dave Chinner
2023-05-23 23:46             ` Eric Biggers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).