All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
@ 2022-11-23  8:48 Bruno Goncalves
  2022-11-23 13:46 ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Bruno Goncalves @ 2022-11-23  8:48 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe, CKI Project

Hello,

We recently started to hit the following panic when testing the block
tree (for-next branch).

[ 5076.172749] list_add corruption. prev->next should be next
(ffff91cd6f7fa568), but was ffff91c991ca6670. (prev=ffff91c991ca6670).
[ 5076.173863] ------------[ cut here ]------------
[ 5076.174853] kernel BUG at lib/list_debug.c:30!
[ 5076.175523] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 5076.175853] CPU: 15 PID: 16415 Comm: kworker/15:13 Tainted: G
   I        6.1.0-rc6 #1
[ 5076.176799] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 05/24/2019
[ 5076.177198] Workqueue: cgwb_release cgwb_release_workfn
[ 5076.177497] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
[ 5076.177788] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
c7 c7
[ 5076.179173] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
[ 5076.179472] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
[ 5076.180241] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
[ 5076.181069] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
[ 5076.182209] R10: 0000000000000003 R11: ffff91cd7ff42fe8 R12: ffff91cd6f7fa568
[ 5076.183002] R13: ffff91c991ca6670 R14: ffff91c991ca6670 R15: ffff91cd6f7f1440
[ 5076.183902] FS:  0000000000000000(0000) GS:ffff91cd6f7c0000(0000)
knlGS:0000000000000000
[ 5076.184377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5076.185084] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
[ 5076.185945] Call Trace:
[ 5076.186110]  <TASK>
[ 5076.186916]  insert_work+0x46/0xc0
[ 5076.187533]  __queue_work+0x1d4/0x460
[ 5076.187788]  queue_work_on+0x37/0x40
[ 5076.187993]  blkcg_unpin_online+0x1ad/0x1b0
[ 5076.188244]  cgwb_release_workfn+0x6a/0x200
[ 5076.188464]  process_one_work+0x1c7/0x380
[ 5076.188675]  worker_thread+0x4d/0x380
[ 5076.188881]  ? rescuer_thread+0x380/0x380
[ 5076.189089]  kthread+0xe9/0x110
[ 5076.189716]  ? kthread_complete_and_exit+0x20/0x20
[ 5076.190407]  ret_from_fork+0x22/0x30
[ 5076.190677]  </TASK>
[ 5076.190816] Modules linked in: nvme nvme_core nvme_common loop tls
rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp sunrpc kvm_intel kvm iTCO_wdt iapl
intel_cstate intel_uncore pcspkr lpc_ich ipmi_ssif hpilo tg3 acpi_ipmi
ioatdma ipmi_si ipmi_devintf dca ipmi_msghandler acpi_power_meter fuse
zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw hpsa
mgag200 scsi_transport_sas [last unloaded: scsi_debug]
[ 5076.293315] ---[ end trace 0000000000000000 ]---
[ 5076.295226] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
[ 5076.295587] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
c7 c7
[ 5076.296921] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
[ 5076.297239] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
[ 5076.297983] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
[ 5076.298768] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
[ 5076.299525] R10: 0000S:  0000000000000000(0000)
GS:ffff91cd6f7c0000(0000) knlGS:0000000000000000
[ 5076.700351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5076.701046] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
[ 5076ernel panic - not syncing: Fatal exception
[ 5077.924713] Shutting down cpus with NMI
[ 5077.924986] Kernel Offset: 0x2b000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 5077.927946] ---[ end Kernel panic - not syncing: Fatal exception ]---

It seems to happen often during different tests.

full console.log:
https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/11/21/redhat:700955106/build_x86_64_redhat:700955106_x86_64/tests/1/results_0001/console.log/console.log

kernel tarball:
https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/publish%20x86_64/3356091217/artifacts/kernel-block-redhat_700955106_x86_64.tar.gz

kernel config: https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/build%20x86_64/3356091207/artifacts/kernel-block-redhat_700955106_x86_64.config

test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677

We didn't bisect, but the first commit we hit the problem was
"f65d92c600fe6eecdbd6e7fab7893c9c094dfcbf
(io_uring-6.1-2022-11-18-2180-gf65d92c600fe)" and the last one where
we didn't hit the problem was
"40fa774af7fd04d06014ac74947c351649b6f64f
(io_uring-6.1-2022-11-11-1843-g40fa774af7fd)"

test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
cki issue tracker: https://datawarehouse.cki-project.org/issue/1732

Thank you,
Bruno Goncalves


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-23  8:48 kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex) Bruno Goncalves
@ 2022-11-23 13:46 ` Jens Axboe
  2022-11-24 14:57   ` Bruno Goncalves
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2022-11-23 13:46 UTC (permalink / raw)
  To: Bruno Goncalves, linux-block; +Cc: CKI Project

On 11/23/22 1:48 AM, Bruno Goncalves wrote:
> Hello,
> 
> We recently started to hit the following panic when testing the block
> tree (for-next branch).
> 
> [ 5076.172749] list_add corruption. prev->next should be next
> (ffff91cd6f7fa568), but was ffff91c991ca6670. (prev=ffff91c991ca6670).
> [ 5076.173863] ------------[ cut here ]------------
> [ 5076.174853] kernel BUG at lib/list_debug.c:30!
> [ 5076.175523] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [ 5076.175853] CPU: 15 PID: 16415 Comm: kworker/15:13 Tainted: G
>    I        6.1.0-rc6 #1
> [ 5076.176799] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 05/24/2019
> [ 5076.177198] Workqueue: cgwb_release cgwb_release_workfn
> [ 5076.177497] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> [ 5076.177788] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> c7 c7
> [ 5076.179173] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> [ 5076.179472] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> [ 5076.180241] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> [ 5076.181069] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> [ 5076.182209] R10: 0000000000000003 R11: ffff91cd7ff42fe8 R12: ffff91cd6f7fa568
> [ 5076.183002] R13: ffff91c991ca6670 R14: ffff91c991ca6670 R15: ffff91cd6f7f1440
> [ 5076.183902] FS:  0000000000000000(0000) GS:ffff91cd6f7c0000(0000)
> knlGS:0000000000000000
> [ 5076.184377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5076.185084] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> [ 5076.185945] Call Trace:
> [ 5076.186110]  <TASK>
> [ 5076.186916]  insert_work+0x46/0xc0
> [ 5076.187533]  __queue_work+0x1d4/0x460
> [ 5076.187788]  queue_work_on+0x37/0x40
> [ 5076.187993]  blkcg_unpin_online+0x1ad/0x1b0
> [ 5076.188244]  cgwb_release_workfn+0x6a/0x200
> [ 5076.188464]  process_one_work+0x1c7/0x380
> [ 5076.188675]  worker_thread+0x4d/0x380
> [ 5076.188881]  ? rescuer_thread+0x380/0x380
> [ 5076.189089]  kthread+0xe9/0x110
> [ 5076.189716]  ? kthread_complete_and_exit+0x20/0x20
> [ 5076.190407]  ret_from_fork+0x22/0x30
> [ 5076.190677]  </TASK>
> [ 5076.190816] Modules linked in: nvme nvme_core nvme_common loop tls
> rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> intel_powerclamp coretemp sunrpc kvm_intel kvm iTCO_wdt iapl
> intel_cstate intel_uncore pcspkr lpc_ich ipmi_ssif hpilo tg3 acpi_ipmi
> ioatdma ipmi_si ipmi_devintf dca ipmi_msghandler acpi_power_meter fuse
> zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw hpsa
> mgag200 scsi_transport_sas [last unloaded: scsi_debug]
> [ 5076.293315] ---[ end trace 0000000000000000 ]---
> [ 5076.295226] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> [ 5076.295587] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> c7 c7
> [ 5076.296921] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> [ 5076.297239] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> [ 5076.297983] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> [ 5076.298768] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> [ 5076.299525] R10: 0000S:  0000000000000000(0000)
> GS:ffff91cd6f7c0000(0000) knlGS:0000000000000000
> [ 5076.700351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5076.701046] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> [ 5076ernel panic - not syncing: Fatal exception
> [ 5077.924713] Shutting down cpus with NMI
> [ 5077.924986] Kernel Offset: 0x2b000000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 5077.927946] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> It seems to happen often during different tests.
> 
> full console.log:
> https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/11/21/redhat:700955106/build_x86_64_redhat:700955106_x86_64/tests/1/results_0001/console.log/console.log
> 
> kernel tarball:
> https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/publish%20x86_64/3356091217/artifacts/kernel-block-redhat_700955106_x86_64.tar.gz
> 
> kernel config: https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/build%20x86_64/3356091207/artifacts/kernel-block-redhat_700955106_x86_64.config
> 
> test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> 
> We didn't bisect, but the first commit we hit the problem was
> "f65d92c600fe6eecdbd6e7fab7893c9c094dfcbf
> (io_uring-6.1-2022-11-18-2180-gf65d92c600fe)" and the last one where
> we didn't hit the problem was
> "40fa774af7fd04d06014ac74947c351649b6f64f
> (io_uring-6.1-2022-11-11-1843-g40fa774af7fd)"
> 
> test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> cki issue tracker: https://datawarehouse.cki-project.org/issue/1732

Please just try and clone for-6.2/block from the block tree and bisect
it?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-23 13:46 ` Jens Axboe
@ 2022-11-24 14:57   ` Bruno Goncalves
  2022-11-25  8:38     ` Yi Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Bruno Goncalves @ 2022-11-24 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, CKI Project

On Wed, 23 Nov 2022 at 14:46, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 11/23/22 1:48 AM, Bruno Goncalves wrote:
> > Hello,
> >
> > We recently started to hit the following panic when testing the block
> > tree (for-next branch).
> >
> > [ 5076.172749] list_add corruption. prev->next should be next
> > (ffff91cd6f7fa568), but was ffff91c991ca6670. (prev=ffff91c991ca6670).
> > [ 5076.173863] ------------[ cut here ]------------
> > [ 5076.174853] kernel BUG at lib/list_debug.c:30!
> > [ 5076.175523] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [ 5076.175853] CPU: 15 PID: 16415 Comm: kworker/15:13 Tainted: G
> >    I        6.1.0-rc6 #1
> > [ 5076.176799] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 05/24/2019
> > [ 5076.177198] Workqueue: cgwb_release cgwb_release_workfn
> > [ 5076.177497] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > [ 5076.177788] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > c7 c7
> > [ 5076.179173] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > [ 5076.179472] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > [ 5076.180241] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > [ 5076.181069] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > [ 5076.182209] R10: 0000000000000003 R11: ffff91cd7ff42fe8 R12: ffff91cd6f7fa568
> > [ 5076.183002] R13: ffff91c991ca6670 R14: ffff91c991ca6670 R15: ffff91cd6f7f1440
> > [ 5076.183902] FS:  0000000000000000(0000) GS:ffff91cd6f7c0000(0000)
> > knlGS:0000000000000000
> > [ 5076.184377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5076.185084] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > [ 5076.185945] Call Trace:
> > [ 5076.186110]  <TASK>
> > [ 5076.186916]  insert_work+0x46/0xc0
> > [ 5076.187533]  __queue_work+0x1d4/0x460
> > [ 5076.187788]  queue_work_on+0x37/0x40
> > [ 5076.187993]  blkcg_unpin_online+0x1ad/0x1b0
> > [ 5076.188244]  cgwb_release_workfn+0x6a/0x200
> > [ 5076.188464]  process_one_work+0x1c7/0x380
> > [ 5076.188675]  worker_thread+0x4d/0x380
> > [ 5076.188881]  ? rescuer_thread+0x380/0x380
> > [ 5076.189089]  kthread+0xe9/0x110
> > [ 5076.189716]  ? kthread_complete_and_exit+0x20/0x20
> > [ 5076.190407]  ret_from_fork+0x22/0x30
> > [ 5076.190677]  </TASK>
> > [ 5076.190816] Modules linked in: nvme nvme_core nvme_common loop tls
> > rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> > intel_powerclamp coretemp sunrpc kvm_intel kvm iTCO_wdt iapl
> > intel_cstate intel_uncore pcspkr lpc_ich ipmi_ssif hpilo tg3 acpi_ipmi
> > ioatdma ipmi_si ipmi_devintf dca ipmi_msghandler acpi_power_meter fuse
> > zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> > polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw hpsa
> > mgag200 scsi_transport_sas [last unloaded: scsi_debug]
> > [ 5076.293315] ---[ end trace 0000000000000000 ]---
> > [ 5076.295226] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > [ 5076.295587] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > c7 c7
> > [ 5076.296921] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > [ 5076.297239] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > [ 5076.297983] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > [ 5076.298768] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > [ 5076.299525] R10: 0000S:  0000000000000000(0000)
> > GS:ffff91cd6f7c0000(0000) knlGS:0000000000000000
> > [ 5076.700351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 5076.701046] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > [ 5076ernel panic - not syncing: Fatal exception
> > [ 5077.924713] Shutting down cpus with NMI
> > [ 5077.924986] Kernel Offset: 0x2b000000 from 0xffffffff81000000
> > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [ 5077.927946] ---[ end Kernel panic - not syncing: Fatal exception ]---
> >
> > It seems to happen often during different tests.
> >
> > full console.log:
> > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/11/21/redhat:700955106/build_x86_64_redhat:700955106_x86_64/tests/1/results_0001/console.log/console.log
> >
> > kernel tarball:
> > https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/publish%20x86_64/3356091217/artifacts/kernel-block-redhat_700955106_x86_64.tar.gz
> >
> > kernel config: https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/build%20x86_64/3356091207/artifacts/kernel-block-redhat_700955106_x86_64.config
> >
> > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> >
> > We didn't bisect, but the first commit we hit the problem was
> > "f65d92c600fe6eecdbd6e7fab7893c9c094dfcbf
> > (io_uring-6.1-2022-11-18-2180-gf65d92c600fe)" and the last one where
> > we didn't hit the problem was
> > "40fa774af7fd04d06014ac74947c351649b6f64f
> > (io_uring-6.1-2022-11-11-1843-g40fa774af7fd)"
> >
> > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> > cki issue tracker: https://datawarehouse.cki-project.org/issue/1732
>
> Please just try and clone for-6.2/block from the block tree and bisect
> it?
>

Hi,
I've tried with commit 93c68cc46a070775cc6675e3543dd909eb9f6c9e (drbd:
use consistent license), but I was not able to hit the panic with it.


Bruno

> --
> Jens Axboe
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-24 14:57   ` Bruno Goncalves
@ 2022-11-25  8:38     ` Yi Zhang
  2022-11-26 14:29       ` [bisected]kernel " Yi Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Yi Zhang @ 2022-11-25  8:38 UTC (permalink / raw)
  To: Bruno Goncalves; +Cc: Jens Axboe, linux-block, CKI Project

I reproduced this issue even when system boot with the latest
linux-block/for-next, will try to bisect it later.

43f3ae1898c9 (HEAD -> for-next, origin/for-next) Merge branch
'for-6.2/writeback' into for-next
d6798bc243fa writeback: Add asserts for adding freed inode to lists

[   24.183829] list_add corruption. prev->next should be next
(ffff9a1d9f337f68), but was ffff9a1a02119e70. (prev=ffff9a1a02119e70).
[   24.195478] ------------[ cut here ]------------
[   24.200088] kernel BUG at lib/list_debug.c:30!
[   24.204532] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   24.209751] CPU: 4 PID: 167 Comm: kworker/4:1 Not tainted 6.1.0-rc6+ #1
[   24.216365] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
2.8.5 08/18/2022
[   24.223930] Workqueue: cgwb_release cgwb_release_workfn
[   24.229157] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
[   24.234208] Code: f2 4c 89 c1 48 89 fe 48 c7 c7 20 23 65 a8 e8 d2
a2 fe ff 0f 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 c8 22 65 a8 e8 bb
a2 fe ff <0f> 0b 4c 89 c1 48 c7 c7 70 22 65 a8 e8 aa a2 fe ff 0f 0b 48
c7 c7
[   24.252953] RSP: 0018:ffffb035407e7da8 EFLAGS: 00010046
[   24.258172] RAX: 0000000000000075 RBX: ffff9a1a02119e68 RCX: 0000000000000000
[   24.265303] RDX: 0000000000000000 RSI: ffff9a1d9f31f840 RDI: ffff9a1d9f31f840
[   24.272428] RBP: ffff9a1d9f337f00 R08: 0000000000000000 R09: 00000000ffff7fff
[   24.279560] R10: ffffb035407e7c50 R11: ffffffffa8be75e8 R12: ffff9a1d9f337f68
[   24.286683] R13: ffff9a1a02119e70 R14: ffff9a1a02119e70 R15: ffff9a1d9f330340
[   24.293808] FS:  0000000000000000(0000) GS:ffff9a1d9f300000(0000)
knlGS:0000000000000000
[   24.301894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.307641] CR2: 000055b5f1f28050 CR3: 0000000104e38000 CR4: 0000000000350ee0
[   24.314772] Call Trace:
[   24.314774]  <TASK>
[   24.314774]  insert_work+0x46/0xc0
[   24.314780]  __queue_work+0x1d5/0x380
[   24.326376]  queue_work_on+0x24/0x30
[   24.329955]  blkcg_unpin_online+0x1b5/0x1c0
[   24.334143]  cgwb_release_workfn+0x6a/0x200
[   24.338327]  process_one_work+0x1e5/0x3b0
[   24.342342]  ? rescuer_thread+0x390/0x390
[   24.346352]  worker_thread+0x50/0x3a0
[   24.350019]  ? rescuer_thread+0x390/0x390
[   24.354030]  kthread+0xd9/0x100
[   24.357177]  ? kthread_complete_and_exit+0x20/0x20
[   24.361970]  ret_from_fork+0x22/0x30
[   24.365550]  </TASK>
[   24.367742] Modules linked in: sunrpc intel_rapl_msr
intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd kvm
mgag200 ledtrig_audio rfkill video i2c_algo_bit drm_shmem_helper
dcdbas drm_kms_helper irqbypass dell_smbios rapl dell_wmi_descriptor
wmi_bmof pcspkr syscopyarea acpi_ipmi sysfillrect sysimgblt
fb_sys_fops ipmi_si ipmi_devintf ptdma i2c_piix4 k10temp
ipmi_msghandler acpi_power_meter vfat fat acpi_cpufreq drm fuse xfs
libcrc32c sd_mod sg ahci crct10dif_pclmul crc32_pclmul libahci
crc32c_intel ghash_clmulni_intel mpt3sas nvme tg3 libata nvme_core ccp
raid_class nvme_common t10_pi sp5100_tco scsi_transport_sas wmi
dm_mirror dm_region_hash dm_log dm_mod
[   24.426475] ---[ end trace 0000000000000000 ]---
[   24.505278] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
[   24.510331] Code: f2 4c 89 c1 48 89 fe 48 c7 c7 20 23 65 a8 e8 d2
a2 fe ff 0f 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 c8 22 65 a8 e8 bb
a2 fe ff <0f> 0b 4c 89 c1 48 c7 c7 70 22 65 a8 e8 aa a2 fe ff 0f 0b 48
c7 c7
[   24.510332] RSP: 0018:ffffb035407e7da8 EFLAGS: 00010046
[   24.510333] RAX: 0000000000000075 RBX: ffff9a1a02119e68 RCX: 0000000000000000
[   24.510334] RDX: 0000000000000000 RSI: ffff9a1d9f31f840 RDI: ffff9a1d9f31f840
[   24.510335] RBP: ffff9a1d9f337f00 R08: 0000000000000000 R09: 00000000ffff7fff
[   24.510337] R10: ffffb035407e7c50 R11: ffffffffa8be75e8 R12: ffff9a1d9f337f68
[   24.562805] R13: ffff9a1a02119e70 R14: ffff9a1a02119e70 R15: ffff9a1d9f330340
[   24.569929] FS:  0000000000000000(0000) GS:ffff9a1d9f300000(0000)
knlGS:0000000000000000
[   24.578017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.578018] CR2: 000055b5f1f28050 CR3: 0000000104e38000 CR4: 0000000000350ee0
[   24.578019] Kernel panic - not syncing: Fatal exception
[   24.578653] Kernel Offset: 0x26200000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   24.682013] ---[ end Kernel panic - not syncing: Fatal exception ]---
[   24.339396] r[-- MARK -- Fri Nov 25 06:25:00 2022]


On Thu, Nov 24, 2022 at 11:00 PM Bruno Goncalves <bgoncalv@redhat.com> wrote:
>
> On Wed, 23 Nov 2022 at 14:46, Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 11/23/22 1:48 AM, Bruno Goncalves wrote:
> > > Hello,
> > >
> > > We recently started to hit the following panic when testing the block
> > > tree (for-next branch).
> > >
> > > [ 5076.172749] list_add corruption. prev->next should be next
> > > (ffff91cd6f7fa568), but was ffff91c991ca6670. (prev=ffff91c991ca6670).
> > > [ 5076.173863] ------------[ cut here ]------------
> > > [ 5076.174853] kernel BUG at lib/list_debug.c:30!
> > > [ 5076.175523] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [ 5076.175853] CPU: 15 PID: 16415 Comm: kworker/15:13 Tainted: G
> > >    I        6.1.0-rc6 #1
> > > [ 5076.176799] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 05/24/2019
> > > [ 5076.177198] Workqueue: cgwb_release cgwb_release_workfn
> > > [ 5076.177497] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > > [ 5076.177788] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > > c7 c7
> > > [ 5076.179173] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > > [ 5076.179472] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > > [ 5076.180241] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > > [ 5076.181069] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > > [ 5076.182209] R10: 0000000000000003 R11: ffff91cd7ff42fe8 R12: ffff91cd6f7fa568
> > > [ 5076.183002] R13: ffff91c991ca6670 R14: ffff91c991ca6670 R15: ffff91cd6f7f1440
> > > [ 5076.183902] FS:  0000000000000000(0000) GS:ffff91cd6f7c0000(0000)
> > > knlGS:0000000000000000
> > > [ 5076.184377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 5076.185084] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > > [ 5076.185945] Call Trace:
> > > [ 5076.186110]  <TASK>
> > > [ 5076.186916]  insert_work+0x46/0xc0
> > > [ 5076.187533]  __queue_work+0x1d4/0x460
> > > [ 5076.187788]  queue_work_on+0x37/0x40
> > > [ 5076.187993]  blkcg_unpin_online+0x1ad/0x1b0
> > > [ 5076.188244]  cgwb_release_workfn+0x6a/0x200
> > > [ 5076.188464]  process_one_work+0x1c7/0x380
> > > [ 5076.188675]  worker_thread+0x4d/0x380
> > > [ 5076.188881]  ? rescuer_thread+0x380/0x380
> > > [ 5076.189089]  kthread+0xe9/0x110
> > > [ 5076.189716]  ? kthread_complete_and_exit+0x20/0x20
> > > [ 5076.190407]  ret_from_fork+0x22/0x30
> > > [ 5076.190677]  </TASK>
> > > [ 5076.190816] Modules linked in: nvme nvme_core nvme_common loop tls
> > > rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> > > intel_powerclamp coretemp sunrpc kvm_intel kvm iTCO_wdt iapl
> > > intel_cstate intel_uncore pcspkr lpc_ich ipmi_ssif hpilo tg3 acpi_ipmi
> > > ioatdma ipmi_si ipmi_devintf dca ipmi_msghandler acpi_power_meter fuse
> > > zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> > > polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw hpsa
> > > mgag200 scsi_transport_sas [last unloaded: scsi_debug]
> > > [ 5076.293315] ---[ end trace 0000000000000000 ]---
> > > [ 5076.295226] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > > [ 5076.295587] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > > c7 c7
> > > [ 5076.296921] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > > [ 5076.297239] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > > [ 5076.297983] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > > [ 5076.298768] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > > [ 5076.299525] R10: 0000S:  0000000000000000(0000)
> > > GS:ffff91cd6f7c0000(0000) knlGS:0000000000000000
> > > [ 5076.700351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 5076.701046] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > > [ 5076ernel panic - not syncing: Fatal exception
> > > [ 5077.924713] Shutting down cpus with NMI
> > > [ 5077.924986] Kernel Offset: 0x2b000000 from 0xffffffff81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > [ 5077.927946] ---[ end Kernel panic - not syncing: Fatal exception ]---
> > >
> > > It seems to happen often during different tests.
> > >
> > > full console.log:
> > > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/11/21/redhat:700955106/build_x86_64_redhat:700955106_x86_64/tests/1/results_0001/console.log/console.log
> > >
> > > kernel tarball:
> > > https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/publish%20x86_64/3356091217/artifacts/kernel-block-redhat_700955106_x86_64.tar.gz
> > >
> > > kernel config: https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/build%20x86_64/3356091207/artifacts/kernel-block-redhat_700955106_x86_64.config
> > >
> > > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> > >
> > > We didn't bisect, but the first commit we hit the problem was
> > > "f65d92c600fe6eecdbd6e7fab7893c9c094dfcbf
> > > (io_uring-6.1-2022-11-18-2180-gf65d92c600fe)" and the last one where
> > > we didn't hit the problem was
> > > "40fa774af7fd04d06014ac74947c351649b6f64f
> > > (io_uring-6.1-2022-11-11-1843-g40fa774af7fd)"
> > >
> > > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> > > cki issue tracker: https://datawarehouse.cki-project.org/issue/1732
> >
> > Please just try and clone for-6.2/block from the block tree and bisect
> > it?
> >
>
> Hi,
> I've tried with commit 93c68cc46a070775cc6675e3543dd909eb9f6c9e (drbd:
> use consistent license), but I was not able to hit the panic with it.
>
>
> Bruno
>
> > --
> > Jens Axboe
> >
> >
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-25  8:38     ` Yi Zhang
@ 2022-11-26 14:29       ` Yi Zhang
  2022-11-26 15:53         ` Jens Axboe
  2022-11-28 18:55         ` Bart Van Assche
  0 siblings, 2 replies; 9+ messages in thread
From: Yi Zhang @ 2022-11-26 14:29 UTC (permalink / raw)
  To: Jens Axboe, Waiman Long; +Cc: linux-block, CKI Project, Bruno Goncalves

Hi Jens
Sorry for the delay as I couldn't reproduce it with the original
for-6.2/block branch.
Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
bisect it:


951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
Author: Waiman Long <longman@redhat.com>
Date:   Fri Nov 4 20:59:02 2022 -0400

    blk-cgroup: Flush stats at blkgs destruction path

    As noted by Michal, the blkg_iostat_set's in the lockless list
    hold reference to blkg's to protect against their removal. Those
    blkg's hold reference to blkcg. When a cgroup is being destroyed,
    cgroup_rstat_flush() is only called at css_release_work_fn() which is
    called when the blkcg reference count reaches 0. This circular dependency
    will prevent blkcg from being freed until some other events cause
    cgroup_rstat_flush() to be called to flush out the pending blkcg stats.

    To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
    function to flush stats for a given css and cpu and call it at the blkgs
    destruction path, blkcg_destroy_blkgs(), whenever there are still some
    pending stats to be flushed. This will ensure that blkcg reference
    count can reach 0 ASAP.

    Signed-off-by: Waiman Long <longman@redhat.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>


On Fri, Nov 25, 2022 at 4:38 PM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> I reproduced this issue even when system boot with the latest
> linux-block/for-next, will try to bisect it later.
>
> 43f3ae1898c9 (HEAD -> for-next, origin/for-next) Merge branch
> 'for-6.2/writeback' into for-next
> d6798bc243fa writeback: Add asserts for adding freed inode to lists
>
> [   24.183829] list_add corruption. prev->next should be next
> (ffff9a1d9f337f68), but was ffff9a1a02119e70. (prev=ffff9a1a02119e70).
> [   24.195478] ------------[ cut here ]------------
> [   24.200088] kernel BUG at lib/list_debug.c:30!
> [   24.204532] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [   24.209751] CPU: 4 PID: 167 Comm: kworker/4:1 Not tainted 6.1.0-rc6+ #1
> [   24.216365] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
> 2.8.5 08/18/2022
> [   24.223930] Workqueue: cgwb_release cgwb_release_workfn
> [   24.229157] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> [   24.234208] Code: f2 4c 89 c1 48 89 fe 48 c7 c7 20 23 65 a8 e8 d2
> a2 fe ff 0f 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 c8 22 65 a8 e8 bb
> a2 fe ff <0f> 0b 4c 89 c1 48 c7 c7 70 22 65 a8 e8 aa a2 fe ff 0f 0b 48
> c7 c7
> [   24.252953] RSP: 0018:ffffb035407e7da8 EFLAGS: 00010046
> [   24.258172] RAX: 0000000000000075 RBX: ffff9a1a02119e68 RCX: 0000000000000000
> [   24.265303] RDX: 0000000000000000 RSI: ffff9a1d9f31f840 RDI: ffff9a1d9f31f840
> [   24.272428] RBP: ffff9a1d9f337f00 R08: 0000000000000000 R09: 00000000ffff7fff
> [   24.279560] R10: ffffb035407e7c50 R11: ffffffffa8be75e8 R12: ffff9a1d9f337f68
> [   24.286683] R13: ffff9a1a02119e70 R14: ffff9a1a02119e70 R15: ffff9a1d9f330340
> [   24.293808] FS:  0000000000000000(0000) GS:ffff9a1d9f300000(0000)
> knlGS:0000000000000000
> [   24.301894] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.307641] CR2: 000055b5f1f28050 CR3: 0000000104e38000 CR4: 0000000000350ee0
> [   24.314772] Call Trace:
> [   24.314774]  <TASK>
> [   24.314774]  insert_work+0x46/0xc0
> [   24.314780]  __queue_work+0x1d5/0x380
> [   24.326376]  queue_work_on+0x24/0x30
> [   24.329955]  blkcg_unpin_online+0x1b5/0x1c0
> [   24.334143]  cgwb_release_workfn+0x6a/0x200
> [   24.338327]  process_one_work+0x1e5/0x3b0
> [   24.342342]  ? rescuer_thread+0x390/0x390
> [   24.346352]  worker_thread+0x50/0x3a0
> [   24.350019]  ? rescuer_thread+0x390/0x390
> [   24.354030]  kthread+0xd9/0x100
> [   24.357177]  ? kthread_complete_and_exit+0x20/0x20
> [   24.361970]  ret_from_fork+0x22/0x30
> [   24.365550]  </TASK>
> [   24.367742] Modules linked in: sunrpc intel_rapl_msr
> intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd kvm
> mgag200 ledtrig_audio rfkill video i2c_algo_bit drm_shmem_helper
> dcdbas drm_kms_helper irqbypass dell_smbios rapl dell_wmi_descriptor
> wmi_bmof pcspkr syscopyarea acpi_ipmi sysfillrect sysimgblt
> fb_sys_fops ipmi_si ipmi_devintf ptdma i2c_piix4 k10temp
> ipmi_msghandler acpi_power_meter vfat fat acpi_cpufreq drm fuse xfs
> libcrc32c sd_mod sg ahci crct10dif_pclmul crc32_pclmul libahci
> crc32c_intel ghash_clmulni_intel mpt3sas nvme tg3 libata nvme_core ccp
> raid_class nvme_common t10_pi sp5100_tco scsi_transport_sas wmi
> dm_mirror dm_region_hash dm_log dm_mod
> [   24.426475] ---[ end trace 0000000000000000 ]---
> [   24.505278] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> [   24.510331] Code: f2 4c 89 c1 48 89 fe 48 c7 c7 20 23 65 a8 e8 d2
> a2 fe ff 0f 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 c8 22 65 a8 e8 bb
> a2 fe ff <0f> 0b 4c 89 c1 48 c7 c7 70 22 65 a8 e8 aa a2 fe ff 0f 0b 48
> c7 c7
> [   24.510332] RSP: 0018:ffffb035407e7da8 EFLAGS: 00010046
> [   24.510333] RAX: 0000000000000075 RBX: ffff9a1a02119e68 RCX: 0000000000000000
> [   24.510334] RDX: 0000000000000000 RSI: ffff9a1d9f31f840 RDI: ffff9a1d9f31f840
> [   24.510335] RBP: ffff9a1d9f337f00 R08: 0000000000000000 R09: 00000000ffff7fff
> [   24.510337] R10: ffffb035407e7c50 R11: ffffffffa8be75e8 R12: ffff9a1d9f337f68
> [   24.562805] R13: ffff9a1a02119e70 R14: ffff9a1a02119e70 R15: ffff9a1d9f330340
> [   24.569929] FS:  0000000000000000(0000) GS:ffff9a1d9f300000(0000)
> knlGS:0000000000000000
> [   24.578017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   24.578018] CR2: 000055b5f1f28050 CR3: 0000000104e38000 CR4: 0000000000350ee0
> [   24.578019] Kernel panic - not syncing: Fatal exception
> [   24.578653] Kernel Offset: 0x26200000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [   24.682013] ---[ end Kernel panic - not syncing: Fatal exception ]---
> [   24.339396] r[-- MARK -- Fri Nov 25 06:25:00 2022]
>
>
> On Thu, Nov 24, 2022 at 11:00 PM Bruno Goncalves <bgoncalv@redhat.com> wrote:
> >
> > On Wed, 23 Nov 2022 at 14:46, Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > On 11/23/22 1:48 AM, Bruno Goncalves wrote:
> > > > Hello,
> > > >
> > > > We recently started to hit the following panic when testing the block
> > > > tree (for-next branch).
> > > >
> > > > [ 5076.172749] list_add corruption. prev->next should be next
> > > > (ffff91cd6f7fa568), but was ffff91c991ca6670. (prev=ffff91c991ca6670).
> > > > [ 5076.173863] ------------[ cut here ]------------
> > > > [ 5076.174853] kernel BUG at lib/list_debug.c:30!
> > > > [ 5076.175523] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > > [ 5076.175853] CPU: 15 PID: 16415 Comm: kworker/15:13 Tainted: G
> > > >    I        6.1.0-rc6 #1
> > > > [ 5076.176799] Hardware name: HP ProLiant DL360p Gen8, BIOS P71 05/24/2019
> > > > [ 5076.177198] Workqueue: cgwb_release cgwb_release_workfn
> > > > [ 5076.177497] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > > > [ 5076.177788] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > > > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > > > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > > > c7 c7
> > > > [ 5076.179173] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > > > [ 5076.179472] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > > > [ 5076.180241] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > > > [ 5076.181069] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > > > [ 5076.182209] R10: 0000000000000003 R11: ffff91cd7ff42fe8 R12: ffff91cd6f7fa568
> > > > [ 5076.183002] R13: ffff91c991ca6670 R14: ffff91c991ca6670 R15: ffff91cd6f7f1440
> > > > [ 5076.183902] FS:  0000000000000000(0000) GS:ffff91cd6f7c0000(0000)
> > > > knlGS:0000000000000000
> > > > [ 5076.184377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 5076.185084] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > > > [ 5076.185945] Call Trace:
> > > > [ 5076.186110]  <TASK>
> > > > [ 5076.186916]  insert_work+0x46/0xc0
> > > > [ 5076.187533]  __queue_work+0x1d4/0x460
> > > > [ 5076.187788]  queue_work_on+0x37/0x40
> > > > [ 5076.187993]  blkcg_unpin_online+0x1ad/0x1b0
> > > > [ 5076.188244]  cgwb_release_workfn+0x6a/0x200
> > > > [ 5076.188464]  process_one_work+0x1c7/0x380
> > > > [ 5076.188675]  worker_thread+0x4d/0x380
> > > > [ 5076.188881]  ? rescuer_thread+0x380/0x380
> > > > [ 5076.189089]  kthread+0xe9/0x110
> > > > [ 5076.189716]  ? kthread_complete_and_exit+0x20/0x20
> > > > [ 5076.190407]  ret_from_fork+0x22/0x30
> > > > [ 5076.190677]  </TASK>
> > > > [ 5076.190816] Modules linked in: nvme nvme_core nvme_common loop tls
> > > > rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal
> > > > intel_powerclamp coretemp sunrpc kvm_intel kvm iTCO_wdt iapl
> > > > intel_cstate intel_uncore pcspkr lpc_ich ipmi_ssif hpilo tg3 acpi_ipmi
> > > > ioatdma ipmi_si ipmi_devintf dca ipmi_msghandler acpi_power_meter fuse
> > > > zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni
> > > > polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw hpsa
> > > > mgag200 scsi_transport_sas [last unloaded: scsi_debug]
> > > > [ 5076.293315] ---[ end trace 0000000000000000 ]---
> > > > [ 5076.295226] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
> > > > [ 5076.295587] Code: f2 48 89 c1 48 89 fe 48 c7 c7 48 d8 76 ad e8 5a
> > > > 8f fd ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f0 d7 76 ad e8 43
> > > > 8f fd ff <0f> 0b 48 89 c1 48 c7 c7 98 d7 76 ad e8 32 8f fd ff 0f 0b 48
> > > > c7 c7
> > > > [ 5076.296921] RSP: 0018:ffffa1c98a6afdb0 EFLAGS: 00010082
> > > > [ 5076.297239] RAX: 0000000000000075 RBX: ffff91c991ca6668 RCX: 0000000000000000
> > > > [ 5076.297983] RDX: 0000000000000002 RSI: ffffffffad752ad3 RDI: 00000000ffffffff
> > > > [ 5076.298768] RBP: ffff91cd6f7fa500 R08: 0000000000000000 R09: ffffa1c98a6afc60
> > > > [ 5076.299525] R10: 0000S:  0000000000000000(0000)
> > > > GS:ffff91cd6f7c0000(0000) knlGS:0000000000000000
> > > > [ 5076.700351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 5076.701046] CR2: 0000560ff67e11b8 CR3: 000000020d010005 CR4: 00000000000606e0
> > > > [ 5076ernel panic - not syncing: Fatal exception
> > > > [ 5077.924713] Shutting down cpus with NMI
> > > > [ 5077.924986] Kernel Offset: 0x2b000000 from 0xffffffff81000000
> > > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > > [ 5077.927946] ---[ end Kernel panic - not syncing: Fatal exception ]---
> > > >
> > > > It seems to happen often during different tests.
> > > >
> > > > full console.log:
> > > > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/11/21/redhat:700955106/build_x86_64_redhat:700955106_x86_64/tests/1/results_0001/console.log/console.log
> > > >
> > > > kernel tarball:
> > > > https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/publish%20x86_64/3356091217/artifacts/kernel-block-redhat_700955106_x86_64.tar.gz
> > > >
> > > > kernel config: https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/700955106/build%20x86_64/3356091207/artifacts/kernel-block-redhat_700955106_x86_64.config
> > > >
> > > > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> > > >
> > > > We didn't bisect, but the first commit we hit the problem was
> > > > "f65d92c600fe6eecdbd6e7fab7893c9c094dfcbf
> > > > (io_uring-6.1-2022-11-18-2180-gf65d92c600fe)" and the last one where
> > > > we didn't hit the problem was
> > > > "40fa774af7fd04d06014ac74947c351649b6f64f
> > > > (io_uring-6.1-2022-11-11-1843-g40fa774af7fd)"
> > > >
> > > > test logs: https://datawarehouse.cki-project.org/kcidb/tests/6061677
> > > > cki issue tracker: https://datawarehouse.cki-project.org/issue/1732
> > >
> > > Please just try and clone for-6.2/block from the block tree and bisect
> > > it?
> > >
> >
> > Hi,
> > I've tried with commit 93c68cc46a070775cc6675e3543dd909eb9f6c9e (drbd:
> > use consistent license), but I was not able to hit the panic with it.
> >
> >
> > Bruno
> >
> > > --
> > > Jens Axboe
> > >
> > >
> >
>
>
> --
> Best Regards,
>   Yi Zhang



-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-26 14:29       ` [bisected]kernel " Yi Zhang
@ 2022-11-26 15:53         ` Jens Axboe
  2022-11-26 22:54           ` Waiman Long
  2022-11-28 18:55         ` Bart Van Assche
  1 sibling, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2022-11-26 15:53 UTC (permalink / raw)
  To: Yi Zhang, Waiman Long; +Cc: linux-block, CKI Project, Bruno Goncalves

On 11/26/22 7:29 AM, Yi Zhang wrote:
> Hi Jens
> Sorry for the delay as I couldn't reproduce it with the original
> for-6.2/block branch.
> Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
> bisect it:
> 
> 
> 951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
> commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
> Author: Waiman Long <longman@redhat.com>
> Date:   Fri Nov 4 20:59:02 2022 -0400
> 
>     blk-cgroup: Flush stats at blkgs destruction path
> 
>     As noted by Michal, the blkg_iostat_set's in the lockless list
>     hold reference to blkg's to protect against their removal. Those
>     blkg's hold reference to blkcg. When a cgroup is being destroyed,
>     cgroup_rstat_flush() is only called at css_release_work_fn() which is
>     called when the blkcg reference count reaches 0. This circular dependency
>     will prevent blkcg from being freed until some other events cause
>     cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
> 
>     To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
>     function to flush stats for a given css and cpu and call it at the blkgs
>     destruction path, blkcg_destroy_blkgs(), whenever there are still some
>     pending stats to be flushed. This will ensure that blkcg reference
>     count can reach 0 ASAP.
> 
>     Signed-off-by: Waiman Long <longman@redhat.com>
>     Acked-by: Tejun Heo <tj@kernel.org>
>     Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>

Waiman, let me know if you have an idea what is going on here and can
send in a fix, or if I need to revert this one. From looking at the
lists of commits after these reports came in, I did suspect this
commit. But I don't know enough about this area to render an opinion
on a fix without spending more time on it.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-26 15:53         ` Jens Axboe
@ 2022-11-26 22:54           ` Waiman Long
  2022-11-27  4:13             ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-11-26 22:54 UTC (permalink / raw)
  To: Jens Axboe, Yi Zhang; +Cc: linux-block, CKI Project, Bruno Goncalves


On 11/26/22 10:53, Jens Axboe wrote:
> On 11/26/22 7:29 AM, Yi Zhang wrote:
>> Hi Jens
>> Sorry for the delay as I couldn't reproduce it with the original
>> for-6.2/block branch.
>> Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
>> bisect it:
>>
>>
>> 951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
>> commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
>> Author: Waiman Long <longman@redhat.com>
>> Date:   Fri Nov 4 20:59:02 2022 -0400
>>
>>      blk-cgroup: Flush stats at blkgs destruction path
>>
>>      As noted by Michal, the blkg_iostat_set's in the lockless list
>>      hold reference to blkg's to protect against their removal. Those
>>      blkg's hold reference to blkcg. When a cgroup is being destroyed,
>>      cgroup_rstat_flush() is only called at css_release_work_fn() which is
>>      called when the blkcg reference count reaches 0. This circular dependency
>>      will prevent blkcg from being freed until some other events cause
>>      cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
>>
>>      To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
>>      function to flush stats for a given css and cpu and call it at the blkgs
>>      destruction path, blkcg_destroy_blkgs(), whenever there are still some
>>      pending stats to be flushed. This will ensure that blkcg reference
>>      count can reach 0 ASAP.
>>
>>      Signed-off-by: Waiman Long <longman@redhat.com>
>>      Acked-by: Tejun Heo <tj@kernel.org>
>>      Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
> Waiman, let me know if you have an idea what is going on here and can
> send in a fix, or if I need to revert this one. From looking at the
> lists of commits after these reports came in, I did suspect this
> commit. But I don't know enough about this area to render an opinion
> on a fix without spending more time on it.
>
Sure. I will take a closer look at that. Will let you know my 
investigation result ASAP.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-26 22:54           ` Waiman Long
@ 2022-11-27  4:13             ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-11-27  4:13 UTC (permalink / raw)
  To: Jens Axboe, Yi Zhang; +Cc: linux-block, CKI Project, Bruno Goncalves

On 11/26/22 17:54, Waiman Long wrote:
>
> On 11/26/22 10:53, Jens Axboe wrote:
>> On 11/26/22 7:29 AM, Yi Zhang wrote:
>>> Hi Jens
>>> Sorry for the delay as I couldn't reproduce it with the original
>>> for-6.2/block branch.
>>> Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
>>> bisect it:
>>>
>>>
>>> 951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
>>> commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
>>> Author: Waiman Long <longman@redhat.com>
>>> Date:   Fri Nov 4 20:59:02 2022 -0400
>>>
>>>      blk-cgroup: Flush stats at blkgs destruction path
>>>
>>>      As noted by Michal, the blkg_iostat_set's in the lockless list
>>>      hold reference to blkg's to protect against their removal. Those
>>>      blkg's hold reference to blkcg. When a cgroup is being destroyed,
>>>      cgroup_rstat_flush() is only called at css_release_work_fn() 
>>> which is
>>>      called when the blkcg reference count reaches 0. This circular 
>>> dependency
>>>      will prevent blkcg from being freed until some other events cause
>>>      cgroup_rstat_flush() to be called to flush out the pending 
>>> blkcg stats.
>>>
>>>      To prevent this delayed blkcg removal, add a new 
>>> cgroup_rstat_css_flush()
>>>      function to flush stats for a given css and cpu and call it at 
>>> the blkgs
>>>      destruction path, blkcg_destroy_blkgs(), whenever there are 
>>> still some
>>>      pending stats to be flushed. This will ensure that blkcg reference
>>>      count can reach 0 ASAP.
>>>
>>>      Signed-off-by: Waiman Long <longman@redhat.com>
>>>      Acked-by: Tejun Heo <tj@kernel.org>
>>>      Link: 
>>> https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
>>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> Waiman, let me know if you have an idea what is going on here and can
>> send in a fix, or if I need to revert this one. From looking at the
>> lists of commits after these reports came in, I did suspect this
>> commit. But I don't know enough about this area to render an opinion
>> on a fix without spending more time on it.
>>
> Sure. I will take a closer look at that. Will let you know my 
> investigation result ASAP.
>
Thanks Yi for allowing me to access the system that can reproduce the 
bug. I found out that the panic problem is fixed by moving the rstat 
flushing before the destruction of blkgs in blkcg_destroy_blkgs(). I 
will post another patch later to fix that bug. However, I want to spend 
a bit more time to see if I can figure out what cause the panic in the 
first place.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)
  2022-11-26 14:29       ` [bisected]kernel " Yi Zhang
  2022-11-26 15:53         ` Jens Axboe
@ 2022-11-28 18:55         ` Bart Van Assche
  1 sibling, 0 replies; 9+ messages in thread
From: Bart Van Assche @ 2022-11-28 18:55 UTC (permalink / raw)
  To: Yi Zhang, Jens Axboe, Waiman Long
  Cc: linux-block, CKI Project, Bruno Goncalves

On 11/26/22 06:29, Yi Zhang wrote:
> Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
> bisect it:
> 
> 
> 951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
> commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
> Author: Waiman Long <longman@redhat.com>
> Date:   Fri Nov 4 20:59:02 2022 -0400
> 
>      blk-cgroup: Flush stats at blkgs destruction path
> 
>      As noted by Michal, the blkg_iostat_set's in the lockless list
>      hold reference to blkg's to protect against their removal. Those
>      blkg's hold reference to blkcg. When a cgroup is being destroyed,
>      cgroup_rstat_flush() is only called at css_release_work_fn() which is
>      called when the blkcg reference count reaches 0. This circular dependency
>      will prevent blkcg from being freed until some other events cause
>      cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
> 
>      To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
>      function to flush stats for a given css and cpu and call it at the blkgs
>      destruction path, blkcg_destroy_blkgs(), whenever there are still some
>      pending stats to be flushed. This will ensure that blkcg reference
>      count can reach 0 ASAP.
> 
>      Signed-off-by: Waiman Long <longman@redhat.com>
>      Acked-by: Tejun Heo <tj@kernel.org>
>      Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.com
>      Signed-off-by: Jens Axboe <axboe@kernel.dk>

I can confirm this report. If I revert patch "blk-cgroup: Flush stats at 
blkgs destruction path" on top of the block/for-next branch from last 
Wednesday then test block/027 passes. Test block/027 fails 
systematically with an unmodified block/for-next branch.

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-28 18:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-23  8:48 kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex) Bruno Goncalves
2022-11-23 13:46 ` Jens Axboe
2022-11-24 14:57   ` Bruno Goncalves
2022-11-25  8:38     ` Yi Zhang
2022-11-26 14:29       ` [bisected]kernel " Yi Zhang
2022-11-26 15:53         ` Jens Axboe
2022-11-26 22:54           ` Waiman Long
2022-11-27  4:13             ` Waiman Long
2022-11-28 18:55         ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.