linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
@ 2020-10-09 22:10 Mikhail Gavrilov
  2021-03-04  8:42 ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Mikhail Gavrilov @ 2020-10-09 22:10 UTC (permalink / raw)
  To: linux-block, paolo.valente, axboe, Linux List Kernel Mailing

Paolo, Jens I am sorry for the noise.
But today I hit the kernel panic and git blame said that you have
created the file in which happened panic (this I saw from trace)

$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname -r`/vmlinux
__bfq_deactivate_entity+0x15a
__bfq_deactivate_entity+0x15a/0x240:
bfq_gt at block/bfq-wf2q.c:20
(inlined by) bfq_insert at block/bfq-wf2q.c:381
(inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
(inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203

https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203

$ head /sys/block/*/queue/scheduler
==> /sys/block/nvme0n1/queue/scheduler <==
[none] mq-deadline kyber bfq

==> /sys/block/sda/queue/scheduler <==
mq-deadline kyber [bfq] none

==> /sys/block/zram0/queue/scheduler <==
none

Trace:
general protection fault, probably for non-canonical address
0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
--------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2606 08/13/2020
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
Call Trace:
 bfq_deactivate_entity+0x4f/0xc0
 bfq_del_bfqq_busy+0xbf/0x170
 __bfq_bfqq_expire+0x95/0xc0
 bfq_bfqq_expire+0x3c5/0x9a0
 ? bfq_active_extract+0x8e/0x140
 bfq_dispatch_request+0x438/0x1070
 __blk_mq_do_dispatch_sched+0x1c7/0x290
 ? dequeue_entity+0xa4/0x420
 __blk_mq_sched_dispatch_requests+0x129/0x180
 blk_mq_sched_dispatch_requests+0x30/0x60
 __blk_mq_run_hw_queue+0x49/0x110
 process_one_work+0x1b4/0x370
 worker_thread+0x53/0x3e0
 ? process_one_work+0x370/0x370
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30
Modules linked in: tun snd_seq_dummy snd_hrtimer uinput rfcomm
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp
nf_conntrack_tftp bridge stp llc nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set
nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac
bnep sunrpc vfat fat mt76x2u snd_hda_codec_realtek mt76x2_common
mt76x02_usb snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi
mt76_usb mt76x02_lib edac_mce_amd iwlmvm snd_hda_intel mt76
snd_intel_dspcfg kvm_amd mac80211 gspca_zc3xx snd_usb_audio
snd_hda_codec gspca_main uvcvideo btusb snd_usbmidi_lib iwlwifi
snd_hda_core videobuf2_vmalloc kvm videobuf2_memops btrtl snd_rawmidi
videobuf2_v4l2 snd_hwdep
 btbcm snd_seq btintel videobuf2_common eeepc_wmi irqbypass
snd_seq_device asus_wmi xpad bluetooth joydev sparse_keymap libarc4
rapl cfg80211 ff_memless snd_pcm videodev video pcspkr wmi_bmof
sp5100_tco snd_timer mc k10temp i2c_piix4 snd ecdh_generic ecc
soundcore rfkill acpi_cpufreq binfmt_misc zram ip_tables
hid_logitech_hidpp hid_logitech_dj amdgpu iommu_v2 gpu_sched ttm
drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm ccp
igb ghash_clmulni_intel nvme nvme_core dca i2c_algo_bit wmi
pinctrl_amd fuse
---[ end trace 09deb55d1b05f40c ]---


Full system log: https://pastebin.com/6cKHZzAi
Full kernel log: https://pastebin.com/316HjHit

Unfortunately, I did not know how reproduce this bug. I am not doing
anything unusual on the computer when it happened.
I could provide any useful info for further investigation.


--
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
  2020-10-09 22:10 [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI Mikhail Gavrilov
@ 2021-03-04  8:42 ` Ming Lei
       [not found]   ` <20210305090022.1863-1-hdanton@sina.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2021-03-04  8:42 UTC (permalink / raw)
  To: Mikhail Gavrilov
  Cc: linux-block, Paolo Valente, Jens Axboe, Linux List Kernel Mailing

On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
<mikhail.v.gavrilov@gmail.com> wrote:
>
> Paolo, Jens I am sorry for the noise.
> But today I hit the kernel panic and git blame said that you have
> created the file in which happened panic (this I saw from trace)
>
> $ /usr/src/kernels/`uname -r`/scripts/faddr2line
> /lib/debug/lib/modules/`uname -r`/vmlinux
> __bfq_deactivate_entity+0x15a
> __bfq_deactivate_entity+0x15a/0x240:
> bfq_gt at block/bfq-wf2q.c:20
> (inlined by) bfq_insert at block/bfq-wf2q.c:381
> (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
> (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
>
> https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
>
> $ head /sys/block/*/queue/scheduler
> ==> /sys/block/nvme0n1/queue/scheduler <==
> [none] mq-deadline kyber bfq
>
> ==> /sys/block/sda/queue/scheduler <==
> mq-deadline kyber [bfq] none
>
> ==> /sys/block/zram0/queue/scheduler <==
> none
>
> Trace:
> general protection fault, probably for non-canonical address
> 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
> CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
> --------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 2606 08/13/2020
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
> Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
> 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
> 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
> RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
> RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
> RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
> RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
> R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
> R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
> FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
> Call Trace:
>  bfq_deactivate_entity+0x4f/0xc0

Hello,

The same stack trace was observed in RH internal test too, and kernel
is 5.11.0-0.rc6,
but there isn't reproducer yet.


-- 
Ming Lei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
       [not found]   ` <20210305090022.1863-1-hdanton@sina.com>
@ 2021-03-05  9:27     ` Ming Lei
  2021-03-05  9:32       ` Paolo Valente
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2021-03-05  9:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Mikhail Gavrilov, linux-block, Paolo Valente, Jens Axboe, LKML

Hello Hillf,

Thanks for the debug patch.

On Fri, Mar 5, 2021 at 5:00 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu, 4 Mar 2021 16:42:30 +0800  Ming Lei wrote:
> > On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
> > <mikhail.v.gavrilov@gmail.com> wrote:
> > >
> > > Paolo, Jens I am sorry for the noise.
> > > But today I hit the kernel panic and git blame said that you have
> > > created the file in which happened panic (this I saw from trace)
> > >
> > > $ /usr/src/kernels/`uname -r`/scripts/faddr2line
> > > /lib/debug/lib/modules/`uname -r`/vmlinux
> > > __bfq_deactivate_entity+0x15a
> > > __bfq_deactivate_entity+0x15a/0x240:
> > > bfq_gt at block/bfq-wf2q.c:20
> > > (inlined by) bfq_insert at block/bfq-wf2q.c:381
> > > (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
> > > (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
> > >
> > > https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
> > >
> > > $ head /sys/block/*/queue/scheduler
> > > ==> /sys/block/nvme0n1/queue/scheduler <==
> > > [none] mq-deadline kyber bfq
> > >
> > > ==> /sys/block/sda/queue/scheduler <==
> > > mq-deadline kyber [bfq] none
> > >
> > > ==> /sys/block/zram0/queue/scheduler <==
> > > none
> > >
> > > Trace:
> > > general protection fault, probably for non-canonical address
> > > 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
> > > CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
> > > --------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
> > > Hardware name: System manufacturer System Product Name/ROG STRIX
> > > X570-I GAMING, BIOS 2606 08/13/2020
> > > Workqueue: kblockd blk_mq_run_work_fn
> > > RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
> > > Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
> > > 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
> > > 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
> > > RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
> > > RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
> > > RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
> > > RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
> > > R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
> > > R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
> > > FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
> > > Call Trace:
> > >  bfq_deactivate_entity+0x4f/0xc0
> >
> > Hello,
> >
> > The same stack trace was observed in RH internal test too, and kernel
> > is 5.11.0-0.rc6,
> > but there isn't reproducer yet.
> >
> >
> > --
> > Ming Lei
>
> Add some debug info.
>
> --- x/block/bfq-wf2q.c
> +++ y/block/bfq-wf2q.c
> @@ -647,8 +647,10 @@ static void bfq_forget_entity(struct bfq
>
>         entity->on_st_or_in_serv = false;
>         st->wsum -= entity->weight;
> -       if (bfqq && !is_in_service)
> +       if (bfqq && !is_in_service) {
> +               WARN_ON(entity->tree != NULL);
>                 bfq_put_queue(bfqq);
> +       }
>  }
>
>  /**
> @@ -1631,6 +1633,7 @@ bool __bfq_bfqd_reset_in_service(struct
>                  * bfqq gets freed here.
>                  */
>                 int ref = in_serv_bfqq->ref;
> +               WARN_ON(in_serv_entity->tree != NULL);
>                 bfq_put_queue(in_serv_bfqq);
>                 if (ref == 1)
>                         return true;

This kernel oops isn't easy to be reproduced, and  we have got another crash
report[1] too, still on __bfq_deactivate_entity(), and not easy to
trigger.  Can your
debug patch cover the report[1]? If not, feel free to add more debug messages,
then I will try to reproduce the two.

[1] another kernel oops log on __bfq_deactivate_entity

[  899.790606] systemd-sysv-generator[25205]: SysV service
'/etc/rc.d/init.d/anamon' lacks a native systemd unit file.
Automatically generating a unit file for compatibility. Please update
package to include a native systemd unit file, in order to make it
more safe and robust.
[  901.937047] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  901.944005] #PF: supervisor read access in kernel mode
[  901.949143] #PF: error_code(0x0000) - not-present page
[  901.954285] PGD 0 P4D 0
[  901.956824] Oops: 0000 [#1] SMP NOPTI
[  901.960490] CPU: 13 PID: 22966 Comm: kworker/13:0 Tainted: G
  I    X --------- ---  5.11.0-1.el9.x86_64 #1
[  901.970829] Hardware name: Dell Inc. PowerEdge R740xd/0WXD1Y, BIOS
2.5.4 01/13/2020
[  901.978480] Workqueue: cgwb_release cgwb_release_workfn
[  901.983705] RIP: 0010:__bfq_deactivate_entity+0x5b/0x240
[  901.989016] Code: b8 30 00 00 00 75 18 48 81 ff 88 00 00 00 74 0f
0f b7 47 8a 83 e8 01 48 8d 04 40 48 c1 e0 04 4c 8b 73 68 48 63 73 40
48 89 df <4d> 8b 3e 4d 8d 64 06 10 e8 48 f0 ff ff 49 39 df 0f 84 87 01
00 00
[  902.007763] RSP: 0018:ffffb77107f0bd98 EFLAGS: 00010002
[  902.012986] RAX: 0000002fffffffd0 RBX: ffff9853ca9c6098 RCX: 0000000000000046
[  902.020119] RDX: 0000000000000001 RSI: 00000000474b1168 RDI: ffff9853ca9c6098
[  902.027253] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff985470c2fed0
[  902.034383] R10: 0000000000000001 R11: ffff9853c9287d98 R12: ffff9853ca8b8000
[  902.041515] R13: 00000000000000ff R14: 0000000000000000 R15: ffff985b44308098
[  902.048647] FS:  0000000000000000(0000) GS:ffff98631f980000(0000)
knlGS:0000000000000000
[  902.056732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  902.062479] CR2: 0000000000000000 CR3: 00000001c0ac2002 CR4: 00000000007706e0
[  902.069611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  902.076744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  902.083876] PKRU: 55555554
[  902.086589] Call Trace:
[  902.089042]  bfq_pd_offline+0x89/0xd0
[  902.092708]  blkg_destroy+0x52/0xf0
[  902.096200]  blkcg_destroy_blkgs+0x46/0xc0
[  902.100300]  cgwb_release_workfn+0xbe/0x150
[  902.104485]  process_one_work+0x1e6/0x380
[  902.108497]  worker_thread+0x53/0x3d0
[  902.112161]  ? process_one_work+0x380/0x380
[  902.116346]  kthread+0x11b/0x140
[  902.119581]  ? kthread_associate_blkcg+0xa0/0xa0
[  902.124199]  ret_from_fork+0x1f/0x30
[  902.127780] Modules linked in: sunrpc scsi_debug iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit
libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
ipmi_ssif irqbypass mgag200 rapl i2c_algo_bit iTCO_wdt drm_kms_helper
intel_cstate iTCO_vendor_support syscopyarea sysfillrect sysimgblt
acpi_ipmi mei_me fb_sys_fops intel_uncore pcspkr dell_smbios dcdbas
dell_wmi_descriptor wmi_bmof mei cec i2c_i801 ipmi_si acpi_power_meter
lpc_ich i2c_smbus ipmi_devintf ipmi_msghandler drm fuse xfs libcrc32c
sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci
megaraid_sas tg3 ghash_clmulni_intel libata wmi dm_mirror
dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
[  902.208546] CR2: 0000000000000000
[  902.211881] ---[ end trace 827b8521dc634ca4 ]---


-- 
Ming Lei

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
  2021-03-05  9:27     ` Ming Lei
@ 2021-03-05  9:32       ` Paolo Valente
  2021-03-05 10:01         ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Valente @ 2021-03-05  9:32 UTC (permalink / raw)
  To: Ming Lei; +Cc: Hillf Danton, Mikhail Gavrilov, linux-block, Jens Axboe, LKML

I'm thinking of a way to debug this too.  The symptom may hint at a
use-after-free.  Could you enable KASAN in your tests?  (On the flip
side, I know this might change timings, thereby making the fault
disappear).

Thanks,
Paolo

> Il giorno 5 mar 2021, alle ore 10:27, Ming Lei <tom.leiming@gmail.com> ha scritto:
> 
> Hello Hillf,
> 
> Thanks for the debug patch.
> 
> On Fri, Mar 5, 2021 at 5:00 PM Hillf Danton <hdanton@sina.com> wrote:
>> 
>> On Thu, 4 Mar 2021 16:42:30 +0800  Ming Lei wrote:
>>> On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
>>> <mikhail.v.gavrilov@gmail.com> wrote:
>>>> 
>>>> Paolo, Jens I am sorry for the noise.
>>>> But today I hit the kernel panic and git blame said that you have
>>>> created the file in which happened panic (this I saw from trace)
>>>> 
>>>> $ /usr/src/kernels/`uname -r`/scripts/faddr2line
>>>> /lib/debug/lib/modules/`uname -r`/vmlinux
>>>> __bfq_deactivate_entity+0x15a
>>>> __bfq_deactivate_entity+0x15a/0x240:
>>>> bfq_gt at block/bfq-wf2q.c:20
>>>> (inlined by) bfq_insert at block/bfq-wf2q.c:381
>>>> (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
>>>> (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
>>>> 
>>>> https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
>>>> 
>>>> $ head /sys/block/*/queue/scheduler
>>>> ==> /sys/block/nvme0n1/queue/scheduler <==
>>>> [none] mq-deadline kyber bfq
>>>> 
>>>> ==> /sys/block/sda/queue/scheduler <==
>>>> mq-deadline kyber [bfq] none
>>>> 
>>>> ==> /sys/block/zram0/queue/scheduler <==
>>>> none
>>>> 
>>>> Trace:
>>>> general protection fault, probably for non-canonical address
>>>> 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
>>>> CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
>>>> --------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
>>>> Hardware name: System manufacturer System Product Name/ROG STRIX
>>>> X570-I GAMING, BIOS 2606 08/13/2020
>>>> Workqueue: kblockd blk_mq_run_work_fn
>>>> RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
>>>> Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
>>>> 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
>>>> 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
>>>> RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
>>>> RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
>>>> RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
>>>> RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
>>>> R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
>>>> R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
>>>> FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
>>>> Call Trace:
>>>> bfq_deactivate_entity+0x4f/0xc0
>>> 
>>> Hello,
>>> 
>>> The same stack trace was observed in RH internal test too, and kernel
>>> is 5.11.0-0.rc6,
>>> but there isn't reproducer yet.
>>> 
>>> 
>>> --
>>> Ming Lei
>> 
>> Add some debug info.
>> 
>> --- x/block/bfq-wf2q.c
>> +++ y/block/bfq-wf2q.c
>> @@ -647,8 +647,10 @@ static void bfq_forget_entity(struct bfq
>> 
>>        entity->on_st_or_in_serv = false;
>>        st->wsum -= entity->weight;
>> -       if (bfqq && !is_in_service)
>> +       if (bfqq && !is_in_service) {
>> +               WARN_ON(entity->tree != NULL);
>>                bfq_put_queue(bfqq);
>> +       }
>> }
>> 
>> /**
>> @@ -1631,6 +1633,7 @@ bool __bfq_bfqd_reset_in_service(struct
>>                 * bfqq gets freed here.
>>                 */
>>                int ref = in_serv_bfqq->ref;
>> +               WARN_ON(in_serv_entity->tree != NULL);
>>                bfq_put_queue(in_serv_bfqq);
>>                if (ref == 1)
>>                        return true;
> 
> This kernel oops isn't easy to be reproduced, and  we have got another crash
> report[1] too, still on __bfq_deactivate_entity(), and not easy to
> trigger.  Can your
> debug patch cover the report[1]? If not, feel free to add more debug messages,
> then I will try to reproduce the two.
> 
> [1] another kernel oops log on __bfq_deactivate_entity
> 
> [  899.790606] systemd-sysv-generator[25205]: SysV service
> '/etc/rc.d/init.d/anamon' lacks a native systemd unit file.
> Automatically generating a unit file for compatibility. Please update
> package to include a native systemd unit file, in order to make it
> more safe and robust.
> [  901.937047] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  901.944005] #PF: supervisor read access in kernel mode
> [  901.949143] #PF: error_code(0x0000) - not-present page
> [  901.954285] PGD 0 P4D 0
> [  901.956824] Oops: 0000 [#1] SMP NOPTI
> [  901.960490] CPU: 13 PID: 22966 Comm: kworker/13:0 Tainted: G
>  I    X --------- ---  5.11.0-1.el9.x86_64 #1
> [  901.970829] Hardware name: Dell Inc. PowerEdge R740xd/0WXD1Y, BIOS
> 2.5.4 01/13/2020
> [  901.978480] Workqueue: cgwb_release cgwb_release_workfn
> [  901.983705] RIP: 0010:__bfq_deactivate_entity+0x5b/0x240
> [  901.989016] Code: b8 30 00 00 00 75 18 48 81 ff 88 00 00 00 74 0f
> 0f b7 47 8a 83 e8 01 48 8d 04 40 48 c1 e0 04 4c 8b 73 68 48 63 73 40
> 48 89 df <4d> 8b 3e 4d 8d 64 06 10 e8 48 f0 ff ff 49 39 df 0f 84 87 01
> 00 00
> [  902.007763] RSP: 0018:ffffb77107f0bd98 EFLAGS: 00010002
> [  902.012986] RAX: 0000002fffffffd0 RBX: ffff9853ca9c6098 RCX: 0000000000000046
> [  902.020119] RDX: 0000000000000001 RSI: 00000000474b1168 RDI: ffff9853ca9c6098
> [  902.027253] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff985470c2fed0
> [  902.034383] R10: 0000000000000001 R11: ffff9853c9287d98 R12: ffff9853ca8b8000
> [  902.041515] R13: 00000000000000ff R14: 0000000000000000 R15: ffff985b44308098
> [  902.048647] FS:  0000000000000000(0000) GS:ffff98631f980000(0000)
> knlGS:0000000000000000
> [  902.056732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  902.062479] CR2: 0000000000000000 CR3: 00000001c0ac2002 CR4: 00000000007706e0
> [  902.069611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  902.076744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  902.083876] PKRU: 55555554
> [  902.086589] Call Trace:
> [  902.089042]  bfq_pd_offline+0x89/0xd0
> [  902.092708]  blkg_destroy+0x52/0xf0
> [  902.096200]  blkcg_destroy_blkgs+0x46/0xc0
> [  902.100300]  cgwb_release_workfn+0xbe/0x150
> [  902.104485]  process_one_work+0x1e6/0x380
> [  902.108497]  worker_thread+0x53/0x3d0
> [  902.112161]  ? process_one_work+0x380/0x380
> [  902.116346]  kthread+0x11b/0x140
> [  902.119581]  ? kthread_associate_blkcg+0xa0/0xa0
> [  902.124199]  ret_from_fork+0x1f/0x30
> [  902.127780] Modules linked in: sunrpc scsi_debug iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi nft_reject_inet
> nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit
> libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> ipmi_ssif irqbypass mgag200 rapl i2c_algo_bit iTCO_wdt drm_kms_helper
> intel_cstate iTCO_vendor_support syscopyarea sysfillrect sysimgblt
> acpi_ipmi mei_me fb_sys_fops intel_uncore pcspkr dell_smbios dcdbas
> dell_wmi_descriptor wmi_bmof mei cec i2c_i801 ipmi_si acpi_power_meter
> lpc_ich i2c_smbus ipmi_devintf ipmi_msghandler drm fuse xfs libcrc32c
> sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci
> megaraid_sas tg3 ghash_clmulni_intel libata wmi dm_mirror
> dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
> [  902.208546] CR2: 0000000000000000
> [  902.211881] ---[ end trace 827b8521dc634ca4 ]---
> 
> 
> -- 
> Ming Lei


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
  2021-03-05  9:32       ` Paolo Valente
@ 2021-03-05 10:01         ` Ming Lei
       [not found]           ` <20210307021524.13260-1-hdanton@sina.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2021-03-05 10:01 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Ming Lei, Hillf Danton, Mikhail Gavrilov, linux-block, Jens Axboe, LKML

On Fri, Mar 05, 2021 at 10:32:04AM +0100, Paolo Valente wrote:
> I'm thinking of a way to debug this too.  The symptom may hint at a
> use-after-free.  Could you enable KASAN in your tests?  (On the flip
> side, I know this might change timings, thereby making the fault
> disappear).

I have asked our QE to reproduce the issue with debug kernel, which may take a
while. And I can't trigger it in my box.

BTW, for the 2nd 'kernel NULL pointer dereference', the RIP points to:

(gdb) l *(__bfq_deactivate_entity+0x5b)
0xffffffff814c31cb is in __bfq_deactivate_entity (block/bfq-wf2q.c:1181).
1176		 * bfq_group_set_parent has already been invoked for the group
1177		 * represented by entity. Therefore, the field
1178		 * entity->sched_data has been set, and we can safely use it.
1179		 */
1180		st = bfq_entity_service_tree(entity);
1181		is_in_service = entity == sd->in_service_entity;
1182
1183		bfq_calc_finish(entity, entity->service);
1184
1185		if (is_in_service)

Seems entity->sched_data points to NULL.


> 
> Thanks,
> Paolo
> 
> > Il giorno 5 mar 2021, alle ore 10:27, Ming Lei <tom.leiming@gmail.com> ha scritto:
> > 
> > Hello Hillf,
> > 
> > Thanks for the debug patch.
> > 
> > On Fri, Mar 5, 2021 at 5:00 PM Hillf Danton <hdanton@sina.com> wrote:
> >> 
> >> On Thu, 4 Mar 2021 16:42:30 +0800  Ming Lei wrote:
> >>> On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
> >>> <mikhail.v.gavrilov@gmail.com> wrote:
> >>>> 
> >>>> Paolo, Jens I am sorry for the noise.
> >>>> But today I hit the kernel panic and git blame said that you have
> >>>> created the file in which happened panic (this I saw from trace)
> >>>> 
> >>>> $ /usr/src/kernels/`uname -r`/scripts/faddr2line
> >>>> /lib/debug/lib/modules/`uname -r`/vmlinux
> >>>> __bfq_deactivate_entity+0x15a
> >>>> __bfq_deactivate_entity+0x15a/0x240:
> >>>> bfq_gt at block/bfq-wf2q.c:20
> >>>> (inlined by) bfq_insert at block/bfq-wf2q.c:381
> >>>> (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
> >>>> (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
> >>>> 
> >>>> https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
> >>>> 
> >>>> $ head /sys/block/*/queue/scheduler
> >>>> ==> /sys/block/nvme0n1/queue/scheduler <==
> >>>> [none] mq-deadline kyber bfq
> >>>> 
> >>>> ==> /sys/block/sda/queue/scheduler <==
> >>>> mq-deadline kyber [bfq] none
> >>>> 
> >>>> ==> /sys/block/zram0/queue/scheduler <==
> >>>> none
> >>>> 
> >>>> Trace:
> >>>> general protection fault, probably for non-canonical address
> >>>> 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
> >>>> CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
> >>>> --------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
> >>>> Hardware name: System manufacturer System Product Name/ROG STRIX
> >>>> X570-I GAMING, BIOS 2606 08/13/2020
> >>>> Workqueue: kblockd blk_mq_run_work_fn
> >>>> RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
> >>>> Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
> >>>> 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
> >>>> 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
> >>>> RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
> >>>> RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
> >>>> RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
> >>>> RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
> >>>> R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
> >>>> R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
> >>>> FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
> >>>> Call Trace:
> >>>> bfq_deactivate_entity+0x4f/0xc0
> >>> 
> >>> Hello,
> >>> 
> >>> The same stack trace was observed in RH internal test too, and kernel
> >>> is 5.11.0-0.rc6,
> >>> but there isn't reproducer yet.
> >>> 
> >>> 
> >>> --
> >>> Ming Lei
> >> 
> >> Add some debug info.
> >> 
> >> --- x/block/bfq-wf2q.c
> >> +++ y/block/bfq-wf2q.c
> >> @@ -647,8 +647,10 @@ static void bfq_forget_entity(struct bfq
> >> 
> >>        entity->on_st_or_in_serv = false;
> >>        st->wsum -= entity->weight;
> >> -       if (bfqq && !is_in_service)
> >> +       if (bfqq && !is_in_service) {
> >> +               WARN_ON(entity->tree != NULL);
> >>                bfq_put_queue(bfqq);
> >> +       }
> >> }
> >> 
> >> /**
> >> @@ -1631,6 +1633,7 @@ bool __bfq_bfqd_reset_in_service(struct
> >>                 * bfqq gets freed here.
> >>                 */
> >>                int ref = in_serv_bfqq->ref;
> >> +               WARN_ON(in_serv_entity->tree != NULL);
> >>                bfq_put_queue(in_serv_bfqq);
> >>                if (ref == 1)
> >>                        return true;
> > 
> > This kernel oops isn't easy to be reproduced, and  we have got another crash
> > report[1] too, still on __bfq_deactivate_entity(), and not easy to
> > trigger.  Can your
> > debug patch cover the report[1]? If not, feel free to add more debug messages,
> > then I will try to reproduce the two.
> > 
> > [1] another kernel oops log on __bfq_deactivate_entity
> > 
> > [  899.790606] systemd-sysv-generator[25205]: SysV service
> > '/etc/rc.d/init.d/anamon' lacks a native systemd unit file.
> > Automatically generating a unit file for compatibility. Please update
> > package to include a native systemd unit file, in order to make it
> > more safe and robust.
> > [  901.937047] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [  901.944005] #PF: supervisor read access in kernel mode
> > [  901.949143] #PF: error_code(0x0000) - not-present page
> > [  901.954285] PGD 0 P4D 0
> > [  901.956824] Oops: 0000 [#1] SMP NOPTI
> > [  901.960490] CPU: 13 PID: 22966 Comm: kworker/13:0 Tainted: G
> >  I    X --------- ---  5.11.0-1.el9.x86_64 #1
> > [  901.970829] Hardware name: Dell Inc. PowerEdge R740xd/0WXD1Y, BIOS
> > 2.5.4 01/13/2020
> > [  901.978480] Workqueue: cgwb_release cgwb_release_workfn
> > [  901.983705] RIP: 0010:__bfq_deactivate_entity+0x5b/0x240
> > [  901.989016] Code: b8 30 00 00 00 75 18 48 81 ff 88 00 00 00 74 0f
> > 0f b7 47 8a 83 e8 01 48 8d 04 40 48 c1 e0 04 4c 8b 73 68 48 63 73 40
> > 48 89 df <4d> 8b 3e 4d 8d 64 06 10 e8 48 f0 ff ff 49 39 df 0f 84 87 01
> > 00 00
> > [  902.007763] RSP: 0018:ffffb77107f0bd98 EFLAGS: 00010002
> > [  902.012986] RAX: 0000002fffffffd0 RBX: ffff9853ca9c6098 RCX: 0000000000000046
> > [  902.020119] RDX: 0000000000000001 RSI: 00000000474b1168 RDI: ffff9853ca9c6098
> > [  902.027253] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff985470c2fed0
> > [  902.034383] R10: 0000000000000001 R11: ffff9853c9287d98 R12: ffff9853ca8b8000
> > [  902.041515] R13: 00000000000000ff R14: 0000000000000000 R15: ffff985b44308098
> > [  902.048647] FS:  0000000000000000(0000) GS:ffff98631f980000(0000)
> > knlGS:0000000000000000
> > [  902.056732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  902.062479] CR2: 0000000000000000 CR3: 00000001c0ac2002 CR4: 00000000007706e0
> > [  902.069611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  902.076744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  902.083876] PKRU: 55555554
> > [  902.086589] Call Trace:
> > [  902.089042]  bfq_pd_offline+0x89/0xd0
> > [  902.092708]  blkg_destroy+0x52/0xf0
> > [  902.096200]  blkcg_destroy_blkgs+0x46/0xc0
> > [  902.100300]  cgwb_release_workfn+0xbe/0x150
> > [  902.104485]  process_one_work+0x1e6/0x380
> > [  902.108497]  worker_thread+0x53/0x3d0
> > [  902.112161]  ? process_one_work+0x380/0x380
> > [  902.116346]  kthread+0x11b/0x140
> > [  902.119581]  ? kthread_associate_blkcg+0xa0/0xa0
> > [  902.124199]  ret_from_fork+0x1f/0x30
> > [  902.127780] Modules linked in: sunrpc scsi_debug iscsi_tcp
> > libiscsi_tcp libiscsi scsi_transport_iscsi nft_reject_inet
> > nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> > rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit
> > libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > ipmi_ssif irqbypass mgag200 rapl i2c_algo_bit iTCO_wdt drm_kms_helper
> > intel_cstate iTCO_vendor_support syscopyarea sysfillrect sysimgblt
> > acpi_ipmi mei_me fb_sys_fops intel_uncore pcspkr dell_smbios dcdbas
> > dell_wmi_descriptor wmi_bmof mei cec i2c_i801 ipmi_si acpi_power_meter
> > lpc_ich i2c_smbus ipmi_devintf ipmi_msghandler drm fuse xfs libcrc32c
> > sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci
> > megaraid_sas tg3 ghash_clmulni_intel libata wmi dm_mirror
> > dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
> > [  902.208546] CR2: 0000000000000000
> > [  902.211881] ---[ end trace 827b8521dc634ca4 ]---
> > 
> > 
> > -- 
> > Ming Lei
> 

-- 
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault in __bfq_deactivate_entity
       [not found]           ` <20210307021524.13260-1-hdanton@sina.com>
@ 2021-03-07  7:46             ` Dmitry Vyukov
       [not found]               ` <20210307100900.13768-1-hdanton@sina.com>
  2021-05-21  2:50             ` Ming Lei
  1 sibling, 1 reply; 8+ messages in thread
From: Dmitry Vyukov @ 2021-03-07  7:46 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Ming Lei, Paolo Valente, Ming Lei, Mikhail Gavrilov, linux-block,
	Jens Axboe, LKML, kasan-dev

On Sun, Mar 7, 2021 at 3:15 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Fri, 5 Mar 2021 18:01:04 +0800  Ming Lei wrote:
> > On Fri, Mar 05, 2021 at 10:32:04AM +0100, Paolo Valente wrote:
> > > I'm thinking of a way to debug this too.  The symptom may hint at a
> > > use-after-free.  Could you enable KASAN in your tests?  (On the flip
> > > side, I know this might change timings, thereby making the fault
> > > disappear).
> >
> > I have asked our QE to reproduce the issue with debug kernel, which may take a
> > while. And I can't trigger it in my box.
> >
> > BTW, for the 2nd 'kernel NULL pointer dereference', the RIP points to:
> >
> > (gdb) l *(__bfq_deactivate_entity+0x5b)
> > 0xffffffff814c31cb is in __bfq_deactivate_entity (block/bfq-wf2q.c:1181).
> > 1176           * bfq_group_set_parent has already been invoked for the group
> > 1177           * represented by entity. Therefore, the field
> > 1178           * entity->sched_data has been set, and we can safely use it.
> > 1179           */
> > 1180          st = bfq_entity_service_tree(entity);
> > 1181          is_in_service = entity == sd->in_service_entity;
> > 1182
> > 1183          bfq_calc_finish(entity, entity->service);
> > 1184
> > 1185          if (is_in_service)
> >
> > Seems entity->sched_data points to NULL.
>
> Hi Ming,
>
> Thanks for your report.
>
> Given the invalid pointer cannot explain line 1180, you are reporting
> a different issue from what Mike reported, and we can do nothing now
> for both without a reproducer.
>
> Dmitry can you shed some light on the tricks to config kasan to print
> Call Trace as the reports with the leading [syzbot] on the subject line do?

+kasan-dev

Hi Hillf,

KASAN prints stack traces always unconditionally. There is nothing you
need to do at all. Do you have any reports w/o stack traces?

"[syzbot]" is prepend by syzbot code. If you want some prefix, you
would need to prepend it manually.



> > > Thanks,
> > > Paolo
> > >
> > > > Il giorno 5 mar 2021, alle ore 10:27, Ming Lei <tom.leiming@gmail.com> ha scritto:
> > > >
> > > > Hello Hillf,
> > > >
> > > > Thanks for the debug patch.
> > > >
> > > > On Fri, Mar 5, 2021 at 5:00 PM Hillf Danton <hdanton@sina.com> wrote:
> > > >>
> > > >> On Thu, 4 Mar 2021 16:42:30 +0800  Ming Lei wrote:
> > > >>> On Sat, Oct 10, 2020 at 1:40 PM Mikhail Gavrilov
> > > >>> <mikhail.v.gavrilov@gmail.com> wrote:
> > > >>>>
> > > >>>> Paolo, Jens I am sorry for the noise.
> > > >>>> But today I hit the kernel panic and git blame said that you have
> > > >>>> created the file in which happened panic (this I saw from trace)
> > > >>>>
> > > >>>> $ /usr/src/kernels/`uname -r`/scripts/faddr2line
> > > >>>> /lib/debug/lib/modules/`uname -r`/vmlinux
> > > >>>> __bfq_deactivate_entity+0x15a
> > > >>>> __bfq_deactivate_entity+0x15a/0x240:
> > > >>>> bfq_gt at block/bfq-wf2q.c:20
> > > >>>> (inlined by) bfq_insert at block/bfq-wf2q.c:381
> > > >>>> (inlined by) bfq_idle_insert at block/bfq-wf2q.c:621
> > > >>>> (inlined by) __bfq_deactivate_entity at block/bfq-wf2q.c:1203
> > > >>>>
> > > >>>> https://github.com/torvalds/linux/blame/master/block/bfq-wf2q.c#L1203
> > > >>>>
> > > >>>> $ head /sys/block/*/queue/scheduler
> > > >>>> ==> /sys/block/nvme0n1/queue/scheduler <==
> > > >>>> [none] mq-deadline kyber bfq
> > > >>>>
> > > >>>> ==> /sys/block/sda/queue/scheduler <==
> > > >>>> mq-deadline kyber [bfq] none
> > > >>>>
> > > >>>> ==> /sys/block/zram0/queue/scheduler <==
> > > >>>> none
> > > >>>>
> > > >>>> Trace:
> > > >>>> general protection fault, probably for non-canonical address
> > > >>>> 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI
> > > >>>> CPU: 27 PID: 1018 Comm: kworker/27:1H Tainted: G        W
> > > >>>> --------- ---  5.9.0-0.rc8.28.fc34.x86_64 #1
> > > >>>> Hardware name: System manufacturer System Product Name/ROG STRIX
> > > >>>> X570-I GAMING, BIOS 2606 08/13/2020
> > > >>>> Workqueue: kblockd blk_mq_run_work_fn
> > > >>>> RIP: 0010:__bfq_deactivate_entity+0x15a/0x240
> > > >>>> Code: 48 2b 41 28 48 85 c0 7e 05 49 89 5c 24 18 49 8b 44 24 08 4d 8d
> > > >>>> 74 24 08 48 85 c0 0f 84 d6 00 00 00 48 8b 7b 28 eb 03 48 89 c8 <48> 8b
> > > >>>> 48 28 48 8d 70 10 48 8d 50 08 48 29 f9 48 85 c9 48 0f 4f d6
> > > >>>> RSP: 0018:ffffadf6c0c6fc00 EFLAGS: 00010002
> > > >>>> RAX: 46b1b0f0d8856e4a RBX: ffff8dc2773b5c88 RCX: 46b1b0f0d8856e4a
> > > >>>> RDX: ffff8dc7d02ed0a0 RSI: ffff8dc7d02ed0a8 RDI: 0000584e64e96beb
> > > >>>> RBP: ffff8dc2773b5c00 R08: ffff8dc9054cb938 R09: 0000000000000000
> > > >>>> R10: 0000000000000018 R11: 0000000000000018 R12: ffff8dc904927150
> > > >>>> R13: 0000000000000001 R14: ffff8dc904927158 R15: ffff8dc2773b5c88
> > > >>>> FS:  0000000000000000(0000) GS:ffff8dc90e0c0000(0000) knlGS:0000000000000000
> > > >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >>>> CR2: 0000003e8ebe4000 CR3: 00000007c2546000 CR4: 0000000000350ee0
> > > >>>> Call Trace:
> > > >>>> bfq_deactivate_entity+0x4f/0xc0
> > > >>>
> > > >>> Hello,
> > > >>>
> > > >>> The same stack trace was observed in RH internal test too, and kernel
> > > >>> is 5.11.0-0.rc6,
> > > >>> but there isn't reproducer yet.
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Ming Lei
> > > >>
> > > >> Add some debug info.
> > > >>
> > > >> --- x/block/bfq-wf2q.c
> > > >> +++ y/block/bfq-wf2q.c
> > > >> @@ -647,8 +647,10 @@ static void bfq_forget_entity(struct bfq
> > > >>
> > > >>        entity->on_st_or_in_serv = false;
> > > >>        st->wsum -= entity->weight;
> > > >> -       if (bfqq && !is_in_service)
> > > >> +       if (bfqq && !is_in_service) {
> > > >> +               WARN_ON(entity->tree != NULL);
> > > >>                bfq_put_queue(bfqq);
> > > >> +       }
> > > >> }
> > > >>
> > > >> /**
> > > >> @@ -1631,6 +1633,7 @@ bool __bfq_bfqd_reset_in_service(struct
> > > >>                 * bfqq gets freed here.
> > > >>                 */
> > > >>                int ref = in_serv_bfqq->ref;
> > > >> +               WARN_ON(in_serv_entity->tree != NULL);
> > > >>                bfq_put_queue(in_serv_bfqq);
> > > >>                if (ref == 1)
> > > >>                        return true;
> > > >
> > > > This kernel oops isn't easy to be reproduced, and  we have got another crash
> > > > report[1] too, still on __bfq_deactivate_entity(), and not easy to
> > > > trigger.  Can your
> > > > debug patch cover the report[1]? If not, feel free to add more debug messages,
> > > > then I will try to reproduce the two.
> > > >
> > > > [1] another kernel oops log on __bfq_deactivate_entity
> > > >
> > > > [  899.790606] systemd-sysv-generator[25205]: SysV service
> > > > '/etc/rc.d/init.d/anamon' lacks a native systemd unit file.
> > > > Automatically generating a unit file for compatibility. Please update
> > > > package to include a native systemd unit file, in order to make it
> > > > more safe and robust.
> > > > [  901.937047] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > > > [  901.944005] #PF: supervisor read access in kernel mode
> > > > [  901.949143] #PF: error_code(0x0000) - not-present page
> > > > [  901.954285] PGD 0 P4D 0
> > > > [  901.956824] Oops: 0000 [#1] SMP NOPTI
> > > > [  901.960490] CPU: 13 PID: 22966 Comm: kworker/13:0 Tainted: G
> > > >  I    X --------- ---  5.11.0-1.el9.x86_64 #1
> > > > [  901.970829] Hardware name: Dell Inc. PowerEdge R740xd/0WXD1Y, BIOS
> > > > 2.5.4 01/13/2020
> > > > [  901.978480] Workqueue: cgwb_release cgwb_release_workfn
> > > > [  901.983705] RIP: 0010:__bfq_deactivate_entity+0x5b/0x240
> > > > [  901.989016] Code: b8 30 00 00 00 75 18 48 81 ff 88 00 00 00 74 0f
> > > > 0f b7 47 8a 83 e8 01 48 8d 04 40 48 c1 e0 04 4c 8b 73 68 48 63 73 40
> > > > 48 89 df <4d> 8b 3e 4d 8d 64 06 10 e8 48 f0 ff ff 49 39 df 0f 84 87 01
> > > > 00 00
> > > > [  902.007763] RSP: 0018:ffffb77107f0bd98 EFLAGS: 00010002
> > > > [  902.012986] RAX: 0000002fffffffd0 RBX: ffff9853ca9c6098 RCX: 0000000000000046
> > > > [  902.020119] RDX: 0000000000000001 RSI: 00000000474b1168 RDI: ffff9853ca9c6098
> > > > [  902.027253] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff985470c2fed0
> > > > [  902.034383] R10: 0000000000000001 R11: ffff9853c9287d98 R12: ffff9853ca8b8000
> > > > [  902.041515] R13: 00000000000000ff R14: 0000000000000000 R15: ffff985b44308098
> > > > [  902.048647] FS:  0000000000000000(0000) GS:ffff98631f980000(0000)
> > > > knlGS:0000000000000000
> > > > [  902.056732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  902.062479] CR2: 0000000000000000 CR3: 00000001c0ac2002 CR4: 00000000007706e0
> > > > [  902.069611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [  902.076744] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > [  902.083876] PKRU: 55555554
> > > > [  902.086589] Call Trace:
> > > > [  902.089042]  bfq_pd_offline+0x89/0xd0
> > > > [  902.092708]  blkg_destroy+0x52/0xf0
> > > > [  902.096200]  blkcg_destroy_blkgs+0x46/0xc0
> > > > [  902.100300]  cgwb_release_workfn+0xbe/0x150
> > > > [  902.104485]  process_one_work+0x1e6/0x380
> > > > [  902.108497]  worker_thread+0x53/0x3d0
> > > > [  902.112161]  ? process_one_work+0x380/0x380
> > > > [  902.116346]  kthread+0x11b/0x140
> > > > [  902.119581]  ? kthread_associate_blkcg+0xa0/0xa0
> > > > [  902.124199]  ret_from_fork+0x1f/0x30
> > > > [  902.127780] Modules linked in: sunrpc scsi_debug iscsi_tcp
> > > > libiscsi_tcp libiscsi scsi_transport_iscsi nft_reject_inet
> > > > nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> > > > rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit
> > > > libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> > > > ipmi_ssif irqbypass mgag200 rapl i2c_algo_bit iTCO_wdt drm_kms_helper
> > > > intel_cstate iTCO_vendor_support syscopyarea sysfillrect sysimgblt
> > > > acpi_ipmi mei_me fb_sys_fops intel_uncore pcspkr dell_smbios dcdbas
> > > > dell_wmi_descriptor wmi_bmof mei cec i2c_i801 ipmi_si acpi_power_meter
> > > > lpc_ich i2c_smbus ipmi_devintf ipmi_msghandler drm fuse xfs libcrc32c
> > > > sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci
> > > > megaraid_sas tg3 ghash_clmulni_intel libata wmi dm_mirror
> > > > dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
> > > > [  902.208546] CR2: 0000000000000000
> > > > [  902.211881] ---[ end trace 827b8521dc634ca4 ]---
> > > >
> > > >
> > > > --
> > > > Ming Lei
> > >
> >
> > --
> > Ming
> >
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault in __bfq_deactivate_entity
       [not found]               ` <20210307100900.13768-1-hdanton@sina.com>
@ 2021-03-07 10:17                 ` Dmitry Vyukov
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Vyukov @ 2021-03-07 10:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Ming Lei, Paolo Valente, Ming Lei, Mikhail Gavrilov,
	Palash Oswal, linux-block, Jens Axboe, LKML, kasan-dev

On Sun, Mar 7, 2021 at 11:09 AM Hillf Danton <hdanton@sina.com> wrote:
>
> On Sun, 7 Mar 2021 08:46:19 +0100  Dmitry Vyukov wrote:
> > On Sun, Mar 7, 2021 at 3:15 AM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > Dmitry can you shed some light on the tricks to config kasan to print
> > > Call Trace as the reports with the leading [syzbot] on the subject line do?
> >
> > +kasan-dev
> >
> > Hi Hillf,
> >
> > KASAN prints stack traces always unconditionally. There is nothing you
> > need to do at all.
>
> Got it, thanks.
>
> > Do you have any reports w/o stack traces?
>
> No, but I saw different formats in Call Trace prints.
>
> Below from [1] is the instance without file name and line number printed,
> while both info help spot the cause of the reported issue.


KASAN always prints stack traces w/o file:line info, like any other
kernel bug detection facility. Kernel itself never symbolizes reports.
In case of syzkaller, syzkaller will symbolize reports and add
file:line info. The main config it requires is CONFIG_DEBUG_INFO.

You may see syzkaller kernel configuration guide here:
https://github.com/google/syzkaller/blob/master/docs/linux/kernel_configs.md

Or fragments that are actually used to generate syzbot configs in this
dir (the guide above may be out-of-date):
https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/base.yml
https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/debug.yml
https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/kasan.yml

Or a complete syzbot config here:
https://github.com/google/syzkaller/blob/master/dashboard/config/linux/upstream-apparmor-kasan.config


> >>>>>>>>>>>>>>>>>>>>>>>>>
>
> I was running syzkaller and I found the following issue :
>
> Head Commit : b1313fe517ca3703119dcc99ef3bbf75ab42bcfb ( v5.10.4 )
> Git Tree : stable
> Console Output :
> [  242.769080] INFO: task repro:2639 blocked for more than 120 seconds.
> [  242.769096]       Not tainted 5.10.4 #8
> [  242.769103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  242.769112] task:repro           state:D stack:    0 pid: 2639
> ppid:  2638 flags:0x00000004
> [  242.769126] Call Trace:
> [  242.769148]  __schedule+0x28d/0x7e0
> [  242.769162]  ? __percpu_counter_sum+0x75/0x90
> [  242.769175]  schedule+0x4f/0xc0
> [  242.769187]  __io_uring_task_cancel+0xad/0xf0
> [  242.769198]  ? wait_woken+0x80/0x80
> [  242.769210]  bprm_execve+0x67/0x8a0
> [  242.769223]  do_execveat_common+0x1d2/0x220
> [  242.769235]  __x64_sys_execveat+0x5d/0x70
> [  242.769249]  do_syscall_64+0x38/0x90
> [  242.769260]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> [1] https://lore.kernel.org/lkml/CAGyP=7cFM6BJE7X2PN9YUptQgt5uQYwM4aVmOiVayQPJg1pqaA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [bugreport 5.9-rc8] general protection fault in __bfq_deactivate_entity
       [not found]           ` <20210307021524.13260-1-hdanton@sina.com>
  2021-03-07  7:46             ` [bugreport 5.9-rc8] general protection fault in __bfq_deactivate_entity Dmitry Vyukov
@ 2021-05-21  2:50             ` Ming Lei
  1 sibling, 0 replies; 8+ messages in thread
From: Ming Lei @ 2021-05-21  2:50 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Paolo Valente, Ming Lei, Dmitry Vyukov, Mikhail Gavrilov,
	linux-block, Jens Axboe, LKML

On Sat, Mar 06, 2021 at 07:15:24PM -0700, Hillf Danton wrote:
> On Fri, 5 Mar 2021 18:01:04 +0800  Ming Lei wrote:
> > On Fri, Mar 05, 2021 at 10:32:04AM +0100, Paolo Valente wrote:
> > > I'm thinking of a way to debug this too.  The symptom may hint at a
> > > use-after-free.  Could you enable KASAN in your tests?  (On the flip
> > > side, I know this might change timings, thereby making the fault
> > > disappear).
> > 
> > I have asked our QE to reproduce the issue with debug kernel, which may take a
> > while. And I can't trigger it in my box.
> > 
> > BTW, for the 2nd 'kernel NULL pointer dereference', the RIP points to:
> > 
> > (gdb) l *(__bfq_deactivate_entity+0x5b)
> > 0xffffffff814c31cb is in __bfq_deactivate_entity (block/bfq-wf2q.c:1181).
> > 1176		 * bfq_group_set_parent has already been invoked for the group
> > 1177		 * represented by entity. Therefore, the field
> > 1178		 * entity->sched_data has been set, and we can safely use it.
> > 1179		 */
> > 1180		st = bfq_entity_service_tree(entity);
> > 1181		is_in_service = entity == sd->in_service_entity;
> > 1182
> > 1183		bfq_calc_finish(entity, entity->service);
> > 1184
> > 1185		if (is_in_service)
> > 
> > Seems entity->sched_data points to NULL.
> 
> Hi Ming,
> 
> Thanks for your report.
> 
> Given the invalid pointer cannot explain line 1180, you are reporting
> a different issue from what Mike reported, and we can do nothing now
> for both without a reproducer.

BTW, we get this report 2 times on 5.12 kernel, following the kernel log, and this
time there is hard LOCKUP.


[  337.526984] systemd-shutdown[1]: Not all DM devices detached, 1 left.
[  337.526988] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing.
[  337.531043] systemd-shutdown[1]: Successfully changed into root pivot.
[  337.531046] systemd-shutdown[1]: Returning to initrd...
[  337.533136] watchdog: watchdog0: watchdog did not stop!
[  337.569177] dracut Warning: Killing all remaining processes
[  337.706605] XFS (dm-0): Unmounting Filesystem
[  351.593888] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[  351.593890] Modules linked in: dm_multipath rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 dcdbas iTCO_wdt irqbypass i2c_algo_bit iTCO_vendor_support rapl drm_kms_helper intel_cstate syscopyarea sysfillrect sysimgblt fb_sys_fops cec intel_uncore pcspkr ipmi_ssif mei_me mei lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse ip_tables xfs libcrc32c sd_mod qla2xxx ahci libahci nvme_fc crct10dif_pclmul crc32_pclmul crc32c_intel nvme_fabrics libata ghash_clmulni_intel tg3 nvme_core megaraid_sas t10_pi scsi_transport_fc wmi dm_mirror dm_region_hash dm_log dm_mod
[  351.593929] CPU: 2 PID: 95 Comm: kworker/2:1 Kdump: loaded Tainted: G               X --------- ---  5.12.0-1.el9.x86_64 #1
[  351.593930] Hardware name: Dell Inc. PowerEdge R430/0HFG24, BIOS 1.6.2 01/08/2016
[  351.593931] Workqueue: cgwb_release cgwb_release_workfn
[  351.593932] RIP: 0010:rb_prev+0x18/0x50
[  351.593933] Code: 31 c0 eb db 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 17 48 39 d7 74 35 48 8b 47 10 48 85 c0 74 1c 49 89 c0 48 8b 40 08 <48> 85 c0 75 f4 4c 89 c0 c3 48 3b 78 10 75 f6 48 8b 10 48 89 c7 48
[  351.593934] RSP: 0018:ffffb7280048fd70 EFLAGS: 00000086
[  351.593935] RAX: ffff98bc30f448a0 RBX: ffff98bc10d1e150 RCX: 0000000000000014
[  351.593936] RDX: 0000000000000001 RSI: ffff98bc00b39098 RDI: ffff98bc00b39098
[  351.593937] RBP: ffff98bc00b39098 R08: ffff98bc30f448a0 R09: 0000000000000000
[  351.593938] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  351.593939] R13: 0000000000000001 R14: ffff98bc10d1e110 R15: 0000000000000000
[  351.593940] FS:  0000000000000000(0000) GS:ffff98c37fa80000(0000) knlGS:0000000000000000
[  351.593941] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  351.593941] CR2: 00007fffeeceaea0 CR3: 0000000105b40003 CR4: 00000000001706e0
[  351.593942] Call Trace:
[  351.593943]  bfq_idle_extract+0x98/0xb0
[  351.593943]  __bfq_deactivate_entity+0x224/0x240
[  351.593944]  bfq_pd_offline+0xaa/0xd0
[  351.593945]  blkg_destroy+0x52/0xf0
[  351.593945]  blkcg_destroy_blkgs+0x46/0xc0
[  351.593946]  cgwb_release_workfn+0xbe/0x150
[  351.593947]  process_one_work+0x1e6/0x380
[  351.593947]  worker_thread+0x53/0x3d0
[  351.593948]  ? process_one_work+0x380/0x380
[  351.593949]  kthread+0x11b/0x140
[  351.593949]  ? kthread_associate_blkcg+0xa0/0xa0
[  351.593950]  ret_from_fork+0x22/0x30
[  351.593950] Kernel panic - not syncing: Hard LOCKUP


Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-21  2:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-09 22:10 [bugreport 5.9-rc8] general protection fault, probably for non-canonical address 0x46b1b0f0d8856e4a: 0000 [#1] SMP NOPTI Mikhail Gavrilov
2021-03-04  8:42 ` Ming Lei
     [not found]   ` <20210305090022.1863-1-hdanton@sina.com>
2021-03-05  9:27     ` Ming Lei
2021-03-05  9:32       ` Paolo Valente
2021-03-05 10:01         ` Ming Lei
     [not found]           ` <20210307021524.13260-1-hdanton@sina.com>
2021-03-07  7:46             ` [bugreport 5.9-rc8] general protection fault in __bfq_deactivate_entity Dmitry Vyukov
     [not found]               ` <20210307100900.13768-1-hdanton@sina.com>
2021-03-07 10:17                 ` Dmitry Vyukov
2021-05-21  2:50             ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).