linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* io_uring: BUG: kernel NULL pointer dereference
@ 2020-05-25 10:30 Stefano Garzarella
  2020-05-25 13:45 ` Stefano Garzarella
  0 siblings, 1 reply; 8+ messages in thread
From: Stefano Garzarella @ 2020-05-25 10:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-kernel, linux-fsdevel

Hi Jens,
using fio and io_uring engine with SQPOLL and IOPOLL enabled, I had the
following issue that happens after 4/5 seconds fio has started.
Initially I had this issue on Linux v5.7-rc6, but I just tried also
Linux v5.7-rc7:

[   75.343479] nvme nvme0: pci function 0000:04:00.0
[   75.355110] nvme nvme0: 16/0/15 default/read/poll queues
[   75.364946]  nvme0n1: p1
[   82.739285] BUG: kernel NULL pointer dereference, address: 00000000000003b0
[   82.747054] #PF: supervisor read access in kernel mode
[   82.752785] #PF: error_code(0x0000) - not-present page
[   82.758516] PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
[   82.766186] Oops: 0000 [#1] SMP PTI
[   82.770076] CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
[   82.777939] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
[   82.786290] RIP: 0010:task_numa_work+0x4f/0x2c0
[   82.791341] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
[   82.812296] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
[   82.818123] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
[   82.826083] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
[   82.834042] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
[   82.842002] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
[   82.849962] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
[   82.857922] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
[   82.866948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   82.873357] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
[   82.881316] Call Trace:
[   82.884046]  task_work_run+0x68/0xa0
[   82.888026]  io_sq_thread+0x252/0x3d0
[   82.892111]  ? finish_wait+0x80/0x80
[   82.896097]  kthread+0xf9/0x130
[   82.899598]  ? __ia32_sys_io_uring_enter+0x370/0x370
[   82.905134]  ? kthread_park+0x90/0x90
[   82.909217]  ret_from_fork+0x35/0x40
[   82.913203] Modules linked in: nvme nvme_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul dcdbas ghash_clmulni_intel iTCO_vendor_support intel_cstate intel_uncore pcspkr intel_rapl_perf ipmi_ssif ixgbe mei_me mdio tg3 dca mei lpc_ich ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c mgag200 drm_kms_helper drm_vram_helper drm_ttm_helper ttm drm megaraid_sas crc32c_intel i2c_algo_bit wmi
[   82.990613] CR2: 00000000000003b0
[   82.994307] ---[ end trace 6d1725e8f60fece7 ]---
[   83.039157] RIP: 0010:task_numa_work+0x4f/0x2c0
[   83.044211] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
[   83.065165] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
[   83.070993] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
[   83.078953] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
[   83.086913] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
[   83.094873] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
[   83.102833] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
[   83.110793] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
[   83.119821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   83.126230] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
[  113.113624] nvme nvme0: I/O 219 QID 19 timeout, aborting
[  113.120135] nvme nvme0: Abort status: 0x0

Steps I did:

  $ modprobe nvme poll_queues=15
  $ fio fio_iou.job

This is the fio_iou.job that I used:

  [global]
  filename=/dev/nvme0n1
  ioengine=io_uring
  direct=1
  runtime=60
  ramp_time=5
  gtod_reduce=1

  cpus_allowed=4

  [job1]
  rw=randread
  bs=4K
  iodepth=1
  registerfiles
  sqthread_poll=1
  sqthread_poll_cpu=2
  hipri

I'll try to bisect, but I have some suspicions about:
b41e98524e42 io_uring: add per-task callback handler

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 10:30 io_uring: BUG: kernel NULL pointer dereference Stefano Garzarella
@ 2020-05-25 13:45 ` Stefano Garzarella
  2020-05-25 14:10   ` Jens Axboe
  2020-05-25 14:11   ` Pavel Begunkov
  0 siblings, 2 replies; 8+ messages in thread
From: Stefano Garzarella @ 2020-05-25 13:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-kernel, linux-fsdevel

On Mon, May 25, 2020 at 12:30:51PM +0200, Stefano Garzarella wrote:
> Hi Jens,
> using fio and io_uring engine with SQPOLL and IOPOLL enabled, I had the
> following issue that happens after 4/5 seconds fio has started.
> Initially I had this issue on Linux v5.7-rc6, but I just tried also
> Linux v5.7-rc7:
> 
> [   75.343479] nvme nvme0: pci function 0000:04:00.0
> [   75.355110] nvme nvme0: 16/0/15 default/read/poll queues
> [   75.364946]  nvme0n1: p1
> [   82.739285] BUG: kernel NULL pointer dereference, address: 00000000000003b0
> [   82.747054] #PF: supervisor read access in kernel mode
> [   82.752785] #PF: error_code(0x0000) - not-present page
> [   82.758516] PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
> [   82.766186] Oops: 0000 [#1] SMP PTI
> [   82.770076] CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
> [   82.777939] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
> [   82.786290] RIP: 0010:task_numa_work+0x4f/0x2c0
> [   82.791341] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> [   82.812296] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> [   82.818123] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> [   82.826083] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> [   82.834042] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> [   82.842002] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> [   82.849962] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> [   82.857922] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> [   82.866948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   82.873357] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> [   82.881316] Call Trace:
> [   82.884046]  task_work_run+0x68/0xa0
> [   82.888026]  io_sq_thread+0x252/0x3d0
> [   82.892111]  ? finish_wait+0x80/0x80
> [   82.896097]  kthread+0xf9/0x130
> [   82.899598]  ? __ia32_sys_io_uring_enter+0x370/0x370
> [   82.905134]  ? kthread_park+0x90/0x90
> [   82.909217]  ret_from_fork+0x35/0x40
> [   82.913203] Modules linked in: nvme nvme_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul dcdbas ghash_clmulni_intel iTCO_vendor_support intel_cstate intel_uncore pcspkr intel_rapl_perf ipmi_ssif ixgbe mei_me mdio tg3 dca mei lpc_ich ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c mgag200 drm_kms_helper drm_vram_helper drm_ttm_helper ttm drm megaraid_sas crc32c_intel i2c_algo_bit wmi
> [   82.990613] CR2: 00000000000003b0
> [   82.994307] ---[ end trace 6d1725e8f60fece7 ]---
> [   83.039157] RIP: 0010:task_numa_work+0x4f/0x2c0
> [   83.044211] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> [   83.065165] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> [   83.070993] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> [   83.078953] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> [   83.086913] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> [   83.094873] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> [   83.102833] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> [   83.110793] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> [   83.119821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   83.126230] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> [  113.113624] nvme nvme0: I/O 219 QID 19 timeout, aborting
> [  113.120135] nvme nvme0: Abort status: 0x0
> 
> Steps I did:
> 
>   $ modprobe nvme poll_queues=15
>   $ fio fio_iou.job
> 
> This is the fio_iou.job that I used:
> 
>   [global]
>   filename=/dev/nvme0n1
>   ioengine=io_uring
>   direct=1
>   runtime=60
>   ramp_time=5
>   gtod_reduce=1
> 
>   cpus_allowed=4
> 
>   [job1]
>   rw=randread
>   bs=4K
>   iodepth=1
>   registerfiles
>   sqthread_poll=1
>   sqthread_poll_cpu=2
>   hipri
> 
> I'll try to bisect, but I have some suspicions about:
> b41e98524e42 io_uring: add per-task callback handler

I confirm, the bisection ended with this:
b41e98524e424d104aa7851d54fd65820759875a is the first bad commit

I'll try to figure out what happened.
Stefano


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 13:45 ` Stefano Garzarella
@ 2020-05-25 14:10   ` Jens Axboe
  2020-05-25 15:02     ` Stefano Garzarella
  2020-05-25 19:21     ` Peter Zijlstra
  2020-05-25 14:11   ` Pavel Begunkov
  1 sibling, 2 replies; 8+ messages in thread
From: Jens Axboe @ 2020-05-25 14:10 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: io-uring, linux-kernel, linux-fsdevel, Peter Zijlstra, Ingo Molnar

On 5/25/20 7:45 AM, Stefano Garzarella wrote:
> On Mon, May 25, 2020 at 12:30:51PM +0200, Stefano Garzarella wrote:
>> Hi Jens,
>> using fio and io_uring engine with SQPOLL and IOPOLL enabled, I had the
>> following issue that happens after 4/5 seconds fio has started.
>> Initially I had this issue on Linux v5.7-rc6, but I just tried also
>> Linux v5.7-rc7:
>>
>> [   75.343479] nvme nvme0: pci function 0000:04:00.0
>> [   75.355110] nvme nvme0: 16/0/15 default/read/poll queues
>> [   75.364946]  nvme0n1: p1
>> [   82.739285] BUG: kernel NULL pointer dereference, address: 00000000000003b0
>> [   82.747054] #PF: supervisor read access in kernel mode
>> [   82.752785] #PF: error_code(0x0000) - not-present page
>> [   82.758516] PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
>> [   82.766186] Oops: 0000 [#1] SMP PTI
>> [   82.770076] CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
>> [   82.777939] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
>> [   82.786290] RIP: 0010:task_numa_work+0x4f/0x2c0
>> [   82.791341] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
>> [   82.812296] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
>> [   82.818123] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
>> [   82.826083] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
>> [   82.834042] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
>> [   82.842002] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
>> [   82.849962] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
>> [   82.857922] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
>> [   82.866948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   82.873357] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
>> [   82.881316] Call Trace:
>> [   82.884046]  task_work_run+0x68/0xa0
>> [   82.888026]  io_sq_thread+0x252/0x3d0
>> [   82.892111]  ? finish_wait+0x80/0x80
>> [   82.896097]  kthread+0xf9/0x130
>> [   82.899598]  ? __ia32_sys_io_uring_enter+0x370/0x370
>> [   82.905134]  ? kthread_park+0x90/0x90
>> [   82.909217]  ret_from_fork+0x35/0x40
>> [   82.913203] Modules linked in: nvme nvme_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul dcdbas ghash_clmulni_intel iTCO_vendor_support intel_cstate intel_uncore pcspkr intel_rapl_perf ipmi_ssif ixgbe mei_me mdio tg3 dca mei lpc_ich ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c mgag200 drm_kms_helper drm_vram_helper drm_ttm_helper ttm drm megaraid_sas crc32c_intel i2c_algo_bit wmi
>> [   82.990613] CR2: 00000000000003b0
>> [   82.994307] ---[ end trace 6d1725e8f60fece7 ]---
>> [   83.039157] RIP: 0010:task_numa_work+0x4f/0x2c0
>> [   83.044211] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
>> [   83.065165] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
>> [   83.070993] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
>> [   83.078953] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
>> [   83.086913] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
>> [   83.094873] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
>> [   83.102833] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
>> [   83.110793] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
>> [   83.119821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   83.126230] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
>> [  113.113624] nvme nvme0: I/O 219 QID 19 timeout, aborting
>> [  113.120135] nvme nvme0: Abort status: 0x0
>>
>> Steps I did:
>>
>>   $ modprobe nvme poll_queues=15
>>   $ fio fio_iou.job
>>
>> This is the fio_iou.job that I used:
>>
>>   [global]
>>   filename=/dev/nvme0n1
>>   ioengine=io_uring
>>   direct=1
>>   runtime=60
>>   ramp_time=5
>>   gtod_reduce=1
>>
>>   cpus_allowed=4
>>
>>   [job1]
>>   rw=randread
>>   bs=4K
>>   iodepth=1
>>   registerfiles
>>   sqthread_poll=1
>>   sqthread_poll_cpu=2
>>   hipri
>>
>> I'll try to bisect, but I have some suspicions about:
>> b41e98524e42 io_uring: add per-task callback handler
> 
> I confirm, the bisection ended with this:
> b41e98524e424d104aa7851d54fd65820759875a is the first bad commit

I think the odd part here is that task_tick_numa() checks for a
valid mm, and queues work if the task has it. But for the sqpoll
kthread, the mm can come and go. By the time the task work is run,
the mm is gone and we oops on current->mm == NULL.

I think the below should fix it:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538ba5d94e99..24a8557f001f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2908,7 +2908,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/*
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
-	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
+	    work->next != work)
 		return;
 
 	/*

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 13:45 ` Stefano Garzarella
  2020-05-25 14:10   ` Jens Axboe
@ 2020-05-25 14:11   ` Pavel Begunkov
  1 sibling, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2020-05-25 14:11 UTC (permalink / raw)
  To: Stefano Garzarella, Jens Axboe; +Cc: io-uring, linux-kernel, linux-fsdevel

On 25/05/2020 16:45, Stefano Garzarella wrote:
> On Mon, May 25, 2020 at 12:30:51PM +0200, Stefano Garzarella wrote:
>>
>> I'll try to bisect, but I have some suspicions about:
>> b41e98524e42 io_uring: add per-task callback handler
> 
> I confirm, the bisection ended with this:
> b41e98524e424d104aa7851d54fd65820759875a is the first bad commit
> 
> I'll try to figure out what happened.

Hmm, I'd start with temporary un-union req->task_work.
Could you give it a try?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index e173cea621de..cfab79254d0a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -645,7 +645,7 @@ struct io_kiocb {

 	struct percpu_ref	*fixed_file_refs;

-	union {
+	// union {
 		/*
 		 * Only commands that never go async can use the below fields,
 		 * obviously. Right now only IORING_OP_POLL_ADD uses them, and
@@ -658,7 +658,7 @@ struct io_kiocb {
 			struct async_poll	*apoll;
 		};
 		struct io_wq_work	work;
-	};
+	// };
 };

 #define IO_PLUG_THRESHOLD		2


-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 14:10   ` Jens Axboe
@ 2020-05-25 15:02     ` Stefano Garzarella
  2020-05-25 19:07       ` Jens Axboe
  2020-05-25 19:21     ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Stefano Garzarella @ 2020-05-25 15:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-kernel, linux-fsdevel, Peter Zijlstra,
	Ingo Molnar, Pavel Begunkov, Hillf Danton

On Mon, May 25, 2020 at 08:10:27AM -0600, Jens Axboe wrote:
> On 5/25/20 7:45 AM, Stefano Garzarella wrote:
> > On Mon, May 25, 2020 at 12:30:51PM +0200, Stefano Garzarella wrote:
> >> Hi Jens,
> >> using fio and io_uring engine with SQPOLL and IOPOLL enabled, I had the
> >> following issue that happens after 4/5 seconds fio has started.
> >> Initially I had this issue on Linux v5.7-rc6, but I just tried also
> >> Linux v5.7-rc7:
> >>
> >> [   75.343479] nvme nvme0: pci function 0000:04:00.0
> >> [   75.355110] nvme nvme0: 16/0/15 default/read/poll queues
> >> [   75.364946]  nvme0n1: p1
> >> [   82.739285] BUG: kernel NULL pointer dereference, address: 00000000000003b0
> >> [   82.747054] #PF: supervisor read access in kernel mode
> >> [   82.752785] #PF: error_code(0x0000) - not-present page
> >> [   82.758516] PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
> >> [   82.766186] Oops: 0000 [#1] SMP PTI
> >> [   82.770076] CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
> >> [   82.777939] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
> >> [   82.786290] RIP: 0010:task_numa_work+0x4f/0x2c0
> >> [   82.791341] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> >> [   82.812296] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> >> [   82.818123] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> >> [   82.826083] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> >> [   82.834042] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> >> [   82.842002] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> >> [   82.849962] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> >> [   82.857922] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> >> [   82.866948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   82.873357] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> >> [   82.881316] Call Trace:
> >> [   82.884046]  task_work_run+0x68/0xa0
> >> [   82.888026]  io_sq_thread+0x252/0x3d0
> >> [   82.892111]  ? finish_wait+0x80/0x80
> >> [   82.896097]  kthread+0xf9/0x130
> >> [   82.899598]  ? __ia32_sys_io_uring_enter+0x370/0x370
> >> [   82.905134]  ? kthread_park+0x90/0x90
> >> [   82.909217]  ret_from_fork+0x35/0x40
> >> [   82.913203] Modules linked in: nvme nvme_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul dcdbas ghash_clmulni_intel iTCO_vendor_support intel_cstate intel_uncore pcspkr intel_rapl_perf ipmi_ssif ixgbe mei_me mdio tg3 dca mei lpc_ich ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c mgag200 drm_kms_helper drm_vram_helper drm_ttm_helper ttm drm megaraid_sas crc32c_intel i2c_algo_bit wmi
> >> [   82.990613] CR2: 00000000000003b0
> >> [   82.994307] ---[ end trace 6d1725e8f60fece7 ]---
> >> [   83.039157] RIP: 0010:task_numa_work+0x4f/0x2c0
> >> [   83.044211] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
> >> [   83.065165] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
> >> [   83.070993] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
> >> [   83.078953] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
> >> [   83.086913] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
> >> [   83.094873] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
> >> [   83.102833] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
> >> [   83.110793] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
> >> [   83.119821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   83.126230] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
> >> [  113.113624] nvme nvme0: I/O 219 QID 19 timeout, aborting
> >> [  113.120135] nvme nvme0: Abort status: 0x0
> >>
> >> Steps I did:
> >>
> >>   $ modprobe nvme poll_queues=15
> >>   $ fio fio_iou.job
> >>
> >> This is the fio_iou.job that I used:
> >>
> >>   [global]
> >>   filename=/dev/nvme0n1
> >>   ioengine=io_uring
> >>   direct=1
> >>   runtime=60
> >>   ramp_time=5
> >>   gtod_reduce=1
> >>
> >>   cpus_allowed=4
> >>
> >>   [job1]
> >>   rw=randread
> >>   bs=4K
> >>   iodepth=1
> >>   registerfiles
> >>   sqthread_poll=1
> >>   sqthread_poll_cpu=2
> >>   hipri
> >>
> >> I'll try to bisect, but I have some suspicions about:
> >> b41e98524e42 io_uring: add per-task callback handler
> > 
> > I confirm, the bisection ended with this:
> > b41e98524e424d104aa7851d54fd65820759875a is the first bad commit
> 
> I think the odd part here is that task_tick_numa() checks for a
> valid mm, and queues work if the task has it. But for the sqpoll
> kthread, the mm can come and go. By the time the task work is run,
> the mm is gone and we oops on current->mm == NULL.
> 
> I think the below should fix it:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 538ba5d94e99..24a8557f001f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2908,7 +2908,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	/*
>  	 * We don't care about NUMA placement if we don't have memory.
>  	 */
> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
> +	    work->next != work)
>  		return;
>  
>  	/*
> 

Great Jens! With this patch applied, everything works properly!

Pavel, Hillf, thanks both for the comments, this patch solves the problem I
have described.

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 15:02     ` Stefano Garzarella
@ 2020-05-25 19:07       ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2020-05-25 19:07 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: io-uring, linux-kernel, linux-fsdevel, Peter Zijlstra,
	Ingo Molnar, Pavel Begunkov, Hillf Danton

On 5/25/20 9:02 AM, Stefano Garzarella wrote:
> On Mon, May 25, 2020 at 08:10:27AM -0600, Jens Axboe wrote:
>> On 5/25/20 7:45 AM, Stefano Garzarella wrote:
>>> On Mon, May 25, 2020 at 12:30:51PM +0200, Stefano Garzarella wrote:
>>>> Hi Jens,
>>>> using fio and io_uring engine with SQPOLL and IOPOLL enabled, I had the
>>>> following issue that happens after 4/5 seconds fio has started.
>>>> Initially I had this issue on Linux v5.7-rc6, but I just tried also
>>>> Linux v5.7-rc7:
>>>>
>>>> [   75.343479] nvme nvme0: pci function 0000:04:00.0
>>>> [   75.355110] nvme nvme0: 16/0/15 default/read/poll queues
>>>> [   75.364946]  nvme0n1: p1
>>>> [   82.739285] BUG: kernel NULL pointer dereference, address: 00000000000003b0
>>>> [   82.747054] #PF: supervisor read access in kernel mode
>>>> [   82.752785] #PF: error_code(0x0000) - not-present page
>>>> [   82.758516] PGD 800000046c042067 P4D 800000046c042067 PUD 461fcf067 PMD 0 
>>>> [   82.766186] Oops: 0000 [#1] SMP PTI
>>>> [   82.770076] CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
>>>> [   82.777939] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
>>>> [   82.786290] RIP: 0010:task_numa_work+0x4f/0x2c0
>>>> [   82.791341] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
>>>> [   82.812296] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
>>>> [   82.818123] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
>>>> [   82.826083] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
>>>> [   82.834042] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
>>>> [   82.842002] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
>>>> [   82.849962] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
>>>> [   82.857922] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
>>>> [   82.866948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   82.873357] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
>>>> [   82.881316] Call Trace:
>>>> [   82.884046]  task_work_run+0x68/0xa0
>>>> [   82.888026]  io_sq_thread+0x252/0x3d0
>>>> [   82.892111]  ? finish_wait+0x80/0x80
>>>> [   82.896097]  kthread+0xf9/0x130
>>>> [   82.899598]  ? __ia32_sys_io_uring_enter+0x370/0x370
>>>> [   82.905134]  ? kthread_park+0x90/0x90
>>>> [   82.909217]  ret_from_fork+0x35/0x40
>>>> [   82.913203] Modules linked in: nvme nvme_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul dcdbas ghash_clmulni_intel iTCO_vendor_support intel_cstate intel_uncore pcspkr intel_rapl_perf ipmi_ssif ixgbe mei_me mdio tg3 dca mei lpc_ich ipmi_si acpi_power_meter ipmi_devintf ipmi_msghandler ip_tables xfs libcrc32c mgag200 drm_kms_helper drm_vram_helper drm_ttm_helper ttm drm megaraid_sas crc32c_intel i2c_algo_bit wmi
>>>> [   82.990613] CR2: 00000000000003b0
>>>> [   82.994307] ---[ end trace 6d1725e8f60fece7 ]---
>>>> [   83.039157] RIP: 0010:task_numa_work+0x4f/0x2c0
>>>> [   83.044211] Code: 18 4c 8b 25 e3 f0 8e 01 49 8b 9f 00 08 00 00 4d 8b af c8 00 00 00 49 39 c7 0f 85 e8 01 00 00 48 89 6d 00 41 f6 47 24 04 75 67 <48> 8b ab b0 03 00 00 48 85 ed 75 16 8b 3d 6f 68 94 01 e8 aa fb 04
>>>> [   83.065165] RSP: 0018:ffffaaa98415be10 EFLAGS: 00010246
>>>> [   83.070993] RAX: ffff953ee36b8000 RBX: 0000000000000000 RCX: 0000000000000000
>>>> [   83.078953] RDX: 0000000000000001 RSI: ffff953ee36b8000 RDI: ffff953ee36b8dc8
>>>> [   83.086913] RBP: ffff953ee36b8dc8 R08: 00000000001200db R09: ffff9542e3ad2e08
>>>> [   83.094873] R10: ffff9542ecd20070 R11: 0000000000000000 R12: 00000000fffca35b
>>>> [   83.102833] R13: 000000012a06a949 R14: ffff9542e3ad2c00 R15: ffff953ee36b8000
>>>> [   83.110793] FS:  0000000000000000(0000) GS:ffff953eefc40000(0000) knlGS:0000000000000000
>>>> [   83.119821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   83.126230] CR2: 00000000000003b0 CR3: 000000046bbd0002 CR4: 00000000001606e0
>>>> [  113.113624] nvme nvme0: I/O 219 QID 19 timeout, aborting
>>>> [  113.120135] nvme nvme0: Abort status: 0x0
>>>>
>>>> Steps I did:
>>>>
>>>>   $ modprobe nvme poll_queues=15
>>>>   $ fio fio_iou.job
>>>>
>>>> This is the fio_iou.job that I used:
>>>>
>>>>   [global]
>>>>   filename=/dev/nvme0n1
>>>>   ioengine=io_uring
>>>>   direct=1
>>>>   runtime=60
>>>>   ramp_time=5
>>>>   gtod_reduce=1
>>>>
>>>>   cpus_allowed=4
>>>>
>>>>   [job1]
>>>>   rw=randread
>>>>   bs=4K
>>>>   iodepth=1
>>>>   registerfiles
>>>>   sqthread_poll=1
>>>>   sqthread_poll_cpu=2
>>>>   hipri
>>>>
>>>> I'll try to bisect, but I have some suspicions about:
>>>> b41e98524e42 io_uring: add per-task callback handler
>>>
>>> I confirm, the bisection ended with this:
>>> b41e98524e424d104aa7851d54fd65820759875a is the first bad commit
>>
>> I think the odd part here is that task_tick_numa() checks for a
>> valid mm, and queues work if the task has it. But for the sqpoll
>> kthread, the mm can come and go. By the time the task work is run,
>> the mm is gone and we oops on current->mm == NULL.
>>
>> I think the below should fix it:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 538ba5d94e99..24a8557f001f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2908,7 +2908,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>  	/*
>>  	 * We don't care about NUMA placement if we don't have memory.
>>  	 */
>> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
>> +	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
>> +	    work->next != work)
>>  		return;
>>  
>>  	/*
>>
> 
> Great Jens! With this patch applied, everything works properly!

Thanks for testing! Ingo/Peter, let me know how you want to handle this.
I can send in a properly formatted patch.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 14:10   ` Jens Axboe
  2020-05-25 15:02     ` Stefano Garzarella
@ 2020-05-25 19:21     ` Peter Zijlstra
  2020-05-25 19:25       ` Jens Axboe
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2020-05-25 19:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stefano Garzarella, io-uring, linux-kernel, linux-fsdevel, Ingo Molnar

On Mon, May 25, 2020 at 08:10:27AM -0600, Jens Axboe wrote:
> I think the odd part here is that task_tick_numa() checks for a
> valid mm, and queues work if the task has it. But for the sqpoll
> kthread, the mm can come and go. By the time the task work is run,
> the mm is gone and we oops on current->mm == NULL.
> 
> I think the below should fix it:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 538ba5d94e99..24a8557f001f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2908,7 +2908,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	/*
>  	 * We don't care about NUMA placement if we don't have memory.
>  	 */
> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
> +	    work->next != work)
>  		return;

Ah, I think that's one more instance of '!p->mm' != is_kthread(). A
while ago someone went and cleaned a bunch of them up. Clearly this one
was missed.

I'm thinking just:

	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)

should be enough.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: io_uring: BUG: kernel NULL pointer dereference
  2020-05-25 19:21     ` Peter Zijlstra
@ 2020-05-25 19:25       ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2020-05-25 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefano Garzarella, io-uring, linux-kernel, linux-fsdevel, Ingo Molnar

On 5/25/20 1:21 PM, Peter Zijlstra wrote:
> On Mon, May 25, 2020 at 08:10:27AM -0600, Jens Axboe wrote:
>> I think the odd part here is that task_tick_numa() checks for a
>> valid mm, and queues work if the task has it. But for the sqpoll
>> kthread, the mm can come and go. By the time the task work is run,
>> the mm is gone and we oops on current->mm == NULL.
>>
>> I think the below should fix it:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 538ba5d94e99..24a8557f001f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2908,7 +2908,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>  	/*
>>  	 * We don't care about NUMA placement if we don't have memory.
>>  	 */
>> -	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
>> +	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
>> +	    work->next != work)
>>  		return;
> 
> Ah, I think that's one more instance of '!p->mm' != is_kthread(). A
> while ago someone went and cleaned a bunch of them up. Clearly this one
> was missed.
> 
> I'm thinking just:
> 
> 	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
> 
> should be enough.

Yeah it should, no point in checking both ->mm == NULL and PF_KTHREAD.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-25 19:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-25 10:30 io_uring: BUG: kernel NULL pointer dereference Stefano Garzarella
2020-05-25 13:45 ` Stefano Garzarella
2020-05-25 14:10   ` Jens Axboe
2020-05-25 15:02     ` Stefano Garzarella
2020-05-25 19:07       ` Jens Axboe
2020-05-25 19:21     ` Peter Zijlstra
2020-05-25 19:25       ` Jens Axboe
2020-05-25 14:11   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).