Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* io_uring NULL pointer dereference on Linux v5.4-rc1
@ 2019-10-09  9:23 Stefan Hajnoczi
  2019-10-09 11:27 ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-09  9:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 6818 bytes --]

I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
I haven't found any obvious fixes in Jens' tree, so it's likely that
this bug is still present:

BUG: kernel NULL pointer dereference, address: 0000000000000102
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
RIP: 0010:__queue_work+0x1f/0x3b0
Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ? __io_queue_sqe+0xa1/0x200
 queue_work_on+0x36/0x40
 __io_queue_sqe+0x16e/0x200
 io_ring_submit+0xd2/0x230
 ? percpu_ref_resurrect+0x46/0x70
 ? __io_uring_register+0x207/0xa30
 ? __schedule+0x286/0x700
 __x64_sys_io_uring_enter+0x1a3/0x280
 ? __x64_sys_io_uring_register+0x64/0xb0
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f7d3439f1fd
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
 intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
CR2: 0000000000000102
---[ end trace 2ac747acabe218da ]---
RIP: 0010:__queue_work+0x1f/0x3b0
Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Unfortunately I don't have time to find the root cause.  What I've
figured out so far is:

  bool queue_work_on(int cpu, struct workqueue_struct *wq,
                     struct work_struct *work)
  {
      bool ret = false;
      unsigned long flags;

      local_irq_save(flags);

      if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
                                                     ~~~~~~~~~~~~~~~~~~~~

The address of work is 0x102 so this line causes a page fault when it
tries to access the data field (offset 0).

The caller provided the 0x102 pointer so let's see where it comes from:

  static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
                            struct sqe_submit *s, bool force_nonblock)
  {
      ...
      if (!io_add_to_prev_work(list, req)) {
          if (list)
              atomic_inc(&list->cnt);
          INIT_WORK(&req->work, io_sq_wq_submit_work);
          io_queue_async_work(ctx, req);
	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and queue_work() is called here:

  static inline void io_queue_async_work(struct io_ring_ctx *ctx,
                                         struct io_kiocb *req)
  {
      int rw = 0;

      if (req->submit.sqe) {
          switch (req->submit.sqe->opcode) {
          case IORING_OP_WRITEV:
          case IORING_OP_WRITE_FIXED:
              rw = !(req->rw.ki_flags & IOCB_DIRECT);
              break;
          }
      }

      queue_work(ctx->sqo_wq[rw], &req->work);
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I must be missing something though because it seems impossible to get
this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
still haven't reached the 0x102 offset from the Oops report.

Any ideas?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-09  9:23 io_uring NULL pointer dereference on Linux v5.4-rc1 Stefan Hajnoczi
@ 2019-10-09 11:27 ` Jens Axboe
  2019-10-09 17:46   ` Stefan Hajnoczi
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2019-10-09 11:27 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: linux-block

On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> I haven't found any obvious fixes in Jens' tree, so it's likely that
> this bug is still present:
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000102
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP PTI
> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> RIP: 0010:__queue_work+0x1f/0x3b0
> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>   ? __io_queue_sqe+0xa1/0x200
>   queue_work_on+0x36/0x40
>   __io_queue_sqe+0x16e/0x200
>   io_ring_submit+0xd2/0x230
>   ? percpu_ref_resurrect+0x46/0x70
>   ? __io_uring_register+0x207/0xa30
>   ? __schedule+0x286/0x700
>   __x64_sys_io_uring_enter+0x1a3/0x280
>   ? __x64_sys_io_uring_register+0x64/0xb0
>   do_syscall_64+0x5b/0x180
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f7d3439f1fd
> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
>   intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> CR2: 0000000000000102
> ---[ end trace 2ac747acabe218da ]---
> RIP: 0010:__queue_work+0x1f/0x3b0
> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 
> Unfortunately I don't have time to find the root cause.  What I've
> figured out so far is:
> 
>    bool queue_work_on(int cpu, struct workqueue_struct *wq,
>                       struct work_struct *work)
>    {
>        bool ret = false;
>        unsigned long flags;
> 
>        local_irq_save(flags);
> 
>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>                                                       ~~~~~~~~~~~~~~~~~~~~
> 
> The address of work is 0x102 so this line causes a page fault when it
> tries to access the data field (offset 0).
> 
> The caller provided the 0x102 pointer so let's see where it comes from:
> 
>    static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>                              struct sqe_submit *s, bool force_nonblock)
>    {
>        ...
>        if (!io_add_to_prev_work(list, req)) {
>            if (list)
>                atomic_inc(&list->cnt);
>            INIT_WORK(&req->work, io_sq_wq_submit_work);
>            io_queue_async_work(ctx, req);
> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> and queue_work() is called here:
> 
>    static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>                                           struct io_kiocb *req)
>    {
>        int rw = 0;
> 
>        if (req->submit.sqe) {
>            switch (req->submit.sqe->opcode) {
>            case IORING_OP_WRITEV:
>            case IORING_OP_WRITE_FIXED:
>                rw = !(req->rw.ki_flags & IOCB_DIRECT);
>                break;
>            }
>        }
> 
>        queue_work(ctx->sqo_wq[rw], &req->work);
>        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I must be missing something though because it seems impossible to get
> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> still haven't reached the 0x102 offset from the Oops report.
> 
> Any ideas?

This is new in 5.4-rc1? And how are you reproducing it? If I had some
hints in that area, it'd make life much easier for me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-09 11:27 ` Jens Axboe
@ 2019-10-09 17:46   ` Stefan Hajnoczi
  2019-10-09 20:36     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-09 17:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 7973 bytes --]

On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> > I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> > on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> > I haven't found any obvious fixes in Jens' tree, so it's likely that
> > this bug is still present:
> > 
> > BUG: kernel NULL pointer dereference, address: 0000000000000102
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> > Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> > RIP: 0010:__queue_work+0x1f/0x3b0
> > Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> > RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> > RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> > RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> > RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> > R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> > R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> > FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   ? __io_queue_sqe+0xa1/0x200
> >   queue_work_on+0x36/0x40
> >   __io_queue_sqe+0x16e/0x200
> >   io_ring_submit+0xd2/0x230
> >   ? percpu_ref_resurrect+0x46/0x70
> >   ? __io_uring_register+0x207/0xa30
> >   ? __schedule+0x286/0x700
> >   __x64_sys_io_uring_enter+0x1a3/0x280
> >   ? __x64_sys_io_uring_register+0x64/0xb0
> >   do_syscall_64+0x5b/0x180
> >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x7f7d3439f1fd
> > Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> > RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> > RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> > RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> > R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> > R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> > Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >   intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> > CR2: 0000000000000102
> > ---[ end trace 2ac747acabe218da ]---
> > RIP: 0010:__queue_work+0x1f/0x3b0
> > Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> > RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> > RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> > RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> > RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> > R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> > R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> > FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > 
> > Unfortunately I don't have time to find the root cause.  What I've
> > figured out so far is:
> > 
> >    bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >                       struct work_struct *work)
> >    {
> >        bool ret = false;
> >        unsigned long flags;
> > 
> >        local_irq_save(flags);
> > 
> >        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >                                                       ~~~~~~~~~~~~~~~~~~~~
> > 
> > The address of work is 0x102 so this line causes a page fault when it
> > tries to access the data field (offset 0).
> > 
> > The caller provided the 0x102 pointer so let's see where it comes from:
> > 
> >    static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >                              struct sqe_submit *s, bool force_nonblock)
> >    {
> >        ...
> >        if (!io_add_to_prev_work(list, req)) {
> >            if (list)
> >                atomic_inc(&list->cnt);
> >            INIT_WORK(&req->work, io_sq_wq_submit_work);
> >            io_queue_async_work(ctx, req);
> > 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > and queue_work() is called here:
> > 
> >    static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >                                           struct io_kiocb *req)
> >    {
> >        int rw = 0;
> > 
> >        if (req->submit.sqe) {
> >            switch (req->submit.sqe->opcode) {
> >            case IORING_OP_WRITEV:
> >            case IORING_OP_WRITE_FIXED:
> >                rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >                break;
> >            }
> >        }
> > 
> >        queue_work(ctx->sqo_wq[rw], &req->work);
> >        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > I must be missing something though because it seems impossible to get
> > this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> > offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> > still haven't reached the 0x102 offset from the Oops report.
> > 
> > Any ideas?
> 
> This is new in 5.4-rc1?

I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
bug exists in older kernels.

> And how are you reproducing it?

  $ git clone -b io_uring https://github.com/stefanha/qemu
  $ cd qemu
  $ ./configure --target-list=x86_64-softmmu
  $ make -j$(nproc)
  $ (cd tests/qemu-iotests && ./check -i io_uring 052)

You can mount the file system of your choice at
tests/qemu-iotests/scratch/ before running the test.

You can view the test case at tests/qemu-iotests/052.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-09 17:46   ` Stefan Hajnoczi
@ 2019-10-09 20:36     ` Jens Axboe
  2019-10-11  8:46       ` Stefan Hajnoczi
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2019-10-09 20:36 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: linux-block

On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
>>> this bug is still present:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
>>> #PF: supervisor read access in kernel mode
>>> #PF: error_code(0x0000) - not-present page
>>> PGD 0 P4D 0
>>> Oops: 0000 [#1] SMP PTI
>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>>    ? __io_queue_sqe+0xa1/0x200
>>>    queue_work_on+0x36/0x40
>>>    __io_queue_sqe+0x16e/0x200
>>>    io_ring_submit+0xd2/0x230
>>>    ? percpu_ref_resurrect+0x46/0x70
>>>    ? __io_uring_register+0x207/0xa30
>>>    ? __schedule+0x286/0x700
>>>    __x64_sys_io_uring_enter+0x1a3/0x280
>>>    ? __x64_sys_io_uring_register+0x64/0xb0
>>>    do_syscall_64+0x5b/0x180
>>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> RIP: 0033:0x7f7d3439f1fd
>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
>>>    intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
>>> CR2: 0000000000000102
>>> ---[ end trace 2ac747acabe218da ]---
>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>
>>> Unfortunately I don't have time to find the root cause.  What I've
>>> figured out so far is:
>>>
>>>     bool queue_work_on(int cpu, struct workqueue_struct *wq,
>>>                        struct work_struct *work)
>>>     {
>>>         bool ret = false;
>>>         unsigned long flags;
>>>
>>>         local_irq_save(flags);
>>>
>>>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>>>                                                        ~~~~~~~~~~~~~~~~~~~~
>>>
>>> The address of work is 0x102 so this line causes a page fault when it
>>> tries to access the data field (offset 0).
>>>
>>> The caller provided the 0x102 pointer so let's see where it comes from:
>>>
>>>     static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>                               struct sqe_submit *s, bool force_nonblock)
>>>     {
>>>         ...
>>>         if (!io_add_to_prev_work(list, req)) {
>>>             if (list)
>>>                 atomic_inc(&list->cnt);
>>>             INIT_WORK(&req->work, io_sq_wq_submit_work);
>>>             io_queue_async_work(ctx, req);
>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> and queue_work() is called here:
>>>
>>>     static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>>>                                            struct io_kiocb *req)
>>>     {
>>>         int rw = 0;
>>>
>>>         if (req->submit.sqe) {
>>>             switch (req->submit.sqe->opcode) {
>>>             case IORING_OP_WRITEV:
>>>             case IORING_OP_WRITE_FIXED:
>>>                 rw = !(req->rw.ki_flags & IOCB_DIRECT);
>>>                 break;
>>>             }
>>>         }
>>>
>>>         queue_work(ctx->sqo_wq[rw], &req->work);
>>>         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> I must be missing something though because it seems impossible to get
>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
>>> still haven't reached the 0x102 offset from the Oops report.
>>>
>>> Any ideas?
>>
>> This is new in 5.4-rc1?
> 
> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> bug exists in older kernels.
> 
>> And how are you reproducing it?
> 
>    $ git clone -b io_uring https://github.com/stefanha/qemu
>    $ cd qemu
>    $ ./configure --target-list=x86_64-softmmu
>    $ make -j$(nproc)
>    $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> 
> You can mount the file system of your choice at
> tests/qemu-iotests/scratch/ before running the test.
> 
> You can view the test case at tests/qemu-iotests/052.

Thanks, that's useful. Need to look closer into this, but seems wrong
that we're killing the workqueue for SCM_RIGHTS removal. We just need to
sync it. Does this work for you?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8a0381f1a43b..a8755582c688 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
 static void io_destruct_skb(struct sk_buff *skb)
 {
 	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
+		if (ctx->sqo_wq[i])
+			flush_workqueue(ctx->sqo_wq[i]);
 
-	io_finish_async(ctx);
 	unix_destruct_scm(skb);
 }
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-09 20:36     ` Jens Axboe
@ 2019-10-11  8:46       ` Stefan Hajnoczi
  2019-10-11 12:08         ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-11  8:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 9222 bytes --]

On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> > On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> >> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> >>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> >>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> >>> I haven't found any obvious fixes in Jens' tree, so it's likely that
> >>> this bug is still present:
> >>>
> >>> BUG: kernel NULL pointer dereference, address: 0000000000000102
> >>> #PF: supervisor read access in kernel mode
> >>> #PF: error_code(0x0000) - not-present page
> >>> PGD 0 P4D 0
> >>> Oops: 0000 [#1] SMP PTI
> >>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> >>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> >>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> Call Trace:
> >>>    ? __io_queue_sqe+0xa1/0x200
> >>>    queue_work_on+0x36/0x40
> >>>    __io_queue_sqe+0x16e/0x200
> >>>    io_ring_submit+0xd2/0x230
> >>>    ? percpu_ref_resurrect+0x46/0x70
> >>>    ? __io_uring_register+0x207/0xa30
> >>>    ? __schedule+0x286/0x700
> >>>    __x64_sys_io_uring_enter+0x1a3/0x280
> >>>    ? __x64_sys_io_uring_register+0x64/0xb0
> >>>    do_syscall_64+0x5b/0x180
> >>>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>> RIP: 0033:0x7f7d3439f1fd
> >>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> >>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> >>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> >>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> >>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> >>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> >>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> >>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >>>    intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> >>> CR2: 0000000000000102
> >>> ---[ end trace 2ac747acabe218da ]---
> >>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>
> >>> Unfortunately I don't have time to find the root cause.  What I've
> >>> figured out so far is:
> >>>
> >>>     bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >>>                        struct work_struct *work)
> >>>     {
> >>>         bool ret = false;
> >>>         unsigned long flags;
> >>>
> >>>         local_irq_save(flags);
> >>>
> >>>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>                                                        ~~~~~~~~~~~~~~~~~~~~
> >>>
> >>> The address of work is 0x102 so this line causes a page fault when it
> >>> tries to access the data field (offset 0).
> >>>
> >>> The caller provided the 0x102 pointer so let's see where it comes from:
> >>>
> >>>     static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >>>                               struct sqe_submit *s, bool force_nonblock)
> >>>     {
> >>>         ...
> >>>         if (!io_add_to_prev_work(list, req)) {
> >>>             if (list)
> >>>                 atomic_inc(&list->cnt);
> >>>             INIT_WORK(&req->work, io_sq_wq_submit_work);
> >>>             io_queue_async_work(ctx, req);
> >>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>
> >>> and queue_work() is called here:
> >>>
> >>>     static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >>>                                            struct io_kiocb *req)
> >>>     {
> >>>         int rw = 0;
> >>>
> >>>         if (req->submit.sqe) {
> >>>             switch (req->submit.sqe->opcode) {
> >>>             case IORING_OP_WRITEV:
> >>>             case IORING_OP_WRITE_FIXED:
> >>>                 rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >>>                 break;
> >>>             }
> >>>         }
> >>>
> >>>         queue_work(ctx->sqo_wq[rw], &req->work);
> >>>         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>
> >>> I must be missing something though because it seems impossible to get
> >>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> >>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> >>> still haven't reached the 0x102 offset from the Oops report.
> >>>
> >>> Any ideas?
> >>
> >> This is new in 5.4-rc1?
> > 
> > I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> > bug exists in older kernels.
> > 
> >> And how are you reproducing it?
> > 
> >    $ git clone -b io_uring https://github.com/stefanha/qemu
> >    $ cd qemu
> >    $ ./configure --target-list=x86_64-softmmu
> >    $ make -j$(nproc)
> >    $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> > 
> > You can mount the file system of your choice at
> > tests/qemu-iotests/scratch/ before running the test.
> > 
> > You can view the test case at tests/qemu-iotests/052.
> 
> Thanks, that's useful. Need to look closer into this, but seems wrong
> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
> sync it. Does this work for you?
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 8a0381f1a43b..a8755582c688 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
>  static void io_destruct_skb(struct sk_buff *skb)
>  {
>  	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
> +		if (ctx->sqo_wq[i])
> +			flush_workqueue(ctx->sqo_wq[i]);
>  
> -	io_finish_async(ctx);
>  	unix_destruct_scm(skb);
>  }

I tried this patch but still hit the same NULL pointer dereference.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-11  8:46       ` Stefan Hajnoczi
@ 2019-10-11 12:08         ` Jens Axboe
  2019-10-11 15:51           ` Stefan Hajnoczi
                             ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jens Axboe @ 2019-10-11 12:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: linux-block

On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
> On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
>> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
>>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
>>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
>>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
>>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
>>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
>>>>> this bug is still present:
>>>>>
>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
>>>>> #PF: supervisor read access in kernel mode
>>>>> #PF: error_code(0x0000) - not-present page
>>>>> PGD 0 P4D 0
>>>>> Oops: 0000 [#1] SMP PTI
>>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
>>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
>>>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> Call Trace:
>>>>>     ? __io_queue_sqe+0xa1/0x200
>>>>>     queue_work_on+0x36/0x40
>>>>>     __io_queue_sqe+0x16e/0x200
>>>>>     io_ring_submit+0xd2/0x230
>>>>>     ? percpu_ref_resurrect+0x46/0x70
>>>>>     ? __io_uring_register+0x207/0xa30
>>>>>     ? __schedule+0x286/0x700
>>>>>     __x64_sys_io_uring_enter+0x1a3/0x280
>>>>>     ? __x64_sys_io_uring_register+0x64/0xb0
>>>>>     do_syscall_64+0x5b/0x180
>>>>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> RIP: 0033:0x7f7d3439f1fd
>>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
>>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
>>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
>>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
>>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
>>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
>>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
>>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
>>>>>     intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
>>>>> CR2: 0000000000000102
>>>>> ---[ end trace 2ac747acabe218da ]---
>>>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>
>>>>> Unfortunately I don't have time to find the root cause.  What I've
>>>>> figured out so far is:
>>>>>
>>>>>      bool queue_work_on(int cpu, struct workqueue_struct *wq,
>>>>>                         struct work_struct *work)
>>>>>      {
>>>>>          bool ret = false;
>>>>>          unsigned long flags;
>>>>>
>>>>>          local_irq_save(flags);
>>>>>
>>>>>          if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>>>>>                                                         ~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> The address of work is 0x102 so this line causes a page fault when it
>>>>> tries to access the data field (offset 0).
>>>>>
>>>>> The caller provided the 0x102 pointer so let's see where it comes from:
>>>>>
>>>>>      static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>>>                                struct sqe_submit *s, bool force_nonblock)
>>>>>      {
>>>>>          ...
>>>>>          if (!io_add_to_prev_work(list, req)) {
>>>>>              if (list)
>>>>>                  atomic_inc(&list->cnt);
>>>>>              INIT_WORK(&req->work, io_sq_wq_submit_work);
>>>>>              io_queue_async_work(ctx, req);
>>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> and queue_work() is called here:
>>>>>
>>>>>      static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>>>>>                                             struct io_kiocb *req)
>>>>>      {
>>>>>          int rw = 0;
>>>>>
>>>>>          if (req->submit.sqe) {
>>>>>              switch (req->submit.sqe->opcode) {
>>>>>              case IORING_OP_WRITEV:
>>>>>              case IORING_OP_WRITE_FIXED:
>>>>>                  rw = !(req->rw.ki_flags & IOCB_DIRECT);
>>>>>                  break;
>>>>>              }
>>>>>          }
>>>>>
>>>>>          queue_work(ctx->sqo_wq[rw], &req->work);
>>>>>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> I must be missing something though because it seems impossible to get
>>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
>>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
>>>>> still haven't reached the 0x102 offset from the Oops report.
>>>>>
>>>>> Any ideas?
>>>>
>>>> This is new in 5.4-rc1?
>>>
>>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
>>> bug exists in older kernels.
>>>
>>>> And how are you reproducing it?
>>>
>>>     $ git clone -b io_uring https://github.com/stefanha/qemu
>>>     $ cd qemu
>>>     $ ./configure --target-list=x86_64-softmmu
>>>     $ make -j$(nproc)
>>>     $ (cd tests/qemu-iotests && ./check -i io_uring 052)
>>>
>>> You can mount the file system of your choice at
>>> tests/qemu-iotests/scratch/ before running the test.
>>>
>>> You can view the test case at tests/qemu-iotests/052.
>>
>> Thanks, that's useful. Need to look closer into this, but seems wrong
>> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
>> sync it. Does this work for you?
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 8a0381f1a43b..a8755582c688 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
>>   static void io_destruct_skb(struct sk_buff *skb)
>>   {
>>   	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
>> +	int i;
>> +
>> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
>> +		if (ctx->sqo_wq[i])
>> +			flush_workqueue(ctx->sqo_wq[i]);
>>   
>> -	io_finish_async(ctx);
>>   	unix_destruct_scm(skb);
>>   }
> 
> I tried this patch but still hit the same NULL pointer dereference.

How certain are you that you booted the right kernel when you tested
that? Because I'm very certain that this patch will fix the issue you
saw.

You can also pull:

git://git.kernel.dk/linux-block for-linus

into master and test that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-11 12:08         ` Jens Axboe
@ 2019-10-11 15:51           ` Stefan Hajnoczi
  2019-10-12 10:22           ` Stefan Hajnoczi
  2019-10-12 16:46           ` Stefan Hajnoczi
  2 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-11 15:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 10118 bytes --]

On Fri, Oct 11, 2019 at 06:08:34AM -0600, Jens Axboe wrote:
> On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
> > On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
> >> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> >>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> >>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> >>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> >>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
> >>>>> this bug is still present:
> >>>>>
> >>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
> >>>>> #PF: supervisor read access in kernel mode
> >>>>> #PF: error_code(0x0000) - not-present page
> >>>>> PGD 0 P4D 0
> >>>>> Oops: 0000 [#1] SMP PTI
> >>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> >>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>> Call Trace:
> >>>>>     ? __io_queue_sqe+0xa1/0x200
> >>>>>     queue_work_on+0x36/0x40
> >>>>>     __io_queue_sqe+0x16e/0x200
> >>>>>     io_ring_submit+0xd2/0x230
> >>>>>     ? percpu_ref_resurrect+0x46/0x70
> >>>>>     ? __io_uring_register+0x207/0xa30
> >>>>>     ? __schedule+0x286/0x700
> >>>>>     __x64_sys_io_uring_enter+0x1a3/0x280
> >>>>>     ? __x64_sys_io_uring_register+0x64/0xb0
> >>>>>     do_syscall_64+0x5b/0x180
> >>>>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> RIP: 0033:0x7f7d3439f1fd
> >>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> >>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> >>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> >>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> >>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> >>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> >>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> >>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >>>>>     intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> >>>>> CR2: 0000000000000102
> >>>>> ---[ end trace 2ac747acabe218da ]---
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>
> >>>>> Unfortunately I don't have time to find the root cause.  What I've
> >>>>> figured out so far is:
> >>>>>
> >>>>>      bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >>>>>                         struct work_struct *work)
> >>>>>      {
> >>>>>          bool ret = false;
> >>>>>          unsigned long flags;
> >>>>>
> >>>>>          local_irq_save(flags);
> >>>>>
> >>>>>          if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>>>                                                         ~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> The address of work is 0x102 so this line causes a page fault when it
> >>>>> tries to access the data field (offset 0).
> >>>>>
> >>>>> The caller provided the 0x102 pointer so let's see where it comes from:
> >>>>>
> >>>>>      static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >>>>>                                struct sqe_submit *s, bool force_nonblock)
> >>>>>      {
> >>>>>          ...
> >>>>>          if (!io_add_to_prev_work(list, req)) {
> >>>>>              if (list)
> >>>>>                  atomic_inc(&list->cnt);
> >>>>>              INIT_WORK(&req->work, io_sq_wq_submit_work);
> >>>>>              io_queue_async_work(ctx, req);
> >>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> and queue_work() is called here:
> >>>>>
> >>>>>      static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >>>>>                                             struct io_kiocb *req)
> >>>>>      {
> >>>>>          int rw = 0;
> >>>>>
> >>>>>          if (req->submit.sqe) {
> >>>>>              switch (req->submit.sqe->opcode) {
> >>>>>              case IORING_OP_WRITEV:
> >>>>>              case IORING_OP_WRITE_FIXED:
> >>>>>                  rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >>>>>                  break;
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>>          queue_work(ctx->sqo_wq[rw], &req->work);
> >>>>>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> I must be missing something though because it seems impossible to get
> >>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> >>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> >>>>> still haven't reached the 0x102 offset from the Oops report.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> This is new in 5.4-rc1?
> >>>
> >>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> >>> bug exists in older kernels.
> >>>
> >>>> And how are you reproducing it?
> >>>
> >>>     $ git clone -b io_uring https://github.com/stefanha/qemu
> >>>     $ cd qemu
> >>>     $ ./configure --target-list=x86_64-softmmu
> >>>     $ make -j$(nproc)
> >>>     $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> >>>
> >>> You can mount the file system of your choice at
> >>> tests/qemu-iotests/scratch/ before running the test.
> >>>
> >>> You can view the test case at tests/qemu-iotests/052.
> >>
> >> Thanks, that's useful. Need to look closer into this, but seems wrong
> >> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
> >> sync it. Does this work for you?
> >>
> >>
> >> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >> index 8a0381f1a43b..a8755582c688 100644
> >> --- a/fs/io_uring.c
> >> +++ b/fs/io_uring.c
> >> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
> >>   static void io_destruct_skb(struct sk_buff *skb)
> >>   {
> >>   	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
> >> +		if (ctx->sqo_wq[i])
> >> +			flush_workqueue(ctx->sqo_wq[i]);
> >>   
> >> -	io_finish_async(ctx);
> >>   	unix_destruct_scm(skb);
> >>   }
> > 
> > I tried this patch but still hit the same NULL pointer dereference.
> 
> How certain are you that you booted the right kernel when you tested
> that? Because I'm very certain that this patch will fix the issue you
> saw.

Quite certain but I can try again.  It was Linux v5.4.0-rc1 plus your
patch.

> You can also pull:
> 
> git://git.kernel.dk/linux-block for-linus
> 
> into master and test that.

Cool, I'll try this tree instead.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-11 12:08         ` Jens Axboe
  2019-10-11 15:51           ` Stefan Hajnoczi
@ 2019-10-12 10:22           ` Stefan Hajnoczi
  2019-10-12 16:46           ` Stefan Hajnoczi
  2 siblings, 0 replies; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-12 10:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 10344 bytes --]

On Fri, Oct 11, 2019 at 06:08:34AM -0600, Jens Axboe wrote:
> On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
> > On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
> >> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> >>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> >>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> >>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> >>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
> >>>>> this bug is still present:
> >>>>>
> >>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
> >>>>> #PF: supervisor read access in kernel mode
> >>>>> #PF: error_code(0x0000) - not-present page
> >>>>> PGD 0 P4D 0
> >>>>> Oops: 0000 [#1] SMP PTI
> >>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> >>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>> Call Trace:
> >>>>>     ? __io_queue_sqe+0xa1/0x200
> >>>>>     queue_work_on+0x36/0x40
> >>>>>     __io_queue_sqe+0x16e/0x200
> >>>>>     io_ring_submit+0xd2/0x230
> >>>>>     ? percpu_ref_resurrect+0x46/0x70
> >>>>>     ? __io_uring_register+0x207/0xa30
> >>>>>     ? __schedule+0x286/0x700
> >>>>>     __x64_sys_io_uring_enter+0x1a3/0x280
> >>>>>     ? __x64_sys_io_uring_register+0x64/0xb0
> >>>>>     do_syscall_64+0x5b/0x180
> >>>>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> RIP: 0033:0x7f7d3439f1fd
> >>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> >>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> >>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> >>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> >>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> >>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> >>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> >>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >>>>>     intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> >>>>> CR2: 0000000000000102
> >>>>> ---[ end trace 2ac747acabe218da ]---
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>
> >>>>> Unfortunately I don't have time to find the root cause.  What I've
> >>>>> figured out so far is:
> >>>>>
> >>>>>      bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >>>>>                         struct work_struct *work)
> >>>>>      {
> >>>>>          bool ret = false;
> >>>>>          unsigned long flags;
> >>>>>
> >>>>>          local_irq_save(flags);
> >>>>>
> >>>>>          if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>>>                                                         ~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> The address of work is 0x102 so this line causes a page fault when it
> >>>>> tries to access the data field (offset 0).
> >>>>>
> >>>>> The caller provided the 0x102 pointer so let's see where it comes from:
> >>>>>
> >>>>>      static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >>>>>                                struct sqe_submit *s, bool force_nonblock)
> >>>>>      {
> >>>>>          ...
> >>>>>          if (!io_add_to_prev_work(list, req)) {
> >>>>>              if (list)
> >>>>>                  atomic_inc(&list->cnt);
> >>>>>              INIT_WORK(&req->work, io_sq_wq_submit_work);
> >>>>>              io_queue_async_work(ctx, req);
> >>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> and queue_work() is called here:
> >>>>>
> >>>>>      static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >>>>>                                             struct io_kiocb *req)
> >>>>>      {
> >>>>>          int rw = 0;
> >>>>>
> >>>>>          if (req->submit.sqe) {
> >>>>>              switch (req->submit.sqe->opcode) {
> >>>>>              case IORING_OP_WRITEV:
> >>>>>              case IORING_OP_WRITE_FIXED:
> >>>>>                  rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >>>>>                  break;
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>>          queue_work(ctx->sqo_wq[rw], &req->work);
> >>>>>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> I must be missing something though because it seems impossible to get
> >>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> >>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> >>>>> still haven't reached the 0x102 offset from the Oops report.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> This is new in 5.4-rc1?
> >>>
> >>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> >>> bug exists in older kernels.
> >>>
> >>>> And how are you reproducing it?
> >>>
> >>>     $ git clone -b io_uring https://github.com/stefanha/qemu
> >>>     $ cd qemu
> >>>     $ ./configure --target-list=x86_64-softmmu
> >>>     $ make -j$(nproc)
> >>>     $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> >>>
> >>> You can mount the file system of your choice at
> >>> tests/qemu-iotests/scratch/ before running the test.
> >>>
> >>> You can view the test case at tests/qemu-iotests/052.
> >>
> >> Thanks, that's useful. Need to look closer into this, but seems wrong
> >> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
> >> sync it. Does this work for you?
> >>
> >>
> >> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >> index 8a0381f1a43b..a8755582c688 100644
> >> --- a/fs/io_uring.c
> >> +++ b/fs/io_uring.c
> >> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
> >>   static void io_destruct_skb(struct sk_buff *skb)
> >>   {
> >>   	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
> >> +		if (ctx->sqo_wq[i])
> >> +			flush_workqueue(ctx->sqo_wq[i]);
> >>   
> >> -	io_finish_async(ctx);
> >>   	unix_destruct_scm(skb);
> >>   }
> > 
> > I tried this patch but still hit the same NULL pointer dereference.
> 
> How certain are you that you booted the right kernel when you tested
> that? Because I'm very certain that this patch will fix the issue you
> saw.
> 
> You can also pull:
> 
> git://git.kernel.dk/linux-block for-linus
> 
> into master and test that.

Yay, that tree doesn't hit the NULL pointer dereference :).

I'm not sure if:
1. I made a mistake when testing your patch.
2. Your linux-block for-linux tree has additional fixes which are also
   needed.
3. Linux v5.4.0-rc1 introduced this regression.

Anyway, I'll let you know if it reappears in newer kernel versions.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-11 12:08         ` Jens Axboe
  2019-10-11 15:51           ` Stefan Hajnoczi
  2019-10-12 10:22           ` Stefan Hajnoczi
@ 2019-10-12 16:46           ` Stefan Hajnoczi
  2019-10-12 16:58             ` Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: Stefan Hajnoczi @ 2019-10-12 16:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

[-- Attachment #1: Type: text/plain, Size: 10279 bytes --]

On Fri, Oct 11, 2019 at 06:08:34AM -0600, Jens Axboe wrote:
> On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
> > On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
> >> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> >>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> >>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> >>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> >>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
> >>>>> this bug is still present:
> >>>>>
> >>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
> >>>>> #PF: supervisor read access in kernel mode
> >>>>> #PF: error_code(0x0000) - not-present page
> >>>>> PGD 0 P4D 0
> >>>>> Oops: 0000 [#1] SMP PTI
> >>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> >>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>> Call Trace:
> >>>>>     ? __io_queue_sqe+0xa1/0x200
> >>>>>     queue_work_on+0x36/0x40
> >>>>>     __io_queue_sqe+0x16e/0x200
> >>>>>     io_ring_submit+0xd2/0x230
> >>>>>     ? percpu_ref_resurrect+0x46/0x70
> >>>>>     ? __io_uring_register+0x207/0xa30
> >>>>>     ? __schedule+0x286/0x700
> >>>>>     __x64_sys_io_uring_enter+0x1a3/0x280
> >>>>>     ? __x64_sys_io_uring_register+0x64/0xb0
> >>>>>     do_syscall_64+0x5b/0x180
> >>>>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> RIP: 0033:0x7f7d3439f1fd
> >>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> >>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> >>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> >>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> >>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> >>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> >>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> >>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >>>>>     intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> >>>>> CR2: 0000000000000102
> >>>>> ---[ end trace 2ac747acabe218da ]---
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>
> >>>>> Unfortunately I don't have time to find the root cause.  What I've
> >>>>> figured out so far is:
> >>>>>
> >>>>>      bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >>>>>                         struct work_struct *work)
> >>>>>      {
> >>>>>          bool ret = false;
> >>>>>          unsigned long flags;
> >>>>>
> >>>>>          local_irq_save(flags);
> >>>>>
> >>>>>          if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>>>                                                         ~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> The address of work is 0x102 so this line causes a page fault when it
> >>>>> tries to access the data field (offset 0).
> >>>>>
> >>>>> The caller provided the 0x102 pointer so let's see where it comes from:
> >>>>>
> >>>>>      static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >>>>>                                struct sqe_submit *s, bool force_nonblock)
> >>>>>      {
> >>>>>          ...
> >>>>>          if (!io_add_to_prev_work(list, req)) {
> >>>>>              if (list)
> >>>>>                  atomic_inc(&list->cnt);
> >>>>>              INIT_WORK(&req->work, io_sq_wq_submit_work);
> >>>>>              io_queue_async_work(ctx, req);
> >>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> and queue_work() is called here:
> >>>>>
> >>>>>      static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >>>>>                                             struct io_kiocb *req)
> >>>>>      {
> >>>>>          int rw = 0;
> >>>>>
> >>>>>          if (req->submit.sqe) {
> >>>>>              switch (req->submit.sqe->opcode) {
> >>>>>              case IORING_OP_WRITEV:
> >>>>>              case IORING_OP_WRITE_FIXED:
> >>>>>                  rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >>>>>                  break;
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>>          queue_work(ctx->sqo_wq[rw], &req->work);
> >>>>>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> I must be missing something though because it seems impossible to get
> >>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> >>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> >>>>> still haven't reached the 0x102 offset from the Oops report.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> This is new in 5.4-rc1?
> >>>
> >>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> >>> bug exists in older kernels.
> >>>
> >>>> And how are you reproducing it?
> >>>
> >>>     $ git clone -b io_uring https://github.com/stefanha/qemu
> >>>     $ cd qemu
> >>>     $ ./configure --target-list=x86_64-softmmu
> >>>     $ make -j$(nproc)
> >>>     $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> >>>
> >>> You can mount the file system of your choice at
> >>> tests/qemu-iotests/scratch/ before running the test.
> >>>
> >>> You can view the test case at tests/qemu-iotests/052.
> >>
> >> Thanks, that's useful. Need to look closer into this, but seems wrong
> >> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
> >> sync it. Does this work for you?
> >>
> >>
> >> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >> index 8a0381f1a43b..a8755582c688 100644
> >> --- a/fs/io_uring.c
> >> +++ b/fs/io_uring.c
> >> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
> >>   static void io_destruct_skb(struct sk_buff *skb)
> >>   {
> >>   	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
> >> +		if (ctx->sqo_wq[i])
> >> +			flush_workqueue(ctx->sqo_wq[i]);
> >>   
> >> -	io_finish_async(ctx);
> >>   	unix_destruct_scm(skb);
> >>   }
> > 
> > I tried this patch but still hit the same NULL pointer dereference.
> 
> How certain are you that you booted the right kernel when you tested
> that? Because I'm very certain that this patch will fix the issue you
> saw.
> 
> You can also pull:
> 
> git://git.kernel.dk/linux-block for-linus
> 
> into master and test that.

It was bugging me that we don't know if Linux v5.4-rc1 has a regression,
so I merged your for-linux branch on top of Linux v5.4-rc1 as you
suggested.

The test passes, so it's now certain that the fix(es) in your for-linux
branch solve the NULL pointer dereference :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: io_uring NULL pointer dereference on Linux v5.4-rc1
  2019-10-12 16:46           ` Stefan Hajnoczi
@ 2019-10-12 16:58             ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2019-10-12 16:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: linux-block

On 10/12/19 10:46 AM, Stefan Hajnoczi wrote:
> On Fri, Oct 11, 2019 at 06:08:34AM -0600, Jens Axboe wrote:
>> On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
>>> On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
>>>> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
>>>>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
>>>>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
>>>>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
>>>>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
>>>>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
>>>>>>> this bug is still present:
>>>>>>>
>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
>>>>>>> #PF: supervisor read access in kernel mode
>>>>>>> #PF: error_code(0x0000) - not-present page
>>>>>>> PGD 0 P4D 0
>>>>>>> Oops: 0000 [#1] SMP PTI
>>>>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
>>>>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
>>>>>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>>>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>>>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>>>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>>>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>>>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>>>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>>>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>>>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>> Call Trace:
>>>>>>>      ? __io_queue_sqe+0xa1/0x200
>>>>>>>      queue_work_on+0x36/0x40
>>>>>>>      __io_queue_sqe+0x16e/0x200
>>>>>>>      io_ring_submit+0xd2/0x230
>>>>>>>      ? percpu_ref_resurrect+0x46/0x70
>>>>>>>      ? __io_uring_register+0x207/0xa30
>>>>>>>      ? __schedule+0x286/0x700
>>>>>>>      __x64_sys_io_uring_enter+0x1a3/0x280
>>>>>>>      ? __x64_sys_io_uring_register+0x64/0xb0
>>>>>>>      do_syscall_64+0x5b/0x180
>>>>>>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>>> RIP: 0033:0x7f7d3439f1fd
>>>>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
>>>>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
>>>>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
>>>>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
>>>>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
>>>>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
>>>>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
>>>>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
>>>>>>>      intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
>>>>>>> CR2: 0000000000000102
>>>>>>> ---[ end trace 2ac747acabe218da ]---
>>>>>>> RIP: 0010:__queue_work+0x1f/0x3b0
>>>>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
>>>>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
>>>>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
>>>>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
>>>>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
>>>>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
>>>>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
>>>>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>
>>>>>>> Unfortunately I don't have time to find the root cause.  What I've
>>>>>>> figured out so far is:
>>>>>>>
>>>>>>>       bool queue_work_on(int cpu, struct workqueue_struct *wq,
>>>>>>>                          struct work_struct *work)
>>>>>>>       {
>>>>>>>           bool ret = false;
>>>>>>>           unsigned long flags;
>>>>>>>
>>>>>>>           local_irq_save(flags);
>>>>>>>
>>>>>>>           if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>>>>>>>                                                          ~~~~~~~~~~~~~~~~~~~~
>>>>>>>
>>>>>>> The address of work is 0x102 so this line causes a page fault when it
>>>>>>> tries to access the data field (offset 0).
>>>>>>>
>>>>>>> The caller provided the 0x102 pointer so let's see where it comes from:
>>>>>>>
>>>>>>>       static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>>>>>                                 struct sqe_submit *s, bool force_nonblock)
>>>>>>>       {
>>>>>>>           ...
>>>>>>>           if (!io_add_to_prev_work(list, req)) {
>>>>>>>               if (list)
>>>>>>>                   atomic_inc(&list->cnt);
>>>>>>>               INIT_WORK(&req->work, io_sq_wq_submit_work);
>>>>>>>               io_queue_async_work(ctx, req);
>>>>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>>
>>>>>>> and queue_work() is called here:
>>>>>>>
>>>>>>>       static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>>>>>>>                                              struct io_kiocb *req)
>>>>>>>       {
>>>>>>>           int rw = 0;
>>>>>>>
>>>>>>>           if (req->submit.sqe) {
>>>>>>>               switch (req->submit.sqe->opcode) {
>>>>>>>               case IORING_OP_WRITEV:
>>>>>>>               case IORING_OP_WRITE_FIXED:
>>>>>>>                   rw = !(req->rw.ki_flags & IOCB_DIRECT);
>>>>>>>                   break;
>>>>>>>               }
>>>>>>>           }
>>>>>>>
>>>>>>>           queue_work(ctx->sqo_wq[rw], &req->work);
>>>>>>>           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>>
>>>>>>> I must be missing something though because it seems impossible to get
>>>>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
>>>>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
>>>>>>> still haven't reached the 0x102 offset from the Oops report.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>
>>>>>> This is new in 5.4-rc1?
>>>>>
>>>>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
>>>>> bug exists in older kernels.
>>>>>
>>>>>> And how are you reproducing it?
>>>>>
>>>>>      $ git clone -b io_uring https://github.com/stefanha/qemu
>>>>>      $ cd qemu
>>>>>      $ ./configure --target-list=x86_64-softmmu
>>>>>      $ make -j$(nproc)
>>>>>      $ (cd tests/qemu-iotests && ./check -i io_uring 052)
>>>>>
>>>>> You can mount the file system of your choice at
>>>>> tests/qemu-iotests/scratch/ before running the test.
>>>>>
>>>>> You can view the test case at tests/qemu-iotests/052.
>>>>
>>>> Thanks, that's useful. Need to look closer into this, but seems wrong
>>>> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
>>>> sync it. Does this work for you?
>>>>
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 8a0381f1a43b..a8755582c688 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
>>>>    static void io_destruct_skb(struct sk_buff *skb)
>>>>    {
>>>>    	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
>>>> +		if (ctx->sqo_wq[i])
>>>> +			flush_workqueue(ctx->sqo_wq[i]);
>>>>    
>>>> -	io_finish_async(ctx);
>>>>    	unix_destruct_scm(skb);
>>>>    }
>>>
>>> I tried this patch but still hit the same NULL pointer dereference.
>>
>> How certain are you that you booted the right kernel when you tested
>> that? Because I'm very certain that this patch will fix the issue you
>> saw.
>>
>> You can also pull:
>>
>> git://git.kernel.dk/linux-block for-linus
>>
>> into master and test that.
> 
> It was bugging me that we don't know if Linux v5.4-rc1 has a regression,
> so I merged your for-linux branch on top of Linux v5.4-rc1 as you
> suggested.
> 
> The test passes, so it's now certain that the fix(es) in your for-linux
> branch solve the NULL pointer dereference :).

Thanks for double checking! It's not a recent regression, it's been
there a while. But it does depend on timing for your test case, so maybe
why you haven't seen it before. In any case, it's fixed in Linus's tree
now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-09  9:23 io_uring NULL pointer dereference on Linux v5.4-rc1 Stefan Hajnoczi
2019-10-09 11:27 ` Jens Axboe
2019-10-09 17:46   ` Stefan Hajnoczi
2019-10-09 20:36     ` Jens Axboe
2019-10-11  8:46       ` Stefan Hajnoczi
2019-10-11 12:08         ` Jens Axboe
2019-10-11 15:51           ` Stefan Hajnoczi
2019-10-12 10:22           ` Stefan Hajnoczi
2019-10-12 16:46           ` Stefan Hajnoczi
2019-10-12 16:58             ` Jens Axboe

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git