Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Subject: Re: io_uring NULL pointer dereference on Linux v5.4-rc1
Date: Sat, 12 Oct 2019 11:22:37 +0100
Message-ID: <20191012102237.GA17940@stefanha-x1.localdomain> (raw)
In-Reply-To: <c24b0ee5-361c-20da-1b7a-27aab947d4f2@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 10344 bytes --]

On Fri, Oct 11, 2019 at 06:08:34AM -0600, Jens Axboe wrote:
> On 10/11/19 2:46 AM, Stefan Hajnoczi wrote:
> > On Wed, Oct 09, 2019 at 02:36:01PM -0600, Jens Axboe wrote:
> >> On 10/9/19 11:46 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Oct 09, 2019 at 05:27:44AM -0600, Jens Axboe wrote:
> >>>> On 10/9/19 3:23 AM, Stefan Hajnoczi wrote:
> >>>>> I hit this NULL pointer dereference when running qemu-iotests 052 (raw)
> >>>>> on both ext4 and XFS on dm-thin/luks.  The kernel is Linux v5.4-rc1 but
> >>>>> I haven't found any obvious fixes in Jens' tree, so it's likely that
> >>>>> this bug is still present:
> >>>>>
> >>>>> BUG: kernel NULL pointer dereference, address: 0000000000000102
> >>>>> #PF: supervisor read access in kernel mode
> >>>>> #PF: error_code(0x0000) - not-present page
> >>>>> PGD 0 P4D 0
> >>>>> Oops: 0000 [#1] SMP PTI
> >>>>> CPU: 2 PID: 6656 Comm: qemu-io Not tainted 5.4.0-rc1 #1
> >>>>> Hardware name: LENOVO 20BTS1N70V/20BTS1N70V, BIOS N14ET37W (1.15 ) 09/06/2016
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>> Call Trace:
> >>>>>     ? __io_queue_sqe+0xa1/0x200
> >>>>>     queue_work_on+0x36/0x40
> >>>>>     __io_queue_sqe+0x16e/0x200
> >>>>>     io_ring_submit+0xd2/0x230
> >>>>>     ? percpu_ref_resurrect+0x46/0x70
> >>>>>     ? __io_uring_register+0x207/0xa30
> >>>>>     ? __schedule+0x286/0x700
> >>>>>     __x64_sys_io_uring_enter+0x1a3/0x280
> >>>>>     ? __x64_sys_io_uring_register+0x64/0xb0
> >>>>>     do_syscall_64+0x5b/0x180
> >>>>>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> RIP: 0033:0x7f7d3439f1fd
> >>>>> Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5b 8c 0c 00 f7 d8 64 89 01 48
> >>>>> RSP: 002b:00007f7d2918d408 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> >>>>> RAX: ffffffffffffffda RBX: 00007f7d2918d4f0 RCX: 00007f7d3439f1fd
> >>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a
> >>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008
> >>>>> R10: 0000000000000000 R11: 0000000000000216 R12: 00005616e3c32ab8
> >>>>> R13: 00005616e3c32b78 R14: 00005616e3c32ab0 R15: 0000000000000001
> >>>>> Modules linked in: fuse ccm xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables sunrpc vfat fat intel_rapl_msr rmi_smbus iwlmvm rmi_core intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_intel snd_hda_intel kvm snd_intel_nhlt snd_hda_codec snd_usb_audio irqbypass uvcvideo snd_hda_core snd_usbmidi_lib snd_rawmidi iTCO_wdt snd_hwdep libarc4 intel_cstate cdc_ether intel_uncore videobuf2_vmalloc iwlwifi mei_wdt mei_hdcp iTCO_vendor_support snd_seq videobuf2_memops usbnet videobuf2_v4l2 snd_seq_device
> >>>>>     intel_rapl_perf pcspkr videobuf2_common joydev wmi_bmof snd_pcm cfg80211 r8152 videodev intel_pch_thermal i2c_i801 mii mc thinkpad_acpi snd_timer mei_me ledtrig_audio snd lpc_ich mei soundcore rfkill binfmt_misc xfs dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c dm_crypt i915 i2c_algo_bit drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw wmi video
> >>>>> CR2: 0000000000000102
> >>>>> ---[ end trace 2ac747acabe218da ]---
> >>>>> RIP: 0010:__queue_work+0x1f/0x3b0
> >>>>> Code: eb df 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 49 89 f7 41 56 41 89 fe 41 55 41 89 fd 41 54 55 48 89 d5 53 48 83 ec 10 <f6> 86 02 01 00 00 01 0f 85 bc 02 00 00 49 bc eb 83 b5 80 46 86 c8
> >>>>> RSP: 0018:ffffbef4884bbd58 EFLAGS: 00010082
> >>>>> RAX: 0000000000000246 RBX: 0000000000000246 RCX: 0000000000000000
> >>>>> RDX: ffff9903901f4460 RSI: 0000000000000000 RDI: 0000000000000040
> >>>>> RBP: ffff9903901f4460 R08: ffff9903901fb040 R09: ffff990398614700
> >>>>> R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
> >>>>> R13: 0000000000000040 R14: 0000000000000040 R15: 0000000000000000
> >>>>> FS:  00007f7d2a4e4a80(0000) GS:ffff9903a5a80000(0000) knlGS:0000000000000000
> >>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> CR2: 0000000000000102 CR3: 0000000203da8004 CR4: 00000000003606e0
> >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>
> >>>>> Unfortunately I don't have time to find the root cause.  What I've
> >>>>> figured out so far is:
> >>>>>
> >>>>>      bool queue_work_on(int cpu, struct workqueue_struct *wq,
> >>>>>                         struct work_struct *work)
> >>>>>      {
> >>>>>          bool ret = false;
> >>>>>          unsigned long flags;
> >>>>>
> >>>>>          local_irq_save(flags);
> >>>>>
> >>>>>          if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>>>                                                         ~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> The address of work is 0x102 so this line causes a page fault when it
> >>>>> tries to access the data field (offset 0).
> >>>>>
> >>>>> The caller provided the 0x102 pointer so let's see where it comes from:
> >>>>>
> >>>>>      static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >>>>>                                struct sqe_submit *s, bool force_nonblock)
> >>>>>      {
> >>>>>          ...
> >>>>>          if (!io_add_to_prev_work(list, req)) {
> >>>>>              if (list)
> >>>>>                  atomic_inc(&list->cnt);
> >>>>>              INIT_WORK(&req->work, io_sq_wq_submit_work);
> >>>>>              io_queue_async_work(ctx, req);
> >>>>> 	  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> and queue_work() is called here:
> >>>>>
> >>>>>      static inline void io_queue_async_work(struct io_ring_ctx *ctx,
> >>>>>                                             struct io_kiocb *req)
> >>>>>      {
> >>>>>          int rw = 0;
> >>>>>
> >>>>>          if (req->submit.sqe) {
> >>>>>              switch (req->submit.sqe->opcode) {
> >>>>>              case IORING_OP_WRITEV:
> >>>>>              case IORING_OP_WRITE_FIXED:
> >>>>>                  rw = !(req->rw.ki_flags & IOCB_DIRECT);
> >>>>>                  break;
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>>          queue_work(ctx->sqo_wq[rw], &req->work);
> >>>>>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>>>
> >>>>> I must be missing something though because it seems impossible to get
> >>>>> this far if req is NULL.  INIT_WORK() would have Oopsed already.  Also,
> >>>>> offsetof(struct io_kiocb, work) is 0xa0 according to pahole(1) so we
> >>>>> still haven't reached the 0x102 offset from the Oops report.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> This is new in 5.4-rc1?
> >>>
> >>> I didn't hit it with 5.3, but I hit other issues so I'm not sure if this
> >>> bug exists in older kernels.
> >>>
> >>>> And how are you reproducing it?
> >>>
> >>>     $ git clone -b io_uring https://github.com/stefanha/qemu
> >>>     $ cd qemu
> >>>     $ ./configure --target-list=x86_64-softmmu
> >>>     $ make -j$(nproc)
> >>>     $ (cd tests/qemu-iotests && ./check -i io_uring 052)
> >>>
> >>> You can mount the file system of your choice at
> >>> tests/qemu-iotests/scratch/ before running the test.
> >>>
> >>> You can view the test case at tests/qemu-iotests/052.
> >>
> >> Thanks, that's useful. Need to look closer into this, but seems wrong
> >> that we're killing the workqueue for SCM_RIGHTS removal. We just need to
> >> sync it. Does this work for you?
> >>
> >>
> >> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >> index 8a0381f1a43b..a8755582c688 100644
> >> --- a/fs/io_uring.c
> >> +++ b/fs/io_uring.c
> >> @@ -2920,8 +2920,12 @@ static void io_finish_async(struct io_ring_ctx *ctx)
> >>   static void io_destruct_skb(struct sk_buff *skb)
> >>   {
> >>   	struct io_ring_ctx *ctx = skb->sk->sk_user_data;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++)
> >> +		if (ctx->sqo_wq[i])
> >> +			flush_workqueue(ctx->sqo_wq[i]);
> >>   
> >> -	io_finish_async(ctx);
> >>   	unix_destruct_scm(skb);
> >>   }
> > 
> > I tried this patch but still hit the same NULL pointer dereference.
> 
> How certain are you that you booted the right kernel when you tested
> that? Because I'm very certain that this patch will fix the issue you
> saw.
> 
> You can also pull:
> 
> git://git.kernel.dk/linux-block for-linus
> 
> into master and test that.

Yay, that tree doesn't hit the NULL pointer dereference :).

I'm not sure if:
1. I made a mistake when testing your patch.
2. Your linux-block for-linux tree has additional fixes which are also
   needed.
3. Linux v5.4.0-rc1 introduced this regression.

Anyway, I'll let you know if it reappears in newer kernel versions.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09  9:23 Stefan Hajnoczi
2019-10-09 11:27 ` Jens Axboe
2019-10-09 17:46   ` Stefan Hajnoczi
2019-10-09 20:36     ` Jens Axboe
2019-10-11  8:46       ` Stefan Hajnoczi
2019-10-11 12:08         ` Jens Axboe
2019-10-11 15:51           ` Stefan Hajnoczi
2019-10-12 10:22           ` Stefan Hajnoczi [this message]
2019-10-12 16:46           ` Stefan Hajnoczi
2019-10-12 16:58             ` Jens Axboe

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191012102237.GA17940@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox