io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/1] io_uring: preserve work->mm since actual work processing may need it
@ 2020-04-09 22:03 Bijan Mottahedeh
  2020-04-09 22:03 ` [RFC 1/1] " Bijan Mottahedeh
  0 siblings, 1 reply; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-04-09 22:03 UTC (permalink / raw)
  To: axboe; +Cc: io-uring

The liburing madvise test crashes the system with a NULL pointer
dereference because io_madvise() is passing a NULL mm value, previously
cleared in io_wq_switch_mm(), to do_advise().

I'm not clear why work->mm is being cleared, especially since it seems
to run contrary to what the comment above it states, but in any case
preserving the work->mm value gets rid of the crash.

--------------------------------------------------------------------------

Running test madvise
[  165.733724] BUG: kernel NULL pointer dereference, address: 0000000000000138
[  165.735088] #PF: supervisor read access in kernel mode
[  165.736027] #PF: error_code(0x0000) - not-present page
[  165.736971] PGD 8000000fa3c32067 P4D 8000000fa3c32067 PUD fc4e17067 PMD 0
[  165.738254] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[  165.739140] CPU: 18 PID: 30105 Comm: io_wqe_worker-0 Not tainted 5.6.0-next-1
[  165.740640] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-4
[  165.742721] RIP: 0010:__lock_acquire.isra.29+0x37/0x6c0
[  165.743656] Code: 25 40 8e 01 00 53 48 83 ec 18 44 8b 35 e6 2f 61 01 45 85 fc
[  165.747020] RSP: 0018:ffffc9000b08bba0 EFLAGS: 00010097
[  165.747989] RAX: 0000000000000000 RBX: 0000000000000130 RCX: 0000000000000001
[  165.749276] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000130
[  165.750552] RBP: ffff888fa35224c0 R08: 0000000000000000 R09: 0000000000000000
[  165.751862] R10: 0000000000000130 R11: 0000000000000000 R12: 0000000000000000
[  165.753195] R13: 0000000000000001 R14: 0000000000000000 R15: 00007f5c4ecea000
[  165.754490] FS:  0000000000000000(0000) GS:ffff888ff4600000(0000) knlGS:00000
[  165.756007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  165.757054] CR2: 0000000000000138 CR3: 0000000fc709c002 CR4: 0000000000160ee0
[  165.758339] Call Trace:
[  165.758805]  ? load_balance+0x1b4/0xd00
[  165.759525]  lock_acquire+0xf9/0x160
[  165.760202]  ? do_madvise+0xa59/0xb20
[  165.760894]  down_read+0x3c/0xe0
[  165.761479]  ? do_madvise+0xa59/0xb20
[  165.762188]  do_madvise+0xa59/0xb20
[  165.762830]  ? kvm_sched_clock_read+0xd/0x20
[  165.763643]  ? free_debug_processing+0x291/0x2c8
[  165.764535]  ? do_raw_spin_unlock+0x83/0x90
[  165.765303]  ? free_debug_processing+0x291/0x2c8
[  165.766184]  io_issue_sqe+0xafa/0x11e0
[  165.766867]  ? kvm_sched_clock_read+0xd/0x20
[  165.767641]  ? __free_pages_ok+0x3db/0x550
[  165.768390]  ? _raw_spin_unlock+0x1f/0x30
[  165.769129]  io_wq_submit_work+0x2f/0x80
[  165.769800]  io_worker_handle_work+0x38a/0x540
[  165.770650]  io_wqe_worker+0x32a/0x370
[  165.771342]  kthread+0x118/0x120
[  165.771948]  ? io_worker_handle_work+0x540/0x540
[  165.772784]  ? kthread_insert_work_sanity_check+0x60/0x60
[  165.773766]  ret_from_fork+0x1f/0x30
[  165.774419] Modules linked in: xfs dm_mod sr_mod sd_mod cdrom crc32c_intel nt
[  165.777124] CR2: 0000000000000138
[  165.777733] ---[ end trace 2a1a5b9c912bd387 ]---

Bijan Mottahedeh (1):
  io_uring: preserve work->mm since actual work processing may need it

 fs/io-wq.c | 2 --
 1 file changed, 2 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-09 22:03 [RFC 0/1] io_uring: preserve work->mm since actual work processing may need it Bijan Mottahedeh
@ 2020-04-09 22:03 ` Bijan Mottahedeh
  2020-04-10  8:47   ` Pavel Begunkov
  0 siblings, 1 reply; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-04-09 22:03 UTC (permalink / raw)
  To: axboe; +Cc: io-uring

Do not clear work->mm since io_madvise() passes it to do_madvise()
when the request is actually processed.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
---
 fs/io-wq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 4023c98..4d20754 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -431,8 +431,6 @@ static void io_wq_switch_mm(struct io_worker *worker, struct io_wq_work *work)
 		if (!worker->mm)
 			set_fs(USER_DS);
 		worker->mm = work->mm;
-		/* hang on to this mm */
-		work->mm = NULL;
 		return;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-09 22:03 ` [RFC 1/1] " Bijan Mottahedeh
@ 2020-04-10  8:47   ` Pavel Begunkov
  2020-04-10 16:54     ` Bijan Mottahedeh
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-04-10  8:47 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe; +Cc: io-uring

On 4/10/2020 1:03 AM, Bijan Mottahedeh wrote:
> Do not clear work->mm since io_madvise() passes it to do_madvise()
> when the request is actually processed.

As I see, this down_read() from the trace is
down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
just several lines above your change. So, what do you mean by passing? I
don't see do_madvise() __explicitly__ accepting mm as an argument.

What tree do you use? Extra patches on top?

> 
> Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
> ---
>  fs/io-wq.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 4023c98..4d20754 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -431,8 +431,6 @@ static void io_wq_switch_mm(struct io_worker *worker, struct io_wq_work *work)
>  		if (!worker->mm)
>  			set_fs(USER_DS);
>  		worker->mm = work->mm;
> -		/* hang on to this mm */
> -		work->mm = NULL;
>  		return;
>  	}
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-10  8:47   ` Pavel Begunkov
@ 2020-04-10 16:54     ` Bijan Mottahedeh
  2020-04-10 17:51       ` Pavel Begunkov
  0 siblings, 1 reply; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-04-10 16:54 UTC (permalink / raw)
  To: Pavel Begunkov, axboe; +Cc: io-uring


> As I see, this down_read() from the trace is
> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
> just several lines above your change. So, what do you mean by passing? I
> don't see do_madvise() __explicitly__ accepting mm as an argument.

I think the sequence is:

io_madvise()
-> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
                     ^^^^^^^^^^^^
    -> down_read(&mm->mmap_sem)

I added an assert in do_madvise() for a NULL mm value and hit it running 
the test.

> What tree do you use? Extra patches on top?

I'm using next-20200409 with no patches.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-10 16:54     ` Bijan Mottahedeh
@ 2020-04-10 17:51       ` Pavel Begunkov
  2020-04-10 17:57         ` Pavel Begunkov
  2020-04-10 19:09         ` Bijan Mottahedeh
  0 siblings, 2 replies; 10+ messages in thread
From: Pavel Begunkov @ 2020-04-10 17:51 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe; +Cc: io-uring

On 10/04/2020 19:54, Bijan Mottahedeh wrote:
> 
>> As I see, this down_read() from the trace is
>> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
>> just several lines above your change. So, what do you mean by passing? I
>> don't see do_madvise() __explicitly__ accepting mm as an argument.
> 
> I think the sequence is:
> 
> io_madvise()
> -> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
>                     ^^^^^^^^^^^^
>    -> down_read(&mm->mmap_sem)
> 
> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
> 
>> What tree do you use? Extra patches on top?
> 
> I'm using next-20200409 with no patches.

I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
in Jen's tree.

I don't think your patch will do, because it changes mm refcounting with extra
mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.

Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
as it actually was at some point in the mentioned patch (v5).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-10 17:51       ` Pavel Begunkov
@ 2020-04-10 17:57         ` Pavel Begunkov
  2020-04-10 19:09         ` Bijan Mottahedeh
  1 sibling, 0 replies; 10+ messages in thread
From: Pavel Begunkov @ 2020-04-10 17:57 UTC (permalink / raw)
  To: Bijan Mottahedeh, axboe; +Cc: io-uring

On 10/04/2020 20:51, Pavel Begunkov wrote:
>> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
>>
>>> What tree do you use? Extra patches on top?
>>
>> I'm using next-20200409 with no patches.
> 
> I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
> in Jen's tree

oops, sorry for mistyping your name.

> 
> I don't think your patch will do, because it changes mm refcounting with extra
> mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.
> 
> Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
> as it actually was at some point in the mentioned patch (v5). 

Jens, how this should be handled? Through what tree it has to go?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-10 17:51       ` Pavel Begunkov
  2020-04-10 17:57         ` Pavel Begunkov
@ 2020-04-10 19:09         ` Bijan Mottahedeh
  2020-04-11  2:17           ` Jens Axboe
  1 sibling, 1 reply; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-04-10 19:09 UTC (permalink / raw)
  To: Pavel Begunkov, axboe; +Cc: io-uring

On 4/10/2020 10:51 AM, Pavel Begunkov wrote:
> On 10/04/2020 19:54, Bijan Mottahedeh wrote:
>>> As I see, this down_read() from the trace is
>>> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
>>> just several lines above your change. So, what do you mean by passing? I
>>> don't see do_madvise() __explicitly__ accepting mm as an argument.
>> I think the sequence is:
>>
>> io_madvise()
>> -> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
>>                      ^^^^^^^^^^^^
>>     -> down_read(&mm->mmap_sem)
>>
>> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
>>
>>> What tree do you use? Extra patches on top?
>> I'm using next-20200409 with no patches.
> I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
> in Jen's tree.
>
> I don't think your patch will do, because it changes mm refcounting with extra
> mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.
>
> Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
> as it actually was at some point in the mentioned patch (v5).
>
Ok. Jens had suggested to use req->work.mm in the patch comments so 
let's just get him to confirm:

"I think we want to use req->work.mm here - it'll be the same as
current->mm at this point, but it makes it clear that we're using a
grabbed mm."


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-10 19:09         ` Bijan Mottahedeh
@ 2020-04-11  2:17           ` Jens Axboe
  2020-04-16 20:24             ` Minchan Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-04-11  2:17 UTC (permalink / raw)
  To: Bijan Mottahedeh, Pavel Begunkov; +Cc: io-uring, Minchan Kim

On 4/10/20 12:09 PM, Bijan Mottahedeh wrote:
> On 4/10/2020 10:51 AM, Pavel Begunkov wrote:
>> On 10/04/2020 19:54, Bijan Mottahedeh wrote:
>>>> As I see, this down_read() from the trace is
>>>> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
>>>> just several lines above your change. So, what do you mean by passing? I
>>>> don't see do_madvise() __explicitly__ accepting mm as an argument.
>>> I think the sequence is:
>>>
>>> io_madvise()
>>> -> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
>>>                      ^^^^^^^^^^^^
>>>     -> down_read(&mm->mmap_sem)
>>>
>>> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
>>>
>>>> What tree do you use? Extra patches on top?
>>> I'm using next-20200409 with no patches.
>> I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
>> in Jen's tree.
>>
>> I don't think your patch will do, because it changes mm refcounting with extra
>> mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.
>>
>> Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
>> as it actually was at some point in the mentioned patch (v5).
>>
> Ok. Jens had suggested to use req->work.mm in the patch comments so 
> let's just get him to confirm:
> 
> "I think we want to use req->work.mm here - it'll be the same as
> current->mm at this point, but it makes it clear that we're using a
> grabbed mm."

We should just use current->mm, as that matches at that point anyway
since IORING_OP_MADVISE had needs_mm set.

Minchan, can you please make that change?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-11  2:17           ` Jens Axboe
@ 2020-04-16 20:24             ` Minchan Kim
  2020-04-16 20:30               ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2020-04-16 20:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bijan Mottahedeh, Pavel Begunkov, io-uring

Hi Jens,

Sorry for the late.

On Fri, Apr 10, 2020 at 08:17:29PM -0600, Jens Axboe wrote:
> On 4/10/20 12:09 PM, Bijan Mottahedeh wrote:
> > On 4/10/2020 10:51 AM, Pavel Begunkov wrote:
> >> On 10/04/2020 19:54, Bijan Mottahedeh wrote:
> >>>> As I see, this down_read() from the trace is
> >>>> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
> >>>> just several lines above your change. So, what do you mean by passing? I
> >>>> don't see do_madvise() __explicitly__ accepting mm as an argument.
> >>> I think the sequence is:
> >>>
> >>> io_madvise()
> >>> -> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
> >>>                      ^^^^^^^^^^^^
> >>>     -> down_read(&mm->mmap_sem)
> >>>
> >>> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
> >>>
> >>>> What tree do you use? Extra patches on top?
> >>> I'm using next-20200409 with no patches.
> >> I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
> >> in Jen's tree.
> >>
> >> I don't think your patch will do, because it changes mm refcounting with extra
> >> mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.
> >>
> >> Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
> >> as it actually was at some point in the mentioned patch (v5).
> >>
> > Ok. Jens had suggested to use req->work.mm in the patch comments so 
> > let's just get him to confirm:
> > 
> > "I think we want to use req->work.mm here - it'll be the same as
> > current->mm at this point, but it makes it clear that we're using a
> > grabbed mm."
> 
> We should just use current->mm, as that matches at that point anyway
> since IORING_OP_MADVISE had needs_mm set.
> 
> Minchan, can you please make that change?

Do you mean this?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a9537cd77aeb..3edbb4764993 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3280,7 +3280,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
        if (force_nonblock)
                return -EAGAIN;

-       ret = do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice);
+       ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
        if (ret < 0)
                req_set_fail_links(req);
        io_cqring_add_event(req, ret);

Since I have a plan to resend whole patchset again, I will carry on
that.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 1/1] io_uring: preserve work->mm since actual work processing may need it
  2020-04-16 20:24             ` Minchan Kim
@ 2020-04-16 20:30               ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2020-04-16 20:30 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Bijan Mottahedeh, Pavel Begunkov, io-uring

On 4/16/20 2:24 PM, Minchan Kim wrote:
> Hi Jens,
> 
> Sorry for the late.
> 
> On Fri, Apr 10, 2020 at 08:17:29PM -0600, Jens Axboe wrote:
>> On 4/10/20 12:09 PM, Bijan Mottahedeh wrote:
>>> On 4/10/2020 10:51 AM, Pavel Begunkov wrote:
>>>> On 10/04/2020 19:54, Bijan Mottahedeh wrote:
>>>>>> As I see, this down_read() from the trace is
>>>>>> down_read(&current->mm->mmap_sem), where current->mm is set by use_mm()
>>>>>> just several lines above your change. So, what do you mean by passing? I
>>>>>> don't see do_madvise() __explicitly__ accepting mm as an argument.
>>>>> I think the sequence is:
>>>>>
>>>>> io_madvise()
>>>>> -> do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice)
>>>>>                      ^^^^^^^^^^^^
>>>>>     -> down_read(&mm->mmap_sem)
>>>>>
>>>>> I added an assert in do_madvise() for a NULL mm value and hit it running the test.
>>>>>
>>>>>> What tree do you use? Extra patches on top?
>>>>> I'm using next-20200409 with no patches.
>>>> I see, it came from 676a179 ("mm: pass task and mm to do_madvise"), which isn't
>>>> in Jen's tree.
>>>>
>>>> I don't think your patch will do, because it changes mm refcounting with extra
>>>> mmdrop() in io_req_work_drop_env(). That's assuming it worked well before.
>>>>
>>>> Better fix then is to make it ```do_madvise(NULL, current->mm, ...)```
>>>> as it actually was at some point in the mentioned patch (v5).
>>>>
>>> Ok. Jens had suggested to use req->work.mm in the patch comments so 
>>> let's just get him to confirm:
>>>
>>> "I think we want to use req->work.mm here - it'll be the same as
>>> current->mm at this point, but it makes it clear that we're using a
>>> grabbed mm."
>>
>> We should just use current->mm, as that matches at that point anyway
>> since IORING_OP_MADVISE had needs_mm set.
>>
>> Minchan, can you please make that change?
> 
> Do you mean this?
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index a9537cd77aeb..3edbb4764993 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3280,7 +3280,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
>         if (force_nonblock)
>                 return -EAGAIN;
> 
> -       ret = do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice);
> +       ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
>         if (ret < 0)
>                 req_set_fail_links(req);
>         io_cqring_add_event(req, ret);
> 
> Since I have a plan to resend whole patchset again, I will carry on
> that.

Yeah exactly like that, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-04-16 20:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-09 22:03 [RFC 0/1] io_uring: preserve work->mm since actual work processing may need it Bijan Mottahedeh
2020-04-09 22:03 ` [RFC 1/1] " Bijan Mottahedeh
2020-04-10  8:47   ` Pavel Begunkov
2020-04-10 16:54     ` Bijan Mottahedeh
2020-04-10 17:51       ` Pavel Begunkov
2020-04-10 17:57         ` Pavel Begunkov
2020-04-10 19:09         ` Bijan Mottahedeh
2020-04-11  2:17           ` Jens Axboe
2020-04-16 20:24             ` Minchan Kim
2020-04-16 20:30               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).