Re: [PATCH 1/3] io_uring: add support for async work inheriting files table

From: Jens Axboe <axboe@kernel.dk>
To: Jann Horn <jannh@google.com>
Cc: linux-block@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH 1/3] io_uring: add support for async work inheriting files table
Date: Fri, 18 Oct 2019 12:16:41 -0600	[thread overview]
Message-ID: <0a3de9b2-3d3a-07b5-0e1c-515f610fbf75@kernel.dk> (raw)
In-Reply-To: <CAG48ez2ZQBVEe8yYRwWX2=TMYWsJ=tK44NM+wqiLW2AmfYEcHw@mail.gmail.com>

On 10/18/19 12:06 PM, Jann Horn wrote:
> On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/18/19 10:36 AM, Jens Axboe wrote:
>>>> Ignoring the locking elision, basically the logic is now this:
>>>>
>>>> static void io_sq_wq_submit_work(struct work_struct *work)
>>>> {
>>>>            struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>>>            struct files_struct *cur_files = NULL, *old_files;
>>>>            [...]
>>>>            old_files = current->files;
>>>>            [...]
>>>>            do {
>>>>                    struct sqe_submit *s = &req->submit;
>>>>                    [...]
>>>>                    if (cur_files)
>>>>                            /* drop cur_files reference; borrow lifetime must
>>>>                             * end before here */
>>>>                            put_files_struct(cur_files);
>>>>                    /* move reference ownership to cur_files */
>>>>                    cur_files = s->files;
>>>>                    if (cur_files) {
>>>>                            task_lock(current);
>>>>                            /* current->files borrows reference from cur_files;
>>>>                             * existing borrow from previous loop ends here */
>>>>                            current->files = cur_files;
>>>>                            task_unlock(current);
>>>>                    }
>>>>
>>>>                    [call __io_submit_sqe()]
>>>>                    [...]
>>>>            } while (req);
>>>>            [...]
>>>>            /* existing borrow ends here */
>>>>            task_lock(current);
>>>>            current->files = old_files;
>>>>            task_unlock(current);
>>>>            if (cur_files)
>>>>                    /* drop cur_files reference; borrow lifetime must
>>>>                     * end before here */
>>>>                    put_files_struct(cur_files);
>>>> }
>>>>
>>>> If you run two iterations of this loop, with a first element that has
>>>> a ->files pointer and a second element that doesn't, then in the
>>>> second run through the loop, the reference to the files_struct will be
>>>> dropped while current->files still points to it; current->files is
>>>> only reset after the loop has ended. If someone accesses
>>>> current->files through procfs directly after that, AFAICS you'd get a
>>>> use-after-free.
>>>
>>> Amazing how this is still broken. You are right, and it's especially
>>> annoying since that's exactly the case I originally talked about (not
>>> flipping current->files if we don't have to). I just did it wrong, so
>>> we'll leave a dangling pointer in ->files.
>>>
>>> The by far most common case is if one sqe has a files it needs to
>>> attach, then others that also have files will be the same set. So I want
>>> to optimize for the case where we only flip current->files once when we
>>> see the files, and once when we're done with the loop.
>>>
>>> Let me see if I can get this right...
>>
>> I _think_ the simplest way to do it is simply to have both cur_files and
>> current->files hold a reference to the file table. That won't really add
>> any extra cost as the double increments / decrements are following each
>> other. Something like this incremental, totally untested.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2fed0badad38..b3cf3f3d7911 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                          put_files_struct(cur_files);
>>                  cur_files = s->files;
>>                  if (cur_files && cur_files != current->files) {
>> +                       struct files_struct *old;
>> +
>> +                       atomic_inc(&cur_files->count);
>>                          task_lock(current);
>> +                       old = current->files;
>>                          current->files = cur_files;
>>                          task_unlock(current);
>> +                       put_files_struct(old);
>>                  }
>>
>>                  if (!ret) {
>> @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                  mmput(cur_mm);
>>          }
>>          if (old_files != current->files) {
>> +               struct files_struct *old;
>> +
>>                  task_lock(current);
>> +               old = current->files;
>>                  current->files = old_files;
>>                  task_unlock(current);
>> +               put_files_struct(old);
>>          }
>>          if (cur_files)
>>                  put_files_struct(cur_files);
> 
> The only part I still feel a bit twitchy about is this part at the end:
> 
>          if (old_files != current->files) {
>                  struct files_struct *old;
> 
>                  task_lock(current);
>                  old = current->files;
>                  current->files = old_files;
>                  task_unlock(current);
>                  put_files_struct(old);
>          }
> 
> If it was possible for the initial ->files to be the same as the
> ->files of a submission, and we got two submissions with first a
> different files_struct and then our old one, then this branch would
> not be executed even though it should, which would leave the refcount
> of the files_struct one too high. But that probably can't happen?
> Since kernel workers should be running with &init_files (I think?) and
> that thing is never used for userspace tasks. But still, I'd feel
> better if you could change it like this:

Right, that is never going to happen. But your solution is simpler,
so... I'll throw some testing at it.

> But actually, by the way: Is this whole files_struct thing creating a
> reference loop? The files_struct has a reference to the uring file,
> and the uring file has ACCEPT work that has a reference to the
> files_struct. If the task gets killed and the accept work blocks, the
> entire files_struct will stay alive, right?

Yes, for the lifetime of the request, it does create a loop. So if the
application goes away, I think you're right, the files_struct will stay.
And so will the io_uring, for that matter, as we depend on the closing
of the files to do the final reap.

Hmm, not sure how best to handle that, to be honest. We need some way to
break the loop, if the request never finishes.

-- 
Jens Axboe