Re: Loophole in async page I/O

From: Jens Axboe <axboe@kernel.dk>
To: Matthew Wilcox <willy@infradead.org>, io-uring@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>, Hao_Xu <haoxu@linux.alibaba.com>
Subject: Re: Loophole in async page I/O
Date: Mon, 12 Oct 2020 16:22:43 -0600	[thread overview]
Message-ID: <0a2918fc-b2e4-bea0-c7e1-265a3da65fc9@kernel.dk> (raw)
In-Reply-To: <14d97ab3-edf7-c72a-51eb-d335e2768b65@kernel.dk>

On 10/12/20 4:08 PM, Jens Axboe wrote:
> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>> This one's pretty unlikely, but there's a case in buffered reads where
>> an IOCB_WAITQ read can end up sleeping.
>>
>> generic_file_buffered_read():
>>                 page = find_get_page(mapping, index);
>> ...
>>                 if (!PageUptodate(page)) {
>> ...
>>                         if (iocb->ki_flags & IOCB_WAITQ) {
>> ...
>>                                 error = wait_on_page_locked_async(page,
>>                                                                 iocb->ki_waitq);
>> wait_on_page_locked_async():
>>         if (!PageLocked(page))
>>                 return 0;
>> (back to generic_file_buffered_read):
>>                         if (!mapping->a_ops->is_partially_uptodate(page,
>>                                                         offset, iter->count))
>>                                 goto page_not_up_to_date_locked;
>>
>> page_not_up_to_date_locked:
>>                 if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>                         unlock_page(page);
>>                         put_page(page);
>>                         goto would_block;
>>                 }
>> ...
>>                 error = mapping->a_ops->readpage(filp, page);
>> (will unlock page on I/O completion)
>>                 if (!PageUptodate(page)) {
>>                         error = lock_page_killable(page);
>>
>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>> and wait for the I/O to complete.  I can't quite figure out if this is
>> intentional -- I think not; if I understand the semantics right, we
>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>> kick off the I/O and wait.
>>
>> I think the right fix is to return -EIOCBQUEUED from
>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>
>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>                                      struct wait_page_queue *wait)
>>  {
>>         if (!PageLocked(page))
>> -               return 0;
>> +               return -EIOCBQUEUED;
>>         return __wait_on_page_locked_async(compound_head(page), wait, false);
>>  }
>>  
>> But as I said, I'm not sure what the semantics are supposed to be.
> 
> If NOWAIT isn't set, then the issue attempt is from the helper thread
> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
> matter for this discussion). So it's totally fine and expected to block
> at that point.
> 
> Hmm actually, I believe that:
> 
> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
> Author: Hao Xu <haoxu@linux.alibaba.com>
> Date:   Tue Sep 29 20:00:45 2020 +0800
> 
>     io_uring: fix async buffered reads when readahead is disabled
> 
> maybe messed up that case, so we could block off the retry-path. I'll
> take a closer look, looks like that can be the case if read-ahead is
> disabled.
> 
> In general, we can only return -EIOCBQUEUED if the IO has been started
> or is in progress already. That means we can safely rely on being told
> when it's unlocked/done. If we need to block, we should be returning
> -EAGAIN, which would punt to a worker thread.

Something like the below might be a better solution - just always use
the read-ahead to generate the IO, for the requested range. That won't
issue any IO beyond what we asked for. And ensure we don't clear NOWAIT
on the io_uring side for retry.

Totally untested... Just trying to get the idea across. We might need
some low cap on req_count in case the range is large. Hao Xu, can you
try with this? Thinking of your read-ahead disabled slowdown as well,
this could very well be the reason why.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aae0ef2ec34d..9a2dfe132665 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3107,7 +3107,6 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	wait->wait.flags = 0;
 	INIT_LIST_HEAD(&wait->wait.entry);
 	kiocb->ki_flags |= IOCB_WAITQ;
-	kiocb->ki_flags &= ~IOCB_NOWAIT;
 	kiocb->ki_waitq = wait;
 
 	io_get_req_task(req);
diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..693af86d171d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -568,15 +568,16 @@ void page_cache_sync_readahead(struct address_space *mapping,
 			       struct file_ra_state *ra, struct file *filp,
 			       pgoff_t index, unsigned long req_count)
 {
-	/* no read-ahead */
-	if (!ra->ra_pages)
-		return;
-
 	if (blk_cgroup_congested())
 		return;
 
-	/* be dumb */
-	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+	/*
+	 * Even if read-ahead is disabled, issue this request as read-ahead
+	 * as we'll need it to satisfy the requested range. The forced
+	 * read-ahead will do the right thing and limit the read to just the
+	 * requested range.
+	 */
+	if (!ra->ra_pages || (filp && (filp->f_mode & FMODE_RANDOM))) {
 		force_page_cache_readahead(mapping, filp, index, req_count);
 		return;
 	}

-- 
Jens Axboe