io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Loophole in async page I/O
@ 2020-10-12 21:13 Matthew Wilcox
  2020-10-12 22:08 ` Jens Axboe
  2020-10-13  5:13 ` Hao_Xu
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2020-10-12 21:13 UTC (permalink / raw)
  To: io-uring; +Cc: Johannes Weiner, Jens Axboe

This one's pretty unlikely, but there's a case in buffered reads where
an IOCB_WAITQ read can end up sleeping.

generic_file_buffered_read():
                page = find_get_page(mapping, index);
...
                if (!PageUptodate(page)) {
...
                        if (iocb->ki_flags & IOCB_WAITQ) {
...
                                error = wait_on_page_locked_async(page,
                                                                iocb->ki_waitq);
wait_on_page_locked_async():
        if (!PageLocked(page))
                return 0;
(back to generic_file_buffered_read):
                        if (!mapping->a_ops->is_partially_uptodate(page,
                                                        offset, iter->count))
                                goto page_not_up_to_date_locked;

page_not_up_to_date_locked:
                if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
                        unlock_page(page);
                        put_page(page);
                        goto would_block;
                }
...
                error = mapping->a_ops->readpage(filp, page);
(will unlock page on I/O completion)
                if (!PageUptodate(page)) {
                        error = lock_page_killable(page);

So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
and wait for the I/O to complete.  I can't quite figure out if this is
intentional -- I think not; if I understand the semantics right, we
should be returning -EIOCBQUEUED and punting to an I/O thread to
kick off the I/O and wait.

I think the right fix is to return -EIOCBQUEUED from
wait_on_page_locked_async() if the page isn't locked.  ie this:

@@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
                                     struct wait_page_queue *wait)
 {
        if (!PageLocked(page))
-               return 0;
+               return -EIOCBQUEUED;
        return __wait_on_page_locked_async(compound_head(page), wait, false);
 }
 
But as I said, I'm not sure what the semantics are supposed to be.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 21:13 Loophole in async page I/O Matthew Wilcox
@ 2020-10-12 22:08 ` Jens Axboe
  2020-10-12 22:22   ` Jens Axboe
  2020-10-13  5:31   ` Hao_Xu
  2020-10-13  5:13 ` Hao_Xu
  1 sibling, 2 replies; 14+ messages in thread
From: Jens Axboe @ 2020-10-12 22:08 UTC (permalink / raw)
  To: Matthew Wilcox, io-uring; +Cc: Johannes Weiner

On 10/12/20 3:13 PM, Matthew Wilcox wrote:
> This one's pretty unlikely, but there's a case in buffered reads where
> an IOCB_WAITQ read can end up sleeping.
> 
> generic_file_buffered_read():
>                 page = find_get_page(mapping, index);
> ...
>                 if (!PageUptodate(page)) {
> ...
>                         if (iocb->ki_flags & IOCB_WAITQ) {
> ...
>                                 error = wait_on_page_locked_async(page,
>                                                                 iocb->ki_waitq);
> wait_on_page_locked_async():
>         if (!PageLocked(page))
>                 return 0;
> (back to generic_file_buffered_read):
>                         if (!mapping->a_ops->is_partially_uptodate(page,
>                                                         offset, iter->count))
>                                 goto page_not_up_to_date_locked;
> 
> page_not_up_to_date_locked:
>                 if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>                         unlock_page(page);
>                         put_page(page);
>                         goto would_block;
>                 }
> ...
>                 error = mapping->a_ops->readpage(filp, page);
> (will unlock page on I/O completion)
>                 if (!PageUptodate(page)) {
>                         error = lock_page_killable(page);
> 
> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
> and wait for the I/O to complete.  I can't quite figure out if this is
> intentional -- I think not; if I understand the semantics right, we
> should be returning -EIOCBQUEUED and punting to an I/O thread to
> kick off the I/O and wait.
> 
> I think the right fix is to return -EIOCBQUEUED from
> wait_on_page_locked_async() if the page isn't locked.  ie this:
> 
> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>                                      struct wait_page_queue *wait)
>  {
>         if (!PageLocked(page))
> -               return 0;
> +               return -EIOCBQUEUED;
>         return __wait_on_page_locked_async(compound_head(page), wait, false);
>  }
>  
> But as I said, I'm not sure what the semantics are supposed to be.

If NOWAIT isn't set, then the issue attempt is from the helper thread
already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
matter for this discussion). So it's totally fine and expected to block
at that point.

Hmm actually, I believe that:

commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
Author: Hao Xu <haoxu@linux.alibaba.com>
Date:   Tue Sep 29 20:00:45 2020 +0800

    io_uring: fix async buffered reads when readahead is disabled

maybe messed up that case, so we could block off the retry-path. I'll
take a closer look, looks like that can be the case if read-ahead is
disabled.

In general, we can only return -EIOCBQUEUED if the IO has been started
or is in progress already. That means we can safely rely on being told
when it's unlocked/done. If we need to block, we should be returning
-EAGAIN, which would punt to a worker thread.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 22:08 ` Jens Axboe
@ 2020-10-12 22:22   ` Jens Axboe
  2020-10-12 22:42     ` Jens Axboe
  2020-10-13  5:31   ` Hao_Xu
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-10-12 22:22 UTC (permalink / raw)
  To: Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Hao_Xu

On 10/12/20 4:08 PM, Jens Axboe wrote:
> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>> This one's pretty unlikely, but there's a case in buffered reads where
>> an IOCB_WAITQ read can end up sleeping.
>>
>> generic_file_buffered_read():
>>                 page = find_get_page(mapping, index);
>> ...
>>                 if (!PageUptodate(page)) {
>> ...
>>                         if (iocb->ki_flags & IOCB_WAITQ) {
>> ...
>>                                 error = wait_on_page_locked_async(page,
>>                                                                 iocb->ki_waitq);
>> wait_on_page_locked_async():
>>         if (!PageLocked(page))
>>                 return 0;
>> (back to generic_file_buffered_read):
>>                         if (!mapping->a_ops->is_partially_uptodate(page,
>>                                                         offset, iter->count))
>>                                 goto page_not_up_to_date_locked;
>>
>> page_not_up_to_date_locked:
>>                 if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>                         unlock_page(page);
>>                         put_page(page);
>>                         goto would_block;
>>                 }
>> ...
>>                 error = mapping->a_ops->readpage(filp, page);
>> (will unlock page on I/O completion)
>>                 if (!PageUptodate(page)) {
>>                         error = lock_page_killable(page);
>>
>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>> and wait for the I/O to complete.  I can't quite figure out if this is
>> intentional -- I think not; if I understand the semantics right, we
>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>> kick off the I/O and wait.
>>
>> I think the right fix is to return -EIOCBQUEUED from
>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>
>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>                                      struct wait_page_queue *wait)
>>  {
>>         if (!PageLocked(page))
>> -               return 0;
>> +               return -EIOCBQUEUED;
>>         return __wait_on_page_locked_async(compound_head(page), wait, false);
>>  }
>>  
>> But as I said, I'm not sure what the semantics are supposed to be.
> 
> If NOWAIT isn't set, then the issue attempt is from the helper thread
> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
> matter for this discussion). So it's totally fine and expected to block
> at that point.
> 
> Hmm actually, I believe that:
> 
> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
> Author: Hao Xu <haoxu@linux.alibaba.com>
> Date:   Tue Sep 29 20:00:45 2020 +0800
> 
>     io_uring: fix async buffered reads when readahead is disabled
> 
> maybe messed up that case, so we could block off the retry-path. I'll
> take a closer look, looks like that can be the case if read-ahead is
> disabled.
> 
> In general, we can only return -EIOCBQUEUED if the IO has been started
> or is in progress already. That means we can safely rely on being told
> when it's unlocked/done. If we need to block, we should be returning
> -EAGAIN, which would punt to a worker thread.

Something like the below might be a better solution - just always use
the read-ahead to generate the IO, for the requested range. That won't
issue any IO beyond what we asked for. And ensure we don't clear NOWAIT
on the io_uring side for retry.

Totally untested... Just trying to get the idea across. We might need
some low cap on req_count in case the range is large. Hao Xu, can you
try with this? Thinking of your read-ahead disabled slowdown as well,
this could very well be the reason why.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index aae0ef2ec34d..9a2dfe132665 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3107,7 +3107,6 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	wait->wait.flags = 0;
 	INIT_LIST_HEAD(&wait->wait.entry);
 	kiocb->ki_flags |= IOCB_WAITQ;
-	kiocb->ki_flags &= ~IOCB_NOWAIT;
 	kiocb->ki_waitq = wait;
 
 	io_get_req_task(req);
diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..693af86d171d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -568,15 +568,16 @@ void page_cache_sync_readahead(struct address_space *mapping,
 			       struct file_ra_state *ra, struct file *filp,
 			       pgoff_t index, unsigned long req_count)
 {
-	/* no read-ahead */
-	if (!ra->ra_pages)
-		return;
-
 	if (blk_cgroup_congested())
 		return;
 
-	/* be dumb */
-	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+	/*
+	 * Even if read-ahead is disabled, issue this request as read-ahead
+	 * as we'll need it to satisfy the requested range. The forced
+	 * read-ahead will do the right thing and limit the read to just the
+	 * requested range.
+	 */
+	if (!ra->ra_pages || (filp && (filp->f_mode & FMODE_RANDOM))) {
 		force_page_cache_readahead(mapping, filp, index, req_count);
 		return;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 22:22   ` Jens Axboe
@ 2020-10-12 22:42     ` Jens Axboe
  2020-10-14 20:31       ` Hao_Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-10-12 22:42 UTC (permalink / raw)
  To: Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Hao_Xu, Andrew Morton

On 10/12/20 4:22 PM, Jens Axboe wrote:
> On 10/12/20 4:08 PM, Jens Axboe wrote:
>> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>>> This one's pretty unlikely, but there's a case in buffered reads where
>>> an IOCB_WAITQ read can end up sleeping.
>>>
>>> generic_file_buffered_read():
>>>                 page = find_get_page(mapping, index);
>>> ...
>>>                 if (!PageUptodate(page)) {
>>> ...
>>>                         if (iocb->ki_flags & IOCB_WAITQ) {
>>> ...
>>>                                 error = wait_on_page_locked_async(page,
>>>                                                                 iocb->ki_waitq);
>>> wait_on_page_locked_async():
>>>         if (!PageLocked(page))
>>>                 return 0;
>>> (back to generic_file_buffered_read):
>>>                         if (!mapping->a_ops->is_partially_uptodate(page,
>>>                                                         offset, iter->count))
>>>                                 goto page_not_up_to_date_locked;
>>>
>>> page_not_up_to_date_locked:
>>>                 if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>>                         unlock_page(page);
>>>                         put_page(page);
>>>                         goto would_block;
>>>                 }
>>> ...
>>>                 error = mapping->a_ops->readpage(filp, page);
>>> (will unlock page on I/O completion)
>>>                 if (!PageUptodate(page)) {
>>>                         error = lock_page_killable(page);
>>>
>>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>>> and wait for the I/O to complete.  I can't quite figure out if this is
>>> intentional -- I think not; if I understand the semantics right, we
>>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>>> kick off the I/O and wait.
>>>
>>> I think the right fix is to return -EIOCBQUEUED from
>>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>>
>>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>>                                      struct wait_page_queue *wait)
>>>  {
>>>         if (!PageLocked(page))
>>> -               return 0;
>>> +               return -EIOCBQUEUED;
>>>         return __wait_on_page_locked_async(compound_head(page), wait, false);
>>>  }
>>>  
>>> But as I said, I'm not sure what the semantics are supposed to be.
>>
>> If NOWAIT isn't set, then the issue attempt is from the helper thread
>> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
>> matter for this discussion). So it's totally fine and expected to block
>> at that point.
>>
>> Hmm actually, I believe that:
>>
>> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
>> Author: Hao Xu <haoxu@linux.alibaba.com>
>> Date:   Tue Sep 29 20:00:45 2020 +0800
>>
>>     io_uring: fix async buffered reads when readahead is disabled
>>
>> maybe messed up that case, so we could block off the retry-path. I'll
>> take a closer look, looks like that can be the case if read-ahead is
>> disabled.
>>
>> In general, we can only return -EIOCBQUEUED if the IO has been started
>> or is in progress already. That means we can safely rely on being told
>> when it's unlocked/done. If we need to block, we should be returning
>> -EAGAIN, which would punt to a worker thread.
> 
> Something like the below might be a better solution - just always use
> the read-ahead to generate the IO, for the requested range. That won't
> issue any IO beyond what we asked for. And ensure we don't clear NOWAIT
> on the io_uring side for retry.
> 
> Totally untested... Just trying to get the idea across. We might need
> some low cap on req_count in case the range is large. Hao Xu, can you
> try with this? Thinking of your read-ahead disabled slowdown as well,
> this could very well be the reason why.

Here's one that caps us at 1 page, if read-ahead is disabled or we're
congested. Should still be fine in terms of being async, and it allows
us to use the same path for this instead of special casing it.

I ran some quick testing on this, and it seems to Work For Me. I'll do
some more targeted testing.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aae0ef2ec34d..9a2dfe132665 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3107,7 +3107,6 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	wait->wait.flags = 0;
 	INIT_LIST_HEAD(&wait->wait.entry);
 	kiocb->ki_flags |= IOCB_WAITQ;
-	kiocb->ki_flags &= ~IOCB_NOWAIT;
 	kiocb->ki_waitq = wait;
 
 	io_get_req_task(req);
diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..d0f556612fd6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -568,15 +568,20 @@ void page_cache_sync_readahead(struct address_space *mapping,
 			       struct file_ra_state *ra, struct file *filp,
 			       pgoff_t index, unsigned long req_count)
 {
-	/* no read-ahead */
-	if (!ra->ra_pages)
-		return;
+	bool do_forced_ra = filp && (filp->f_mode & FMODE_RANDOM);
 
-	if (blk_cgroup_congested())
-		return;
+	/*
+	 * Even if read-ahead is disabled, issue this request as read-ahead
+	 * as we'll need it to satisfy the requested range. The forced
+	 * read-ahead will do the right thing and limit the read to just the
+	 * requested range, which we'll set to 1 page for this case.
+	 */
+	if (!ra->ra_pages || blk_cgroup_congested()) {
+		req_count = 1;
+		do_forced_ra = true;
+	}
 
-	/* be dumb */
-	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+	if (do_forced_ra) {
 		force_page_cache_readahead(mapping, filp, index, req_count);
 		return;
 	}

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 21:13 Loophole in async page I/O Matthew Wilcox
  2020-10-12 22:08 ` Jens Axboe
@ 2020-10-13  5:13 ` Hao_Xu
  2020-10-13 12:01   ` Matthew Wilcox
  1 sibling, 1 reply; 14+ messages in thread
From: Hao_Xu @ 2020-10-13  5:13 UTC (permalink / raw)
  To: Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Jens Axboe

在 2020/10/13 上午5:13, Matthew Wilcox 写道:
> This one's pretty unlikely, but there's a case in buffered reads where
> an IOCB_WAITQ read can end up sleeping.
> 
> generic_file_buffered_read():
>                  page = find_get_page(mapping, index);
> ...
>                  if (!PageUptodate(page)) {
> ...
>                          if (iocb->ki_flags & IOCB_WAITQ) {
> ...
>                                  error = wait_on_page_locked_async(page,
>                                                                  iocb->ki_waitq);
> wait_on_page_locked_async():
>          if (!PageLocked(page))
>                  return 0;
> (back to generic_file_buffered_read):
>                          if (!mapping->a_ops->is_partially_uptodate(page,
>                                                          offset, iter->count))
>                                  goto page_not_up_to_date_locked;
> 
> page_not_up_to_date_locked:
>                  if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>                          unlock_page(page);
>                          put_page(page);
>                          goto would_block;
>                  }
> ...
>                  error = mapping->a_ops->readpage(filp, page);
> (will unlock page on I/O completion)
>                  if (!PageUptodate(page)) {
>                          error = lock_page_killable(page);
> 
> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
> and wait for the I/O to complete.  I can't quite figure out if this is
> intentional -- I think not; if I understand the semantics right, we
> should be returning -EIOCBQUEUED and punting to an I/O thread to
> kick off the I/O and wait.
> 
> I think the right fix is to return -EIOCBQUEUED from
> wait_on_page_locked_async() if the page isn't locked.  ie this:
> 
> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>                                       struct wait_page_queue *wait)
>   {
>          if (!PageLocked(page))
> -               return 0;
> +               return -EIOCBQUEUED;
>          return __wait_on_page_locked_async(compound_head(page), wait, false);
>   }
>   
> But as I said, I'm not sure what the semantics are supposed to be.
> 
Hi Matthew,
which kernel version are you use, I believe I've fixed this case in the 
commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
in this commit, I did the modification:

diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..ea383478fc22 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2267,7 +2267,11 @@ ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
                 }

                 if (!PageUptodate(page)) {
-                       error = lock_page_killable(page);
+                       if (iocb->ki_flags & IOCB_WAITQ)
+                               error = lock_page_async(page, 
iocb->ki_waitq);
+                       else
+                               error = lock_page_killable(page);
+
                         if (unlikely(error))
                                 goto readpage_error;
                         if (!PageUptodate(page)) {

lock_page_killable() won't be called in this case.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 22:08 ` Jens Axboe
  2020-10-12 22:22   ` Jens Axboe
@ 2020-10-13  5:31   ` Hao_Xu
  2020-10-13 17:50     ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: Hao_Xu @ 2020-10-13  5:31 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, io-uring; +Cc: Johannes Weiner

在 2020/10/13 上午6:08, Jens Axboe 写道:
> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>> This one's pretty unlikely, but there's a case in buffered reads where
>> an IOCB_WAITQ read can end up sleeping.
>>
>> generic_file_buffered_read():
>>                  page = find_get_page(mapping, index);
>> ...
>>                  if (!PageUptodate(page)) {
>> ...
>>                          if (iocb->ki_flags & IOCB_WAITQ) {
>> ...
>>                                  error = wait_on_page_locked_async(page,
>>                                                                  iocb->ki_waitq);
>> wait_on_page_locked_async():
>>          if (!PageLocked(page))
>>                  return 0;
>> (back to generic_file_buffered_read):
>>                          if (!mapping->a_ops->is_partially_uptodate(page,
>>                                                          offset, iter->count))
>>                                  goto page_not_up_to_date_locked;
>>
>> page_not_up_to_date_locked:
>>                  if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>                          unlock_page(page);
>>                          put_page(page);
>>                          goto would_block;
>>                  }
>> ...
>>                  error = mapping->a_ops->readpage(filp, page);
>> (will unlock page on I/O completion)
>>                  if (!PageUptodate(page)) {
>>                          error = lock_page_killable(page);
>>
>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>> and wait for the I/O to complete.  I can't quite figure out if this is
>> intentional -- I think not; if I understand the semantics right, we
>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>> kick off the I/O and wait.
>>
>> I think the right fix is to return -EIOCBQUEUED from
>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>
>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>                                       struct wait_page_queue *wait)
>>   {
>>          if (!PageLocked(page))
>> -               return 0;
>> +               return -EIOCBQUEUED;
>>          return __wait_on_page_locked_async(compound_head(page), wait, false);
>>   }
>>   
>> But as I said, I'm not sure what the semantics are supposed to be.
> 
> If NOWAIT isn't set, then the issue attempt is from the helper thread
> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
> matter for this discussion). So it's totally fine and expected to block
> at that point.
> 
> Hmm actually, I believe that:
> 
> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
> Author: Hao Xu <haoxu@linux.alibaba.com>
> Date:   Tue Sep 29 20:00:45 2020 +0800
> 
>      io_uring: fix async buffered reads when readahead is disabled
> 
> maybe messed up that case, so we could block off the retry-path. I'll
> take a closer look, looks like that can be the case if read-ahead is
> disabled.
> 
> In general, we can only return -EIOCBQUEUED if the IO has been started
> or is in progress already. That means we can safely rely on being told
> when it's unlocked/done. If we need to block, we should be returning
> -EAGAIN, which would punt to a worker thread.
> 
Hi Jens,
My undertanding of io_uring buffered reads process after the commit 
c8d317aa1887b40b188ec3aaa6e9e524333caed1 has been merged is:
the first io_uring IO try is with IOCB_NOWAIT, the second retry in the 
same context is with IOCB_WAITQ but without IOCB_NOWAIT.
so in Matthew's case, lock_page_async() will be called after calling 
mapping->a_ops->readpage(), So it won't end up sleeping.
Actually this case is what happens when readahead is disabled or somehow 
skipped for reasons like blk_cgroup_congested() returns true. And this 
case is my commit c8d317aa1887b40b188e for.

Regards,
Hao


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-13  5:13 ` Hao_Xu
@ 2020-10-13 12:01   ` Matthew Wilcox
  2020-10-13 19:57     ` Hao_Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2020-10-13 12:01 UTC (permalink / raw)
  To: Hao_Xu; +Cc: io-uring, Johannes Weiner, Jens Axboe

On Tue, Oct 13, 2020 at 01:13:48PM +0800, Hao_Xu wrote:
> 在 2020/10/13 上午5:13, Matthew Wilcox 写道:
> > This one's pretty unlikely, but there's a case in buffered reads where
> > an IOCB_WAITQ read can end up sleeping.
> > 
> > generic_file_buffered_read():
> >                  page = find_get_page(mapping, index);
> > ...
> >                  if (!PageUptodate(page)) {
> > ...
> >                          if (iocb->ki_flags & IOCB_WAITQ) {
> > ...
> >                                  error = wait_on_page_locked_async(page,
> >                                                                  iocb->ki_waitq);
> > wait_on_page_locked_async():
> >          if (!PageLocked(page))
> >                  return 0;
> > (back to generic_file_buffered_read):
> >                          if (!mapping->a_ops->is_partially_uptodate(page,
> >                                                          offset, iter->count))
> >                                  goto page_not_up_to_date_locked;
> > 
> > page_not_up_to_date_locked:
> >                  if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
> >                          unlock_page(page);
> >                          put_page(page);
> >                          goto would_block;
> >                  }
> > ...
> >                  error = mapping->a_ops->readpage(filp, page);
> > (will unlock page on I/O completion)
> >                  if (!PageUptodate(page)) {
> >                          error = lock_page_killable(page);
> > 
> > So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
> > and wait for the I/O to complete.  I can't quite figure out if this is
> > intentional -- I think not; if I understand the semantics right, we
> > should be returning -EIOCBQUEUED and punting to an I/O thread to
> > kick off the I/O and wait.
> > 
> > I think the right fix is to return -EIOCBQUEUED from
> > wait_on_page_locked_async() if the page isn't locked.  ie this:
> > 
> > @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
> >                                       struct wait_page_queue *wait)
> >   {
> >          if (!PageLocked(page))
> > -               return 0;
> > +               return -EIOCBQUEUED;
> >          return __wait_on_page_locked_async(compound_head(page), wait, false);
> >   }
> > But as I said, I'm not sure what the semantics are supposed to be.
> > 
> Hi Matthew,
> which kernel version are you use, I believe I've fixed this case in the
> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1

Ah, I don't have that commit in my tree.

Nevertheless, there is still a problem.  The ->readpage implementation
is not required to execute asynchronously.  For example, it may enter
page reclaim by using GFP_KERNEL.  Indeed, I feel it is better if it
works synchronously as it can then report the actual error from an I/O
instead of the almost-meaningless -EIO.

This patch series documents 12 filesystems which implement ->readpage
in a synchronous way today (for at least some cases) and converts iomap
to be synchronous (making two more filesystems synchronous).

https://lore.kernel.org/linux-fsdevel/20201009143104.22673-1-willy@infradead.org/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-13  5:31   ` Hao_Xu
@ 2020-10-13 17:50     ` Jens Axboe
  2020-10-13 19:50       ` Hao_Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-10-13 17:50 UTC (permalink / raw)
  To: Hao_Xu, Matthew Wilcox, io-uring; +Cc: Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 4211 bytes --]

On 10/12/20 11:31 PM, Hao_Xu wrote:
> 在 2020/10/13 上午6:08, Jens Axboe 写道:
>> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>>> This one's pretty unlikely, but there's a case in buffered reads where
>>> an IOCB_WAITQ read can end up sleeping.
>>>
>>> generic_file_buffered_read():
>>>                  page = find_get_page(mapping, index);
>>> ...
>>>                  if (!PageUptodate(page)) {
>>> ...
>>>                          if (iocb->ki_flags & IOCB_WAITQ) {
>>> ...
>>>                                  error = wait_on_page_locked_async(page,
>>>                                                                  iocb->ki_waitq);
>>> wait_on_page_locked_async():
>>>          if (!PageLocked(page))
>>>                  return 0;
>>> (back to generic_file_buffered_read):
>>>                          if (!mapping->a_ops->is_partially_uptodate(page,
>>>                                                          offset, iter->count))
>>>                                  goto page_not_up_to_date_locked;
>>>
>>> page_not_up_to_date_locked:
>>>                  if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>>                          unlock_page(page);
>>>                          put_page(page);
>>>                          goto would_block;
>>>                  }
>>> ...
>>>                  error = mapping->a_ops->readpage(filp, page);
>>> (will unlock page on I/O completion)
>>>                  if (!PageUptodate(page)) {
>>>                          error = lock_page_killable(page);
>>>
>>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>>> and wait for the I/O to complete.  I can't quite figure out if this is
>>> intentional -- I think not; if I understand the semantics right, we
>>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>>> kick off the I/O and wait.
>>>
>>> I think the right fix is to return -EIOCBQUEUED from
>>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>>
>>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>>                                       struct wait_page_queue *wait)
>>>   {
>>>          if (!PageLocked(page))
>>> -               return 0;
>>> +               return -EIOCBQUEUED;
>>>          return __wait_on_page_locked_async(compound_head(page), wait, false);
>>>   }
>>>   
>>> But as I said, I'm not sure what the semantics are supposed to be.
>>
>> If NOWAIT isn't set, then the issue attempt is from the helper thread
>> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
>> matter for this discussion). So it's totally fine and expected to block
>> at that point.
>>
>> Hmm actually, I believe that:
>>
>> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
>> Author: Hao Xu <haoxu@linux.alibaba.com>
>> Date:   Tue Sep 29 20:00:45 2020 +0800
>>
>>      io_uring: fix async buffered reads when readahead is disabled
>>
>> maybe messed up that case, so we could block off the retry-path. I'll
>> take a closer look, looks like that can be the case if read-ahead is
>> disabled.
>>
>> In general, we can only return -EIOCBQUEUED if the IO has been started
>> or is in progress already. That means we can safely rely on being told
>> when it's unlocked/done. If we need to block, we should be returning
>> -EAGAIN, which would punt to a worker thread.
>>
> Hi Jens,
> My undertanding of io_uring buffered reads process after the commit 
> c8d317aa1887b40b188ec3aaa6e9e524333caed1 has been merged is:
> the first io_uring IO try is with IOCB_NOWAIT, the second retry in the 
> same context is with IOCB_WAITQ but without IOCB_NOWAIT.
> so in Matthew's case, lock_page_async() will be called after calling 
> mapping->a_ops->readpage(), So it won't end up sleeping.
> Actually this case is what happens when readahead is disabled or somehow 
> skipped for reasons like blk_cgroup_congested() returns true. And this 
> case is my commit c8d317aa1887b40b188e for.

Well, try the patches. I agree it's not going to sleep with the previous
fix, but we're definitely driving a lower utilization by not utilizing
read-ahead even if disabled.

Re-run your previous tests with these two applied and see what you get.

-- 
Jens Axboe


[-- Attachment #2: 0002-io_uring-don-t-clear-IOCB_NOWAIT-for-async-buffered-.patch --]
[-- Type: text/x-patch, Size: 1173 bytes --]

From 19185e0ea3a91a1d8b9c7e013a32f96bf006052a Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Mon, 12 Oct 2020 16:48:57 -0600
Subject: [PATCH 2/2] io_uring: don't clear IOCB_NOWAIT for async buffered
 retry

If we do, and read-ahead is disabled, we can be blocking on the page to
finish before making progress. This defeats the purpose of async IO.
Now that we know that read-ahead will most likely trigger the IO, we can
make progress even for ra_pages == 0 without punting to io-wq to satisfy
the IO in a blocking fashion.

Fixes: c8d317aa1887 ("io_uring: fix async buffered reads when readahead is disabled")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c043d889a2eb..be70f3e38fb2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3248,7 +3248,6 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	wait->wait.flags = 0;
 	INIT_LIST_HEAD(&wait->wait.entry);
 	kiocb->ki_flags |= IOCB_WAITQ;
-	kiocb->ki_flags &= ~IOCB_NOWAIT;
 	kiocb->ki_waitq = wait;
 	return true;
 }
-- 
2.28.0


[-- Attachment #3: 0001-readahead-use-limited-read-ahead-to-satisfy-read.patch --]
[-- Type: text/x-patch, Size: 2221 bytes --]

From 10b8c31e8085a85d5a71c7e271387c2edbcf7b96 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe@kernel.dk>
Date: Mon, 12 Oct 2020 16:44:23 -0600
Subject: [PATCH 1/2] readahead: use limited read-ahead to satisfy read

Willy reports that there's a case where async buffered reads will be
blocking, and that's due to not using read-ahead to generate the reads
when read-ahead is disabled. io_uring relies on read-ahead triggering
the reads, if not, it needs to fallback to threaded helpers.

For the case where read-ahead is disabled on the file, or if the cgroup
is congested, ensure that we can at least do 1 page of read-ahead to
make progress on the read in an async fashion. This could potentially be
larger, but it's not needed in terms of functionality, so let's error on
the side of caution as larger counts of pages may run into reclaim
issues (particularly if we're congested).

Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/readahead.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..e5975f4e0ee5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -568,15 +568,21 @@ void page_cache_sync_readahead(struct address_space *mapping,
 			       struct file_ra_state *ra, struct file *filp,
 			       pgoff_t index, unsigned long req_count)
 {
-	/* no read-ahead */
-	if (!ra->ra_pages)
-		return;
+	bool do_forced_ra = filp && (filp->f_mode & FMODE_RANDOM);
 
-	if (blk_cgroup_congested())
-		return;
+	/*
+	 * Even if read-ahead is disabled, start this request as read-ahead.
+	 * This makes regular read-ahead disabled use the same path as normal
+	 * reads, instead of having to punt to ->readpage() manually. We limit
+	 * ourselves to 1 page for this case, to avoid causing problems if
+	 * we're congested or tight on memory.
+	 */
+	if (!ra->ra_pages || blk_cgroup_congested()) {
+		req_count = 1;
+		do_forced_ra = true;
+	}
 
-	/* be dumb */
-	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+	if (do_forced_ra) {
 		force_page_cache_readahead(mapping, filp, index, req_count);
 		return;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-13 17:50     ` Jens Axboe
@ 2020-10-13 19:50       ` Hao_Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao_Xu @ 2020-10-13 19:50 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, io-uring; +Cc: Johannes Weiner

在 2020/10/14 上午1:50, Jens Axboe 写道:
> On 10/12/20 11:31 PM, Hao_Xu wrote:
>> 在 2020/10/13 上午6:08, Jens Axboe 写道:
>>> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>>>> This one's pretty unlikely, but there's a case in buffered reads where
>>>> an IOCB_WAITQ read can end up sleeping.
>>>>
>>>> generic_file_buffered_read():
>>>>                   page = find_get_page(mapping, index);
>>>> ...
>>>>                   if (!PageUptodate(page)) {
>>>> ...
>>>>                           if (iocb->ki_flags & IOCB_WAITQ) {
>>>> ...
>>>>                                   error = wait_on_page_locked_async(page,
>>>>                                                                   iocb->ki_waitq);
>>>> wait_on_page_locked_async():
>>>>           if (!PageLocked(page))
>>>>                   return 0;
>>>> (back to generic_file_buffered_read):
>>>>                           if (!mapping->a_ops->is_partially_uptodate(page,
>>>>                                                           offset, iter->count))
>>>>                                   goto page_not_up_to_date_locked;
>>>>
>>>> page_not_up_to_date_locked:
>>>>                   if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>>>                           unlock_page(page);
>>>>                           put_page(page);
>>>>                           goto would_block;
>>>>                   }
>>>> ...
>>>>                   error = mapping->a_ops->readpage(filp, page);
>>>> (will unlock page on I/O completion)
>>>>                   if (!PageUptodate(page)) {
>>>>                           error = lock_page_killable(page);
>>>>
>>>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>>>> and wait for the I/O to complete.  I can't quite figure out if this is
>>>> intentional -- I think not; if I understand the semantics right, we
>>>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>>>> kick off the I/O and wait.
>>>>
>>>> I think the right fix is to return -EIOCBQUEUED from
>>>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>>>
>>>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>>>                                        struct wait_page_queue *wait)
>>>>    {
>>>>           if (!PageLocked(page))
>>>> -               return 0;
>>>> +               return -EIOCBQUEUED;
>>>>           return __wait_on_page_locked_async(compound_head(page), wait, false);
>>>>    }
>>>>    
>>>> But as I said, I'm not sure what the semantics are supposed to be.
>>>
>>> If NOWAIT isn't set, then the issue attempt is from the helper thread
>>> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
>>> matter for this discussion). So it's totally fine and expected to block
>>> at that point.
>>>
>>> Hmm actually, I believe that:
>>>
>>> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
>>> Author: Hao Xu <haoxu@linux.alibaba.com>
>>> Date:   Tue Sep 29 20:00:45 2020 +0800
>>>
>>>       io_uring: fix async buffered reads when readahead is disabled
>>>
>>> maybe messed up that case, so we could block off the retry-path. I'll
>>> take a closer look, looks like that can be the case if read-ahead is
>>> disabled.
>>>
>>> In general, we can only return -EIOCBQUEUED if the IO has been started
>>> or is in progress already. That means we can safely rely on being told
>>> when it's unlocked/done. If we need to block, we should be returning
>>> -EAGAIN, which would punt to a worker thread.
>>>
>> Hi Jens,
>> My undertanding of io_uring buffered reads process after the commit
>> c8d317aa1887b40b188ec3aaa6e9e524333caed1 has been merged is:
>> the first io_uring IO try is with IOCB_NOWAIT, the second retry in the
>> same context is with IOCB_WAITQ but without IOCB_NOWAIT.
>> so in Matthew's case, lock_page_async() will be called after calling
>> mapping->a_ops->readpage(), So it won't end up sleeping.
>> Actually this case is what happens when readahead is disabled or somehow
>> skipped for reasons like blk_cgroup_congested() returns true. And this
>> case is my commit c8d317aa1887b40b188e for.
> 
> Well, try the patches. I agree it's not going to sleep with the previous
> fix, but we're definitely driving a lower utilization by not utilizing
> read-ahead even if disabled.
> 
> Re-run your previous tests with these two applied and see what you get.
> 
Sure I agree, looks good to me. I'll try the tests with the new code.
Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-13 12:01   ` Matthew Wilcox
@ 2020-10-13 19:57     ` Hao_Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao_Xu @ 2020-10-13 19:57 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, Johannes Weiner, Jens Axboe

在 2020/10/13 下午8:01, Matthew Wilcox 写道:
> On Tue, Oct 13, 2020 at 01:13:48PM +0800, Hao_Xu wrote:
>> 在 2020/10/13 上午5:13, Matthew Wilcox 写道:
>>> This one's pretty unlikely, but there's a case in buffered reads where
>>> an IOCB_WAITQ read can end up sleeping.
>>>
>>> generic_file_buffered_read():
>>>                   page = find_get_page(mapping, index);
>>> ...
>>>                   if (!PageUptodate(page)) {
>>> ...
>>>                           if (iocb->ki_flags & IOCB_WAITQ) {
>>> ...
>>>                                   error = wait_on_page_locked_async(page,
>>>                                                                   iocb->ki_waitq);
>>> wait_on_page_locked_async():
>>>           if (!PageLocked(page))
>>>                   return 0;
>>> (back to generic_file_buffered_read):
>>>                           if (!mapping->a_ops->is_partially_uptodate(page,
>>>                                                           offset, iter->count))
>>>                                   goto page_not_up_to_date_locked;
>>>
>>> page_not_up_to_date_locked:
>>>                   if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>>                           unlock_page(page);
>>>                           put_page(page);
>>>                           goto would_block;
>>>                   }
>>> ...
>>>                   error = mapping->a_ops->readpage(filp, page);
>>> (will unlock page on I/O completion)
>>>                   if (!PageUptodate(page)) {
>>>                           error = lock_page_killable(page);
>>>
>>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>>> and wait for the I/O to complete.  I can't quite figure out if this is
>>> intentional -- I think not; if I understand the semantics right, we
>>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>>> kick off the I/O and wait.
>>>
>>> I think the right fix is to return -EIOCBQUEUED from
>>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>>
>>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>>                                        struct wait_page_queue *wait)
>>>    {
>>>           if (!PageLocked(page))
>>> -               return 0;
>>> +               return -EIOCBQUEUED;
>>>           return __wait_on_page_locked_async(compound_head(page), wait, false);
>>>    }
>>> But as I said, I'm not sure what the semantics are supposed to be.
>>>
>> Hi Matthew,
>> which kernel version are you use, I believe I've fixed this case in the
>> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
> 
> Ah, I don't have that commit in my tree.
> 
> Nevertheless, there is still a problem.  The ->readpage implementation
> is not required to execute asynchronously.  For example, it may enter
> page reclaim by using GFP_KERNEL.  Indeed, I feel it is better if it
> works synchronously as it can then report the actual error from an I/O
> instead of the almost-meaningless -EIO.
> 
> This patch series documents 12 filesystems which implement ->readpage
> in a synchronous way today (for at least some cases) and converts iomap
> to be synchronous (making two more filesystems synchronous).
> 
> https://lore.kernel.org/linux-fsdevel/20201009143104.22673-1-willy@infradead.org/
> 
Thanks, Matthew. I didn't have this knowledge before, thank you for your 
share and information. It's really kind of you. I'll look into it soon.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-12 22:42     ` Jens Axboe
@ 2020-10-14 20:31       ` Hao_Xu
  2020-10-14 20:57         ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Hao_Xu @ 2020-10-14 20:31 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Andrew Morton

在 2020/10/13 上午6:42, Jens Axboe 写道:
> On 10/12/20 4:22 PM, Jens Axboe wrote:
>> On 10/12/20 4:08 PM, Jens Axboe wrote:
>>> On 10/12/20 3:13 PM, Matthew Wilcox wrote:
>>>> This one's pretty unlikely, but there's a case in buffered reads where
>>>> an IOCB_WAITQ read can end up sleeping.
>>>>
>>>> generic_file_buffered_read():
>>>>                  page = find_get_page(mapping, index);
>>>> ...
>>>>                  if (!PageUptodate(page)) {
>>>> ...
>>>>                          if (iocb->ki_flags & IOCB_WAITQ) {
>>>> ...
>>>>                                  error = wait_on_page_locked_async(page,
>>>>                                                                  iocb->ki_waitq);
>>>> wait_on_page_locked_async():
>>>>          if (!PageLocked(page))
>>>>                  return 0;
>>>> (back to generic_file_buffered_read):
>>>>                          if (!mapping->a_ops->is_partially_uptodate(page,
>>>>                                                          offset, iter->count))
>>>>                                  goto page_not_up_to_date_locked;
>>>>
>>>> page_not_up_to_date_locked:
>>>>                  if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
>>>>                          unlock_page(page);
>>>>                          put_page(page);
>>>>                          goto would_block;
>>>>                  }
>>>> ...
>>>>                  error = mapping->a_ops->readpage(filp, page);
>>>> (will unlock page on I/O completion)
>>>>                  if (!PageUptodate(page)) {
>>>>                          error = lock_page_killable(page);
>>>>
>>>> So if we have IOCB_WAITQ set but IOCB_NOWAIT clear, we'll call ->readpage()
>>>> and wait for the I/O to complete.  I can't quite figure out if this is
>>>> intentional -- I think not; if I understand the semantics right, we
>>>> should be returning -EIOCBQUEUED and punting to an I/O thread to
>>>> kick off the I/O and wait.
>>>>
>>>> I think the right fix is to return -EIOCBQUEUED from
>>>> wait_on_page_locked_async() if the page isn't locked.  ie this:
>>>>
>>>> @@ -1258,7 +1258,7 @@ static int wait_on_page_locked_async(struct page *page,
>>>>                                       struct wait_page_queue *wait)
>>>>   {
>>>>          if (!PageLocked(page))
>>>> -               return 0;
>>>> +               return -EIOCBQUEUED;
>>>>          return __wait_on_page_locked_async(compound_head(page), wait, false);
>>>>   }
>>>>   
>>>> But as I said, I'm not sure what the semantics are supposed to be.
>>>
>>> If NOWAIT isn't set, then the issue attempt is from the helper thread
>>> already, and IOCB_WAITQ shouldn't be set either (the latter doesn't
>>> matter for this discussion). So it's totally fine and expected to block
>>> at that point.
>>>
>>> Hmm actually, I believe that:
>>>
>>> commit c8d317aa1887b40b188ec3aaa6e9e524333caed1
>>> Author: Hao Xu <haoxu@linux.alibaba.com>
>>> Date:   Tue Sep 29 20:00:45 2020 +0800
>>>
>>>      io_uring: fix async buffered reads when readahead is disabled
>>>
>>> maybe messed up that case, so we could block off the retry-path. I'll
>>> take a closer look, looks like that can be the case if read-ahead is
>>> disabled.
>>>
>>> In general, we can only return -EIOCBQUEUED if the IO has been started
>>> or is in progress already. That means we can safely rely on being told
>>> when it's unlocked/done. If we need to block, we should be returning
>>> -EAGAIN, which would punt to a worker thread.
>>
>> Something like the below might be a better solution - just always use
>> the read-ahead to generate the IO, for the requested range. That won't
>> issue any IO beyond what we asked for. And ensure we don't clear NOWAIT
>> on the io_uring side for retry.
>>
>> Totally untested... Just trying to get the idea across. We might need
>> some low cap on req_count in case the range is large. Hao Xu, can you
>> try with this? Thinking of your read-ahead disabled slowdown as well,
>> this could very well be the reason why.
> 
> Here's one that caps us at 1 page, if read-ahead is disabled or we're
> congested. Should still be fine in terms of being async, and it allows
> us to use the same path for this instead of special casing it.
> 
> I ran some quick testing on this, and it seems to Work For Me. I'll do
> some more targeted testing.
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index aae0ef2ec34d..9a2dfe132665 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3107,7 +3107,6 @@ static bool io_rw_should_retry(struct io_kiocb *req)
>   	wait->wait.flags = 0;
>   	INIT_LIST_HEAD(&wait->wait.entry);
>   	kiocb->ki_flags |= IOCB_WAITQ;
> -	kiocb->ki_flags &= ~IOCB_NOWAIT;
>   	kiocb->ki_waitq = wait;
>   
>   	io_get_req_task(req);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 3c9a8dd7c56c..d0f556612fd6 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -568,15 +568,20 @@ void page_cache_sync_readahead(struct address_space *mapping,
>   			       struct file_ra_state *ra, struct file *filp,
>   			       pgoff_t index, unsigned long req_count)
>   {
> -	/* no read-ahead */
> -	if (!ra->ra_pages)
> -		return;
> +	bool do_forced_ra = filp && (filp->f_mode & FMODE_RANDOM);
>   
> -	if (blk_cgroup_congested())
> -		return;
> +	/*
> +	 * Even if read-ahead is disabled, issue this request as read-ahead
> +	 * as we'll need it to satisfy the requested range. The forced
> +	 * read-ahead will do the right thing and limit the read to just the
> +	 * requested range, which we'll set to 1 page for this case.
> +	 */
> +	if (!ra->ra_pages || blk_cgroup_congested()) {
> +		req_count = 1;
> +		do_forced_ra = true;
> +	}
>   
> -	/* be dumb */
> -	if (filp && (filp->f_mode & FMODE_RANDOM)) {
> +	if (do_forced_ra) {
>   		force_page_cache_readahead(mapping, filp, index, req_count);
>   		return;
>   	}
> 

Hi Jens,
I've done some tests for the new fix code with readahead disabled from 
userspace. Here comes some results.
For the perf reports, since I'm new to kernel stuff, still investigating 
on it.
I'll keep addressing the issue which causes the difference among the 
four perf reports(in which the  copy_user_enhanced_fast_string() catches 
my eyes)

my environment is:
     server: physical server
     kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
     fs: ext4
     device: nvme ssd
     fio: 3.20

I did the tests by setting and commenting the code:
     filp->f_mode |= FMODE_BUF_RASYNC;
in fs/ext4/file.c ext4_file_open()

the IOPS with readahead disabled from userspace is below:

with new fix code(force readahead)
QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
1                    10.8k                  10.3k
2                    21.2k                  20.1k
4                    41.1k                  39.1k
8                    76.1k                  72.2k
16                   133k                   126k
32                   169k                   147k
64                   176k                   160k
128                  (1)187k                (2)156k

now async buffered reads feature looks better in terms of IOPS,
but it still looks similar with the async buffered reads feature in the 
mainline code.

with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix 
async buffered reads when readahead is disabled"))
QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
1                       10.9k            10.2k
2                       21.6k            20.2k
4                       41.0k            39.9k
8                       79.7k            75.9k
16                      141k             138k
32                      169k             237k
64                      190k             316k
128                     (3)195k          (4)315k

Considering the number in place (1)(2)(3)(4), the new fix doesn't seem 
to fix the slow down
but make the number (4) become number (2)

the perf reports of (1)(2)(3)(4) situations are:
(1)
   9 # Overhead  Command  Shared Object       Symbol
  10 # ........  .......  .................. 
..............................................
  11 #
  12     10.19%  fio      [kernel.vmlinux]    [k] 
copy_user_enhanced_fast_string
  13      8.53%  fio      fio                 [.] clock_thread_fn
  14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
  15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
  16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
  17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
  18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
  19      1.48%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
  20      1.48%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
  21      1.46%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
  22      1.45%  fio      [nvme]              [k] nvme_irq
  23      1.25%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
  24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
  25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
  26      1.12%  fio      fio                 [.] get_io_u
  27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
  28      0.78%  fio      fio                 [.] fio_gettime
  29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
  30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
  31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
  32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
  33      0.68%  fio      fio                 [.] io_u_queued_complete
  34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
  35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes
  ...

(2)
   9 # Overhead  Command  Shared Object       Symbol
  10 # ........  .......  .................. 
..............................................
  11 #
  12      7.94%  fio      fio                 [.] clock_thread_fn
  13      3.83%  fio      [kernel.vmlinux]    [k] xas_load
  14      2.57%  fio      [kernel.vmlinux]    [k] io_prep_async_work
  15      2.24%  fio      [kernel.vmlinux]    [k] clear_page_erms
  16      1.99%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
  17      1.94%  fio      libc-2.24.so        [.] __memset_avx2_erms
  18      1.83%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
  19      1.78%  fio      [kernel.vmlinux]    [k] __fget_files
  20      1.67%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
  21      1.50%  fio      fio                 [.] get_io_u
  22      1.41%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
  23      1.40%  fio      [kernel.vmlinux]    [k] io_prep_rw
  24      1.39%  fio      [kernel.vmlinux]    [k] mutex_unlock
  25      1.28%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irq
  26      1.21%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
  27      1.17%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
  28      1.11%  fio      [ext4]              [k] ext4_mpage_readpages
  29      1.11%  fio      fio                 [.] fio_gettime
  30      1.09%  fio      [kernel.vmlinux]    [k] __pagevec_lru_add_fn
  31      1.04%  fio      [kernel.vmlinux]    [k] kmem_cache_alloc_bulk
  32      0.99%  fio      [kernel.vmlinux]    [k] io_submit_sqes
  33      0.95%  fio      fio                 [.] io_u_queued_complete
  34      0.90%  fio      [kernel.vmlinux]    [k] io_wqe_wake_worker
  35      0.78%  fio      [vdso]              [.] __vdso_clock_gettime
  ...

(3)
   9 # Overhead  Command  Shared Object       Symbol
  10 # ........  .......  .................. 
..............................................
  11 #
  12      9.06%  fio      fio                 [.] clock_thread_fn
  13      6.05%  fio      [kernel.vmlinux]    [k] 
copy_user_enhanced_fast_string
  14      4.27%  fio      [kernel.vmlinux]    [k] xas_load
  15      2.31%  fio      [kernel.vmlinux]    [k] clear_page_erms
  16      2.09%  fio      libc-2.24.so        [.] __memset_avx2_erms
  17      1.70%  fio      fio                 [.] get_io_u
  18      1.67%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
  19      1.67%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
  20      1.61%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
  21      1.56%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
  22      1.34%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
  23      1.29%  fio      [kernel.vmlinux]    [k] mutex_unlock
  24      1.24%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
  25      1.11%  fio      fio                 [.] fio_gettime
  26      1.01%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
  27      0.90%  fio      [ext4]              [k] ext4_mpage_readpages
  28      0.89%  fio      [vdso]              [.] __vdso_clock_gettime
  29      0.82%  fio      [kernel.vmlinux]    [k] 
audit_filter_syscall.constprop.20
  30      0.74%  fio      [kernel.vmlinux]    [k] xas_store
  31      0.73%  fio      [kernel.vmlinux]    [k] find_get_entries
  32      0.72%  fio      [kernel.vmlinux]    [k] find_get_entry
  33      0.70%  fio      fio                 [.] io_u_queued_complete
  34      0.66%  fio      [kernel.vmlinux]    [k] release_pages
  35      0.66%  fio      [kernel.vmlinux]    [k] io_submit_sqes
  ...

(4)
   9 # Overhead  Command  Shared Object       Symbol
  10 # ........  .......  .................. 
..............................................
  11 #
  12     12.30%  fio      fio                 [.] clock_thread_fn
  13      4.69%  fio      [kernel.vmlinux]    [k] xas_load
  14      3.12%  fio      [kernel.vmlinux]    [k] clear_page_erms
  15      2.87%  fio      libc-2.24.so        [.] __memset_avx2_erms
  16      2.80%  fio      [kernel.vmlinux]    [k] io_prep_async_work
  17      2.43%  fio      [kernel.vmlinux]    [k] io_prep_rw
  18      2.32%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
  19      2.24%  fio      [kernel.vmlinux]    [k] __fget_files
  20      2.18%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
  21      2.15%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
  22      2.10%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
  23      1.81%  fio      [kernel.vmlinux]    [k] 
native_queued_spin_lock_slowpath
  24      1.77%  fio      [kernel.vmlinux]    [k] lru_cache_add
  25      1.69%  fio      [kernel.vmlinux]    [k] _raw_spin_lock_irq
  26      1.65%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
  27      1.36%  fio      [kernel.vmlinux]    [k] __pagevec_lru_add_fn
  28      1.27%  fio      fio                 [.] get_io_u
  29      1.26%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
  30      1.21%  fio      [kernel.vmlinux]    [k] io_submit_sqes
  31      1.19%  fio      fio                 [.] account_io_completion
  32      1.16%  fio      [vdso]              [.] __vdso_clock_gettime
  33      1.14%  fio      fio                 [.] fio_gettime
  34      1.12%  fio      [kernel.vmlinux]    [k] allocate_slab
  35      0.97%  fio      [kernel.vmlinux]    [k] __x64_sys_io_uring_enter
  ...

the arguments of fio I use are:
fio_test.sh:
fio -filename=/mnt/nvme0n1/haul.xh/fio_read_test.txt \
     -buffered=1 \
     -iodepth $1 \
     -rw=randread \
     -ioengine=io_uring \
     -direct=0 \
     -bs=4k \
     -size=4G \
     -name=rand_read_4k \
     -numjobs=1

Thanks && Regards,
Hao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-14 20:31       ` Hao_Xu
@ 2020-10-14 20:57         ` Jens Axboe
  2020-10-15 11:27           ` Hao_Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2020-10-14 20:57 UTC (permalink / raw)
  To: Hao_Xu, Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Andrew Morton

On 10/14/20 2:31 PM, Hao_Xu wrote:
> Hi Jens,
> I've done some tests for the new fix code with readahead disabled from 
> userspace. Here comes some results.
> For the perf reports, since I'm new to kernel stuff, still investigating 
> on it.
> I'll keep addressing the issue which causes the difference among the 
> four perf reports(in which the  copy_user_enhanced_fast_string() catches 
> my eyes)
> 
> my environment is:
>      server: physical server
>      kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
>      fs: ext4
>      device: nvme ssd
>      fio: 3.20
> 
> I did the tests by setting and commenting the code:
>      filp->f_mode |= FMODE_BUF_RASYNC;
> in fs/ext4/file.c ext4_file_open()

You don't have to modify the kernel, if you use a newer fio then you can
essentially just add:

--force_async=1

after setting the engine to io_uring to get the same effect. Just a
heads up, as that might make it easier for you.

> the IOPS with readahead disabled from userspace is below:
> 
> with new fix code(force readahead)
> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
> 1                    10.8k                  10.3k
> 2                    21.2k                  20.1k
> 4                    41.1k                  39.1k
> 8                    76.1k                  72.2k
> 16                   133k                   126k
> 32                   169k                   147k
> 64                   176k                   160k
> 128                  (1)187k                (2)156k
> 
> now async buffered reads feature looks better in terms of IOPS,
> but it still looks similar with the async buffered reads feature in the 
> mainline code.

I'd say it looks better all around. And what you're completely
forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
using QD number of async workers to achieve that result. Hence you have
1..128 threads potentially running on that one, vs having a _single_
process running with FMODE_BUF_RASYNC.

> with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix 
> async buffered reads when readahead is disabled"))
> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
> 1                       10.9k            10.2k
> 2                       21.6k            20.2k
> 4                       41.0k            39.9k
> 8                       79.7k            75.9k
> 16                      141k             138k
> 32                      169k             237k
> 64                      190k             316k
> 128                     (3)195k          (4)315k
> 
> Considering the number in place (1)(2)(3)(4), the new fix doesn't seem 
> to fix the slow down
> but make the number (4) become number (2)

Not sure why there would be a difference between 2 and 4, that does seem
odd. I'll see if I can reproduce that. More questions below.

> the perf reports of (1)(2)(3)(4) situations are:
> (1)
>    9 # Overhead  Command  Shared Object       Symbol
>   10 # ........  .......  .................. 
> ..............................................
>   11 #
>   12     10.19%  fio      [kernel.vmlinux]    [k] 
> copy_user_enhanced_fast_string
>   13      8.53%  fio      fio                 [.] clock_thread_fn
>   14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
>   15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
>   16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
>   17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
>   18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
>   19      1.48%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
>   20      1.48%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
>   21      1.46%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
>   22      1.45%  fio      [nvme]              [k] nvme_irq
>   23      1.25%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
>   24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
>   25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
>   26      1.12%  fio      fio                 [.] get_io_u
>   27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
>   28      0.78%  fio      fio                 [.] fio_gettime
>   29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
>   30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
>   31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
>   32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
>   33      0.68%  fio      fio                 [.] io_u_queued_complete
>   34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
>   35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes

These profiles are of marginal use, as you're only profiling fio itself,
not all of the async workers that are running for !FMODE_BUF_RASYNC.

How long does the test run? It looks suspect that clock_thread_fn shows
up in the profiles at all.

And is it actually doing IO, or are you using shm/tmpfs for this test?
Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
me a little confused.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-14 20:57         ` Jens Axboe
@ 2020-10-15 11:27           ` Hao_Xu
  2020-10-15 12:17             ` Hao_Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Hao_Xu @ 2020-10-15 11:27 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Andrew Morton

在 2020/10/15 上午4:57, Jens Axboe 写道:
> On 10/14/20 2:31 PM, Hao_Xu wrote:
>> Hi Jens,
>> I've done some tests for the new fix code with readahead disabled from
>> userspace. Here comes some results.
>> For the perf reports, since I'm new to kernel stuff, still investigating
>> on it.
>> I'll keep addressing the issue which causes the difference among the
>> four perf reports(in which the  copy_user_enhanced_fast_string() catches
>> my eyes)
>>
>> my environment is:
>>       server: physical server
>>       kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
>>       fs: ext4
>>       device: nvme ssd
>>       fio: 3.20
>>
>> I did the tests by setting and commenting the code:
>>       filp->f_mode |= FMODE_BUF_RASYNC;
>> in fs/ext4/file.c ext4_file_open()
> 
> You don't have to modify the kernel, if you use a newer fio then you can
> essentially just add:
> 
> --force_async=1
> 
> after setting the engine to io_uring to get the same effect. Just a
> heads up, as that might make it easier for you.
> 
>> the IOPS with readahead disabled from userspace is below:
>>
>> with new fix code(force readahead)
>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>> 1                    10.8k                  10.3k
>> 2                    21.2k                  20.1k
>> 4                    41.1k                  39.1k
>> 8                    76.1k                  72.2k
>> 16                   133k                   126k
>> 32                   169k                   147k
>> 64                   176k                   160k
>> 128                  (1)187k                (2)156k
>>
>> now async buffered reads feature looks better in terms of IOPS,
>> but it still looks similar with the async buffered reads feature in the
>> mainline code.
> 
> I'd say it looks better all around. And what you're completely
> forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
> using QD number of async workers to achieve that result. Hence you have
> 1..128 threads potentially running on that one, vs having a _single_
> process running with FMODE_BUF_RASYNC.
I totally agree with this, the server I use has many cpus which makes 
the multiple async workers works exactly parallelly.

> 
>> with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix
>> async buffered reads when readahead is disabled"))
>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>> 1                       10.9k            10.2k
>> 2                       21.6k            20.2k
>> 4                       41.0k            39.9k
>> 8                       79.7k            75.9k
>> 16                      141k             138k
>> 32                      169k             237k
>> 64                      190k             316k
>> 128                     (3)195k          (4)315k
>>
>> Considering the number in place (1)(2)(3)(4), the new fix doesn't seem
>> to fix the slow down
>> but make the number (4) become number (2)
> 
> Not sure why there would be a difference between 2 and 4, that does seem
> odd. I'll see if I can reproduce that. More questions below.
> 
>> the perf reports of (1)(2)(3)(4) situations are:
>> (1)
>>     9 # Overhead  Command  Shared Object       Symbol
>>    10 # ........  .......  ..................
>> ..............................................
>>    11 #
>>    12     10.19%  fio      [kernel.vmlinux]    [k]
>> copy_user_enhanced_fast_string
>>    13      8.53%  fio      fio                 [.] clock_thread_fn
>>    14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
>>    15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
>>    16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
>>    17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
>>    18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
>>    19      1.48%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
>>    20      1.48%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
>>    21      1.46%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
>>    22      1.45%  fio      [nvme]              [k] nvme_irq
>>    23      1.25%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
>>    24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
>>    25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
>>    26      1.12%  fio      fio                 [.] get_io_u
>>    27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
>>    28      0.78%  fio      fio                 [.] fio_gettime
>>    29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
>>    30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
>>    31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
>>    32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
>>    33      0.68%  fio      fio                 [.] io_u_queued_complete
>>    34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
>>    35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes
> 
> These profiles are of marginal use, as you're only profiling fio itself,
> not all of the async workers that are running for !FMODE_BUF_RASYNC.
> 
Ah, I got it. Thanks.
> How long does the test run? It looks suspect that clock_thread_fn shows
> up in the profiles at all.
> 
it runs about 5 msec, randread 4G with bs=4k
> And is it actually doing IO, or are you using shm/tmpfs for this test?
> Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
> me a little confused.
> 
I'm using ext4 on real nvme ssd device. from the call stack, the 
shm_getpage_gfp is from __memset_avx2_erms in libc.
there are ext4 related functions in all the four reports.
I'm doing more to check if it is my test process causing high IOPS in 
case (4).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Loophole in async page I/O
  2020-10-15 11:27           ` Hao_Xu
@ 2020-10-15 12:17             ` Hao_Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao_Xu @ 2020-10-15 12:17 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, io-uring; +Cc: Johannes Weiner, Andrew Morton

在 2020/10/15 下午7:27, Hao_Xu 写道:
> 在 2020/10/15 上午4:57, Jens Axboe 写道:
>> On 10/14/20 2:31 PM, Hao_Xu wrote:
>>> Hi Jens,
>>> I've done some tests for the new fix code with readahead disabled from
>>> userspace. Here comes some results.
>>> For the perf reports, since I'm new to kernel stuff, still investigating
>>> on it.
>>> I'll keep addressing the issue which causes the difference among the
>>> four perf reports(in which the  copy_user_enhanced_fast_string() catches
>>> my eyes)
>>>
>>> my environment is:
>>>       server: physical server
>>>       kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
>>>       fs: ext4
>>>       device: nvme ssd
>>>       fio: 3.20
>>>
>>> I did the tests by setting and commenting the code:
>>>       filp->f_mode |= FMODE_BUF_RASYNC;
>>> in fs/ext4/file.c ext4_file_open()
>>
>> You don't have to modify the kernel, if you use a newer fio then you can
>> essentially just add:
>>
>> --force_async=1
>>
>> after setting the engine to io_uring to get the same effect. Just a
>> heads up, as that might make it easier for you.
>>
>>> the IOPS with readahead disabled from userspace is below:
>>>
>>> with new fix code(force readahead)
>>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>>> 1                    10.8k                  10.3k
>>> 2                    21.2k                  20.1k
>>> 4                    41.1k                  39.1k
>>> 8                    76.1k                  72.2k
>>> 16                   133k                   126k
>>> 32                   169k                   147k
>>> 64                   176k                   160k
>>> 128                  (1)187k                (2)156k
>>>
>>> now async buffered reads feature looks better in terms of IOPS,
>>> but it still looks similar with the async buffered reads feature in the
>>> mainline code.
>>
>> I'd say it looks better all around. And what you're completely
>> forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
>> using QD number of async workers to achieve that result. Hence you have
>> 1..128 threads potentially running on that one, vs having a _single_
>> process running with FMODE_BUF_RASYNC.
> I totally agree with this, the server I use has many cpus which makes 
> the multiple async workers works exactly parallelly.
> 
>>
>>> with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix
>>> async buffered reads when readahead is disabled"))
>>> QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
>>> 1                       10.9k            10.2k
>>> 2                       21.6k            20.2k
>>> 4                       41.0k            39.9k
>>> 8                       79.7k            75.9k
>>> 16                      141k             138k
>>> 32                      169k             237k
>>> 64                      190k             316k
>>> 128                     (3)195k          (4)315k
>>>
>>> Considering the number in place (1)(2)(3)(4), the new fix doesn't seem
>>> to fix the slow down
>>> but make the number (4) become number (2)
>>
>> Not sure why there would be a difference between 2 and 4, that does seem
>> odd. I'll see if I can reproduce that. More questions below.
>>
>>> the perf reports of (1)(2)(3)(4) situations are:
>>> (1)
>>>     9 # Overhead  Command  Shared Object       Symbol
>>>    10 # ........  .......  ..................
>>> ..............................................
>>>    11 #
>>>    12     10.19%  fio      [kernel.vmlinux]    [k]
>>> copy_user_enhanced_fast_string
>>>    13      8.53%  fio      fio                 [.] clock_thread_fn
>>>    14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
>>>    15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
>>>    16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
>>>    17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
>>>    18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
>>>    19      1.48%  fio      [kernel.vmlinux]    [k] 
>>> native_irq_return_iret
>>>    20      1.48%  fio      [kernel.vmlinux]    [k] 
>>> get_page_from_freelist
>>>    21      1.46%  fio      [kernel.vmlinux]    [k] 
>>> generic_file_buffered_read
>>>    22      1.45%  fio      [nvme]              [k] nvme_irq
>>>    23      1.25%  fio      [kernel.vmlinux]    [k] 
>>> __list_del_entry_valid
>>>    24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
>>>    25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
>>>    26      1.12%  fio      fio                 [.] get_io_u
>>>    27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
>>>    28      0.78%  fio      fio                 [.] fio_gettime
>>>    29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
>>>    30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
>>>    31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
>>>    32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
>>>    33      0.68%  fio      fio                 [.] io_u_queued_complete
>>>    34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
>>>    35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes
>>
>> These profiles are of marginal use, as you're only profiling fio itself,
>> not all of the async workers that are running for !FMODE_BUF_RASYNC.
>>
> Ah, I got it. Thanks.
>> How long does the test run? It looks suspect that clock_thread_fn shows
>> up in the profiles at all.
>>
> it runs about 5 msec, randread 4G with bs=4k
Sorry, 5 seconds not 5 msec.
>> And is it actually doing IO, or are you using shm/tmpfs for this test?
>> Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
>> me a little confused.
>>
> I'm using ext4 on real nvme ssd device. from the call stack, the 
> shm_getpage_gfp is from __memset_avx2_erms in libc.
> there are ext4 related functions in all the four reports.
> I'm doing more to check if it is my test process causing high IOPS in 
> case (4).


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-15 12:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12 21:13 Loophole in async page I/O Matthew Wilcox
2020-10-12 22:08 ` Jens Axboe
2020-10-12 22:22   ` Jens Axboe
2020-10-12 22:42     ` Jens Axboe
2020-10-14 20:31       ` Hao_Xu
2020-10-14 20:57         ` Jens Axboe
2020-10-15 11:27           ` Hao_Xu
2020-10-15 12:17             ` Hao_Xu
2020-10-13  5:31   ` Hao_Xu
2020-10-13 17:50     ` Jens Axboe
2020-10-13 19:50       ` Hao_Xu
2020-10-13  5:13 ` Hao_Xu
2020-10-13 12:01   ` Matthew Wilcox
2020-10-13 19:57     ` Hao_Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).