io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] optimise ctx's refs grabbing in io_uring
@ 2019-12-17 22:28 Pavel Begunkov
  2019-12-17 22:28 ` [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
  2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  0 siblings, 2 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-17 22:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable performance boost, though with
a bit unconventional(/unrealistic?) workload.

There is still one step to add, which is not implemented with
patchset, and will amortise the effect calls to io_uring_enter().

rebased on top of for-5.6/io_uring

Pavel Begunkov (2):
  pcpu_ref: add percpu_ref_tryget_many()
  io_uring: batch getting pcpu references

 fs/io_uring.c                   | 11 ++++++++---
 include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
 2 files changed, 28 insertions(+), 7 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-17 22:28 [PATCH 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
@ 2019-12-17 22:28 ` Pavel Begunkov
  2019-12-17 23:42   ` Jens Axboe
  2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-17 22:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..19079b62ce31 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
  * @ref: percpu_ref to try-get
+ * @nr: number of references to get
  *
  * Increment a percpu refcount unless its count already reached zero.
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
  */
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+					  unsigned long nr)
 {
 	unsigned long __percpu *percpu_count;
 	bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	rcu_read_lock();
 
 	if (__ref_is_percpu(ref, &percpu_count)) {
-		this_cpu_inc(*percpu_count);
+		this_cpu_add(*percpu_count, nr);
 		ret = true;
 	} else {
-		ret = atomic_long_inc_not_zero(&ref->count);
+		ret = atomic_long_add_unless(&ref->count, nr, 0);
 	}
 
 	rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	return ret;
 }
 
+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+	return percpu_ref_tryget_many(ref, 1);
+}
+
 /**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 22:28 [PATCH 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-17 22:28 ` [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
@ 2019-12-17 22:28 ` Pavel Begunkov
  2019-12-17 23:21   ` Jens Axboe
                     ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-17 22:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

basic benchmark with submit and wait 128 non-linked nops showed ~5%
performance gain. (7044 KIOPS vs 7423)

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

For notice: it could be done without @extra_refs variable,
but looked too tangled because of gotos.


 fs/io_uring.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index cf4138f0e504..6c85dfc62224 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -845,9 +845,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct io_kiocb *req;
 
-	if (!percpu_ref_tryget(&ctx->refs))
-		return NULL;
-
 	if (!state) {
 		req = kmem_cache_alloc(req_cachep, gfp);
 		if (unlikely(!req))
@@ -3929,6 +3926,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	struct io_submit_state state, *statep = NULL;
 	struct io_kiocb *link = NULL;
 	int i, submitted = 0;
+	unsigned int extra_refs;
 	bool mm_fault = false;
 
 	/* if we have a backlog and couldn't flush it all, return BUSY */
@@ -3941,6 +3939,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 		statep = &state;
 	}
 
+	if (!percpu_ref_tryget_many(&ctx->refs, nr))
+		return -EAGAIN;
+	extra_refs = nr;
+
 	for (i = 0; i < nr; i++) {
 		struct io_kiocb *req = io_get_req(ctx, statep);
 
@@ -3949,6 +3951,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 				submitted = -EAGAIN;
 			break;
 		}
+		--extra_refs;
 		if (!io_get_sqring(ctx, req)) {
 			__io_free_req(req);
 			break;
@@ -3976,6 +3979,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 		io_queue_link_head(link);
 	if (statep)
 		io_submit_state_end(&state);
+	if (extra_refs)
+		percpu_ref_put_many(&ctx->refs, extra_refs);
 
 	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
 	io_commit_sqring(ctx);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
@ 2019-12-17 23:21   ` Jens Axboe
  2019-12-17 23:31     ` Jens Axboe
  2019-12-18  9:23     ` Pavel Begunkov
  2019-12-18  0:02   ` Jens Axboe
  2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2019-12-17 23:21 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 12/17/19 3:28 PM, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
> 
> basic benchmark with submit and wait 128 non-linked nops showed ~5%
> performance gain. (7044 KIOPS vs 7423)
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> 
> For notice: it could be done without @extra_refs variable,
> but looked too tangled because of gotos.
> 
> 
>  fs/io_uring.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index cf4138f0e504..6c85dfc62224 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -845,9 +845,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>  	struct io_kiocb *req;
>  
> -	if (!percpu_ref_tryget(&ctx->refs))
> -		return NULL;
> -
>  	if (!state) {
>  		req = kmem_cache_alloc(req_cachep, gfp);
>  		if (unlikely(!req))
> @@ -3929,6 +3926,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  	struct io_submit_state state, *statep = NULL;
>  	struct io_kiocb *link = NULL;
>  	int i, submitted = 0;
> +	unsigned int extra_refs;
>  	bool mm_fault = false;
>  
>  	/* if we have a backlog and couldn't flush it all, return BUSY */
> @@ -3941,6 +3939,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  		statep = &state;
>  	}
>  
> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
> +		return -EAGAIN;
> +	extra_refs = nr;
> +
>  	for (i = 0; i < nr; i++) {
>  		struct io_kiocb *req = io_get_req(ctx, statep);
>  
> @@ -3949,6 +3951,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  				submitted = -EAGAIN;
>  			break;
>  		}
> +		--extra_refs;
>  		if (!io_get_sqring(ctx, req)) {
>  			__io_free_req(req);
>  			break;
> @@ -3976,6 +3979,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  		io_queue_link_head(link);
>  	if (statep)
>  		io_submit_state_end(&state);
> +	if (extra_refs)
> +		percpu_ref_put_many(&ctx->refs, extra_refs);

Might be cleaner to introduce a 'ret' variable, and leave submitted to be just
that, the number submitted. Then you could just do:

if (submitted != nr)
	percpu_ref_put_many(&ctx->refs, nr - submitted);

and not need that weird extra_refs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 23:21   ` Jens Axboe
@ 2019-12-17 23:31     ` Jens Axboe
  2019-12-18  9:25       ` Pavel Begunkov
  2019-12-18  9:23     ` Pavel Begunkov
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-17 23:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 12/17/19 4:21 PM, Jens Axboe wrote:
> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>> for each request, grab a bunch once per io_submit_sqes().
>>
>> basic benchmark with submit and wait 128 non-linked nops showed ~5%
>> performance gain. (7044 KIOPS vs 7423)
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> For notice: it could be done without @extra_refs variable,
>> but looked too tangled because of gotos.
>>
>>
>>  fs/io_uring.c | 11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index cf4138f0e504..6c85dfc62224 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -845,9 +845,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>  	struct io_kiocb *req;
>>  
>> -	if (!percpu_ref_tryget(&ctx->refs))
>> -		return NULL;
>> -
>>  	if (!state) {
>>  		req = kmem_cache_alloc(req_cachep, gfp);
>>  		if (unlikely(!req))
>> @@ -3929,6 +3926,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  	struct io_submit_state state, *statep = NULL;
>>  	struct io_kiocb *link = NULL;
>>  	int i, submitted = 0;
>> +	unsigned int extra_refs;
>>  	bool mm_fault = false;
>>  
>>  	/* if we have a backlog and couldn't flush it all, return BUSY */
>> @@ -3941,6 +3939,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  		statep = &state;
>>  	}
>>  
>> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
>> +		return -EAGAIN;
>> +	extra_refs = nr;
>> +
>>  	for (i = 0; i < nr; i++) {
>>  		struct io_kiocb *req = io_get_req(ctx, statep);
>>  

This also needs to come before the submit_state_start().

I forgot to mention that I really like the idea, no point in NOT batching
the refs when we know exactly how many refs we need.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-17 22:28 ` [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
@ 2019-12-17 23:42   ` Jens Axboe
  2019-12-18 16:26     ` Tejun Heo
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-17 23:42 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel, Tejun Heo

CC Tejun on this one. Looks fine to me, and matches the put path.


On 12/17/19 3:28 PM, Pavel Begunkov wrote:
> Add percpu_ref_tryget_many(), which works the same way as
> percpu_ref_tryget(), but grabs specified number of refs.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
> index 390031e816dc..19079b62ce31 100644
> --- a/include/linux/percpu-refcount.h
> +++ b/include/linux/percpu-refcount.h
> @@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
>  }
>  
>  /**
> - * percpu_ref_tryget - try to increment a percpu refcount
> + * percpu_ref_tryget_many - try to increment a percpu refcount
>   * @ref: percpu_ref to try-get
> + * @nr: number of references to get
>   *
>   * Increment a percpu refcount unless its count already reached zero.
>   * Returns %true on success; %false on failure.
>   *
>   * This function is safe to call as long as @ref is between init and exit.
>   */
> -static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> +static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
> +					  unsigned long nr)
>  {
>  	unsigned long __percpu *percpu_count;
>  	bool ret;
> @@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>  	rcu_read_lock();
>  
>  	if (__ref_is_percpu(ref, &percpu_count)) {
> -		this_cpu_inc(*percpu_count);
> +		this_cpu_add(*percpu_count, nr);
>  		ret = true;
>  	} else {
> -		ret = atomic_long_inc_not_zero(&ref->count);
> +		ret = atomic_long_add_unless(&ref->count, nr, 0);
>  	}
>  
>  	rcu_read_unlock();
> @@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>  	return ret;
>  }
>  
> +/**
> + * percpu_ref_tryget - try to increment a percpu refcount
> + * @ref: percpu_ref to try-get
> + *
> + * Increment a percpu refcount unless its count already reached zero.
> + * Returns %true on success; %false on failure.
> + *
> + * This function is safe to call as long as @ref is between init and exit.
> + */
> +static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> +{
> +	return percpu_ref_tryget_many(ref, 1);
> +}
> +
>  /**
>   * percpu_ref_tryget_live - try to increment a live percpu refcount
>   * @ref: percpu_ref to try-get
> 


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  2019-12-17 23:21   ` Jens Axboe
@ 2019-12-18  0:02   ` Jens Axboe
  2019-12-18 10:41     ` Pavel Begunkov
  2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-18  0:02 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 12/17/19 3:28 PM, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
> 
> basic benchmark with submit and wait 128 non-linked nops showed ~5%
> performance gain. (7044 KIOPS vs 7423)

Confirmed about 5% here as well, doing polled IO to a fast device.
That's a huge gain!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 23:21   ` Jens Axboe
  2019-12-17 23:31     ` Jens Axboe
@ 2019-12-18  9:23     ` Pavel Begunkov
  1 sibling, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-18  9:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 12/18/2019 2:21 AM, Jens Axboe wrote:
> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>> for each request, grab a bunch once per io_submit_sqes().
>>
>> basic benchmark with submit and wait 128 non-linked nops showed ~5%
>> performance gain. (7044 KIOPS vs 7423)
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> For notice: it could be done without @extra_refs variable,
>> but looked too tangled because of gotos.
>>
>>
>>  fs/io_uring.c | 11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index cf4138f0e504..6c85dfc62224 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -845,9 +845,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>  	struct io_kiocb *req;
>>  
>> -	if (!percpu_ref_tryget(&ctx->refs))
>> -		return NULL;
>> -
>>  	if (!state) {
>>  		req = kmem_cache_alloc(req_cachep, gfp);
>>  		if (unlikely(!req))
>> @@ -3929,6 +3926,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  	struct io_submit_state state, *statep = NULL;
>>  	struct io_kiocb *link = NULL;
>>  	int i, submitted = 0;
>> +	unsigned int extra_refs;
>>  	bool mm_fault = false;
>>  
>>  	/* if we have a backlog and couldn't flush it all, return BUSY */
>> @@ -3941,6 +3939,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  		statep = &state;
>>  	}
>>  
>> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
>> +		return -EAGAIN;
>> +	extra_refs = nr;
>> +
>>  	for (i = 0; i < nr; i++) {
>>  		struct io_kiocb *req = io_get_req(ctx, statep);
>>  
>> @@ -3949,6 +3951,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  				submitted = -EAGAIN;
>>  			break;
>>  		}
>> +		--extra_refs;
>>  		if (!io_get_sqring(ctx, req)) {
>>  			__io_free_req(req);
>>  			break;
>> @@ -3976,6 +3979,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>  		io_queue_link_head(link);
>>  	if (statep)
>>  		io_submit_state_end(&state);
>> +	if (extra_refs)
>> +		percpu_ref_put_many(&ctx->refs, extra_refs);
> 
> Might be cleaner to introduce a 'ret' variable, and leave submitted to be just
> that, the number submitted. Then you could just do:
> 
> if (submitted != nr)
> 	percpu_ref_put_many(&ctx->refs, nr - submitted);
> 
> and not need that weird extra_refs.
> 
That won't work as is, because  __io_free_req() after io_get_sqring()
puts a ref. And that's the reason why --extra_refs; placed in between
those 2 if's. Also, percpu_ref_put_many() is inline, so I want to have
only 1 call site.

I'll send another version, which won't even need the "if" at the end,
but is more knotty.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-17 23:31     ` Jens Axboe
@ 2019-12-18  9:25       ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-18  9:25 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 12/18/2019 2:31 AM, Jens Axboe wrote:
> On 12/17/19 4:21 PM, Jens Axboe wrote:
>> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
>>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>>> for each request, grab a bunch once per io_submit_sqes().
>>>
>>> basic benchmark with submit and wait 128 non-linked nops showed ~5%
>>> performance gain. (7044 KIOPS vs 7423)
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>
>>> For notice: it could be done without @extra_refs variable,
>>> but looked too tangled because of gotos.
>>>
>>>
>>>  fs/io_uring.c | 11 ++++++++---
>>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index cf4138f0e504..6c85dfc62224 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -845,9 +845,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>>  	struct io_kiocb *req;
>>>  
>>> -	if (!percpu_ref_tryget(&ctx->refs))
>>> -		return NULL;
>>> -
>>>  	if (!state) {
>>>  		req = kmem_cache_alloc(req_cachep, gfp);
>>>  		if (unlikely(!req))
>>> @@ -3929,6 +3926,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>>  	struct io_submit_state state, *statep = NULL;
>>>  	struct io_kiocb *link = NULL;
>>>  	int i, submitted = 0;
>>> +	unsigned int extra_refs;
>>>  	bool mm_fault = false;
>>>  
>>>  	/* if we have a backlog and couldn't flush it all, return BUSY */
>>> @@ -3941,6 +3939,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>>  		statep = &state;
>>>  	}
>>>  
>>> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
>>> +		return -EAGAIN;
>>> +	extra_refs = nr;
>>> +
>>>  	for (i = 0; i < nr; i++) {
>>>  		struct io_kiocb *req = io_get_req(ctx, statep);
>>>  
> 
> This also needs to come before the submit_state_start().
> 
Good catch, forgot to handle this. Thanks

> I forgot to mention that I really like the idea, no point in NOT batching
> the refs when we know exactly how many refs we need.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] io_uring: batch getting pcpu references
  2019-12-18  0:02   ` Jens Axboe
@ 2019-12-18 10:41     ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-18 10:41 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 12/18/2019 3:02 AM, Jens Axboe wrote:
> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>> for each request, grab a bunch once per io_submit_sqes().
>>
>> basic benchmark with submit and wait 128 non-linked nops showed ~5%
>> performance gain. (7044 KIOPS vs 7423)
> 
> Confirmed about 5% here as well, doing polled IO to a fast device.
> That's a huge gain!
> 
Great that you got it for a real drive!
As mentioned, after we are done with this one, I have an idea how to
optimise it even further, which would work for small QD as well.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-17 23:42   ` Jens Axboe
@ 2019-12-18 16:26     ` Tejun Heo
  2019-12-18 17:49       ` Dennis Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Tejun Heo @ 2019-12-18 16:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, io-uring, linux-kernel, Dennis Zhou, Christoph Lameter

(cc'ing Dennis and Christoph and quoting whole body)

Pavel, can you please cc percpu maintainers on related changes?

The patch looks fine to me.  Please feel free to add my acked-by.

On Tue, Dec 17, 2019 at 04:42:59PM -0700, Jens Axboe wrote:
> CC Tejun on this one. Looks fine to me, and matches the put path.
> 
> 
> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
> > Add percpu_ref_tryget_many(), which works the same way as
> > percpu_ref_tryget(), but grabs specified number of refs.
> > 
> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > ---
> >  include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
> > index 390031e816dc..19079b62ce31 100644
> > --- a/include/linux/percpu-refcount.h
> > +++ b/include/linux/percpu-refcount.h
> > @@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
> >  }
> >  
> >  /**
> > - * percpu_ref_tryget - try to increment a percpu refcount
> > + * percpu_ref_tryget_many - try to increment a percpu refcount
> >   * @ref: percpu_ref to try-get
> > + * @nr: number of references to get
> >   *
> >   * Increment a percpu refcount unless its count already reached zero.
> >   * Returns %true on success; %false on failure.
> >   *
> >   * This function is safe to call as long as @ref is between init and exit.
> >   */
> > -static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > +static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
> > +					  unsigned long nr)
> >  {
> >  	unsigned long __percpu *percpu_count;
> >  	bool ret;
> > @@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> >  	rcu_read_lock();
> >  
> >  	if (__ref_is_percpu(ref, &percpu_count)) {
> > -		this_cpu_inc(*percpu_count);
> > +		this_cpu_add(*percpu_count, nr);
> >  		ret = true;
> >  	} else {
> > -		ret = atomic_long_inc_not_zero(&ref->count);
> > +		ret = atomic_long_add_unless(&ref->count, nr, 0);
> >  	}
> >  
> >  	rcu_read_unlock();
> > @@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> >  	return ret;
> >  }
> >  
> > +/**
> > + * percpu_ref_tryget - try to increment a percpu refcount
> > + * @ref: percpu_ref to try-get
> > + *
> > + * Increment a percpu refcount unless its count already reached zero.
> > + * Returns %true on success; %false on failure.
> > + *
> > + * This function is safe to call as long as @ref is between init and exit.
> > + */
> > +static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > +{
> > +	return percpu_ref_tryget_many(ref, 1);
> > +}
> > +
> >  /**
> >   * percpu_ref_tryget_live - try to increment a live percpu refcount
> >   * @ref: percpu_ref to try-get
> > 
> 
> 
> -- 
> Jens Axboe
> 

-- 
tejun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-18 16:26     ` Tejun Heo
@ 2019-12-18 17:49       ` Dennis Zhou
  2019-12-21 15:36         ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Dennis Zhou @ 2019-12-18 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel, Dennis Zhou,
	Christoph Lameter

On Wed, Dec 18, 2019 at 08:26:42AM -0800, Tejun Heo wrote:
> (cc'ing Dennis and Christoph and quoting whole body)
> 
> Pavel, can you please cc percpu maintainers on related changes?
> 
> The patch looks fine to me.  Please feel free to add my acked-by.
> 
> On Tue, Dec 17, 2019 at 04:42:59PM -0700, Jens Axboe wrote:
> > CC Tejun on this one. Looks fine to me, and matches the put path.
> > 
> > 
> > On 12/17/19 3:28 PM, Pavel Begunkov wrote:
> > > Add percpu_ref_tryget_many(), which works the same way as
> > > percpu_ref_tryget(), but grabs specified number of refs.
> > > 
> > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> > > ---
> > >  include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
> > >  1 file changed, 20 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
> > > index 390031e816dc..19079b62ce31 100644
> > > --- a/include/linux/percpu-refcount.h
> > > +++ b/include/linux/percpu-refcount.h
> > > @@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
> > >  }
> > >  
> > >  /**
> > > - * percpu_ref_tryget - try to increment a percpu refcount
> > > + * percpu_ref_tryget_many - try to increment a percpu refcount
> > >   * @ref: percpu_ref to try-get
> > > + * @nr: number of references to get
> > >   *
> > >   * Increment a percpu refcount unless its count already reached zero.
> > >   * Returns %true on success; %false on failure.

Minor nit: would be nice to change this so the two don't have identical
comments. (eg: Increment a percpu refcount by @nr unless...)
> > >   *
> > >   * This function is safe to call as long as @ref is between init and exit.
> > >   */
> > > -static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > > +static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
> > > +					  unsigned long nr)
> > >  {
> > >  	unsigned long __percpu *percpu_count;
> > >  	bool ret;
> > > @@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > >  	rcu_read_lock();
> > >  
> > >  	if (__ref_is_percpu(ref, &percpu_count)) {
> > > -		this_cpu_inc(*percpu_count);
> > > +		this_cpu_add(*percpu_count, nr);
> > >  		ret = true;
> > >  	} else {
> > > -		ret = atomic_long_inc_not_zero(&ref->count);
> > > +		ret = atomic_long_add_unless(&ref->count, nr, 0);
> > >  	}
> > >  
> > >  	rcu_read_unlock();
> > > @@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > >  	return ret;
> > >  }
> > >  
> > > +/**
> > > + * percpu_ref_tryget - try to increment a percpu refcount
> > > + * @ref: percpu_ref to try-get
> > > + *
> > > + * Increment a percpu refcount unless its count already reached zero.
> > > + * Returns %true on success; %false on failure.
> > > + *
> > > + * This function is safe to call as long as @ref is between init and exit.
> > > + */
> > > +static inline bool percpu_ref_tryget(struct percpu_ref *ref)
> > > +{
> > > +	return percpu_ref_tryget_many(ref, 1);
> > > +}
> > > +
> > >  /**
> > >   * percpu_ref_tryget_live - try to increment a live percpu refcount
> > >   * @ref: percpu_ref to try-get
> > > 
> > 
> > 
> > -- 
> > Jens Axboe
> > 
> 
> -- 
> tejun

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-18 17:49       ` Dennis Zhou
@ 2019-12-21 15:36         ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 15:36 UTC (permalink / raw)
  To: Dennis Zhou, Tejun Heo
  Cc: Jens Axboe, io-uring, linux-kernel, Christoph Lameter


[-- Attachment #1.1: Type: text/plain, Size: 3461 bytes --]

On 18/12/2019 20:49, Dennis Zhou wrote:
> On Wed, Dec 18, 2019 at 08:26:42AM -0800, Tejun Heo wrote:
>> (cc'ing Dennis and Christoph and quoting whole body)
>>
>> Pavel, can you please cc percpu maintainers on related changes?

My bad, lost cc's in the way.

>>
>> The patch looks fine to me.  Please feel free to add my acked-by.
>>
>> On Tue, Dec 17, 2019 at 04:42:59PM -0700, Jens Axboe wrote:
>>> CC Tejun on this one. Looks fine to me, and matches the put path.

Thanks both for taking a look!

>>>
>>>
>>> On 12/17/19 3:28 PM, Pavel Begunkov wrote:
>>>> Add percpu_ref_tryget_many(), which works the same way as
>>>> percpu_ref_tryget(), but grabs specified number of refs.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>  include/linux/percpu-refcount.h | 24 ++++++++++++++++++++----
>>>>  1 file changed, 20 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
>>>> index 390031e816dc..19079b62ce31 100644
>>>> --- a/include/linux/percpu-refcount.h
>>>> +++ b/include/linux/percpu-refcount.h
>>>> @@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
>>>>  }
>>>>  
>>>>  /**
>>>> - * percpu_ref_tryget - try to increment a percpu refcount
>>>> + * percpu_ref_tryget_many - try to increment a percpu refcount
>>>>   * @ref: percpu_ref to try-get
>>>> + * @nr: number of references to get
>>>>   *
>>>>   * Increment a percpu refcount unless its count already reached zero.
>>>>   * Returns %true on success; %false on failure.
> 
> Minor nit: would be nice to change this so the two don't have identical
> comments. (eg: Increment a percpu refcount by @nr unless...)
>>>>   *
>>>>   * This function is safe to call as long as @ref is between init and exit.
>>>>   */
>>>> -static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>>>> +static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
>>>> +					  unsigned long nr)
>>>>  {
>>>>  	unsigned long __percpu *percpu_count;
>>>>  	bool ret;
>>>> @@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>>>>  	rcu_read_lock();
>>>>  
>>>>  	if (__ref_is_percpu(ref, &percpu_count)) {
>>>> -		this_cpu_inc(*percpu_count);
>>>> +		this_cpu_add(*percpu_count, nr);
>>>>  		ret = true;
>>>>  	} else {
>>>> -		ret = atomic_long_inc_not_zero(&ref->count);
>>>> +		ret = atomic_long_add_unless(&ref->count, nr, 0);
>>>>  	}
>>>>  
>>>>  	rcu_read_unlock();
>>>> @@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +/**
>>>> + * percpu_ref_tryget - try to increment a percpu refcount
>>>> + * @ref: percpu_ref to try-get
>>>> + *
>>>> + * Increment a percpu refcount unless its count already reached zero.
>>>> + * Returns %true on success; %false on failure.
>>>> + *
>>>> + * This function is safe to call as long as @ref is between init and exit.
>>>> + */
>>>> +static inline bool percpu_ref_tryget(struct percpu_ref *ref)
>>>> +{
>>>> +	return percpu_ref_tryget_many(ref, 1);
>>>> +}
>>>> +
>>>>  /**
>>>>   * percpu_ref_tryget_live - try to increment a live percpu refcount
>>>>   * @ref: percpu_ref to try-get
>>>>
>>>
>>>
>>> -- 
>>> Jens Axboe
>>>
>>
>> -- 
>> tejun
> 
> Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Thanks,
> Dennis
> 

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring
  2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  2019-12-17 23:21   ` Jens Axboe
  2019-12-18  0:02   ` Jens Axboe
@ 2019-12-21 16:15   ` Pavel Begunkov
  2019-12-21 16:15     ` [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
                       ` (2 more replies)
  2 siblings, 3 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
    better comments for percpu_ref_tryget_many (Dennis Zhou)
    amortise across io_uring_enter() boundary

Pavel Begunkov (3):
  pcpu_ref: add percpu_ref_tryget_many()
  io_uring: batch getting pcpu references
  io_uring: batch get(ctx->ref) across submits

 fs/io_uring.c                   | 29 ++++++++++++++++++++++++++---
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 2 files changed, 47 insertions(+), 8 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
@ 2019-12-21 16:15     ` Pavel Begunkov
  2019-12-21 16:15     ` [PATCH v2 2/3] io_uring: batch getting pcpu references Pavel Begunkov
  2019-12-21 16:15     ` [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits Pavel Begunkov
  2 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..22d9d183950d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
  * @ref: percpu_ref to try-get
+ * @nr: number of references to get
  *
- * Increment a percpu refcount unless its count already reached zero.
+ * Increment a percpu refcount  by @nr unless its count already reached zero.
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
  */
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+					  unsigned long nr)
 {
 	unsigned long __percpu *percpu_count;
 	bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	rcu_read_lock();
 
 	if (__ref_is_percpu(ref, &percpu_count)) {
-		this_cpu_inc(*percpu_count);
+		this_cpu_add(*percpu_count, nr);
 		ret = true;
 	} else {
-		ret = atomic_long_inc_not_zero(&ref->count);
+		ret = atomic_long_add_unless(&ref->count, nr, 0);
 	}
 
 	rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	return ret;
 }
 
+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+	return percpu_ref_tryget_many(ref, 1);
+}
+
 /**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/3] io_uring: batch getting pcpu references
  2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-21 16:15     ` [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
@ 2019-12-21 16:15     ` Pavel Begunkov
  2019-12-21 16:15     ` [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits Pavel Begunkov
  2 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

basic benchmark with submit and wait 128 non-linked nops showed ~5%
performance gain. (7044 KIOPS vs 7423)

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 513f1922ce6a..5392134f042f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1045,9 +1045,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct io_kiocb *req;
 
-	if (!percpu_ref_tryget(&ctx->refs))
-		return NULL;
-
 	if (!state) {
 		req = kmem_cache_alloc(req_cachep, gfp);
 		if (unlikely(!req))
@@ -4391,6 +4388,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	struct io_submit_state state, *statep = NULL;
 	struct io_kiocb *link = NULL;
 	int i, submitted = 0;
+	unsigned int extra_refs;
 	bool mm_fault = false;
 
 	/* if we have a backlog and couldn't flush it all, return BUSY */
@@ -4400,6 +4398,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			return -EBUSY;
 	}
 
+	if (!percpu_ref_tryget_many(&ctx->refs, nr))
+		return -EAGAIN;
+	extra_refs = nr;
+
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, nr);
 		statep = &state;
@@ -4415,6 +4417,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 				submitted = -EAGAIN;
 			break;
 		}
+		--extra_refs;
 		if (!io_get_sqring(ctx, req, &sqe)) {
 			__io_free_req(req);
 			break;
@@ -4451,6 +4454,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 		io_queue_link_head(link);
 	if (statep)
 		io_submit_state_end(&state);
+	if (extra_refs)
+		percpu_ref_put_many(&ctx->refs, extra_refs);
 
 	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
 	io_commit_sqring(ctx);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-21 16:15     ` [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
  2019-12-21 16:15     ` [PATCH v2 2/3] io_uring: batch getting pcpu references Pavel Begunkov
@ 2019-12-21 16:15     ` Pavel Begunkov
  2019-12-21 16:20       ` Pavel Begunkov
  2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2 siblings, 2 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

Double account ctx->refs keeping number of taken refs in ctx. As
io_uring gets per-request ctx->refs during submission, while holding
ctx->uring_lock, this allows in most of the time to bypass
percpu_ref_get*() and its overhead.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5392134f042f..eef09de94609 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -84,6 +84,9 @@
 #define IORING_MAX_ENTRIES	32768
 #define IORING_MAX_CQ_ENTRIES	(2 * IORING_MAX_ENTRIES)
 
+/* Not less than IORING_MAX_ENTRIES, so can grab once per submission loop */
+#define IORING_REFS_THRESHOLD	IORING_MAX_ENTRIES
+
 /*
  * Shift of 9 is 512 entries, or exactly one page on 64-bit archs
  */
@@ -197,6 +200,7 @@ struct fixed_file_data {
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
+		unsigned long		taken_refs; /* used under @uring_lock */
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -690,6 +694,13 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
 	complete(&ctx->completions[0]);
 }
 
+static void io_free_taken_refs(struct io_ring_ctx *ctx)
+{
+	if (ctx->taken_refs)
+		percpu_ref_put_many(&ctx->refs, ctx->taken_refs);
+	ctx->taken_refs = 0;
+}
+
 static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
 	struct io_ring_ctx *ctx;
@@ -4388,7 +4399,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	struct io_submit_state state, *statep = NULL;
 	struct io_kiocb *link = NULL;
 	int i, submitted = 0;
-	unsigned int extra_refs;
 	bool mm_fault = false;
 
 	/* if we have a backlog and couldn't flush it all, return BUSY */
@@ -4398,9 +4408,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			return -EBUSY;
 	}
 
-	if (!percpu_ref_tryget_many(&ctx->refs, nr))
-		return -EAGAIN;
-	extra_refs = nr;
+	if (ctx->taken_refs < IORING_REFS_THRESHOLD) {
+		if (unlikely(percpu_ref_is_dying(&ctx->refs))) {
+			io_free_taken_refs(ctx);
+			return -ENXIO;
+		}
+		if (!percpu_ref_tryget_many(&ctx->refs, IORING_REFS_THRESHOLD))
+			return -EAGAIN;
+		ctx->taken_refs += IORING_REFS_THRESHOLD;
+	}
 
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, nr);
@@ -4417,8 +4433,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 				submitted = -EAGAIN;
 			break;
 		}
-		--extra_refs;
 		if (!io_get_sqring(ctx, req, &sqe)) {
+			/* not submitted, but a ref is freed */
+			ctx->taken_refs--;
 			__io_free_req(req);
 			break;
 		}
@@ -4454,8 +4471,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 		io_queue_link_head(link);
 	if (statep)
 		io_submit_state_end(&state);
-	if (extra_refs)
-		percpu_ref_put_many(&ctx->refs, extra_refs);
+	ctx->taken_refs -= submitted;
 
 	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
 	io_commit_sqring(ctx);
@@ -5731,6 +5747,7 @@ static int io_uring_fasync(int fd, struct file *file, int on)
 static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 {
 	mutex_lock(&ctx->uring_lock);
+	io_free_taken_refs(ctx);
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
@@ -6196,6 +6213,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 
 	if (opcode != IORING_UNREGISTER_FILES &&
 	    opcode != IORING_REGISTER_FILES_UPDATE) {
+		io_free_taken_refs(ctx);
 		percpu_ref_kill(&ctx->refs);
 
 		/*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 16:15     ` [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits Pavel Begunkov
@ 2019-12-21 16:20       ` Pavel Begunkov
  2019-12-21 16:38         ` Jens Axboe
  2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:20 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter


[-- Attachment #1.1: Type: text/plain, Size: 4211 bytes --]

On 21/12/2019 19:15, Pavel Begunkov wrote:
> Double account ctx->refs keeping number of taken refs in ctx. As
> io_uring gets per-request ctx->refs during submission, while holding
> ctx->uring_lock, this allows in most of the time to bypass
> percpu_ref_get*() and its overhead.

Jens, could you please benchmark with this one? Especially for offloaded QD1
case. I haven't got any difference for nops test and don't have a decent SSD
at hands to test it myself. We could drop it, if there is no benefit.

This rewrites that @extra_refs from the second one, so I left it for now.


> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  fs/io_uring.c | 32 +++++++++++++++++++++++++-------
>  1 file changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5392134f042f..eef09de94609 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -84,6 +84,9 @@
>  #define IORING_MAX_ENTRIES	32768
>  #define IORING_MAX_CQ_ENTRIES	(2 * IORING_MAX_ENTRIES)
>  
> +/* Not less than IORING_MAX_ENTRIES, so can grab once per submission loop */
> +#define IORING_REFS_THRESHOLD	IORING_MAX_ENTRIES
> +
>  /*
>   * Shift of 9 is 512 entries, or exactly one page on 64-bit archs
>   */
> @@ -197,6 +200,7 @@ struct fixed_file_data {
>  struct io_ring_ctx {
>  	struct {
>  		struct percpu_ref	refs;
> +		unsigned long		taken_refs; /* used under @uring_lock */
>  	} ____cacheline_aligned_in_smp;
>  
>  	struct {
> @@ -690,6 +694,13 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
>  	complete(&ctx->completions[0]);
>  }
>  
> +static void io_free_taken_refs(struct io_ring_ctx *ctx)
> +{
> +	if (ctx->taken_refs)
> +		percpu_ref_put_many(&ctx->refs, ctx->taken_refs);
> +	ctx->taken_refs = 0;
> +}
> +
>  static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>  {
>  	struct io_ring_ctx *ctx;
> @@ -4388,7 +4399,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  	struct io_submit_state state, *statep = NULL;
>  	struct io_kiocb *link = NULL;
>  	int i, submitted = 0;
> -	unsigned int extra_refs;
>  	bool mm_fault = false;
>  
>  	/* if we have a backlog and couldn't flush it all, return BUSY */
> @@ -4398,9 +4408,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  			return -EBUSY;
>  	}
>  
> -	if (!percpu_ref_tryget_many(&ctx->refs, nr))
> -		return -EAGAIN;
> -	extra_refs = nr;
> +	if (ctx->taken_refs < IORING_REFS_THRESHOLD) {
> +		if (unlikely(percpu_ref_is_dying(&ctx->refs))) {
> +			io_free_taken_refs(ctx);
> +			return -ENXIO;
> +		}
> +		if (!percpu_ref_tryget_many(&ctx->refs, IORING_REFS_THRESHOLD))
> +			return -EAGAIN;
> +		ctx->taken_refs += IORING_REFS_THRESHOLD;
> +	}
>  
>  	if (nr > IO_PLUG_THRESHOLD) {
>  		io_submit_state_start(&state, nr);
> @@ -4417,8 +4433,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  				submitted = -EAGAIN;
>  			break;
>  		}
> -		--extra_refs;
>  		if (!io_get_sqring(ctx, req, &sqe)) {
> +			/* not submitted, but a ref is freed */
> +			ctx->taken_refs--;
>  			__io_free_req(req);
>  			break;
>  		}
> @@ -4454,8 +4471,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  		io_queue_link_head(link);
>  	if (statep)
>  		io_submit_state_end(&state);
> -	if (extra_refs)
> -		percpu_ref_put_many(&ctx->refs, extra_refs);
> +	ctx->taken_refs -= submitted;
>  
>  	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
>  	io_commit_sqring(ctx);
> @@ -5731,6 +5747,7 @@ static int io_uring_fasync(int fd, struct file *file, int on)
>  static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>  {
>  	mutex_lock(&ctx->uring_lock);
> +	io_free_taken_refs(ctx);
>  	percpu_ref_kill(&ctx->refs);
>  	mutex_unlock(&ctx->uring_lock);
>  
> @@ -6196,6 +6213,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>  
>  	if (opcode != IORING_UNREGISTER_FILES &&
>  	    opcode != IORING_REGISTER_FILES_UPDATE) {
> +		io_free_taken_refs(ctx);
>  		percpu_ref_kill(&ctx->refs);
>  
>  		/*
> 

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 16:20       ` Pavel Begunkov
@ 2019-12-21 16:38         ` Jens Axboe
  2019-12-21 16:48           ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-21 16:38 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

On 12/21/19 9:20 AM, Pavel Begunkov wrote:
> On 21/12/2019 19:15, Pavel Begunkov wrote:
>> Double account ctx->refs keeping number of taken refs in ctx. As
>> io_uring gets per-request ctx->refs during submission, while holding
>> ctx->uring_lock, this allows in most of the time to bypass
>> percpu_ref_get*() and its overhead.
> 
> Jens, could you please benchmark with this one? Especially for offloaded QD1
> case. I haven't got any difference for nops test and don't have a decent SSD
> at hands to test it myself. We could drop it, if there is no benefit.
> 
> This rewrites that @extra_refs from the second one, so I left it for now.

Sure, let me run a peak test, qd1 test, qd1+sqpoll test on
for-5.6/io_uring, same branch with 1-2, and same branch with 1-3. That
should give us a good comparison. One core used for all, and we're going
to be core speed bound for the performance in all cases on this setup.
So it'll be a good comparison.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 16:38         ` Jens Axboe
@ 2019-12-21 16:48           ` Pavel Begunkov
  2019-12-21 17:01             ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 16:48 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter


[-- Attachment #1.1: Type: text/plain, Size: 1073 bytes --]

On 21/12/2019 19:38, Jens Axboe wrote:
> On 12/21/19 9:20 AM, Pavel Begunkov wrote:
>> On 21/12/2019 19:15, Pavel Begunkov wrote:
>>> Double account ctx->refs keeping number of taken refs in ctx. As
>>> io_uring gets per-request ctx->refs during submission, while holding
>>> ctx->uring_lock, this allows in most of the time to bypass
>>> percpu_ref_get*() and its overhead.
>>
>> Jens, could you please benchmark with this one? Especially for offloaded QD1
>> case. I haven't got any difference for nops test and don't have a decent SSD
>> at hands to test it myself. We could drop it, if there is no benefit.
>>
>> This rewrites that @extra_refs from the second one, so I left it for now.
> 
> Sure, let me run a peak test, qd1 test, qd1+sqpoll test on
> for-5.6/io_uring, same branch with 1-2, and same branch with 1-3. That
> should give us a good comparison. One core used for all, and we're going
> to be core speed bound for the performance in all cases on this setup.
> So it'll be a good comparison.
> 
Great, thanks!

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 16:48           ` Pavel Begunkov
@ 2019-12-21 17:01             ` Jens Axboe
  2019-12-21 17:26               ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-21 17:01 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

On 12/21/19 9:48 AM, Pavel Begunkov wrote:
> On 21/12/2019 19:38, Jens Axboe wrote:
>> On 12/21/19 9:20 AM, Pavel Begunkov wrote:
>>> On 21/12/2019 19:15, Pavel Begunkov wrote:
>>>> Double account ctx->refs keeping number of taken refs in ctx. As
>>>> io_uring gets per-request ctx->refs during submission, while holding
>>>> ctx->uring_lock, this allows in most of the time to bypass
>>>> percpu_ref_get*() and its overhead.
>>>
>>> Jens, could you please benchmark with this one? Especially for offloaded QD1
>>> case. I haven't got any difference for nops test and don't have a decent SSD
>>> at hands to test it myself. We could drop it, if there is no benefit.
>>>
>>> This rewrites that @extra_refs from the second one, so I left it for now.
>>
>> Sure, let me run a peak test, qd1 test, qd1+sqpoll test on
>> for-5.6/io_uring, same branch with 1-2, and same branch with 1-3. That
>> should give us a good comparison. One core used for all, and we're going
>> to be core speed bound for the performance in all cases on this setup.
>> So it'll be a good comparison.
>>
> Great, thanks!

For some reason, not seeing much of a change between for-5.6/io_uring
and 1+2 and 1+2+3, it's about the same and results seem very stable.
For reference, top of profile with 1-3 applied looks like this:

+    3.92%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
+    3.87%  io_uring  [kernel.vmlinux]  [k] blk_mq_get_request
+    3.43%  io_uring  [kernel.vmlinux]  [k] io_iopoll_getevents
+    3.03%  io_uring  [kernel.vmlinux]  [k] __slab_free
+    2.87%  io_uring  io_uring          [.] submitter_fn
+    2.79%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
+    2.75%  io_uring  [kernel.vmlinux]  [k] bio_alloc_bioset
+    2.70%  io_uring  [nvme_core]       [k] nvme_setup_cmd
+    2.59%  io_uring  [kernel.vmlinux]  [k] blk_mq_make_request
+    2.46%  io_uring  [kernel.vmlinux]  [k] io_prep_rw
+    2.32%  io_uring  [kernel.vmlinux]  [k] io_read
+    2.25%  io_uring  [kernel.vmlinux]  [k] blk_mq_free_request
+    2.19%  io_uring  [kernel.vmlinux]  [k] io_put_req
+    2.06%  io_uring  [kernel.vmlinux]  [k] kmem_cache_alloc
+    2.01%  io_uring  [kernel.vmlinux]  [k] generic_make_request_checks
+    1.90%  io_uring  [kernel.vmlinux]  [k] __sbitmap_get_word
+    1.86%  io_uring  [kernel.vmlinux]  [k] sbitmap_queue_clear
+    1.85%  io_uring  [kernel.vmlinux]  [k] io_issue_sqe


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits
  2019-12-21 17:01             ` Jens Axboe
@ 2019-12-21 17:26               ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 17:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter


[-- Attachment #1.1: Type: text/plain, Size: 2970 bytes --]

On 21/12/2019 20:01, Jens Axboe wrote:
> On 12/21/19 9:48 AM, Pavel Begunkov wrote:
>> On 21/12/2019 19:38, Jens Axboe wrote:
>>> On 12/21/19 9:20 AM, Pavel Begunkov wrote:
>>>> On 21/12/2019 19:15, Pavel Begunkov wrote:
>>>>> Double account ctx->refs keeping number of taken refs in ctx. As
>>>>> io_uring gets per-request ctx->refs during submission, while holding
>>>>> ctx->uring_lock, this allows in most of the time to bypass
>>>>> percpu_ref_get*() and its overhead.
>>>>
>>>> Jens, could you please benchmark with this one? Especially for offloaded QD1
>>>> case. I haven't got any difference for nops test and don't have a decent SSD
>>>> at hands to test it myself. We could drop it, if there is no benefit.
>>>>
>>>> This rewrites that @extra_refs from the second one, so I left it for now.
>>>
>>> Sure, let me run a peak test, qd1 test, qd1+sqpoll test on
>>> for-5.6/io_uring, same branch with 1-2, and same branch with 1-3. That
>>> should give us a good comparison. One core used for all, and we're going
>>> to be core speed bound for the performance in all cases on this setup.
>>> So it'll be a good comparison.
>>>
>> Great, thanks!
> 
> For some reason, not seeing much of a change between for-5.6/io_uring
> and 1+2 and 1+2+3, it's about the same and results seem very stable.
> For reference, top of profile with 1-3 applied looks like this:

I see. I'll probably drop the last one, as it only complicates things.

My apologies for misleading terminology. Read-only QD1 (submit and
wait until the userspace completes it) obviously won't saturate a CPU.
Writes probably wouldn't as well (though, depends on HW). And it would be
better to say -- submit by one, complete in a bunch.
Just curious, what you used for testing? Is it fio?

> 
> +    3.92%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
> +    3.87%  io_uring  [kernel.vmlinux]  [k] blk_mq_get_request
> +    3.43%  io_uring  [kernel.vmlinux]  [k] io_iopoll_getevents
> +    3.03%  io_uring  [kernel.vmlinux]  [k] __slab_free
> +    2.87%  io_uring  io_uring          [.] submitter_fn
> +    2.79%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
> +    2.75%  io_uring  [kernel.vmlinux]  [k] bio_alloc_bioset
> +    2.70%  io_uring  [nvme_core]       [k] nvme_setup_cmd
> +    2.59%  io_uring  [kernel.vmlinux]  [k] blk_mq_make_request
> +    2.46%  io_uring  [kernel.vmlinux]  [k] io_prep_rw
> +    2.32%  io_uring  [kernel.vmlinux]  [k] io_read
> +    2.25%  io_uring  [kernel.vmlinux]  [k] blk_mq_free_request
> +    2.19%  io_uring  [kernel.vmlinux]  [k] io_put_req
> +    2.06%  io_uring  [kernel.vmlinux]  [k] kmem_cache_alloc
> +    2.01%  io_uring  [kernel.vmlinux]  [k] generic_make_request_checks
> +    1.90%  io_uring  [kernel.vmlinux]  [k] __sbitmap_get_word
> +    1.86%  io_uring  [kernel.vmlinux]  [k] sbitmap_queue_clear
> +    1.85%  io_uring  [kernel.vmlinux]  [k] io_issue_sqe
> 
> 

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring
  2019-12-21 16:15     ` [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits Pavel Begunkov
  2019-12-21 16:20       ` Pavel Begunkov
@ 2019-12-21 20:12       ` Pavel Begunkov
  2019-12-21 20:12         ` [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
                           ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 20:12 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
    better comments for percpu_ref_tryget_many (Dennis Zhou)
    amortise across io_uring_enter() boundary

v3: drop "batching across syscalls"
    remove error handling in io_submit_sqes() from common path

Pavel Begunkov (2):
  pcpu_ref: add percpu_ref_tryget_many()
  io_uring: batch getting pcpu references

 fs/io_uring.c                   | 14 ++++++++++----
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 9 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
@ 2019-12-21 20:12         ` Pavel Begunkov
  2019-12-21 20:12         ` [PATCH v3 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  2019-12-28 11:13         ` [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 20:12 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..22d9d183950d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
  * @ref: percpu_ref to try-get
+ * @nr: number of references to get
  *
- * Increment a percpu refcount unless its count already reached zero.
+ * Increment a percpu refcount  by @nr unless its count already reached zero.
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
  */
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+					  unsigned long nr)
 {
 	unsigned long __percpu *percpu_count;
 	bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	rcu_read_lock();
 
 	if (__ref_is_percpu(ref, &percpu_count)) {
-		this_cpu_inc(*percpu_count);
+		this_cpu_add(*percpu_count, nr);
 		ret = true;
 	} else {
-		ret = atomic_long_inc_not_zero(&ref->count);
+		ret = atomic_long_add_unless(&ref->count, nr, 0);
 	}
 
 	rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	return ret;
 }
 
+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+	return percpu_ref_tryget_many(ref, 1);
+}
+
 /**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 2/2] io_uring: batch getting pcpu references
  2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-21 20:12         ` [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
@ 2019-12-21 20:12         ` Pavel Begunkov
  2019-12-21 21:56           ` Pavel Begunkov
  2019-12-28 11:13         ` [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 20:12 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

basic benchmark with submit and wait 128 non-linked nops showed ~5%
performance gain. (7044 KIOPS vs 7423)

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

It's just becoming more bulky with ret for me, and would love to hear,
hot to make it clearer. This version removes all error handling from
hot path, though with goto.

 fs/io_uring.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 513f1922ce6a..b89a8b975c69 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1045,9 +1045,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct io_kiocb *req;
 
-	if (!percpu_ref_tryget(&ctx->refs))
-		return NULL;
-
 	if (!state) {
 		req = kmem_cache_alloc(req_cachep, gfp);
 		if (unlikely(!req))
@@ -4400,6 +4397,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			return -EBUSY;
 	}
 
+	if (!percpu_ref_tryget_many(&ctx->refs, nr))
+		return -EAGAIN;
+
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, nr);
 		statep = &state;
@@ -4408,16 +4408,22 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	for (i = 0; i < nr; i++) {
 		const struct io_uring_sqe *sqe;
 		struct io_kiocb *req;
+		unsigned int unused_refs;
 
 		req = io_get_req(ctx, statep);
 		if (unlikely(!req)) {
+			unused_refs = nr - submitted;
 			if (!submitted)
 				submitted = -EAGAIN;
+put_refs:
+			percpu_ref_put_many(&ctx->refs, unused_refs);
 			break;
 		}
 		if (!io_get_sqring(ctx, req, &sqe)) {
 			__io_free_req(req);
-			break;
+			/* __io_free_req() puts a ref */
+			unused_refs = nr - submitted - 1;
+			goto put_refs;
 		}
 
 		/* will complete beyond this point, count as submitted */
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/2] io_uring: batch getting pcpu references
  2019-12-21 20:12         ` [PATCH v3 2/2] io_uring: batch getting pcpu references Pavel Begunkov
@ 2019-12-21 21:56           ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-21 21:56 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 2210 bytes --]

On 21/12/2019 23:12, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
> 
> basic benchmark with submit and wait 128 non-linked nops showed ~5%
> performance gain. (7044 KIOPS vs 7423)
> 

Hmm, this won't work. Please drop it.

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> 
> It's just becoming more bulky with ret for me, and would love to hear,
> hot to make it clearer. This version removes all error handling from
> hot path, though with goto.
> 
>  fs/io_uring.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 513f1922ce6a..b89a8b975c69 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1045,9 +1045,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>  	struct io_kiocb *req;
>  
> -	if (!percpu_ref_tryget(&ctx->refs))
> -		return NULL;
> -
>  	if (!state) {
>  		req = kmem_cache_alloc(req_cachep, gfp);
>  		if (unlikely(!req))
> @@ -4400,6 +4397,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  			return -EBUSY;
>  	}
>  
> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
> +		return -EAGAIN;
> +
>  	if (nr > IO_PLUG_THRESHOLD) {
>  		io_submit_state_start(&state, nr);
>  		statep = &state;
> @@ -4408,16 +4408,22 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  	for (i = 0; i < nr; i++) {
>  		const struct io_uring_sqe *sqe;
>  		struct io_kiocb *req;
> +		unsigned int unused_refs;
>  
>  		req = io_get_req(ctx, statep);
>  		if (unlikely(!req)) {
> +			unused_refs = nr - submitted;
>  			if (!submitted)
>  				submitted = -EAGAIN;
> +put_refs:
> +			percpu_ref_put_many(&ctx->refs, unused_refs);
>  			break;
>  		}
>  		if (!io_get_sqring(ctx, req, &sqe)) {
>  			__io_free_req(req);
> -			break;
> +			/* __io_free_req() puts a ref */
> +			unused_refs = nr - submitted - 1;
> +			goto put_refs;
>  		}
>  
>  		/* will complete beyond this point, count as submitted */
> 

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring
  2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-21 20:12         ` [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
  2019-12-21 20:12         ` [PATCH v3 2/2] io_uring: batch getting pcpu references Pavel Begunkov
@ 2019-12-28 11:13         ` Pavel Begunkov
  2019-12-28 11:13           ` [PATCH v4 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
  2019-12-28 11:13           ` [PATCH v4 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  2 siblings, 2 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-28 11:13 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
    better comments for percpu_ref_tryget_many (Dennis Zhou)
    amortise across io_uring_enter() boundary

v3: drop "batching across syscalls"
    remove error handling in io_submit_sqes() from common path

v4: fix error handling

Pavel Begunkov (2):
  pcpu_ref: add percpu_ref_tryget_many()
  io_uring: batch getting pcpu references

 fs/io_uring.c                   | 26 +++++++++++++++++---------
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 14 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/2] pcpu_ref: add percpu_ref_tryget_many()
  2019-12-28 11:13         ` [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
@ 2019-12-28 11:13           ` Pavel Begunkov
  2019-12-28 11:13           ` [PATCH v4 2/2] io_uring: batch getting pcpu references Pavel Begunkov
  1 sibling, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-28 11:13 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel
  Cc: Tejun Heo, Dennis Zhou, Christoph Lameter

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..22d9d183950d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
  * @ref: percpu_ref to try-get
+ * @nr: number of references to get
  *
- * Increment a percpu refcount unless its count already reached zero.
+ * Increment a percpu refcount  by @nr unless its count already reached zero.
  * Returns %true on success; %false on failure.
  *
  * This function is safe to call as long as @ref is between init and exit.
  */
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+					  unsigned long nr)
 {
 	unsigned long __percpu *percpu_count;
 	bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	rcu_read_lock();
 
 	if (__ref_is_percpu(ref, &percpu_count)) {
-		this_cpu_inc(*percpu_count);
+		this_cpu_add(*percpu_count, nr);
 		ret = true;
 	} else {
-		ret = atomic_long_inc_not_zero(&ref->count);
+		ret = atomic_long_add_unless(&ref->count, nr, 0);
 	}
 
 	rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 	return ret;
 }
 
+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+	return percpu_ref_tryget_many(ref, 1);
+}
+
 /**
  * percpu_ref_tryget_live - try to increment a live percpu refcount
  * @ref: percpu_ref to try-get
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-28 11:13         ` [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
  2019-12-28 11:13           ` [PATCH v4 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
@ 2019-12-28 11:13           ` Pavel Begunkov
  2019-12-28 11:15             ` Pavel Begunkov
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-28 11:13 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

~5% throughput boost for a "submit and wait 128 nops" benchmark.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7fc1158bf9a4..404946080e86 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	struct io_kiocb *req;
 
-	if (!percpu_ref_tryget(&ctx->refs))
-		return NULL;
-
 	if (!state) {
 		req = kmem_cache_alloc(req_cachep, gfp);
 		if (unlikely(!req))
@@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
 	}
 }
 
+static void __io_req_free_empty(struct io_kiocb *req)
+{
+	if (likely(!io_is_fallback_req(req)))
+		kmem_cache_free(req_cachep, req);
+	else
+		clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req);
+}
+
 static void __io_free_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -1162,11 +1167,9 @@ static void __io_free_req(struct io_kiocb *req)
 			wake_up(&ctx->inflight_wait);
 		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
 	}
-	percpu_ref_put(&ctx->refs);
-	if (likely(!io_is_fallback_req(req)))
-		kmem_cache_free(req_cachep, req);
-	else
-		clear_bit_unlock(0, (unsigned long *) ctx->fallback_req);
+
+	percpu_ref_put(&req->ctx->refs);
+	__io_req_free_empty(req);
 }
 
 static bool io_link_cancel_timeout(struct io_kiocb *req)
@@ -4551,6 +4554,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			return -EBUSY;
 	}
 
+	if (!percpu_ref_tryget_many(&ctx->refs, nr))
+		return -EAGAIN;
+
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, nr);
 		statep = &state;
@@ -4567,7 +4573,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			break;
 		}
 		if (!io_get_sqring(ctx, req, &sqe)) {
-			__io_free_req(req);
+			__io_req_free_empty(req);
 			break;
 		}
 
@@ -4598,6 +4604,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			break;
 	}
 
+	if (submitted != nr)
+		percpu_ref_put_many(&ctx->refs, nr - submitted);
 	if (link)
 		io_queue_link_head(link);
 	if (statep)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-28 11:13           ` [PATCH v4 2/2] io_uring: batch getting pcpu references Pavel Begunkov
@ 2019-12-28 11:15             ` Pavel Begunkov
  2019-12-28 17:03               ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-28 11:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 28/12/2019 14:13, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
> 
> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  fs/io_uring.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 7fc1158bf9a4..404946080e86 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>  	struct io_kiocb *req;
>  
> -	if (!percpu_ref_tryget(&ctx->refs))
> -		return NULL;
> -
>  	if (!state) {
>  		req = kmem_cache_alloc(req_cachep, gfp);
>  		if (unlikely(!req))
> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>  	}
>  }
>  
> +static void __io_req_free_empty(struct io_kiocb *req)

If anybody have better naming (or a better approach at all), I'm all ears.


> +{
> +	if (likely(!io_is_fallback_req(req)))
> +		kmem_cache_free(req_cachep, req);
> +	else
> +		clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req);
> +}
> +
>  static void __io_free_req(struct io_kiocb *req)
>  {
>  	struct io_ring_ctx *ctx = req->ctx;
> @@ -1162,11 +1167,9 @@ static void __io_free_req(struct io_kiocb *req)
>  			wake_up(&ctx->inflight_wait);
>  		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
>  	}
> -	percpu_ref_put(&ctx->refs);
> -	if (likely(!io_is_fallback_req(req)))
> -		kmem_cache_free(req_cachep, req);
> -	else
> -		clear_bit_unlock(0, (unsigned long *) ctx->fallback_req);
> +
> +	percpu_ref_put(&req->ctx->refs);
> +	__io_req_free_empty(req);
>  }
>  
>  static bool io_link_cancel_timeout(struct io_kiocb *req)
> @@ -4551,6 +4554,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  			return -EBUSY;
>  	}
>  
> +	if (!percpu_ref_tryget_many(&ctx->refs, nr))
> +		return -EAGAIN;
> +
>  	if (nr > IO_PLUG_THRESHOLD) {
>  		io_submit_state_start(&state, nr);
>  		statep = &state;
> @@ -4567,7 +4573,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  			break;
>  		}
>  		if (!io_get_sqring(ctx, req, &sqe)) {
> -			__io_free_req(req);
> +			__io_req_free_empty(req);
>  			break;
>  		}
>  
> @@ -4598,6 +4604,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>  			break;
>  	}
>  
> +	if (submitted != nr)
> +		percpu_ref_put_many(&ctx->refs, nr - submitted);
>  	if (link)
>  		io_queue_link_head(link);
>  	if (statep)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-28 11:15             ` Pavel Begunkov
@ 2019-12-28 17:03               ` Jens Axboe
  2019-12-28 18:37                 ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2019-12-28 17:03 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 12/28/19 4:15 AM, Pavel Begunkov wrote:
> On 28/12/2019 14:13, Pavel Begunkov wrote:
>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>> for each request, grab a bunch once per io_submit_sqes().
>>
>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>  fs/io_uring.c | 26 +++++++++++++++++---------
>>  1 file changed, 17 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7fc1158bf9a4..404946080e86 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>  	struct io_kiocb *req;
>>  
>> -	if (!percpu_ref_tryget(&ctx->refs))
>> -		return NULL;
>> -
>>  	if (!state) {
>>  		req = kmem_cache_alloc(req_cachep, gfp);
>>  		if (unlikely(!req))
>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>>  	}
>>  }
>>  
>> +static void __io_req_free_empty(struct io_kiocb *req)
> 
> If anybody have better naming (or a better approach at all), I'm all ears.

__io_req_do_free()?

I think that's better than the empty, not quite sure what that means.
If you're fine with that, I can just make that edit when applying.
The rest looks fine to me now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-28 17:03               ` Jens Axboe
@ 2019-12-28 18:37                 ` Pavel Begunkov
  2019-12-30  3:33                   ` Brian Gianforcaro
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-28 18:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1717 bytes --]

On 28/12/2019 20:03, Jens Axboe wrote:
> On 12/28/19 4:15 AM, Pavel Begunkov wrote:
>> On 28/12/2019 14:13, Pavel Begunkov wrote:
>>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>>> for each request, grab a bunch once per io_submit_sqes().
>>>
>>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>  fs/io_uring.c | 26 +++++++++++++++++---------
>>>  1 file changed, 17 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 7fc1158bf9a4..404946080e86 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>>  	struct io_kiocb *req;
>>>  
>>> -	if (!percpu_ref_tryget(&ctx->refs))
>>> -		return NULL;
>>> -
>>>  	if (!state) {
>>>  		req = kmem_cache_alloc(req_cachep, gfp);
>>>  		if (unlikely(!req))
>>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>>>  	}
>>>  }
>>>  
>>> +static void __io_req_free_empty(struct io_kiocb *req)
>>
>> If anybody have better naming (or a better approach at all), I'm all ears.
> 
> __io_req_do_free()?

Not quite clear what's the difference with __io_req_free() then

> 
> I think that's better than the empty, not quite sure what that means.

Probably, so. It was kind of "request without a bound sqe".
Does io_free_{hollow,empty}_req() sound better?

> If you're fine with that, I can just make that edit when applying.
> The rest looks fine to me now.
> 

Please do

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-28 18:37                 ` Pavel Begunkov
@ 2019-12-30  3:33                   ` Brian Gianforcaro
  2019-12-30 18:45                     ` Pavel Begunkov
  0 siblings, 1 reply; 34+ messages in thread
From: Brian Gianforcaro @ 2019-12-30  3:33 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-kernel

On Sat, Dec 28, 2019 at 09:37:35PM +0300, Pavel Begunkov wrote:
> On 28/12/2019 20:03, Jens Axboe wrote:
> > On 12/28/19 4:15 AM, Pavel Begunkov wrote:
> >> On 28/12/2019 14:13, Pavel Begunkov wrote:
> >>> percpu_ref_tryget() has its own overhead. Instead getting a reference
> >>> for each request, grab a bunch once per io_submit_sqes().
> >>>
> >>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
> >>>
> >>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >>> ---
> >>>  fs/io_uring.c | 26 +++++++++++++++++---------
> >>>  1 file changed, 17 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >>> index 7fc1158bf9a4..404946080e86 100644
> >>> --- a/fs/io_uring.c
> >>> +++ b/fs/io_uring.c
> >>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
> >>>  	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> >>>  	struct io_kiocb *req;
> >>>  
> >>> -	if (!percpu_ref_tryget(&ctx->refs))
> >>> -		return NULL;
> >>> -
> >>>  	if (!state) {
> >>>  		req = kmem_cache_alloc(req_cachep, gfp);
> >>>  		if (unlikely(!req))
> >>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
> >>>  	}
> >>>  }
> >>>  
> >>> +static void __io_req_free_empty(struct io_kiocb *req)
> >>
> >> If anybody have better naming (or a better approach at all), I'm all ears.
> > 
> > __io_req_do_free()?
> 
> Not quite clear what's the difference with __io_req_free() then
> 
> > 
> > I think that's better than the empty, not quite sure what that means.
> 
> Probably, so. It was kind of "request without a bound sqe".
> Does io_free_{hollow,empty}_req() sound better?

Given your description, perhaps io_free_unbound_req() makes sense?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/2] io_uring: batch getting pcpu references
  2019-12-30  3:33                   ` Brian Gianforcaro
@ 2019-12-30 18:45                     ` Pavel Begunkov
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Begunkov @ 2019-12-30 18:45 UTC (permalink / raw)
  To: Brian Gianforcaro; +Cc: Jens Axboe, io-uring, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 784 bytes --]

On 30/12/2019 06:33, Brian Gianforcaro wrote:
>>>>> +static void __io_req_free_empty(struct io_kiocb *req)
>>>>
>>>> If anybody have better naming (or a better approach at all), I'm all ears.
>>>
>>> __io_req_do_free()?
>>
>> Not quite clear what's the difference with __io_req_free() then
>>
>>>
>>> I think that's better than the empty, not quite sure what that means.
>>
>> Probably, so. It was kind of "request without a bound sqe".
>> Does io_free_{hollow,empty}_req() sound better?
> 
> Given your description, perhaps io_free_unbound_req() makes sense?
> 

Like it more, though neither of these have a set meaning in io_uring context.
The patch already got into Jen's repo, so you can send another one, if you
think it's worth it.

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-12-30 18:46 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 22:28 [PATCH 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
2019-12-17 22:28 ` [PATCH 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
2019-12-17 23:42   ` Jens Axboe
2019-12-18 16:26     ` Tejun Heo
2019-12-18 17:49       ` Dennis Zhou
2019-12-21 15:36         ` Pavel Begunkov
2019-12-17 22:28 ` [PATCH 2/2] io_uring: batch getting pcpu references Pavel Begunkov
2019-12-17 23:21   ` Jens Axboe
2019-12-17 23:31     ` Jens Axboe
2019-12-18  9:25       ` Pavel Begunkov
2019-12-18  9:23     ` Pavel Begunkov
2019-12-18  0:02   ` Jens Axboe
2019-12-18 10:41     ` Pavel Begunkov
2019-12-21 16:15   ` [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring Pavel Begunkov
2019-12-21 16:15     ` [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
2019-12-21 16:15     ` [PATCH v2 2/3] io_uring: batch getting pcpu references Pavel Begunkov
2019-12-21 16:15     ` [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits Pavel Begunkov
2019-12-21 16:20       ` Pavel Begunkov
2019-12-21 16:38         ` Jens Axboe
2019-12-21 16:48           ` Pavel Begunkov
2019-12-21 17:01             ` Jens Axboe
2019-12-21 17:26               ` Pavel Begunkov
2019-12-21 20:12       ` [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
2019-12-21 20:12         ` [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
2019-12-21 20:12         ` [PATCH v3 2/2] io_uring: batch getting pcpu references Pavel Begunkov
2019-12-21 21:56           ` Pavel Begunkov
2019-12-28 11:13         ` [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring Pavel Begunkov
2019-12-28 11:13           ` [PATCH v4 1/2] pcpu_ref: add percpu_ref_tryget_many() Pavel Begunkov
2019-12-28 11:13           ` [PATCH v4 2/2] io_uring: batch getting pcpu references Pavel Begunkov
2019-12-28 11:15             ` Pavel Begunkov
2019-12-28 17:03               ` Jens Axboe
2019-12-28 18:37                 ` Pavel Begunkov
2019-12-30  3:33                   ` Brian Gianforcaro
2019-12-30 18:45                     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).