linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
@ 2018-09-09 12:58 Ming Lei
  2018-09-09 18:46 ` Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Ming Lei @ 2018-09-09 12:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ming Lei, Tejun Heo, Jianchao Wang, Kent Overstreet, linux-block

Now percpu_ref_reinit() can only be done on one percpu refcounter
when it drops zero. And the limit shouldn't be so strict, and it
is quite straightforward that percpu_ref_reinit() can be done when
this counter is at atomic mode.

This patch relaxes the limit, so we may avoid extra change[1] for NVMe
timeout's requirement.

[1] https://marc.info/?l=linux-kernel&m=153612052611020&w=2

Cc: Tejun Heo <tj@kernel.org>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 lib/percpu-refcount.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7bc000..af6b514c7d72 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -130,8 +130,10 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 	unsigned long count = 0;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		count += *per_cpu_ptr(percpu_count, cpu);
+		*per_cpu_ptr(percpu_count, cpu) = 0;
+	}
 
 	pr_debug("global %ld percpu %ld",
 		 atomic_long_read(&ref->count), (long)count);
@@ -187,7 +189,6 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 {
 	unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
-	int cpu;
 
 	BUG_ON(!percpu_count);
 
@@ -196,15 +197,6 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 
 	atomic_long_add(PERCPU_COUNT_BIAS, &ref->count);
 
-	/*
-	 * Restore per-cpu operation.  smp_store_release() is paired
-	 * with READ_ONCE() in __ref_is_percpu() and guarantees that the
-	 * zeroing is visible to all percpu accesses which can see the
-	 * following __PERCPU_REF_ATOMIC clearing.
-	 */
-	for_each_possible_cpu(cpu)
-		*per_cpu_ptr(percpu_count, cpu) = 0;
-
 	smp_store_release(&ref->percpu_count_ptr,
 			  ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
 }
@@ -349,7 +341,7 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
  *
  * Re-initialize @ref so that it's in the same state as when it finished
  * percpu_ref_init() ignoring %PERCPU_REF_INIT_DEAD.  @ref must have been
- * initialized successfully and reached 0 but not exited.
+ * initialized successfully and in atomic mode but not exited.
  *
  * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
  * this function is in progress.
@@ -357,10 +349,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
 void percpu_ref_reinit(struct percpu_ref *ref)
 {
 	unsigned long flags;
+	unsigned long __percpu *percpu_count;
 
 	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
 
-	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
+	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));
 
 	ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
 	percpu_ref_get(ref);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
@ 2018-09-09 18:46 ` Bart Van Assche
  2018-09-09 23:59   ` Ming Lei
  2018-09-10  1:40 ` jianchao.wang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2018-09-09 18:46 UTC (permalink / raw)
  To: Ming Lei, linux-kernel
  Cc: Tejun Heo, Jianchao Wang, Kent Overstreet, linux-block

On Sun, 2018-09-09 at 20:58 +0800, Ming Lei wrote:
> Now percpu_ref_reinit() can only be done on one percpu refcounter
> when it drops zero. And the limit shouldn't be so strict, and it
> is quite straightforward that percpu_ref_reinit() can be done when
> this counter is at atomic mode.
> 
> This patch relaxes the limit, so we may avoid extra change[1] for NVMe
> timeout's requirement.
> 
> [1] https://marc.info/?l=linux-kernel&m=153612052611020&w=2

Is the NVMe driver the only block driver that hangs if it is attempted to
freeze its request queue when a request times out? If so, can this hang be
fixed by modifying the NVMe driver instead of by modifying the percpu
refcount implementation?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 18:46 ` Bart Van Assche
@ 2018-09-09 23:59   ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2018-09-09 23:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, Tejun Heo, Jianchao Wang, Kent Overstreet, linux-block

On Sun, Sep 09, 2018 at 11:46:20AM -0700, Bart Van Assche wrote:
> On Sun, 2018-09-09 at 20:58 +0800, Ming Lei wrote:
> > Now percpu_ref_reinit() can only be done on one percpu refcounter
> > when it drops zero. And the limit shouldn't be so strict, and it
> > is quite straightforward that percpu_ref_reinit() can be done when
> > this counter is at atomic mode.
> > 
> > This patch relaxes the limit, so we may avoid extra change[1] for NVMe
> > timeout's requirement.
> > 
> > [1] https://marc.info/?l=linux-kernel&m=153612052611020&w=2
> 
> Is the NVMe driver the only block driver that hangs if it is attempted to
> freeze its request queue when a request times out? If so, can this hang be
> fixed by modifying the NVMe driver instead of by modifying the percpu
> refcount implementation?

IMO, this patch is quite simple and it should be much simpler than
solving this issue in block layer or NVMe, could we focus on the
technical correctness of this patch first?

However if anyone has better/simpler solution for this issue on any
change, I am open to them too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
  2018-09-09 18:46 ` Bart Van Assche
@ 2018-09-10  1:40 ` jianchao.wang
  2018-09-10 16:11   ` Ming Lei
  2018-09-10  1:54 ` jianchao.wang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: jianchao.wang @ 2018-09-10  1:40 UTC (permalink / raw)
  To: Ming Lei, linux-kernel; +Cc: Tejun Heo, Kent Overstreet, linux-block

Hi Ming

On 09/09/2018 08:58 PM, Ming Lei wrote:
> Now percpu_ref_reinit() can only be done on one percpu refcounter
> when it drops zero. And the limit shouldn't be so strict, and it
> is quite straightforward that percpu_ref_reinit() can be done when
> this counter is at atomic mode.

As we know, when the percpu_ref is switched to atomic mode, the values
of the per cpu will be sumed up to the atomic conter in percpu_ref_switch_to_atomic_rcu.

However, the tricky part is:
when we switch back to percpu mode, how can we know the exact value of the value of every cpu ?

Draining the percpu refcounter to zero before switch it back to percpu mode should be relatively
easy to implement. And also, this is the initial intention of percpu refcounter, only switch
to atomic mode when want to drain the refcounter. 


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
  2018-09-09 18:46 ` Bart Van Assche
  2018-09-10  1:40 ` jianchao.wang
@ 2018-09-10  1:54 ` jianchao.wang
  2018-09-10 16:49 ` Tejun Heo
  2018-09-18  3:21 ` jianchao.wang
  4 siblings, 0 replies; 30+ messages in thread
From: jianchao.wang @ 2018-09-10  1:54 UTC (permalink / raw)
  To: Ming Lei, linux-kernel; +Cc: Tejun Heo, Kent Overstreet, linux-block

Hi Ming

On 09/09/2018 08:58 PM, Ming Lei wrote:
> Now percpu_ref_reinit() can only be done on one percpu refcounter
> when it drops zero. And the limit shouldn't be so strict, and it
> is quite straightforward that percpu_ref_reinit() can be done when
> this counter is at atomic mode.

As we know, when the percpu_ref is switched to atomic mode, the values
of the per cpu will be sumed up to the atomic conter in percpu_ref_switch_to_atomic_rcu.

However, the tricky part is:
when we switch back to percpu mode, how can we know the exact value of the ref counter of every cpu ?

Draining the percpu refcounter to zero before switch it back to percpu mode should be relatively
easy to implement. And also, this is the initial intention of percpu refcounter, only switch
to atomic mode when want to drain the refcounter. 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-10  1:40 ` jianchao.wang
@ 2018-09-10 16:11   ` Ming Lei
  2018-09-11  1:48     ` jianchao.wang
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-10 16:11 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

Hi Jianchao,

On Mon, Sep 10, 2018 at 09:40:35AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/09/2018 08:58 PM, Ming Lei wrote:
> > Now percpu_ref_reinit() can only be done on one percpu refcounter
> > when it drops zero. And the limit shouldn't be so strict, and it
> > is quite straightforward that percpu_ref_reinit() can be done when
> > this counter is at atomic mode.
> 
> As we know, when the percpu_ref is switched to atomic mode, the values
> of the per cpu will be sumed up to the atomic conter in percpu_ref_switch_to_atomic_rcu.

Right.

> 
> However, the tricky part is:
> when we switch back to percpu mode, how can we know the exact value of the value of every cpu ?

The exact value of each CPU is zero at the exact time:

1) when percpu mode is switched from atomic mode

percpu_ref_switch_to_atomic_rcu() is the point where no any percpu inc/dec
can happen any more. And in this function the percpu count is sumed up to
the atomic counter, meantime this patch clears the percpu value. It means
once the refcount is switched to atomic mode, the percpu value is always
zero, doesn't it?

2) when the percpu-refcount is initialized at percpu mode

the percpu value is zero too.

> 
> Draining the percpu refcounter to zero before switch it back to percpu mode should be relatively
> easy to implement. And also, this is the initial intention of percpu refcounter, only switch

No, I don't think so, we can extend the percpu-refcount implementation to
cover the NVMe timeout case easily. Then no necessary to reinvent a new wheel
to address that issue.

> to atomic mode when want to drain the refcounter. 


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
                   ` (2 preceding siblings ...)
  2018-09-10  1:54 ` jianchao.wang
@ 2018-09-10 16:49 ` Tejun Heo
  2018-09-11  0:00   ` Ming Lei
  2018-09-18  3:21 ` jianchao.wang
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-10 16:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

Hello, Ming.

On Sun, Sep 09, 2018 at 08:58:24PM +0800, Ming Lei wrote:
> @@ -196,15 +197,6 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
>  
>  	atomic_long_add(PERCPU_COUNT_BIAS, &ref->count);
>  
> -	/*
> -	 * Restore per-cpu operation.  smp_store_release() is paired
> -	 * with READ_ONCE() in __ref_is_percpu() and guarantees that the
> -	 * zeroing is visible to all percpu accesses which can see the
> -	 * following __PERCPU_REF_ATOMIC clearing.
> -	 */

So, while the location of percpu counter resetting moved, the pairing
of store_release and READ_ONCE is still required to ensure that the
clearing is visible before the switching to percpu mode becomes
effective.  Can you please rephrase and keep the above comment?

> -	for_each_possible_cpu(cpu)
> -		*per_cpu_ptr(percpu_count, cpu) = 0;
> -
>  	smp_store_release(&ref->percpu_count_ptr,
>  			  ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
>  }
...
> @@ -357,10 +349,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
>  void percpu_ref_reinit(struct percpu_ref *ref)
>  {
>  	unsigned long flags;
> +	unsigned long __percpu *percpu_count;
>  
>  	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
>  
> -	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
> +	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));

Can you elaborate this part?  This doesn't seem required for the
described change.  Why is it part of the patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-10 16:49 ` Tejun Heo
@ 2018-09-11  0:00   ` Ming Lei
  2018-09-11 13:48     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11  0:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

Hi Tejun,

On Mon, Sep 10, 2018 at 09:49:20AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Sun, Sep 09, 2018 at 08:58:24PM +0800, Ming Lei wrote:
> > @@ -196,15 +197,6 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
> >  
> >  	atomic_long_add(PERCPU_COUNT_BIAS, &ref->count);
> >  
> > -	/*
> > -	 * Restore per-cpu operation.  smp_store_release() is paired
> > -	 * with READ_ONCE() in __ref_is_percpu() and guarantees that the
> > -	 * zeroing is visible to all percpu accesses which can see the
> > -	 * following __PERCPU_REF_ATOMIC clearing.
> > -	 */
> 
> So, while the location of percpu counter resetting moved, the pairing
> of store_release and READ_ONCE is still required to ensure that the
> clearing is visible before the switching to percpu mode becomes
> effective.  Can you please rephrase and keep the above comment?

OK, will do it in V2.

> 
> > -	for_each_possible_cpu(cpu)
> > -		*per_cpu_ptr(percpu_count, cpu) = 0;
> > -
> >  	smp_store_release(&ref->percpu_count_ptr,
> >  			  ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
> >  }
> ...
> > @@ -357,10 +349,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
> >  void percpu_ref_reinit(struct percpu_ref *ref)
> >  {
> >  	unsigned long flags;
> > +	unsigned long __percpu *percpu_count;
> >  
> >  	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
> >  
> > -	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
> > +	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));
> 
> Can you elaborate this part?  This doesn't seem required for the
> described change.  Why is it part of the patch?

The motivation of this patch is to avoid the above warning and allow
the ref to switch back to percpu mode without dropping to zero.

That is why the check has to be changed to the above way.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-10 16:11   ` Ming Lei
@ 2018-09-11  1:48     ` jianchao.wang
  2018-09-11  4:03       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: jianchao.wang @ 2018-09-11  1:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

Hi Ming

On 09/11/2018 12:11 AM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Mon, Sep 10, 2018 at 09:40:35AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/09/2018 08:58 PM, Ming Lei wrote:
>>> Now percpu_ref_reinit() can only be done on one percpu refcounter
>>> when it drops zero. And the limit shouldn't be so strict, and it
>>> is quite straightforward that percpu_ref_reinit() can be done when
>>> this counter is at atomic mode.
>>
>> As we know, when the percpu_ref is switched to atomic mode, the values
>> of the per cpu will be sumed up to the atomic conter in percpu_ref_switch_to_atomic_rcu.
> 
> Right.
> 
>>
>> However, the tricky part is:
>> when we switch back to percpu mode, how can we know the exact value of the value of every cpu ?
> 
> The exact value of each CPU is zero at the exact time:
> 
> 1) when percpu mode is switched from atomic mode
> 
> percpu_ref_switch_to_atomic_rcu() is the point where no any percpu inc/dec
> can happen any more. And in this function the percpu count is sumed up to
> the atomic counter, meantime this patch clears the percpu value. It means
> once the refcount is switched to atomic mode, the percpu value is always
> zero, doesn't it?
> 
> 2) when the percpu-refcount is initialized at percpu mode
> 
> the percpu value is zero too.

What we want to get is to switch the percpu refcounter to percpu mode from atomic mode when
the refcounter has _not_ been drained to zero, instead of just to discard the warning, right ?

When we have sumed the values of every cpu's refcounter to a global atomic counter, how can
we give the values back to the refcounter of every cpu ?


Thanks
Jianchao

> 
>>
>> Draining the percpu refcounter to zero before switch it back to percpu mode should be relatively
>> easy to implement. And also, this is the initial intention of percpu refcounter, only switch
> 
> No, I don't think so, we can extend the percpu-refcount implementation to
> cover the NVMe timeout case easily. Then no necessary to reinvent a new wheel
> to address that issue.
> 
>> to atomic mode when want to drain the refcounter. 
> 
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  1:48     ` jianchao.wang
@ 2018-09-11  4:03       ` Ming Lei
  2018-09-11  4:40         ` jianchao.wang
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11  4:03 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 09:48:15AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/11/2018 12:11 AM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Mon, Sep 10, 2018 at 09:40:35AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 09/09/2018 08:58 PM, Ming Lei wrote:
> >>> Now percpu_ref_reinit() can only be done on one percpu refcounter
> >>> when it drops zero. And the limit shouldn't be so strict, and it
> >>> is quite straightforward that percpu_ref_reinit() can be done when
> >>> this counter is at atomic mode.
> >>
> >> As we know, when the percpu_ref is switched to atomic mode, the values
> >> of the per cpu will be sumed up to the atomic conter in percpu_ref_switch_to_atomic_rcu.
> > 
> > Right.
> > 
> >>
> >> However, the tricky part is:
> >> when we switch back to percpu mode, how can we know the exact value of the value of every cpu ?
> > 
> > The exact value of each CPU is zero at the exact time:
> > 
> > 1) when percpu mode is switched from atomic mode
> > 
> > percpu_ref_switch_to_atomic_rcu() is the point where no any percpu inc/dec
> > can happen any more. And in this function the percpu count is sumed up to
> > the atomic counter, meantime this patch clears the percpu value. It means
> > once the refcount is switched to atomic mode, the percpu value is always
> > zero, doesn't it?
> > 
> > 2) when the percpu-refcount is initialized at percpu mode
> > 
> > the percpu value is zero too.
> 
> What we want to get is to switch the percpu refcounter to percpu mode from atomic mode when
> the refcounter has _not_ been drained to zero, instead of just to discard the warning, right ?

Right, it does work in this way as I explained.

The idea is simple, atomic mode is one easy mode to switch to percpu
mode, and we don't have to wait until the whole ref-count(at atomic mode)
drops zero for the switching.

> 
> When we have sumed the values of every cpu's refcounter to a global atomic counter, how can
> we give the values back to the refcounter of every cpu ?

Who will use percpu-refcounter in this way? From user's view, only the
whole counting matters.

After the refcount is switched to atomic mode, the whole counting is
done on the atomic part. Then if the refcount need to switch to percpu mode
again, all percpu part of the counter is re-initialized as zero simply. This
is invariant with/without this patch.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  4:03       ` Ming Lei
@ 2018-09-11  4:40         ` jianchao.wang
  2018-09-11  8:20           ` Ming Lei
  2018-09-11 13:44           ` Tejun Heo
  0 siblings, 2 replies; 30+ messages in thread
From: jianchao.wang @ 2018-09-11  4:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

Hi Ming

On 09/11/2018 12:03 PM, Ming Lei wrote:
> After the refcount is switched to atomic mode, the whole counting is
> done on the atomic part. Then if the refcount need to switch to percpu mode
> again, all percpu part of the counter is re-initialized as zero simply. This
> is invariant with/without this patch.

Does the "whole counting" here means ?

(long)(sum (every cpu's refcounter)) + atomic refcounter

and when switch to atomic mode, there could be value left in atomic refcounter.
then the unsigned long percpu refcounter cound be decreased from 0.
	
From another angle, one request could be ended on a different cpu from the one where
it is issued.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  4:40         ` jianchao.wang
@ 2018-09-11  8:20           ` Ming Lei
  2018-09-11 14:22             ` jianchao.wang
  2018-09-11 13:44           ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11  8:20 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 12:40:36PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/11/2018 12:03 PM, Ming Lei wrote:
> > After the refcount is switched to atomic mode, the whole counting is
> > done on the atomic part. Then if the refcount need to switch to percpu mode
> > again, all percpu part of the counter is re-initialized as zero simply. This
> > is invariant with/without this patch.
> 
> Does the "whole counting" here means ?
> 
> (long)(sum (every cpu's refcounter)) + atomic refcounter

No.

We only check the 'whole counting' at atomic mode, so it is the atomic
part('ref->count'), please see percpu_ref_put_many(), in which ref->release()
is only called at atomic mode.

At percpu mode, the ref-count is only decreased/increased on the local CPU.

> 
> and when switch to atomic mode, there could be value left in atomic refcounter.
> then the unsigned long percpu refcounter cound be decreased from 0.
> 	
> From another angle, one request could be ended on a different cpu from the one where
> it is issued.

Right.

But,

In the fast path(either completion or issue path), the .q_usage_counter is only
increased/decreased, which can be done in either atomic or percpu mode.

And when we want to check the queued/inflight requests, the .q_usage_counter has
to be killed first by switching to atomic mode, such as in all kinds of
uses of blk_freeze_queue_start().


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  4:40         ` jianchao.wang
  2018-09-11  8:20           ` Ming Lei
@ 2018-09-11 13:44           ` Tejun Heo
  2018-09-11 14:13             ` jianchao.wang
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-11 13:44 UTC (permalink / raw)
  To: jianchao.wang; +Cc: Ming Lei, linux-kernel, Kent Overstreet, linux-block

Hello,

On Tue, Sep 11, 2018 at 12:40:36PM +0800, jianchao.wang wrote:
> (long)(sum (every cpu's refcounter)) + atomic refcounter
> 
> and when switch to atomic mode, there could be value left in atomic refcounter.
> then the unsigned long percpu refcounter cound be decreased from 0.

It can always do that.  Imagine a task getting a ref on one cpu,
migrating and putting it on another.  The only thing which matters is
that the sum of everything adds upto what it should be.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  0:00   ` Ming Lei
@ 2018-09-11 13:48     ` Tejun Heo
  2018-09-11 15:45       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-11 13:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

Hello, Ming.

On Tue, Sep 11, 2018 at 08:00:50AM +0800, Ming Lei wrote:
> > > @@ -357,10 +349,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
> > >  void percpu_ref_reinit(struct percpu_ref *ref)
> > >  {
> > >  	unsigned long flags;
> > > +	unsigned long __percpu *percpu_count;
> > >  
> > >  	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
> > >  
> > > -	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
> > > +	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));
> > 
> > Can you elaborate this part?  This doesn't seem required for the
> > described change.  Why is it part of the patch?
> 
> The motivation of this patch is to avoid the above warning and allow
> the ref to switch back to percpu mode without dropping to zero.
> 
> That is why the check has to be changed to the above way.

So, this part seems wrong.  The function is called percpu_ref_reinit()
- the refcnt is expected to be in its initial state with just the base
ref once this function returns.  If you're removing the restriction on
when this can be called, you should also make sure that the function
actually enforces the target state.  Also, this is a separate logical
change, please put it in a separate patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 13:44           ` Tejun Heo
@ 2018-09-11 14:13             ` jianchao.wang
  0 siblings, 0 replies; 30+ messages in thread
From: jianchao.wang @ 2018-09-11 14:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Ming Lei, linux-kernel, Kent Overstreet, linux-block

Hi tejun

On 09/11/2018 09:44 PM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Sep 11, 2018 at 12:40:36PM +0800, jianchao.wang wrote:
>> (long)(sum (every cpu's refcounter)) + atomic refcounter
>>
>> and when switch to atomic mode, there could be value left in atomic refcounter.
>> then the unsigned long percpu refcounter cound be decreased from 0.
> 
> It can always do that.  Imagine a task getting a ref on one cpu,
> migrating and putting it on another.  The only thing which matters is
> that the sum of everything adds upto what it should be.
> 

Thanks for your kindly response.
I have got the point here.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11  8:20           ` Ming Lei
@ 2018-09-11 14:22             ` jianchao.wang
  0 siblings, 0 replies; 30+ messages in thread
From: jianchao.wang @ 2018-09-11 14:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

Hi Ming

On 09/11/2018 04:20 PM, Ming Lei wrote:
> We only check the 'whole counting' at atomic mode, so it is the atomic
> part('ref->count'), please see percpu_ref_put_many(), in which ref->release()
> is only called at atomic mode.
> 
> At percpu mode, the ref-count is only decreased/increased on the local CPU.

Yes, I got it.
Thanks for your kindly response.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 13:48     ` Tejun Heo
@ 2018-09-11 15:45       ` Ming Lei
  2018-09-11 15:49         ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11 15:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 06:48:36AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Tue, Sep 11, 2018 at 08:00:50AM +0800, Ming Lei wrote:
> > > > @@ -357,10 +349,11 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
> > > >  void percpu_ref_reinit(struct percpu_ref *ref)
> > > >  {
> > > >  	unsigned long flags;
> > > > +	unsigned long __percpu *percpu_count;
> > > >  
> > > >  	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
> > > >  
> > > > -	WARN_ON_ONCE(!percpu_ref_is_zero(ref));
> > > > +	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));
> > > 
> > > Can you elaborate this part?  This doesn't seem required for the
> > > described change.  Why is it part of the patch?
> > 
> > The motivation of this patch is to avoid the above warning and allow
> > the ref to switch back to percpu mode without dropping to zero.
> > 
> > That is why the check has to be changed to the above way.
> 
> So, this part seems wrong.  The function is called percpu_ref_reinit()
> - the refcnt is expected to be in its initial state with just the base
> ref once this function returns.  If you're removing the restriction on

But the comment says that 'Re-initialize @ref so that it's in the same
state as when it finished', and this invariant isn't changed with this
patch.

> when this can be called, you should also make sure that the function
> actually enforces the target state.  Also, this is a separate logical
> change, please put it in a separate patch.

OK, will do it in V2.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 15:45       ` Ming Lei
@ 2018-09-11 15:49         ` Tejun Heo
  2018-09-11 16:05           ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-11 15:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 11:45:41PM +0800, Ming Lei wrote:
> > So, this part seems wrong.  The function is called percpu_ref_reinit()
> > - the refcnt is expected to be in its initial state with just the base
> > ref once this function returns.  If you're removing the restriction on
> 
> But the comment says that 'Re-initialize @ref so that it's in the same
> state as when it finished', and this invariant isn't changed with this
> patch.

The comment goes "when perpcu_ref_init() finished".  The function is
called re _init_.  It should put the ref in the initial state, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 15:49         ` Tejun Heo
@ 2018-09-11 16:05           ` Ming Lei
  2018-09-11 16:30             ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11 16:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 08:49:59AM -0700, Tejun Heo wrote:
> On Tue, Sep 11, 2018 at 11:45:41PM +0800, Ming Lei wrote:
> > > So, this part seems wrong.  The function is called percpu_ref_reinit()
> > > - the refcnt is expected to be in its initial state with just the base
> > > ref once this function returns.  If you're removing the restriction on
> > 
> > But the comment says that 'Re-initialize @ref so that it's in the same
> > state as when it finished', and this invariant isn't changed with this
> > patch.
> 
> The comment goes "when perpcu_ref_init() finished".  The function is
> called re _init_.  It should put the ref in the initial state, right?

OK, I am fine to keep the behaviour for this API, will introduce a new helper
for NVMe.

However, it is doable to switch to percpu mode from atomic mode when it
doesn't drop to zero, looks like sort of perpcu_ref_reinit(inherit_old_refcount).

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 16:05           ` Ming Lei
@ 2018-09-11 16:30             ` Tejun Heo
  2018-09-11 16:34               ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-11 16:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Wed, Sep 12, 2018 at 12:05:33AM +0800, Ming Lei wrote:
> On Tue, Sep 11, 2018 at 08:49:59AM -0700, Tejun Heo wrote:
> > On Tue, Sep 11, 2018 at 11:45:41PM +0800, Ming Lei wrote:
> > > > So, this part seems wrong.  The function is called percpu_ref_reinit()
> > > > - the refcnt is expected to be in its initial state with just the base
> > > > ref once this function returns.  If you're removing the restriction on
> > > 
> > > But the comment says that 'Re-initialize @ref so that it's in the same
> > > state as when it finished', and this invariant isn't changed with this
> > > patch.
> > 
> > The comment goes "when perpcu_ref_init() finished".  The function is
> > called re _init_.  It should put the ref in the initial state, right?
> 
> OK, I am fine to keep the behaviour for this API, will introduce a new helper
> for NVMe.

Why aren't switch_to_atomic/percpu enough?

> However, it is doable to switch to percpu mode from atomic mode when it
> doesn't drop to zero, looks like sort of perpcu_ref_reinit(inherit_old_refcount).

Isn't that way more contorted than just switching operating modes?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 16:30             ` Tejun Heo
@ 2018-09-11 16:34               ` Ming Lei
  2018-09-11 16:38                 ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-11 16:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 09:30:32AM -0700, Tejun Heo wrote:
> On Wed, Sep 12, 2018 at 12:05:33AM +0800, Ming Lei wrote:
> > On Tue, Sep 11, 2018 at 08:49:59AM -0700, Tejun Heo wrote:
> > > On Tue, Sep 11, 2018 at 11:45:41PM +0800, Ming Lei wrote:
> > > > > So, this part seems wrong.  The function is called percpu_ref_reinit()
> > > > > - the refcnt is expected to be in its initial state with just the base
> > > > > ref once this function returns.  If you're removing the restriction on
> > > > 
> > > > But the comment says that 'Re-initialize @ref so that it's in the same
> > > > state as when it finished', and this invariant isn't changed with this
> > > > patch.
> > > 
> > > The comment goes "when perpcu_ref_init() finished".  The function is
> > > called re _init_.  It should put the ref in the initial state, right?
> > 
> > OK, I am fine to keep the behaviour for this API, will introduce a new helper
> > for NVMe.
> 
> Why aren't switch_to_atomic/percpu enough?

The blk-mq's use case is this _reinit is done on one refcount which was
killed via percpu_ref_kill(), so the DEAD flag has to be cleared.

> 
> > However, it is doable to switch to percpu mode from atomic mode when it
> > doesn't drop to zero, looks like sort of perpcu_ref_reinit(inherit_old_refcount).
> 
> Isn't that way more contorted than just switching operating modes?

As explained above.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 16:34               ` Ming Lei
@ 2018-09-11 16:38                 ` Tejun Heo
  2018-09-12  1:52                   ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-11 16:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

Hello, Ming.

On Wed, Sep 12, 2018 at 12:34:44AM +0800, Ming Lei wrote:
> > Why aren't switch_to_atomic/percpu enough?
> 
> The blk-mq's use case is this _reinit is done on one refcount which was
> killed via percpu_ref_kill(), so the DEAD flag has to be cleared.

If you killed and waited until kill finished, you should be able to
re-init.  Is it that you want to kill but abort killing in some cases?
How do you then handle the race against release?  Can you please
describe the exact usage you have on mind?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-11 16:38                 ` Tejun Heo
@ 2018-09-12  1:52                   ` Ming Lei
  2018-09-12 15:53                     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-12  1:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

On Tue, Sep 11, 2018 at 09:38:56AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Wed, Sep 12, 2018 at 12:34:44AM +0800, Ming Lei wrote:
> > > Why aren't switch_to_atomic/percpu enough?
> > 
> > The blk-mq's use case is this _reinit is done on one refcount which was
> > killed via percpu_ref_kill(), so the DEAD flag has to be cleared.
> 
> If you killed and waited until kill finished, you should be able to
> re-init.  Is it that you want to kill but abort killing in some cases?

Yes, it can be re-init, just with the warning of WARN_ON_ONCE(!percpu_ref_is_zero(ref)).

> How do you then handle the race against release?  Can you please

The .release is only called at atomic mode, and once we switch to
percpu mode, .release can't be called at all. Or I may not follow you,
could you explain a bit the race with release?

> describe the exact usage you have on mind?

Let me explain the use case:

1) nvme timeout comes

2) all pending requests are canceled, but won't be completed because
they have to be retried after the controller is recovered

3) meantime, the queue has to be frozen for avoiding new request, so
the refcount is killed via percpu_ref_kill().

4) after the queue is recovered(or the controller is reset successfully), it
isn't necessary to wait until the refcount drops zero, since it is fine to
reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
And waiting for the refcount dropping to zero in the reset handler may trigger
IO hang if IO timeout happens again during reset.


So what I am trying to propose is the following usage:

1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
preventing new requests from entering queue

2) controller is recovered

3) percpu_ref_reinit() on .q_usage_counter, and do not wait for
.q_usage_counter dropping to zero, then we needn't to wait in NVMe reset
handler which can be thought as single thread, and avoid IO hang when
new timeout is triggered during the waiting.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-12  1:52                   ` Ming Lei
@ 2018-09-12 15:53                     ` Tejun Heo
  2018-09-12 22:11                       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-12 15:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block

Hello,

On Wed, Sep 12, 2018 at 09:52:48AM +0800, Ming Lei wrote:
> > If you killed and waited until kill finished, you should be able to
> > re-init.  Is it that you want to kill but abort killing in some cases?
> 
> Yes, it can be re-init, just with the warning of WARN_ON_ONCE(!percpu_ref_is_zero(ref)).

We can add another interface but it can't be re _init_.

> > How do you then handle the race against release?  Can you please
> 
> The .release is only called at atomic mode, and once we switch to
> percpu mode, .release can't be called at all. Or I may not follow you,
> could you explain a bit the race with release?

Yeah but what guards ->release() starting to run and then the ref
being switched to percpu mode?  Or maybe that doesn't matter?

> > describe the exact usage you have on mind?
> 
> Let me explain the use case:
> 
> 1) nvme timeout comes
> 
> 2) all pending requests are canceled, but won't be completed because
> they have to be retried after the controller is recovered
> 
> 3) meantime, the queue has to be frozen for avoiding new request, so
> the refcount is killed via percpu_ref_kill().
> 
> 4) after the queue is recovered(or the controller is reset successfully), it
> isn't necessary to wait until the refcount drops zero, since it is fine to
> reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> And waiting for the refcount dropping to zero in the reset handler may trigger
> IO hang if IO timeout happens again during reset.

Does the recovery need the in-flight commands actually drained or does
it just need to block new issues for a while.  If latter, why is
percpu_ref even being used?

> So what I am trying to propose is the following usage:
> 
> 1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
> preventing new requests from entering queue

The way you're describing it, the above part is no different from
having a global bool which gates new issues.

> 2) controller is recovered
> 
> 3) percpu_ref_reinit() on .q_usage_counter, and do not wait for
> .q_usage_counter dropping to zero, then we needn't to wait in NVMe reset
> handler which can be thought as single thread, and avoid IO hang when
> new timeout is triggered during the waiting.

This sounds possibly confused to me.  Can you please explain how the
recovery may hang if you wait for the ref to drain?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-12 15:53                     ` Tejun Heo
@ 2018-09-12 22:11                       ` Ming Lei
  2018-09-18 12:49                         ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-12 22:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block, linux-nvme

On Wed, Sep 12, 2018 at 08:53:21AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 12, 2018 at 09:52:48AM +0800, Ming Lei wrote:
> > > If you killed and waited until kill finished, you should be able to
> > > re-init.  Is it that you want to kill but abort killing in some cases?
> > 
> > Yes, it can be re-init, just with the warning of WARN_ON_ONCE(!percpu_ref_is_zero(ref)).
> 
> We can add another interface but it can't be re _init_.

OK.

> 
> > > How do you then handle the race against release?  Can you please
> > 
> > The .release is only called at atomic mode, and once we switch to
> > percpu mode, .release can't be called at all. Or I may not follow you,
> > could you explain a bit the race with release?
> 
> Yeah but what guards ->release() starting to run and then the ref
> being switched to percpu mode?  Or maybe that doesn't matter?

OK, we may add synchronize_rcu() just after clearing the DEAD flag in
the new introduced helper to avoid the race.

> 
> > > describe the exact usage you have on mind?
> > 
> > Let me explain the use case:
> > 
> > 1) nvme timeout comes
> > 
> > 2) all pending requests are canceled, but won't be completed because
> > they have to be retried after the controller is recovered
> > 
> > 3) meantime, the queue has to be frozen for avoiding new request, so
> > the refcount is killed via percpu_ref_kill().
> > 
> > 4) after the queue is recovered(or the controller is reset successfully), it
> > isn't necessary to wait until the refcount drops zero, since it is fine to
> > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > And waiting for the refcount dropping to zero in the reset handler may trigger
> > IO hang if IO timeout happens again during reset.
> 
> Does the recovery need the in-flight commands actually drained or does
> it just need to block new issues for a while.  If latter, why is

The recovery needn't to drain the in-flight commands actually.

> percpu_ref even being used?

Just for avoiding to invent a new wheel, especially .q_usage_counter
has served for this purpose for long time.

> 
> > So what I am trying to propose is the following usage:
> > 
> > 1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
> > preventing new requests from entering queue
> 
> The way you're describing it, the above part is no different from
> having a global bool which gates new issues.

Right, but the global bool has to be checked in fast path, and the sync
between updating the flag and checking it has to be considered. Given
blk-mq has already used .q_usage_counter for this purpose, that is why
I suggest to scale percpu-refcount to cover this use case.

> 
> > 2) controller is recovered
> > 
> > 3) percpu_ref_reinit() on .q_usage_counter, and do not wait for
> > .q_usage_counter dropping to zero, then we needn't to wait in NVMe reset
> > handler which can be thought as single thread, and avoid IO hang when
> > new timeout is triggered during the waiting.
> 
> This sounds possibly confused to me.  Can you please explain how the
> recovery may hang if you wait for the ref to drain?

The reset handler can be thought as one single dedicated thread, if it hangs
in draining in-flight commands, then it won't be run again for dealing with
next timeout event.


thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
                   ` (3 preceding siblings ...)
  2018-09-10 16:49 ` Tejun Heo
@ 2018-09-18  3:21 ` jianchao.wang
  2018-09-18  7:34   ` Ming Lei
  4 siblings, 1 reply; 30+ messages in thread
From: jianchao.wang @ 2018-09-18  3:21 UTC (permalink / raw)
  To: Ming Lei, linux-kernel; +Cc: Tejun Heo, Kent Overstreet, linux-block

Hi Ming

Will post the next version ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-18  3:21 ` jianchao.wang
@ 2018-09-18  7:34   ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2018-09-18  7:34 UTC (permalink / raw)
  To: jianchao.wang; +Cc: linux-kernel, Tejun Heo, Kent Overstreet, linux-block

On Tue, Sep 18, 2018 at 11:21:25AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Will post the next version ?

Not see Tejun's further comment yet and suppose he is fine with the
approach, so will do that soon for further review.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-12 22:11                       ` Ming Lei
@ 2018-09-18 12:49                         ` Tejun Heo
  2018-09-19  2:51                           ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2018-09-18 12:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-kernel, Jianchao Wang, Kent Overstreet, linux-block, linux-nvme

Hello, Ming.

Sorry about the delay.

On Thu, Sep 13, 2018 at 06:11:40AM +0800, Ming Lei wrote:
> > Yeah but what guards ->release() starting to run and then the ref
> > being switched to percpu mode?  Or maybe that doesn't matter?
> 
> OK, we may add synchronize_rcu() just after clearing the DEAD flag in
> the new introduced helper to avoid the race.

That doesn't make sense to me.  How is synchronize_rcu() gonna change
anything there?

> > > 4) after the queue is recovered(or the controller is reset successfully), it
> > > isn't necessary to wait until the refcount drops zero, since it is fine to
> > > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > > And waiting for the refcount dropping to zero in the reset handler may trigger
> > > IO hang if IO timeout happens again during reset.
> > 
> > Does the recovery need the in-flight commands actually drained or does
> > it just need to block new issues for a while.  If latter, why is
> 
> The recovery needn't to drain the in-flight commands actually.

Is it just waiting till confirm_kill is called?  So that new ref is
not given away?  If synchronization like that is gonna work, the
percpu ref operations on the reader side must be wrapped in a larger
critical region, which brings up two issues.

1. Callers of percpu_ref must not depend on what internal
   synchronization construct percpu_ref uses.  Again, percpu_ref
   doesn't even use regular RCU.

2. If there is already an outer RCU protection around ref operation,
   that RCU critical section can and should be used for
   synchronization, not percpu_ref.

> > percpu_ref even being used?
> 
> Just for avoiding to invent a new wheel, especially .q_usage_counter
> has served for this purpose for long time.

It sounds like this was more of an abuse.  So, basically what you want
is sth like the following.

READER

 rcu_read_lock();
 if (can_issue_new_commands)
	issue;
 else
	abort;
 rcu_read_unlock();

WRITER

 can_issue_new_commands = false;
 synchronize_rcu();
 // no new command will be issued anymore

Right?  There isn't much wheel to reinvent here and using percpu_ref
for the above is likely already incorrect due to the different RCU
type being used.

> > > So what I am trying to propose is the following usage:
> > > 
> > > 1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
> > > preventing new requests from entering queue
> > 
> > The way you're describing it, the above part is no different from
> > having a global bool which gates new issues.
> 
> Right, but the global bool has to be checked in fast path, and the sync

That likely bool test isn't gonna cost anything.

> between updating the flag and checking it has to be considered. Given
> blk-mq has already used .q_usage_counter for this purpose, that is why
> I suggest to scale percpu-refcount to cover this use case.

And the synchronization part should always be considered and is
already likely wrong.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-18 12:49                         ` Tejun Heo
@ 2018-09-19  2:51                           ` Ming Lei
  2018-09-19 20:36                             ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2018-09-19  2:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-block, linux-nvme, linux-kernel, Jianchao Wang, Kent Overstreet

Hi Tejun,

On Tue, Sep 18, 2018 at 05:49:09AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> Sorry about the delay.
> 
> On Thu, Sep 13, 2018 at 06:11:40AM +0800, Ming Lei wrote:
> > > Yeah but what guards ->release() starting to run and then the ref
> > > being switched to percpu mode?  Or maybe that doesn't matter?
> > 
> > OK, we may add synchronize_rcu() just after clearing the DEAD flag in
> > the new introduced helper to avoid the race.
> 
> That doesn't make sense to me.  How is synchronize_rcu() gonna change
> anything there?

As you saw in the new post, synchronize_rcu() isn't used for avoiding
the race. Instead, it is done by grabbing one extra ref on atomic part.

> 
> > > > 4) after the queue is recovered(or the controller is reset successfully), it
> > > > isn't necessary to wait until the refcount drops zero, since it is fine to
> > > > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > > > And waiting for the refcount dropping to zero in the reset handler may trigger
> > > > IO hang if IO timeout happens again during reset.
> > > 
> > > Does the recovery need the in-flight commands actually drained or does
> > > it just need to block new issues for a while.  If latter, why is
> > 
> > The recovery needn't to drain the in-flight commands actually.
> 
> Is it just waiting till confirm_kill is called?  So that new ref is
> not given away?  If synchronization like that is gonna work, the
> percpu ref operations on the reader side must be wrapped in a larger
> critical region, which brings up two issues.
> 
> 1. Callers of percpu_ref must not depend on what internal
>    synchronization construct percpu_ref uses.  Again, percpu_ref
>    doesn't even use regular RCU.
> 
> 2. If there is already an outer RCU protection around ref operation,
>    that RCU critical section can and should be used for
>    synchronization, not percpu_ref.

I guess the above doesn't apply any more because there isn't new 
synchronize_rcu() introduced in my new post.

> 
> > > percpu_ref even being used?
> > 
> > Just for avoiding to invent a new wheel, especially .q_usage_counter
> > has served for this purpose for long time.
> 
> It sounds like this was more of an abuse.  So, basically what you want
> is sth like the following.
> 
> READER
> 
>  rcu_read_lock();
>  if (can_issue_new_commands)
> 	issue;
>  else
> 	abort;
>  rcu_read_unlock();
> 
> WRITER
> 
>  can_issue_new_commands = false;
>  synchronize_rcu();
>  // no new command will be issued anymore
> 
> Right?  There isn't much wheel to reinvent here and using percpu_ref
> for the above is likely already incorrect due to the different RCU
> type being used.

No RCU story any more, :-)

It might work, but still a reinvented wheel since perpcu-refcount does
provide same function. Not mention the inter-action between the two
mechanism may have to be considered.

Also there is still cost introduced in WRITER side, and the
synchronize_rcu() often takes a bit long, especially there might be lots
of namespaces, each need to run one synchronize_rcu(). We have learned
lessons in converting to blk-mq for scsi, in which synchronize_rcu()
introduces long delay in booting.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-19  2:51                           ` Ming Lei
@ 2018-09-19 20:36                             ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2018-09-19 20:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, linux-nvme, linux-kernel, Jianchao Wang, Kent Overstreet

Hello, Ming.

On Wed, Sep 19, 2018 at 10:51:49AM +0800, Ming Lei wrote:
> > That doesn't make sense to me.  How is synchronize_rcu() gonna change
> > anything there?
> 
> As you saw in the new post, synchronize_rcu() isn't used for avoiding
> the race. Instead, it is done by grabbing one extra ref on atomic part.

This is layering violation.  It just isn't a good idea to depend on
percpu_ref internal implementation details like this.

> > 1. Callers of percpu_ref must not depend on what internal
> >    synchronization construct percpu_ref uses.  Again, percpu_ref
> >    doesn't even use regular RCU.
> > 
> > 2. If there is already an outer RCU protection around ref operation,
> >    that RCU critical section can and should be used for
> >    synchronization, not percpu_ref.
> 
> I guess the above doesn't apply any more because there isn't new 
> synchronize_rcu() introduced in my new post.

It still does.  The problem is that what you're doing creates
dependencies on percpu_ref's implementation details - how it
guarantees the mode transition visibility using what sort of
synchronization construct.

> > Right?  There isn't much wheel to reinvent here and using percpu_ref
> > for the above is likely already incorrect due to the different RCU
> > type being used.
> 
> No RCU story any more, :-)
> 
> It might work, but still a reinvented wheel since perpcu-refcount does
> provide same function. Not mention the inter-action between the two
> mechanism may have to be considered.

Why would the two independent mechanisms interact with each other?
What's problematic is entangling two mechanisms in an implementation
dependent way.

> Also there is still cost introduced in WRITER side, and the
> synchronize_rcu() often takes a bit long, especially there might be lots
> of namespaces, each need to run one synchronize_rcu(). We have learned
> lessons in converting to blk-mq for scsi, in which synchronize_rcu()
> introduces long delay in booting.

You're already paying that latency.  It's not like percpu_ref can make
it happen magically without paying the same cost.  You also can easily
overlay the two grace periods as the percpu_ref part can be
asynchronous (if you still care about it).  But, from what I've read
till now, it doesn't even look like you'd need to do anything with
percpu_ref if you all you need to do is shutting down issue of new
commands.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-09-19 20:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-09 12:58 [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
2018-09-09 18:46 ` Bart Van Assche
2018-09-09 23:59   ` Ming Lei
2018-09-10  1:40 ` jianchao.wang
2018-09-10 16:11   ` Ming Lei
2018-09-11  1:48     ` jianchao.wang
2018-09-11  4:03       ` Ming Lei
2018-09-11  4:40         ` jianchao.wang
2018-09-11  8:20           ` Ming Lei
2018-09-11 14:22             ` jianchao.wang
2018-09-11 13:44           ` Tejun Heo
2018-09-11 14:13             ` jianchao.wang
2018-09-10  1:54 ` jianchao.wang
2018-09-10 16:49 ` Tejun Heo
2018-09-11  0:00   ` Ming Lei
2018-09-11 13:48     ` Tejun Heo
2018-09-11 15:45       ` Ming Lei
2018-09-11 15:49         ` Tejun Heo
2018-09-11 16:05           ` Ming Lei
2018-09-11 16:30             ` Tejun Heo
2018-09-11 16:34               ` Ming Lei
2018-09-11 16:38                 ` Tejun Heo
2018-09-12  1:52                   ` Ming Lei
2018-09-12 15:53                     ` Tejun Heo
2018-09-12 22:11                       ` Ming Lei
2018-09-18 12:49                         ` Tejun Heo
2018-09-19  2:51                           ` Ming Lei
2018-09-19 20:36                             ` Tejun Heo
2018-09-18  3:21 ` jianchao.wang
2018-09-18  7:34   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).