All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: Kent Overstreet <koverstreet@google.com>, Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-aio@kvack.org, akpm@linux-foundation.org,
	Zach Brown <zab@redhat.com>, Felipe Balbi <balbi@ti.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Mark Fasheh <mfasheh@suse.com>, Joel Becker <jlbec@evilplan.org>,
	Jens Axboe <axboe@kernel.dk>,
	Asai Thambi S P <asamymuthupa@micron.com>,
	Selvan Mani <smani@micron.com>,
	Sam Bradshaw <sbradshaw@micron.com>,
	Jeff Moyer <jmoyer@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Benjamin LaHaise <bcrl@kvack.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH 04/21] Generic percpu refcounting
Date: Wed, 29 May 2013 14:29:56 +0930	[thread overview]
Message-ID: <87hahmmldf.fsf@rustcorp.com.au> (raw)
In-Reply-To: <20130528234728.GB2291@google.com>

Kent Overstreet <koverstreet@google.com> writes:
> On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
>> Can you please expand it on a bit and, more importantly, describe in
>> what limits, it's safe?  This should be safe as long as the actual sum
>> of refcnts given out doesn't overflow the original type, right? 
>
> Precisely.
>
>> It'd be great if that is explained clearly in more intuitive way.  The
>> only actual explanation above is "modular arithmatic is commutative"
>> which is a very compact way to put it and I really think it deserves
>> an easier explanation.
>
> I'm not sure I know of any good way of explaining it intuitively, but
> here's this at least...
>
>  * (More precisely: because moduler arithmatic is commutative the sum of all the
>  * pcpu_count vars will be equal to what it would have been if all the gets and
>  * puts were done to a single integer, even if some of the percpu integers
>  * overflow or underflow).

This seems intuitively obvious, so I wouldn't sweat it too much.  What
goes up, has to come down somewhere.

>> > > Are we sure this is enough?  1<<31 is a fairly large number but it's
>> > > just easy enough to breach from time to time and it's gonna be hellish
>> > > to reproduce / debug when it actually overflows.  Maybe we want
>> > > atomic64_t w/ 1LLU << 63 bias?  Or is there something else which
>> > > guarantees that the bias can't over/underflow?
>> > 
>> > Well, it has the effect of halving the usable range of the refcount,
>> > which I think is probably ok - the thing is, the range of an atomic_t
>> > doesn't really correspond to anything useful on 64 bit machines so if
>> > you're concerned about overflow you probably need to be using an
>> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
>> > too.
>> 
>> I'm not worrying about the total refcnt overflowing 31 bits, that's
>> fine.  What I'm worried about is the percpu refs having systmetic
>> drift (got on certain cpus and put on others), and the total counter
>> being overflowed while percpu draining is in progress.  To me, the
>> problem is that the bias which tags that draining in progress can be
>> overflown by percpu refs.  The summing can be the same but the tagging
>> should be put where summing can't overflow it.  It'd be great if you
>> can explain in the comment in what range it's safe and why, because
>> that'd make the limits clear to both you and other people reading the
>> code and would help a lot in deciding whether it's safe enough.
>
> (This is why I initially didn't (don't) like the bias method, it makes
> things harder to reason about).
>
> The fact that the counter is percpu is irrelevant w.r.t. the bias; we
> sum all the percpu counters up before adding them to the atomic counter
> and subtracting the bias, so when we go to add the percpu counters it's
> no different from if the percpu counter was a single integer all along.
>
> So there's only two counters we're adding together; there's the percpu
> counter (just think of it as a single integer) that we started out
> using, but then at some point in time we start applying the gets and
> puts to the atomic counter.
>
> Note that there's no systemic drift here; at time t all the gets and
> puts were happening to one counter, and then at time t+1 they switch to
> a different counter.
>
> We know the sum of the counters will be positive (again, because modular
> arithmatic is still commutative; when we sum the counters it's as if
> there was a single counter all along) but that doesn't mean either of
> the individual counters can't be negative.
>
> (Actually, unless I'm mistaken in this version the percpu counter can
> never go negative - it definitely could with dynamic percpu allocation,
> as you need a atomic_t -> percpu transition when the atomic_t was > 0
> for the percpu counter to go negative; but in this version we start out
> using the percpu counters and the atomic_t 0 (ignoring for the moment
> the bias and the initial ref).
>
> So, the sum must be positive but the atomic_t could be negative. How
> negative?
>
> We can't do a get() to the percpu counters after we've seen that the ref
> is no longer in percpu mode - so after we've done one put to the
> atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
> we can't do a get to the percpu counter.
>
> And we can't do more puts than there have been gets - because the sum
> can't be negative. So the most puts() we can do at any given time is the
> real count, or sum of the percpu ref and atomic_t.
>
> Therefore, the amount the atomic_t can go negative is bounded by the
> maximum value of the refcount.
>
> So if we say arbitrarily that the maximum legal value of the refcount is
> - say - 1U << 31, then the atomic_t will always be greater than
> -((int) (1U << 31)).
>
> So as long as the total number of outstanding refs never exceeds the
> bias we're fine.

Yes.  We should note the 31 bit limit somewhere.  We could WARN_ON() if
count is >= BIAS in percpu_ref_kill(), perhaps.

>> I probably should have made it clearer.  Sorry about that.  tryget()
>> is fine.  I was curious about count() as it's always a bit dangerous a
>> query interface which is racy and can return something unexpected like
>> false zero or underflowed refcnt.
>
> Yeah, it is, it was intended just for the module code where it's only
> used for the value lsmod shows.

Open code it there?

>> Let's just have percpu_ref_kill(ref, release) which puts the base ref
>> and invokes release whenever it's done.
>
> Release has to be stored in struct percpu_ref() so it can be invoked
> after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> percpu_ref_init(), but yeah.

Or hand it to percpu_ref_put(), too, as per kref_put().  I hate indirect
magic.

Cheers,
Rusty.

WARNING: multiple messages have this Message-ID (diff)
From: Rusty Russell <rusty@rustcorp.com.au>
To: Kent Overstreet <koverstreet@google.com>, Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-aio@kvack.org, akpm@linux-foundation.org,
	Zach Brown <zab@redhat.com>, Felipe Balbi <balbi@ti.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Mark Fasheh <mfasheh@suse.com>, Joel Becker <jlbec@evilplan.org>,
	Jens Axboe <axboe@kernel.dk>,
	Asai Thambi S P <asamymuthupa@micron.com>,
	Selvan Mani <smani@micron.com>,
	Sam Bradshaw <sbradshaw@micron.com>,
	Jeff Moyer <jmoyer@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Benjamin LaHaise <bcrl@kvack.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH 04/21] Generic percpu refcounting
Date: Wed, 29 May 2013 14:29:56 +0930	[thread overview]
Message-ID: <87hahmmldf.fsf@rustcorp.com.au> (raw)
In-Reply-To: <20130528234728.GB2291@google.com>

Kent Overstreet <koverstreet@google.com> writes:
> On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
>> Can you please expand it on a bit and, more importantly, describe in
>> what limits, it's safe?  This should be safe as long as the actual sum
>> of refcnts given out doesn't overflow the original type, right? 
>
> Precisely.
>
>> It'd be great if that is explained clearly in more intuitive way.  The
>> only actual explanation above is "modular arithmatic is commutative"
>> which is a very compact way to put it and I really think it deserves
>> an easier explanation.
>
> I'm not sure I know of any good way of explaining it intuitively, but
> here's this at least...
>
>  * (More precisely: because moduler arithmatic is commutative the sum of all the
>  * pcpu_count vars will be equal to what it would have been if all the gets and
>  * puts were done to a single integer, even if some of the percpu integers
>  * overflow or underflow).

This seems intuitively obvious, so I wouldn't sweat it too much.  What
goes up, has to come down somewhere.

>> > > Are we sure this is enough?  1<<31 is a fairly large number but it's
>> > > just easy enough to breach from time to time and it's gonna be hellish
>> > > to reproduce / debug when it actually overflows.  Maybe we want
>> > > atomic64_t w/ 1LLU << 63 bias?  Or is there something else which
>> > > guarantees that the bias can't over/underflow?
>> > 
>> > Well, it has the effect of halving the usable range of the refcount,
>> > which I think is probably ok - the thing is, the range of an atomic_t
>> > doesn't really correspond to anything useful on 64 bit machines so if
>> > you're concerned about overflow you probably need to be using an
>> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
>> > too.
>> 
>> I'm not worrying about the total refcnt overflowing 31 bits, that's
>> fine.  What I'm worried about is the percpu refs having systmetic
>> drift (got on certain cpus and put on others), and the total counter
>> being overflowed while percpu draining is in progress.  To me, the
>> problem is that the bias which tags that draining in progress can be
>> overflown by percpu refs.  The summing can be the same but the tagging
>> should be put where summing can't overflow it.  It'd be great if you
>> can explain in the comment in what range it's safe and why, because
>> that'd make the limits clear to both you and other people reading the
>> code and would help a lot in deciding whether it's safe enough.
>
> (This is why I initially didn't (don't) like the bias method, it makes
> things harder to reason about).
>
> The fact that the counter is percpu is irrelevant w.r.t. the bias; we
> sum all the percpu counters up before adding them to the atomic counter
> and subtracting the bias, so when we go to add the percpu counters it's
> no different from if the percpu counter was a single integer all along.
>
> So there's only two counters we're adding together; there's the percpu
> counter (just think of it as a single integer) that we started out
> using, but then at some point in time we start applying the gets and
> puts to the atomic counter.
>
> Note that there's no systemic drift here; at time t all the gets and
> puts were happening to one counter, and then at time t+1 they switch to
> a different counter.
>
> We know the sum of the counters will be positive (again, because modular
> arithmatic is still commutative; when we sum the counters it's as if
> there was a single counter all along) but that doesn't mean either of
> the individual counters can't be negative.
>
> (Actually, unless I'm mistaken in this version the percpu counter can
> never go negative - it definitely could with dynamic percpu allocation,
> as you need a atomic_t -> percpu transition when the atomic_t was > 0
> for the percpu counter to go negative; but in this version we start out
> using the percpu counters and the atomic_t 0 (ignoring for the moment
> the bias and the initial ref).
>
> So, the sum must be positive but the atomic_t could be negative. How
> negative?
>
> We can't do a get() to the percpu counters after we've seen that the ref
> is no longer in percpu mode - so after we've done one put to the
> atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
> we can't do a get to the percpu counter.
>
> And we can't do more puts than there have been gets - because the sum
> can't be negative. So the most puts() we can do at any given time is the
> real count, or sum of the percpu ref and atomic_t.
>
> Therefore, the amount the atomic_t can go negative is bounded by the
> maximum value of the refcount.
>
> So if we say arbitrarily that the maximum legal value of the refcount is
> - say - 1U << 31, then the atomic_t will always be greater than
> -((int) (1U << 31)).
>
> So as long as the total number of outstanding refs never exceeds the
> bias we're fine.

Yes.  We should note the 31 bit limit somewhere.  We could WARN_ON() if
count is >= BIAS in percpu_ref_kill(), perhaps.

>> I probably should have made it clearer.  Sorry about that.  tryget()
>> is fine.  I was curious about count() as it's always a bit dangerous a
>> query interface which is racy and can return something unexpected like
>> false zero or underflowed refcnt.
>
> Yeah, it is, it was intended just for the module code where it's only
> used for the value lsmod shows.

Open code it there?

>> Let's just have percpu_ref_kill(ref, release) which puts the base ref
>> and invokes release whenever it's done.
>
> Release has to be stored in struct percpu_ref() so it can be invoked
> after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> percpu_ref_init(), but yeah.

Or hand it to percpu_ref_put(), too, as per kref_put().  I hate indirect
magic.

Cheers,
Rusty.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

  parent reply	other threads:[~2013-05-29  5:07 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-14  1:18 AIO refactoring/performance improvements/cancellation Kent Overstreet
2013-05-14  1:18 ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 01/21] aio: fix kioctx not being freed after cancellation at exit time Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 02/21] aio: reqs_active -> reqs_available Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 03/21] aio: percpu reqs_available Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 04/21] Generic percpu refcounting Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14 13:51   ` Oleg Nesterov
2013-05-14 13:51     ` Oleg Nesterov
2013-05-15  8:21     ` Kent Overstreet
2013-05-15  8:21       ` Kent Overstreet
2013-05-14 14:59   ` Tejun Heo
2013-05-14 14:59     ` Tejun Heo
2013-05-14 15:28     ` Oleg Nesterov
2013-05-14 15:28       ` Oleg Nesterov
2013-05-15  9:00       ` Kent Overstreet
2013-05-15  9:00         ` Kent Overstreet
2013-05-15  8:58     ` Kent Overstreet
2013-05-15  8:58       ` Kent Overstreet
2013-05-15 17:37       ` Tejun Heo
2013-05-15 17:37         ` Tejun Heo
2013-05-28 23:47         ` Kent Overstreet
2013-05-28 23:47           ` Kent Overstreet
2013-05-29  1:11           ` Tejun Heo
2013-05-29  1:11             ` Tejun Heo
2013-05-29  4:59           ` Rusty Russell [this message]
2013-05-29  4:59             ` Rusty Russell
2013-05-31 20:12             ` Kent Overstreet
2013-05-31 20:12               ` Kent Overstreet
2013-05-14 21:59   ` Tejun Heo
2013-05-14 21:59     ` Tejun Heo
2013-05-14 22:15     ` Tejun Heo
2013-05-14 22:15       ` Tejun Heo
2013-05-15  9:07     ` Kent Overstreet
2013-05-15  9:07       ` Kent Overstreet
2013-05-15 17:56       ` Tejun Heo
2013-05-15 17:56         ` Tejun Heo
2013-05-16  0:26   ` Rusty Russell
2013-05-16  0:26     ` Rusty Russell
2013-05-14  1:18 ` [PATCH 05/21] aio: percpu ioctx refcount Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 06/21] aio: io_cancel() no longer returns the io_event Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 07/21] aio: Don't use ctx->tail unnecessarily Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 08/21] aio: Kill aio_rw_vect_retry() Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 09/21] aio: Kill unneeded kiocb members Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 10/21] aio: Kill ki_users Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 11/21] aio: Kill ki_dtor Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 12/21] aio: convert the ioctx list to radix tree Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 13/21] block: prep work for batch completion Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 14/21] block, aio: batch completion for bios/kiocbs Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 15/21] virtio-blk: convert to batch completion Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 16/21] mtip32xx: " Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 17/21] Percpu tag allocator Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14 13:48   ` Oleg Nesterov
2013-05-14 13:48     ` Oleg Nesterov
2013-05-14 14:24     ` Oleg Nesterov
2013-05-14 14:24       ` Oleg Nesterov
2013-05-15  9:34       ` Kent Overstreet
2013-05-15  9:34         ` Kent Overstreet
2013-05-15  9:25     ` Kent Overstreet
2013-05-15  9:25       ` Kent Overstreet
2013-05-15 15:41       ` Oleg Nesterov
2013-05-15 15:41         ` Oleg Nesterov
2013-05-15 16:10         ` Oleg Nesterov
2013-05-15 16:10           ` Oleg Nesterov
2013-06-10 23:20         ` Kent Overstreet
2013-06-10 23:20           ` Kent Overstreet
2013-06-11 17:42           ` Oleg Nesterov
2013-06-11 17:42             ` Oleg Nesterov
2013-05-14 15:03   ` Tejun Heo
2013-05-14 15:03     ` Tejun Heo
2013-05-15 20:19   ` Andi Kleen
2013-05-15 20:19     ` Andi Kleen
2013-05-14  1:18 ` [PATCH 18/21] aio: Allow cancellation without a cancel callback, new kiocb lookup Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 19/21] aio/usb: Update cancellation for new synchonization Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 20/21] direct-io: Set dio->io_error directly Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-14  1:18 ` [PATCH 21/21] block: Bio cancellation Kent Overstreet
2013-05-14  1:18   ` Kent Overstreet
2013-05-15 17:52   ` Jens Axboe
2013-05-15 17:52     ` Jens Axboe
2013-05-15 19:29     ` Kent Overstreet
2013-05-15 19:29       ` Kent Overstreet
2013-05-15 20:01       ` Jens Axboe
2013-05-15 20:01         ` Jens Axboe
2013-05-31 22:52         ` Kent Overstreet
2013-05-31 22:52           ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87hahmmldf.fsf@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=akpm@linux-foundation.org \
    --cc=asamymuthupa@micron.com \
    --cc=axboe@kernel.dk \
    --cc=balbi@ti.com \
    --cc=bcrl@kvack.org \
    --cc=cl@linux-foundation.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jlbec@evilplan.org \
    --cc=jmoyer@redhat.com \
    --cc=koverstreet@google.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=sbradshaw@micron.com \
    --cc=smani@micron.com \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zab@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.