Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface

From: Sagi Grimberg <sagi@grimberg.me>
To: paulmck@kernel.org
Cc: Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Jens Axboe <axboe@kernel.dk>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	Chao Leng <lengchao@huawei.com>, Keith Busch <kbusch@kernel.org>,
	Ming Lin <mlin@kernel.org>
Subject: Re: [PATCH v5 1/2] blk-mq: add tagset quiesce interface
Date: Tue, 28 Jul 2020 16:46:23 -0700	[thread overview]
Message-ID: <d1ba2009-130a-d423-1389-c7af72e25a6a@grimberg.me> (raw)
In-Reply-To: <20200728135436.GP9247@paulmck-ThinkPad-P72>

Hey Paul,

> Indeed you cannot.  And if you build with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> it will yell at you when you try.
> 
> You -can- pass on-stack rcu_head structures to call_srcu(), though,
> if that helps.  You of course must have some way of waiting for the
> callback to be invoked before exiting that function.  This should be
> easy for me to package into an API, maybe using one of the existing
> reference-counting APIs.
> 
> So, do you have a separate stack frame for each of the desired call_srcu()
> invocations?  If not, do you know at build time how many rcu_head
> structures you need?  If the answer to both of these is "no", then
> it is likely that there needs to be an rcu_head in each of the relevant
> data structures, as was noted earlier in this thread.
> 
> Yeah, I should go read the code.  But I would need to know where it is
> and it is still early in the morning over here!  ;-)
> 
> I probably should also have read the remainder of the thread before
> replying, as well.  But what is the fun in that?

The use-case is to quiesce submissions to queues. This flow is where we
want to teardown stuff, and we can potentially have 1000's of queues
that we need to quiesce each one.

each queue (hctx) has either rcu or srcu depending if it may sleep
during submission.

The goal is that the overall quiesce should be fast, so we want
to wait for all of these queues elapsed period ~once, in parallel,
instead of synchronizing each serially as done today.

The guys here are resisting to add a rcu_synchronize to each and
every hctx because it will take 32 bytes more or less from 1000's
of hctxs.

Dynamically allocating each one is possible but not very scalable.

The question is if there is some way, we can do this with on-stack
or a single on-heap rcu_head or equivalent that can achieve the same
effect.