Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

From: Ming Lei <ming.lei@redhat.com>
To: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	joseph.qi@linux.alibaba.com, ming.lei@redhat.com
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support
Date: Mon, 29 Aug 2022 10:08:11 +0800	[thread overview]
Message-ID: <Ywwfi7Dgi0JC2kQ/@T590> (raw)
In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com>

On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
> userspace. For each ublk queue, there is one ubq_daemon(pthread).
> All ubq_daemons share the same process which opens /dev/ublkcX.
> The ubq_daemon code infinitely loops on io_uring_enter() to
> send/receive io_uring cmds which pass information of blk-mq
> rqs.
> 
> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
> must abort the dying ubq, stop the device and release everything.
> This is not a good choice in practice because users do not expect
> aborted requests, I/O errors and a released device. They may want
> a recovery machenism so that no requests are aborted and no I/O
> error occurs. Anyway, users just want everything works as uaual.

I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
won't be deleted & re-added from user viewpoint after user recovery,
so the device context won't be lost.

> 
> This RFC patchset implements USER_RECOVERY support. If the process
> crashes, we allow ublksrv to provide new process and ubq_daemons.
> We do not support single ubq_daemon(pthread) recovery because a
> pthread rarely crashes.
> 
> Recovery feature is quite useful for products do not expect to
> return any I/O error to frontend users.

That looks one very ideal requirement. To be honest, no any block driver
can guarantee that 100%, so it is just one soft requirement?

Cause memory allocation may fail, network may be disconnected,
re-creating pthread or process may fail too, ...

> In detail, we support
> this scenario:
> (1) The /dev/ublkc0 is opened by process 0;
> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
>     rqs are handled by process 0.
> (3) Process 0 suddenly crashes(e.g. segfault);
> (4) Fio is still running and submit IOs(but these IOs cannot
>     complete now)
> (5) User recovers with process 1 and attach it to /dev/ublkc0
> (6) All rqs are handled by process 1 now and IOs can be
>     completed now.
> 
> Note: The backend must tolerate double-write because we re-issue
> a rq sent to the old(dying) process before. We allow users to
> choose whether re-issue these rqs or not, please see patch 7 for
> more detail.
> 
> We provide a sample script here to simulate the above steps:
> 
> ***************************script***************************
> LOOPS=10
> 
> __ublk_get_pid() {
> 	pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
> 	echo $pid
> }
> 
> ublk_recover_kill()
> {
> 	for CNT in `seq $LOOPS`; do
> 		dmesg -C
>                 pid=`__ublk_get_pid`
>                 echo -e "*** kill $pid now ***"
> 		kill -9 $pid
> 		sleep 2
>                 echo -e "*** recover now ***"
>                 ./ublk recover -n 0

The current behavior is that /dev/ublkb* is removed after device is
aborted because ubq daemon is killed.

What if 'ublk recover' command isn't sent? So the current behavior
without recovery is changed? Or just changed with this feature enabled?

BTW, I do not mean the change isn't reasonable, but suggest to document
the user visible change, so it can get reviewed from either user
viewpoint and technical point.

> 		sleep 4
> 	done
> }
> 
> ublk_test()
> {
>         dmesg -C
>         echo -e "*** add ublk device ***"
>         ./ublk add -t null -d 4 -i 1
>         sleep 2
>         echo -e "*** start fio ***"
>         fio --bs=4k \
>             --filename=/dev/ublkb0 \
>             --runtime=100s \
>             --rw=read &
>         sleep 4
>         ublk_recover_kill
>         wait
>         echo -e "*** delete ublk device ***"
>         ./ublk del -n 0
> }
> 
> for CNT in `seq 4`; do
>         modprobe -rv ublk_drv
>         modprobe ublk_drv
>         echo -e "************ round $CNT ************"
>         ublk_test
>         sleep 5
> done
> ***************************script***************************
> 
> You may run it with our modified ublksrv[3] which supports
> recovey feature. No I/O error occurs and you can verify it
> by typing
>     $ perf-tools/bin/tpoint block:block_rq_error
> 
> The basic idea of USER_RECOVERY is quite straightfoward:
> 
> (1) release/free everything belongs to the dying process.
> 
>     Note: Since ublk_drv does save information about user process,
>     this work is important because we don't expect any resource
>     lekage. Particularly, ioucmds from the dying ubq_daemons
>     need to be completed(freed). Current ublk_drv code cannot satisfy
>     our need while considering USER_RECOVERY. So we refactor some code
>     shown in patch 1-5 to gracefully free all ioucmds.
> 
> (2) init ublk queues including requeuing/aborting rqs.
> 
> (3) allow new ubq_daemons issue FETCH_REQ.
> 
> Here is steps to reocver:
> 
> (1) For a user, after a process crash(how he detect a crash is not related
>     to this patchset), he sends START_USER_RECOVERY ctrl-cmd to

I'd suggest to describe crash detector a bit at least, as one whole use case,
crash detector should be the input of the use case of user recovery, which is
usually one part of use case when modeling software requirement/design.

Such as, crash is detected after the ubq daemon pthread/process is crashed?
Will you consider io hang in the daemon pthread/process? IMO, long-term,
the crash detector utility should be part of ublksrv.

We don't implement ublk driver's IO timeout yet, but that implementation may be
related with this recovery feature closely, such as, one simple approach is to
kill ubq-daemon if we can't move on after retrying several times, then
let userspace detect & recovery.

Thanks,
Ming