All of lore.kernel.org
 help / color / mirror / Atom feed
From: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
To: ming.lei@redhat.com
Cc: axboe@kernel.dk, xiaoguang.wang@linux.alibaba.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	joseph.qi@linux.alibaba.com,
	ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Subject: [PATCH V4 0/8] ublk_drv: add USER_RECOVERY support
Date: Wed, 21 Sep 2022 17:58:41 +0800	[thread overview]
Message-ID: <20220921095849.84988-1-ZiyangZhang@linux.alibaba.com> (raw)

ublk_drv is a driver simply passes all blk-mq rqs to userspace
target(such as ublksrv[1]). For each ublk queue, there is one
ubq_daemon(pthread). All ubq_daemons share the same process
which opens /dev/ublkcX. The ubq_daemon code infinitely loops on
io_uring_enter() to send/receive io_uring cmds which pass
information of blk-mq rqs.

Since the real IO handler(the process/thread opening /dev/ublkcX) is
in userspace, it could crash if:
(1) the user kills -9 it because of IO hang on backend, system
    reboot, etc...
(2) the process/thread catches a exception(segfault, divisor error,
oom...) Therefore, the kernel driver has to deal with a dying
ubq_daemon or the process.

Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
must abort the dying ubq, stop the device and release everything.
This is not a good choice in practice because users do not expect
aborted requests, I/O errors and a released device. They may want
a recovery machenism so that no requests are aborted and no I/O
error occurs. Anyway, users just want everything works as usual.

This patchset implements USER_RECOVERY support. If the process
or any ubq_daemon(pthread) crashes(exits accidentally), we allow
user to provide new process and ubq_daemons.

Note: The responsibility of recovery belongs to the user who opens
/dev/ublkcX. After a crash, the kernel driver only switch the
device's state to be ready for recovery(START_USER_RECOVERY) or
termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED.
This patchset does not provide how to detect such a crash in userspace.
The user has may ways to do so. For example, user may:
(1) send GET_DEV_INFO on specific dev_id and check if its state is
    UBLK_S_DEV_QUIESCED.
(2) 'ps' on ublksrv_pid.

Recovery feature is quite useful for real products. In detail,
we support this scenario:
(1) The /dev/ublkc0 is opened by process 0.
(2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
    rqs are handled by process 0.
(3) Process 0 suddenly crashes(e.g. segfault);
(4) Fio is still running and submit IOs(but these IOs cannot
    be dispatched now)
(5) User starts process 1 and attach it to /dev/ublkc0
(6) All rqs are handled by process 1 now and IOs can be
    completed now.

Note: The backend must tolerate double-write because we re-issue
a rq sent to the old process 0 before.

We provide a sample script here to simulate the above steps:

***************************script***************************
LOOPS=10

__ublk_get_pid() {
	pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
	echo $pid
}

ublk_recover_kill()
{
	for CNT in `seq $LOOPS`; do
		dmesg -C
                pid=`__ublk_get_pid`
                echo -e "*** kill $pid now ***"
		kill -9 $pid
		sleep 6
                echo -e "*** recover now ***"
                ./ublk recover -n 0
		sleep 6
	done
}

ublk_test()
{
        echo -e "*** add ublk device ***"
        ./ublk add -t null -d 4 -i 1
        sleep 2
        echo -e "*** start fio ***"
        fio --bs=4k \
            --filename=/dev/ublkb0 \
            --runtime=140s \
            --rw=read &
        sleep 4
        ublk_recover_kill
        wait
        echo -e "*** delete ublk device ***"
        ./ublk del -n 0
}

for CNT in `seq 4`; do
        modprobe -rv ublk_drv
        modprobe ublk_drv
        echo -e "************ round $CNT ************"
        ublk_test
        sleep 5
done
***************************script***************************

You may run it with our modified ublksrv[2] which supports
recovery feature. No I/O error occurs and you can verify it
by typing
    $ perf-tools/bin/tpoint block:block_rq_error

The basic idea of USER_RECOVERY is quite straightfoward:
(1) quiesce ublk queues and requeue/abort rqs.
(2) release/free everything belongs to the dying process.
    Note: Since ublk_drv does save information about user process,
    this work is important because we don't expect any resource
    lekage. Particularly, ioucmds from the dying ubq_daemons
    need to be completed(freed).
(3) allow new ubq_daemons issue FETCH_REQ.
    Note: ublk_ch_uring_cmd() checks some states and flags. We
    have to set them to a correct value.

Here is steps to reocver:
(0) requests dispatched after the corresponding ubq_daemon is dying 
    are requeued.
(1) monitor_work finds one dying ubq_daemon, and it should
    schedule quiesce_work and requeue/abort requests issued to
    userspace before the ubq_daemon is dying.
(2) quiesce_work must (a)quiesce request queue to ban any incoming
    ublk_queue_rq(), (b)wait unitl all rqs are IDLE, (c)complete old
	  ioucmds. Then the ublk device is ready for recovery or stop.
(3) Since io_uring resources are released, ublk_ch_release() is called
    and all ublk_ios are reset to be ready for a new process.
(4) Then, user should start a new process and ubq_daemons(pthreads) and
    send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
    user must correctly setup queues, flags and so on(how to persist
    user's information is not related to this patchset).
(6) The user sends RESTART_DEV ctrl-cmd to /dev/ublk-control with a
    dev_id X.
(7) After receiving RESTART_DEV, ublk_drv waits for all ubq_daemons
    getting ready. Then it unquiesces request queue and new rqs are
    allowed.

You should use ublksrv[2] and tests[3] provided by us. We add 3 additional
tests to verify that recovery feature works. Our code will be PR-ed to
Ming's repo soon.

[1] https://github.com/ming1/ubdsrv
[2] https://github.com/old-memories/ubdsrv/tree/recovery-v1
[3] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic

Since V3:
(1) do not kick requeue list in ublk_queue_rq() or io_uring fallback wq
    with a dying ubq_daemon but kicking the list once while unquiescing dev
(2) add comment on requeing rqs in ublk_queue_rq(), or io_uring fallback wq
    with a dying ubq_daemon
(3) split support for UBLK_F_USER_RECOVERY_REISSUE into a single patch
(4) let monitor_work abort/requeue rqs issued to userspace instead of
    quiesce_work with recovery enabled
(5) alway wait until no INFLIGHT rq exists in ublk_quiesce_dev()
(6) move ublk re-init stuff into ublk_ch_release()
(7) let ublk_quiesce_dev() go on as long as one ubq_daemon is dying
(8) add only one ctrl-cmd and rename it as RESTART_DEV
(9) check ub.dev_info->flags instead of iterating on all ubqs
(10) do not disable recoevry feature, but always qiuesce dev in
     ublk_stop_dev() and then unquiesce it
(11) add doc on USER_RECOVERY feature
 
Since V2:
(1) run ublk_quiesce_dev() in a standalone work.
(2) do not run monitor_work after START_USER_RECOVERY is handled.
(3) refactor recovery feature code so that it does not affect current code.

Since V1:
(1) refactor cover letter. Add intruduction on "how to detect a crash" and
    "why we need recovery feature".
(2) do not refactor task_work and ublk_queue_rq().
(3) allow users freely stop/recover the device.
(4) add comment on ublk_cancel_queue().
(5) refactor monitor_work and aborting machenism since we add recovery
    machenism in monitor_work.

ZiyangZhang (8):
  ublk_drv: check 'current' instead of 'ubq_daemon'
  ublk_drv: refactor ublk_cancel_queue()
  ublk_drv: define macros for recovery feature and check them
  ublk_drv: requeue rqs with recovery feature enabled
  ublk_drv: consider recovery feature in aborting mechanism
  ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE
  ublk_drv: allow new process to open ublk chardev with recovery feature
    enabled
  Documentation: document ublk user recovery feature

 Documentation/block/ublk.rst  |  25 +++
 drivers/block/ublk_drv.c      | 345 +++++++++++++++++++++++++++++++---
 include/uapi/linux/ublk_cmd.h |   6 +
 3 files changed, 354 insertions(+), 22 deletions(-)

-- 
2.27.0


             reply	other threads:[~2022-09-21  9:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-21  9:58 ZiyangZhang [this message]
2022-09-21  9:58 ` [PATCH V4 1/8] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang
2022-09-21  9:58 ` [PATCH V4 2/8] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang
2022-09-21  9:58 ` [PATCH V4 3/8] ublk_drv: define macros for recovery feature and check them ZiyangZhang
2022-09-21  9:58 ` [PATCH V4 4/8] ublk_drv: requeue rqs with recovery feature enabled ZiyangZhang
2022-09-22  0:28   ` Ming Lei
2022-09-22  2:04     ` Ziyang Zhang
2022-09-21  9:58 ` [PATCH V4 5/8] ublk_drv: consider recovery feature in aborting mechanism ZiyangZhang
2022-09-22  0:18   ` Ming Lei
2022-09-22  2:06     ` Ziyang Zhang
2022-09-21  9:58 ` [PATCH V4 6/8] ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE ZiyangZhang
2022-09-22  2:13   ` Ming Lei
2022-09-21  9:58 ` [PATCH V4 7/8] ublk_drv: allow new process to open ublk chardev with recovery feature enabled ZiyangZhang
2022-09-22  2:38   ` Ming Lei
2022-09-22  3:00     ` Ziyang Zhang
2022-09-21  9:58 ` [PATCH V4 8/8] Documentation: document ublk user recovery feature ZiyangZhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220921095849.84988-1-ZiyangZhang@linux.alibaba.com \
    --to=ziyangzhang@linux.alibaba.com \
    --cc=axboe@kernel.dk \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.