* [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support @ 2022-08-31 15:51 ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 1/6] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang ` (6 more replies) 0 siblings, 7 replies; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in userspace. For each ublk queue, there is one ubq_daemon(pthread). All ubq_daemons share the same process which opens /dev/ublkcX. The ubq_daemon code infinitely loops on io_uring_enter() to send/receive io_uring cmds which pass information of blk-mq rqs. Since the real IO handler(the process opening /dev/ublkcX) is in userspace, it could crash if: (1) the user kills -9 it because of IO hang on backend, system reboot, etc... (2) the process catches a exception(segfault, divisor error, oom...) Therefore, the kernel driver has to deal with a dying process. Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv must abort the dying ubq, stop the device and release everything. This is not a good choice in practice because users do not expect aborted requests, I/O errors and a released device. They may want a recovery machenism so that no requests are aborted and no I/O error occurs. Anyway, users just want everything works as uaual. This RFC patchset implements USER_RECOVERY support. If the process crashes, we allow ublksrv to provide new process and ubq_daemons. We do not support single ubq_daemon(pthread) recovery because a pthread rarely crashes. Note: The responsibility of recovery belongs to the user who opens /dev/ublkcX. After a process crash, the kernel driver only switch the device's status to be ready for recovery or termination(STOP_DEV). This patchset does not provide how to detect such a process crash in userspace. A very straightfoward idea may be adding a watchdog. Recovery feature is quite useful for real products. In detail, we support this scenario: (1) The /dev/ublkc0 is opened by process 0; (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all rqs are handled by process 0. (3) Process 0 suddenly crashes(e.g. segfault); (4) Fio is still running and submit IOs(but these IOs cannot complete now) (5) User recovers with process 1 and attach it to /dev/ublkc0 (6) All rqs are handled by process 1 now and IOs can be completed now. Note: The backend must tolerate double-write because we re-issue a rq sent to the old(dying) process before. We allow users to choose whether re-issue these rqs or not, please see patch 7 for more detail. We provide a sample script here to simulate the above steps: ***************************script*************************** LOOPS=10 __ublk_get_pid() { pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'` echo $pid } ublk_recover_kill() { for CNT in `seq $LOOPS`; do dmesg -C pid=`__ublk_get_pid` echo -e "*** kill $pid now ***" kill -9 $pid sleep 6 echo -e "*** recover now ***" ./ublk recover -n 0 sleep 6 done } ublk_test() { dmesg -C echo -e "*** add ublk device ***" ./ublk add -t null -d 4 -i 1 sleep 2 echo -e "*** start fio ***" fio --bs=4k \ --filename=/dev/ublkb0 \ --runtime=140s \ --rw=read & sleep 4 ublk_recover_kill wait echo -e "*** delete ublk device ***" ./ublk del -n 0 } for CNT in `seq 4`; do modprobe -rv ublk_drv modprobe ublk_drv echo -e "************ round $CNT ************" ublk_test sleep 5 done ***************************script*************************** You may run it with our modified ublksrv[2] which supports recovey feature. No I/O error occurs and you can verify it by typing $ perf-tools/bin/tpoint block:block_rq_error The basic idea of USER_RECOVERY is quite straightfoward: (1) release/free everything belongs to the dying process. Note: Since ublk_drv does save information about user process, this work is important because we don't expect any resource lekage. Particularly, ioucmds from the dying ubq_daemons need to be completed(freed). (2) init ublk queues including requeuing/aborting rqs. (3) allow new ubq_daemons issue FETCH_REQ. Here is steps to reocver: (1) The monitor_work detects a crash, and it should requeue/abort inflight rqs, complete old ioucmds and quiesce request queue to ban any incoming ublk_queue_rq(). Then the ublk device is ready for a recovery/stop procedure. (2) For a user, after a process crash, he sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a dev_id X (such as 3 for /dev/ublkc3). (2) Then ublk_drv should perpare for a new process to attach /dev/ublkcX. All ublk_io structures are cleared and ubq_daemons are reset. (3) Then, user should start a new process and ubq_daemons(pthreads) and send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The user must correctly setup queues, flags and so on(how to persist user's information is not related to this patchset). (4) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a dev_id X. (5) ublk_drv waits for all ubq_daemons getting ready. Then it unquiesces request queue and new rqs are allowed. You should use ublksrv[2] and tests[3] provided by us. We add 2 additional tests to verify that recovery feature works. Our code will be PR-ed to Ming's repo soon. [1] https://github.com/ming1/ubdsrv [2] https://github.com/old-memories/ubdsrv/tree/recovery-v1 [3] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic Since V1: (1) refactor cover letter. Add intruduction on "how to detect a crash" and "why we need recovery feature". (2) do not refactor task_work and ublk_queue_rq(). (3) allow users freely stop/recover the device. (4) add comment on ublk_cancel_queue(). (5) refactor monitor_work and aborting machenism since we add recovery machenism in monitor_work. ZiyangZhang (6): ublk_drv: check 'current' instead of 'ubq_daemon' ublk_drv: refactor ublk_cancel_queue() ublk_drv: define macros for recovery feature and check them ublk_drv: requeue rqs with recovery feature enabled ublk_drv: consider recovery feature in aborting mechanism ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support drivers/block/ublk_drv.c | 439 +++++++++++++++++++++++++++++++--- include/uapi/linux/ublk_cmd.h | 7 + 2 files changed, 419 insertions(+), 27 deletions(-) -- 2.27.0 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH V2 1/6] ublk_drv: check 'current' instead of 'ubq_daemon' 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang ` (5 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang This check is not atomic. So with recovery feature, ubq_daemon may be modified simultaneously by recovery task. Instead, check 'current' is safe here because 'current' never changes. Also add comment explaining this check, which is really important for understanding recovery feature. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> --- drivers/block/ublk_drv.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 6a4a94b4cdf4..c39b67d7133d 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req) struct ublk_device *ub = ubq->dev; int tag = req->tag; struct ublk_io *io = &ubq->ios[tag]; - bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq); unsigned int mapped_bytes; pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, ublk_get_iod(ubq, req->tag)->addr); - if (unlikely(task_exiting)) { + /* + * Task is exiting if either: + * + * (1) current != ubq_daemon. + * io_uring_cmd_complete_in_task() tries to run task_work + * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. + * + * (2) current->flags & PF_EXITING. + */ + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { blk_mq_end_request(req, BLK_STS_IOERR); mod_delayed_work(system_wq, &ub->monitor_work, 0); return; -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 1/6] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-09-03 11:16 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them ZiyangZhang ` (4 subsequent siblings) 6 siblings, 1 reply; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds received so that io_uring ctx will not leak. ublk_cancel_queue() may be called before START_DEV or after STOP_DEV, we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> --- drivers/block/ublk_drv.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index c39b67d7133d..0c6db0978ed0 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -963,22 +963,39 @@ static inline bool ublk_queue_ready(struct ublk_queue *ubq) return ubq->nr_io_ready == ubq->q_depth; } +/* If ublk_cancel_queue() is called before sending START_DEV(), ->mutex + * provides protection on above update. + * + * If ublk_cancel_queue() is called after sending START_DEV(), disk is + * deleted first, UBLK_IO_RES_ABORT is returned so that any new io + * command can't be issued to driver, so updating on io flags and + * nr_io_ready is safe here. + * + * Also ->nr_io_ready is guaranteed to become zero after ublk_cance_queue() + * returns since request queue is either frozen or not present in both two + * cases. + */ static void ublk_cancel_queue(struct ublk_queue *ubq) { int i; - if (!ublk_queue_ready(ubq)) + if (!ubq->nr_io_ready) return; for (i = 0; i < ubq->q_depth; i++) { struct ublk_io *io = &ubq->ios[i]; - if (io->flags & UBLK_IO_FLAG_ACTIVE) + if (io->flags & UBLK_IO_FLAG_ACTIVE) { + pr_devel("%s: done old cmd: qid %d tag %d\n", + __func__, ubq->q_id, i); io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0); + io->flags &= ~UBLK_IO_FLAG_ACTIVE; + ubq->nr_io_ready--; + } } /* all io commands are canceled */ - ubq->nr_io_ready = 0; + WARN_ON_ONCE(ubq->nr_io_ready); } /* Cancel all pending commands, must be called after del_gendisk() returns */ -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() 2022-08-31 15:51 ` [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang @ 2022-09-03 11:16 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2022-09-03 11:16 UTC (permalink / raw) To: ZiyangZhang; +Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi On Wed, Aug 31, 2022 at 11:51:32PM +0800, ZiyangZhang wrote: > Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the > ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds > received so that io_uring ctx will not leak. > > ublk_cancel_queue() may be called before START_DEV or after STOP_DEV, > we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we > won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also > clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable. > > Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> > --- Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 1/6] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-09-03 11:18 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 4/6] ublk_drv: requeue rqs with recovery feature enabled ZiyangZhang ` (3 subsequent siblings) 6 siblings, 1 reply; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang Define some macros for recovery feature. Especially define a new state: UBLK_S_DEV_RECOVERING which implies the ublk_device is recovering. UBLK_F_USER_RECOVERY implies that: (1) ublk_drv enables recovery feature. It won't let monitor_work to automatically abort rqs and release the device. Instead, it waits for user's START_USER_RECOVERY ctrl-cmd. (2) In monitor_work after a crash, ublk_drv ends(aborts) rqs issued to userspace(ublksrv) before crash. (3) In task work and ublk_queue_rq() after a crash, ublk_drv requeues rqs dispatched after crash. UBLK_F_USER_RECOVERY_REISSUE implies that: (1) everything UBLK_F_USER_RECOVERY implies except (2) ublk_drv requeues rqs issued to userspace(ublksrv) before crash. UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: (1) tolerate double-writes because we may issue the same rq twice. (2) cannot let frontend users get I/O error, such as a RDONLY system. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> --- drivers/block/ublk_drv.c | 31 ++++++++++++++++++++++++++++++- include/uapi/linux/ublk_cmd.h | 7 +++++++ 2 files changed, 37 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 0c6db0978ed0..0c3d32e8d686 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -49,7 +49,9 @@ /* All UBLK_F_* have to be included into UBLK_F_ALL */ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ - | UBLK_F_NEED_GET_DATA) + | UBLK_F_NEED_GET_DATA \ + | UBLK_F_USER_RECOVERY \ + | UBLK_F_USER_RECOVERY_REISSUE) /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) @@ -323,6 +325,33 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) PAGE_SIZE); } +/* + * TODO: UBLK_F_USER_RECOVERY should be a flag for device, not for queue, + * since "some queues are aborted while others are recoverd" is really weird. + */ +static inline bool ublk_can_use_recovery(struct ublk_device *ub) +{ + struct ublk_queue *ubq = ublk_get_queue(ub, 0); + + if (ubq->flags & UBLK_F_USER_RECOVERY) + return true; + return false; +} + +/* + * TODO: UBLK_F_USER_RECOVERY_REISSUE should be a flag for device, not for queue, + * since "some queues are aborted while others are recoverd" is really weird. + */ +static inline bool ublk_can_use_recovery_reissue(struct ublk_device *ub) +{ + struct ublk_queue *ubq = ublk_get_queue(ub, 0); + + if ((ubq->flags & UBLK_F_USER_RECOVERY) && + (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE)) + return true; + return false; +} + static void ublk_free_disk(struct gendisk *disk) { struct ublk_device *ub = disk->private_data; diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 677edaab2b66..7f7e6f44cec5 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -17,6 +17,8 @@ #define UBLK_CMD_STOP_DEV 0x07 #define UBLK_CMD_SET_PARAMS 0x08 #define UBLK_CMD_GET_PARAMS 0x09 +#define UBLK_CMD_START_USER_RECOVERY 0x10 +#define UBLK_CMD_END_USER_RECOVERY 0x11 /* * IO commands, issued by ublk server, and handled by ublk driver. @@ -74,9 +76,14 @@ */ #define UBLK_F_NEED_GET_DATA (1UL << 2) +#define UBLK_F_USER_RECOVERY (1UL << 3) + +#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1 +#define UBLK_S_DEV_RECOVERING 2 /* shipped via sqe->cmd of io_uring command */ struct ublksrv_ctrl_cmd { -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them 2022-08-31 15:51 ` [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them ZiyangZhang @ 2022-09-03 11:18 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2022-09-03 11:18 UTC (permalink / raw) To: ZiyangZhang; +Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi On Wed, Aug 31, 2022 at 11:51:33PM +0800, ZiyangZhang wrote: > Define some macros for recovery feature. Especially define a new state: > UBLK_S_DEV_RECOVERING which implies the ublk_device is recovering. > > UBLK_F_USER_RECOVERY implies that: > (1) ublk_drv enables recovery feature. It won't let monitor_work to > automatically abort rqs and release the device. Instead, it waits > for user's START_USER_RECOVERY ctrl-cmd. > > (2) In monitor_work after a crash, ublk_drv ends(aborts) rqs issued to > userspace(ublksrv) before crash. > > (3) In task work and ublk_queue_rq() after a crash, ublk_drv requeues > rqs dispatched after crash. > > UBLK_F_USER_RECOVERY_REISSUE implies that: > (1) everything UBLK_F_USER_RECOVERY implies except > (2) ublk_drv requeues rqs issued to userspace(ublksrv) before crash. > > UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: > (1) tolerate double-writes because we may issue the same rq twice. > (2) cannot let frontend users get I/O error, such as a RDONLY system. > > Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> > --- > drivers/block/ublk_drv.c | 31 ++++++++++++++++++++++++++++++- > include/uapi/linux/ublk_cmd.h | 7 +++++++ > 2 files changed, 37 insertions(+), 1 deletion(-) > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > index 0c6db0978ed0..0c3d32e8d686 100644 > --- a/drivers/block/ublk_drv.c > +++ b/drivers/block/ublk_drv.c > @@ -49,7 +49,9 @@ > /* All UBLK_F_* have to be included into UBLK_F_ALL */ > #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ > | UBLK_F_URING_CMD_COMP_IN_TASK \ > - | UBLK_F_NEED_GET_DATA) > + | UBLK_F_NEED_GET_DATA \ > + | UBLK_F_USER_RECOVERY \ > + | UBLK_F_USER_RECOVERY_REISSUE) > > /* All UBLK_PARAM_TYPE_* should be included here */ > #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) > @@ -323,6 +325,33 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) > PAGE_SIZE); > } > > +/* > + * TODO: UBLK_F_USER_RECOVERY should be a flag for device, not for queue, > + * since "some queues are aborted while others are recoverd" is really weird. > + */ > +static inline bool ublk_can_use_recovery(struct ublk_device *ub) > +{ > + struct ublk_queue *ubq = ublk_get_queue(ub, 0); This way is too tricky, just wondering why you don't passe ubq to ublk_can_use_recovery()? Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH V2 4/6] ublk_drv: requeue rqs with recovery feature enabled 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang ` (2 preceding siblings ...) 2022-08-31 15:51 ` [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism ZiyangZhang ` (2 subsequent siblings) 6 siblings, 0 replies; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang With recovery feature enabled, in ublk_queue_rq or task work (in exit_task_work or fallback wq), we requeue rqs instead of ending(aborting) them. No matter recovery feature is enabled or disabled, schedule monitor work immediately after detecting a crash so it can find out the crash and do aborting/recovery mechanism. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> --- drivers/block/ublk_drv.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 0c3d32e8d686..296b9d80f003 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -690,7 +690,16 @@ static inline void __ublk_rq_task_work(struct request *req) * (2) current->flags & PF_EXITING. */ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { - blk_mq_end_request(req, BLK_STS_IOERR); + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__, + (ublk_can_use_recovery(ub)) ? "requeue" : "abort", + ubq->q_id, req->tag, io->flags); + + if (ublk_can_use_recovery(ub)) { + /* We cannot process this req so just requeue it. */ + blk_mq_requeue_request(req, false); + } else { + blk_mq_end_request(req, BLK_STS_IOERR); + } mod_delayed_work(system_wq, &ub->monitor_work, 0); return; } @@ -770,6 +779,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, { struct ublk_queue *ubq = hctx->driver_data; struct request *rq = bd->rq; + struct ublk_io *io = &ubq->ios[rq->tag]; blk_status_t res; /* fill iod to slot in io cmd buffer */ @@ -781,8 +791,18 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(ubq_daemon_is_dying(ubq))) { fail: + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__, + (ublk_can_use_recovery(ubq->dev)) ? "requeue" : "abort", + ubq->q_id, rq->tag, io->flags); + mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); - return BLK_STS_IOERR; + if (ublk_can_use_recovery(ubq->dev)) { + /* We cannot process this rq so just requeue it. */ + blk_mq_requeue_request(rq, false); + return BLK_STS_OK; + } else { + return BLK_STS_IOERR; + } } if (ublk_can_use_task_work(ubq)) { @@ -793,7 +813,6 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, if (task_work_add(ubq->ubq_daemon, &data->work, notify_mode)) goto fail; } else { - struct ublk_io *io = &ubq->ios[rq->tag]; struct io_uring_cmd *cmd = io->cmd; struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang ` (3 preceding siblings ...) 2022-08-31 15:51 ` [RFC PATCH V2 4/6] ublk_drv: requeue rqs with recovery feature enabled ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-09-03 13:30 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 6/6] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support ZiyangZhang 2022-09-06 1:14 ` [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support Ming Lei 6 siblings, 1 reply; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang We change the default behavior of aborting machenism. Now monitor_work will not be manually scheduled by ublk_queue_rq() or task_work after a ubq_daemon or process is dying(PF_EXITING). The monitor work should find a dying ubq_daemon or a crash process by itself. Then, it can start the aborting machenism. We do such modification is because we want to strictly separate the STOP_DEV procedure and monitor_work. More specifically, we ensure that monitor_work must not be scheduled after we start deleting gendisk and ending(aborting) all inflight rqs. In this way we are easy to consider recovery feature and unify it into existing aborting mechanism. Really we do not want too many "if can_use_recovery" checks. With recovery feature disabled and after a ubq_daemon crash: (1) monitor_work notices the crash and schedules stop_work (2) stop_work calls ublk_stop_dev() (3) In ublk_stop_dev(): (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq(); If ublk_queue_rq() does not see it, rqs can still be ended(aborted) in fallback wq. (b) Then it cancels monitor_work; (c) Then it schedules abort_work which ends(aborts) all inflight rqs. (d) At the same time del_gendisk() is called. (e) Finally, we complete all ioucmds. Note: we do not change the existing behavior with reocvery disabled. Note that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work. With recovery feature enabled and after a process crash: (1) monitor_work notices the crash and all ubq_daemon are dying. We do not consider a "single" ubq_daemon(pthread) crash. Please send STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case. (2) The monitor_work quiesces request queue. (3) The monotor_work checks if there is any inflight rq with UBLK_IO_FLAG_ACTIVE unset. If so, we give up and schedule monitor_work later to retry. This is because we have to wait these rqs requeued(IDLE) and we are safe to complete their ioucmds later. Otherwise we may cause UAF on ioucmd in fallback wq. (4) If check in (3) passes, we should requeue/abort inflight rqs issued to the crash ubq_daemon before. If UBLK_F_USER_RECOVERY_REISSUE is set, rq is requeued. Otherwise it is aborted. (5) All ioucmds are completed by calling io_uring_cmd_done(). (6) monitor_work set ub's state to UBLK_S_DEV_RECOVERING. It does not scheduled itself anymore. Now we are ready for START_USER_RECOVERY. Note: If (3) fails, monitor_work schedules itself and retires (3). We allow user to manually start STOP_DEV procedure without reagrd to monitor_work. STOP_DEV can cancel monitor_work, unquiesce request queue and drain all requeued rqs. More importantly, STOP_DEV can safely complete all ioucmds since monitor_work has been canceled at that moment. Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> --- drivers/block/ublk_drv.c | 222 +++++++++++++++++++++++++++++++++++---- 1 file changed, 202 insertions(+), 20 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 296b9d80f003..0e185d1fa2c4 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -156,7 +156,7 @@ struct ublk_device { struct completion completion; unsigned int nr_queues_ready; - atomic_t nr_aborted_queues; + bool force_abort; /* * Our ubq->daemon may be killed without any notification, so @@ -164,6 +164,7 @@ struct ublk_device { */ struct delayed_work monitor_work; struct work_struct stop_work; + struct work_struct abort_work; }; /* header of ublk_params */ @@ -643,9 +644,13 @@ static void ublk_complete_rq(struct request *req) */ static void __ublk_fail_req(struct ublk_io *io, struct request *req) { + struct ublk_queue *ubq = req->mq_hctx->driver_data; + WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); if (!(io->flags & UBLK_IO_FLAG_ABORTED)) { + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n", + __func__, ubq->q_id, req->tag, io->flags); io->flags |= UBLK_IO_FLAG_ABORTED; blk_mq_end_request(req, BLK_STS_IOERR); } @@ -664,6 +669,8 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res) /* tell ublksrv one io request is coming */ io_uring_cmd_done(io->cmd, res, 0); + pr_devel("%s: complete ioucmd: res %d io_flags %x\n", + __func__, res, io->flags); } #define UBLK_REQUEUE_DELAY_MS 3 @@ -675,11 +682,6 @@ static inline void __ublk_rq_task_work(struct request *req) int tag = req->tag; struct ublk_io *io = &ubq->ios[tag]; unsigned int mapped_bytes; - - pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", - __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, - ublk_get_iod(ubq, req->tag)->addr); - /* * Task is exiting if either: * @@ -700,10 +702,13 @@ static inline void __ublk_rq_task_work(struct request *req) } else { blk_mq_end_request(req, BLK_STS_IOERR); } - mod_delayed_work(system_wq, &ub->monitor_work, 0); return; } + pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", + __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, + ublk_get_iod(ubq, req->tag)->addr); + if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE || req_op(req) == REQ_OP_FLUSH)) { @@ -782,6 +787,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, struct ublk_io *io = &ubq->ios[rq->tag]; blk_status_t res; + if (unlikely(ubq->dev->force_abort)) { + pr_devel("%s: abort q_id %d tag %d io_flags %x.\n", + __func__, ubq->q_id, rq->tag, io->flags); + return BLK_STS_IOERR; + } /* fill iod to slot in io cmd buffer */ res = ublk_setup_iod(ubq, rq); if (unlikely(res != BLK_STS_OK)) @@ -789,13 +799,15 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(bd->rq); + pr_devel("%s: start rq: q_id %d tag %d io_flags %x.\n", + __func__, ubq->q_id, rq->tag, io->flags); + if (unlikely(ubq_daemon_is_dying(ubq))) { fail: pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__, (ublk_can_use_recovery(ubq->dev)) ? "requeue" : "abort", ubq->q_id, rq->tag, io->flags); - mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); if (ublk_can_use_recovery(ubq->dev)) { /* We cannot process this rq so just requeue it. */ blk_mq_requeue_request(rq, false); @@ -880,6 +892,7 @@ static int ublk_ch_open(struct inode *inode, struct file *filp) if (test_and_set_bit(UB_STATE_OPEN, &ub->state)) return -EBUSY; filp->private_data = ub; + pr_devel("%s: open /dev/ublkc%d\n", __func__, ub->dev_info.dev_id); return 0; } @@ -888,6 +901,7 @@ static int ublk_ch_release(struct inode *inode, struct file *filp) struct ublk_device *ub = filp->private_data; clear_bit(UB_STATE_OPEN, &ub->state); + pr_devel("%s: release /dev/ublkc%d\n", __func__, ub->dev_info.dev_id); return 0; } @@ -971,37 +985,180 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) * will do it */ rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); - if (rq) + if (rq && blk_mq_request_started(rq)) __ublk_fail_req(io, rq); } } ublk_put_device(ub); } -static void ublk_daemon_monitor_work(struct work_struct *work) + + +static void ublk_abort_work_fn(struct work_struct *work) { struct ublk_device *ub = - container_of(work, struct ublk_device, monitor_work.work); + container_of(work, struct ublk_device, abort_work); + + int i; + + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + if (ubq_daemon_is_dying(ubq)) + ublk_abort_queue(ub, ubq); + } + + if (ub->force_abort) + schedule_work(&ub->abort_work); +} + +static void ublk_reinit_queue(struct ublk_device *ub, + struct ublk_queue *ubq) +{ + int i; + + for (i = 0; i < ubq->q_depth; i++) { + struct ublk_io *io = &ubq->ios[i]; + + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) { + struct request *rq = blk_mq_tag_to_rq( + ub->tag_set.tags[ubq->q_id], i); + + WARN_ON_ONCE(!rq); + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", + __func__, + ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort", + ubq->q_id, i, io->flags); + if (ublk_can_use_recovery_reissue(ub)) + blk_mq_requeue_request(rq, false); + else + __ublk_fail_req(io, rq); + + } else { + io_uring_cmd_done(io->cmd, + UBLK_IO_RES_ABORT, 0); + io->flags &= ~UBLK_IO_FLAG_ACTIVE; + pr_devel("%s: send UBLK_IO_RES_ABORT: qid %d tag %d io_flags %x\n", + __func__, ubq->q_id, i, io->flags); + } + ubq->nr_io_ready--; + } + WARN_ON_ONCE(ubq->nr_io_ready); +} + +static bool ublk_check_inflight_rq(struct request *rq, void *data) +{ + struct ublk_queue *ubq = rq->mq_hctx->driver_data; + struct ublk_io *io = &ubq->ios[rq->tag]; + bool *busy = data; + + if (io->flags & UBLK_IO_FLAG_ACTIVE) { + *busy = true; + return false; + } + return true; +} + +static void ublk_reinit_dev(struct ublk_device *ub) +{ + int i; + bool busy = false; + + if (!ublk_get_device(ub)) + return; + + /* If we have quiesced q, all ubq_daemons are dying */ + if (blk_queue_quiesced(ub->ub_disk->queue)) + goto check_inflight; + + /* Recovery feature is for 'process' crash. */ + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))) + goto out; + } + + pr_devel("%s: all ubq_daemons(nr: %d) are dying.\n", + __func__, ub->dev_info.nr_hw_queues); + + /* Now all ubq_daemons are PF_EXITING, let's quiesce q. */ + blk_mq_quiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue quiesced.\n", __func__); + check_inflight: + /* All rqs have to be requeued(and stay in queue now) */ + blk_mq_tagset_busy_iter(&ub->tag_set, ublk_check_inflight_rq, &busy); + if (busy) { + pr_devel("%s: still some inflight rqs, retry later...\n", + __func__); + goto out; + } + + pr_devel("%s: all inflight rqs are requeued.\n", __func__); + + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + ublk_reinit_queue(ub, ubq); + } + /* So monitor_work won't be scheduled anymore */ + ub->dev_info.state = UBLK_S_DEV_RECOVERING; + pr_devel("%s: convert state to UBLK_S_DEV_RECOVERING\n", + __func__); + out: + ublk_put_device(ub); +} + +static void ublk_kill_dev(struct ublk_device *ub) +{ int i; for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { struct ublk_queue *ubq = ublk_get_queue(ub, i); if (ubq_daemon_is_dying(ubq)) { + pr_devel("%s: ubq(%d) is dying, schedule stop_work now\n", + __func__, i); schedule_work(&ub->stop_work); - - /* abort queue is for making forward progress */ - ublk_abort_queue(ub, ubq); } } +} + +static void ublk_daemon_monitor_work(struct work_struct *work) +{ + struct ublk_device *ub = + container_of(work, struct ublk_device, monitor_work.work); + + pr_devel("%s: mode(%s) running...\n", + __func__, + ublk_can_use_recovery(ub) ? "recovery" : "kill"); + /* + * We can't schedule monitor work if: + * (1) The state is DEAD. + * The gendisk is not alive, so either all rqs are ended + * or request queue is not created. + * + * (2) The state is RECOVERYING. + * The process crashed, all rqs were requeued and request queue + * was quiesced. + */ + WARN_ON_ONCE(ub->dev_info.state != UBLK_S_DEV_LIVE); + if (ublk_can_use_recovery(ub)) + ublk_reinit_dev(ub); + else + ublk_kill_dev(ub); /* - * We can't schedule monitor work after ublk_remove() is started. + * No need ub->mutex, monitor work are canceled after ub is marked + * as force_abort which is observed reliably. + * + * Note: + * All incoming rqs are aborted in ublk_queue_rq ASAP. Then + * we will hang on del_gendisk() and wait for all inflight rqs' + * completion. * - * No need ub->mutex, monitor work are canceled after state is marked - * as DEAD, so DEAD state is observed reliably. */ - if (ub->dev_info.state != UBLK_S_DEV_DEAD) + if (ub->dev_info.state == UBLK_S_DEV_LIVE && !ub->force_abort) schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); } @@ -1058,10 +1215,35 @@ static void ublk_cancel_dev(struct ublk_device *ub) static void ublk_stop_dev(struct ublk_device *ub) { mutex_lock(&ub->mutex); - if (ub->dev_info.state != UBLK_S_DEV_LIVE) + if (ub->dev_info.state == UBLK_S_DEV_DEAD) goto unlock; + /* + * All incoming rqs are aborted in ublk_queue_rq ASAP. Then + * we will hang on del_gendisk() and wait for all inflight rqs' + * completion. + */ + ub->force_abort = true; + /* wait until monitor_work is not scheduled */ + cancel_delayed_work_sync(&ub->monitor_work); + pr_devel("%s: monitor work is canceled.\n", __func__); + /* unquiesce q and let all inflight rqs' be aborted */ + if (blk_queue_quiesced(ub->ub_disk->queue)) { + blk_mq_unquiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue unquiesced.\n", __func__); + } + /* requeued requests will be aborted ASAP because of 'force_abort' */ + blk_mq_kick_requeue_list(ub->ub_disk->queue); + /* forward progress */ + schedule_work(&ub->abort_work); + pr_devel("%s: abort work is scheduled, start delete gendisk...\n", + __func__); + pr_devel("%s: gendisk is deleted.\n", __func__); del_gendisk(ub->ub_disk); + pr_devel("%s: gendisk is deleted.\n", __func__); + ub->force_abort = false; + cancel_work_sync(&ub->abort_work); + pr_devel("%s: abort work is canceled.\n", __func__); ub->dev_info.state = UBLK_S_DEV_DEAD; ub->dev_info.ublksrv_pid = -1; put_disk(ub->ub_disk); @@ -1069,7 +1251,6 @@ static void ublk_stop_dev(struct ublk_device *ub) unlock: ublk_cancel_dev(ub); mutex_unlock(&ub->mutex); - cancel_delayed_work_sync(&ub->monitor_work); } /* device can only be started after all IOs are ready */ @@ -1560,6 +1741,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) goto out_unlock; mutex_init(&ub->mutex); spin_lock_init(&ub->mm_lock); + INIT_WORK(&ub->abort_work, ublk_abort_work_fn); INIT_WORK(&ub->stop_work, ublk_stop_work_fn); INIT_DELAYED_WORK(&ub->monitor_work, ublk_daemon_monitor_work); -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism 2022-08-31 15:51 ` [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism ZiyangZhang @ 2022-09-03 13:30 ` Ming Lei 2022-09-04 11:23 ` Ziyang Zhang 0 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2022-09-03 13:30 UTC (permalink / raw) To: ZiyangZhang Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi, Ming Lei On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang <ZiyangZhang@linux.alibaba.com> wrote: > > We change the default behavior of aborting machenism. Now monitor_work > will not be manually scheduled by ublk_queue_rq() or task_work after a > ubq_daemon or process is dying(PF_EXITING). The monitor work should > find a dying ubq_daemon or a crash process by itself. Then, it can Looks you don't consider one dying ubq_daemon as one crash candidate. Most io implementation is done in the ubq pthread, so it should be covered by the crash recovery. > start the aborting machenism. We do such modification is because we want > to strictly separate the STOP_DEV procedure and monitor_work. More > specifically, we ensure that monitor_work must not be scheduled after > we start deleting gendisk and ending(aborting) all inflight rqs. In this > way we are easy to consider recovery feature and unify it into existing > aborting mechanism. Really we do not want too many "if can_use_recovery" > checks. Frankly speaking, not sure we need to invent new wheel for the 'aborting' mechanism. In theory, you needn't change the current monitor work and cancel dev/queue. What you need is how to handle the dying ubq daemon: 1) without user recovery, delete disk if any ubq daemon is died. 2) with user recovery: - quiesce request queue and wait until all inflight requests are requeued(become IDLE); - call io_uring_cmd_done for any active io slot - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling the potential crash; if it is confirmed as crash by userspace, userspace will send command to handle it. (this way will simplify userspace too, since we can add one utility and provide it via udev script for handling rec Checking one flag lockless is usually not safe, also not sure why we need such flag here, and the original check is supposed to work. overy) > > With recovery feature disabled and after a ubq_daemon crash: > (1) monitor_work notices the crash and schedules stop_work driver can't figure out if it is crash, and it can just see if the ubq deamon is died or not. And crash detection logic should be done in userspace, IMO. > (2) stop_work calls ublk_stop_dev() > (3) In ublk_stop_dev(): > (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq(); Please don't add new flag in fast path lockless, and the original check is supposed to be reused for recovery feature. > If ublk_queue_rq() does not see it, rqs can still be ended(aborted) > in fallback wq. > (b) Then it cancels monitor_work; > (c) Then it schedules abort_work which ends(aborts) all inflight rqs. > (d) At the same time del_gendisk() is called. > (e) Finally, we complete all ioucmds. > > Note: we do not change the existing behavior with reocvery disabled. Note > that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work. > > With recovery feature enabled and after a process crash: > (1) monitor_work notices the crash and all ubq_daemon are dying. > We do not consider a "single" ubq_daemon(pthread) crash. Please send > STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case. Can you consider why you don't consider it as one crash? IMO, most of userspace block logic is run in ubq_daemon, so it is reasonable to consider it. ublk_reinit_dev() is supposed to be run in standalone context, just like ublk_stop_dev(), we need monitor_work to provide forward progress, so don't run wait in monitor work. And please don't change this model for making forward progress. > (2) The monitor_work quiesces request queue. > (3) The monotor_work checks if there is any inflight rq with > UBLK_IO_FLAG_ACTIVE unset. If so, we give up and schedule monitor_work > later to retry. This is because we have to wait these rqs requeued(IDLE) > and we are safe to complete their ioucmds later. Otherwise we may cause > UAF on ioucmd in fallback wq. > (4) If check in (3) passes, we should requeue/abort inflight rqs issued > to the crash ubq_daemon before. If UBLK_F_USER_RECOVERY_REISSUE is set, > rq is requeued. Otherwise it is aborted. > (5) All ioucmds are completed by calling io_uring_cmd_done(). > (6) monitor_work set ub's state to UBLK_S_DEV_RECOVERING. It does not > scheduled itself anymore. Now we are ready for START_USER_RECOVERY. > > Note: If (3) fails, monitor_work schedules itself and retires (3). We allow > user to manually start STOP_DEV procedure without reagrd to monitor_work. > STOP_DEV can cancel monitor_work, unquiesce request queue and drain all > requeued rqs. More importantly, STOP_DEV can safely complete all ioucmds > since monitor_work has been canceled at that moment. > > Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> > --- > drivers/block/ublk_drv.c | 222 +++++++++++++++++++++++++++++++++++---- > 1 file changed, 202 insertions(+), 20 deletions(-) > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > index 296b9d80f003..0e185d1fa2c4 100644 > --- a/drivers/block/ublk_drv.c > +++ b/drivers/block/ublk_drv.c > @@ -156,7 +156,7 @@ struct ublk_device { > > struct completion completion; > unsigned int nr_queues_ready; > - atomic_t nr_aborted_queues; > + bool force_abort; > > /* > * Our ubq->daemon may be killed without any notification, so > @@ -164,6 +164,7 @@ struct ublk_device { > */ > struct delayed_work monitor_work; > struct work_struct stop_work; > + struct work_struct abort_work; > }; > > /* header of ublk_params */ > @@ -643,9 +644,13 @@ static void ublk_complete_rq(struct request *req) > */ > static void __ublk_fail_req(struct ublk_io *io, struct request *req) > { > + struct ublk_queue *ubq = req->mq_hctx->driver_data; > + > WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); > > if (!(io->flags & UBLK_IO_FLAG_ABORTED)) { > + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n", > + __func__, ubq->q_id, req->tag, io->flags); > io->flags |= UBLK_IO_FLAG_ABORTED; > blk_mq_end_request(req, BLK_STS_IOERR); > } > @@ -664,6 +669,8 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res) > > /* tell ublksrv one io request is coming */ > io_uring_cmd_done(io->cmd, res, 0); > + pr_devel("%s: complete ioucmd: res %d io_flags %x\n", > + __func__, res, io->flags); > } > > #define UBLK_REQUEUE_DELAY_MS 3 > @@ -675,11 +682,6 @@ static inline void __ublk_rq_task_work(struct request *req) > int tag = req->tag; > struct ublk_io *io = &ubq->ios[tag]; > unsigned int mapped_bytes; > - > - pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", > - __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, > - ublk_get_iod(ubq, req->tag)->addr); > - > /* > * Task is exiting if either: > * > @@ -700,10 +702,13 @@ static inline void __ublk_rq_task_work(struct request *req) > } else { > blk_mq_end_request(req, BLK_STS_IOERR); > } > - mod_delayed_work(system_wq, &ub->monitor_work, 0); > return; > } > > + pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", > + __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, > + ublk_get_iod(ubq, req->tag)->addr); > + > if (ublk_need_get_data(ubq) && > (req_op(req) == REQ_OP_WRITE || > req_op(req) == REQ_OP_FLUSH)) { > @@ -782,6 +787,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, > struct ublk_io *io = &ubq->ios[rq->tag]; > blk_status_t res; > > + if (unlikely(ubq->dev->force_abort)) { > + pr_devel("%s: abort q_id %d tag %d io_flags %x.\n", > + __func__, ubq->q_id, rq->tag, io->flags); > + return BLK_STS_IOERR; > + } Checking one flag lockless is usually not safe, also not sure why we need such flag here, and the original check is supposed to work. > /* fill iod to slot in io cmd buffer */ > res = ublk_setup_iod(ubq, rq); > if (unlikely(res != BLK_STS_OK)) > @@ -789,13 +799,15 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, > > blk_mq_start_request(bd->rq); > > + pr_devel("%s: start rq: q_id %d tag %d io_flags %x.\n", > + __func__, ubq->q_id, rq->tag, io->flags); > + > if (unlikely(ubq_daemon_is_dying(ubq))) { > fail: > pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__, > (ublk_can_use_recovery(ubq->dev)) ? "requeue" : "abort", > ubq->q_id, rq->tag, io->flags); > > - mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); > if (ublk_can_use_recovery(ubq->dev)) { > /* We cannot process this rq so just requeue it. */ > blk_mq_requeue_request(rq, false); > @@ -880,6 +892,7 @@ static int ublk_ch_open(struct inode *inode, struct file *filp) > if (test_and_set_bit(UB_STATE_OPEN, &ub->state)) > return -EBUSY; > filp->private_data = ub; > + pr_devel("%s: open /dev/ublkc%d\n", __func__, ub->dev_info.dev_id); > return 0; > } > > @@ -888,6 +901,7 @@ static int ublk_ch_release(struct inode *inode, struct file *filp) > struct ublk_device *ub = filp->private_data; > > clear_bit(UB_STATE_OPEN, &ub->state); > + pr_devel("%s: release /dev/ublkc%d\n", __func__, ub->dev_info.dev_id); > return 0; > } > > @@ -971,37 +985,180 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) > * will do it > */ > rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); > - if (rq) > + if (rq && blk_mq_request_started(rq)) > __ublk_fail_req(io, rq); > } > } > ublk_put_device(ub); > } > > -static void ublk_daemon_monitor_work(struct work_struct *work) > + > + > +static void ublk_abort_work_fn(struct work_struct *work) > { > struct ublk_device *ub = > - container_of(work, struct ublk_device, monitor_work.work); > + container_of(work, struct ublk_device, abort_work); > + > + int i; > + > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > + > + if (ubq_daemon_is_dying(ubq)) > + ublk_abort_queue(ub, ubq); > + } > + > + if (ub->force_abort) > + schedule_work(&ub->abort_work); > +} > + > +static void ublk_reinit_queue(struct ublk_device *ub, > + struct ublk_queue *ubq) ublk_reinit_queue() is bad name, it is called during aborting, what we need is to shutdown them completely. And reinit or reset will be done when you start to recovery, at that time, you can reinit or reset the queue really. > +{ > + int i; > + > + for (i = 0; i < ubq->q_depth; i++) { > + struct ublk_io *io = &ubq->ios[i]; > + > + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) { > + struct request *rq = blk_mq_tag_to_rq( > + ub->tag_set.tags[ubq->q_id], i); > + > + WARN_ON_ONCE(!rq); > + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", > + __func__, > + ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort", > + ubq->q_id, i, io->flags); > + if (ublk_can_use_recovery_reissue(ub)) > + blk_mq_requeue_request(rq, false); > + else > + __ublk_fail_req(io, rq); > + > + } else { > + io_uring_cmd_done(io->cmd, > + UBLK_IO_RES_ABORT, 0); > + io->flags &= ~UBLK_IO_FLAG_ACTIVE; > + pr_devel("%s: send UBLK_IO_RES_ABORT: qid %d tag %d io_flags %x\n", > + __func__, ubq->q_id, i, io->flags); > + } > + ubq->nr_io_ready--; > + } > + WARN_ON_ONCE(ubq->nr_io_ready); > +} > + > +static bool ublk_check_inflight_rq(struct request *rq, void *data) > +{ > + struct ublk_queue *ubq = rq->mq_hctx->driver_data; > + struct ublk_io *io = &ubq->ios[rq->tag]; > + bool *busy = data; > + > + if (io->flags & UBLK_IO_FLAG_ACTIVE) { > + *busy = true; > + return false; > + } > + return true; > +} > + > +static void ublk_reinit_dev(struct ublk_device *ub) bad name too. > +{ > + int i; > + bool busy = false; > + > + if (!ublk_get_device(ub)) > + return; > + > + /* If we have quiesced q, all ubq_daemons are dying */ > + if (blk_queue_quiesced(ub->ub_disk->queue)) > + goto check_inflight; > + > + /* Recovery feature is for 'process' crash. */ > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > + > + if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))) > + goto out; > + } > + > + pr_devel("%s: all ubq_daemons(nr: %d) are dying.\n", > + __func__, ub->dev_info.nr_hw_queues); > + > + /* Now all ubq_daemons are PF_EXITING, let's quiesce q. */ > + blk_mq_quiesce_queue(ub->ub_disk->queue); > + pr_devel("%s: queue quiesced.\n", __func__); > + check_inflight: > + /* All rqs have to be requeued(and stay in queue now) */ > + blk_mq_tagset_busy_iter(&ub->tag_set, ublk_check_inflight_rq, &busy); > + if (busy) { > + pr_devel("%s: still some inflight rqs, retry later...\n", > + __func__); > + goto out; > + } > + > + pr_devel("%s: all inflight rqs are requeued.\n", __func__); > + > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > + struct ublk_queue *ubq = ublk_get_queue(ub, i); > + > + ublk_reinit_queue(ub, ubq); > + } > + /* So monitor_work won't be scheduled anymore */ > + ub->dev_info.state = UBLK_S_DEV_RECOVERING; > + pr_devel("%s: convert state to UBLK_S_DEV_RECOVERING\n", > + __func__); > + out: > + ublk_put_device(ub); > +} > + > +static void ublk_kill_dev(struct ublk_device *ub) > +{ > int i; > > for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { > struct ublk_queue *ubq = ublk_get_queue(ub, i); > > if (ubq_daemon_is_dying(ubq)) { > + pr_devel("%s: ubq(%d) is dying, schedule stop_work now\n", > + __func__, i); > schedule_work(&ub->stop_work); > - > - /* abort queue is for making forward progress */ > - ublk_abort_queue(ub, ubq); > } > } > +} > + > +static void ublk_daemon_monitor_work(struct work_struct *work) > +{ > + struct ublk_device *ub = > + container_of(work, struct ublk_device, monitor_work.work); > + > + pr_devel("%s: mode(%s) running...\n", > + __func__, > + ublk_can_use_recovery(ub) ? "recovery" : "kill"); > + /* > + * We can't schedule monitor work if: > + * (1) The state is DEAD. > + * The gendisk is not alive, so either all rqs are ended > + * or request queue is not created. > + * > + * (2) The state is RECOVERYING. > + * The process crashed, all rqs were requeued and request queue > + * was quiesced. > + */ > + WARN_ON_ONCE(ub->dev_info.state != UBLK_S_DEV_LIVE); > > + if (ublk_can_use_recovery(ub)) > + ublk_reinit_dev(ub); ublk_reinit_dev() is supposed to be run in standalone context, just like ublk_stop_dev(), we need monitor_work to provide forward progress, so don't run wait in monitor work. And please don't change this model of making forward progress. > + else > + ublk_kill_dev(ub); > /* > - * We can't schedule monitor work after ublk_remove() is started. > + * No need ub->mutex, monitor work are canceled after ub is marked > + * as force_abort which is observed reliably. > + * > + * Note: > + * All incoming rqs are aborted in ublk_queue_rq ASAP. Then > + * we will hang on del_gendisk() and wait for all inflight rqs' > + * completion. > * > - * No need ub->mutex, monitor work are canceled after state is marked > - * as DEAD, so DEAD state is observed reliably. > */ > - if (ub->dev_info.state != UBLK_S_DEV_DEAD) > + if (ub->dev_info.state == UBLK_S_DEV_LIVE && !ub->force_abort) > schedule_delayed_work(&ub->monitor_work, > UBLK_DAEMON_MONITOR_PERIOD); > } > @@ -1058,10 +1215,35 @@ static void ublk_cancel_dev(struct ublk_device *ub) > static void ublk_stop_dev(struct ublk_device *ub) > { > mutex_lock(&ub->mutex); > - if (ub->dev_info.state != UBLK_S_DEV_LIVE) > + if (ub->dev_info.state == UBLK_S_DEV_DEAD) > goto unlock; > > + /* > + * All incoming rqs are aborted in ublk_queue_rq ASAP. Then > + * we will hang on del_gendisk() and wait for all inflight rqs' > + * completion. > + */ > + ub->force_abort = true; > + /* wait until monitor_work is not scheduled */ > + cancel_delayed_work_sync(&ub->monitor_work); > + pr_devel("%s: monitor work is canceled.\n", __func__); > + /* unquiesce q and let all inflight rqs' be aborted */ > + if (blk_queue_quiesced(ub->ub_disk->queue)) { > + blk_mq_unquiesce_queue(ub->ub_disk->queue); > + pr_devel("%s: queue unquiesced.\n", __func__); > + } > + /* requeued requests will be aborted ASAP because of 'force_abort' */ > + blk_mq_kick_requeue_list(ub->ub_disk->queue); > + /* forward progress */ > + schedule_work(&ub->abort_work); > + pr_devel("%s: abort work is scheduled, start delete gendisk...\n", > + __func__); > + pr_devel("%s: gendisk is deleted.\n", __func__); > del_gendisk(ub->ub_disk); > + pr_devel("%s: gendisk is deleted.\n", __func__); > + ub->force_abort = false; > + cancel_work_sync(&ub->abort_work); > + pr_devel("%s: abort work is canceled.\n", __func__); > ub->dev_info.state = UBLK_S_DEV_DEAD; > ub->dev_info.ublksrv_pid = -1; > put_disk(ub->ub_disk); ublk_stop_dev() isn't supposed to be changed so much, and it can be just for handling non-recovery, but it can be renamed as ublk_delete_disk(). For recovery, you can add one function of ublk_quiesce_disk() for preparing for recovering. thanks Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism 2022-09-03 13:30 ` Ming Lei @ 2022-09-04 11:23 ` Ziyang Zhang 2022-09-06 1:12 ` Ming Lei 0 siblings, 1 reply; 13+ messages in thread From: Ziyang Zhang @ 2022-09-04 11:23 UTC (permalink / raw) To: Ming Lei; +Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi On 2022/9/3 21:30, Ming Lei wrote: > On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang > <ZiyangZhang@linux.alibaba.com> wrote: >> >> We change the default behavior of aborting machenism. Now monitor_work >> will not be manually scheduled by ublk_queue_rq() or task_work after a >> ubq_daemon or process is dying(PF_EXITING). The monitor work should >> find a dying ubq_daemon or a crash process by itself. Then, it can > > Looks you don't consider one dying ubq_daemon as one crash candidate. > Most io implementation is done in the ubq pthread, so it should be > covered by the crash recovery. > >> start the aborting machenism. We do such modification is because we want >> to strictly separate the STOP_DEV procedure and monitor_work. More >> specifically, we ensure that monitor_work must not be scheduled after >> we start deleting gendisk and ending(aborting) all inflight rqs. In this >> way we are easy to consider recovery feature and unify it into existing >> aborting mechanism. Really we do not want too many "if can_use_recovery" >> checks. > > Frankly speaking, not sure we need to invent new wheel for the > 'aborting' mechanism. > > In theory, you needn't change the current monitor work and cancel > dev/queue. What you need is how to handle the dying ubq daemon: > > 1) without user recovery, delete disk if any ubq daemon is died. > > 2) with user recovery: > - quiesce request queue and wait until all inflight requests are > requeued(become IDLE); > - call io_uring_cmd_done for any active io slot > - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling > the potential crash; if it is confirmed as crash by userspace, > userspace will send command to handle it. > (this way will simplify userspace too, since we can add one utility > and provide it via udev script for handling rec > > Checking one flag lockless is usually not safe, also not sure why we > need such flag here, and the original check is supposed to work. > > overy) > >> >> With recovery feature disabled and after a ubq_daemon crash: >> (1) monitor_work notices the crash and schedules stop_work > > driver can't figure out if it is crash, and it can just see if the > ubq deamon is died or not. And crash detection logic should be done > in userspace, IMO. > >> (2) stop_work calls ublk_stop_dev() >> (3) In ublk_stop_dev(): >> (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq(); > > Please don't add new flag in fast path lockless, and the original check > is supposed to be reused for recovery feature. > >> If ublk_queue_rq() does not see it, rqs can still be ended(aborted) >> in fallback wq. >> (b) Then it cancels monitor_work; >> (c) Then it schedules abort_work which ends(aborts) all inflight rqs. >> (d) At the same time del_gendisk() is called. >> (e) Finally, we complete all ioucmds. >> >> Note: we do not change the existing behavior with reocvery disabled. Note >> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work. >> >> With recovery feature enabled and after a process crash: >> (1) monitor_work notices the crash and all ubq_daemon are dying. >> We do not consider a "single" ubq_daemon(pthread) crash. Please send >> STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case. > > Can you consider why you don't consider it as one crash? IMO, most of > userspace block logic is run in ubq_daemon, so it is reasonable to > consider it. > > ublk_reinit_dev() is supposed to be run in standalone context, just like > ublk_stop_dev(), we need monitor_work to provide forward progress, > so don't run wait in monitor work. > > And please don't change this model for making forward progress. > > Hi, Ming. I will take your advice and provide V4 soon. Here is the new design: (0) No modification in fast patch. We just requeue rqs with a dying ubq_daemon and schedule monitor_work immediately. BTW: I think here we should call 'blk_mq_delay_kick_requeue_list()' after requeuing a rq. Otherwise del_gendisk() in ublk_stop_dev() hangs. (1) Add quiesce_work, which is scheduled by monitor_work after a ubq_daemon is dying with recovery enabled. (2) quiesce_work runs ublk_quiesce_dev(). It accquires the ub lock, and quiescses the request queue(only once). On each dying ubq, call ublk_quiesce_queue(). It waits until all inflight rqs(ACTIVE) are requeued(IDLE). Finally it completes all ioucmds. Note: So We need to add a per ubq flag 'quiesced', which means we have done this 'quiesce && clean' stuff on the ubq. (3) After the request queue is quiesced, change ub's state to STATE_QUIESCED. This state can be checked by GET_DEV_INFO ctrl-cmd, just like STATE_LIVE. So user can detect a crash by sending GET_DEV_INFO and getting STATE_QUIESCED back. BTW, I'm unsure that sending one kobj_uevent(KOBJ_CHANGE) really helps. Users have may ways to detect a dying process/pthread. For example, they can 'ps' ublksrv_pid or check ub's state by GET_DEV_INFO ctrl-cmd. Anyway, this work can be done in the future. We can introduce a better way to detect a crash. For this patchset, let's focus on how to deal with a dying ubq_daemon. Do you agree? (4) Do not change ublk_stop_dev(). BTW, ublk_stop_dev() and ublk_quiescd_dev() exclude each other by accqiring ub lock. (5) ublk_stop_dev() has to consider a quiesced ubq. It should unquiesce request queue(only once) if it is quiesced. There is nothing else ublk_stop_dev() has to do. Inflight rqs requeued before will be aborted naturally by del_gendisk(). (6) ublk_quiesce_dev() cannot be run after gendisk is removed(STATE_DEAD). (7) No need to run ublk_quiesce_queue() on a 'quiesced' ubq by checking the flag. Note: I think this check is safe here. (8) START_USER_RECOVERY needs to consider both a dying process and pthread(ubq_daemon). For a dying process, it has to reset ub->dev_info.ublksrv_pid and ub->mm. This can by done by passing a qid = -1 in ctrl-cmd. We should make sure all ubq_daemons are dying in this case. otherwise it is a dying pthread. Only this ubq is reinit. Users may send many START_USER_RECOVERY with different qid to recover many ubqs. Thanks for reviewing patches. Regards, Zhang. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism 2022-09-04 11:23 ` Ziyang Zhang @ 2022-09-06 1:12 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2022-09-06 1:12 UTC (permalink / raw) To: Ziyang Zhang Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ming.lei On Sun, Sep 04, 2022 at 07:23:49PM +0800, Ziyang Zhang wrote: > On 2022/9/3 21:30, Ming Lei wrote: > > On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang > > <ZiyangZhang@linux.alibaba.com> wrote: > >> > >> We change the default behavior of aborting machenism. Now monitor_work > >> will not be manually scheduled by ublk_queue_rq() or task_work after a > >> ubq_daemon or process is dying(PF_EXITING). The monitor work should > >> find a dying ubq_daemon or a crash process by itself. Then, it can > > > > Looks you don't consider one dying ubq_daemon as one crash candidate. > > Most io implementation is done in the ubq pthread, so it should be > > covered by the crash recovery. > > > >> start the aborting machenism. We do such modification is because we want > >> to strictly separate the STOP_DEV procedure and monitor_work. More > >> specifically, we ensure that monitor_work must not be scheduled after > >> we start deleting gendisk and ending(aborting) all inflight rqs. In this > >> way we are easy to consider recovery feature and unify it into existing > >> aborting mechanism. Really we do not want too many "if can_use_recovery" > >> checks. > > > > Frankly speaking, not sure we need to invent new wheel for the > > 'aborting' mechanism. > > > > In theory, you needn't change the current monitor work and cancel > > dev/queue. What you need is how to handle the dying ubq daemon: > > > > 1) without user recovery, delete disk if any ubq daemon is died. > > > > 2) with user recovery: > > - quiesce request queue and wait until all inflight requests are > > requeued(become IDLE); > > - call io_uring_cmd_done for any active io slot > > - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling > > the potential crash; if it is confirmed as crash by userspace, > > userspace will send command to handle it. > > (this way will simplify userspace too, since we can add one utility > > and provide it via udev script for handling rec > > > > Checking one flag lockless is usually not safe, also not sure why we > > need such flag here, and the original check is supposed to work. > > > > overy) > > > >> > >> With recovery feature disabled and after a ubq_daemon crash: > >> (1) monitor_work notices the crash and schedules stop_work > > > > driver can't figure out if it is crash, and it can just see if the > > ubq deamon is died or not. And crash detection logic should be done > > in userspace, IMO. > > > >> (2) stop_work calls ublk_stop_dev() > >> (3) In ublk_stop_dev(): > >> (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq(); > > > > Please don't add new flag in fast path lockless, and the original check > > is supposed to be reused for recovery feature. > > > >> If ublk_queue_rq() does not see it, rqs can still be ended(aborted) > >> in fallback wq. > >> (b) Then it cancels monitor_work; > >> (c) Then it schedules abort_work which ends(aborts) all inflight rqs. > >> (d) At the same time del_gendisk() is called. > >> (e) Finally, we complete all ioucmds. > >> > >> Note: we do not change the existing behavior with reocvery disabled. Note > >> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work. > >> > >> With recovery feature enabled and after a process crash: > >> (1) monitor_work notices the crash and all ubq_daemon are dying. > >> We do not consider a "single" ubq_daemon(pthread) crash. Please send > >> STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case. > > > > Can you consider why you don't consider it as one crash? IMO, most of > > userspace block logic is run in ubq_daemon, so it is reasonable to > > consider it. > > > > ublk_reinit_dev() is supposed to be run in standalone context, just like > > ublk_stop_dev(), we need monitor_work to provide forward progress, > > so don't run wait in monitor work. > > > > And please don't change this model for making forward progress. > > > > > > Hi, Ming. > > I will take your advice and provide V4 soon. Here is the new design: > > (0) No modification in fast patch. We just requeue rqs with a dying ubq_daemon > and schedule monitor_work immediately. > > BTW: I think here we should call 'blk_mq_delay_kick_requeue_list()' after > requeuing a rq. Otherwise del_gendisk() in ublk_stop_dev() hangs. No. IMO you can add one helper for either ending or adding request in requeue list, then code is still clean. And when you call del_gendisk() in case of being in recovery state, blk_mq_unquiesce_queue() will handle/abort them and make del_gendisk() move on. > > (1) Add quiesce_work, which is scheduled by monitor_work after a ubq_daemon > is dying with recovery enabled. > > (2) quiesce_work runs ublk_quiesce_dev(). It accquires the ub lock, and > quiescses the request queue(only once). On each dying ubq, call > ublk_quiesce_queue(). It waits until all inflight rqs(ACTIVE) are > requeued(IDLE). Finally it completes all ioucmds. > Note: So We need to add a per ubq flag 'quiesced', which means > we have done this 'quiesce && clean' stuff on the ubq. ubq->nr_io_ready should be reused for checking if the queue is quiesced. It is actually same with current usage. > > (3) After the request queue is quiesced, change ub's state to STATE_QUIESCED. > This state can be checked by GET_DEV_INFO ctrl-cmd, just like STATE_LIVE. So > user can detect a crash by sending GET_DEV_INFO and getting STATE_QUIESCED > back. > > BTW, I'm unsure that sending one kobj_uevent(KOBJ_CHANGE) really helps. Users > have may ways to detect a dying process/pthread. For example, they can 'ps' > ublksrv_pid or check ub's state by GET_DEV_INFO ctrl-cmd. Anyway, this work > can be done in the future. We can introduce a better way to detect a crash. > For this patchset, let's focus on how to deal with a dying ubq_daemon. > Do you agree? kobj_uevent(KOBJ_CHANGE) is useful, which can avoid userspace's polling on device. That provides one accurate chance for userspace utility to confirm if it is one really crash. And checking if ubq daemon/process is crashed is really userspace's responsibility, but sending the uevent after ubq daemon/process is dying is helpful for userspace to run the check, at least polling can be avoided, or done in time. If you don't want to add it from beginning, that is fine, and we can do it after your patchset is merged. > > (4) Do not change ublk_stop_dev(). BTW, ublk_stop_dev() and ublk_quiescd_dev() > exclude each other by accqiring ub lock. > > (5) ublk_stop_dev() has to consider a quiesced ubq. It should unquiesce request > queue(only once) if it is quiesced. There is nothing else ublk_stop_dev() > has to do. Inflight rqs requeued before will be aborted naturally by > del_gendisk(). > > (6) ublk_quiesce_dev() cannot be run after gendisk is removed(STATE_DEAD). > > (7) No need to run ublk_quiesce_queue() on a 'quiesced' ubq by checking the flag. > Note: I think this check is safe here. > > (8) START_USER_RECOVERY needs to consider both a dying process and pthread(ubq_daemon). > > For a dying process, it has to reset ub->dev_info.ublksrv_pid and ub->mm. This can > by done by passing a qid = -1 in ctrl-cmd. We should make sure all ubq_daemons > are dying in this case. > > otherwise it is a dying pthread. Only this ubq is reinit. Users may send many > START_USER_RECOVERY with different qid to recover many ubqs. The simplest handling might be to exit all ublk queues first, and re-create one new process to recover all since the request queue is required to be quiesced first, and all ublk queue is actually quiesced too. So from user viewpoint, there is nothing visible comparing with just recovering single ubq daemon/pthread. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH V2 6/6] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang ` (4 preceding siblings ...) 2022-08-31 15:51 ` [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism ZiyangZhang @ 2022-08-31 15:51 ` ZiyangZhang 2022-09-06 1:14 ` [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support Ming Lei 6 siblings, 0 replies; 13+ messages in thread From: ZiyangZhang @ 2022-08-31 15:51 UTC (permalink / raw) To: ming.lei, axboe Cc: xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ZiyangZhang START_USER_RECOVERY and END_USER_RECOVERY are two new control commands to support user recovery feature. After a crash, user should send START_USER_RECOVERY, it will: (1) check if (a)current ublk_device is UBLK_S_DEV_RECOVERING which was set by monitor_work and (b)the file struct is released. We always expect crash of the ublksrv 'process', not exit of a single ubq_daemon 'pthread'. (2) reinit all ubqs, including: (a) put the dying task(thread) and reset ->ubq_daemon to NULL. (b) reset all ublk_io. (3) reset ub->mm to NULL. Then, user should start a new 'process' and send FETCH_REQ on each ubq_daemon. Finally, user should send END_USER_RECOVERY, it will: (1) wait for all new ubq_daemons getting ready. (2) unquiesce the request queue and expect incoming ublk_queue_rq() (3) convert state to UBLK_S_DEV_LIVE (4) schedule monitor_work again Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> --- drivers/block/ublk_drv.c | 130 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 0e185d1fa2c4..2d1f3e032606 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -1964,6 +1964,130 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd) return ret; } +static void ublk_queue_start_recovery(struct ublk_device *ub, struct ublk_queue *ubq) +{ + int i; + + /* All old ioucmds have to be completed/canceled by io_uring_cmd_done(). */ + WARN_ON_ONCE(ubq->nr_io_ready); + + /* old daemon is PF_EXITING, put it now */ + put_task_struct(ubq->ubq_daemon); + /* have to set it to NULL, otherwise ub won't accept new FETCH_REQ */ + ubq->ubq_daemon = NULL; + + for (i = 0; i < ubq->q_depth; i++) { + struct ublk_io *io = &ubq->ios[i]; + + /* forget everything now and be ready for new FETCH_REQ */ + io->flags = 0; + io->cmd = NULL; + io->addr = 0; + } +} + +static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + struct ublk_device *ub; + int ret = -EINVAL; + int i; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + mutex_lock(&ub->mutex); + + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + + /* + * START_RECOVERY is only allowd after: + * + * (1) UB_STATE_OPEN is not set, which means the dying process is exited + * and related io_uring ctx is freed so file struct of /dev/ublkcX is + * released. + * + * (2) UBLK_S_DEV_RECOVERING is set, which means the monitor_work: + * (a)has requeued all inflight rqs whose io_flags is ACTIVE + * (b)has requeued/aborted all inflight rqs whose io_flags is NOT ACTIVE + * (c)has completed/camceled all ioucmds owned by ther dying process + */ + if (test_bit(UB_STATE_OPEN, &ub->state) || + ub->dev_info.state != UBLK_S_DEV_RECOVERING) { + ret = -EBUSY; + goto out_unlock; + } + + pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id); + + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))); + pr_devel("%s: prepare for recovering qid %d\n", __func__, ubq->q_id); + ublk_queue_start_recovery(ub, ubq); + } + + /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ + ub->mm = NULL; + ub->nr_queues_ready = 0; + init_completion(&ub->completion); + ret = 0; + + out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + +static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + int ublksrv_pid = (int)header->data[0]; + struct ublk_device *ub; + int ret = -EINVAL; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + pr_devel("%s: Waiting for new ubq_daemon is ready, dev id %d...\n", + __func__, header->dev_id); + /* wait until new ubq_daemon sending all FETCH_REQ */ + wait_for_completion_interruptible(&ub->completion); + pr_devel("%s: All new ubq_daemon is ready, dev id %d\n", + __func__, header->dev_id); + + mutex_lock(&ub->mutex); + + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + + /* monitor_work should set UBLK_S_DEV_RECOVERING */ + if (ub->dev_info.state != UBLK_S_DEV_RECOVERING) { + ret = -EBUSY; + goto out_unlock; + } + ub->dev_info.ublksrv_pid = ublksrv_pid; + pr_devel("%s: new ublksrv_pid %d, dev id %d\n", + __func__, ublksrv_pid, header->dev_id); + blk_mq_unquiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue unquiesced, dev id %d.\n", + __func__, header->dev_id); + + ub->dev_info.state = UBLK_S_DEV_LIVE; + schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); + /* We are good to redo requests now */ + blk_mq_kick_requeue_list(ub->ub_disk->queue); + ret = 0; + out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) { @@ -2005,6 +2129,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, case UBLK_CMD_SET_PARAMS: ret = ublk_ctrl_set_params(cmd); break; + case UBLK_CMD_START_USER_RECOVERY: + ret = ublk_ctrl_start_recovery(cmd); + break; + case UBLK_CMD_END_USER_RECOVERY: + ret = ublk_ctrl_end_recovery(cmd); + break; default: break; } -- 2.27.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang ` (5 preceding siblings ...) 2022-08-31 15:51 ` [RFC PATCH V2 6/6] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support ZiyangZhang @ 2022-09-06 1:14 ` Ming Lei 6 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2022-09-06 1:14 UTC (permalink / raw) To: ZiyangZhang Cc: axboe, xiaoguang.wang, linux-block, linux-kernel, joseph.qi, ming.lei On Wed, Aug 31, 2022 at 11:51:30PM +0800, ZiyangZhang wrote: > ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in > userspace. For each ublk queue, there is one ubq_daemon(pthread). > All ubq_daemons share the same process which opens /dev/ublkcX. > The ubq_daemon code infinitely loops on io_uring_enter() to > send/receive io_uring cmds which pass information of blk-mq > rqs. BTW, given ublk document is merged, so please document the new added commands in Documentation/block/ublk.rst in following versions. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-09-06 1:14 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-31 15:51 [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 1/6] ublk_drv: check 'current' instead of 'ubq_daemon' ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 2/6] ublk_drv: refactor ublk_cancel_queue() ZiyangZhang 2022-09-03 11:16 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 3/6] ublk_drv: define macros for recovery feature and check them ZiyangZhang 2022-09-03 11:18 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 4/6] ublk_drv: requeue rqs with recovery feature enabled ZiyangZhang 2022-08-31 15:51 ` [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism ZiyangZhang 2022-09-03 13:30 ` Ming Lei 2022-09-04 11:23 ` Ziyang Zhang 2022-09-06 1:12 ` Ming Lei 2022-08-31 15:51 ` [RFC PATCH V2 6/6] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support ZiyangZhang 2022-09-06 1:14 ` [RFC PATCH V2 0/6] ublk_drv: add USER_RECOVERY support Ming Lei
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.