On 9/29/20 5:27 PM, Victor Gladkov wrote: > Commands get stuck while Host NVMe-oF controller is in reconnect state. > NVMe controller enters into reconnect state when it loses connection with the > target. It tries to reconnect every 10 seconds > (default) until successful reconnection or until reconnect time-out is reached. > The default reconnect time out is 10 minutes. > > Applications are expecting commands to complete with success or error within > a certain timeout (30 seconds by default). The NVMe host is enforcing that > timeout while it is connected, never the less, during reconnection, the timeout > is not enforced and commands may get stuck for a long period or even forever. > > To fix this long delay due to the default timeout we introduce new session > parameter "fast_io_fail_tmo". The timeout is measured in seconds from the > controller reconnect, any command beyond that timeout is rejected. The new > parameter value may be passed during 'connect'. > The default value of -1 means no timeout (similar to current behavior). > > We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective > delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag. > > When the controller is entering the CONNECTING state, we schedule the > delayed_work based on failfast timeout value. If the transition is out of > CONNECTING, terminate delayed work item and ensure failfast_expired is > false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED" > flag to true. > > We also update nvmf_fail_nonready_command() and > nvme_available_path() functions with check the > "NVME_CTRL_FAILFAST_EXPIRED" controller flag. > > For multipath (function nvme_available_path()): > The path will not be considered available if > "NVME_CTRL_FAILFAST_EXPIRED" controller flag is set > and the controller in NVME_CTRL_CONNECTING state. > This prevents commands from getting stuck > when available paths have tried to reconnect for too long. > > Signed-off-by: Victor Gladkov > Signed-off-by: Chaitanya Kulkarni > Reviewed-by: Hannes Reinecke > > --- > Changes from V8: > > 1. Added multipath behavior description as requested by Sagi Grimberg > (Fri Sep 18 16:38:58 EDT 2020) > > Changes from V7: > > 1. Expanded the patch description as requested by James Smart > (Thu Aug 13 11:00:25 EDT 2020) > > Changes from V6: > > 1. Changed according to Hannes Reinecke review: > in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures. > > Changes from V5: > > 1. Drop the "off" string option for fast_io_fail_tmo. > > Changes from V4: > > 1. Remove subsysnqn from dev_info just keep "failfast expired" > in nvme_failfast_work(). > 2. Remove excess lock in nvme_failfast_work(). > 3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now. > 4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent. > > Changes from V3: > > 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd > nvme_stop_failfast_work() when accessing ctrl->opts as it will fail > for PCIe transport when is called nvme_change_ctrl_state() > from nvme_reset_work(), since we don't set the ctrl->opts for > PCIe transport. > 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and > for macro NVMF_ALLOWED_OPTS definition. > 3. Just like all the state change code add a switch for newly > added state handling outside of the state machine in > nvme_state_change(). > 4. nvme_available_path() add /* fallthru */ after if..break inside > switch which is under list_for_each_entry_rcu(). > 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp. > 6. Fix the tabs before if in nvme_available_path() and line wrap > for the same. > 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING > instead of == to get rid of the parentheses and avoid char > 80. > 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp. > 9. Change the commit log style to match the one we have in the NVMe > repo. > > Changes from V2: > > 1. Several coding style and small fixes. > 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO. > 3. Don't call functionality from the state machine. > 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI > semantics. > > Changes from V1: > > 1. Add a new session parameter called "fast_fail_tmo". The timeout > is measured in seconds from the controller reconnect, any command > beyond that timeout is rejected. The new parameter value may be > passed during 'connect', and its default value is 30 seconds. > A value of 0 means no timeout (similar to current behavior). > 2. Add a controller flag of "failfast_expired". > 3. Add dedicated delayed_work that update the "failfast_expired" > controller flag. > 4. When entering CONNECTING, schedule the delayed_work based on > failfast timeout value. If transition out of CONNECTING, terminate > delayed work item and ensure failfast_expired is false. > If delayed work item expires: set "failfast_expired" flag to true. > 5. Update nvmf_fail_nonready_command() with check the > "failfast_expired" controller flag. > > --- > drivers/nvme/host/core.c | 49 > ++++++++++++++++++++++++++++++++++++++++++- > drivers/nvme/host/fabrics.c | 25 +++++++++++++++++++--- > drivers/nvme/host/fabrics.h | 5 +++++ > drivers/nvme/host/multipath.c | 5 ++++- > drivers/nvme/host/nvme.h | 3 +++ > 5 files changed, 82 insertions(+), 5 deletions(-) > I did some more experiments with this, and found that there are some issues with ANA handling. If reconnect works, but the ANA state indicates that we still can't sent I/O (eg by still being in transitioning), we hit the 'requeueing I/O' state despite fast_io_fail_tmo being set. Not sure if that's the expected outcome. For that it might be better to move the FAILFAST_EXPIRED bit into the namespace, as then we could selectively clear the bit in nvme_failfast_work(): @@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset); static void nvme_failfast_work(struct work_struct *work) { struct nvme_ctrl *ctrl = container_of(to_delayed_work(work), struct nvme_ctrl, failfast_work); + struct nvme_ns *ns; - if (ctrl->state != NVME_CTRL_CONNECTING) - return; - - - set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); + down_read(&ctrl->namespaces_rwsem); + list_for_each_entry(ns, &ctrl->namespaces, list) { + if (ctrl->state != NVME_CTRL_LIVE || + (ns->ana_state != NVME_ANA_OPTIMIZED && + ns->ana_state != NVME_ANA_NONOPTIMIZED)) + set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags); + } + up_read(&ctrl->namespaces_rwsem); dev_info(ctrl->device, "failfast expired\n"); ...and we could leave the failfast worker running even after the controller transitioned to LIVE. Cf the attached patch for details. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer