From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83FA3C433F5 for ; Wed, 17 Nov 2021 11:21:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6373E61B42 for ; Wed, 17 Nov 2021 11:21:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236738AbhKQLY0 (ORCPT ); Wed, 17 Nov 2021 06:24:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232294AbhKQLYZ (ORCPT ); Wed, 17 Nov 2021 06:24:25 -0500 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF3A3C061570 for ; Wed, 17 Nov 2021 03:21:26 -0800 (PST) Received: by mail-lf1-x132.google.com with SMTP id l22so6622945lfg.7 for ; Wed, 17 Nov 2021 03:21:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ionos.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=FUymyk7ZQa9hAvOVscIY3mcQkNhHHJJg/pcxaiRKp60=; b=E4NFm4kWOtM+9yLPreTF2OqCj58REP65Eu1UqS8WaooweHcH1/GM5E8RvHMeaUMdAK HmNZgSPtVsxwFRCPMWNH9YSFrS76DNENyzjkOlxSRCNJvdQYy9hmsyBOyMvlXHBx3cDj JQ6N+RV7WvmKAzfPI92D1Zo2+D5K5txlZxi2lU43UeoTG6imN6urxWGcaXEBnDWuVFIB HnlV2Ehk98ERG+SKSjruOKdp7v6qkQQNSeRvUj/9HBDWzq3Hswvp2ytYG/P7AFu3GRfW j0eeZB8jn+QPju1UJ3viXYlCB6pAVabKishVRjDbjlt8xO4HQVsB/xeTTzoNuhRwBXV1 5jMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=FUymyk7ZQa9hAvOVscIY3mcQkNhHHJJg/pcxaiRKp60=; b=qQm1Yh+tbxwkHV8YC3pTZK/cn5iSGTP/OMBTs7P7NlfvtghpH2S5zcMGRcE8DTTAmk Ezb/DWnsvOWneJHZBC0vGzeVi0ioqxTa0ubZqDSFAhDN5QgDclyGtImMYyH7+dH2ojL8 iRUfHpHpup3fLrSYvAKUuxyi8FGDYErGw/ql55FN2PfCbvPXDcBvC+WCnHNvQoXHNTjo 6gkXWAFEXH24Q/QiyFHf1mIqj2QuPUhiFXghk4KIItAUvXhZ4TtYYjMs81ss7bdzSt51 P/nkH0uic/Oo9qEy0JctVUTuDSa7AdgzSphX6WeM8CwC9xaQm+OJJBpOTwuI0Mkbd01j eJwQ== X-Gm-Message-State: AOAM530eSxOJ/XN3jWV1afwu8mRpTBWzbyY+CGOHmaWGkwyW97RVA2W7 3mWf6p4L38j+7OY63L630p6TG51krX6xZh+BmIGtVQ== X-Google-Smtp-Source: ABdhPJxt0H9qvxctwMgumP81gl1sToJpHbkLWCeEixFUZALV+tbhIKUwd30QtVLlAuMi9sJCGWaKwaDRxDz3qXSrs5E= X-Received: by 2002:a2e:9653:: with SMTP id z19mr7008470ljh.29.1637148085120; Wed, 17 Nov 2021 03:21:25 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Haris Iqbal Date: Wed, 17 Nov 2021 12:21:14 +0100 Message-ID: Subject: Re: Task hung while using RTRS with rxe and using "ip" utility to down the interface To: Leon Romanovsky Cc: RDMA mailing list , Jason Gunthorpe , Jinpu Wang , Danil Kipnis , Aleksei Marov Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org On Sun, Nov 14, 2021 at 7:12 AM Leon Romanovsky wrote: > > On Thu, Nov 11, 2021 at 05:25:11PM +0100, Haris Iqbal wrote: > > Hi, > > > > We are experiencing a hang while using RTRS with softROCE and > > transitioning the network interface down using ifdown command. > > > > Steps. > > 1) Map an RNBD/RTRS device through softROCE port. > > 2) Once mapped, transition the eth interface to down (on which the > > softROCE interface was created) using command "ifconfig down", > > or "ip link set down". > > 3) The device errors out, and one can see RTRS connection errors in > > dmesg. So far so good. > > 4) After a while, we see task hung traces in dmesg. > > > > [ 550.866462] INFO: task kworker/1:2:170 blocked for more than 184 seconds. > > [ 550.868820] Tainted: G O 5.10.42-pserver+ #84 > > [ 550.869337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 550.869963] task:kworker/1:2 state:D stack: 0 pid: 170 > > ppid: 2 flags:0x00004000 > > [ 550.870619] Workqueue: rtrs_server_wq rtrs_srv_close_work [rtrs_server] > > [ 550.871134] Call Trace: > > [ 550.871375] __schedule+0x421/0x810 > > [ 550.871683] schedule+0x46/0xb0 > > [ 550.871964] schedule_timeout+0x20e/0x2a0 > > [ 550.872300] ? internal_add_timer+0x44/0x70 > > [ 550.872650] wait_for_completion+0x86/0xe0 > > [ 550.872994] cm_destroy_id+0x18c/0x5a0 [ib_cm] > > [ 550.873357] ? _cond_resched+0x15/0x30 > > [ 550.873680] ? wait_for_completion+0x33/0xe0 > > [ 550.874036] _destroy_id+0x57/0x210 [rdma_cm] > > [ 550.874395] rtrs_srv_close_work+0xcc/0x250 [rtrs_server] > > [ 550.874819] process_one_work+0x1d4/0x370 > > [ 550.875156] worker_thread+0x4a/0x3b0 > > [ 550.875471] ? process_one_work+0x370/0x370 > > [ 550.875817] kthread+0xfe/0x140 > > [ 550.876098] ? kthread_park+0x90/0x90 > > [ 550.876453] ret_from_fork+0x1f/0x30 > > > > > > Our observations till now. > > > > 1) Does not occur if we use "ifdown " instead. There is a > > difference between the commands, but we are not sure if the above one > > should lead to a task hang. > > https://access.redhat.com/solutions/27166 > > 2) We have verified v5.10 and v.15.1 kernels, and both have this issue. > > 3) We tried the same test with NvmeOf target and host over softROCE. > > We get the same task hang after doing "ifconfig .. down". > > > > [Tue Nov 9 14:28:51 2021] INFO: task kworker/1:1:34 blocked for more > > than 184 seconds. > > [Tue Nov 9 14:28:51 2021] Tainted: G O > > 5.10.42-pserver+ #84 > > [Tue Nov 9 14:28:51 2021] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [Tue Nov 9 14:28:51 2021] task:kworker/1:1 state:D stack: 0 > > pid: 34 ppid: 2 flags:0x00004000 > > [Tue Nov 9 14:28:51 2021] Workqueue: events > > nvmet_rdma_release_queue_work [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] Call Trace: > > [Tue Nov 9 14:28:51 2021] __schedule+0x421/0x810 > > [Tue Nov 9 14:28:51 2021] schedule+0x46/0xb0 > > [Tue Nov 9 14:28:51 2021] schedule_timeout+0x20e/0x2a0 > > [Tue Nov 9 14:28:51 2021] ? internal_add_timer+0x44/0x70 > > [Tue Nov 9 14:28:51 2021] wait_for_completion+0x86/0xe0 > > [Tue Nov 9 14:28:51 2021] cm_destroy_id+0x18c/0x5a0 [ib_cm] > > [Tue Nov 9 14:28:51 2021] ? _cond_resched+0x15/0x30 > > [Tue Nov 9 14:28:51 2021] ? wait_for_completion+0x33/0xe0 > > [Tue Nov 9 14:28:51 2021] _destroy_id+0x57/0x210 [rdma_cm] > > [Tue Nov 9 14:28:51 2021] nvmet_rdma_free_queue+0x2e/0xc0 [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma] > > [Tue Nov 9 14:28:51 2021] process_one_work+0x1d4/0x370 > > [Tue Nov 9 14:28:51 2021] worker_thread+0x4a/0x3b0 > > [Tue Nov 9 14:28:51 2021] ? process_one_work+0x370/0x370 > > [Tue Nov 9 14:28:51 2021] kthread+0xfe/0x140 > > [Tue Nov 9 14:28:51 2021] ? kthread_park+0x90/0x90 > > [Tue Nov 9 14:28:51 2021] ret_from_fork+0x1f/0x30 > > > > Is this an known issue with ifconfig or the rxe driver? Thoughts? > > It doesn't look related to RXE and if to judge by the call trace the > issue is one of two: missing call to cm_deref_id() or extra call > to refcount_inc(&cm_id_priv->refcount). Both can be related to the > callers of rdma-cm interface. For example by not releasing cm_id > properly. Hi, @Aleksei did some more investigation and found out the cause of the hang. 1. enabled traces for rdma_cma and ib_cma events. 2. When we call ifdown (the one that does not hung) the following trace can be observed kworker/1:2-100 [001] .... 98.191532: cm_disconnect: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.208935: icm_send_drep_err: local_id=3791663578 remote_id=2340456478 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .N.. 98.213924: cm_disconnect: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.218595: icm_send_drep_err: local_id=3808440794 remote_id=2323679262 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .N.. 98.219583: cm_disconnect: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] dN.. 98.228200: icm_send_drep_err: local_id=3758109146 remote_id=2306902046 state=TIMEWAIT lap_state=LAP_UNINIT kworker/1:2-100 [001] .... 98.322883: cm_qp_destroy: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=17 kworker/1:2-100 [001] .... 98.322913: cm_id_destroy: cm.id=2 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] .... 98.332065: cm_qp_destroy: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=18 kworker/1:2-100 [001] .... 98.332115: cm_id_destroy: cm.id=3 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 kworker/1:2-100 [001] .... 98.340362: cm_qp_destroy: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 qp_num=19 kworker/1:2-100 [001] .... 98.340400: cm_id_destroy: cm.id=4 src=10.154.0.1:1234 dst=10.154.0.3:53749 tos=0 Explanation: In this example we have 3 cm.id (2,3,4) for each of them we call rdma_disconnect() that calls ib_send_cm_dreq which calls cm_send_dreq_locked but there we call ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) { cm_enter_timewait(cm_id_priv); return ret; } which returns with the error before doing refcount_inc(&cm_id_priv->refcount); inside cm_alloc_msg. Since the error returns from rdma_create_ah. so the refcount for each cm.id is 1 after calling rdma_disconnect. Hence in cm_destroy_id we call once more cm_deref_id and do NOT hung on wait_for_completion(&cm_id_priv->comp); because refcount is 0. 3. In case of ifconfig down (the one that hung) the trace is following kworker/1:0-19 [001] .... 289.211230: cm_disconnect: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] d... 289.222918: icm_send_dreq: local_id=3636223036 remote_id=2994969923 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.224789: cm_sent_dreq: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/u5:0-63 [001] d... 289.224820: icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR kworker/u5:0-63 [001] .... 289.224821: cm_event_handler: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED (10/-110) kworker/u5:0-63 [001] .... 289.224822: cm_event_done: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 DISCONNECTED consumer returns 0 kworker/1:0-19 [001] .N.. 289.226797: cm_disconnect: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] dN.. 289.238556: icm_send_dreq: local_id=3619445820 remote_id=3045301571 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.240126: cm_sent_dreq: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .N.. 289.241162: cm_disconnect: cm.id=7 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] dN.. 289.251148: icm_send_dreq: local_id=3602668604 remote_id=3028524355 state=ESTABLISHED lap_state=LAP_UNINIT kworker/1:0-19 [001] .N.. 289.252624: cm_sent_dreq: cm.id=7 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .... 289.340055: cm_qp_destroy: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=19 kworker/1:0-19 [001] .... 289.340159: cm_id_destroy: cm.id=5 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 kworker/1:0-19 [001] .... 289.367393: cm_qp_destroy: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 qp_num=20 kworker/1:0-19 [001] .... 289.367445: cm_id_destroy: cm.id=6 src=10.154.0.1:1234 dst=10.154.0.3:60177 tos=0 Explanation: In this example we have 3 cm.id (5,6,7). for each of them we call rdma_disconnect() that calls ib_send_cm_dreq which calls cm_send_dreq_locked which calls cm_alloc_msg but in this case cm_alloc_msg does not return the error and increase the refcount for each cm.id to 2. And we do send dreq. But later only for cm.id=5 the send_handler is called with LOC_PROT_ERR icm_mad_send_err: state=DREQ_SENT completion status=LOC_PROT_ERR which triggers cm_process_send_error -> cm_free_msg -> cm_deref_id which decrease the refcount for cm.id=5 to 1. For other cm.id (6,7) the refcounts are 2 since the dreq were sent but nor completed nor timed out. Nothing happens with those dreq sent to cm.id=6,7 So later, when cm_id_destroy is called for cm.id=5 (refcount=1) we call once more cm_deref_id and do not hung on wait_for_completion(&cm_id_priv->comp); because refcount is 0. but for cm.i=6,7 we do cm_deref_id but refcount goes from 2 to 1 and we wait_for_completion forever, because those sent dreqs are still somewhere. 4. The reason why cm_alloc_msg in one case (ifdown "interface") returns with an error without refcount increase and in other case (ifconfig down "interface") increase refcount and return 0 is what happens in cm_alloc_msg. ( gid entry state in ib_gid_table. ) Explanation: The cm_alloc_msg calls the following functions: cm_alloc_msg -> rdma_create_ah -> rdma_lag_get_ah_roce_slave -> rdma_read_gid_attr_ndev_rcu -> is_gid_entry_valid which return entry && entry->state == GID_TABLE_ENTRY_VALID; In case of ifdown entry is either NULL or in the state GID_TABLE_ENTRY_PENDING_DEL so the function returns false. and cm_alloc_msg fails without refcount increase. In case of ifconfig down entry is in the state GID_TABLE_ENTRY_VALID so the function returns true. and cm_alloc_msg increase the refcount for cm.id. We think one of the below questions is the problem, 1. why cm_process_send_error was triggered only for 1 of 3 sent dreq? Why the cm_send_handler is not called for others? 2. When RTRS receives an error and starts closing the network, during which rdma_disconnect is called, shouldnt cm_alloc_msg fail in such a case? the same behavior is observed for RTRS and for NVMEof targets. > > Thanks > > > > > Regards > > -Haris