From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EC5DC2BA83 for ; Wed, 12 Feb 2020 07:27:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 46197206DB for ; Wed, 12 Feb 2020 07:27:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1581492425; bh=pCoeSZfs2f8IrNoSWugPCwW+l4OU/Emg1V08aXOXa40=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=GacLanSGCHrt1rHj8zv3nlTVAfkn6j2CwGdwQ9gJo1xRbN/16ETOInBK1xDZe62Hv 79StxVpaLEyH/bG+WEgyhnuE+6PJsF8z9BOSaUKmE2ZKHFpJgY3dxvUPI44YG6wek/ oHz6a6MwAlVXcjyXJbeyPgsIw/dG07wmhqJVXyDI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727669AbgBLH1E (ORCPT ); Wed, 12 Feb 2020 02:27:04 -0500 Received: from mail.kernel.org ([198.145.29.99]:59798 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728329AbgBLH1E (ORCPT ); Wed, 12 Feb 2020 02:27:04 -0500 Received: from localhost (unknown [213.57.247.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1F6C22082F; Wed, 12 Feb 2020 07:27:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1581492423; bh=pCoeSZfs2f8IrNoSWugPCwW+l4OU/Emg1V08aXOXa40=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=alNDbwJn3WpCUCoR2zONWosXN+sorXckm91RE8HeKu9u7Uqi8W4CyCQxa69tUOh+k MTppj5ttaHGmVmwgShYWVyyun72kqOh+vsZZG02hKd1Mz+6dKWPThPBP4uDUfvtwAc 8+PX0nnemL39Ge3c+vTLtE8StKV1tYL03Oq3DA0E= From: Leon Romanovsky To: Doug Ledford , Jason Gunthorpe Cc: Leon Romanovsky , RDMA mailing list , Daniel Jurgens , Erez Shitrit , Jason Gunthorpe , Maor Gottlieb , Michael Guralnik , Moni Shoua , Parav Pandit , Sean Hefty , Valentine Fatiev , Yishai Hadas , Yonatan Cohen , Zhu Yanjun Subject: [PATCH rdma-rc 7/9] RDMA/rxe: Fix soft lockup problem due to using tasklets in softirq Date: Wed, 12 Feb 2020 09:26:33 +0200 Message-Id: <20200212072635.682689-8-leon@kernel.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200212072635.682689-1-leon@kernel.org> References: <20200212072635.682689-1-leon@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org From: Zhu Yanjun When run stress tests with RXE, the following Call Traces often occur " watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0] ... Call Trace: create_object+0x3f/0x3b0 kmem_cache_alloc_node_trace+0x129/0x2d0 __kmalloc_reserve.isra.52+0x2e/0x80 __alloc_skb+0x83/0x270 rxe_init_packet+0x99/0x150 [rdma_rxe] rxe_requester+0x34e/0x11a0 [rdma_rxe] rxe_do_task+0x85/0xf0 [rdma_rxe] tasklet_action_common.isra.21+0xeb/0x100 __do_softirq+0xd0/0x298 irq_exit+0xc5/0xd0 smp_apic_timer_interrupt+0x68/0x120 apic_timer_interrupt+0xf/0x20 ... " The root cause is that tasklet is actually a softirq. In a tasklet handler, another softirq handler is triggered. Usually these softirq handlers run on the same cpu core. So this will cause "soft lockup Bug". Fixes: 8700e3e7c485 ("Soft RoCE driver") CC: Yossef Efraim Signed-off-by: Zhu Yanjun Signed-off-by: Leon Romanovsky --- drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 116cafc9afcf..4bc88708b355 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -329,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp, qp->comp.psn = pkt->psn; if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } return COMPST_ERROR_RETRY; @@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe) */ if (qp->req.wait_fence) { qp->req.wait_fence = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } @@ -479,7 +479,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp, if (qp->req.need_rd_atomic) { qp->comp.timeout_retry = 0; qp->req.need_rd_atomic = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } @@ -725,7 +725,7 @@ int rxe_completer(void *arg) RXE_CNT_COMP_RETRY); qp->req.need_retry = 1; qp->comp.started_retry = 1; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } if (pkt) { -- 2.24.1