From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57F97C0650F for ; Wed, 14 Aug 2019 06:03:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3271D2083B for ; Wed, 14 Aug 2019 06:03:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726373AbfHNGDh (ORCPT ); Wed, 14 Aug 2019 02:03:37 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:32932 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725263AbfHNGDh (ORCPT ); Wed, 14 Aug 2019 02:03:37 -0400 Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id 9F200B841736AAF1E278; Wed, 14 Aug 2019 14:03:35 +0800 (CST) Received: from [127.0.0.1] (10.74.150.236) by DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id 14.3.439.0; Wed, 14 Aug 2019 14:03:26 +0800 Subject: Re: [PATCH for-next 3/9] RDMA/hns: Completely release qp resources when hw err To: Doug Ledford , Lijun Ou , CC: , , References: <1565343666-73193-1-git-send-email-oulijun@huawei.com> <1565343666-73193-4-git-send-email-oulijun@huawei.com> From: Yangyang Li Message-ID: <0d325f78-a929-f088-cc29-e2c7af98fd40@huawei.com> Date: Wed, 14 Aug 2019 14:02:49 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.74.150.236] X-CFilter-Loop: Reflected Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org Hi, Doug Thanks a lot for your reply. 在 2019/8/12 23:29, Doug Ledford 写道: > On Fri, 2019-08-09 at 17:41 +0800, Lijun Ou wrote: >> From: Yangyang Li >> >> Even if no response from hardware, make sure that qp related >> resources are completely released. >> >> Signed-off-by: Yangyang Li >> --- >> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 12 ++++-------- >> 1 file changed, 4 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> index 7a14f0b..0409851 100644 >> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> @@ -4562,16 +4562,14 @@ static int >> hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, >> { >> struct hns_roce_cq *send_cq, *recv_cq; >> struct ib_device *ibdev = &hr_dev->ib_dev; >> - int ret; >> + int ret = 0; >> >> if (hr_qp->ibqp.qp_type == IB_QPT_RC && hr_qp->state != >> IB_QPS_RESET) { >> /* Modify qp to reset before destroying qp */ >> ret = hns_roce_v2_modify_qp(&hr_qp->ibqp, NULL, 0, >> hr_qp->state, IB_QPS_RESET); >> - if (ret) { >> + if (ret) >> ibdev_err(ibdev, "modify QP to Reset >> failed.\n"); >> - return ret; >> - } >> } >> >> send_cq = to_hr_cq(hr_qp->ibqp.send_cq); >> @@ -4627,7 +4625,7 @@ static int hns_roce_v2_destroy_qp_common(struct >> hns_roce_dev *hr_dev, >> kfree(hr_qp->rq_inl_buf.wqe_list); >> } >> >> - return 0; >> + return ret; >> } >> >> static int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata >> *udata) >> @@ -4637,11 +4635,9 @@ static int hns_roce_v2_destroy_qp(struct ib_qp >> *ibqp, struct ib_udata *udata) >> int ret; >> >> ret = hns_roce_v2_destroy_qp_common(hr_dev, hr_qp, udata); >> - if (ret) { >> + if (ret) >> ibdev_err(&hr_dev->ib_dev, "Destroy qp 0x%06lx >> failed(%d)\n", >> hr_qp->qpn, ret); >> - return ret; >> - } >> >> if (hr_qp->ibqp.qp_type == IB_QPT_GSI) >> kfree(hr_to_hr_sqp(hr_qp)); > > I don't know your hardware, but this patch sounds wrong/dangerous to me. > As long as the resources this card might access are allocated by the > kernel, you can't get random data corruption by the card writing to > memory used elsewhere in the kernel. So if your card is not responding > to your requests to free the resources, it would seem safer to leak > those resources permanently than to free them and risk the card coming > back to life long enough to corrupt memory reallocated to some other > task. > > Only if you can guarantee me that there is no way your commands to the > card will fail and then the card start working again later would I > consider this patch safe. And if it's possible for the card to hang > like this, should that be triggering a reset of the device? > Thanks for your suggestion, I agree with you, it would seem safer to leak those resources permanently than to free them. I will abandon this change and consider cleaning up these leaked resources during uninstallation or reset. Thanks