From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8F6CC6FA90 for ; Wed, 21 Sep 2022 17:59:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230153AbiIUR7v (ORCPT ); Wed, 21 Sep 2022 13:59:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229490AbiIUR7t (ORCPT ); Wed, 21 Sep 2022 13:59:49 -0400 Received: from mail-qv1-xf2b.google.com (mail-qv1-xf2b.google.com [IPv6:2607:f8b0:4864:20::f2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A120F25C61 for ; Wed, 21 Sep 2022 10:59:45 -0700 (PDT) Received: by mail-qv1-xf2b.google.com with SMTP id j8so5006385qvt.13 for ; Wed, 21 Sep 2022 10:59:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=ojFVpzQQBU65/cGnBwiD8XrHjwc6yoknKk2FoCB1U+E=; b=O/yO+1pTVH+zv9lgZ6ajaiHQaZ3PYprJYilZ8UutSTafDsVkKMpXizw9/imuY0AtOC NYW8Mswf9f4b0Lsrpbgl9u1P+xgRrxnbfGvYOjv1/vgDuYXpDHMJwNLgi684KgMrw7yo fiUPjMEnVf5rG46oAG+S4HyC9ur8chpqMlDO+5sfilqDTNVhMKXKLEnnVcYxULKTK5VA AQBdCNF62VZ9b3HfZxMhuaoRec6029+5PYkmBxHsn+yuxUyzLF4gqEZwlo/RtBNMiUve K3NzQAETMcW9rwyNJQ2V3fj7XLUNgd3jy5oraffH3ZVwzYeSacO+Cpg+jgSb3HQUCa/I poIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=ojFVpzQQBU65/cGnBwiD8XrHjwc6yoknKk2FoCB1U+E=; b=5+MSyMw3kAadZBRaUNBY/qxh/kB8eamQJu1+oepkQuPn918TuExnIoKMjSwXaGeCqD tXY0JeKT2BouBBSsD3iSWz7MxAZw8IDeMgeira7ooYpE83c9F+CC3HZMIlRSeYoQM6l0 ARy0fUuEbgayKFOtaF1V9WeAb7OzuBlwHdIc+CXrzGCvG31cOJ5Lu9fKnj6K8DFHZYH4 Bo2CSSo1R8WMKM6Onws9QADzpcHFWiMEJRPdBcNmm/O++fw/lJrbAuLip6qwxj8BpWbc r5Zf+CRrPfZNCtftVjh/OTtl1MOl6jje4BZ/OJH/29cwgwRxzkLc7McHxWQpjTvW2+1L hIqA== X-Gm-Message-State: ACrzQf1JfoiajmfLGkz696aIO5sV5Wrsq3k5CblAi+Y+TFFdWXLWFodO 7aoIzINO6C9rxjroWbVL/kzg8g== X-Google-Smtp-Source: AMsMyM5fRcrw2C3mhiySyFtDFy2sCBzxuDmRwENMhWkDVUkzIilCF7ArZc23WJWPhbBRVXc2yPQX5g== X-Received: by 2002:a0c:b3c1:0:b0:4ac:bfe7:af9d with SMTP id b1-20020a0cb3c1000000b004acbfe7af9dmr24500399qvf.38.1663783184673; Wed, 21 Sep 2022 10:59:44 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-129.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.129]) by smtp.gmail.com with ESMTPSA id ga8-20020a05622a590800b00304fe5247bfsm2053130qtb.36.2022.09.21.10.59.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Sep 2022 10:59:43 -0700 (PDT) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1ob40c-0015tZ-Fk; Wed, 21 Sep 2022 14:59:42 -0300 Date: Wed, 21 Sep 2022 14:59:42 -0300 From: Jason Gunthorpe To: longli@microsoft.com Cc: "K. Y. Srinivasan" , Haiyang Zhang , Stephen Hemminger , Wei Liu , Dexuan Cui , "David S. Miller" , Jakub Kicinski , Paolo Abeni , Leon Romanovsky , edumazet@google.com, shiraz.saleem@intel.com, Ajay Sharma , linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org Subject: Re: [Patch v6 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter Message-ID: References: <1663723352-598-1-git-send-email-longli@linuxonhyperv.com> <1663723352-598-13-git-send-email-longli@linuxonhyperv.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1663723352-598-13-git-send-email-longli@linuxonhyperv.com> Precedence: bulk List-ID: X-Mailing-List: linux-hyperv@vger.kernel.org On Tue, Sep 20, 2022 at 06:22:32PM -0700, longli@linuxonhyperv.com wrote: > +int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr, > + struct ib_udata *udata) > +{ > + struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq); > + struct ib_device *ibdev = ibcq->device; > + struct mana_ib_create_cq ucmd = {}; > + struct mana_ib_dev *mdev; > + int err; > + > + mdev = container_of(ibdev, struct mana_ib_dev, ib_dev); Stylistically these container_of's are usually in the definitions section, at least pick a form and stick to it consistently. > + if (udata->inlen < sizeof(ucmd)) > + return -EINVAL; > + > + err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen)); > + if (err) { > + ibdev_dbg(ibdev, > + "Failed to copy from udata for create cq, %d\n", err); > + return -EFAULT; > + } > + > + if (attr->cqe > MAX_SEND_BUFFERS_PER_QUEUE) { > + ibdev_dbg(ibdev, "CQE %d exceeding limit\n", attr->cqe); > + return -EINVAL; > + } > + > + cq->cqe = attr->cqe; > + cq->umem = ib_umem_get(ibdev, ucmd.buf_addr, cq->cqe * COMP_ENTRY_SIZE, > + IB_ACCESS_LOCAL_WRITE); > + if (IS_ERR(cq->umem)) { > + err = PTR_ERR(cq->umem); > + ibdev_dbg(ibdev, "Failed to get umem for create cq, err %d\n", > + err); > + return err; > + } > + > + err = mana_ib_gd_create_dma_region(mdev, cq->umem, &cq->gdma_region, > + PAGE_SIZE); > + if (err) { > + ibdev_err(ibdev, > + "Failed to create dma region for create cq, %d\n", > + err); Prints on userspace paths are not allowed, this should be dbg. There are many other cases like this, please fix them all. This driver may have too many dbg prints too, not every failure if should have a print :( > + ibdev_dbg(ibdev, > + "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n", > + err, cq->gdma_region); > + > + /* The CQ ID is not known at this time > + * The ID is generated at create_qp > + */ Wrong comment style, rdma uses the leading empty blank line for some reason /* * The CQ ID is not known at this time. The ID is generated at create_qp. */ > +static void mana_ib_remove(struct auxiliary_device *adev) > +{ > + struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev); > + > + ib_unregister_device(&dev->ib_dev); > + ib_dealloc_device(&dev->ib_dev); > +} > + > +static const struct auxiliary_device_id mana_id_table[] = { > + { > + .name = "mana.rdma", > + }, > + {}, > +}; > + > +MODULE_DEVICE_TABLE(auxiliary, mana_id_table); > + > +static struct auxiliary_driver mana_driver = { > + .name = "rdma", > + .probe = mana_ib_probe, > + .remove = mana_ib_remove, > + .id_table = mana_id_table, > +}; > + > +static int __init mana_ib_init(void) > +{ > + auxiliary_driver_register(&mana_driver); > + > + return 0; > +} > + > +static void __exit mana_ib_cleanup(void) > +{ > + auxiliary_driver_unregister(&mana_driver); > +} > + All this is just module_auxiliary_driver() > + mutex_lock(&pd->vport_mutex); > + > + pd->vport_use_count++; > + if (pd->vport_use_count > 1) { > + ibdev_dbg(&dev->ib_dev, > + "Skip as this PD is already configured vport\n"); > + mutex_unlock(&pd->vport_mutex); This leaves vport_use_count elevated. > + return 0; > +int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata) > +{ > + struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd); > + struct ib_device *ibdev = ibpd->device; > + struct mana_ib_dev *dev; > + > + dev = container_of(ibdev, struct mana_ib_dev, ib_dev); > + return mana_ib_gd_destroy_pd(dev, pd->pd_handle); This is the only place that calls mana_ib_gd_destroy_pd(), don't have a spaghetti of functions calling single other functions like this, just inline it. Also it shouldn't have been non-static, please check everything that everything non-static actually has an out-of-file user. > +void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext) > +{ > + struct mana_ib_ucontext *mana_ucontext = > + container_of(ibcontext, struct mana_ib_ucontext, ibucontext); > + struct ib_device *ibdev = ibcontext->device; > + struct mana_ib_dev *mdev; > + struct gdma_context *gc; > + int ret; > + > + mdev = container_of(ibdev, struct mana_ib_dev, ib_dev); > + gc = mdev->gdma_dev->gdma_context; > + > + ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell); > + if (ret) > + ibdev_err(ibdev, "Failed to destroy doorbell page %d\n", ret); This already printing on error And again, why is this driver split up so strangely? mana_gd_destroy_doorbell_page() is an RDMA function, it is only called by the RDMA code, why is it located under drivers/net/ethernet? I do not want an RDMA driver in drivers/net/ethernet. > +} > + > +int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem, > + mana_handle_t *gdma_region, u64 page_sz) > +{ > + size_t num_pages_total = ib_umem_num_dma_blocks(umem, page_sz); > + struct gdma_dma_region_add_pages_req *add_req = NULL; > + struct gdma_create_dma_region_resp create_resp = {}; > + struct gdma_create_dma_region_req *create_req; > + size_t num_pages_cur, num_pages_to_handle; > + unsigned int create_req_msg_size; > + struct hw_channel_context *hwc; > + struct ib_block_iter biter; > + size_t max_pgs_create_cmd; > + struct gdma_context *gc; > + struct gdma_dev *mdev; > + unsigned int i; > + int err; > + > + mdev = dev->gdma_dev; > + gc = mdev->gdma_context; > + hwc = gc->hwc.driver_data; > + max_pgs_create_cmd = > + (hwc->max_req_msg_size - sizeof(*create_req)) / sizeof(u64); > + > + num_pages_to_handle = > + min_t(size_t, num_pages_total, max_pgs_create_cmd); > + create_req_msg_size = > + struct_size(create_req, page_addr_list, num_pages_to_handle); > + > + create_req = kzalloc(create_req_msg_size, GFP_KERNEL); > + if (!create_req) Is this a multi-order allocation, I can't tell how big max_req_msg_size is? This design seems to repeat the mistakes we made in mlx5, the low levels of the driver already has memory allocated - why not get a pointer to that memory here and directly fill the message buffer instead of all this allocation and memory copying? > + ibdev_dbg(&dev->ib_dev, > + "size_dma_region %lu num_pages_total %lu, " > + "page_sz 0x%llx offset_in_page %u\n", > + umem->length, num_pages_total, page_sz, > + create_req->offset_in_page); > + > + ibdev_dbg(&dev->ib_dev, "num_pages_to_handle %lu, gdma_page_type %u", > + num_pages_to_handle, create_req->gdma_page_type); > + > + __rdma_umem_block_iter_start(&biter, umem, page_sz); > + for (i = 0; i < num_pages_to_handle; ++i) { > + dma_addr_t cur_addr; > + > + __rdma_block_iter_next(&biter); > + cur_addr = rdma_block_iter_dma_address(&biter); > + > + create_req->page_addr_list[i] = cur_addr; > + > + ibdev_dbg(&dev->ib_dev, "page num %u cur_addr 0x%llx\n", i, > + cur_addr); Please get rid of the worthless debugging code, actually test your driver with EBUG enabled and ensure it is usuable. Printing thousands of lines of garbage is not OK. Especially when what should have been a 1 line loop body is expended into a big block just to have a worthless debug statement. > + if (num_pages_cur < num_pages_total) { > + unsigned int add_req_msg_size; > + size_t max_pgs_add_cmd = > + (hwc->max_req_msg_size - sizeof(*add_req)) / > + sizeof(u64); > + > + num_pages_to_handle = > + min_t(size_t, num_pages_total - num_pages_cur, > + max_pgs_add_cmd); > + > + /* Calculate the max num of pages that will be handled */ > + add_req_msg_size = struct_size(add_req, page_addr_list, > + num_pages_to_handle); > + > + add_req = kmalloc(add_req_msg_size, GFP_KERNEL); And allocating every loop iteration seems like overkill, why not just reuse the large buffer that create_req allocated? Usually the way these loops are structured is to fill the array and then check for fullness, trigger an action to drain the array, and reset the indexes back to the start. > +int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id, > + enum gdma_pd_flags flags) > +{ > + struct gdma_dev *mdev = dev->gdma_dev; > + struct gdma_create_pd_resp resp = {}; > + struct gdma_create_pd_req req = {}; > + struct gdma_context *gc; > + int err; > + > + gc = mdev->gdma_context; > + > + mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_PD, sizeof(req), > + sizeof(resp)); > + > + req.flags = flags; > + err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp); > + > + if (err || resp.hdr.status) { > + ibdev_err(&dev->ib_dev, > + "Failed to get pd_id err %d status %u\n", err, > + resp.hdr.status); > + if (!err) > + err = -EPROTO; This pattern is repeated everywhere, you should fix mana_gd_send_request() to return EPROTO. > + return err; > + } > + > + *pd_handle = resp.pd_handle; > + *pd_id = resp.pd_id; > + ibdev_dbg(&dev->ib_dev, "pd_handle 0x%llx pd_id %d\n", *pd_handle, > + *pd_id); > + > + return 0; > +} > + > +int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle) > +{ > + struct gdma_dev *mdev = dev->gdma_dev; > + struct gdma_destory_pd_resp resp = {}; > + struct gdma_destroy_pd_req req = {}; > + struct gdma_context *gc; > + int err; > + > + gc = mdev->gdma_context; Why the local for a variable that is used once? > +int mana_ib_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma) > +{ > + struct mana_ib_ucontext *mana_ucontext = > + container_of(ibcontext, struct mana_ib_ucontext, ibucontext); > + struct ib_device *ibdev = ibcontext->device; > + struct mana_ib_dev *mdev; > + struct gdma_context *gc; > + phys_addr_t pfn; > + pgprot_t prot; > + int ret; > + > + mdev = container_of(ibdev, struct mana_ib_dev, ib_dev); > + gc = mdev->gdma_dev->gdma_context; > + > + if (vma->vm_pgoff != 0) { > + ibdev_err(ibdev, "Unexpected vm_pgoff %lu\n", vma->vm_pgoff); > + return -EINVAL; More user triggerable printing > +int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) > +{ > + struct mana_ib_mr *mr = container_of(ibmr, struct mana_ib_mr, ibmr); > + struct ib_device *ibdev = ibmr->device; > + struct mana_ib_dev *dev; > + int err; > + > + dev = container_of(ibdev, struct mana_ib_dev, ib_dev); > + > + err = mana_ib_gd_destroy_mr(dev, mr->mr_handle); > + if (err) > + return err; mana_ib_gd_destroy_mr() is only ever called here, why is it in main.c? > +static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd, > + struct ib_qp_init_attr *attr, > + struct ib_udata *udata) > +{ > + struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp); > + struct mana_ib_dev *mdev = > + container_of(pd->device, struct mana_ib_dev, ib_dev); > + struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl; > + struct mana_ib_create_qp_rss_resp resp = {}; > + struct mana_ib_create_qp_rss ucmd = {}; > + struct gdma_dev *gd = mdev->gdma_dev; > + mana_handle_t *mana_ind_table; > + struct mana_port_context *mpc; > + struct mana_context *mc; > + struct net_device *ndev; > + struct mana_ib_cq *cq; > + struct mana_ib_wq *wq; > + struct ib_cq *ibcq; > + struct ib_wq *ibwq; > + int i = 0, ret; > + u32 port; > + > + mc = gd->driver_data; > + > + if (udata->inlen < sizeof(ucmd)) > + return -EINVAL; > + > + ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen)); > + if (ret) { > + ibdev_dbg(&mdev->ib_dev, > + "Failed copy from udata for create rss-qp, err %d\n", > + ret); > + return -EFAULT; > + } > + > + if (attr->cap.max_recv_wr > MAX_SEND_BUFFERS_PER_QUEUE) { > + ibdev_dbg(&mdev->ib_dev, > + "Requested max_recv_wr %d exceeding limit.\n", > + attr->cap.max_recv_wr); > + return -EINVAL; > + } > + > + if (attr->cap.max_recv_sge > MAX_RX_WQE_SGL_ENTRIES) { > + ibdev_dbg(&mdev->ib_dev, > + "Requested max_recv_sge %d exceeding limit.\n", > + attr->cap.max_recv_sge); > + return -EINVAL; > + } > + > + if (ucmd.rx_hash_function != MANA_IB_RX_HASH_FUNC_TOEPLITZ) { > + ibdev_dbg(&mdev->ib_dev, > + "RX Hash function is not supported, %d\n", > + ucmd.rx_hash_function); > + return -EINVAL; > + } > + > + /* IB ports start with 1, MANA start with 0 */ > + port = ucmd.port; > + if (port < 1 || port > mc->num_ports) { > + ibdev_dbg(&mdev->ib_dev, "Invalid port %u in creating qp\n", > + port); > + return -EINVAL; > + } > + ndev = mc->ports[port - 1]; > + mpc = netdev_priv(ndev); > + > + ibdev_dbg(&mdev->ib_dev, "rx_hash_function %d port %d\n", > + ucmd.rx_hash_function, port); > + > + mana_ind_table = kzalloc(sizeof(mana_handle_t) * > + (1 << ind_tbl->log_ind_tbl_size), > + GFP_KERNEL); Should be careful about maths overflow on this calculation. > + ibdev_dbg(&mdev->ib_dev, "ucmd sq_buf_addr 0x%llx port %u\n", > + ucmd.sq_buf_addr, ucmd.port); > + > + umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr, ucmd.sq_buf_size, > + IB_ACCESS_LOCAL_WRITE); > + if (IS_ERR(umem)) { > + err = PTR_ERR(umem); > + ibdev_dbg(&mdev->ib_dev, > + "Failed to get umem for create qp-raw, err %d\n", > + err); > + goto err_free_vport; > + } > + qp->sq_umem = umem; > + > + err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem, > + &qp->sq_gdma_region, PAGE_SIZE); > + if (err) { All these cases that process a page list have to call ib_umem_find_best_XXX()! It does important validation against hostile user input, including checking the critical iova. You have missed that the user can request that an unaligned umem be created and this creates a starting offset from the first page that must be honored in HW, or there will be weird memory corruption. Since it looks like this HW doesn't support a starting page offset this should call ib_umem_find_best_pgsz() with a SZ_4K and a 0 iova. Also never use PAGE_SIZE to describe a HW limitation. There is alot of repeated code here, it would do well to have a wrapper function consolidating this pattern. Jason