From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0615CC433F5 for ; Tue, 28 Sep 2021 17:36:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DA79460F21 for ; Tue, 28 Sep 2021 17:36:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242195AbhI1RiE (ORCPT ); Tue, 28 Sep 2021 13:38:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39774 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242193AbhI1RiD (ORCPT ); Tue, 28 Sep 2021 13:38:03 -0400 Received: from mail-qt1-x832.google.com (mail-qt1-x832.google.com [IPv6:2607:f8b0:4864:20::832]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3306BC061745 for ; Tue, 28 Sep 2021 10:36:24 -0700 (PDT) Received: by mail-qt1-x832.google.com with SMTP id c20so20642942qtb.2 for ; Tue, 28 Sep 2021 10:36:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=2j0hSIPEbEVg08O91IQHe4e/t66ZjkFeXF4uWgBdazo=; b=ax7LPES/qoKnFOEWKvaH8l1/x+ycloiA/hqHbPeL5rP8Z1vn907OmU4BErMfCuVCQc uykk/OGXe7PV+xjzDRfkypKmDznCXQsvvz3zqFUmlkelU8ZkngLjEO/wINJszUCiO6fM rkC4xrj8aKDffdSBLCe6RujXg1aOQUTzSlUr/UMvo61pUtTO+oYCwyOV634CFGZsY3yi IiQEQ7G33mCNMT2e5fuJtLPw8Dvjst412sYB6xqWHjwxFiWGrGwoYAs9wm8hFbqIOcUm D0BS8zqjwkZUWjefjReFzvuS5I14QPxK6cI/EN+UNxqhUhnycRjJZujgSzt7Xzy0jauC VN/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=2j0hSIPEbEVg08O91IQHe4e/t66ZjkFeXF4uWgBdazo=; b=jNvjbX8qkfC9mMd+nuI0O6MRMzi2oksZ5QKMyRKfplxb5DeTpNIXNXehaCizLZiW7R OdCzO6ulWEPkHOc/0fkzh6LqIfZw41BDBOcBzAEmXoLD/xU0WlSkyHDNyj+ZyHnQzeY9 9tG+7me0BsWVrV4SVwE7N3JxyrMErUt6IRKkSfnm3w5DHzbknINx+rOasMp/d+8HjEyS 25PF03s3Cma0ISl/tssLN0r5Rx2jeJaaWiQ6K1hajnglJ+8Vnd5400tVrdcrvxfWcQ9L 2f2OWHi2BSNgcrf2gupGt8t2PShgWZ/1Pp3mERVZ9NMPe0O77GzMn4ZlmSKs1bP1ysMn gYkw== X-Gm-Message-State: AOAM531ZeDYk8kjTvMIL6L5JgH9mXUOTMHenMaGlS6umCIIz7C2GOuYM 8TkF+g4ZpA+2woMrqj5vE83j5g== X-Google-Smtp-Source: ABdhPJyCMnEBQPKgYsxOkhRLCW8qILx1aEYS07Bsz6/uRtyaQBO+Rln0oQrRf1mMYyjQAzRzdprBSg== X-Received: by 2002:ac8:4819:: with SMTP id g25mr6958932qtq.364.1632850583261; Tue, 28 Sep 2021 10:36:23 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-162-113-129.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.113.129]) by smtp.gmail.com with ESMTPSA id h4sm2175787qtb.67.2021.09.28.10.36.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Sep 2021 10:36:22 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1mVH1h-007EMG-K7; Tue, 28 Sep 2021 14:36:21 -0300 Date: Tue, 28 Sep 2021 14:36:21 -0300 From: Jason Gunthorpe To: Oded Gabbay Cc: linux-kernel@vger.kernel.org, gregkh@linuxfoundation.org, christian.koenig@amd.com, daniel.vetter@ffwll.ch, galpress@amazon.com, sleybo@amazon.com, dri-devel@lists.freedesktop.org, linux-rdma@vger.kernel.org, linux-media@vger.kernel.org, dledford@redhat.com, airlied@gmail.com, alexander.deucher@amd.com, leonro@nvidia.com, hch@lst.de, amd-gfx@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, Tomer Tayar Subject: Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter Message-ID: <20210928173621.GG3544071@ziepe.ca> References: <20210912165309.98695-1-ogabbay@kernel.org> <20210912165309.98695-3-ogabbay@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210912165309.98695-3-ogabbay@kernel.org> Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org On Sun, Sep 12, 2021 at 07:53:09PM +0300, Oded Gabbay wrote: > From: Tomer Tayar > > Implement the calls to the dma-buf kernel api to create a dma-buf > object backed by FD. > > We block the option to mmap the DMA-BUF object because we don't support > DIRECT_IO and implicit P2P. This statement doesn't make sense, you can mmap your dmabuf if you like. All dmabuf mmaps are supposed to set the special bit/etc to exclude them from get_user_pages() anyhow - and since this is BAR memory not struct page memory this driver would be doing it anyhow. > We check the p2p distance using pci_p2pdma_distance_many() and refusing > to map dmabuf in case the distance doesn't allow p2p. Does this actually allow the p2p transfer for your intended use cases? > diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c > index 33986933aa9e..8cf5437c0390 100644 > +++ b/drivers/misc/habanalabs/common/memory.c > @@ -1,7 +1,7 @@ > // SPDX-License-Identifier: GPL-2.0 > > /* > - * Copyright 2016-2019 HabanaLabs, Ltd. > + * Copyright 2016-2021 HabanaLabs, Ltd. > * All Rights Reserved. > */ > > @@ -11,11 +11,13 @@ > > #include > #include > +#include > > #define HL_MMU_DEBUG 0 > > /* use small pages for supporting non-pow2 (32M/40M/48M) DRAM phys page sizes */ > -#define DRAM_POOL_PAGE_SIZE SZ_8M > +#define DRAM_POOL_PAGE_SIZE SZ_8M > + ?? > /* > * The va ranges in context object contain a list with the available chunks of > @@ -347,6 +349,13 @@ static int free_device_memory(struct hl_ctx *ctx, struct hl_mem_in *args) > return -EINVAL; > } > > + if (phys_pg_pack->exporting_cnt) { > + dev_err(hdev->dev, > + "handle %u is exported, cannot free\n", handle); > + spin_unlock(&vm->idr_lock); Don't write to the kernel log from user space triggered actions > +static int alloc_sgt_from_device_pages(struct hl_device *hdev, > + struct sg_table **sgt, u64 *pages, > + u64 npages, u64 page_size, > + struct device *dev, > + enum dma_data_direction dir) Why doesn't this return a sg_table * and an ERR_PTR? > +{ > + u64 chunk_size, bar_address, dma_max_seg_size; > + struct asic_fixed_properties *prop; > + int rc, i, j, nents, cur_page; > + struct scatterlist *sg; > + > + prop = &hdev->asic_prop; > + > + dma_max_seg_size = dma_get_max_seg_size(dev); > + > + /* We would like to align the max segment size to PAGE_SIZE, so the > + * SGL will contain aligned addresses that can be easily mapped to > + * an MMU > + */ > + dma_max_seg_size = ALIGN_DOWN(dma_max_seg_size, PAGE_SIZE); > + if (dma_max_seg_size < PAGE_SIZE) { > + dev_err_ratelimited(hdev->dev, > + "dma_max_seg_size %llu can't be smaller than PAGE_SIZE\n", > + dma_max_seg_size); > + return -EINVAL; > + } > + > + *sgt = kzalloc(sizeof(**sgt), GFP_KERNEL); > + if (!*sgt) > + return -ENOMEM; > + > + /* If the size of each page is larger than the dma max segment size, > + * then we can't combine pages and the number of entries in the SGL > + * will just be the > + * * > + */ > + if (page_size > dma_max_seg_size) > + nents = npages * DIV_ROUND_UP_ULL(page_size, dma_max_seg_size); > + else > + /* Get number of non-contiguous chunks */ > + for (i = 1, nents = 1, chunk_size = page_size ; i < npages ; i++) { > + if (pages[i - 1] + page_size != pages[i] || > + chunk_size + page_size > dma_max_seg_size) { > + nents++; > + chunk_size = page_size; > + continue; > + } > + > + chunk_size += page_size; > + } > + > + rc = sg_alloc_table(*sgt, nents, GFP_KERNEL | __GFP_ZERO); > + if (rc) > + goto error_free; > + > + /* Because we are not going to include a CPU list we want to have some > + * chance that other users will detect this by setting the orig_nents > + * to 0 and using only nents (length of DMA list) when going over the > + * sgl > + */ > + (*sgt)->orig_nents = 0; Maybe do this at the end so you'd have to undo it on the error path? > + cur_page = 0; > + > + if (page_size > dma_max_seg_size) { > + u64 size_left, cur_device_address = 0; > + > + size_left = page_size; > + > + /* Need to split each page into the number of chunks of > + * dma_max_seg_size > + */ > + for_each_sgtable_dma_sg((*sgt), sg, i) { > + if (size_left == page_size) > + cur_device_address = > + pages[cur_page] - prop->dram_base_address; > + else > + cur_device_address += dma_max_seg_size; > + > + chunk_size = min(size_left, dma_max_seg_size); > + > + bar_address = hdev->dram_pci_bar_start + cur_device_address; > + > + rc = set_dma_sg(sg, bar_address, chunk_size, dev, dir); > + if (rc) > + goto error_unmap; > + > + if (size_left > dma_max_seg_size) { > + size_left -= dma_max_seg_size; > + } else { > + cur_page++; > + size_left = page_size; > + } > + } > + } else { > + /* Merge pages and put them into the scatterlist */ > + for_each_sgtable_dma_sg((*sgt), sg, i) { > + chunk_size = page_size; > + for (j = cur_page + 1 ; j < npages ; j++) { > + if (pages[j - 1] + page_size != pages[j] || > + chunk_size + page_size > dma_max_seg_size) > + break; > + > + chunk_size += page_size; > + } > + > + bar_address = hdev->dram_pci_bar_start + > + (pages[cur_page] - prop->dram_base_address); > + > + rc = set_dma_sg(sg, bar_address, chunk_size, dev, dir); > + if (rc) > + goto error_unmap; > + > + cur_page = j; > + } > + } We have this sg_append stuff now that is intended to help building these things. It can only build CPU page lists, not these DMA lists, but I do wonder if open coding in drivers is slipping back a bit. Especially since AMD seems to be doing something different. Could the DMABUF layer gain some helpers styled after the sg_append to simplify building these things? and convert the AMD driver of course. > +static int hl_dmabuf_attach(struct dma_buf *dmabuf, > + struct dma_buf_attachment *attachment) > +{ > + struct hl_dmabuf_wrapper *hl_dmabuf; > + struct hl_device *hdev; > + int rc; > + > + hl_dmabuf = dmabuf->priv; > + hdev = hl_dmabuf->ctx->hdev; > + > + rc = pci_p2pdma_distance_many(hdev->pdev, &attachment->dev, 1, true); > + > + if (rc < 0) > + attachment->peer2peer = false; Extra blank line > + > + return 0; > +} > + > +static struct sg_table *hl_map_dmabuf(struct dma_buf_attachment *attachment, > + enum dma_data_direction dir) > +{ > + struct dma_buf *dma_buf = attachment->dmabuf; > + struct hl_vm_phys_pg_pack *phys_pg_pack; > + struct hl_dmabuf_wrapper *hl_dmabuf; > + struct hl_device *hdev; > + struct sg_table *sgt; > + int rc; > + > + hl_dmabuf = dma_buf->priv; > + hdev = hl_dmabuf->ctx->hdev; > + phys_pg_pack = hl_dmabuf->phys_pg_pack; > + > + if (!attachment->peer2peer) { > + dev_err(hdev->dev, > + "Failed to map dmabuf because p2p is disabled\n"); > + return ERR_PTR(-EPERM); User triggered printable again? > +static void hl_unmap_dmabuf(struct dma_buf_attachment *attachment, > + struct sg_table *sgt, > + enum dma_data_direction dir) > +{ > + struct scatterlist *sg; > + int i; > + > + for_each_sgtable_dma_sg(sgt, sg, i) > + dma_unmap_resource(attachment->dev, sg_dma_address(sg), > + sg_dma_len(sg), dir, > + DMA_ATTR_SKIP_CPU_SYNC); Why can we skip the CPU_SYNC? Seems like a comment is needed Something has to do a CPU_SYNC before recylcing this memory for another purpose, where is it? > +static void hl_release_dmabuf(struct dma_buf *dmabuf) > +{ > + struct hl_dmabuf_wrapper *hl_dmabuf = dmabuf->priv; Maybe hl_dmabuf_wrapper should be hl_dmabuf_priv > + * export_dmabuf_from_addr() - export a dma-buf object for the given memory > + * address and size. > + * @ctx: pointer to the context structure. > + * @device_addr: device memory physical address. > + * @size: size of device memory. > + * @flags: DMA-BUF file/FD flags. > + * @dmabuf_fd: pointer to result FD that represents the dma-buf object. > + * > + * Create and export a dma-buf object for an existing memory allocation inside > + * the device memory, and return a FD which is associated with the dma-buf > + * object. > + * > + * Return: 0 on success, non-zero for failure. > + */ > +static int export_dmabuf_from_addr(struct hl_ctx *ctx, u64 device_addr, > + u64 size, int flags, int *dmabuf_fd) > +{ > + struct hl_dmabuf_wrapper *hl_dmabuf; > + struct hl_device *hdev = ctx->hdev; > + struct asic_fixed_properties *prop; > + u64 bar_address; > + int rc; > + > + prop = &hdev->asic_prop; > + > + if (!IS_ALIGNED(device_addr, PAGE_SIZE)) { > + dev_err_ratelimited(hdev->dev, > + "address of exported device memory should be aligned to 0x%lx, address 0x%llx\n", > + PAGE_SIZE, device_addr); > + return -EINVAL; > + } > + > + if (size < PAGE_SIZE) { > + dev_err_ratelimited(hdev->dev, > + "size %llu of exported device memory should be equal to or greater than %lu\n", > + size, PAGE_SIZE); > + return -EINVAL; > + } > + > + if (device_addr < prop->dram_user_base_address || > + device_addr + size > prop->dram_end_address || > + device_addr + size < device_addr) { > + dev_err_ratelimited(hdev->dev, > + "DRAM memory range is outside of DRAM boundaries, address 0x%llx, size 0x%llx\n", > + device_addr, size); > + return -EINVAL; > + } > + > + bar_address = hdev->dram_pci_bar_start + > + (device_addr - prop->dram_base_address); > + > + if (bar_address + size > > + hdev->dram_pci_bar_start + prop->dram_pci_bar_size || > + bar_address + size < bar_address) { > + dev_err_ratelimited(hdev->dev, > + "DRAM memory range is outside of PCI BAR boundaries, address 0x%llx, size 0x%llx\n", > + device_addr, size); > + return -EINVAL; > + } More prints from userspace > +static int export_dmabuf_from_handle(struct hl_ctx *ctx, u64 handle, int flags, > + int *dmabuf_fd) > +{ > + struct hl_vm_phys_pg_pack *phys_pg_pack; > + struct hl_dmabuf_wrapper *hl_dmabuf; > + struct hl_device *hdev = ctx->hdev; > + struct asic_fixed_properties *prop; > + struct hl_vm *vm = &hdev->vm; > + u64 bar_address; > + u32 idr_handle; > + int rc, i; > + > + prop = &hdev->asic_prop; > + > + idr_handle = lower_32_bits(handle); Why silent truncation? Shouldn't setting the upper 32 bits be an error? > + case HL_MEM_OP_EXPORT_DMABUF_FD: > + rc = export_dmabuf_from_addr(ctx, > + args->in.export_dmabuf_fd.handle, > + args->in.export_dmabuf_fd.mem_size, > + args->in.flags, > + &dmabuf_fd); > + memset(args, 0, sizeof(*args)); > + args->out.fd = dmabuf_fd; Would expect the installed fd to be the positive return, not a pointer Jason