From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75556C4320A for ; Fri, 20 Aug 2021 07:25:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 58BCB61056 for ; Fri, 20 Aug 2021 07:25:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238668AbhHTH0W (ORCPT ); Fri, 20 Aug 2021 03:26:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238564AbhHTH0T (ORCPT ); Fri, 20 Aug 2021 03:26:19 -0400 Received: from mail-oi1-x232.google.com (mail-oi1-x232.google.com [IPv6:2607:f8b0:4864:20::232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3CA86C061757 for ; Fri, 20 Aug 2021 00:25:42 -0700 (PDT) Received: by mail-oi1-x232.google.com with SMTP id bd1so12101440oib.3 for ; Fri, 20 Aug 2021 00:25:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=G5ALr4EA4tA2v978TPeOHUvCmve3hdooJ+JbOY+xBaw=; b=e8DhrBaGywh239BR9UQgfCjPecmsIPU8LAS/ZlfifZh0cpL4zFyrZpikEaVhObl4IF pozoTQDVklP+HA+yeDTCvsVH4MDujDI65ZCmdG1lpbaPY8SXUM9RnkO2x6mIrJPOBr/o 18WMB/R0IAUdjnO9eSdJriCYRNe3pfqIq9ePc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=G5ALr4EA4tA2v978TPeOHUvCmve3hdooJ+JbOY+xBaw=; b=OsuVnJtl2pmLMbzec7jBa4gyVHfUOSeNEe5yeE8sPOQ0I/wX9e4GfYcFl+Gra3z69S KhU/nScNNg2pVnHAy16BkNwuTJlVf9s25YxZLyBbVb/mauZk8rjKD9XAGdpw/pbDJ7sL GDxRNHh9TFyBk3yil/RFy+wjsbtglY5yy1ZGUGB4VoMyQHzdYnBfhOaX7ofPHYqWuQs/ 0wn+UCfUglbU7Yhj8oant5iu5NS9nqWjdbczR0T33rIk6gxsZTGZQVjxiUn3jBHOp3Is uaV3/IRGFAdVKMT+dF9XEKlC83W4IdyQdD7qK6dyGtsPk9pmAw3Hvbo5Z00aKbRjCf6G iv6A== X-Gm-Message-State: AOAM531PYihZIX4JvigW5c0yfsuWZ5SPjb8fu7z43PhjPll7Jh66N4jR xeVS/n0CLIAFmBt+Y+bAvDP55rf+JkYcLkGFbAromg== X-Google-Smtp-Source: ABdhPJwlex8RkEsRvudtfGWE0C46ujA2iX4BP2ZJ8GsF4Jblfr83G50+9sU0thDn+HWeLg/nFJaZIYC4VGB2Ro+Unmc= X-Received: by 2002:a05:6808:2116:: with SMTP id r22mr2012118oiw.128.1629444341532; Fri, 20 Aug 2021 00:25:41 -0700 (PDT) MIME-Version: 1.0 References: <20210818074352.29950-1-galpress@amazon.com> <20210819230602.GU543798@ziepe.ca> In-Reply-To: <20210819230602.GU543798@ziepe.ca> From: Daniel Vetter Date: Fri, 20 Aug 2021 09:25:30 +0200 Message-ID: Subject: Re: [RFC] Make use of non-dynamic dmabuf in RDMA To: Jason Gunthorpe Cc: Gal Pressman , Sumit Semwal , =?UTF-8?Q?Christian_K=C3=B6nig?= , Doug Ledford , "open list:DMA BUFFER SHARING FRAMEWORK" , dri-devel , Linux Kernel Mailing List , linux-rdma , Oded Gabbay , Tomer Tayar , Yossi Leybovich , Alexander Matushevsky , Leon Romanovsky , Jianxin Xiong , John Hubbard Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe wrote: > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote: > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman wrote: > > > > > > Hey all, > > > > > > Currently, the RDMA subsystem can only work with dynamic dmabuf > > > attachments, which requires the RDMA device to support on-demand-paging > > > (ODP) which is not common on most devices (only supported by mlx5). > > > > > > While the dynamic requirement makes sense for certain GPUs, some devices > > > (such as habanalabs) have device memory that is always "pinned" and do > > > not need/use the move_notify operation. > > > > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter, > > > and EFA as the importer to allow for peer2peer access through libibverbs. > > > > > > This draft patch changes the dmabuf driver to differentiate between > > > static/dynamic attachments by looking at the move_notify op instead of > > > the importer_ops struct, and allowing the peer2peer flag to be enabled > > > in case of a static exporter. > > > > > > Thanks > > > > > > Signed-off-by: Gal Pressman > > > > Given that habanalabs dma-buf support is very firmly in limbo (at > > least it's not yet in linux-next or anywhere else) I think you want to > > solve that problem first before we tackle the additional issue of > > making p2p work without dynamic dma-buf. Without that it just doesn't > > make a lot of sense really to talk about solutions here. > > I have been thinking about adding a dmabuf exporter to VFIO, for > basically the same reason habana labs wants to do it. > > In that situation we'd want to see an approach similar to this as well > to have a broad usability. > > The GPU drivers also want this for certain sophisticated scenarios > with RDMA, the intree drivers just haven't quite got there yet. > > So, I think it is worthwhile to start thinking about this regardless > of habana labs. Oh sure, I've been having these for a while. I think there's two options: - some kind of soft-pin, where the contract is that we only revoke when absolutely necessary, and it's expected to be catastrophic on the importer's side. The use-case would be single user that fully controls all accelerator local memory, and so kernel driver evicting stuff. I havent implemented it, but the idea is that essentially in eviction we check whom we're evicting for (by checking owners of buffers maybe, atm those are not tracked in generic code but not that hard to add), and if it's the same userspace owner we don't ever pick these buffers as victims for eviction, preferreing -ENOMEM/-ENOSPC. If a new user comes around then we'd still throw these out to avoid abuse, and it would be up to sysadmins to make sure this doesn't happen untimely, maybe with the next thing. - cgroups for (pinned) buffers. Mostly because cgroups for local memory is somewhere on the plans anyway, but that one could also take forever since there's questions about overlap with memcg and things like that, plus thus far everyone who cares made and incompatible proposal about how it should be done :-/ A variant of the first one would be device-level revoke, which is a concept we already have in drm for the modesetting side and also for like 20 year old gpu drivers. We could brush that off and close some of the gaps (a student is fixing the locking right now, the thing left to do is mmap revoke), and I think that model of exclusive device ownership with the option to revoke fits pretty well for at least some of the accelerators floating around. In that case importers would never get a move_notify (maybe we should call this revoke_notify to make it clear it's a bit different) callback, except when the entire thing has been yanked. I think that would fit pretty well for VFIO, and I think we should be able to make it work for rdma too as some kind of auto-deregister. The locking might be fun with both of these since I expect some inversions compared to the register path, we'll have to figure these out. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch