dri-devel Archive on lore.kernel.org
 help / color / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Leon Romanovsky" <leon@kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"Doug Ledford" <dledford@redhat.com>,
	"Vetter, Daniel" <daniel.vetter@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Xiong, Jianxin" <jianxin.xiong@intel.com>
Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
Date: Thu, 2 Jul 2020 15:10:00 +0200
Message-ID: <20200702131000.GW3278063@phenom.ffwll.local> (raw)
In-Reply-To: <20200701171524.GN25301@ziepe.ca>

On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > >> make sure that new ones grab a new mapping.
> > > > Swap and flush isn't a general HW ability either..
> > > >
> > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > in-progress writes?
> > > >
> > > > Did you mean pause, swap and resume? That's ODP.
> > >
> > > Yes, something like this. And good to know, never heard of ODP.
> > 
> > Hm I thought ODP was full hw page faults at an individual page
> > level,
> Yes
> > and this stop&resume is for the entire nic. Under the hood both apply
> > back-pressure on the network if a transmission can't be received,
> > but
> NIC's don't do stop and resume, blocking the Rx pipe is very
> problematic and performance destroying.
> The strategy for something like ODP is more complex, and so far no NIC
> has deployed it at any granularity larger than per-page.
> > So since Jason really doesn't like dma_fence much I think for rdma
> > synchronous it is. And it shouldn't really matter, since waiting for a
> > small transaction to complete at rdma wire speed isn't really that
> > long an operation. 
> Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

0. rdma driver maintains a list of all dma-buf that it has mapped
somewhere and is currently using for transactions

1. rdma driver gets a dma_buf->notify_move callback on one of these
buffers. To handle that it:
	1. stops hw access somehow at the rx
	2. flushes caches and whatever else is needed
	3. moves the unmapped buffer on a special list or marks it in some
	different way as unavailable
	4. launch the kthread/work_struct to fix everything back up

2. dma-buf export (gpu driver) can now issue the commands to move the
buffer around

3. rdma driver worker gets busy to restart rx:
	1. lock all dma-buf that are currently in use (dma_resv_lock).
	thanks to ww_mutex deadlock avoidance this is possible
	2. for any buffers which have been marked as unavailable in 1.3
	grab a new mapping (which might now be in system memory, or again
	peer2peer but different address)
	3. restart hw and rx
	4. unlock all dma-buf locks (dma_resv_unlock)

There is a minor problem because step 2 only queues up the entire buffer
moves behind a pile of dma_fence, and atm we haven't made it absolutely
clear who's responsible for waiting for those to complete. For gpu drivers
it's the importer since gpu drivers don't have big qualms about
dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make
sure the buffer move has actually completed.

Above flow is more or less exactly what happens for gpu workloads where we
can preempt running computations. Instead of stopping rx we preempt the
compute job and remove it from the scheduler queues, and instead of
restarting rx we just put the compute job back onto the scheduler queue as
eligible for gpu time. Otherwise it's exactly the same stuff.

Of course if you only have a single compute job and too many such
interruptions, then performance is going to tank. Don't do that, instead
make sure you have enough vram or system memory or whatever :-)

Cheers, Daniel
Daniel Vetter
Software Engineer, Intel Corporation
dri-devel mailing list

  reply index

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1593451903-30959-1-git-send-email-jianxin.xiong@intel.com>
     [not found] ` <20200629185152.GD25301@ziepe.ca>
     [not found]   ` <MW3PR11MB4555A99038FA0CFC3ED80D3DE56F0@MW3PR11MB4555.namprd11.prod.outlook.com>
     [not found]     ` <20200630173435.GK25301@ziepe.ca>
2020-06-30 18:46       ` Xiong, Jianxin
2020-06-30 19:17         ` Jason Gunthorpe
2020-06-30 20:08           ` Xiong, Jianxin
2020-07-02 12:27             ` Jason Gunthorpe
2020-07-01  9:03         ` Christian König
2020-07-01 12:07           ` Daniel Vetter
2020-07-01 12:14             ` Daniel Vetter
2020-07-01 12:39           ` Jason Gunthorpe
2020-07-01 12:55             ` Christian König
2020-07-01 15:42               ` Daniel Vetter
2020-07-01 17:15                 ` Jason Gunthorpe
2020-07-02 13:10                   ` Daniel Vetter [this message]
2020-07-02 13:29                     ` Jason Gunthorpe
2020-07-02 14:50                       ` Christian König
2020-07-02 18:15                         ` Daniel Vetter
2020-07-03 12:03                           ` Jason Gunthorpe
2020-07-03 12:52                             ` Daniel Vetter
2020-07-03 13:14                               ` Jason Gunthorpe
2020-07-03 13:21                                 ` Christian König
2020-07-07 21:58                                   ` Xiong, Jianxin
2020-07-08  9:38                                     ` Christian König
2020-07-08  9:49                                       ` Daniel Vetter
2020-07-08 14:20                                         ` Christian König
2020-07-08 14:33                                           ` Alex Deucher
2020-06-30 18:56 ` Xiong, Jianxin
     [not found] ` <1593451903-30959-2-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:04   ` [RFC PATCH v2 1/3] RDMA/umem: Support importing dma-buf as user memory region Xiong, Jianxin
     [not found] ` <1593451903-30959-3-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:04   ` [RFC PATCH v2 2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf Xiong, Jianxin
     [not found] ` <1593451903-30959-4-git-send-email-jianxin.xiong@intel.com>
2020-06-30 19:05   ` [RFC PATCH v2 3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration Xiong, Jianxin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200702131000.GW3278063@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=christian.koenig@amd.com \
    --cc=daniel.vetter@intel.com \
    --cc=dledford@redhat.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jgg@ziepe.ca \
    --cc=jianxin.xiong@intel.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

dri-devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/dri-devel/0 dri-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dri-devel dri-devel/ https://lore.kernel.org/dri-devel \
	public-inbox-index dri-devel

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git