Re: Phyr Starter

From: Jason Gunthorpe <jgg@nvidia.com>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: Matthew Wilcox <willy@infradead.org>,
	nvdimm@lists.linux.dev, linux-rdma@vger.kernel.org,
	John Hubbard <jhubbard@nvidia.com>,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, Joao Martins <joao.m.martins@oracle.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: Phyr Starter
Date: Tue, 11 Jan 2022 16:26:48 -0400	[thread overview]
Message-ID: <20220111202648.GP2328285@nvidia.com> (raw)
In-Reply-To: <CAKMK7uFfpTKQEPpVQxNDi0NeO732PJMfiZ=N6u39bSCFY3d6VQ@mail.gmail.com>

On Tue, Jan 11, 2022 at 10:05:40AM +0100, Daniel Vetter wrote:

> If we go with page size I think hardcoding a PHYS_PAGE_SIZE KB(4)
> would make sense, because thanks to x86 that's pretty much the lowest
> common denominator that all hw (I know of at least) supports. Not
> having to fiddle with "which page size do we have" in driver code
> would be neat. It makes writing portable gup code in drivers just
> needlessly silly.

What I did in RDMA was make an iterator rdma_umem_for_each_dma_block()

The driver passes in the page size it wants and the iterator breaks up
the SGL into that size.

So, eg on a 16k page size system the SGL would be full of 16K stuff,
but the driver only support 4k and so the iterator hands out 4 pages
for each SGL entry.

All the drivers use this to build their DMA lists and tables, it works
really well.

The other part is that most RDMA drivers support many page sizes, so
there is another API to inspect the SGL and take in the device's page
size support and compute what page size the driver should use.

> - I think minimally an sg list form of dma-mapped stuff which does not
> have a struct page, iirc last time we discussed that we agreed that
> this really needs to be part of such a rework or it's not really
> improving things much

Yes, this seems important..

> - a few per-entry driver bits would be nice in both the phys/dma
> chains, if we can have them. gpus have funny gpu interconnects, this
> would allow us to put all the gpu addresses into dma_addr_t if we can
> have some bits indicating whether it's on the pci bus, gpu local
> memory or the gpu<->gpu interconnect.

It seems useful, see my other email for a suggested coding..

Jason