Re: [RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning

From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ben Segall <bsegall@google.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Ingo Molnar <mingo@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Michal Hocko <mhocko@suse.com>, Nico Pache <npache@redhat.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Steve Sistare <steven.sistare@oracle.com>,
	Tejun Heo <tj@kernel.org>, Tim Chen <tim.c.chen@linux.intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org
Subject: Re: [RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning
Date: Tue, 11 Jan 2022 11:20:27 -0500	[thread overview]
Message-ID: <20220111162027.3brb7ga3vgtvv6th@oracle.com> (raw)
In-Reply-To: <20220111001751.GI2328285@nvidia.com>

On Mon, Jan 10, 2022 at 08:17:51PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote:
> 
> > > > Pinning itself, the only thing being optimized, improves 8.5x in that
> > > > experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> > > > significant savings IMHO
> > > 
> > > And here is where I suspect we'd get similar results from folio's
> > > based on the unpin performance uplift we already saw.
> > > 
> > > As long as PUP doesn't have to COW its work is largely proportional to
> > > the number of struct pages it processes, so we should be expecting an
> > > upper limit of 512x gains on the PUP alone with foliation.
> > >
> > > This is in line with what we saw with the prior unpin work.
> > 
> > "in line with what we saw"  Not following.  The unpin work had two
> > optimizations, I think, 4.5x and 3.5x which together give 16x.  Why is
> > that in line with the potential gains from pup?
> 
> It is the same basic issue, doing extra work, dirtying extra memory..

Ok, gotcha.

> I don't know of other users that use such huge memory sizes this would
> matter, besides a VMM..

Right, all the VMMs out there that use vfio.

> > My assumption going into this series was that multithreading VFIO page
> > pinning in the kernel was a viable way forward given the positive
> > feedback I got from the VFIO maintainer last time I posted this, which
> > was admittedly a while ago, and I've since been focused on the other
> > parts of this series rather than what's been happening in the mm lately.
> > Anyway, your arguments are reasonable, so I'll go take a look at some of
> > these optimizations and see where I get.
> 
> Well, it is not *unreasonable* it just doesn't seem compelling to me
> yet.
> 
> Especially since we are not anywhere close to the limit of single
> threaded performance. Aside from GUP, the whole way we transfer the
> physical pages into the iommu is just begging for optimizations
> eg Matthew's struct phyr needs to be an input and output at the iommu
> layer to make this code really happy.

/nods/  There are other ways forward.  As I say, I'll take a look.