Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.

From: Ashish Kalra <ashish.kalra@amd.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: hch@lst.de, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, luto@kernel.org,
	peterz@infradead.org, dave.hansen@linux-intel.com,
	iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	brijesh.singh@amd.com, Thomas.Lendacky@amd.com,
	ssg.sos.patches@amd.com, jon.grimm@amd.com, rientjes@google.com
Subject: Re: [PATCH v3] swiotlb: Adjust SWIOTBL bounce buffer size for SEV guests.
Date: Tue, 17 Nov 2020 15:33:34 +0000	[thread overview]
Message-ID: <20201117153302.GA29293@ashkalra_ubuntu_server> (raw)
In-Reply-To: <20201113211925.GA6096@char.us.oracle.com>

On Fri, Nov 13, 2020 at 04:19:25PM -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 05, 2020 at 09:20:45PM +0000, Ashish Kalra wrote:
> > On Thu, Nov 05, 2020 at 03:20:07PM -0500, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Nov 05, 2020 at 07:38:28PM +0000, Ashish Kalra wrote:
> > > > On Thu, Nov 05, 2020 at 02:06:49PM -0500, Konrad Rzeszutek Wilk wrote:
> > > > > .
> > > > > > > Right, so I am wondering if we can do this better.
> > > > > > > 
> > > > > > > That is you are never going to get any 32-bit devices with SEV right? That
> > > > > > > is there is nothing that bounds you to always use the memory below 4GB?
> > > > > > > 
> > > > > > 
> > > > > > We do support 32-bit PCIe passthrough devices with SEV.
> > > > > 
> > > > > Ewww..  Which devices would this be?
> > > > 
> > > > That will be difficult to predict as customers could be doing
> > > > passthrough of all kinds of devices.
> > > 
> > > But SEV is not on some 1990 hardware. It has PCIe, there is no PCI slots in there.
> > > 
> > > Is it really possible to have a PCIe device that can't do more than 32-bit DMA?
> > > 
> > > > 
> > > > > > 
> > > > > > Therefore, we can't just depend on >4G memory for SWIOTLB bounce buffering
> > > > > > when there is I/O pressure, because we do need to support device
> > > > > > passthrough of 32-bit devices.
> > > > > 
> > > > > Presumarily there is just a handful of them?
> > > > >
> > > > Again, it will be incorrect to assume this.
> > > > 
> > > > > > 
> > > > > > Considering this, we believe that this patch needs to adjust/extend
> > > > > > boot-allocation of SWIOTLB and we want to keep it simple to do this
> > > > > > within a range detemined by amount of allocated guest memory.
> > > > > 
> > > > > I would prefer to not have to revert this in a year as customers
> > > > > complain about "I paid $$$ and I am wasting half a gig on something 
> > > > > I am not using" and giving customers knobs to tweak this instead of
> > > > > doing the right thing from the start.
> > > > 
> > > > Currently, we face a lot of situations where we have to tell our
> > > > internal teams/external customers to explicitly increase SWIOTLB buffer
> > > > via the swiotlb parameter on the kernel command line, especially to
> > > > get better I/O performance numbers with SEV. 
> > > 
> > > Presumarily these are 64-bit?
> > > 
> > > And what devices do you speak off that are actually affected by 
> > > this performance? Increasing the SWIOTLB just means we have more
> > > memory, which in mind means you can have _more_ devices in the guest
> > > that won't handle the fact that DMA mapping returns an error.
> > > 
> > > Not neccessarily that one device suddenly can go faster.
> > > 
> > > > 
> > > > So by having this SWIOTLB size adjustment done implicitly (even using a
> > > > static logic) is a great win-win situation. In other words, having even
> > > > a simple and static default increase of SWIOTLB buffer size for SEV is
> > > > really useful for us.
> > > > 
> > > > We can always think of adding all kinds of heuristics to this, but that
> > > > just adds too much complexity without any predictable performance gain.
> > > > 
> > > > And to add, the patch extends the SWIOTLB size as an architecture
> > > > specific callback, currently it is a simple and static logic for SEV/x86
> > > > specific, but there is always an option to tweak/extend it with
> > > > additional logic in the future.
> > > 
> > > Right, and that is what I would like to talk about as I think you
> > > are going to disappear (aka, busy with other stuff) after this patch goes in.
> > > 
> > > I need to understand this more than "performance" and "internal teams"
> > > requirements to come up with a better way going forward as surely other
> > > platforms will hit the same issue anyhow.
> > > 
> > > Lets break this down:
> > > 
> > > How does the performance improve for one single device if you increase the SWIOTLB?
> > > Is there a specific device/driver that you can talk about that improve with this patch?
> > > 
> > > 
> > 
> > Yes, these are mainly for multi-queue devices such as NICs or even
> > multi-queue virtio. 
> > 
> > This basically improves performance with concurrent DMA, hence,
> > basically multi-queue devices.
> 
> OK, and for _1GB_ guest - what are the "internal teams/external customers" amount 
> of CPUs they use? Please lets use real use-cases.

>> I am sure you will understand we cannot share any external customer
>> data as all that customer information is proprietary.
>>
>> In similar situation if you have to share Oracle data, you will
>> surely have the same concerns and i don't think you will be able
>> to share any such information externally, i.e., outside Oracle.
>>
>I am asking for a simple query - what amount of CPUs does a 1GB
>guest have? The reason for this should be fairly obvious - if
>it is a 1vCPU, then there is no multi-queue and the existing
>SWIOTLB pool size as it is OK.
>
>If however there are say 2 and multiqueue is enabled, that
>gives me an idea of how many you use and I can find out what
>the maximum pool size usage of virtio there is with that configuration.

Again we cannot share any customer data.

Also i don't think there can be a definitive answer to how many vCPUs a
1GB guest will have, it will depend on what kind of configuration we are
testing.

For example, i usually setup 4-16 vCPUs for as low as 512M configured
gueest memory.

I have been also testing with 16 vCPUs configuration for 512M-1G guest
memory with Mellanox SRIOV NICs, and this will be a multi-queue NIC
device environment.

So we might be having less configured guest memory, but we still might
be using that configuration with I/O intensive workloads.

Thanks,
Ashish