From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:42371) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1grzT3-0003Be-Gz for qemu-devel@nongnu.org; Fri, 08 Feb 2019 01:16:54 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1grzT1-0004zf-LG for qemu-devel@nongnu.org; Fri, 08 Feb 2019 01:16:53 -0500 Date: Fri, 8 Feb 2019 16:28:50 +1100 From: David Gibson Message-ID: <20190208052849.GB6434@umbus.fritz.box> References: <20190117025115.81178-1-aik@ozlabs.ru> <20190207081830.4dcbb822@x1.home> <295fa9ca-29c1-33e6-5168-8991bc0ef7b1@ozlabs.ru> <20190207202620.23e9c063@x1.home> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="4bRzO86E/ozDv8r1" Content-Disposition: inline In-Reply-To: <20190207202620.23e9c063@x1.home> Subject: Re: [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Alexey Kardashevskiy , Daniel Henrique Barboza , qemu-devel@nongnu.org, qemu-ppc@nongnu.org, Piotr Jaroszynski , Jose Ricardo Ziviani --4bRzO86E/ozDv8r1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote: > On Fri, 8 Feb 2019 13:29:37 +1100 > Alexey Kardashevskiy wrote: >=20 > > On 08/02/2019 02:18, Alex Williamson wrote: > > > On Thu, 7 Feb 2019 15:43:18 +1100 > > > Alexey Kardashevskiy wrote: > > > =20 > > >> On 07/02/2019 04:22, Daniel Henrique Barboza wrote: =20 > > >>> Based on this series, I've sent a Libvirt patch to allow a QEMU pro= cess > > >>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100 > > >>> GPU: > > >>> > > >>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219.= html > > >>> > > >>> > > >>> In that thread, Alex raised concerns about allowing QEMU to freely = lock > > >>> all the memory it wants. Is this an issue to be considered in the r= eview > > >>> of this series here? > > >>> > > >>> Reading the patches, specially patch 3/3, it seems to me that QEMU = is > > >>> going to lock the KVM memory to populate the NUMA node with memory > > >>> of the GPU itself, so at first there is no risk of not taking over = the > > >>> host RAM. > > >>> Am I missing something? =20 > > >> > > >> > > >> The GPU memory belongs to the device and not visible to the host as > > >> memory blocks and not covered by page structs, for the host it is mo= re > > >> like MMIO which is passed through to the guest without that locked > > >> accounting, I'd expect libvirt to keep working as usual except that: > > >> > > >> when libvirt calculates the amount of memory needed for TCE tables > > >> (which is guestRAM/64k*8), now it needs to use the end of the last G= PU > > >> RAM window as a guest RAM size. For example, in QEMU HMP "info mtree= -f": > > >> > > >> FlatView #2 > > >> AS "memory", root: system > > >> AS "cpu-memory-0", root: system > > >> Root memory region: system > > >> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > >> 0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr > > >> > > >> So previously the DMA window would cover 0x7fffffff+1, now it has to > > >> cover 0x11fffffffff+1. =20 > > >=20 > > > This looks like a chicken and egg problem, you're saying libvirt needs > > > to query mtree to understand the extent of the GPU layout, but we need > > > to specify the locked memory limits in order for QEMU to start? Is > > > libvirt supposed to start the VM with unlimited locked memory and fix > > > it at some indeterminate point in the future? Run a dummy VM with > > > unlimited locked memory in order to determine the limits for the real > > > VM? Neither of these sound practical. Thanks, =20 > >=20 > >=20 > > QEMU maps GPU RAM at known locations (which only depends on the vPHB's > > index or can be set explicitely) and libvirt knows how many GPUs are > > passed so it is quite easy to calculate the required amount of memory. > >=20 > > Here is the window start calculation: > > https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf53092808b= 54fb#diff-662409c2a5a150fe231d07ea8384b920R3812 > >=20 > > We do not exactly know the GPU RAM window size until QEMU reads it from > > VFIO/nvlink2 but we know that all existing hardware has a window of > > 128GB (the adapters I have access to only have 16/32GB on board). >=20 > So you're asking that libvirt add 128GB per GPU with magic nvlink > properties, which may be 8x what's actually necessary and libvirt > determines which GPUs to apply this to how? Does libvirt need to sort > through device tree properties for this? Thanks, Hm. If the GPU memory is really separate from main RAM, which it sounds like, I don't think it makes sense to account it against the same locked memory limit as regular RAM. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --4bRzO86E/ozDv8r1 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAlxdE5EACgkQbDjKyiDZ s5Llig//RA9jgl1MGVHFsK7w/Q6DhdOIL6BJYtpoYo/INkQD2CVwOLgp4B5uAUlm 1iJKBEb/hMmnzqBMkGVxBy0So/u7xrzYveBahgbC/z8iGo661vptJGrRGf2Q11bE Nphidx2n/wofR1D8nwO/SU9ZJT6ubGI/Gm4BR3Ob48NROv3qFj+OpiNJ8mTbex95 n4xxkTpS4DAUu4iRORz5W9qAMFM2e6L/L3GuBZlb/ImpMzwC2+blSe7N2jkg6AJk tKedxcap27c5y6Ha6oeMVT9mehF8NU3izJ9SIj00uRlfKkjoR75tL8j4lRjdgy2Q V2AL3iNnw9WHhJEunh1pDMHh4G3EtM7odSk7quDBUi0hI5s3eegRakdiBXqyDDes DJiTU4oAt7kKNPgIu8qRq48TP3fsEgXfS/rWJxWg006frjoBSLQhg4iOqCiJiCCA gRzz7fcUespp1sTD7yVMzJJg808UYU0L22tUq+tZYE4DjeOCisFik7ZbU5D8fedv Jwlq6tFj/I4vPEGB/Q+jLci0wibnSACyHPkZKfzCnFukI2jKKGjIFwKx/kzIEo6d WIX77qyJ00XBznGImO0za5bkn1hXM5qmhs5uTIaI8bY7bezdNJ0KJKuGwcmePJCs mkDxmhroFrSMNCGY6Afs7sQPeQpiuHyxU5sVkZ6rhMcF5Cj7OR8= =383v -----END PGP SIGNATURE----- --4bRzO86E/ozDv8r1--