From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:52009) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gu9Aq-00019x-VF for qemu-devel@nongnu.org; Thu, 14 Feb 2019 00:03:02 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gu9Ap-0008Uk-CW for qemu-devel@nongnu.org; Thu, 14 Feb 2019 00:03:00 -0500 Date: Thu, 14 Feb 2019 15:59:32 +1100 From: David Gibson Message-ID: <20190214045926.GB1884@umbus.fritz.box> References: <20190117025115.81178-1-aik@ozlabs.ru> <20190207081830.4dcbb822@x1.home> <295fa9ca-29c1-33e6-5168-8991bc0ef7b1@ozlabs.ru> <20190207202620.23e9c063@x1.home> <20190208052849.GB6434@umbus.fritz.box> <45e89e77-02ad-aebf-ab2f-e5e7f7e12ecb@ozlabs.ru> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="5C/VXimrgG1HEvSQ" Content-Disposition: inline In-Reply-To: <45e89e77-02ad-aebf-ab2f-e5e7f7e12ecb@ozlabs.ru> Subject: Re: [Qemu-devel] [PATCH qemu 0/3] spapr_pci, vfio: NVIDIA V100 + P9 passthrough List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexey Kardashevskiy Cc: Alex Williamson , Daniel Henrique Barboza , qemu-devel@nongnu.org, qemu-ppc@nongnu.org, Piotr Jaroszynski , Jose Ricardo Ziviani --5C/VXimrgG1HEvSQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Feb 11, 2019 at 02:49:49PM +1100, Alexey Kardashevskiy wrote: >=20 >=20 > On 08/02/2019 16:28, David Gibson wrote: > > On Thu, Feb 07, 2019 at 08:26:20PM -0700, Alex Williamson wrote: > >> On Fri, 8 Feb 2019 13:29:37 +1100 > >> Alexey Kardashevskiy wrote: > >> > >>> On 08/02/2019 02:18, Alex Williamson wrote: > >>>> On Thu, 7 Feb 2019 15:43:18 +1100 > >>>> Alexey Kardashevskiy wrote: > >>>> =20 > >>>>> On 07/02/2019 04:22, Daniel Henrique Barboza wrote: =20 > >>>>>> Based on this series, I've sent a Libvirt patch to allow a QEMU pr= ocess > >>>>>> to inherit IPC_LOCK when using VFIO passthrough with the Tesla V100 > >>>>>> GPU: > >>>>>> > >>>>>> https://www.redhat.com/archives/libvir-list/2019-February/msg00219= =2Ehtml > >>>>>> > >>>>>> > >>>>>> In that thread, Alex raised concerns about allowing QEMU to freely= lock > >>>>>> all the memory it wants. Is this an issue to be considered in the = review > >>>>>> of this series here? > >>>>>> > >>>>>> Reading the patches, specially patch 3/3, it seems to me that QEMU= is > >>>>>> going to lock the KVM memory to populate the NUMA node with memory > >>>>>> of the GPU itself, so at first there is no risk of not taking over= the > >>>>>> host RAM. > >>>>>> Am I missing something? =20 > >>>>> > >>>>> > >>>>> The GPU memory belongs to the device and not visible to the host as > >>>>> memory blocks and not covered by page structs, for the host it is m= ore > >>>>> like MMIO which is passed through to the guest without that locked > >>>>> accounting, I'd expect libvirt to keep working as usual except that: > >>>>> > >>>>> when libvirt calculates the amount of memory needed for TCE tables > >>>>> (which is guestRAM/64k*8), now it needs to use the end of the last = GPU > >>>>> RAM window as a guest RAM size. For example, in QEMU HMP "info mtre= e -f": > >>>>> > >>>>> FlatView #2 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory-0", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000010000000000-0000011fffffffff (prio 0, ram): nvlink2-mr > >>>>> > >>>>> So previously the DMA window would cover 0x7fffffff+1, now it has to > >>>>> cover 0x11fffffffff+1. =20 > >>>> > >>>> This looks like a chicken and egg problem, you're saying libvirt nee= ds > >>>> to query mtree to understand the extent of the GPU layout, but we ne= ed > >>>> to specify the locked memory limits in order for QEMU to start? Is > >>>> libvirt supposed to start the VM with unlimited locked memory and fix > >>>> it at some indeterminate point in the future? Run a dummy VM with > >>>> unlimited locked memory in order to determine the limits for the real > >>>> VM? Neither of these sound practical. Thanks, =20 > >>> > >>> > >>> QEMU maps GPU RAM at known locations (which only depends on the vPHB's > >>> index or can be set explicitely) and libvirt knows how many GPUs are > >>> passed so it is quite easy to calculate the required amount of memory. > >>> > >>> Here is the window start calculation: > >>> https://github.com/aik/qemu/commit/7073cad3ae7708d657e01672bcf5309280= 8b54fb#diff-662409c2a5a150fe231d07ea8384b920R3812 > >>> > >>> We do not exactly know the GPU RAM window size until QEMU reads it fr= om > >>> VFIO/nvlink2 but we know that all existing hardware has a window of > >>> 128GB (the adapters I have access to only have 16/32GB on board). > >> > >> So you're asking that libvirt add 128GB per GPU with magic nvlink > >> properties, which may be 8x what's actually necessary and libvirt > >> determines which GPUs to apply this to how? Does libvirt need to sort > >> through device tree properties for this? Thanks, > >=20 > > Hm. If the GPU memory is really separate from main RAM, which it > > sounds like, I don't think it makes sense to account it against the > > same locked memory limit as regular RAM. >=20 > This is accounting for TCE table to cover GPU RAM, not for GPU RAM itself. Ah, ok, that makes sense then > So I am asking libvirt to add 128GB/64k*8=3D16MB to the locked_vm. It > already does so for the guest RAM. That seems reasonable. IIRC we already have some slop in the amount of locked vm that libvirt allocates; not sure if it'll be enough as is. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --5C/VXimrgG1HEvSQ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIyBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAlxk9asACgkQbDjKyiDZ s5LFzw/41jP75NOaTjZ/RL0vQ4BqtiYAfrK0rcxU+wl8pnnGgvqWWh2reQ+QOU/S RbopzKn6UgRiiX9LoS7rGQjPSGoD7Y5kjm9OnKeyLMhukVFq8T+UWgfu8yB5qyiy ZzTuShHB1USEn4ILfe7o+fqC+b5kazlK5wqRiHgZqavKtV+H++utZZqMOXYM88QH 4BzhZO2uDUeeHHQovDXLbnfqGGLZm3vg5f2e5XsOIKELm8jAcy7y3WnQulvYQDy3 KfKy52dCtP76JNZmd9wu3SauPZ2uB2c1lIJLKWbpCNnxSfIPLAvz2vm3e8sUckwQ MBJaND85CU2jxxf3wY2I8EF8ED5DpIHGLlz9qhQpYEmurR2bT+ndIwYVA3EGe2F1 w3xoGDxT9erzogMamPKiTHU9s1BXu/JtDxH9PF/aZs4hXNH5v2IYURa5cksmDU8P t5o5XgM4CX8uesAICeXWIYOx5DaLgCvbGFRAY5WUsyLuTNKNm4VP0DKfpaCsLBdY +YtsEhMSqvUYPialusQPW9MahxUUcQdpMIeDDmeTvl03MK5CkmoCjefdt1vdDuf2 QJNV2O7QS1Lbq8EowUbqOkWUkU9i38ohFZkXHssH+LHbbuYwiyKa8PugAAjldier iyVNLosfS9U+BQaA4h7s0rDPiRNsCYauaL1glSJ4auN0X8aG4g== =OM6+ -----END PGP SIGNATURE----- --5C/VXimrgG1HEvSQ--