From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83FC7C4743D for ; Tue, 8 Jun 2021 06:54:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6D06C61246 for ; Tue, 8 Jun 2021 06:54:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230448AbhFHG4n (ORCPT ); Tue, 8 Jun 2021 02:56:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230323AbhFHG43 (ORCPT ); Tue, 8 Jun 2021 02:56:29 -0400 Received: from ozlabs.org (bilbo.ozlabs.org [IPv6:2401:3900:2:1::2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E131BC06178B; Mon, 7 Jun 2021 23:54:36 -0700 (PDT) Received: by ozlabs.org (Postfix, from userid 1007) id 4Fzgvs0xfSz9sW6; Tue, 8 Jun 2021 16:54:33 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1623135273; bh=6XrXYPDJzXROOAXXdd8a889l8o4k0qJLMeKixWa/h/0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=QnUcTxkPj0P8bQn4t4lMXMTlpj+BS6+QLHtSAxmuZckPqPajgMMpSFAX5Z2hoqHtw 4Hxfli9VuKXYkZOJQZUHmpaUHjJD+toFoqGOdSK8Z9aT52LyvySbGtqGvjPjREBsla BEwUIpcVDf7x+PabPrjj8X7f3dvdgGyyjUyZ0/04= Date: Tue, 8 Jun 2021 15:49:25 +1000 From: David Gibson To: "Tian, Kevin" Cc: LKML , Joerg Roedel , Jason Gunthorpe , Lu Baolu , David Woodhouse , "iommu@lists.linux-foundation.org" , "kvm@vger.kernel.org" , "Alex Williamson (alex.williamson@redhat.com)" , Jason Wang , Eric Auger , Jonathan Corbet , "Raj, Ashok" , "Liu, Yi L" , "Wu, Hao" , "Jiang, Dave" , Jacob Pan , Jean-Philippe Brucker , Kirti Wankhede , Robin Murphy Subject: Re: [RFC] /dev/ioasid uAPI proposal Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="0m5ZmT5muBxt/Zml" Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --0m5ZmT5muBxt/Zml Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote: > > From: David Gibson > > Sent: Wednesday, June 2, 2021 2:15 PM > >=20 > [...]=20 > > > An I/O address space takes effect in the IOMMU only after it is attac= hed > > > to a device. The device in the /dev/ioasid context always refers to a > > > physical one or 'pdev' (PF or VF). > >=20 > > What you mean by "physical" device here isn't really clear - VFs > > aren't really physical devices, and the PF/VF terminology also doesn't > > extent to non-PCI devices (which I think we want to consider for the > > API, even if we're not implemenenting it any time soon). >=20 > Yes, it's not very clear, and more in PCI context to simplify the=20 > description. A "physical" one here means an PCI endpoint function > which has a unique RID. It's more to differentiate with later mdev/ > subdevice which uses both RID+PASID. Naming is always a hard > exercise to me... Possibly I'll just use device vs. subdevice in future > versions. >=20 > >=20 > > Now, it's clear that we can't program things into the IOMMU before > > attaching a device - we might not even know which IOMMU to use. >=20 > yes >=20 > > However, I'm not sure if its wise to automatically make the AS "real" > > as soon as we attach a device: > >=20 > > * If we're going to attach a whole bunch of devices, could we (for at > > least some IOMMU models) end up doing a lot of work which then has > > to be re-done for each extra device we attach? >=20 > which extra work did you specifically refer to? each attach just implies > writing the base address of the I/O page table to the IOMMU structure > corresponding to this device (either being a per-device entry, or per > device+PASID entry). >=20 > and generally device attach should not be in a hot path. >=20 > >=20 > > * With kernel managed IO page tables could attaching a second device > > (at least on some IOMMU models) require some operation which would > > require discarding those tables? e.g. if the second device somehow > > forces a different IO page size >=20 > Then the attach should fail and the user should create another IOASID > for the second device. Couldn't this make things weirdly order dependent though? If device A has strictly more capabilities than device B, then attaching A then B will be fine, but B then A will trigger a new ioasid fd. > > For that reason I wonder if we want some sort of explicit enable or > > activate call. Device attaches would only be valid before, map or > > attach pagetable calls would only be valid after. >=20 > I'm interested in learning a real example requiring explicit enable... >=20 > >=20 > > > One I/O address space could be attached to multiple devices. In this = case, > > > /dev/ioasid uAPI applies to all attached devices under the specified = IOASID. > > > > > > Based on the underlying IOMMU capability one device might be allowed > > > to attach to multiple I/O address spaces, with DMAs accessing them by > > > carrying different routing information. One of them is the default I/O > > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The > > > remaining are routed by RID + Process Address Space ID (PASID) or > > > Stream+Substream ID. For simplicity the following context uses RID and > > > PASID when talking about the routing information for I/O address spac= es. > >=20 > > I'm not really clear on how this interacts with nested ioasids. Would > > you generally expect the RID+PASID IOASes to be children of the base > > RID IOAS, or not? >=20 > No. With Intel SIOV both parent/children could be RID+PASID, e.g. > when one enables vSVA on a mdev. Hm, ok. I really haven't understood how the PASIDs fit into this then. I'll try again on v2. > > If the PASID ASes are children of the RID AS, can we consider this not > > as the device explicitly attaching to multiple IOASIDs, but instead > > attaching to the parent IOASID with awareness of the child ones? > >=20 > > > Device attachment is initiated through passthrough framework uAPI (use > > > VFIO for simplicity in following context). VFIO is responsible for id= entifying > > > the routing information and registering it to the ioasid driver when = calling > > > ioasid attach helper function. It could be RID if the assigned device= is > > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In additi= on, > > > user might also provide its view of virtual routing information (vPAS= ID) in > > > the attach call, e.g. when multiple user-managed I/O address spaces a= re > > > attached to the vfio_device. In this case VFIO must figure out whether > > > vPASID should be directly used (for pdev) or converted to a kernel- > > > allocated one (pPASID, for mdev) for physical routing (see section 4). > > > > > > Device must be bound to an IOASID FD before attach operation can be > > > conducted. This is also through VFIO uAPI. In this proposal one device > > > should not be bound to multiple FD's. Not sure about the gain of > > > allowing it except adding unnecessary complexity. But if others have > > > different view we can further discuss. > > > > > > VFIO must ensure its device composes DMAs with the routing information > > > attached to the IOASID. For pdev it naturally happens since vPASID is > > > directly programmed to the device by guest software. For mdev this > > > implies any guest operation carrying a vPASID on this device must be > > > trapped into VFIO and then converted to pPASID before sent to the > > > device. A detail explanation about PASID virtualization policies can = be > > > found in section 4. > > > > > > Modern devices may support a scalable workload submission interface > > > based on PCI DMWr capability, allowing a single work queue to access > > > multiple I/O address spaces. One example is Intel ENQCMD, having > > > PASID saved in the CPU MSR and carried in the instruction payload > > > when sent out to the device. Then a single work queue shared by > > > multiple processes can compose DMAs carrying different PASIDs. > >=20 > > Is the assumption here that the processes share the IOASID FD > > instance, but not memory? >=20 > I didn't get this question Ok, stepping back, what exactly do you mean by "processes" above? Do you mean Linux processes, or something else? > > > When executing ENQCMD in the guest, the CPU MSR includes a vPASID > > > which, if targeting a mdev, must be converted to pPASID before sent > > > to the wire. Intel CPU provides a hardware PASID translation capabili= ty > > > for auto-conversion in the fast path. The user is expected to setup t= he > > > PASID mapping through KVM uAPI, with information about {vpasid, > > > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM > > > to figure out the actual pPASID given an IOASID. > > > > > > With above design /dev/ioasid uAPI is all about I/O address spaces. > > > It doesn't include any device routing information, which is only > > > indirectly registered to the ioasid driver through VFIO uAPI. For > > > example, I/O page fault is always reported to userspace per IOASID, > > > although it's physically reported per device (RID+PASID). If there is= a > > > need of further relaying this fault into the guest, the user is respo= nsible > > > of identifying the device attached to this IOASID (randomly pick one = if > > > multiple attached devices) and then generates a per-device virtual I/O > > > page fault into guest. Similarly the iotlb invalidation uAPI describe= s the > > > granularity in the I/O address space (all, or a range), different fro= m the > > > underlying IOMMU semantics (domain-wide, PASID-wide, range-based). > > > > > > I/O page tables routed through PASID are installed in a per-RID PASID > > > table structure. Some platforms implement the PASID table in the guest > > > physical space (GPA), expecting it managed by the guest. The guest > > > PASID table is bound to the IOMMU also by attaching to an IOASID, > > > representing the per-RID vPASID space. > >=20 > > Do we need to consider two management modes here, much as we have for > > the pagetables themsleves: either kernel managed, in which we have > > explicit calls to bind a vPASID to a parent PASID, or user managed in > > which case we register a table in some format. >=20 > yes, this is related to PASID virtualization in section 4. And based on= =20 > suggestion from Jason, the vPASID requirement will be reported to > user space via the per-device reporting interface. >=20 > Thanks > Kevin >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --0m5ZmT5muBxt/Zml Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmC/BOMACgkQbDjKyiDZ s5LK5BAAs3fDG9X4n5k72QI9E302AvlcHhqDEV+QO9FVrJIBUNIGWtArJ3QpaMo9 YsRdjSzjUe4K2FixvHk+vhrIL0UgsEpKME/3wnAcTdcLRlmuiYOxnoNmay2FSBvz +agr6JKxIPqKlIaweXV+ZgPE+UslAGwB+Y0WYZpAU/AgjHrv1WqDWKBdUdndhnKk tqqXFUhAPV/SiRNFx1kZf5b8UKpGJtAQPmoE8lwpE+9PQSNPeDNX9gXgNkPpCu6E 86Yci0bFwMyb+DMgzmZn4iIiE4wIRT5C96dQ3obnW8bKsQm36HZmNB9pEjnjvBiM GLG452DaoWg1eFxg8lEMyxsbo/eiEwC3QMzvXA8PlDsbptqxsRaE9GazeXKM97Vz YBOQ1kbBzfa0FzNl62c0JiXiksCG7hZk/HvUKkycKW/919eYSqEyh505caHGDY2f 7hbCyXSoHxk3SUYo3tF8pUKAesWi+9DUJHEcWMygcwyM+W5CEjnsE3MUjnJWeWk/ e92lErcq27pLOYi0nWWZJVTvDWHBGMdzm5QQjjCh3s8a5iD5s+Q8fOr2ersQtDyO /F+CcUqV+/E5auEHonn/6AKBcVapffQfoPCPQrsiuAEAOSoNYctoBcHAXX3BCHi1 gu8QAvrPQrUbLwXaqL2cgayEOif9QcKT/li/J7FAOA5xlqNPvzI= =qPF8 -----END PGP SIGNATURE----- --0m5ZmT5muBxt/Zml--