Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

From: Auger Eric <eric.auger@redhat.com>
To: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	Jacob Pan <jacob.pan.linux@gmail.com>,
	Raj Ashok <ashok.raj@intel.com>,
	David Woodhouse <dwmw2@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	iommu@lists.linux-foundation.org,
	Jean-Philippe Brucker <jean-philippe@linaro.com>,
	Wu Hao <hao.wu@intel.com>
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs
Date: Mon, 7 Sep 2020 10:03:39 +0200	[thread overview]
Message-ID: <9ab100e0-71ad-75d7-ffd7-caab4871cc9c@redhat.com> (raw)
In-Reply-To: <20200901095640.6f657756@jacob-builder>

Hi Jacob,

On 9/1/20 6:56 PM, Jacob Pan wrote:
> Hi Eric,
> 
> On Thu, 27 Aug 2020 18:21:07 +0200
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Jacob,
>> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
>>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:  
>>>> IOASID is used to identify address spaces that can be targeted by
>>>> device DMA. It is a system-wide resource that is essential to its
>>>> many users. This document is an attempt to help developers from
>>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
>>>> Scalable IO Virtualization (SIOV) enabled platforms are the
>>>> primary users of IOASID. Examples of how SIOV components interact
>>>> with IOASID APIs are provided in that many APIs are driven by the
>>>> requirements from SIOV.
>>>>
>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>> Signed-off-by: Wu Hao <hao.wu@intel.com>
>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>> ---
>>>>  Documentation/ioasid.rst | 618
>>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
>>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
>>>>
>>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst  
>>>
>>> Thanks for writing this up. Should it go to
>>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I
>>> think this also needs to Cc linux-doc@vger.kernel.org and
>>> corbet@lwn.net 
>>>> new file mode 100644
>>>> index 000000000000..b6a8cdc885ff
>>>> --- /dev/null
>>>> +++ b/Documentation/ioasid.rst
>>>> @@ -0,0 +1,618 @@
>>>> +.. ioasid:
>>>> +
>>>> +=====================================
>>>> +IO Address Space ID
>>>> +=====================================
>>>> +
>>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or
>>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space
>>>> that DMA  
>>>
>>> "SubstreamID"  
>> On ARM if we don't use PASIDs we have streamids (SID) which can also
>> identify address spaces that DMA requests can target. So maybe this
>> definition is not sufficient.
>>
> According to SMMU spec, the SubstreamID is equivalent to PASID. My
> understanding is that SID is equivalent to PCI requester ID that
> identifies stage 2. Do you plan to use IOASID for stage 2?
No. So actually if PASID is not used we still have a default single
IOASID matching the single context. So that may be fine as a definition.
> IOASID is mostly for SVA and DMA request w/ PASID.
> 
>>>   
>>>> +requests can target.
>>>> +
>>>> +The primary use cases for IOASID are Shared Virtual Address (SVA)
>>>> and +IO Virtual Address (IOVA). However, the requirements for
>>>> IOASID  
>>>
>>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per
>>> device"? 
>>>> +management can vary among hardware architectures.
>>>> +
>>>> +This document covers the generic features supported by IOASID
>>>> +APIs. Vendor-specific use cases are also illustrated with Intel's
>>>> VT-d +based platforms as the first example.
>>>> +
>>>> +.. contents:: :local:
>>>> +
>>>> +Glossary
>>>> +========
>>>> +PASID - Process Address Space ID
>>>> +
>>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
>>>> +sub-stream ID in SMMU)  
>>>
>>> "SubstreamID"
>>>   
>>>> +
>>>> +SVA/SVM - Shared Virtual Addressing/Memory
>>>> +
>>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
>>>> [1]  
>>>
>>> Maybe drop the "New", to keep the documentation perennial. It might
>>> be good to add internal links here to the specifications URLs at
>>> the bottom. 
>>>> +
>>>> +DSA - Intel Data Streaming Accelerator [2]
>>>> +
>>>> +VDCM - Virtual device composition module [3]
>>>> +
>>>> +SIOV - Intel Scalable IO Virtualization
>>>> +
>>>> +
>>>> +Key Concepts
>>>> +============
>>>> +
>>>> +IOASID Set
>>>> +-----------
>>>> +An IOASID set is a group of IOASIDs allocated from the system-wide
>>>> +IOASID pool. An IOASID set is created and can be identified by a
>>>> +token of u64. Refer to IOASID set APIs for more details.  
>>>
>>> Identified either by an u64 or an mm_struct, right?  Maybe just
>>> drop the second sentence if it's detailed in the IOASID set section
>>> below. 
>>>> +
>>>> +IOASID set is particularly useful for guest SVA where each guest
>>>> could +have its own IOASID set for security and efficiency reasons.
>>>> +
>>>> +IOASID Set Private ID (SPID)
>>>> +----------------------------
>>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
>>>> a +system-wide IOASID but the namespace of SPID is within its
>>>> IOASID +set.  
>>>
>>> The intro isn't super clear. Perhaps this is simpler:
>>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
>>> single system-wide IOASID."  
>> or, "within an ioasid set, each ioasid can be associated with an alias
>> ID, named SPID."
> I don't have strong opinion, I feel it is good to explain the
> relationship between SPID and IOASID in both directions, how about add?
> " Conversely, each IOASID is associated with an alias ID, named SPID."
yep. I amy suggest: each IOASID may be associated with an alias ID,
local to the IOASID set, named SPID.
> 
>>>   
>>>> SPIDs can be used as guest IOASIDs where each guest could do
>>>> +IOASID allocation from its own pool and map them to host physical
>>>> +IOASIDs. SPIDs are particularly useful for supporting live
>>>> migration +where decoupling guest and host physical resources are
>>>> necessary. +
>>>> +For example, two VMs can both allocate guest PASID/SPID #101 but
>>>> map to +different host PASIDs #201 and #202 respectively as shown
>>>> in the +diagram below.
>>>> +::
>>>> +
>>>> + .------------------.    .------------------.
>>>> + |   VM 1           |    |   VM 2           |
>>>> + |                  |    |                  |
>>>> + |------------------|    |------------------|
>>>> + | GPASID/SPID 101  |    | GPASID/SPID 101  |
>>>> + '------------------'    -------------------'     Guest
>>>> + __________|______________________|______________________
>>>> +           |                      |               Host
>>>> +           v                      v
>>>> + .------------------.    .------------------.
>>>> + | Host IOASID 201  |    | Host IOASID 202  |
>>>> + '------------------'    '------------------'
>>>> + |   IOASID set 1   |    |   IOASID set 2   |
>>>> + '------------------'    '------------------'
>>>> +
>>>> +Guest PASID is treated as IOASID set private ID (SPID) within an
>>>> +IOASID set, mappings between guest and host IOASIDs are stored in
>>>> the +set for inquiry.
>>>> +
>>>> +IOASID APIs
>>>> +===========
>>>> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
>>>> These APIs +serve the following functionalities:
>>>> +
>>>> +  - IOASID allocation/Free
>>>> +  - Group management in the form of ioasid_set
>>>> +  - Private data storage and lookup
>>>> +  - Reference counting
>>>> +  - Event notification in case of state change  
>> (a)
> got it
> 
>>>> +
>>>> +IOASID Set Level APIs
>>>> +--------------------------
>>>> +For use cases such as guest SVA it is necessary to manage IOASIDs
>>>> at +a group level. For example, VMs may allocate multiple IOASIDs
>>>> for  
>> I would use the introduced ioasid_set terminology instead of "group".
> Right, we already introduced it.
> 
>>>> +guest process address sharing (vSVA). It is imperative to enforce
>>>> +VM-IOASID ownership such that malicious guest cannot target DMA  
>>>
>>> "a malicious guest"
>>>   
>>>> +traffic outside its own IOASIDs, or free an active IOASID belong
>>>> to  
>>>
>>> "that belongs to"
>>>   
>>>> +another VM.
>>>> +::
>>>> +
>>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>>> u32 type)  
>> what is this void *token? also the type may be explained here.
> token is explained in the text following API list. I can move it up.
> 
>>>> +
>>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);  
>>>
>>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
>>> be consistent with the rest of the API.
>>>   
>>>> +
>>>> + void ioasid_set_get(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_get_locked(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put_locked(struct ioasid_set *set)
>>>> +
>>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,  
>>>
>>> Might be nicer to keep the same argument names within the API. Here
>>> "set" rather than "sdata".
>>>   
>>>> +                                void (*fn)(ioasid_t id, void
>>>> *data),
>>>> +				void *data)  
>>>
>>> (alignment)
>>>   
>>>> +
>>>> +
>>>> +IOASID set concept is introduced to represent such IOASID groups.
>>>> Each  
>>>
>>> Or just "IOASID sets represent such IOASID groups", but might be
>>> redundant.
>>>   
>>>> +IOASID set is created with a token which can be one of the
>>>> following +types:  
>> I think this explanation should happen before the above function
>> prototypes
> ditto.
> 
>>>> +
>>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
>>>> +
>>>> +The explicit MM token type is useful when multiple users of an
>>>> IOASID +set under the same process need to communicate about their
>>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
>>>> can be associated +with the KVM instance for the same guest since
>>>> they share a common mm_struct. +
>>>> +The IOASID set APIs serve the following purposes:
>>>> +
>>>> + - Ownership/permission enforcement
>>>> + - Take collective actions, e.g. free an entire set
>>>> + - Event notifications within a set
>>>> + - Look up a set based on token
>>>> + - Quota enforcement  
>>>
>>> This paragraph could be earlier in the section  
>>
>> yes this is a kind of repetition of (a), above
> I meant to highlight on what the APIs do such that readers don't
> need to read the code instead.
> 
>>>   
>>>> +
>>>> +Individual IOASID APIs
>>>> +----------------------
>>>> +Once an ioasid_set is created, IOASIDs can be allocated from the
>>>> set. +Within the IOASID set namespace, set private ID (SPID) is
>>>> supported. In +the VM use case, SPID can be used for storing guest
>>>> PASID. +
>>>> +::
>>>> +
>>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>>> ioasid_t max,
>>>> +                       void *private);
>>>> +
>>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>>> +                   bool (*getter)(void *));
>>>> +
>>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
>>>> spid) +
>>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
>>>> +                        void *data);
>>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
>>>> +                        ioasid_t ssid);  
>>>   
>>> s/ssid/spid>
> got it
> 
>>>> +
>>>> +
>>>> +Notifications
>>>> +-------------
>>>> +An IOASID may have multiple users, each user may have hardware
>>>> context +associated with an IOASID. When the status of an IOASID
>>>> changes, +e.g. an IOASID is being freed, users need to be notified
>>>> such that the +associated hardware context can be cleared,
>>>> flushed, and drained. +
>>>> +::
>>>> +
>>>> + int ioasid_register_notifier(struct ioasid_set *set, struct
>>>> +                              notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier(struct ioasid_set *set,
>>>> +                                 struct notifier_block *nb)
>>>> +
>>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>>>> +                                 notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>>>> +                                    notifier_block *nb)  
>> the mm_struct prototypes may be justified
> This is the mm type token, i.e.
>  - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> I am not sure if it is better to keep the explanation in code or in
> this document, certainly don't want to duplicate.
OK. Maybe add a text explaining why it makes sense to register a
notifier at mm_struct granularity.
> 
>>>> +
>>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>>>> +                   unsigned int flags)  
>> this one is not obvious either.
> Here I just wanted to list the API functions, perhaps readers can check
> out the code comments?
OK never mind. The exercise is difficult anyway.
> 
>>>> +
>>>> +
>>>> +Events
>>>> +~~~~~~
>>>> +Notification events are pertinent to individual IOASIDs, they can
>>>> be +one of the following:
>>>> +
>>>> + - ALLOC
>>>> + - FREE
>>>> + - BIND
>>>> + - UNBIND
>>>> +
>>>> +Ordering
>>>> +~~~~~~~~
>>>> +Ordering is supported by IOASID notification priorities as the
>>>> +following (in ascending order):
>>>> +
>>>> +::
>>>> +
>>>> + enum ioasid_notifier_prios {
>>>> +	IOASID_PRIO_LAST,
>>>> +	IOASID_PRIO_IOMMU,
>>>> +	IOASID_PRIO_DEVICE,
>>>> +	IOASID_PRIO_CPU,
>>>> + };  
>>
>> Maybe:
>> when registered, notifiers are assigned a priority that affect the
>> call order. Notifiers with CPU priority get called before notifiers
>> with device priority and so on.
> Sounds good.
> 
>>>> +
>>>> +The typical use case is when an IOASID is freed due to an
>>>> exception, DMA +source should be quiesced before tearing down
>>>> other hardware contexts +in the system. This will reduce the churn
>>>> in handling faults. DMA work +submission is performed by the CPU
>>>> which is granted higher priority than +devices.
>>>> +
>>>> +
>>>> +Scopes
>>>> +~~~~~~
>>>> +There are two types of notifiers in IOASID core: system-wide and
>>>> +ioasid_set-wide.
>>>> +
>>>> +System-wide notifier is catering for users that need to handle all
>>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
>>>> +
>>>> +Per ioasid_set notifier can be used by VM specific components
>>>> such as +KVM. After all, each KVM instance only cares about
>>>> IOASIDs within its +own set.
>>>> +
>>>> +
>>>> +Atomicity
>>>> +~~~~~~~~~
>>>> +IOASID notifiers are atomic due to spinlocks used inside the
>>>> IOASID +core. For tasks cannot be completed in the notifier
>>>> handler, async work  
>>>
>>> "tasks that cannot be"
>>>   
>>>> +can be submitted to complete the work later as long as there is no
>>>> +ordering requirement.
>>>> +
>>>> +Reference counting
>>>> +------------------
>>>> +IOASID lifecycle management is based on reference counting. Users
>>>> of +IOASID intend to align lifecycle with the IOASID need to hold  
>>>
>>> "who intend to"
>>>   
>>>> +reference of the IOASID. IOASID will not be returned to the pool
>>>> for  
>>>
>>> "a reference to the IOASID. The IOASID"
>>>   
>>>> +allocation until all references are dropped. Calling ioasid_free()
>>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
>>>> +reference. ioasid_get() is not allowed once an IOASID is in the
>>>> +FREE_PENDING state.
>>>> +
>>>> +Event notifications are used to inform users of IOASID status
>>>> change. +IOASID_FREE event prompts users to drop their references
>>>> after +clearing its context.
>>>> +
>>>> +For example, on VT-d platform when an IOASID is freed, teardown
>>>> +actions are performed on KVM, device driver, and IOMMU driver.
>>>> +KVM shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_kvm = {
>>>> +	.notifier_call = pasid_status_change_kvm,
>>>> +	.priority      = IOASID_PRIO_CPU,
>>>> + };
>>>> +
>>>> +VDCM driver shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_vdcm = {
>>>> +	.notifier_call = pasid_status_change_vdcm,
>>>> +	.priority      = IOASID_PRIO_DEVICE,
>>>> + };  
>> not sure those code snippets are really useful. Maybe simply say who
>> is supposed to use each prio.
> Agreed, not all the bits in the snippets are explained. I will explain
> KVM and VDCM need to use priority to ensure call order.
> 
>>>> +
>>>> +In both cases, notifier blocks shall be registered on the IOASID
>>>> set +such that *only* events from the matching VM is received.
>>>> +
>>>> +If KVM attempts to register notifier block before the IOASID set
>>>> is +created for the MM token, the notifier block will be placed on
>>>> a  
>> using the MM token
> sounds good
> 
>>>> +pending list inside IOASID core. Once the token matching IOASID
>>>> set +is created, IOASID will register the notifier block
>>>> automatically.  
>> Is this implementation mandated? Can't you enforce the ioasid_set to
>> be created before the notifier gets registered?
>>>> +IOASID core does not replay events for the existing IOASIDs in the
>>>> +set. For IOASID set of MM type, notification blocks can be
>>>> registered +on empty sets only. This is to avoid lost events.
>>>> +
>>>> +IOMMU driver shall register notifier block on global chain::
>>>> +
>>>> + static struct notifier_block pasid_nb_vtd = {
>>>> +	.notifier_call = pasid_status_change_vtd,
>>>> +	.priority      = IOASID_PRIO_IOMMU,
>>>> + };
>>>> +
>>>> +Custom allocator APIs
>>>> +---------------------
>>>> +
>>>> +::
>>>> +
>>>> + int ioasid_register_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> +Allocator Choices
>>>> +~~~~~~~~~~~~~~~~~
>>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
>>>> However, +allocators can be different. For example, on VT-d guest
>>>> PASID +allocation must be performed via a virtual command
>>>> interface which is +emulated by VMM.
>>>> +
>>>> +IOASID core has the notion of "custom allocator" such that guest
>>>> can +register virtual command allocator that precedes the default
>>>> one. +
>>>> +Namespaces
>>>> +~~~~~~~~~~
>>>> +IOASIDs are limited system resources that default to 20 bits in
>>>> +size. Since each device has its own table, theoretically the
>>>> namespace +can be per device also. However, for security reasons
>>>> sharing PASID +tables among devices are not good for isolation.
>>>> Therefore, IOASID +namespace is system-wide.  
>>>
>>> I don't follow this development. Having per-device PASID table
>>> would work fine for isolation (assuming no hardware bug
>>> necessitating IOMMU groups). If I remember correctly IOASID space
>>> was chosen to be OS-wide because it simplifies the management code
>>> (single PASID per task), and it is system-wide across VMs only in
>>> the case of VT-d scalable mode. 
>>>> +
>>>> +There are also other reasons to have this simpler system-wide
>>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
>>>> +and ENQCMD[1] where one IOASID could be used to submit work on  
>>>
>>> Maybe use the Sphinx glossary syntax rather than "[1]"
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>>>   
>>>> +multiple devices that are shared with other VMs. This requires
>>>> IOASID +to be system-wide. This is also the reason why guests must
>>>> use an +emulated virtual command interface to allocate IOASID from
>>>> the host. +
>>>> +
>>>> +Life cycle
>>>> +==========
>>>> +This section covers IOASID lifecycle management for both
>>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
>>>> directly hooked +up with IOMMU driver, therefore the process
>>>> address space (MM) +lifecycle is aligned with IOASID.  
>> therefore the IOASID lifecyle matches the process address space (MM)
>> lifecyle?
> Sounds good.
> 
>>>> +
>>>> +However, guest MMU notifier is not available to host IOMMU
>>>> driver,  
>> the guest MMU notifier
>>>> +when guest MM terminates unexpectedly, the events have to go
>>>> through  
>> the guest MM
>>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
>>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
>>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
>>>> +
>>>> +Native IOASID Life Cycle (VT-d Example)
>>>> +---------------------------------------
>>>> +
>>>> +The normal flow of native SVA code with Intel Data Streaming
>>>> +Accelerator(DSA) [2] as example:
>>>> +
>>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
>>>> +2. DSA driver allocate WQ, do sva_bind_device();
>>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
>>>> +   mmu_notifier_get()
>>>> +4. DMA starts by DSA driver userspace
>>>> +5. DSA userspace close FD
>>>> +6. DSA/uacce kernel driver handles FD.close()
>>>> +7. DSA driver stops DMA
>>>> +8. DSA driver calls sva_unbind_device();
>>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
>>>> +   TLBs. mmu_notifier_put() called.
>>>> +10. mmu_notifier.release() called, IOMMU SVA code calls
>>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
>>>> +
>>>> +::
>>>> +  
>>>
>>> Use a footnote?
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes 
>>>> +   * With ENQCMD, PASID used on VT-d is not released in
>>>> mmu_notifier() but
>>>> +     mmdrop(). mmdrop comes after FD close. Should not matter.  
>>>
>>> "comes after FD close, which doesn't make a difference?"
>>> The following might not be necessary since early process
>>> termination is described later.
>>>   
>>>> +     If the user process dies unexpectedly, Step #10 may come
>>>> before
>>>> +     Step #5, in between, all DMA faults discarded. PRQ responded
>>>> with  
>>>
>>> PRQ hasn't been defined in this document.
>>>   
>>>> +     code INVALID REQUEST.
>>>> +
>>>> +During the normal teardown, the following three steps would
>>>> happen in +order:  
>> can't this be illustrated in the above 1-11 sequence, just adding
>> NORMAL TEARDONW before #7?
>>>> +
>>>> +1. Device driver stops DMA request
>>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
>>>> in-flight
>>>> +   requests.
>>>> +3. IOASID freed
>>>> +  
>> Then you can just focus on abnormal termination
> Yes, will refer to the steps starting #7. These can be removed.
> 
>>>> +Exception happens when process terminates *before* device driver
>>>> stops +DMA and call IOMMU driver to unbind. The flow of process
>>>> exists are as  
>> Can't this be explained with something simpler looking at the steps
>> 1-11?
> It meant to be educational given this level of details. Simpler
> steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
> out right? I will use the steps in the 1-11 sequence.
> 
>>>
>>> "exits"
>>>   
>>>> +follows:
>>>> +
>>>> +::
>>>> +
>>>> +   do_exit() {
>>>> +	exit_mm() {
>>>> +		mm_put();
>>>> +		exit_mmap() {
>>>> +			intel_invalidate_range() //mmu notifier
>>>> +			tlb_finish_mmu()
>>>> +			mmu_notifier_release(mm) {
>>>> +				intel_iommu_release() {
>>>> +   [2]
>>>> intel_iommu_teardown_pasid();  
>>>
>>> Parentheses might be better than square brackets for step numbers
>>>   
>>>> +                                        intel_iommu_flush_tlbs();
>>>> +				}
>>>> +				// tlb_invalidate_range cb removed
>>>> +			}
>>>> +			unmap_vmas();
>>>> +                        free_pgtables(); // IOMMU cannot walk PGT
>>>> after this
>>>> +		};
>>>> +	}
>>>> +	exit_files(tsk) {
>>>> +		close_files() {
>>>> +			dsa_close();
>>>> +   [1]			dsa_stop_dma();
>>>> +                        intel_svm_unbind_pasid(); //nothing to do
>>>> +		}
>>>> +	}
>>>> +   }
>>>> +
>>>> +   mmdrop() /* some random time later, lazy mm user */ {
>>>> +   	mm_free_pgd();
>>>> +        destroy_context(mm); {
>>>> +   [3]	        ioasid_free();
>>>> +	}
>>>> +   }
>>>> +
>>>> +As shown in the list above, step #2 could happen before
>>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>>>> +
>>>> +Also notice that TLB invalidation occurs at mmu_notifier
>>>> +invalidate_range callback as well as the release callback. The
>>>> reason +is that release callback will delete IOMMU driver from the
>>>> notifier +chain which may skip invalidate_range() calls during the
>>>> exit path. +
>>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
>>>> disable  
>> UR?
> Unrecoverable, mentioned in the previous paragraph.
> 
>>>> +fault reporting after free and before unbind.
>>>> +
>>>> +Guest IOASID Life Cycle (VT-d Example)
>>>> +--------------------------------------
>>>> +Guest IOASID life cycle starts with guest driver open(), this
>>>> could be +uacce or individual accelerator driver such as DSA. At
>>>> FD open, +sva_bind_device() is called which triggers a series of
>>>> actions. +
>>>> +The example below is an illustration of *normal* operations that
>>>> +involves *all* the SW components in VT-d. The flow can be simpler
>>>> if +no ENQCMD is supported.
>>>> +
>>>> +::
>>>> +
>>>> +     VFIO        IOMMU        KVM        VDCM        IOASID
>>>> Ref
>>>> +   ..................................................................
>>>> +   1             ioasid_register_notifier/_mm()
>>>> +   2 ioasid_alloc()
>>>> 1
>>>> +   3 bind_gpasid()
>>>> +   4             iommu_bind()->ioasid_get()
>>>> 2
>>>> +   5             ioasid_notify(BIND)
>>>> +   6                          -> ioasid_get()
>>>> 3
>>>> +   7                          -> vmcs_update_atomic()
>>>> +   8 mdev_write(gpasid)
>>>> +   9                                    hpasid=
>>>> +   10                                   find_by_spid(gpasid)
>>>> 4
>>>> +   11                                   vdev_write(hpasid)
>>>> +   12 -------- GUEST STARTS DMA --------------------------
>>>> +   13 -------- GUEST STOPS DMA --------------------------
>>>> +   14 mdev_clear(gpasid)
>>>> +   15                                   vdev_clear(hpasid)
>>>> +   16
>>>> ioasid_put()               3
>>>> +   17 unbind_gpasid()
>>>> +   18            iommu_ubind()
>>>> +   19            ioasid_notify(UNBIND)
>>>> +   20                          -> vmcs_update_atomic()
>>>> +   21                          ->
>>>> ioasid_put()                     2
>>>> +   22
>>>> ioasid_free()                                                1
>>>> +   23
>>>> ioasid_put()                                      0
>>>> +   24                                                 Reclaimed
>>>> +   -------------- New Life Cycle Begin
>>>> ----------------------------
>>>> +   1  ioasid_alloc()
>>>> ->           1 +
>>>> +   Note: IOASID Notification Events: FREE, BIND, UNBIND
>>>> +
>>>> +Exception cases arise when a guest crashes or a malicious guest
>>>> +attempts to cause disruption on the host system. The fault
>>>> handling +rules are:
>>>> +
>>>> +1. IOASID free must *always* succeed.
>>>> +2. An inactive period may be required before the freed IOASID is
>>>> +   reclaimed. During this period, consumers of IOASID perform
>>>> cleanup. +3. Malfunction is limited to the guest owned resources
>>>> for all
>>>> +   programming errors.
>>>> +
>>>> +The primary source of exception is when the following are out of
>>>> +order:
>>>> +
>>>> +1. Start/Stop of DMA activity
>>>> +   (Guest device driver, mdev via VFIO)  
>> please explain the meaning of what is inside (): initiator?
>>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
>>>> +   (Host IOMMU driver bind/unbind)
>>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
>>>> +   case of ENQCMD
>>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
>>>> +5. IOASID alloc/free (Host IOASID)
>>>> +
>>>> +VFIO is the *only* user-kernel interface, which is ultimately
>>>> +responsible for exception handlings.  
>>>
>>> "handling"
>>>   
>>>> +
>>>> +#1 is processed the same way as the assigned device today based on
>>>> +device file descriptors and events. There is no special handling.
>>>> +
>>>> +#3 is based on bind/unbind events emitted by #2.
>>>> +
>>>> +#4 is naturally aligned with IOASID life cycle in that an illegal
>>>> +guest PASID programming would fail in obtaining reference of the
>>>> +matching host IOASID.
>>>> +
>>>> +#5 is similar to #4. The fault will be reported to the user if
>>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation
>>>> table. +
>>>> +Therefore, the remaining out of order problem is between #2 and
>>>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
>>>> +
>>>> +IOASID notifier and refcounting are used to ensure order.
>>>> Following +a publisher-subscriber pattern where:  
>> with the following actors:
>>>> +
>>>> +- Publishers: VFIO & IOMMU
>>>> +- Subscribers: KVM, VDCM, IOMMU  
>> this may be introduced before.
>>>> +
>>>> +IOASID notifier is atomic which requires subscribers to do quick
>>>> +handling of the event in the atomic context. Workqueue can be
>>>> used for +any processing that requires thread context.  
>> repetition of what was said before.
>>  IOASID reference must be
> Right, will remove.
> 
>>>> +acquired before receiving the FREE event. The reference must be
>>>> +dropped at the end of the processing in order to return the
>>>> IOASID to +the pool.
>>>> +
>>>> +Let's examine the IOASID life cycle again when free happens
>>>> *before* +unbind. This could be a result of misbehaving guests or
>>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
>>>> that the setup part up +until step #12 is identical to the normal
>>>> case, the flow below starts +with step 13.
>>>> +
>>>> +::
>>>> +
>>>> +     VFIO        IOMMU        KVM        VDCM        IOASID
>>>> Ref
>>>> +   ..................................................................
>>>> +   13 -------- GUEST STARTS DMA --------------------------
>>>> +   14 -------- *GUEST MISBEHAVES!!!* ----------------
>>>> +   15 ioasid_free()
>>>> +   16
>>>> ioasid_notify(FREE)
>>>> +   17
>>>> mark_ioasid_inactive[1]
>>>> +   18                          kvm_nb_handler(FREE)
>>>> +   19                          vmcs_update_atomic()
>>>> +   20                          ioasid_put_locked()   ->
>>>> 3
>>>> +   21                                   vdcm_nb_handler(FREE)
>>>> +   22            iomm_nb_handler(FREE)
>>>> +   23 ioasid_free() returns[2]          schedule_work()
>>>> 2
>>>> +   24            schedule_work()        vdev_clear_wk(hpasid)
>>>> +   25            teardown_pasid_wk()
>>>> +   26                                   ioasid_put() ->
>>>> 1
>>>> +   27            ioasid_put()
>>>> 0
>>>> +   28                                                 Reclaimed
>>>> +   29 unbind_gpasid()
>>>> +   30            iommu_unbind()->ioasid_find() Fails[3]
>>>> +   -------------- New Life Cycle Begin
>>>> ---------------------------- +
>>>> +Note:
>>>> +
>>>> +1. By marking IOASID inactive at step #17, no new references can
>>>> be  
>>>
>>> Is "inactive" FREE_PENDING?
>>>   
>>>> +   held. ioasid_get/find() will return -ENOENT;
>>>> +2. After step #23, all events can go out of order. Shall not
>>>> affect
>>>> +   the outcome.
>>>> +3. IOMMU driver fails to find private data for unbinding. If
>>>> unbind is
>>>> +   called after the same IOASID is allocated for the same guest
>>>> again,
>>>> +   this is a programming error. The damage is limited to the guest
>>>> +   itself since unbind performs permission checking based on the
>>>> +   IOASID set associated with the guest process.
>>>> +
>>>> +KVM PASID Translation Table Updates
>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> +Per VM PASID translation table is maintained by KVM in order to
>>>> +support ENQCMD in the guest. The table contains host-guest PASID
>>>> +translations to be consumed by CPU ucode. The synchronization of
>>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
>>>> atomic +notifiers are used. KVM must register IOASID notifier per
>>>> VM instance +during launch time. The following events are handled:
>>>> +
>>>> +1. BIND/UNBIND
>>>> +2. FREE
>>>> +
>>>> +Rules:
>>>> +
>>>> +1. Multiple devices can bind with the same PASID, this can be
>>>> different PCI
>>>> +   devices or mdevs within the same PCI device. However, only the
>>>> +   *first* BIND and *last* UNBIND emit notifications.
>>>> +2. IOASID code is responsible for ensuring the correctness of H-G
>>>> +   PASID mapping. There is no need for KVM to validate the
>>>> +   notification data.
>>>> +3. When UNBIND happens *after* FREE, KVM will see error in
>>>> +   ioasid_get() even when the reclaim is not done. IOMMU driver
>>>> will
>>>> +   also avoid sending UNBIND if the PASID is already FREE.
>>>> +4. When KVM terminates *before* FREE & UNBIND, references will be
>>>> +   dropped for all host PASIDs.
>>>> +
>>>> +VDCM PASID Programming
>>>> +~~~~~~~~~~~~~~~~~~~~~~
>>>> +VDCM composes virtual devices and exposes them to the guests. When
>>>> +the guest allocates a PASID then program it to the virtual
>>>> device, VDCM  
>> programs as well
>>>> +intercepts the programming attempt then program the matching
>>>> host  
>>>
>>> "programs"
>>>
>>> Thanks,
>>> Jean
>>>   
>>>> +PASID on to the hardware.
>>>> +Conversely, when a device is going away, VDCM must be informed
>>>> such +that PASID context on the hardware can be cleared. There
>>>> could be +multiple mdevs assigned to different guests in the same
>>>> VDCM. Since +the PASID table is shared at PCI device level, lazy
>>>> clearing is not +secure. A malicious guest can attack by using
>>>> newly freed PASIDs that +are allocated by another guest.
>>>> +
>>>> +By holding a reference of the PASID until VDCM cleans up the HW
>>>> context, +it is guaranteed that PASID life cycles do not cross
>>>> within the same +device.
>>>> +
>>>> +
>>>> +Reference
>>>> +====================================================
>>>> +1.
>>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
>>>> + +2.
>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>> + +3.
>>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>>>> -- 2.7.4  
>>
>> Thanks
>>
>> Eric
>>>>  
>>>   
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> [Jacob Pan]
> 
Thanks

Eric

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu