iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment
@ 2024-05-07  6:18 Yan Zhao
  2024-05-07  6:19 ` [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Yan Zhao
                   ` (4 more replies)
  0 siblings, 5 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:18 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

This is a follow-up series to fix the security risk for non-coherent device
assignment raised by Jason in [1].

When IOMMU does not enforce cache coherency, devices are allowed to perform
non-coherent DMAs (DMAs that lack CPU cache snooping). This scenario poses
a risk of information leakage when the device is assigned into a VM.
Specifically, a malicious guest could potentially retrieve stale host data
through non-coherent DMA reads of physical memory, while data initialized
by host (e.g., zeros) still resides in the cache.

Furthermore, host kernel (e.g. a ksm thread) might encounter inconsistent
data between the CPU cache and physical memory (left by a malicious guest)
after a page is unpinned for DMA but before the page is recycled.

Therefore, a mitigation in VFIO/IOMMUFD is required to flush CPU caches on
pages involved in non-coherent DMAs prior to or following their mapping or
unmapping to or from the IOMMU.

The mitigation is not implemented in DMA API layer, so as to avoid slowing
down the DMA API users. Users of the DMA API are expected to take care of
CPU cache flushing in one of two ways: (a) by using the DMA API which is
aware of the non-coherence and does the flushes internally or (b) be aware
of its flushing needs and handle them on its own if they are overriding the
platform using no-snoop. A general mitigation in DMA API layer will only
come when non-coherent DMAs are common, which, however, is not the case
(now only Intel GPU and some ARM devices).

Also the mitigation is not implemented in IOMMU core for VMs exclusively,
because it would make a large IOTLB flush range being split due to the
absence of information regarding to IOVA-PFN relationship in IOMMU core.

Given non-coherent devices exist both on x86 and ARM, this series
introduces an arch helper to flush CPU caches for non-coherent DMAs which
is available for both VFIO and IOMMUFD, though current only implementation
for x86 is provided.


Series Layout:

Patch 1 first fixes an error in pat_pfn_immune_to_uc_mtrr() which always
        returns WB for untracked PAT ranges. This error leads to KVM
        treating all PFNs within these untracked PAT ranges as cacheable
        memory types, even when a PFN's MTRR type is UC. (An example is for
        VGA range from 0xa0000-0xbffff).
        Patch 3 will use pat_pfn_immune_to_uc_mtrr() to determine
        uncacheable PFNs.

Patch 2 is a side fix in KVM to prevent guest cacheable access to PFNs
        mapped as UC in host.

Patch 3 introduces and exports an arch helper arch_clean_nonsnoop_dma() to
        flush CPU cachelines. It takes physical address and size as inputs
        and provides a implementation for x86.
        Given that executing CLFLUSH on certain MMIO ranges on x86 can be
        problematic, potentially causing machine check exceptions on some
        platforms, while flushing is necessary on some other MMIO ranges
        (e.g., some MMIO ranges for PMEM), this patch determines
        cacheability by consulting the PAT (if enabled) or MTRR type (if
        PAT is disabled). It assesses whether a PFN is considered as
        uncacheable by the host. For reserved pages or !pfn_valid() PFN,
        CLFLUSH is avoided if the PFN is recognized as uncacheable on the
        host.

Patch 4/5 implement a mitigation in vfio/iommufd to flush CPU caches
         - before a page is accessible to non-coherent DMAs,
         - after the page is inaccessible to non-coherent DMAs, and right
           before it's unpinned for DMAs.


Performance data:

The overhead of flushing CPU caches is measured below:
CPU MHz:4494.377, 4 vCPU, 8G guest memory
Pass-through GPU: 1G aperture

Across each VM boot up and tear down,

IOMMUFD     |     Map        |   Unmap        | Teardown 
------------|----------------|----------------|-------------
w/o clflush | 1167M          |   40M          |  201M
w/  clflush | 2400M (+1233M) |  276M (+236M)  | 1160M (+959M)

Map = total cycles of iommufd_ioas_map() during VM boot up
Unmap = total cycles of iommufd_ioas_unmap() during VM boot up
Teardown = total cycles of iommufd_hwpt_paging_destroy() at VM teardown

VFIO        |     Map        |   Unmap        | Teardown 
------------|----------------|----------------|-------------
w/o clflush | 3058M          |  379M          |  448M
w/  clflush | 5664M (+2606M) | 1653M (+1274M) | 1522M (+1074M)

Map = total cycles of vfio_dma_do_map() during VM boot up
Unmap = total cycles of vfio_dma_do_unmap() during VM boot up
Teardown = total cycles of vfio_iommu_type1_detach_group() at VM teardown

[1] https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com

Yan Zhao (5):
  x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT
    range
  KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is
    MMIO
  x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  iommufd: Flush CPU caches on DMA pages in non-coherent domains

 arch/x86/include/asm/cacheflush.h       |  3 +
 arch/x86/kvm/mmu/spte.c                 | 14 +++-
 arch/x86/mm/pat/memtype.c               | 12 +++-
 arch/x86/mm/pat/set_memory.c            | 88 +++++++++++++++++++++++++
 drivers/iommu/iommufd/hw_pagetable.c    | 19 +++++-
 drivers/iommu/iommufd/io_pagetable.h    |  5 ++
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 drivers/iommu/iommufd/pages.c           | 44 ++++++++++++-
 drivers/vfio/vfio_iommu_type1.c         | 51 ++++++++++++++
 include/linux/cacheflush.h              |  6 ++
 10 files changed, 237 insertions(+), 6 deletions(-)


base-commit: e67572cd2204894179d89bd7b984072f19313b03
-- 
2.17.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
@ 2024-05-07  6:19 ` Yan Zhao
  2024-05-07  8:26   ` Tian, Kevin
  2024-05-07  6:20 ` [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO Yan Zhao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:19 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

Let pat_pfn_immune_to_uc_mtrr() check MTRR type for PFNs in untracked PAT
range.

pat_pfn_immune_to_uc_mtrr() is used by KVM to distinguish MMIO PFNs and
give them UC memory type in the EPT page tables.
When pat_pfn_immune_to_uc_mtrr() identifies a PFN as having a PAT type of
UC/WC/UC-, it indicates that the PFN should be accessed using an
uncacheable memory type. Consequently, KVM maps it with UC in the EPT to
ensure that the guest's memory access is uncacheable.

Internally, pat_pfn_immune_to_uc_mtrr() utilizes lookup_memtype() to
determine PAT type for a PFN. For a PFN outside untracked PAT range, the
returned PAT type is either
- The type set by memtype_reserve()
  (which, in turn, calls pat_x_mtrr_type() to adjust the requested type to
   UC- if the requested type is WB but the MTRR type does not match WB),
- Or UC-, if memtype_reserve() has not yet been invoked for this PFN.

However, lookup_memtype() defaults to returning WB for PFNs within the
untracked PAT range, regardless of their actual MTRR type. This behavior
could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable
guest access. Such access might result in MCE on certain platforms, (e.g.
clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms).

Hence, invoke pat_x_mtrr_type() for PFNs within the untracked PAT range so
as to take MTRR type into account to mitigate potential MCEs.

Fixes: b8d7044bcff7 ("x86/mm: add a function to check if a pfn is UC/UC-/WC")
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/mm/pat/memtype.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 36b603d0cdde..e85e8c5737ad 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -705,7 +705,17 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
  */
 bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
 {
-	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
+	u64 paddr = PFN_PHYS(pfn);
+	enum page_cache_mode cm;
+
+	/*
+	 * Check MTRR type for untracked pat range since lookup_memtype() always
+	 * returns WB for this range.
+	 */
+	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
+		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE, _PAGE_CACHE_MODE_WB);
+	else
+		cm = lookup_memtype(paddr);
 
 	return cm == _PAGE_CACHE_MODE_UC ||
 	       cm == _PAGE_CACHE_MODE_UC_MINUS ||
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO
  2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
  2024-05-07  6:19 ` [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Yan Zhao
@ 2024-05-07  6:20 ` Yan Zhao
  2024-05-07  8:39   ` Tian, Kevin
  2024-05-07  6:20 ` [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Yan Zhao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:20 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

Fine-grained check to decide whether a PFN, which is !pfn_valid() and
identified within the raw e820 table as RAM, should be treated as MMIO by
KVM in order to prevent guest cachable access.

Previously, a PFN that is !pfn_valid() and identified within the raw e820
table as RAM was not considered as MMIO. This is for the scenerio when
"mem=" was passed to the kernel, resulting in certain valid pages lacking
an associated struct page. See commit 0c55671f84ff ("kvm, x86: Properly
check whether a pfn is an MMIO or not").

However, that approach is only based on guest performance perspective
and may cause cacheable access to potential MMIO PFNs if
pat_pfn_immune_to_uc_mtrr() identifies the PFN as having a PAT type of
UC/WC/UC-. Therefore, do a fine-graned check for PAT in primary MMU so that
KVM would map the PFN as UC in EPT to prevent cachable access from guest.

For the rare case when PAT is not enabled, default the PFN to MMIO to avoid
further checking MTRR (since functions for MTRR related checking are not
exported now).

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/spte.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..5db0fb7b74f5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -101,9 +101,21 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 			 */
 			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
 
+	/*
+	 * If the PFN is invalid and not RAM in raw e820 table, keep treating it
+	 * as MMIO.
+	 *
+	 * If the PFN is invalid and is RAM in raw e820 table,
+	 * - if PAT is not enabled, always treat the PFN as MMIO to avoid futher
+	 *   checking of MTRRs.
+	 * - if PAT is enabled, treat the PFN as MMIO if its PAT is UC/WC/UC- in
+	 *   primary MMU.
+	 * to prevent guest cacheable access to MMIO PFNs.
+	 */
 	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
-				     E820_TYPE_RAM);
+				     E820_TYPE_RAM) ? true :
+				     (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
  2024-05-07  6:19 ` [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Yan Zhao
  2024-05-07  6:20 ` [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO Yan Zhao
@ 2024-05-07  6:20 ` Yan Zhao
  2024-05-07  8:51   ` Tian, Kevin
  2024-05-20 14:07   ` Christoph Hellwig
  2024-05-07  6:21 ` [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains Yan Zhao
  2024-05-07  6:22 ` [PATCH 5/5] iommufd: " Yan Zhao
  4 siblings, 2 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:20 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU
caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache
snooping).

When IOMMU does not enforce cache coherency, devices are allowed to perform
non-coherent DMAs. This scenario poses a risk of information leakage when
the device is assigned into a VM. Specifically, a malicious guest could
potentially retrieve stale host data through non-coherent DMA reads to
physical memory, with data initialized by host (e.g., zeros) still residing
in the cache.

Additionally, host kernel (e.g. by a ksm kthread) is possible to read
inconsistent data from CPU cache/memory (left by a malicious guest) after
a page is unpinned for non-coherent DMA but before it's freed.

Therefore, VFIO/IOMMUFD must initiate a CPU cache flush for pages involved
in non-coherent DMAs prior to or following their mapping or unmapping to or
from the IOMMU.

Introduce and export an interface accepting a contiguous physical address
range as input to help flush CPU caches in architecture specific way for
VFIO/IOMMUFD. (Currently, x86 only).

Given CLFLUSH on MMIOs in x86 is generally undesired and sometimes will
cause MCE on certain platforms (e.g. executing CLFLUSH on VGA ranges
0xA0000-0xBFFFF causes MCE on some platforms). Meanwhile, some MMIOs are
cacheable and demands CLFLUSH (e.g. certain MMIOs for PMEM). Hence, a
method of checking host PAT/MTRR for uncacheable memory is adopted.

This implementation always performs CLFLUSH on "pfn_valid() && !reserved"
pages (since they are not possible to be MMIOs).
For the reserved or !pfn_valid() cases, check host PAT/MTRR to bypass
uncacheable physical ranges in host and do CFLUSH on the rest cacheable
ranges.

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/cacheflush.h |  3 ++
 arch/x86/mm/pat/set_memory.c      | 88 +++++++++++++++++++++++++++++++
 include/linux/cacheflush.h        |  6 +++
 3 files changed, 97 insertions(+)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index b192d917a6d0..b63607994285 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -10,4 +10,7 @@
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+void arch_clean_nonsnoop_dma(phys_addr_t phys, size_t length);
+#define arch_clean_nonsnoop_dma arch_clean_nonsnoop_dma
+
 #endif /* _ASM_X86_CACHEFLUSH_H */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 80c9037ffadf..7ff08ad20369 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -34,6 +34,7 @@
 #include <asm/memtype.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
+#include <asm/mtrr.h>
 
 #include "../mm_internal.h"
 
@@ -349,6 +350,93 @@ void arch_invalidate_pmem(void *addr, size_t size)
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 #endif
 
+/*
+ * Flush pfn_valid() and !PageReserved() page
+ */
+static void clflush_page(struct page *page)
+{
+	const int size = boot_cpu_data.x86_clflush_size;
+	unsigned int i;
+	void *va;
+
+	va = kmap_local_page(page);
+
+	/* CLFLUSHOPT is unordered and requires full memory barrier */
+	mb();
+	for (i = 0; i < PAGE_SIZE; i += size)
+		clflushopt(va + i);
+	/* CLFLUSHOPT is unordered and requires full memory barrier */
+	mb();
+
+	kunmap_local(va);
+}
+
+/*
+ * Flush a reserved page or !pfn_valid() PFN.
+ * Flush is not performed if the PFN is accessed in uncacheable type. i.e.
+ * - PAT type is UC/UC-/WC when PAT is enabled
+ * - MTRR type is UC/WC/WT/WP when PAT is not enabled.
+ *   (no need to do CLFLUSH though WT/WP is cacheable).
+ */
+static void clflush_reserved_or_invalid_pfn(unsigned long pfn)
+{
+	const int size = boot_cpu_data.x86_clflush_size;
+	unsigned int i;
+	void *va;
+
+	if (!pat_enabled()) {
+		u64 start = PFN_PHYS(pfn), end = start + PAGE_SIZE;
+		u8 mtrr_type, uniform;
+
+		mtrr_type = mtrr_type_lookup(start, end, &uniform);
+		if (mtrr_type != MTRR_TYPE_WRBACK)
+			return;
+	} else if (pat_pfn_immune_to_uc_mtrr(pfn)) {
+		return;
+	}
+
+	va = memremap(pfn << PAGE_SHIFT, PAGE_SIZE, MEMREMAP_WB);
+	if (!va)
+		return;
+
+	/* CLFLUSHOPT is unordered and requires full memory barrier */
+	mb();
+	for (i = 0; i < PAGE_SIZE; i += size)
+		clflushopt(va + i);
+	/* CLFLUSHOPT is unordered and requires full memory barrier */
+	mb();
+
+	memunmap(va);
+}
+
+static inline void clflush_pfn(unsigned long pfn)
+{
+	if (pfn_valid(pfn) &&
+	    (!PageReserved(pfn_to_page(pfn)) || is_zero_pfn(pfn)))
+		return clflush_page(pfn_to_page(pfn));
+
+	clflush_reserved_or_invalid_pfn(pfn);
+}
+
+/**
+ * arch_clean_nonsnoop_dma - flush a cache range for non-coherent DMAs
+ *                           (DMAs that lack CPU cache snooping).
+ * @phys_addr:	physical address start
+ * @length:	number of bytes to flush
+ */
+void arch_clean_nonsnoop_dma(phys_addr_t phys_addr, size_t length)
+{
+	unsigned long nrpages, pfn;
+	unsigned long i;
+
+	pfn = PHYS_PFN(phys_addr);
+	nrpages = PAGE_ALIGN((phys_addr & ~PAGE_MASK) + length) >> PAGE_SHIFT;
+
+	for (i = 0; i < nrpages; i++, pfn++)
+		clflush_pfn(pfn);
+}
+EXPORT_SYMBOL_GPL(arch_clean_nonsnoop_dma);
+
 #ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
 bool cpu_cache_has_invalidate_memregion(void)
 {
diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
index 55f297b2c23f..0bfc6551c6d3 100644
--- a/include/linux/cacheflush.h
+++ b/include/linux/cacheflush.h
@@ -26,4 +26,10 @@ static inline void flush_icache_pages(struct vm_area_struct *vma,
 
 #define flush_icache_page(vma, page)	flush_icache_pages(vma, page, 1)
 
+#ifndef arch_clean_nonsnoop_dma
+static inline void arch_clean_nonsnoop_dma(phys_addr_t phys, size_t length)
+{
+}
+#endif
+
 #endif /* _LINUX_CACHEFLUSH_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
                   ` (2 preceding siblings ...)
  2024-05-07  6:20 ` [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Yan Zhao
@ 2024-05-07  6:21 ` Yan Zhao
  2024-05-09 18:10   ` Alex Williamson
  2024-05-07  6:22 ` [PATCH 5/5] iommufd: " Yan Zhao
  4 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:21 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

Flush CPU cache on DMA pages before mapping them into the first
non-coherent domain (domain that does not enforce cache coherency, i.e. CPU
caches are not force-snooped) and after unmapping them from the last
domain.

Devices attached to non-coherent domains can execute non-coherent DMAs
(DMAs that lack CPU cache snooping) to access physical memory with CPU
caches bypassed.

Such a scenario could be exploited by a malicious guest, allowing them to
access stale host data in memory rather than the data initialized by the
host (e.g., zeros) in the cache, thus posing a risk of information leakage
attack.

Furthermore, the host kernel (e.g. a ksm thread) might encounter
inconsistent data between the CPU cache and memory (left by a malicious
guest) after a page is unpinned for DMA but before it's recycled.

Therefore, it is required to flush the CPU cache before a page is
accessible to non-coherent DMAs and after the page is inaccessible to
non-coherent DMAs.

However, the CPU cache is not flushed immediately when the page is unmapped
from the last non-coherent domain. Instead, the flushing is performed
lazily, right before the page is unpinned.
Take the following example to illustrate the process. The CPU cache is
flushed right before step 2 and step 5.
1. A page is mapped into a coherent domain.
2. The page is mapped into a non-coherent domain.
3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug.
4. The page is unmapped from the coherent domain.
5. The page is unpinned.

Reasons for adopting this lazily flushing design include:
- There're several unmap paths and only one unpin path. Lazily flush before
  unpin wipes out the inconsistency between cache and physical memory
  before a page is globally visible and produces code that is simpler, more
  maintainable and easier to backport.
- Avoid dividing a large unmap range into several smaller ones or
  allocating additional memory to hold IOVA to HPA relationship.

Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com
Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b5c15fe8f9fc..ce873f4220bf 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -74,6 +74,7 @@ struct vfio_iommu {
 	bool			v2;
 	bool			nesting;
 	bool			dirty_page_tracking;
+	bool			has_noncoherent_domain;
 	struct list_head	emulated_iommu_groups;
 };
 
@@ -99,6 +100,7 @@ struct vfio_dma {
 	unsigned long		*bitmap;
 	struct mm_struct	*mm;
 	size_t			locked_vm;
+	bool			cache_flush_required; /* For noncoherent domain */
 };
 
 struct vfio_batch {
@@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	long unlocked = 0, locked = 0;
 	long i;
 
+	if (dma->cache_flush_required)
+		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
+
 	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
 		if (put_pfn(pfn++, dma->prot)) {
 			unlocked++;
@@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 					    &iotlb_gather);
 	}
 
+	dma->cache_flush_required = false;
+
 	if (do_accounting) {
 		vfio_lock_acct(dma, -unlocked, true);
 		return 0;
@@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	iommu->dma_avail++;
 }
 
+static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	bool has_noncoherent = false;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (domain->enforce_cache_coherency)
+			continue;
+
+		has_noncoherent = true;
+		break;
+	}
+	iommu->has_noncoherent_domain = has_noncoherent;
+}
+
 static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
 {
 	struct vfio_domain *domain;
@@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	vfio_batch_init(&batch);
 
+	/*
+	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
+	 * for both pin & map and unmap & unpin (for unwind) paths.
+	 */
+	dma->cache_flush_required = iommu->has_noncoherent_domain;
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
 		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
@@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			break;
 		}
 
+		if (dma->cache_flush_required)
+			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
+						npage << PAGE_SHIFT);
+
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
 				     dma->prot);
@@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
+		bool cache_flush_required;
 
 		dma = rb_entry(n, struct vfio_dma, node);
 		iova = dma->iova;
+		cache_flush_required = !domain->enforce_cache_coherency &&
+				       !dma->cache_flush_required;
+		if (cache_flush_required)
+			dma->cache_flush_required = true;
 
 		while (iova < dma->iova + dma->size) {
 			phys_addr_t phys;
@@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				size = npage << PAGE_SHIFT;
 			}
 
+			if (cache_flush_required)
+				arch_clean_nonsnoop_dma(phys, size);
+
 			ret = iommu_map(domain->domain, iova, phys, size,
 					dma->prot | IOMMU_CACHE,
 					GFP_KERNEL_ACCOUNT);
@@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
 						size >> PAGE_SHIFT, true);
 		}
+		dma->cache_flush_required = false;
 	}
 
 	vfio_batch_fini(&batch);
@@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
 	if (!pages)
 		return;
 
+	if (!domain->enforce_cache_coherency)
+		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
+
 	list_for_each_entry(region, regions, list) {
 		start = ALIGN(region->start, PAGE_SIZE * 2);
 		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
@@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
 		break;
 	}
 
+	if (!domain->enforce_cache_coherency)
+		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
+
 	__free_pages(pages, order);
 }
 
@@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	list_add(&domain->next, &iommu->domain_list);
 	vfio_update_pgsize_bitmap(iommu);
+	if (!domain->enforce_cache_coherency)
+		vfio_update_noncoherent_domain_state(iommu);
 done:
 	/* Delete the old one and insert new iova list */
 	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
@@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 			}
 			iommu_domain_free(domain->domain);
 			list_del(&domain->next);
+			if (!domain->enforce_cache_coherency)
+				vfio_update_noncoherent_domain_state(iommu);
 			kfree(domain);
 			vfio_iommu_aper_expand(iommu, &iova_copy);
 			vfio_update_pgsize_bitmap(iommu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
                   ` (3 preceding siblings ...)
  2024-05-07  6:21 ` [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains Yan Zhao
@ 2024-05-07  6:22 ` Yan Zhao
  2024-05-09 14:13   ` Jason Gunthorpe
  4 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  6:22 UTC (permalink / raw)
  To: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Yan Zhao

Flush CPU cache on DMA pages before mapping them into the first
non-coherent domain (domain that does not enforce cache coherency, i.e. CPU
caches are not force-snooped) and after unmapping them from the last
domain.

Devices attached to non-coherent domains can execute non-coherent DMAs
(DMAs that lack CPU cache snooping) to access physical memory with CPU
caches bypassed.

Such a scenario could be exploited by a malicious guest, allowing them to
access stale host data in memory rather than the data initialized by the
host (e.g., zeros) in the cache, thus posing a risk of information leakage
attack.

Furthermore, the host kernel (e.g. a ksm thread) might encounter
inconsistent data between the CPU cache and memory (left by a malicious
guest) after a page is unpinned for DMA but before it's recycled.

Therefore, it is required to flush the CPU cache before a page is
accessible to non-coherent DMAs and after the page is inaccessible to
non-coherent DMAs.

However, the CPU cache is not flushed immediately when the page is unmapped
from the last non-coherent domain. Instead, the flushing is performed
lazily, right before the page is unpinned.
Take the following example to illustrate the process. The CPU cache is
flushed right before step 2 and step 5.
1. A page is mapped into a coherent domain.
2. The page is mapped into a non-coherent domain.
3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug.
4. The page is unmapped from the coherent domain.
5. The page is unpinned.

Reasons for adopting this lazily flushing design include:
- There're several unmap paths and only one unpin path. Lazily flush before
  unpin wipes out the inconsistency between cache and physical memory
  before a page is globally visible and produces code that is simpler, more
  maintainable and easier to backport.
- Avoid dividing a large unmap range into several smaller ones or
  allocating additional memory to hold IOVA to HPA relationship.

Unlike "has_noncoherent_domain" flag used in vfio_iommu, the
"noncoherent_domain_cnt" counter is implemented in io_pagetable to track
whether an iopt has non-coherent domains attached.
Such a difference is because in iommufd only hwpt of type paging contains
flag "enforce_cache_coherency" and iommu domains in io_pagetable has no
flag "enforce_cache_coherency" as that in vfio_domain.
A counter in io_pagetable can avoid traversing ioas->hwpt_list and holding
ioas->mutex.

Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com
Fixes: e8d57210035b ("iommufd: Add kAPI toward external drivers for physical devices")
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/iommu/iommufd/hw_pagetable.c    | 19 +++++++++--
 drivers/iommu/iommufd/io_pagetable.h    |  5 +++
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 drivers/iommu/iommufd/pages.c           | 44 +++++++++++++++++++++++--
 4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 33d142f8057d..e3099d732c5c 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -14,12 +14,18 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
 		container_of(obj, struct iommufd_hwpt_paging, common.obj);
 
 	if (!list_empty(&hwpt_paging->hwpt_item)) {
+		struct io_pagetable *iopt = &hwpt_paging->ioas->iopt;
 		mutex_lock(&hwpt_paging->ioas->mutex);
 		list_del(&hwpt_paging->hwpt_item);
 		mutex_unlock(&hwpt_paging->ioas->mutex);
 
-		iopt_table_remove_domain(&hwpt_paging->ioas->iopt,
-					 hwpt_paging->common.domain);
+		iopt_table_remove_domain(iopt, hwpt_paging->common.domain);
+
+		if (!hwpt_paging->enforce_cache_coherency) {
+			down_write(&iopt->domains_rwsem);
+			iopt->noncoherent_domain_cnt--;
+			up_write(&iopt->domains_rwsem);
+		}
 	}
 
 	if (hwpt_paging->common.domain)
@@ -176,6 +182,12 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			goto out_abort;
 	}
 
+	if (!hwpt_paging->enforce_cache_coherency) {
+		down_write(&ioas->iopt.domains_rwsem);
+		ioas->iopt.noncoherent_domain_cnt++;
+		up_write(&ioas->iopt.domains_rwsem);
+	}
+
 	rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain);
 	if (rc)
 		goto out_detach;
@@ -183,6 +195,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 	return hwpt_paging;
 
 out_detach:
+	down_write(&ioas->iopt.domains_rwsem);
+	ioas->iopt.noncoherent_domain_cnt--;
+	up_write(&ioas->iopt.domains_rwsem);
 	if (immediate_attach)
 		iommufd_hw_pagetable_detach(idev);
 out_abort:
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 0ec3509b7e33..557da8fb83d9 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -198,6 +198,11 @@ struct iopt_pages {
 	void __user *uptr;
 	bool writable:1;
 	u8 account_mode;
+	/*
+	 * CPU cache flush is required before mapping the pages to or after
+	 * unmapping it from a noncoherent domain
+	 */
+	bool cache_flush_required:1;
 
 	struct xarray pinned_pfns;
 	/* Of iopt_pages_access::node */
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 991f864d1f9b..fc77fd43b232 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -53,6 +53,7 @@ struct io_pagetable {
 	struct rb_root_cached reserved_itree;
 	u8 disable_large_pages;
 	unsigned long iova_alignment;
+	unsigned int noncoherent_domain_cnt;
 };
 
 void iopt_init_table(struct io_pagetable *iopt);
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 528f356238b3..8f4b939cba5b 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -272,6 +272,17 @@ struct pfn_batch {
 	unsigned int total_pfns;
 };
 
+static void iopt_cache_flush_pfn_batch(struct pfn_batch *batch)
+{
+	unsigned long cur, i;
+
+	for (cur = 0; cur < batch->end; cur++) {
+		for (i = 0; i < batch->npfns[cur]; i++)
+			arch_clean_nonsnoop_dma(PFN_PHYS(batch->pfns[cur] + i),
+						PAGE_SIZE);
+	}
+}
+
 static void batch_clear(struct pfn_batch *batch)
 {
 	batch->total_pfns = 0;
@@ -637,10 +648,18 @@ static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
 	while (npages) {
 		size_t to_unpin = min_t(size_t, npages,
 					batch->npfns[cur] - first_page_off);
+		unsigned long pfn = batch->pfns[cur] + first_page_off;
+
+		/*
+		 * Lazily flushing CPU caches when a page is about to be
+		 * unpinned if the page was mapped into a noncoherent domain
+		 */
+		if (pages->cache_flush_required)
+			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
+						to_unpin << PAGE_SHIFT);
 
 		unpin_user_page_range_dirty_lock(
-			pfn_to_page(batch->pfns[cur] + first_page_off),
-			to_unpin, pages->writable);
+			pfn_to_page(pfn), to_unpin, pages->writable);
 		iopt_pages_sub_npinned(pages, to_unpin);
 		cur++;
 		first_page_off = 0;
@@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 {
 	unsigned long done_end_index;
 	struct pfn_reader pfns;
+	bool cache_flush_required;
 	int rc;
 
 	lockdep_assert_held(&area->pages->mutex);
 
+	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
+			       !area->pages->cache_flush_required;
+
+	if (cache_flush_required)
+		area->pages->cache_flush_required = true;
+
 	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
 			      iopt_area_last_index(area));
 	if (rc)
@@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 
 	while (!pfn_reader_done(&pfns)) {
 		done_end_index = pfns.batch_start_index;
+		if (cache_flush_required)
+			iopt_cache_flush_pfn_batch(&pfns.batch);
+
 		rc = batch_to_domain(&pfns.batch, domain, area,
 				     pfns.batch_start_index);
 		if (rc)
@@ -1413,6 +1442,7 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 	unsigned long unmap_index;
 	struct pfn_reader pfns;
 	unsigned long index;
+	bool cache_flush_required;
 	int rc;
 
 	lockdep_assert_held(&area->iopt->domains_rwsem);
@@ -1426,9 +1456,19 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 	if (rc)
 		goto out_unlock;
 
+	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
+			       !pages->cache_flush_required;
+
+	if (cache_flush_required)
+		pages->cache_flush_required = true;
+
 	while (!pfn_reader_done(&pfns)) {
 		done_first_end_index = pfns.batch_end_index;
 		done_all_end_index = pfns.batch_start_index;
+
+		if (cache_flush_required)
+			iopt_cache_flush_pfn_batch(&pfns.batch);
+
 		xa_for_each(&area->iopt->domains, index, domain) {
 			rc = batch_to_domain(&pfns.batch, domain, area,
 					     pfns.batch_start_index);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* RE: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-07  6:19 ` [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Yan Zhao
@ 2024-05-07  8:26   ` Tian, Kevin
  2024-05-07  9:12     ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-07  8:26 UTC (permalink / raw)
  To: Zhao, Yan Y, kvm, linux-kernel, x86, alex.williamson, jgg
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Tuesday, May 7, 2024 2:19 PM
> 
> However, lookup_memtype() defaults to returning WB for PFNs within the
> untracked PAT range, regardless of their actual MTRR type. This behavior
> could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable
> guest access. Such access might result in MCE on certain platforms, (e.g.
> clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms).

the VGA range is not exposed to any guest today. So is it just trying to
fix a theoretical problem?

> @@ -705,7 +705,17 @@ static enum page_cache_mode
> lookup_memtype(u64 paddr)
>   */
>  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
>  {
> -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> +	u64 paddr = PFN_PHYS(pfn);
> +	enum page_cache_mode cm;
> +
> +	/*
> +	 * Check MTRR type for untracked pat range since lookup_memtype()
> always
> +	 * returns WB for this range.
> +	 */
> +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> _PAGE_CACHE_MODE_WB);

doing so violates the name of this function. The PAT of the untracked
range is still WB and not immune to UC MTRR.

> +	else
> +		cm = lookup_memtype(paddr);
> 
>  	return cm == _PAGE_CACHE_MODE_UC ||
>  	       cm == _PAGE_CACHE_MODE_UC_MINUS ||


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO
  2024-05-07  6:20 ` [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO Yan Zhao
@ 2024-05-07  8:39   ` Tian, Kevin
  2024-05-07  9:19     ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-07  8:39 UTC (permalink / raw)
  To: Zhao, Yan Y, kvm, linux-kernel, x86, alex.williamson, jgg
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Tuesday, May 7, 2024 2:20 PM
> @@ -101,9 +101,21 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  			 */
>  			(!pat_enabled() ||
> pat_pfn_immune_to_uc_mtrr(pfn));
> 
> +	/*
> +	 * If the PFN is invalid and not RAM in raw e820 table, keep treating it
> +	 * as MMIO.
> +	 *
> +	 * If the PFN is invalid and is RAM in raw e820 table,
> +	 * - if PAT is not enabled, always treat the PFN as MMIO to avoid
> futher
> +	 *   checking of MTRRs.
> +	 * - if PAT is enabled, treat the PFN as MMIO if its PAT is UC/WC/UC-
> in
> +	 *   primary MMU.
> +	 * to prevent guest cacheable access to MMIO PFNs.
> +	 */
>  	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
>  				     pfn_to_hpa(pfn + 1) - 1,
> -				     E820_TYPE_RAM);
> +				     E820_TYPE_RAM) ? true :
> +				     (!pat_enabled() ||
> pat_pfn_immune_to_uc_mtrr(pfn));

Is it for another theoretical problem in case the primary
mmu uses a non-WB type on a invalid RAM-type pfn so
you want to do additional scrutiny here?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-07  6:20 ` [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Yan Zhao
@ 2024-05-07  8:51   ` Tian, Kevin
  2024-05-07  9:40     ` Yan Zhao
  2024-05-20 14:07   ` Christoph Hellwig
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-07  8:51 UTC (permalink / raw)
  To: Zhao, Yan Y, kvm, linux-kernel, x86, alex.williamson, jgg
  Cc: iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Tuesday, May 7, 2024 2:21 PM
> 
> +
> +/*
> + * Flush a reserved page or !pfn_valid() PFN.
> + * Flush is not performed if the PFN is accessed in uncacheable type. i.e.
> + * - PAT type is UC/UC-/WC when PAT is enabled
> + * - MTRR type is UC/WC/WT/WP when PAT is not enabled.
> + *   (no need to do CLFLUSH though WT/WP is cacheable).
> + */

As long as a page is cacheable (being WB/WT/WP) the malicious
guest can always use non-coherent DMA to make cache/memory
inconsistent, hence clflush is still required after unmapping such
page from the IOMMU page table to avoid leaking the inconsistency
state back to the host.

> +
> +/**
> + * arch_clean_nonsnoop_dma - flush a cache range for non-coherent DMAs
> + *                           (DMAs that lack CPU cache snooping).
> + * @phys_addr:	physical address start
> + * @length:	number of bytes to flush
> + */
> +void arch_clean_nonsnoop_dma(phys_addr_t phys_addr, size_t length)
> +{
> +	unsigned long nrpages, pfn;
> +	unsigned long i;
> +
> +	pfn = PHYS_PFN(phys_addr);
> +	nrpages = PAGE_ALIGN((phys_addr & ~PAGE_MASK) + length) >>
> PAGE_SHIFT;
> +
> +	for (i = 0; i < nrpages; i++, pfn++)
> +		clflush_pfn(pfn);
> +}
> +EXPORT_SYMBOL_GPL(arch_clean_nonsnoop_dma);

this is not a good name. The code has nothing to do with nonsnoop
dma aspect. It's just a general helper accepting a physical pfn to flush
CPU cache, with nonsnoop dma as one potential caller usage.

It's clearer to be arch_flush_cache_phys().

and probably drm_clflush_pages() can be converted to use this
helper too.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-07  8:26   ` Tian, Kevin
@ 2024-05-07  9:12     ` Yan Zhao
  2024-05-08 22:14       ` Alex Williamson
  2024-05-16  7:42       ` Tian, Kevin
  0 siblings, 2 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  9:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, linux-kernel, x86, alex.williamson, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Tuesday, May 7, 2024 2:19 PM
> > 
> > However, lookup_memtype() defaults to returning WB for PFNs within the
> > untracked PAT range, regardless of their actual MTRR type. This behavior
> > could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable
> > guest access. Such access might result in MCE on certain platforms, (e.g.
> > clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms).
> 
> the VGA range is not exposed to any guest today. So is it just trying to
> fix a theoretical problem?

Yes. Not sure if VGA range is allowed to be exposed to guest in future, given
we have VFIO variant drivers.

> 
> > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > lookup_memtype(u64 paddr)
> >   */
> >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> >  {
> > -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > +	u64 paddr = PFN_PHYS(pfn);
> > +	enum page_cache_mode cm;
> > +
> > +	/*
> > +	 * Check MTRR type for untracked pat range since lookup_memtype()
> > always
> > +	 * returns WB for this range.
> > +	 */
> > +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > _PAGE_CACHE_MODE_WB);
> 
> doing so violates the name of this function. The PAT of the untracked
> range is still WB and not immune to UC MTRR.
Right.
Do you think we can rename this function to something like
pfn_of_uncachable_effective_memory_type() and make it work under !pat_enabled()
too?

> 
> > +	else
> > +		cm = lookup_memtype(paddr);
> > 
> >  	return cm == _PAGE_CACHE_MODE_UC ||
> >  	       cm == _PAGE_CACHE_MODE_UC_MINUS ||
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO
  2024-05-07  8:39   ` Tian, Kevin
@ 2024-05-07  9:19     ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  9:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, linux-kernel, x86, alex.williamson, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, May 07, 2024 at 04:39:27PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Tuesday, May 7, 2024 2:20 PM
> > @@ -101,9 +101,21 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> >  			 */
> >  			(!pat_enabled() ||
> > pat_pfn_immune_to_uc_mtrr(pfn));
> > 
> > +	/*
> > +	 * If the PFN is invalid and not RAM in raw e820 table, keep treating it
> > +	 * as MMIO.
> > +	 *
> > +	 * If the PFN is invalid and is RAM in raw e820 table,
> > +	 * - if PAT is not enabled, always treat the PFN as MMIO to avoid
> > futher
> > +	 *   checking of MTRRs.
> > +	 * - if PAT is enabled, treat the PFN as MMIO if its PAT is UC/WC/UC-
> > in
> > +	 *   primary MMU.
> > +	 * to prevent guest cacheable access to MMIO PFNs.
> > +	 */
> >  	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
> >  				     pfn_to_hpa(pfn + 1) - 1,
> > -				     E820_TYPE_RAM);
> > +				     E820_TYPE_RAM) ? true :
> > +				     (!pat_enabled() ||
> > pat_pfn_immune_to_uc_mtrr(pfn));
> 
> Is it for another theoretical problem in case the primary
> mmu uses a non-WB type on a invalid RAM-type pfn so
> you want to do additional scrutiny here?
Yes. Another untold reason is that patch 3 does not do CLFLUSH to this type of
memory since it's mapped as uncacheable in primary MMU. I feel that it's better
to ensure guest will not access it in cacheable memory type either.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-07  8:51   ` Tian, Kevin
@ 2024-05-07  9:40     ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-07  9:40 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, linux-kernel, x86, alex.williamson, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, May 07, 2024 at 04:51:31PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Tuesday, May 7, 2024 2:21 PM
> > 
> > +
> > +/*
> > + * Flush a reserved page or !pfn_valid() PFN.
> > + * Flush is not performed if the PFN is accessed in uncacheable type. i.e.
> > + * - PAT type is UC/UC-/WC when PAT is enabled
> > + * - MTRR type is UC/WC/WT/WP when PAT is not enabled.
> > + *   (no need to do CLFLUSH though WT/WP is cacheable).
> > + */
> 
> As long as a page is cacheable (being WB/WT/WP) the malicious
> guest can always use non-coherent DMA to make cache/memory
> inconsistent, hence clflush is still required after unmapping such
> page from the IOMMU page table to avoid leaking the inconsistency
> state back to the host.
You are right.
I should only check MTRR type is UC or WC, as below.

static void clflush_reserved_or_invalid_pfn(unsigned long pfn)                  
{                                                                               
       const int size = boot_cpu_data.x86_clflush_size;                         
       unsigned int i;                                                          
       void *va;                                                                
                                                                                
       if (!pat_enabled()) {                                                    
               u64 start = PFN_PHYS(pfn), end = start + PAGE_SIZE;              
               u8 mtrr_type, uniform;                                           
                                                                                
               mtrr_type = mtrr_type_lookup(start, end, &uniform);              
               if ((mtrr_type == MTRR_TYPE_UNCACHABLE) ||( mtrry_type == MTRR_TYPE_WRCOMB))                               
                       return;                                                  
       } else if (pat_pfn_immune_to_uc_mtrr(pfn)) {                             
               return;                                                          
       }                                                                        
       ...                                                                           
} 

Also for the pat_enabled() case where pat_pfn_immune_to_uc_mtrr() is called,
maybe pat_x_mtrr_type() cannot be called in patch 1 for untracked PAT range,
because pat_x_mtrr_type() will return UC- if MTRR type is WT/WP, which will cause
pat_pfn_immune_to_uc_mtrr() to return true and CLFLUSH would be skipped.


static unsigned long pat_x_mtrr_type(u64 start, u64 end,
                                     enum page_cache_mode req_type)
{
        /*
         * Look for MTRR hint to get the effective type in case where PAT
         * request is for WB.
         */
        if (req_type == _PAGE_CACHE_MODE_WB) {
                u8 mtrr_type, uniform;

                mtrr_type = mtrr_type_lookup(start, end, &uniform);
                if (mtrr_type != MTRR_TYPE_WRBACK)
                        return _PAGE_CACHE_MODE_UC_MINUS;

                return _PAGE_CACHE_MODE_WB;
        }

        return req_type;
}

> 
> > +
> > +/**
> > + * arch_clean_nonsnoop_dma - flush a cache range for non-coherent DMAs
> > + *                           (DMAs that lack CPU cache snooping).
> > + * @phys_addr:	physical address start
> > + * @length:	number of bytes to flush
> > + */
> > +void arch_clean_nonsnoop_dma(phys_addr_t phys_addr, size_t length)
> > +{
> > +	unsigned long nrpages, pfn;
> > +	unsigned long i;
> > +
> > +	pfn = PHYS_PFN(phys_addr);
> > +	nrpages = PAGE_ALIGN((phys_addr & ~PAGE_MASK) + length) >>
> > PAGE_SHIFT;
> > +
> > +	for (i = 0; i < nrpages; i++, pfn++)
> > +		clflush_pfn(pfn);
> > +}
> > +EXPORT_SYMBOL_GPL(arch_clean_nonsnoop_dma);
> 
> this is not a good name. The code has nothing to do with nonsnoop
> dma aspect. It's just a general helper accepting a physical pfn to flush
> CPU cache, with nonsnoop dma as one potential caller usage.
> 
> It's clearer to be arch_flush_cache_phys().
> 
> and probably drm_clflush_pages() can be converted to use this
> helper too.
Yes. I agree, though arch_clean_nonsnoop_dma() might have its merit if its
implementation in other platforms would do some nonsnoop_dma specific
implementations.




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-07  9:12     ` Yan Zhao
@ 2024-05-08 22:14       ` Alex Williamson
  2024-05-09  3:36         ` Yan Zhao
  2024-05-16  7:42       ` Tian, Kevin
  1 sibling, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-08 22:14 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Tian, Kevin, kvm, linux-kernel, x86, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, 7 May 2024 17:12:40 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > Sent: Tuesday, May 7, 2024 2:19 PM
> > > 
> > > However, lookup_memtype() defaults to returning WB for PFNs within the
> > > untracked PAT range, regardless of their actual MTRR type. This behavior
> > > could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable
> > > guest access. Such access might result in MCE on certain platforms, (e.g.
> > > clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms).  
> > 
> > the VGA range is not exposed to any guest today. So is it just trying to
> > fix a theoretical problem?  
> 
> Yes. Not sure if VGA range is allowed to be exposed to guest in future, given
> we have VFIO variant drivers.

include/uapi/linux/vfio.h:
        /*
         * Expose VGA regions defined for PCI base class 03, subclass 00.
         * This includes I/O port ranges 0x3b0 to 0x3bb and 0x3c0 to 0x3df
         * as well as the MMIO range 0xa0000 to 0xbffff.  Each implemented
         * range is found at it's identity mapped offset from the region
         * offset, for example 0x3b0 is region_info.offset + 0x3b0.  Areas
         * between described ranges are unimplemented.
         */
        VFIO_PCI_VGA_REGION_INDEX,

We don't currently support mmap for this region though, so I think we
still don't technically require this, but I guess an mmap through KVM
is theoretically possible.  Thanks,

Alex

> > > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > > lookup_memtype(u64 paddr)
> > >   */
> > >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> > >  {
> > > -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > > +	u64 paddr = PFN_PHYS(pfn);
> > > +	enum page_cache_mode cm;
> > > +
> > > +	/*
> > > +	 * Check MTRR type for untracked pat range since lookup_memtype()
> > > always
> > > +	 * returns WB for this range.
> > > +	 */
> > > +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> > > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > > _PAGE_CACHE_MODE_WB);  
> > 
> > doing so violates the name of this function. The PAT of the untracked
> > range is still WB and not immune to UC MTRR.  
> Right.
> Do you think we can rename this function to something like
> pfn_of_uncachable_effective_memory_type() and make it work under !pat_enabled()
> too?
> 
> >   
> > > +	else
> > > +		cm = lookup_memtype(paddr);
> > > 
> > >  	return cm == _PAGE_CACHE_MODE_UC ||
> > >  	       cm == _PAGE_CACHE_MODE_UC_MINUS ||  
> >   
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-08 22:14       ` Alex Williamson
@ 2024-05-09  3:36         ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-09  3:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, kvm, linux-kernel, x86, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Wed, May 08, 2024 at 04:14:24PM -0600, Alex Williamson wrote:
> On Tue, 7 May 2024 17:12:40 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > > Sent: Tuesday, May 7, 2024 2:19 PM
> > > > 
> > > > However, lookup_memtype() defaults to returning WB for PFNs within the
> > > > untracked PAT range, regardless of their actual MTRR type. This behavior
> > > > could lead KVM to misclassify the PFN as non-MMIO, permitting cacheable
> > > > guest access. Such access might result in MCE on certain platforms, (e.g.
> > > > clflush on VGA range (0xA0000-0xBFFFF) triggers MCE on some platforms).  
> > > 
> > > the VGA range is not exposed to any guest today. So is it just trying to
> > > fix a theoretical problem?  
> > 
> > Yes. Not sure if VGA range is allowed to be exposed to guest in future, given
> > we have VFIO variant drivers.
> 
> include/uapi/linux/vfio.h:
>         /*
>          * Expose VGA regions defined for PCI base class 03, subclass 00.
>          * This includes I/O port ranges 0x3b0 to 0x3bb and 0x3c0 to 0x3df
>          * as well as the MMIO range 0xa0000 to 0xbffff.  Each implemented
>          * range is found at it's identity mapped offset from the region
>          * offset, for example 0x3b0 is region_info.offset + 0x3b0.  Areas
>          * between described ranges are unimplemented.
>          */
>         VFIO_PCI_VGA_REGION_INDEX,
> 
> We don't currently support mmap for this region though, so I think we
> still don't technically require this, but I guess an mmap through KVM
> is theoretically possible.  Thanks,

Thanks, Alex, for pointing it out.
KVM does not mmap this region currently, and I guess KVM will not do the mmap
by itself in future too.

I added this check for VGA range is because I want to call
pat_pfn_immune_to_uc_mtrr() in arch_clean_nonsnoop_dma() in patch 3 to exclude VGA
ranges from CLFLUSH,  as arch_clean_nonsnoop_dma() is under arch/x86 and not
virtualization specific.

Also, as Jason once said that "Nothinig about vfio actually guarantees that"
"there's no ISA range" (VGA range), I think KVM might see this range after
hva_to_pfn_remapped() translation, and adding this check may be helpful to KVM,
too.

Thanks
Yan

> 
> > > > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > > > lookup_memtype(u64 paddr)
> > > >   */
> > > >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> > > >  {
> > > > -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > > > +	u64 paddr = PFN_PHYS(pfn);
> > > > +	enum page_cache_mode cm;
> > > > +
> > > > +	/*
> > > > +	 * Check MTRR type for untracked pat range since lookup_memtype()
> > > > always
> > > > +	 * returns WB for this range.
> > > > +	 */
> > > > +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> > > > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > > > _PAGE_CACHE_MODE_WB);  
> > > 
> > > doing so violates the name of this function. The PAT of the untracked
> > > range is still WB and not immune to UC MTRR.  
> > Right.
> > Do you think we can rename this function to something like
> > pfn_of_uncachable_effective_memory_type() and make it work under !pat_enabled()
> > too?
> > 
> > >   
> > > > +	else
> > > > +		cm = lookup_memtype(paddr);
> > > > 
> > > >  	return cm == _PAGE_CACHE_MODE_UC ||
> > > >  	       cm == _PAGE_CACHE_MODE_UC_MINUS ||  
> > >   
> > 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-07  6:22 ` [PATCH 5/5] iommufd: " Yan Zhao
@ 2024-05-09 14:13   ` Jason Gunthorpe
  2024-05-10  8:03     ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-09 14:13 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Tue, May 07, 2024 at 02:22:12PM +0800, Yan Zhao wrote:
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 33d142f8057d..e3099d732c5c 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -14,12 +14,18 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
>  		container_of(obj, struct iommufd_hwpt_paging, common.obj);
>  
>  	if (!list_empty(&hwpt_paging->hwpt_item)) {
> +		struct io_pagetable *iopt = &hwpt_paging->ioas->iopt;
>  		mutex_lock(&hwpt_paging->ioas->mutex);
>  		list_del(&hwpt_paging->hwpt_item);
>  		mutex_unlock(&hwpt_paging->ioas->mutex);
>  
> -		iopt_table_remove_domain(&hwpt_paging->ioas->iopt,
> -					 hwpt_paging->common.domain);
> +		iopt_table_remove_domain(iopt, hwpt_paging->common.domain);
> +
> +		if (!hwpt_paging->enforce_cache_coherency) {
> +			down_write(&iopt->domains_rwsem);
> +			iopt->noncoherent_domain_cnt--;
> +			up_write(&iopt->domains_rwsem);

I think it would be nicer to put this in iopt_table_remove_domain()
since we already have the lock there anyhow. It would be OK to pass
int he hwpt. Same remark for the incr side

> @@ -176,6 +182,12 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>  			goto out_abort;
>  	}
>  
> +	if (!hwpt_paging->enforce_cache_coherency) {
> +		down_write(&ioas->iopt.domains_rwsem);
> +		ioas->iopt.noncoherent_domain_cnt++;
> +		up_write(&ioas->iopt.domains_rwsem);
> +	}
> +
>  	rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain);

iopt_table_add_domain also already gets the required locks too

>  	if (rc)
>  		goto out_detach;
> @@ -183,6 +195,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>  	return hwpt_paging;
>  
>  out_detach:
> +	down_write(&ioas->iopt.domains_rwsem);
> +	ioas->iopt.noncoherent_domain_cnt--;
> +	up_write(&ioas->iopt.domains_rwsem);

And then you don't need this error unwind

> diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> index 0ec3509b7e33..557da8fb83d9 100644
> --- a/drivers/iommu/iommufd/io_pagetable.h
> +++ b/drivers/iommu/iommufd/io_pagetable.h
> @@ -198,6 +198,11 @@ struct iopt_pages {
>  	void __user *uptr;
>  	bool writable:1;
>  	u8 account_mode;
> +	/*
> +	 * CPU cache flush is required before mapping the pages to or after
> +	 * unmapping it from a noncoherent domain
> +	 */
> +	bool cache_flush_required:1;

Move this up a line so it packs with the other bool bitfield.

>  static void batch_clear(struct pfn_batch *batch)
>  {
>  	batch->total_pfns = 0;
> @@ -637,10 +648,18 @@ static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
>  	while (npages) {
>  		size_t to_unpin = min_t(size_t, npages,
>  					batch->npfns[cur] - first_page_off);
> +		unsigned long pfn = batch->pfns[cur] + first_page_off;
> +
> +		/*
> +		 * Lazily flushing CPU caches when a page is about to be
> +		 * unpinned if the page was mapped into a noncoherent domain
> +		 */
> +		if (pages->cache_flush_required)
> +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> +						to_unpin << PAGE_SHIFT);
>  
>  		unpin_user_page_range_dirty_lock(
> -			pfn_to_page(batch->pfns[cur] + first_page_off),
> -			to_unpin, pages->writable);
> +			pfn_to_page(pfn), to_unpin, pages->writable);
>  		iopt_pages_sub_npinned(pages, to_unpin);
>  		cur++;
>  		first_page_off = 0;

Make sense

> @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
>  {
>  	unsigned long done_end_index;
>  	struct pfn_reader pfns;
> +	bool cache_flush_required;
>  	int rc;
>  
>  	lockdep_assert_held(&area->pages->mutex);
>  
> +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> +			       !area->pages->cache_flush_required;
> +
> +	if (cache_flush_required)
> +		area->pages->cache_flush_required = true;
> +
>  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
>  			      iopt_area_last_index(area));
>  	if (rc)
> @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
>  
>  	while (!pfn_reader_done(&pfns)) {
>  		done_end_index = pfns.batch_start_index;
> +		if (cache_flush_required)
> +			iopt_cache_flush_pfn_batch(&pfns.batch);
> +

This is a bit unfortunate, it means we are going to flush for every
domain, even though it is not required. I don't see any easy way out
of that :(

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-07  6:21 ` [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains Yan Zhao
@ 2024-05-09 18:10   ` Alex Williamson
  2024-05-10 10:31     ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-09 18:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Tue,  7 May 2024 14:21:38 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> Flush CPU cache on DMA pages before mapping them into the first
> non-coherent domain (domain that does not enforce cache coherency, i.e. CPU
> caches are not force-snooped) and after unmapping them from the last
> domain.
> 
> Devices attached to non-coherent domains can execute non-coherent DMAs
> (DMAs that lack CPU cache snooping) to access physical memory with CPU
> caches bypassed.
> 
> Such a scenario could be exploited by a malicious guest, allowing them to
> access stale host data in memory rather than the data initialized by the
> host (e.g., zeros) in the cache, thus posing a risk of information leakage
> attack.
> 
> Furthermore, the host kernel (e.g. a ksm thread) might encounter
> inconsistent data between the CPU cache and memory (left by a malicious
> guest) after a page is unpinned for DMA but before it's recycled.
> 
> Therefore, it is required to flush the CPU cache before a page is
> accessible to non-coherent DMAs and after the page is inaccessible to
> non-coherent DMAs.
> 
> However, the CPU cache is not flushed immediately when the page is unmapped
> from the last non-coherent domain. Instead, the flushing is performed
> lazily, right before the page is unpinned.
> Take the following example to illustrate the process. The CPU cache is
> flushed right before step 2 and step 5.
> 1. A page is mapped into a coherent domain.
> 2. The page is mapped into a non-coherent domain.
> 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug.
> 4. The page is unmapped from the coherent domain.
> 5. The page is unpinned.
> 
> Reasons for adopting this lazily flushing design include:
> - There're several unmap paths and only one unpin path. Lazily flush before
>   unpin wipes out the inconsistency between cache and physical memory
>   before a page is globally visible and produces code that is simpler, more
>   maintainable and easier to backport.
> - Avoid dividing a large unmap range into several smaller ones or
>   allocating additional memory to hold IOVA to HPA relationship.
> 
> Reported-by: Jason Gunthorpe <jgg@nvidia.com>
> Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b5c15fe8f9fc..ce873f4220bf 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -74,6 +74,7 @@ struct vfio_iommu {
>  	bool			v2;
>  	bool			nesting;
>  	bool			dirty_page_tracking;
> +	bool			has_noncoherent_domain;
>  	struct list_head	emulated_iommu_groups;
>  };
>  
> @@ -99,6 +100,7 @@ struct vfio_dma {
>  	unsigned long		*bitmap;
>  	struct mm_struct	*mm;
>  	size_t			locked_vm;
> +	bool			cache_flush_required; /* For noncoherent domain */

Poor packing, minimally this should be grouped with the other bools in
the structure, longer term they should likely all be converted to
bit fields.

>  };
>  
>  struct vfio_batch {
> @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>  	long unlocked = 0, locked = 0;
>  	long i;
>  
> +	if (dma->cache_flush_required)
> +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> +
>  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>  		if (put_pfn(pfn++, dma->prot)) {
>  			unlocked++;
> @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  					    &iotlb_gather);
>  	}
>  
> +	dma->cache_flush_required = false;
> +
>  	if (do_accounting) {
>  		vfio_lock_acct(dma, -unlocked, true);
>  		return 0;
> @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	iommu->dma_avail++;
>  }
>  
> +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	bool has_noncoherent = false;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (domain->enforce_cache_coherency)
> +			continue;
> +
> +		has_noncoherent = true;
> +		break;
> +	}
> +	iommu->has_noncoherent_domain = has_noncoherent;
> +}

This should be merged with vfio_domains_have_enforce_cache_coherency()
and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).

> +
>  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>  	struct vfio_domain *domain;
> @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  
>  	vfio_batch_init(&batch);
>  
> +	/*
> +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> +	 * for both pin & map and unmap & unpin (for unwind) paths.
> +	 */
> +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
>  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			break;
>  		}
>  
> +		if (dma->cache_flush_required)
> +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> +						npage << PAGE_SHIFT);
> +
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
>  				     dma->prot);
> @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> +		bool cache_flush_required;
>  
>  		dma = rb_entry(n, struct vfio_dma, node);
>  		iova = dma->iova;
> +		cache_flush_required = !domain->enforce_cache_coherency &&
> +				       !dma->cache_flush_required;
> +		if (cache_flush_required)
> +			dma->cache_flush_required = true;

The variable name here isn't accurate and the logic is confusing.  If
the domain does not enforce coherency and the mapping is not tagged as
requiring a cache flush, then we need to mark the mapping as requiring
a cache flush.  So the variable state is something more akin to
set_cache_flush_required.  But all we're saving with this is a
redundant set if the mapping is already tagged as requiring a cache
flush, so it could really be simplified to:

		dma->cache_flush_required = !domain->enforce_cache_coherency;

It might add more clarity to just name the mapping flag
dma->mapped_noncoherent.

>  
>  		while (iova < dma->iova + dma->size) {
>  			phys_addr_t phys;
> @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  				size = npage << PAGE_SHIFT;
>  			}
>  
> +			if (cache_flush_required)
> +				arch_clean_nonsnoop_dma(phys, size);
> +

I agree with others as well that this arch callback should be named
something relative to the cache-flush/write-back operation that it
actually performs instead of the overall reason for us requiring it.

>  			ret = iommu_map(domain->domain, iova, phys, size,
>  					dma->prot | IOMMU_CACHE,
>  					GFP_KERNEL_ACCOUNT);
> @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
>  						size >> PAGE_SHIFT, true);
>  		}
> +		dma->cache_flush_required = false;
>  	}
>  
>  	vfio_batch_fini(&batch);
> @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
>  	if (!pages)
>  		return;
>  
> +	if (!domain->enforce_cache_coherency)
> +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> +
>  	list_for_each_entry(region, regions, list) {
>  		start = ALIGN(region->start, PAGE_SIZE * 2);
>  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
>  		break;
>  	}
>  
> +	if (!domain->enforce_cache_coherency)
> +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> +

Seems like this use case isn't subject to the unmap aspect since these
are kernel allocated and freed pages rather than userspace pages.
There's not an "ongoing use of the page" concern.

The window of opportunity for a device to discover and exploit the
mapping side issue appears almost impossibly small.

>  	__free_pages(pages, order);
>  }
>  
> @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  
>  	list_add(&domain->next, &iommu->domain_list);
>  	vfio_update_pgsize_bitmap(iommu);
> +	if (!domain->enforce_cache_coherency)
> +		vfio_update_noncoherent_domain_state(iommu);

Why isn't this simply:

	if (!domain->enforce_cache_coherency)
		iommu->has_noncoherent_domain = true;

Or maybe:

	if (!domain->enforce_cache_coherency)
		iommu->noncoherent_domains++;

>  done:
>  	/* Delete the old one and insert new iova list */
>  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  			}
>  			iommu_domain_free(domain->domain);
>  			list_del(&domain->next);
> +			if (!domain->enforce_cache_coherency)
> +				vfio_update_noncoherent_domain_state(iommu);

If we were to just track the number of noncoherent domains, this could
simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:

	return iommu->noncoherent_domains ? 1 : 0;

Maybe there should be wrappers for list_add() and list_del() relative
to the iommu domain list to make it just be a counter.  Thanks,

Alex

>  			kfree(domain);
>  			vfio_iommu_aper_expand(iommu, &iova_copy);
>  			vfio_update_pgsize_bitmap(iommu);


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-09 14:13   ` Jason Gunthorpe
@ 2024-05-10  8:03     ` Yan Zhao
  2024-05-10 13:29       ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-10  8:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Thu, May 09, 2024 at 11:13:32AM -0300, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 02:22:12PM +0800, Yan Zhao wrote:
> > diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> > index 33d142f8057d..e3099d732c5c 100644
> > --- a/drivers/iommu/iommufd/hw_pagetable.c
> > +++ b/drivers/iommu/iommufd/hw_pagetable.c
> > @@ -14,12 +14,18 @@ void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
> >  		container_of(obj, struct iommufd_hwpt_paging, common.obj);
> >  
> >  	if (!list_empty(&hwpt_paging->hwpt_item)) {
> > +		struct io_pagetable *iopt = &hwpt_paging->ioas->iopt;
> >  		mutex_lock(&hwpt_paging->ioas->mutex);
> >  		list_del(&hwpt_paging->hwpt_item);
> >  		mutex_unlock(&hwpt_paging->ioas->mutex);
> >  
> > -		iopt_table_remove_domain(&hwpt_paging->ioas->iopt,
> > -					 hwpt_paging->common.domain);
> > +		iopt_table_remove_domain(iopt, hwpt_paging->common.domain);
> > +
> > +		if (!hwpt_paging->enforce_cache_coherency) {
> > +			down_write(&iopt->domains_rwsem);
> > +			iopt->noncoherent_domain_cnt--;
> > +			up_write(&iopt->domains_rwsem);
> 
> I think it would be nicer to put this in iopt_table_remove_domain()
> since we already have the lock there anyhow. It would be OK to pass
> int he hwpt. Same remark for the incr side
Ok. Passed hwpt to the two funcions.

int iopt_table_add_domain(struct io_pagetable *iopt,
                          struct iommufd_hw_pagetable *hwpt);

void iopt_table_remove_domain(struct io_pagetable *iopt,
                              struct iommufd_hw_pagetable *hwpt);

> 
> > @@ -176,6 +182,12 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
> >  			goto out_abort;
> >  	}
> >  
> > +	if (!hwpt_paging->enforce_cache_coherency) {
> > +		down_write(&ioas->iopt.domains_rwsem);
> > +		ioas->iopt.noncoherent_domain_cnt++;
> > +		up_write(&ioas->iopt.domains_rwsem);
> > +	}
> > +
> >  	rc = iopt_table_add_domain(&ioas->iopt, hwpt->domain);
> 
> iopt_table_add_domain also already gets the required locks too
Right.

> 
> >  	if (rc)
> >  		goto out_detach;
> > @@ -183,6 +195,9 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
> >  	return hwpt_paging;
> >  
> >  out_detach:
> > +	down_write(&ioas->iopt.domains_rwsem);
> > +	ioas->iopt.noncoherent_domain_cnt--;
> > +	up_write(&ioas->iopt.domains_rwsem);
> 
> And then you don't need this error unwind
Yes :)

> > diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
> > index 0ec3509b7e33..557da8fb83d9 100644
> > --- a/drivers/iommu/iommufd/io_pagetable.h
> > +++ b/drivers/iommu/iommufd/io_pagetable.h
> > @@ -198,6 +198,11 @@ struct iopt_pages {
> >  	void __user *uptr;
> >  	bool writable:1;
> >  	u8 account_mode;
> > +	/*
> > +	 * CPU cache flush is required before mapping the pages to or after
> > +	 * unmapping it from a noncoherent domain
> > +	 */
> > +	bool cache_flush_required:1;
> 
> Move this up a line so it packs with the other bool bitfield.
Yes, thanks!

> >  static void batch_clear(struct pfn_batch *batch)
> >  {
> >  	batch->total_pfns = 0;
> > @@ -637,10 +648,18 @@ static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
> >  	while (npages) {
> >  		size_t to_unpin = min_t(size_t, npages,
> >  					batch->npfns[cur] - first_page_off);
> > +		unsigned long pfn = batch->pfns[cur] + first_page_off;
> > +
> > +		/*
> > +		 * Lazily flushing CPU caches when a page is about to be
> > +		 * unpinned if the page was mapped into a noncoherent domain
> > +		 */
> > +		if (pages->cache_flush_required)
> > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > +						to_unpin << PAGE_SHIFT);
> >  
> >  		unpin_user_page_range_dirty_lock(
> > -			pfn_to_page(batch->pfns[cur] + first_page_off),
> > -			to_unpin, pages->writable);
> > +			pfn_to_page(pfn), to_unpin, pages->writable);
> >  		iopt_pages_sub_npinned(pages, to_unpin);
> >  		cur++;
> >  		first_page_off = 0;
> 
> Make sense
> 
> > @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> >  {
> >  	unsigned long done_end_index;
> >  	struct pfn_reader pfns;
> > +	bool cache_flush_required;
> >  	int rc;
> >  
> >  	lockdep_assert_held(&area->pages->mutex);
> >  
> > +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> > +			       !area->pages->cache_flush_required;
> > +
> > +	if (cache_flush_required)
> > +		area->pages->cache_flush_required = true;
> > +
> >  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
> >  			      iopt_area_last_index(area));
> >  	if (rc)
> > @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> >  
> >  	while (!pfn_reader_done(&pfns)) {
> >  		done_end_index = pfns.batch_start_index;
> > +		if (cache_flush_required)
> > +			iopt_cache_flush_pfn_batch(&pfns.batch);
> > +
> 
> This is a bit unfortunate, it means we are going to flush for every
> domain, even though it is not required. I don't see any easy way out
> of that :(
Yes. Do you think it's possible to add an op get_cache_coherency_enforced
to iommu_domain_ops?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-09 18:10   ` Alex Williamson
@ 2024-05-10 10:31     ` Yan Zhao
  2024-05-10 16:57       ` Alex Williamson
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-10 10:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> On Tue,  7 May 2024 14:21:38 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
... 
> >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index b5c15fe8f9fc..ce873f4220bf 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -74,6 +74,7 @@ struct vfio_iommu {
> >  	bool			v2;
> >  	bool			nesting;
> >  	bool			dirty_page_tracking;
> > +	bool			has_noncoherent_domain;
> >  	struct list_head	emulated_iommu_groups;
> >  };
> >  
> > @@ -99,6 +100,7 @@ struct vfio_dma {
> >  	unsigned long		*bitmap;
> >  	struct mm_struct	*mm;
> >  	size_t			locked_vm;
> > +	bool			cache_flush_required; /* For noncoherent domain */
> 
> Poor packing, minimally this should be grouped with the other bools in
> the structure, longer term they should likely all be converted to
> bit fields.
Yes. Will do!

> 
> >  };
> >  
> >  struct vfio_batch {
> > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >  	long unlocked = 0, locked = 0;
> >  	long i;
> >  
> > +	if (dma->cache_flush_required)
> > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > +
> >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> >  		if (put_pfn(pfn++, dma->prot)) {
> >  			unlocked++;
> > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  					    &iotlb_gather);
> >  	}
> >  
> > +	dma->cache_flush_required = false;
> > +
> >  	if (do_accounting) {
> >  		vfio_lock_acct(dma, -unlocked, true);
> >  		return 0;
> > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >  	iommu->dma_avail++;
> >  }
> >  
> > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > +{
> > +	struct vfio_domain *domain;
> > +	bool has_noncoherent = false;
> > +
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		if (domain->enforce_cache_coherency)
> > +			continue;
> > +
> > +		has_noncoherent = true;
> > +		break;
> > +	}
> > +	iommu->has_noncoherent_domain = has_noncoherent;
> > +}
> 
> This should be merged with vfio_domains_have_enforce_cache_coherency()
> and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).
Will convert it to a counter and do the merge.
Thanks for pointing it out!

> 
> > +
> >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> >  {
> >  	struct vfio_domain *domain;
> > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  
> >  	vfio_batch_init(&batch);
> >  
> > +	/*
> > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > +	 */
> > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > +
> >  	while (size) {
> >  		/* Pin a contiguous chunk of memory */
> >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  			break;
> >  		}
> >  
> > +		if (dma->cache_flush_required)
> > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > +						npage << PAGE_SHIFT);
> > +
> >  		/* Map it! */
> >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> >  				     dma->prot);
> > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  	for (; n; n = rb_next(n)) {
> >  		struct vfio_dma *dma;
> >  		dma_addr_t iova;
> > +		bool cache_flush_required;
> >  
> >  		dma = rb_entry(n, struct vfio_dma, node);
> >  		iova = dma->iova;
> > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > +				       !dma->cache_flush_required;
> > +		if (cache_flush_required)
> > +			dma->cache_flush_required = true;
> 
> The variable name here isn't accurate and the logic is confusing.  If
> the domain does not enforce coherency and the mapping is not tagged as
> requiring a cache flush, then we need to mark the mapping as requiring
> a cache flush.  So the variable state is something more akin to
> set_cache_flush_required.  But all we're saving with this is a
> redundant set if the mapping is already tagged as requiring a cache
> flush, so it could really be simplified to:
> 
> 		dma->cache_flush_required = !domain->enforce_cache_coherency;
Sorry about the confusion.

If dma->cache_flush_required is set to true by a domain not enforcing cache
coherency, we hope it will not be reset to false by a later attaching to domain 
enforcing cache coherency due to the lazily flushing design.

> It might add more clarity to just name the mapping flag
> dma->mapped_noncoherent.

The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
cache flush in the subsequence mapping into the first non-coherent domain
and page unpinning.
So, mapped_noncoherent may not be accurate.
Do you think it's better to put a comment for explanation? 

struct vfio_dma {
        ...    
        bool                    iommu_mapped;
        bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
        bool                    vaddr_invalid;
        /*
         *  Mark whether it is required to flush CPU caches when mapping pages
         *  of the vfio_dma to the first non-coherent domain and when unpinning
         *  pages of the vfio_dma
         */
        bool                    cache_flush_required;
        ...    
};
> 
> >  
> >  		while (iova < dma->iova + dma->size) {
> >  			phys_addr_t phys;
> > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  				size = npage << PAGE_SHIFT;
> >  			}
> >  
> > +			if (cache_flush_required)
> > +				arch_clean_nonsnoop_dma(phys, size);
> > +
> 
> I agree with others as well that this arch callback should be named
> something relative to the cache-flush/write-back operation that it
> actually performs instead of the overall reason for us requiring it.
>
Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
suggested by Kevin.

> >  			ret = iommu_map(domain->domain, iova, phys, size,
> >  					dma->prot | IOMMU_CACHE,
> >  					GFP_KERNEL_ACCOUNT);
> > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> >  						size >> PAGE_SHIFT, true);
> >  		}
> > +		dma->cache_flush_required = false;
> >  	}
> >  
> >  	vfio_batch_fini(&batch);
> > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> >  	if (!pages)
> >  		return;
> >  
> > +	if (!domain->enforce_cache_coherency)
> > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > +
> >  	list_for_each_entry(region, regions, list) {
> >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> >  		break;
> >  	}
> >  
> > +	if (!domain->enforce_cache_coherency)
> > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > +
> 
> Seems like this use case isn't subject to the unmap aspect since these
> are kernel allocated and freed pages rather than userspace pages.
> There's not an "ongoing use of the page" concern.
> 
> The window of opportunity for a device to discover and exploit the
> mapping side issue appears almost impossibly small.
>
The concern is for a malicious device attempting DMAs automatically.
Do you think this concern is valid?
As there're only extra flushes for 4 pages, what about keeping it for safety?

> >  	__free_pages(pages, order);
> >  }
> >  
> > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  
> >  	list_add(&domain->next, &iommu->domain_list);
> >  	vfio_update_pgsize_bitmap(iommu);
> > +	if (!domain->enforce_cache_coherency)
> > +		vfio_update_noncoherent_domain_state(iommu);
> 
> Why isn't this simply:
> 
> 	if (!domain->enforce_cache_coherency)
> 		iommu->has_noncoherent_domain = true;
Yes, it's simpler during attach.

> Or maybe:
> 
> 	if (!domain->enforce_cache_coherency)
> 		iommu->noncoherent_domains++;
Yes, this counter is better.
I previously thought a bool can save some space.

> >  done:
> >  	/* Delete the old one and insert new iova list */
> >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  			}
> >  			iommu_domain_free(domain->domain);
> >  			list_del(&domain->next);
> > +			if (!domain->enforce_cache_coherency)
> > +				vfio_update_noncoherent_domain_state(iommu);
> 
> If we were to just track the number of noncoherent domains, this could
> simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> 
> 	return iommu->noncoherent_domains ? 1 : 0;
> 
> Maybe there should be wrappers for list_add() and list_del() relative
> to the iommu domain list to make it just be a counter.  Thanks,

Do you think we can skip the "iommu->noncoherent_domains--" in
vfio_iommu_type1_release() when iommu is about to be freed.

Asking that is also because it's hard for me to find a good name for the wrapper
around list_del().  :)

It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
vfio_iommu_type1_detach_group().

> 
> 
> >  			kfree(domain);
> >  			vfio_iommu_aper_expand(iommu, &iova_copy);
> >  			vfio_update_pgsize_bitmap(iommu);
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-10  8:03     ` Yan Zhao
@ 2024-05-10 13:29       ` Jason Gunthorpe
  2024-05-13  7:43         ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-10 13:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Fri, May 10, 2024 at 04:03:04PM +0800, Yan Zhao wrote:
> > > @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > >  {
> > >  	unsigned long done_end_index;
> > >  	struct pfn_reader pfns;
> > > +	bool cache_flush_required;
> > >  	int rc;
> > >  
> > >  	lockdep_assert_held(&area->pages->mutex);
> > >  
> > > +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> > > +			       !area->pages->cache_flush_required;
> > > +
> > > +	if (cache_flush_required)
> > > +		area->pages->cache_flush_required = true;
> > > +
> > >  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
> > >  			      iopt_area_last_index(area));
> > >  	if (rc)
> > > @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > >  
> > >  	while (!pfn_reader_done(&pfns)) {
> > >  		done_end_index = pfns.batch_start_index;
> > > +		if (cache_flush_required)
> > > +			iopt_cache_flush_pfn_batch(&pfns.batch);
> > > +
> > 
> > This is a bit unfortunate, it means we are going to flush for every
> > domain, even though it is not required. I don't see any easy way out
> > of that :(
> Yes. Do you think it's possible to add an op get_cache_coherency_enforced
> to iommu_domain_ops?

Do we need that? The hwpt already keeps track of that? the enforced could be
copied into the area along side storage_domain

Then I guess you could avoid flushing in the case the page came from
the storage_domain...

You'd want the storage_domain to preferentially point to any
non-enforced domain.

Is it worth it? How slow is this stuff?

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-10 10:31     ` Yan Zhao
@ 2024-05-10 16:57       ` Alex Williamson
  2024-05-13  7:11         ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-10 16:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Fri, 10 May 2024 18:31:13 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > On Tue,  7 May 2024 14:21:38 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> ... 
> > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > >  1 file changed, 51 insertions(+)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > >  	bool			v2;
> > >  	bool			nesting;
> > >  	bool			dirty_page_tracking;
> > > +	bool			has_noncoherent_domain;
> > >  	struct list_head	emulated_iommu_groups;
> > >  };
> > >  
> > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > >  	unsigned long		*bitmap;
> > >  	struct mm_struct	*mm;
> > >  	size_t			locked_vm;
> > > +	bool			cache_flush_required; /* For noncoherent domain */  
> > 
> > Poor packing, minimally this should be grouped with the other bools in
> > the structure, longer term they should likely all be converted to
> > bit fields.  
> Yes. Will do!
> 
> >   
> > >  };
> > >  
> > >  struct vfio_batch {
> > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > >  	long unlocked = 0, locked = 0;
> > >  	long i;
> > >  
> > > +	if (dma->cache_flush_required)
> > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > +
> > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > >  		if (put_pfn(pfn++, dma->prot)) {
> > >  			unlocked++;
> > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  					    &iotlb_gather);
> > >  	}
> > >  
> > > +	dma->cache_flush_required = false;
> > > +
> > >  	if (do_accounting) {
> > >  		vfio_lock_acct(dma, -unlocked, true);
> > >  		return 0;
> > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > >  	iommu->dma_avail++;
> > >  }
> > >  
> > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > +{
> > > +	struct vfio_domain *domain;
> > > +	bool has_noncoherent = false;
> > > +
> > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > +		if (domain->enforce_cache_coherency)
> > > +			continue;
> > > +
> > > +		has_noncoherent = true;
> > > +		break;
> > > +	}
> > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > +}  
> > 
> > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).  
> Will convert it to a counter and do the merge.
> Thanks for pointing it out!
> 
> >   
> > > +
> > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  {
> > >  	struct vfio_domain *domain;
> > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  
> > >  	vfio_batch_init(&batch);
> > >  
> > > +	/*
> > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > +	 */
> > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > +
> > >  	while (size) {
> > >  		/* Pin a contiguous chunk of memory */
> > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > >  			break;
> > >  		}
> > >  
> > > +		if (dma->cache_flush_required)
> > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > +						npage << PAGE_SHIFT);
> > > +
> > >  		/* Map it! */
> > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > >  				     dma->prot);
> > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  	for (; n; n = rb_next(n)) {
> > >  		struct vfio_dma *dma;
> > >  		dma_addr_t iova;
> > > +		bool cache_flush_required;
> > >  
> > >  		dma = rb_entry(n, struct vfio_dma, node);
> > >  		iova = dma->iova;
> > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > +				       !dma->cache_flush_required;
> > > +		if (cache_flush_required)
> > > +			dma->cache_flush_required = true;  
> > 
> > The variable name here isn't accurate and the logic is confusing.  If
> > the domain does not enforce coherency and the mapping is not tagged as
> > requiring a cache flush, then we need to mark the mapping as requiring
> > a cache flush.  So the variable state is something more akin to
> > set_cache_flush_required.  But all we're saving with this is a
> > redundant set if the mapping is already tagged as requiring a cache
> > flush, so it could really be simplified to:
> > 
> > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> Sorry about the confusion.
> 
> If dma->cache_flush_required is set to true by a domain not enforcing cache
> coherency, we hope it will not be reset to false by a later attaching to domain 
> enforcing cache coherency due to the lazily flushing design.

Right, ok, the vfio_dma objects are shared between domains so we never
want to set 'dma->cache_flush_required = false' due to the addition of a
'domain->enforce_cache_coherent == true'.  So this could be:

	if (!dma->cache_flush_required)
		dma->cache_flush_required = !domain->enforce_cache_coherency;

> > It might add more clarity to just name the mapping flag
> > dma->mapped_noncoherent.  
> 
> The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> cache flush in the subsequence mapping into the first non-coherent domain
> and page unpinning.

How do we arrive at a sequence where we have dma->cache_flush_required
that isn't the result of being mapped into a domain with
!domain->enforce_cache_coherency?

It seems to me that we only get 'dma->cache_flush_required == true' as
a result of being mapped into a 'domain->enforce_cache_coherency ==
false' domain.  In that case the flush-on-map is handled at the time
we're setting dma->cache_flush_required and what we're actually
tracking with the flag is that the dma object has been mapped into a
noncoherent domain.

> So, mapped_noncoherent may not be accurate.
> Do you think it's better to put a comment for explanation? 
> 
> struct vfio_dma {
>         ...    
>         bool                    iommu_mapped;
>         bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
>         bool                    vaddr_invalid;
>         /*
>          *  Mark whether it is required to flush CPU caches when mapping pages
>          *  of the vfio_dma to the first non-coherent domain and when unpinning
>          *  pages of the vfio_dma
>          */
>         bool                    cache_flush_required;
>         ...    
> };
> >   
> > >  
> > >  		while (iova < dma->iova + dma->size) {
> > >  			phys_addr_t phys;
> > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  				size = npage << PAGE_SHIFT;
> > >  			}
> > >  
> > > +			if (cache_flush_required)
> > > +				arch_clean_nonsnoop_dma(phys, size);
> > > +  
> > 
> > I agree with others as well that this arch callback should be named
> > something relative to the cache-flush/write-back operation that it
> > actually performs instead of the overall reason for us requiring it.
> >  
> Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
> suggested by Kevin.

Yes, better.

> > >  			ret = iommu_map(domain->domain, iova, phys, size,
> > >  					dma->prot | IOMMU_CACHE,
> > >  					GFP_KERNEL_ACCOUNT);
> > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> > >  						size >> PAGE_SHIFT, true);
> > >  		}
> > > +		dma->cache_flush_required = false;
> > >  	}
> > >  
> > >  	vfio_batch_fini(&batch);
> > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > >  	if (!pages)
> > >  		return;
> > >  
> > > +	if (!domain->enforce_cache_coherency)
> > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > +
> > >  	list_for_each_entry(region, regions, list) {
> > >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> > >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > >  		break;
> > >  	}
> > >  
> > > +	if (!domain->enforce_cache_coherency)
> > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > +  
> > 
> > Seems like this use case isn't subject to the unmap aspect since these
> > are kernel allocated and freed pages rather than userspace pages.
> > There's not an "ongoing use of the page" concern.
> > 
> > The window of opportunity for a device to discover and exploit the
> > mapping side issue appears almost impossibly small.
> >  
> The concern is for a malicious device attempting DMAs automatically.
> Do you think this concern is valid?
> As there're only extra flushes for 4 pages, what about keeping it for safety?

Userspace doesn't know anything about these mappings, so to exploit
them the device would somehow need to discover and interact with the
mapping in the split second that the mapping exists, without exposing
itself with mapping faults at the IOMMU.

I don't mind keeping the flush before map so that infinitesimal gap
where previous data in physical memory exposed to the device is closed,
but I have a much harder time seeing that the flush on unmap to
synchronize physical memory is required.

For example, the potential KSM use case doesn't exist since the pages
are not owned by the user.  Any subsequent use of the pages would be
subject to the same condition we assumed after allocation, where the
physical data may be inconsistent with the cached data.  It's easy to
flush 2 pages, but I think it obscures the function of the flush if we
can't articulate the value in this case.


> > >  	__free_pages(pages, order);
> > >  }
> > >  
> > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> > >  
> > >  	list_add(&domain->next, &iommu->domain_list);
> > >  	vfio_update_pgsize_bitmap(iommu);
> > > +	if (!domain->enforce_cache_coherency)
> > > +		vfio_update_noncoherent_domain_state(iommu);  
> > 
> > Why isn't this simply:
> > 
> > 	if (!domain->enforce_cache_coherency)
> > 		iommu->has_noncoherent_domain = true;  
> Yes, it's simpler during attach.
> 
> > Or maybe:
> > 
> > 	if (!domain->enforce_cache_coherency)
> > 		iommu->noncoherent_domains++;  
> Yes, this counter is better.
> I previously thought a bool can save some space.
> 
> > >  done:
> > >  	/* Delete the old one and insert new iova list */
> > >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> > >  			}
> > >  			iommu_domain_free(domain->domain);
> > >  			list_del(&domain->next);
> > > +			if (!domain->enforce_cache_coherency)
> > > +				vfio_update_noncoherent_domain_state(iommu);  
> > 
> > If we were to just track the number of noncoherent domains, this could
> > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> > 
> > 	return iommu->noncoherent_domains ? 1 : 0;
> > 
> > Maybe there should be wrappers for list_add() and list_del() relative
> > to the iommu domain list to make it just be a counter.  Thanks,  
> 
> Do you think we can skip the "iommu->noncoherent_domains--" in
> vfio_iommu_type1_release() when iommu is about to be freed.
> 
> Asking that is also because it's hard for me to find a good name for the wrapper
> around list_del().  :)

vfio_iommu_link_domain(), vfio_iommu_unlink_domain()?

> 
> It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
> vfio_iommu_type1_detach_group().

I'm not sure I understand the concern here, detach_group is performed
under the iommu->lock where the value of iommu->noncohernet_domains is
only guaranteed while this lock is held.  In the release callback the
iommu->lock is not held, but we have no external users at this point.
It's not strictly required that we decrement each domain, but it's also
not a bad sanity test that iommu->noncoherent_domains should be zero
after unlinking the domains.  Thanks,

Alex
 
> > >  			kfree(domain);
> > >  			vfio_iommu_aper_expand(iommu,
> > > &iova_copy); vfio_update_pgsize_bitmap(iommu);  
> >   
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-10 16:57       ` Alex Williamson
@ 2024-05-13  7:11         ` Yan Zhao
  2024-05-16  7:53           ` Tian, Kevin
                             ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-13  7:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> On Fri, 10 May 2024 18:31:13 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > > On Tue,  7 May 2024 14:21:38 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> > ... 
> > > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > > >  1 file changed, 51 insertions(+)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > > >  	bool			v2;
> > > >  	bool			nesting;
> > > >  	bool			dirty_page_tracking;
> > > > +	bool			has_noncoherent_domain;
> > > >  	struct list_head	emulated_iommu_groups;
> > > >  };
> > > >  
> > > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > > >  	unsigned long		*bitmap;
> > > >  	struct mm_struct	*mm;
> > > >  	size_t			locked_vm;
> > > > +	bool			cache_flush_required; /* For noncoherent domain */  
> > > 
> > > Poor packing, minimally this should be grouped with the other bools in
> > > the structure, longer term they should likely all be converted to
> > > bit fields.  
> > Yes. Will do!
> > 
> > >   
> > > >  };
> > > >  
> > > >  struct vfio_batch {
> > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > > >  	long unlocked = 0, locked = 0;
> > > >  	long i;
> > > >  
> > > > +	if (dma->cache_flush_required)
> > > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > > +
> > > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > >  		if (put_pfn(pfn++, dma->prot)) {
> > > >  			unlocked++;
> > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  					    &iotlb_gather);
> > > >  	}
> > > >  
> > > > +	dma->cache_flush_required = false;
> > > > +
> > > >  	if (do_accounting) {
> > > >  		vfio_lock_acct(dma, -unlocked, true);
> > > >  		return 0;
> > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > > >  	iommu->dma_avail++;
> > > >  }
> > > >  
> > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > > +{
> > > > +	struct vfio_domain *domain;
> > > > +	bool has_noncoherent = false;
> > > > +
> > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > +		if (domain->enforce_cache_coherency)
> > > > +			continue;
> > > > +
> > > > +		has_noncoherent = true;
> > > > +		break;
> > > > +	}
> > > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > > +}  
> > > 
> > > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).  
> > Will convert it to a counter and do the merge.
> > Thanks for pointing it out!
> > 
> > >   
> > > > +
> > > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > > >  {
> > > >  	struct vfio_domain *domain;
> > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  
> > > >  	vfio_batch_init(&batch);
> > > >  
> > > > +	/*
> > > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > > +	 */
> > > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > > +
> > > >  	while (size) {
> > > >  		/* Pin a contiguous chunk of memory */
> > > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > >  			break;
> > > >  		}
> > > >  
> > > > +		if (dma->cache_flush_required)
> > > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > > +						npage << PAGE_SHIFT);
> > > > +
> > > >  		/* Map it! */
> > > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > > >  				     dma->prot);
> > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  	for (; n; n = rb_next(n)) {
> > > >  		struct vfio_dma *dma;
> > > >  		dma_addr_t iova;
> > > > +		bool cache_flush_required;
> > > >  
> > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > >  		iova = dma->iova;
> > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > +				       !dma->cache_flush_required;
> > > > +		if (cache_flush_required)
> > > > +			dma->cache_flush_required = true;  
> > > 
> > > The variable name here isn't accurate and the logic is confusing.  If
> > > the domain does not enforce coherency and the mapping is not tagged as
> > > requiring a cache flush, then we need to mark the mapping as requiring
> > > a cache flush.  So the variable state is something more akin to
> > > set_cache_flush_required.  But all we're saving with this is a
> > > redundant set if the mapping is already tagged as requiring a cache
> > > flush, so it could really be simplified to:
> > > 
> > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> > Sorry about the confusion.
> > 
> > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > coherency, we hope it will not be reset to false by a later attaching to domain 
> > enforcing cache coherency due to the lazily flushing design.
> 
> Right, ok, the vfio_dma objects are shared between domains so we never
> want to set 'dma->cache_flush_required = false' due to the addition of a
> 'domain->enforce_cache_coherent == true'.  So this could be:
> 
> 	if (!dma->cache_flush_required)
> 		dma->cache_flush_required = !domain->enforce_cache_coherency;

Though this code is easier for understanding, it leads to unnecessary setting of
dma->cache_flush_required to false, given domain->enforce_cache_coherency is
true at the most time.

> > > It might add more clarity to just name the mapping flag
> > > dma->mapped_noncoherent.  
> > 
> > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > cache flush in the subsequence mapping into the first non-coherent domain
> > and page unpinning.
> 
> How do we arrive at a sequence where we have dma->cache_flush_required
> that isn't the result of being mapped into a domain with
> !domain->enforce_cache_coherency?
Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
!domain->enforce_cache_coherency.
My concern only arrives from the actual code sequence, i.e.
dma->cache_flush_required is set to true before the actual mapping.

If we rename it to dma->mapped_noncoherent and only set it to true after the
actual successful mapping, it would lead to more code to handle flushing for the
unwind case.
Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
by checking dma->cache_flush_required, which is true even before a full
successful mapping, so we won't miss flush on any pages that are mapped into a
non-coherent domain in a short window.

> 
> It seems to me that we only get 'dma->cache_flush_required == true' as
> a result of being mapped into a 'domain->enforce_cache_coherency ==
> false' domain.  In that case the flush-on-map is handled at the time
> we're setting dma->cache_flush_required and what we're actually
> tracking with the flag is that the dma object has been mapped into a
> noncoherent domain.
> 
> > So, mapped_noncoherent may not be accurate.
> > Do you think it's better to put a comment for explanation? 
> > 
> > struct vfio_dma {
> >         ...    
> >         bool                    iommu_mapped;
> >         bool                    lock_cap;       /* capable(CAP_IPC_LOCK) */
> >         bool                    vaddr_invalid;
> >         /*
> >          *  Mark whether it is required to flush CPU caches when mapping pages
> >          *  of the vfio_dma to the first non-coherent domain and when unpinning
> >          *  pages of the vfio_dma
> >          */
> >         bool                    cache_flush_required;
> >         ...    
> > };
> > >   
> > > >  
> > > >  		while (iova < dma->iova + dma->size) {
> > > >  			phys_addr_t phys;
> > > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  				size = npage << PAGE_SHIFT;
> > > >  			}
> > > >  
> > > > +			if (cache_flush_required)
> > > > +				arch_clean_nonsnoop_dma(phys, size);
> > > > +  
> > > 
> > > I agree with others as well that this arch callback should be named
> > > something relative to the cache-flush/write-back operation that it
> > > actually performs instead of the overall reason for us requiring it.
> > >  
> > Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as
> > suggested by Kevin.
> 
> Yes, better.
> 
> > > >  			ret = iommu_map(domain->domain, iova, phys, size,
> > > >  					dma->prot | IOMMU_CACHE,
> > > >  					GFP_KERNEL_ACCOUNT);
> > > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > >  			vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT,
> > > >  						size >> PAGE_SHIFT, true);
> > > >  		}
> > > > +		dma->cache_flush_required = false;
> > > >  	}
> > > >  
> > > >  	vfio_batch_fini(&batch);
> > > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > > >  	if (!pages)
> > > >  		return;
> > > >  
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > > +
> > > >  	list_for_each_entry(region, regions, list) {
> > > >  		start = ALIGN(region->start, PAGE_SIZE * 2);
> > > >  		if (start >= region->end || (region->end - start < PAGE_SIZE * 2))
> > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head *
> > > >  		break;
> > > >  	}
> > > >  
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2);
> > > > +  
> > > 
> > > Seems like this use case isn't subject to the unmap aspect since these
> > > are kernel allocated and freed pages rather than userspace pages.
> > > There's not an "ongoing use of the page" concern.
> > > 
> > > The window of opportunity for a device to discover and exploit the
> > > mapping side issue appears almost impossibly small.
> > >  
> > The concern is for a malicious device attempting DMAs automatically.
> > Do you think this concern is valid?
> > As there're only extra flushes for 4 pages, what about keeping it for safety?
> 
> Userspace doesn't know anything about these mappings, so to exploit
> them the device would somehow need to discover and interact with the
> mapping in the split second that the mapping exists, without exposing
> itself with mapping faults at the IOMMU.
> 
> I don't mind keeping the flush before map so that infinitesimal gap
> where previous data in physical memory exposed to the device is closed,
> but I have a much harder time seeing that the flush on unmap to
> synchronize physical memory is required.
> 
> For example, the potential KSM use case doesn't exist since the pages
> are not owned by the user.  Any subsequent use of the pages would be
> subject to the same condition we assumed after allocation, where the
> physical data may be inconsistent with the cached data.  It's easy to
> flush 2 pages, but I think it obscures the function of the flush if we
> can't articulate the value in this case.
>
I agree the second flush is not necessary if we are confident that functions in
between the two flushes do not and will not touch the page in CPU side.
However, can we guarantee this? For instance, is it possible for some IOMMU
driver to read/write the page for some quirks? (Or is it just a totally
paranoid?)
If that's not impossible, then ensuring cache and memory coherency before
page reclaiming is better?

> 
> > > >  	__free_pages(pages, order);
> > > >  }
> > > >  
> > > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> > > >  
> > > >  	list_add(&domain->next, &iommu->domain_list);
> > > >  	vfio_update_pgsize_bitmap(iommu);
> > > > +	if (!domain->enforce_cache_coherency)
> > > > +		vfio_update_noncoherent_domain_state(iommu);  
> > > 
> > > Why isn't this simply:
> > > 
> > > 	if (!domain->enforce_cache_coherency)
> > > 		iommu->has_noncoherent_domain = true;  
> > Yes, it's simpler during attach.
> > 
> > > Or maybe:
> > > 
> > > 	if (!domain->enforce_cache_coherency)
> > > 		iommu->noncoherent_domains++;  
> > Yes, this counter is better.
> > I previously thought a bool can save some space.
> > 
> > > >  done:
> > > >  	/* Delete the old one and insert new iova list */
> > > >  	vfio_iommu_iova_insert_copy(iommu, &iova_copy);
> > > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> > > >  			}
> > > >  			iommu_domain_free(domain->domain);
> > > >  			list_del(&domain->next);
> > > > +			if (!domain->enforce_cache_coherency)
> > > > +				vfio_update_noncoherent_domain_state(iommu);  
> > > 
> > > If we were to just track the number of noncoherent domains, this could
> > > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be:
> > > 
> > > 	return iommu->noncoherent_domains ? 1 : 0;
> > > 
> > > Maybe there should be wrappers for list_add() and list_del() relative
> > > to the iommu domain list to make it just be a counter.  Thanks,  
> > 
> > Do you think we can skip the "iommu->noncoherent_domains--" in
> > vfio_iommu_type1_release() when iommu is about to be freed.
> > 
> > Asking that is also because it's hard for me to find a good name for the wrapper
> > around list_del().  :)
> 
> vfio_iommu_link_domain(), vfio_iommu_unlink_domain()?

Ah, this is a good name!

> > 
> > It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in
> > vfio_iommu_type1_detach_group().
> 
> I'm not sure I understand the concern here, detach_group is performed
> under the iommu->lock where the value of iommu->noncohernet_domains is
> only guaranteed while this lock is held.  In the release callback the
> iommu->lock is not held, but we have no external users at this point.
> It's not strictly required that we decrement each domain, but it's also
> not a bad sanity test that iommu->noncoherent_domains should be zero
> after unlinking the domains.  Thanks,
I previously thought I couldn't find a name for a domain operation that's
called after vfio_release_domain(), and I couldn't merge list_del() into
vfio_release_domain() given it's not in vfio_iommu_type1_detach_group().

But vfio_iommu_unlink_domain() is a good one.
I'll rename list_del() to vfio_iommu_unlink_domain().

Thanks!
Yan


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-10 13:29       ` Jason Gunthorpe
@ 2024-05-13  7:43         ` Yan Zhao
  2024-05-14 15:11           ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-13  7:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Fri, May 10, 2024 at 10:29:28AM -0300, Jason Gunthorpe wrote:
> On Fri, May 10, 2024 at 04:03:04PM +0800, Yan Zhao wrote:
> > > > @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > >  {
> > > >  	unsigned long done_end_index;
> > > >  	struct pfn_reader pfns;
> > > > +	bool cache_flush_required;
> > > >  	int rc;
> > > >  
> > > >  	lockdep_assert_held(&area->pages->mutex);
> > > >  
> > > > +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> > > > +			       !area->pages->cache_flush_required;
> > > > +
> > > > +	if (cache_flush_required)
> > > > +		area->pages->cache_flush_required = true;
> > > > +
> > > >  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
> > > >  			      iopt_area_last_index(area));
> > > >  	if (rc)
> > > > @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > >  
> > > >  	while (!pfn_reader_done(&pfns)) {
> > > >  		done_end_index = pfns.batch_start_index;
> > > > +		if (cache_flush_required)
> > > > +			iopt_cache_flush_pfn_batch(&pfns.batch);
> > > > +
> > > 
> > > This is a bit unfortunate, it means we are going to flush for every
> > > domain, even though it is not required. I don't see any easy way out
> > > of that :(
> > Yes. Do you think it's possible to add an op get_cache_coherency_enforced
> > to iommu_domain_ops?
> 
> Do we need that? The hwpt already keeps track of that? the enforced could be
> copied into the area along side storage_domain
> 
> Then I guess you could avoid flushing in the case the page came from
> the storage_domain...
> 
> You'd want the storage_domain to preferentially point to any
> non-enforced domain.
> 
> Is it worth it? How slow is this stuff?
Sorry, I might have misunderstood your intentions in my previous mail.
In iopt_area_fill_domain(), flushing CPU caches is only performed when
(1) noncoherent_domain_cnt is non-zero and
(2) area->pages->cache_flush_required is false.
area->pages->cache_flush_required is also set to true after the two are met, so
that the next flush to the same "area->pages" in filling phase will be skipped.

In my last mail, I thought you wanted to flush for every domain even if
area->pages->cache_flush_required is true, because I thought that you were
worried about that checking area->pages->cache_flush_required might results in
some pages, which ought be flushed, not being flushed.
So, I was wondering if we could do the flush for every non-coherent domain by
checking whether domain enforces cache coherency.

However, as you said, we can check hwpt instead if it's passed in
iopt_area_fill_domain().

On the other side, after a second thought, looks it's still good to check
area->pages->cache_flush_required?
- "area" and "pages" are 1:1. In other words, there's no such a condition that
  several "area"s are pointing to the same "pages".
  Is this assumption right?
- Once area->pages->cache_flush_required is set to true, it means all pages
  indicated by "area->pages" has been mapped into a non-coherent domain
  (though the domain is not necessarily the storage domain).
  Is this assumption correct as well?
  If so, we can safely skip the flush in iopt_area_fill_domain() if
  area->pages->cache_flush_required is true.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-13  7:43         ` Yan Zhao
@ 2024-05-14 15:11           ` Jason Gunthorpe
  2024-05-15  7:06             ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-14 15:11 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Mon, May 13, 2024 at 03:43:45PM +0800, Yan Zhao wrote:
> On Fri, May 10, 2024 at 10:29:28AM -0300, Jason Gunthorpe wrote:
> > On Fri, May 10, 2024 at 04:03:04PM +0800, Yan Zhao wrote:
> > > > > @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > > >  {
> > > > >  	unsigned long done_end_index;
> > > > >  	struct pfn_reader pfns;
> > > > > +	bool cache_flush_required;
> > > > >  	int rc;
> > > > >  
> > > > >  	lockdep_assert_held(&area->pages->mutex);
> > > > >  
> > > > > +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> > > > > +			       !area->pages->cache_flush_required;
> > > > > +
> > > > > +	if (cache_flush_required)
> > > > > +		area->pages->cache_flush_required = true;
> > > > > +
> > > > >  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
> > > > >  			      iopt_area_last_index(area));
> > > > >  	if (rc)
> > > > > @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > > >  
> > > > >  	while (!pfn_reader_done(&pfns)) {
> > > > >  		done_end_index = pfns.batch_start_index;
> > > > > +		if (cache_flush_required)
> > > > > +			iopt_cache_flush_pfn_batch(&pfns.batch);
> > > > > +
> > > > 
> > > > This is a bit unfortunate, it means we are going to flush for every
> > > > domain, even though it is not required. I don't see any easy way out
> > > > of that :(
> > > Yes. Do you think it's possible to add an op get_cache_coherency_enforced
> > > to iommu_domain_ops?
> > 
> > Do we need that? The hwpt already keeps track of that? the enforced could be
> > copied into the area along side storage_domain
> > 
> > Then I guess you could avoid flushing in the case the page came from
> > the storage_domain...
> > 
> > You'd want the storage_domain to preferentially point to any
> > non-enforced domain.
> > 
> > Is it worth it? How slow is this stuff?
> Sorry, I might have misunderstood your intentions in my previous mail.
> In iopt_area_fill_domain(), flushing CPU caches is only performed when
> (1) noncoherent_domain_cnt is non-zero and
> (2) area->pages->cache_flush_required is false.
> area->pages->cache_flush_required is also set to true after the two are met, so
> that the next flush to the same "area->pages" in filling phase will be skipped.
> 
> In my last mail, I thought you wanted to flush for every domain even if
> area->pages->cache_flush_required is true, because I thought that you were
> worried about that checking area->pages->cache_flush_required might results in
> some pages, which ought be flushed, not being flushed.
> So, I was wondering if we could do the flush for every non-coherent domain by
> checking whether domain enforces cache coherency.
> 
> However, as you said, we can check hwpt instead if it's passed in
> iopt_area_fill_domain().
> 
> On the other side, after a second thought, looks it's still good to check
> area->pages->cache_flush_required?
> - "area" and "pages" are 1:1. In other words, there's no such a condition that
>   several "area"s are pointing to the same "pages".
>   Is this assumption right?

copy can create new areas that point to shared pages. That is why
there are two structs.

> - Once area->pages->cache_flush_required is set to true, it means all pages
>   indicated by "area->pages" has been mapped into a non-coherent
>   domain

Also not true, the multiple area's can take sub slices of the pages,
so two hwpts' can be mapping disjoint sets of pages, and thus have
disjoint cachability.

So it has to be calculated on closer to a page by page basis (really a
span by span basis) if flushing of that span is needed based on where
the pages came from. Only pages that came from a hwpt that is
non-coherent can skip the flushing.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-14 15:11           ` Jason Gunthorpe
@ 2024-05-15  7:06             ` Yan Zhao
  2024-05-15 20:43               ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-15  7:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Tue, May 14, 2024 at 12:11:19PM -0300, Jason Gunthorpe wrote:
> On Mon, May 13, 2024 at 03:43:45PM +0800, Yan Zhao wrote:
> > On Fri, May 10, 2024 at 10:29:28AM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 10, 2024 at 04:03:04PM +0800, Yan Zhao wrote:
> > > > > > @@ -1358,10 +1377,17 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > > > >  {
> > > > > >  	unsigned long done_end_index;
> > > > > >  	struct pfn_reader pfns;
> > > > > > +	bool cache_flush_required;
> > > > > >  	int rc;
> > > > > >  
> > > > > >  	lockdep_assert_held(&area->pages->mutex);
> > > > > >  
> > > > > > +	cache_flush_required = area->iopt->noncoherent_domain_cnt &&
> > > > > > +			       !area->pages->cache_flush_required;
> > > > > > +
> > > > > > +	if (cache_flush_required)
> > > > > > +		area->pages->cache_flush_required = true;
> > > > > > +
> > > > > >  	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
> > > > > >  			      iopt_area_last_index(area));
> > > > > >  	if (rc)
> > > > > > @@ -1369,6 +1395,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
> > > > > >  
> > > > > >  	while (!pfn_reader_done(&pfns)) {
> > > > > >  		done_end_index = pfns.batch_start_index;
> > > > > > +		if (cache_flush_required)
> > > > > > +			iopt_cache_flush_pfn_batch(&pfns.batch);
> > > > > > +
> > > > > 
> > > > > This is a bit unfortunate, it means we are going to flush for every
> > > > > domain, even though it is not required. I don't see any easy way out
> > > > > of that :(
> > > > Yes. Do you think it's possible to add an op get_cache_coherency_enforced
> > > > to iommu_domain_ops?
> > > 
> > > Do we need that? The hwpt already keeps track of that? the enforced could be
> > > copied into the area along side storage_domain
> > > 
> > > Then I guess you could avoid flushing in the case the page came from
> > > the storage_domain...
> > > 
> > > You'd want the storage_domain to preferentially point to any
> > > non-enforced domain.
> > > 
> > > Is it worth it? How slow is this stuff?
> > Sorry, I might have misunderstood your intentions in my previous mail.
> > In iopt_area_fill_domain(), flushing CPU caches is only performed when
> > (1) noncoherent_domain_cnt is non-zero and
> > (2) area->pages->cache_flush_required is false.
> > area->pages->cache_flush_required is also set to true after the two are met, so
> > that the next flush to the same "area->pages" in filling phase will be skipped.
> > 
> > In my last mail, I thought you wanted to flush for every domain even if
> > area->pages->cache_flush_required is true, because I thought that you were
> > worried about that checking area->pages->cache_flush_required might results in
> > some pages, which ought be flushed, not being flushed.
> > So, I was wondering if we could do the flush for every non-coherent domain by
> > checking whether domain enforces cache coherency.
> > 
> > However, as you said, we can check hwpt instead if it's passed in
> > iopt_area_fill_domain().
> > 
> > On the other side, after a second thought, looks it's still good to check
> > area->pages->cache_flush_required?
> > - "area" and "pages" are 1:1. In other words, there's no such a condition that
> >   several "area"s are pointing to the same "pages".
> >   Is this assumption right?
> 
> copy can create new areas that point to shared pages. That is why
> there are two structs.
Oh, thanks for explanation and glad to learn that!!
Though in this case, new area is identical to the old area.
> 
> > - Once area->pages->cache_flush_required is set to true, it means all pages
> >   indicated by "area->pages" has been mapped into a non-coherent
> >   domain
> 
> Also not true, the multiple area's can take sub slices of the pages,
Ah, right, e.g. after iopt_area_split().

> so two hwpts' can be mapping disjoint sets of pages, and thus have
> disjoint cachability.
Indeed.
> 
> So it has to be calculated on closer to a page by page basis (really a
> span by span basis) if flushing of that span is needed based on where
> the pages came from. Only pages that came from a hwpt that is
> non-coherent can skip the flushing.
Is area by area basis also good?
Isn't an area either not mapped to any domain or mapped into all domains?

But, yes, considering the limited number of non-coherent domains, it appears
more robust and clean to always flush for non-coherent domain in
iopt_area_fill_domain().
It eliminates the need to decide whether to retain the area flag during a split.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-15  7:06             ` Yan Zhao
@ 2024-05-15 20:43               ` Jason Gunthorpe
  2024-05-16  2:32                 ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-15 20:43 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:

> > So it has to be calculated on closer to a page by page basis (really a
> > span by span basis) if flushing of that span is needed based on where
> > the pages came from. Only pages that came from a hwpt that is
> > non-coherent can skip the flushing.
> Is area by area basis also good?
> Isn't an area either not mapped to any domain or mapped into all domains?

Yes, this is what the span iterator turns into in the background, it
goes area by area to cover things.

> But, yes, considering the limited number of non-coherent domains, it appears
> more robust and clean to always flush for non-coherent domain in
> iopt_area_fill_domain().
> It eliminates the need to decide whether to retain the area flag during a split.

And flush for pin user pages, so you basically always flush because
you can't tell where the pages came from.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-15 20:43               ` Jason Gunthorpe
@ 2024-05-16  2:32                 ` Yan Zhao
  2024-05-16  8:38                   ` Tian, Kevin
  2024-05-17 17:04                   ` Jason Gunthorpe
  0 siblings, 2 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-16  2:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> 
> > > So it has to be calculated on closer to a page by page basis (really a
> > > span by span basis) if flushing of that span is needed based on where
> > > the pages came from. Only pages that came from a hwpt that is
> > > non-coherent can skip the flushing.
> > Is area by area basis also good?
> > Isn't an area either not mapped to any domain or mapped into all domains?
> 
> Yes, this is what the span iterator turns into in the background, it
> goes area by area to cover things.
> 
> > But, yes, considering the limited number of non-coherent domains, it appears
> > more robust and clean to always flush for non-coherent domain in
> > iopt_area_fill_domain().
> > It eliminates the need to decide whether to retain the area flag during a split.
> 
> And flush for pin user pages, so you basically always flush because
> you can't tell where the pages came from.
As a summary, do you think it's good to flush in below way?

1. in iopt_area_fill_domains(), flush before mapping a page into domains when
   iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
   Record cache_flush_required in pages for unpin.
2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
   flush before mapping a page into a non-coherent domain, no matter where the
   page is from.
   Record cache_flush_required in pages for unpin.
3. in batch_unpin(), flush if pages->cache_flush_required before
   unpin_user_pages.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-07  9:12     ` Yan Zhao
  2024-05-08 22:14       ` Alex Williamson
@ 2024-05-16  7:42       ` Tian, Kevin
  2024-05-16 14:07         ` Sean Christopherson
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-16  7:42 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm, linux-kernel, x86, alex.williamson, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Tuesday, May 7, 2024 5:13 PM
> 
> On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > Sent: Tuesday, May 7, 2024 2:19 PM
> > >
> > > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > > lookup_memtype(u64 paddr)
> > >   */
> > >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> > >  {
> > > -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > > +	u64 paddr = PFN_PHYS(pfn);
> > > +	enum page_cache_mode cm;
> > > +
> > > +	/*
> > > +	 * Check MTRR type for untracked pat range since lookup_memtype()
> > > always
> > > +	 * returns WB for this range.
> > > +	 */
> > > +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> > > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > > _PAGE_CACHE_MODE_WB);
> >
> > doing so violates the name of this function. The PAT of the untracked
> > range is still WB and not immune to UC MTRR.
> Right.
> Do you think we can rename this function to something like
> pfn_of_uncachable_effective_memory_type() and make it work
> under !pat_enabled()
> too?

let's hear from x86/kvm maintainers for their opinions.

My gut-feeling is that kvm_is_mmio_pfn() might be moved into the
x86 core as the logic there has nothing specific to kvm itself. Also
naming-wise it doesn't really matter whether the pfn is mmio. The
real point is to find the uncacheble memtype in the primary mmu
and then follow it in KVM.

from that point probably a pfn_memtype_uncacheable() reads clearer.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-13  7:11         ` Yan Zhao
@ 2024-05-16  7:53           ` Tian, Kevin
  2024-05-16  8:34           ` Tian, Kevin
  2024-05-16 20:50           ` Alex Williamson
  2 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2024-05-16  7:53 UTC (permalink / raw)
  To: Zhao, Yan Y, Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Monday, May 13, 2024 3:11 PM
> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > The dma->cache_flush_required is to mark whether pages in a vfio_dma
> requires
> > > cache flush in the subsequence mapping into the first non-coherent
> domain
> > > and page unpinning.
> >
> > How do we arrive at a sequence where we have dma-
> >cache_flush_required
> > that isn't the result of being mapped into a domain with
> > !domain->enforce_cache_coherency?
> Hmm, dma->cache_flush_required IS the result of being mapped into a
> domain with
> !domain->enforce_cache_coherency.
> My concern only arrives from the actual code sequence, i.e.
> dma->cache_flush_required is set to true before the actual mapping.
> 
> If we rename it to dma->mapped_noncoherent and only set it to true after
> the
> actual successful mapping, it would lead to more code to handle flushing for
> the
> unwind case.
> Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> by checking dma->cache_flush_required, which is true even before a full
> successful mapping, so we won't miss flush on any pages that are mapped
> into a
> non-coherent domain in a short window.
> 

What about storing a vfio_iommu pointer in vfio_dma? Or pass an extra
parameter to vfio_unpin_pages_remote()...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-13  7:11         ` Yan Zhao
  2024-05-16  7:53           ` Tian, Kevin
@ 2024-05-16  8:34           ` Tian, Kevin
  2024-05-16 20:31             ` Alex Williamson
  2024-05-16 20:50           ` Alex Williamson
  2 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-16  8:34 UTC (permalink / raw)
  To: Zhao, Yan Y, Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Monday, May 13, 2024 3:11 PM
> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:
> > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct
> vfio_domain *domain, struct list_head *
> > > > >  		break;
> > > > >  	}
> > > > >
> > > > > +	if (!domain->enforce_cache_coherency)
> > > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages),
> PAGE_SIZE * 2);
> > > > > +
> > > >
> > > > Seems like this use case isn't subject to the unmap aspect since these
> > > > are kernel allocated and freed pages rather than userspace pages.
> > > > There's not an "ongoing use of the page" concern.
> > > >
> > > > The window of opportunity for a device to discover and exploit the
> > > > mapping side issue appears almost impossibly small.
> > > >
> > > The concern is for a malicious device attempting DMAs automatically.
> > > Do you think this concern is valid?
> > > As there're only extra flushes for 4 pages, what about keeping it for safety?
> >
> > Userspace doesn't know anything about these mappings, so to exploit
> > them the device would somehow need to discover and interact with the
> > mapping in the split second that the mapping exists, without exposing
> > itself with mapping faults at the IOMMU.

Userspace could guess the attacking ranges based on code, e.g. currently
the code just tries to use the 1st available IOVA region which likely starts
at address 0.

and mapping faults don't stop the attack. Just some after-the-fact hint
revealing the possibility of being attacked. 😊

> >
> > I don't mind keeping the flush before map so that infinitesimal gap
> > where previous data in physical memory exposed to the device is closed,
> > but I have a much harder time seeing that the flush on unmap to
> > synchronize physical memory is required.
> >
> > For example, the potential KSM use case doesn't exist since the pages
> > are not owned by the user.  Any subsequent use of the pages would be
> > subject to the same condition we assumed after allocation, where the
> > physical data may be inconsistent with the cached data.  It's easy to

physical data can be different from the cached one at any time. In normal
case the cache line is marked as dirty and the CPU cache protocol
guarantees coherency between cache/memory.

here we talked about a situation which a malicious user uses non-coherent
DMA to bypass CPU and makes memory/cache inconsistent when the
CPU still considers the memory copy is up-to-date (e.g. cacheline is in
exclusive or shared state). In this case multiple reads from the next-user
may get different values from cache or memory depending on when the
cacheline is invalidated.

So it's really about a bad inconsistency state which can be recovered only
by invalidating the cacheline (so memory data is up-to-date) or doing
a WB-type store (to mark memory copy out-of-date) before the next-use.

> > flush 2 pages, but I think it obscures the function of the flush if we
> > can't articulate the value in this case.

btw KSM is one example. Jason mentioned in earlier discussion that not all
free pages are zero-ed before the next use then it'd always good to
conservatively prevent any potential inconsistent state leaked back to
the kernel. Though I'm not sure what'd be a real usage in which the next
user will directly use then uninitialized content w/o doing any meaningful
writes (which once done then will stop the attacking window)...

> >
> I agree the second flush is not necessary if we are confident that functions in
> between the two flushes do not and will not touch the page in CPU side.
> However, can we guarantee this? For instance, is it possible for some
> IOMMU
> driver to read/write the page for some quirks? (Or is it just a totally
> paranoid?)
> If that's not impossible, then ensuring cache and memory coherency before
> page reclaiming is better?
> 

I don't think it's a valid argument.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16  2:32                 ` Yan Zhao
@ 2024-05-16  8:38                   ` Tian, Kevin
  2024-05-16  9:48                     ` Yan Zhao
  2024-05-17 17:04                   ` Jason Gunthorpe
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-16  8:38 UTC (permalink / raw)
  To: Zhao, Yan Y, Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, Liu, Yi L

> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Thursday, May 16, 2024 10:33 AM
> 
> On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> >
> > > > So it has to be calculated on closer to a page by page basis (really a
> > > > span by span basis) if flushing of that span is needed based on where
> > > > the pages came from. Only pages that came from a hwpt that is
> > > > non-coherent can skip the flushing.
> > > Is area by area basis also good?
> > > Isn't an area either not mapped to any domain or mapped into all
> domains?
> >
> > Yes, this is what the span iterator turns into in the background, it
> > goes area by area to cover things.
> >
> > > But, yes, considering the limited number of non-coherent domains, it
> appears
> > > more robust and clean to always flush for non-coherent domain in
> > > iopt_area_fill_domain().
> > > It eliminates the need to decide whether to retain the area flag during a
> split.
> >
> > And flush for pin user pages, so you basically always flush because
> > you can't tell where the pages came from.
> As a summary, do you think it's good to flush in below way?
> 
> 1. in iopt_area_fill_domains(), flush before mapping a page into domains
> when
>    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
>    Record cache_flush_required in pages for unpin.
> 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
>    flush before mapping a page into a non-coherent domain, no matter where
> the
>    page is from.
>    Record cache_flush_required in pages for unpin.
> 3. in batch_unpin(), flush if pages->cache_flush_required before
>    unpin_user_pages.

so above suggests a sequence similar to vfio_type1 does?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16  8:38                   ` Tian, Kevin
@ 2024-05-16  9:48                     ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-16  9:48 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, kvm, linux-kernel, x86, alex.williamson, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Thu, May 16, 2024 at 04:38:12PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Thursday, May 16, 2024 10:33 AM
> > 
> > On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> > >
> > > > > So it has to be calculated on closer to a page by page basis (really a
> > > > > span by span basis) if flushing of that span is needed based on where
> > > > > the pages came from. Only pages that came from a hwpt that is
> > > > > non-coherent can skip the flushing.
> > > > Is area by area basis also good?
> > > > Isn't an area either not mapped to any domain or mapped into all
> > domains?
> > >
> > > Yes, this is what the span iterator turns into in the background, it
> > > goes area by area to cover things.
> > >
> > > > But, yes, considering the limited number of non-coherent domains, it
> > appears
> > > > more robust and clean to always flush for non-coherent domain in
> > > > iopt_area_fill_domain().
> > > > It eliminates the need to decide whether to retain the area flag during a
> > split.
> > >
> > > And flush for pin user pages, so you basically always flush because
> > > you can't tell where the pages came from.
> > As a summary, do you think it's good to flush in below way?
> > 
> > 1. in iopt_area_fill_domains(), flush before mapping a page into domains
> > when
> >    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
> >    Record cache_flush_required in pages for unpin.
> > 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
> >    flush before mapping a page into a non-coherent domain, no matter where
> > the
> >    page is from.
> >    Record cache_flush_required in pages for unpin.
> > 3. in batch_unpin(), flush if pages->cache_flush_required before
> >    unpin_user_pages.
> 
> so above suggests a sequence similar to vfio_type1 does?
Similar. Except that in iopt_area_fill_domain(), flush is always performed to
non-coherent domains without checking pages->cache_flush_required, while in
vfio_iommu_replay(), flush can be skipped if dma->cache_flush_required is true.

This is because in vfio_type1, pages are mapped into domains in dma-by-dma basis,
but in iommufd, pages are mapped into domains in area-by-area basis.
Two areas are possible to be non-overlapping parts of an iopt_pages.
It's not right to skip flushing of pages in the second area if
pages->cache_flush_required is set to true by mapping pages in the first area.
It's also cumbersome to introduce and check another flag in area or to check
where pages came from before mapping them into a non-coherent domain.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-16  7:42       ` Tian, Kevin
@ 2024-05-16 14:07         ` Sean Christopherson
  2024-05-20  2:36           ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Sean Christopherson @ 2024-05-16 14:07 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Yan Y Zhao, kvm, linux-kernel, x86, alex.williamson, jgg, iommu,
	pbonzini, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	corbet, joro, will, robin.murphy, baolu.lu, Yi L Liu,
	Tom Lendacky

+Tom

On Thu, May 16, 2024, Kevin Tian wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Tuesday, May 7, 2024 5:13 PM
> > 
> > On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > > Sent: Tuesday, May 7, 2024 2:19 PM
> > > >
> > > > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > > > lookup_memtype(u64 paddr)
> > > >   */
> > > >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> > > >  {
> > > > -	enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn));
> > > > +	u64 paddr = PFN_PHYS(pfn);
> > > > +	enum page_cache_mode cm;
> > > > +
> > > > +	/*
> > > > +	 * Check MTRR type for untracked pat range since lookup_memtype()
> > > > always
> > > > +	 * returns WB for this range.
> > > > +	 */
> > > > +	if (x86_platform.is_untracked_pat_range(paddr, paddr + PAGE_SIZE))
> > > > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > > > _PAGE_CACHE_MODE_WB);
> > >
> > > doing so violates the name of this function. The PAT of the untracked
> > > range is still WB and not immune to UC MTRR.
> > Right.
> > Do you think we can rename this function to something like
> > pfn_of_uncachable_effective_memory_type() and make it work under
> > !pat_enabled() too?
> 
> let's hear from x86/kvm maintainers for their opinions.
> 
> My gut-feeling is that kvm_is_mmio_pfn() might be moved into the
> x86 core as the logic there has nothing specific to kvm itself. Also
> naming-wise it doesn't really matter whether the pfn is mmio. The
> real point is to find the uncacheble memtype in the primary mmu
> and then follow it in KVM.

Yeaaaah, we've got an existing problem there.  When AMD's SME is enabled, KVM
uses kvm_is_mmio_pfn() to determine whether or not to map memory into the guest
as encrypted or plain text.  I.e. KVM really does try to use this helper to
detect MMIO vs. RAM.  I highly doubt that actually works in all setups.

For SME, it seems like the best approach would be grab the C-Bit from the host
page tables, similar to how KVM uses host_pfn_mapping_level().

SME aside, I don't have objection to moving kvm_is_mmio_pfn() out of KVM.

> from that point probably a pfn_memtype_uncacheable() reads clearer.

or even just pfn_is_memtype_uc()?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16  8:34           ` Tian, Kevin
@ 2024-05-16 20:31             ` Alex Williamson
  2024-05-17 17:11               ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-16 20:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Zhao, Yan Y, kvm, linux-kernel, x86, jgg, iommu, pbonzini,
	seanjc, dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet,
	joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Thu, 16 May 2024 08:34:20 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Monday, May 13, 2024 3:11 PM
> > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > On Fri, 10 May 2024 18:31:13 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >  
> > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:  
> > > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct  
> > vfio_domain *domain, struct list_head *  
> > > > > >  		break;
> > > > > >  	}
> > > > > >
> > > > > > +	if (!domain->enforce_cache_coherency)
> > > > > > +		arch_clean_nonsnoop_dma(page_to_phys(pages),  
> > PAGE_SIZE * 2);  
> > > > > > +  
> > > > >
> > > > > Seems like this use case isn't subject to the unmap aspect since these
> > > > > are kernel allocated and freed pages rather than userspace pages.
> > > > > There's not an "ongoing use of the page" concern.
> > > > >
> > > > > The window of opportunity for a device to discover and exploit the
> > > > > mapping side issue appears almost impossibly small.
> > > > >  
> > > > The concern is for a malicious device attempting DMAs automatically.
> > > > Do you think this concern is valid?
> > > > As there're only extra flushes for 4 pages, what about keeping it for safety?  
> > >
> > > Userspace doesn't know anything about these mappings, so to exploit
> > > them the device would somehow need to discover and interact with the
> > > mapping in the split second that the mapping exists, without exposing
> > > itself with mapping faults at the IOMMU.  
> 
> Userspace could guess the attacking ranges based on code, e.g. currently
> the code just tries to use the 1st available IOVA region which likely starts
> at address 0.
> 
> and mapping faults don't stop the attack. Just some after-the-fact hint
> revealing the possibility of being attacked. 😊

As below, the gap is infinitesimally small, but not zero, and I don't
mind closing it entirely.

> > >
> > > I don't mind keeping the flush before map so that infinitesimal gap
> > > where previous data in physical memory exposed to the device is closed,
> > > but I have a much harder time seeing that the flush on unmap to
> > > synchronize physical memory is required.
> > >
> > > For example, the potential KSM use case doesn't exist since the pages
> > > are not owned by the user.  Any subsequent use of the pages would be
> > > subject to the same condition we assumed after allocation, where the
> > > physical data may be inconsistent with the cached data.  It's easy to  
> 
> physical data can be different from the cached one at any time. In normal
> case the cache line is marked as dirty and the CPU cache protocol
> guarantees coherency between cache/memory.
> 
> here we talked about a situation which a malicious user uses non-coherent
> DMA to bypass CPU and makes memory/cache inconsistent when the
> CPU still considers the memory copy is up-to-date (e.g. cacheline is in
> exclusive or shared state). In this case multiple reads from the next-user
> may get different values from cache or memory depending on when the
> cacheline is invalidated.
> 
> So it's really about a bad inconsistency state which can be recovered only
> by invalidating the cacheline (so memory data is up-to-date) or doing
> a WB-type store (to mark memory copy out-of-date) before the next-use.

Ok, so the initial state may be that the page is zero'd in cache, but
the cacheline is dirty and is therefore the source of truth for all
coherent operations.  In the case where a device has non-coherently
modified physical memory, the coherent results are indeterminate, the
processor could see a value from cache or physical memory.  So these
are in fact different scenarios.

> > > flush 2 pages, but I think it obscures the function of the flush if we
> > > can't articulate the value in this case.  
> 
> btw KSM is one example. Jason mentioned in earlier discussion that not all
> free pages are zero-ed before the next use then it'd always good to
> conservatively prevent any potential inconsistent state leaked back to
> the kernel. Though I'm not sure what'd be a real usage in which the next
> user will directly use then uninitialized content w/o doing any meaningful
> writes (which once done then will stop the attacking window)...

Yes, exactly.  Zero'ing the page would obviously reestablish the
coherency, but the page could be reallocated without being zero'd and as
you describe the owner of that page could then get inconsistent
results.  It doesn't fit any use case that I can think of that next
user only cares that the contents of the page are consistent without
writing a specific value, but sure, let's not be the source of that
obscure bug ;)  Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-13  7:11         ` Yan Zhao
  2024-05-16  7:53           ` Tian, Kevin
  2024-05-16  8:34           ` Tian, Kevin
@ 2024-05-16 20:50           ` Alex Williamson
  2024-05-17  3:11             ` Yan Zhao
  2 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-16 20:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Mon, 13 May 2024 15:11:28 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > On Fri, 10 May 2024 18:31:13 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:    
> > > ...   
> > > > >  drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 51 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index b5c15fe8f9fc..ce873f4220bf 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -74,6 +74,7 @@ struct vfio_iommu {
> > > > >  	bool			v2;
> > > > >  	bool			nesting;
> > > > >  	bool			dirty_page_tracking;
> > > > > +	bool			has_noncoherent_domain;
> > > > >  	struct list_head	emulated_iommu_groups;
> > > > >  };
> > > > >  
> > > > > @@ -99,6 +100,7 @@ struct vfio_dma {
> > > > >  	unsigned long		*bitmap;
> > > > >  	struct mm_struct	*mm;
> > > > >  	size_t			locked_vm;
> > > > > +	bool			cache_flush_required; /* For noncoherent domain */    
> > > > 
> > > > Poor packing, minimally this should be grouped with the other bools in
> > > > the structure, longer term they should likely all be converted to
> > > > bit fields.    
> > > Yes. Will do!
> > >   
> > > >     
> > > > >  };
> > > > >  
> > > > >  struct vfio_batch {
> > > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > > > >  	long unlocked = 0, locked = 0;
> > > > >  	long i;
> > > > >  
> > > > > +	if (dma->cache_flush_required)
> > > > > +		arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT);
> > > > > +
> > > > >  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > > >  		if (put_pfn(pfn++, dma->prot)) {
> > > > >  			unlocked++;
> > > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  					    &iotlb_gather);
> > > > >  	}
> > > > >  
> > > > > +	dma->cache_flush_required = false;
> > > > > +
> > > > >  	if (do_accounting) {
> > > > >  		vfio_lock_acct(dma, -unlocked, true);
> > > > >  		return 0;
> > > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> > > > >  	iommu->dma_avail++;
> > > > >  }
> > > > >  
> > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu)
> > > > > +{
> > > > > +	struct vfio_domain *domain;
> > > > > +	bool has_noncoherent = false;
> > > > > +
> > > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > > +		if (domain->enforce_cache_coherency)
> > > > > +			continue;
> > > > > +
> > > > > +		has_noncoherent = true;
> > > > > +		break;
> > > > > +	}
> > > > > +	iommu->has_noncoherent_domain = has_noncoherent;
> > > > > +}    
> > > > 
> > > > This should be merged with vfio_domains_have_enforce_cache_coherency()
> > > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below).    
> > > Will convert it to a counter and do the merge.
> > > Thanks for pointing it out!
> > >   
> > > >     
> > > > > +
> > > > >  static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu)
> > > > >  {
> > > > >  	struct vfio_domain *domain;
> > > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  
> > > > >  	vfio_batch_init(&batch);
> > > > >  
> > > > > +	/*
> > > > > +	 * Record necessity to flush CPU cache to make sure CPU cache is flushed
> > > > > +	 * for both pin & map and unmap & unpin (for unwind) paths.
> > > > > +	 */
> > > > > +	dma->cache_flush_required = iommu->has_noncoherent_domain;
> > > > > +
> > > > >  	while (size) {
> > > > >  		/* Pin a contiguous chunk of memory */
> > > > >  		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
> > > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > > > >  			break;
> > > > >  		}
> > > > >  
> > > > > +		if (dma->cache_flush_required)
> > > > > +			arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT,
> > > > > +						npage << PAGE_SHIFT);
> > > > > +
> > > > >  		/* Map it! */
> > > > >  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> > > > >  				     dma->prot);
> > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > >  	for (; n; n = rb_next(n)) {
> > > > >  		struct vfio_dma *dma;
> > > > >  		dma_addr_t iova;
> > > > > +		bool cache_flush_required;
> > > > >  
> > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > >  		iova = dma->iova;
> > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > +				       !dma->cache_flush_required;
> > > > > +		if (cache_flush_required)
> > > > > +			dma->cache_flush_required = true;    
> > > > 
> > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > a cache flush.  So the variable state is something more akin to
> > > > set_cache_flush_required.  But all we're saving with this is a
> > > > redundant set if the mapping is already tagged as requiring a cache
> > > > flush, so it could really be simplified to:
> > > > 
> > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > Sorry about the confusion.
> > > 
> > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > enforcing cache coherency due to the lazily flushing design.  
> > 
> > Right, ok, the vfio_dma objects are shared between domains so we never
> > want to set 'dma->cache_flush_required = false' due to the addition of a
> > 'domain->enforce_cache_coherent == true'.  So this could be:
> > 
> > 	if (!dma->cache_flush_required)
> > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> 
> Though this code is easier for understanding, it leads to unnecessary setting of
> dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> true at the most time.

I don't really see that as an issue, but the variable name originally
chosen above, cache_flush_required, also doesn't convey that it's only
attempting to set the value if it wasn't previously set and is now
required by a noncoherent domain.

> > > > It might add more clarity to just name the mapping flag
> > > > dma->mapped_noncoherent.    
> > > 
> > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > cache flush in the subsequence mapping into the first non-coherent domain
> > > and page unpinning.  
> > 
> > How do we arrive at a sequence where we have dma->cache_flush_required
> > that isn't the result of being mapped into a domain with
> > !domain->enforce_cache_coherency?  
> Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> !domain->enforce_cache_coherency.
> My concern only arrives from the actual code sequence, i.e.
> dma->cache_flush_required is set to true before the actual mapping.
> 
> If we rename it to dma->mapped_noncoherent and only set it to true after the
> actual successful mapping, it would lead to more code to handle flushing for the
> unwind case.
> Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> by checking dma->cache_flush_required, which is true even before a full
> successful mapping, so we won't miss flush on any pages that are mapped into a
> non-coherent domain in a short window.

I don't think we need to be so literal that "mapped_noncoherent" can
only be set after the vfio_dma is fully mapped to a noncoherent domain,
but also we can come up with other names for the flag.  Perhaps
"is_noncoherent".  My suggestion was more from the perspective of what
does the flag represent rather than what we intend to do as a result of
the flag being set.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16 20:50           ` Alex Williamson
@ 2024-05-17  3:11             ` Yan Zhao
  2024-05-17  4:44               ` Alex Williamson
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-17  3:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> On Mon, 13 May 2024 15:11:28 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:
> > > On Fri, 10 May 2024 18:31:13 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:  
> > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:    
...   
> > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > >  	for (; n; n = rb_next(n)) {
> > > > > >  		struct vfio_dma *dma;
> > > > > >  		dma_addr_t iova;
> > > > > > +		bool cache_flush_required;
> > > > > >  
> > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > >  		iova = dma->iova;
> > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > +				       !dma->cache_flush_required;
> > > > > > +		if (cache_flush_required)
> > > > > > +			dma->cache_flush_required = true;    
> > > > > 
> > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > a cache flush.  So the variable state is something more akin to
> > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > flush, so it could really be simplified to:
> > > > > 
> > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > > Sorry about the confusion.
> > > > 
> > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > enforcing cache coherency due to the lazily flushing design.  
> > > 
> > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > 
> > > 	if (!dma->cache_flush_required)
> > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;  
> > 
> > Though this code is easier for understanding, it leads to unnecessary setting of
> > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > true at the most time.
> 
> I don't really see that as an issue, but the variable name originally
> chosen above, cache_flush_required, also doesn't convey that it's only
> attempting to set the value if it wasn't previously set and is now
> required by a noncoherent domain.
Agreed, the old name is too vague.
What about update_to_noncoherent_required?
Then in vfio_iommu_replay(), it's like

update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
if (update_to_noncoherent_required)
         dma->is_noncoherent = true;

...
if (update_to_noncoherent_required)
	arch_flush_cache_phys((phys, size);
> 
> > > > > It might add more clarity to just name the mapping flag
> > > > > dma->mapped_noncoherent.    
> > > > 
> > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > and page unpinning.  
> > > 
> > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > that isn't the result of being mapped into a domain with
> > > !domain->enforce_cache_coherency?  
> > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > !domain->enforce_cache_coherency.
> > My concern only arrives from the actual code sequence, i.e.
> > dma->cache_flush_required is set to true before the actual mapping.
> > 
> > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > actual successful mapping, it would lead to more code to handle flushing for the
> > unwind case.
> > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > by checking dma->cache_flush_required, which is true even before a full
> > successful mapping, so we won't miss flush on any pages that are mapped into a
> > non-coherent domain in a short window.
> 
> I don't think we need to be so literal that "mapped_noncoherent" can
> only be set after the vfio_dma is fully mapped to a noncoherent domain,
> but also we can come up with other names for the flag.  Perhaps
> "is_noncoherent".  My suggestion was more from the perspective of what
> does the flag represent rather than what we intend to do as a result of
> the flag being set.  Thanks, 
Makes sense!
I like the name "is_noncoherent" :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-17  3:11             ` Yan Zhao
@ 2024-05-17  4:44               ` Alex Williamson
  2024-05-17  5:00                 ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-17  4:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Fri, 17 May 2024 11:11:48 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> > On Mon, 13 May 2024 15:11:28 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > > On Fri, 10 May 2024 18:31:13 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:    
> > > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:      
> ...   
> > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > > >  	for (; n; n = rb_next(n)) {
> > > > > > >  		struct vfio_dma *dma;
> > > > > > >  		dma_addr_t iova;
> > > > > > > +		bool cache_flush_required;
> > > > > > >  
> > > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > > >  		iova = dma->iova;
> > > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > > +				       !dma->cache_flush_required;
> > > > > > > +		if (cache_flush_required)
> > > > > > > +			dma->cache_flush_required = true;      
> > > > > > 
> > > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > > a cache flush.  So the variable state is something more akin to
> > > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > > flush, so it could really be simplified to:
> > > > > > 
> > > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;      
> > > > > Sorry about the confusion.
> > > > > 
> > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > > enforcing cache coherency due to the lazily flushing design.    
> > > > 
> > > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > > 
> > > > 	if (!dma->cache_flush_required)
> > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > 
> > > Though this code is easier for understanding, it leads to unnecessary setting of
> > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > > true at the most time.  
> > 
> > I don't really see that as an issue, but the variable name originally
> > chosen above, cache_flush_required, also doesn't convey that it's only
> > attempting to set the value if it wasn't previously set and is now
> > required by a noncoherent domain.  
> Agreed, the old name is too vague.
> What about update_to_noncoherent_required?

set_noncoherent?  Thanks,

Alex

> Then in vfio_iommu_replay(), it's like
> 
> update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
> if (update_to_noncoherent_required)
>          dma->is_noncoherent = true;
> 
> ...
> if (update_to_noncoherent_required)
> 	arch_flush_cache_phys((phys, size);
> >   
> > > > > > It might add more clarity to just name the mapping flag
> > > > > > dma->mapped_noncoherent.      
> > > > > 
> > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > > and page unpinning.    
> > > > 
> > > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > > that isn't the result of being mapped into a domain with
> > > > !domain->enforce_cache_coherency?    
> > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > > !domain->enforce_cache_coherency.
> > > My concern only arrives from the actual code sequence, i.e.
> > > dma->cache_flush_required is set to true before the actual mapping.
> > > 
> > > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > > actual successful mapping, it would lead to more code to handle flushing for the
> > > unwind case.
> > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > > by checking dma->cache_flush_required, which is true even before a full
> > > successful mapping, so we won't miss flush on any pages that are mapped into a
> > > non-coherent domain in a short window.  
> > 
> > I don't think we need to be so literal that "mapped_noncoherent" can
> > only be set after the vfio_dma is fully mapped to a noncoherent domain,
> > but also we can come up with other names for the flag.  Perhaps
> > "is_noncoherent".  My suggestion was more from the perspective of what
> > does the flag represent rather than what we intend to do as a result of
> > the flag being set.  Thanks,   
> Makes sense!
> I like the name "is_noncoherent" :)
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-17  4:44               ` Alex Williamson
@ 2024-05-17  5:00                 ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-17  5:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, x86, jgg, kevin.tian, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, yi.l.liu

On Thu, May 16, 2024 at 10:44:42PM -0600, Alex Williamson wrote:
> On Fri, 17 May 2024 11:11:48 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote:
> > > On Mon, 13 May 2024 15:11:28 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote:  
> > > > > On Fri, 10 May 2024 18:31:13 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote:    
> > > > > > > On Tue,  7 May 2024 14:21:38 +0800
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:      
> > ...   
> > > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > > > > > > >  	for (; n; n = rb_next(n)) {
> > > > > > > >  		struct vfio_dma *dma;
> > > > > > > >  		dma_addr_t iova;
> > > > > > > > +		bool cache_flush_required;
> > > > > > > >  
> > > > > > > >  		dma = rb_entry(n, struct vfio_dma, node);
> > > > > > > >  		iova = dma->iova;
> > > > > > > > +		cache_flush_required = !domain->enforce_cache_coherency &&
> > > > > > > > +				       !dma->cache_flush_required;
> > > > > > > > +		if (cache_flush_required)
> > > > > > > > +			dma->cache_flush_required = true;      
> > > > > > > 
> > > > > > > The variable name here isn't accurate and the logic is confusing.  If
> > > > > > > the domain does not enforce coherency and the mapping is not tagged as
> > > > > > > requiring a cache flush, then we need to mark the mapping as requiring
> > > > > > > a cache flush.  So the variable state is something more akin to
> > > > > > > set_cache_flush_required.  But all we're saving with this is a
> > > > > > > redundant set if the mapping is already tagged as requiring a cache
> > > > > > > flush, so it could really be simplified to:
> > > > > > > 
> > > > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;      
> > > > > > Sorry about the confusion.
> > > > > > 
> > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache
> > > > > > coherency, we hope it will not be reset to false by a later attaching to domain 
> > > > > > enforcing cache coherency due to the lazily flushing design.    
> > > > > 
> > > > > Right, ok, the vfio_dma objects are shared between domains so we never
> > > > > want to set 'dma->cache_flush_required = false' due to the addition of a
> > > > > 'domain->enforce_cache_coherent == true'.  So this could be:
> > > > > 
> > > > > 	if (!dma->cache_flush_required)
> > > > > 		dma->cache_flush_required = !domain->enforce_cache_coherency;    
> > > > 
> > > > Though this code is easier for understanding, it leads to unnecessary setting of
> > > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is
> > > > true at the most time.  
> > > 
> > > I don't really see that as an issue, but the variable name originally
> > > chosen above, cache_flush_required, also doesn't convey that it's only
> > > attempting to set the value if it wasn't previously set and is now
> > > required by a noncoherent domain.  
> > Agreed, the old name is too vague.
> > What about update_to_noncoherent_required?
> 
> set_noncoherent?  Thanks,
> 
Concise!

> 
> > Then in vfio_iommu_replay(), it's like
> > 
> > update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent;
> > if (update_to_noncoherent_required)
> >          dma->is_noncoherent = true;
> > 
> > ...
> > if (update_to_noncoherent_required)
> > 	arch_flush_cache_phys((phys, size);
> > >   
> > > > > > > It might add more clarity to just name the mapping flag
> > > > > > > dma->mapped_noncoherent.      
> > > > > > 
> > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires
> > > > > > cache flush in the subsequence mapping into the first non-coherent domain
> > > > > > and page unpinning.    
> > > > > 
> > > > > How do we arrive at a sequence where we have dma->cache_flush_required
> > > > > that isn't the result of being mapped into a domain with
> > > > > !domain->enforce_cache_coherency?    
> > > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with
> > > > !domain->enforce_cache_coherency.
> > > > My concern only arrives from the actual code sequence, i.e.
> > > > dma->cache_flush_required is set to true before the actual mapping.
> > > > 
> > > > If we rename it to dma->mapped_noncoherent and only set it to true after the
> > > > actual successful mapping, it would lead to more code to handle flushing for the
> > > > unwind case.
> > > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote()
> > > > by checking dma->cache_flush_required, which is true even before a full
> > > > successful mapping, so we won't miss flush on any pages that are mapped into a
> > > > non-coherent domain in a short window.  
> > > 
> > > I don't think we need to be so literal that "mapped_noncoherent" can
> > > only be set after the vfio_dma is fully mapped to a noncoherent domain,
> > > but also we can come up with other names for the flag.  Perhaps
> > > "is_noncoherent".  My suggestion was more from the perspective of what
> > > does the flag represent rather than what we intend to do as a result of
> > > the flag being set.  Thanks,   
> > Makes sense!
> > I like the name "is_noncoherent" :)
> > 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16  2:32                 ` Yan Zhao
  2024-05-16  8:38                   ` Tian, Kevin
@ 2024-05-17 17:04                   ` Jason Gunthorpe
  2024-05-20  2:45                     ` Yan Zhao
  1 sibling, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-17 17:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Thu, May 16, 2024 at 10:32:43AM +0800, Yan Zhao wrote:
> On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> > 
> > > > So it has to be calculated on closer to a page by page basis (really a
> > > > span by span basis) if flushing of that span is needed based on where
> > > > the pages came from. Only pages that came from a hwpt that is
> > > > non-coherent can skip the flushing.
> > > Is area by area basis also good?
> > > Isn't an area either not mapped to any domain or mapped into all domains?
> > 
> > Yes, this is what the span iterator turns into in the background, it
> > goes area by area to cover things.
> > 
> > > But, yes, considering the limited number of non-coherent domains, it appears
> > > more robust and clean to always flush for non-coherent domain in
> > > iopt_area_fill_domain().
> > > It eliminates the need to decide whether to retain the area flag during a split.
> > 
> > And flush for pin user pages, so you basically always flush because
> > you can't tell where the pages came from.
> As a summary, do you think it's good to flush in below way?
> 
> 1. in iopt_area_fill_domains(), flush before mapping a page into domains when
>    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
>    Record cache_flush_required in pages for unpin.
> 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
>    flush before mapping a page into a non-coherent domain, no matter where the
>    page is from.
>    Record cache_flush_required in pages for unpin.
> 3. in batch_unpin(), flush if pages->cache_flush_required before
>    unpin_user_pages.

It does not quite sound right, there should be no tracking in the
pages of this stuff.

If pfn_reader_fill_span() does batch_from_domain() and
the source domain's storage_domain is non-coherent then you can skip
the flush. This is not pedantically perfect in skipping all flushes, but
in practice it is probably good enough.

__iopt_area_unfill_domain() (and children) must flush after
iopt_area_unmap_domain_range() if the area's domain is
non-coherent. This is also not perfect, but probably good enough.

Doing better in both cases would require inspecting the areas under
the used span to see what is there. This is not so easy.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-16 20:31             ` Alex Williamson
@ 2024-05-17 17:11               ` Jason Gunthorpe
  2024-05-20  2:52                 ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-17 17:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Zhao, Yan Y, kvm, linux-kernel, x86, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote:

> Yes, exactly.  Zero'ing the page would obviously reestablish the
> coherency, but the page could be reallocated without being zero'd and as
> you describe the owner of that page could then get inconsistent
> results.  

I think if we care about the performance of this stuff enough to try
and remove flushes we'd be better off figuring out how to disable no
snoop in PCI config space and trust the device not to use it and avoid
these flushes.

iommu enforcement is nice, but at least ARM has been assuming that the
PCI config space bit is sufficient.

Intel/AMD are probably fine here as they will only flush for weird GPU
cases, but I expect ARM is going to be unhappy.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range
  2024-05-16 14:07         ` Sean Christopherson
@ 2024-05-20  2:36           ` Tian, Kevin
  0 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2024-05-20  2:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Zhao, Yan Y, kvm, linux-kernel, x86, alex.williamson, jgg, iommu,
	pbonzini, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L,
	Tom Lendacky

> From: Sean Christopherson <seanjc@google.com>
> Sent: Thursday, May 16, 2024 10:07 PM
> 
> +Tom
> 
> On Thu, May 16, 2024, Kevin Tian wrote:
> > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > Sent: Tuesday, May 7, 2024 5:13 PM
> > >
> > > On Tue, May 07, 2024 at 04:26:37PM +0800, Tian, Kevin wrote:
> > > > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > > > Sent: Tuesday, May 7, 2024 2:19 PM
> > > > >
> > > > > @@ -705,7 +705,17 @@ static enum page_cache_mode
> > > > > lookup_memtype(u64 paddr)
> > > > >   */
> > > > >  bool pat_pfn_immune_to_uc_mtrr(unsigned long pfn)
> > > > >  {
> > > > > -	enum page_cache_mode cm =
> lookup_memtype(PFN_PHYS(pfn));
> > > > > +	u64 paddr = PFN_PHYS(pfn);
> > > > > +	enum page_cache_mode cm;
> > > > > +
> > > > > +	/*
> > > > > +	 * Check MTRR type for untracked pat range since
> lookup_memtype()
> > > > > always
> > > > > +	 * returns WB for this range.
> > > > > +	 */
> > > > > +	if (x86_platform.is_untracked_pat_range(paddr, paddr +
> PAGE_SIZE))
> > > > > +		cm = pat_x_mtrr_type(paddr, paddr + PAGE_SIZE,
> > > > > _PAGE_CACHE_MODE_WB);
> > > >
> > > > doing so violates the name of this function. The PAT of the untracked
> > > > range is still WB and not immune to UC MTRR.
> > > Right.
> > > Do you think we can rename this function to something like
> > > pfn_of_uncachable_effective_memory_type() and make it work under
> > > !pat_enabled() too?
> >
> > let's hear from x86/kvm maintainers for their opinions.
> >
> > My gut-feeling is that kvm_is_mmio_pfn() might be moved into the
> > x86 core as the logic there has nothing specific to kvm itself. Also
> > naming-wise it doesn't really matter whether the pfn is mmio. The
> > real point is to find the uncacheble memtype in the primary mmu
> > and then follow it in KVM.
> 
> Yeaaaah, we've got an existing problem there.  When AMD's SME is enabled,
> KVM
> uses kvm_is_mmio_pfn() to determine whether or not to map memory into
> the guest
> as encrypted or plain text.  I.e. KVM really does try to use this helper to
> detect MMIO vs. RAM.  I highly doubt that actually works in all setups.
> 
> For SME, it seems like the best approach would be grab the C-Bit from the
> host
> page tables, similar to how KVM uses host_pfn_mapping_level().

yes that sounds clearer. Checking MMIO vs. RAM is kind of indirect hint.

> 
> SME aside, I don't have objection to moving kvm_is_mmio_pfn() out of KVM.
> 
> > from that point probably a pfn_memtype_uncacheable() reads clearer.
> 
> or even just pfn_is_memtype_uc()?

yes, better.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-17 17:04                   ` Jason Gunthorpe
@ 2024-05-20  2:45                     ` Yan Zhao
  2024-05-21 16:04                       ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-20  2:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Fri, May 17, 2024 at 02:04:18PM -0300, Jason Gunthorpe wrote:
> On Thu, May 16, 2024 at 10:32:43AM +0800, Yan Zhao wrote:
> > On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> > > 
> > > > > So it has to be calculated on closer to a page by page basis (really a
> > > > > span by span basis) if flushing of that span is needed based on where
> > > > > the pages came from. Only pages that came from a hwpt that is
> > > > > non-coherent can skip the flushing.
> > > > Is area by area basis also good?
> > > > Isn't an area either not mapped to any domain or mapped into all domains?
> > > 
> > > Yes, this is what the span iterator turns into in the background, it
> > > goes area by area to cover things.
> > > 
> > > > But, yes, considering the limited number of non-coherent domains, it appears
> > > > more robust and clean to always flush for non-coherent domain in
> > > > iopt_area_fill_domain().
> > > > It eliminates the need to decide whether to retain the area flag during a split.
> > > 
> > > And flush for pin user pages, so you basically always flush because
> > > you can't tell where the pages came from.
> > As a summary, do you think it's good to flush in below way?
> > 
> > 1. in iopt_area_fill_domains(), flush before mapping a page into domains when
> >    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
> >    Record cache_flush_required in pages for unpin.
> > 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
> >    flush before mapping a page into a non-coherent domain, no matter where the
> >    page is from.
> >    Record cache_flush_required in pages for unpin.
> > 3. in batch_unpin(), flush if pages->cache_flush_required before
> >    unpin_user_pages.
> 
> It does not quite sound right, there should be no tracking in the
> pages of this stuff.
What's the downside of having tracking in the pages?

Lazily flush pages right before unpin pages is not only to save flush count
for performance, but also for some real problem we encountered. see below.

> 
> If pfn_reader_fill_span() does batch_from_domain() and
> the source domain's storage_domain is non-coherent then you can skip
> the flush. This is not pedantically perfect in skipping all flushes, but
> in practice it is probably good enough.
We don't know whether the source storage_domain is non-coherent since
area->storage_domain is of "struct iommu_domain".

Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent,
and set this flag along side setting storage_domain?
(But looks this is not easy in iopt_area_fill_domains() as we don't have hwpt
there.)

> __iopt_area_unfill_domain() (and children) must flush after
> iopt_area_unmap_domain_range() if the area's domain is
> non-coherent. This is also not perfect, but probably good enough.
Do you mean flush after each iopt_area_unmap_domain_range() if the domain is
non-coherent?
The problem is that iopt_area_unmap_domain_range() knows only IOVA, the
IOVA->PFN relationship is not available without iommu_iova_to_phys() and
iommu_domain contains no coherency info.
Besides, when the non-coherent domain is a storage domain, we still need to do
the flush in batch_unpin(), right?
Then, with a more complex case, if the non-coherent domain is a storage domain,
and if some pages are still held in pages->access_itree when unfilling the
domain, should we get PFNs from pages->pinned_pfns and do the flush in
__iopt_area_unfill_domain()?
> 
> Doing better in both cases would require inspecting the areas under
> the used span to see what is there. This is not so easy.
My feeling is that checking non-coherency of target domain and save non-coherency
in pages might be the easiest way with least code change.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-17 17:11               ` Jason Gunthorpe
@ 2024-05-20  2:52                 ` Tian, Kevin
  2024-05-21 16:07                   ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-20  2:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Vetter, Daniel
  Cc: Zhao, Yan Y, kvm, linux-kernel, x86, iommu, pbonzini, seanjc,
	dave.hansen, luto, peterz, tglx, mingo, bp, hpa, corbet, joro,
	will, robin.murphy, baolu.lu, Liu, Yi L

+Daniel

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, May 18, 2024 1:11 AM
> 
> On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote:
> 
> > Yes, exactly.  Zero'ing the page would obviously reestablish the
> > coherency, but the page could be reallocated without being zero'd and as
> > you describe the owner of that page could then get inconsistent
> > results.
> 
> I think if we care about the performance of this stuff enough to try
> and remove flushes we'd be better off figuring out how to disable no
> snoop in PCI config space and trust the device not to use it and avoid
> these flushes.
> 
> iommu enforcement is nice, but at least ARM has been assuming that the
> PCI config space bit is sufficient.
> 
> Intel/AMD are probably fine here as they will only flush for weird GPU
> cases, but I expect ARM is going to be unhappy.
> 

My impression was that Intel GPU is not usable w/o non-coherent DMA,
but I don't remember whether it's unusable being a functional breakage
or a user experience breakage. e.g. I vaguely recalled that the display
engine cannot afford high resolution/high refresh rate using the snoop
way so the IOMMU dedicated for the GPU doesn't implement the force
snoop capability.

Daniel, can you help explain the behavior of Intel GPU in case nosnoop
is disabled in the PCI config space?

Overall it sounds that we are talking about different requirements. For
Intel GPU nosnoop is a must but it is not currently done securely so we
need add proper flush to fix it, while for ARM looks you don't have a
case which relies on nosnoop so finding a way to disable it is more
straightforward?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-07  6:20 ` [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Yan Zhao
  2024-05-07  8:51   ` Tian, Kevin
@ 2024-05-20 14:07   ` Christoph Hellwig
  2024-05-21 15:49     ` Jason Gunthorpe
  1 sibling, 1 reply; 67+ messages in thread
From: Christoph Hellwig @ 2024-05-20 14:07 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, jgg, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Russell King

On Tue, May 07, 2024 at 02:20:44PM +0800, Yan Zhao wrote:
> Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU
> caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache
> snooping).

Err, no.  There should really be no exported cache manipulation macros,
as drivers are almost guaranteed to get this wrong.  I've added
Russell to the Cc list who has been extremtly vocal about this at least
for arm.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-20 14:07   ` Christoph Hellwig
@ 2024-05-21 15:49     ` Jason Gunthorpe
  2024-05-21 16:00       ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 15:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yan Zhao, kvm, linux-kernel, x86, alex.williamson, kevin.tian,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Russell King

On Mon, May 20, 2024 at 07:07:10AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 02:20:44PM +0800, Yan Zhao wrote:
> > Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU
> > caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache
> > snooping).
> 
> Err, no.  There should really be no exported cache manipulation macros,
> as drivers are almost guaranteed to get this wrong.  I've added
> Russell to the Cc list who has been extremtly vocal about this at least
> for arm.

We could possibly move this under some IOMMU core API (ie flush and
map, unmap and flush), the iommu APIs are non-modular so this could
avoid the exported symbol.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-21 15:49     ` Jason Gunthorpe
@ 2024-05-21 16:00       ` Jason Gunthorpe
  2024-05-22  3:41         ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yan Zhao, kvm, linux-kernel, x86, alex.williamson, kevin.tian,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu,
	Russell King

On Tue, May 21, 2024 at 12:49:39PM -0300, Jason Gunthorpe wrote:
> On Mon, May 20, 2024 at 07:07:10AM -0700, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 02:20:44PM +0800, Yan Zhao wrote:
> > > Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU
> > > caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache
> > > snooping).
> > 
> > Err, no.  There should really be no exported cache manipulation macros,
> > as drivers are almost guaranteed to get this wrong.  I've added
> > Russell to the Cc list who has been extremtly vocal about this at least
> > for arm.
> 
> We could possibly move this under some IOMMU core API (ie flush and
> map, unmap and flush), the iommu APIs are non-modular so this could
> avoid the exported symbol.

Though this would be pretty difficult for unmap as we don't have the
pfns in the core code to flush. I don't think we have alot of good
options but to make iommufd & VFIO handle this directly as they have
the list of pages to flush on the unmap side. Use a namespace?

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-20  2:45                     ` Yan Zhao
@ 2024-05-21 16:04                       ` Jason Gunthorpe
  2024-05-22  3:17                         ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 16:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Mon, May 20, 2024 at 10:45:56AM +0800, Yan Zhao wrote:
> On Fri, May 17, 2024 at 02:04:18PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 16, 2024 at 10:32:43AM +0800, Yan Zhao wrote:
> > > On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> > > > 
> > > > > > So it has to be calculated on closer to a page by page basis (really a
> > > > > > span by span basis) if flushing of that span is needed based on where
> > > > > > the pages came from. Only pages that came from a hwpt that is
> > > > > > non-coherent can skip the flushing.
> > > > > Is area by area basis also good?
> > > > > Isn't an area either not mapped to any domain or mapped into all domains?
> > > > 
> > > > Yes, this is what the span iterator turns into in the background, it
> > > > goes area by area to cover things.
> > > > 
> > > > > But, yes, considering the limited number of non-coherent domains, it appears
> > > > > more robust and clean to always flush for non-coherent domain in
> > > > > iopt_area_fill_domain().
> > > > > It eliminates the need to decide whether to retain the area flag during a split.
> > > > 
> > > > And flush for pin user pages, so you basically always flush because
> > > > you can't tell where the pages came from.
> > > As a summary, do you think it's good to flush in below way?
> > > 
> > > 1. in iopt_area_fill_domains(), flush before mapping a page into domains when
> > >    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
> > >    Record cache_flush_required in pages for unpin.
> > > 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
> > >    flush before mapping a page into a non-coherent domain, no matter where the
> > >    page is from.
> > >    Record cache_flush_required in pages for unpin.
> > > 3. in batch_unpin(), flush if pages->cache_flush_required before
> > >    unpin_user_pages.
> > 
> > It does not quite sound right, there should be no tracking in the
> > pages of this stuff.
> What's the downside of having tracking in the pages?

Well, a counter doesn't make sense. You could have a single sticky bit
that indicates that all PFNs are coherency dirty and overflush them on
every map and unmap operation.

This is certainly the simplest option, but gives the maximal flushes.

If you want to minimize flushes then you can't store flush
minimization information in the pages because it isn't global to the
pages and will not be accurate enough.

> > If pfn_reader_fill_span() does batch_from_domain() and
> > the source domain's storage_domain is non-coherent then you can skip
> > the flush. This is not pedantically perfect in skipping all flushes, but
> > in practice it is probably good enough.

> We don't know whether the source storage_domain is non-coherent since
> area->storage_domain is of "struct iommu_domain".
 
> Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent,
> and set this flag along side setting storage_domain?

Sure, that could work.

> > __iopt_area_unfill_domain() (and children) must flush after
> > iopt_area_unmap_domain_range() if the area's domain is
> > non-coherent. This is also not perfect, but probably good enough.
> Do you mean flush after each iopt_area_unmap_domain_range() if the domain is
> non-coherent?
> The problem is that iopt_area_unmap_domain_range() knows only IOVA, the
> IOVA->PFN relationship is not available without iommu_iova_to_phys() and
> iommu_domain contains no coherency info.

Yes, you'd have to read back the PFNs on this path which it doesn't do
right now.. Given this pain it would be simpler to have one bit in the
pages that marks it permanently non-coherent and all pfns will be
flushed before put_page is called.

The trouble with a counter is that the count going to zero doesn't
really mean we flushed the PFN if it is being held someplace else.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-20  2:52                 ` Tian, Kevin
@ 2024-05-21 16:07                   ` Jason Gunthorpe
  2024-05-21 16:21                     ` Alex Williamson
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 16:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Mon, May 20, 2024 at 02:52:43AM +0000, Tian, Kevin wrote:
> +Daniel
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, May 18, 2024 1:11 AM
> > 
> > On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote:
> > 
> > > Yes, exactly.  Zero'ing the page would obviously reestablish the
> > > coherency, but the page could be reallocated without being zero'd and as
> > > you describe the owner of that page could then get inconsistent
> > > results.
> > 
> > I think if we care about the performance of this stuff enough to try
> > and remove flushes we'd be better off figuring out how to disable no
> > snoop in PCI config space and trust the device not to use it and avoid
> > these flushes.
> > 
> > iommu enforcement is nice, but at least ARM has been assuming that the
> > PCI config space bit is sufficient.
> > 
> > Intel/AMD are probably fine here as they will only flush for weird GPU
> > cases, but I expect ARM is going to be unhappy.
> > 
> 
> My impression was that Intel GPU is not usable w/o non-coherent DMA,
> but I don't remember whether it's unusable being a functional breakage
> or a user experience breakage. e.g. I vaguely recalled that the display
> engine cannot afford high resolution/high refresh rate using the snoop
> way so the IOMMU dedicated for the GPU doesn't implement the force
> snoop capability.
> 
> Daniel, can you help explain the behavior of Intel GPU in case nosnoop
> is disabled in the PCI config space?
> 
> Overall it sounds that we are talking about different requirements. For
> Intel GPU nosnoop is a must but it is not currently done securely so we
> need add proper flush to fix it, while for ARM looks you don't have a
> case which relies on nosnoop so finding a way to disable it is more
> straightforward?

Intel GPU weirdness should not leak into making other devices
insecure/slow. If necessary Intel GPU only should get some variant
override to keep no snoop working.

It would make alot of good sense if VFIO made the default to disable
no-snoop via the config space.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 16:07                   ` Jason Gunthorpe
@ 2024-05-21 16:21                     ` Alex Williamson
  2024-05-21 16:34                       ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-21 16:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, 21 May 2024 13:07:14 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 20, 2024 at 02:52:43AM +0000, Tian, Kevin wrote:
> > +Daniel
> >   
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, May 18, 2024 1:11 AM
> > > 
> > > On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote:
> > >   
> > > > Yes, exactly.  Zero'ing the page would obviously reestablish the
> > > > coherency, but the page could be reallocated without being zero'd and as
> > > > you describe the owner of that page could then get inconsistent
> > > > results.  
> > > 
> > > I think if we care about the performance of this stuff enough to try
> > > and remove flushes we'd be better off figuring out how to disable no
> > > snoop in PCI config space and trust the device not to use it and avoid
> > > these flushes.
> > > 
> > > iommu enforcement is nice, but at least ARM has been assuming that the
> > > PCI config space bit is sufficient.
> > > 
> > > Intel/AMD are probably fine here as they will only flush for weird GPU
> > > cases, but I expect ARM is going to be unhappy.
> > >   
> > 
> > My impression was that Intel GPU is not usable w/o non-coherent DMA,
> > but I don't remember whether it's unusable being a functional breakage
> > or a user experience breakage. e.g. I vaguely recalled that the display
> > engine cannot afford high resolution/high refresh rate using the snoop
> > way so the IOMMU dedicated for the GPU doesn't implement the force
> > snoop capability.
> > 
> > Daniel, can you help explain the behavior of Intel GPU in case nosnoop
> > is disabled in the PCI config space?
> > 
> > Overall it sounds that we are talking about different requirements. For
> > Intel GPU nosnoop is a must but it is not currently done securely so we
> > need add proper flush to fix it, while for ARM looks you don't have a
> > case which relies on nosnoop so finding a way to disable it is more
> > straightforward?  
> 
> Intel GPU weirdness should not leak into making other devices
> insecure/slow. If necessary Intel GPU only should get some variant
> override to keep no snoop working.
> 
> It would make alot of good sense if VFIO made the default to disable
> no-snoop via the config space.

We can certainly virtualize the config space no-snoop enable bit, but
I'm not sure what it actually accomplishes.  We'd then be relying on
the device to honor the bit and not have any backdoors to twiddle the
bit otherwise (where we know that GPUs often have multiple paths to get
to config space).  We also then have the question of does the device
function correctly if we disable no-snoop.  The more secure approach
might be that we need to do these cache flushes for any IOMMU that
doesn't maintain coherency, even for no-snoop transactions.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 16:21                     ` Alex Williamson
@ 2024-05-21 16:34                       ` Jason Gunthorpe
  2024-05-21 18:19                         ` Alex Williamson
  2024-05-22  3:24                         ` Yan Zhao
  0 siblings, 2 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 16:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote:

> > Intel GPU weirdness should not leak into making other devices
> > insecure/slow. If necessary Intel GPU only should get some variant
> > override to keep no snoop working.
> > 
> > It would make alot of good sense if VFIO made the default to disable
> > no-snoop via the config space.
> 
> We can certainly virtualize the config space no-snoop enable bit, but
> I'm not sure what it actually accomplishes.  We'd then be relying on
> the device to honor the bit and not have any backdoors to twiddle the
> bit otherwise (where we know that GPUs often have multiple paths to get
> to config space).

I'm OK with this. If devices are insecure then they need quirks in
vfio to disclose their problems, we shouldn't punish everyone who
followed the spec because of some bad actors.

But more broadly in a security engineered environment we can trust the
no-snoop bit to work properly.

> We also then have the question of does the device function
> correctly if we disable no-snoop.

Other than the GPU BW issue the no-snoop is not a functional behavior.

> The more secure approach might be that we need to do these cache
> flushes for any IOMMU that doesn't maintain coherency, even for
> no-snoop transactions.  Thanks,

Did you mean 'even for snoop transactions'?

That is where this series is, it assumes a no-snoop transaction took
place even if that is impossible, because of config space, and then
does pessimistic flushes.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 16:34                       ` Jason Gunthorpe
@ 2024-05-21 18:19                         ` Alex Williamson
  2024-05-21 18:37                           ` Jason Gunthorpe
  2024-05-22  3:33                           ` Yan Zhao
  2024-05-22  3:24                         ` Yan Zhao
  1 sibling, 2 replies; 67+ messages in thread
From: Alex Williamson @ 2024-05-21 18:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, 21 May 2024 13:34:00 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote:
> 
> > > Intel GPU weirdness should not leak into making other devices
> > > insecure/slow. If necessary Intel GPU only should get some variant
> > > override to keep no snoop working.
> > > 
> > > It would make alot of good sense if VFIO made the default to disable
> > > no-snoop via the config space.  
> > 
> > We can certainly virtualize the config space no-snoop enable bit, but
> > I'm not sure what it actually accomplishes.  We'd then be relying on
> > the device to honor the bit and not have any backdoors to twiddle the
> > bit otherwise (where we know that GPUs often have multiple paths to get
> > to config space).  
> 
> I'm OK with this. If devices are insecure then they need quirks in
> vfio to disclose their problems, we shouldn't punish everyone who
> followed the spec because of some bad actors.
> 
> But more broadly in a security engineered environment we can trust the
> no-snoop bit to work properly.

 The spec has an interesting requirement on devices sending no-snoop
 transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN):

 "Even when this bit is Set, a Function is only permitted to Set the No
  Snoop attribute on a transaction when it can guarantee that the
  address of the transaction is not stored in any cache in the system."

I wouldn't think the function itself has such visibility and it would
leave the problem of reestablishing coherency to the driver, but am I
overlooking something that implicitly makes this safe?  ie. if the
function isn't permitted to perform no-snoop to an address stored in
cache, there's nothing we need to do here.

> > We also then have the question of does the device function
> > correctly if we disable no-snoop.  
> 
> Other than the GPU BW issue the no-snoop is not a functional behavior.

As with some other config space bits though, I think we're kind of
hoping for sloppy driver behavior to virtualize this.  The spec does
allow the bit to be hardwired to zero:

 "This bit is permitted to be hardwired to 0b if a Function would never
  Set the No Snoop attribute in transactions it initiates."

But there's no capability bit that allows us to report whether the
device supports no-snoop, we're just hoping that a driver writing to
the bit doesn't generate a fault if the bit doesn't stick.  For example
the no-snoop bit in the TLP itself may only be a bandwidth issue, but
if the driver thinks no-snoop support is enabled it may request the
device use the attribute for a specific transaction and the device
could fault if it cannot comply.

> > The more secure approach might be that we need to do these cache
> > flushes for any IOMMU that doesn't maintain coherency, even for
> > no-snoop transactions.  Thanks,  
> 
> Did you mean 'even for snoop transactions'?

I was referring to IOMMUs that maintain coherency regardless of
no-snoop transactions, ie domain->enforce_cache_coherency (ex. snoop
control/SNP on Intel), so I meant as typed, the IOMMU maintaining
coherency even for no-snoop transactions.

That's essentially the case we expect and we don't need to virtualize
no-snoop enable on the device.

> That is where this series is, it assumes a no-snoop transaction took
> place even if that is impossible, because of config space, and then
> does pessimistic flushes.

So are you proposing that we can trust devices to honor the
PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to zero
on IOMMUs that do not enforce coherency as the entire solution?

Or maybe we trap on setting the bit to make the flushing less
pessimistic?

Intel folks might be able to comment on the performance hit relative to
iGPU assignment of denying the device the ability to use no-snoop
transactions (assuming the device control bit is actually honored).
The latency of flushing caches on touching no-snoop enable might be
prohibitive in the latter case.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 18:19                         ` Alex Williamson
@ 2024-05-21 18:37                           ` Jason Gunthorpe
  2024-05-22  6:24                             ` Tian, Kevin
  2024-05-22  3:33                           ` Yan Zhao
  1 sibling, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-21 18:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote:
> > I'm OK with this. If devices are insecure then they need quirks in
> > vfio to disclose their problems, we shouldn't punish everyone who
> > followed the spec because of some bad actors.
> > 
> > But more broadly in a security engineered environment we can trust the
> > no-snoop bit to work properly.
> 
>  The spec has an interesting requirement on devices sending no-snoop
>  transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN):
> 
>  "Even when this bit is Set, a Function is only permitted to Set the No
>   Snoop attribute on a transaction when it can guarantee that the
>   address of the transaction is not stored in any cache in the system."
> 
> I wouldn't think the function itself has such visibility and it would
> leave the problem of reestablishing coherency to the driver, but am I
> overlooking something that implicitly makes this safe?  

I think it is just bad spec language! People are clearly using
no-snoop on cachable memory today. The authors must have had some
other usage in mind than what the industry actually did.

> But there's no capability bit that allows us to report whether the
> device supports no-snoop, we're just hoping that a driver writing to
> the bit doesn't generate a fault if the bit doesn't stick.  For example
> the no-snoop bit in the TLP itself may only be a bandwidth issue, but
> if the driver thinks no-snoop support is enabled it may request the
> device use the attribute for a specific transaction and the device
> could fault if it cannot comply.

It could, but that is another wierdo quirk IMHO. We already see things
in config space under hypervisor control because the VF don't have the
bits :\

> > > The more secure approach might be that we need to do these cache
> > > flushes for any IOMMU that doesn't maintain coherency, even for
> > > no-snoop transactions.  Thanks,  
> > 
> > Did you mean 'even for snoop transactions'?
> 
> I was referring to IOMMUs that maintain coherency regardless of
> no-snoop transactions, ie domain->enforce_cache_coherency (ex. snoop
> control/SNP on Intel), so I meant as typed, the IOMMU maintaining
> coherency even for no-snoop transactions.
> 
> That's essentially the case we expect and we don't need to virtualize
> no-snoop enable on the device.

It is the most robust case to be sure, and then we don't need
flushing.

My point was we could extend the cases where we don't need to flush if
we pay attention to, or virtualize, the PCI_EXP_DEVCTL_NOSNOOP_EN.

> > That is where this series is, it assumes a no-snoop transaction took
> > place even if that is impossible, because of config space, and then
> > does pessimistic flushes.
> 
> So are you proposing that we can trust devices to honor the
> PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to zero
> on IOMMUs that do not enforce coherency as the entire solution?

Maybe not entire, but as an additional step to reduce the cost of
this. ARM would like this for instance.
 
> Or maybe we trap on setting the bit to make the flushing less
> pessimistic?

Also a good idea. The VMM could then decide on policy.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 16:04                       ` Jason Gunthorpe
@ 2024-05-22  3:17                         ` Yan Zhao
  2024-05-22  6:29                           ` Yan Zhao
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-22  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Tue, May 21, 2024 at 01:04:42PM -0300, Jason Gunthorpe wrote:
> On Mon, May 20, 2024 at 10:45:56AM +0800, Yan Zhao wrote:
> > On Fri, May 17, 2024 at 02:04:18PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 16, 2024 at 10:32:43AM +0800, Yan Zhao wrote:
> > > > On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote:
> > > > > 
> > > > > > > So it has to be calculated on closer to a page by page basis (really a
> > > > > > > span by span basis) if flushing of that span is needed based on where
> > > > > > > the pages came from. Only pages that came from a hwpt that is
> > > > > > > non-coherent can skip the flushing.
> > > > > > Is area by area basis also good?
> > > > > > Isn't an area either not mapped to any domain or mapped into all domains?
> > > > > 
> > > > > Yes, this is what the span iterator turns into in the background, it
> > > > > goes area by area to cover things.
> > > > > 
> > > > > > But, yes, considering the limited number of non-coherent domains, it appears
> > > > > > more robust and clean to always flush for non-coherent domain in
> > > > > > iopt_area_fill_domain().
> > > > > > It eliminates the need to decide whether to retain the area flag during a split.
> > > > > 
> > > > > And flush for pin user pages, so you basically always flush because
> > > > > you can't tell where the pages came from.
> > > > As a summary, do you think it's good to flush in below way?
> > > > 
> > > > 1. in iopt_area_fill_domains(), flush before mapping a page into domains when
> > > >    iopt->noncoherent_domain_cnt > 0, no matter where the page is from.
> > > >    Record cache_flush_required in pages for unpin.
> > > > 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency.
> > > >    flush before mapping a page into a non-coherent domain, no matter where the
> > > >    page is from.
> > > >    Record cache_flush_required in pages for unpin.
> > > > 3. in batch_unpin(), flush if pages->cache_flush_required before
> > > >    unpin_user_pages.
> > > 
> > > It does not quite sound right, there should be no tracking in the
> > > pages of this stuff.
> > What's the downside of having tracking in the pages?
> 
> Well, a counter doesn't make sense. You could have a single sticky bit
> that indicates that all PFNs are coherency dirty and overflush them on
> every map and unmap operation.
cache_flush_required is a sticky bit actually. It's set if any PFN in the
iopt_pages is mapped into a noncoherent domain.
batch_unpin() checks this sticky bit for flush.

@@ -198,6 +198,11 @@ struct iopt_pages {
        void __user *uptr;
        bool writable:1;
        u8 account_mode;
+       /*
+        * CPU cache flush is required before mapping the pages to or after
+        * unmapping it from a noncoherent domain
+        */
+       bool cache_flush_required:1;

(Please ignore the confusing comment).

iopt->noncoherent_domain_cnt is a counter. It's increased/decreased on
non-coherent hwpt attach/detach.

@@ -53,6 +53,7 @@ struct io_pagetable {
        struct rb_root_cached reserved_itree;
        u8 disable_large_pages;
        unsigned long iova_alignment;
+       unsigned int noncoherent_domain_cnt;
 };

Since iopt->domains contains no coherency info, this counter helps
iopt_area_fill_domains() to decide whether to flush pages and set sticky bit
cache_flush_required in iopt_pages.
Though it's not that useful to iopt_area_fill_domain(), after your suggestion
to pass in hwpt.

> This is certainly the simplest option, but gives the maximal flushes.

Why does this give the maximal flushes?
Considering the flush after unmap,
- With a sticky bit in iopt_pages, once a iopt_pages has been mapped into a
  non-coherent domain, the PFNs in the iopt_pages will be flushed for only once
  right before the page is unpinned.

- But if we do the flush after each iopt_area_unmap_domain_range() for each
  non-coherent domain, then the flush count for each PFN is the count of
  non-coherent domains.

> 
> If you want to minimize flushes then you can't store flush
> minimization information in the pages because it isn't global to the
> pages and will not be accurate enough.
> 
> > > If pfn_reader_fill_span() does batch_from_domain() and
> > > the source domain's storage_domain is non-coherent then you can skip
> > > the flush. This is not pedantically perfect in skipping all flushes, but
> > > in practice it is probably good enough.
> 
> > We don't know whether the source storage_domain is non-coherent since
> > area->storage_domain is of "struct iommu_domain".
>  
> > Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent,
> > and set this flag along side setting storage_domain?
> 
> Sure, that could work.
When the storage_domain is set in iopt_area_fill_domains(),
    "area->storage_domain = xa_load(&area->iopt->domains, 0);"
is there a convenient way to know the storage_domain is non-coherent?

> 
> > > __iopt_area_unfill_domain() (and children) must flush after
> > > iopt_area_unmap_domain_range() if the area's domain is
> > > non-coherent. This is also not perfect, but probably good enough.
> > Do you mean flush after each iopt_area_unmap_domain_range() if the domain is
> > non-coherent?
> > The problem is that iopt_area_unmap_domain_range() knows only IOVA, the
> > IOVA->PFN relationship is not available without iommu_iova_to_phys() and
> > iommu_domain contains no coherency info.
> 
> Yes, you'd have to read back the PFNs on this path which it doesn't do
> right now.. Given this pain it would be simpler to have one bit in the
> pages that marks it permanently non-coherent and all pfns will be
> flushed before put_page is called.
> 
> The trouble with a counter is that the count going to zero doesn't
> really mean we flushed the PFN if it is being held someplace else.
Not sure if you are confused between iopt->noncoherent_domain_cnt and
pages->cache_flush_required.

iopt->noncoherent_domain_cnt is increased/decreased on non-coherent hwpt
attach/detach.

Once iopt->noncoherent_domain_cnt is non-zero, sticky bit cache_flush_required
in iopt_pages will be set during filling domain, PFNs in the iopt_pages will be
flushed right before unpinning even though iopt->noncoherent_domain_cnt might
have gone to 0 at that time.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 16:34                       ` Jason Gunthorpe
  2024-05-21 18:19                         ` Alex Williamson
@ 2024-05-22  3:24                         ` Yan Zhao
  2024-05-22 12:26                           ` Jason Gunthorpe
  1 sibling, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-22  3:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Vetter, Daniel, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Tue, May 21, 2024 at 01:34:00PM -0300, Jason Gunthorpe wrote:
> On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote:
> 
> > > Intel GPU weirdness should not leak into making other devices
> > > insecure/slow. If necessary Intel GPU only should get some variant
> > > override to keep no snoop working.
> > > 
> > > It would make alot of good sense if VFIO made the default to disable
> > > no-snoop via the config space.
> > 
> > We can certainly virtualize the config space no-snoop enable bit, but
> > I'm not sure what it actually accomplishes.  We'd then be relying on
> > the device to honor the bit and not have any backdoors to twiddle the
> > bit otherwise (where we know that GPUs often have multiple paths to get
> > to config space).
> 
> I'm OK with this. If devices are insecure then they need quirks in
> vfio to disclose their problems, we shouldn't punish everyone who
> followed the spec because of some bad actors.
Does that mean a malicous device that does not honor the bit could read
uninitialized host data?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 18:19                         ` Alex Williamson
  2024-05-21 18:37                           ` Jason Gunthorpe
@ 2024-05-22  3:33                           ` Yan Zhao
  1 sibling, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-22  3:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Tian, Kevin, Vetter, Daniel, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote:
> On Tue, 21 May 2024 13:34:00 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote:
 
> Intel folks might be able to comment on the performance hit relative to
> iGPU assignment of denying the device the ability to use no-snoop
> transactions (assuming the device control bit is actually honored).
I don't have direct data for iGPU assignment. But I have a reference
data regarding to virtio GPU.

When backend GPU is iGPU for a virtio GPU, follow non-coherent path could
increase performance up to 20%+ for some platforms.

> The latency of flushing caches on touching no-snoop enable might be
> prohibitive in the latter case.  Thanks,

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  2024-05-21 16:00       ` Jason Gunthorpe
@ 2024-05-22  3:41         ` Yan Zhao
  0 siblings, 0 replies; 67+ messages in thread
From: Yan Zhao @ 2024-05-22  3:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, kvm, linux-kernel, x86, alex.williamson,
	kevin.tian, iommu, pbonzini, seanjc, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu,
	yi.l.liu, Russell King

On Tue, May 21, 2024 at 01:00:16PM -0300, Jason Gunthorpe wrote:
> On Tue, May 21, 2024 at 12:49:39PM -0300, Jason Gunthorpe wrote:
> > On Mon, May 20, 2024 at 07:07:10AM -0700, Christoph Hellwig wrote:
> > > On Tue, May 07, 2024 at 02:20:44PM +0800, Yan Zhao wrote:
> > > > Introduce and export interface arch_clean_nonsnoop_dma() to flush CPU
> > > > caches for memory involved in non-coherent DMAs (DMAs that lack CPU cache
> > > > snooping).
> > > 
> > > Err, no.  There should really be no exported cache manipulation macros,
> > > as drivers are almost guaranteed to get this wrong.  I've added
> > > Russell to the Cc list who has been extremtly vocal about this at least
> > > for arm.
> > 
> > We could possibly move this under some IOMMU core API (ie flush and
> > map, unmap and flush), the iommu APIs are non-modular so this could
> > avoid the exported symbol.
> 
> Though this would be pretty difficult for unmap as we don't have the
> pfns in the core code to flush. I don't think we have alot of good
> options but to make iommufd & VFIO handle this directly as they have
> the list of pages to flush on the unmap side. Use a namespace?
Given we'll rename this function to arch_flush_cache_phys() which takes physical
address as input, and there're already clflush_cache_range() and
arch_invalidate_pmem() exported with vaddr as input, is this export still good?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-21 18:37                           ` Jason Gunthorpe
@ 2024-05-22  6:24                             ` Tian, Kevin
  2024-05-22 12:29                               ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-22  6:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 22, 2024 2:38 AM
> 
> On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote:
> > > I'm OK with this. If devices are insecure then they need quirks in
> > > vfio to disclose their problems, we shouldn't punish everyone who
> > > followed the spec because of some bad actors.
> > >
> > > But more broadly in a security engineered environment we can trust the
> > > no-snoop bit to work properly.
> >
> >  The spec has an interesting requirement on devices sending no-snoop
> >  transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN):
> >
> >  "Even when this bit is Set, a Function is only permitted to Set the No
> >   Snoop attribute on a transaction when it can guarantee that the
> >   address of the transaction is not stored in any cache in the system."
> >
> > I wouldn't think the function itself has such visibility and it would
> > leave the problem of reestablishing coherency to the driver, but am I
> > overlooking something that implicitly makes this safe?
> 
> I think it is just bad spec language! People are clearly using
> no-snoop on cachable memory today. The authors must have had some
> other usage in mind than what the industry actually did.

sure no-snoop can be used on cacheable memory but then the driver
needs to flush the cache before triggering the no-snoop DMA so it
still meets the spec "the address of the transaction is not stored
in any cache in the system".

but as Alex said the function itself has no such visibility so it's really
a guarantee made by the driver.

> > > That is where this series is, it assumes a no-snoop transaction took
> > > place even if that is impossible, because of config space, and then
> > > does pessimistic flushes.
> >
> > So are you proposing that we can trust devices to honor the
> > PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to
> zero
> > on IOMMUs that do not enforce coherency as the entire solution?
> 
> Maybe not entire, but as an additional step to reduce the cost of
> this. ARM would like this for instance.

I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not
touched by i915 driver. sort of suggesting that Intel GPU doesn't follow
the spec to honor that bit...

> 
> > Or maybe we trap on setting the bit to make the flushing less
> > pessimistic?
> 
> Also a good idea. The VMM could then decide on policy.
> 

On Intel platform there is no pessimistic flush. Only Intel GPUs are
exempted from IOMMU force snoop (either being lacking of the
capability on the IOMMU dedicated to the GPU or having a special
flag bit < REQ_WO_PASID_PGSNP_NOTALLOWED> in the ACPI
structure for the IOMMU hosting many devices) to require the
additional flushes in this series.

We just need to avoid such flushes on other platforms e.g. ARM.

I'm fine to do a special check in the attach path to enable the flush
only for Intel GPU.

or alternatively could ARM SMMU driver implement
@enforce_cache_coherency by disabling PCI nosnoop cap when
the SMMU itself cannot force snoop? Then VFIO/IOMMUFD could
still check enforce_cache_coherency generally to apply the cache
flush trick... 😊 



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22  3:17                         ` Yan Zhao
@ 2024-05-22  6:29                           ` Yan Zhao
  2024-05-22 17:01                             ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Yan Zhao @ 2024-05-22  6:29 UTC (permalink / raw)
  To: Jason Gunthorpe, kvm, linux-kernel, x86, alex.williamson,
	kevin.tian, iommu, pbonzini, seanjc, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu,
	yi.l.liu

> > If you want to minimize flushes then you can't store flush
> > minimization information in the pages because it isn't global to the
> > pages and will not be accurate enough.
> > 
> > > > If pfn_reader_fill_span() does batch_from_domain() and
> > > > the source domain's storage_domain is non-coherent then you can skip
> > > > the flush. This is not pedantically perfect in skipping all flushes, but
> > > > in practice it is probably good enough.
> > 
> > > We don't know whether the source storage_domain is non-coherent since
> > > area->storage_domain is of "struct iommu_domain".
> >  
> > > Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent,
> > > and set this flag along side setting storage_domain?
> > 
> > Sure, that could work.
> When the storage_domain is set in iopt_area_fill_domains(),
>     "area->storage_domain = xa_load(&area->iopt->domains, 0);"
> is there a convenient way to know the storage_domain is non-coherent?
Also asking for when storage_domain is switching to an arbitrary remaining domain
in iopt_unfill_domain().

And in iopt_area_unfill_domains(), after iopt_area_unmap_domain_range()
of a non-coherent domain which is not the storage domain, how can we know that
the domain is non-coherent?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22  3:24                         ` Yan Zhao
@ 2024-05-22 12:26                           ` Jason Gunthorpe
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 12:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, Tian, Kevin, Vetter, Daniel, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Wed, May 22, 2024 at 11:24:20AM +0800, Yan Zhao wrote:
> On Tue, May 21, 2024 at 01:34:00PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote:
> > 
> > > > Intel GPU weirdness should not leak into making other devices
> > > > insecure/slow. If necessary Intel GPU only should get some variant
> > > > override to keep no snoop working.
> > > > 
> > > > It would make alot of good sense if VFIO made the default to disable
> > > > no-snoop via the config space.
> > > 
> > > We can certainly virtualize the config space no-snoop enable bit, but
> > > I'm not sure what it actually accomplishes.  We'd then be relying on
> > > the device to honor the bit and not have any backdoors to twiddle the
> > > bit otherwise (where we know that GPUs often have multiple paths to get
> > > to config space).
> > 
> > I'm OK with this. If devices are insecure then they need quirks in
> > vfio to disclose their problems, we shouldn't punish everyone who
> > followed the spec because of some bad actors.
> Does that mean a malicous device that does not honor the bit could read
> uninitialized host data?

Yes, but a malicious device could also just do DMA with the PF RID and
break everything. VFIO substantially trusts the device already, I'm
not sure trusting it to do no-snoop blocking is a big reach.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22  6:24                             ` Tian, Kevin
@ 2024-05-22 12:29                               ` Jason Gunthorpe
  2024-05-22 14:43                                 ` Alex Williamson
  2024-05-22 23:26                                 ` Tian, Kevin
  0 siblings, 2 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 12:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 22, 2024 2:38 AM
> > 
> > On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote:
> > > > I'm OK with this. If devices are insecure then they need quirks in
> > > > vfio to disclose their problems, we shouldn't punish everyone who
> > > > followed the spec because of some bad actors.
> > > >
> > > > But more broadly in a security engineered environment we can trust the
> > > > no-snoop bit to work properly.
> > >
> > >  The spec has an interesting requirement on devices sending no-snoop
> > >  transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN):
> > >
> > >  "Even when this bit is Set, a Function is only permitted to Set the No
> > >   Snoop attribute on a transaction when it can guarantee that the
> > >   address of the transaction is not stored in any cache in the system."
> > >
> > > I wouldn't think the function itself has such visibility and it would
> > > leave the problem of reestablishing coherency to the driver, but am I
> > > overlooking something that implicitly makes this safe?
> > 
> > I think it is just bad spec language! People are clearly using
> > no-snoop on cachable memory today. The authors must have had some
> > other usage in mind than what the industry actually did.
> 
> sure no-snoop can be used on cacheable memory but then the driver
> needs to flush the cache before triggering the no-snoop DMA so it
> still meets the spec "the address of the transaction is not stored
> in any cache in the system".

Flush does not mean evict.. The way I read the above it is trying to
say the driver must map all the memory non-cachable to ensure it never
gets pulled into a cache in the first place.

> > Maybe not entire, but as an additional step to reduce the cost of
> > this. ARM would like this for instance.
> 
> I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not
> touched by i915 driver. sort of suggesting that Intel GPU doesn't follow
> the spec to honor that bit...

Or the BIOS turns it on and the OS just leaves it..

> I'm fine to do a special check in the attach path to enable the flush
> only for Intel GPU.

We already effectively do this already by checking the domain
capabilities. Only the Intel GPU will have a non-coherent domain.

> or alternatively could ARM SMMU driver implement
> @enforce_cache_coherency by disabling PCI nosnoop cap when
> the SMMU itself cannot force snoop? Then VFIO/IOMMUFD could
> still check enforce_cache_coherency generally to apply the cache
> flush trick... 😊

I like this a lot less than having vfio understand it..

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 12:29                               ` Jason Gunthorpe
@ 2024-05-22 14:43                                 ` Alex Williamson
  2024-05-22 16:52                                   ` Jason Gunthorpe
  2024-05-22 23:26                                 ` Tian, Kevin
  1 sibling, 1 reply; 67+ messages in thread
From: Alex Williamson @ 2024-05-22 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Wed, 22 May 2024 09:29:39 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, May 22, 2024 2:38 AM
> > > 
> > > On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote:  
> > > > > I'm OK with this. If devices are insecure then they need quirks in
> > > > > vfio to disclose their problems, we shouldn't punish everyone who
> > > > > followed the spec because of some bad actors.
> > > > >
> > > > > But more broadly in a security engineered environment we can trust the
> > > > > no-snoop bit to work properly.  
> > > >
> > > >  The spec has an interesting requirement on devices sending no-snoop
> > > >  transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN):
> > > >
> > > >  "Even when this bit is Set, a Function is only permitted to Set the No
> > > >   Snoop attribute on a transaction when it can guarantee that the
> > > >   address of the transaction is not stored in any cache in the system."
> > > >
> > > > I wouldn't think the function itself has such visibility and it would
> > > > leave the problem of reestablishing coherency to the driver, but am I
> > > > overlooking something that implicitly makes this safe?  
> > > 
> > > I think it is just bad spec language! People are clearly using
> > > no-snoop on cachable memory today. The authors must have had some
> > > other usage in mind than what the industry actually did.  
> > 
> > sure no-snoop can be used on cacheable memory but then the driver
> > needs to flush the cache before triggering the no-snoop DMA so it
> > still meets the spec "the address of the transaction is not stored
> > in any cache in the system".  
> 
> Flush does not mean evict.. The way I read the above it is trying to
> say the driver must map all the memory non-cachable to ensure it never
> gets pulled into a cache in the first place.

I think we should probably just fall back to your previous
interpretation, it's bad spec language.  It may not be possible to map
the memory uncachable, it's a driver issue to sync the DMA as needed
for coherency.

> > > Maybe not entire, but as an additional step to reduce the cost of
> > > this. ARM would like this for instance.  
> > 
> > I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not
> > touched by i915 driver. sort of suggesting that Intel GPU doesn't follow
> > the spec to honor that bit...  
> 
> Or the BIOS turns it on and the OS just leaves it..

This is kind of an unusual feature in that sense, the default value of
PCI_EXP_DEVCTL_NOSNOOP_EN is enabled.  It therefore might make sense
that the i915 driver assumes that it can do no-snoop.  The interesting
case would be if it still does no-snoop if that bit were cleared prior
to the driver binding or while the device is running.

But I think this also means that regardless of virtualizing
PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device
resets where a device could legitimately perform no-snoop transactions.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 14:43                                 ` Alex Williamson
@ 2024-05-22 16:52                                   ` Jason Gunthorpe
  2024-05-22 18:22                                     ` Alex Williamson
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 16:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Wed, May 22, 2024 at 08:43:18AM -0600, Alex Williamson wrote:

> But I think this also means that regardless of virtualizing
> PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device
> resets where a device could legitimately perform no-snoop
> transactions.

Isn't memory enable turned off after FLR? If not do we have to make it
off before doing FLR?

I'm not sure how a no-snoop could leak out around FLR?

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 5/5] iommufd: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22  6:29                           ` Yan Zhao
@ 2024-05-22 17:01                             ` Jason Gunthorpe
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 17:01 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, x86, alex.williamson, kevin.tian, iommu,
	pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo, bp,
	hpa, corbet, joro, will, robin.murphy, baolu.lu, yi.l.liu

On Wed, May 22, 2024 at 02:29:19PM +0800, Yan Zhao wrote:
> > > If you want to minimize flushes then you can't store flush
> > > minimization information in the pages because it isn't global to the
> > > pages and will not be accurate enough.
> > > 
> > > > > If pfn_reader_fill_span() does batch_from_domain() and
> > > > > the source domain's storage_domain is non-coherent then you can skip
> > > > > the flush. This is not pedantically perfect in skipping all flushes, but
> > > > > in practice it is probably good enough.
> > > 
> > > > We don't know whether the source storage_domain is non-coherent since
> > > > area->storage_domain is of "struct iommu_domain".
> > >  
> > > > Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent,
> > > > and set this flag along side setting storage_domain?
> > > 
> > > Sure, that could work.
> > When the storage_domain is set in iopt_area_fill_domains(),
> >     "area->storage_domain = xa_load(&area->iopt->domains, 0);"
> > is there a convenient way to know the storage_domain is non-coherent?
> Also asking for when storage_domain is switching to an arbitrary remaining domain
> in iopt_unfill_domain().
> 
> And in iopt_area_unfill_domains(), after iopt_area_unmap_domain_range()
> of a non-coherent domain which is not the storage domain, how can we know that
> the domain is non-coherent?

Yes, it would have to keep track of hwpts in more case unfortunately
:(

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 16:52                                   ` Jason Gunthorpe
@ 2024-05-22 18:22                                     ` Alex Williamson
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Williamson @ 2024-05-22 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel, x86,
	iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx, mingo,
	bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu, Yi L

On Wed, 22 May 2024 13:52:21 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, May 22, 2024 at 08:43:18AM -0600, Alex Williamson wrote:
> 
> > But I think this also means that regardless of virtualizing
> > PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device
> > resets where a device could legitimately perform no-snoop
> > transactions.  
> 
> Isn't memory enable turned off after FLR? If not do we have to make it
> off before doing FLR?
> 
> I'm not sure how a no-snoop could leak out around FLR?

Good point, modulo s/memory/bus master/.  Yes, we'd likely need to make
sure we enter pci_reset_function() with BM disabled so that we don't
have an ordering issue between restoring the PCIe capability and the
command register.  Likewise no-snoop handling would need to avoid gaps
around backdoor resets like we try to do when we're masking INTx
support on the device (vfio_bar_restore).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 12:29                               ` Jason Gunthorpe
  2024-05-22 14:43                                 ` Alex Williamson
@ 2024-05-22 23:26                                 ` Tian, Kevin
  2024-05-22 23:32                                   ` Jason Gunthorpe
  1 sibling, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-22 23:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, May 22, 2024 8:30 PM
> 
> On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > I'm fine to do a special check in the attach path to enable the flush
> > only for Intel GPU.
> 
> We already effectively do this already by checking the domain
> capabilities. Only the Intel GPU will have a non-coherent domain.
> 

I'm confused. In earlier discussions you wanted to find a way to not
publish others due to the check of non-coherent domain, e.g. some
ARM SMMU cannot force snoop.

Then you and Alex discussed the possibility of reducing pessimistic
flushes by virtualizing the PCI NOSNOOP bit.

With that in mind I was thinking whether we explicitly enable this
flush only for Intel GPU instead of checking non-coherent domain
in the attach path, since it's the only device with such requirement.

Did I misunderstand the concern here?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 23:26                                 ` Tian, Kevin
@ 2024-05-22 23:32                                   ` Jason Gunthorpe
  2024-05-22 23:40                                     ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 23:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, May 22, 2024 8:30 PM
> > 
> > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > > I'm fine to do a special check in the attach path to enable the flush
> > > only for Intel GPU.
> > 
> > We already effectively do this already by checking the domain
> > capabilities. Only the Intel GPU will have a non-coherent domain.
> > 
> 
> I'm confused. In earlier discussions you wanted to find a way to not
> publish others due to the check of non-coherent domain, e.g. some
> ARM SMMU cannot force snoop.
> 
> Then you and Alex discussed the possibility of reducing pessimistic
> flushes by virtualizing the PCI NOSNOOP bit.
> 
> With that in mind I was thinking whether we explicitly enable this
> flush only for Intel GPU instead of checking non-coherent domain
> in the attach path, since it's the only device with such requirement.

I am suggesting to do both checks:
 - If the iommu domain indicates it has force coherency then leave PCI
   no-snoop alone and no flush
 - If the PCI NOSNOOP bit is or can be 0 then no flush
 - Otherwise flush

I'm not sure there is a good reason to ignore the data we get from the
iommu domain that it enforces coherency?

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 23:32                                   ` Jason Gunthorpe
@ 2024-05-22 23:40                                     ` Tian, Kevin
  2024-05-23 14:58                                       ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2024-05-22 23:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 23, 2024 7:32 AM
> 
> On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, May 22, 2024 8:30 PM
> > >
> > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > > > I'm fine to do a special check in the attach path to enable the flush
> > > > only for Intel GPU.
> > >
> > > We already effectively do this already by checking the domain
> > > capabilities. Only the Intel GPU will have a non-coherent domain.
> > >
> >
> > I'm confused. In earlier discussions you wanted to find a way to not
> > publish others due to the check of non-coherent domain, e.g. some
> > ARM SMMU cannot force snoop.
> >
> > Then you and Alex discussed the possibility of reducing pessimistic
> > flushes by virtualizing the PCI NOSNOOP bit.
> >
> > With that in mind I was thinking whether we explicitly enable this
> > flush only for Intel GPU instead of checking non-coherent domain
> > in the attach path, since it's the only device with such requirement.
> 
> I am suggesting to do both checks:
>  - If the iommu domain indicates it has force coherency then leave PCI
>    no-snoop alone and no flush
>  - If the PCI NOSNOOP bit is or can be 0 then no flush
>  - Otherwise flush

How to judge whether PCI NOSNOOP can be 0? If following PCI spec
it can always be set to 0 but then we break the requirement for Intel
GPU. If we explicitly exempt Intel GPU in 2nd check  then what'd be
the value of doing that generic check?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  2024-05-22 23:40                                     ` Tian, Kevin
@ 2024-05-23 14:58                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2024-05-23 14:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Vetter, Daniel, Zhao, Yan Y, kvm, linux-kernel,
	x86, iommu, pbonzini, seanjc, dave.hansen, luto, peterz, tglx,
	mingo, bp, hpa, corbet, joro, will, robin.murphy, baolu.lu, Liu,
	Yi L

On Wed, May 22, 2024 at 11:40:58PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, May 23, 2024 7:32 AM
> > 
> > On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, May 22, 2024 8:30 PM
> > > >
> > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote:
> > > > > I'm fine to do a special check in the attach path to enable the flush
> > > > > only for Intel GPU.
> > > >
> > > > We already effectively do this already by checking the domain
> > > > capabilities. Only the Intel GPU will have a non-coherent domain.
> > > >
> > >
> > > I'm confused. In earlier discussions you wanted to find a way to not
> > > publish others due to the check of non-coherent domain, e.g. some
> > > ARM SMMU cannot force snoop.
> > >
> > > Then you and Alex discussed the possibility of reducing pessimistic
> > > flushes by virtualizing the PCI NOSNOOP bit.
> > >
> > > With that in mind I was thinking whether we explicitly enable this
> > > flush only for Intel GPU instead of checking non-coherent domain
> > > in the attach path, since it's the only device with such requirement.
> > 
> > I am suggesting to do both checks:
> >  - If the iommu domain indicates it has force coherency then leave PCI
> >    no-snoop alone and no flush
> >  - If the PCI NOSNOOP bit is or can be 0 then no flush
> >  - Otherwise flush
> 
> How to judge whether PCI NOSNOOP can be 0? If following PCI spec
> it can always be set to 0 but then we break the requirement for Intel
> GPU. If we explicitly exempt Intel GPU in 2nd check  then what'd be
> the value of doing that generic check?

Non-PCI environments still have this problem, and the first check does
help them since we don't have PCI config space there.

PCI can supply more information (no snoop impossible) and variant
drivers can add in too (want no snoop)

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2024-05-23 14:58 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-07  6:18 [PATCH 0/5] Enforce CPU cache flush for non-coherent device assignment Yan Zhao
2024-05-07  6:19 ` [PATCH 1/5] x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT range Yan Zhao
2024-05-07  8:26   ` Tian, Kevin
2024-05-07  9:12     ` Yan Zhao
2024-05-08 22:14       ` Alex Williamson
2024-05-09  3:36         ` Yan Zhao
2024-05-16  7:42       ` Tian, Kevin
2024-05-16 14:07         ` Sean Christopherson
2024-05-20  2:36           ` Tian, Kevin
2024-05-07  6:20 ` [PATCH 2/5] KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is MMIO Yan Zhao
2024-05-07  8:39   ` Tian, Kevin
2024-05-07  9:19     ` Yan Zhao
2024-05-07  6:20 ` [PATCH 3/5] x86/mm: Introduce and export interface arch_clean_nonsnoop_dma() Yan Zhao
2024-05-07  8:51   ` Tian, Kevin
2024-05-07  9:40     ` Yan Zhao
2024-05-20 14:07   ` Christoph Hellwig
2024-05-21 15:49     ` Jason Gunthorpe
2024-05-21 16:00       ` Jason Gunthorpe
2024-05-22  3:41         ` Yan Zhao
2024-05-07  6:21 ` [PATCH 4/5] vfio/type1: Flush CPU caches on DMA pages in non-coherent domains Yan Zhao
2024-05-09 18:10   ` Alex Williamson
2024-05-10 10:31     ` Yan Zhao
2024-05-10 16:57       ` Alex Williamson
2024-05-13  7:11         ` Yan Zhao
2024-05-16  7:53           ` Tian, Kevin
2024-05-16  8:34           ` Tian, Kevin
2024-05-16 20:31             ` Alex Williamson
2024-05-17 17:11               ` Jason Gunthorpe
2024-05-20  2:52                 ` Tian, Kevin
2024-05-21 16:07                   ` Jason Gunthorpe
2024-05-21 16:21                     ` Alex Williamson
2024-05-21 16:34                       ` Jason Gunthorpe
2024-05-21 18:19                         ` Alex Williamson
2024-05-21 18:37                           ` Jason Gunthorpe
2024-05-22  6:24                             ` Tian, Kevin
2024-05-22 12:29                               ` Jason Gunthorpe
2024-05-22 14:43                                 ` Alex Williamson
2024-05-22 16:52                                   ` Jason Gunthorpe
2024-05-22 18:22                                     ` Alex Williamson
2024-05-22 23:26                                 ` Tian, Kevin
2024-05-22 23:32                                   ` Jason Gunthorpe
2024-05-22 23:40                                     ` Tian, Kevin
2024-05-23 14:58                                       ` Jason Gunthorpe
2024-05-22  3:33                           ` Yan Zhao
2024-05-22  3:24                         ` Yan Zhao
2024-05-22 12:26                           ` Jason Gunthorpe
2024-05-16 20:50           ` Alex Williamson
2024-05-17  3:11             ` Yan Zhao
2024-05-17  4:44               ` Alex Williamson
2024-05-17  5:00                 ` Yan Zhao
2024-05-07  6:22 ` [PATCH 5/5] iommufd: " Yan Zhao
2024-05-09 14:13   ` Jason Gunthorpe
2024-05-10  8:03     ` Yan Zhao
2024-05-10 13:29       ` Jason Gunthorpe
2024-05-13  7:43         ` Yan Zhao
2024-05-14 15:11           ` Jason Gunthorpe
2024-05-15  7:06             ` Yan Zhao
2024-05-15 20:43               ` Jason Gunthorpe
2024-05-16  2:32                 ` Yan Zhao
2024-05-16  8:38                   ` Tian, Kevin
2024-05-16  9:48                     ` Yan Zhao
2024-05-17 17:04                   ` Jason Gunthorpe
2024-05-20  2:45                     ` Yan Zhao
2024-05-21 16:04                       ` Jason Gunthorpe
2024-05-22  3:17                         ` Yan Zhao
2024-05-22  6:29                           ` Yan Zhao
2024-05-22 17:01                             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).