All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Paul Durrant <paul@xen.org>, Wei Liu <wl@xen.org>
Subject: Re: [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables
Date: Fri, 6 May 2022 13:16:09 +0200	[thread overview]
Message-ID: <YnUDeR5feSsmbCVF@Air-de-Roger> (raw)
In-Reply-To: <9d073a05-0c7d-4989-7a38-93cd5b01d071@suse.com>

On Mon, Apr 25, 2022 at 10:40:55AM +0200, Jan Beulich wrote:
> Page tables are used for two purposes after allocation: They either
> start out all empty, or they get filled to replace a superpage.
> Subsequently, to replace all empty or fully contiguous page tables,
> contiguous sub-regions will be recorded within individual page tables.
> Install the initial set of markers immediately after allocation. Make
> sure to retain these markers when further populating a page table in
> preparation for it to replace a superpage.
> 
> The markers are simply 4-bit fields holding the order value of
> contiguous entries. To demonstrate this, if a page table had just 16
> entries, this would be the initial (fully contiguous) set of markers:
> 
> index  0 1 2 3 4 5 6 7 8 9 A B C D E F
> marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0
> 
> "Contiguous" here means not only present entries with successively
> increasing MFNs, each one suitably aligned for its slot, but also a
> respective number of all non-present entries.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> An alternative to the ASSERT()s added to set_iommu_ptes_present() would
> be to make the function less general-purpose; it's used in a single
> place only after all (i.e. it might as well be folded into its only
> caller).

I would think adding a comment that the function requires the PDE to
be empty would be good.  Also given the current usage we could drop
the nr_ptes parameter and just name the function fill_pde() or
similar.

> 
> While in VT-d's comment ahead of struct dma_pte I'm adjusting the
> description of the high bits, I'd like to note that the description of
> some of the lower bits isn't correct either. Yet I don't think adjusting
> that belongs here.
> ---
> v4: Add another comment referring to pt-contig-markers.h. Re-base.
> v3: Add comments. Re-base.
> v2: New.
> 
> --- a/xen/arch/x86/include/asm/iommu.h
> +++ b/xen/arch/x86/include/asm/iommu.h
> @@ -146,7 +146,8 @@ void iommu_free_domid(domid_t domid, uns
>  
>  int __must_check iommu_free_pgtables(struct domain *d);
>  struct domain_iommu;
> -struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
> +struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd,
> +                                                   uint64_t contig_mask);
>  void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
>  
>  #endif /* !__ARCH_X86_IOMMU_H__ */
> --- a/xen/drivers/passthrough/amd/iommu-defs.h
> +++ b/xen/drivers/passthrough/amd/iommu-defs.h
> @@ -446,11 +446,13 @@ union amd_iommu_x2apic_control {
>  #define IOMMU_PAGE_TABLE_U32_PER_ENTRY	(IOMMU_PAGE_TABLE_ENTRY_SIZE / 4)
>  #define IOMMU_PAGE_TABLE_ALIGNMENT	4096
>  
> +#define IOMMU_PTE_CONTIG_MASK           0x1e /* The ign0 field below. */
> +
>  union amd_iommu_pte {
>      uint64_t raw;
>      struct {
>          bool pr:1;
> -        unsigned int ign0:4;
> +        unsigned int ign0:4; /* Covered by IOMMU_PTE_CONTIG_MASK. */
>          bool a:1;
>          bool d:1;
>          unsigned int ign1:2;
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -115,7 +115,19 @@ static void set_iommu_ptes_present(unsig
>  
>      while ( nr_ptes-- )
>      {
> -        set_iommu_pde_present(pde, next_mfn, 0, iw, ir);
> +        ASSERT(!pde->next_level);
> +        ASSERT(!pde->u);
> +
> +        if ( pde > table )
> +            ASSERT(pde->ign0 == find_first_set_bit(pde - table));
> +        else
> +            ASSERT(pde->ign0 == PAGE_SHIFT - 3);

I think PAGETABLE_ORDER would be clearer here.

While here, could you also assert that next_mfn matches the contiguous
order currently set in the PTE?

> +
> +        pde->iw = iw;
> +        pde->ir = ir;
> +        pde->fc = true; /* See set_iommu_pde_present(). */
> +        pde->mfn = next_mfn;
> +        pde->pr = true;
>  
>          ++pde;
>          next_mfn += page_sz;
> @@ -295,7 +307,7 @@ static int iommu_pde_from_dfn(struct dom
>              mfn = next_table_mfn;
>  
>              /* allocate lower level page table */
> -            table = iommu_alloc_pgtable(hd);
> +            table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
>              if ( table == NULL )
>              {
>                  AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
> @@ -325,7 +337,7 @@ static int iommu_pde_from_dfn(struct dom
>  
>              if ( next_table_mfn == 0 )
>              {
> -                table = iommu_alloc_pgtable(hd);
> +                table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
>                  if ( table == NULL )
>                  {
>                      AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
> @@ -717,7 +729,7 @@ static int fill_qpt(union amd_iommu_pte
>                   * page table pages, and the resulting allocations are always
>                   * zeroed.
>                   */
> -                pgs[level] = iommu_alloc_pgtable(hd);
> +                pgs[level] = iommu_alloc_pgtable(hd, 0);

Is it worth not setting up the contiguous data for quarantine page
tables?

I think it's fine now given the current code, but you having added
ASSERTs that the contig data is correct in set_iommu_ptes_present()
makes me wonder whether we could trigger those in the future.

I understand that the contig data is not helpful for quarantine page
tables, but still doesn't seem bad to have it just for coherency.

>                  if ( !pgs[level] )
>                  {
>                      rc = -ENOMEM;
> @@ -775,7 +787,7 @@ int cf_check amd_iommu_quarantine_init(s
>          return 0;
>      }
>  
> -    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
> +    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
>      if ( !pdev->arch.amd.root_table )
>          return -ENOMEM;
>  
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -342,7 +342,7 @@ int amd_iommu_alloc_root(struct domain *
>  
>      if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
>      {
> -        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
> +        hd->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
>          if ( !hd->arch.amd.root_table )
>              return -ENOMEM;
>      }
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -334,7 +334,7 @@ static uint64_t addr_to_dma_page_maddr(s
>              goto out;
>  
>          pte_maddr = level;
> -        if ( !(pg = iommu_alloc_pgtable(hd)) )
> +        if ( !(pg = iommu_alloc_pgtable(hd, 0)) )
>              goto out;
>  
>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
> @@ -376,7 +376,7 @@ static uint64_t addr_to_dma_page_maddr(s
>              }
>  
>              pte_maddr = level - 1;
> -            pg = iommu_alloc_pgtable(hd);
> +            pg = iommu_alloc_pgtable(hd, DMA_PTE_CONTIG_MASK);
>              if ( !pg )
>                  break;
>  
> @@ -388,12 +388,13 @@ static uint64_t addr_to_dma_page_maddr(s
>                  struct dma_pte *split = map_vtd_domain_page(pte_maddr);
>                  unsigned long inc = 1UL << level_to_offset_bits(level - 1);
>  
> -                split[0].val = pte->val;
> +                split[0].val |= pte->val & ~DMA_PTE_CONTIG_MASK;
>                  if ( inc == PAGE_SIZE )
>                      split[0].val &= ~DMA_PTE_SP;
>  
>                  for ( offset = 1; offset < PTE_NUM; ++offset )
> -                    split[offset].val = split[offset - 1].val + inc;
> +                    split[offset].val |=
> +                        (split[offset - 1].val & ~DMA_PTE_CONTIG_MASK) + inc;
>  
>                  iommu_sync_cache(split, PAGE_SIZE);
>                  unmap_vtd_domain_page(split);
> @@ -2173,7 +2174,7 @@ static int __must_check cf_check intel_i
>      if ( iommu_snoop )
>          dma_set_pte_snp(new);
>  
> -    if ( old.val == new.val )
> +    if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          unmap_vtd_domain_page(page);
> @@ -3052,7 +3053,7 @@ static int fill_qpt(struct dma_pte *this
>                   * page table pages, and the resulting allocations are always
>                   * zeroed.
>                   */
> -                pgs[level] = iommu_alloc_pgtable(hd);
> +                pgs[level] = iommu_alloc_pgtable(hd, 0);
>                  if ( !pgs[level] )
>                  {
>                      rc = -ENOMEM;
> @@ -3109,7 +3110,7 @@ static int cf_check intel_iommu_quaranti
>      if ( !drhd )
>          return -ENODEV;
>  
> -    pg = iommu_alloc_pgtable(hd);
> +    pg = iommu_alloc_pgtable(hd, 0);
>      if ( !pg )
>          return -ENOMEM;
>  
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -253,7 +253,10 @@ struct context_entry {
>   * 2-6: reserved
>   * 7: super page
>   * 8-11: available
> - * 12-63: Host physcial address
> + * 12-51: Host physcial address
> + * 52-61: available (52-55 used for DMA_PTE_CONTIG_MASK)
> + * 62: reserved
> + * 63: available
>   */
>  struct dma_pte {
>      u64 val;
> @@ -263,6 +266,7 @@ struct dma_pte {
>  #define DMA_PTE_PROT (DMA_PTE_READ | DMA_PTE_WRITE)
>  #define DMA_PTE_SP   (1 << 7)
>  #define DMA_PTE_SNP  (1 << 11)
> +#define DMA_PTE_CONTIG_MASK  (0xfull << PADDR_BITS)
>  #define dma_clear_pte(p)    do {(p).val = 0;} while(0)
>  #define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while(0)
>  #define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while(0)
> @@ -276,7 +280,7 @@ struct dma_pte {
>  #define dma_pte_write(p) (dma_pte_prot(p) & DMA_PTE_WRITE)
>  #define dma_pte_addr(p) ((p).val & PADDR_MASK & PAGE_MASK_4K)
>  #define dma_set_pte_addr(p, addr) do {\
> -            (p).val |= ((addr) & PAGE_MASK_4K); } while (0)
> +            (p).val |= ((addr) & PADDR_MASK & PAGE_MASK_4K); } while (0)

While I'm not opposed to this, I would assume that addr is not
expected to contain bit cleared by PADDR_MASK? (or PAGE_MASK_4K FWIW)

Or else callers are really messed up.

>  #define dma_pte_present(p) (((p).val & DMA_PTE_PROT) != 0)
>  #define dma_pte_superpage(p) (((p).val & DMA_PTE_SP) != 0)
>  
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -522,11 +522,12 @@ int iommu_free_pgtables(struct domain *d
>      return 0;
>  }
>  
> -struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
> +struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
> +                                      uint64_t contig_mask)
>  {
>      unsigned int memflags = 0;
>      struct page_info *pg;
> -    void *p;
> +    uint64_t *p;
>  
>  #ifdef CONFIG_NUMA
>      if ( hd->node != NUMA_NO_NODE )
> @@ -538,7 +539,29 @@ struct page_info *iommu_alloc_pgtable(st
>          return NULL;
>  
>      p = __map_domain_page(pg);
> -    clear_page(p);
> +
> +    if ( contig_mask )
> +    {
> +        /* See pt-contig-markers.h for a description of the marker scheme. */
> +        unsigned int i, shift = find_first_set_bit(contig_mask);
> +
> +        ASSERT(((PAGE_SHIFT - 3) & (contig_mask >> shift)) == PAGE_SHIFT - 3);

I think it might be clearer to use PAGETABLE_ORDER rather than
PAGE_SHIFT - 3.

Thanks, Roger.


  reply	other threads:[~2022-05-06 11:16 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-25  8:29 [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich
2022-04-25  8:30 ` [PATCH v4 01/21] AMD/IOMMU: correct potentially-UB shifts Jan Beulich
2022-04-27 13:08   ` Andrew Cooper
2022-04-27 13:57     ` Jan Beulich
2022-05-03 10:10   ` Roger Pau Monné
2022-05-03 14:34     ` Jan Beulich
2022-04-25  8:32 ` [PATCH v4 02/21] IOMMU: simplify unmap-on-error in iommu_map() Jan Beulich
2022-04-27 13:16   ` Andrew Cooper
2022-04-27 14:05     ` Jan Beulich
2022-05-03 10:25   ` Roger Pau Monné
2022-05-03 14:37     ` Jan Beulich
2022-05-03 16:22       ` Roger Pau Monné
2022-04-25  8:32 ` [PATCH v4 03/21] IOMMU: add order parameter to ->{,un}map_page() hooks Jan Beulich
2022-04-25  8:33 ` [PATCH v4 04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks Jan Beulich
2022-05-03 12:37   ` Roger Pau Monné
2022-05-03 14:44     ` Jan Beulich
2022-05-04 10:20       ` Roger Pau Monné
2022-04-25  8:34 ` [PATCH v4 05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 Jan Beulich
2022-05-03 13:00   ` Roger Pau Monné
2022-05-03 14:50     ` Jan Beulich
2022-05-04  9:32       ` Jan Beulich
2022-05-04 10:30         ` Roger Pau Monné
2022-05-04 10:51           ` Jan Beulich
2022-05-04 12:01             ` Roger Pau Monné
2022-05-04 12:12               ` Jan Beulich
2022-05-04 13:00                 ` Roger Pau Monné
2022-05-04 13:19                   ` Jan Beulich
2022-05-04 13:46                     ` Roger Pau Monné
2022-05-04 13:55                       ` Jan Beulich
2022-05-04 15:22                         ` Roger Pau Monné
2022-04-25  8:34 ` [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches Jan Beulich
2022-05-03 14:49   ` Roger Pau Monné
2022-05-04  9:46     ` Jan Beulich
2022-05-04 11:20       ` Roger Pau Monné
2022-05-04 12:27         ` Jan Beulich
2022-05-04 13:55           ` Roger Pau Monné
2022-05-04 14:26             ` Jan Beulich
2022-04-25  8:35 ` [PATCH v4 07/21] IOMMU/x86: support freeing of pagetables Jan Beulich
2022-05-03 16:20   ` Roger Pau Monné
2022-05-04 13:07     ` Jan Beulich
2022-05-04 15:06       ` Roger Pau Monné
2022-05-05  8:20         ` Jan Beulich
2022-05-05  9:57           ` Roger Pau Monné
2022-04-25  8:36 ` [PATCH v4 08/21] AMD/IOMMU: walk trees upon page fault Jan Beulich
2022-05-04 15:57   ` Roger Pau Monné
2022-04-25  8:37 ` [PATCH v4 09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() Jan Beulich
2022-04-25  8:38 ` [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Jan Beulich
2022-05-05 13:19   ` Roger Pau Monné
2022-05-05 14:34     ` Jan Beulich
2022-05-05 15:26       ` Roger Pau Monné
2022-04-25  8:38 ` [PATCH v4 11/21] VT-d: " Jan Beulich
2022-05-05 16:20   ` Roger Pau Monné
2022-05-06  6:13     ` Jan Beulich
2022-04-25  8:40 ` [PATCH v4 12/21] IOMMU: fold flush-all hook into "flush one" Jan Beulich
2022-05-06  8:38   ` Roger Pau Monné
2022-05-06  9:59     ` Jan Beulich
2022-04-25  8:40 ` [PATCH v4 13/21] IOMMU/x86: prefill newly allocate page tables Jan Beulich
2022-05-06 11:16   ` Roger Pau Monné [this message]
2022-05-19 12:12     ` Jan Beulich
2022-05-20 10:47       ` Roger Pau Monné
2022-05-20 11:11         ` Jan Beulich
2022-05-20 11:13           ` Jan Beulich
2022-05-20 12:22             ` Roger Pau Monné
2022-05-20 12:36               ` Jan Beulich
2022-05-20 14:28                 ` Roger Pau Monné
2022-05-20 14:38                   ` Roger Pau Monné
2022-05-23  6:49                     ` Jan Beulich
2022-05-23  9:10                       ` Roger Pau Monné
2022-05-23 10:52                         ` Jan Beulich
2022-04-25  8:41 ` [PATCH v4 14/21] x86: introduce helper for recording degree of contiguity in " Jan Beulich
2022-05-06 13:25   ` Roger Pau Monné
2022-05-18 10:06     ` Jan Beulich
2022-05-20 10:22       ` Roger Pau Monné
2022-05-20 10:59         ` Jan Beulich
2022-05-20 11:27           ` Roger Pau Monné
2022-04-25  8:42 ` [PATCH v4 15/21] AMD/IOMMU: free all-empty " Jan Beulich
2022-05-10 13:30   ` Roger Pau Monné
2022-05-18 10:18     ` Jan Beulich
2022-04-25  8:42 ` [PATCH v4 16/21] VT-d: " Jan Beulich
2022-04-27  4:09   ` Tian, Kevin
2022-05-10 14:30   ` Roger Pau Monné
2022-05-18 10:26     ` Jan Beulich
2022-05-20  0:38       ` Tian, Kevin
2022-05-20 11:13       ` Roger Pau Monné
2022-05-27  7:40         ` Jan Beulich
2022-05-27  7:53           ` Jan Beulich
2022-05-27  9:21             ` Roger Pau Monné
2022-04-25  8:43 ` [PATCH v4 17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings Jan Beulich
2022-05-10 15:31   ` Roger Pau Monné
2022-05-18 10:40     ` Jan Beulich
2022-05-20 10:35       ` Roger Pau Monné
2022-04-25  8:43 ` [PATCH v4 18/21] VT-d: " Jan Beulich
2022-05-11 11:08   ` Roger Pau Monné
2022-05-18 10:44     ` Jan Beulich
2022-05-20 10:38       ` Roger Pau Monné
2022-04-25  8:44 ` [PATCH v4 19/21] IOMMU/x86: add perf counters for page table splitting / coalescing Jan Beulich
2022-05-11 13:48   ` Roger Pau Monné
2022-05-18 11:39     ` Jan Beulich
2022-05-20 10:41       ` Roger Pau Monné
2022-04-25  8:44 ` [PATCH v4 20/21] VT-d: fold iommu_flush_iotlb{,_pages}() Jan Beulich
2022-04-27  4:12   ` Tian, Kevin
2022-05-11 13:50   ` Roger Pau Monné
2022-04-25  8:45 ` [PATCH v4 21/21] VT-d: fold dma_pte_clear_one() into its only caller Jan Beulich
2022-04-27  4:13   ` Tian, Kevin
2022-05-11 13:57   ` Roger Pau Monné
2022-05-18 12:50 ` [PATCH v4 00/21] IOMMU: superpage support when not sharing pagetables Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnUDeR5feSsmbCVF@Air-de-Roger \
    --to=roger.pau@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=paul@xen.org \
    --cc=wl@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.