Re: [Xen-devel] [PATCH] x86 / iommu: set up a scratch page in the quarantine domain

From: "Durrant, Paul" <pdurrant@amazon.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	"Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH] x86 / iommu: set up a scratch page in the quarantine domain
Date: Wed, 27 Nov 2019 15:31:31 +0000	[thread overview]
Message-ID: <645bec84b69546089f2a018ac9254e2b@EX13D32EUC003.ant.amazon.com> (raw)
In-Reply-To: <daac5903-cc73-15e3-5d5a-2a1d9e147301@suse.com>

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
> Beulich
> Sent: 20 November 2019 13:52
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: xen-devel@lists.xenproject.org; Kevin Tian <kevin.tian@intel.com>;
> Roger Pau Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; Andrew
> Cooper <andrew.cooper3@citrix.com>
> Subject: Re: [Xen-devel] [PATCH] x86 / iommu: set up a scratch page in the
> quarantine domain
> 
> On 20.11.2019 13:08, Paul Durrant wrote:
> > This patch introduces a new iommu_op to facilitate a per-implementation
> > quarantine set up, and then further code for x86 implementations
> > (amd and vtd) to set up a read/wrote scratch page to serve as the
> source/
> > target for all DMA whilst a device is assigned to dom_io.
> 
> A single page in the system won't do, I'm afraid. If one guest's
> (prior) device is retrying reads with data containing secrets of that
> guest, another guest's (prior) device could end up writing this data
> to e.g. storage where after a guest restart it is then available to
> the wrong guest.
> 

True. I was unsure whether this was a concern in the scenarios we had to deal with but I'm informed it is, and in the general case it is too.

> Also nit: s/wrote/write/ .
> 

Yep. Will fix.

> > The reason for doing this is that some hardware may continue to re-try
> > DMA, despite FLR, in the event of an error. Having a scratch page mapped
> > will allow pending DMA to drain and thus quiesce such buggy hardware.
> 
> Without a "sink" page mapped, this would result in IOMMU faults aiui.
> What's the problem with having these faults surface and get handled,
> eventually leading to the device getting bus-mastering disabled? Is
> it that devices continue DMAing even when bus-mastering is off? If
> so, is it even safe to pass through any such device? In any event
> the description needs to be extended here.
> 

The devices in question ignore both FLR and BME and some IOMMU faults are fatal. I believe, however, write faults are not and so I think a single read-only 'source' page will be sufficient.

> > Signed-off-by: Paul Durrant <pdurrant@amazon.com>
> 
> What about Arm? Can devices which Arm allows to assign to guests
> also "babble" like this after de-assignment? If not, this should be
> said in the description. If so, obviously that side would also want
> fixing.
> 
> > --- a/xen/drivers/passthrough/amd/iommu_map.c
> > +++ b/xen/drivers/passthrough/amd/iommu_map.c
> > @@ -560,6 +560,63 @@ int amd_iommu_reserve_domain_unity_map(struct
> domain *domain,
> >      return rt;
> >  }
> >
> > +int amd_iommu_quarantine_init(struct domain *d)
> 
> __init
> 

Ok.

> > +{
> > +    struct domain_iommu *hd = dom_iommu(d);
> > +    unsigned int level;
> > +    struct amd_iommu_pte *table;
> > +
> > +    if ( hd->arch.root_table )
> > +    {
> > +        ASSERT_UNREACHABLE();
> > +        return 0;
> > +    }
> > +
> > +    spin_lock(&hd->arch.mapping_lock);
> > +
> > +    level = hd->arch.paging_mode;
> 
> With DomIO being PV in principle, this is going to be the
> fixed value PV domains get set, merely depending on RAM size at
> boot time (i.e. not accounting for memory hotplug). This could
> be easily too little for HVM guests, which are free to extend
> their GFN (and hence DFN) space. Therefore I think you need to
> set the maximum possible level here.
>

Ok. I'd not considered memory hotplug. I'll use a static maximum value instead, as VT-d does in general.

> > +    hd->arch.root_table = alloc_amd_iommu_pgtable();
> > +    if ( !hd->arch.root_table )
> > +        goto out;
> > +
> > +    table = __map_domain_page(hd->arch.root_table);
> > +    while ( level )
> > +    {
> > +        struct page_info *pg;
> > +        unsigned int i;
> > +
> > +        /*
> > +         * The pgtable allocator is fine for the leaf page, as well as
> > +         * page table pages.
> > +         */
> > +        pg = alloc_amd_iommu_pgtable();
> > +        if ( !pg )
> > +            break;
> > +
> > +        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
> > +        {
> > +            struct amd_iommu_pte *pde = &table[i];
> > +
> > +            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level -
> 1,
> > +                                  true, true);
> 
> This would also benefit from a comment indicating that it's fine
> for the leaf level as well, despite the "pde" in the name (and
> its sibling set_iommu_pte_present() actually existing). Of course
> you could as well extend the comment a few lines up.

The AMD IOMMU conflates PDE and PTE all over the place but I'll add a comment here to that effect.

> 
> What you do need to do though is pre-fill the leaf page ...
> 
> > +        }
> > +
> > +        unmap_domain_page(table);
> > +        table = __map_domain_page(pg);
> > +        level--;
> > +    }
> 
> ... here, such that possible left over secrets can't end up
> getting written to e.g. persistent storage or over a network.
> 

That's actually one reason for using the pgtable allocator... It already does that.

> > @@ -2683,9 +2671,68 @@ static void vtd_dump_p2m_table(struct domain *d)
> >      vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd-
> >arch.agaw), 0, 0);
> >  }
> >
> > +static int intel_iommu_quarantine_init(struct domain *d)
> 
> __init again.
> 

Ok.

> > +{
> > +    struct domain_iommu *hd = dom_iommu(d);
> > +    struct dma_pte *parent;
> > +    unsigned int level = agaw_to_level(hd->arch.agaw);
> 
> Other than for AMD this is not a problem here, as all domains
> (currently) get the same AGAW. I wonder though whether precautions
> would be possible here against the "normal" domain setting getting
> adjusted without recalling the need to come back here.
> 

I could just go for a hardcoded width value here too.

> > +    int rc;
> > +
> > +    if ( hd->arch.pgd_maddr )
> > +    {
> > +        ASSERT_UNREACHABLE();
> > +        return 0;
> > +    }
> > +
> > +    spin_lock(&hd->arch.mapping_lock);
> > +
> > +    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
> > +    if ( !hd->arch.pgd_maddr )
> > +        goto out;
> > +
> > +    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.pgd_maddr);
> 
> Unnecessary cast; funnily enough you don't have one further down.
> 

Sorry. Cut'n'paste.

> > +    while ( level )
> > +    {
> > +        uint64_t maddr;
> > +        unsigned int offset;
> > +
> > +        /*
> > +         * The pgtable allocator is fine for the leaf page, as well as
> > +         * page table pages.
> > +         */
> > +        maddr = alloc_pgtable_maddr(1, hd->node);
> > +        if ( !maddr )
> > +            break;
> > +
> > +        for ( offset = 0; offset < PTE_NUM; offset++ )
> > +        {
> > +            struct dma_pte *pte = &parent[offset];
> > +
> > +            dma_set_pte_addr(*pte, maddr);
> > +            dma_set_pte_readable(*pte);
> > +            dma_set_pte_writable(*pte);
> > +        }
> > +        iommu_flush_cache_page(parent, 1);
> > +
> > +        unmap_vtd_domain_page(parent);
> > +        parent = map_vtd_domain_page(maddr);
> > +        level--;
> > +    }
> 
> The leaf page wants scrubbing here as well.
> 

Already done by alloc_pgtable_maddr(). I'll note in the comment in this hunk and the AMD one that the pages are zeroed.

  Paul

> Jan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel