All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
@ 2015-09-16  9:33   ` Julien Grall
  2015-09-16 13:43     ` Xu, Quan
  2015-09-29  9:24   ` Jan Beulich
  1 sibling, 1 reply; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:33 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

The time of the mail is in a future. Can you configure your mail to 
report the correct time?

On 16/09/2015 14:23, Quan Xu wrote:
> diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> index 106e08f..28e7fc3 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -23,6 +23,21 @@
>   #include <xen/list.h>
>   #include <asm/hvm/iommu.h>
>
> +/*
> + * Status Address and Data: Status address and data is used by hardware to perform
> + * wait descriptor completion status write when the Status Write(SW) field is Set.
> + *
> + * Track the Device-TLB invalidation status in an invalidation table. Update
> + * invalidation table's count of in-flight Device-TLB invalidation request and
> + * assign the address of global polling parameter per domain in the Status Address
> + * of each invalidation wait descriptor, when submit Device-TLB invalidation
> + * requests.
> + */
> +struct qi_talbe {

Did you want to say table rather than talbe?

> +    u64 qi_table_poll_slot;
> +    u32 qi_table_status_data;
> +};
> +
>   struct hvm_iommu {
>       struct arch_hvm_iommu arch;
>
> @@ -34,6 +49,9 @@ struct hvm_iommu {
>       struct list_head dt_devices;
>   #endif
>
> +    /* IOMMU Queued Invalidation(QI) */
> +    struct qi_talbe talbe;
> +

This header is should contain any common code between ARM and x86. 
Although, this feature seems to be vtd only (i.e x86).

So this should be moved in arch_hvm_iommu defined in asm-x86/hvm/iommu.h.

You would then be able to access the data using 
domain_hvm_iommu(d)->arch.field


>       /* Features supported by the IOMMU */
>       DECLARE_BITMAP(features, IOMMU_FEAT_count);
>   };
> @@ -41,4 +59,9 @@ struct hvm_iommu {
>   #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
>   #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
>
> +#define qi_table_data(d) \
> +    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_status_data)
> +#define qi_table_pollslot(d) \
> +    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)

The way to access the iommu data on ARM and x86 are different. Please 
use domain_hvm_iommu(d)->field if you keep these fields in common code.

> +
>   #endif /* __XEN_HVM_IOMMU_H__ */
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag.
  2015-09-16 13:24 ` [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag Quan Xu
@ 2015-09-16  9:34   ` Julien Grall
  0 siblings, 0 replies; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:34 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

On 16/09/2015 14:24, Quan Xu wrote:
> diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> index 28e7fc3..e838905 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -51,6 +51,7 @@ struct hvm_iommu {
>
>       /* IOMMU Queued Invalidation(QI) */
>       struct qi_talbe talbe;
> +    bool_t qi_flag;
>
>       /* Features supported by the IOMMU */
>       DECLARE_BITMAP(features, IOMMU_FEAT_count);
> @@ -63,5 +64,7 @@ struct hvm_iommu {
>       (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_status_data)
>   #define qi_table_pollslot(d) \
>       (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
> +#define QI_FLUSHING(d) \
> +    (d->arch.hvm_domain.hvm_iommu.qi_flag)

I guess the new field and this new macro could be moved in 
asm-x86/hvm/iommu.h too.

>
>   #endif /* __XEN_HVM_IOMMU_H__ */
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  2015-09-16 13:24 ` [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to Quan Xu
@ 2015-09-16  9:44   ` Julien Grall
  2015-09-16 14:03     ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:44 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

On 16/09/2015 14:24, Quan Xu wrote:
> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
> index 447c650..d26b026 100644
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -116,6 +116,7 @@ void __dummy__(void)
>       BLANK();
>
>       OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
> +    OFFSET(QI_flag, struct domain, arch.hvm_domain.hvm_iommu.qi_flag);
>       BLANK();
>
>       OFFSET(VMCB_rax, struct vmcb_struct, rax);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 1b9fcfc..1f62e3b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1479,6 +1479,11 @@ int continue_hypercall_on_cpu(
>       return 0;
>   }
>
> +void do_qi_flushing(struct domain *d)
> +{
> +    do_sched_op(SCHEDOP_yield, guest_handle_from_ptr(NULL, void));

SCHEDOP_yield is as wrapper to vcpu_yield() to would be called by the guest.

It would be simpler to use the latter. You may even be able to call it 
directly from the assembly code rather than introducing is a wrapper.

If not, this function should go in x86 specific code (maybe 
arch/x86/domain.c ?)


> +}
> +
>   /*
>    * Local variables:
>    * mode: C
> diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> index e838905..e40fc7b 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -57,6 +57,8 @@ struct hvm_iommu {
>       DECLARE_BITMAP(features, IOMMU_FEAT_count);
>   };
>
> +void do_qi_flushing(struct domain *d);
> +

If you declare the function in file.c you should add the prototype in 
file.h.

I.e as you defined the function in common/domain.c, the prototype should 
go in xen/domain.h.

>   #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
>   #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
>
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed.
  2015-09-16 13:24 ` [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed Quan Xu
@ 2015-09-16  9:45   ` Julien Grall
  0 siblings, 0 replies; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:45 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

On 16/09/2015 14:24, Quan Xu wrote:
> diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> index e40fc7b..5dc0033 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h

Same remarks as the previous patches for the fields, prototype and macros.

> @@ -53,11 +53,15 @@ struct hvm_iommu {
>       struct qi_talbe talbe;
>       bool_t qi_flag;
>
> +    struct page_list_head qi_hold_page_list;
> +    spinlock_t qi_lock;
> +
>       /* Features supported by the IOMMU */
>       DECLARE_BITMAP(features, IOMMU_FEAT_count);
>   };
>
>   void do_qi_flushing(struct domain *d);
> +void qi_hold_page(struct domain *d, struct page_info *pg);
>
>   #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
>   #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
> @@ -68,5 +72,9 @@ void do_qi_flushing(struct domain *d);
>       (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
>   #define QI_FLUSHING(d) \
>       (d->arch.hvm_domain.hvm_iommu.qi_flag)
> +#define qi_hold_page_list(d) \
> +    (d->arch.hvm_domain.hvm_iommu.qi_hold_page_list)
> +#define qi_page_lock(d) \
> +    (d->arch.hvm_domain.hvm_iommu.qi_lock)
>
>   #endif /* __XEN_HVM_IOMMU_H__ */
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed.
  2015-09-16 13:24 ` [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed Quan Xu
@ 2015-09-16  9:52   ` Julien Grall
  0 siblings, 0 replies; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:52 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

On 16/09/2015 14:24, Quan Xu wrote:
> Signed-off-by: Quan Xu <quan.xu@intel.com>
> ---
>   xen/common/memory.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 61bb94c..4b2def5 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -253,7 +253,21 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
>
>       guest_physmap_remove_page(d, gmfn, mfn, 0);
>
> -    put_page(page);
> +#ifdef HAS_PASSTHROUGH
> +    /*
> +     * The page freed from the domain should be on held, until the
> +     * Device-TLB flush is completed. The page previously associated
> +     * with the freed portion of GPA should not be reallocated for
> +     * another purpose until the appropriate invalidations have been
> +     * performed. Otherwise, the original page owner can still access
> +     * freed page though DMA.
> +     */
> +    if ( need_iommu(d) && QI_FLUSHING(d) && !d->is_dying )
> +        qi_hold_page(d, page);

qi_hold_page is defined in drivers/passthrough/vtd/iommu.c which is only 
compiled for x86.

Which means that this call will break compilation on ARM. Also, AMD 
iommu should never call this code.

IHMO this should be moved in x86 specific code. Although, if you plan to 
keep it in common code, you need to at least add a new IOMMU ops.

> +    else
> +#endif
> +        put_page(page);
> +
>       put_gfn(d, gmfn);
>
>       return 1;
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when
  2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
@ 2015-09-16  9:56   ` Julien Grall
  2015-09-23 17:38   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 84+ messages in thread
From: Julien Grall @ 2015-09-16  9:56 UTC (permalink / raw)
  To: Quan Xu, andrew.cooper3, eddie.dong, ian.campbell, ian.jackson,
	jbeulich, jun.nakajima, keir, kevin.tian, tim, yang.z.zhang,
	george.dunlap
  Cc: xen-devel

Hi Quan,

On 16/09/2015 14:24, Quan Xu wrote:
> to destroy virtual machine, schedule and wait on a waitqueue
> until the Device-TLB flush is completed.
>
> Signed-off-by: Quan Xu <quan.xu@intel.com>
> ---
>   xen/common/domain.c                 | 10 ++++++++++
>   xen/drivers/passthrough/vtd/iommu.c |  9 +++++++++
>   xen/include/xen/hvm/iommu.h         |  6 ++++++
>   3 files changed, 25 insertions(+)

Same remarks as the previous patches.

> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 1f62e3b..8ccc1a5 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -867,6 +867,16 @@ void domain_destroy(struct domain *d)
>       rcu_assign_pointer(*pd, d->next_in_hashbucket);
>       spin_unlock(&domlist_update_lock);
>
> +#ifdef HAS_PASSTHROUGH
> +    /*
> +     * If the Device-TLB flush is still not completed, schedule
> +     * and wait on a waitqueue until the Device-TLB flush is
> +     * completed.
> +     */
> +    if ( need_iommu(d) && QI_FLUSHING(d) )
> +        wait_for_qi_flushing(d);
> +#endif
> +
>       /* Schedule RCU asynchronous completion of domain destroy. */
>       call_rcu(&d->rcu, complete_domain_destroy);
>   }
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 1297dea..3d98fea 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1070,6 +1070,11 @@ static hw_irq_controller dma_msi_type = {
>   };
>
>   /* IOMMU Queued Invalidation(QI). */
> +void wait_for_qi_flushing(struct domain *d)
> +{
> +    wait_event(qi_wq(d), !QI_FLUSHING(d));
> +}
> +
>   static void qi_clear_iwc(struct iommu *iommu)
>   {
>       unsigned long flags;
> @@ -1188,6 +1193,7 @@ scan_again:
>                   }
>                   spin_unlock(&qi_page_lock(d));
>                   QI_FLUSHING(d) = 0;
> +                wake_up_all(&qi_wq(d));
>               }
>               rcu_unlock_domain(d);
>           }
> @@ -1494,6 +1500,7 @@ static int intel_iommu_domain_init(struct domain *d)
>       hd->arch.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
>       INIT_PAGE_LIST_HEAD(&qi_hold_page_list(d));
>       spin_lock_init(&qi_page_lock(d));
> +    init_waitqueue_head(&qi_wq(d));
>
>       return 0;
>   }
> @@ -1925,6 +1932,8 @@ static void iommu_domain_teardown(struct domain *d)
>       if ( list_empty(&acpi_drhd_units) )
>           return;
>
> +    destroy_waitqueue_head(&qi_wq(d));
> +
>       list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
>       {
>           list_del(&mrmrr->list);
> diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> index 5dc0033..f661c8c 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -20,6 +20,7 @@
>   #define __XEN_HVM_IOMMU_H__
>
>   #include <xen/iommu.h>
> +#include <xen/wait.h>
>   #include <xen/list.h>
>   #include <asm/hvm/iommu.h>
>
> @@ -56,12 +57,15 @@ struct hvm_iommu {
>       struct page_list_head qi_hold_page_list;
>       spinlock_t qi_lock;
>
> +    struct waitqueue_head qi_wq;
> +
>       /* Features supported by the IOMMU */
>       DECLARE_BITMAP(features, IOMMU_FEAT_count);
>   };
>
>   void do_qi_flushing(struct domain *d);
>   void qi_hold_page(struct domain *d, struct page_info *pg);
> +void wait_for_qi_flushing(struct domain *d);
>
>   #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
>   #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
> @@ -76,5 +80,7 @@ void qi_hold_page(struct domain *d, struct page_info *pg);
>       (d->arch.hvm_domain.hvm_iommu.qi_hold_page_list)
>   #define qi_page_lock(d) \
>       (d->arch.hvm_domain.hvm_iommu.qi_lock)
> +#define qi_wq(d) \
> +    (d->arch.hvm_domain.hvm_iommu.qi_wq)
>
>   #endif /* __XEN_HVM_IOMMU_H__ */
>

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
@ 2015-09-16 10:46 ` Ian Jackson
  2015-09-16 11:22   ` Julien Grall
  2015-09-16 13:33   ` Xu, Quan
  2015-09-16 13:23 ` [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt Quan Xu
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 84+ messages in thread
From: Ian Jackson @ 2015-09-16 10:46 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, keir, eddie.dong, jun.nakajima, andrew.cooper3,
	ian.jackson, tim, george.dunlap, jbeulich, yang.z.zhang,
	xen-devel, ian.campbell

Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"):
> Introduction
> ============

Thanks for your submission.

JOOI why did you CC me ?  I did a quick scan of these patches and they
don't seem to have any tools impact.  I would prefer not to be CC'd
unless there is a reason why my attention would be valueable.

Regards,
Ian.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 10:46 ` Ian Jackson
@ 2015-09-16 11:22   ` Julien Grall
  2015-09-16 13:47     ` Ian Jackson
  2015-09-16 13:33   ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Julien Grall @ 2015-09-16 11:22 UTC (permalink / raw)
  To: Ian Jackson, Quan Xu
  Cc: kevin.tian, keir, jbeulich, george.dunlap, tim, eddie.dong,
	xen-devel, jun.nakajima, andrew.cooper3, yang.z.zhang,
	ian.campbell

On 16/09/15 11:46, Ian Jackson wrote:
> Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"):
>> Introduction
>> ============
> 
> Thanks for your submission.
> 
> JOOI why did you CC me ?  I did a quick scan of these patches and they
> don't seem to have any tools impact.  I would prefer not to be CC'd
> unless there is a reason why my attention would be valueable.

The common directory is maintained by "THE REST" group. From the
MAINTAINERS file you are part of it.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
@ 2015-09-16 13:23 Quan Xu
  2015-09-16 10:46 ` Ian Jackson
                   ` (16 more replies)
  0 siblings, 17 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Introduction
============

   VT-d code currently has a number of cases where completion of certain operations
is being waited for by way of spinning. The majority of instances use that variable
indirectly through IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second
(DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the
invalidation case seems particularly problematic.

Currently hypervisor polls the status address of wait descriptor up to 1 second to
get Invalidation flush result. When Invalidation queue includes Device-TLB invalidation,
using 1 second is a mistake here in the validation sync. As the 1 second timeout here is
related to response times by the IOMMU engine, Instead of Device-TLB invalidation with
PCI-e Address Translation Services (ATS) in use. the ATS specification mandates a timeout
of 1 _minute_ for cache flush. The ATS case needs to be taken into consideration when
doing invalidations. Obviously we can't spin for a minute, so invalidation absolutely
needs to be converted to a non-spinning model.

   Also i should fix the new memory security issue.
The page freed from the domain should be on held, until the Device-TLB flush is completed (ATS timeout of 1 _minute_).
The page previously associated  with the freed portion of GPA should not be reallocated for
another purpose until the appropriate invalidations have been performed. Otherwise, the original
page owner can still access freed page though DMA.

Why RFC
=======
    Patch 0001--0005, 0013 are IOMMU related.
    Patch 0006 is about new flag (vCPU / MMU related).
    Patch 0007 is vCPU related.
    Patch 0008--0012 are MMU related.

    1. Xen MMU is very complicated. Could Xen MMU experts help me verify whether I
       have covered all of the case?

    2. For gnttab_transfer, If the Device-TLB flush is still not completed when to
       map the transferring page to a remote domain, schedule and wait on a waitqueue
       until the Device-TLB flush is completed. Is it correct?

       (I have tested waitqueue in decrease_reservation() [do_memory_op()  hypercall])
        I wake up domain(with only one vCPU) with debug-key tool, and the domain(with only one vCPU)
        is still working after waiting 60s in a waitqueue. )


Design Overview
===============

This design implements a non-spinning model for Device-TLB invalidation -- using an interrupt
based mechanism. Track the Device-TLB invalidation status in an invalidation table per-domain. The
invalidation table keeps the count of in-flight Device-TLB invalidation requests, and also
provides a global polling parameter per domain for in-flight Device-TLB invalidation requests.
Update invalidation table's count of in-flight Device-TLB invalidation requests and assign the
address of global polling parameter per domain in the Status Address of each invalidation wait
descriptor, when to submit invalidation requests.

For example:
  .

|invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation requests)
|wait |  Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|dsc  |
  .
  .

|invl |
|wait | Status Data = 2 (the count of in-flight Device-TLB invalidation requests)
|dsc  | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

|invl |
|wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation requests)
|dsc  |  Status Address =  virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

More information about VT-d Invalidation Wait Descriptor, please refer to
  http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
Status Address and Data: Status address and data is used by hardware to perform wait descriptor
                         completion status write when the Status Write(SW) field is Set. Hardware Behavior
                         is undefined if the Status address range of 0xFEEX_XXXX etc.). The Status Address
                         and Data fields are ignored by hardware when the Status Write field is Clear.

The invalidation completion event interrupt is generated only after the invalidation wait descriptor
completes. In invalidation interrupt handler, it will schedule a soft-irq to do the following check:

  if invalidation table's count of in-flight Device-TLB invalidation requests == polling parameter:
    This domain has no in-flight Device-TLB invalidation requests.
  else
    This domain has in-flight Device-TLB invalidation requests.

Track domain Status:
   The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield list if it has in-flight
Device-TLB invalidation requests.

Memory security issue:
    In case with PCI-e Address Translation Services(ATS) in use, ATS spec mandates a timeout of 1 minute
for cache flush.
    The page freed from the domain should be on held, until the Device-TLB flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page owner can still access
freed page though DMA.

   *Held on The page until the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

    *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed.

Invalidation Fault:
A fault event will be generated if an invalidation failed. We can disable the devices.

For Context Invalidation and IOTLB invalidation without Device-TLB invalidation, Queued Invalidation(QI) submits
invalidation requests as before(This is a tradeoff and the cost of interrupt is overhead. It will be modified
in coming series of patch).

More details
============

1. invalidation table. We define qi_table structure per domain.
+struct qi_talbe {
+    u64 qi_table_poll_slot;
+    u32 qi_table_status_data;
+};

@ struct hvm_iommu {
+    /* IOMMU Queued Invalidation(QI) */
+    struct qi_talbe talbe;
}

2. Modification to Device IOTLB invalidation:
    - Enabled interrupt notification when hardware completes the invalidations:
      Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also set SW bit is that
      the interrupt for notification is global not per domain.
      So we still need to poll the status address to know which Device-TLB invalidation request is
      completed in QI interrupt handler.
    - A new per-domain flag (*qi_flag) is used to track the status of Device-TLB invalidation request.
      The *qi_flag will be set before sbumitting the Device-TLB invalidation requests. The vCPU is NOT
      allowed to entry guest mode and put into SCHEDOP_yield list, if the *qi_flag is Set.
    - new logic to do synchronize.
        if no Device-TLB invalidation:
            Back to current invalidation logic.
        else
            Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
            Set *qi_flag.
            Put the domain in pending flush list (The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
        Return

More information about VT-d Invalidation Wait Descriptor, please refer to
  http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
   SW: Indicate the invalidation wait descriptor completion by performing a coherent DWORD write of the value in the Status Data field
       to the address specified in the Status Address.
   FN: Indicate the descriptors following the invalidation wait descriptor must be processed by hardware only after the invalidation
       Wait descriptor completes.
   IF: Indicate the invalidation wait descriptor completion by generating an invalidation completion event per the programing of the
       Invalidation Completion Event Registers.

3. Modification to domain running lifecycle:
    - When the *qi_flag is set, the domain is not allowed to enter guest mode and put into SCHEDOP_yield list
      if there are in-flight Device-TLB invalidation requests.

4. New interrupt handler for invalidation completion:
    - when hardware completes the Device-TLB invalidation requests, it generates an interrupt to notify hypervisor.
    - In interrupt handler, schedule a tasklet to handle it.
    - tasklet to handle below:
        *Clear IWC field in the Invalidation Completion Status register. If the IWC field in the Invalidation
         Completion Status register was already Set at the time of setting this field, it is not treated as a new
         interrupt condition.
        *Scan the domain list. (the domain is with vt-d passthrough devices. scan 'iommu->domid_bitmap')
                for each domain:
                check the values invalidation table (qi_table_poll_slot and qi_table_status_data) of each domain.
                if equal:
                   Put the on hold pages.
                   Clear the invalidation table.
                   Clear *qi_flag.

        *If IP field of Invalidation Event Control Register is Set, try to *Clear IWC and *Scan the domain list again, instead of
         generating another interrupt.
        *Clear IM field of Invalidation Event Control Register.

((
  Logic of IWC / IP / IM as below:

                          Interrupt condition (An invalidation wait descriptor with Interrupt Flag(IF) field Set completed.)
                                  ||
                                   v
           ----------------------(IWC) ----------------------
     (IWC is Set)                                (IWC is not Set)
          ||                                            ||
          V                                             ||
(Not treated as an new interrupt condition)             ||
                                                         V
                                                   (Set IWC / IP)
                                                        ||
                                                         V
                                  ---------------------(IM)---------------------
                              (IM is Set)                               (IM not Set)
                                  ||                                        ||
                                  ||                                        V
                                  ||                    (cause Interrupt message / then hardware clear IP)
                                   V
   (interrupt is held pending, clearing IM to cause interrupt message)

* If IWC field is being clear, the IP field is cleared.
))

5. invalidation failed.
    - A fault event will be generated if invalidation failed. we can disable the devices if receive an
      invalidation fault event.

6. Memory security issue:

    The page freed from the domain should be on held, until the Device-TLB flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page owner can still access
freed page though DMA.

   *Held on The page unitl the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

  *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed.


----
There are 3 reasons to submit device-TLB invalidation requests:
    *VT-d initialization.
    *Reassign device ownership.
    *Memory modification.

6.1 *VT-d initialization
    When VT-d is initializing, there is no guest domain running. So no memory security issue.
iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init()
                                              |--init_vtd_hw()
6.2 *Reassign device ownership
    Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and arch_do_domctl().
As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the appropriate invalidations maybe have
not been performed, the *qi_flag is still Set, and these devices are not ready for guest domains to launch
DMA with these devices. So if the *qi_flag is introduced, there is no memory security issue.

iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_dsi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

|-iommu_flush_iotlb_psi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

6.3 *Memory modification.
While memory is modified, There are a lot of invoker flow for updating EPT, but not all of them will update IOMMU page tables. All
of the following three conditions are met.
  * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
  * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
  * This domain needs IOMMU. (need_iommu(d))

##
|--iommu_pte_flush()--ept_set_entry()

#PoD(populate on demand) is not supported while IOMMU passthrough is enabled. So ignore PoD invoker flow below.
      |--p2m_pod_zero_check_superpage()  ...
      |--p2m_pod_zero_check()  ...
      |--p2m_pod_demand_populate()  ...
      |--p2m_pod_decrease_reservation()  ...
      |--guest_physmap_mark_populate_on_demand() ...

#Xen paging is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below.
      |--p2m_mem_paging_evict() ...
      |--p2m_mem_paging_resume()...
      |--p2m_mem_paging_prep()...
      |--p2m_mem_paging_populate()...
      |--p2m_mem_paging_nominate()...
      |--p2m_alloc_table()--shadow_enable() --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl() --do_domctl()
                                                                                                                  |--paging_domctl_continuation()

#Xen sharing is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below.
      |--set_shared_p2m_entry()...


#Domain is paused, the domain can't launch DMA.
      |--relinquish_shared_pages()--domain_relinquish_resources( case RELMEM_shared: ) --domain_kill()--do_domctl()

#The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
      |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() --hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()

#If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore invoker flow below.
      |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check p2m_type_t)
      |--p2m_set_mem_access() ...
      |--guest_physmap_mark_populate_on_demand() ...
      |--p2m_change_type_one() ...
# The previous page is not put and allocated for Xen or other guest domains. So there is no memory security issue. Ignore invoker flow below.
   |--p2m_remove_page()--guest_physmap_remove_page() ...

   |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
                           |--map_mmio_regions()--do_domctl()


# Held on the pages which are removed in guest_remove_page(), and put in QI interrupt handler when it has no in-flight Device-TLB invalidation requests.

|--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
                                               |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap() /xenmem_add_to_physmap_batch()  .. --do_memory_op()
                                               |--p2m_add_foreign() -- xenmem_add_to_physmap_one() ..--do_memory_op()
                                                                   |--guest_physmap_add_entry()--create_grant_p2m_mapping()  ... --do_grant_table_op()

((
   Much more explanation:
   Actually, the previous pages are maybe mapped from Xen heap for guest domains in decrease_reservation() / xenmem_add_to_physmap_one()
   / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 functions will map xen heap page for guest domains:
          * share page for xen Oprofile.
          * vLAPIC mapping.
          * grant table shared page.
          * domain share_info page.
))

# For grant_unmap*. ignore it at this point, as we can held on the page when domain free xenbllooned page.

    |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant_ref() --gnttab_unmap_grant_ref()--do_grant_table_op()
                                               |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace() --do_grant_table_op()

# For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.

# For grant_transfer:
  |--p2m_remove_page()--guest_physmap_remove_page()
                                                 |--gnttab_transfer() ...  --do_grant_table_op()

    If the Device-TLB flush is still not completed when to map the transferring page to a remote domain,
    schedule and wait on a waitqueue until the Device-TLB flush is completed.

   Plan B:
   ((If the Device-TLB flush is still not completed before adding the transferring page to the target domain,
   allocate a new page for target domain and held on the old transferring page which will be put in QI interrupt
   handler when there are no in-flight Device-TLB invalidation requests.))


Quan Xu (13):
  vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
  vt-d: Register MSI for async invalidation completion interrupt.
  vt-d: Track the Device-TLB invalidation status in an invalidation table.
  vt-d: Clear invalidation table in invaidation interrupt handler
  vt-d: Clear the IWC field of Invalidation Event Control Register in
  vt-d: Introduce a new per-domain flag - qi_flag.
  vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  vt-d: Held on the freed page until the Device-TLB flush is completed.
  vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
  vt-d: Held on the removed page until the Device-TLB flush is completed.
  vt-d: If the Device-TLB flush is still not completed when
  vt-d: For gnttab_transfer, If the Device-TLB flush is still
  vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB

 xen/arch/x86/hvm/vmx/entry.S         |  10 ++
 xen/arch/x86/x86_64/asm-offsets.c    |   1 +
 xen/common/domain.c                  |  15 ++
 xen/common/grant_table.c             |  16 ++
 xen/common/memory.c                  |  16 +-
 xen/drivers/passthrough/vtd/iommu.c  | 290 +++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h  |  18 +++
 xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
 xen/include/xen/hvm/iommu.h          |  42 +++++
 9 files changed, 443 insertions(+), 16 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
  2015-09-16 10:46 ` Ian Jackson
@ 2015-09-16 13:23 ` Quan Xu
  2015-09-29  8:43   ` Jan Beulich
  2015-09-16 13:23 ` [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt Quan Xu
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 1dffc40..17bfb76 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1068,7 +1068,9 @@ static hw_irq_controller dma_msi_type = {
     .set_affinity = dma_msi_set_affinity,
 };
 
-static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd)
+static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd,
+    hw_irq_controller *irq_ctrl, const char *devname, struct msi_desc *msi,
+    void (*irq_handler)(int, void *, struct cpu_user_regs *))
 {
     int irq, ret;
     struct acpi_rhsa_unit *rhsa = drhd_to_rhsa(drhd);
@@ -1084,8 +1086,8 @@ static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd)
     }
 
     desc = irq_to_desc(irq);
-    desc->handler = &dma_msi_type;
-    ret = request_irq(irq, 0, iommu_page_fault, "dmar", iommu);
+    desc->handler = irq_ctrl;
+    ret = request_irq(irq, 0, irq_handler, devname, iommu);
     if ( ret )
     {
         desc->handler = &no_irq_type;
@@ -1094,11 +1096,11 @@ static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd)
         return ret;
     }
 
-    iommu->msi.irq = irq;
-    iommu->msi.msi_attrib.pos = MSI_TYPE_IOMMU;
-    iommu->msi.msi_attrib.maskbit = 1;
-    iommu->msi.msi_attrib.is_64 = 1;
-    desc->msi_desc = &iommu->msi;
+    msi->irq = irq;
+    msi->msi_attrib.pos = MSI_TYPE_IOMMU;
+    msi->msi_attrib.maskbit = 1;
+    msi->msi_attrib.is_64 = 1;
+    desc->msi_desc = msi;
 
     return 0;
 }
@@ -2179,7 +2181,8 @@ int __init intel_vtd_setup(void)
         if ( !vtd_ept_page_compatible(iommu) )
             iommu_hap_pt_share = 0;
 
-        ret = iommu_set_interrupt(drhd);
+        ret = iommu_set_interrupt(drhd, &dma_msi_type, "dmar", &drhd->iommu->msi,
+                                  iommu_page_fault);
         if ( ret )
         {
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: interrupt setup failed\n");
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt.
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
  2015-09-16 10:46 ` Ian Jackson
  2015-09-16 13:23 ` [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt Quan Xu
@ 2015-09-16 13:23 ` Quan Xu
  2015-09-29  8:57   ` Jan Beulich
  2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 133 ++++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.h |  10 +++
 2 files changed, 143 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 17bfb76..db6e3a2 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -54,6 +54,7 @@ bool_t __read_mostly untrusted_msi;
 int nr_iommus;
 
 static struct tasklet vtd_fault_tasklet;
+static struct tasklet vtd_qi_tasklet;
 
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
@@ -1068,6 +1069,125 @@ static hw_irq_controller dma_msi_type = {
     .set_affinity = dma_msi_set_affinity,
 };
 
+/* IOMMU Queued Invalidation(QI). */
+static void _qi_msi_unmask(struct iommu *iommu)
+{
+    u32 sts;
+    unsigned long flags;
+
+    /* Clear IM bit of DMAR_IECTL_REG. */
+    spin_lock_irqsave(&iommu->register_lock, flags);
+    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
+    sts &= ~DMA_IECTL_IM;
+    dmar_writel(iommu->reg, DMAR_IECTL_REG, sts);
+    spin_unlock_irqrestore(&iommu->register_lock, flags);
+}
+
+static void _qi_msi_mask(struct iommu *iommu)
+{
+    u32 sts;
+    unsigned long flags;
+
+    /* Set IM bit of DMAR_IECTL_REG. */
+    spin_lock_irqsave(&iommu->register_lock, flags);
+    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
+    sts |= DMA_IECTL_IM;
+    dmar_writel(iommu->reg, DMAR_IECTL_REG, sts);
+    spin_unlock_irqrestore(&iommu->register_lock, flags);
+}
+
+static void _do_iommu_qi(struct iommu *iommu)
+{
+}
+
+static void do_iommu_qi_completion(unsigned long data)
+{
+    struct acpi_drhd_unit *drhd;
+
+    if ( list_empty(&acpi_drhd_units) )
+    {
+       dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no iommu devices.\n");
+       return;
+    }
+
+    for_each_drhd_unit( drhd )
+        _do_iommu_qi(drhd->iommu);
+}
+
+static void iommu_qi_completion(int irq, void *dev_id,
+                                struct cpu_user_regs *regs)
+{
+    tasklet_schedule(&vtd_qi_tasklet);
+}
+
+static void qi_msi_unmask(struct irq_desc *desc)
+{
+    _qi_msi_unmask(desc->action->dev_id);
+}
+
+static void qi_msi_mask(struct irq_desc *desc)
+{
+    _qi_msi_mask(desc->action->dev_id);
+}
+
+static unsigned int qi_msi_startup(struct irq_desc *desc)
+{
+    qi_msi_unmask(desc);
+    return 0;
+}
+
+static void qi_msi_ack(struct irq_desc *desc)
+{
+    irq_complete_move(desc);
+    qi_msi_mask(desc);
+    move_masked_irq(desc);
+}
+
+static void qi_msi_end(struct irq_desc *desc, u8 vector)
+{
+    ack_APIC_irq();
+}
+
+static void qi_msi_set_affinity(struct irq_desc *desc, const cpumask_t *mask)
+{
+    struct msi_msg msg;
+    unsigned int dest;
+    unsigned long flags;
+    struct iommu *iommu = desc->action->dev_id;
+
+    dest = set_desc_affinity(desc, mask);
+    if ( dest == BAD_APICID )
+    {
+        dprintk(XENLOG_ERR VTDPREFIX,
+                "IOMMU: Set invaldaiton interrupt affinity error!\n");
+        return;
+    }
+
+    msi_compose_msg(desc->arch.vector, desc->arch.cpu_mask, &msg);
+    if ( x2apic_enabled )
+        msg.address_hi = dest & 0xFFFFFF00;
+    msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
+    msg.address_lo |= MSI_ADDR_DEST_ID(dest);
+    iommu->qi_msi.msg = msg;
+
+    spin_lock_irqsave(&iommu->register_lock, flags);
+    dmar_writel(iommu->reg, DMAR_IEDATA_REG, msg.data);
+    dmar_writel(iommu->reg, DMAR_IEADDR_REG, msg.address_lo);
+    dmar_writel(iommu->reg, DMAR_IEUADDR_REG, msg.address_hi);
+    spin_unlock_irqrestore(&iommu->register_lock, flags);
+}
+
+static hw_irq_controller qi_msi_type = {
+    .typename = "QI_MSI",
+    .startup = qi_msi_startup,
+    .shutdown = qi_msi_mask,
+    .enable = qi_msi_unmask,
+    .disable = qi_msi_mask,
+    .ack = qi_msi_ack,
+    .end = qi_msi_end,
+    .set_affinity = qi_msi_set_affinity,
+};
+
 static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd,
     hw_irq_controller *irq_ctrl, const char *devname, struct msi_desc *msi,
     void (*irq_handler)(int, void *, struct cpu_user_regs *))
@@ -1123,6 +1243,7 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
         return -ENOMEM;
 
     iommu->msi.irq = -1; /* No irq assigned yet. */
+    iommu->qi_msi.irq = -1; /* No irq assigned yet. */
 
     iommu->intel = alloc_intel_iommu();
     if ( iommu->intel == NULL )
@@ -1228,6 +1349,9 @@ void __init iommu_free(struct acpi_drhd_unit *drhd)
     free_intel_iommu(iommu->intel);
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
+    if ( iommu->qi_msi.irq >= 0 )
+        destroy_irq(iommu->qi_msi.irq);
+
     xfree(iommu);
 }
 
@@ -1985,6 +2109,9 @@ static void adjust_irq_affinity(struct acpi_drhd_unit *drhd)
          cpumask_intersects(&node_to_cpumask(node), cpumask) )
         cpumask = &node_to_cpumask(node);
     dma_msi_set_affinity(irq_to_desc(drhd->iommu->msi.irq), cpumask);
+
+    if ( ats_enabled )
+        qi_msi_set_affinity(irq_to_desc(drhd->iommu->qi_msi.irq), cpumask);
 }
 
 int adjust_vtd_irq_affinities(void)
@@ -2183,6 +2310,11 @@ int __init intel_vtd_setup(void)
 
         ret = iommu_set_interrupt(drhd, &dma_msi_type, "dmar", &drhd->iommu->msi,
                                   iommu_page_fault);
+        if ( ats_enabled )
+            ret = iommu_set_interrupt(drhd, &qi_msi_type, "qi",
+                                      &drhd->iommu->qi_msi,
+                                      iommu_qi_completion);
+
         if ( ret )
         {
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: interrupt setup failed\n");
@@ -2191,6 +2323,7 @@ int __init intel_vtd_setup(void)
     }
 
     softirq_tasklet_init(&vtd_fault_tasklet, do_iommu_page_fault, 0);
+    softirq_tasklet_init(&vtd_qi_tasklet, do_iommu_qi_completion, 0);
 
     if ( !iommu_qinval && iommu_intremap )
     {
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index ac71ed1..52d328f 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -47,6 +47,11 @@
 #define    DMAR_IQH_REG    0x80    /* invalidation queue head */
 #define    DMAR_IQT_REG    0x88    /* invalidation queue tail */
 #define    DMAR_IQA_REG    0x90    /* invalidation queue addr */
+#define    DMAR_IECTL_REG  0xA0    /* invalidation event contrl register */
+#define    DMAR_IEDATA_REG 0xA4    /* invalidation event data register */
+#define    DMAR_IEADDR_REG 0xA8    /* invalidation event address register */
+#define    DMAR_IEUADDR_REG 0xAC   /* invalidation event upper address register */
+#define    DMAR_ICS_REG    0x9C    /* invalidation completion status register */
 #define    DMAR_IRTA_REG   0xB8    /* intr remap */
 
 #define OFFSET_STRIDE        (9)
@@ -165,6 +170,10 @@
 /* FECTL_REG */
 #define DMA_FECTL_IM (((u64)1) << 31)
 
+/* IECTL_REG */
+#define DMA_IECTL_IM (((u64)1) << 31)
+
+
 /* FSTS_REG */
 #define DMA_FSTS_PFO ((u64)1 << 0)
 #define DMA_FSTS_PPF ((u64)1 << 1)
@@ -515,6 +524,7 @@ struct iommu {
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     struct msi_desc msi;
+    struct msi_desc qi_msi;
     struct intel_iommu *intel;
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (2 preceding siblings ...)
  2015-09-16 13:23 ` [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt Quan Xu
@ 2015-09-16 13:23 ` Quan Xu
  2015-09-16  9:33   ` Julien Grall
  2015-09-29  9:24   ` Jan Beulich
  2015-09-16 13:23 ` [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler Quan Xu
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Update invalidation table's count of in-flight Device-TLB invalidation
request and assign the address of global polling parameter per domain in
the Status Address of each invalidation wait descriptor, when submit
Device-TLB invalidation requests.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.h  |  2 ++
 xen/drivers/passthrough/vtd/qinval.c | 24 ++++++++++++++++++++----
 xen/include/xen/hvm/iommu.h          | 23 +++++++++++++++++++++++
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 52d328f..f2ee56d 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -453,6 +453,8 @@ struct qinval_entry {
 /* Queue invalidation head/tail shift */
 #define QINVAL_INDEX_SHIFT 4
 
+#define QINVAL_INVALID_DEVICE_ID  ((u16)~0)
+
 #define qinval_present(v) ((v).lo & 1)
 #define qinval_fault_disable(v) (((v).lo >> 1) & 1)
 
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index b81b0bd..abe6e9c 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -130,8 +130,9 @@ static void queue_invalidate_iotlb(struct iommu *iommu,
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
+/* device_id parmeter is invalid when iflag is not set. */
 static int queue_invalidate_wait(struct iommu *iommu,
-    u8 iflag, u8 sw, u8 fn)
+    u8 iflag, u8 sw, u8 fn, u16 device_id)
 {
     s_time_t start_time;
     volatile u32 poll_slot = QINVAL_STAT_INIT;
@@ -139,6 +140,7 @@ static int queue_invalidate_wait(struct iommu *iommu,
     unsigned long flags;
     u64 entry_base;
     struct qinval_entry *qinval_entry, *qinval_entries;
+    struct domain *d;
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     index = qinval_next_index(iommu);
@@ -152,9 +154,22 @@ static int queue_invalidate_wait(struct iommu *iommu,
     qinval_entry->q.inv_wait_dsc.lo.sw = sw;
     qinval_entry->q.inv_wait_dsc.lo.fn = fn;
     qinval_entry->q.inv_wait_dsc.lo.res_1 = 0;
-    qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
     qinval_entry->q.inv_wait_dsc.hi.res_1 = 0;
-    qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(&poll_slot) >> 2;
+
+    if ( iflag )
+    {
+        d = rcu_lock_domain_by_id(iommu->domid_map[device_id]);
+        if ( d == NULL )
+            return -ENODATA;
+
+        qinval_entry->q.inv_wait_dsc.lo.sdata = ++ qi_table_data(d);
+        qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(
+                                                &qi_table_pollslot(d)) >> 2;
+        rcu_unlock_domain(d);
+    } else {
+        qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
+        qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(&poll_slot) >> 2;
+    }
 
     unmap_vtd_domain_page(qinval_entries);
     qinval_update_qtail(iommu, index);
@@ -185,7 +200,8 @@ static int invalidate_sync(struct iommu *iommu)
     struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
 
     if ( qi_ctrl->qinval_maddr )
-        return queue_invalidate_wait(iommu, 0, 1, 1);
+        return queue_invalidate_wait(iommu, 0, 1, 1,
+                                     QINVAL_INVALID_DEVICE_ID);
     return 0;
 }
 
diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
index 106e08f..28e7fc3 100644
--- a/xen/include/xen/hvm/iommu.h
+++ b/xen/include/xen/hvm/iommu.h
@@ -23,6 +23,21 @@
 #include <xen/list.h>
 #include <asm/hvm/iommu.h>
 
+/*
+ * Status Address and Data: Status address and data is used by hardware to perform
+ * wait descriptor completion status write when the Status Write(SW) field is Set.
+ *
+ * Track the Device-TLB invalidation status in an invalidation table. Update
+ * invalidation table's count of in-flight Device-TLB invalidation request and
+ * assign the address of global polling parameter per domain in the Status Address
+ * of each invalidation wait descriptor, when submit Device-TLB invalidation
+ * requests.
+ */
+struct qi_talbe {
+    u64 qi_table_poll_slot;
+    u32 qi_table_status_data;
+};
+
 struct hvm_iommu {
     struct arch_hvm_iommu arch;
 
@@ -34,6 +49,9 @@ struct hvm_iommu {
     struct list_head dt_devices;
 #endif
 
+    /* IOMMU Queued Invalidation(QI) */
+    struct qi_talbe talbe;
+
     /* Features supported by the IOMMU */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 };
@@ -41,4 +59,9 @@ struct hvm_iommu {
 #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
 #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
 
+#define qi_table_data(d) \
+    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_status_data)
+#define qi_table_pollslot(d) \
+    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
+
 #endif /* __XEN_HVM_IOMMU_H__ */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (3 preceding siblings ...)
  2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
@ 2015-09-16 13:23 ` Quan Xu
  2015-09-29  9:33   ` Jan Beulich
  2015-09-16 13:23 ` [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in Quan Xu
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

if it has no in-flight Device-TLB invalidation request.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index db6e3a2..0e912fb 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1098,6 +1098,28 @@ static void _qi_msi_mask(struct iommu *iommu)
 
 static void _do_iommu_qi(struct iommu *iommu)
 {
+    unsigned long nr_dom, i;
+    struct domain *d = NULL;
+
+    nr_dom = cap_ndoms(iommu->cap);
+    i = find_first_bit(iommu->domid_bitmap, nr_dom);
+    while ( i < nr_dom )
+    {
+        if ( iommu->domid_map[i] > 0 )
+        {
+            d = rcu_lock_domain_by_id(iommu->domid_map[i]);
+            if ( d == NULL )
+                continue;
+
+            if ( qi_table_pollslot(d) == qi_table_data(d) )
+            {
+                qi_table_data(d) = 0;
+                qi_table_pollslot(d) = 0;
+            }
+            rcu_unlock_domain(d);
+        }
+        i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
+    }
 }
 
 static void do_iommu_qi_completion(unsigned long data)
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (4 preceding siblings ...)
  2015-09-16 13:23 ` [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler Quan Xu
@ 2015-09-16 13:23 ` Quan Xu
  2015-09-29  9:44   ` Jan Beulich
  2015-09-16 13:24 ` [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag Quan Xu
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

QI interrupt handler and QI startup. If the IWC field was already
Set at the time of setting this field, it is not treated as a new
interrupt conditions.
In QI interrupt handler, Check IP field of Invalidation Event Control
register after scan domain status. if IP field is Set, scan agian,
instead of generating another interrupt. then, Clear IM fild of
Invalidation Event Control Register for no masking of QI interrupt.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 59 +++++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.h |  6 ++++
 2 files changed, 65 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 0e912fb..e3acea5 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1070,6 +1070,27 @@ static hw_irq_controller dma_msi_type = {
 };
 
 /* IOMMU Queued Invalidation(QI). */
+static void qi_clear_iwc(struct iommu *iommu)
+{
+    unsigned long flags;
+
+    spin_lock_irqsave(&iommu->register_lock, flags);
+    dmar_writel(iommu->reg, DMAR_ICS_REG, RW1CS);
+    spin_unlock_irqrestore(&iommu->register_lock, flags);
+}
+
+static int _qi_msi_ip(struct iommu *iommu)
+{
+    u32 sts;
+    unsigned long flags;
+
+    /* Get IP bit of DMAR_IECTL_REG. */
+    spin_lock_irqsave(&iommu->register_lock, flags);
+    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
+    spin_unlock_irqrestore(&iommu->register_lock, flags);
+    return (sts & DMA_IECTL_IP);
+}
+
 static void _qi_msi_unmask(struct iommu *iommu)
 {
     u32 sts;
@@ -1101,6 +1122,14 @@ static void _do_iommu_qi(struct iommu *iommu)
     unsigned long nr_dom, i;
     struct domain *d = NULL;
 
+scan_again:
+    /*
+     * If the IWC field in the Invalidation Completion Status register was already
+     * Set at the time of setting this field, it is not treated as a new interrupt
+     * condition.
+     */
+    qi_clear_iwc(iommu);
+
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
@@ -1120,6 +1149,28 @@ static void _do_iommu_qi(struct iommu *iommu)
         }
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
+
+    /*
+     * IP is interrupt pending and the 30 bit of Invalidation Event Control
+     * Register. The IP field is kept Set by hardware while the interrupt
+     * message is held pending. The IP field is cleared by hardware as soon
+     * as the interrupt message pending condition  is serviced. IP could be
+     * cleard due to either:
+     *
+     * - Clear IM field in the Invalidation Event Control Register. A QI
+     *   interrupt is generated along with clearing the IP field.
+     * - Clear IWC field in the Invalidateion Coompletion Status register.
+     *
+     * If the Ip is Set, scan agian, instead of generating another interrupt.
+     */
+    if ( _qi_msi_ip(iommu) )
+        goto scan_again;
+
+    /*
+     * No masking of QI interrupt. when a QI interrupt event condition is
+     * detected, hardware issues an interrupt message.
+     */
+    _qi_msi_unmask(iommu);
 }
 
 static void do_iommu_qi_completion(unsigned long data)
@@ -1154,6 +1205,14 @@ static void qi_msi_mask(struct irq_desc *desc)
 
 static unsigned int qi_msi_startup(struct irq_desc *desc)
 {
+    struct iommu *iommu = desc->action->dev_id;
+
+    /*
+     * If the IWC field in the Invalidation Completion Status register was already
+     * Set at the time of setting this field, it is not treated as a new interrupt
+     * condition.
+     */
+    qi_clear_iwc(iommu);
     qi_msi_unmask(desc);
     return 0;
 }
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index f2ee56d..e6278ee 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -54,6 +54,11 @@
 #define    DMAR_ICS_REG    0x9C    /* invalidation completion status register */
 #define    DMAR_IRTA_REG   0xB8    /* intr remap */
 
+/*
+ * Register Attributes.
+ */
+#define RW1CS  1  /* A status may be cleard by writing a 1. */
+
 #define OFFSET_STRIDE        (9)
 #define dmar_readl(dmar, reg) readl((dmar) + (reg))
 #define dmar_readq(dmar, reg) readq((dmar) + (reg))
@@ -172,6 +177,7 @@
 
 /* IECTL_REG */
 #define DMA_IECTL_IM (((u64)1) << 31)
+#define DMA_IECTL_IP (((u64)1) << 30)
 
 
 /* FSTS_REG */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag.
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (5 preceding siblings ...)
  2015-09-16 13:23 ` [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16  9:34   ` Julien Grall
  2015-09-16 13:24 ` [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to Quan Xu
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

The qi_flag is Set when submit Device-TLB invalidation requests. The
qi_flag will be Clear in QI interrupt handler if there are no in-flight
Device-TLB invalidation requests.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c  | 1 +
 xen/drivers/passthrough/vtd/qinval.c | 1 +
 xen/include/xen/hvm/iommu.h          | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index e3acea5..fda9a84 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1144,6 +1144,7 @@ scan_again:
             {
                 qi_table_data(d) = 0;
                 qi_table_pollslot(d) = 0;
+                QI_FLUSHING(d) = 0;
             }
             rcu_unlock_domain(d);
         }
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index abe6e9c..0d85ce7 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -165,6 +165,7 @@ static int queue_invalidate_wait(struct iommu *iommu,
         qinval_entry->q.inv_wait_dsc.lo.sdata = ++ qi_table_data(d);
         qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(
                                                 &qi_table_pollslot(d)) >> 2;
+        QI_FLUSHING(d) = 1;
         rcu_unlock_domain(d);
     } else {
         qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
index 28e7fc3..e838905 100644
--- a/xen/include/xen/hvm/iommu.h
+++ b/xen/include/xen/hvm/iommu.h
@@ -51,6 +51,7 @@ struct hvm_iommu {
 
     /* IOMMU Queued Invalidation(QI) */
     struct qi_talbe talbe;
+    bool_t qi_flag;
 
     /* Features supported by the IOMMU */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
@@ -63,5 +64,7 @@ struct hvm_iommu {
     (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_status_data)
 #define qi_table_pollslot(d) \
     (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
+#define QI_FLUSHING(d) \
+    (d->arch.hvm_domain.hvm_iommu.qi_flag)
 
 #endif /* __XEN_HVM_IOMMU_H__ */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (6 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16  9:44   ` Julien Grall
  2015-09-16 13:24 ` [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed Quan Xu
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

entry guest mode and put into the SCHEDOP_yield list.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/arch/x86/hvm/vmx/entry.S      | 10 ++++++++++
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/common/domain.c               |  5 +++++
 xen/include/xen/hvm/iommu.h       |  2 ++
 4 files changed, 18 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 2a4ed57..53a4c58 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -66,6 +66,10 @@ ENTRY(vmx_asm_vmexit_handler)
         cmp  %ecx,(%rdx,%rax,1)
         jnz  .Lvmx_process_softirqs
 
+        mov  VCPU_domain(%rbx),%rax
+        cmp  $0,QI_flag(%rax)
+        jne  .Lqi_flushing
+
         cmp  %cl,VCPU_vmx_emulate(%rbx)
         jne .Lvmx_goto_emulator
         cmp  %cl,VCPU_vmx_realmode(%rbx)
@@ -125,3 +129,9 @@ ENTRY(vmx_asm_do_vmentry)
         sti
         call do_softirq
         jmp  .Lvmx_do_vmentry
+
+.Lqi_flushing:
+        sti
+        mov %rax,%rdi
+        call do_qi_flushing
+        jmp  .Lvmx_do_vmentry
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 447c650..d26b026 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -116,6 +116,7 @@ void __dummy__(void)
     BLANK();
 
     OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
+    OFFSET(QI_flag, struct domain, arch.hvm_domain.hvm_iommu.qi_flag);
     BLANK();
 
     OFFSET(VMCB_rax, struct vmcb_struct, rax);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 1b9fcfc..1f62e3b 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1479,6 +1479,11 @@ int continue_hypercall_on_cpu(
     return 0;
 }
 
+void do_qi_flushing(struct domain *d)
+{
+    do_sched_op(SCHEDOP_yield, guest_handle_from_ptr(NULL, void));
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
index e838905..e40fc7b 100644
--- a/xen/include/xen/hvm/iommu.h
+++ b/xen/include/xen/hvm/iommu.h
@@ -57,6 +57,8 @@ struct hvm_iommu {
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 };
 
+void do_qi_flushing(struct domain *d);
+
 #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
 #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
 
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed.
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (7 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16  9:45   ` Julien Grall
  2015-09-16 13:24 ` [Patch RFC 09/13] vt-d: Put the page in Queued Invalidation(QI) interrupt handler if Quan Xu
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

The page freed from the domain should be on held, until the
Device-TLB flush is completed. The page previously associated
with the freed portion of GPA should not be reallocated for
another purpose until the appropriate invalidations have been
performed. Otherwise, the original page owner can still access
freed page though DMA.

Held on The page until the Device-TLB flush is completed.
  - Unlink the page from the original owner.
  - Remove the page from the page_list of domain.
  - Decrease the total pages count of domain.
  - Add the page to qi_hold_page_list.

The page will be put in Queued Invalidation(QI) interrupt handler
if the Device-TLB flush is completed.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 35 +++++++++++++++++++++++++++++++++++
 xen/include/xen/hvm/iommu.h         |  8 ++++++++
 2 files changed, 43 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index fda9a84..5c03e41 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1117,6 +1117,39 @@ static void _qi_msi_mask(struct iommu *iommu)
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
+/*
+ * The page freed from the domain should be on held, until the
+ * Device-TLB flush is completed. The page previously associated
+ * with the freed portion of GPA should not be reallocated for
+ * another purpose until the appropriate invalidations have been
+ * performed. Otherwise, the original page owner can still access
+ * freed page though DMA.
+ *
+ * Held on The page until the Device-TLB flush is completed.
+ *   - Unlink the page from the original owner.
+ *   - Remove the page from the page_list of domain.
+ *   - Decrease the total pages count of domain.
+ *   - Add the page to qi_hold_page_list.
+ *
+ * The page will be put in Queued Invalidation(QI) interrupt
+ * handler if the Device-TLB flush is completed.
+ */
+void qi_hold_page(struct domain *d, struct page_info *pg)
+{
+    spin_lock(&d->page_alloc_lock);
+    page_set_owner(pg, NULL);
+    page_list_del(pg, &d->page_list);
+    d->tot_pages--;
+    spin_unlock(&d->page_alloc_lock);
+
+    INTEL_IOMMU_DEBUG("IOMMU: Hold on page mfn : %"PRIx64"\n",
+                      page_to_mfn(pg));
+
+    spin_lock(&qi_page_lock(d));
+    page_list_add_tail(pg, &qi_hold_page_list(d));
+    spin_unlock(&qi_page_lock(d));
+}
+
 static void _do_iommu_qi(struct iommu *iommu)
 {
     unsigned long nr_dom, i;
@@ -1449,6 +1482,8 @@ static int intel_iommu_domain_init(struct domain *d)
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     hd->arch.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
+    INIT_PAGE_LIST_HEAD(&qi_hold_page_list(d));
+    spin_lock_init(&qi_page_lock(d));
 
     return 0;
 }
diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
index e40fc7b..5dc0033 100644
--- a/xen/include/xen/hvm/iommu.h
+++ b/xen/include/xen/hvm/iommu.h
@@ -53,11 +53,15 @@ struct hvm_iommu {
     struct qi_talbe talbe;
     bool_t qi_flag;
 
+    struct page_list_head qi_hold_page_list;
+    spinlock_t qi_lock;
+
     /* Features supported by the IOMMU */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 };
 
 void do_qi_flushing(struct domain *d);
+void qi_hold_page(struct domain *d, struct page_info *pg);
 
 #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
 #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
@@ -68,5 +72,9 @@ void do_qi_flushing(struct domain *d);
     (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
 #define QI_FLUSHING(d) \
     (d->arch.hvm_domain.hvm_iommu.qi_flag)
+#define qi_hold_page_list(d) \
+    (d->arch.hvm_domain.hvm_iommu.qi_hold_page_list)
+#define qi_page_lock(d) \
+    (d->arch.hvm_domain.hvm_iommu.qi_lock)
 
 #endif /* __XEN_HVM_IOMMU_H__ */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 09/13] vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (8 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16 13:24 ` [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed Quan Xu
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

the Device-TLB flush is completed.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 5c03e41..1297dea 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1154,6 +1154,7 @@ static void _do_iommu_qi(struct iommu *iommu)
 {
     unsigned long nr_dom, i;
     struct domain *d = NULL;
+    struct page_info *page = NULL;
 
 scan_again:
     /*
@@ -1177,6 +1178,15 @@ scan_again:
             {
                 qi_table_data(d) = 0;
                 qi_table_pollslot(d) = 0;
+                spin_lock(&qi_page_lock(d));
+                while ( (page = page_list_remove_head(
+                                &qi_hold_page_list(d))) )
+                {
+                    INTEL_IOMMU_DEBUG("IOMMU:  Put page mfn : %"PRIx64"\n",
+                                      page_to_mfn(page));
+                    put_page(page);
+                }
+                spin_unlock(&qi_page_lock(d));
                 QI_FLUSHING(d) = 0;
             }
             rcu_unlock_domain(d);
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed.
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (9 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 09/13] vt-d: Put the page in Queued Invalidation(QI) interrupt handler if Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16  9:52   ` Julien Grall
  2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/common/memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 61bb94c..4b2def5 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -253,7 +253,21 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
 
     guest_physmap_remove_page(d, gmfn, mfn, 0);
 
-    put_page(page);
+#ifdef HAS_PASSTHROUGH
+    /*
+     * The page freed from the domain should be on held, until the
+     * Device-TLB flush is completed. The page previously associated
+     * with the freed portion of GPA should not be reallocated for
+     * another purpose until the appropriate invalidations have been
+     * performed. Otherwise, the original page owner can still access
+     * freed page though DMA.
+     */
+    if ( need_iommu(d) && QI_FLUSHING(d) && !d->is_dying )
+        qi_hold_page(d, page);
+    else
+#endif
+        put_page(page);
+
     put_gfn(d, gmfn);
 
     return 1;
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (10 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16  9:56   ` Julien Grall
  2015-09-23 17:38   ` Konrad Rzeszutek Wilk
  2015-09-16 13:24 ` [Patch RFC 12/13] vt-d: For gnttab_transfer, If the Device-TLB flush is still Quan Xu
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

to destroy virtual machine, schedule and wait on a waitqueue
until the Device-TLB flush is completed.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/common/domain.c                 | 10 ++++++++++
 xen/drivers/passthrough/vtd/iommu.c |  9 +++++++++
 xen/include/xen/hvm/iommu.h         |  6 ++++++
 3 files changed, 25 insertions(+)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 1f62e3b..8ccc1a5 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -867,6 +867,16 @@ void domain_destroy(struct domain *d)
     rcu_assign_pointer(*pd, d->next_in_hashbucket);
     spin_unlock(&domlist_update_lock);
 
+#ifdef HAS_PASSTHROUGH
+    /*
+     * If the Device-TLB flush is still not completed, schedule
+     * and wait on a waitqueue until the Device-TLB flush is
+     * completed.
+     */
+    if ( need_iommu(d) && QI_FLUSHING(d) )
+        wait_for_qi_flushing(d);
+#endif
+
     /* Schedule RCU asynchronous completion of domain destroy. */
     call_rcu(&d->rcu, complete_domain_destroy);
 }
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 1297dea..3d98fea 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1070,6 +1070,11 @@ static hw_irq_controller dma_msi_type = {
 };
 
 /* IOMMU Queued Invalidation(QI). */
+void wait_for_qi_flushing(struct domain *d)
+{
+    wait_event(qi_wq(d), !QI_FLUSHING(d));
+}
+
 static void qi_clear_iwc(struct iommu *iommu)
 {
     unsigned long flags;
@@ -1188,6 +1193,7 @@ scan_again:
                 }
                 spin_unlock(&qi_page_lock(d));
                 QI_FLUSHING(d) = 0;
+                wake_up_all(&qi_wq(d));
             }
             rcu_unlock_domain(d);
         }
@@ -1494,6 +1500,7 @@ static int intel_iommu_domain_init(struct domain *d)
     hd->arch.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     INIT_PAGE_LIST_HEAD(&qi_hold_page_list(d));
     spin_lock_init(&qi_page_lock(d));
+    init_waitqueue_head(&qi_wq(d));
 
     return 0;
 }
@@ -1925,6 +1932,8 @@ static void iommu_domain_teardown(struct domain *d)
     if ( list_empty(&acpi_drhd_units) )
         return;
 
+    destroy_waitqueue_head(&qi_wq(d));
+
     list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list )
     {
         list_del(&mrmrr->list);
diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
index 5dc0033..f661c8c 100644
--- a/xen/include/xen/hvm/iommu.h
+++ b/xen/include/xen/hvm/iommu.h
@@ -20,6 +20,7 @@
 #define __XEN_HVM_IOMMU_H__
 
 #include <xen/iommu.h>
+#include <xen/wait.h>
 #include <xen/list.h>
 #include <asm/hvm/iommu.h>
 
@@ -56,12 +57,15 @@ struct hvm_iommu {
     struct page_list_head qi_hold_page_list;
     spinlock_t qi_lock;
 
+    struct waitqueue_head qi_wq;
+
     /* Features supported by the IOMMU */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 };
 
 void do_qi_flushing(struct domain *d);
 void qi_hold_page(struct domain *d, struct page_info *pg);
+void wait_for_qi_flushing(struct domain *d);
 
 #define iommu_set_feature(d, f)   set_bit((f), domain_hvm_iommu(d)->features)
 #define iommu_clear_feature(d, f) clear_bit((f), domain_hvm_iommu(d)->features)
@@ -76,5 +80,7 @@ void qi_hold_page(struct domain *d, struct page_info *pg);
     (d->arch.hvm_domain.hvm_iommu.qi_hold_page_list)
 #define qi_page_lock(d) \
     (d->arch.hvm_domain.hvm_iommu.qi_lock)
+#define qi_wq(d) \
+    (d->arch.hvm_domain.hvm_iommu.qi_wq)
 
 #endif /* __XEN_HVM_IOMMU_H__ */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 12/13] vt-d: For gnttab_transfer, If the Device-TLB flush is still
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (11 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-16 13:24 ` [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB Quan Xu
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

not completed when to map the transferring page to a remote
domain, schedule and wait on a waitqueue until the Device-TLB
flush is completed.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/common/grant_table.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index f2ed64a..9bf2009 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1808,6 +1808,22 @@ gnttab_transfer(
         guest_physmap_remove_page(d, gop.mfn, mfn, 0);
         gnttab_flush_tlb(d);
 
+#ifdef HAS_PASSTHROUGH
+        /*
+         * The page freed from the domain should be on held, until the
+         * Device-TLB flush is completed. The page previously associated
+         * with the freed portion of GPA should not be reallocated for
+         * another purpose until the appropriate invalidations have been
+         * performed. Otherwise, the original page owner can still access
+         * freed page though DMA.
+         *
+         * If the Device-TLB flush is still not completed, schedule and
+         * wait on a waitqueue until the Device-TLB flush is completed.
+         */
+        if ( QI_FLUSHING(d) )
+            wait_for_qi_flushing(d);
+#endif
+
         /* Find the target domain. */
         if ( unlikely((e = rcu_lock_domain_by_id(gop.domid)) == NULL) )
         {
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (12 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 12/13] vt-d: For gnttab_transfer, If the Device-TLB flush is still Quan Xu
@ 2015-09-16 13:24 ` Quan Xu
  2015-09-29  9:46   ` Jan Beulich
  2015-09-17  3:26 ` [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Xu, Quan
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:24 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

invalidataion requests. If the IF bit is Set, the interrupt based mechanism
will be used to track the Device-TLB invalidation requests. Do not do polling
to detect whether hardware completes the Device-TLB invalidation during Device-
TLB invalidation.

Signed-off-by: Quan Xu <quan.xu@intel.com>
---
 xen/drivers/passthrough/vtd/qinval.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 0d85ce7..b330d02 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -176,7 +176,15 @@ static int queue_invalidate_wait(struct iommu *iommu,
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
-    /* Now we don't support interrupt method */
+    /*
+     * If the iflag is Set, the interrupt based mechanism will be used to track
+     * the Device-TLB invalidation status. Do not do polling to detect whether
+     * hardware completes the Device-TLB invalidation during submitting Device-TLB
+     * invalidation requests.
+     */
+    if ( iflag )
+        return 0;
+
     if ( sw )
     {
         /* In case all wait descriptor writes to same addr with same data */
@@ -322,6 +330,15 @@ static int flush_context_qi(
     return ret;
 }
 
+static int invalidate_async(struct iommu *iommu, u16 device_id)
+{
+    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
+
+    if ( qi_ctrl->qinval_maddr )
+        return queue_invalidate_wait(iommu, 1, 1, 1, device_id);
+    return 0;
+}
+
 static int flush_iotlb_qi(
     void *_iommu, u16 did,
     u64 addr, unsigned int size_order, u64 type,
@@ -360,8 +377,13 @@ static int flush_iotlb_qi(
                                type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
                                dw, did, size_order, 0, addr);
         if ( flush_dev_iotlb )
+        {
             ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
-        rc = invalidate_sync(iommu);
+            rc = invalidate_async(iommu, did);
+        } else {
+            rc = invalidate_sync(iommu);
+        }
+
         if ( !ret )
             ret = rc;
     }
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 10:46 ` Ian Jackson
  2015-09-16 11:22   ` Julien Grall
@ 2015-09-16 13:33   ` Xu, Quan
  1 sibling, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-09-16 13:33 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Dong, Eddie, tim, jbeulich, Nakajima, Jun, Zhang, Yang Z,
	xen-devel



> -----Original Message-----
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: Wednesday, September 16, 2015 6:47 PM
> To: Xu, Quan
> Cc: andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com;
> xen-devel@lists.xen.org
> Subject: Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS
> Device
> 
> Quan Xu writes ("[Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS
> Device"):
> > Introduction
> > ============
> 
> Thanks for your submission.
> 
> JOOI why did you CC me ?  I did a quick scan of these patches and they don't
> seem to have any tools impact.  I would prefer not to be CC'd unless there is a
> reason why my attention would be valueable.

Ian,
Thanks for your quick response!
For patch-11 and patch-12, I got your email with scripts/get_maintainer.pl tool. 

-Quan

> 
> Regards,
> Ian.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-09-16  9:33   ` Julien Grall
@ 2015-09-16 13:43     ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-09-16 13:43 UTC (permalink / raw)
  To: Julien Grall
  Cc: tim, Tian, Kevin, keir, Dong, Eddie, Nakajima, Jun,
	andrew.cooper3, ian.jackson, xen-devel, george.dunlap, jbeulich,
	Zhang, Yang Z, ian.campbell



> -----Original Message-----
> From: Julien Grall [mailto:julien.grall@citrix.com]
> Sent: Wednesday, September 16, 2015 5:34 PM
> To: Xu, Quan; andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com
> Cc: xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation
> status in an invalidation table.
> 
> Hi Quan,
> 
> The time of the mail is in a future. Can you configure your mail to report the
> correct time?


Yes, I should set the time. Thanks for your quick response.

> 
> On 16/09/2015 14:23, Quan Xu wrote:
> > diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> > index 106e08f..28e7fc3 100644
> > --- a/xen/include/xen/hvm/iommu.h
> > +++ b/xen/include/xen/hvm/iommu.h
> > @@ -23,6 +23,21 @@
> >   #include <xen/list.h>
> >   #include <asm/hvm/iommu.h>
> >
> > +/*
> > + * Status Address and Data: Status address and data is used by
> > +hardware to perform
> > + * wait descriptor completion status write when the Status Write(SW) field is
> Set.
> > + *
> > + * Track the Device-TLB invalidation status in an invalidation table.
> > +Update
> > + * invalidation table's count of in-flight Device-TLB invalidation
> > +request and
> > + * assign the address of global polling parameter per domain in the
> > +Status Address
> > + * of each invalidation wait descriptor, when submit Device-TLB
> > +invalidation
> > + * requests.
> > + */
> > +struct qi_talbe {
> 
> Did you want to say table rather than talbe?

Yes, It is 'table'. Thanks. I will correct it in next version.

> 
> > +    u64 qi_table_poll_slot;
> > +    u32 qi_table_status_data;
> > +};
> > +
> >   struct hvm_iommu {
> >       struct arch_hvm_iommu arch;
> >
> > @@ -34,6 +49,9 @@ struct hvm_iommu {
> >       struct list_head dt_devices;
> >   #endif
> >
> > +    /* IOMMU Queued Invalidation(QI) */
> > +    struct qi_talbe talbe;
> > +
> 
> This header is should contain any common code between ARM and x86.
> Although, this feature seems to be vtd only (i.e x86).
> 
> So this should be moved in arch_hvm_iommu defined in asm-x86/hvm/iommu.h.
> 
> You would then be able to access the data using
> domain_hvm_iommu(d)->arch.field
> 

I think it doesn't look good. let me redefine it.

> 
> >       /* Features supported by the IOMMU */
> >       DECLARE_BITMAP(features, IOMMU_FEAT_count);
> >   };
> > @@ -41,4 +59,9 @@ struct hvm_iommu {
> >   #define iommu_set_feature(d, f)   set_bit((f),
> domain_hvm_iommu(d)->features)
> >   #define iommu_clear_feature(d, f) clear_bit((f),
> domain_hvm_iommu(d)->features)
> >
> > +#define qi_table_data(d) \
> > +    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_status_data)
> > +#define qi_table_pollslot(d) \
> > +    (d->arch.hvm_domain.hvm_iommu.talbe.qi_table_poll_slot)
> 
> The way to access the iommu data on ARM and x86 are different. Please
> use domain_hvm_iommu(d)->field if you keep these fields in common code.
>

Ditto.

Julien, thanks for your review.


Quan
 
> > +
> >   #endif /* __XEN_HVM_IOMMU_H__ */
> >
> 
> Regards,
> 
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 11:22   ` Julien Grall
@ 2015-09-16 13:47     ` Ian Jackson
  2015-09-17  9:06       ` Julien Grall
  0 siblings, 1 reply; 84+ messages in thread
From: Ian Jackson @ 2015-09-16 13:47 UTC (permalink / raw)
  To: Julien Grall
  Cc: kevin.tian, keir, Quan Xu, george.dunlap, andrew.cooper3,
	eddie.dong, tim, jbeulich, jun.nakajima, yang.z.zhang, xen-devel,
	ian.campbell

Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"):
> On 16/09/15 11:46, Ian Jackson wrote:
> > JOOI why did you CC me ?  I did a quick scan of these patches and they
> > don't seem to have any tools impact.  I would prefer not to be CC'd
> > unless there is a reason why my attention would be valueable.
> 
> The common directory is maintained by "THE REST" group. From the
> MAINTAINERS file you are part of it.

Ah.  Hmmm.  You mean xen/common/ ?

I don't consider myself qualified to review that.  I think the
MAINTAINERS file should have an entry for xen/ but it doesn't seem to.

Ian.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  2015-09-16  9:44   ` Julien Grall
@ 2015-09-16 14:03     ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-09-16 14:03 UTC (permalink / raw)
  To: Julien Grall
  Cc: tim, Tian, Kevin, keir, Dong, Eddie, Nakajima, Jun,
	andrew.cooper3, ian.jackson, xen-devel, george.dunlap, jbeulich,
	Zhang, Yang Z, ian.campbell



> -----Original Message-----
> From: Julien Grall [mailto:julien.grall@citrix.com]
> Sent: Wednesday, September 16, 2015 5:44 PM
> To: Xu, Quan; andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com
> Cc: xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's
> vCPUs are not allowed to
> 
> Hi Quan,
> 
> On 16/09/2015 14:24, Quan Xu wrote:
> > diff --git a/xen/arch/x86/x86_64/asm-offsets.c
> > b/xen/arch/x86/x86_64/asm-offsets.c
> > index 447c650..d26b026 100644
> > --- a/xen/arch/x86/x86_64/asm-offsets.c
> > +++ b/xen/arch/x86/x86_64/asm-offsets.c
> > @@ -116,6 +116,7 @@ void __dummy__(void)
> >       BLANK();
> >
> >       OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
> > +    OFFSET(QI_flag, struct domain,
> > + arch.hvm_domain.hvm_iommu.qi_flag);
> >       BLANK();
> >
> >       OFFSET(VMCB_rax, struct vmcb_struct, rax); diff --git
> > a/xen/common/domain.c b/xen/common/domain.c index 1b9fcfc..1f62e3b
> > 100644
> > --- a/xen/common/domain.c
> > +++ b/xen/common/domain.c
> > @@ -1479,6 +1479,11 @@ int continue_hypercall_on_cpu(
> >       return 0;
> >   }
> >
> > +void do_qi_flushing(struct domain *d) {
> > +    do_sched_op(SCHEDOP_yield, guest_handle_from_ptr(NULL, void));
> 
> SCHEDOP_yield is as wrapper to vcpu_yield() to would be called by the guest.
> 
> It would be simpler to use the latter. You may even be able to call it directly from
> the assembly code rather than introducing is a wrapper.
> 

Agreed.
I will test it. if it is working, I will modify it in next version. 


> If not, this function should go in x86 specific code (maybe arch/x86/domain.c ?)
> 
> 
> > +}
> > +
> >   /*
> >    * Local variables:
> >    * mode: C
> > diff --git a/xen/include/xen/hvm/iommu.h b/xen/include/xen/hvm/iommu.h
> > index e838905..e40fc7b 100644
> > --- a/xen/include/xen/hvm/iommu.h
> > +++ b/xen/include/xen/hvm/iommu.h
> > @@ -57,6 +57,8 @@ struct hvm_iommu {
> >       DECLARE_BITMAP(features, IOMMU_FEAT_count);
> >   };
> >
> > +void do_qi_flushing(struct domain *d);
> > +
> 
> If you declare the function in file.c you should add the prototype in
> file.h.
> 
> I.e as you defined the function in common/domain.c, the prototype should
> go in xen/domain.h.

In general, I should define these function/macro for x86 only.. 
Thanks Julien.

Quan

> 
> >   #define iommu_set_feature(d, f)   set_bit((f),
> domain_hvm_iommu(d)->features)
> >   #define iommu_clear_feature(d, f) clear_bit((f),
> domain_hvm_iommu(d)->features)
> >
> >
> 
> Regards,
> 
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (13 preceding siblings ...)
  2015-09-16 13:24 ` [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB Quan Xu
@ 2015-09-17  3:26 ` Xu, Quan
  2015-09-21  8:51   ` Jan Beulich
  2015-09-21 14:09 ` Xu, Quan
  2015-10-12  1:42 ` Zhang, Yang Z
  16 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-17  3:26 UTC (permalink / raw)
  To: andrew.cooper3, Dong, Eddie, ian.campbell, ian.jackson, jbeulich,
	Nakajima, Jun, keir, Tian, Kevin, tim, Zhang, Yang Z,
	george.dunlap
  Cc: xen-devel



> -----Original Message-----
> From: Xu, Quan
> Sent: Wednesday, September 16, 2015 9:24 PM
> To: andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com
> Cc: xen-devel@lists.xen.org; Xu, Quan
> Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> 
> Introduction
> ============
> 
>    VT-d code currently has a number of cases where completion of certain
> operations is being waited for by way of spinning. The majority of instances use
> that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of
> up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this
> may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to 1 second to
> get Invalidation flush result. When Invalidation queue includes Device-TLB
> invalidation, using 1 second is a mistake here in the validation sync. As the 1
> second timeout here is related to response times by the IOMMU engine, Instead
> of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in use.
> the ATS specification mandates a timeout of 1 _minute_ for cache flush. The ATS
> case needs to be taken into consideration when doing invalidations. Obviously
> we can't spin for a minute, so invalidation absolutely needs to be converted to a
> non-spinning model.
> 
>    Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB flush is
> completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should not be
> reallocated for another purpose until the appropriate invalidations have been
> performed. Otherwise, the original page owner can still access freed page though
> DMA.
> 
> Why RFC
> =======
>     Patch 0001--0005, 0013 are IOMMU related.
>     Patch 0006 is about new flag (vCPU / MMU related).
>     Patch 0007 is vCPU related.
>     Patch 0008--0012 are MMU related.
> 
>     1. Xen MMU is very complicated. Could Xen MMU experts help me verify
> whether I
>        have covered all of the case?
> 
>     2. For gnttab_transfer, If the Device-TLB flush is still not completed when to
>        map the transferring page to a remote domain, schedule and wait on a
> waitqueue
>        until the Device-TLB flush is completed. Is it correct?
> 
>        (I have tested waitqueue in decrease_reservation() [do_memory_op()
> hypercall])
>         I wake up domain(with only one vCPU) with debug-key tool, and the
> domain(with only one vCPU)
>         is still working after waiting 60s in a waitqueue. )


Much more information:
   If I run a service in this domain and tested this waitqueue case. The domain is still working after 60s, but It prints out Call Trace with $dmesg:

[  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272]
[  161.978621] Modules linked in: crct10dif_pclmul(F) crc32_pclmul(F) joydev(F) ghash_clmulni_intel(F) cryptd(F) xen_kbdfront(F) microcode(F) cirrus ttm drm_kms_helper drm psmouse(F) serio_raw(F) syscopyarea(F) sysfillrect(F) sysimgblt(F) i2c_piix4 ext2(F) mac_hid lp(F) parport(F) myri10ge dca floppy(F)
[  161.978626] CPU: 0 PID: 272 Comm: kworker/0:1 Tainted: GF            3.11.0-12-generic #19-Ubuntu
[  161.978628] Hardware name: Xen HVM domU, BIOS 4.6.0-rc 08/03/2015
[  161.978638] Workqueue: events balloon_process
[  161.978640] task: ffff88007a1b4650 ti: ffff88007a1f2000 task.ti: ffff88007a1f2000
[  161.978650] RIP: 0010:[<ffffffff81001185>]  [<ffffffff81001185>] xen_hypercall_memory_op+0x5/0x20
[  161.978652] RSP: 0018:ffff88007a1f3d60  EFLAGS: 00000246
[  161.978653] RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000001568
[  161.978654] RDX: 0000000000000000 RSI: ffff88007a1f3d70 RDI: 0000000000000041
[  161.978656] RBP: ffff88007a1f3db8 R08: ffffffff81d04888 R09: 000000000006a5dd
[  161.978657] R10: 0000000000003690 R11: ffff88007f7fa750 R12: ffff880036671000
[  161.978658] R13: ffffffff810c6176 R14: ffff88007a1f3d20 R15: 0000000000000000
[  161.978660] FS:  0000000000000000(0000) GS:ffff88007be00000(0000) knlGS:0000000000000000
[  161.978661] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.978662] CR2: 00007f8d3e97e000 CR3: 0000000001c0e000 CR4: 00000000001406f0
[  161.978669] Stack:
[  161.978673]  ffffffff8141a16a 00000000365b0048 ffffffff81fb1520 00000000000000ff
[  161.978676]  0000000000000000 0000000000007ff0 ffffffff81c97301 0000160000000000
[  161.978678]  ffff88007be13e00 ffff88007be17e00 0000000000000000 ffff88007a1f3e20
[  161.978679] Call Trace:
[  161.978684]  [<ffffffff8141a16a>] ? decrease_reservation+0x29a/0x2e0
[  161.978688]  [<ffffffff8141a513>] balloon_process+0x333/0x430
[  161.978695]  [<ffffffff8107d05c>] process_one_work+0x17c/0x430
[  161.978699]  [<ffffffff8107dcac>] worker_thread+0x11c/0x3c0
[  161.978702]  [<ffffffff8107db90>] ? manage_workers.isra.24+0x2a0/0x2a0
[  161.978710]  [<ffffffff810847b0>] kthread+0xc0/0xd0
[  161.978715]  [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120
[  161.978722]  [<ffffffff816f516c>] ret_from_fork+0x7c/0xb0
[  161.978727]  [<ffffffff810846f0>] ? kthread_create_on_node+0x120/0x120 


Thanks 
Quan

> 
> 
> Design Overview
> ===============
> 
> This design implements a non-spinning model for Device-TLB invalidation -- using
> an interrupt based mechanism. Track the Device-TLB invalidation status in an
> invalidation table per-domain. The invalidation table keeps the count of in-flight
> Device-TLB invalidation requests, and also provides a global polling parameter per
> domain for in-flight Device-TLB invalidation requests.
> Update invalidation table's count of in-flight Device-TLB invalidation requests and
> assign the address of global polling parameter per domain in the Status Address
> of each invalidation wait descriptor, when to submit invalidation requests.
> 
> For example:
>   .
> 
> |invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation
> |requests) wait |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
> |dsc  |
>   .
>   .
> 
> |invl |
> |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation
> |requests) dsc  | Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> |invl |
> |wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation
> |requests) dsc  |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
> Status Address and Data: Status address and data is used by hardware to perform
> wait descriptor
>                          completion status write when the Status Write(SW)
> field is Set. Hardware Behavior
>                          is undefined if the Status address range of
> 0xFEEX_XXXX etc.). The Status Address
>                          and Data fields are ignored by hardware when the
> Status Write field is Clear.
> 
> The invalidation completion event interrupt is generated only after the
> invalidation wait descriptor completes. In invalidation interrupt handler, it will
> schedule a soft-irq to do the following check:
> 
>   if invalidation table's count of in-flight Device-TLB invalidation requests ==
> polling parameter:
>     This domain has no in-flight Device-TLB invalidation requests.
>   else
>     This domain has in-flight Device-TLB invalidation requests.
> 
> Track domain Status:
>    The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield
> list if it has in-flight Device-TLB invalidation requests.
> 
> Memory security issue:
>     In case with PCI-e Address Translation Services(ATS) in use, ATS spec
> mandates a timeout of 1 minute for cache flush.
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can still
> access freed page though DMA.
> 
>    *Held on The page until the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>     *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB
> is completed.
> 
> Invalidation Fault:
> A fault event will be generated if an invalidation failed. We can disable the
> devices.
> 
> For Context Invalidation and IOTLB invalidation without Device-TLB invalidation,
> Queued Invalidation(QI) submits invalidation requests as before(This is a tradeoff
> and the cost of interrupt is overhead. It will be modified in coming series of
> patch).
> 
> More details
> ============
> 
> 1. invalidation table. We define qi_table structure per domain.
> +struct qi_talbe {
> +    u64 qi_table_poll_slot;
> +    u32 qi_table_status_data;
> +};
> 
> @ struct hvm_iommu {
> +    /* IOMMU Queued Invalidation(QI) */
> +    struct qi_talbe talbe;
> }
> 
> 2. Modification to Device IOTLB invalidation:
>     - Enabled interrupt notification when hardware completes the invalidations:
>       Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also
> set SW bit is that
>       the interrupt for notification is global not per domain.
>       So we still need to poll the status address to know which Device-TLB
> invalidation request is
>       completed in QI interrupt handler.
>     - A new per-domain flag (*qi_flag) is used to track the status of Device-TLB
> invalidation request.
>       The *qi_flag will be set before sbumitting the Device-TLB invalidation
> requests. The vCPU is NOT
>       allowed to entry guest mode and put into SCHEDOP_yield list, if the
> *qi_flag is Set.
>     - new logic to do synchronize.
>         if no Device-TLB invalidation:
>             Back to current invalidation logic.
>         else
>             Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
>             Set *qi_flag.
>             Put the domain in pending flush list (The vCPU is NOT allowed to
> entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
>         Return
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
>    SW: Indicate the invalidation wait descriptor completion by performing a
> coherent DWORD write of the value in the Status Data field
>        to the address specified in the Status Address.
>    FN: Indicate the descriptors following the invalidation wait descriptor must be
> processed by hardware only after the invalidation
>        Wait descriptor completes.
>    IF: Indicate the invalidation wait descriptor completion by generating an
> invalidation completion event per the programing of the
>        Invalidation Completion Event Registers.
> 
> 3. Modification to domain running lifecycle:
>     - When the *qi_flag is set, the domain is not allowed to enter guest mode
> and put into SCHEDOP_yield list
>       if there are in-flight Device-TLB invalidation requests.
> 
> 4. New interrupt handler for invalidation completion:
>     - when hardware completes the Device-TLB invalidation requests, it
> generates an interrupt to notify hypervisor.
>     - In interrupt handler, schedule a tasklet to handle it.
>     - tasklet to handle below:
>         *Clear IWC field in the Invalidation Completion Status register. If the
> IWC field in the Invalidation
>          Completion Status register was already Set at the time of setting this
> field, it is not treated as a new
>          interrupt condition.
>         *Scan the domain list. (the domain is with vt-d passthrough devices.
> scan 'iommu->domid_bitmap')
>                 for each domain:
>                 check the values invalidation table (qi_table_poll_slot and
> qi_table_status_data) of each domain.
>                 if equal:
>                    Put the on hold pages.
>                    Clear the invalidation table.
>                    Clear *qi_flag.
> 
>         *If IP field of Invalidation Event Control Register is Set, try to *Clear IWC
> and *Scan the domain list again, instead of
>          generating another interrupt.
>         *Clear IM field of Invalidation Event Control Register.
> 
> ((
>   Logic of IWC / IP / IM as below:
> 
>                           Interrupt condition (An invalidation wait descriptor
> with Interrupt Flag(IF) field Set completed.)
>                                   ||
>                                    v
>            ----------------------(IWC) ----------------------
>      (IWC is Set)                                (IWC is not Set)
>           ||                                            ||
>           V                                             ||
> (Not treated as an new interrupt condition)             ||
>                                                          V
>                                                    (Set IWC / IP)
>                                                         ||
>                                                          V
>                                   ---------------------(IM)---------------------
>                               (IM is Set)
> (IM not Set)
>                                   ||
> ||
>                                   ||
> V
>                                   ||                    (cause Interrupt
> message / then hardware clear IP)
>                                    V
>    (interrupt is held pending, clearing IM to cause interrupt message)
> 
> * If IWC field is being clear, the IP field is cleared.
> ))
> 
> 5. invalidation failed.
>     - A fault event will be generated if invalidation failed. we can disable the
> devices if receive an
>       invalidation fault event.
> 
> 6. Memory security issue:
> 
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can still
> access freed page though DMA.
> 
>    *Held on The page unitl the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>   *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is
> completed.
> 
> 
> ----
> There are 3 reasons to submit device-TLB invalidation requests:
>     *VT-d initialization.
>     *Reassign device ownership.
>     *Memory modification.
> 
> 6.1 *VT-d initialization
>     When VT-d is initializing, there is no guest domain running. So no memory
> security issue.
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init(
> |)
>                                               |--init_vtd_hw()
> 6.2 *Reassign device ownership
>     Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and
> arch_do_domctl().
> As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the
> appropriate invalidations maybe have not been performed, the *qi_flag is still
> Set, and these devices are not ready for guest domains to launch DMA with
> these devices. So if the *qi_flag is introduced, there is no memory security issue.
> 
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_dsi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> |-iommu_flush_iotlb_psi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> 6.3 *Memory modification.
> While memory is modified, There are a lot of invoker flow for updating EPT, but
> not all of them will update IOMMU page tables. All of the following three
> conditions are met.
>   * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
>   * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
>   * This domain needs IOMMU. (need_iommu(d))
> 
> ##
> |--iommu_pte_flush()--ept_set_entry()
> 
> #PoD(populate on demand) is not supported while IOMMU passthrough is
> enabled. So ignore PoD invoker flow below.
>       |--p2m_pod_zero_check_superpage()  ...
>       |--p2m_pod_zero_check()  ...
>       |--p2m_pod_demand_populate()  ...
>       |--p2m_pod_decrease_reservation()  ...
>       |--guest_physmap_mark_populate_on_demand() ...
> 
> #Xen paging is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--p2m_mem_paging_evict() ...
>       |--p2m_mem_paging_resume()...
>       |--p2m_mem_paging_prep()...
>       |--p2m_mem_paging_populate()...
>       |--p2m_mem_paging_nominate()...
>       |--p2m_alloc_table()--shadow_enable()
> --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl()
> --do_domctl()
> 
> |--paging_domctl_continuation()
> 
> #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--set_shared_p2m_entry()...
> 
> 
> #Domain is paused, the domain can't launch DMA.
>       |--relinquish_shared_pages()--domain_relinquish_resources( case
> RELMEM_shared: ) --domain_kill()--do_domctl()
> 
> #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
>       |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault()
> --hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()
> 
> #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore
> invoker flow below.
>       |--p2m_mem_access_check()-- hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check
> p2m_type_t)
>       |--p2m_set_mem_access() ...
>       |--guest_physmap_mark_populate_on_demand() ...
>       |--p2m_change_type_one() ...
> # The previous page is not put and allocated for Xen or other guest domains. So
> there is no memory security issue. Ignore invoker flow below.
>    |--p2m_remove_page()--guest_physmap_remove_page() ...
> 
>    |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
>                            |--map_mmio_regions()--do_domctl()
> 
> 
> # Held on the pages which are removed in guest_remove_page(), and put in QI
> interrupt handler when it has no in-flight Device-TLB invalidation requests.
> 
> |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
> 
> |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap()
> /xenmem_add_to_physmap_batch()  .. --do_memory_op()
>                                                |--p2m_add_foreign() --
> xenmem_add_to_physmap_one() ..--do_memory_op()
> 
> |--guest_physmap_add_entry()--create_grant_p2m_mapping()  ...
> --do_grant_table_op()
> 
> ((
>    Much more explanation:
>    Actually, the previous pages are maybe mapped from Xen heap for guest
> domains in decrease_reservation() / xenmem_add_to_physmap_one()
>    / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4
> functions will map xen heap page for guest domains:
>           * share page for xen Oprofile.
>           * vLAPIC mapping.
>           * grant table shared page.
>           * domain share_info page.
> ))
> 
> # For grant_unmap*. ignore it at this point, as we can held on the page when
> domain free xenbllooned page.
> 
> 
> |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant
> _ref() --gnttab_unmap_grant_ref()--do_grant_table_op()
> 
> |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace()
> --do_grant_table_op()
> 
> # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.
> 
> # For grant_transfer:
>   |--p2m_remove_page()--guest_physmap_remove_page()
>                                                  |--gnttab_transfer() ...
> --do_grant_table_op()
> 
>     If the Device-TLB flush is still not completed when to map the transferring
> page to a remote domain,
>     schedule and wait on a waitqueue until the Device-TLB flush is completed.
> 
>    Plan B:
>    ((If the Device-TLB flush is still not completed before adding the transferring
> page to the target domain,
>    allocate a new page for target domain and held on the old transferring page
> which will be put in QI interrupt
>    handler when there are no in-flight Device-TLB invalidation requests.))
> 
> 
> Quan Xu (13):
>   vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
>   vt-d: Register MSI for async invalidation completion interrupt.
>   vt-d: Track the Device-TLB invalidation status in an invalidation table.
>   vt-d: Clear invalidation table in invaidation interrupt handler
>   vt-d: Clear the IWC field of Invalidation Event Control Register in
>   vt-d: Introduce a new per-domain flag - qi_flag.
>   vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
>   vt-d: Held on the freed page until the Device-TLB flush is completed.
>   vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
>   vt-d: Held on the removed page until the Device-TLB flush is completed.
>   vt-d: If the Device-TLB flush is still not completed when
>   vt-d: For gnttab_transfer, If the Device-TLB flush is still
>   vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB
> 
>  xen/arch/x86/hvm/vmx/entry.S         |  10 ++
>  xen/arch/x86/x86_64/asm-offsets.c    |   1 +
>  xen/common/domain.c                  |  15 ++
>  xen/common/grant_table.c             |  16 ++
>  xen/common/memory.c                  |  16 +-
>  xen/drivers/passthrough/vtd/iommu.c  | 290
> +++++++++++++++++++++++++++++++++--
>  xen/drivers/passthrough/vtd/iommu.h  |  18 +++
> xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
>  xen/include/xen/hvm/iommu.h          |  42 +++++
>  9 files changed, 443 insertions(+), 16 deletions(-)
> 
> --
> 1.8.3.2

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 13:47     ` Ian Jackson
@ 2015-09-17  9:06       ` Julien Grall
  2015-09-17 10:16         ` Ian Jackson
  0 siblings, 1 reply; 84+ messages in thread
From: Julien Grall @ 2015-09-17  9:06 UTC (permalink / raw)
  To: Ian Jackson
  Cc: kevin.tian, keir, Quan Xu, george.dunlap, andrew.cooper3,
	eddie.dong, tim, jbeulich, jun.nakajima, yang.z.zhang, xen-devel,
	ian.campbell



On 16/09/2015 14:47, Ian Jackson wrote:
> Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"):
>> On 16/09/15 11:46, Ian Jackson wrote:
>>> JOOI why did you CC me ?  I did a quick scan of these patches and they
>>> don't seem to have any tools impact.  I would prefer not to be CC'd
>>> unless there is a reason why my attention would be valueable.
>>
>> The common directory is maintained by "THE REST" group. From the
>> MAINTAINERS file you are part of it.
>
> Ah.  Hmmm.  You mean xen/common/ ?

Right.


> I don't consider myself qualified to review that.  I think the
> MAINTAINERS file should have an entry for xen/ but it doesn't seem to.

I think a such patch should come from you.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-17  9:06       ` Julien Grall
@ 2015-09-17 10:16         ` Ian Jackson
  0 siblings, 0 replies; 84+ messages in thread
From: Ian Jackson @ 2015-09-17 10:16 UTC (permalink / raw)
  To: Julien Grall
  Cc: kevin.tian, keir, Quan Xu, george.dunlap, andrew.cooper3,
	eddie.dong, tim, jbeulich, jun.nakajima, yang.z.zhang, xen-devel,
	ian.campbell

Julien Grall writes ("Re: [Xen-devel] [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device"):
> On 16/09/2015 14:47, Ian Jackson wrote:
> > I don't consider myself qualified to review that.  I think the
> > MAINTAINERS file should have an entry for xen/ but it doesn't seem to.
> 
> I think a such patch should come from you.

You're probably right.  Just sent.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-17  3:26 ` [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Xu, Quan
@ 2015-09-21  8:51   ` Jan Beulich
  2015-09-21  9:46     ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-09-21  8:51 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima,
	Yang Z Zhang

>>> On 17.09.15 at 05:26, <quan.xu@intel.com> wrote:
> Much more information:
>    If I run a service in this domain and tested this waitqueue case. The 
> domain is still working after 60s, but It prints out Call Trace with $dmesg:
> 
> [  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! [kworker/0:1:272]

Not sure what you meant to tell us with that (it clearly means
there's a bug somewhere in the series you're testing):
- Drop the current series and wait for an update?
- A request for help? If so, I don't think there's much to be said
  based on just the kernel soft lockup detector output.
- Anything else?

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-21  8:51   ` Jan Beulich
@ 2015-09-21  9:46     ` Xu, Quan
  2015-09-21 12:03       ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-21  9:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Zhang,
	Yang Z

Thanks Jan.

>> >>> On 21.09.15 at 16:51, < JBeulich@suse.com > wrote:
>>> On 17.09.15 at 05:26, <quan.xu@intel.com> wrote:
> Much more information:
>    If I run a service in this domain and tested this waitqueue case. 
> The domain is still working after 60s, but It prints out Call Trace with $dmesg:
> 
> [  161.978599] BUG: soft lockup - CPU#0 stuck for 57s! 
> [kworker/0:1:272]

>Not sure what you meant to tell us with that (it clearly means there's a bug somewhere in the series you're testing):
>- Drop the current series and wait for an update?

No.

>- A request for help? If so, I don't think there's much to be said
  based on just the kernel soft lockup detector output.
>- Anything else?


Just test the extreme case. The ATS specification mandates a timeout of 1 _minute_ for cache flush, even though it doesn't take so much time for cache flush.
In my design, if the Device-TLB is not completed, the domain's vCPUs are not allowed entry guest mode (patch #7), otherwise, the logic is not correct.



~Quan


>Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-21  9:46     ` Xu, Quan
@ 2015-09-21 12:03       ` Jan Beulich
  2015-09-21 14:03         ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-09-21 12:03 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima,
	YangZ Zhang

>>> On 21.09.15 at 11:46, <quan.xu@intel.com> wrote:
>>> >>> On 21.09.15 at 16:51, < JBeulich@suse.com > wrote:
>>- Anything else?
> 
> 
> Just test the extreme case. The ATS specification mandates a timeout of 1 
> _minute_ for cache flush, even though it doesn't take so much time for cache 
> flush.
> In my design, if the Device-TLB is not completed, the domain's vCPUs are not 
> allowed entry guest mode (patch #7), otherwise, the logic is not correct.

Hmm, that would be a serious limitation, and I can't immediately
see a reason: Taking this with virtualization removed, I don't
think such an invalidation would stall a physical CPU for a whole
minute. Sadly I also can't immediately think of a solution, but I
guess first of all I'll have to take a look at the series (which
unfortunately may take a few days to get to).

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-21 12:03       ` Jan Beulich
@ 2015-09-21 14:03         ` Xu, Quan
  2015-09-21 14:20           ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-21 14:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	tim, xen-devel, Nakajima, Jun, Zhang, Yang Z


>>> On 21.09.15 at 20:04, < JBeulich@suse.com > wrote:
> >>> On 21.09.15 at 11:46, <quan.xu@intel.com> wrote:
> >>> >>> On 21.09.15 at 16:51, < JBeulich@suse.com > wrote:
> >>- Anything else?
> >
> >
> > Just test the extreme case. The ATS specification mandates a timeout
> > of 1 _minute_ for cache flush, even though it doesn't take so much
> > time for cache flush.
> > In my design, if the Device-TLB is not completed, the domain's vCPUs
> > are not allowed entry guest mode (patch #7), otherwise, the logic is not
> correct.
> 
> Hmm, that would be a serious limitation, and I can't immediately see a reason:
> Taking this with virtualization removed, I don't think such an invalidation would
> stall a physical CPU for a whole minute. Sadly I also can't immediately think of a
> solution, but I guess first of all I'll have to take a look at the series (which
> unfortunately may take a few days to get to).
> 
> Jan

Jan, 

thanks for your review. Any comment, I will reply to your emails ASAP.
 
-Quan 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (14 preceding siblings ...)
  2015-09-17  3:26 ` [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Xu, Quan
@ 2015-09-21 14:09 ` Xu, Quan
  2015-09-23 16:26   ` Tim Deegan
  2015-10-12  1:42 ` Zhang, Yang Z
  16 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-21 14:09 UTC (permalink / raw)
  To: george.dunlap, tim
  Cc: Tian, Kevin, keir, ian.campbell, andrew.cooper3, Dong, Eddie,
	xen-devel, jbeulich, Nakajima, Jun, Zhang, Yang Z

George / Tim,
Could you help me review these memory patches? Thanks!

-Quan



> -----Original Message-----
> From: Xu, Quan
> Sent: Wednesday, September 16, 2015 9:24 PM
> To: andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com
> Cc: xen-devel@lists.xen.org; Xu, Quan
> Subject: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
> 
> Introduction
> ============
> 
>    VT-d code currently has a number of cases where completion of certain
> operations is being waited for by way of spinning. The majority of instances use
> that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of
> up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this
> may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to 1 second to
> get Invalidation flush result. When Invalidation queue includes Device-TLB
> invalidation, using 1 second is a mistake here in the validation sync. As the 1
> second timeout here is related to response times by the IOMMU engine, Instead
> of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in use.
> the ATS specification mandates a timeout of 1 _minute_ for cache flush. The ATS
> case needs to be taken into consideration when doing invalidations. Obviously
> we can't spin for a minute, so invalidation absolutely needs to be converted to a
> non-spinning model.
> 
>    Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB flush is
> completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should not be
> reallocated for another purpose until the appropriate invalidations have been
> performed. Otherwise, the original page owner can still access freed page though
> DMA.
> 
> Why RFC
> =======
>     Patch 0001--0005, 0013 are IOMMU related.
>     Patch 0006 is about new flag (vCPU / MMU related).
>     Patch 0007 is vCPU related.
>     Patch 0008--0012 are MMU related.
> 
>     1. Xen MMU is very complicated. Could Xen MMU experts help me verify
> whether I
>        have covered all of the case?
> 
>     2. For gnttab_transfer, If the Device-TLB flush is still not completed when to
>        map the transferring page to a remote domain, schedule and wait on a
> waitqueue
>        until the Device-TLB flush is completed. Is it correct?
> 
>        (I have tested waitqueue in decrease_reservation() [do_memory_op()
> hypercall])
>         I wake up domain(with only one vCPU) with debug-key tool, and the
> domain(with only one vCPU)
>         is still working after waiting 60s in a waitqueue. )
> 
> 
> Design Overview
> ===============
> 
> This design implements a non-spinning model for Device-TLB invalidation -- using
> an interrupt based mechanism. Track the Device-TLB invalidation status in an
> invalidation table per-domain. The invalidation table keeps the count of in-flight
> Device-TLB invalidation requests, and also provides a global polling parameter per
> domain for in-flight Device-TLB invalidation requests.
> Update invalidation table's count of in-flight Device-TLB invalidation requests and
> assign the address of global polling parameter per domain in the Status Address
> of each invalidation wait descriptor, when to submit invalidation requests.
> 
> For example:
>   .
> 
> |invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation
> |requests) wait |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
> |dsc  |
>   .
>   .
> 
> |invl |
> |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation
> |requests) dsc  | Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> |invl |
> |wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation
> |requests) dsc  |  Status Address =
> |virt_to_maddr(&_a_global_polling_parameter_per_domain_)
>   .
>   .
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
> Status Address and Data: Status address and data is used by hardware to perform
> wait descriptor
>                          completion status write when the Status Write(SW)
> field is Set. Hardware Behavior
>                          is undefined if the Status address range of
> 0xFEEX_XXXX etc.). The Status Address
>                          and Data fields are ignored by hardware when the
> Status Write field is Clear.
> 
> The invalidation completion event interrupt is generated only after the
> invalidation wait descriptor completes. In invalidation interrupt handler, it will
> schedule a soft-irq to do the following check:
> 
>   if invalidation table's count of in-flight Device-TLB invalidation requests ==
> polling parameter:
>     This domain has no in-flight Device-TLB invalidation requests.
>   else
>     This domain has in-flight Device-TLB invalidation requests.
> 
> Track domain Status:
>    The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield
> list if it has in-flight Device-TLB invalidation requests.
> 
> Memory security issue:
>     In case with PCI-e Address Translation Services(ATS) in use, ATS spec
> mandates a timeout of 1 minute for cache flush.
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can still
> access freed page though DMA.
> 
>    *Held on The page until the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>     *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB
> is completed.
> 
> Invalidation Fault:
> A fault event will be generated if an invalidation failed. We can disable the
> devices.
> 
> For Context Invalidation and IOTLB invalidation without Device-TLB invalidation,
> Queued Invalidation(QI) submits invalidation requests as before(This is a tradeoff
> and the cost of interrupt is overhead. It will be modified in coming series of
> patch).
> 
> More details
> ============
> 
> 1. invalidation table. We define qi_table structure per domain.
> +struct qi_talbe {
> +    u64 qi_table_poll_slot;
> +    u32 qi_table_status_data;
> +};
> 
> @ struct hvm_iommu {
> +    /* IOMMU Queued Invalidation(QI) */
> +    struct qi_talbe talbe;
> }
> 
> 2. Modification to Device IOTLB invalidation:
>     - Enabled interrupt notification when hardware completes the invalidations:
>       Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also
> set SW bit is that
>       the interrupt for notification is global not per domain.
>       So we still need to poll the status address to know which Device-TLB
> invalidation request is
>       completed in QI interrupt handler.
>     - A new per-domain flag (*qi_flag) is used to track the status of Device-TLB
> invalidation request.
>       The *qi_flag will be set before sbumitting the Device-TLB invalidation
> requests. The vCPU is NOT
>       allowed to entry guest mode and put into SCHEDOP_yield list, if the
> *qi_flag is Set.
>     - new logic to do synchronize.
>         if no Device-TLB invalidation:
>             Back to current invalidation logic.
>         else
>             Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
>             Set *qi_flag.
>             Put the domain in pending flush list (The vCPU is NOT allowed to
> entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
>         Return
> 
> More information about VT-d Invalidation Wait Descriptor, please refer to
> 
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>   6.5.2.8 Invalidation Wait Descriptor.
>    SW: Indicate the invalidation wait descriptor completion by performing a
> coherent DWORD write of the value in the Status Data field
>        to the address specified in the Status Address.
>    FN: Indicate the descriptors following the invalidation wait descriptor must be
> processed by hardware only after the invalidation
>        Wait descriptor completes.
>    IF: Indicate the invalidation wait descriptor completion by generating an
> invalidation completion event per the programing of the
>        Invalidation Completion Event Registers.
> 
> 3. Modification to domain running lifecycle:
>     - When the *qi_flag is set, the domain is not allowed to enter guest mode
> and put into SCHEDOP_yield list
>       if there are in-flight Device-TLB invalidation requests.
> 
> 4. New interrupt handler for invalidation completion:
>     - when hardware completes the Device-TLB invalidation requests, it
> generates an interrupt to notify hypervisor.
>     - In interrupt handler, schedule a tasklet to handle it.
>     - tasklet to handle below:
>         *Clear IWC field in the Invalidation Completion Status register. If the
> IWC field in the Invalidation
>          Completion Status register was already Set at the time of setting this
> field, it is not treated as a new
>          interrupt condition.
>         *Scan the domain list. (the domain is with vt-d passthrough devices.
> scan 'iommu->domid_bitmap')
>                 for each domain:
>                 check the values invalidation table (qi_table_poll_slot and
> qi_table_status_data) of each domain.
>                 if equal:
>                    Put the on hold pages.
>                    Clear the invalidation table.
>                    Clear *qi_flag.
> 
>         *If IP field of Invalidation Event Control Register is Set, try to *Clear IWC
> and *Scan the domain list again, instead of
>          generating another interrupt.
>         *Clear IM field of Invalidation Event Control Register.
> 
> ((
>   Logic of IWC / IP / IM as below:
> 
>                           Interrupt condition (An invalidation wait descriptor
> with Interrupt Flag(IF) field Set completed.)
>                                   ||
>                                    v
>            ----------------------(IWC) ----------------------
>      (IWC is Set)                                (IWC is not Set)
>           ||                                            ||
>           V                                             ||
> (Not treated as an new interrupt condition)             ||
>                                                          V
>                                                    (Set IWC / IP)
>                                                         ||
>                                                          V
>                                   ---------------------(IM)---------------------
>                               (IM is Set)
> (IM not Set)
>                                   ||
> ||
>                                   ||
> V
>                                   ||                    (cause Interrupt
> message / then hardware clear IP)
>                                    V
>    (interrupt is held pending, clearing IM to cause interrupt message)
> 
> * If IWC field is being clear, the IP field is cleared.
> ))
> 
> 5. invalidation failed.
>     - A fault event will be generated if invalidation failed. we can disable the
> devices if receive an
>       invalidation fault event.
> 
> 6. Memory security issue:
> 
>     The page freed from the domain should be on held, until the Device-TLB
> flush is completed. The page previously associated  with the freed portion of
> GPA should not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner can still
> access freed page though DMA.
> 
>    *Held on The page unitl the Device-TLB flush is completed.
>       - Unlink the page from the original owner.
>       - Remove the page from the page_list of domain.
>       - Decrease the total pages count of domain.
>       - Add the page to qi_hold_page_list.
> 
>   *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is
> completed.
> 
> 
> ----
> There are 3 reasons to submit device-TLB invalidation requests:
>     *VT-d initialization.
>     *Reassign device ownership.
>     *Memory modification.
> 
> 6.1 *VT-d initialization
>     When VT-d is initializing, there is no guest domain running. So no memory
> security issue.
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init(
> |)
>                                               |--init_vtd_hw()
> 6.2 *Reassign device ownership
>     Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and
> arch_do_domctl().
> As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the
> appropriate invalidations maybe have not been performed, the *qi_flag is still
> Set, and these devices are not ready for guest domains to launch DMA with
> these devices. So if the *qi_flag is introduced, there is no memory security issue.
> 
> iotlb(iotlb/device-tlb)
> |-iommu_flush_iotlb_dsi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> |-iommu_flush_iotlb_psi()
>                        |--domain_context_mapping_one() ...
>                        |--domain_context_unmap_one() ...
> 
> 6.3 *Memory modification.
> While memory is modified, There are a lot of invoker flow for updating EPT, but
> not all of them will update IOMMU page tables. All of the following three
> conditions are met.
>   * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
>   * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
>   * This domain needs IOMMU. (need_iommu(d))
> 
> ##
> |--iommu_pte_flush()--ept_set_entry()
> 
> #PoD(populate on demand) is not supported while IOMMU passthrough is
> enabled. So ignore PoD invoker flow below.
>       |--p2m_pod_zero_check_superpage()  ...
>       |--p2m_pod_zero_check()  ...
>       |--p2m_pod_demand_populate()  ...
>       |--p2m_pod_decrease_reservation()  ...
>       |--guest_physmap_mark_populate_on_demand() ...
> 
> #Xen paging is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--p2m_mem_paging_evict() ...
>       |--p2m_mem_paging_resume()...
>       |--p2m_mem_paging_prep()...
>       |--p2m_mem_paging_populate()...
>       |--p2m_mem_paging_nominate()...
>       |--p2m_alloc_table()--shadow_enable()
> --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl()
> --do_domctl()
> 
> |--paging_domctl_continuation()
> 
> #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore
> Xen paging invoker flow below.
>       |--set_shared_p2m_entry()...
> 
> 
> #Domain is paused, the domain can't launch DMA.
>       |--relinquish_shared_pages()--domain_relinquish_resources( case
> RELMEM_shared: ) --domain_kill()--do_domctl()
> 
> #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
>       |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault()
> --hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()
> 
> #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore
> invoker flow below.
>       |--p2m_mem_access_check()-- hvm_hap_nested_page_fault()
> --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check
> p2m_type_t)
>       |--p2m_set_mem_access() ...
>       |--guest_physmap_mark_populate_on_demand() ...
>       |--p2m_change_type_one() ...
> # The previous page is not put and allocated for Xen or other guest domains. So
> there is no memory security issue. Ignore invoker flow below.
>    |--p2m_remove_page()--guest_physmap_remove_page() ...
> 
>    |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
>                            |--map_mmio_regions()--do_domctl()
> 
> 
> # Held on the pages which are removed in guest_remove_page(), and put in QI
> interrupt handler when it has no in-flight Device-TLB invalidation requests.
> 
> |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
> 
> |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap()
> /xenmem_add_to_physmap_batch()  .. --do_memory_op()
>                                                |--p2m_add_foreign() --
> xenmem_add_to_physmap_one() ..--do_memory_op()
> 
> |--guest_physmap_add_entry()--create_grant_p2m_mapping()  ...
> --do_grant_table_op()
> 
> ((
>    Much more explanation:
>    Actually, the previous pages are maybe mapped from Xen heap for guest
> domains in decrease_reservation() / xenmem_add_to_physmap_one()
>    / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4
> functions will map xen heap page for guest domains:
>           * share page for xen Oprofile.
>           * vLAPIC mapping.
>           * grant table shared page.
>           * domain share_info page.
> ))
> 
> # For grant_unmap*. ignore it at this point, as we can held on the page when
> domain free xenbllooned page.
> 
> 
> |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant
> _ref() --gnttab_unmap_grant_ref()--do_grant_table_op()
> 
> |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace()
> --do_grant_table_op()
> 
> # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.
> 
> # For grant_transfer:
>   |--p2m_remove_page()--guest_physmap_remove_page()
>                                                  |--gnttab_transfer() ...
> --do_grant_table_op()
> 
>     If the Device-TLB flush is still not completed when to map the transferring
> page to a remote domain,
>     schedule and wait on a waitqueue until the Device-TLB flush is completed.
> 
>    Plan B:
>    ((If the Device-TLB flush is still not completed before adding the transferring
> page to the target domain,
>    allocate a new page for target domain and held on the old transferring page
> which will be put in QI interrupt
>    handler when there are no in-flight Device-TLB invalidation requests.))
> 
> 
> Quan Xu (13):
>   vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
>   vt-d: Register MSI for async invalidation completion interrupt.
>   vt-d: Track the Device-TLB invalidation status in an invalidation table.
>   vt-d: Clear invalidation table in invaidation interrupt handler
>   vt-d: Clear the IWC field of Invalidation Event Control Register in
>   vt-d: Introduce a new per-domain flag - qi_flag.
>   vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
>   vt-d: Held on the freed page until the Device-TLB flush is completed.
>   vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
>   vt-d: Held on the removed page until the Device-TLB flush is completed.
>   vt-d: If the Device-TLB flush is still not completed when
>   vt-d: For gnttab_transfer, If the Device-TLB flush is still
>   vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB
> 
>  xen/arch/x86/hvm/vmx/entry.S         |  10 ++
>  xen/arch/x86/x86_64/asm-offsets.c    |   1 +
>  xen/common/domain.c                  |  15 ++
>  xen/common/grant_table.c             |  16 ++
>  xen/common/memory.c                  |  16 +-
>  xen/drivers/passthrough/vtd/iommu.c  | 290
> +++++++++++++++++++++++++++++++++--
>  xen/drivers/passthrough/vtd/iommu.h  |  18 +++
> xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
>  xen/include/xen/hvm/iommu.h          |  42 +++++
>  9 files changed, 443 insertions(+), 16 deletions(-)
> 
> --
> 1.8.3.2

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-21 14:03         ` Xu, Quan
@ 2015-09-21 14:20           ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-21 14:20 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	tim, xen-devel, Jun Nakajima, YangZ Zhang

>>> On 21.09.15 at 16:03, <quan.xu@intel.com> wrote:

>>>> On 21.09.15 at 20:04, < JBeulich@suse.com > wrote:
>> >>> On 21.09.15 at 11:46, <quan.xu@intel.com> wrote:
>> >>> >>> On 21.09.15 at 16:51, < JBeulich@suse.com > wrote:
>> >>- Anything else?
>> >
>> >
>> > Just test the extreme case. The ATS specification mandates a timeout
>> > of 1 _minute_ for cache flush, even though it doesn't take so much
>> > time for cache flush.
>> > In my design, if the Device-TLB is not completed, the domain's vCPUs
>> > are not allowed entry guest mode (patch #7), otherwise, the logic is not
>> correct.
>> 
>> Hmm, that would be a serious limitation, and I can't immediately see a 
> reason:
>> Taking this with virtualization removed, I don't think such an invalidation 
> would
>> stall a physical CPU for a whole minute. Sadly I also can't immediately 
> think of a
>> solution, but I guess first of all I'll have to take a look at the series 
> (which
>> unfortunately may take a few days to get to).
> 
> thanks for your review. Any comment, I will reply to your emails ASAP.

Thanks. It would however have been nice if you addressed the
implied question (regarding the reason) right away.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-21 14:09 ` Xu, Quan
@ 2015-09-23 16:26   ` Tim Deegan
  2015-09-28  3:08     ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Deegan @ 2015-09-23 16:26 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Hi,

At 14:09 +0000 on 21 Sep (1442844587), Xu, Quan wrote:
> George / Tim,
> Could you help me review these memory patches? Thanks!

The interrupt-mapping and chipset control parts of this are outside my
understanding. :)  And I'm not an x86/mm maintainer any more, but I'll
have a look:

7/13: I'm not convinced that making the vcpu spin calling
sched_yield() is a very good plan.  Better to explicitly pause the
domain if you need its vcpus not to run.  But first -- why does IOMMU
flushing mean that vcpus can't be run?

8/13: This doesn't seem like it's enough make this safe. :(  Yes, you
can't allocate a page to another VM while there are IOTLB entries
pointing to it, but you also can't use it for other things inside the
same domain!

It might be enough, if you can argue that e.g. the IOMMU tables only
ever have mappings of pages owned by the domain, and that any other
feature that might rely on the daomin's memory being made read-only
(e.g. sharing) is explicily disallowed, but I'd want to see those
things mechanically enforced.

I think the safest answer is to make the IOMMU table take typed
refcounts to anything it points to, and only drop those refcounts when
the flush completes, but I can imaging that becoming complex.

You may also need to consider grant-mapped memory - you need to make
sure the grant isn't released until after the flush completes.

12/13: Ah, I see you are looking at grant table stuff, at least for
the transfer path. :)  Still the mapping path needs to be looked at.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when
  2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
  2015-09-16  9:56   ` Julien Grall
@ 2015-09-23 17:38   ` Konrad Rzeszutek Wilk
  2015-09-24  1:40     ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-23 17:38 UTC (permalink / raw)
  To: Quan Xu
  Cc: kevin.tian, keir, eddie.dong, jun.nakajima, andrew.cooper3,
	ian.jackson, tim, george.dunlap, jbeulich, yang.z.zhang,
	xen-devel, ian.campbell

> index 5dc0033..f661c8c 100644
> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -20,6 +20,7 @@
>  #define __XEN_HVM_IOMMU_H__
>  
>  #include <xen/iommu.h>
> +#include <xen/wait.h>
>  #include <xen/list.h>
>  #include <asm/hvm/iommu.h>
>  
> @@ -56,12 +57,15 @@ struct hvm_iommu {
>      struct page_list_head qi_hold_page_list;
>      spinlock_t qi_lock;
>  
> +    struct waitqueue_head qi_wq;
> +


Is there anything that mandates this must be an HVM guest? As in - can't you
have ATS with an PV guest (and doing passthrough?)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when
  2015-09-23 17:38   ` Konrad Rzeszutek Wilk
@ 2015-09-24  1:40     ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-09-24  1:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tian, Kevin, keir, Dong, Eddie, Nakajima, Jun, andrew.cooper3,
	ian.jackson, tim, george.dunlap, jbeulich, Zhang, Yang Z,
	xen-devel, ian.campbell

Hi, Kornrad

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Thursday, September 24, 2015 1:39 AM
> To: Xu, Quan
> Cc: andrew.cooper3@citrix.com; Dong, Eddie; ian.campbell@citrix.com;
> ian.jackson@eu.citrix.com; jbeulich@suse.com; Nakajima, Jun; keir@xen.org;
> Tian, Kevin; tim@xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not
> completed when
> 
> > index 5dc0033..f661c8c 100644
> > --- a/xen/include/xen/hvm/iommu.h
> > +++ b/xen/include/xen/hvm/iommu.h
> > @@ -20,6 +20,7 @@
> >  #define __XEN_HVM_IOMMU_H__
> >
> >  #include <xen/iommu.h>
> > +#include <xen/wait.h>
> >  #include <xen/list.h>
> >  #include <asm/hvm/iommu.h>
> >
> > @@ -56,12 +57,15 @@ struct hvm_iommu {
> >      struct page_list_head qi_hold_page_list;
> >      spinlock_t qi_lock;
> >
> > +    struct waitqueue_head qi_wq;
> > +
> 
> 
> Is there anything that mandates this must be an HVM guest? As in - can't you
> have ATS with an PV guest (and doing passthrough?)

Yes, this patch set if just for HVM guest. It is complicated to support both PV guest and HVM guest.



Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-23 16:26   ` Tim Deegan
@ 2015-09-28  3:08     ` Xu, Quan
  2015-09-28  6:47       ` Jan Beulich
  2015-09-29  9:11       ` Tim Deegan
  0 siblings, 2 replies; 84+ messages in thread
From: Xu, Quan @ 2015-09-28  3:08 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Tim, thanks for your review.

>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> Hi,
> 
> At 14:09 +0000 on 21 Sep (1442844587), Xu, Quan wrote:
> > George / Tim,
> > Could you help me review these memory patches? Thanks!
> 
> The interrupt-mapping and chipset control parts of this are outside my
> understanding. :)  And I'm not an x86/mm maintainer any more, but I'll have a
> look:
> 

I've heard so much about you from my teammates. :):)


> 7/13: I'm not convinced that making the vcpu spin calling
> sched_yield() is a very good plan.  Better to explicitly pause the domain if you
> need its vcpus not to run.  But first -- why does IOMMU flushing mean that
> vcpus can't be run?
> 

Ensure that the required Device-TLB flushes are applied before returning to guest mode via hypercall completion.
the domain can also DMA this freed pages.
For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from domain, and assume that there is
a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest mode, then the domain can still DMA this freed pageX.
Domain kernel cannot use this being freed page, otherwise this is a domain kernel bug.

In fact, any time you want to reschedule, you need to raise SCHEDULE_SOFTIRQ, which is then checked and serviced
in do_softirq().
It can also pause the domain and unpause the domain in QI interrupt handler, or block the vCPU and unblock all of the 
domain's vCPU in QI interrupt handler.  It is the performance consideration while using sched_yield().



> 8/13: This doesn't seem like it's enough make this safe. :(  Yes, you can't allocate
> a page to another VM while there are IOTLB entries pointing to it, but you also
> can't use it for other things inside the same domain!
> 
> It might be enough, if you can argue that e.g. the IOMMU tables only ever have
> mappings of pages owned by the domain, and that any other feature that might
> rely on the daomin's memory being made read-only (e.g. sharing) is explicily
> disallowed, but I'd want to see those things mechanically enforced.
> 



As mentioned above,  Device-TLB flushes are applied before returning to guest mode via hypercall completion.
If the hypercall is not completed, the freed page will not be released back to xen. Domain kernel cannot use this being freed page, otherwise this is a domain kernel bug.
So I think it is enough.   :):)




> I think the safest answer is to make the IOMMU table take typed refcounts to
> anything it points to, and only drop those refcounts when the flush completes,
> but I can imaging that becoming complex.
> 

Yes, i also think so.
I have a similar design, taking typed refcount and increasing the page refcount (as PGT_pinned / PGC_allocated), but i did it only in page freed invoker flow.
I dropped this design, as i should introduce anther flag to aware that the domain is assigned a ATS device(A ugly design). 
I didn't make the IOMMU table to take typed refcount to anything it points to. This is really complex.

> You may also need to consider grant-mapped memory - you need to make sure
> the grant isn't released until after the flush completes.
> 
 I can introduce a new grant type for this case (such as GNTMAP_iommu_dma)..
For ending foreign access, it should _only_ be used when the remote domain has unmapped the frame. 
gnttab_query_foreign_access( gref ) will indicate the state of any mapping.

> 12/13: Ah, I see you are looking at grant table stuff, at least for the transfer path. :)
> Still the mapping path needs to be looked at.
> 

Yes, I think you can do it. :):)


> Cheers,
> 
> Tim.


Tim, thanks again. I really appreciate your review.


Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-28  3:08     ` Xu, Quan
@ 2015-09-28  6:47       ` Jan Beulich
  2015-09-29  2:53         ` Xu, Quan
  2015-09-29  9:11       ` Tim Deegan
  1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-09-28  6:47 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, Yang Z Zhang

>>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
>>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>> 7/13: I'm not convinced that making the vcpu spin calling
>> sched_yield() is a very good plan.  Better to explicitly pause the domain if you
>> need its vcpus not to run.  But first -- why does IOMMU flushing mean that
>> vcpus can't be run?
> 
> Ensure that the required Device-TLB flushes are applied before returning to 
> guest mode via hypercall completion.
> the domain can also DMA this freed pages.
> For example, Call do_memory_op HYPERCALL to free a pageX (gfn --- mfn) from 
> domain, and assume that there is
> a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to guest mode, 
> then the domain can still DMA this freed pageX.
> Domain kernel cannot use this being freed page, otherwise this is a domain 
> kernel bug.

It would be a guest kernel bug, but all _we_ care about is that such
a guest kernel bug won't affect the hypervisor or other guests. You
need to answer the question (perhaps just for yourself) taking into
account Tim's suggestion to hold references to all pages mapped by
the IOMMU page tables. Once you do that, I don't think there'll be
a reason to pause the guest for the duration of the flush. And really
(as pointed out before) pausing the guest would get us _far_ away
from how real hardware behaves.

The only possibly tricky thing will be how to know in the flush
completion handler which pages to drop references for, as it doesn't
look like you'd be able to put them on a list without allocating extra
memory fro tracking (and allocation in turn would be bad as it can
fail).

> I didn't make the IOMMU table to take typed refcount to anything it points 
> to. This is really complex.

But unavoidable I think, and with that I'm not sure it makes a lot of
sense to do further (detailed) review of the initial version of the series.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-28  6:47       ` Jan Beulich
@ 2015-09-29  2:53         ` Xu, Quan
  2015-09-29  7:21           ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-29  2:53 UTC (permalink / raw)
  To: 'Jan Beulich', Tim Deegan
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	xen-devel, Nakajima, Jun, Zhang, Yang Z

>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:

> It would be a guest kernel bug, but all _we_ care about is that such a guest kernel
> bug won't affect the hypervisor or other guests.

It won't affect the hypervisor or other guest domains.
As the required Device-TLB flushes are not applied, the hypercall is not completed. The being freed page is still owned by this buggy
Guest, not released back to xen or reallocated for other guests.


> You need to answer the
> question (perhaps just for yourself) taking into account Tim's suggestion to hold
> references to all pages mapped by the IOMMU page tables. 

It is safe and complex.
But if Tim can ack all of my memory analysis, does my solution work for upstream?


For Tim's suggestion --"to make the IOMMU table take typed refcounts to
anything it points to, and only drop those refcounts when the flush completes."

>From IOMMU point of view, if it can walk through IOMMU table to get these pages and take typed refcounts. 
These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could I narrow it down to HVM guest? --- It is not for anything it points to, but just for HVM guest related. this will simplify the design.

from HVM guest point of view, once the ATS device is assigned, we can: 
*pause the HVM guest domain.
*scan domain's xenpage_list, page_list and arch.relmem_list to get these pages, which will be took typed refcounts ( PGT_dev_tlb_page -- a new type).
*unpause the HVM guest domain.

(we can ignore domain's xenpage_list) as:
((
   Actually, the previous pages are maybe mapped from Xen heap for guest domains in decrease_reservation() / xenmem_add_to_physmap_one()
   / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 functions will map xen heap page for guest domains:
          * share page for xen Oprofile.
          * vLAPIC mapping.
          * grant table shared page.
          * domain share_info page.
))


* Once assigned a new page, if the ATS device is assigned, we should also take typed refcounts ( PGT_dev_tlb_page).
* Once freed a page, the ATS device is assigned, we should check the typed refcounts ( PGT_dev_tlb_page) in free_domheap_pages()
  If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page list per-domain and freed in QI interrupt handler.


 Just for check, do typed refcounts refer to the following?

--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -183,6 +183,7 @@ struct page_info
 #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?  */
 #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?         */
 #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page              */
+#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB mapping?   */
 #define PGT_type_mask     PG_mask(15, 4) /* Bits 28-31 or 60-63.           */




* I define a new typed refcounts PGT_dev_tlb_page.



> Once you do that, I
> don't think there'll be a reason to pause the guest for the duration of the flush.
> And really (as pointed out before) pausing the guest would get us _far_ away
> from how real hardware behaves.
> 

Once I do that, I think the guest should be still paused, if the Device-TLB flush is not completed.

As mentioned in previous email, for example:
Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is the freed portion of GPA.
assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the Device-TLB flush is not completed and return to guest mode,
the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to gfn1..
then:
the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .

If the Device-TLB flush is not completed, DMA associated with gfn1 may still write some data with pageX(gfn1 <---> mfn1), but pageX will be 
Released to xen when the Device-TLB flush is completed. It is maybe not correct for guest to read data from gfn1 after DMA(now the page associated with gfn1 is pageY ).

Right?


> The only possibly tricky thing will be how to know in the flush completion handler
> which pages to drop references for, as it doesn't look like you'd be able to put
> them on a list without allocating extra memory fro tracking (and allocation in turn
> would be bad as it can fail).
> 

* Once freed a page, the ATS device is assigned, we should check the typed refcounts ( PGT_dev_tlb_page) in free_domheap_pages()
  If the typed refcounts is PGT_dev_tlb_page, the page should be hold in a page list per-domain and freed in QI interrupt handler.


> > I didn't make the IOMMU table to take typed refcount to anything it
> > points to. This is really complex.
> 
> But unavoidable I think, and with that I'm not sure it makes a lot of sense to do
> further (detailed) review of the initial version of the series.
> 

If it is unavoidable for upstream, I think the patch 0001--0005, 0013 IOMMU related are good. I should design and modify the other part.
Jan, thanks for your help.


Quan 

> Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-29  2:53         ` Xu, Quan
@ 2015-09-29  7:21           ` Jan Beulich
  2015-09-30 13:55             ` Xu, Quan
  2015-10-13 14:29             ` Xu, Quan
  0 siblings, 2 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  7:21 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, YangZ Zhang

>>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
>>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
>> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
>> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> 
>> It would be a guest kernel bug, but all _we_ care about is that such a guest 
> kernel
>> bug won't affect the hypervisor or other guests.
> 
> It won't affect the hypervisor or other guest domains.
> As the required Device-TLB flushes are not applied, the hypercall is not 
> completed. The being freed page is still owned by this buggy
> Guest, not released back to xen or reallocated for other guests.

Seems like you misunderstood the purpose of my reply: I wasn't
claiming that what you patch set currently does would constitute an
issue. I was simply stating a general rule to consider when thinking
about which solutions are viable and which aren't.

> For Tim's suggestion --"to make the IOMMU table take typed refcounts to
> anything it points to, and only drop those refcounts when the flush 
> completes."
> 
> From IOMMU point of view, if it can walk through IOMMU table to get these 
> pages and take typed refcounts. 
> These pages are maybe owned by hardware_domain, dummy, HVM guest .etc. could 
> I narrow it down to HVM guest? --- It is not for anything it points to, but just 
> for HVM guest related. this will simplify the design.

I don't follow. Why would you want to walk page tables? And why
would a HVM guest have pages other than those owned by itself or
granted access to by another guest mapped in its IOMMU page
tables? In any event - the ref-counting would need to happen as
you _create_ the mappings, not at some later point.

> from HVM guest point of view, once the ATS device is assigned, we can: 
> *pause the HVM guest domain.
> *scan domain's xenpage_list, page_list and arch.relmem_list to get these 
> pages, which will be took typed refcounts ( PGT_dev_tlb_page -- a new type).
> *unpause the HVM guest domain.
> 
> (we can ignore domain's xenpage_list) as:
> ((
>    Actually, the previous pages are maybe mapped from Xen heap for guest 
> domains in decrease_reservation() / xenmem_add_to_physmap_one()
>    / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 
> functions will map xen heap page for guest domains:
>           * share page for xen Oprofile.
>           * vLAPIC mapping.
>           * grant table shared page.
>           * domain share_info page.
> ))

Neither of which really has a need to be in the IOMMU page tables
afaics.

>  Just for check, do typed refcounts refer to the following?
> 
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -183,6 +183,7 @@ struct page_info
>  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?  */
>  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?         */
>  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page              */
> +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB mapping?   */
>  #define PGT_type_mask     PG_mask(15, 4) /* Bits 28-31 or 60-63.           */
> 
> * I define a new typed refcounts PGT_dev_tlb_page.

Why? I.e. why won't a base ref for r/o pages and a writable type-ref
for r/w ones suffice, just like we do everywhere else?

>> Once you do that, I
>> don't think there'll be a reason to pause the guest for the duration of the 
> flush.
>> And really (as pointed out before) pausing the guest would get us _far_ away
>> from how real hardware behaves.
>> 
> 
> Once I do that, I think the guest should be still paused, if the Device-TLB 
> flush is not completed.
> 
> As mentioned in previous email, for example:
> Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The gfn1 is the 
> freed portion of GPA.
> assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the Device-TLB 
> flush is not completed and return to guest mode,
> the guest may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to 
> gfn1..
> then:
> the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .
> 
> If the Device-TLB flush is not completed, DMA associated with gfn1 may still 
> write some data with pageX(gfn1 <---> mfn1), but pageX will be 
> Released to xen when the Device-TLB flush is completed. It is maybe not 
> correct for guest to read data from gfn1 after DMA(now the page associated 
> with gfn1 is pageY ).
> 
> Right?

No. The extra ref taken will prevent the page from getting freed. And
as long as the flush is in process, DMA to/from the page is going to
produce undefined results (affecting only the guest). But note that
there may be reasons for an external to the guest entity invoking the
operation which ultimately led to the flush to do this on a paused guest
only. But that's not of concern to the hypervisor side implementation.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
  2015-09-16 13:23 ` [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt Quan Xu
@ 2015-09-29  8:43   ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  8:43 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> Signed-off-by: Quan Xu <quan.xu@intel.com>

The title isn't really meaningful, and there's no description.

> @@ -1084,8 +1086,8 @@ static int __init iommu_set_interrupt(struct acpi_drhd_unit *drhd)
>      }
>  
>      desc = irq_to_desc(irq);
> -    desc->handler = &dma_msi_type;
> -    ret = request_irq(irq, 0, iommu_page_fault, "dmar", iommu);
> +    desc->handler = irq_ctrl;

I suppose (also taking the title into account) any the planned second
user will also use dma_msi_type, so there's no real point in making
this a function parameter.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt.
  2015-09-16 13:23 ` [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt Quan Xu
@ 2015-09-29  8:57   ` Jan Beulich
  2015-10-10  8:22     ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  8:57 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> +/* IOMMU Queued Invalidation(QI). */
> +static void _qi_msi_unmask(struct iommu *iommu)
> +{
> +    u32 sts;
> +    unsigned long flags;
> +
> +    /* Clear IM bit of DMAR_IECTL_REG. */
> +    spin_lock_irqsave(&iommu->register_lock, flags);
> +    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
> +    sts &= ~DMA_IECTL_IM;
> +    dmar_writel(iommu->reg, DMAR_IECTL_REG, sts);
> +    spin_unlock_irqrestore(&iommu->register_lock, flags);
> +}

I think rather than duplicating all this code from the fault interrupt
you should instead re-factor to make the original usable for both
purposes. Afaics the differences really are just the register and
bit locations.

> +static void _do_iommu_qi(struct iommu *iommu)
> +{
> +}

???

> +static void qi_msi_unmask(struct irq_desc *desc)
> +{
> +    _qi_msi_unmask(desc->action->dev_id);
> +}
> +
> +static void qi_msi_mask(struct irq_desc *desc)
> +{
> +    _qi_msi_mask(desc->action->dev_id);
> +}

These wrappers look pretty pointless.

> +static void qi_msi_end(struct irq_desc *desc, u8 vector)
> +{
> +    ack_APIC_irq();
> +}

Why is there, other than for its fault counterpart, no unmask
operation here?

> @@ -1123,6 +1243,7 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
>          return -ENOMEM;
>  
>      iommu->msi.irq = -1; /* No irq assigned yet. */
> +    iommu->qi_msi.irq = -1; /* No irq assigned yet. */

Which suggests that (perhaps in patch 1) the existing field should be
renamed to e.g. fe_msi.

> @@ -1985,6 +2109,9 @@ static void adjust_irq_affinity(struct acpi_drhd_unit *drhd)
>           cpumask_intersects(&node_to_cpumask(node), cpumask) )
>          cpumask = &node_to_cpumask(node);
>      dma_msi_set_affinity(irq_to_desc(drhd->iommu->msi.irq), cpumask);
> +
> +    if ( ats_enabled )
> +        qi_msi_set_affinity(irq_to_desc(drhd->iommu->qi_msi.irq), cpumask);
>  }
>  
>  int adjust_vtd_irq_affinities(void)
> @@ -2183,6 +2310,11 @@ int __init intel_vtd_setup(void)
>  
>          ret = iommu_set_interrupt(drhd, &dma_msi_type, "dmar", &drhd->iommu->msi,
>                                    iommu_page_fault);
> +        if ( ats_enabled )
> +            ret = iommu_set_interrupt(drhd, &qi_msi_type, "qi",
> +                                      &drhd->iommu->qi_msi,
> +                                      iommu_qi_completion);
> +

Would there be any harm from leaving out most/all of these
ats_enabled conditionals, despite right now that code only intended
to be used for ATS invalidations? I.e. wouldn't it just be that the
interrupt never triggers?

> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -47,6 +47,11 @@
>  #define    DMAR_IQH_REG    0x80    /* invalidation queue head */
>  #define    DMAR_IQT_REG    0x88    /* invalidation queue tail */
>  #define    DMAR_IQA_REG    0x90    /* invalidation queue addr */
> +#define    DMAR_IECTL_REG  0xA0    /* invalidation event contrl register */
> +#define    DMAR_IEDATA_REG 0xA4    /* invalidation event data register */
> +#define    DMAR_IEADDR_REG 0xA8    /* invalidation event address register */
> +#define    DMAR_IEUADDR_REG 0xAC   /* invalidation event upper address register */
> +#define    DMAR_ICS_REG    0x9C    /* invalidation completion status register */

Numerically ordered please.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-28  3:08     ` Xu, Quan
  2015-09-28  6:47       ` Jan Beulich
@ 2015-09-29  9:11       ` Tim Deegan
  2015-09-29  9:57         ` Jan Beulich
  2015-09-30 15:05         ` Xu, Quan
  1 sibling, 2 replies; 84+ messages in thread
From: Tim Deegan @ 2015-09-29  9:11 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Hi,

At 03:08 +0000 on 28 Sep (1443409723), Xu, Quan wrote:
> >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> > 7/13: I'm not convinced that making the vcpu spin calling
> > sched_yield() is a very good plan.  Better to explicitly pause the domain if you
> > need its vcpus not to run.  But first -- why does IOMMU flushing mean that
> > vcpus can't be run?
> 
> Ensure that the required Device-TLB flushes are applied before
> returning to guest mode via hypercall completion.  the domain can
> also DMA this freed pages.  For example, Call do_memory_op HYPERCALL
> to free a pageX (gfn --- mfn) from domain, and assume that there is
> a mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to
> guest mode, then the domain can still DMA this freed pageX.  Domain
> kernel cannot use this being freed page, otherwise this is a domain
> kernel bug.


OK - let's ignore guest kernel bugs.  IIUC you're worried about the
guest OS telling a device to issue DMA to an address that has changed
in the IOMMU tables (unmapped, remapped elsewhere, permisisons
changedm &c) but not yet been flushed?

Unfortunately, pausing the guest's CPUs doesn't stop that.  A
malicious guest could enqueue network receive buffers pointing to
that address, and then arrange for a packet to arrive between the IOMMU
table change and the flush completion.

So you'll need to do something else to make the unmap safe.  The usual
method in Xen is to hold a reference to the page (for read-only
mappings) or a typed reference (for read-write), and not release that
reference until the flush has completed.  That's OK with in-line
synchronous flushes.

With the flush taking longer than Xen can wait for, you'll need to
do something more complex, e.g.:
 - keep a log of all relevant pending derefs, to be processed when the
   flush completes; or
 - have some other method of preventing changes of ownership/type on
   the relevant pages.  E.g. for CPU TLBs, we keep a per-page counter
   (tlbflush-timestamp) that we can use to detect whether enough TLB
   flushes have happened since the page was freed.

The log is tricky - I'm not sure how toq make sure that it has bounded
size if a flush can take seconds.

I'm not sure the counter works either -- when that detector triggers
we do a synchronous TLB-flush IPI to make the operation safe, and
that's exactly what we can't do here.

Any other ideas floating around?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
  2015-09-16  9:33   ` Julien Grall
@ 2015-09-29  9:24   ` Jan Beulich
  2015-10-10 12:27     ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  9:24 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> @@ -139,6 +140,7 @@ static int queue_invalidate_wait(struct iommu *iommu,
>      unsigned long flags;
>      u64 entry_base;
>      struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct domain *d;
>  
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      index = qinval_next_index(iommu);
> @@ -152,9 +154,22 @@ static int queue_invalidate_wait(struct iommu *iommu,
>      qinval_entry->q.inv_wait_dsc.lo.sw = sw;
>      qinval_entry->q.inv_wait_dsc.lo.fn = fn;
>      qinval_entry->q.inv_wait_dsc.lo.res_1 = 0;
> -    qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
>      qinval_entry->q.inv_wait_dsc.hi.res_1 = 0;
> -    qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(&poll_slot) >> 2;
> +
> +    if ( iflag )
> +    {
> +        d = rcu_lock_domain_by_id(iommu->domid_map[device_id]);
> +        if ( d == NULL )
> +            return -ENODATA;
> +
> +        qinval_entry->q.inv_wait_dsc.lo.sdata = ++ qi_table_data(d);

Stray blank following the ++.

> +        qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(
> +                                                &qi_table_pollslot(d)) >> 2;
> +        rcu_unlock_domain(d);

If you don't hold a reference to the domain, what prevents it from
going away, struct domain getting freed, and the write to the poll
slot corrupting data if the memory gets re-used? Plus, if you obtain
a domain reference at the time you enter it into domid_map[], you
wouldn't have to be worried about failure above (i.e. you could
simply ASSERT() success of rcu_lock_domain_by_id()).

Considering the implementation of rcu_lock_domain_by_id() I
also wonder whether it wouldn't be more efficient to make
domid_map[] an array of struct domain pointers.

> +    } else {

Coding style.

> --- a/xen/include/xen/hvm/iommu.h
> +++ b/xen/include/xen/hvm/iommu.h
> @@ -23,6 +23,21 @@
>  #include <xen/list.h>
>  #include <asm/hvm/iommu.h>
>  
> +/*
> + * Status Address and Data: Status address and data is used by hardware to perform
> + * wait descriptor completion status write when the Status Write(SW) field is Set.
> + *
> + * Track the Device-TLB invalidation status in an invalidation table. Update
> + * invalidation table's count of in-flight Device-TLB invalidation request and
> + * assign the address of global polling parameter per domain in the Status Address
> + * of each invalidation wait descriptor, when submit Device-TLB invalidation
> + * requests.
> + */
> +struct qi_talbe {
> +    u64 qi_table_poll_slot;

Why is this a 64-bit field when the respective write stores 32 bits
only?

Also the qi_table_ prefixes seem rather pointless.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler
  2015-09-16 13:23 ` [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler Quan Xu
@ 2015-09-29  9:33   ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  9:33 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1098,6 +1098,28 @@ static void _qi_msi_mask(struct iommu *iommu)
>  
>  static void _do_iommu_qi(struct iommu *iommu)
>  {
> +    unsigned long nr_dom, i;
> +    struct domain *d = NULL;
> +
> +    nr_dom = cap_ndoms(iommu->cap);
> +    i = find_first_bit(iommu->domid_bitmap, nr_dom);
> +    while ( i < nr_dom )
> +    {
> +        if ( iommu->domid_map[i] > 0 )

This is a pointless check when the bit was already found set. What
instead you need to consider are races with table entries getting
removed (unless following the suggestions made on the previous
patch already make this impossible).

> +        {
> +            d = rcu_lock_domain_by_id(iommu->domid_map[i]);
> +            if ( d == NULL )
> +                continue;
> +
> +            if ( qi_table_pollslot(d) == qi_table_data(d) )

So qi_table_data() gets (non-atomically) incremented in the
previous patch when a new wait command gets issued. How is
this check safe (and the zapping below) against races, and
against out of order completion of invalidations?

Jan

> +            {
> +                qi_table_data(d) = 0;
> +                qi_table_pollslot(d) = 0;
> +            }
> +            rcu_unlock_domain(d);
> +        }
> +        i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
> +    }
>  }
>  
>  static void do_iommu_qi_completion(unsigned long data)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in
  2015-09-16 13:23 ` [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in Quan Xu
@ 2015-09-29  9:44   ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  9:44 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1070,6 +1070,27 @@ static hw_irq_controller dma_msi_type = {
>  };
>  
>  /* IOMMU Queued Invalidation(QI). */
> +static void qi_clear_iwc(struct iommu *iommu)
> +{
> +    unsigned long flags;
> +
> +    spin_lock_irqsave(&iommu->register_lock, flags);
> +    dmar_writel(iommu->reg, DMAR_ICS_REG, RW1CS);

RW1CS is definitely not a suitable name for this (or any other) bit.
The manual and the title call the bit IWC, i.e. maybe QI_ICS_IWC?
Also I don't think you can validly assume you can blindly write
zeroes to the other bits, but considering the RW1CS nature of this
bit it's also not clear whether blindly writing back what you read
would be a good idea.

> +static int _qi_msi_ip(struct iommu *iommu)
> +{
> +    u32 sts;
> +    unsigned long flags;
> +
> +    /* Get IP bit of DMAR_IECTL_REG. */
> +    spin_lock_irqsave(&iommu->register_lock, flags);
> +    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
> +    spin_unlock_irqrestore(&iommu->register_lock, flags);
> +    return (sts & DMA_IECTL_IP);
> +}

The function appears to be meant to return a boolean, i.e. its
return type should be bool_t and the return statement should
use !!.

> @@ -1101,6 +1122,14 @@ static void _do_iommu_qi(struct iommu *iommu)
>      unsigned long nr_dom, i;
>      struct domain *d = NULL;
>  
> +scan_again:

Labels indented by at least on space please. Even better would be if
this was written without goto.

> @@ -1120,6 +1149,28 @@ static void _do_iommu_qi(struct iommu *iommu)
>          }
>          i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
>      }
> +
> +    /*
> +     * IP is interrupt pending and the 30 bit of Invalidation Event Control
> +     * Register. The IP field is kept Set by hardware while the interrupt
> +     * message is held pending. The IP field is cleared by hardware as soon
> +     * as the interrupt message pending condition  is serviced. IP could be
> +     * cleard due to either:
> +     *
> +     * - Clear IM field in the Invalidation Event Control Register. A QI
> +     *   interrupt is generated along with clearing the IP field.
> +     * - Clear IWC field in the Invalidateion Coompletion Status register.
> +     *
> +     * If the Ip is Set, scan agian, instead of generating another interrupt.
> +     */
> +    if ( _qi_msi_ip(iommu) )
> +        goto scan_again;
> +
> +    /*
> +     * No masking of QI interrupt. when a QI interrupt event condition is
> +     * detected, hardware issues an interrupt message.
> +     */
> +    _qi_msi_unmask(iommu);

Isn't this what actually belongs in the flow end handler?

> @@ -1154,6 +1205,14 @@ static void qi_msi_mask(struct irq_desc *desc)
>  
>  static unsigned int qi_msi_startup(struct irq_desc *desc)
>  {
> +    struct iommu *iommu = desc->action->dev_id;
> +
> +    /*
> +     * If the IWC field in the Invalidation Completion Status register was already
> +     * Set at the time of setting this field, it is not treated as a new interrupt
> +     * condition.
> +     */
> +    qi_clear_iwc(iommu);

Things not related directly to interrupt control don't belong in flow
handlers.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB
  2015-09-16 13:24 ` [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB Quan Xu
@ 2015-09-29  9:46   ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  9:46 UTC (permalink / raw)
  To: Quan Xu
  Cc: tim, kevin.tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, jun.nakajima,
	yang.z.zhang

>>> On 16.09.15 at 15:24, <quan.xu@intel.com> wrote:
> @@ -322,6 +330,15 @@ static int flush_context_qi(
>      return ret;
>  }
>  
> +static int invalidate_async(struct iommu *iommu, u16 device_id)
> +{
> +    struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> +
> +    if ( qi_ctrl->qinval_maddr )
> +        return queue_invalidate_wait(iommu, 1, 1, 1, device_id);
> +    return 0;

Is this meant to be a success or an error indication (afaict it ought
to be the latter, but the function isn't returning bool_t).

> @@ -360,8 +377,13 @@ static int flush_iotlb_qi(
>                                 type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>                                 dw, did, size_order, 0, addr);
>          if ( flush_dev_iotlb )
> +        {
>              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type);
> -        rc = invalidate_sync(iommu);
> +            rc = invalidate_async(iommu, did);
> +        } else {

Coding style.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-29  9:11       ` Tim Deegan
@ 2015-09-29  9:57         ` Jan Beulich
  2015-09-30 15:05         ` Xu, Quan
  1 sibling, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-29  9:57 UTC (permalink / raw)
  To: Quan Xu, Tim Deegan
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Eddie Dong, xen-devel, Jun Nakajima, Yang Z Zhang

>>> On 29.09.15 at 11:11, <tim@xen.org> wrote:
> With the flush taking longer than Xen can wait for, you'll need to
> do something more complex, e.g.:
>  - keep a log of all relevant pending derefs, to be processed when the
>    flush completes; or
>  - have some other method of preventing changes of ownership/type on
>    the relevant pages.  E.g. for CPU TLBs, we keep a per-page counter
>    (tlbflush-timestamp) that we can use to detect whether enough TLB
>    flushes have happened since the page was freed.
> 
> The log is tricky - I'm not sure how toq make sure that it has bounded
> size if a flush can take seconds.
> 
> I'm not sure the counter works either -- when that detector triggers
> we do a synchronous TLB-flush IPI to make the operation safe, and
> that's exactly what we can't do here.
> 
> Any other ideas floating around?

The variant of the log model might work if sufficient information is
available in the interrupt handler (or associated tasklet) to identify
a much smaller subset of pages to scan through. Since there is a
32-bit quantity written to a pre-determined location upon qi
completion, I wonder whether that couldn't be used for that purpose
- 32 bits disambiguate a page within 16Tb of RAM, so there wouldn't
need to be too many hashed together in a single chain. Otoh of
course we can't have 2^32 standalone list heads.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-29  7:21           ` Jan Beulich
@ 2015-09-30 13:55             ` Xu, Quan
  2015-09-30 14:03               ` Jan Beulich
  2015-10-13 14:29             ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-30 13:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Nakajima, Jun, Zhang, Yang Z

> >> >> >>> On September 29, 2015 at 3:22 PM, <JBeulich@suse.com> wrote:
> >>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
> >>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
> >> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
> >> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> >
> > For Tim's suggestion --"to make the IOMMU table take typed refcounts
> > to anything it points to, and only drop those refcounts when the flush
> > completes."
> >
> > From IOMMU point of view, if it can walk through IOMMU table to get
> > these pages and take typed refcounts.
> > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc.
> > could I narrow it down to HVM guest? --- It is not for anything it
> > points to, but just for HVM guest related. this will simplify the design.
> 
> I don't follow. Why would you want to walk page tables? And why would a HVM
> guest have pages other than those owned by itself or granted access to by
> another guest mapped in its IOMMU page tables?

It is tricky. Let's ignore it.

This is an analysis of IOMMU table to take typed refcounts to anything it points to.
I know the IOMMU table and EPT table may share the same page table('iommu_hap_pt_share = 1').
Then, go through iommu table and take typed refcounts.

> In any event - the ref-counting
> would need to happen as you _create_ the mappings, not at some later point.
> 
a general rule. Agreed.

When create the mappings, what conditions to take typed refcounts?



> >  Just for check, do typed refcounts refer to the following?
> >
> > --- a/xen/include/asm-x86/mm.h
> > +++ b/xen/include/asm-x86/mm.h
> > @@ -183,6 +183,7 @@ struct page_info
> >  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?
> */
> >  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?
> */
> >  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page
> */
> > +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB
> mapping?   */
> >  #define PGT_type_mask     PG_mask(15, 4) /* Bits 28-31 or 60-63.
> */
> >
> > * I define a new typed refcounts PGT_dev_tlb_page.
> 
> Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w ones
> suffice, just like we do everywhere else?
> 

I think it is different from r/o or writable.

The page freed from the domain, the Device-TLB flush is not completed.
The page is _not_ r/o or writable, and can only access though DMA..

Maybe it would modify a lot of related code. 
r/o or writable are acceptable to me. 


> >> Once you do that, I
> >> don't think there'll be a reason to pause the guest for the duration
> >> of the
> > flush.
> >> And really (as pointed out before) pausing the guest would get us
> >> _far_ away from how real hardware behaves.
> >>
> >
> > Once I do that, I think the guest should be still paused, if the
> > Device-TLB flush is not completed.
> >
> > As mentioned in previous email, for example:
> > Call do_memory_op HYPERCALL to free a pageX (gfn1 <---> mfn1). The
> > gfn1 is the freed portion of GPA.
> > assume that there is a mapping(gfn1<---> mfn1) in Device-TLB. If the
> > Device-TLB flush is not completed and return to guest mode, the guest
> > may call do_memory_op HYPERCALL to allocate a new pageY(mfn2) to
> > gfn1..
> > then:
> > the EPT mapping is (gfn1--mfn2), the Device-TLB mapping is (gfn1<--->mfn1) .
> >
> > If the Device-TLB flush is not completed, DMA associated with gfn1 may
> > still write some data with pageX(gfn1 <---> mfn1), but pageX will be
> > Released to xen when the Device-TLB flush is completed. It is maybe
> > not correct for guest to read data from gfn1 after DMA(now the page
> > associated with gfn1 is pageY ).
> >
> > Right?
> 
> No. The extra ref taken will prevent the page from getting freed. And as long as
> the flush is in process, DMA to/from the page is going to produce undefined
> results (affecting only the guest). But note that there may be reasons for an
> external to the guest entity invoking the operation which ultimately led to the
> flush to do this on a paused guest only. But that's not of concern to the
> hypervisor side implementation.
> 

Reasonable.

Jan, thanks!

-Quan


> Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-30 13:55             ` Xu, Quan
@ 2015-09-30 14:03               ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-09-30 14:03 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, YangZ Zhang

>>> On 30.09.15 at 15:55, <quan.xu@intel.com> wrote:
>> >> >> >>> On September 29, 2015 at 3:22 PM, <JBeulich@suse.com> wrote:
>> >>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
>> >>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
>> >> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
>> >> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>> >
>> > For Tim's suggestion --"to make the IOMMU table take typed refcounts
>> > to anything it points to, and only drop those refcounts when the flush
>> > completes."
>> >
>> > From IOMMU point of view, if it can walk through IOMMU table to get
>> > these pages and take typed refcounts.
>> > These pages are maybe owned by hardware_domain, dummy, HVM guest .etc.
>> > could I narrow it down to HVM guest? --- It is not for anything it
>> > points to, but just for HVM guest related. this will simplify the design.
>> 
>> I don't follow. Why would you want to walk page tables? And why would a HVM
>> guest have pages other than those owned by itself or granted access to by
>> another guest mapped in its IOMMU page tables?
> 
> It is tricky. Let's ignore it.
> 
> This is an analysis of IOMMU table to take typed refcounts to anything it 
> points to.
> I know the IOMMU table and EPT table may share the same page 
> table('iommu_hap_pt_share = 1').
> Then, go through iommu table and take typed refcounts.
> 
>> In any event - the ref-counting
>> would need to happen as you _create_ the mappings, not at some later point.
>> 
> a general rule. Agreed.
> 
> When create the mappings, what conditions to take typed refcounts?

On any r/o mapping, take a general ref. On any r/w mapping take
a writable ref.

>> > --- a/xen/include/asm-x86/mm.h
>> > +++ b/xen/include/asm-x86/mm.h
>> > @@ -183,6 +183,7 @@ struct page_info
>> >  #define PGT_seg_desc_page PG_mask(5, 4)  /* using this page in a GDT/LDT?
>> */
>> >  #define PGT_writable_page PG_mask(7, 4)  /* has writable mappings?
>> */
>> >  #define PGT_shared_page   PG_mask(8, 4)  /* CoW sharable page
>> */
>> > +#define PGT_dev_tlb_page  PG_mask(9, 4)  /* Maybe in Device-TLB
>> mapping?   */
>> >  #define PGT_type_mask     PG_mask(15, 4) /* Bits 28-31 or 60-63.
>> */
>> >
>> > * I define a new typed refcounts PGT_dev_tlb_page.
>> 
>> Why? I.e. why won't a base ref for r/o pages and a writable type-ref for r/w 
> ones
>> suffice, just like we do everywhere else?
>> 
> 
> I think it is different from r/o or writable.
> 
> The page freed from the domain, the Device-TLB flush is not completed.
> The page is _not_ r/o or writable, and can only access though DMA..

DMA or not, the page can either be accessed (read) (when mapped
r/o in the IOMMU) or written (when mapped r/w in the IOMMU). For
the purpose of refcounting it doesn't matter where the reads/writes
originate.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-29  9:11       ` Tim Deegan
  2015-09-29  9:57         ` Jan Beulich
@ 2015-09-30 15:05         ` Xu, Quan
  2015-10-01  9:09           ` Tim Deegan
  1 sibling, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-09-30 15:05 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

>> >>> On September 29, 2015, at 5:12 PM, <tim@xen.org> wrote:
> At 03:08 +0000 on 28 Sep (1443409723), Xu, Quan wrote:
> > >>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
> > > 7/13: I'm not convinced that making the vcpu spin calling
> > > sched_yield() is a very good plan.  Better to explicitly pause the
> > > domain if you need its vcpus not to run.  But first -- why does
> > > IOMMU flushing mean that vcpus can't be run?
> >
> > Ensure that the required Device-TLB flushes are applied before
> > returning to guest mode via hypercall completion.  the domain can also
> > DMA this freed pages.  For example, Call do_memory_op HYPERCALL to
> > free a pageX (gfn --- mfn) from domain, and assume that there is a
> > mapping(gfn --- mfn) in Device-TLB, once the vcpu has returned to
> > guest mode, then the domain can still DMA this freed pageX.  Domain
> > kernel cannot use this being freed page, otherwise this is a domain
> > kernel bug.
> 
> 
> OK - let's ignore guest kernel bugs.  IIUC you're worried about the guest OS
> telling a device to issue DMA to an address that has changed in the IOMMU
> tables (unmapped, remapped elsewhere, permisisons changedm &c) but not yet
> been flushed?


Yes, issue DMA to an address that has changed in the IOMMU table and EPT table, but not yet been flushed.


> 
> Unfortunately, pausing the guest's CPUs doesn't stop that.  A malicious guest
> could enqueue network receive buffers pointing to that address, and then
> arrange for a packet to arrive between the IOMMU table change and the flush
> completion.

Cool !!

> So you'll need to do something else to make the unmap safe.
>The usual
> method in Xen is to hold a reference to the page (for read-only
> mappings)


Read-only mapping refers to 'PGT_pinned'?
Could I introduce a new typed reference which can only been deref in QI interrupt handler(or associated tasklet)?? --(stop me, I always want to add some new flag or typed ..)
And preventing changes of ownership/type on the relevant pages.


> or a typed reference (for read-write), and not release that reference
> until the flush has completed.  That's OK with in-line synchronous flushes.
> 
> With the flush taking longer than Xen can wait for, you'll need to do something
> more complex, e.g.:
>  - keep a log of all relevant pending derefs, to be processed when the
>    flush completes; 



One of the CCed mentioned this solution in internal discussions. But it is tricky and over-engineering.
I need more than half year to implement it.


> or
>  - have some other method of preventing changes of ownership/type on
>    the relevant pages. 


I prefer this solution.


> E.g. for CPU TLBs, we keep a per-page counter
>    (tlbflush-timestamp) that we can use to detect whether enough TLB
>    flushes have happened since the page was freed.
> 
> The log is tricky - I'm not sure how toq make sure that it has bounded size if a
> flush can take seconds.
> 
> I'm not sure the counter works either -- when that detector triggers we do a
> synchronous TLB-flush IPI to make the operation safe, and that's exactly what we
> can't do here.
> 
> Any other ideas floating around?
> 
> Cheers,
> 

Tim, thanks for your help.
Any idea, I will send out. Maybe it is not a complete solution. 

Quan

> Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-30 15:05         ` Xu, Quan
@ 2015-10-01  9:09           ` Tim Deegan
  2015-10-07 17:02             ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Deegan @ 2015-10-01  9:09 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Hi,

At 15:05 +0000 on 30 Sep (1443625549), Xu, Quan wrote:
> >> >>> On September 29, 2015, at 5:12 PM, <tim@xen.org> wrote:
> > So you'll need to do something else to make the unmap safe.
> >The usual
> > method in Xen is to hold a reference to the page (for read-only
> > mappings)
> 
> 
> Read-only mapping refers to 'PGT_pinned'?

Read-only mappings usually take an 'untyped' reference, with
get_page().  Read-write mappings take one of those plus a typecount
reference for type PGT_writable_page (i.e. get_page_and_type()).  The
reference counts are described in a comment which is currently at the
top of xen/arch/x86/mm.c, but which is mostly talking about PV memory
management.

The PGT_pinned bit is not relevant here - it's used to indicate that a
PV _guest_ has asked for a page to have a particular type (useful for
PV pagetables).

> Could I introduce a new typed reference which can only been deref in QI interrupt handler(or associated tasklet)?? --(stop me, I always want to add some new flag or typed ..)
> And preventing changes of ownership/type on the relevant pages.

If you get_page_and_type(PGT_writable_page), then that guarantees that
the page won't change type or be freed until the matching
put_page_and_type(), which does what you want.  Likewise, for
read-only mappings, get_page() will guarantee that the page isn't
freed until you call put_page().  (We don't care about read-only DMA
mappings of specially-typed pages so you don't need a typecount there).

That leaves three questions:
 - when to take the references?
 - how do you know when to drop them?
 - what to do about mappings of other domains' memory (i.e. grant
   and foreign mappings).

IIUC your current scheme is to take the reference as the page is freed
and drop it after the flush.  That's not enough for pages where
permissions change for other reasons, so it will at least have to be
"take the reference when the IOMMU entry is removed/changed".  IMO
that's sufficiently invasive that you might as well:
 - take the reference when the IOMMU entry is _created_;
 - log (or something) when the IOMMU entry is removed/overwritten; and
 - drop the entry when the flush completes.

Like other schemes (I'm thinking about the p2m foreign mapping stuff
here) the reference counting is best done at the very bottom of the
stack, when the actual entries are being written and cleared.  IOW
I'd be adding more code to atomic_write_ept_entry() and friends.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-01  9:09           ` Tim Deegan
@ 2015-10-07 17:02             ` Xu, Quan
  2015-10-08  8:51               ` Jan Beulich
  2015-10-10 18:24               ` Tim Deegan
  0 siblings, 2 replies; 84+ messages in thread
From: Xu, Quan @ 2015-10-07 17:02 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z,
	Xu, Quan


>>> >> On October 01, 2015, at 5:09 PM <tim@xen.org> wrote:
> At 15:05 +0000 on 30 Sep (1443625549), Xu, Quan wrote:
> > >> >>> On September 29, 2015, at 5:12 PM, <tim@xen.org> wrote:

> > Could I introduce a new typed reference which can only been deref in
> > QI interrupt handler(or associated tasklet)?? --(stop me, I always want to add
> some new flag or typed ..) And preventing changes of ownership/type on the
> relevant pages.
> 
> If you get_page_and_type(PGT_writable_page), then that guarantees that the
> page won't change type or be freed until the matching put_page_and_type(),
> which does what you want.  Likewise, for read-only mappings, get_page() will
> guarantee that the page isn't freed until you call put_page().  (We don't care
> about read-only DMA mappings of specially-typed pages so you don't need a
> typecount there).
> 
Thanks for your further clarification.

> That leaves three questions:
>  - when to take the references?
>  - how do you know when to drop them?
>  - what to do about mappings of other domains' memory (i.e. grant
>    and foreign mappings).
> 

I think this are the key points. Look at the below answers.

> IIUC your current scheme is to take the reference as the page is freed and drop it
> after the flush.

Yes,

> That's not enough for pages where permissions change for
> other reasons, so it will at least have to be "take the reference when the IOMMU
> entry is removed/changed".  IMO that's sufficiently invasive that you might as
> well:
>  - take the reference when the IOMMU entry is _created_;
>  - log (or something) when the IOMMU entry is removed/overwritten; and
>  - drop the entry when the flush completes.


__scheme A__
Q1: - when to take the references?
    take the reference when the IOMMU entry is _created_;
    in detail:
     --iommu_map_page(), or
     --ept_set_entry() [Once IOMMU shares EPT page table.]

That leaves one question:
    -- how to do with hot-plug ATS device pass-through? As the EPT doesn't aware IOMMU will share EPT page table when EPT page table was _created_.

Q2: how do you know when to drop them?
   - log (or something) when the IOMMU entry is removed/overwritten; and
   - drop the entry when the flush completes.
   
   -- We can add a new page_list_entry structure per page_info, and Add the page with the new page_list_entry structure to per domain page list, when
    the IOMMU entry is removed/overwritten; and drop the entry when the flush completes. 

Q3: what to do about mappings of other domains' memory (i.e. grant and foreign mappings).
   Between two domains, now I have only one idea to fix this tricky issue -- waitqueue.
   I.e. grant.
    For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating grant flag, until the Device-TLB flush is completed.
    For grant-mapped, it is safe as the modification of gnttab_unmap.



__scheme B__
Q1: - when to take the references?

    take the reference when the IOMMU entry is _ removed/overwritten_;
    in detail:
     --iommu_unmap_page(), or
     --ept_set_entry() [Once IOMMU shares EPT page table.]

    * Make sure IOMMU page should not be reallocated for
     another purpose until the appropriate invalidations have been
     performed. 
    * in this case, it does not matter hot-plug ATS device pass-through or ATS device assigned in domain initialization.

Q2 / Q3: 
    The same as above __scheme A__ Q2/Q3.

One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..



Tim, thanks very much!


Quan



> 
> Like other schemes (I'm thinking about the p2m foreign mapping stuff
> here) the reference counting is best done at the very bottom of the stack, when
> the actual entries are being written and cleared.  IOW I'd be adding more code
> to atomic_write_ept_entry() and friends.
> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-07 17:02             ` Xu, Quan
@ 2015-10-08  8:51               ` Jan Beulich
  2015-10-09  7:06                 ` Xu, Quan
  2015-10-10 18:24               ` Tim Deegan
  1 sibling, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-08  8:51 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, Yang Z Zhang

>>> On 07.10.15 at 19:02, <quan.xu@intel.com> wrote:
> __scheme A__
> Q1: - when to take the references?
>     take the reference when the IOMMU entry is _created_;
>     in detail:
>      --iommu_map_page(), or
>      --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> That leaves one question:
>     -- how to do with hot-plug ATS device pass-through? As the EPT doesn't aware 
> IOMMU will share EPT page table when EPT page table was _created_.

How is it not? iommu_hap_pt_share is a global option, and hence
even for a domain which does not (yet) require an IOMMU these
references could be acquired proactively.

> Q2: how do you know when to drop them?
>    - log (or something) when the IOMMU entry is removed/overwritten; and
>    - drop the entry when the flush completes.
>    
>    -- We can add a new page_list_entry structure per page_info, and Add the 
> page with the new page_list_entry structure to per domain page list, when
>     the IOMMU entry is removed/overwritten; and drop the entry when the 
> flush completes. 

Please be very careful when considering to grow the size of struct
page_info - due to the amount of its instances this would have a
quite measurable effect on the memory overhead of memory
management. Since the number of pages currently having a flush
pending shouldn't be that high at any point in time, I don't think
such increased overhead would be justifiable.

> Q3: what to do about mappings of other domains' memory (i.e. grant and 
> foreign mappings).
>    Between two domains, now I have only one idea to fix this tricky issue -- 
> waitqueue.
>    I.e. grant.
>     For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating 
> grant flag, until the Device-TLB flush is completed.
>     For grant-mapped, it is safe as the modification of gnttab_unmap.

Hmm, wouldn't grant transfers already be taken care of by the
extra references? See steal_page(). Perhaps the ordering
between its invocation and guest_physmap_remove_page() would
need to be switched (with the latter getting undone if steal_page()
fails). The waiting for the flush to complete could - afaics - be done
by using the grant-ops' inherent batching (and hence easy availability
of continuations). But I admit there's some hand waiving here
without closer inspection...

> __scheme B__
> Q1: - when to take the references?
> 
>     take the reference when the IOMMU entry is _ removed/overwritten_;
>     in detail:
>      --iommu_unmap_page(), or
>      --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
>     * Make sure IOMMU page should not be reallocated for
>      another purpose until the appropriate invalidations have been
>      performed. 
>     * in this case, it does not matter hot-plug ATS device pass-through or ATS 
> device assigned in domain initialization.
> 
> Q2 / Q3: 
>     The same as above __scheme A__ Q2/Q3.
> 
> One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..

While at the first glance this looks like a neat idea - what do you do
if obtaining the reference fails? Not touching the EPT entry may be
possible (but not nice), but not doing the IOMMU side update when
the EPT one already succeeded would seem very cumbersome (as
you'd need to roll back the EPT side update). Maybe a two-phase
operation (obtain references for IOMMU, update EPT, update IOMMU)
could solve this (and could then actually be independent of whether
page tables are shared).

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-08  8:51               ` Jan Beulich
@ 2015-10-09  7:06                 ` Xu, Quan
  2015-10-09  7:18                   ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-10-09  7:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Nakajima, Jun, Zhang, Yang Z

>> >>>On 08.10.2015 at 16:52 <JBeulich@suse.com> wrote:
> >>> On 07.10.15 at 19:02, <quan.xu@intel.com> wrote:
> > __scheme A__
> > Q1: - when to take the references?
> >     take the reference when the IOMMU entry is _created_;
> >     in detail:
> >      --iommu_map_page(), or
> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
> >
> > That leaves one question:
> >     -- how to do with hot-plug ATS device pass-through? As the EPT
> > doesn't aware IOMMU will share EPT page table when EPT page table was
> _created_.
> 
> How is it not? iommu_hap_pt_share is a global option, and hence even for a
> domain which does not (yet) require an IOMMU these references could be
> acquired proactively.
> 

I mean that the 'need_iommu(d)' is '0'. If I extend to take an additional reference for normal HVM guest, I think it is tricky and very challenge to manage this additional reference.
that's why I prefer __scheme B__..



> > Q2: how do you know when to drop them?
> >    - log (or something) when the IOMMU entry is removed/overwritten; and
> >    - drop the entry when the flush completes.
> >
> >    -- We can add a new page_list_entry structure per page_info, and
> > Add the page with the new page_list_entry structure to per domain page list,
> when
> >     the IOMMU entry is removed/overwritten; and drop the entry when
> > the flush completes.
> 
> Please be very careful when considering to grow the size of struct page_info -
> due to the amount of its instances this would have a quite measurable effect on
> the memory overhead of memory management. Since the number of pages
> currently having a flush pending shouldn't be that high at any point in time, I
> don't think such increased overhead would be justifiable.
> 

Make sense. Agreed.
I can remove this page from domain  [like steal_page(). Swizzle the owner -- page_set_owner(page, NULL);  Unlink from original owner -- page_list_del(page, &d->page_list)]
Then, take a reference and add this page to a per domain page list, and put this page when the flush completes.
I was afraid that this solution may cause potential issues. i.e. make xen crash in some corner case ..


> > Q3: what to do about mappings of other domains' memory (i.e. grant and
> > foreign mappings).
> >    Between two domains, now I have only one idea to fix this tricky
> > issue -- waitqueue.
> >    I.e. grant.
> >     For gnttab_transfer /gnttab_unmap , wait on a waitqueue before
> > updating grant flag, until the Device-TLB flush is completed.
> >     For grant-mapped, it is safe as the modification of gnttab_unmap.
> 
> Hmm, wouldn't grant transfers already be taken care of by the extra references?
> See steal_page(). Perhaps the ordering between its invocation and
> guest_physmap_remove_page() would need to be switched (with the latter
> getting undone if steal_page() fails). The waiting for the flush to complete could -
> afaics - be done by using the grant-ops' inherent batching (and hence easy
> availability of continuations). But I admit there's some hand waiving here without
> closer inspection...
> 

I think the extra references can NOT fix the security issue between two domains.
i.e. If domA transfers the ownership of a page to domB, but the domA still take extra references of the page. I think it is not correct.

IMO the guide to fix the security issue between two domains -- a two-phase operation.
 --First--make sure one domain can NOT read/write/DMA with this page.
 --Second-- then the other domain do everything for this page.

Waitqueue is useful for this case.
I know this is still an ugly solution. - afaics - make sure the hypervisor side implementation is correct.

> > __scheme B__
> > Q1: - when to take the references?
> >
> >     take the reference when the IOMMU entry is _ removed/overwritten_;
> >     in detail:
> >      --iommu_unmap_page(), or
> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
> >
> >     * Make sure IOMMU page should not be reallocated for
> >      another purpose until the appropriate invalidations have been
> >      performed.
> >     * in this case, it does not matter hot-plug ATS device
> > pass-through or ATS device assigned in domain initialization.
> >
> > Q2 / Q3:
> >     The same as above __scheme A__ Q2/Q3.
> >
> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..
> 
> While at the first glance this looks like a neat idea - 


I think this is safe and a good solution.
I hope you can review into the __scheme B__. I need _Acked-by_ you and Tim Deegan.



> what do you do if obtaining
> the reference fails? Not touching the EPT entry may be possible (but not nice),
> but not doing the IOMMU side update when the EPT one already succeeded
> would seem very cumbersome (as you'd need to roll back the EPT side update).
> Maybe a two-phase operation (obtain references for IOMMU, update EPT,
> update IOMMU) could solve this (and could then actually be independent of
> whether page tables are shared).


Good. if the __scheme B__ is correct, I will take it into consideration.

Jan, thanks very much!


-Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-09  7:06                 ` Xu, Quan
@ 2015-10-09  7:18                   ` Jan Beulich
  2015-10-09  7:51                     ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-09  7:18 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, YangZ Zhang

>>> On 09.10.15 at 09:06, <quan.xu@intel.com> wrote:
>> > >>>On 08.10.2015 at 16:52 <JBeulich@suse.com> wrote:
>> >>> On 07.10.15 at 19:02, <quan.xu@intel.com> wrote:
>> > Q3: what to do about mappings of other domains' memory (i.e. grant and
>> > foreign mappings).
>> >    Between two domains, now I have only one idea to fix this tricky
>> > issue -- waitqueue.
>> >    I.e. grant.
>> >     For gnttab_transfer /gnttab_unmap , wait on a waitqueue before
>> > updating grant flag, until the Device-TLB flush is completed.
>> >     For grant-mapped, it is safe as the modification of gnttab_unmap.
>> 
>> Hmm, wouldn't grant transfers already be taken care of by the extra references?
>> See steal_page(). Perhaps the ordering between its invocation and
>> guest_physmap_remove_page() would need to be switched (with the latter
>> getting undone if steal_page() fails). The waiting for the flush to complete could -
>> afaics - be done by using the grant-ops' inherent batching (and hence easy
>> availability of continuations). But I admit there's some hand waiving here without
>> closer inspection...
> 
> I think the extra references can NOT fix the security issue between two 
> domains.
> i.e. If domA transfers the ownership of a page to domB, but the domA still 
> take extra references of the page. I think it is not correct.

Again - see steal_page(): Pages with references beyond the single
allocation related one can't change ownership.

>> > __scheme B__
>> > Q1: - when to take the references?
>> >
>> >     take the reference when the IOMMU entry is _ removed/overwritten_;
>> >     in detail:
>> >      --iommu_unmap_page(), or
>> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
>> >
>> >     * Make sure IOMMU page should not be reallocated for
>> >      another purpose until the appropriate invalidations have been
>> >      performed.
>> >     * in this case, it does not matter hot-plug ATS device
>> > pass-through or ATS device assigned in domain initialization.
>> >
>> > Q2 / Q3:
>> >     The same as above __scheme A__ Q2/Q3.
>> >
>> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..
>> 
>> While at the first glance this looks like a neat idea - 
> 
> 
> I think this is safe and a good solution.
> I hope you can review into the __scheme B__. I need _Acked-by_ you and Tim 
> Deegan.

What do you mean here? I'm not going to ack a patch that hasn't
even got written, and while scheme B looks possible, I might still
overlook something, so I also can't up front ack that model (which
may then lead to you expecting that once implemented it gets
accepted).

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-09  7:18                   ` Jan Beulich
@ 2015-10-09  7:51                     ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-10-09  7:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Nakajima, Jun, Zhang, Yang Z

>> >>> On 09.10.2015 at 15:18 <JBeulich@suse.com> wrote:
> >>> On 09.10.15 at 09:06, <quan.xu@intel.com> wrote:
> >> > >>>On 08.10.2015 at 16:52 <JBeulich@suse.com> wrote:
> >> >>> On 07.10.15 at 19:02, <quan.xu@intel.com> wrote:
 
> >> > __scheme B__
> >> > Q1: - when to take the references?
> >> >
> >> >     take the reference when the IOMMU entry is _
> removed/overwritten_;
> >> >     in detail:
> >> >      --iommu_unmap_page(), or
> >> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
> >> >
> >> >     * Make sure IOMMU page should not be reallocated for
> >> >      another purpose until the appropriate invalidations have been
> >> >      performed.
> >> >     * in this case, it does not matter hot-plug ATS device
> >> > pass-through or ATS device assigned in domain initialization.
> >> >
> >> > Q2 / Q3:
> >> >     The same as above __scheme A__ Q2/Q3.
> >> >
> >> > One question: is __scheme B__ safe? If it is safe, I prefer __scheme B__..
> >>
> >> While at the first glance this looks like a neat idea -
> >
> >
> > I think this is safe and a good solution.
> > I hope you can review into the __scheme B__. I need _Acked-by_ you and
> > Tim Deegan.
> 
> What do you mean here?

Just verify it.
If it is working, I continue to write patch based on it.
If it is not working, I continue to research ..


> I'm not going to ack a patch that hasn't even got
> written, and while scheme B looks possible, I might still overlook something, so I
> also can't up front ack that model (which may then lead to you expecting that
> once implemented it gets accepted).


I am getting started to write patch based on the __scheme B__ and send out ASAP .
Thanks 

Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt.
  2015-09-29  8:57   ` Jan Beulich
@ 2015-10-10  8:22     ` Xu, Quan
  2015-10-12  7:11       ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-10-10  8:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, tim, keir, xen-devel

>> >>> On 29.09.2015 at 16:57 <JBeulich@suse.com> wrote:
> >>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> > +/* IOMMU Queued Invalidation(QI). */
> > +static void _qi_msi_unmask(struct iommu *iommu) {
> > +    u32 sts;
> > +    unsigned long flags;
> > +
> > +    /* Clear IM bit of DMAR_IECTL_REG. */
> > +    spin_lock_irqsave(&iommu->register_lock, flags);
> > +    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
> > +    sts &= ~DMA_IECTL_IM;
> > +    dmar_writel(iommu->reg, DMAR_IECTL_REG, sts);
> > +    spin_unlock_irqrestore(&iommu->register_lock, flags); }
> 
> I think rather than duplicating all this code from the fault interrupt you should
> instead re-factor to make the original usable for both purposes. Afaics the
> differences really are just the register and bit locations.
> 

hw_irq_controller is a common data structure for arm/amd/x86.
For reusing these function, I should redefine the function pointers .set_affinity / . startup .etc.
It takes much effort to modify the other arm/amd/x86 code.


> > +static void _do_iommu_qi(struct iommu *iommu) { }
> 
> ???
> 

Now it still has no knowledge about ATS.


> > +static void qi_msi_unmask(struct irq_desc *desc) {
> > +    _qi_msi_unmask(desc->action->dev_id);
> > +}
> > +
> > +static void qi_msi_mask(struct irq_desc *desc) {
> > +    _qi_msi_mask(desc->action->dev_id);
> > +}
> 
> These wrappers look pretty pointless.
> 

I will modify it in next version.


> > +static void qi_msi_end(struct irq_desc *desc, u8 vector) {
> > +    ack_APIC_irq();
> > +}
> 
> Why is there, other than for its fault counterpart, no unmask operation here?
> 


 In my design: I try to optimize it in interrupt handler(or associated tasklet):
Check the IP field at the end of interrupt handler, if it is Set, try to scan the domain list again, instead of clear IM to cause another Interrupt.
For the logic of IWC/IP/IM as below:

                       Interrupt condition (An invalidation wait descriptor with Interrupt Flag(IF) field Set completed.)
                            ||
                             v
           ----------------------(IWC) ----------------------
           (IWC is Set)                     (IWC is not Set)
               ||                              ||
               V                              ||
(Not treated as an new interrupt condition)         ||
                                               ||
                                               V
                                             (Set IWC / IP)
                                               ||
                                               V
                         ------------------------------(IM)---------------------
                       (IM is Set)                               (IM not Set)
                        ||                                        ||
                        ||                                        V
                        ||                    (cause Interrupt message / then hardware clear IP)
                        ||
                        V
   (interrupt is held pending, clearing IM to cause interrupt message)
  

And, If you clear IWC, the IP is cleared.


> > @@ -1123,6 +1243,7 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
> >          return -ENOMEM;
> >
> >      iommu->msi.irq = -1; /* No irq assigned yet. */
> > +    iommu->qi_msi.irq = -1; /* No irq assigned yet. */
> 
> Which suggests that (perhaps in patch 1) the existing field should be renamed to
> e.g. fe_msi.
> 

I'd better rename the existing in next patch set. otherwise it is confusing.



> > @@ -1985,6 +2109,9 @@ static void adjust_irq_affinity(struct acpi_drhd_unit
> *drhd)
> >           cpumask_intersects(&node_to_cpumask(node), cpumask) )
> >          cpumask = &node_to_cpumask(node);
> >      dma_msi_set_affinity(irq_to_desc(drhd->iommu->msi.irq), cpumask);
> > +
> > +    if ( ats_enabled )
> > +        qi_msi_set_affinity(irq_to_desc(drhd->iommu->qi_msi.irq),
> > + cpumask);
> >  }
> >
> >  int adjust_vtd_irq_affinities(void)
> > @@ -2183,6 +2310,11 @@ int __init intel_vtd_setup(void)
> >
> >          ret = iommu_set_interrupt(drhd, &dma_msi_type, "dmar",
> &drhd->iommu->msi,
> >                                    iommu_page_fault);
> > +        if ( ats_enabled )
> > +            ret = iommu_set_interrupt(drhd, &qi_msi_type, "qi",
> > +                                      &drhd->iommu->qi_msi,
> > +                                      iommu_qi_completion);
> > +
> 
> Would there be any harm from leaving out most/all of these ats_enabled
> conditionals, despite right now that code only intended to be used for ATS
> invalidations? I.e. wouldn't it just be that the interrupt never triggers?
> 

No harm.
It is no use if the ATS is not enabled now.
It is only be triggered when an invalidation wait descriptor with Interrupt Flag(IF) field Set completed


> > --- a/xen/drivers/passthrough/vtd/iommu.h
> > +++ b/xen/drivers/passthrough/vtd/iommu.h
> > @@ -47,6 +47,11 @@
> >  #define    DMAR_IQH_REG    0x80    /* invalidation queue head */
> >  #define    DMAR_IQT_REG    0x88    /* invalidation queue tail */
> >  #define    DMAR_IQA_REG    0x90    /* invalidation queue addr */
> > +#define    DMAR_IECTL_REG  0xA0    /* invalidation event contrl register
> */
> > +#define    DMAR_IEDATA_REG 0xA4    /* invalidation event data register
> */
> > +#define    DMAR_IEADDR_REG 0xA8    /* invalidation event address
> register */
> > +#define    DMAR_IEUADDR_REG 0xAC   /* invalidation event upper
> address register */
> > +#define    DMAR_ICS_REG    0x9C    /* invalidation completion status
> register */
> 
> Numerically ordered please.
> 

Got it.


Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-09-29  9:24   ` Jan Beulich
@ 2015-10-10 12:27     ` Xu, Quan
  2015-10-12  7:15       ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-10-10 12:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, keir, tim, xen-devel

>> >>> On 29.09.2015 at 17:24 <JBeulich@suse.com> wrote:
> >>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
> > @@ -139,6 +140,7 @@ static int queue_invalidate_wait(struct iommu *iommu,
> >      unsigned long flags;
> >      u64 entry_base;
> >      struct qinval_entry *qinval_entry, *qinval_entries;
> > +    struct domain *d;
> >
> >      spin_lock_irqsave(&iommu->register_lock, flags);
> >      index = qinval_next_index(iommu); @@ -152,9 +154,22 @@ static int
> > queue_invalidate_wait(struct iommu *iommu,
> >      qinval_entry->q.inv_wait_dsc.lo.sw = sw;
> >      qinval_entry->q.inv_wait_dsc.lo.fn = fn;
> >      qinval_entry->q.inv_wait_dsc.lo.res_1 = 0;
> > -    qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
> >      qinval_entry->q.inv_wait_dsc.hi.res_1 = 0;
> > -    qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(&poll_slot) >> 2;
> > +
> > +    if ( iflag )
> > +    {
> > +        d = rcu_lock_domain_by_id(iommu->domid_map[device_id]);
> > +        if ( d == NULL )
> > +            return -ENODATA;
> > +
> > +        qinval_entry->q.inv_wait_dsc.lo.sdata = ++ qi_table_data(d);
> 
> Stray blank following the ++.
> 

I will modify it in next version.

> > +        qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(
> > +
> &qi_table_pollslot(d)) >> 2;
> > +        rcu_unlock_domain(d);
> 
> If you don't hold a reference to the domain, what prevents it from going away,
> struct domain getting freed, and the write to the poll slot corrupting data if the
> memory gets re-used? Plus, if you obtain a domain reference at the time you
> enter it into domid_map[], you wouldn't have to be worried about failure above
> (i.e. you could simply ASSERT() success of rcu_lock_domain_by_id()).
> 
> Considering the implementation of rcu_lock_domain_by_id() I also wonder
> whether it wouldn't be more efficient to make domid_map[] an array of struct
> domain pointers.
> 
Good catch!!
Patch #11 can prevents it from going away.
patch 11: 
##
If the Device-TLB flush is still not completed, schedule and wait on a waitqueue until the Device-TLB flush is 
Completed, before schedule RCU asynchronous completion of domain destroy.
##


And,
 If it has got domain by rcu_lock_domain_by_id(), the domain structure is protected by RCU lock from rcu_lock_domain_by_id() to rcu_lock_domain_by_id().




> > +    } else {
> 
> Coding style.
> 

I will modify it in next version.



> > --- a/xen/include/xen/hvm/iommu.h
> > +++ b/xen/include/xen/hvm/iommu.h
> > @@ -23,6 +23,21 @@
> >  #include <xen/list.h>
> >  #include <asm/hvm/iommu.h>
> >
> > +/*
> > + * Status Address and Data: Status address and data is used by
> > +hardware to perform
> > + * wait descriptor completion status write when the Status Write(SW) field is
> Set.
> > + *
> > + * Track the Device-TLB invalidation status in an invalidation table.
> > +Update
> > + * invalidation table's count of in-flight Device-TLB invalidation
> > +request and
> > + * assign the address of global polling parameter per domain in the
> > +Status Address
> > + * of each invalidation wait descriptor, when submit Device-TLB
> > +invalidation
> > + * requests.
> > + */
> > +struct qi_talbe {
> > +    u64 qi_table_poll_slot;
> 
> Why is this a 64-bit field when the respective write stores 32 bits only?
> 
Yes, it should be u32. It is a DWORD.
Invalidation Wait Descriptor's Status Address[63:2] would be very likely to mislead me.


> Also the qi_table_ prefixes seem rather pointless.
> 

I will modify it in next version.



Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-07 17:02             ` Xu, Quan
  2015-10-08  8:51               ` Jan Beulich
@ 2015-10-10 18:24               ` Tim Deegan
  2015-10-11 11:09                 ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Tim Deegan @ 2015-10-10 18:24 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Hi,

At 17:02 +0000 on 07 Oct (1444237344), Xu, Quan wrote:
> __scheme A__
> Q1: - when to take the references?
>     take the reference when the IOMMU entry is _created_;
>     in detail:
>      --iommu_map_page(), or
>      --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> That leaves one question:
>     -- how to do with hot-plug ATS device pass-through? As the EPT doesn't aware IOMMU will share EPT page table when EPT page table was _created_.

Either you have to take all the references even before the hot-plug,
or you have to do an (interruptable) pass over the EPT table when
needs_iommu becaomes set.  Of those I prefer the first one.  EPT
entries already take references to foreign pages, and I think that
extending that to other kinds should be OK.

> Q2: how do you know when to drop them?
>    - log (or something) when the IOMMU entry is removed/overwritten; and
>    - drop the entry when the flush completes.
>    
>    -- We can add a new page_list_entry structure per page_info, and Add the page with the new page_list_entry structure to per domain page list, when
>     the IOMMU entry is removed/overwritten; and drop the entry when the flush completes. 

As Jan says, I think that might be too much overhead -- a new
page_list_entry would be a pretty big increase to page_info.
(And potentially might not be enough if a page ever needs to be on two
lists? Or can we make sure that never happens?)

Storing the list of to-be-flushed MFNs separately sounds better.  The
question is how to make sure we don't run out of memory to store that
list.  Maybe have a pool allocated that's big enough for sensible use,
and fail p2m updates with -EAGAIN when we run out?

> Q3: what to do about mappings of other domains' memory (i.e. grant and foreign mappings).
>    Between two domains, now I have only one idea to fix this tricky issue -- waitqueue.
>    I.e. grant.
>     For gnttab_transfer /gnttab_unmap , wait on a waitqueue before updating grant flag, until the Device-TLB flush is completed.
>     For grant-mapped, it is safe as the modification of gnttab_unmap.

Waitqueues are probably more heavyweight than is needed here.  The
grant unmap operation can complete immediately, and we can queue the
reference drop (and maybe even some of the maptrack updates) to be
finished when the flush completes.  It could maybe use the same
queue/log as the normal p2m updates.

> __scheme B__
> Q1: - when to take the references?
> 
>     take the reference when the IOMMU entry is _ removed/overwritten_;
>     in detail:
>      --iommu_unmap_page(), or
>      --ept_set_entry() [Once IOMMU shares EPT page table.]

Hmm.  That might be practical, taking a typecount of whatever type the
page happens to be at the time.  I would prefer scheme A though, if it
can be made to work.  It fits better with the usual way refcounts are
used, and it's more likely to be a good long-term fix.  This is the
second time we've had to add refcounting for an edge case, and I
suspect it won't be the last.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-10 18:24               ` Tim Deegan
@ 2015-10-11 11:09                 ` Xu, Quan
  2015-10-12 12:25                   ` Jan Beulich
  2015-10-13  9:34                   ` Tim Deegan
  0 siblings, 2 replies; 84+ messages in thread
From: Xu, Quan @ 2015-10-11 11:09 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

On 11.10.2015 at 2:25, <tim@xen.org> wrote:
> At 17:02 +0000 on 07 Oct (1444237344), Xu, Quan wrote:
> > Q2: how do you know when to drop them?
> >    - log (or something) when the IOMMU entry is removed/overwritten; and
> >    - drop the entry when the flush completes.
> >
> >    -- We can add a new page_list_entry structure per page_info, and Add the
> page with the new page_list_entry structure to per domain page list, when
> >     the IOMMU entry is removed/overwritten; and drop the entry when the
> flush completes.
> 
> As Jan says, I think that might be too much overhead -- a new page_list_entry
> would be a pretty big increase to page_info.
> (And potentially might not be enough if a page ever needs to be on two lists? Or
> can we make sure that never happens?)
> 
> Storing the list of to-be-flushed MFNs separately sounds better.  The question is
> how to make sure we don't run out of memory to store that list.  Maybe have a
> pool allocated that's big enough for sensible use, and fail p2m updates with
> -EAGAIN when we run out?
> 
Ignore this method. It is not a good idea.
One question: do two lists refer to page_list and arch.relmem_list?
It is might a challenge to make sure we don't run out of memory, if it store additional relevant information.
An aggressive method:
--Remove this page from domain page_list / arch.relmem_list.  
... [like steal_page(). Swizzle the owner -- page_set_owner(page, NULL);  Unlink from original owner -- page_list_del(page, &d->page_list)] Then,( take a reference for _scheme_B__,) add this page to a per domain page list(), and put this page when the flush completes.
I am afraid that this method may cause some potential issues. i.e. make xen crash in some corner cases ..


> > Q3: what to do about mappings of other domains' memory (i.e. grant and
> foreign mappings).
> >    Between two domains, now I have only one idea to fix this tricky issue --
> waitqueue.
> >    I.e. grant.
> >     For gnttab_transfer /gnttab_unmap , wait on a waitqueue before
> updating grant flag, until the Device-TLB flush is completed.
> >     For grant-mapped, it is safe as the modification of gnttab_unmap.
> 
> Waitqueues are probably more heavyweight than is needed here.  The grant
> unmap operation can complete immediately, and we can queue the reference
> drop (and maybe even some of the maptrack updates) to be finished when the
> flush completes.  It could maybe use the same queue/log as the normal p2m
> updates.
> 


Make sense.
It could queue the reference drop and grant_flag update to be finished when the flush completes.


> > __scheme B__
> > Q1: - when to take the references?
> >
> >     take the reference when the IOMMU entry is _ removed/overwritten_;
> >     in detail:
> >      --iommu_unmap_page(), or
> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
> 
> Hmm.  That might be practical, taking a typecount of whatever type the page
> happens to be at the time.  I would prefer scheme A though, if it can be made
> to work.  It fits better with the usual way refcounts are used, and it's more likely
> to be a good long-term fix.  This is the second time we've had to add
> refcounting for an edge case, and I suspect it won't be the last.
> 




I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too.  Jan, correct me, if I am wrong :) )
which fits better with the usual way refcounts are used. But __scheme_A__ would be difficult for buy-in by my team (Obviously, why spend so many effort for such a small issue? why does __scheme_B__ not accept?) I think, __scheme_A__ is also a tricky solution.


 The IOMMU entry associated with the freed portion of GPA should not be reallocated for another purpose until the flush completes.
 I think __scheme_A__ is complex to keep a log of all relevant pending derefs, and to be processed when the flush completes;

optimized __scheme_A__:
It could keep a log of the reference only when the IOMMU entry is _ removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is safe.).



If this optimized __scheme_A__ is correct and acceptable?  It would be buy-in by any parties.



I am very glad to discuss with you and Jan.
Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
                   ` (15 preceding siblings ...)
  2015-09-21 14:09 ` Xu, Quan
@ 2015-10-12  1:42 ` Zhang, Yang Z
  2015-10-12 12:34   ` Jan Beulich
  16 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-12  1:42 UTC (permalink / raw)
  To: Xu, Quan, andrew.cooper3, Dong, Eddie, ian.campbell, ian.jackson,
	jbeulich, Nakajima, Jun, keir, Tian, Kevin, tim, george.dunlap
  Cc: xen-devel

Xu, Quan wrote on 2015-09-16:
> Introduction
> ============
> 
>    VT-d code currently has a number of cases where completion of
> certain operations is being waited for by way of spinning. The
> majority of instances use that variable indirectly through
> IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second
> (DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the invalidation case seems particularly problematic.
> 
> Currently hypervisor polls the status address of wait descriptor up to
> 1 second to get Invalidation flush result. When Invalidation queue
> includes Device-TLB invalidation, using 1 second is a mistake here in
> the validation sync. As the 1 second timeout here is related to
> response times by the IOMMU engine, Instead of Device-TLB invalidation
> with PCI-e Address Translation Services (ATS) in use. the ATS specification mandates a timeout of 1 _minute_ for cache flush.
> The ATS case needs to be taken into consideration when doing invalidations.
> Obviously we can't spin for a minute, so invalidation absolutely needs
> to be converted to a non-spinning model.
> 
>    Also i should fix the new memory security issue.
> The page freed from the domain should be on held, until the Device-TLB
> flush is completed (ATS timeout of 1 _minute_).
> The page previously associated  with the freed portion of GPA should
> not be reallocated for another purpose until the appropriate
> invalidations have been performed. Otherwise, the original page owner
> can still access freed page though DMA.
> 

Hi Maintainers,

According the discussion and suggestion you made in past several weeks, obviously, it is not an easy task. So I am wondering whether it is worth to do it since: 
1. ATS device is not popular. I only know one NIC from Myricom has ATS capabilities.
2. The issue is only in theory. Linux, Windows, VMware are all using spin now as well as Xen, but none of them observed any problem so far.
3. I know there is propose to modify the timeout value(maybe less in 1 ms) in ATS spec to mitigate the problem. But the risk is how long to achieve it.
4. The most important concern is it is too complicated to fix it in Xen since it needs to modify the core memory part. And I don't think Quan and i have the enough knowledge to do it perfectly currently. It may take long time, half of year or one year?(We have spent three months so far). Yes, if Tim likes to take it. It will be much fast. :)

So, my suggestion is that we can rely on user to not assign the ATS device if hypervisor says it cannot support such device. For example, if hypervisor find the invalidation isn't completed in 1 second, then hypervisor can crash itself and tell the user this ATS device needs more than 1 second invalidation time which is not support by Xen.

Any comments?

Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt.
  2015-10-10  8:22     ` Xu, Quan
@ 2015-10-12  7:11       ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-10-12  7:11 UTC (permalink / raw)
  To: Quan Xu; +Cc: Yang Z Zhang, tim, keir, xen-devel

>>> On 10.10.15 at 10:22, <quan.xu@intel.com> wrote:
>> > >>> On 29.09.2015 at 16:57 <JBeulich@suse.com> wrote:
>> >>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
>> > +/* IOMMU Queued Invalidation(QI). */
>> > +static void _qi_msi_unmask(struct iommu *iommu) {
>> > +    u32 sts;
>> > +    unsigned long flags;
>> > +
>> > +    /* Clear IM bit of DMAR_IECTL_REG. */
>> > +    spin_lock_irqsave(&iommu->register_lock, flags);
>> > +    sts = dmar_readl(iommu->reg, DMAR_IECTL_REG);
>> > +    sts &= ~DMA_IECTL_IM;
>> > +    dmar_writel(iommu->reg, DMAR_IECTL_REG, sts);
>> > +    spin_unlock_irqrestore(&iommu->register_lock, flags); }
>> 
>> I think rather than duplicating all this code from the fault interrupt you should
>> instead re-factor to make the original usable for both purposes. Afaics the
>> differences really are just the register and bit locations.
>> 
> 
> hw_irq_controller is a common data structure for arm/amd/x86.
> For reusing these function, I should redefine the function pointers 
> .set_affinity / . startup .etc.

Why?

> It takes much effort to modify the other arm/amd/x86 code.

Sure, and this certainly should be avoided.

>> > +static void _do_iommu_qi(struct iommu *iommu) { }
>> 
>> ???
>> 
> 
> Now it still has no knowledge about ATS.

But what's the function good for then? Can't it be added when it
needs to have a body?

>> > +static void qi_msi_end(struct irq_desc *desc, u8 vector) {
>> > +    ack_APIC_irq();
>> > +}
>> 
>> Why is there, other than for its fault counterpart, no unmask operation 
> here?
>> 
> 
> 
>  In my design: I try to optimize it in interrupt handler(or associated 
> tasklet):
> Check the IP field at the end of interrupt handler, if it is Set, try to 
> scan the domain list again, instead of clear IM to cause another Interrupt.
> For the logic of IWC/IP/IM as below:
> 
>                        Interrupt condition (An invalidation wait descriptor 
> with Interrupt Flag(IF) field Set completed.)
>                             ||
>                              v
>            ----------------------(IWC) ----------------------
>            (IWC is Set)                     (IWC is not Set)
>                ||                              ||
>                V                              ||
> (Not treated as an new interrupt condition)         ||
>                                                ||
>                                                V
>                                              (Set IWC / IP)
>                                                ||
>                                                V
>                          ------------------------------(IM)---------------------
>                        (IM is Set)                               (IM not Set)
>                         ||                                        ||
>                         ||                                        V
>                         ||                    (cause Interrupt message / then hardware clear IP)
>                         ||
>                         V
>    (interrupt is held pending, clearing IM to cause interrupt message)
>   
> 
> And, If you clear IWC, the IP is cleared.

I don't follow - in analogy to handling of the Fault Event Control
Register, unmasking would mean to clear IM, not IWC. The
question really is more about symmetry, i.e. if the answer is
"we don't need this because IM ought to be clear anyway", then
you'd also need to answer why the fault event case needs it
(which I think is pretty clear - it needs to undo what the
corresponding .ack method did).

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table.
  2015-10-10 12:27     ` Xu, Quan
@ 2015-10-12  7:15       ` Jan Beulich
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-10-12  7:15 UTC (permalink / raw)
  To: Quan Xu; +Cc: Yang Z Zhang, tim, keir, xen-devel

>>> On 10.10.15 at 14:27, <quan.xu@intel.com> wrote:
>> > >>> On 29.09.2015 at 17:24 <JBeulich@suse.com> wrote:
>> >>> On 16.09.15 at 15:23, <quan.xu@intel.com> wrote:
>> > +        qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(
>> > +
>> &qi_table_pollslot(d)) >> 2;
>> > +        rcu_unlock_domain(d);
>> 
>> If you don't hold a reference to the domain, what prevents it from going away,
>> struct domain getting freed, and the write to the poll slot corrupting data if the
>> memory gets re-used? Plus, if you obtain a domain reference at the time you
>> enter it into domid_map[], you wouldn't have to be worried about failure above
>> (i.e. you could simply ASSERT() success of rcu_lock_domain_by_id()).
>> 
>> Considering the implementation of rcu_lock_domain_by_id() I also wonder
>> whether it wouldn't be more efficient to make domid_map[] an array of struct
>> domain pointers.
>> 
> Good catch!!
> Patch #11 can prevents it from going away.
> patch 11: 
> ##
> If the Device-TLB flush is still not completed, schedule and wait on a 
> waitqueue until the Device-TLB flush is 
> Completed, before schedule RCU asynchronous completion of domain destroy.
> ##

Not sure I understand what you're trying to tell me here: Do you
mean things are safe due to patch 11? If so, please recall that
things need to be safe after every individual patch. If not, what
exactly are you trying to say?

> And,
>  If it has got domain by rcu_lock_domain_by_id(), the domain structure is 
> protected by RCU lock from rcu_lock_domain_by_id() to 
> rcu_lock_domain_by_id().

???

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-11 11:09                 ` Xu, Quan
@ 2015-10-12 12:25                   ` Jan Beulich
  2015-10-13  9:34                   ` Tim Deegan
  1 sibling, 0 replies; 84+ messages in thread
From: Jan Beulich @ 2015-10-12 12:25 UTC (permalink / raw)
  To: Quan Xu
  Cc: Tim Deegan, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Eddie Dong, xen-devel, Jun Nakajima,
	Yang Z Zhang

>>> On 11.10.15 at 13:09, <quan.xu@intel.com> wrote:
> On 11.10.2015 at 2:25, <tim@xen.org> wrote:
>> At 17:02 +0000 on 07 Oct (1444237344), Xu, Quan wrote:
>> > Q2: how do you know when to drop them?
>> >    - log (or something) when the IOMMU entry is removed/overwritten; and
>> >    - drop the entry when the flush completes.
>> >
>> >    -- We can add a new page_list_entry structure per page_info, and Add the
>> page with the new page_list_entry structure to per domain page list, when
>> >     the IOMMU entry is removed/overwritten; and drop the entry when the
>> flush completes.
>> 
>> As Jan says, I think that might be too much overhead -- a new page_list_entry
>> would be a pretty big increase to page_info.
>> (And potentially might not be enough if a page ever needs to be on two 
> lists? Or
>> can we make sure that never happens?)
>> 
>> Storing the list of to-be-flushed MFNs separately sounds better.  The question 
> is
>> how to make sure we don't run out of memory to store that list.  Maybe have 
> a
>> pool allocated that's big enough for sensible use, and fail p2m updates with
>> -EAGAIN when we run out?
>> 
> Ignore this method. It is not a good idea.
> One question: do two lists refer to page_list and arch.relmem_list?
> It is might a challenge to make sure we don't run out of memory, if it store 
> additional relevant information.
> An aggressive method:
> --Remove this page from domain page_list / arch.relmem_list.  
> ... [like steal_page(). Swizzle the owner -- page_set_owner(page, NULL);  
> Unlink from original owner -- page_list_del(page, &d->page_list)] Then,( take a 
> reference for _scheme_B__,) add this page to a per domain page list(), and 
> put this page when the flush completes.

I'm not following - are you suggesting to remove the page from the
guest temporarily? That's certainly no a valid approach.

> I am afraid that this method may cause some potential issues. i.e. make xen 
> crash in some corner cases ..

And such would preclude its use too.

>> > __scheme B__
>> > Q1: - when to take the references?
>> >
>> >     take the reference when the IOMMU entry is _ removed/overwritten_;
>> >     in detail:
>> >      --iommu_unmap_page(), or
>> >      --ept_set_entry() [Once IOMMU shares EPT page table.]
>> 
>> Hmm.  That might be practical, taking a typecount of whatever type the page
>> happens to be at the time.  I would prefer scheme A though, if it can be 
> made
>> to work.  It fits better with the usual way refcounts are used, and it's 
> more likely
>> to be a good long-term fix.  This is the second time we've had to add
>> refcounting for an edge case, and I suspect it won't be the last.
> 
> I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too.  Jan, 
> correct me, if I am wrong :) )
> which fits better with the usual way refcounts are used. But __scheme_A__ 
> would be difficult for buy-in by my team (Obviously, why spend so many effort 
> for such a small issue? why does __scheme_B__ not accept?) I think, 
> __scheme_A__ is also a tricky solution.
> 
> 
>  The IOMMU entry associated with the freed portion of GPA should not be 
> reallocated for another purpose until the flush completes.
>  I think __scheme_A__ is complex to keep a log of all relevant pending 
> derefs, and to be processed when the flush completes;
> 
> optimized __scheme_A__:
> It could keep a log of the reference only when the IOMMU entry is _ 
> removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is 
> safe.).

And I'm afraid this is too vague for me to understand - what do you
want to track under what conditions?

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-12  1:42 ` Zhang, Yang Z
@ 2015-10-12 12:34   ` Jan Beulich
  2015-10-13  5:27     ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-12 12:34 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
> According the discussion and suggestion you made in past several weeks, 
> obviously, it is not an easy task. So I am wondering whether it is worth to 
> do it since: 
> 1. ATS device is not popular. I only know one NIC from Myricom has ATS 
> capabilities.
> 2. The issue is only in theory. Linux, Windows, VMware are all using spin 
> now as well as Xen, but none of them observed any problem so far.
> 3. I know there is propose to modify the timeout value(maybe less in 1 ms) 
> in ATS spec to mitigate the problem. But the risk is how long to achieve it.
> 4. The most important concern is it is too complicated to fix it in Xen 
> since it needs to modify the core memory part. And I don't think Quan and i 
> have the enough knowledge to do it perfectly currently. It may take long 
> time, half of year or one year?(We have spent three months so far). Yes, if 
> Tim likes to take it. It will be much fast. :)
> 
> So, my suggestion is that we can rely on user to not assign the ATS device 
> if hypervisor says it cannot support such device. For example, if hypervisor 
> find the invalidation isn't completed in 1 second, then hypervisor can crash 
> itself and tell the user this ATS device needs more than 1 second 
> invalidation time which is not support by Xen.

Crashing the hypervisor in such a case is a security issue, i.e. is not
an acceptable thing (and the fact that we panic() on timeout expiry
right now isn't really acceptable either). If crashing the offending
guest was sufficient to contain the issue, that might be an option.
Else ripping out ATS support (and limiting spin time below what there
is currently) may be the only alternative to fixing it.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-12 12:34   ` Jan Beulich
@ 2015-10-13  5:27     ` Zhang, Yang Z
  2015-10-13  9:15       ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-13  5:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-12:
>>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
>> According the discussion and suggestion you made in past several
>> weeks, obviously, it is not an easy task. So I am wondering whether
>> it is worth to do it since:
>> 1. ATS device is not popular. I only know one NIC from Myricom has
>> ATS capabilities.
>> 2. The issue is only in theory. Linux, Windows, VMware are all using
>> spin now as well as Xen, but none of them observed any problem so far.
>> 3. I know there is propose to modify the timeout value(maybe less in
>> 1
>> ms) in ATS spec to mitigate the problem. But the risk is how long to achieve it.
>> 4. The most important concern is it is too complicated to fix it in
>> Xen since it needs to modify the core memory part. And I don't think
>> Quan and i have the enough knowledge to do it perfectly currently.
>> It may take long time, half of year or one year?(We have spent three
>> months so far). Yes, if Tim likes to take it. It will be much fast.
>> :)
>> 
>> So, my suggestion is that we can rely on user to not assign the ATS
>> device if hypervisor says it cannot support such device. For
>> example, if hypervisor find the invalidation isn't completed in 1
>> second, then hypervisor can crash itself and tell the user this ATS
>> device needs more than 1 second invalidation time which is not support by Xen.
> 
> Crashing the hypervisor in such a case is a security issue, i.e. is not

Indeed. Crashing the guest is more reasonable. 

> an acceptable thing (and the fact that we panic() on timeout expiry
> right now isn't really acceptable either). If crashing the offending
> guest was sufficient to contain the issue, that might be an option. Else

I think it should be sufficient (any concern from you?). Hypervisor can crash the guest with hint that the device may need long time to complete the invalidation or device maybe bad. And user should add the device to a blacklist to disallow assignment again. 

> ripping out ATS support (and limiting spin time below what there is
> currently) may be the only alternative to fixing it.

Yes, it is another solution considering ATS device is rare currently. For spin time, 10ms should be enough in both two solutions. 
But if solution 1 is acceptable, I prefer it since most of ATS devices are still able to play with Xen.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-13  5:27     ` Zhang, Yang Z
@ 2015-10-13  9:15       ` Jan Beulich
  2015-10-14  5:12         ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-13  9:15 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 13.10.15 at 07:27, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-10-12:
>>>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
>>> So, my suggestion is that we can rely on user to not assign the ATS
>>> device if hypervisor says it cannot support such device. For
>>> example, if hypervisor find the invalidation isn't completed in 1
>>> second, then hypervisor can crash itself and tell the user this ATS
>>> device needs more than 1 second invalidation time which is not support by 
> Xen.
>> 
>> Crashing the hypervisor in such a case is a security issue, i.e. is not
> 
> Indeed. Crashing the guest is more reasonable. 
> 
>> an acceptable thing (and the fact that we panic() on timeout expiry
>> right now isn't really acceptable either). If crashing the offending
>> guest was sufficient to contain the issue, that might be an option. Else
> 
> I think it should be sufficient (any concern from you?).

Having looked at the code, it wasn't immediately clear whether
that would work. After all there one would think there would be a
reason for the code panic()ing right now instead.

> Hypervisor can 
> crash the guest with hint that the device may need long time to complete the 
> invalidation or device maybe bad. And user should add the device to a 
> blacklist to disallow assignment again. 
> 
>> ripping out ATS support (and limiting spin time below what there is
>> currently) may be the only alternative to fixing it.
> 
> Yes, it is another solution considering ATS device is rare currently. For 
> spin time, 10ms should be enough in both two solutions. 

But 10ms is awfully much already. Which is why I've been advocating
async flushing independent of ATS.

> But if solution 1 is acceptable, I prefer it since most of ATS devices are 
> still able to play with Xen.

With a multi-millisecond spin, solution 1 would imo be acceptable only
as a transitional measure.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-11 11:09                 ` Xu, Quan
  2015-10-12 12:25                   ` Jan Beulich
@ 2015-10-13  9:34                   ` Tim Deegan
  2015-10-14 14:44                     ` Xu, Quan
  1 sibling, 1 reply; 84+ messages in thread
From: Tim Deegan @ 2015-10-13  9:34 UTC (permalink / raw)
  To: Xu, Quan
  Cc: Nakajima, Jun, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, Dong, Eddie, xen-devel, jbeulich, Zhang, Yang Z

Hi,

At 11:09 +0000 on 11 Oct (1444561760), Xu, Quan wrote:
> One question: do two lists refer to page_list and arch.relmem_list?

No, I was wondering if a page ever needed to be queued waiting for two
different flushes -- e.g. if there are multiple IOMMUs.

> I know you prefer __scheme_A__(I think Jan prefers __scheme_A__ too.  Jan, correct me, if I am wrong :) )
> which fits better with the usual way refcounts are used. But __scheme_A__ would be difficult for buy-in by my team (Obviously, why spend so many effort for such a small issue? why does __scheme_B__ not accept?) I think, __scheme_A__ is also a tricky solution.
> 

What in particular is worrying you about scheme A?  AFAICS you need to
build the same refcount-taking mechanism for either scheme.

Is it the interactions with other p2m-based features in VMs that don't
have devices passed through?  In that case perhaps you could just
mandate that ATS support means no shared HAP/IOMMU tables, and do the
refcounting only in the IOMMU ones?

>  I think __scheme_A__ is complex to keep a log of all relevant pending derefs, and to be processed when the flush completes;
> 
> optimized __scheme_A__:
> It could keep a log of the reference only when the IOMMU entry is _ removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is safe.).

Yes, though I'd add any change from read-write to read-only.
Basically, you only need the log/deref whenever you need to flush the
IOMMU. :)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-09-29  7:21           ` Jan Beulich
  2015-09-30 13:55             ` Xu, Quan
@ 2015-10-13 14:29             ` Xu, Quan
  2015-10-13 14:50               ` Jan Beulich
  1 sibling, 1 reply; 84+ messages in thread
From: Xu, Quan @ 2015-10-13 14:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Nakajima, Jun, Zhang, Yang Z

>> >>>On 29.09.2015 at 15:22 <JBeulich@suse.com> wrote:
> >>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
> >>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
> >> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
> >> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:


>The extra ref taken will prevent the page from getting freed. 

Jan, could you share more about it?

I want to check some cases of Xen memory. i.e.

1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0)
In this case, can the page be freed to xen domain heap?

2. if  (page->count_info & PGC_count_mask == 0) and  (page->u.inuse.type_info != 0) :
In this case, can the page be freed to xen domain heap?

Thanks!

Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-13 14:29             ` Xu, Quan
@ 2015-10-13 14:50               ` Jan Beulich
  2015-10-14 14:54                 ` Xu, Quan
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-13 14:50 UTC (permalink / raw)
  To: Quan Xu
  Cc: Kevin Tian, keir, ian.campbell, george.dunlap, andrew.cooper3,
	Tim Deegan, xen-devel, Jun Nakajima, YangZ Zhang

>>> On 13.10.15 at 16:29, <quan.xu@intel.com> wrote:
>> > >>>On 29.09.2015 at 15:22 <JBeulich@suse.com> wrote:
>> >>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
>> >>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
>> >> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
>> >> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:
>>The extra ref taken will prevent the page from getting freed. 
> 
> Jan, could you share more about it?
> 
> I want to check some cases of Xen memory. i.e.
> 
> 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info != 0)
> In this case, can the page be freed to xen domain heap?

Whether a page can get freed depends on changes to count_info, not
just its current state. For instance, PGC_allocated set implies
page->count_info & PGC_count_mask != 0, i.e. your question above
cannot be answered properly. Just look at put_page() - it frees the
page when the count _drops_ to zero.

> 2. if  (page->count_info & PGC_count_mask == 0) and  (page->u.inuse.type_info != 0) :
> In this case, can the page be freed to xen domain heap?

Generally type_info should be zero when the ref count is zero; there
are, I think, exceptional cases (like during domain death) where this
might get violated with no harm. But again - look at put_page() and
you'll see that type_info doesn't matter for whether a page gets
freed; all it matter is whether a page's type can change: Only when
type count is zero.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-13  9:15       ` Jan Beulich
@ 2015-10-14  5:12         ` Zhang, Yang Z
  2015-10-14  9:30           ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-14  5:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-13:
>>>> On 13.10.15 at 07:27, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-10-12:
>>>>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
>>>> So, my suggestion is that we can rely on user to not assign the
>>>> ATS device if hypervisor says it cannot support such device. For
>>>> example, if hypervisor find the invalidation isn't completed in 1
>>>> second, then hypervisor can crash itself and tell the user this
>>>> ATS device needs more than 1 second invalidation time which is not
>>>> support by
>> Xen.
>>> 
>>> Crashing the hypervisor in such a case is a security issue, i.e. is
>>> not
>> 
>> Indeed. Crashing the guest is more reasonable.
>> 
>>> an acceptable thing (and the fact that we panic() on timeout expiry
>>> right now isn't really acceptable either). If crashing the
>>> offending guest was sufficient to contain the issue, that might be an option.
>>> Else
>> 
>> I think it should be sufficient (any concern from you?).
> 
> Having looked at the code, it wasn't immediately clear whether that
> would work. After all there one would think there would be a reason
> for the code panic()ing right now instead.

What the panic()ing refer to here?

> 
>> Hypervisor can
>> crash the guest with hint that the device may need long time to
>> complete the invalidation or device maybe bad. And user should add
>> the device to a blacklist to disallow assignment again.
>> 
>>> ripping out ATS support (and limiting spin time below what there is
>>> currently) may be the only alternative to fixing it.
>> 
>> Yes, it is another solution considering ATS device is rare currently.
>> For spin time, 10ms should be enough in both two solutions.
> 
> But 10ms is awfully much already. Which is why I've been advocating
> async flushing independent of ATS.

Agree. Technically speaking, async flush is the best solution. But considering the complexity and the benefit it brings, a compromise solution maybe better.

> 
>> But if solution 1 is acceptable, I prefer it since most of ATS
>> devices are still able to play with Xen.
> 
> With a multi-millisecond spin, solution 1 would imo be acceptable only
> as a transitional measure.

What does the transitional measure mean? Do you mean we still need the async flush for ATS issue or we can adapt solution 1 before ATS spec changed?

> 
> Jan


Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-14  5:12         ` Zhang, Yang Z
@ 2015-10-14  9:30           ` Jan Beulich
  2015-10-15  1:03             ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-14  9:30 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 14.10.15 at 07:12, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-10-13:
>>>>> On 13.10.15 at 07:27, <yang.z.zhang@intel.com> wrote:
>>> Jan Beulich wrote on 2015-10-12:
>>>>>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
>>>>> So, my suggestion is that we can rely on user to not assign the
>>>>> ATS device if hypervisor says it cannot support such device. For
>>>>> example, if hypervisor find the invalidation isn't completed in 1
>>>>> second, then hypervisor can crash itself and tell the user this
>>>>> ATS device needs more than 1 second invalidation time which is not
>>>>> support by
>>> Xen.
>>>> 
>>>> Crashing the hypervisor in such a case is a security issue, i.e. is
>>>> not
>>> 
>>> Indeed. Crashing the guest is more reasonable.
>>> 
>>>> an acceptable thing (and the fact that we panic() on timeout expiry
>>>> right now isn't really acceptable either). If crashing the
>>>> offending guest was sufficient to contain the issue, that might be an option.
>>>> Else
>>> 
>>> I think it should be sufficient (any concern from you?).
>> 
>> Having looked at the code, it wasn't immediately clear whether that
>> would work. After all there one would think there would be a reason
>> for the code panic()ing right now instead.
> 
> What the panic()ing refer to here?

E.g. what we have in queue_invalidate_wait():

        while ( poll_slot != QINVAL_STAT_DONE )
        {
            if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
            {
                print_qi_regs(iommu);
                panic("queue invalidate wait descriptor was not executed");
            }
            cpu_relax();
        }

>>> But if solution 1 is acceptable, I prefer it since most of ATS
>>> devices are still able to play with Xen.
>> 
>> With a multi-millisecond spin, solution 1 would imo be acceptable only
>> as a transitional measure.
> 
> What does the transitional measure mean? Do you mean we still need the async 
> flush for ATS issue or we can adapt solution 1 before ATS spec changed?

As long as the multi-millisecond spins aren't going to go away by
other means, I think conversion to async mode is ultimately
unavoidable.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-13  9:34                   ` Tim Deegan
@ 2015-10-14 14:44                     ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-10-14 14:44 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Zhang, Yang Z, Dong, Eddie, jbeulich, xen-devel

																				
>> >>>On 13.10.2015 at 17:35, <tim@xen.org> wrote:
> At 11:09 +0000 on 11 Oct (1444561760), Xu, Quan wrote:


> What in particular is worrying you about scheme A?  AFAICS you need to build
> the same refcount-taking mechanism for either scheme.
> 
> Is it the interactions with other p2m-based features in VMs that don't have
> devices passed through?  In that case perhaps you could just mandate that ATS
> support means no shared HAP/IOMMU tables, and do the refcounting only in the
> IOMMU ones?
> 

I am worrying that I should keep a log of all relevant pending derefs and to be processed when the flush completes for __scheme_A__..
Now I know only need the log/deref whenever you need to flush the IOMMU. :):)


> >  I think __scheme_A__ is complex to keep a log of all relevant pending
> > derefs, and to be processed when the flush completes;
> >
> > optimized __scheme_A__:
> > It could keep a log of the reference only when the IOMMU entry is _
> removed/overwritten_.(if the IOMMU entry is not _ removed/overwritten_, it is
> safe.).
> 
> Yes, though I'd add any change from read-write to read-only.
> Basically, you only need the log/deref whenever you need to flush the
> IOMMU. :)
> 

A summary of __scheme_A__:

 Q1: - when to take the references?
     take the reference when the IOMMU entry is _created_;
     in detail:
      --iommu_map_page(), or
      --ept_set_entry() [Once IOMMU shares EPT page table.]

 Q2: how do you know when to drop them?
    - Log (or something) when the IOMMU entry is removed/overwritten; and
    - Drop the entry when the flush completes.
    in detail:
       --iommu_unmap_page(); or
       --ept_set_entry() [Once IOMMU shares EPT page table.]


   **The challenge: how to log when IOMMU entry is removed/overwritten?



 Q3: what to do about mappings of other domains' memory (i.e. grant and
 foreign mappings).

i.e. grant:
    -Take the reference when the IOMMU entry is _created_;
    then,
    -Queue the reference drop;and
    -Queue grant iflag update(only for grant_unmap and grant_transfer are enough -- we can hold on this question).



 __afaics__:
       1. Are Q1/Q2/Q3 enough for this memory security issue?
       2. Are there any other potential memory issues, when I finish  __scheme_A__?
       3. Do you have any idea how to log when IOMMU entry is removed/overwritten?
      
Tim, thanks for your help!


Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-13 14:50               ` Jan Beulich
@ 2015-10-14 14:54                 ` Xu, Quan
  0 siblings, 0 replies; 84+ messages in thread
From: Xu, Quan @ 2015-10-14 14:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Tim Deegan, xen-devel

>> >>> On 13.10.2015 at 22:50 <JBeulich@suse.com> wrote:
> >>> On 13.10.15 at 16:29, <quan.xu@intel.com> wrote:
> >> > >>>On 29.09.2015 at 15:22 <JBeulich@suse.com> wrote:
> >> >>> On 29.09.15 at 04:53, <quan.xu@intel.com> wrote:
> >> >>>> Monday, September 28, 2015 2:47 PM,<JBeulich@suse.com> wrote:
> >> >> >>> On 28.09.15 at 05:08, <quan.xu@intel.com> wrote:
> >> >> >>>> Thursday, September 24, 2015 12:27 AM, Tim Deegan wrote:

> > 1. if (page->count_info & PGC_count_mask == 0) and (page->count_info
> > != 0) In this case, can the page be freed to xen domain heap?
> 
> Whether a page can get freed depends on changes to count_info, not just its
> current state. For instance, PGC_allocated set implies
> page->count_info & PGC_count_mask != 0, i.e. your question above
> cannot be answered properly. Just look at put_page() - it frees the page when
> the count _drops_ to zero.
> 
> > 2. if  (page->count_info & PGC_count_mask == 0) and
> (page->u.inuse.type_info != 0) :
> > In this case, can the page be freed to xen domain heap?
> 
> Generally type_info should be zero when the ref count is zero; there are, I think,
> exceptional cases (like during domain death) where this might get violated with
> no harm. But again - look at put_page() and you'll see that type_info doesn't
> matter for whether a page gets freed; all it matter is whether a page's type can
> change: Only when type count is zero.
> 

Jan, Thanks for kind explanation.

-Quan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-14  9:30           ` Jan Beulich
@ 2015-10-15  1:03             ` Zhang, Yang Z
  2015-10-15  6:46               ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-15  1:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-14:
>>>> On 14.10.15 at 07:12, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-10-13:
>>>>>> On 13.10.15 at 07:27, <yang.z.zhang@intel.com> wrote:
>>>> Jan Beulich wrote on 2015-10-12:
>>>>>>>> On 12.10.15 at 03:42, <yang.z.zhang@intel.com> wrote:
>>>>>> So, my suggestion is that we can rely on user to not assign the
>>>>>> ATS device if hypervisor says it cannot support such device. For
>>>>>> example, if hypervisor find the invalidation isn't completed in
>>>>>> 1 second, then hypervisor can crash itself and tell the user
>>>>>> this ATS device needs more than 1 second invalidation time which
>>>>>> is not support by
>>>> Xen.
>>>>> 
>>>>> Crashing the hypervisor in such a case is a security issue, i.e.
>>>>> is not
>>>> 
>>>> Indeed. Crashing the guest is more reasonable.
>>>> 
>>>>> an acceptable thing (and the fact that we panic() on timeout expiry
>>>>> right now isn't really acceptable either). If crashing the offending
>>>>> guest was sufficient to contain the issue, that might be an option.
>>>>> Else
>>>> 
>>>> I think it should be sufficient (any concern from you?).
>>> 
>>> Having looked at the code, it wasn't immediately clear whether that
>>> would work. After all there one would think there would be a reason
>>> for the code panic()ing right now instead.
>> 
>> What the panic()ing refer to here?
> 
> E.g. what we have in queue_invalidate_wait():
> 
>         while ( poll_slot != QINVAL_STAT_DONE )
>         {
>             if ( NOW() > (start_time + DMAR_OPERATION_TIMEOUT) )
>             {
>                 print_qi_regs(iommu);
>                 panic("queue invalidate wait descriptor was not
> executed");
>             }
>             cpu_relax();
>         }
>>>> But if solution 1 is acceptable, I prefer it since most of ATS
>>>> devices are still able to play with Xen.
>>> 
>>> With a multi-millisecond spin, solution 1 would imo be acceptable
>>> only as a transitional measure.
>> 
>> What does the transitional measure mean? Do you mean we still need
>> the async flush for ATS issue or we can adapt solution 1 before ATS spec changed?
> 
> As long as the multi-millisecond spins aren't going to go away by
> other means, I think conversion to async mode is ultimately unavoidable.

I am not fully agreed. I think the time to spin is important. To me, less than 1 ms is acceptable and if the hardware can guarantee it, then sync mode also is ok. 
I remember the origin motivation to handle ATS problem is due to: 1. ATS spec allow 60s timeout to completed the flush which Xen only allows 1s, and 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For the former, as we discussed before, either disable ATS support or only support some specific ATS devices(complete the flush less than 10ms or 1ms) is acceptable. For the latter, if spin loop for 1s is not acceptable, we can reduce the timeout to 10ms or 1ms to eliminate the performance impaction. 
Yes, I'd agree it would be best solution if Xen has the async mode. But spin loop is used widely in iommu code: not only for invalidations, lots of DMAR operations are using spin to sync hardware's status. For those operations, it is hard to use async mode. Or, even it is possible to use async mode, I don't see the benefit considering the cost and complexity which means we either need a timer or a softirq to do the check.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  1:03             ` Zhang, Yang Z
@ 2015-10-15  6:46               ` Jan Beulich
  2015-10-15  7:28                 ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-15  6:46 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 15.10.15 at 03:03, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-10-14:
>> As long as the multi-millisecond spins aren't going to go away by
>> other means, I think conversion to async mode is ultimately unavoidable.
> 
> I am not fully agreed. I think the time to spin is important. To me, less 
> than 1 ms is acceptable and if the hardware can guarantee it, then sync mode 
> also is ok. 

Okay, let me put the condition slightly differently - any spin on the
order of what a WBINVD might take ought to be okay, provided
both are equally (in)accessible to guests. The whole discussion is
really about limiting the impact misbehaving guests can have on
the whole system. (Obviously any spin time reaching the order of
a scheduling time slice is a problem.)

> I remember the origin motivation to handle ATS problem is due to: 1. ATS 
> spec allow 60s timeout to completed the flush which Xen only allows 1s, and 
> 2. spin loop for 1s is not reasonable since it will hurt the scheduler. For 
> the former, as we discussed before, either disable ATS support or only 
> support some specific ATS devices(complete the flush less than 10ms or 1ms) 
> is acceptable. For the latter, if spin loop for 1s is not acceptable, we can 
> reduce the timeout to 10ms or 1ms to eliminate the performance impaction. 

If we really can, why has it been chosen to be 1s in the first place?

> Yes, I'd agree it would be best solution if Xen has the async mode. But spin 
> loop is used widely in iommu code: not only for invalidations, lots of DMAR 
> operations are using spin to sync hardware's status. For those operations, it 
> is hard to use async mode. Or, even it is possible to use async mode, I don't 
> see the benefit considering the cost and complexity which means we either 
> need a timer or a softirq to do the check.

Even if the cost is high, limited overall throughput by undue spinning is
worth it imo even outside of misbehaving guest considerations. I'm
surprised you're not getting similar pressure on this from the KVM folks
(assuming the use of spinning is similar there).

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  6:46               ` Jan Beulich
@ 2015-10-15  7:28                 ` Zhang, Yang Z
  2015-10-15  8:25                   ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-15  7:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-15:
>>>> On 15.10.15 at 03:03, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-10-14:
>>> As long as the multi-millisecond spins aren't going to go away by
>>> other means, I think conversion to async mode is ultimately unavoidable.
>> 
>> I am not fully agreed. I think the time to spin is important. To me,
>> less than 1 ms is acceptable and if the hardware can guarantee it,
>> then sync mode also is ok.
> 
> Okay, let me put the condition slightly differently - any spin on the
> order of what a WBINVD might take ought to be okay, provided both are

>From the data we collected, the invalidation is completed within several us. IMO, the time for WBINVD is varying due the size and different cache hierarchies. And it may take more than several us in worst case.

> equally (in)accessible to guests. The whole discussion is really about
> limiting the impact misbehaving guests can have on the whole system.
> (Obviously any spin time reaching the order of a scheduling time slice
> is a problem.)

The premise for a misbehaving guest to impact the system is that the IOMMU is buggy which takes long time to complete the invalidation. In other words, if all invalidations are able to complete within several us, what's the matter to do with the spin time? 

> 
>> I remember the origin motivation to handle ATS problem is due to: 1.
>> ATS spec allow 60s timeout to completed the flush which Xen only
>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>> hurt the scheduler. For the former, as we discussed before, either
>> disable ATS support or only support some specific ATS
>> devices(complete the flush less than 10ms or 1ms) is acceptable. For
>> the latter, if spin loop for 1s is not acceptable, we can reduce the
>> timeout to 10ms or 1ms
> to eliminate the performance impaction.
> 
> If we really can, why has it been chosen to be 1s in the first place?

What I can tell is 1s is just the value the original author chooses. It has no special means. I have double check with our hardware expert and he suggests us to use the value as small as possible. According his comment, 10ms is sufficiently large.  

> 
>> Yes, I'd agree it would be best solution if Xen has the async mode.
>> But spin loop is used widely in iommu code: not only for
>> invalidations, lots of DMAR operations are using spin to sync
>> hardware's status. For those operations, it is hard to use async mode.
>> Or, even it is possible to use async mode, I don't see the benefit
>> considering the cost and complexity which means we either need a
>> timer or a
> softirq to do the check.
> 
> Even if the cost is high, limited overall throughput by undue spinning
> is worth it imo even outside of misbehaving guest considerations. I'm
> surprised you're not getting similar pressure on this from the KVM
> folks (assuming the use of spinning is similar there).

Because no one observe such invalidation timeout issue so far. What we have discussed are only in theory. 

btw, I have told the issue to Linux IOMMU maintainer but he didn't say anything on it.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  7:28                 ` Zhang, Yang Z
@ 2015-10-15  8:25                   ` Jan Beulich
  2015-10-15  8:52                     ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-15  8:25 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 15.10.15 at 09:28, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-10-15:
>>>>> On 15.10.15 at 03:03, <yang.z.zhang@intel.com> wrote:
>>> Jan Beulich wrote on 2015-10-14:
>>>> As long as the multi-millisecond spins aren't going to go away by
>>>> other means, I think conversion to async mode is ultimately unavoidable.
>>> 
>>> I am not fully agreed. I think the time to spin is important. To me,
>>> less than 1 ms is acceptable and if the hardware can guarantee it,
>>> then sync mode also is ok.
>> 
>> Okay, let me put the condition slightly differently - any spin on the
>> order of what a WBINVD might take ought to be okay, provided both are
> 
> From the data we collected, the invalidation is completed within several us. 
> IMO, the time for WBINVD is varying due the size and different cache 
> hierarchies. And it may take more than several us in worst case.

Understood - hence the setting of the worst case latency of WBINVD
as an upper bound for other (kind of similar) software operation.

>> equally (in)accessible to guests. The whole discussion is really about
>> limiting the impact misbehaving guests can have on the whole system.
>> (Obviously any spin time reaching the order of a scheduling time slice
>> is a problem.)
> 
> The premise for a misbehaving guest to impact the system is that the IOMMU 
> is buggy which takes long time to complete the invalidation. In other words, 
> if all invalidations are able to complete within several us, what's the 
> matter to do with the spin time? 

The risk of exploits of such poorly behaving IOMMUs. I.e. if properly
operating IOMMUs only require several us, why spin for several ms?

>>> I remember the origin motivation to handle ATS problem is due to: 1.
>>> ATS spec allow 60s timeout to completed the flush which Xen only
>>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>>> hurt the scheduler. For the former, as we discussed before, either
>>> disable ATS support or only support some specific ATS
>>> devices(complete the flush less than 10ms or 1ms) is acceptable. For
>>> the latter, if spin loop for 1s is not acceptable, we can reduce the
>>> timeout to 10ms or 1ms
>> to eliminate the performance impaction.
>> 
>> If we really can, why has it been chosen to be 1s in the first place?
> 
> What I can tell is 1s is just the value the original author chooses. It has 
> no special means. I have double check with our hardware expert and he 
> suggests us to use the value as small as possible. According his comment, 
> 10ms is sufficiently large.  

So here you talk about milliseconds again, while above you talked
about microsecond. Can we at least settle on an order of what is
required? 10ms is 10 times the minimum time slice credit1 allows, i.e.
awfully long.

>>> Yes, I'd agree it would be best solution if Xen has the async mode.
>>> But spin loop is used widely in iommu code: not only for
>>> invalidations, lots of DMAR operations are using spin to sync
>>> hardware's status. For those operations, it is hard to use async mode.
>>> Or, even it is possible to use async mode, I don't see the benefit
>>> considering the cost and complexity which means we either need a
>>> timer or a
>> softirq to do the check.
>> 
>> Even if the cost is high, limited overall throughput by undue spinning
>> is worth it imo even outside of misbehaving guest considerations. I'm
>> surprised you're not getting similar pressure on this from the KVM
>> folks (assuming the use of spinning is similar there).
> 
> Because no one observe such invalidation timeout issue so far. What we have 
> discussed are only in theory. 

As is the case with many security related things. We shouldn't wait
until someone exploits them.

> btw, I have told the issue to Linux IOMMU maintainer but he didn't say 
> anything on it.

Interesting - that may speak for itself (depending on how long this
has been pending), but otoh is in line with experience I have with
many (but not all) Linux maintainers.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  8:25                   ` Jan Beulich
@ 2015-10-15  8:52                     ` Zhang, Yang Z
  2015-10-15  9:24                       ` Jan Beulich
  0 siblings, 1 reply; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-15  8:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-15:
>>>> On 15.10.15 at 09:28, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-10-15:
>>>>>> On 15.10.15 at 03:03, <yang.z.zhang@intel.com> wrote:
>>>> Jan Beulich wrote on 2015-10-14:
>>>>> As long as the multi-millisecond spins aren't going to go away by
>>>>> other means, I think conversion to async mode is ultimately unavoidable.
>>>> 
>>>> I am not fully agreed. I think the time to spin is important. To
>>>> me, less than 1 ms is acceptable and if the hardware can guarantee
>>>> it, then sync mode also is ok.
>>> 
>>> Okay, let me put the condition slightly differently - any spin on
>>> the order of what a WBINVD might take ought to be okay, provided
>>> both are
>> 
>> From the data we collected, the invalidation is completed within several us.
>> IMO, the time for WBINVD is varying due the size and different cache
>> hierarchies. And it may take more than several us in worst case.
> 
> Understood - hence the setting of the worst case latency of WBINVD as
> an upper bound for other (kind of similar) software operation.
> 
>>> equally (in)accessible to guests. The whole discussion is really about
>>> limiting the impact misbehaving guests can have on the whole system.
>>> (Obviously any spin time reaching the order of a scheduling time slice
>>> is a problem.)
>> 
>> The premise for a misbehaving guest to impact the system is that the
>> IOMMU is buggy which takes long time to complete the invalidation.
>> In other words, if all invalidations are able to complete within
>> several us, what's the matter to do with the spin time?
> 
> The risk of exploits of such poorly behaving IOMMUs. I.e. if properly

But this is not a software flaw. A guest has no way to know the underlying IOMMU is wrong and it cannot exploit it.

> operating IOMMUs only require several us, why spin for several ms?

10ms is just my suggestion. I don't know whether future hardware will need more time to complete the invalidation. So I think we need to have a large enough timeout here. Meanwhile, doesn't impact the scheduling.

> 
>>>> I remember the origin motivation to handle ATS problem is due to: 1.
>>>> ATS spec allow 60s timeout to completed the flush which Xen only
>>>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>>>> hurt the scheduler. For the former, as we discussed before, either
>>>> disable ATS support or only support some specific ATS
>>>> devices(complete the flush less than 10ms or 1ms) is acceptable.
>>>> For the latter, if spin loop for 1s is not acceptable, we can
>>>> reduce the timeout to 10ms or 1ms
>>> to eliminate the performance impaction.
>>> 
>>> If we really can, why has it been chosen to be 1s in the first place?
>> 
>> What I can tell is 1s is just the value the original author chooses.
>> It has no special means. I have double check with our hardware
>> expert and he suggests us to use the value as small as possible.
>> According his comment, 10ms is sufficiently large.
> 
> So here you talk about milliseconds again, while above you talked
> about microsecond. Can we at least settle on an order of what is
> required? 10ms is
> 10 times the minimum time slice credit1 allows, i.e.
> awfully long.

We can use an appropriate value which you think reasonable which can cover most of invalidation cases. For left cases, the vcpu can yield the CPU to others until a timer fired. In callback function, hypervisor can check whether the invalidation is completed. If yes, schedule in the vcpu. Otherwise, kill the guest due to unpredictable invalidation timeout.

Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  8:52                     ` Zhang, Yang Z
@ 2015-10-15  9:24                       ` Jan Beulich
  2015-10-15  9:50                         ` Zhang, Yang Z
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Beulich @ 2015-10-15  9:24 UTC (permalink / raw)
  To: Yang Z Zhang
  Cc: tim, Kevin Tian, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Jun Nakajima, Quan Xu

>>> On 15.10.15 at 10:52, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-10-15:
>>>>> On 15.10.15 at 09:28, <yang.z.zhang@intel.com> wrote:
>>> The premise for a misbehaving guest to impact the system is that the
>>> IOMMU is buggy which takes long time to complete the invalidation.
>>> In other words, if all invalidations are able to complete within
>>> several us, what's the matter to do with the spin time?
>> 
>> The risk of exploits of such poorly behaving IOMMUs. I.e. if properly
> 
> But this is not a software flaw. A guest has no way to know the underlying 
> IOMMU is wrong and it cannot exploit it.

A guest doesn't need to know what IOMMU is there in order to try
some exploit. Plus - based other information it may be able to make
an educated guess.

>> operating IOMMUs only require several us, why spin for several ms?
> 
> 10ms is just my suggestion. I don't know whether future hardware will need 
> more time to complete the invalidation. So I think we need to have a large 
> enough timeout here. Meanwhile, doesn't impact the scheduling.

It does, as explained further down in my previous reply.

>>>>> I remember the origin motivation to handle ATS problem is due to: 1.
>>>>> ATS spec allow 60s timeout to completed the flush which Xen only
>>>>> allows 1s, and 2. spin loop for 1s is not reasonable since it will
>>>>> hurt the scheduler. For the former, as we discussed before, either
>>>>> disable ATS support or only support some specific ATS
>>>>> devices(complete the flush less than 10ms or 1ms) is acceptable.
>>>>> For the latter, if spin loop for 1s is not acceptable, we can
>>>>> reduce the timeout to 10ms or 1ms
>>>> to eliminate the performance impaction.
>>>> 
>>>> If we really can, why has it been chosen to be 1s in the first place?
>>> 
>>> What I can tell is 1s is just the value the original author chooses.
>>> It has no special means. I have double check with our hardware
>>> expert and he suggests us to use the value as small as possible.
>>> According his comment, 10ms is sufficiently large.
>> 
>> So here you talk about milliseconds again, while above you talked
>> about microsecond. Can we at least settle on an order of what is
>> required? 10ms is
>> 10 times the minimum time slice credit1 allows, i.e.
>> awfully long.
> 
> We can use an appropriate value which you think reasonable which can cover 
> most of invalidation cases. For left cases, the vcpu can yield the CPU to 
> others until a timer fired. In callback function, hypervisor can check 
> whether the invalidation is completed. If yes, schedule in the vcpu. 
> Otherwise, kill the guest due to unpredictable invalidation timeout.

Using a timer implies you again think about pausing the vCPU until
the invalidation completes. Which, as discussed before, has its own
problems and, even worse, won't cover the domain's other vCPU-s
or devices still possibly doing work involving the use of the being
invalidated entries. Or did you have something else in mind?

IOW - as soon as spinning time reaches the order of the scheduler
time slice, I think the only sane model is async operation with
proper refcounting.

Jan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
  2015-10-15  9:24                       ` Jan Beulich
@ 2015-10-15  9:50                         ` Zhang, Yang Z
  0 siblings, 0 replies; 84+ messages in thread
From: Zhang, Yang Z @ 2015-10-15  9:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, Tian, Kevin, keir, ian.campbell, george.dunlap,
	andrew.cooper3, ian.jackson, xen-devel, Nakajima, Jun, Xu, Quan

Jan Beulich wrote on 2015-10-15:
>>>> On 15.10.15 at 10:52, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-10-15:
>>>>>> On 15.10.15 at 09:28, <yang.z.zhang@intel.com> wrote:
>>>> The premise for a misbehaving guest to impact the system is that
>>>> the IOMMU is buggy which takes long time to complete the invalidation.
>>>> In other words, if all invalidations are able to complete within
>>>> several us, what's the matter to do with the spin time?
>>> 
>>> The risk of exploits of such poorly behaving IOMMUs. I.e. if
>>> properly
>> 
>> But this is not a software flaw. A guest has no way to know the
>> underlying IOMMU is wrong and it cannot exploit it.
> 
> A guest doesn't need to know what IOMMU is there in order to try some
> exploit. Plus - based other information it may be able to make an
> educated guess.

As I said before, the premise is the IOMMU is buggy. 

> 
>>> operating IOMMUs only require several us, why spin for several ms?
>> 
>> 10ms is just my suggestion. I don't know whether future hardware
>> will need more time to complete the invalidation. So I think we need
>> to have a large enough timeout here. Meanwhile, doesn't impact the
> scheduling.
> 
> It does, as explained further down in my previous reply.
> 
>>>>>> I remember the origin motivation to handle ATS problem is due to: 1.
>>>>>> ATS spec allow 60s timeout to completed the flush which Xen only
>>>>>> allows 1s, and 2. spin loop for 1s is not reasonable since it
>>>>>> will hurt the scheduler. For the former, as we discussed before,
>>>>>> either disable ATS support or only support some specific ATS
>>>>>> devices(complete the flush less than 10ms or 1ms) is acceptable.
>>>>>> For the latter, if spin loop for 1s is not acceptable, we can
>>>>>> reduce the timeout to 10ms or 1ms
>>>>> to eliminate the performance impaction.
>>>>> 
>>>>> If we really can, why has it been chosen to be 1s in the first place?
>>>> 
>>>> What I can tell is 1s is just the value the original author chooses.
>>>> It has no special means. I have double check with our hardware
>>>> expert and he suggests us to use the value as small as possible.
>>>> According his comment, 10ms is sufficiently large.
>>> 
>>> So here you talk about milliseconds again, while above you talked
>>> about microsecond. Can we at least settle on an order of what is
>>> required? 10ms is
>>> 10 times the minimum time slice credit1 allows, i.e.
>>> awfully long.
>> 
>> We can use an appropriate value which you think reasonable which can
>> cover most of invalidation cases. For left cases, the vcpu can yield
>> the CPU to others until a timer fired. In callback function, hypervisor
>> can check whether the invalidation is completed. If yes, schedule in
>> the vcpu. Otherwise, kill the guest due to unpredictable invalidation
>> timeout.
> 
> Using a timer implies you again think about pausing the vCPU until the
> invalidation completes. Which, as discussed before, has its own
> problems and, even worse, won't cover the domain's other vCPU-s or
> devices still possibly doing work involving the use of the being
> invalidated entries. Or did you have something else in mind?

Why not pause the whole domain? Based on Quan's data, all the invalidations in his experiment are completed within 3us. So perhaps 10us is enough to cover all invalidations in today's IOMMU.(I need to check with hardware expert to get the exact data). The timer mechanism is a backup which only for the extreme case which exists in theory. So the probability for a guest trigger the timer mechanism can be ignored. Even it happens, it only affect guest itself.

> 
> IOW - as soon as spinning time reaches the order of the scheduler time
> slice, I think the only sane model is async operation with proper refcounting.
> 
> Jan


Best regards,
Yang

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2015-10-15  9:50 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
2015-09-16 10:46 ` Ian Jackson
2015-09-16 11:22   ` Julien Grall
2015-09-16 13:47     ` Ian Jackson
2015-09-17  9:06       ` Julien Grall
2015-09-17 10:16         ` Ian Jackson
2015-09-16 13:33   ` Xu, Quan
2015-09-16 13:23 ` [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt Quan Xu
2015-09-29  8:43   ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt Quan Xu
2015-09-29  8:57   ` Jan Beulich
2015-10-10  8:22     ` Xu, Quan
2015-10-12  7:11       ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
2015-09-16  9:33   ` Julien Grall
2015-09-16 13:43     ` Xu, Quan
2015-09-29  9:24   ` Jan Beulich
2015-10-10 12:27     ` Xu, Quan
2015-10-12  7:15       ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler Quan Xu
2015-09-29  9:33   ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in Quan Xu
2015-09-29  9:44   ` Jan Beulich
2015-09-16 13:24 ` [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag Quan Xu
2015-09-16  9:34   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to Quan Xu
2015-09-16  9:44   ` Julien Grall
2015-09-16 14:03     ` Xu, Quan
2015-09-16 13:24 ` [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed Quan Xu
2015-09-16  9:45   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 09/13] vt-d: Put the page in Queued Invalidation(QI) interrupt handler if Quan Xu
2015-09-16 13:24 ` [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed Quan Xu
2015-09-16  9:52   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
2015-09-16  9:56   ` Julien Grall
2015-09-23 17:38   ` Konrad Rzeszutek Wilk
2015-09-24  1:40     ` Xu, Quan
2015-09-16 13:24 ` [Patch RFC 12/13] vt-d: For gnttab_transfer, If the Device-TLB flush is still Quan Xu
2015-09-16 13:24 ` [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB Quan Xu
2015-09-29  9:46   ` Jan Beulich
2015-09-17  3:26 ` [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Xu, Quan
2015-09-21  8:51   ` Jan Beulich
2015-09-21  9:46     ` Xu, Quan
2015-09-21 12:03       ` Jan Beulich
2015-09-21 14:03         ` Xu, Quan
2015-09-21 14:20           ` Jan Beulich
2015-09-21 14:09 ` Xu, Quan
2015-09-23 16:26   ` Tim Deegan
2015-09-28  3:08     ` Xu, Quan
2015-09-28  6:47       ` Jan Beulich
2015-09-29  2:53         ` Xu, Quan
2015-09-29  7:21           ` Jan Beulich
2015-09-30 13:55             ` Xu, Quan
2015-09-30 14:03               ` Jan Beulich
2015-10-13 14:29             ` Xu, Quan
2015-10-13 14:50               ` Jan Beulich
2015-10-14 14:54                 ` Xu, Quan
2015-09-29  9:11       ` Tim Deegan
2015-09-29  9:57         ` Jan Beulich
2015-09-30 15:05         ` Xu, Quan
2015-10-01  9:09           ` Tim Deegan
2015-10-07 17:02             ` Xu, Quan
2015-10-08  8:51               ` Jan Beulich
2015-10-09  7:06                 ` Xu, Quan
2015-10-09  7:18                   ` Jan Beulich
2015-10-09  7:51                     ` Xu, Quan
2015-10-10 18:24               ` Tim Deegan
2015-10-11 11:09                 ` Xu, Quan
2015-10-12 12:25                   ` Jan Beulich
2015-10-13  9:34                   ` Tim Deegan
2015-10-14 14:44                     ` Xu, Quan
2015-10-12  1:42 ` Zhang, Yang Z
2015-10-12 12:34   ` Jan Beulich
2015-10-13  5:27     ` Zhang, Yang Z
2015-10-13  9:15       ` Jan Beulich
2015-10-14  5:12         ` Zhang, Yang Z
2015-10-14  9:30           ` Jan Beulich
2015-10-15  1:03             ` Zhang, Yang Z
2015-10-15  6:46               ` Jan Beulich
2015-10-15  7:28                 ` Zhang, Yang Z
2015-10-15  8:25                   ` Jan Beulich
2015-10-15  8:52                     ` Zhang, Yang Z
2015-10-15  9:24                       ` Jan Beulich
2015-10-15  9:50                         ` Zhang, Yang Z

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.