All of lore.kernel.org
 help / color / mirror / Atom feed
* VT-d async invalidation for Device-TLB.
@ 2015-06-03  7:49 Xu, Quan
  2015-06-10  8:04 ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Xu, Quan @ 2015-06-03  7:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Tian, Kevin, Xu, Quan, andrew.cooper3, Dugger, Donald D,
	Jan Beulich, Zhang, Yang Z

Hi All,
     This Email is about VT-d async invalidation for Device-TLB.

Background
=========

As Jan Beulich mentioned(http://lists.xenproject.org/archives/html/xen-devel/2014-06/msg03351.html ), VT-d code currently has a number of cases where completion of certain operations is being waited for by way of spinning. The majority of instances use that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second(DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the invalidation case seems particularly problematic. Currently hypervisor polls the status address of wait descriptor up to 1 second to get Invalidation flush result. When Invalidation queue includes Device-TLB invalidation, Using 1 second is a mistake here in the validation sync. As the 1 second timeout here is related to response times by the IOMMU engine, Instead of Devi
 ce-TLB invalidation with PCI-e Address Translation Services (ATS) in use. the ATS specification mandates a timeout of 1 _minute_ for cache flush. The ATS case needs to be taken into consideration when doing invalidations.
Obviously we can't spin for a minute, so invalidation absolutely needs to be converted to a non-spinning model.

Design Overview
=============
This design implements a non-spinning model for Device-TLB invalidation - using an interrupt based mechanism. Each domain maintains a invalidation table, and the hypervisor has an entry of invalidation tables. The invalidation table keeps the count of in-flight Device-TLB invalidation queues, and also provides the same polling parameter for mutil in-flight Device-TLB invalidation queues of each domain.
When a domain issues a request to Device-TLB invalidation queue, update invalidation table's count of in-flight Device-TLB invalidation queue and assign the Status Data of wait descriptor of the invalidation queue. An interrupt is sent out to the hypervisor once a Device-TLB invalidation request is done. In interrupt handler, we will schedule a soft-irq to do the following check: 
    if invalidation table's count of in-flight Device-TLB invalidation queues == polling parameter:
	   This domain has no in-flight invalidation requests.
    else
	   This domain has in-flight invalidation requests.
The domain is put into the "blocked" status if it has in-flight Device-TLB invalidation requests, and awoken when all the requests are done. A fault event will be generated if an invalidation failed. We can either crash the domain or crash Xen.
    For Context Invalidation and IOTLB invalidation without Device-TLB invalidation, Invalidation Queue flushes synchronous invalidation as before(This is a tradeoff and the cost of interrupt is overhead).

More details:

1. invalidation table. We define iommu _invl structure in domain.
Struct iommu _invl {
    volatile u64 iommu _invl _poll_slot :62;
    domid_t dom_id;
    u64 iommu _invl _status_data :32;
}__attribute__ ((aligned (64)));

   iommu _invl _poll_slot: Set it equal to the status address of wait descriptor when the invalidation queue is with Device-TLB.
   dom_id: Keep the id of the domain.
   iommu _invl _status_data: Keep the count of in-flight queue with Device-TLB invalidation.

2. Modification to Device IOTLB invalidation:
    - Enabled interrupt notification when hardware completes the invalidations: 
        Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also set SW bit is that the interrupt for notification is global not per domain. So we still need to poll the status address to know which domain's flush request is
        completed in interrupt handler.
    - A new per-domain flag (iommu_pending_flush) is used to track the flush status of IOTLB invalidation with Device-TLB invalidation:
        iommu_pending_flush will be set before flushing the Device-TLB invalidation.
    - new logic to do synchronize.
        if no Device-TLB invalidation:
            Back to current invalidation logic.
	   else 
            Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
            Set iommu_pending_flush
            Put the domain in pending flush list
            Return

3. Modification to domain running lifecycle:
    - When iommu_pending_flush is set, the domain is not allowed to enter non-root mode: pause domain before VM entry.

4. New interrupt handler for invalidation completion:
    - when hardware completes the invalidations with Device IOTLB, it generates an interrupt to notify hypervisor.
    - In interrupt handler, we will schedule a soft-irq to handle the finished invalidations.
    - soft-irq to handle finished invalidation:
        Scan the pending flush list
	    for each entry in list
            check the values of iommu _invl _poll_slot and iommu _invl _status_data in each domain's invalidation table.
            if yes, clear iommu_pending_flush and invalidation table, then wakeup the domain.
     (We can leverage IM bit of Invalidation Event Control Register to optimize the interrupt).

5. invalidation failed.
    - A fault event will be generated if invalidation failed. we can either crash the domain or crash Xen if receive an invalidation fault event.



Intel OTC
Quan Xu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VT-d async invalidation for Device-TLB.
  2015-06-03  7:49 VT-d async invalidation for Device-TLB Xu, Quan
@ 2015-06-10  8:04 ` Jan Beulich
  2015-06-12  2:40   ` Xu, Quan
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-10  8:04 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Z Zhang, andrew.cooper3, Kevin Tian, Donald D Dugger, xen-devel

>>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:
> Design Overview
> =============
> This design implements a non-spinning model for Device-TLB invalidation - using 
> an interrupt based mechanism. Each domain maintains a invalidation table, and 
> the hypervisor has an entry of invalidation tables. The invalidation table 

entry? Do you mean array or table?

> keeps the count of in-flight Device-TLB invalidation queues, and also provides 
> the same polling parameter for mutil in-flight Device-TLB invalidation queues 
> of each domain.

Which "same polling parameter"? I.e. I'm not sure what this is about
in the first place.

> When a domain issues a request to Device-TLB invalidation queue, update 
> invalidation table's count of in-flight Device-TLB invalidation queue and 
> assign the Status Data of wait descriptor of the invalidation queue. An 
> interrupt is sent out to the hypervisor once a Device-TLB invalidation request 
> is done. In interrupt handler, we will schedule a soft-irq to do the following 
> check: 
>     if invalidation table's count of in-flight Device-TLB invalidation queues 
> == polling parameter:
> 	   This domain has no in-flight invalidation requests.
>     else
> 	   This domain has in-flight invalidation requests.
> The domain is put into the "blocked" status if it has in-flight Device-TLB 
> invalidation requests, and awoken when all the requests are done. A fault 
> event will be generated if an invalidation failed. We can either crash the 
> domain or crash Xen.

Crashing Xen can't really be considered an option except when you
can't contain the failed invalidation to a particular VM (which, from
what was written above, should never happen).

>     For Context Invalidation and IOTLB invalidation without Device-TLB 
> invalidation, Invalidation Queue flushes synchronous invalidation as 
> before(This is a tradeoff and the cost of interrupt is overhead).

DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're
not intending to replace the current spinning for the non-ATS case?
Considering that expiring these loops results in panic()s, I would
expect these to become asynchronous _and_ contained to the
affected VM alongside the ATS induced changed behavior. You
talking of overhead - can you quantify that?

> More details:
> 
> 1. invalidation table. We define iommu _invl structure in domain.
> Struct iommu _invl {
>     volatile u64 iommu _invl _poll_slot :62;
>     domid_t dom_id;
>     u64 iommu _invl _status_data :32;
> }__attribute__ ((aligned (64)));
> 
>    iommu _invl _poll_slot: Set it equal to the status address of wait 
> descriptor when the invalidation queue is with Device-TLB.
>    dom_id: Keep the id of the domain.
>    iommu _invl _status_data: Keep the count of in-flight queue with Device-TLB 
> invalidation.

Without further explanation above/below I don't think I really
understand the purpose of this structure, nor its organization: Is
this something imposed by the VT-d specification? If so, a reference
to the respective section in the spec would be useful. If not, I can't
see why the structure is laid out the (odd) way it is.

> 2. Modification to Device IOTLB invalidation:
>     - Enabled interrupt notification when hardware completes the 
> invalidations: 
>         Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason 

A god design document would either give a (short) explanation of
these bits, or at the very least a precise reference to where in the
spec they're being defined. The way the VT-d spec is organized I
generally find it quite hard to locate the definition of specific fields
when I have only a vague reference in hand. Yet reading the doc
here should require the reader to spend meaningful extra amounts
of time hunting down the corresponding pieces of the spec.

> why also set SW bit is that the interrupt for notification is global not per 
> domain. So we still need to poll the status address to know which domain's 
> flush request is
>         completed in interrupt handler.

With the above taken care of, I would then hope to also be able to
understand this (kind of an) explanation.

>     - A new per-domain flag (iommu_pending_flush) is used to track the flush 
> status of IOTLB invalidation with Device-TLB invalidation:
>         iommu_pending_flush will be set before flushing the Device-TLB 
> invalidation.

What is "flushing an invalidation" supposed to mean? I think there's
some problem with the wording here...

> 4. New interrupt handler for invalidation completion:
>     - when hardware completes the invalidations with Device IOTLB, it 
> generates an interrupt to notify hypervisor.
>     - In interrupt handler, we will schedule a soft-irq to handle the finished 
> invalidations.
>     - soft-irq to handle finished invalidation:
>         Scan the pending flush list
> 	    for each entry in list
>             check the values of iommu _invl _poll_slot and iommu _invl 
> _status_data in each domain's invalidation table.
>             if yes, clear iommu_pending_flush and invalidation table, then 
> wakeup the domain.

Did you put some consideration into how long this list may get, and
hence how long it may take you to iterate through the entire list?

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VT-d async invalidation for Device-TLB.
  2015-06-10  8:04 ` Jan Beulich
@ 2015-06-12  2:40   ` Xu, Quan
  2015-06-12  6:46     ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Xu, Quan @ 2015-06-12  2:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

>> >>> On 10.06.15 at 16:05, <JBeulich@suse.com> wrote:
> >>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:

Jan, thanks for your review!!

> > Design Overview
> > =============
> > This design implements a non-spinning model for Device-TLB
> > invalidation - using an interrupt based mechanism. Each domain
> > maintains a invalidation table, and the hypervisor has an entry of
> > invalidation tables. The invalidation table
> 
> entry? Do you mean array or table?
> 
It is a table or list to track the domains that has pending invalidation request. In invalidation complete event handler, the hypervisor will walk this table/list to find which domain's invalidation is completed.
Now I have a new ways to scan. We can get scan iommu->domid_bitmap[] to get the domain which is with assigned device, Then get the domain's invalidation status.


> > keeps the count of in-flight Device-TLB invalidation queues, and also
> > provides the same polling parameter for mutil in-flight Device-TLB
> > invalidation queues of each domain.
> 
> Which "same polling parameter"? I.e. I'm not sure what this is about in the first
> place.
> 
It is similar to poll_slot in current Xen. In VT-d spec, it is also called status data.
For detail, I think we should know more about Invalidation Wait Descriptor.
More information about VT-d Invalidation Wait Descriptor, please refer to http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
6.5.2.8 Invalidation Wait Descriptor.

When we set SW of Invalidation Wait Descriptor, which indicates the invalidation wait descriptor completion by performing a coherent DWORD write of the value in the Status Data field to the address specified in the Status Address. 

For sync way, Xen has provided a local polling parameter, assigning the address in the Status Address of invalidation Wait descriptor and polling it for invalidation result in 1 second.
For async way, we provide a global polling parameter per domain, which is the " Same polling parameter ". assigning the address of global polling parameter per domain in the Status Address of each invalidation Wait descriptor when the domain submits invalidation requests.

When a domain issues a request to Device-TLB invalidation queue, update invalidation table's count of in-flight Device-TLB invalidation queue and assign the Status Data of wait descriptor of the invalidation queue.

For example:
  .
 -----
|invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation queues)
|wait|  Status Address =  virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|desc| 
|----|
  .
  .
 -----
|invl|
|wait| Status Data = 2 (the count of in-flight Device-TLB invalidation queues)
|desc| Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|----|
  .
  .
 -----
|invl| 
|wait|  Status Data = 3 (the count of in-flight Device-TLB invalidation queues)
|desc|  Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|----|
  .
  .

In interrupt handler:
    If the count of in-flight Device-TLB invalidation queues == a global polling parameter per domain:
	   This domain has no in-flight invalidation requests.
    else
	   This domain has in-flight invalidation requests.

BTW, we set FN bit of Invalidation Wait Descriptor. The FN bit indicates the descriptors following the invalidation wait descriptor must be processed by hardware only after the invalidation Wait descriptor completes.


> > When a domain issues a request to Device-TLB invalidation queue,
> > update invalidation table's count of in-flight Device-TLB invalidation
> > queue and assign the Status Data of wait descriptor of the
> > invalidation queue. An interrupt is sent out to the hypervisor once a
> > Device-TLB invalidation request is done. In interrupt handler, we will
> > schedule a soft-irq to do the following
> > check:
> >     if invalidation table's count of in-flight Device-TLB invalidation
> > queues == polling parameter:
> > 	   This domain has no in-flight invalidation requests.
> >     else
> > 	   This domain has in-flight invalidation requests.
> > The domain is put into the "blocked" status if it has in-flight
> > Device-TLB invalidation requests, and awoken when all the requests are
> > done. A fault event will be generated if an invalidation failed. We
> > can either crash the domain or crash Xen.
> 
> Crashing Xen can't really be considered an option except when you can't contain
> the failed invalidation to a particular VM (which, from what was written above,
> should never happen).
> 

Make sense.

> >     For Context Invalidation and IOTLB invalidation without Device-TLB
> > invalidation, Invalidation Queue flushes synchronous invalidation as
> > before(This is a tradeoff and the cost of interrupt is overhead).
> 
> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not intending
> to replace the current spinning for the non-ATS case?

Yes, we are not intending to replace the current spinning for the non-ATS case.


> Considering that expiring these loops results in panic()s, I would expect these to
> become asynchronous _and_ contained to the affected VM alongside the ATS
> induced changed behavior. You talking of overhead - can you quantify that?
>

I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device. 
for an invalidation:
 By sync way, it takes about 1.4 ms.
 By async way, it takes about 4.3 ms.


> > More details:
> >
> > 1. invalidation table. We define iommu _invl structure in domain.
> > Struct iommu _invl {
> >     volatile u64 iommu _invl _poll_slot :62;
> >     domid_t dom_id;
> >     u64 iommu _invl _status_data :32;
> > }__attribute__ ((aligned (64)));
> >
> >    iommu _invl _poll_slot: Set it equal to the status address of wait
> > descriptor when the invalidation queue is with Device-TLB.
> >    dom_id: Keep the id of the domain.
> >    iommu _invl _status_data: Keep the count of in-flight queue with
> > Device-TLB invalidation.
> 
> Without further explanation above/below I don't think I really understand the
> purpose of this structure, nor its organization: Is this something imposed by the
> VT-d specification? If so, a reference to the respective section in the spec would
> be useful. If not, I can't see why the structure is laid out the (odd) way it is.
> 

Refer to the explanation above. If it is still not clear, I will continue to explain in next email.

> > 2. Modification to Device IOTLB invalidation:
> >     - Enabled interrupt notification when hardware completes the
> > invalidations:
> >         Set FN, IF and SW bits in Invalidation Wait Descriptor. The
> > reason
> 
> A god design document would either give a (short) explanation of these bits, or
> at the very least a precise reference to where in the spec they're being defined.
> The way the VT-d spec is organized I generally find it quite hard to locate the
> definition of specific fields when I have only a vague reference in hand. Yet
> reading the doc here should require the reader to spend meaningful extra
> amounts of time hunting down the corresponding pieces of the spec.
> 

Agreed.  I will enhance it when I send out the code.
More information about VT-d Invalidation Wait Descriptor, please refer to http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
6.5.2.8 Invalidation Wait Descriptor.

SW: indicate the invalidation wait descriptor completion by performing a coherent DWORD write of the value in the Status Data field to the address specified in the Status Address.
FN: indicate the descriptors following the invalidation wait descriptor must be processed by hardware only after the invalidation Wait descriptor completes.
IF: Indicate the invalidation wait descriptor completion by generating an invalidation completion event per the programing of the Invalidation Completion Event Registers.


> > why also set SW bit is that the interrupt for notification is global
> > not per domain. So we still need to poll the status address to know
> > which domain's flush request is
> >         completed in interrupt handler.
> 
> With the above taken care of, I would then hope to also be able to understand
> this (kind of an) explanation.
> 
> >     - A new per-domain flag (iommu_pending_flush) is used to track the
> > flush status of IOTLB invalidation with Device-TLB invalidation:
> >         iommu_pending_flush will be set before flushing the Device-TLB
> > invalidation.
> 
> What is "flushing an invalidation" supposed to mean? I think there's some
> problem with the wording here...
> 

Yes, it should be 'submit invalidation requests'.


> > 4. New interrupt handler for invalidation completion:
> >     - when hardware completes the invalidations with Device IOTLB, it
> > generates an interrupt to notify hypervisor.
> >     - In interrupt handler, we will schedule a soft-irq to handle the
> > finished invalidations.
> >     - soft-irq to handle finished invalidation:
> >         Scan the pending flush list
> > 	    for each entry in list
> >             check the values of iommu _invl _poll_slot and iommu _invl
> > _status_data in each domain's invalidation table.
> >             if yes, clear iommu_pending_flush and invalidation table,
> > then wakeup the domain.
> 
> Did you put some consideration into how long this list may get, and hence how
> long it may take you to iterate through the entire list?
> 

Only the domain which has the ATS device assigned will be tracked in this list. So the list length shouldn't be very long. Besides, the DEVICE-IOTLB invalidation doesn't happened frequently so the cost should be acceptable.

Thanks.

> Jan

Quan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: VT-d async invalidation for Device-TLB.
  2015-06-12  2:40   ` Xu, Quan
@ 2015-06-12  6:46     ` Jan Beulich
  2015-06-13 14:44       ` FW: " Xu, Quan
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-12  6:46 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Z Zhang, andrew.cooper3, Kevin Tian, DonaldD Dugger, xen-devel

>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>> > >>> On 10.06.15 at 16:05, <JBeulich@suse.com> wrote:
>> >>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:
>> >     For Context Invalidation and IOTLB invalidation without Device-TLB
>> > invalidation, Invalidation Queue flushes synchronous invalidation as
>> > before(This is a tradeoff and the cost of interrupt is overhead).
>> 
>> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not intending
>> to replace the current spinning for the non-ATS case?
> 
> Yes, we are not intending to replace the current spinning for the non-ATS 
> case.

I'm not really happy about that.

>> Considering that expiring these loops results in panic()s, I would expect 
> these to
>> become asynchronous _and_ contained to the affected VM alongside the ATS
>> induced changed behavior. You talking of overhead - can you quantify that?
>>
> 
> I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device. 
> for an invalidation:
>  By sync way, it takes about 1.4 ms.
>  By async way, it takes about 4.3 ms.

What's the theory on why this is? After all, it shouldn't matter how
the completion of the invalidation gets signaled.

Apart from that measuring the ATS case (in which case we're set to
use async mode anyway) is kind of pointless here - we'd need to
know the overhead of non-ATS async compared to non-ATS sync.

>> > More details:
>> >
>> > 1. invalidation table. We define iommu _invl structure in domain.
>> > Struct iommu _invl {
>> >     volatile u64 iommu _invl _poll_slot :62;
>> >     domid_t dom_id;
>> >     u64 iommu _invl _status_data :32;
>> > }__attribute__ ((aligned (64)));
>> >
>> >    iommu _invl _poll_slot: Set it equal to the status address of wait
>> > descriptor when the invalidation queue is with Device-TLB.
>> >    dom_id: Keep the id of the domain.
>> >    iommu _invl _status_data: Keep the count of in-flight queue with
>> > Device-TLB invalidation.
>> 
>> Without further explanation above/below I don't think I really understand 
> the
>> purpose of this structure, nor its organization: Is this something imposed 
> by the
>> VT-d specification? If so, a reference to the respective section in the spec 
> would
>> be useful. If not, I can't see why the structure is laid out the (odd) way 
> it is.
>> 
> 
> Refer to the explanation above. If it is still not clear, I will continue to 
> explain in next email.

The explanation above helped for what I asked above, but didn't
make clear to me what the structure here is, how it relates to hw
defined structures, and hence (as said) why it is laid out the way it
is.

>> > 4. New interrupt handler for invalidation completion:
>> >     - when hardware completes the invalidations with Device IOTLB, it
>> > generates an interrupt to notify hypervisor.
>> >     - In interrupt handler, we will schedule a soft-irq to handle the
>> > finished invalidations.
>> >     - soft-irq to handle finished invalidation:
>> >         Scan the pending flush list
>> > 	    for each entry in list
>> >             check the values of iommu _invl _poll_slot and iommu _invl
>> > _status_data in each domain's invalidation table.
>> >             if yes, clear iommu_pending_flush and invalidation table,
>> > then wakeup the domain.
>> 
>> Did you put some consideration into how long this list may get, and hence 
> how
>> long it may take you to iterate through the entire list?
>> 
> 
> Only the domain which has the ATS device assigned will be tracked in this 
> list. So the list length shouldn't be very long.

Okay, if this is a list of domains (or of devices), that would hopefully
be acceptable (albeit on a huge system this could still be dozens). If
this was a list of pending flush requests, it might be worse.

> Besides, the DEVICE-IOTLB 
> invalidation doesn't happened frequently so the cost should be acceptable.

That's not a valid consideration: At no time must any processing
inside the hypervisor take arbitrarily long. This requirement is
entirely independent of how frequently such cases may occur.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* FW: VT-d async invalidation for Device-TLB.
  2015-06-12  6:46     ` Jan Beulich
@ 2015-06-13 14:44       ` Xu, Quan
  2015-06-15  7:03         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Xu, Quan @ 2015-06-13 14:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

>On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
> >>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
> >> > >>> On 10.06.15 at 16:05, <JBeulich@suse.com> wrote:
> >> >>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:
> >> >     For Context Invalidation and IOTLB invalidation without
> >> > Device-TLB invalidation, Invalidation Queue flushes synchronous
> >> > invalidation as before(This is a tradeoff and the cost of interrupt is
> overhead).
> >>
> >> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not
> >> intending to replace the current spinning for the non-ATS case?
> >
> > Yes, we are not intending to replace the current spinning for the
> > non-ATS case.
> 
> I'm not really happy about that.

Jan, could you share more about what you expect? We can enable it step by step.
Do you want to replace the current spinning for all of queued invalidation? 

> 
> >> Considering that expiring these loops results in panic()s, I would
> >> expect
> > these to
> >> become asynchronous _and_ contained to the affected VM alongside the
> >> ATS induced changed behavior. You talking of overhead - can you quantify
> that?
> >>
> >
> > I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
> > for an invalidation:
> >  By sync way, it takes about 1.4 ms.
> >  By async way, it takes about 4.3 ms.
> 
> What's the theory on why this is? After all, it shouldn't matter how the
> completion of the invalidation gets signaled.
>

Let me introduce more about how I get these test data.

For sync way, I get the time by NOW() macro, when I update Queue Tail Register  (qinval_update_qtail()).
Then get time by NOW() macro in current spinning when it has wrote the value in the Status Data field to the address specified in the Status Address.
The difference of these 2 value is the time of an sync invalidation.


For async way, as the previous email, I have introduced the IF bit of Invalidation Wait Descriptor. 
(IF: Indicate the invalidation wait descriptor completion by generating an invalidation completion event per the programing of the Invalidation Completion Event Registers.)
I have implemented an interrupt for invalidation completion event. 
Also I get the time by NOW() macro, when I update Queue Tail Register (by qinval_update_qtail()).
Then get time by NOW() macro in invalidation completion event handler.
The difference of these 2 value is the time of an async invalidation.



> Apart from that measuring the ATS case (in which case we're set to use async
> mode anyway) is kind of pointless here - we'd need to know the overhead of
> non-ATS async compared to non-ATS sync.
> 

I will send out these data in next Monday, now I am out of the office, and I can't ping that test machine when modify Xen source code to get
non-ATS async data.


> >> > More details:
> >> >
> >> > 1. invalidation table. We define iommu _invl structure in domain.
> >> > Struct iommu _invl {
> >> >     volatile u64 iommu _invl _poll_slot :62;
> >> >     domid_t dom_id;
> >> >     u64 iommu _invl _status_data :32; }__attribute__ ((aligned
> >> > (64)));
> >> >
> >> >    iommu _invl _poll_slot: Set it equal to the status address of
> >> > wait descriptor when the invalidation queue is with Device-TLB.
> >> >    dom_id: Keep the id of the domain.
> >> >    iommu _invl _status_data: Keep the count of in-flight queue with
> >> > Device-TLB invalidation.
> >>
> >> Without further explanation above/below I don't think I really
> >> understand
> > the
> >> purpose of this structure, nor its organization: Is this something
> >> imposed
> > by the
> >> VT-d specification? If so, a reference to the respective section in
> >> the spec
> > would
> >> be useful. If not, I can't see why the structure is laid out the
> >> (odd) way
> > it is.
> >>
> >
> > Refer to the explanation above. If it is still not clear, I will
> > continue to explain in next email.
> 
> The explanation above helped for what I asked above, but didn't make clear to
> me what the structure here is, how it relates to hw defined structures, and
> hence (as said) why it is laid out the way it is.
> 

The below is the structures.

--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h

+struct iommu_invl {
+    volatile u64 iommu_invl_poll_slot;
+    u32 iommu_invl_status_data;
+};
+
 
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
+#include <xen/iommu.h>
@@ -364,6 +365,10 @@ struct domain
+
+    struct iommu_invl iommu_qinval_invl;
+    /* The domain is under Device-TLB invalidation flush */
+    s8               iommu_invl_pending_flush;


> >> > 4. New interrupt handler for invalidation completion:
> >> >     - when hardware completes the invalidations with Device IOTLB,
> >> > it generates an interrupt to notify hypervisor.
> >> >     - In interrupt handler, we will schedule a soft-irq to handle
> >> > the finished invalidations.
> >> >     - soft-irq to handle finished invalidation:
> >> >         Scan the pending flush list
> >> > 	    for each entry in list
> >> >             check the values of iommu _invl _poll_slot and iommu
> >> > _invl _status_data in each domain's invalidation table.
> >> >             if yes, clear iommu_pending_flush and invalidation
> >> > table, then wakeup the domain.
> >>
> >> Did you put some consideration into how long this list may get, and
> >> hence
> > how
> >> long it may take you to iterate through the entire list?
> >>
> >
> > Only the domain which has the ATS device assigned will be tracked in
> > this list. So the list length shouldn't be very long.
> 
> Okay, if this is a list of domains (or of devices), that would hopefully be
> acceptable (albeit on a huge system this could still be dozens). If this was a list of
> pending flush requests, it might be worse.


Yes, this is a list of domains.

> 
> > Besides, the DEVICE-IOTLB
> > invalidation doesn't happened frequently so the cost should be acceptable.
> 
> That's not a valid consideration: At no time must any processing inside the
> hypervisor take arbitrarily long. This requirement is entirely independent of how
> frequently such cases may occur.
>

agreed.


 
> Jan


Quan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-13 14:44       ` FW: " Xu, Quan
@ 2015-06-15  7:03         ` Jan Beulich
  2015-06-16  3:07           ` Zhang, Yang Z
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-15  7:03 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Z Zhang, andrew.cooper3, Kevin Tian, DonaldD Dugger, xen-devel

>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>> >>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>> >> > >>> On 10.06.15 at 16:05, <JBeulich@suse.com> wrote:
>> >> >>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:
>> >> >     For Context Invalidation and IOTLB invalidation without
>> >> > Device-TLB invalidation, Invalidation Queue flushes synchronous
>> >> > invalidation as before(This is a tradeoff and the cost of interrupt is
>> overhead).
>> >>
>> >> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not
>> >> intending to replace the current spinning for the non-ATS case?
>> >
>> > Yes, we are not intending to replace the current spinning for the
>> > non-ATS case.
>> 
>> I'm not really happy about that.
> 
> Jan, could you share more about what you expect? We can enable it step by 
> step.
> Do you want to replace the current spinning for all of queued invalidation? 

Yes.

>> >> Considering that expiring these loops results in panic()s, I would
>> >> expect
>> > these to
>> >> become asynchronous _and_ contained to the affected VM alongside the
>> >> ATS induced changed behavior. You talking of overhead - can you quantify
>> that?
>> >>
>> >
>> > I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
>> > for an invalidation:
>> >  By sync way, it takes about 1.4 ms.
>> >  By async way, it takes about 4.3 ms.
>> 
>> What's the theory on why this is? After all, it shouldn't matter how the
>> completion of the invalidation gets signaled.
>>
> 
> Let me introduce more about how I get these test data.
> 
> For sync way, I get the time by NOW() macro, when I update Queue Tail 
> Register  (qinval_update_qtail()).
> Then get time by NOW() macro in current spinning when it has wrote the value 
> in the Status Data field to the address specified in the Status Address.
> The difference of these 2 value is the time of an sync invalidation.
> 
> 
> For async way, as the previous email, I have introduced the IF bit of 
> Invalidation Wait Descriptor. 
> (IF: Indicate the invalidation wait descriptor completion by generating an 
> invalidation completion event per the programing of the Invalidation 
> Completion Event Registers.)
> I have implemented an interrupt for invalidation completion event. 
> Also I get the time by NOW() macro, when I update Queue Tail Register (by 
> qinval_update_qtail()).
> Then get time by NOW() macro in invalidation completion event handler.
> The difference of these 2 value is the time of an async invalidation.

Okay, thanks for the explanation. As this includes the time it takes to
deliver and (partly) handle the interrupt, the difference is of course
within what one would expect (and also of what would seem acceptable).

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-15  7:03         ` Jan Beulich
@ 2015-06-16  3:07           ` Zhang, Yang Z
  2015-06-16  7:55             ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang, Yang Z @ 2015-06-16  3:07 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

Jan Beulich wrote on 2015-06-15:
>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>>>>>>>>> On 10.06.15 at 16:05, <JBeulich@suse.com> wrote:
>>>>>>>> On 03.06.15 at 09:49, <quan.xu@intel.com> wrote:
>>>>>>     For Context Invalidation and IOTLB invalidation without
>>>>>> Device-TLB invalidation, Invalidation Queue flushes
>>>>>> synchronous invalidation as before(This is a tradeoff and the
>>>>>> cost of interrupt is
>>> overhead).
>>>>> 
>>>>> DMAR_OPERATION_TIMEOUT being 1s, are you saying that you're not
>>>>> intending to replace the current spinning for the non-ATS case?
>>>> 
>>>> Yes, we are not intending to replace the current spinning for the
>>>> non-ATS case.
>>> 
>>> I'm not really happy about that.
>> 
>> Jan, could you share more about what you expect? We can enable it
>> step by step.
>> Do you want to replace the current spinning for all of queued invalidation?
> 
> Yes.
> 
>>>>> Considering that expiring these loops results in panic()s, I would
>>>>> expect these to become asynchronous _and_ contained to the affected
>>>>> VM alongside the ATS induced changed behavior. You talking of
>>>>> overhead - can you quantify that?
>>>>> 
>>>> 
>>>> I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
>>>> for an invalidation:
>>>>  By sync way, it takes about 1.4 ms.
>>>>  By async way, it takes about 4.3 ms.

It is a typo for the time: should be 1.4 us and 4.3 us.

>>> 
>>> What's the theory on why this is? After all, it shouldn't matter
>>> how the completion of the invalidation gets signaled.
>>> 
>> 
>> Let me introduce more about how I get these test data.
>> 
>> For sync way, I get the time by NOW() macro, when I update Queue
>> Tail Register  (qinval_update_qtail()).
>> Then get time by NOW() macro in current spinning when it has wrote
>> the value in the Status Data field to the address specified in the Status Address.
>> The difference of these 2 value is the time of an sync invalidation.
>> 
>> 
>> For async way, as the previous email, I have introduced the IF bit
>> of Invalidation Wait Descriptor.
>> (IF: Indicate the invalidation wait descriptor completion by
>> generating an invalidation completion event per the programing of
>> the Invalidation Completion Event Registers.) I have implemented an
>> interrupt for invalidation completion event.
>> Also I get the time by NOW() macro, when I update Queue Tail
>> Register (by qinval_update_qtail()).
>> Then get time by NOW() macro in invalidation completion event handler.
>> The difference of these 2 value is the time of an async invalidation.
> 
> Okay, thanks for the explanation. As this includes the time it takes
> to deliver and (partly) handle the interrupt, the difference is of
> course within what one would expect (and also of what would seem acceptable).

The time doesn't include the cost of handling of interrupt. We just record it at the entry of interrupt handler. So the cost should bigger than 4.3 us if taking the handing cost into consideration. And the costs will much bigger if there are more pass-through VMs runs. We can start from ATS case firstly. And apply it to non-ATS case later if the async approach doesn't hurt the performance.

> 
> Jan


Best regards,
Yang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-16  3:07           ` Zhang, Yang Z
@ 2015-06-16  7:55             ` Jan Beulich
  2015-06-16  7:59               ` Zhang, Yang Z
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-16  7:55 UTC (permalink / raw)
  To: Quan Xu, Yang Z Zhang
  Cc: andrew.cooper3, Kevin Tian, DonaldD Dugger, xen-devel

>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-06-15:
>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>>>>> I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
>>>>> for an invalidation:
>>>>>  By sync way, it takes about 1.4 ms.
>>>>>  By async way, it takes about 4.3 ms.
> 
> It is a typo for the time: should be 1.4 us and 4.3 us.
> 
>>>> 
>>>> What's the theory on why this is? After all, it shouldn't matter
>>>> how the completion of the invalidation gets signaled.
>>>> 
>>> 
>>> Let me introduce more about how I get these test data.
>>> 
>>> For sync way, I get the time by NOW() macro, when I update Queue
>>> Tail Register  (qinval_update_qtail()).
>>> Then get time by NOW() macro in current spinning when it has wrote
>>> the value in the Status Data field to the address specified in the Status Address.
>>> The difference of these 2 value is the time of an sync invalidation.
>>> 
>>> 
>>> For async way, as the previous email, I have introduced the IF bit
>>> of Invalidation Wait Descriptor.
>>> (IF: Indicate the invalidation wait descriptor completion by
>>> generating an invalidation completion event per the programing of
>>> the Invalidation Completion Event Registers.) I have implemented an
>>> interrupt for invalidation completion event.
>>> Also I get the time by NOW() macro, when I update Queue Tail
>>> Register (by qinval_update_qtail()).
>>> Then get time by NOW() macro in invalidation completion event handler.
>>> The difference of these 2 value is the time of an async invalidation.
>> 
>> Okay, thanks for the explanation. As this includes the time it takes
>> to deliver and (partly) handle the interrupt, the difference is of
>> course within what one would expect (and also of what would seem 
> acceptable).
> 
> The time doesn't include the cost of handling of interrupt. We just record 
> it at the entry of interrupt handler. So the cost should bigger than 4.3 us 
> if taking the handing cost into consideration. And the costs will much bigger 
> if there are more pass-through VMs runs. We can start from ATS case firstly. 
> And apply it to non-ATS case later if the async approach doesn't hurt the 
> performance.

In which case we're back to the question I raised originally: How do
you explain the time difference if the interrupt delivery overhead
isn't included?

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-16  7:55             ` Jan Beulich
@ 2015-06-16  7:59               ` Zhang, Yang Z
  2015-06-16  8:11                 ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang, Yang Z @ 2015-06-16  7:59 UTC (permalink / raw)
  To: Jan Beulich, Xu, Quan
  Cc: andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

Jan Beulich wrote on 2015-06-16:
>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
>> Jan Beulich wrote on 2015-06-15:
>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>>>>>> I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
>>>>>> for an invalidation:
>>>>>>  By sync way, it takes about 1.4 ms.
>>>>>>  By async way, it takes about 4.3 ms.
>> 
>> It is a typo for the time: should be 1.4 us and 4.3 us.
>> 
>>>>> 
>>>>> What's the theory on why this is? After all, it shouldn't matter
>>>>> how the completion of the invalidation gets signaled.
>>>>> 
>>>> 
>>>> Let me introduce more about how I get these test data.
>>>> 
>>>> For sync way, I get the time by NOW() macro, when I update Queue Tail
>>>> Register  (qinval_update_qtail()). Then get time by NOW() macro in
>>>> current spinning when it has wrote the value in the Status Data field
>>>> to the address specified in the Status Address. The difference of
>>>> these 2 value is the time of an sync invalidation.
>>>> 
>>>> 
>>>> For async way, as the previous email, I have introduced the IF bit
>>>> of Invalidation Wait Descriptor.
>>>> (IF: Indicate the invalidation wait descriptor completion by
>>>> generating an invalidation completion event per the programing of
>>>> the Invalidation Completion Event Registers.) I have implemented
>>>> an interrupt for invalidation completion event.
>>>> Also I get the time by NOW() macro, when I update Queue Tail
>>>> Register (by qinval_update_qtail()).
>>>> Then get time by NOW() macro in invalidation completion event handler.
>>>> The difference of these 2 value is the time of an async invalidation.
>>> 
>>> Okay, thanks for the explanation. As this includes the time it
>>> takes to deliver and (partly) handle the interrupt, the difference
>>> is of course within what one would expect (and also of what would
>>> seem
>> acceptable).
>> 
>> The time doesn't include the cost of handling of interrupt. We just
>> record it at the entry of interrupt handler. So the cost should bigger
>> than 4.3 us if taking the handing cost into consideration. And the
>> costs will much bigger if there are more pass-through VMs runs. We can
>> start from ATS case firstly. And apply it to non-ATS case later if the
>> async approach doesn't hurt the performance.
> 
> In which case we're back to the question I raised originally: How do
> you explain the time difference if the interrupt delivery overhead isn't included?

which one? 1.4us for sync case and 4.3us for async case?
> 
> Jan


Best regards,
Yang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-16  7:59               ` Zhang, Yang Z
@ 2015-06-16  8:11                 ` Jan Beulich
  2015-06-18  8:09                   ` Xu, Quan
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-16  8:11 UTC (permalink / raw)
  To: Quan Xu, Yang Z Zhang
  Cc: andrew.cooper3, Kevin Tian, DonaldD Dugger, xen-devel

>>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
> Jan Beulich wrote on 2015-06-16:
>>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
>>> Jan Beulich wrote on 2015-06-15:
>>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>>>>>>> I tested it by a Myri-10G Dual-Protocol NIC, which is an ATS device.
>>>>>>> for an invalidation:
>>>>>>>  By sync way, it takes about 1.4 ms.
>>>>>>>  By async way, it takes about 4.3 ms.
>>> 
>>> It is a typo for the time: should be 1.4 us and 4.3 us.
>>> 
>>>>>> 
>>>>>> What's the theory on why this is? After all, it shouldn't matter
>>>>>> how the completion of the invalidation gets signaled.
>>>>>> 
>>>>> 
>>>>> Let me introduce more about how I get these test data.
>>>>> 
>>>>> For sync way, I get the time by NOW() macro, when I update Queue Tail
>>>>> Register  (qinval_update_qtail()). Then get time by NOW() macro in
>>>>> current spinning when it has wrote the value in the Status Data field
>>>>> to the address specified in the Status Address. The difference of
>>>>> these 2 value is the time of an sync invalidation.
>>>>> 
>>>>> 
>>>>> For async way, as the previous email, I have introduced the IF bit
>>>>> of Invalidation Wait Descriptor.
>>>>> (IF: Indicate the invalidation wait descriptor completion by
>>>>> generating an invalidation completion event per the programing of
>>>>> the Invalidation Completion Event Registers.) I have implemented
>>>>> an interrupt for invalidation completion event.
>>>>> Also I get the time by NOW() macro, when I update Queue Tail
>>>>> Register (by qinval_update_qtail()).
>>>>> Then get time by NOW() macro in invalidation completion event handler.
>>>>> The difference of these 2 value is the time of an async invalidation.
>>>> 
>>>> Okay, thanks for the explanation. As this includes the time it
>>>> takes to deliver and (partly) handle the interrupt, the difference
>>>> is of course within what one would expect (and also of what would
>>>> seem
>>> acceptable).
>>> 
>>> The time doesn't include the cost of handling of interrupt. We just
>>> record it at the entry of interrupt handler. So the cost should bigger
>>> than 4.3 us if taking the handing cost into consideration. And the
>>> costs will much bigger if there are more pass-through VMs runs. We can
>>> start from ATS case firstly. And apply it to non-ATS case later if the
>>> async approach doesn't hurt the performance.
>> 
>> In which case we're back to the question I raised originally: How do
>> you explain the time difference if the interrupt delivery overhead isn't 
> included?
> 
> which one? 1.4us for sync case and 4.3us for async case?

The difference between the two (i.e. why is the async variant
three times as long).

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-16  8:11                 ` Jan Beulich
@ 2015-06-18  8:09                   ` Xu, Quan
  2015-06-18  9:18                     ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Xu, Quan @ 2015-06-18  8:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel


> On 16.06.15 09:59, <mailto:JBeulich@suse.com> wrote:
> >>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
> > Jan Beulich wrote on 2015-06-16:
> >>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
> >>> Jan Beulich wrote on 2015-06-15:
> >>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
> >>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
> >>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:

> >
> > which one? 1.4us for sync case and 4.3us for async case?
> 
> The difference between the two (i.e. why is the async variant three times as
> long).
> 

I have tested iotlb async/sync invalidation to get another data. Also I disabled 'P State' / 'C State' in bios.
For async invalidation: about 2.67 us.
For sync invalidation: about 1.28 us.

I also tested VCPU_KICK_SOFTIRQ irq cost.
 When hypervisor calls cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) to raise an VCPU_KICK_SOFTIRQ irq, and vcpu_kick_softirq() is the VCPU_KICK_SOFTIRQ interrupt handler.
I got the cost between cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) and vcpu_kick_softirq(). It is about 1.21 us.

I think the difference is interrupt cost between async invalidation and sync invalidation.

2.67us is almost ideal for async invalidation cost. There are 4 reasons to cost much more time:
   1.  If enable 'P State' / 'C State' in bios.
   2.  Hypervisor is running in No-root mode.
   3.  The time doesn't include the cost of handling of interrupt. I just record it at the entry of interrupt handler.
   4.  More pass-through VMs runs.

So there are maybe some performance issues when we replace the current spinning for the non-ATS case.
We can start from ATS case firstly, And apply it to non-ATS case later if the async approach performance is acceptable.
Jan, Do you agree with this?


Quan


> Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-18  8:09                   ` Xu, Quan
@ 2015-06-18  9:18                     ` Jan Beulich
  2015-06-18 11:31                       ` Xu, Quan
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-18  9:18 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Z Zhang, andrew.cooper3, Kevin Tian, DonaldD Dugger, xen-devel

>>> On 18.06.15 at 10:09, <quan.xu@intel.com> wrote:

>> On 16.06.15 09:59, <mailto:JBeulich@suse.com> wrote:
>> >>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
>> > Jan Beulich wrote on 2015-06-16:
>> >>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
>> >>> Jan Beulich wrote on 2015-06-15:
>> >>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>> >>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>> >>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
> 
>> >
>> > which one? 1.4us for sync case and 4.3us for async case?
>> 
>> The difference between the two (i.e. why is the async variant three times as
>> long).
>> 
> 
> I have tested iotlb async/sync invalidation to get another data. Also I 
> disabled 'P State' / 'C State' in bios.
> For async invalidation: about 2.67 us.
> For sync invalidation: about 1.28 us.
> 
> I also tested VCPU_KICK_SOFTIRQ irq cost.
>  When hypervisor calls cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) to raise an 
> VCPU_KICK_SOFTIRQ irq, and vcpu_kick_softirq() is the VCPU_KICK_SOFTIRQ 
> interrupt handler.
> I got the cost between cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) and 
> vcpu_kick_softirq(). It is about 1.21 us.
> 
> I think the difference is interrupt cost between async invalidation and sync 
> invalidation.

Which contradicts what I think Yang said in an earlier reply.

> 2.67us is almost ideal for async invalidation cost. There are 4 reasons to 
> cost much more time:
>    1.  If enable 'P State' / 'C State' in bios.
>    2.  Hypervisor is running in No-root mode.
>    3.  The time doesn't include the cost of handling of interrupt. I just 
> record it at the entry of interrupt handler.
>    4.  More pass-through VMs runs.
> 
> So there are maybe some performance issues when we replace the current 
> spinning for the non-ATS case.
> We can start from ATS case firstly, And apply it to non-ATS case later if the 
> async approach performance is acceptable.
> Jan, Do you agree with this?

No, I'm still not convinced that leaving the non-ATS case alone
initially is the right approach. But maybe I'm the only one?

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-18  9:18                     ` Jan Beulich
@ 2015-06-18 11:31                       ` Xu, Quan
  2015-06-18 11:39                         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Xu, Quan @ 2015-06-18 11:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

>On June 18, 2015 5:19 PM, <JBeulich@suse.com> wrote:
> >>> On 18.06.15 at 10:09, <quan.xu@intel.com> wrote:
> 
> >> On 16.06.15 09:59, <mailto:JBeulich@suse.com> wrote:
> >> >>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
> >> > Jan Beulich wrote on 2015-06-16:
> >> >>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
> >> >>> Jan Beulich wrote on 2015-06-15:
> >> >>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
> >> >>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
> >> >>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
> >
> >> >
> >> > which one? 1.4us for sync case and 4.3us for async case?
> >>
> >> The difference between the two (i.e. why is the async variant three
> >> times as long).
> >>
> >
> > I have tested iotlb async/sync invalidation to get another data. Also
> > I disabled 'P State' / 'C State' in bios.
> > For async invalidation: about 2.67 us.
> > For sync invalidation: about 1.28 us.
> >
> > I also tested VCPU_KICK_SOFTIRQ irq cost.
> >  When hypervisor calls cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) to raise
> > an VCPU_KICK_SOFTIRQ irq, and vcpu_kick_softirq() is the
> > VCPU_KICK_SOFTIRQ interrupt handler.
> > I got the cost between cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) and
> > vcpu_kick_softirq(). It is about 1.21 us.
> >
> > I think the difference is interrupt cost between async invalidation
> > and sync invalidation.
> 
> Which contradicts what I think Yang said in an earlier reply.
> 
Talked with Yang who is confused at what he said. :(
Could you share more?


> > 2.67us is almost ideal for async invalidation cost. There are 4
> > reasons to cost much more time:
> >    1.  If enable 'P State' / 'C State' in bios.
> >    2.  Hypervisor is running in No-root mode.
> >    3.  The time doesn't include the cost of handling of interrupt. I
> > just record it at the entry of interrupt handler.
> >    4.  More pass-through VMs runs.
> >
> > So there are maybe some performance issues when we replace the current
> > spinning for the non-ATS case.
> > We can start from ATS case firstly, And apply it to non-ATS case later
> > if the async approach performance is acceptable.
> > Jan, Do you agree with this?
> 
> No, I'm still not convinced that leaving the non-ATS case alone initially is the right
> approach. But maybe I'm the only one?
> 
I hope for someone else to give some comments.

I tried to replace the current spinning for the non-ATS case, but Xen crashed.
Based on dmesg, it seems that VT-d is enabled before enabling IO-APIC IRQs. 

I can send out two serious of patch:
1st: VT-d async invalidation for ATS case.
2nd: VT-d async invalidation for non-ATS case.


I think the 1st serious of patch is high priority, as it is not correct to spin 1 second for ATS case. I can implement these source code and send out ASAP.
2nd serious of patch is low priority, as it's optimization. Also I can provide a serious of patch to fix it later.
Agree?

Quan


> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-18 11:31                       ` Xu, Quan
@ 2015-06-18 11:39                         ` Jan Beulich
  2015-06-18 12:02                           ` Xu, Quan
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2015-06-18 11:39 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Z Zhang, andrew.cooper3, Kevin Tian, Donald D Dugger, xen-devel

>>> On 18.06.15 at 13:31, <quan.xu@intel.com> wrote:
>> On June 18, 2015 5:19 PM, <JBeulich@suse.com> wrote:
>> >>> On 18.06.15 at 10:09, <quan.xu@intel.com> wrote:
>> 
>> >> On 16.06.15 09:59, <mailto:JBeulich@suse.com> wrote:
>> >> >>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
>> >> > Jan Beulich wrote on 2015-06-16:
>> >> >>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
>> >> >>> Jan Beulich wrote on 2015-06-15:
>> >> >>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
>> >> >>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
>> >> >>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
>> >
>> >> >
>> >> > which one? 1.4us for sync case and 4.3us for async case?
>> >>
>> >> The difference between the two (i.e. why is the async variant three
>> >> times as long).
>> >>
>> >
>> > I have tested iotlb async/sync invalidation to get another data. Also
>> > I disabled 'P State' / 'C State' in bios.
>> > For async invalidation: about 2.67 us.
>> > For sync invalidation: about 1.28 us.
>> >
>> > I also tested VCPU_KICK_SOFTIRQ irq cost.
>> >  When hypervisor calls cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) to raise
>> > an VCPU_KICK_SOFTIRQ irq, and vcpu_kick_softirq() is the
>> > VCPU_KICK_SOFTIRQ interrupt handler.
>> > I got the cost between cpu_raise_softirq(..VCPU_KICK_SOFTIRQ) and
>> > vcpu_kick_softirq(). It is about 1.21 us.
>> >
>> > I think the difference is interrupt cost between async invalidation
>> > and sync invalidation.
>> 
>> Which contradicts what I think Yang said in an earlier reply.
>> 
> Talked with Yang who is confused at what he said. :(
> Could you share more?

Upon me asking, he specifically indicated the numbers do _not_
include interrupt handling overhead.

>> > 2.67us is almost ideal for async invalidation cost. There are 4
>> > reasons to cost much more time:
>> >    1.  If enable 'P State' / 'C State' in bios.
>> >    2.  Hypervisor is running in No-root mode.
>> >    3.  The time doesn't include the cost of handling of interrupt. I
>> > just record it at the entry of interrupt handler.
>> >    4.  More pass-through VMs runs.
>> >
>> > So there are maybe some performance issues when we replace the current
>> > spinning for the non-ATS case.
>> > We can start from ATS case firstly, And apply it to non-ATS case later
>> > if the async approach performance is acceptable.
>> > Jan, Do you agree with this?
>> 
>> No, I'm still not convinced that leaving the non-ATS case alone initially is 
> the right
>> approach. But maybe I'm the only one?
>> 
> I hope for someone else to give some comments.
> 
> I tried to replace the current spinning for the non-ATS case, but Xen 
> crashed.
> Based on dmesg, it seems that VT-d is enabled before enabling IO-APIC IRQs. 
> 
> I can send out two serious of patch:
> 1st: VT-d async invalidation for ATS case.
> 2nd: VT-d async invalidation for non-ATS case.
> 
> 
> I think the 1st serious of patch is high priority, as it is not correct to 
> spin 1 second for ATS case. I can implement these source code and send out 
> ASAP.
> 2nd serious of patch is low priority, as it's optimization. Also I can 
> provide a serious of patch to fix it later.

That's fine as long as the second series won't arrive only months
later.

Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: FW: VT-d async invalidation for Device-TLB.
  2015-06-18 11:39                         ` Jan Beulich
@ 2015-06-18 12:02                           ` Xu, Quan
  0 siblings, 0 replies; 15+ messages in thread
From: Xu, Quan @ 2015-06-18 12:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhang, Yang Z, andrew.cooper3, Tian, Kevin, Dugger, Donald D, xen-devel

>June 18, 2015 7:39 PM, <JBeulich@suse.com> wrote:
> >>> On 18.06.15 at 13:31, <quan.xu@intel.com> wrote:
> >> On June 18, 2015 5:19 PM, <JBeulich@suse.com> wrote:
> >> >>> On 18.06.15 at 10:09, <quan.xu@intel.com> wrote:
> >>
> >> >> On 16.06.15 09:59, <mailto:JBeulich@suse.com> wrote:
> >> >> >>> On 16.06.15 at 09:59, <yang.z.zhang@intel.com> wrote:
> >> >> > Jan Beulich wrote on 2015-06-16:
> >> >> >>>>> On 16.06.15 at 05:07, <yang.z.zhang@intel.com> wrote:
> >> >> >>> Jan Beulich wrote on 2015-06-15:
> >> >> >>>>>>> On 13.06.15 at 16:44, <quan.xu@intel.com> wrote:
> >> >> >>>>>> On 12.06.15 at 14:47, <JBeulich@suse.com> wrote:
> >> >> >>>>>>>>> On 12.06.15 at 04:40, <quan.xu@intel.com> wrote:
> >> >
> >> >> >



> > I can send out two series of patch:
> > 1st: VT-d async invalidation for ATS case.
> > 2nd: VT-d async invalidation for non-ATS case.
> >
> >
> > I think the 1st series of patch is high priority, as it is not
> > correct to spin 1 second for ATS case. I can implement these source
> > code and send out ASAP.
> > 2nd series of patch is low priority, as it's optimization. Also I can
> > provide a serious of patch to fix it later.
> 
> That's fine as long as the second series won't arrive only months later.
> 

I am starting to implement the 1st series -- VT-d async invalidation for ATS case.

Quan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-06-18 12:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-03  7:49 VT-d async invalidation for Device-TLB Xu, Quan
2015-06-10  8:04 ` Jan Beulich
2015-06-12  2:40   ` Xu, Quan
2015-06-12  6:46     ` Jan Beulich
2015-06-13 14:44       ` FW: " Xu, Quan
2015-06-15  7:03         ` Jan Beulich
2015-06-16  3:07           ` Zhang, Yang Z
2015-06-16  7:55             ` Jan Beulich
2015-06-16  7:59               ` Zhang, Yang Z
2015-06-16  8:11                 ` Jan Beulich
2015-06-18  8:09                   ` Xu, Quan
2015-06-18  9:18                     ` Jan Beulich
2015-06-18 11:31                       ` Xu, Quan
2015-06-18 11:39                         ` Jan Beulich
2015-06-18 12:02                           ` Xu, Quan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.