All of lore.kernel.org
 help / color / mirror / Atom feed
* swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
@ 2010-11-11  1:16 Dante Cinco
  2010-11-11 16:04 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-11  1:16 UTC (permalink / raw)
  To: Xen-devel

We have Fibre Channel HBA devices that we PCI passthrough to our pvops
domU kernel. Without swiotlb=force in the domU's kernel command line,
both domU and dom0 lock up after loading the kernel module drivers for
the HBA devices. With swiotlb=force, the domU and dom0 are stable
after loading the kernel module drivers but the I/O performance is at
least an order of magnitude worse than what we were seeing with the
HVM kernel. I see the following in /var/log/kern.log in the pvops
domU:

PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
software IO TLB at phys 0x5800000 - 0x9800000

Is swiotlb=force responsible for the I/O performance degradation? I
don't understand what swiotlb=force does so I would appreciate an
explanation or a pointer.

Thanks.

- Dante

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11  1:16 swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough Dante Cinco
@ 2010-11-11 16:04 ` Konrad Rzeszutek Wilk
  2010-11-11 18:31   ` Dante Cinco
  0 siblings, 1 reply; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-11 16:04 UTC (permalink / raw)
  To: Dante Cinco; +Cc: Xen-devel

On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
> We have Fibre Channel HBA devices that we PCI passthrough to our pvops
> domU kernel. Without swiotlb=force in the domU's kernel command line,
> both domU and dom0 lock up after loading the kernel module drivers for
> the HBA devices. With swiotlb=force, the domU and dom0 are stable

Whoa. That is not good - what happens if you just pass in iommu=soft?
Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
(I don't think it does, but just doing 'iommu=soft' should enable it).


> after loading the kernel module drivers but the I/O performance is at
> least an order of magnitude worse than what we were seeing with the
> HVM kernel. I see the following in /var/log/kern.log in the pvops
> domU:
> 
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
> software IO TLB at phys 0x5800000 - 0x9800000
> 
> Is swiotlb=force responsible for the I/O performance degradation? I
> don't understand what swiotlb=force does so I would appreciate an
> explanation or a pointer.

So, you should only need to use 'iommu=soft'. It will enable the Linux kernel IOMMU
to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).

If your card is 64-bit, then that is all it would do. If however your card is 32-bit
and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page
to memory below 4GB, DMA that, and when done, copy it back to the where the user-space
page is. This is called bounce-buffering and this is why you would use a mix of
pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.

However, I think your cards are 64-bit, so you don't need this bounce-buffering. But
if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.

So, try just 'iommu=soft' and see what happens.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 16:04 ` Konrad Rzeszutek Wilk
@ 2010-11-11 18:31   ` Dante Cinco
  2010-11-11 19:03     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-11 18:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

Konrad,

Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
buffering for IO" in /var/log/kern.log.

With iommu=soft and without swiotlb=force, I see the "software bounce
buffering" in /var/log/kern.log and an NMI (see below) when I load the
kernel module drivers. I made sure the NMI is reproducible and not a
one-time event.

/var/log/kern.log (iommu=soft):
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
software IO TLB at phys 0x5800000 - 0x9800000

(XEN)
(XEN)
(XEN) NMI - I/O ERROR
(XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
(XEN) RFLAGS: 0000000000000012   CONTEXT: hypervisor
(XEN) rax: 0000000000000080   rbx: ffff82c480287c48   rcx: 0000000000000000
(XEN) rdx: 0000000000000080   rsi: 0000000000000080   rdi: ffff82c480287c48
(XEN) rbp: ffff82c480287c78   rsp: ffff82c480287c38   r8:  0000000000000000
(XEN) r9:  0000000000000037   r10: 0000ffff0000ffff   r11: 00ff00ff00ff00ff
(XEN) r12: ffff82c48029f080   r13: 0000000000000001   r14: 0000000000000008
(XEN) r15: ffff82c4802b0c20   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000001250a9000   cr2: 00007f6165ae9428
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480287c38:
(XEN)    ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000
(XEN)    ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300
(XEN)    ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001
(XEN)    0000000000000100 0000000000000000 0000000000000002 ffff8300df606000
(XEN)    000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299
(XEN)    0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0
(XEN)    ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000
(XEN)    000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea
(XEN)    ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066
(XEN)    ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59
(XEN)    ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000
(XEN)    ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043
(XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
(XEN)    000000000000010c 0000000000000000 0000000000000000 0000000000000002
(XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
(XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
(XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
(XEN)    ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff
(XEN)    0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000
(XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
(XEN)    [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302
(XEN)    [<ffff82c48011f299>] vcpu_wake+0x243/0x43e
(XEN)    [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c
(XEN)    [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f
(XEN)    [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32
(XEN)    [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190
(XEN)    [<ffff82c480105f81>] send_guest_pirq+0x54/0x56
(XEN)    [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c
(XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
(XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
(XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL TRAP: vector = 2 (nmi)
(XEN) [error_code=0000] , IN INTERRUPT CONTEXT
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

Dante


On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
>> We have Fibre Channel HBA devices that we PCI passthrough to our pvops
>> domU kernel. Without swiotlb=force in the domU's kernel command line,
>> both domU and dom0 lock up after loading the kernel module drivers for
>> the HBA devices. With swiotlb=force, the domU and dom0 are stable
>
> Whoa. That is not good - what happens if you just pass in iommu=soft?
> Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
> (I don't think it does, but just doing 'iommu=soft' should enable it).
>
>
>> after loading the kernel module drivers but the I/O performance is at
>> least an order of magnitude worse than what we were seeing with the
>> HVM kernel. I see the following in /var/log/kern.log in the pvops
>> domU:
>>
>> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
>> software IO TLB at phys 0x5800000 - 0x9800000
>>
>> Is swiotlb=force responsible for the I/O performance degradation? I
>> don't understand what swiotlb=force does so I would appreciate an
>> explanation or a pointer.
>
> So, you should only need to use 'iommu=soft'. It will enable the Linux kernel IOMMU
> to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).
>
> If your card is 64-bit, then that is all it would do. If however your card is 32-bit
> and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page
> to memory below 4GB, DMA that, and when done, copy it back to the where the user-space
> page is. This is called bounce-buffering and this is why you would use a mix of
> pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.
>
> However, I think your cards are 64-bit, so you don't need this bounce-buffering. But
> if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.
>
> So, try just 'iommu=soft' and see what happens.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 18:31   ` Dante Cinco
@ 2010-11-11 19:03     ` Konrad Rzeszutek Wilk
  2010-11-11 19:42       ` Lin, Ray
  2010-11-11 22:32       ` Dante Cinco
  0 siblings, 2 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-11 19:03 UTC (permalink / raw)
  To: Dante Cinco; +Cc: Xen-devel

On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
> Konrad,
> 
> Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
> buffering for IO" in /var/log/kern.log.
> 
> With iommu=soft and without swiotlb=force, I see the "software bounce
> buffering" in /var/log/kern.log and an NMI (see below) when I load the
> kernel module drivers. I made sure the NMI is reproducible and not a

What is the kernel module doing to cause this? DMA?
> one-time event.

So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d
enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
> 
> /var/log/kern.log (iommu=soft):
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
> software IO TLB at phys 0x5800000 - 0x9800000
> 
> (XEN)
> (XEN)
> (XEN) NMI - I/O ERROR
> (XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
> (XEN) RFLAGS: 0000000000000012   CONTEXT: hypervisor
> (XEN) rax: 0000000000000080   rbx: ffff82c480287c48   rcx: 0000000000000000
> (XEN) rdx: 0000000000000080   rsi: 0000000000000080   rdi: ffff82c480287c48
> (XEN) rbp: ffff82c480287c78   rsp: ffff82c480287c38   r8:  0000000000000000
> (XEN) r9:  0000000000000037   r10: 0000ffff0000ffff   r11: 00ff00ff00ff00ff
> (XEN) r12: ffff82c48029f080   r13: 0000000000000001   r14: 0000000000000008
> (XEN) r15: ffff82c4802b0c20   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 00000001250a9000   cr2: 00007f6165ae9428
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c480287c38:
> (XEN)    ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000
> (XEN)    ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300
> (XEN)    ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001
> (XEN)    0000000000000100 0000000000000000 0000000000000002 ffff8300df606000
> (XEN)    000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299
> (XEN)    0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0
> (XEN)    ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000
> (XEN)    000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea
> (XEN)    ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066
> (XEN)    ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59
> (XEN)    ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000
> (XEN)    ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043
> (XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
> (XEN)    000000000000010c 0000000000000000 0000000000000000 0000000000000002
> (XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
> (XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
> (XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
> (XEN)    ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff
> (XEN)    0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000
> (XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
> (XEN)    [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302
> (XEN)    [<ffff82c48011f299>] vcpu_wake+0x243/0x43e
> (XEN)    [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c
> (XEN)    [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f
> (XEN)    [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32
> (XEN)    [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190
> (XEN)    [<ffff82c480105f81>] send_guest_pirq+0x54/0x56
> (XEN)    [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c
> (XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
> (XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
> (XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) FATAL TRAP: vector = 2 (nmi)
> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
> 
> Dante
> 
> 
> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
> >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops
> >> domU kernel. Without swiotlb=force in the domU's kernel command line,
> >> both domU and dom0 lock up after loading the kernel module drivers for
> >> the HBA devices. With swiotlb=force, the domU and dom0 are stable
> >
> > Whoa. That is not good - what happens if you just pass in iommu=soft?
> > Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
> > (I don't think it does, but just doing 'iommu=soft' should enable it).
> >
> >
> >> after loading the kernel module drivers but the I/O performance is at
> >> least an order of magnitude worse than what we were seeing with the
> >> HVM kernel. I see the following in /var/log/kern.log in the pvops
> >> domU:
> >>
> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
> >> software IO TLB at phys 0x5800000 - 0x9800000
> >>
> >> Is swiotlb=force responsible for the I/O performance degradation? I
> >> don't understand what swiotlb=force does so I would appreciate an
> >> explanation or a pointer.
> >
> > So, you should only need to use 'iommu=soft'. It will enable the Linux kernel IOMMU
> > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).
> >
> > If your card is 64-bit, then that is all it would do. If however your card is 32-bit
> > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page
> > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space
> > page is. This is called bounce-buffering and this is why you would use a mix of
> > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.
> >
> > However, I think your cards are 64-bit, so you don't need this bounce-buffering. But
> > if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.
> >
> > So, try just 'iommu=soft' and see what happens.
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 19:03     ` Konrad Rzeszutek Wilk
@ 2010-11-11 19:42       ` Lin, Ray
  2010-11-12 15:56         ` Konrad Rzeszutek Wilk
  2010-11-11 22:32       ` Dante Cinco
  1 sibling, 1 reply; 36+ messages in thread
From: Lin, Ray @ 2010-11-11 19:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Dante Cinco; +Cc: Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 8250 bytes --]


Konrad,

   See my response in red.


-Ray

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk
Sent: Thursday, November 11, 2010 11:04 AM
To: Dante Cinco
Cc: Xen-devel
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
> Konrad,
>
> Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
> buffering for IO" in /var/log/kern.log.
>
> With iommu=soft and without swiotlb=force, I see the "software bounce
> buffering" in /var/log/kern.log and an NMI (see below) when I load the
> kernel module drivers. I made sure the NMI is reproducible and not a

What is the kernel module doing to cause this? DMA?
> one-time event.

So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work?

We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn't work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.
124:      86538          0          0          0          0          0      13462          0          0          0          0          0          0          0  xen-pirq-pcifront-msi  HW_TACHYON
125:      88348          0          0          0      11652          0          0          0          0          0          0          0          0          0  xen-pirq-pcifront-msi  HW_TACHYON
126:      89335          0      10665          0          0          0          0          0          0          0          0          0          0          0  xen-pirq-pcifront-msi  HW_TACHYON
127:     100000          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-pirq-pcifront-msi  HW_TACHYON




>
> /var/log/kern.log (iommu=soft):
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing 64MB
> software IO TLB between ffff880005800000 - ffff880009800000 software
> IO TLB at phys 0x5800000 - 0x9800000
>
> (XEN)
> (XEN)
> (XEN) NMI - I/O ERROR
> (XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
> (XEN) RFLAGS: 0000000000000012   CONTEXT: hypervisor
> (XEN) rax: 0000000000000080   rbx: ffff82c480287c48   rcx: 0000000000000000
> (XEN) rdx: 0000000000000080   rsi: 0000000000000080   rdi: ffff82c480287c48
> (XEN) rbp: ffff82c480287c78   rsp: ffff82c480287c38   r8:  0000000000000000
> (XEN) r9:  0000000000000037   r10: 0000ffff0000ffff   r11: 00ff00ff00ff00ff
> (XEN) r12: ffff82c48029f080   r13: 0000000000000001   r14: 0000000000000008
> (XEN) r15: ffff82c4802b0c20   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 00000001250a9000   cr2: 00007f6165ae9428
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff82c480287c38:
> (XEN)    ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000
> (XEN)    ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300
> (XEN)    ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001
> (XEN)    0000000000000100 0000000000000000 0000000000000002 ffff8300df606000
> (XEN)    000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299
> (XEN)    0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0
> (XEN)    ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000
> (XEN)    000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea
> (XEN)    ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066
> (XEN)    ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59
> (XEN)    ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000
> (XEN)    ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043
> (XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
> (XEN)    000000000000010c 0000000000000000 0000000000000000 0000000000000002
> (XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
> (XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
> (XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
> (XEN)    ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff
> (XEN)    0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000
> (XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
> (XEN)    [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302
> (XEN)    [<ffff82c48011f299>] vcpu_wake+0x243/0x43e
> (XEN)    [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c
> (XEN)    [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f
> (XEN)    [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32
> (XEN)    [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190
> (XEN)    [<ffff82c480105f81>] send_guest_pirq+0x54/0x56
> (XEN)    [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c
> (XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
> (XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
> (XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) FATAL TRAP: vector = 2 (nmi)
> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
>
> Dante
>
>
> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
> >> We have Fibre Channel HBA devices that we PCI passthrough to our
> >> pvops domU kernel. Without swiotlb=force in the domU's kernel
> >> command line, both domU and dom0 lock up after loading the kernel
> >> module drivers for the HBA devices. With swiotlb=force, the domU
> >> and dom0 are stable
> >
> > Whoa. That is not good - what happens if you just pass in iommu=soft?
> > Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
> > (I don't think it does, but just doing 'iommu=soft' should enable it).
> >
> >
> >> after loading the kernel module drivers but the I/O performance is
> >> at least an order of magnitude worse than what we were seeing with
> >> the HVM kernel. I see the following in /var/log/kern.log in the
> >> pvops
> >> domU:
> >>
> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB) Placing
> >> 64MB software IO TLB between ffff880005800000 - ffff880009800000
> >> software IO TLB at phys 0x5800000 - 0x9800000
> >>
> >> Is swiotlb=force responsible for the I/O performance degradation? I
> >> don't understand what swiotlb=force does so I would appreciate an
> >> explanation or a pointer.
> >
> > So, you should only need to use 'iommu=soft'. It will enable the
> > Linux kernel IOMMU to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).
> >
> > If your card is 64-bit, then that is all it would do. If however
> > your card is 32-bit and your are DMA-ing data from above the 32-bit
> > limit, it would copy the user-space page to memory below 4GB, DMA
> > that, and when done, copy it back to the where the user-space page
> > is. This is called bounce-buffering and this is why you would use a mix of pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.
> >
> > However, I think your cards are 64-bit, so you don't need this
> > bounce-buffering. But if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.
> >
> > So, try just 'iommu=soft' and see what happens.
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 18164 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 19:03     ` Konrad Rzeszutek Wilk
  2010-11-11 19:42       ` Lin, Ray
@ 2010-11-11 22:32       ` Dante Cinco
  2010-11-12  1:02         ` Dante Cinco
  1 sibling, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-11 22:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

With iommu=off,verbose in the Xen commandline, pvops domU works only
with swiotlb=force and with the same performance degradation. Without
swiotlb=force, there's no NMI but DMA does not work (see Ray Lin's
reply on Thu 11/11/2010 11:42 AM).

The XenPCIpassthrough wiki
(http://wiki.xensource.com/xenwiki/XenPCIpassthrough) talks about
setting iommu=pv in order to use the hardware IOMMU (VT-d) passthru
for PV guests but I didn't see any difference compared to my original
setting (iommu=1,passthrough,no-intremap). Is iommu=pv still required
for this particular pvops domU kernel (xen-pcifront-0.8.2) and if it
is, what should I be looking for in the Xen log (xm dmesg) to verify
its efficacy?

With my original setting (iommu=1,passthrough,no-intremap), here's what I see:

(XEN) [VT-D]dmar.c:702: Host address width 39
(XEN) [VT-D]dmar.c:717: found ACPI_DMAR_DRHD:
(XEN) [VT-D]dmar.c:413:   dmaru->address = e7ffe000
(XEN) [VT-D]iommu.c:1136: drhd->address = e7ffe000 iommu->reg = ffff82c3fff57000
(XEN) [VT-D]iommu.c:1138: cap = c90780106f0462 ecap = f0207e
(XEN) [VT-D]dmar.c:356:   IOAPIC: 0:1e.1
(XEN) [VT-D]dmar.c:356:   IOAPIC: 0:13.0
(XEN) [VT-D]dmar.c:427:   flags: INCLUDE_ALL
(XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.7
(XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df7fc000 end_address df7fdfff
(XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.0
(XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.1
(XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.2
(XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.3
(XEN) [VT-D]dmar.c:341:   endpoint: 2:0.0
(XEN) [VT-D]dmar.c:341:   endpoint: 2:0.2
(XEN) [VT-D]dmar.c:341:   endpoint: 2:0.4
(XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df7f5000 end_address df7fafff
(XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:341:   endpoint: 5:0.0
(XEN) [VT-D]dmar.c:341:   endpoint: 2:0.0
(XEN) [VT-D]dmar.c:341:   endpoint: 2:0.2
(XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df63e000 end_address df63ffff
(XEN) [VT-D]dmar.c:727: found ACPI_DMAR_ATSR:
(XEN) [VT-D]dmar.c:622:   atsru->all_ports: 0
(XEN) [VT-D]dmar.c:327:   bridge: 0:a.0  start = 0 sec = 7  sub = 7
(XEN) [VT-D]dmar.c:327:   bridge: 0:9.0  start = 0 sec = 8  sub = a
(XEN) [VT-D]dmar.c:327:   bridge: 0:8.0  start = 0 sec = b  sub = d
(XEN) [VT-D]dmar.c:327:   bridge: 0:7.0  start = 0 sec = e  sub = 10
(XEN) [VT-D]dmar.c:327:   bridge: 0:6.0  start = 0 sec = 18  sub = 1a
(XEN) [VT-D]dmar.c:327:   bridge: 0:5.0  start = 0 sec = 15  sub = 17
(XEN) [VT-D]dmar.c:327:   bridge: 0:4.0  start = 0 sec = 14  sub = 14
(XEN) [VT-D]dmar.c:327:   bridge: 0:3.0  start = 0 sec = 11  sub = 13
(XEN) [VT-D]dmar.c:327:   bridge: 0:2.0  start = 0 sec = 6  sub = 6
(XEN) [VT-D]dmar.c:327:   bridge: 0:1.0  start = 0 sec = 5  sub = 5
(XEN) Intel VT-d Snoop Control not enabled.
(XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
(XEN) Intel VT-d Queued Invalidation enabled.
(XEN) Intel VT-d Interrupt Remapping not enabled.
(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) [VT-D]iommu.c:743: iommu_enable_translation: iommu->reg = ffff82c3fff57000

domU bringup:

(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.3
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.3
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.2
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.2
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.1
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.3
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.3
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.2
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.2
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.1
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.1
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.1
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.1
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.0
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.0
(XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.1
(XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.1
mapping kernel into physical memory
about to get started...

- Dante

On Thu, Nov 11, 2010 at 11:03 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
>> Konrad,
>>
>> Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
>> buffering for IO" in /var/log/kern.log.
>>
>> With iommu=soft and without swiotlb=force, I see the "software bounce
>> buffering" in /var/log/kern.log and an NMI (see below) when I load the
>> kernel module drivers. I made sure the NMI is reproducible and not a
>
> What is the kernel module doing to cause this? DMA?
>> one-time event.
>
> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d
> enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
>>
>> /var/log/kern.log (iommu=soft):
>> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
>> software IO TLB at phys 0x5800000 - 0x9800000
>>
>> (XEN)
>> (XEN)
>> (XEN) NMI - I/O ERROR
>> (XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
>> (XEN) CPU:    0
>> (XEN) RIP:    e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
>> (XEN) RFLAGS: 0000000000000012   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000080   rbx: ffff82c480287c48   rcx: 0000000000000000
>> (XEN) rdx: 0000000000000080   rsi: 0000000000000080   rdi: ffff82c480287c48
>> (XEN) rbp: ffff82c480287c78   rsp: ffff82c480287c38   r8:  0000000000000000
>> (XEN) r9:  0000000000000037   r10: 0000ffff0000ffff   r11: 00ff00ff00ff00ff
>> (XEN) r12: ffff82c48029f080   r13: 0000000000000001   r14: 0000000000000008
>> (XEN) r15: ffff82c4802b0c20   cr0: 000000008005003b   cr4: 00000000000026f0
>> (XEN) cr3: 00000001250a9000   cr2: 00007f6165ae9428
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>> (XEN) Xen stack trace from rsp=ffff82c480287c38:
>> (XEN)    ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000
>> (XEN)    ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300
>> (XEN)    ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001
>> (XEN)    0000000000000100 0000000000000000 0000000000000002 ffff8300df606000
>> (XEN)    000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299
>> (XEN)    0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0
>> (XEN)    ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000
>> (XEN)    000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea
>> (XEN)    ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066
>> (XEN)    ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59
>> (XEN)    ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000
>> (XEN)    ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043
>> (XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
>> (XEN)    000000000000010c 0000000000000000 0000000000000000 0000000000000002
>> (XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
>> (XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
>> (XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
>> (XEN)    ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff
>> (XEN)    0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000
>> (XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
>> (XEN)    [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302
>> (XEN)    [<ffff82c48011f299>] vcpu_wake+0x243/0x43e
>> (XEN)    [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c
>> (XEN)    [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f
>> (XEN)    [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32
>> (XEN)    [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190
>> (XEN)    [<ffff82c480105f81>] send_guest_pirq+0x54/0x56
>> (XEN)    [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c
>> (XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
>> (XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
>> (XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) FATAL TRAP: vector = 2 (nmi)
>> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
>> (XEN) ****************************************
>> (XEN)
>> (XEN) Reboot in five seconds...
>>
>> Dante
>>
>>
>> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
>> >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops
>> >> domU kernel. Without swiotlb=force in the domU's kernel command line,
>> >> both domU and dom0 lock up after loading the kernel module drivers for
>> >> the HBA devices. With swiotlb=force, the domU and dom0 are stable
>> >
>> > Whoa. That is not good - what happens if you just pass in iommu=soft?
>> > Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
>> > (I don't think it does, but just doing 'iommu=soft' should enable it).
>> >
>> >
>> >> after loading the kernel module drivers but the I/O performance is at
>> >> least an order of magnitude worse than what we were seeing with the
>> >> HVM kernel. I see the following in /var/log/kern.log in the pvops
>> >> domU:
>> >>
>> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>> >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
>> >> software IO TLB at phys 0x5800000 - 0x9800000
>> >>
>> >> Is swiotlb=force responsible for the I/O performance degradation? I
>> >> don't understand what swiotlb=force does so I would appreciate an
>> >> explanation or a pointer.
>> >
>> > So, you should only need to use 'iommu=soft'. It will enable the Linux kernel IOMMU
>> > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).
>> >
>> > If your card is 64-bit, then that is all it would do. If however your card is 32-bit
>> > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page
>> > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space
>> > page is. This is called bounce-buffering and this is why you would use a mix of
>> > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.
>> >
>> > However, I think your cards are 64-bit, so you don't need this bounce-buffering. But
>> > if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.
>> >
>> > So, try just 'iommu=soft' and see what happens.
>> >
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 22:32       ` Dante Cinco
@ 2010-11-12  1:02         ` Dante Cinco
  2010-11-12 16:58           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-12  1:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

Here's another datapoint: with iommu=1,passthrough,no-intremap,verbose
in the Xen command line and iommu=soft in the pvops domU command line
also results in an NMI (see below). Replacing iommu=soft with
swiotlb=force in pvops domU works reliably but with the I/O
performance degradation. It seems that regardless of whether iommu is
enabled or disabled in the hypervisor, swiotlb=force is necessary in
the pvops domU.

(XEN)
(XEN) NMI - I/O ERROR
(XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c48015c006>] do_IRQ+0x375/0x59c
(XEN) RFLAGS: 0000000000000002   CONTEXT: hypervisor
(XEN) rax: ffff83011dae4460   rbx: ffff8301616a6990   rcx: 000000000000010c
(XEN) rdx: 000000000000010c   rsi: 0000000000000086   rdi: 0000000000000001
(XEN) rbp: ffff82c480287e28   rsp: ffff82c480287db8   r8:  000000000000007a
(XEN) r9:  ffff8300df4d4060   r10: ffff83019fffac88   r11: 000001958595f304
(XEN) r12: ffff83011dae2000   r13: 0000000000000000   r14: 000000000000007f
(XEN) r15: ffff83019fe02200   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000001261ff000   cr2: 0000000000783000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c480287db8:
(XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
(XEN)    000000000000010c ffff830000000000 ffff82c4802c2400 0000000000000002
(XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
(XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
(XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
(XEN)    ffff82c480287ee0 ffff82c480287f18 000001958595f304 ffff83019fffac88
(XEN)    ffff8300df4d4060 ffff83019fffa9f0 ffff82c4802c23a0 0000000000000000
(XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
(XEN)    ffff82c48014e3c3 000000000000e008 0000000000000246 ffff82c480287ee0
(XEN)    000000000000e010 ffff82c480287f10 ffff82c480150664 0000000000000000
(XEN)    ffff8300df2fc000 ffff8300df4d4000 00000000ffffffff ffff82c480287db8
(XEN)    0000000000000000 ffffffffffffffff ffffffff81787160 ffffffff81669fd8
(XEN)    ffffffff81669ed0 ffffffff81668000 0000000000000246 ffff8800067c0200
(XEN)    0000019575abe291 0000000000000000 0000000000000000 ffffffff810093aa
(XEN)    0000000400000000 00000000deadbeef 00000000deadbeef 0000010000000000
(XEN)    ffffffff810093aa 000000000000e033 0000000000000246 ffffffff81669eb8
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff8300df2fc000 0000000000000000
(XEN)    0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48015c006>] do_IRQ+0x375/0x59c
(XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
(XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
(XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL TRAP: vector = 2 (nmi)
(XEN) [error_code=0000] , IN INTERRUPT CONTEXT
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

- Dante


On Thu, Nov 11, 2010 at 2:32 PM, Dante Cinco <dantecinco@gmail.com> wrote:
> With iommu=off,verbose in the Xen commandline, pvops domU works only
> with swiotlb=force and with the same performance degradation. Without
> swiotlb=force, there's no NMI but DMA does not work (see Ray Lin's
> reply on Thu 11/11/2010 11:42 AM).
>
> The XenPCIpassthrough wiki
> (http://wiki.xensource.com/xenwiki/XenPCIpassthrough) talks about
> setting iommu=pv in order to use the hardware IOMMU (VT-d) passthru
> for PV guests but I didn't see any difference compared to my original
> setting (iommu=1,passthrough,no-intremap). Is iommu=pv still required
> for this particular pvops domU kernel (xen-pcifront-0.8.2) and if it
> is, what should I be looking for in the Xen log (xm dmesg) to verify
> its efficacy?
>
> With my original setting (iommu=1,passthrough,no-intremap), here's what I see:
>
> (XEN) [VT-D]dmar.c:702: Host address width 39
> (XEN) [VT-D]dmar.c:717: found ACPI_DMAR_DRHD:
> (XEN) [VT-D]dmar.c:413:   dmaru->address = e7ffe000
> (XEN) [VT-D]iommu.c:1136: drhd->address = e7ffe000 iommu->reg = ffff82c3fff57000
> (XEN) [VT-D]iommu.c:1138: cap = c90780106f0462 ecap = f0207e
> (XEN) [VT-D]dmar.c:356:   IOAPIC: 0:1e.1
> (XEN) [VT-D]dmar.c:356:   IOAPIC: 0:13.0
> (XEN) [VT-D]dmar.c:427:   flags: INCLUDE_ALL
> (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.7
> (XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df7fc000 end_address df7fdfff
> (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.0
> (XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.1
> (XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.2
> (XEN) [VT-D]dmar.c:341:   endpoint: 0:1d.3
> (XEN) [VT-D]dmar.c:341:   endpoint: 2:0.0
> (XEN) [VT-D]dmar.c:341:   endpoint: 2:0.2
> (XEN) [VT-D]dmar.c:341:   endpoint: 2:0.4
> (XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df7f5000 end_address df7fafff
> (XEN) [VT-D]dmar.c:722: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:341:   endpoint: 5:0.0
> (XEN) [VT-D]dmar.c:341:   endpoint: 2:0.0
> (XEN) [VT-D]dmar.c:341:   endpoint: 2:0.2
> (XEN) [VT-D]dmar.c:594:   RMRR region: base_addr df63e000 end_address df63ffff
> (XEN) [VT-D]dmar.c:727: found ACPI_DMAR_ATSR:
> (XEN) [VT-D]dmar.c:622:   atsru->all_ports: 0
> (XEN) [VT-D]dmar.c:327:   bridge: 0:a.0  start = 0 sec = 7  sub = 7
> (XEN) [VT-D]dmar.c:327:   bridge: 0:9.0  start = 0 sec = 8  sub = a
> (XEN) [VT-D]dmar.c:327:   bridge: 0:8.0  start = 0 sec = b  sub = d
> (XEN) [VT-D]dmar.c:327:   bridge: 0:7.0  start = 0 sec = e  sub = 10
> (XEN) [VT-D]dmar.c:327:   bridge: 0:6.0  start = 0 sec = 18  sub = 1a
> (XEN) [VT-D]dmar.c:327:   bridge: 0:5.0  start = 0 sec = 15  sub = 17
> (XEN) [VT-D]dmar.c:327:   bridge: 0:4.0  start = 0 sec = 14  sub = 14
> (XEN) [VT-D]dmar.c:327:   bridge: 0:3.0  start = 0 sec = 11  sub = 13
> (XEN) [VT-D]dmar.c:327:   bridge: 0:2.0  start = 0 sec = 6  sub = 6
> (XEN) [VT-D]dmar.c:327:   bridge: 0:1.0  start = 0 sec = 5  sub = 5
> (XEN) Intel VT-d Snoop Control not enabled.
> (XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
> (XEN) Intel VT-d Queued Invalidation enabled.
> (XEN) Intel VT-d Interrupt Remapping not enabled.
> (XEN) I/O virtualisation enabled
> (XEN)  - Dom0 mode: Relaxed
> (XEN) Enabled directed EOI with ioapic_ack_old on!
> (XEN) [VT-D]iommu.c:743: iommu_enable_translation: iommu->reg = ffff82c3fff57000
>
> domU bringup:
>
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.3
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.3
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.2
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.2
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.1
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 11:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 11:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.3
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.3
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.2
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.2
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.1
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 8:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 8:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 15:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 15:0.1
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = 18:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = 18:0.1
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = b:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = b:0.1
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.0
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.0
> (XEN) [VT-D]iommu.c:1514: d0:PCIe: unmap bdf = e:0.1
> (XEN) [VT-D]iommu.c:1387: d1:PCIe: map bdf = e:0.1
> mapping kernel into physical memory
> about to get started...
>
> - Dante
>
> On Thu, Nov 11, 2010 at 11:03 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
>> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
>>> Konrad,
>>>
>>> Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
>>> buffering for IO" in /var/log/kern.log.
>>>
>>> With iommu=soft and without swiotlb=force, I see the "software bounce
>>> buffering" in /var/log/kern.log and an NMI (see below) when I load the
>>> kernel module drivers. I made sure the NMI is reproducible and not a
>>
>> What is the kernel module doing to cause this? DMA?
>>> one-time event.
>>
>> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d
>> enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
>>>
>>> /var/log/kern.log (iommu=soft):
>>> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
>>> software IO TLB at phys 0x5800000 - 0x9800000
>>>
>>> (XEN)
>>> (XEN)
>>> (XEN) NMI - I/O ERROR
>>> (XEN) ----[ Xen-4.1-unstable  x86_64  debug=y  Not tainted ]----
>>> (XEN) CPU:    0
>>> (XEN) RIP:    e008:[<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
>>> (XEN) RFLAGS: 0000000000000012   CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000080   rbx: ffff82c480287c48   rcx: 0000000000000000
>>> (XEN) rdx: 0000000000000080   rsi: 0000000000000080   rdi: ffff82c480287c48
>>> (XEN) rbp: ffff82c480287c78   rsp: ffff82c480287c38   r8:  0000000000000000
>>> (XEN) r9:  0000000000000037   r10: 0000ffff0000ffff   r11: 00ff00ff00ff00ff
>>> (XEN) r12: ffff82c48029f080   r13: 0000000000000001   r14: 0000000000000008
>>> (XEN) r15: ffff82c4802b0c20   cr0: 000000008005003b   cr4: 00000000000026f0
>>> (XEN) cr3: 00000001250a9000   cr2: 00007f6165ae9428
>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>>> (XEN) Xen stack trace from rsp=ffff82c480287c38:
>>> (XEN)    ffff82c480287c78 ffff82c48012001f 0000000000000100 0000000000000000
>>> (XEN)    ffff82c480287ca8 ffff83011dadd8b0 ffff83019fffa9d0 ffff82c4802c2300
>>> (XEN)    ffff82c480287cc8 ffff82c480117d0d ffff82c48029f080 0000000000000001
>>> (XEN)    0000000000000100 0000000000000000 0000000000000002 ffff8300df606000
>>> (XEN)    000000411de66867 ffff82c4802c2300 ffff82c480287d28 ffff82c48011f299
>>> (XEN)    0000000000000100 0000000000000086 ffff83019e3fa000 ffff83011dadd8b0
>>> (XEN)    ffff83019fffa9d0 ffff8300df606000 0000000000000000 0000000000000000
>>> (XEN)    000000000000007f ffff83019fe02200 ffff82c480287d38 ffff82c48011f6ea
>>> (XEN)    ffff82c480287d58 ffff82c48014e4c1 ffff83011dae2000 0000000000000066
>>> (XEN)    ffff82c480287d68 ffff82c48014e54d ffff82c480287d98 ffff82c480105d59
>>> (XEN)    ffff82c480287da8 ffff8301616a6990 ffff83011dae2000 0000000000000000
>>> (XEN)    ffff82c480287da8 ffff82c480105f81 ffff82c480287e28 ffff82c48015c043
>>> (XEN)    0000000000000043 0000000000000043 ffff83019fe02234 0000000000000000
>>> (XEN)    000000000000010c 0000000000000000 0000000000000000 0000000000000002
>>> (XEN)    ffff82c480287e10 ffff82c480287f18 ffff82c48024f6c0 ffff82c480287f18
>>> (XEN)    ffff82c4802c2300 0000000000000002 00007d3b7fd781a7 ffff82c480154ee6
>>> (XEN)    0000000000000002 ffff82c4802c2300 ffff82c480287f18 ffff82c48024f6c0
>>> (XEN)    ffff82c480287ee0 ffff82c480287f18 00ff00ff00ff00ff 0000ffff0000ffff
>>> (XEN)    0000000000000000 0000000000000000 ffff82c4802c23a0 0000000000000000
>>> (XEN)    0000000000000000 ffff82c4802c2e80 0000000000000000 0000007a00000000
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82c4801701b2>] smp_send_event_check_mask+0x1/0x10
>>> (XEN)    [<ffff82c480117d0d>] csched_vcpu_wake+0x2e1/0x302
>>> (XEN)    [<ffff82c48011f299>] vcpu_wake+0x243/0x43e
>>> (XEN)    [<ffff82c48011f6ea>] vcpu_unblock+0x4a/0x4c
>>> (XEN)    [<ffff82c48014e4c1>] vcpu_kick+0x21/0x7f
>>> (XEN)    [<ffff82c48014e54d>] vcpu_mark_events_pending+0x2e/0x32
>>> (XEN)    [<ffff82c480105d59>] evtchn_set_pending+0xbf/0x190
>>> (XEN)    [<ffff82c480105f81>] send_guest_pirq+0x54/0x56
>>> (XEN)    [<ffff82c48015c043>] do_IRQ+0x3b2/0x59c
>>> (XEN)    [<ffff82c480154ee6>] common_interrupt+0x26/0x30
>>> (XEN)    [<ffff82c48014e3c3>] default_idle+0x82/0x87
>>> (XEN)    [<ffff82c480150664>] idle_loop+0x5a/0x68
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) FATAL TRAP: vector = 2 (nmi)
>>> (XEN) [error_code=0000] , IN INTERRUPT CONTEXT
>>> (XEN) ****************************************
>>> (XEN)
>>> (XEN) Reboot in five seconds...
>>>
>>> Dante
>>>
>>>
>>> On Thu, Nov 11, 2010 at 8:04 AM, Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>> > On Wed, Nov 10, 2010 at 05:16:14PM -0800, Dante Cinco wrote:
>>> >> We have Fibre Channel HBA devices that we PCI passthrough to our pvops
>>> >> domU kernel. Without swiotlb=force in the domU's kernel command line,
>>> >> both domU and dom0 lock up after loading the kernel module drivers for
>>> >> the HBA devices. With swiotlb=force, the domU and dom0 are stable
>>> >
>>> > Whoa. That is not good - what happens if you just pass in iommu=soft?
>>> > Does the PCI-DMA: Using.. show up if you don't pass in any of those parameters?
>>> > (I don't think it does, but just doing 'iommu=soft' should enable it).
>>> >
>>> >
>>> >> after loading the kernel module drivers but the I/O performance is at
>>> >> least an order of magnitude worse than what we were seeing with the
>>> >> HVM kernel. I see the following in /var/log/kern.log in the pvops
>>> >> domU:
>>> >>
>>> >> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
>>> >> Placing 64MB software IO TLB between ffff880005800000 - ffff880009800000
>>> >> software IO TLB at phys 0x5800000 - 0x9800000
>>> >>
>>> >> Is swiotlb=force responsible for the I/O performance degradation? I
>>> >> don't understand what swiotlb=force does so I would appreciate an
>>> >> explanation or a pointer.
>>> >
>>> > So, you should only need to use 'iommu=soft'. It will enable the Linux kernel IOMMU
>>> > to translate the pseudo-PFNs to the real machine frame numbers (bus addresses).
>>> >
>>> > If your card is 64-bit, then that is all it would do. If however your card is 32-bit
>>> > and your are DMA-ing data from above the 32-bit limit, it would copy the user-space page
>>> > to memory below 4GB, DMA that, and when done, copy it back to the where the user-space
>>> > page is. This is called bounce-buffering and this is why you would use a mix of
>>> > pci_map_page, pci_sync_single_for_[cpu|device] calls around your driver.
>>> >
>>> > However, I think your cards are 64-bit, so you don't need this bounce-buffering. But
>>> > if you say 'swiotlb=force' it will force _all_ DMAs to go through the bounce-buffer.
>>> >
>>> > So, try just 'iommu=soft' and see what happens.
>>> >
>>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-11 19:42       ` Lin, Ray
@ 2010-11-12 15:56         ` Konrad Rzeszutek Wilk
  2010-11-12 16:20           ` Lin, Ray
  2010-11-12 18:29           ` Dante Cinco
  0 siblings, 2 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 15:56 UTC (permalink / raw)
  To: Lin, Ray; +Cc: Xen-devel, Dante Cinco

On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote:
> 
> Konrad,
> 
>    See my response in red.

Please don't top post.
> 
> 
> -Ray
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk
> Sent: Thursday, November 11, 2010 11:04 AM
> To: Dante Cinco
> Cc: Xen-devel
> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
> 
> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
> > Konrad,
> >
> > Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
> > buffering for IO" in /var/log/kern.log.
> >
> > With iommu=soft and without swiotlb=force, I see the "software bounce
> > buffering" in /var/log/kern.log and an NMI (see below) when I load the
> > kernel module drivers. I made sure the NMI is reproducible and not a
> 
> What is the kernel module doing to cause this? DMA?

??? What did it do?

> > one-time event.
> 
> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
> 
> We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn't work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.

That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it
is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU.
Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 15:56         ` Konrad Rzeszutek Wilk
@ 2010-11-12 16:20           ` Lin, Ray
  2010-11-12 16:55             ` Konrad Rzeszutek Wilk
  2010-11-12 18:29           ` Dante Cinco
  1 sibling, 1 reply; 36+ messages in thread
From: Lin, Ray @ 2010-11-12 16:20 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel, Dante Cinco

 

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
Sent: Friday, November 12, 2010 7:57 AM
To: Lin, Ray
Cc: Dante Cinco; Xen-devel
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote:
> 
> Konrad,
> 
>    See my response in red.

Please don't top post.
> 
> 
> -Ray
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad 
> Rzeszutek Wilk
> Sent: Thursday, November 11, 2010 11:04 AM
> To: Dante Cinco
> Cc: Xen-devel
> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 
> pvops domU kernel with PCI passthrough
> 
> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
> > Konrad,
> >
> > Without swiotlb=force, I don't see "PCI-DMA: Using software bounce 
> > buffering for IO" in /var/log/kern.log.
> >
> > With iommu=soft and without swiotlb=force, I see the "software 
> > bounce buffering" in /var/log/kern.log and an NMI (see below) when I 
> > load the kernel module drivers. I made sure the NMI is reproducible 
> > and not a
> 
> What is the kernel module doing to cause this? DMA?

??? What did it do?

> > one-time event.
> 
> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
> 
> We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn't work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.

That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU.
Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?)

Konrad,
It's unlikely the interrupt issue but DMA issue. Here is the sequence how the tachyon device generates the DMA/interrupts,
- the tachyon device does the DMA to update the memory which indicates the source of interrupt.
- After the DMA is done, the tachyon device trigger an interrupt.
- The interrupt service routine of software driver is invoked due to the interrupt
- The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA.
- Even though the interrupt happens, the driver code can't find the source of interrupt since the DMA doesn't work properly.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 16:20           ` Lin, Ray
@ 2010-11-12 16:55             ` Konrad Rzeszutek Wilk
  2010-11-12 19:38               ` Lin, Ray
  0 siblings, 1 reply; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 16:55 UTC (permalink / raw)
  To: Lin, Ray; +Cc: Xen-devel, Dante Cinco

> That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU.
> Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?)
> 
> Konrad,
> It's unlikely the interrupt issue but DMA issue. Here is the sequence how the tachyon device generates the DMA/interrupts,
> - the tachyon device does the DMA to update the memory which indicates the source of interrupt.
> - After the DMA is done, the tachyon device trigger an interrupt.
> - The interrupt service routine of software driver is invoked due to the interrupt
> - The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA.
> - Even though the interrupt happens, the driver code can't find the source of interrupt since the DMA doesn't work properly.

That sounds like the tachyon device is updating the wrong memory location. How are you
programming the memory location where thetachyon device is suppose to touch? Are you using
the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different
from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent
in which case the virt_to_phys() values for that pool should be the right MFNs.

One way you can figure this is doing something like this to make sure you got
the right MFN:

add these two:
#include <xen/page.h>
#include <asm/xen/page.h>

          phys_addr_t phys = page_to_phys(mem->pages[i]);
+               if (xen_pv_domain()) {
+                       phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn(
+                                       page_to_pfn(mem->pages[i])));
+                       if (phys != xen_phys) {
+                               printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \
+                                       " CODE UNTESTED!\n",
+                                       (unsigned long)phys,
+                                       (unsigned long)xen_phys);
+                               WARN_ON_ONCE(phys != xen_phys);
+                               phys = xen_phys;
+                       }
+               }
	and using the 'phys' value from now.

If this sounds like black magic, here is a short writeup
http://wiki.xensource.com/xenwiki/XenPVOPSDRM

look at "Why those patches" section.

Lastly, are you using unsigned long for or the phys_addr_t typedefs?

The more I think about your problem the more it sounds like a truncating issue. You
said that it works just right (albeit slow) if you use 'swiotlb=force'. The slowness
could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless
using bounce buffers will slow the DMA operations down.

Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that
you are using some casting macro that casts a PFN to unsigned long or vice-versa and
we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers
back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12  1:02         ` Dante Cinco
@ 2010-11-12 16:58           ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 16:58 UTC (permalink / raw)
  To: Dante Cinco; +Cc: Xen-devel

On Thu, Nov 11, 2010 at 05:02:55PM -0800, Dante Cinco wrote:
> Here's another datapoint: with iommu=1,passthrough,no-intremap,verbose
> in the Xen command line and iommu=soft in the pvops domU command line
> also results in an NMI (see below). Replacing iommu=soft with

Ok, so that enabled the VT-D and enables you to do 64-bit DMA.

> swiotlb=force in pvops domU works reliably but with the I/O
> performance degradation. It seems that regardless of whether iommu is
> enabled or disabled in the hypervisor, swiotlb=force is necessary in
> the pvops domU.

That is bizzare. I am pretty sure it should work just fine with 'iommu=soft'.
My test scripts confirm this, but let me run once more just to make sure.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 15:56         ` Konrad Rzeszutek Wilk
  2010-11-12 16:20           ` Lin, Ray
@ 2010-11-12 18:29           ` Dante Cinco
  1 sibling, 0 replies; 36+ messages in thread
From: Dante Cinco @ 2010-11-12 18:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel, Lin, Ray

On Fri, Nov 12, 2010 at 7:56 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Thu, Nov 11, 2010 at 12:42:03PM -0700, Lin, Ray wrote:
>>
>> Konrad,
>>
>>    See my response in red.
>
> Please don't top post.
>>
>>
>> -Ray
>>
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Konrad Rzeszutek Wilk
>> Sent: Thursday, November 11, 2010 11:04 AM
>> To: Dante Cinco
>> Cc: Xen-devel
>> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
>>
>> On Thu, Nov 11, 2010 at 10:31:48AM -0800, Dante Cinco wrote:
>> > Konrad,
>> >
>> > Without swiotlb=force, I don't see "PCI-DMA: Using software bounce
>> > buffering for IO" in /var/log/kern.log.
>> >
>> > With iommu=soft and without swiotlb=force, I see the "software bounce
>> > buffering" in /var/log/kern.log and an NMI (see below) when I load the
>> > kernel module drivers. I made sure the NMI is reproducible and not a
>>
>> What is the kernel module doing to cause this? DMA?
>
> ??? What did it do?
>
>> > one-time event.
>>
>> So doing 64-bit DMA causes an NMI. Do you have the Hypervisor's IOMMU VT-d enabled or disabled? (iommu=off,verbose) If you turn it off does this work?
>>
>> We have IOMMU VT-d enabled. If we turn it off (iommu=off,verbose), the DMA doesn't work properly and the driver code is unable to detect the source of interrupt. The interrupts of our device would be disabled by kernel eventually due to nobody services the interrupts for more than 100000 times.
>
> That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it
> is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU.
> Had you tried disabling IOMMU and x2apic? (this is all on the hypervisor line?)
>
>

I set the hypervisor boot options to iommu=0 x2apic=0. I booted the
pvops domU with swiotlb=force initially since that's the option that
worked in the past. Not long after loading the kernel module drivers,
domU hung/froze but dom0 stayed up. I checked the Xen interrupt
bindings and I see the PCI-passthrough devices have either (PS--) or
(-S--).

(XEN)    IRQ:  66 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:72
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:127(-S--),
(XEN)    IRQ:  67 affinity:00000000,00000000,00000000,00000200 vec:3b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:126(-S--),
(XEN)    IRQ:  68 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:8a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:125(-S--),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000800 vec:43
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:124(PS--),
(XEN)    IRQ:  70 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:9a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:123(-S--),
(XEN)    IRQ:  71 affinity:fffffff,ffffffff,ffffffff,ffffffff vec:a2
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:122(PS--),
(XEN)    IRQ:  72 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:aa
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:121(-S--),
(XEN)    IRQ:  73 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:b2
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:120(PS--),
(XEN)    IRQ:  74 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ba
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:119(-S--),
(XEN)    IRQ:  75 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:c2
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:118(PS--),
(XEN)    IRQ:  76 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:ca
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:117(-S--),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00080000 vec:4b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:116(PS--),
(XEN)    IRQ:  78 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:da
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:115(-S--),
(XEN)    IRQ:  79 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:23
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:114(PS--),
(XEN)    IRQ:  80 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:2b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:113(-S--),
(XEN)    IRQ:  81 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:33
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:112(PS--),

I will reboot the pvops domU with iommu=soft and without swiotlb=force next.

- Dante

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 16:55             ` Konrad Rzeszutek Wilk
@ 2010-11-12 19:38               ` Lin, Ray
  2010-11-12 22:33                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Lin, Ray @ 2010-11-12 19:38 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel, Dante Cinco

 

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
Sent: Friday, November 12, 2010 8:56 AM
To: Lin, Ray
Cc: Dante Cinco; Xen-devel
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

> That does not sound right. You should be able to use the PCI passthrough without the IOMMU. Since it is an interrupt issue it sounds like that you are using x2APIC and that is enabled without the IOMMU.
> Had you tried disabling IOMMU and x2apic? (this is all on the 
> hypervisor line?)
> 
> Konrad,
> It's unlikely the interrupt issue but DMA issue. Here is the sequence 
> how the tachyon device generates the DMA/interrupts,
> - the tachyon device does the DMA to update the memory which indicates the source of interrupt.
> - After the DMA is done, the tachyon device trigger an interrupt.
> - The interrupt service routine of software driver is invoked due to 
> the interrupt
> - The interrupt service routine checks the source of interrupts by examining the memory which is supposed to be updated by previous DMA.
> - Even though the interrupt happens, the driver code can't find the source of interrupt since the DMA doesn't work properly.

That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs.

Our driver uses pci_map_single to get the physical addr to program the chip.


One way you can figure this is doing something like this to make sure you got the right MFN:

add these two:
#include <xen/page.h>
#include <asm/xen/page.h>

          phys_addr_t phys = page_to_phys(mem->pages[i]);
+               if (xen_pv_domain()) {
+                       phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn(
+                                       page_to_pfn(mem->pages[i])));
+                       if (phys != xen_phys) {
+                               printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \
+                                       " CODE UNTESTED!\n",
+                                       (unsigned long)phys,
+                                       (unsigned long)xen_phys);
+                               WARN_ON_ONCE(phys != xen_phys);
+                               phys = xen_phys;
+                       }
+               }
	and using the 'phys' value from now.



If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM

look at "Why those patches" section.

Lastly, are you using unsigned long for or the phys_addr_t typedefs?

The driver uses dma_addr_t for physical address. 

The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use 'swiotlb=force'. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down.

The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.

Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?

The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. 
The driver doesn't set DMA mask on the device to 32 bit.

I'm looking at the driver code to see anything wrong or not. We appreciate your help Konrad.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 19:38               ` Lin, Ray
@ 2010-11-12 22:33                 ` Konrad Rzeszutek Wilk
  2010-11-12 22:57                   ` Lin, Ray
  2010-11-16 17:07                   ` Dante Cinco
  0 siblings, 2 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 22:33 UTC (permalink / raw)
  To: Lin, Ray; +Cc: Xen-devel, Dante Cinco

>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs.
> 
> Our driver uses pci_map_single to get the physical addr to program the chip.

OK. Good.
> 
> 
> One way you can figure this is doing something like this to make sure you got the right MFN:
> 
> add these two:
> #include <xen/page.h>
> #include <asm/xen/page.h>
> 
>           phys_addr_t phys = page_to_phys(mem->pages[i]);
> +               if (xen_pv_domain()) {
> +                       phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn(
> +                                       page_to_pfn(mem->pages[i])));
> +                       if (phys != xen_phys) {
> +                               printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \
> +                                       " CODE UNTESTED!\n",
> +                                       (unsigned long)phys,
> +                                       (unsigned long)xen_phys);
> +                               WARN_ON_ONCE(phys != xen_phys);
> +                               phys = xen_phys;
> +                       }
> +               }
> 	and using the 'phys' value from now.
> 
> 
> 
> If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM
> 
> look at "Why those patches" section.
> 
> Lastly, are you using unsigned long for or the phys_addr_t typedefs?
> 
> The driver uses dma_addr_t for physical address. 

Excellent.
> 
> The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use 'swiotlb=force'. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down.
> 
> The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.

<nods> That makes sense.
> 
> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?
> 
> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. 

Oh boy. That complicates it. 

> The driver doesn't set DMA mask on the device to 32 bit.

Is it set then to 45bit?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 22:33                 ` Konrad Rzeszutek Wilk
@ 2010-11-12 22:57                   ` Lin, Ray
  2010-11-16 17:07                   ` Dante Cinco
  1 sibling, 0 replies; 36+ messages in thread
From: Lin, Ray @ 2010-11-12 22:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel, Dante Cinco

 

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] 
Sent: Friday, November 12, 2010 2:34 PM
To: Lin, Ray
Cc: Xen-devel; Dante Cinco
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs.
> 
> Our driver uses pci_map_single to get the physical addr to program the chip.

OK. Good.
> 
> 
> One way you can figure this is doing something like this to make sure you got the right MFN:
> 
> add these two:
> #include <xen/page.h>
> #include <asm/xen/page.h>
> 
>           phys_addr_t phys = page_to_phys(mem->pages[i]);
> +               if (xen_pv_domain()) {
> +                       phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn(
> +                                       page_to_pfn(mem->pages[i])));
> +                       if (phys != xen_phys) {
> +                               printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \
> +                                       " CODE UNTESTED!\n",
> +                                       (unsigned long)phys,
> +                                       (unsigned long)xen_phys);
> +                               WARN_ON_ONCE(phys != xen_phys);
> +                               phys = xen_phys;
> +                       }
> +               }
> 	and using the 'phys' value from now.
> 
> 
> 
> If this sounds like black magic, here is a short writeup 
> http://wiki.xensource.com/xenwiki/XenPVOPSDRM
> 
> look at "Why those patches" section.
> 
> Lastly, are you using unsigned long for or the phys_addr_t typedefs?
> 
> The driver uses dma_addr_t for physical address. 

Excellent.
> 
> The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use 'swiotlb=force'. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down.
> 
> The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.

<nods> That makes sense.
> 
> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?
> 
> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip. 

Oh boy. That complicates it. 

> The driver doesn't set DMA mask on the device to 32 bit.

Is it set then to 45bit?
The driver doesn't use pci_set_dma_mask to set the HW_DMA_MASK.
The tachyon chip should support 64-bit dma transfer even though dma programmable address is limited to 32-bit/45-bit address. The chip should fill the upper address with 0. I'm confirming it with their fae now. In the mean time, I try to manipulate pci_set_dma_mask to see it make the difference or not. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-12 22:33                 ` Konrad Rzeszutek Wilk
  2010-11-12 22:57                   ` Lin, Ray
@ 2010-11-16 17:07                   ` Dante Cinco
  2010-11-16 18:57                     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-16 17:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

On Fri, Nov 12, 2010 at 2:33 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>>> That sounds like the tachyon device is updating the wrong memory location. How are you programming the memory location where thetachyon device is suppose to touch? Are you using the value from pci_map_page or are you using virt_to_phys? The virt_to_phys should be different from the pci_map_page.. unless you allocated a coherent DMA pool using pci_alloc_coherent in which case the virt_to_phys() values for that pool should be the right MFNs.
>>
>> Our driver uses pci_map_single to get the physical addr to program the chip.
>
> OK. Good.
>>
>>
>> One way you can figure this is doing something like this to make sure you got the right MFN:
>>
>> add these two:
>> #include <xen/page.h>
>> #include <asm/xen/page.h>
>>
>>           phys_addr_t phys = page_to_phys(mem->pages[i]);
>> +               if (xen_pv_domain()) {
>> +                       phys_addr_t xen_phys = PFN_PHYS(pfn_to_mfn(
>> +                                       page_to_pfn(mem->pages[i])));
>> +                       if (phys != xen_phys) {
>> +                               printk(KERN_ERR "Fixing up: (0x%lx->0x%lx)." \
>> +                                       " CODE UNTESTED!\n",
>> +                                       (unsigned long)phys,
>> +                                       (unsigned long)xen_phys);
>> +                               WARN_ON_ONCE(phys != xen_phys);
>> +                               phys = xen_phys;
>> +                       }
>> +               }
>>       and using the 'phys' value from now.
>>
>>
>>
>> If this sounds like black magic, here is a short writeup http://wiki.xensource.com/xenwiki/XenPVOPSDRM
>>
>> look at "Why those patches" section.
>>
>> Lastly, are you using unsigned long for or the phys_addr_t typedefs?
>>
>> The driver uses dma_addr_t for physical address.
>
> Excellent.
>>
>> The more I think about your problem the more it sounds like a truncating issue. You said that it works just right (albeit slow) if you use 'swiotlb=force'. The slowness could be due to not using the pci_sync_* APIs to sync the DMA buffers.. But irregardless using bounce buffers will slow the DMA operations down.
>>
>> The driver do use pci_dma_sync_single_for_cpu or pci_dma_sync_single_for_device to sync the DMA buffers. Without these syncs, the driver would not work at all.
>
> <nods> That makes sense.
>>
>> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?
>>
>> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip.
>
> Oh boy. That complicates it.
>
>> The driver doesn't set DMA mask on the device to 32 bit.
>
> Is it set then to 45bit?
>

We were not explicitly setting the DMA mask. pci_alloc_coherent was
always returning 32 bits but pci_map_single was returning a 34-bit
address which we truncate by casting it to a uint32_t since the
Tachyon's HBA register is only 32 bits. With swiotlb=force, both
returned 32 bits without explicitly setting the DMA mask. Once we set
the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However
with iommu=soft (and no more swiotlb=force), we're still stuck with
the abysmal I/O performance (same as when we had swiotlb=force).

In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What's
the default if we don't specify it? Without it, we get no I/Os (it
seems the interrupts and/or DMA don't work).

Are there any profiling tools you can suggest for domU? I was able to
apply Dulloor's xenoprofile patch to our dom0 kernel (2.6.32.25-pvops)
but not to xen-pcifront-0.8.2.

- Dante

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-16 17:07                   ` Dante Cinco
@ 2010-11-16 18:57                     ` Konrad Rzeszutek Wilk
  2010-11-16 19:43                       ` Dante Cinco
  0 siblings, 1 reply; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-16 18:57 UTC (permalink / raw)
  To: Dante Cinco; +Cc: Xen-devel

> >> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?
> >>
> >> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip.
> >
> > Oh boy. That complicates it.
> >
> >> The driver doesn't set DMA mask on the device to 32 bit.
> >
> > Is it set then to 45bit?
> >
> 
> We were not explicitly setting the DMA mask. pci_alloc_coherent was

You should. But only once (during startup).

> always returning 32 bits but pci_map_single was returning a 34-bit
> address which we truncate by casting it to a uint32_t since the

Truncating any bus (DMA) address is a big no no.

> Tachyon's HBA register is only 32 bits. With swiotlb=force, both

Not knowing the driver I can't comment here much, but
 1). When you say 'HBA registers' I think PCI MMIO BARs. Those are
     usually found beneath the 4GB limit and you get the virtual
     address when doing ioremap (or the pci equivalant). And the
     bus address is definitly under the 4GB.
 2). After you have done that, set your pci_dma_mask to 34-bit, and then
 2). For all other operations where you can do 34-bit use the pci_map
     _single. The swiotlb buffer looks at the dma_mask (and if there
     is no set it assumes 32bit), and if it finds the physical address
     to be within the DMA mask it will gladly translate the physical
     to bus and nothing else. If however the physical address is way
     beyound the bus address it will give you the bounce buffer which
     you will later have to copy from (using pci_sync..). I've written
     a little blurp at the bottom of the email explaining this in more details.

Or is the issue that when you write to your HBA register the DMA
address, the HBA register can _only_ deal with 32-bit values (4bytes)?
In which case the PCI device seems to be limited to addressing only up to 4GB, right?

> returned 32 bits without explicitly setting the DMA mask. Once we set
> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However
> with iommu=soft (and no more swiotlb=force), we're still stuck with
> the abysmal I/O performance (same as when we had swiotlb=force).

Right, that is expected.

> In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What's
> the default if we don't specify it? Without it, we get no I/Os (it

If you don't specify it you can't do PCI passthrough in PV guests.
It is automatically enabled when you boot Linux as Dom0.

> seems the interrupts and/or DMA don't work).

It has two purposes:

 1). The predominant and which is used for both DomU and Dom0 is to
     translate physical address to machine frame numbers (PFNs->MFNs).
     Xen PV guests have a P2M array that is consulted when setting
     virtual addresses (PTEs). For PCI BARs, they are equivalant
     (PFN == MFN), but for memory regions they can be discontigous,
     and in decreasing order. If you would traverse the P2M list you
     could see: p2m(0x1000)==0x5121, p2m(0x1001)==0x5120, p2m(0x1002)==0x5119.

     So obviously we need a lookup mechanism to say find for
     virtual address 0xfffff8000010000 the DMA address (bus address).
     Naively on baremetal on X86 you could use virt_to_phy which would
     get you PFN 0x10000. On Xen however, we need to consult the P2M array.
     For example, for p2m[0x10000], the real machine frame number might 0x102323.

     So when you do 'pci_map_*' Xen-SWIOTLB looks up the P2M to find you the
     machine frame number and returns that (dma address aka bus address). That
     is the value you tell the HBA to transform from/to.

     If you don't enable Xen-SWIOTLB, and use the native one (or none at all),
     you end up programming the PCI driver with bogus data since the bus address you
     are giving the card does not correspond to the real bus address.

 2). Using our example before, the p2m[0x10000] returned MFN 0x102323. That
     MFN is above 4GB (0x100000) and if your device can _only_ do PCI Memory Write
     and PCI Memory Read b/c it only has 32-bit address bits we need some way
     of still getting the contents of 0x102323 to the PCI card. This is where
     bounce buffers come in play. During bootup, Xen-SWIOTLB initializes a 64MB
     chunk of space that is underneath the 4GB space - it is also contingous.
     When you do 'pci_map_*' Xen-SWIOTLB looks at the DMA mask you have, the MFN,
     and if DMA mask & MFN > DMA mask it copies the value from 0x102323 to one it'ss
     buffers, gives you the MFN of its buffer (say 0x20000) and you program that
     in the PCI card.  When you get an interrupt from the PCI card, you call
     pci_sync_* which copies from MFN 0x20000 to 0x102323 and sticks the MFN 0x20000
     back on the list of buffers to be used. And now you have in MFN 0x102323 the
     result.
     
> 
> Are there any profiling tools you can suggest for domU? I was able to
> apply Dulloor's xenoprofile patch to our dom0 kernel (2.6.32.25-pvops)
> but not to xen-pcifront-0.8.2.

Oh boy. I don't sorry.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-16 18:57                     ` Konrad Rzeszutek Wilk
@ 2010-11-16 19:43                       ` Dante Cinco
  2010-11-16 20:15                         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-16 19:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

On Tue, Nov 16, 2010 at 10:57 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> >> Using the bounce buffers limits the DMA operations to under 32-bit. So could it be that you are using some casting macro that casts a PFN to unsigned long or vice-versa and we end up truncating it to 32-bit? (I've seen this issue actually with InfiniBand drivers back in RHEL5 days..). Lastly, do you set your DMA mask on the device to 32BIT?
>> >>
>> >> The tachyon chip supports both 32-bit & 45-bit dma. Some features need to set 32-bit physical addr to chip. Others need to set 45-bit physical addr to chip.
>> >
>> > Oh boy. That complicates it.
>> >
>> >> The driver doesn't set DMA mask on the device to 32 bit.
>> >
>> > Is it set then to 45bit?
>> >
>>
>> We were not explicitly setting the DMA mask. pci_alloc_coherent was
>
> You should. But only once (during startup).
>
>> always returning 32 bits but pci_map_single was returning a 34-bit
>> address which we truncate by casting it to a uint32_t since the
>
> Truncating any bus (DMA) address is a big no no.
>
>> Tachyon's HBA register is only 32 bits. With swiotlb=force, both
>
> Not knowing the driver I can't comment here much, but
>  1). When you say 'HBA registers' I think PCI MMIO BARs. Those are
>     usually found beneath the 4GB limit and you get the virtual
>     address when doing ioremap (or the pci equivalant). And the
>     bus address is definitly under the 4GB.
>  2). After you have done that, set your pci_dma_mask to 34-bit, and then
>  2). For all other operations where you can do 34-bit use the pci_map
>     _single. The swiotlb buffer looks at the dma_mask (and if there
>     is no set it assumes 32bit), and if it finds the physical address
>     to be within the DMA mask it will gladly translate the physical
>     to bus and nothing else. If however the physical address is way
>     beyound the bus address it will give you the bounce buffer which
>     you will later have to copy from (using pci_sync..). I've written
>     a little blurp at the bottom of the email explaining this in more details.
>
> Or is the issue that when you write to your HBA register the DMA
> address, the HBA register can _only_ deal with 32-bit values (4bytes)?

The HBA register which is using the address returned by pci_map_single
is limited to a 32-bit value.

> In which case the PCI device seems to be limited to addressing only up to 4GB, right?

The HBA has some 32-bit registers and some that are 45-bit.

>
>> returned 32 bits without explicitly setting the DMA mask. Once we set
>> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However
>> with iommu=soft (and no more swiotlb=force), we're still stuck with
>> the abysmal I/O performance (same as when we had swiotlb=force).
>
> Right, that is expected.

So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which
explains why we're seeing the abysmal I/O performance, right?

Is it true then that with an HVM domU kernel and PCI passthrough, it
does not use Xen-SWIOTLB and therefore results in better performance?

>
>> In pvops domU (xen-pcifront-0.8.2), what does iommu=soft do? What's
>> the default if we don't specify it? Without it, we get no I/Os (it
>
> If you don't specify it you can't do PCI passthrough in PV guests.
> It is automatically enabled when you boot Linux as Dom0.
>
>> seems the interrupts and/or DMA don't work).
>
> It has two purposes:
>
>  1). The predominant and which is used for both DomU and Dom0 is to
>     translate physical address to machine frame numbers (PFNs->MFNs).
>     Xen PV guests have a P2M array that is consulted when setting
>     virtual addresses (PTEs). For PCI BARs, they are equivalant
>     (PFN == MFN), but for memory regions they can be discontigous,
>     and in decreasing order. If you would traverse the P2M list you
>     could see: p2m(0x1000)==0x5121, p2m(0x1001)==0x5120, p2m(0x1002)==0x5119.
>
>     So obviously we need a lookup mechanism to say find for
>     virtual address 0xfffff8000010000 the DMA address (bus address).
>     Naively on baremetal on X86 you could use virt_to_phy which would
>     get you PFN 0x10000. On Xen however, we need to consult the P2M array.
>     For example, for p2m[0x10000], the real machine frame number might 0x102323.
>
>     So when you do 'pci_map_*' Xen-SWIOTLB looks up the P2M to find you the
>     machine frame number and returns that (dma address aka bus address). That
>     is the value you tell the HBA to transform from/to.
>
>     If you don't enable Xen-SWIOTLB, and use the native one (or none at all),
>     you end up programming the PCI driver with bogus data since the bus address you
>     are giving the card does not correspond to the real bus address.
>
>  2). Using our example before, the p2m[0x10000] returned MFN 0x102323. That
>     MFN is above 4GB (0x100000) and if your device can _only_ do PCI Memory Write
>     and PCI Memory Read b/c it only has 32-bit address bits we need some way
>     of still getting the contents of 0x102323 to the PCI card. This is where
>     bounce buffers come in play. During bootup, Xen-SWIOTLB initializes a 64MB
>     chunk of space that is underneath the 4GB space - it is also contingous.
>     When you do 'pci_map_*' Xen-SWIOTLB looks at the DMA mask you have, the MFN,
>     and if DMA mask & MFN > DMA mask it copies the value from 0x102323 to one it'ss
>     buffers, gives you the MFN of its buffer (say 0x20000) and you program that
>     in the PCI card.  When you get an interrupt from the PCI card, you call
>     pci_sync_* which copies from MFN 0x20000 to 0x102323 and sticks the MFN 0x20000
>     back on the list of buffers to be used. And now you have in MFN 0x102323 the
>     result.
>
>>
>> Are there any profiling tools you can suggest for domU? I was able to
>> apply Dulloor's xenoprofile patch to our dom0 kernel (2.6.32.25-pvops)
>> but not to xen-pcifront-0.8.2.
>
> Oh boy. I don't sorry.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-16 19:43                       ` Dante Cinco
@ 2010-11-16 20:15                         ` Konrad Rzeszutek Wilk
  2010-11-18  1:09                           ` Dante Cinco
  0 siblings, 1 reply; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-16 20:15 UTC (permalink / raw)
  To: Dante Cinco; +Cc: Xen-devel

> > Or is the issue that when you write to your HBA register the DMA
> > address, the HBA register can _only_ deal with 32-bit values (4bytes)?
> 
> The HBA register which is using the address returned by pci_map_single
> is limited to a 32-bit value.
> 
> > In which case the PCI device seems to be limited to addressing only up to 4GB, right?
> 
> The HBA has some 32-bit registers and some that are 45-bit.

Ugh, so can you set up pci coherent DMA pools at startup for the 32-bit
registers. Then set the pci_dma_mask to 45-bit and use pci_map_single for
all others.
> 
> >
> >> returned 32 bits without explicitly setting the DMA mask. Once we set
> >> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However
> >> with iommu=soft (and no more swiotlb=force), we're still stuck with
> >> the abysmal I/O performance (same as when we had swiotlb=force).
> >
> > Right, that is expected.
> 
> So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which
> explains why we're seeing the abysmal I/O performance, right?

You are simplifying it. You are seeing abysmal I/O performance b/c you
are doing bounce buffering. You can fix this by making the driver
have a 32-bit pool allocated at startup and use that just for the
HBA registers that can only do 32-bit, and then for the rest use
the pci_map_single and use DMA mask 45-bit.

> 
> Is it true then that with an HVM domU kernel and PCI passthrough, it
> does not use Xen-SWIOTLB and therefore results in better performance?

Yes and no.

If you allocate to your HVM guests more than 4GB you are going to
hit the same issues with the bounce buffer.

If you give your guest less than 4GB, there is no SWIOTLB running in the guest
and QEMU along with the hypervisor end up using the hardware one (currently
Xen hypervisor supports AMD V-i and Intel VT-d). In your case it is the VT-d
- at which point the VT-d will remap your GMFN to MFNs. And the VT-d will
be responsible for translating the DMA address that the PCI card will
try to access to the real MFN.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-16 20:15                         ` Konrad Rzeszutek Wilk
@ 2010-11-18  1:09                           ` Dante Cinco
  2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-18  1:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Xen-devel

On Tue, Nov 16, 2010 at 12:15 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> > Or is the issue that when you write to your HBA register the DMA
>> > address, the HBA register can _only_ deal with 32-bit values (4bytes)?
>>
>> The HBA register which is using the address returned by pci_map_single
>> is limited to a 32-bit value.
>>
>> > In which case the PCI device seems to be limited to addressing only up to 4GB, right?
>>
>> The HBA has some 32-bit registers and some that are 45-bit.
>
> Ugh, so can you set up pci coherent DMA pools at startup for the 32-bit
> registers. Then set the pci_dma_mask to 45-bit and use pci_map_single for
> all others.
>>
>> >
>> >> returned 32 bits without explicitly setting the DMA mask. Once we set
>> >> the mask to 32 bits using pci_set_dma_mask, the NMIs stopped. However
>> >> with iommu=soft (and no more swiotlb=force), we're still stuck with
>> >> the abysmal I/O performance (same as when we had swiotlb=force).
>> >
>> > Right, that is expected.
>>
>> So with iommu=soft, all I/Os have to go through Xen-SWIOTLB which
>> explains why we're seeing the abysmal I/O performance, right?
>
> You are simplifying it. You are seeing abysmal I/O performance b/c you
> are doing bounce buffering. You can fix this by making the driver
> have a 32-bit pool allocated at startup and use that just for the
> HBA registers that can only do 32-bit, and then for the rest use
> the pci_map_single and use DMA mask 45-bit.

I wanted to confirm that bounce buffering was indeed occurring so I
modified swiotlb.c in the kernel and added printks in the following
functions:
swiotlb_bounce
swiotlb_tbl_map_single
swiotlb_tbl_unmap_single
Sure enough we were calling all 3 five times per I/O. We took your
suggestion and replaced pci_map_single with pci_pool_alloc. The
swiotlb calls were gone but the I/O performance only improved 6% (29k
IOPS to 31k IOPS) which is still abysmal.

Any suggestions on where to look next? I have one question about the
P2M array: Does the P2M lookup occur every DMA or just during the
allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
resource that could be a bottleneck?

>
>>
>> Is it true then that with an HVM domU kernel and PCI passthrough, it
>> does not use Xen-SWIOTLB and therefore results in better performance?
>
> Yes and no.
>
> If you allocate to your HVM guests more than 4GB you are going to
> hit the same issues with the bounce buffer.
>
> If you give your guest less than 4GB, there is no SWIOTLB running in the guest
> and QEMU along with the hypervisor end up using the hardware one (currently
> Xen hypervisor supports AMD V-i and Intel VT-d). In your case it is the VT-d
> - at which point the VT-d will remap your GMFN to MFNs. And the VT-d will
> be responsible for translating the DMA address that the PCI card will
> try to access to the real MFN.
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18  1:09                           ` Dante Cinco
@ 2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
  2010-11-18 17:28                               ` Chris Mason
                                                 ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-18 17:19 UTC (permalink / raw)
  To: Dante Cinco, andrew.thomas, mukesh.rathor, keir.fraser,
	mathieu.desnoyers, chris.mason, Jeremy
  Cc: Xen-devel

Keir, Dan, Mathieu, Chris, Mukesh,

This fellow is passing in a PCI device to his Xen PV guest and trying
to get high IOPS. The kernel he is using is a 2.6.36 with tglx's
sparse_irq rework.

> I wanted to confirm that bounce buffering was indeed occurring so I
> modified swiotlb.c in the kernel and added printks in the following
> functions:
> swiotlb_bounce
> swiotlb_tbl_map_single
> swiotlb_tbl_unmap_single
> Sure enough we were calling all 3 five times per I/O. We took your
> suggestion and replaced pci_map_single with pci_pool_alloc. The
> swiotlb calls were gone but the I/O performance only improved 6% (29k
> IOPS to 31k IOPS) which is still abysmal.

Hey! 6% that is nothing to sneeze at.

> 
> Any suggestions on where to look next? I have one question about the

So since you are talking IOPS I figured you must be using fio to run those
numbers. And since you mentioned HVM at some point, you are not running
this PV domain as a back-end for another PV guest. You are probably going
to run some form of iSCSI target and stuff those down the PCI device.

Couple of things that pop in my head.. but lets first address your question.

> P2M array: Does the P2M lookup occur every DMA or just during the
> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central

It only occurs during allocation. Also since you are bypassing the
bounce buffer those calls are done without any spinlock. The lookup
of P2M is bitshifting, division - and are constant - so O(1).

> resource that could be a bottleneck?

Doubt it. Your best bet to figure this out is to play with ftrace, or
perf trace. But I don't know how well they work with Xen nowadays - Jeremy
and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got
it working?

So the next couple of possiblities are:
 1). you are hitting the spinlock issues on 'struct request' or any of
     the paths on the I/O. Oracle did a lot of work on those - and one
     way to find this out is to look at tracing and see where the contention is.
     I don't know where or if those patches have been posted upstream.. but as said,
     if you are seeing the spinlock usage high  - that might be it.
 1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. Otherwise
     you are going to hit dreadfull conditions.
 2). You are hitting the 64-bit syscall wall. Basically your user-mode
     application (fio) is doing a write(), which used to be int 0x80 but now
     is a syscall. The syscall gets trapped in the hypervisor which has to
     call in your PV kernel. You get hit with two context switches for each
     'write()' call. The solution is to use a 32-bit DomU as the guest user
     application and guest kernel run in different rings.
 3). Xen CPU pools. You didn't say where the application that sends the IOs
     is located. But if it was in a seperate domain then you will want to use
     Xen CPU pools. Basically this way you can get gang-scheduling where the
     guest that submits the I/O and the guest that picks up the I/O are running
     right after each other. I don't know much more details, but this is what
     I understand it does.
 4). CPU/MSI-X affinity. I think you already did this, but make sure you pin
     your guest to specific CPUs and also pin the MSI-X (vectors) to the proper
     destination. You can use the 'xm debug-keys i' to see the MSI-X affinity - it
     is a mask and basically see if it overlays the CPUs you are running your guest
     at. Not sure how to actually set the MSI-X affinity ... now that I think.
     Keir or some of the Intel folks might know better.
 5). Andrew, Mukesh, Keir, Dan, any other ideas?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
@ 2010-11-18 17:28                               ` Chris Mason
  2010-11-18 17:54                               ` Mathieu Desnoyers
  2010-11-18 18:43                               ` Dante Cinco
  2 siblings, 0 replies; 36+ messages in thread
From: Chris Mason @ 2010-11-18 17:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, andrew.thomas,
	keir.fraser, mukesh.rathor, Dante Cinco

Excerpts from Konrad Rzeszutek Wilk's message of 2010-11-18 12:19:36 -0500:
> Keir, Dan, Mathieu, Chris, Mukesh,
> 
> This fellow is passing in a PCI device to his Xen PV guest and trying
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's
> sparse_irq rework.
> 
> > I wanted to confirm that bounce buffering was indeed occurring so I
> > modified swiotlb.c in the kernel and added printks in the following
> > functions:
> > swiotlb_bounce
> > swiotlb_tbl_map_single
> > swiotlb_tbl_unmap_single
> > Sure enough we were calling all 3 five times per I/O. We took your
> > suggestion and replaced pci_map_single with pci_pool_alloc. The
> > swiotlb calls were gone but the I/O performance only improved 6% (29k
> > IOPS to 31k IOPS) which is still abysmal.
> 
> Hey! 6% that is nothing to sneeze at.

How fast does it go on bare metal?

I usually do four things:

1) perf record -g -a -f 'sleep 15'
(use perf report to look at the biggest CPU hogs)

2) mpstat -P ALL 1 to find the CPU doing all the softirq processing

3) perf record -g -C N -f 'sleep 15' where N was the CPU in mpstat -P
ALL that was doing all the softirq processing

4) Turn off the intel iommu.  This isn't an option of for virtualized
though, but I'd try it on/off on bare metal.

-chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
  2010-11-18 17:28                               ` Chris Mason
@ 2010-11-18 17:54                               ` Mathieu Desnoyers
  2010-11-18 18:43                               ` Dante Cinco
  2 siblings, 0 replies; 36+ messages in thread
From: Mathieu Desnoyers @ 2010-11-18 17:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, andrew.thomas, keir.fraser,
	chris.mason, Dante Cinco

* Konrad Rzeszutek Wilk (konrad.wilk@oracle.com) wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
[...]
> Doubt it. Your best bet to figure this out is to play with ftrace, or
> perf trace. But I don't know how well they work with Xen nowadays - Jeremy
> and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got
> it working?

I did port LTTng to the Xen hypervisor in a past life, but I did not
have time to maintain this port in parallel with the Linux kernel LTTng.
So I doubt these bits would be very useful today, as a new port would be
needed for compatibility with newer lttng tools.

If you can afford to use older Xen hypervisors with older Linux kernels
and old LTTng/LTTV versions, then you could gather a synchronized trace
across the hypervisor/Dom0/DomUs, but it would require some work for
recent Xen versions.

Currently, we've been focusing our efforts on tracing of KVM, which
works very well. We support analysis of traces taken from different
host/guest domains, as long as the TSCs are synchronized.

So an option here would be to deploy LTTng on both your dom0 and domU
kernels, gather traces of both in parallel while you run your workload,
and compare the resulting traces (load both dom0 and domU traces into
one trace set within lttv). Comparing the I/O behavior with a bare-metal
trace should give a good insight into what's different.

At least you'll be able to follow the path taken by each I/O request,
except for what's happening in Xen, which will be a black box.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
  2010-11-18 17:28                               ` Chris Mason
  2010-11-18 17:54                               ` Mathieu Desnoyers
@ 2010-11-18 18:43                               ` Dante Cinco
  2010-11-18 18:52                                 ` Lin, Ray
  2010-11-18 19:35                                 ` Dante Cinco
  2 siblings, 2 replies; 36+ messages in thread
From: Dante Cinco @ 2010-11-18 18:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, andrew.thomas,
	keir.fraser, chris.mason

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
>
> This fellow is passing in a PCI device to his Xen PV guest and trying
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's
> sparse_irq rework.
>
>> I wanted to confirm that bounce buffering was indeed occurring so I
>> modified swiotlb.c in the kernel and added printks in the following
>> functions:
>> swiotlb_bounce
>> swiotlb_tbl_map_single
>> swiotlb_tbl_unmap_single
>> Sure enough we were calling all 3 five times per I/O. We took your
>> suggestion and replaced pci_map_single with pci_pool_alloc. The
>> swiotlb calls were gone but the I/O performance only improved 6% (29k
>> IOPS to 31k IOPS) which is still abysmal.
>
> Hey! 6% that is nothing to sneeze at.

When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at
least 20x (~700k IOPS).

>
>>
>> Any suggestions on where to look next? I have one question about the
>
> So since you are talking IOPS I figured you must be using fio to run those
> numbers. And since you mentioned HVM at some point, you are not running
> this PV domain as a back-end for another PV guest. You are probably going
> to run some form of iSCSI target and stuff those down the PCI device.

Our setup is pure Fibre Channel. We're using a physically separate
system (Linux-based also) to initiate the SCSI I/Os.

>
> Couple of things that pop in my head.. but lets first address your question.
>
>> P2M array: Does the P2M lookup occur every DMA or just during the
>> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
>
> It only occurs during allocation. Also since you are bypassing the
> bounce buffer those calls are done without any spinlock. The lookup
> of P2M is bitshifting, division - and are constant - so O(1).
>
>> resource that could be a bottleneck?
>
> Doubt it. Your best bet to figure this out is to play with ftrace, or
> perf trace. But I don't know how well they work with Xen nowadays - Jeremy
> and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu got
> it working?
>
> So the next couple of possiblities are:
>  1). you are hitting the spinlock issues on 'struct request' or any of
>     the paths on the I/O. Oracle did a lot of work on those - and one
>     way to find this out is to look at tracing and see where the contention is.
>     I don't know where or if those patches have been posted upstream.. but as said,
>     if you are seeing the spinlock usage high  - that might be it.
>  1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. Otherwise

I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y

>     you are going to hit dreadfull conditions.
>  2). You are hitting the 64-bit syscall wall. Basically your user-mode
>     application (fio) is doing a write(), which used to be int 0x80 but now
>     is a syscall. The syscall gets trapped in the hypervisor which has to
>     call in your PV kernel. You get hit with two context switches for each
>     'write()' call. The solution is to use a 32-bit DomU as the guest user
>     application and guest kernel run in different rings.

There is no user space application that is involved with the I/O. It's
all kernel driver code that handles the I/O.

>  3). Xen CPU pools. You didn't say where the application that sends the IOs
>     is located. But if it was in a seperate domain then you will want to use
>     Xen CPU pools. Basically this way you can get gang-scheduling where the
>     guest that submits the I/O and the guest that picks up the I/O are running
>     right after each other. I don't know much more details, but this is what
>     I understand it does.
>  4). CPU/MSI-X affinity. I think you already did this, but make sure you pin
>     your guest to specific CPUs and also pin the MSI-X (vectors) to the proper
>     destination. You can use the 'xm debug-keys i' to see the MSI-X affinity - it
>     is a mask and basically see if it overlays the CPUs you are running your guest
>     at. Not sure how to actually set the MSI-X affinity ... now that I think.
>     Keir or some of the Intel folks might know better.

There 16 devices (multi-function) that are PCI-passed through to domU.
There are 16 VCPUs in domU and all are pinned to individual PCPUs
(24-CPU platform). Each IRQ in domU is affinitized to a CPU. This
strategy has worked well for us with the HVM kernel. Here's the output
of 'xm debug-keys i'
(XEN)    IRQ:  67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:127(----),
(XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000200 vec:43
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:126(----),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000400 vec:83
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:125(----),
(XEN)    IRQ:  70 affinity:00000000,00000000,00000000,00000800 vec:4b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:124(----),
(XEN)    IRQ:  71 affinity:00000000,00000000,00000000,00001000 vec:8b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:123(----),
(XEN)    IRQ:  72 affinity:00000000,00000000,00000000,00002000 vec:53
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:122(----),
(XEN)    IRQ:  73 affinity:00000000,00000000,00000000,00004000 vec:93
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:121(----),
(XEN)    IRQ:  74 affinity:00000000,00000000,00000000,00008000 vec:5b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:120(----),
(XEN)    IRQ:  75 affinity:00000000,00000000,00000000,00010000 vec:9b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:119(----),
(XEN)    IRQ:  76 affinity:00000000,00000000,00000000,00020000 vec:63
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:118(----),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00040000 vec:a3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:117(----),
(XEN)    IRQ:  78 affinity:00000000,00000000,00000000,00080000 vec:6b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:116(----),
(XEN)    IRQ:  79 affinity:00000000,00000000,00000000,00100000 vec:ab
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:115(----),
(XEN)    IRQ:  80 affinity:00000000,00000000,00000000,00200000 vec:73
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:114(----),
(XEN)    IRQ:  81 affinity:00000000,00000000,00000000,00400000 vec:b3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:113(----),
(XEN)    IRQ:  82 affinity:00000000,00000000,00000000,00800000 vec:7b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:112(----),

>  5). Andrew, Mukesh, Keir, Dan, any other ideas?
>

We're also trying Chris' 4 things to try and will consider Mathieu's
LTT suggestion.

- Dante

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 18:43                               ` Dante Cinco
@ 2010-11-18 18:52                                 ` Lin, Ray
  2010-11-18 19:35                                 ` Dante Cinco
  1 sibling, 0 replies; 36+ messages in thread
From: Lin, Ray @ 2010-11-18 18:52 UTC (permalink / raw)
  To: Dante Cinco, Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, andrew.thomas,
	keir.fraser, chris.mason

 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Dante Cinco
Sent: Thursday, November 18, 2010 10:44 AM
To: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; andrew.thomas@oracle.com; keir.fraser@eu.citrix.com; chris.mason@oracle.com
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
>
> This fellow is passing in a PCI device to his Xen PV guest and trying 
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's 
> sparse_irq rework.
>
>> I wanted to confirm that bounce buffering was indeed occurring so I 
>> modified swiotlb.c in the kernel and added printks in the following
>> functions:
>> swiotlb_bounce
>> swiotlb_tbl_map_single
>> swiotlb_tbl_unmap_single
>> Sure enough we were calling all 3 five times per I/O. We took your 
>> suggestion and replaced pci_map_single with pci_pool_alloc. The 
>> swiotlb calls were gone but the I/O performance only improved 6% (29k 
>> IOPS to 31k IOPS) which is still abysmal.
>
> Hey! 6% that is nothing to sneeze at.

When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least 20x (~700k IOPS).

>
>>
>> Any suggestions on where to look next? I have one question about the
>
> So since you are talking IOPS I figured you must be using fio to run 
> those numbers. And since you mentioned HVM at some point, you are not 
> running this PV domain as a back-end for another PV guest. You are 
> probably going to run some form of iSCSI target and stuff those down the PCI device.

Our setup is pure Fibre Channel. We're using a physically separate system (Linux-based also) to initiate the SCSI I/Os.

>
> Couple of things that pop in my head.. but lets first address your question.
>
>> P2M array: Does the P2M lookup occur every DMA or just during the 
>> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
>
> It only occurs during allocation. Also since you are bypassing the 
> bounce buffer those calls are done without any spinlock. The lookup of 
> P2M is bitshifting, division - and are constant - so O(1).
>
>> resource that could be a bottleneck?
>
> Doubt it. Your best bet to figure this out is to play with ftrace, or 
> perf trace. But I don't know how well they work with Xen nowadays - 
> Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard 
> that Mathieu got it working?
>
> So the next couple of possiblities are:
>  1). you are hitting the spinlock issues on 'struct request' or any of
>     the paths on the I/O. Oracle did a lot of work on those - and one
>     way to find this out is to look at tracing and see where the contention is.
>     I don't know where or if those patches have been posted upstream.. 
> but as said,
>     if you are seeing the spinlock usage high  - that might be it.
>  1b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. 
> Otherwise

I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=y

The platform we're running has Intel Xeon E5540 and X58 chipset. Here is the kernel configuration associated with processor. Is there anything we could tune to improve the performance ?

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_SPARSE_IRQ=y
CONFIG_NUMA_IRQ_DESC=y
CONFIG_X86_MPPARSE=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=8
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=7
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=y
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_COMPACTION is not set
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y


>     you are going to hit dreadfull conditions.
>  2). You are hitting the 64-bit syscall wall. Basically your user-mode
>     application (fio) is doing a write(), which used to be int 0x80 
> but now
>     is a syscall. The syscall gets trapped in the hypervisor which has 
> to
>     call in your PV kernel. You get hit with two context switches for 
> each
>     'write()' call. The solution is to use a 32-bit DomU as the guest 
> user
>     application and guest kernel run in different rings.

There is no user space application that is involved with the I/O. It's all kernel driver code that handles the I/O.

>  3). Xen CPU pools. You didn't say where the application that sends 
> the IOs
>     is located. But if it was in a seperate domain then you will want 
> to use
>     Xen CPU pools. Basically this way you can get gang-scheduling 
> where the
>     guest that submits the I/O and the guest that picks up the I/O are 
> running
>     right after each other. I don't know much more details, but this 
> is what
>     I understand it does.
>  4). CPU/MSI-X affinity. I think you already did this, but make sure 
> you pin
>     your guest to specific CPUs and also pin the MSI-X (vectors) to 
> the proper
>     destination. You can use the 'xm debug-keys i' to see the MSI-X 
> affinity - it
>     is a mask and basically see if it overlays the CPUs you are 
> running your guest
>     at. Not sure how to actually set the MSI-X affinity ... now that I think.
>     Keir or some of the Intel folks might know better.

There 16 devices (multi-function) that are PCI-passed through to domU.
There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU platform). Each IRQ in domU is affinitized to a CPU. This strategy has worked well for us with the HVM kernel. Here's the output of 'xm debug-keys i'
(XEN)    IRQ:  67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:127(----),
(XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000200 vec:43
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:126(----),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000400 vec:83
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:125(----),
(XEN)    IRQ:  70 affinity:00000000,00000000,00000000,00000800 vec:4b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:124(----),
(XEN)    IRQ:  71 affinity:00000000,00000000,00000000,00001000 vec:8b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:123(----),
(XEN)    IRQ:  72 affinity:00000000,00000000,00000000,00002000 vec:53
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:122(----),
(XEN)    IRQ:  73 affinity:00000000,00000000,00000000,00004000 vec:93
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:121(----),
(XEN)    IRQ:  74 affinity:00000000,00000000,00000000,00008000 vec:5b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:120(----),
(XEN)    IRQ:  75 affinity:00000000,00000000,00000000,00010000 vec:9b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:119(----),
(XEN)    IRQ:  76 affinity:00000000,00000000,00000000,00020000 vec:63
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:118(----),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00040000 vec:a3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:117(----),
(XEN)    IRQ:  78 affinity:00000000,00000000,00000000,00080000 vec:6b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:116(----),
(XEN)    IRQ:  79 affinity:00000000,00000000,00000000,00100000 vec:ab
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:115(----),
(XEN)    IRQ:  80 affinity:00000000,00000000,00000000,00200000 vec:73
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:114(----),
(XEN)    IRQ:  81 affinity:00000000,00000000,00000000,00400000 vec:b3
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:113(----),
(XEN)    IRQ:  82 affinity:00000000,00000000,00000000,00800000 vec:7b
type=PCI-MSI         status=00000010 in-flight=0
domain-list=1:112(----),

>  5). Andrew, Mukesh, Keir, Dan, any other ideas?
>

We're also trying Chris' 4 things to try and will consider Mathieu's LTT suggestion.

- Dante

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 18:43                               ` Dante Cinco
  2010-11-18 18:52                                 ` Lin, Ray
@ 2010-11-18 19:35                                 ` Dante Cinco
  2010-11-18 21:20                                   ` Dan Magenheimer
  2010-11-19 17:10                                   ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 36+ messages in thread
From: Dante Cinco @ 2010-11-18 19:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, andrew.thomas,
	keir.fraser, chris.mason

I mentioned earlier in an previous post to this thread that I'm able
to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
domU kernel. So I can't do active-domain profiling but I'm able to do
passive-domain profiling but I don't know how reliable the results are
since it shows pvclock_clocksource_read as the top consumer of CPU
cycles at 28%.

CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
918089   27.9310
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           pvclock_clocksource_read
217811    6.6265  domain1-modules          domain1-modules
/domain1-modules
188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
mutex_spin_on_owner
186684    5.6795
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __xen_spin_lock
149514    4.5487
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __write_lock_failed
123278    3.7505
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __kernel_text_address
122906    3.7392
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           xen_spin_unlock
90903     2.7655
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __spin_time_accum
85880     2.6127
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __module_address
75223     2.2885
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           print_context_stack
66778     2.0316
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           __module_text_address
57389     1.7459
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           is_module_text_address
47282     1.4385  xen-syms-4.1-unstable    domain1-xen
syscall_enter
47219     1.4365
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           prio_tree_insert
46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
pvclock_clocksource_read
44501     1.3539
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           prio_tree_left
32482     0.9882
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
domain1-kernel           native_read_tsc

I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the
I/Os were running. Here's the command I used:

opcontrol --start --xen=/boot/xen-syms-4.1-unstable
--vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
--passive-domains=1
--passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug

I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the
Xen command line. Otherwise, oprofile only gives the samples from
CPU0.

I'm going to try perf next.

- Dante

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 19:35                                 ` Dante Cinco
@ 2010-11-18 21:20                                   ` Dan Magenheimer
  2010-11-18 21:39                                     ` Lin, Ray
  2010-11-19 17:10                                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2010-11-18 21:20 UTC (permalink / raw)
  To: Dante Cinco, Konrad Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, Andrew Thomas,
	keir.fraser, Chris Mason

In case it is related:
http://lists.xensource.com/archives/html/xen-devel/2010-07/msg01247.html 

Although I never went further on this investigation, it
appeared to me that pvclock_clocksource_read was getting
called at least an order-of-magnitude more frequently than
expected in some circumstances for some kernels.  And IIRC
it was scaled by the number of vcpus.

> -----Original Message-----
> From: Dante Cinco [mailto:dantecinco@gmail.com]
> Sent: Thursday, November 18, 2010 12:36 PM
> To: Konrad Rzeszutek Wilk
> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
> pvops domU kernel with PCI passthrough
> 
> I mentioned earlier in an previous post to this thread that I'm able
> to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
> domU kernel. So I can't do active-domain profiling but I'm able to do
> passive-domain profiling but I don't know how reliable the results are
> since it shows pvclock_clocksource_read as the top consumer of CPU
> cycles at 28%.
> 
> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples  %        image name               app name
> symbol name
> 918089   27.9310
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           pvclock_clocksource_read
> 217811    6.6265  domain1-modules          domain1-modules
> /domain1-modules
> 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> mutex_spin_on_owner
> 186684    5.6795
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __xen_spin_lock
> 149514    4.5487
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __write_lock_failed
> 123278    3.7505
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __kernel_text_address
> 122906    3.7392
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           xen_spin_unlock
> 90903     2.7655
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __spin_time_accum
> 85880     2.6127
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_address
> 75223     2.2885
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           print_context_stack
> 66778     2.0316
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_text_address
> 57389     1.7459
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           is_module_text_address
> 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
> syscall_enter
> 47219     1.4365
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_insert
> 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> pvclock_clocksource_read
> 44501     1.3539
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_left
> 32482     0.9882
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           native_read_tsc
> 
> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the
> I/Os were running. Here's the command I used:
> 
> opcontrol --start --xen=/boot/xen-syms-4.1-unstable
> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> --passive-domains=1
> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-
> 5.11.dcinco-debug
> 
> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the
> Xen command line. Otherwise, oprofile only gives the samples from
> CPU0.
> 
> I'm going to try perf next.
> 
> - Dante
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 21:20                                   ` Dan Magenheimer
@ 2010-11-18 21:39                                     ` Lin, Ray
  2010-11-19  0:20                                       ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Lin, Ray @ 2010-11-18 21:39 UTC (permalink / raw)
  To: Dan Magenheimer, Dante Cinco, Konrad Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, Andrew Thomas,
	keir.fraser, Chris Mason

 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Dan Magenheimer
Sent: Thursday, November 18, 2010 1:21 PM
To: Dante Cinco; Konrad Wilk
Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

In case it is related:
http://lists.xensource.com/archives/html/xen-devel/2010-07/msg01247.html 

Although I never went further on this investigation, it appeared to me that pvclock_clocksource_read was getting called at least an order-of-magnitude more frequently than expected in some circumstances for some kernels.  And IIRC it was scaled by the number of vcpus.

We did suspect it, since our old setting was HZ=1000 and we assigned more than 10 VCPUs to domU. But we don't see the performance difference with HZ=100.

> -----Original Message-----
> From: Dante Cinco [mailto:dantecinco@gmail.com]
> Sent: Thursday, November 18, 2010 12:36 PM
> To: Konrad Rzeszutek Wilk
> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; 
> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 
> pvops domU kernel with PCI passthrough
> 
> I mentioned earlier in an previous post to this thread that I'm able 
> to apply Dulloor's xenoprofile patch to the dom0 kernel but not the 
> domU kernel. So I can't do active-domain profiling but I'm able to do 
> passive-domain profiling but I don't know how reliable the results are 
> since it shows pvclock_clocksource_read as the top consumer of CPU 
> cycles at 28%.
> 
> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) 
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a 
> unit mask of 0x00 (No unit mask) count 100000
> samples  %        image name               app name
> symbol name
> 918089   27.9310
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           pvclock_clocksource_read
> 217811    6.6265  domain1-modules          domain1-modules
> /domain1-modules
> 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> mutex_spin_on_owner
> 186684    5.6795
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __xen_spin_lock
> 149514    4.5487
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __write_lock_failed
> 123278    3.7505
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __kernel_text_address
> 122906    3.7392
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           xen_spin_unlock
> 90903     2.7655
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __spin_time_accum
> 85880     2.6127
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_address
> 75223     2.2885
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           print_context_stack
> 66778     2.0316
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_text_address
> 57389     1.7459
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           is_module_text_address
> 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
> syscall_enter
> 47219     1.4365
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_insert
> 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> pvclock_clocksource_read
> 44501     1.3539
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_left
> 32482     0.9882
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           native_read_tsc
> 
> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the 
> I/Os were running. Here's the command I used:
> 
> opcontrol --start --xen=/boot/xen-syms-4.1-unstable 
> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> --passive-domains=1
> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-
> 5.11.dcinco-debug
> 
> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the 
> Xen command line. Otherwise, oprofile only gives the samples from 
> CPU0.
> 
> I'm going to try perf next.
> 
> - Dante
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 21:39                                     ` Lin, Ray
@ 2010-11-19  0:20                                       ` Dan Magenheimer
  2010-11-19  1:38                                         ` Dante Cinco
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2010-11-19  0:20 UTC (permalink / raw)
  To: Lin, Ray, Dante Cinco, Konrad Wilk
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, Andrew Thomas,
	keir.fraser, Chris Mason

> We did suspect it, since our old setting was HZ=1000 and we assigned
> more than 10 VCPUs to domU. But we don't see the performance difference
> with HZ=100.

FWIW, it didn't appear that the problems were proportional to HZ.
Seemed more that somehow the pvclock became incorrect and spent
a lot of time rereading the pvclock value.

> -----Original Message-----
> From: Lin, Ray [mailto:Ray.Lin@lsi.com]
> Sent: Thursday, November 18, 2010 2:40 PM
> To: Dan Magenheimer; Dante Cinco; Konrad Wilk
> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
> Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
> pvops domU kernel with PCI passthrough
> 
> 
> 
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
> bounces@lists.xensource.com] On Behalf Of Dan Magenheimer
> Sent: Thursday, November 18, 2010 1:21 PM
> To: Dante Cinco; Konrad Wilk
> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
> Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
> pvops domU kernel with PCI passthrough
> 
> In case it is related:
> http://lists.xensource.com/archives/html/xen-devel/2010-
> 07/msg01247.html
> 
> Although I never went further on this investigation, it appeared to me
> that pvclock_clocksource_read was getting called at least an order-of-
> magnitude more frequently than expected in some circumstances for some
> kernels.  And IIRC it was scaled by the number of vcpus.
> 
> We did suspect it, since our old setting was HZ=1000 and we assigned
> more than 10 VCPUs to domU. But we don't see the performance difference
> with HZ=100.
> 
> > -----Original Message-----
> > From: Dante Cinco [mailto:dantecinco@gmail.com]
> > Sent: Thursday, November 18, 2010 12:36 PM
> > To: Konrad Rzeszutek Wilk
> > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
> > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
> > Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
> > pvops domU kernel with PCI passthrough
> >
> > I mentioned earlier in an previous post to this thread that I'm able
> > to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
> > domU kernel. So I can't do active-domain profiling but I'm able to do
> > passive-domain profiling but I don't know how reliable the results
> are
> > since it shows pvclock_clocksource_read as the top consumer of CPU
> > cycles at 28%.
> >
> > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
> > unit mask of 0x00 (No unit mask) count 100000
> > samples  %        image name               app name
> > symbol name
> > 918089   27.9310
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           pvclock_clocksource_read
> > 217811    6.6265  domain1-modules          domain1-modules
> > /domain1-modules
> > 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-
> debug
> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> > mutex_spin_on_owner
> > 186684    5.6795
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __xen_spin_lock
> > 149514    4.5487
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __write_lock_failed
> > 123278    3.7505
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __kernel_text_address
> > 122906    3.7392
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           xen_spin_unlock
> > 90903     2.7655
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __spin_time_accum
> > 85880     2.6127
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __module_address
> > 75223     2.2885
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           print_context_stack
> > 66778     2.0316
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           __module_text_address
> > 57389     1.7459
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           is_module_text_address
> > 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
> > syscall_enter
> > 47219     1.4365
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           prio_tree_insert
> > 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-
> debug
> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> > pvclock_clocksource_read
> > 44501     1.3539
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           prio_tree_left
> > 32482     0.9882
> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> > domain1-kernel           native_read_tsc
> >
> > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while
> the
> > I/Os were running. Here's the command I used:
> >
> > opcontrol --start --xen=/boot/xen-syms-4.1-unstable
> > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> > --passive-domains=1
> > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-
> > 5.11.dcinco-debug
> >
> > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in
> the
> > Xen command line. Otherwise, oprofile only gives the samples from
> > CPU0.
> >
> > I'm going to try perf next.
> >
> > - Dante
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19  0:20                                       ` Dan Magenheimer
@ 2010-11-19  1:38                                         ` Dante Cinco
  0 siblings, 0 replies; 36+ messages in thread
From: Dante Cinco @ 2010-11-19  1:38 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, Andrew Thomas,
	Konrad Wilk, Lin, Ray, keir.fraser, Chris Mason

On Thu, Nov 18, 2010 at 4:20 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>> We did suspect it, since our old setting was HZ=1000 and we assigned
>> more than 10 VCPUs to domU. But we don't see the performance difference
>> with HZ=100.
>
> FWIW, it didn't appear that the problems were proportional to HZ.
> Seemed more that somehow the pvclock became incorrect and spent
> a lot of time rereading the pvclock value.

We decided to enable lock stat in the kernel to track down all those
lock activities in the profile report. The first thing I noticed was
kmemleak was at the top of the list (/proc/lock_stat) so we disabled
kmemleak. This boosted our I/O performance to 119k IOPS (from 31k).

One of our developers (Bruce Edge) suggested killing ntpd so I did.
This resulted in another significant bump in I/O performance to 209k
IOPS. The question now is why ntpd? Is it the source of all or most of
those pvclock_clocksource_read in the profile report?

>
>> -----Original Message-----
>> From: Lin, Ray [mailto:Ray.Lin@lsi.com]
>> Sent: Thursday, November 18, 2010 2:40 PM
>> To: Dan Magenheimer; Dante Cinco; Konrad Wilk
>> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
>> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
>> Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
>> pvops domU kernel with PCI passthrough
>>
>>
>>
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
>> bounces@lists.xensource.com] On Behalf Of Dan Magenheimer
>> Sent: Thursday, November 18, 2010 1:21 PM
>> To: Dante Cinco; Konrad Wilk
>> Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
>> Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
>> Subject: RE: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
>> pvops domU kernel with PCI passthrough
>>
>> In case it is related:
>> http://lists.xensource.com/archives/html/xen-devel/2010-
>> 07/msg01247.html
>>
>> Although I never went further on this investigation, it appeared to me
>> that pvclock_clocksource_read was getting called at least an order-of-
>> magnitude more frequently than expected in some circumstances for some
>> kernels.  And IIRC it was scaled by the number of vcpus.
>>
>> We did suspect it, since our old setting was HZ=1000 and we assigned
>> more than 10 VCPUs to domU. But we don't see the performance difference
>> with HZ=100.
>>
>> > -----Original Message-----
>> > From: Dante Cinco [mailto:dantecinco@gmail.com]
>> > Sent: Thursday, November 18, 2010 12:36 PM
>> > To: Konrad Rzeszutek Wilk
>> > Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca;
>> > Andrew Thomas; keir.fraser@eu.citrix.com; Chris Mason
>> > Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
>> > pvops domU kernel with PCI passthrough
>> >
>> > I mentioned earlier in an previous post to this thread that I'm able
>> > to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
>> > domU kernel. So I can't do active-domain profiling but I'm able to do
>> > passive-domain profiling but I don't know how reliable the results
>> are
>> > since it shows pvclock_clocksource_read as the top consumer of CPU
>> > cycles at 28%.
>> >
>> > CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
>> > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
>> > unit mask of 0x00 (No unit mask) count 100000
>> > samples  %        image name               app name
>> > symbol name
>> > 918089   27.9310
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           pvclock_clocksource_read
>> > 217811    6.6265  domain1-modules          domain1-modules
>> > /domain1-modules
>> > 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-
>> debug
>> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> > mutex_spin_on_owner
>> > 186684    5.6795
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __xen_spin_lock
>> > 149514    4.5487
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __write_lock_failed
>> > 123278    3.7505
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __kernel_text_address
>> > 122906    3.7392
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           xen_spin_unlock
>> > 90903     2.7655
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __spin_time_accum
>> > 85880     2.6127
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __module_address
>> > 75223     2.2885
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           print_context_stack
>> > 66778     2.0316
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           __module_text_address
>> > 57389     1.7459
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           is_module_text_address
>> > 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
>> > syscall_enter
>> > 47219     1.4365
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           prio_tree_insert
>> > 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-
>> debug
>> > vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> > pvclock_clocksource_read
>> > 44501     1.3539
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           prio_tree_left
>> > 32482     0.9882
>> > vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> > domain1-kernel           native_read_tsc
>> >
>> > I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while
>> the
>> > I/Os were running. Here's the command I used:
>> >
>> > opcontrol --start --xen=/boot/xen-syms-4.1-unstable
>> > --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> > --passive-domains=1
>> > --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-
>> > 5.11.dcinco-debug
>> >
>> > I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in
>> the
>> > Xen command line. Otherwise, oprofile only gives the samples from
>> > CPU0.
>> >
>> > I'm going to try perf next.
>> >
>> > - Dante
>> >
>> > _______________________________________________
>> > Xen-devel mailing list
>> > Xen-devel@lists.xensource.com
>> > http://lists.xensource.com/xen-devel
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-18 19:35                                 ` Dante Cinco
  2010-11-18 21:20                                   ` Dan Magenheimer
@ 2010-11-19 17:10                                   ` Jeremy Fitzhardinge
  2010-11-19 17:52                                     ` Dante Cinco
  2010-11-19 17:55                                     ` Lin, Ray
  1 sibling, 2 replies; 36+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-19 17:10 UTC (permalink / raw)
  To: Dante Cinco
  Cc: Xen-devel, mathieu.desnoyers, andrew.thomas,
	Konrad Rzeszutek Wilk, keir.fraser, chris.mason

On 11/18/2010 11:35 AM, Dante Cinco wrote:
> I mentioned earlier in an previous post to this thread that I'm able
> to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
> domU kernel. So I can't do active-domain profiling but I'm able to do
> passive-domain profiling but I don't know how reliable the results are
> since it shows pvclock_clocksource_read as the top consumer of CPU
> cycles at 28%.

Is rdtsc emulation on?  (I forget what the incantation is for that now.)

    J

> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples  %        image name               app name                 symbol name
> 918089   27.9310
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           pvclock_clocksource_read
> 217811    6.6265  domain1-modules          domain1-modules
> /domain1-modules
> 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> mutex_spin_on_owner
> 186684    5.6795
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __xen_spin_lock
> 149514    4.5487
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __write_lock_failed
> 123278    3.7505
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __kernel_text_address
> 122906    3.7392
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           xen_spin_unlock
> 90903     2.7655
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __spin_time_accum
> 85880     2.6127
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_address
> 75223     2.2885
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           print_context_stack
> 66778     2.0316
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_text_address
> 57389     1.7459
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           is_module_text_address
> 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
> syscall_enter
> 47219     1.4365
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_insert
> 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> pvclock_clocksource_read
> 44501     1.3539
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_left
> 32482     0.9882
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           native_read_tsc
>
> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the
> I/Os were running. Here's the command I used:
>
> opcontrol --start --xen=/boot/xen-syms-4.1-unstable
> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> --passive-domains=1
> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>
> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the
> Xen command line. Otherwise, oprofile only gives the samples from
> CPU0.
>
> I'm going to try perf next.
>
> - Dante
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19 17:10                                   ` Jeremy Fitzhardinge
@ 2010-11-19 17:52                                     ` Dante Cinco
  2010-11-19 17:58                                       ` Keir Fraser
  2010-11-19 17:55                                     ` Lin, Ray
  1 sibling, 1 reply; 36+ messages in thread
From: Dante Cinco @ 2010-11-19 17:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Xen-devel, mathieu.desnoyers, andrew.thomas,
	Konrad Rzeszutek Wilk, keir.fraser, chris.mason

On Fri, Nov 19, 2010 at 9:10 AM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> On 11/18/2010 11:35 AM, Dante Cinco wrote:
>> I mentioned earlier in an previous post to this thread that I'm able
>> to apply Dulloor's xenoprofile patch to the dom0 kernel but not the
>> domU kernel. So I can't do active-domain profiling but I'm able to do
>> passive-domain profiling but I don't know how reliable the results are
>> since it shows pvclock_clocksource_read as the top consumer of CPU
>> cycles at 28%.
>
> Is rdtsc emulation on?  (I forget what the incantation is for that now.)

How do I check if rdtsc emulation is on? Does 'xm debug-keys s' do it?

(XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
input to DOM0)
(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1
(XEN) No domains have emulated TSC

I'm using xen-unstable-4.1 (22388:87f248de5230).

- Dante

>
>    J
>
>> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
>> unit mask of 0x00 (No unit mask) count 100000
>> samples  %        image name               app name                 symbol name
>> 918089   27.9310
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           pvclock_clocksource_read
>> 217811    6.6265  domain1-modules          domain1-modules
>> /domain1-modules
>> 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> mutex_spin_on_owner
>> 186684    5.6795
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __xen_spin_lock
>> 149514    4.5487
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __write_lock_failed
>> 123278    3.7505
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __kernel_text_address
>> 122906    3.7392
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           xen_spin_unlock
>> 90903     2.7655
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __spin_time_accum
>> 85880     2.6127
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __module_address
>> 75223     2.2885
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           print_context_stack
>> 66778     2.0316
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           __module_text_address
>> 57389     1.7459
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           is_module_text_address
>> 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
>> syscall_enter
>> 47219     1.4365
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           prio_tree_insert
>> 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> pvclock_clocksource_read
>> 44501     1.3539
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           prio_tree_left
>> 32482     0.9882
>> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>> domain1-kernel           native_read_tsc
>>
>> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the
>> I/Os were running. Here's the command I used:
>>
>> opcontrol --start --xen=/boot/xen-syms-4.1-unstable
>> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
>> --passive-domains=1
>> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
>>
>> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the
>> Xen command line. Otherwise, oprofile only gives the samples from
>> CPU0.
>>
>> I'm going to try perf next.
>>
>> - Dante
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19 17:10                                   ` Jeremy Fitzhardinge
  2010-11-19 17:52                                     ` Dante Cinco
@ 2010-11-19 17:55                                     ` Lin, Ray
  1 sibling, 0 replies; 36+ messages in thread
From: Lin, Ray @ 2010-11-19 17:55 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Dante Cinco
  Cc: Xen-devel, mathieu.desnoyers, andrew.thomas,
	Konrad Rzeszutek Wilk, keir.fraser, chris.mason

 

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Jeremy Fitzhardinge
Sent: Friday, November 19, 2010 9:10 AM
To: Dante Cinco
Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; andrew.thomas@oracle.com; Konrad Rzeszutek Wilk; keir.fraser@eu.citrix.com; chris.mason@oracle.com
Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough

On 11/18/2010 11:35 AM, Dante Cinco wrote:
> I mentioned earlier in an previous post to this thread that I'm able 
> to apply Dulloor's xenoprofile patch to the dom0 kernel but not the 
> domU kernel. So I can't do active-domain profiling but I'm able to do 
> passive-domain profiling but I don't know how reliable the results are 
> since it shows pvclock_clocksource_read as the top consumer of CPU 
> cycles at 28%.

Is rdtsc emulation on?  (I forget what the incantation is for that now.)

    J

We don't specify it for the domU. The default should be tsc_mode==0. Is the PV domain always 
Enabling the emulation if tsc_mode==0 ?

-Ray

> CPU: Intel Architectural Perfmon, speed 2665.98 MHz (estimated) 
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a 
> unit mask of 0x00 (No unit mask) count 100000
> samples  %        image name               app name                 symbol name
> 918089   27.9310
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           pvclock_clocksource_read
> 217811    6.6265  domain1-modules          domain1-modules
> /domain1-modules
> 188327    5.7295  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> mutex_spin_on_owner
> 186684    5.6795
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __xen_spin_lock
> 149514    4.5487
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __write_lock_failed
> 123278    3.7505
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __kernel_text_address
> 122906    3.7392
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           xen_spin_unlock
> 90903     2.7655
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __spin_time_accum
> 85880     2.6127
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_address
> 75223     2.2885
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           print_context_stack
> 66778     2.0316
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           __module_text_address
> 57389     1.7459
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           is_module_text_address
> 47282     1.4385  xen-syms-4.1-unstable    domain1-xen
> syscall_enter
> 47219     1.4365
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_insert
> 46495     1.4145  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> pvclock_clocksource_read
> 44501     1.3539
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           prio_tree_left
> 32482     0.9882
> vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.dcinco-debug
> domain1-kernel           native_read_tsc
>
> I ran oprofile (0.9.5 with xenoprofile patch) for 20 seconds while the 
> I/Os were running. Here's the command I used:
>
> opcontrol --start --xen=/boot/xen-syms-4.1-unstable 
> --vmlinux=/boot/vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
> --passive-domains=1
> --passive-images=/boot/vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.11.d
> cinco-debug
>
> I had to remove dom0_max_vcpus=1 (but kept dom0_vcpus_pin=true) in the 
> Xen command line. Otherwise, oprofile only gives the samples from 
> CPU0.
>
> I'm going to try perf next.
>
> - Dante
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19 17:52                                     ` Dante Cinco
@ 2010-11-19 17:58                                       ` Keir Fraser
  2010-11-19 22:36                                         ` Dan Magenheimer
  0 siblings, 1 reply; 36+ messages in thread
From: Keir Fraser @ 2010-11-19 17:58 UTC (permalink / raw)
  To: Dante Cinco, Jeremy Fitzhardinge
  Cc: Xen-devel, mathieu.desnoyers, chris.mason, andrew.thomas,
	Konrad Rzeszutek Wilk

On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote:

> How do I check if rdtsc emulation is on? Does 'xm debug-keys s' do it?
> 
> (XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
> input to DOM0)
> (XEN) TSC marked as reliable, warp = 0 (count=2)
> (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1
> (XEN) No domains have emulated TSC

TSC emulation is not enabled.

 -- Keir

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19 17:58                                       ` Keir Fraser
@ 2010-11-19 22:36                                         ` Dan Magenheimer
  2010-11-20  0:13                                           ` Dante Cinco
  0 siblings, 1 reply; 36+ messages in thread
From: Dan Magenheimer @ 2010-11-19 22:36 UTC (permalink / raw)
  To: Keir Fraser, Dante Cinco, Jeremy Fitzhardinge
  Cc: Xen-devel, mathieu.desnoyers, Andrew Thomas, Chris Mason, Konrad Wilk

> From: Keir Fraser [mailto:keir@xen.org]
> Sent: Friday, November 19, 2010 10:58 AM
> To: Dante Cinco; Jeremy Fitzhardinge
> Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; Chris Mason; Andrew
> Thomas; Konrad Rzeszutek Wilk
> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
> pvops domU kernel with PCI passthrough
> 
> On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote:
> 
> > How do I check if rdtsc emulation is on? Does 'xm debug-keys s' do
> it?
> >
> > (XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
> > input to DOM0)
> > (XEN) TSC marked as reliable, warp = 0 (count=2)
> > (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1
> > (XEN) No domains have emulated TSC
> 
> TSC emulation is not enabled.

I *think* "No domains have emulated TSC" will be printed
if there are no domains other than dom0 currently running,
so this may not be definitive.

Also note that tsc_mode=0 means "do the right thing for
this hardware platform" but, if the domain is saved/restored
or live-migrated, TSC will start being emulated. See
tscmode.txt in xen/Documentation for more detail.

Lastly, I haven't tested this code in quite some time,
the code for PV and HVM is different, and I've never
tested it with xl, only with xm.  So bitrot is possible,
though hopefully unlikely.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough
  2010-11-19 22:36                                         ` Dan Magenheimer
@ 2010-11-20  0:13                                           ` Dante Cinco
  0 siblings, 0 replies; 36+ messages in thread
From: Dante Cinco @ 2010-11-20  0:13 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-devel, mathieu.desnoyers, Andrew Thomas,
	Konrad Wilk, Keir Fraser, Chris Mason

On Fri, Nov 19, 2010 at 2:36 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>> From: Keir Fraser [mailto:keir@xen.org]
>> Sent: Friday, November 19, 2010 10:58 AM
>> To: Dante Cinco; Jeremy Fitzhardinge
>> Cc: Xen-devel; mathieu.desnoyers@polymtl.ca; Chris Mason; Andrew
>> Thomas; Konrad Rzeszutek Wilk
>> Subject: Re: [Xen-devel] swiotlb=force in Konrad's xen-pcifront-0.8.2
>> pvops domU kernel with PCI passthrough
>>
>> On 19/11/2010 17:52, "Dante Cinco" <dantecinco@gmail.com> wrote:
>>
>> > How do I check if rdtsc emulation is on? Does 'xm debug-keys s' do
>> it?
>> >
>> > (XEN) *** Serial input -> Xen (type 'CTRL-a' three times to switch
>> > input to DOM0)
>> > (XEN) TSC marked as reliable, warp = 0 (count=2)
>> > (XEN) dom1: mode=0,ofs=0xca6f68770,khz=2666017,inc=1
>> > (XEN) No domains have emulated TSC
>>
>> TSC emulation is not enabled.
>
> I *think* "No domains have emulated TSC" will be printed
> if there are no domains other than dom0 currently running,
> so this may not be definitive.

The pvops domU was running when I captured that Xen console output and
I also looked at /var/log/xen/xend.log and saw 'tsc_mode 0' without
explicitly setting tsc_mode in the domain's cfg file.

>
> Also note that tsc_mode=0 means "do the right thing for
> this hardware platform" but, if the domain is saved/restored
> or live-migrated, TSC will start being emulated. See
> tscmode.txt in xen/Documentation for more detail.

We have not done any save/restore on domU.

>
> Lastly, I haven't tested this code in quite some time,
> the code for PV and HVM is different, and I've never
> tested it with xl, only with xm.  So bitrot is possible,
> though hopefully unlikely.
>
> Thanks,
> Dan
>

pvclock_clocksource_read is no longer the top symbol (was 28% of the
CPU samples) in the latest xenoprofile report. I had mistakenly
attributed the huge I/O performance gain (from 119k IOPS to 209k IOPS)
to the act of killing ntpd but that was not the case. In fact, the
performance gain was due to turning off lock stat. I had enabled lock
stat in the kernel to try to track down the lock-associated symbols in
the profile report. I had forgotten that I had turned off lock stat
(echo 0 >/proc/sys/kernel/lock_stat) just before I killed ntpd. When I
disabled lock stat in the kernel, I was able to get 209k IOPS without
killing ntpd.

The latest xenoprofile report doesn't even have
pvclock_clocksource_read in the top 10. All the I/O processing in domU
(domID=1) is done in our kernel driver modules so domain1-modules is
expected to be at the top of the list.

CPU: Intel Architectural Perfmon, speed 2665.97 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
542839   17.2427  domain1-modules          domain1-modules
/domain1-modules
378968   12.0375
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           xen_spin_unlock
250342    7.9518  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
mutex_spin_on_owner
206585    6.5620  xen-syms-4.1-unstable    domain1-xen
syscall_enter
123021    3.9076
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           lock_release
103703    3.2940
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           __lock_acquire
100973    3.2073  domain1-xen-unknown      domain1-xen-unknown
/domain1-xen-unknown
94449     3.0001
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           hypercall_page
67145     2.1328  xen-syms-4.1-unstable    domain1-xen
restore_all_guest
64460     2.0475
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           xen_spin_trylock
62415     1.9825
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           xen_restore_fl_direct
51822     1.6461
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           native_read_tsc
45901     1.4580  vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
vmlinux-2.6.32.25-pvops-stable-dom0-5.7.dcinco-debug
pvclock_clocksource_read
44398     1.4103
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           debug_locks_off
42191     1.3402
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           find_next_bit
41913     1.3313
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           do_raw_spin_lock
41424     1.3158
vmlinux-2.6.36-rc7-pvops-kpcif-08-2-domu-5.14.dcinco-debug
domain1-kernel           lock_acquire
39275     1.2475  xen-syms-4.1-unstable    domain1-xen
do_xen_version

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-11-20  0:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11  1:16 swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough Dante Cinco
2010-11-11 16:04 ` Konrad Rzeszutek Wilk
2010-11-11 18:31   ` Dante Cinco
2010-11-11 19:03     ` Konrad Rzeszutek Wilk
2010-11-11 19:42       ` Lin, Ray
2010-11-12 15:56         ` Konrad Rzeszutek Wilk
2010-11-12 16:20           ` Lin, Ray
2010-11-12 16:55             ` Konrad Rzeszutek Wilk
2010-11-12 19:38               ` Lin, Ray
2010-11-12 22:33                 ` Konrad Rzeszutek Wilk
2010-11-12 22:57                   ` Lin, Ray
2010-11-16 17:07                   ` Dante Cinco
2010-11-16 18:57                     ` Konrad Rzeszutek Wilk
2010-11-16 19:43                       ` Dante Cinco
2010-11-16 20:15                         ` Konrad Rzeszutek Wilk
2010-11-18  1:09                           ` Dante Cinco
2010-11-18 17:19                             ` Konrad Rzeszutek Wilk
2010-11-18 17:28                               ` Chris Mason
2010-11-18 17:54                               ` Mathieu Desnoyers
2010-11-18 18:43                               ` Dante Cinco
2010-11-18 18:52                                 ` Lin, Ray
2010-11-18 19:35                                 ` Dante Cinco
2010-11-18 21:20                                   ` Dan Magenheimer
2010-11-18 21:39                                     ` Lin, Ray
2010-11-19  0:20                                       ` Dan Magenheimer
2010-11-19  1:38                                         ` Dante Cinco
2010-11-19 17:10                                   ` Jeremy Fitzhardinge
2010-11-19 17:52                                     ` Dante Cinco
2010-11-19 17:58                                       ` Keir Fraser
2010-11-19 22:36                                         ` Dan Magenheimer
2010-11-20  0:13                                           ` Dante Cinco
2010-11-19 17:55                                     ` Lin, Ray
2010-11-12 18:29           ` Dante Cinco
2010-11-11 22:32       ` Dante Cinco
2010-11-12  1:02         ` Dante Cinco
2010-11-12 16:58           ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.