From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dante Cinco <dantecinco@gmail.com>
Subject: Re: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops
	domU kernel with PCI passthrough
Date: Thu, 18 Nov 2010 10:43:57 -0800
Message-ID: <AANLkTimPJ4y+YOL2Ed78jmCeaKnxLZb93Kuowxutu_O1@mail.gmail.com>
References: <20101112165541.GA10339@dumpdata.com>
	<EB4C61A1A2501842A04B573FE42B14D601374FBFD2@cosmail02.lsi.com>
	<20101112223333.GD26189@dumpdata.com>
	<AANLkTi=H6r2=-zJE+6eCtP4VXacYhd_e47+KRW5vdwjS@mail.gmail.com>
	<20101116185748.GA11549@dumpdata.com>
	<AANLkTikw8reKXwd9CcXc3qqHuXKjbMEatAVfn19uwzs3@mail.gmail.com>
	<20101116201349.GA18315@dumpdata.com>
	<AANLkTin7SRKuT5qQQ_1NSis1asOG3eJ1SmmC3fppsGnv@mail.gmail.com>
	<20101118171936.GA29275@dumpdata.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <20101118171936.GA29275@dumpdata.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>, Xen-devel <xen-devel@lists.xensource.com>, mathieu.desnoyers@polymtl.ca, andrew.thomas@oracle.com, keir.fraser@eu.citrix.com, chris.mason@oracle.com
List-Id: xen-devel@lists.xenproject.org

On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> Keir, Dan, Mathieu, Chris, Mukesh,
>
> This fellow is passing in a PCI device to his Xen PV guest and trying
> to get high IOPS. The kernel he is using is a 2.6.36 with tglx's
> sparse_irq rework.
>
>> I wanted to confirm that bounce buffering was indeed occurring so I
>> modified swiotlb.c in the kernel and added printks in the following
>> functions:
>> swiotlb_bounce
>> swiotlb_tbl_map_single
>> swiotlb_tbl_unmap_single
>> Sure enough we were calling all 3 five times per I/O. We took your
>> suggestion and replaced pci_map_single with pci_pool_alloc. The
>> swiotlb calls were gone but the I/O performance only improved 6% (29k
>> IOPS to 31k IOPS) which is still abysmal.
>
> Hey! 6% that is nothing to sneeze at.

When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at
least 20x (~700k IOPS).

>
>>
>> Any suggestions on where to look next? I have one question about the
>
> So since you are talking IOPS I figured you must be using fio to run thos=
e
> numbers. And since you mentioned HVM at some point, you are not running
> this PV domain as a back-end for another PV guest. You are probably going
> to run some form of iSCSI target and stuff those down the PCI device.

Our setup is pure Fibre Channel. We're using a physically separate
system (Linux-based also) to initiate the SCSI I/Os.

>
> Couple of things that pop in my head.. but lets first address your questi=
on.
>
>> P2M array: Does the P2M lookup occur every DMA or just during the
>> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central
>
> It only occurs during allocation. Also since you are bypassing the
> bounce buffer those calls are done without any spinlock. The lookup
> of P2M is bitshifting, division - and are constant - so O(1).
>
>> resource that could be a bottleneck?
>
> Doubt it. Your best bet to figure this out is to play with ftrace, or
> perf trace. But I don't know how well they work with Xen nowadays - Jerem=
y
> and Mathieu Desnoyers poked it a bit and I think I overheard that Mathieu=
 got
> it working?
>
> So the next couple of possiblities are:
> =A01). you are hitting the spinlock issues on 'struct request' or any of
> =A0 =A0 the paths on the I/O. Oracle did a lot of work on those - and one
> =A0 =A0 way to find this out is to look at tracing and see where the cont=
ention is.
> =A0 =A0 I don't know where or if those patches have been posted upstream.=
. but as said,
> =A0 =A0 if you are seeing the spinlock usage high =A0- that might be it.
> =A01b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled. Oth=
erwise

I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=3Dy

> =A0 =A0 you are going to hit dreadfull conditions.
> =A02). You are hitting the 64-bit syscall wall. Basically your user-mode
> =A0 =A0 application (fio) is doing a write(), which used to be int 0x80 b=
ut now
> =A0 =A0 is a syscall. The syscall gets trapped in the hypervisor which ha=
s to
> =A0 =A0 call in your PV kernel. You get hit with two context switches for=
 each
> =A0 =A0 'write()' call. The solution is to use a 32-bit DomU as the guest=
 user
> =A0 =A0 application and guest kernel run in different rings.

There is no user space application that is involved with the I/O. It's
all kernel driver code that handles the I/O.

> =A03). Xen CPU pools. You didn't say where the application that sends the=
 IOs
> =A0 =A0 is located. But if it was in a seperate domain then you will want=
 to use
> =A0 =A0 Xen CPU pools. Basically this way you can get gang-scheduling whe=
re the
> =A0 =A0 guest that submits the I/O and the guest that picks up the I/O ar=
e running
> =A0 =A0 right after each other. I don't know much more details, but this =
is what
> =A0 =A0 I understand it does.
> =A04). CPU/MSI-X affinity. I think you already did this, but make sure yo=
u pin
> =A0 =A0 your guest to specific CPUs and also pin the MSI-X (vectors) to t=
he proper
> =A0 =A0 destination. You can use the 'xm debug-keys i' to see the MSI-X a=
ffinity - it
> =A0 =A0 is a mask and basically see if it overlays the CPUs you are runni=
ng your guest
> =A0 =A0 at. Not sure how to actually set the MSI-X affinity ... now that =
I think.
> =A0 =A0 Keir or some of the Intel folks might know better.

There 16 devices (multi-function) that are PCI-passed through to domU.
There are 16 VCPUs in domU and all are pinned to individual PCPUs
(24-CPU platform). Each IRQ in domU is affinitized to a CPU. This
strategy has worked well for us with the HVM kernel. Here's the output
of 'xm debug-keys i'
(XEN)    IRQ:  67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:127(----),
(XEN)    IRQ:  68 affinity:00000000,00000000,00000000,00000200 vec:43
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:126(----),
(XEN)    IRQ:  69 affinity:00000000,00000000,00000000,00000400 vec:83
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:125(----),
(XEN)    IRQ:  70 affinity:00000000,00000000,00000000,00000800 vec:4b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:124(----),
(XEN)    IRQ:  71 affinity:00000000,00000000,00000000,00001000 vec:8b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:123(----),
(XEN)    IRQ:  72 affinity:00000000,00000000,00000000,00002000 vec:53
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:122(----),
(XEN)    IRQ:  73 affinity:00000000,00000000,00000000,00004000 vec:93
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:121(----),
(XEN)    IRQ:  74 affinity:00000000,00000000,00000000,00008000 vec:5b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:120(----),
(XEN)    IRQ:  75 affinity:00000000,00000000,00000000,00010000 vec:9b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:119(----),
(XEN)    IRQ:  76 affinity:00000000,00000000,00000000,00020000 vec:63
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:118(----),
(XEN)    IRQ:  77 affinity:00000000,00000000,00000000,00040000 vec:a3
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:117(----),
(XEN)    IRQ:  78 affinity:00000000,00000000,00000000,00080000 vec:6b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:116(----),
(XEN)    IRQ:  79 affinity:00000000,00000000,00000000,00100000 vec:ab
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:115(----),
(XEN)    IRQ:  80 affinity:00000000,00000000,00000000,00200000 vec:73
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:114(----),
(XEN)    IRQ:  81 affinity:00000000,00000000,00000000,00400000 vec:b3
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:113(----),
(XEN)    IRQ:  82 affinity:00000000,00000000,00000000,00800000 vec:7b
type=3DPCI-MSI         status=3D00000010 in-flight=3D0
domain-list=3D1:112(----),

> =A05). Andrew, Mukesh, Keir, Dan, any other ideas?
>

We're also trying Chris' 4 things to try and will consider Mathieu's
LTT suggestion.

- Dante