From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lin, Ray" Subject: RE: swiotlb=force in Konrad's xen-pcifront-0.8.2 pvops domU kernel with PCI passthrough Date: Thu, 18 Nov 2010 11:52:53 -0700 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: Content-Language: en-US List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Dante Cinco , Konrad Rzeszutek Wilk Cc: Jeremy Fitzhardinge , Xen-devel , "mathieu.desnoyers@polymtl.ca" , "andrew.thomas@oracle.com" , "keir.fraser@eu.citrix.com" , "chris.mason@oracle.com" List-Id: xen-devel@lists.xenproject.org =20 -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists= .xensource.com] On Behalf Of Dante Cinco Sent: Thursday, November 18, 2010 10:44 AM To: Konrad Rzeszutek Wilk Cc: Jeremy Fitzhardinge; Xen-devel; mathieu.desnoyers@polymtl.ca; andrew.th= omas@oracle.com; keir.fraser@eu.citrix.com; chris.mason@oracle.com Subject: Re: [Xen-devel] swiotlb=3Dforce in Konrad's xen-pcifront-0.8.2 pvo= ps domU kernel with PCI passthrough On Thu, Nov 18, 2010 at 9:19 AM, Konrad Rzeszutek Wilk wrote: > Keir, Dan, Mathieu, Chris, Mukesh, > > This fellow is passing in a PCI device to his Xen PV guest and trying=20 > to get high IOPS. The kernel he is using is a 2.6.36 with tglx's=20 > sparse_irq rework. > >> I wanted to confirm that bounce buffering was indeed occurring so I=20 >> modified swiotlb.c in the kernel and added printks in the following >> functions: >> swiotlb_bounce >> swiotlb_tbl_map_single >> swiotlb_tbl_unmap_single >> Sure enough we were calling all 3 five times per I/O. We took your=20 >> suggestion and replaced pci_map_single with pci_pool_alloc. The=20 >> swiotlb calls were gone but the I/O performance only improved 6% (29k=20 >> IOPS to 31k IOPS) which is still abysmal. > > Hey! 6% that is nothing to sneeze at. When we were using an HVM kernel (2.6.32.15+drm33.5), our IOPS was at least= 20x (~700k IOPS). > >> >> Any suggestions on where to look next? I have one question about the > > So since you are talking IOPS I figured you must be using fio to run=20 > those numbers. And since you mentioned HVM at some point, you are not=20 > running this PV domain as a back-end for another PV guest. You are=20 > probably going to run some form of iSCSI target and stuff those down the = PCI device. Our setup is pure Fibre Channel. We're using a physically separate system (= Linux-based also) to initiate the SCSI I/Os. > > Couple of things that pop in my head.. but lets first address your questi= on. > >> P2M array: Does the P2M lookup occur every DMA or just during the=20 >> allocation? What I'm getting at is this: Is the Xen-SWIOTLB a central > > It only occurs during allocation. Also since you are bypassing the=20 > bounce buffer those calls are done without any spinlock. The lookup of=20 > P2M is bitshifting, division - and are constant - so O(1). > >> resource that could be a bottleneck? > > Doubt it. Your best bet to figure this out is to play with ftrace, or=20 > perf trace. But I don't know how well they work with Xen nowadays -=20 > Jeremy and Mathieu Desnoyers poked it a bit and I think I overheard=20 > that Mathieu got it working? > > So the next couple of possiblities are: > =A01). you are hitting the spinlock issues on 'struct request' or any of > =A0 =A0 the paths on the I/O. Oracle did a lot of work on those - and one > =A0 =A0 way to find this out is to look at tracing and see where the cont= ention is. > =A0 =A0 I don't know where or if those patches have been posted upstream.= .=20 > but as said, > =A0 =A0 if you are seeing the spinlock usage high =A0- that might be it. > =A01b). Spinlocks - make sure you have CONFIG_PVOPS_SPINLOCK enabled.=20 > Otherwise I checked the config file and it is enabled: CONFIG_PARAVIRT_SPINLOCKS=3Dy The platform we're running has Intel Xeon E5540 and X58 chipset. Here is th= e kernel configuration associated with processor. Is there anything we coul= d tune to improve the performance ? # # Processor type and features # CONFIG_TICK_ONESHOT=3Dy CONFIG_NO_HZ=3Dy CONFIG_HIGH_RES_TIMERS=3Dy CONFIG_GENERIC_CLOCKEVENTS_BUILD=3Dy CONFIG_SMP=3Dy CONFIG_SPARSE_IRQ=3Dy CONFIG_NUMA_IRQ_DESC=3Dy CONFIG_X86_MPPARSE=3Dy # CONFIG_X86_EXTENDED_PLATFORM is not set CONFIG_X86_SUPPORTS_MEMORY_FAILURE=3Dy CONFIG_SCHED_OMIT_FRAME_POINTER=3Dy CONFIG_PARAVIRT_GUEST=3Dy CONFIG_XEN=3Dy CONFIG_XEN_PVHVM=3Dy CONFIG_XEN_MAX_DOMAIN_MEMORY=3D8 CONFIG_XEN_SAVE_RESTORE=3Dy CONFIG_XEN_DEBUG_FS=3Dy CONFIG_KVM_CLOCK=3Dy CONFIG_KVM_GUEST=3Dy CONFIG_PARAVIRT=3Dy CONFIG_PARAVIRT_SPINLOCKS=3Dy CONFIG_PARAVIRT_CLOCK=3Dy # CONFIG_PARAVIRT_DEBUG is not set CONFIG_NO_BOOTMEM=3Dy # CONFIG_MEMTEST is not set # CONFIG_MK8 is not set # CONFIG_MPSC is not set # CONFIG_MCORE2 is not set # CONFIG_MATOM is not set CONFIG_GENERIC_CPU=3Dy CONFIG_X86_CPU=3Dy CONFIG_X86_INTERNODE_CACHE_SHIFT=3D7 CONFIG_X86_CMPXCHG=3Dy CONFIG_X86_L1_CACHE_SHIFT=3D6 CONFIG_X86_XADD=3Dy CONFIG_X86_WP_WORKS_OK=3Dy CONFIG_X86_TSC=3Dy CONFIG_X86_CMPXCHG64=3Dy CONFIG_X86_CMOV=3Dy CONFIG_X86_MINIMUM_CPU_FAMILY=3D64 CONFIG_X86_DEBUGCTLMSR=3Dy CONFIG_CPU_SUP_INTEL=3Dy CONFIG_CPU_SUP_AMD=3Dy CONFIG_CPU_SUP_CENTAUR=3Dy CONFIG_HPET_TIMER=3Dy CONFIG_HPET_EMULATE_RTC=3Dy CONFIG_DMI=3Dy CONFIG_GART_IOMMU=3Dy CONFIG_CALGARY_IOMMU=3Dy CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=3Dy CONFIG_AMD_IOMMU=3Dy CONFIG_AMD_IOMMU_STATS=3Dy CONFIG_SWIOTLB=3Dy CONFIG_IOMMU_HELPER=3Dy CONFIG_IOMMU_API=3Dy # CONFIG_MAXSMP is not set CONFIG_NR_CPUS=3D32 CONFIG_SCHED_SMT=3Dy CONFIG_SCHED_MC=3Dy # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=3Dy # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=3Dy CONFIG_X86_IO_APIC=3Dy CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=3Dy CONFIG_X86_MCE=3Dy CONFIG_X86_MCE_INTEL=3Dy CONFIG_X86_MCE_AMD=3Dy CONFIG_X86_MCE_THRESHOLD=3Dy CONFIG_X86_MCE_INJECT=3Dy CONFIG_X86_THERMAL_VECTOR=3Dy # CONFIG_I8K is not set CONFIG_MICROCODE=3Dy CONFIG_MICROCODE_INTEL=3Dy CONFIG_MICROCODE_AMD=3Dy CONFIG_MICROCODE_OLD_INTERFACE=3Dy CONFIG_X86_MSR=3Dy CONFIG_X86_CPUID=3Dy CONFIG_ARCH_PHYS_ADDR_T_64BIT=3Dy CONFIG_DIRECT_GBPAGES=3Dy CONFIG_NUMA=3Dy CONFIG_K8_NUMA=3Dy CONFIG_X86_64_ACPI_NUMA=3Dy CONFIG_NODES_SPAN_OTHER_NODES=3Dy # CONFIG_NUMA_EMU is not set CONFIG_NODES_SHIFT=3D6 CONFIG_ARCH_PROC_KCORE_TEXT=3Dy CONFIG_ARCH_SPARSEMEM_DEFAULT=3Dy CONFIG_ARCH_SPARSEMEM_ENABLE=3Dy CONFIG_ARCH_SELECT_MEMORY_MODEL=3Dy CONFIG_ILLEGAL_POINTER_VALUE=3D0xdead000000000000 CONFIG_SELECT_MEMORY_MODEL=3Dy CONFIG_SPARSEMEM_MANUAL=3Dy CONFIG_SPARSEMEM=3Dy CONFIG_NEED_MULTIPLE_NODES=3Dy CONFIG_HAVE_MEMORY_PRESENT=3Dy CONFIG_SPARSEMEM_EXTREME=3Dy CONFIG_SPARSEMEM_VMEMMAP_ENABLE=3Dy CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=3Dy CONFIG_SPARSEMEM_VMEMMAP=3Dy # CONFIG_MEMORY_HOTPLUG is not set CONFIG_PAGEFLAGS_EXTENDED=3Dy CONFIG_SPLIT_PTLOCK_CPUS=3D4 # CONFIG_COMPACTION is not set CONFIG_MIGRATION=3Dy CONFIG_PHYS_ADDR_T_64BIT=3Dy CONFIG_ZONE_DMA_FLAG=3D1 CONFIG_BOUNCE=3Dy CONFIG_VIRT_TO_BUS=3Dy # CONFIG_KSM is not set CONFIG_DEFAULT_MMAP_MIN_ADDR=3D4096 CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=3Dy # CONFIG_MEMORY_FAILURE is not set CONFIG_X86_CHECK_BIOS_CORRUPTION=3Dy CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=3Dy CONFIG_X86_RESERVE_LOW_64K=3Dy CONFIG_MTRR=3Dy # CONFIG_MTRR_SANITIZER is not set CONFIG_X86_PAT=3Dy CONFIG_ARCH_USES_PG_UNCACHED=3Dy CONFIG_EFI=3Dy CONFIG_SECCOMP=3Dy # CONFIG_CC_STACKPROTECTOR is not set CONFIG_HZ_100=3Dy # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=3D100 CONFIG_SCHED_HRTICK=3Dy CONFIG_KEXEC=3Dy CONFIG_CRASH_DUMP=3Dy CONFIG_KEXEC_JUMP=3Dy CONFIG_PHYSICAL_START=3D0x1000000 CONFIG_RELOCATABLE=3Dy CONFIG_PHYSICAL_ALIGN=3D0x1000000 CONFIG_HOTPLUG_CPU=3Dy # CONFIG_COMPAT_VDSO is not set # CONFIG_CMDLINE_BOOL is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=3Dy CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=3Dy CONFIG_USE_PERCPU_NUMA_NODE_ID=3Dy > =A0 =A0 you are going to hit dreadfull conditions. > =A02). You are hitting the 64-bit syscall wall. Basically your user-mode > =A0 =A0 application (fio) is doing a write(), which used to be int 0x80=20 > but now > =A0 =A0 is a syscall. The syscall gets trapped in the hypervisor which ha= s=20 > to > =A0 =A0 call in your PV kernel. You get hit with two context switches for= =20 > each > =A0 =A0 'write()' call. The solution is to use a 32-bit DomU as the guest= =20 > user > =A0 =A0 application and guest kernel run in different rings. There is no user space application that is involved with the I/O. It's all = kernel driver code that handles the I/O. > =A03). Xen CPU pools. You didn't say where the application that sends=20 > the IOs > =A0 =A0 is located. But if it was in a seperate domain then you will want= =20 > to use > =A0 =A0 Xen CPU pools. Basically this way you can get gang-scheduling=20 > where the > =A0 =A0 guest that submits the I/O and the guest that picks up the I/O ar= e=20 > running > =A0 =A0 right after each other. I don't know much more details, but this= =20 > is what > =A0 =A0 I understand it does. > =A04). CPU/MSI-X affinity. I think you already did this, but make sure=20 > you pin > =A0 =A0 your guest to specific CPUs and also pin the MSI-X (vectors) to=20 > the proper > =A0 =A0 destination. You can use the 'xm debug-keys i' to see the MSI-X=20 > affinity - it > =A0 =A0 is a mask and basically see if it overlays the CPUs you are=20 > running your guest > =A0 =A0 at. Not sure how to actually set the MSI-X affinity ... now that = I think. > =A0 =A0 Keir or some of the Intel folks might know better. There 16 devices (multi-function) that are PCI-passed through to domU. There are 16 VCPUs in domU and all are pinned to individual PCPUs (24-CPU p= latform). Each IRQ in domU is affinitized to a CPU. This strategy has worke= d well for us with the HVM kernel. Here's the output of 'xm debug-keys i' (XEN) IRQ: 67 affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:7a type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:127(----), (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000200 vec:43 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:126(----), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000400 vec:83 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:125(----), (XEN) IRQ: 70 affinity:00000000,00000000,00000000,00000800 vec:4b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:124(----), (XEN) IRQ: 71 affinity:00000000,00000000,00000000,00001000 vec:8b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:123(----), (XEN) IRQ: 72 affinity:00000000,00000000,00000000,00002000 vec:53 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:122(----), (XEN) IRQ: 73 affinity:00000000,00000000,00000000,00004000 vec:93 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:121(----), (XEN) IRQ: 74 affinity:00000000,00000000,00000000,00008000 vec:5b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:120(----), (XEN) IRQ: 75 affinity:00000000,00000000,00000000,00010000 vec:9b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:119(----), (XEN) IRQ: 76 affinity:00000000,00000000,00000000,00020000 vec:63 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:118(----), (XEN) IRQ: 77 affinity:00000000,00000000,00000000,00040000 vec:a3 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:117(----), (XEN) IRQ: 78 affinity:00000000,00000000,00000000,00080000 vec:6b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:116(----), (XEN) IRQ: 79 affinity:00000000,00000000,00000000,00100000 vec:ab type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:115(----), (XEN) IRQ: 80 affinity:00000000,00000000,00000000,00200000 vec:73 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:114(----), (XEN) IRQ: 81 affinity:00000000,00000000,00000000,00400000 vec:b3 type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:113(----), (XEN) IRQ: 82 affinity:00000000,00000000,00000000,00800000 vec:7b type=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D1:112(----), > =A05). Andrew, Mukesh, Keir, Dan, any other ideas? > We're also trying Chris' 4 things to try and will consider Mathieu's LTT su= ggestion. - Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel