From mboxrd@z Thu Jan 1 00:00:00 1970 From: Julien Grall Subject: Re: [ARM] Bash often segfaults in Dom0 with the latest Xen Date: Wed, 05 Jun 2013 17:12:04 +0100 Message-ID: <51AF6354.4090701@linaro.org> References: <51AE6DFD.3060308@linaro.org> <51AF25A2.5010207@linaro.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , xen-devel List-Id: xen-devel@lists.xenproject.org On 06/05/2013 03:30 PM, Christoffer Dall wrote: > On 5 June 2013 04:48, Julien Grall wrote: >> On 06/05/2013 02:38 AM, Christoffer Dall wrote: >> >>> On 4 June 2013 15:45, Julien Grall wrote: >>>> Hi all, >>>> >>>> Since a couple of week, I'm tracking an issue with Xen on ARM with no luck. >>>> >>>> I'm run out of idea, so I send this email to have advice from the community. >>>> >>>> Most of the time bash will abort with random error in dom0: >>>> - page fault (data and prefetch abort) >>>> - memory corruption (malloc corruption and invalid pointer) >>>> >>>> It's easily to reproduce by doing ./configure on the xen tree. >>>> >>>> My environment is an arndale board: >>>> - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) >>>> - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) >>>> - xen upstream >>>> >>>> The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git >>>> using the branch linaro-3.10. >>>> The previous branch is based on the linaro tree with some patches for the dts and xen. >>>> >>>> The issue also occurs on the versatile express. But it's harder to reproduce. >>>> Here the environment is: >>>> - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) >>>> - ubuntu linaro 13.05 >>>> - xen upstream >>>> >>>> I have tried different distributions and linux version, the issue was the same. >>>> I made some testing to narrow down the bug and I came to the following test case: >>>> >>>> Only dom0 is running and each VCPUs are pinned to a specific cpu >>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1). >>>> >>>> The patch below removes WFI trap and by consequence avoid a VCPU to move to >>>> another physical CPU. >>>> ========================================= >>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >>>> index 6cfba1a..e89ca15 100644 >>>> --- a/xen/arch/arm/traps.c >>>> +++ b/xen/arch/arm/traps.c >>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) >>>> WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); >>>> >>>> /* Setup hypervisor traps */ >>>> - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); >>>> + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); >>>> isb(); >>>> } >>>> >>>> ========================================= >>>> >>>> If a bash process is assigned to a specific cpu with taskset, the process seems >>>> to always run without any issue. >>>> >>>> taskset -c 0 ./configure >>>> >>>> I guess it's a caching issue, but each time I've tried to play with the caching >>>> policy Linux was not booting. >>>> >>>> Thanks in advance for any advice. >>> >>> Some thoughts: >>> >>> - Does dom0 run with Stage-2 translation? If so, you should be able >>> to disable caches in both Hyp mode and for dom0 by manipulating the >>> hyp registers to try and exclude caches. If Linux doesn't boot under >>> such configuration, something else is completely broken, as it must be >>> transparent to your dom0. >>> >>> - Are you doing any swapping and/or page reclaiming? I wouldn't >>> assume so for dom0, but if you are, you need to maintain the icache >>> properly, since it can be aliasing, see >>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this >>> is the case though) >>> >>> - All other cache accesses should be coherent across cores and are >>> physically indexed/physically tagged so I don't see how this could be >>> your issue. >> >> It was only an idea because I have noticed the memory was often corrupted. >> >>> - Do you always see the crash in user space or kernel space in dom0 or >>> is it all over the map? >> >> >> Only in user space in dom0. >> > Hmm, which kernel version is dom0 based on? Can you bisect the dom0 > source to make sure it's not something introduced during development. I'm using the linaro's branch ll_20130528.0, I have only few patches for the dts and not yet in linaro tree patches. I have the same issue with linux 3.9-rc4 with multiple CPUs and I can't really go before without carrying many xen patches to try it. I have tried different configuration with the number of CPUs in Xen (pCPU) and linux (vCPU): - 2 pCPU 2 vCPU : segfaulting - 2 pCPU 1 vCPU : working - 1 pCPU 1 vCPU : working - 1 pCPU 2 vCPU : very slow but working > You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix > broken mmap..." ? Yes. -- Julien