From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tim Deegan Subject: Re: frequently ballooning results in qemu exit Date: Thu, 21 Mar 2013 12:15:47 +0000 Message-ID: <20130321121547.GB12338@ocelot.phlegethon.org> References: <5141A8B0.4050305@citrix.com> <20130314143403.GB5174@ocelot.phlegethon.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Hanweidong Cc: George Dunlap , Andrew Cooper , Yanqiangjun , "xen-devel@lists.xen.org" , "Gonglei (Arei)" , Anthony PERARD List-Id: xen-devel@lists.xenproject.org At 05:54 +0000 on 15 Mar (1363326854), Hanweidong wrote: > > > I'm also curious about this. There is a window between memory balloon > > out > > > and QEMU invalidate mapcache. > > > > That by itself is OK; I don't think we need to provide any meaningful > > semantics if the guest is accessing memory that it's ballooned out. > > > > The question is where the SIGBUS comes from: either qemu has a mapping > > of the old memory, in which case it can write to it safely, or it > > doesn't, in which case it shouldn't try. > > The error always happened at memcpy in if (is_write) branch in > address_space_rw. Sure, but _why_? Why does this access cause SIGBUS? Presumably there's some part of the mapcache code that thinks it has a mapping there when it doesn't. > We found that, after the last xen_invalidate_map_cache, the mapcache entry related to the failed address was mapped: > ==xen_map_cache== phys_addr=7a3c1ec0 size=0 lock=0 > ==xen_remap_bucket== begin size=1048576 ,address_index=7a3 > ==xen_remap_bucket== end entry->paddr_index=7a3,entry->vaddr_base=2a2d9000,size=1048576,address_index=7a3 OK, so that's 0x2a2d9000 -- 0x2a3d8fff. > ==address_space_rw== ptr=2a39aec0 > ==xen_map_cache== phys_addr=7a3c1ec4 size=0 lock=0 > ==xen_map_cache==first return 2a2d9000+c1ec4=2a39aec4 > ==address_space_rw== ptr=2a39aec4 > ==xen_map_cache== phys_addr=7a3c1ec8 size=0 lock=0 > ==xen_map_cache==first return 2a2d9000+c1ec8=2a39aec8 > ==address_space_rw== ptr=2a39aec8 > ==xen_map_cache== phys_addr=7a3c1ecc size=0 lock=0 > ==xen_map_cache==first return 2a2d9000+c1ecc=2a39aecc > ==address_space_rw== ptr=2a39aecc These are all to page 0x2a3e9a___. > ==xen_map_cache== phys_addr=7a16c108 size=0 lock=0 > ==xen_map_cache== return 92a407000+6c108=2a473108 > ==xen_map_cache== phys_addr=7a16c10c size=0 lock=0 > ==xen_map_cache==first return 2a407000+6c10c=2a47310c > ==xen_map_cache== phys_addr=7a16c110 size=0 lock=0 > ==xen_map_cache==first return 2a407000+6c110=2a473110 > ==xen_map_cache== phys_addr=7a395000 size=0 lock=0 > ==xen_map_cache== return 2a2d9000+95000=2a36e000 > ==address_space_rw== ptr=2a36e000 And this is to page 0x2a36e___, a different page in the same bucket. > here, the SIGBUS error occurred. So that page isn't mapped. Which means: - it was never mapped (and the mapcache code didn't handle the error correctly at map time); or - it was never mapped (and the mapcache hasn't checked its own records before using the map); or - it was mapped (and something unmapped it in the meantime). Why not add some tests in xen_remap_bucket to check that all the pages that qemu records as mapped are actually there? Tim.