From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sander Eikelenboom Subject: Re: dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set Date: Wed, 5 Sep 2012 16:38:48 +0200 Message-ID: <1014998302.20120905163848@eikelenboom.it> References: <1136369816.20120904183757@eikelenboom.it> <20120904163347.GH23361@phenom.dumpdata.com> <143844933.20120904191941@eikelenboom.it> <1813712325.20120904213459@eikelenboom.it> <048EAD622912254A9DEA24C1734613C18C864C3C5D@FTLPMAILBOX02.citrite.net> <20120905140600.GA5844@phenom.dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20120905140600.GA5844@phenom.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: Robert Phillips , Ben Guthro , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org Wednesday, September 5, 2012, 4:06:01 PM, you wrote: > On Tue, Sep 04, 2012 at 04:27:20PM -0400, Robert Phillips wrote: >> Ben, >> >> You have asked me to provide the rationale behind the gnttab_old_mfn patch, which you emailed to Sander earlier today. >> Here are my findings. >> >> I found that xen_blkbk_map() in drivers/block/xen-blkback/blkback.c has changed from our previous version. It now calls gnttab_map_refs() in drivers/xen/grant-table.c. >> >> That function first calls HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, ... ) and then calls m2p_add_override() in p2m.c > And HYPERVISOR_grant_table_op .. would populate map_ops[i].bus_addr with the machine address.. >> which is where I made my change. >> >> The unpatched code was saving the pfn's old mfn in kmap_op->dev_bus_addr. >> >> kmap_op is of type struct gnttab_map_grant_ref. That data type is used to record grant table mappings so later they can be unmapped correctly. > Right, but the blkback makes a distinction by passing NULL as kmap_op, which means it should > use the old mechanism. Meaning that once the hypercall is done, the map_ops[i].bus_addr is not > used anymore.. >> >> The problem with saving the old mfn in kmap_op->dev_bus_addr is that it is later overwritten by __gnttab_map_grant_ref() in xen/common/grant_table.c > Uh, so the problem of saving the old mfn in dev_bus_addr has been there for a long long time then? > Even before this patch set? >> >> Since the storage holding the old mfn got overwritten, the unmapping was being done incorrectly. The balloon code detected that and bugged at drivers/xen/balloon.c:359 >> > Hmm, I believe the storage for holding the old mfn was/is page->index. >> My patch simply adds another member called old_mfn to struct gnttab_map_grant_ref rather than trying to overload dev_bus_addr. >> >> I don't know if Sander's bug is the same or related. The BUG_ON at drivers/xen/balloon.c:359 is quite general. It simply asserts that we are not trying to re-map a valid mapping. > Right. Somehow he ends up with valid mappings where there should be none. And lots of them. It's something between kernel v3.4.1 and v3.5.3, haven't had time to narrow it down yet. Any suggestions for specific commits i could try to quickly bisect this one ? >> >> -- Robert Phillips >> >> >> -----Original Message----- >> From: Sander Eikelenboom [mailto:linux@eikelenboom.it] >> Sent: Tuesday, September 04, 2012 3:35 PM >> To: Ben Guthro >> Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xen.org; Robert Phillips >> Subject: Re: [Xen-devel] dom0 linux 3.6.0-rc4, crash due to ballooning althoug dom0_mem=X, max:X set >> >> >> Tuesday, September 4, 2012, 8:07:11 PM, you wrote: >> >> > We ran into the same issue, in newer kernels - but had not yet >> > submitted this fix. >> >> > One of the developers here came up with a fix (attached, and CC'ed >> > here) that fixes an issue where the p2m code reuses a structure member >> > where it shouldn't. >> > The patch adds a new "old_mfn" member to the gnttab_map_grant_ref >> > structure, instead of re-using dev_bus_addr. >> >> >> > If this also works for you, I can re-submit it with a Signed-off-by >> > line, if you prefer, Konrad. >> >> Hi Ben, >> >> This patch doesn't work for me: >> >> When starting the PV-guest i get: >> >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (68b69070). >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0). >> (XEN) [2012-09-04 20:31:37] grant_table.c:499:d0 Bad flags in grant map op (0). >> >> >> and from the dom0 kernel: >> >> [ 374.425727] BUG: unable to handle kernel paging request at ffff8800fffd9078 >> [ 374.428901] IP: [] gnttab_map_refs+0x14e/0x270 >> [ 374.428901] PGD 1e0c067 PUD 0 >> [ 374.428901] Oops: 0000 [#1] PREEMPT SMP >> [ 374.428901] Modules linked in: >> [ 374.428901] CPU 0 >> [ 374.428901] Pid: 4308, comm: qemu-system-i38 Not tainted 3.6.0-rc4-20120830+ #70 System manufacturer System Product Name/P5Q-EM DO >> [ 374.428901] RIP: e030:[] [] gnttab_map_refs+0x14e/0x270 >> [ 374.428901] RSP: e02b:ffff88002f185ca8 EFLAGS: 00010206 >> [ 374.428901] RAX: ffff880000000000 RBX: ffff88001471cf00 RCX: 00000000fffd9078 >> [ 374.428901] RDX: 0000000000000050 RSI: 40000000000fffd9 RDI: 00003ffffffff000 >> [ 374.428901] RBP: ffff88002f185d08 R08: 0000000000000078 R09: 0000000000000000 >> [ 374.428901] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 >> [ 374.428901] R13: ffff88001471c480 R14: 0000000000000002 R15: 0000000000000002 >> [ 374.428901] FS: 00007f6def9f2740(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 >> [ 374.428901] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 374.428901] CR2: ffff8800fffd9078 CR3: 000000002d30e000 CR4: 0000000000042660 >> [ 374.428901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 374.428901] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> [ 374.428901] Process qemu-system-i38 (pid: 4308, threadinfo ffff88002f184000, task ffff8800376f1040) >> [ 374.428901] Stack: >> [ 374.428901] ffffffffffffffff 0000000000000050 00000000fffd9078 00000000000fffd9 >> [ 374.428901] 0000000001000000 ffff8800382135a0 ffff88002f185d08 ffff880038211960 >> [ 374.428901] ffff88002f11d2c0 0000000000000004 0000000000000003 0000000000000001 >> [ 374.428901] Call Trace: >> [ 374.428901] [] gntdev_mmap+0x20e/0x520 >> [ 374.428901] [] ? mmap_region+0x312/0x5a0 >> [ 374.428901] [] ? lockdep_trace_alloc+0xa0/0x130 >> [ 374.428901] [] mmap_region+0x3ce/0x5a0 >> [ 374.428901] [] do_mmap_pgoff+0x250/0x350 >> [ 374.428901] [] vm_mmap_pgoff+0x68/0x90 >> [ 374.428901] [] sys_mmap_pgoff+0x152/0x170 >> [ 374.428901] [] ? trace_hardirqs_on_thunk+0x3a/0x3f >> [ 374.428901] [] sys_mmap+0x29/0x30 >> [ 374.428901] [] system_call_fastpath+0x16/0x1b >> [ 374.428901] Code: 0f 84 e7 00 00 00 48 89 f1 48 c1 e1 0c 41 81 e0 ff 0f 00 00 48 b8 00 00 00 00 00 88 ff ff 48 bf 00 f0 ff ff ff 3f 00 00 4c 01 c1 <48> 23 3c 01 48 c1 ef 0c 49 8d 54 15 00 4d 85 ed b8 00 00 00 00 >> [ 374.428901] RIP [] gnttab_map_refs+0x14e/0x270 >> [ 374.428901] RSP >> [ 374.428901] CR2: ffff8800fffd9078 >> [ 374.428901] ---[ end trace 0e0a5a49f6503c0a ]--- >> >> >> >> > Ben >> >> >> > On Tue, Sep 4, 2012 at 1:19 PM, Sander Eikelenboom wrote: >> >> >> >> Tuesday, September 4, 2012, 6:33:47 PM, you wrote: >> >> >> >>> On Tue, Sep 04, 2012 at 06:37:57PM +0200, Sander Eikelenboom wrote: >> >>>> Hi Konrad, >> >>>> >> >>>> This seems to happen only on a intel machine i'm trying to setup as a development machine (haven't seen it on my amd). >> >>>> It boots fine, i have dom0_mem=1024M,max:1024M set, the machine has 2G of mem. >> >> >> >>> Is this only with Xen 4.2? As, does Xen 4.1 work? >> >>>> >> >>>> Dom0 and guest kernel are 3.6.0-rc4 with config: >> >> >> >>> If you back out: >> >> >> >>> f393387d160211f60398d58463a7e65 >> >>> Author: Konrad Rzeszutek Wilk >> >>> Date: Fri Aug 17 16:43:28 2012 -0400 >> >> >> >>> xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M. >> >> >> >>> Do you see this bug? (Either with Xen 4.1 or Xen 4.2)? >> >> >> >> With c96aae1f7f393387d160211f60398d58463a7e65 reverted i still see this bug (with Xen 4.2). >> >> >> >> Will use the debug patch you mailed and send back the results ... >> >> >> >> >> >>>> [*] Xen memory balloon driver >> >>>> [*] Scrub pages before returning them to system >> >>>> >> >>>> From http://wiki.xen.org/wiki/Do%EF%BB%BFm0_Memory_%E2%80%94_Where_It_Has_Not_Gone , I thought this should be okay >> >>>> >> >>>> But when trying to start a PV guest with 512MB mem, the machine (dom0) crashes with the stacktrace below (complete serial-log.txt attached). >> >>>> >> >>>> From the: >> >>>> "mapping kernel into physical memory >> >>>> about to get started..." >> >>>> >> >>>> I would almost say it's trying to reload dom0 ? >> >>>> >> >>>> >> >>>> [ 897.161119] device vif1.0 entered promiscuous mode >> >>>> mapping kernel into physical memory >> >>>> about to get started... >> >>>> [ 897.696619] xen_bridge: port 1(vif1.0) entered forwarding state >> >>>> [ 897.716219] xen_bridge: port 1(vif1.0) entered forwarding state >> >>>> [ 898.129465] ------------[ cut here ]------------ >> >>>> [ 898.132209] kernel BUG at drivers/xen/balloon.c:359! >> >>>> [ 898.132209] invalid opcode: 0000 [#1] PREEMPT SMP >> >> >> >> >> >> >> >> _______________________________________________ >> >> Xen-devel mailing list >> >> Xen-devel@lists.xen.org >> >> http://lists.xen.org/xen-devel