Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)

From: Ian Campbell <Ian.Campbell@eu.citrix.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Keir Fraser <keir@xen.org>,
	Anthony Wright <anthony@overnetdata.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Todd Deshane <todd.deshane@xen.org>
Subject: Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)
Date: Sat, 3 Sep 2011 11:27:19 +0100	[thread overview]
Message-ID: <1315045639.19389.1683.camel@dagon.hellion.org.uk> (raw)
In-Reply-To: <4E613BEF.30908@goop.org>

On Fri, 2011-09-02 at 21:26 +0100, Jeremy Fitzhardinge wrote:
> On 09/02/2011 12:17 AM, Ian Campbell wrote:
> > On Thu, 2011-09-01 at 21:34 +0100, Jeremy Fitzhardinge wrote:
> >> On 09/01/2011 12:21 PM, Ian Campbell wrote:
> >>> On Thu, 2011-09-01 at 18:32 +0100, Jeremy Fitzhardinge wrote:
> >>>> On 09/01/2011 12:42 AM, Ian Campbell wrote:
> >>>>> On Wed, 2011-08-31 at 18:07 +0100, Konrad Rzeszutek Wilk wrote:
> >>>>>> On Wed, Aug 31, 2011 at 05:58:43PM +0100, David Vrabel wrote:
> >>>>>>> On 26/08/11 15:44, Konrad Rzeszutek Wilk wrote:
> >>>>>>>> So while I am still looking at the hypervisor code to figure out why
> >>>>>>>> it would give me [when trying to map a grant page]:
> >>>>>>>>
> >>>>>>>> (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
> >>>>>>> It is failing in guest_map_l1e() because the page for the vmalloc'd
> >>>>>>> virtual address PTEs is not present.
> >>>>>>>
> >>>>>>> The test that fails is:
> >>>>>>>
> >>>>>>> (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT
> >>>>>>>
> >>>>>>> I think this is because the GNTTABOP_map_grant_ref hypercall is done
> >>>>>>> when task->active_mm != &init_mm and alloc_vm_area() only adds PTEs into
> >>>>>>> init_mm so when Xen looks in the page tables it doesn't find the entries
> >>>>>>> because they're not there yet.
> >>>>>>>
> >>>>>>> Putting a call to vmalloc_sync_all() after create_vm_area() and before
> >>>>>>> the hypercall makes it work for me.  Classic Xen kernels used to have
> >>>>>>> such a call.
> >>>>>> That sounds quite reasonable.
> >>>>> I was wondering why upstream was missing the vmalloc_sync_all() in
> >>>>> alloc_vm_area() since the out-of-tree kernels did have it and the
> >>>>> function was added by us. I found this:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ef691947d8a3d479e67652312783aedcf629320a
> >>>>>
> >>>>> commit ef691947d8a3d479e67652312783aedcf629320a
> >>>>> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> >>>>> Date:   Wed Dec 1 15:45:48 2010 -0800
> >>>>>
> >>>>>     vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
> >>>>>     
> >>>>>     There's no need for it: it will get faulted into the current pagetable
> >>>>>     as needed.
> >>>>>     
> >>>>>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> >>>>>
> >>>>> The flaw in the reasoning here is that you cannot take a kernel fault
> >>>>> while processing a hypercall, so hypercall arguments must have been
> >>>>> faulted in beforehand and that is what the sync_all was for.
> >>>> That's a good point.  (Maybe Xen should have generated pagefaults when
> >>>> hypercall arg pointers are bad...)
> >>> I think it would be a bit tricky to do in practice, you'd either have to
> >>> support recursive hypercalls in the middle of other hypercalls (because
> >>> the page fault handler is surely going to want to do some) or proper
> >>> hypercall restart (so you can fully return to guest context to handle
> >>> the fault then retry) or something along those and complexifying up the
> >>> hypervisor one way or another. Probably not impossible if you were
> >>> building something form the ground up, but not trivial.
> >> Well, Xen already has the continuation machinery for dealing with
> >> hypercall restart, so that could be reused.
> > That requires special support beyond just calling the continuation in
> > each hypercall (often extending into the ABI) for pickling progress and
> > picking it up again, only a small number of (usually long running)
> > hypercalls have that support today. It also uses the guest context to
> > store the state which perhaps isn't helpful if you want to return to the
> > guest, although I suppose building a nested frame would work.
> 
> I guess it depends on how many hypercalls do work before touching guest
> memory, but any hypercall should be like that anyway, or at least be
> able to wind back work done if a later read EFAULTs.
> 
> I was vaguely speculating about a scheme on the lines of:
> 
>  1. In copy_to/from_user, if we touch a bad address, save it in a
>     per-vcpu "bad_guest_addr"
>  2. when returning to the guest, if the errno is EFAULT and
>     bad_guest_addr is set, then generate a memory fault frame with cr2 =
>     bad_guest_addr, and with the exception return restarting the hypercall
> 
> Perhaps there should be a EFAULT_RETRY error return to trigger this
> behaviour, rather than doing it for all EFAULTs, so the faulting
> behaviour can be added incrementally.

The kernel uses -ERESTARTSSYS for something similar, doesn't it?

Does this scheme work if the hypercall causing the exception was itself
runnnig in an exception handler? I guess it depends on the architecture
+OSes handling of nested faults.

> Maybe this is a lost cause for x86, but perhaps its worth considering
> for new ports?

Certainly worth thinking about.

> > The guys doing paging and sharing etc looked into this and came to the
> > conclusion that it would be intractably difficult to do this fully --
> > hence we now have the ability to sleep in hypercalls, which works
> > because the pager/sharer is in a different domain/vcpu.
> 
> Hmm.  Were they looking at injecting faults back into the guest, or
> forwarding "missing page" events off to another domain?

Sharing and swapping are transparent to the domain, another domain runs
the swapper/unshare process (actually, unshare might be in the h/v
itself, not sure).

> >>   And accesses to guest
> >> memory are already special events which must be checked so that EFAULT
> >> can be returned.  If, rather than failing with EFAULT Xen set up a
> >> pagefault exception for the guest CPU with the return set up to retry
> >> the hypercall, it should all work...
> >>
> >> Of course, if the guest isn't expecting that - or its buggy - then it
> >> could end up in an infinite loop.  But maybe a flag (set a high bit in
> >> the hypercall number?), or a feature, or something?  Might be worthwhile
> >> if it saves guests having to do something expensive (like a
> >> vmalloc_sync_all), even if they have to also deal with old hypervisors.
> > The vmalloc_sync_all is a pretty event even on Xen though, isn't it?
> 
> Looks like an important word is missing there.  But its very expensive,
> if that's what you're saying.

Oops. "rare" was the missing word.

> 
>     J