All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <Ian.Campbell@eu.citrix.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Keir Fraser <keir@xen.org>,
	Anthony Wright <anthony@overnetdata.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Todd Deshane <todd.deshane@xen.org>
Subject: Re: Kernel bug from 3.0 (was phy disks and vifs timing out in DomU)
Date: Sat, 3 Sep 2011 11:27:19 +0100	[thread overview]
Message-ID: <1315045639.19389.1683.camel@dagon.hellion.org.uk> (raw)
In-Reply-To: <4E613BEF.30908@goop.org>

On Fri, 2011-09-02 at 21:26 +0100, Jeremy Fitzhardinge wrote:
> On 09/02/2011 12:17 AM, Ian Campbell wrote:
> > On Thu, 2011-09-01 at 21:34 +0100, Jeremy Fitzhardinge wrote:
> >> On 09/01/2011 12:21 PM, Ian Campbell wrote:
> >>> On Thu, 2011-09-01 at 18:32 +0100, Jeremy Fitzhardinge wrote:
> >>>> On 09/01/2011 12:42 AM, Ian Campbell wrote:
> >>>>> On Wed, 2011-08-31 at 18:07 +0100, Konrad Rzeszutek Wilk wrote:
> >>>>>> On Wed, Aug 31, 2011 at 05:58:43PM +0100, David Vrabel wrote:
> >>>>>>> On 26/08/11 15:44, Konrad Rzeszutek Wilk wrote:
> >>>>>>>> So while I am still looking at the hypervisor code to figure out why
> >>>>>>>> it would give me [when trying to map a grant page]:
> >>>>>>>>
> >>>>>>>> (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
> >>>>>>> It is failing in guest_map_l1e() because the page for the vmalloc'd
> >>>>>>> virtual address PTEs is not present.
> >>>>>>>
> >>>>>>> The test that fails is:
> >>>>>>>
> >>>>>>> (l2e_get_flags(l2e) & (_PAGE_PRESENT | _PAGE_PSE)) != _PAGE_PRESENT
> >>>>>>>
> >>>>>>> I think this is because the GNTTABOP_map_grant_ref hypercall is done
> >>>>>>> when task->active_mm != &init_mm and alloc_vm_area() only adds PTEs into
> >>>>>>> init_mm so when Xen looks in the page tables it doesn't find the entries
> >>>>>>> because they're not there yet.
> >>>>>>>
> >>>>>>> Putting a call to vmalloc_sync_all() after create_vm_area() and before
> >>>>>>> the hypercall makes it work for me.  Classic Xen kernels used to have
> >>>>>>> such a call.
> >>>>>> That sounds quite reasonable.
> >>>>> I was wondering why upstream was missing the vmalloc_sync_all() in
> >>>>> alloc_vm_area() since the out-of-tree kernels did have it and the
> >>>>> function was added by us. I found this:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=ef691947d8a3d479e67652312783aedcf629320a
> >>>>>
> >>>>> commit ef691947d8a3d479e67652312783aedcf629320a
> >>>>> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> >>>>> Date:   Wed Dec 1 15:45:48 2010 -0800
> >>>>>
> >>>>>     vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
> >>>>>     
> >>>>>     There's no need for it: it will get faulted into the current pagetable
> >>>>>     as needed.
> >>>>>     
> >>>>>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> >>>>>
> >>>>> The flaw in the reasoning here is that you cannot take a kernel fault
> >>>>> while processing a hypercall, so hypercall arguments must have been
> >>>>> faulted in beforehand and that is what the sync_all was for.
> >>>> That's a good point.  (Maybe Xen should have generated pagefaults when
> >>>> hypercall arg pointers are bad...)
> >>> I think it would be a bit tricky to do in practice, you'd either have to
> >>> support recursive hypercalls in the middle of other hypercalls (because
> >>> the page fault handler is surely going to want to do some) or proper
> >>> hypercall restart (so you can fully return to guest context to handle
> >>> the fault then retry) or something along those and complexifying up the
> >>> hypervisor one way or another. Probably not impossible if you were
> >>> building something form the ground up, but not trivial.
> >> Well, Xen already has the continuation machinery for dealing with
> >> hypercall restart, so that could be reused.
> > That requires special support beyond just calling the continuation in
> > each hypercall (often extending into the ABI) for pickling progress and
> > picking it up again, only a small number of (usually long running)
> > hypercalls have that support today. It also uses the guest context to
> > store the state which perhaps isn't helpful if you want to return to the
> > guest, although I suppose building a nested frame would work.
> 
> I guess it depends on how many hypercalls do work before touching guest
> memory, but any hypercall should be like that anyway, or at least be
> able to wind back work done if a later read EFAULTs.
> 
> I was vaguely speculating about a scheme on the lines of:
> 
>  1. In copy_to/from_user, if we touch a bad address, save it in a
>     per-vcpu "bad_guest_addr"
>  2. when returning to the guest, if the errno is EFAULT and
>     bad_guest_addr is set, then generate a memory fault frame with cr2 =
>     bad_guest_addr, and with the exception return restarting the hypercall
> 
> Perhaps there should be a EFAULT_RETRY error return to trigger this
> behaviour, rather than doing it for all EFAULTs, so the faulting
> behaviour can be added incrementally.

The kernel uses -ERESTARTSSYS for something similar, doesn't it?

Does this scheme work if the hypercall causing the exception was itself
runnnig in an exception handler? I guess it depends on the architecture
+OSes handling of nested faults.

> Maybe this is a lost cause for x86, but perhaps its worth considering
> for new ports?

Certainly worth thinking about.

> > The guys doing paging and sharing etc looked into this and came to the
> > conclusion that it would be intractably difficult to do this fully --
> > hence we now have the ability to sleep in hypercalls, which works
> > because the pager/sharer is in a different domain/vcpu.
> 
> Hmm.  Were they looking at injecting faults back into the guest, or
> forwarding "missing page" events off to another domain?

Sharing and swapping are transparent to the domain, another domain runs
the swapper/unshare process (actually, unshare might be in the h/v
itself, not sure).

> >>   And accesses to guest
> >> memory are already special events which must be checked so that EFAULT
> >> can be returned.  If, rather than failing with EFAULT Xen set up a
> >> pagefault exception for the guest CPU with the return set up to retry
> >> the hypercall, it should all work...
> >>
> >> Of course, if the guest isn't expecting that - or its buggy - then it
> >> could end up in an infinite loop.  But maybe a flag (set a high bit in
> >> the hypercall number?), or a feature, or something?  Might be worthwhile
> >> if it saves guests having to do something expensive (like a
> >> vmalloc_sync_all), even if they have to also deal with old hypervisors.
> > The vmalloc_sync_all is a pretty event even on Xen though, isn't it?
> 
> Looks like an important word is missing there.  But its very expensive,
> if that's what you're saying.

Oops. "rare" was the missing word.

> 
>     J

  reply	other threads:[~2011-09-03 10:27 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <29902981.10.1311837224851.JavaMail.root@zimbra.overnetdata.com>
2011-07-28  7:24 ` phy disks and vifs timing out in DomU Anthony Wright
2011-07-28 15:01   ` Todd Deshane
2011-07-28 15:36     ` Anthony Wright
2011-07-28 15:46       ` Todd Deshane
2011-07-28 16:00         ` Anthony Wright
2011-07-29 15:55           ` Konrad Rzeszutek Wilk
2011-07-29 18:40             ` Anthony Wright
2011-07-29 20:01               ` Konrad Rzeszutek Wilk
2011-07-30 17:05                 ` Anthony Wright
2011-08-01 11:03                   ` Anthony Wright
2011-07-28 16:28       ` Ian Campbell
2011-07-29  7:53         ` Kernel bug from 3.0 (was phy disks and vifs timing out in DomU) Anthony Wright
2011-08-03 15:28           ` Konrad Rzeszutek Wilk
2011-08-09 16:35             ` Konrad Rzeszutek Wilk
2011-08-19 10:22             ` Anthony Wright
2011-08-19 12:56               ` Konrad Rzeszutek Wilk
2011-08-22 11:02                 ` Anthony Wright
2011-08-25 20:31                 ` Anthony Wright
2011-08-26 14:26                   ` Konrad Rzeszutek Wilk
2011-08-26 14:44                     ` Konrad Rzeszutek Wilk
2011-08-29 12:13                       ` Anthony Wright
2011-08-31 16:58                       ` David Vrabel
2011-08-31 17:07                         ` Konrad Rzeszutek Wilk
2011-09-01  7:42                           ` Ian Campbell
2011-09-01 14:23                             ` Konrad Rzeszutek Wilk
2011-09-01 15:12                               ` David Vrabel
2011-09-01 15:37                                 ` Konrad Rzeszutek Wilk
2011-09-01 15:43                                   ` Ian Campbell
2011-09-01 16:07                                     ` Konrad Rzeszutek Wilk
2011-09-07 12:57                                 ` Anthony Wright
2011-09-07 18:35                                   ` Konrad Rzeszutek Wilk
2011-09-01 15:12                               ` Ian Campbell
2011-09-01 15:38                                 ` Konrad Rzeszutek Wilk
2011-09-01 15:44                                   ` Ian Campbell
2011-09-01 17:34                                     ` Jeremy Fitzhardinge
2011-09-01 19:19                                       ` Ian Campbell
2011-09-01 17:32                             ` Jeremy Fitzhardinge
2011-09-01 19:21                               ` Ian Campbell
2011-09-01 20:34                                 ` Jeremy Fitzhardinge
2011-09-02  7:17                                   ` Ian Campbell
2011-09-02 20:26                                     ` Jeremy Fitzhardinge
2011-09-03 10:27                                       ` Ian Campbell [this message]
2011-09-23 12:35                                         ` Anthony Wright
2011-09-23 12:49                                           ` David Vrabel
2011-08-29 17:33                     ` Anthony Wright
2011-08-25 21:11                 ` Anthony Wright
2011-08-26  7:10                   ` Sander Eikelenboom
2011-08-26 11:23                     ` Pasi Kärkkäinen
2011-08-26 12:16                   ` Stefano Stabellini
2011-08-26 12:15                     ` Anthony Wright
2011-08-26 12:32                       ` Stefano Stabellini
2011-07-29 15:48         ` phy disks and vifs timing out in DomU (only on certain hardware) Anthony Wright
2011-07-29 16:06           ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1315045639.19389.1683.camel@dagon.hellion.org.uk \
    --to=ian.campbell@eu.citrix.com \
    --cc=anthony@overnetdata.com \
    --cc=david.vrabel@citrix.com \
    --cc=jeremy@goop.org \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=todd.deshane@xen.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.