Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

From: Vivek Goyal <vgoyal@redhat.com>
To: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtio-fs-list <virtio-fs@redhat.com>,
	vkuznets@redhat.com, pbonzini@redhat.com
Subject: Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Date: Tue, 6 Oct 2020 09:46:29 -0400	[thread overview]
Message-ID: <20201006134629.GB5306@redhat.com> (raw)
In-Reply-To: <20201005161620.GC11938@linux.intel.com>

On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote:
> On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> > Now I have few questions.
> > 
> > - If we exit to user space asynchronously (using kvm request), what debug
> >   information is in there which tells user which address is bad. I admit
> >   that even above trace does not seem to be telling me directly which
> >   address (HVA?) is bad.
> > 
> >   But if I take a crash dump of guest, using above information I should
> >   be able to get to GPA which is problematic. And looking at /proc/iomem
> >   it should also tell which device this memory region is in.
> > 
> >   Also using this crash dump one should be able to walk through virtiofs data
> >   structures and figure out which file and what offset with-in file does
> >   it belong to. Now one can look at filesystem on host and see file got
> >   truncated and it will become obvious it can't be faulted in. And then
> >   one can continue to debug that how did we arrive here.
> > 
> > But if we don't exit to user space synchronously, Only relevant
> > information we seem to have is -EFAULT. Apart from that, how does one
> > figure out what address is bad, or who tried to access it. Or which
> > file/offset does it belong to etc.
> >
> > I agree that problem is not necessarily in guest code. But by exiting
> > synchronously, it gives enough information that one can use crash
> > dump to get to bottom of the issue. If we exit to user space
> > asynchronously, all this information will be lost and it might make
> > it very hard to figure out (if not impossible), what's going on.
> 
> If we want userspace to be able to do something useful, KVM should explicitly
> inform userspace about the error, userspace shouldn't simply assume that
> -EFAULT means a HVA->PFN lookup failed.

I guess that's fine. But for this patch, user space is not doing anything.
Its just printing error -EFAULT and dumping guest state (Same as we do
in case of synchronous fault).

> Userspace also shouldn't have to
> query guest state to handle the error, as that won't work for protected guests
> guests like SEV-ES and TDX.

So qemu would not be able to dump vcpu register state when kvm returns
with -EFAULT for the case of SEV-ES and TDX?

> 
> I can think of two options:
> 
>   1. Send a signal, a la kvm_send_hwpoison_signal().

This works because -EHWPOISON is a special kind of error which is
different from -EFAULT. For truncation, even kvm gets -EFAULT.

        if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
                return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
        if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
                return -EFAULT;

Anyway, if -EFAULT is too generic, and we need something finer grained,
that can be looked into when we actually have a method where kvm/qemu
injects error into guest.

> 
>   2. Add a userspace exit reason along with a new entry in the run struct,
>      e.g. that provides the bad GPA, HVA, possibly permissions, etc...

This sounds more reasonable to me. That is kvm gives additional
information to qemu about failing HVA and GPA with -EFAULT and that
can be helpful in debugging a problem. This seems like an extension
of KVM API.

Even with this, if we want to figure out which file got truncated, we
will need to take a dump of guest and try to figure out which file
this GPA is currently mapping(By looking at virtiofs data structures).
And that becomes little easier if vcpu is running the task which 
accessed that GPA. Anyway, if we have failing GPA, I think it should
be possible to figure out inode even without accessing task being
current on vcpu.

So we seem to have 3 options.

A. Just exit to user space with -EFAULT (using kvm request) and don't
   wait for the accessing task to run on vcpu again. 

B. Store error gfn in an hash and exit to user space when task accessing
   gfn runs again.

C. Extend KVM API and report failing HVA/GFN access by guest. And that
   should allow not having to exit to user space synchronously.

Thanks
Vivek