xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [VMI] Possible race-condition in altp2m APIs
@ 2019-05-06 16:18 Mathieu Tarral
  2019-05-06 16:18 ` [Xen-devel] " Mathieu Tarral
  2019-05-06 17:07 ` Andrew Cooper
  0 siblings, 2 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-06 16:18 UTC (permalink / raw)
  To: xen-devel

Hi,

I would like to submit a strange bug that i'm facing while using DRAKVUF to
monitor applications from the hypervisor.

I wanted to evaluate DRAKVUF's robustness, so I built a test suite, and began
by executing reg.exe via shellexec injection, having the execution tracked by the procmon plugin.

I quickly realized that sometimes applications were crashing in the guest, with different types of weird errors:
- memory cannot be written
- invalid opcode
- unknown software exception (its a Windows message, not sure what type of processor exception is behind this)

And more than that, I had lots of BSODs, in different places in the kernel.

So heavy monitoring with DRAKVUF tends to make the guest unstable.

It's important to emphasize that the more VCPUs you have, the more likely the bug will be triggered.

For example, injecting on Windows with 1 VCPU, i was able to go through 5000 successives injections.
Using 4 VCPUs on the other hand, it would crash around ~50th injection.

My first suspicion was on DRAKVUF's custom injector, which hijacks the process control flow,
and could have corrupted the guest memory.

This is the most invasive method to start a process in the guest, so it was a good candidate.

But last week, I replaced this injector by opening the WinRM service, and starting the remote process
via Ansible win_command module.

Unfortunately, the result was the same, the BSODs and appcrashes are still here.

Which means that DRAKVUF, simply by calling the altp2m APIs and injecting stealth breakpoints,
could somehow make the guest execute code in a page that would either be non present
(I had PAGE_FAULT_IN_NONPAGED_AERA BSODs), or corrupted, which would explain
the invalid opcode/access_violation errors.

You can find my extensive bug reports and comments on the following Github issues:
- [Injection BSOD on W7x64](https://github.com/tklengyel/drakvuf/issues/576)
- [BSOD when injecting on Windows 10 protected by KPTI ](https://github.com/tklengyel/drakvuf/issues/622)

The latest proof I have of this effect is the following analysis of a Win10 BSOD:
https://gist.github.com/mtarral/f593e50d1d68b5a1071d8bc42affd542

(Please note that KPTI was manually disabled, because it would crash guest quite quickly under monitoring, but that's another issue.)

I managed to get a page containing 2 successive `int 3` (previously injected by DRAKVUF), in a location that I just wasn't monitoring.

That's why I think that DRAKVUF is not responsible of this behavior.

I'm using only 3 plugins:
- procmon
  - NtCreateUserProcess
  - NtTerminateProcess
  - NtOpenProcess
  - NtProtectVirtualMemory
- bsodmon
  - KeBugCheck
- crashmon
  - CR3 load

As altp2m seems like a really complicated to implement (EPT manipulation, CoW, ...),
I suspect that there is a possible race condition that lies in there, which would trigger this bug.

I would like your opinions on the matter, how I can investigate this,
and ultimately debug it, with your help of course.

Thank you,
Mathieu Tarral

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-06 16:18 [VMI] Possible race-condition in altp2m APIs Mathieu Tarral
@ 2019-05-06 16:18 ` Mathieu Tarral
  2019-05-06 17:07 ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-06 16:18 UTC (permalink / raw)
  To: xen-devel

Hi,

I would like to submit a strange bug that i'm facing while using DRAKVUF to
monitor applications from the hypervisor.

I wanted to evaluate DRAKVUF's robustness, so I built a test suite, and began
by executing reg.exe via shellexec injection, having the execution tracked by the procmon plugin.

I quickly realized that sometimes applications were crashing in the guest, with different types of weird errors:
- memory cannot be written
- invalid opcode
- unknown software exception (its a Windows message, not sure what type of processor exception is behind this)

And more than that, I had lots of BSODs, in different places in the kernel.

So heavy monitoring with DRAKVUF tends to make the guest unstable.

It's important to emphasize that the more VCPUs you have, the more likely the bug will be triggered.

For example, injecting on Windows with 1 VCPU, i was able to go through 5000 successives injections.
Using 4 VCPUs on the other hand, it would crash around ~50th injection.

My first suspicion was on DRAKVUF's custom injector, which hijacks the process control flow,
and could have corrupted the guest memory.

This is the most invasive method to start a process in the guest, so it was a good candidate.

But last week, I replaced this injector by opening the WinRM service, and starting the remote process
via Ansible win_command module.

Unfortunately, the result was the same, the BSODs and appcrashes are still here.

Which means that DRAKVUF, simply by calling the altp2m APIs and injecting stealth breakpoints,
could somehow make the guest execute code in a page that would either be non present
(I had PAGE_FAULT_IN_NONPAGED_AERA BSODs), or corrupted, which would explain
the invalid opcode/access_violation errors.

You can find my extensive bug reports and comments on the following Github issues:
- [Injection BSOD on W7x64](https://github.com/tklengyel/drakvuf/issues/576)
- [BSOD when injecting on Windows 10 protected by KPTI ](https://github.com/tklengyel/drakvuf/issues/622)

The latest proof I have of this effect is the following analysis of a Win10 BSOD:
https://gist.github.com/mtarral/f593e50d1d68b5a1071d8bc42affd542

(Please note that KPTI was manually disabled, because it would crash guest quite quickly under monitoring, but that's another issue.)

I managed to get a page containing 2 successive `int 3` (previously injected by DRAKVUF), in a location that I just wasn't monitoring.

That's why I think that DRAKVUF is not responsible of this behavior.

I'm using only 3 plugins:
- procmon
  - NtCreateUserProcess
  - NtTerminateProcess
  - NtOpenProcess
  - NtProtectVirtualMemory
- bsodmon
  - KeBugCheck
- crashmon
  - CR3 load

As altp2m seems like a really complicated to implement (EPT manipulation, CoW, ...),
I suspect that there is a possible race condition that lies in there, which would trigger this bug.

I would like your opinions on the matter, how I can investigate this,
and ultimately debug it, with your help of course.

Thank you,
Mathieu Tarral

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-06 16:18 [VMI] Possible race-condition in altp2m APIs Mathieu Tarral
  2019-05-06 16:18 ` [Xen-devel] " Mathieu Tarral
@ 2019-05-06 17:07 ` Andrew Cooper
  2019-05-06 17:07   ` [Xen-devel] " Andrew Cooper
                     ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-06 17:07 UTC (permalink / raw)
  To: Mathieu Tarral, xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 5033 bytes --]

On 06/05/2019 17:18, Mathieu Tarral wrote:
> Hi,
>
> I would like to submit a strange bug that i'm facing while using DRAKVUF to
> monitor applications from the hypervisor.
>
> I wanted to evaluate DRAKVUF's robustness, so I built a test suite, and began
> by executing reg.exe via shellexec injection, having the execution tracked by the procmon plugin.
>
> I quickly realized that sometimes applications were crashing in the guest, with different types of weird errors:
> - memory cannot be written
> - invalid opcode
> - unknown software exception (its a Windows message, not sure what type of processor exception is behind this)
>
> And more than that, I had lots of BSODs, in different places in the kernel.
>
> So heavy monitoring with DRAKVUF tends to make the guest unstable.
>
> It's important to emphasize that the more VCPUs you have, the more likely the bug will be triggered.
>
> For example, injecting on Windows with 1 VCPU, i was able to go through 5000 successives injections.
> Using 4 VCPUs on the other hand, it would crash around ~50th injection.
>
> My first suspicion was on DRAKVUF's custom injector, which hijacks the process control flow,
> and could have corrupted the guest memory.
>
> This is the most invasive method to start a process in the guest, so it was a good candidate.
>
> But last week, I replaced this injector by opening the WinRM service, and starting the remote process
> via Ansible win_command module.
>
> Unfortunately, the result was the same, the BSODs and appcrashes are still here.
>
> Which means that DRAKVUF, simply by calling the altp2m APIs and injecting stealth breakpoints,
> could somehow make the guest execute code in a page that would either be non present
> (I had PAGE_FAULT_IN_NONPAGED_AERA BSODs), or corrupted, which would explain
> the invalid opcode/access_violation errors.
>
> You can find my extensive bug reports and comments on the following Github issues:
> - [Injection BSOD on W7x64](https://github.com/tklengyel/drakvuf/issues/576)
> - [BSOD when injecting on Windows 10 protected by KPTI ](https://github.com/tklengyel/drakvuf/issues/622)
>
> The latest proof I have of this effect is the following analysis of a Win10 BSOD:
> https://gist.github.com/mtarral/f593e50d1d68b5a1071d8bc42affd542
>
> (Please note that KPTI was manually disabled, because it would crash guest quite quickly under monitoring, but that's another issue.)
>
> I managed to get a page containing 2 successive `int 3` (previously injected by DRAKVUF), in a location that I just wasn't monitoring.
>
> That's why I think that DRAKVUF is not responsible of this behavior.
>
> I'm using only 3 plugins:
> - procmon
>   - NtCreateUserProcess
>   - NtTerminateProcess
>   - NtOpenProcess
>   - NtProtectVirtualMemory
> - bsodmon
>   - KeBugCheck
> - crashmon
>   - CR3 load
>
> As altp2m seems like a really complicated to implement (EPT manipulation, CoW, ...),
> I suspect that there is a possible race condition that lies in there, which would trigger this bug.
>
> I would like your opinions on the matter, how I can investigate this,
> and ultimately debug it, with your help of course.

There is a lot in here.

As for your BSOD analysis, the first thing to be aware of is that Double
Fault is not necessarily precise, which means you can't necessarily
trust any of the registers.  That said, most double faults are precise
in practice, so if you're seeing it reliably at the same place, then it
is likely to be a precise example.

Your faulting address isn't the immediately after the pagetable switch. 
It is one instruction further on, after the stack switch, which means at
the very minimum that reading the new rsp out of the per-processor
storage succeeded.

The stack switch, combined with `push $0x2b` faulting is a clear sign
that the stack is bad.  As the stack pointer looks plausible, it is
almost certainly the pagewalk from %rsp which is bad.  Judging by the
Windbg guide, you want to use !pte to dump the pagewalk (but I have
never used it in anger before).

How exactly does DRAKVUF go about injecting silent breakpoints?  It
obviously has to allocate a new gfn from somewhere to begin with.  Do
the bifurcated frames end up in two different altp2ms, or one in the
host p2m and one in an alternative?  Does #VE ever get used?

Given how many EPT flushing bugs I've already found in this area, I
wouldn't be surprised if there are further ones lurking.  If it is an
EPT flushing bug, this delta should make it go away, but it will come
with a hefty perf hit.

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 283eb7b..019333d 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
             }
         }
 
-        if ( inv )
-            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
-                     inv == 1 ? single->eptp          : 0);
+        __invept(INVEPT_ALL_CONTEXT, 0);
     }
 
  out:

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 5854 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-06 17:07 ` Andrew Cooper
@ 2019-05-06 17:07   ` Andrew Cooper
  2019-05-06 17:41   ` Tamas K Lengyel
  2019-05-07 12:01   ` Mathieu Tarral
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-06 17:07 UTC (permalink / raw)
  To: Mathieu Tarral, xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 5033 bytes --]

On 06/05/2019 17:18, Mathieu Tarral wrote:
> Hi,
>
> I would like to submit a strange bug that i'm facing while using DRAKVUF to
> monitor applications from the hypervisor.
>
> I wanted to evaluate DRAKVUF's robustness, so I built a test suite, and began
> by executing reg.exe via shellexec injection, having the execution tracked by the procmon plugin.
>
> I quickly realized that sometimes applications were crashing in the guest, with different types of weird errors:
> - memory cannot be written
> - invalid opcode
> - unknown software exception (its a Windows message, not sure what type of processor exception is behind this)
>
> And more than that, I had lots of BSODs, in different places in the kernel.
>
> So heavy monitoring with DRAKVUF tends to make the guest unstable.
>
> It's important to emphasize that the more VCPUs you have, the more likely the bug will be triggered.
>
> For example, injecting on Windows with 1 VCPU, i was able to go through 5000 successives injections.
> Using 4 VCPUs on the other hand, it would crash around ~50th injection.
>
> My first suspicion was on DRAKVUF's custom injector, which hijacks the process control flow,
> and could have corrupted the guest memory.
>
> This is the most invasive method to start a process in the guest, so it was a good candidate.
>
> But last week, I replaced this injector by opening the WinRM service, and starting the remote process
> via Ansible win_command module.
>
> Unfortunately, the result was the same, the BSODs and appcrashes are still here.
>
> Which means that DRAKVUF, simply by calling the altp2m APIs and injecting stealth breakpoints,
> could somehow make the guest execute code in a page that would either be non present
> (I had PAGE_FAULT_IN_NONPAGED_AERA BSODs), or corrupted, which would explain
> the invalid opcode/access_violation errors.
>
> You can find my extensive bug reports and comments on the following Github issues:
> - [Injection BSOD on W7x64](https://github.com/tklengyel/drakvuf/issues/576)
> - [BSOD when injecting on Windows 10 protected by KPTI ](https://github.com/tklengyel/drakvuf/issues/622)
>
> The latest proof I have of this effect is the following analysis of a Win10 BSOD:
> https://gist.github.com/mtarral/f593e50d1d68b5a1071d8bc42affd542
>
> (Please note that KPTI was manually disabled, because it would crash guest quite quickly under monitoring, but that's another issue.)
>
> I managed to get a page containing 2 successive `int 3` (previously injected by DRAKVUF), in a location that I just wasn't monitoring.
>
> That's why I think that DRAKVUF is not responsible of this behavior.
>
> I'm using only 3 plugins:
> - procmon
>   - NtCreateUserProcess
>   - NtTerminateProcess
>   - NtOpenProcess
>   - NtProtectVirtualMemory
> - bsodmon
>   - KeBugCheck
> - crashmon
>   - CR3 load
>
> As altp2m seems like a really complicated to implement (EPT manipulation, CoW, ...),
> I suspect that there is a possible race condition that lies in there, which would trigger this bug.
>
> I would like your opinions on the matter, how I can investigate this,
> and ultimately debug it, with your help of course.

There is a lot in here.

As for your BSOD analysis, the first thing to be aware of is that Double
Fault is not necessarily precise, which means you can't necessarily
trust any of the registers.  That said, most double faults are precise
in practice, so if you're seeing it reliably at the same place, then it
is likely to be a precise example.

Your faulting address isn't the immediately after the pagetable switch. 
It is one instruction further on, after the stack switch, which means at
the very minimum that reading the new rsp out of the per-processor
storage succeeded.

The stack switch, combined with `push $0x2b` faulting is a clear sign
that the stack is bad.  As the stack pointer looks plausible, it is
almost certainly the pagewalk from %rsp which is bad.  Judging by the
Windbg guide, you want to use !pte to dump the pagewalk (but I have
never used it in anger before).

How exactly does DRAKVUF go about injecting silent breakpoints?  It
obviously has to allocate a new gfn from somewhere to begin with.  Do
the bifurcated frames end up in two different altp2ms, or one in the
host p2m and one in an alternative?  Does #VE ever get used?

Given how many EPT flushing bugs I've already found in this area, I
wouldn't be surprised if there are further ones lurking.  If it is an
EPT flushing bug, this delta should make it go away, but it will come
with a hefty perf hit.

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 283eb7b..019333d 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
             }
         }
 
-        if ( inv )
-            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
-                     inv == 1 ? single->eptp          : 0);
+        __invept(INVEPT_ALL_CONTEXT, 0);
     }
 
  out:

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 5854 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-06 17:07 ` Andrew Cooper
  2019-05-06 17:07   ` [Xen-devel] " Andrew Cooper
@ 2019-05-06 17:41   ` Tamas K Lengyel
  2019-05-06 17:41     ` [Xen-devel] " Tamas K Lengyel
  2019-05-06 18:30     ` Andrew Cooper
  2019-05-07 12:01   ` Mathieu Tarral
  2 siblings, 2 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-06 17:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

Hi Andrew,
thanks for helping brainstorming on this.

> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?

I've posted a blog entry about it a while ago, it's still accurate:
https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.

You can't add new frames to only some of the altp2m's - at least not
with the current interfaces. All the shadow pages are added to the
hostp2m and then in the altp2m the GFN is remapped to the mfn of the
shadow page with an execute-only permissions. This way the breakpoint
can be written into the shadow-page and any attempt to read it can be
safely handled on a per-vCPU base by switching it back to the hostp2m
for the duration of a singlestep (with MTF). Setting up the shadow
pages is only safe to do during the initial setup while the altp2m
view is not used and the guest is paused. Once altp2m views are being
used adding new pages to the hostp2m results in losing all altp2m
settings. For the most part this limitation is not an issue because
all supported use-cases add the breakpoints once during the initial
setup and there are no breakpoints added later during runtime.

We've noticed that trapping MOV-TO-CR3 with the latest version of
Windows 10 has a lot of issues in terms of overhead when KPTI is used,
so as a band-aid solution it can be disabled to improve performance
(which Mathieu already did).

Also, this is all with external use of altp2m, #VE is not used.

> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.

My understanding is that the VPID implementation in Xen is such that
effectively all VMEXITs will trigger assignment of a new VPID to the
vCPU - which is likely a performance issue in itself - so flushing the
EPT is likely not going to make a difference. But it's worth a shot,
maybe it does :)

Thanks,
Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-06 17:41   ` Tamas K Lengyel
@ 2019-05-06 17:41     ` Tamas K Lengyel
  2019-05-06 18:30     ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-06 17:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

Hi Andrew,
thanks for helping brainstorming on this.

> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?

I've posted a blog entry about it a while ago, it's still accurate:
https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.

You can't add new frames to only some of the altp2m's - at least not
with the current interfaces. All the shadow pages are added to the
hostp2m and then in the altp2m the GFN is remapped to the mfn of the
shadow page with an execute-only permissions. This way the breakpoint
can be written into the shadow-page and any attempt to read it can be
safely handled on a per-vCPU base by switching it back to the hostp2m
for the duration of a singlestep (with MTF). Setting up the shadow
pages is only safe to do during the initial setup while the altp2m
view is not used and the guest is paused. Once altp2m views are being
used adding new pages to the hostp2m results in losing all altp2m
settings. For the most part this limitation is not an issue because
all supported use-cases add the breakpoints once during the initial
setup and there are no breakpoints added later during runtime.

We've noticed that trapping MOV-TO-CR3 with the latest version of
Windows 10 has a lot of issues in terms of overhead when KPTI is used,
so as a band-aid solution it can be disabled to improve performance
(which Mathieu already did).

Also, this is all with external use of altp2m, #VE is not used.

> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.

My understanding is that the VPID implementation in Xen is such that
effectively all VMEXITs will trigger assignment of a new VPID to the
vCPU - which is likely a performance issue in itself - so flushing the
EPT is likely not going to make a difference. But it's worth a shot,
maybe it does :)

Thanks,
Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-06 17:41   ` Tamas K Lengyel
  2019-05-06 17:41     ` [Xen-devel] " Tamas K Lengyel
@ 2019-05-06 18:30     ` Andrew Cooper
  2019-05-06 18:30       ` [Xen-devel] " Andrew Cooper
  2019-05-06 18:51       ` Tamas K Lengyel
  1 sibling, 2 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-06 18:30 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 06/05/2019 18:41, Tamas K Lengyel wrote:
> Hi Andrew,
> thanks for helping brainstorming on this.
>
>> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?
> I've posted a blog entry about it a while ago, it's still accurate:
> https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.

Talking of, have we fixed the emulation of `sti`?  I don't recall any
changes, but given our aim to get the emulator complete, we should fix it.

> You can't add new frames to only some of the altp2m's - at least not
> with the current interfaces. All the shadow pages are added to the
> hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> shadow page with an execute-only permissions.

Ah - of course.  gfns only make sense in the context of the hostp2m.

> This way the breakpoint
> can be written into the shadow-page and any attempt to read it can be
> safely handled on a per-vCPU base by switching it back to the hostp2m
> for the duration of a singlestep (with MTF). Setting up the shadow
> pages is only safe to do during the initial setup while the altp2m
> view is not used and the guest is paused. Once altp2m views are being
> used adding new pages to the hostp2m results in losing all altp2m
> settings. For the most part this limitation is not an issue because
> all supported use-cases add the breakpoints once during the initial
> setup and there are no breakpoints added later during runtime.

What do the host p2m permissions get set to?  How do you cope with
future reuse of the gfn for a different purpose later?

>
> We've noticed that trapping MOV-TO-CR3 with the latest version of
> Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> so as a band-aid solution it can be disabled to improve performance
> (which Mathieu already did).

Meltdown isn't subtle with its perf problems...  What purpose are you
trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
so, VT-x has (since forever, iirc) had the CR3 target list (of 4
entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
the VMExit.  If all you care about is that the vcpu stays on known-good
pagetables, this interface could be plumbed up to include the kernel and
user pagetables, which will avoid all the vmexits from syscalls due to
meltdown.

Alternatively, in some copious free time, once I've got the CPUID/MSR
interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
the guest doesn't turn on its meltdown mitigations in the first place.

>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> My understanding is that the VPID implementation in Xen is such that
> effectively all VMEXITs will trigger assignment of a new VPID to the
> vCPU - which is likely a performance issue in itself - so flushing the
> EPT is likely not going to make a difference. But it's worth a shot,
> maybe it does :)

Sadly, things are far more complicated than that.  For one, Intel still
owe me a comment/correction to that section of the SDM on INVLPG
emulation for guests.

Xen's use of ASIDs as a common concept started from the AMD side.  AMD
strictly only cache linear => host physical mappings, so after any
change to the p2m, an ASID tick will guarantee to get you a fully clean
TLB for future pagewalks to populate.

The same is not true for Intel.  VPID and EPT were introduced together,
and have several kinds of mappings which are cached.  The processor may
cache:
1) linear => gpa mappings (tagged with current VPID and PCID values, and
contain no information from EPT)
2) gpa => hpa mappings (tagged with the current EPTP, may contain other
data such as the SPP vector, doesn't contain any data from the guest
pagetables)
3) combined mappings which are linear => hpa mappings.

In particular, ticking the VPID after an EPT modification *does not*
invalidate the gpa=>hpa mappings, so the guest can continue to execute
using stale mappings.  This is why we've got the logic in
vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Hence my suggestion for identifying whether it is a real TLB flushing
issue, or a logical error elsewhere. :)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-06 18:30     ` Andrew Cooper
@ 2019-05-06 18:30       ` Andrew Cooper
  2019-05-06 18:51       ` Tamas K Lengyel
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-06 18:30 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 06/05/2019 18:41, Tamas K Lengyel wrote:
> Hi Andrew,
> thanks for helping brainstorming on this.
>
>> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?
> I've posted a blog entry about it a while ago, it's still accurate:
> https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.

Talking of, have we fixed the emulation of `sti`?  I don't recall any
changes, but given our aim to get the emulator complete, we should fix it.

> You can't add new frames to only some of the altp2m's - at least not
> with the current interfaces. All the shadow pages are added to the
> hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> shadow page with an execute-only permissions.

Ah - of course.  gfns only make sense in the context of the hostp2m.

> This way the breakpoint
> can be written into the shadow-page and any attempt to read it can be
> safely handled on a per-vCPU base by switching it back to the hostp2m
> for the duration of a singlestep (with MTF). Setting up the shadow
> pages is only safe to do during the initial setup while the altp2m
> view is not used and the guest is paused. Once altp2m views are being
> used adding new pages to the hostp2m results in losing all altp2m
> settings. For the most part this limitation is not an issue because
> all supported use-cases add the breakpoints once during the initial
> setup and there are no breakpoints added later during runtime.

What do the host p2m permissions get set to?  How do you cope with
future reuse of the gfn for a different purpose later?

>
> We've noticed that trapping MOV-TO-CR3 with the latest version of
> Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> so as a band-aid solution it can be disabled to improve performance
> (which Mathieu already did).

Meltdown isn't subtle with its perf problems...  What purpose are you
trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
so, VT-x has (since forever, iirc) had the CR3 target list (of 4
entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
the VMExit.  If all you care about is that the vcpu stays on known-good
pagetables, this interface could be plumbed up to include the kernel and
user pagetables, which will avoid all the vmexits from syscalls due to
meltdown.

Alternatively, in some copious free time, once I've got the CPUID/MSR
interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
the guest doesn't turn on its meltdown mitigations in the first place.

>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> My understanding is that the VPID implementation in Xen is such that
> effectively all VMEXITs will trigger assignment of a new VPID to the
> vCPU - which is likely a performance issue in itself - so flushing the
> EPT is likely not going to make a difference. But it's worth a shot,
> maybe it does :)

Sadly, things are far more complicated than that.  For one, Intel still
owe me a comment/correction to that section of the SDM on INVLPG
emulation for guests.

Xen's use of ASIDs as a common concept started from the AMD side.  AMD
strictly only cache linear => host physical mappings, so after any
change to the p2m, an ASID tick will guarantee to get you a fully clean
TLB for future pagewalks to populate.

The same is not true for Intel.  VPID and EPT were introduced together,
and have several kinds of mappings which are cached.  The processor may
cache:
1) linear => gpa mappings (tagged with current VPID and PCID values, and
contain no information from EPT)
2) gpa => hpa mappings (tagged with the current EPTP, may contain other
data such as the SPP vector, doesn't contain any data from the guest
pagetables)
3) combined mappings which are linear => hpa mappings.

In particular, ticking the VPID after an EPT modification *does not*
invalidate the gpa=>hpa mappings, so the guest can continue to execute
using stale mappings.  This is why we've got the logic in
vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Hence my suggestion for identifying whether it is a real TLB flushing
issue, or a logical error elsewhere. :)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-06 18:30     ` Andrew Cooper
  2019-05-06 18:30       ` [Xen-devel] " Andrew Cooper
@ 2019-05-06 18:51       ` Tamas K Lengyel
  2019-05-06 18:51         ` [Xen-devel] " Tamas K Lengyel
  1 sibling, 1 reply; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-06 18:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Mon, May 6, 2019 at 12:30 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 06/05/2019 18:41, Tamas K Lengyel wrote:
> > Hi Andrew,
> > thanks for helping brainstorming on this.
> >
> >> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?
> > I've posted a blog entry about it a while ago, it's still accurate:
> > https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.
>
> Talking of, have we fixed the emulation of `sti`?  I don't recall any
> changes, but given our aim to get the emulator complete, we should fix it.
>
> > You can't add new frames to only some of the altp2m's - at least not
> > with the current interfaces. All the shadow pages are added to the
> > hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> > shadow page with an execute-only permissions.
>
> Ah - of course.  gfns only make sense in the context of the hostp2m.
>
> > This way the breakpoint
> > can be written into the shadow-page and any attempt to read it can be
> > safely handled on a per-vCPU base by switching it back to the hostp2m
> > for the duration of a singlestep (with MTF). Setting up the shadow
> > pages is only safe to do during the initial setup while the altp2m
> > view is not used and the guest is paused. Once altp2m views are being
> > used adding new pages to the hostp2m results in losing all altp2m
> > settings. For the most part this limitation is not an issue because
> > all supported use-cases add the breakpoints once during the initial
> > setup and there are no breakpoints added later during runtime.
>
> What do the host p2m permissions get set to?  How do you cope with
> future reuse of the gfn for a different purpose later?

The hostp2m permissions aren't changed. The active view is always the
altp2m, the hostp2m is only ever being used while singlestepping. A
write-violation in the altp2m always triggers a full recopy of the
page to its shadow and redeployment of any breakpoints active on the
page after the singlestep is finished and we are switching back to the
altp2m. Since we are monitoring the guest kernel after it's already
booted up, the pages that are trapped are stable. Monitoring code that
may be paged in and loaded back to a different physical location is
not supported at the moment.

>
> >
> > We've noticed that trapping MOV-TO-CR3 with the latest version of
> > Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> > so as a band-aid solution it can be disabled to improve performance
> > (which Mathieu already did).
>
> Meltdown isn't subtle with its perf problems...  What purpose are you
> trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
> so, VT-x has (since forever, iirc) had the CR3 target list (of 4
> entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
> the VMExit.  If all you care about is that the vcpu stays on known-good
> pagetables, this interface could be plumbed up to include the kernel and
> user pagetables, which will avoid all the vmexits from syscalls due to
> meltdown.


CR3 trapping is primarily used to just keep track of where the KPCR is
for the active vCPU. This speeds up finding the thread/process base to
get standard info about the execution state (pid, process name, etc).
I'm aiming to get rid of this, hence the recent patches that fix the
shadow_gs and adds the gdtr_base to the vm_event structure so we can
gather these even if the process is in ring3. There are some plugins
that use it for other custom purposes but those aren't (as) critical.
So the CR3 whitelist isn't really something that applies for this
usecase.

> Alternatively, in some copious free time, once I've got the CPUID/MSR
> interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
> the guest doesn't turn on its meltdown mitigations in the first place.

That would be a nicer work-around, although I would still prefer not
having to trick the guest into a state that could be easily
fingerprinted - ie. I want that Windows install to be just like any
other Windows install under Xen. Otherwise it would be easy to spot
that "oh this is the sandbox version".

> >> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> > My understanding is that the VPID implementation in Xen is such that
> > effectively all VMEXITs will trigger assignment of a new VPID to the
> > vCPU - which is likely a performance issue in itself - so flushing the
> > EPT is likely not going to make a difference. But it's worth a shot,
> > maybe it does :)
>
> Sadly, things are far more complicated than that.  For one, Intel still
> owe me a comment/correction to that section of the SDM on INVLPG
> emulation for guests.
>
> Xen's use of ASIDs as a common concept started from the AMD side.  AMD
> strictly only cache linear => host physical mappings, so after any
> change to the p2m, an ASID tick will guarantee to get you a fully clean
> TLB for future pagewalks to populate.
>
> The same is not true for Intel.  VPID and EPT were introduced together,
> and have several kinds of mappings which are cached.  The processor may
> cache:
> 1) linear => gpa mappings (tagged with current VPID and PCID values, and
> contain no information from EPT)
> 2) gpa => hpa mappings (tagged with the current EPTP, may contain other
> data such as the SPP vector, doesn't contain any data from the guest
> pagetables)
> 3) combined mappings which are linear => hpa mappings.
>
> In particular, ticking the VPID after an EPT modification *does not*
> invalidate the gpa=>hpa mappings, so the guest can continue to execute
> using stale mappings.  This is why we've got the logic in
> vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Right, there is a problem if you do that while there are active vCPUs
- each has to trap to Xen to pick up a new VPID to see the EPT
modifications. But in our context we don't modify EPT after the
initial setup and during that setup the whole domain is paused. At
runtime we only perform switching of the EPT for a particular vCPU
which is trapped through vm_event, so it will pick up its new VPID
when it resumes.

> Hence my suggestion for identifying whether it is a real TLB flushing
> issue, or a logical error elsewhere. :)
>

Yea, it's certainly worth a shot. Not like any of this is trivial so I
could be wrong :)

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-06 18:51       ` Tamas K Lengyel
@ 2019-05-06 18:51         ` Tamas K Lengyel
  0 siblings, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-06 18:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Mon, May 6, 2019 at 12:30 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 06/05/2019 18:41, Tamas K Lengyel wrote:
> > Hi Andrew,
> > thanks for helping brainstorming on this.
> >
> >> How exactly does DRAKVUF go about injecting silent breakpoints?  It obviously has to allocate a new gfn from somewhere to begin with.  Do the bifurcated frames end up in two different altp2ms, or one in the host p2m and one in an alternative?  Does #VE ever get used?
> > I've posted a blog entry about it a while ago, it's still accurate:
> > https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.
>
> Talking of, have we fixed the emulation of `sti`?  I don't recall any
> changes, but given our aim to get the emulator complete, we should fix it.
>
> > You can't add new frames to only some of the altp2m's - at least not
> > with the current interfaces. All the shadow pages are added to the
> > hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> > shadow page with an execute-only permissions.
>
> Ah - of course.  gfns only make sense in the context of the hostp2m.
>
> > This way the breakpoint
> > can be written into the shadow-page and any attempt to read it can be
> > safely handled on a per-vCPU base by switching it back to the hostp2m
> > for the duration of a singlestep (with MTF). Setting up the shadow
> > pages is only safe to do during the initial setup while the altp2m
> > view is not used and the guest is paused. Once altp2m views are being
> > used adding new pages to the hostp2m results in losing all altp2m
> > settings. For the most part this limitation is not an issue because
> > all supported use-cases add the breakpoints once during the initial
> > setup and there are no breakpoints added later during runtime.
>
> What do the host p2m permissions get set to?  How do you cope with
> future reuse of the gfn for a different purpose later?

The hostp2m permissions aren't changed. The active view is always the
altp2m, the hostp2m is only ever being used while singlestepping. A
write-violation in the altp2m always triggers a full recopy of the
page to its shadow and redeployment of any breakpoints active on the
page after the singlestep is finished and we are switching back to the
altp2m. Since we are monitoring the guest kernel after it's already
booted up, the pages that are trapped are stable. Monitoring code that
may be paged in and loaded back to a different physical location is
not supported at the moment.

>
> >
> > We've noticed that trapping MOV-TO-CR3 with the latest version of
> > Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> > so as a band-aid solution it can be disabled to improve performance
> > (which Mathieu already did).
>
> Meltdown isn't subtle with its perf problems...  What purpose are you
> trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
> so, VT-x has (since forever, iirc) had the CR3 target list (of 4
> entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
> the VMExit.  If all you care about is that the vcpu stays on known-good
> pagetables, this interface could be plumbed up to include the kernel and
> user pagetables, which will avoid all the vmexits from syscalls due to
> meltdown.


CR3 trapping is primarily used to just keep track of where the KPCR is
for the active vCPU. This speeds up finding the thread/process base to
get standard info about the execution state (pid, process name, etc).
I'm aiming to get rid of this, hence the recent patches that fix the
shadow_gs and adds the gdtr_base to the vm_event structure so we can
gather these even if the process is in ring3. There are some plugins
that use it for other custom purposes but those aren't (as) critical.
So the CR3 whitelist isn't really something that applies for this
usecase.

> Alternatively, in some copious free time, once I've got the CPUID/MSR
> interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
> the guest doesn't turn on its meltdown mitigations in the first place.

That would be a nicer work-around, although I would still prefer not
having to trick the guest into a state that could be easily
fingerprinted - ie. I want that Windows install to be just like any
other Windows install under Xen. Otherwise it would be easy to spot
that "oh this is the sandbox version".

> >> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> > My understanding is that the VPID implementation in Xen is such that
> > effectively all VMEXITs will trigger assignment of a new VPID to the
> > vCPU - which is likely a performance issue in itself - so flushing the
> > EPT is likely not going to make a difference. But it's worth a shot,
> > maybe it does :)
>
> Sadly, things are far more complicated than that.  For one, Intel still
> owe me a comment/correction to that section of the SDM on INVLPG
> emulation for guests.
>
> Xen's use of ASIDs as a common concept started from the AMD side.  AMD
> strictly only cache linear => host physical mappings, so after any
> change to the p2m, an ASID tick will guarantee to get you a fully clean
> TLB for future pagewalks to populate.
>
> The same is not true for Intel.  VPID and EPT were introduced together,
> and have several kinds of mappings which are cached.  The processor may
> cache:
> 1) linear => gpa mappings (tagged with current VPID and PCID values, and
> contain no information from EPT)
> 2) gpa => hpa mappings (tagged with the current EPTP, may contain other
> data such as the SPP vector, doesn't contain any data from the guest
> pagetables)
> 3) combined mappings which are linear => hpa mappings.
>
> In particular, ticking the VPID after an EPT modification *does not*
> invalidate the gpa=>hpa mappings, so the guest can continue to execute
> using stale mappings.  This is why we've got the logic in
> vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Right, there is a problem if you do that while there are active vCPUs
- each has to trap to Xen to pick up a new VPID to see the EPT
modifications. But in our context we don't modify EPT after the
initial setup and during that setup the whole domain is paused. At
runtime we only perform switching of the EPT for a particular vCPU
which is trapped through vm_event, so it will pick up its new VPID
when it resumes.

> Hence my suggestion for identifying whether it is a real TLB flushing
> issue, or a logical error elsewhere. :)
>

Yea, it's certainly worth a shot. Not like any of this is trivial so I
could be wrong :)

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-06 17:07 ` Andrew Cooper
  2019-05-06 17:07   ` [Xen-devel] " Andrew Cooper
  2019-05-06 17:41   ` Tamas K Lengyel
@ 2019-05-07 12:01   ` Mathieu Tarral
  2019-05-07 12:01     ` [Xen-devel] " Mathieu Tarral
  2019-05-09 16:19     ` Mathieu Tarral
  2 siblings, 2 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-07 12:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 2976 bytes --]

Le lundi, mai 6, 2019 7:07 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :

> There is a lot in here.

I wanted to gather enough data before making a bug report on such a complicated issue.

> As for your BSOD analysis, the first thing to be aware of is that Double Fault is not necessarily precise, which means you can't necessarily trust any of the registers.  That said, most double faults are precise in practice, so if you're seeing it reliably at the same place, then it is likely to be a precise example.

I can reliably reproduce the Double Fault after ~10 tests on Windows 10 with KPTI.
And the stacktrace always show the beginning of KiSystemCall64ShadowCommon, which is executed after the CR3 switch to the kernel page tables.

> Your faulting address isn't the immediately after the pagetable switch.  It is one instruction further on, after the stack switch, which means at the very minimum that reading the new rsp out of the per-processor storage succeeded.
>
> The stack switch, combined with `push $0x2b` faulting is a clear sign that the stack is bad.  As the stack pointer looks plausible, it is almost certainly the pagewalk from %rsp which is bad.  Judging by the Windbg guide, you want to use !pte to dump the pagewalk (but I have never used it in anger before).

I checked RSP, and it's mapped in the kernel page tables:
# print kernel and userland page directory physical address
kd> dt _EPROCESS ffffdf8815e15340 ImageFileName Pcb.Directorytablebase Pcb.Userdirectorytablebase
ntdll!_EPROCESS
   +0x000 Pcb                        :
      +0x028 DirectoryTableBase         : 0xcbf10002
      +0x278 UserDirectoryTableBase     : 0xcbe00001
   +0x450 ImageFileName              : [15]  "ctfmon.exe"

# print RSP
kd> r rsp
rsp=fffff800b006cd08

# translate RSP to physical address
kd> !vtop cbf10000 fffff800b006cd08
Amd64VtoP: Virt fffff800b006cd08, pagedir 00000000cbf10000
Amd64VtoP: PML4E 00000000cbf10f80
Amd64VtoP: PDPE 0000000003708010
Amd64VtoP: PDE 0000000003709c00
Amd64VtoP: PTE 000000000371d360
Amd64VtoP: Mapped phys 000000000546cd08
Virtual address fffff800b006cd08 translates to physical address 546cd08.

> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 283eb7b..019333d 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>              }
>          }
>
> -        if ( inv )
> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> -                     inv == 1 ? single->eptp          : 0);
> +        __invept(INVEPT_ALL_CONTEXT, 0);
>      }
>
>   out:

I can give this a try, and see if it resolves the problem !

Thanks Andrew

[-- Attachment #1.2: Type: text/html, Size: 4228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-07 12:01   ` Mathieu Tarral
@ 2019-05-07 12:01     ` Mathieu Tarral
  2019-05-09 16:19     ` Mathieu Tarral
  1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-07 12:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 2976 bytes --]

Le lundi, mai 6, 2019 7:07 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :

> There is a lot in here.

I wanted to gather enough data before making a bug report on such a complicated issue.

> As for your BSOD analysis, the first thing to be aware of is that Double Fault is not necessarily precise, which means you can't necessarily trust any of the registers.  That said, most double faults are precise in practice, so if you're seeing it reliably at the same place, then it is likely to be a precise example.

I can reliably reproduce the Double Fault after ~10 tests on Windows 10 with KPTI.
And the stacktrace always show the beginning of KiSystemCall64ShadowCommon, which is executed after the CR3 switch to the kernel page tables.

> Your faulting address isn't the immediately after the pagetable switch.  It is one instruction further on, after the stack switch, which means at the very minimum that reading the new rsp out of the per-processor storage succeeded.
>
> The stack switch, combined with `push $0x2b` faulting is a clear sign that the stack is bad.  As the stack pointer looks plausible, it is almost certainly the pagewalk from %rsp which is bad.  Judging by the Windbg guide, you want to use !pte to dump the pagewalk (but I have never used it in anger before).

I checked RSP, and it's mapped in the kernel page tables:
# print kernel and userland page directory physical address
kd> dt _EPROCESS ffffdf8815e15340 ImageFileName Pcb.Directorytablebase Pcb.Userdirectorytablebase
ntdll!_EPROCESS
   +0x000 Pcb                        :
      +0x028 DirectoryTableBase         : 0xcbf10002
      +0x278 UserDirectoryTableBase     : 0xcbe00001
   +0x450 ImageFileName              : [15]  "ctfmon.exe"

# print RSP
kd> r rsp
rsp=fffff800b006cd08

# translate RSP to physical address
kd> !vtop cbf10000 fffff800b006cd08
Amd64VtoP: Virt fffff800b006cd08, pagedir 00000000cbf10000
Amd64VtoP: PML4E 00000000cbf10f80
Amd64VtoP: PDPE 0000000003708010
Amd64VtoP: PDE 0000000003709c00
Amd64VtoP: PTE 000000000371d360
Amd64VtoP: Mapped phys 000000000546cd08
Virtual address fffff800b006cd08 translates to physical address 546cd08.

> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 283eb7b..019333d 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>              }
>          }
>
> -        if ( inv )
> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> -                     inv == 1 ? single->eptp          : 0);
> +        __invept(INVEPT_ALL_CONTEXT, 0);
>      }
>
>   out:

I can give this a try, and see if it resolves the problem !

Thanks Andrew

[-- Attachment #1.2: Type: text/html, Size: 4228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-07 12:01   ` Mathieu Tarral
  2019-05-07 12:01     ` [Xen-devel] " Mathieu Tarral
@ 2019-05-09 16:19     ` Mathieu Tarral
  2019-05-09 16:19       ` [Xen-devel] " Mathieu Tarral
                         ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-09 16:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :

> > Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 283eb7b..019333d 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >              }
> >          }
> >
> > -        if ( inv )
> > -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> > -                     inv == 1 ? single->eptp          : 0);
> > +        __invept(INVEPT_ALL_CONTEXT, 0);
> >      }
> >
> >   out:
>
> I can give this a try, and see if it resolves the problem !

Just tested, on Xen 4.12.0, and the bug is still here.
Windows 7 is having BSODs with 4 VCPUs.
I didn't noticed a hefty performance impact though.

Do we have other caches to invalidate ?
Something else that i should test ?

I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
increased by the complexity of altp2m.
What i can do however, is test your ideas and patches and report the information I can gather on this issue.

Note: I tested with the latest commits on Drakvuf/master, especially:
"Add a VM pause for shadow copy refresh operation"
https://github.com/tklengyel/drakvuf/pull/626

@tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
Or was it totally unrelated ?

Thanks,
Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:19     ` Mathieu Tarral
@ 2019-05-09 16:19       ` Mathieu Tarral
  2019-05-09 16:42       ` Andrew Cooper
  2019-05-09 17:49       ` Tamas K Lengyel
  2 siblings, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-09 16:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :

> > Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 283eb7b..019333d 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >              }
> >          }
> >
> > -        if ( inv )
> > -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> > -                     inv == 1 ? single->eptp          : 0);
> > +        __invept(INVEPT_ALL_CONTEXT, 0);
> >      }
> >
> >   out:
>
> I can give this a try, and see if it resolves the problem !

Just tested, on Xen 4.12.0, and the bug is still here.
Windows 7 is having BSODs with 4 VCPUs.
I didn't noticed a hefty performance impact though.

Do we have other caches to invalidate ?
Something else that i should test ?

I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
increased by the complexity of altp2m.
What i can do however, is test your ideas and patches and report the information I can gather on this issue.

Note: I tested with the latest commits on Drakvuf/master, especially:
"Add a VM pause for shadow copy refresh operation"
https://github.com/tklengyel/drakvuf/pull/626

@tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
Or was it totally unrelated ?

Thanks,
Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:19     ` Mathieu Tarral
  2019-05-09 16:19       ` [Xen-devel] " Mathieu Tarral
@ 2019-05-09 16:42       ` Andrew Cooper
  2019-05-09 16:42         ` [Xen-devel] " Andrew Cooper
                           ` (2 more replies)
  2019-05-09 17:49       ` Tamas K Lengyel
  2 siblings, 3 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-09 16:42 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel

On 09/05/2019 17:19, Mathieu Tarral wrote:
> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
>
>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>>>
>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>> index 283eb7b..019333d 100644
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>>>              }
>>>          }
>>>
>>> -        if ( inv )
>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
>>> -                     inv == 1 ? single->eptp          : 0);
>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
>>>      }
>>>
>>>   out:
>> I can give this a try, and see if it resolves the problem !
> Just tested, on Xen 4.12.0, and the bug is still here.
> Windows 7 is having BSODs with 4 VCPUs.
> I didn't noticed a hefty performance impact though.
>
> Do we have other caches to invalidate ?
> Something else that i should test ?
>
> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> increased by the complexity of altp2m.
> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>
> Note: I tested with the latest commits on Drakvuf/master, especially:
> "Add a VM pause for shadow copy refresh operation"
> https://github.com/tklengyel/drakvuf/pull/626
>
> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> Or was it totally unrelated ?

With the above change in place and BSODs still happening, I'm fairly
convinced that it not a TLB flushing issue.

Therefore, the conclusion to draw is that it is a logical bug somewhere.

First of all - ensure you are using up-to-date microcode.  The number of
errata which have been discovered by people associated with the Xen
community is large.

The microcode is available from
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
is some documentation I prepared earlier.

Beyond that, I think it would help to know exactly how libvmi is
manipulating the guest.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:42       ` Andrew Cooper
@ 2019-05-09 16:42         ` Andrew Cooper
  2019-05-09 17:46         ` Tamas K Lengyel
  2019-05-10 15:17         ` Mathieu Tarral
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-09 16:42 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel

On 09/05/2019 17:19, Mathieu Tarral wrote:
> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
>
>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>>>
>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>> index 283eb7b..019333d 100644
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>>>              }
>>>          }
>>>
>>> -        if ( inv )
>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
>>> -                     inv == 1 ? single->eptp          : 0);
>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
>>>      }
>>>
>>>   out:
>> I can give this a try, and see if it resolves the problem !
> Just tested, on Xen 4.12.0, and the bug is still here.
> Windows 7 is having BSODs with 4 VCPUs.
> I didn't noticed a hefty performance impact though.
>
> Do we have other caches to invalidate ?
> Something else that i should test ?
>
> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> increased by the complexity of altp2m.
> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>
> Note: I tested with the latest commits on Drakvuf/master, especially:
> "Add a VM pause for shadow copy refresh operation"
> https://github.com/tklengyel/drakvuf/pull/626
>
> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> Or was it totally unrelated ?

With the above change in place and BSODs still happening, I'm fairly
convinced that it not a TLB flushing issue.

Therefore, the conclusion to draw is that it is a logical bug somewhere.

First of all - ensure you are using up-to-date microcode.  The number of
errata which have been discovered by people associated with the Xen
community is large.

The microcode is available from
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
is some documentation I prepared earlier.

Beyond that, I think it would help to know exactly how libvmi is
manipulating the guest.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:42       ` Andrew Cooper
  2019-05-09 16:42         ` [Xen-devel] " Andrew Cooper
@ 2019-05-09 17:46         ` Tamas K Lengyel
  2019-05-09 17:46           ` [Xen-devel] " Tamas K Lengyel
  2019-05-09 18:00           ` Andrew Cooper
  2019-05-10 15:17         ` Mathieu Tarral
  2 siblings, 2 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 17:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/05/2019 17:19, Mathieu Tarral wrote:
> > Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
> >
> >>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >>>
> >>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >>> index 283eb7b..019333d 100644
> >>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >>>              }
> >>>          }
> >>>
> >>> -        if ( inv )
> >>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> >>> -                     inv == 1 ? single->eptp          : 0);
> >>> +        __invept(INVEPT_ALL_CONTEXT, 0);
> >>>      }
> >>>
> >>>   out:
> >> I can give this a try, and see if it resolves the problem !
> > Just tested, on Xen 4.12.0, and the bug is still here.
> > Windows 7 is having BSODs with 4 VCPUs.
> > I didn't noticed a hefty performance impact though.
> >
> > Do we have other caches to invalidate ?
> > Something else that i should test ?
> >
> > I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> > increased by the complexity of altp2m.
> > What i can do however, is test your ideas and patches and report the information I can gather on this issue.
> >
> > Note: I tested with the latest commits on Drakvuf/master, especially:
> > "Add a VM pause for shadow copy refresh operation"
> > https://github.com/tklengyel/drakvuf/pull/626
> >
> > @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> > Or was it totally unrelated ?
>
> With the above change in place and BSODs still happening, I'm fairly
> convinced that it not a TLB flushing issue.
>
> Therefore, the conclusion to draw is that it is a logical bug somewhere.

I agree.

>
> First of all - ensure you are using up-to-date microcode.  The number of
> errata which have been discovered by people associated with the Xen
> community is large.
>
> The microcode is available from
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> is some documentation I prepared earlier.
>
> Beyond that, I think it would help to know exactly how libvmi is
> manipulating the guest.

I already suggested to Mathieu to try to reproduce the issue using the
xen-access test tool that's in the Xen tree to cut out all that
complexity. Without being able to limit the scope of the bug and being
able to reproducible trigger it I see little chance of finding the
root cause. Unfortunately I don't have the time to do that myself.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 17:46         ` Tamas K Lengyel
@ 2019-05-09 17:46           ` Tamas K Lengyel
  2019-05-09 18:00           ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 17:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/05/2019 17:19, Mathieu Tarral wrote:
> > Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
> >
> >>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >>>
> >>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >>> index 283eb7b..019333d 100644
> >>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >>>              }
> >>>          }
> >>>
> >>> -        if ( inv )
> >>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> >>> -                     inv == 1 ? single->eptp          : 0);
> >>> +        __invept(INVEPT_ALL_CONTEXT, 0);
> >>>      }
> >>>
> >>>   out:
> >> I can give this a try, and see if it resolves the problem !
> > Just tested, on Xen 4.12.0, and the bug is still here.
> > Windows 7 is having BSODs with 4 VCPUs.
> > I didn't noticed a hefty performance impact though.
> >
> > Do we have other caches to invalidate ?
> > Something else that i should test ?
> >
> > I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> > increased by the complexity of altp2m.
> > What i can do however, is test your ideas and patches and report the information I can gather on this issue.
> >
> > Note: I tested with the latest commits on Drakvuf/master, especially:
> > "Add a VM pause for shadow copy refresh operation"
> > https://github.com/tklengyel/drakvuf/pull/626
> >
> > @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> > Or was it totally unrelated ?
>
> With the above change in place and BSODs still happening, I'm fairly
> convinced that it not a TLB flushing issue.
>
> Therefore, the conclusion to draw is that it is a logical bug somewhere.

I agree.

>
> First of all - ensure you are using up-to-date microcode.  The number of
> errata which have been discovered by people associated with the Xen
> community is large.
>
> The microcode is available from
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> is some documentation I prepared earlier.
>
> Beyond that, I think it would help to know exactly how libvmi is
> manipulating the guest.

I already suggested to Mathieu to try to reproduce the issue using the
xen-access test tool that's in the Xen tree to cut out all that
complexity. Without being able to limit the scope of the bug and being
able to reproducible trigger it I see little chance of finding the
root cause. Unfortunately I don't have the time to do that myself.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:19     ` Mathieu Tarral
  2019-05-09 16:19       ` [Xen-devel] " Mathieu Tarral
  2019-05-09 16:42       ` Andrew Cooper
@ 2019-05-09 17:49       ` Tamas K Lengyel
  2019-05-09 17:49         ` [Xen-devel] " Tamas K Lengyel
  2 siblings, 1 reply; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 17:49 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> increased by the complexity of altp2m.

Unfortunately the reality is that you might very well have to become
familiar with these areas if you hope to debug some of these issues.

> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>
> Note: I tested with the latest commits on Drakvuf/master, especially:
> "Add a VM pause for shadow copy refresh operation"
> https://github.com/tklengyel/drakvuf/pull/626
>
> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> Or was it totally unrelated ?

It was an unrelated race-condition that I stumbled across while trying
to think of vectors that might be causing the issue you are seeing. As
you still have the issue, evidently it was an unrelated issue.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 17:49       ` Tamas K Lengyel
@ 2019-05-09 17:49         ` Tamas K Lengyel
  0 siblings, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 17:49 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> increased by the complexity of altp2m.

Unfortunately the reality is that you might very well have to become
familiar with these areas if you hope to debug some of these issues.

> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>
> Note: I tested with the latest commits on Drakvuf/master, especially:
> "Add a VM pause for shadow copy refresh operation"
> https://github.com/tklengyel/drakvuf/pull/626
>
> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> Or was it totally unrelated ?

It was an unrelated race-condition that I stumbled across while trying
to think of vectors that might be causing the issue you are seeing. As
you still have the issue, evidently it was an unrelated issue.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 17:46         ` Tamas K Lengyel
  2019-05-09 17:46           ` [Xen-devel] " Tamas K Lengyel
@ 2019-05-09 18:00           ` Andrew Cooper
  2019-05-09 18:00             ` [Xen-devel] " Andrew Cooper
                               ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-09 18:00 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 09/05/2019 18:46, Tamas K Lengyel wrote:
> On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 09/05/2019 17:19, Mathieu Tarral wrote:
>>> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
>>>
>>>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>>>>>
>>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>>>> index 283eb7b..019333d 100644
>>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>>>>>              }
>>>>>          }
>>>>>
>>>>> -        if ( inv )
>>>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
>>>>> -                     inv == 1 ? single->eptp          : 0);
>>>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
>>>>>      }
>>>>>
>>>>>   out:
>>>> I can give this a try, and see if it resolves the problem !
>>> Just tested, on Xen 4.12.0, and the bug is still here.
>>> Windows 7 is having BSODs with 4 VCPUs.
>>> I didn't noticed a hefty performance impact though.
>>>
>>> Do we have other caches to invalidate ?
>>> Something else that i should test ?
>>>
>>> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
>>> increased by the complexity of altp2m.
>>> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>>>
>>> Note: I tested with the latest commits on Drakvuf/master, especially:
>>> "Add a VM pause for shadow copy refresh operation"
>>> https://github.com/tklengyel/drakvuf/pull/626
>>>
>>> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
>>> Or was it totally unrelated ?
>> With the above change in place and BSODs still happening, I'm fairly
>> convinced that it not a TLB flushing issue.
>>
>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
> I agree.
>
>> First of all - ensure you are using up-to-date microcode.  The number of
>> errata which have been discovered by people associated with the Xen
>> community is large.
>>
>> The microcode is available from
>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
>> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
>> is some documentation I prepared earlier.
>>
>> Beyond that, I think it would help to know exactly how libvmi is
>> manipulating the guest.
> I already suggested to Mathieu to try to reproduce the issue using the
> xen-access test tool that's in the Xen tree to cut out all that
> complexity.

xen-access is ok, but I've never encountered a situation where I haven't
had to modify it first to get it usable.

I have some plans to replace it with something far more usable, as part
of tying together some XTF-based VMI testing, but none of that is
remotely ready yet.

> Without being able to limit the scope of the bug and being
> able to reproducible trigger it I see little chance of finding the
> root cause. Unfortunately I don't have the time to do that myself.

I can probably help out with some suggestions, but I agree that we are
going to have to cut out some of the complexity here to figure out
exactly what is going on.

Alternatively, if there are some sufficiently detailed instructions for
how to put together a repro of the problem using libvmi/etc, I might be
able to start debugging from that, but I definitely don't have time to
do that in the next week.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 18:00           ` Andrew Cooper
@ 2019-05-09 18:00             ` Andrew Cooper
  2019-05-09 18:08             ` Tamas K Lengyel
  2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-09 18:00 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 09/05/2019 18:46, Tamas K Lengyel wrote:
> On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 09/05/2019 17:19, Mathieu Tarral wrote:
>>> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
>>>
>>>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
>>>>>
>>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>>>> index 283eb7b..019333d 100644
>>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
>>>>>              }
>>>>>          }
>>>>>
>>>>> -        if ( inv )
>>>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
>>>>> -                     inv == 1 ? single->eptp          : 0);
>>>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
>>>>>      }
>>>>>
>>>>>   out:
>>>> I can give this a try, and see if it resolves the problem !
>>> Just tested, on Xen 4.12.0, and the bug is still here.
>>> Windows 7 is having BSODs with 4 VCPUs.
>>> I didn't noticed a hefty performance impact though.
>>>
>>> Do we have other caches to invalidate ?
>>> Something else that i should test ?
>>>
>>> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
>>> increased by the complexity of altp2m.
>>> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
>>>
>>> Note: I tested with the latest commits on Drakvuf/master, especially:
>>> "Add a VM pause for shadow copy refresh operation"
>>> https://github.com/tklengyel/drakvuf/pull/626
>>>
>>> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
>>> Or was it totally unrelated ?
>> With the above change in place and BSODs still happening, I'm fairly
>> convinced that it not a TLB flushing issue.
>>
>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
> I agree.
>
>> First of all - ensure you are using up-to-date microcode.  The number of
>> errata which have been discovered by people associated with the Xen
>> community is large.
>>
>> The microcode is available from
>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
>> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
>> is some documentation I prepared earlier.
>>
>> Beyond that, I think it would help to know exactly how libvmi is
>> manipulating the guest.
> I already suggested to Mathieu to try to reproduce the issue using the
> xen-access test tool that's in the Xen tree to cut out all that
> complexity.

xen-access is ok, but I've never encountered a situation where I haven't
had to modify it first to get it usable.

I have some plans to replace it with something far more usable, as part
of tying together some XTF-based VMI testing, but none of that is
remotely ready yet.

> Without being able to limit the scope of the bug and being
> able to reproducible trigger it I see little chance of finding the
> root cause. Unfortunately I don't have the time to do that myself.

I can probably help out with some suggestions, but I agree that we are
going to have to cut out some of the complexity here to figure out
exactly what is going on.

Alternatively, if there are some sufficiently detailed instructions for
how to put together a repro of the problem using libvmi/etc, I might be
able to start debugging from that, but I definitely don't have time to
do that in the next week.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 18:00           ` Andrew Cooper
  2019-05-09 18:00             ` [Xen-devel] " Andrew Cooper
@ 2019-05-09 18:08             ` Tamas K Lengyel
  2019-05-09 18:08               ` [Xen-devel] " Tamas K Lengyel
  2019-05-10 15:20               ` Mathieu Tarral
  2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
  2 siblings, 2 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 18:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Thu, May 9, 2019 at 12:00 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/05/2019 18:46, Tamas K Lengyel wrote:
> > On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> On 09/05/2019 17:19, Mathieu Tarral wrote:
> >>> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
> >>>
> >>>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >>>>>
> >>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> index 283eb7b..019333d 100644
> >>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>> -        if ( inv )
> >>>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> >>>>> -                     inv == 1 ? single->eptp          : 0);
> >>>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
> >>>>>      }
> >>>>>
> >>>>>   out:
> >>>> I can give this a try, and see if it resolves the problem !
> >>> Just tested, on Xen 4.12.0, and the bug is still here.
> >>> Windows 7 is having BSODs with 4 VCPUs.
> >>> I didn't noticed a hefty performance impact though.
> >>>
> >>> Do we have other caches to invalidate ?
> >>> Something else that i should test ?
> >>>
> >>> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> >>> increased by the complexity of altp2m.
> >>> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
> >>>
> >>> Note: I tested with the latest commits on Drakvuf/master, especially:
> >>> "Add a VM pause for shadow copy refresh operation"
> >>> https://github.com/tklengyel/drakvuf/pull/626
> >>>
> >>> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> >>> Or was it totally unrelated ?
> >> With the above change in place and BSODs still happening, I'm fairly
> >> convinced that it not a TLB flushing issue.
> >>
> >> Therefore, the conclusion to draw is that it is a logical bug somewhere.
> > I agree.
> >
> >> First of all - ensure you are using up-to-date microcode.  The number of
> >> errata which have been discovered by people associated with the Xen
> >> community is large.
> >>
> >> The microcode is available from
> >> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> >> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> >> is some documentation I prepared earlier.
> >>
> >> Beyond that, I think it would help to know exactly how libvmi is
> >> manipulating the guest.
> > I already suggested to Mathieu to try to reproduce the issue using the
> > xen-access test tool that's in the Xen tree to cut out all that
> > complexity.
>
> xen-access is ok, but I've never encountered a situation where I haven't
> had to modify it first to get it usable.

Right, it would likely have to be modified.

>
> I have some plans to replace it with something far more usable, as part
> of tying together some XTF-based VMI testing, but none of that is
> remotely ready yet.

Yes, that would be fantastic to have.

> > Without being able to limit the scope of the bug and being
> > able to reproducible trigger it I see little chance of finding the
> > root cause. Unfortunately I don't have the time to do that myself.
>
> I can probably help out with some suggestions, but I agree that we are
> going to have to cut out some of the complexity here to figure out
> exactly what is going on.
>
> Alternatively, if there are some sufficiently detailed instructions for
> how to put together a repro of the problem using libvmi/etc, I might be
> able to start debugging from that, but I definitely don't have time to
> do that in the next week.

The instructions are on https://drakvuf.com. AFAICT Mathieu is running
into the issue with simply running it on a up-to-date Windows 10 guest
but not in any way that I would call reproducible. Running it "for a
minute or two" is really not a reproducible bug description.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-09 18:08             ` Tamas K Lengyel
@ 2019-05-09 18:08               ` Tamas K Lengyel
  2019-05-10 15:20               ` Mathieu Tarral
  1 sibling, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-09 18:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

On Thu, May 9, 2019 at 12:00 PM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 09/05/2019 18:46, Tamas K Lengyel wrote:
> > On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >> On 09/05/2019 17:19, Mathieu Tarral wrote:
> >>> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral <mathieu.tarral@protonmail.com> a écrit :
> >>>
> >>>>> Given how many EPT flushing bugs I've already found in this area, I wouldn't be surprised if there are further ones lurking.  If it is an EPT flushing bug, this delta should make it go away, but it will come with a hefty perf hit.
> >>>>>
> >>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> index 283eb7b..019333d 100644
> >>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct cpu_user_regs *regs)
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>> -        if ( inv )
> >>>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : INVEPT_ALL_CONTEXT,
> >>>>> -                     inv == 1 ? single->eptp          : 0);
> >>>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
> >>>>>      }
> >>>>>
> >>>>>   out:
> >>>> I can give this a try, and see if it resolves the problem !
> >>> Just tested, on Xen 4.12.0, and the bug is still here.
> >>> Windows 7 is having BSODs with 4 VCPUs.
> >>> I didn't noticed a hefty performance impact though.
> >>>
> >>> Do we have other caches to invalidate ?
> >>> Something else that i should test ?
> >>>
> >>> I don't feel comfortable digging into Xen's code, especially for something as complicated as page table and memory management,
> >>> increased by the complexity of altp2m.
> >>> What i can do however, is test your ideas and patches and report the information I can gather on this issue.
> >>>
> >>> Note: I tested with the latest commits on Drakvuf/master, especially:
> >>> "Add a VM pause for shadow copy refresh operation"
> >>> https://github.com/tklengyel/drakvuf/pull/626
> >>>
> >>> @tamas, did you made this patch to fix these kind of race conditions issue that i'm reporting ?
> >>> Or was it totally unrelated ?
> >> With the above change in place and BSODs still happening, I'm fairly
> >> convinced that it not a TLB flushing issue.
> >>
> >> Therefore, the conclusion to draw is that it is a logical bug somewhere.
> > I agree.
> >
> >> First of all - ensure you are using up-to-date microcode.  The number of
> >> errata which have been discovered by people associated with the Xen
> >> community is large.
> >>
> >> The microcode is available from
> >> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> >> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> >> is some documentation I prepared earlier.
> >>
> >> Beyond that, I think it would help to know exactly how libvmi is
> >> manipulating the guest.
> > I already suggested to Mathieu to try to reproduce the issue using the
> > xen-access test tool that's in the Xen tree to cut out all that
> > complexity.
>
> xen-access is ok, but I've never encountered a situation where I haven't
> had to modify it first to get it usable.

Right, it would likely have to be modified.

>
> I have some plans to replace it with something far more usable, as part
> of tying together some XTF-based VMI testing, but none of that is
> remotely ready yet.

Yes, that would be fantastic to have.

> > Without being able to limit the scope of the bug and being
> > able to reproducible trigger it I see little chance of finding the
> > root cause. Unfortunately I don't have the time to do that myself.
>
> I can probably help out with some suggestions, but I agree that we are
> going to have to cut out some of the complexity here to figure out
> exactly what is going on.
>
> Alternatively, if there are some sufficiently detailed instructions for
> how to put together a repro of the problem using libvmi/etc, I might be
> able to start debugging from that, but I definitely don't have time to
> do that in the next week.

The instructions are on https://drakvuf.com. AFAICT Mathieu is running
into the issue with simply running it on a up-to-date Windows 10 guest
but not in any way that I would call reproducible. Running it "for a
minute or two" is really not a reproducible bug description.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 18:00           ` Andrew Cooper
  2019-05-09 18:00             ` [Xen-devel] " Andrew Cooper
  2019-05-09 18:08             ` Tamas K Lengyel
@ 2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
  2019-05-10 10:52               ` [Xen-devel] " Petre Ovidiu PIRCALABU
  2019-05-10 10:55               ` Andrew Cooper
  2 siblings, 2 replies; 54+ messages in thread
From: Petre Ovidiu PIRCALABU @ 2019-05-10 10:52 UTC (permalink / raw)
  To: Andrew Cooper, Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On Thu, 2019-05-09 at 19:00 +0100, Andrew Cooper wrote:
> > 
> On 09/05/2019 18:46, Tamas K Lengyel wrote:
> > 
> 
> I have some plans to replace it with something far more usable, as
> part
> of tying together some XTF-based VMI testing, but none of that is
> remotely ready yet.
Hi Andrew,

Did you get a chance to look at this series? 
https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg02421.html

If it doens't match your requirements, I can modify it accordingly.

Many thanks,
Petre

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
@ 2019-05-10 10:52               ` Petre Ovidiu PIRCALABU
  2019-05-10 10:55               ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Petre Ovidiu PIRCALABU @ 2019-05-10 10:52 UTC (permalink / raw)
  To: Andrew Cooper, Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On Thu, 2019-05-09 at 19:00 +0100, Andrew Cooper wrote:
> > 
> On 09/05/2019 18:46, Tamas K Lengyel wrote:
> > 
> 
> I have some plans to replace it with something far more usable, as
> part
> of tying together some XTF-based VMI testing, but none of that is
> remotely ready yet.
Hi Andrew,

Did you get a chance to look at this series? 
https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg02421.html

If it doens't match your requirements, I can modify it accordingly.

Many thanks,
Petre

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
  2019-05-10 10:52               ` [Xen-devel] " Petre Ovidiu PIRCALABU
@ 2019-05-10 10:55               ` Andrew Cooper
  2019-05-10 10:55                 ` [Xen-devel] " Andrew Cooper
  1 sibling, 1 reply; 54+ messages in thread
From: Andrew Cooper @ 2019-05-10 10:55 UTC (permalink / raw)
  To: Petre Ovidiu PIRCALABU, Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 10/05/2019 11:52, Petre Ovidiu PIRCALABU wrote:
> On Thu, 2019-05-09 at 19:00 +0100, Andrew Cooper wrote:
>> On 09/05/2019 18:46, Tamas K Lengyel wrote:
>> I have some plans to replace it with something far more usable, as
>> part
>> of tying together some XTF-based VMI testing, but none of that is
>> remotely ready yet.
> Hi Andrew,
>
> Did you get a chance to look at this series? 
> https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg02421.html
>
> If it doens't match your requirements, I can modify it accordingly.

I have, and while I haven't had time to act on any of it yet, I am
planning to use some of it, but I think I've got a clever idea to make
the dom0 side rather cleaner to use.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-10 10:55               ` Andrew Cooper
@ 2019-05-10 10:55                 ` Andrew Cooper
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-10 10:55 UTC (permalink / raw)
  To: Petre Ovidiu PIRCALABU, Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 10/05/2019 11:52, Petre Ovidiu PIRCALABU wrote:
> On Thu, 2019-05-09 at 19:00 +0100, Andrew Cooper wrote:
>> On 09/05/2019 18:46, Tamas K Lengyel wrote:
>> I have some plans to replace it with something far more usable, as
>> part
>> of tying together some XTF-based VMI testing, but none of that is
>> remotely ready yet.
> Hi Andrew,
>
> Did you get a chance to look at this series? 
> https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg02421.html
>
> If it doens't match your requirements, I can modify it accordingly.

I have, and while I haven't had time to act on any of it yet, I am
planning to use some of it, but I think I've got a clever idea to make
the dom0 side rather cleaner to use.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 16:42       ` Andrew Cooper
  2019-05-09 16:42         ` [Xen-devel] " Andrew Cooper
  2019-05-09 17:46         ` Tamas K Lengyel
@ 2019-05-10 15:17         ` Mathieu Tarral
  2019-05-10 15:17           ` [Xen-devel] " Mathieu Tarral
  2019-05-10 15:21           ` Andrew Cooper
  2 siblings, 2 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-10 15:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>
> First of all - ensure you are using up-to-date microcode.  The number of
> errata which have been discovered by people associated with the Xen
> community is large.
>
> The microcode is available from
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> is some documentation I prepared earlier.
I updated my microcode following your instructions.
I installed the intel-microcode package on Debian stretch, and added ucode=scan to my Xen boot command line.

Looking at xl dmesg before the update, I noticed I had a bug that should be fix by a microcode update:
"tsc_deadline disabled due to errata..."
This message disappeared with the microcode code update, which is applied at boot and can be
seen in xl dmesg.
The microcode version is now 0x25 (2018-04-02)

And if I use iucode-tool on the intel-ucode directory, from the repo your provided, the only
microcode that matches has the same release date:

$ iucode-tool -S Intel-Linux-Processor-Microcode-Data-Files/intel-ucode/ -l
selected microcodes:
  053/001: sig 0x000306c3, pf_mask 0x32, 2018-04-02, rev 0x0025, size 23552


The bug is still here, so we can exclude a microcode issue.

Mathieu


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:17         ` Mathieu Tarral
@ 2019-05-10 15:17           ` Mathieu Tarral
  2019-05-10 15:21           ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-10 15:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>
> First of all - ensure you are using up-to-date microcode.  The number of
> errata which have been discovered by people associated with the Xen
> community is large.
>
> The microcode is available from
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> is some documentation I prepared earlier.
I updated my microcode following your instructions.
I installed the intel-microcode package on Debian stretch, and added ucode=scan to my Xen boot command line.

Looking at xl dmesg before the update, I noticed I had a bug that should be fix by a microcode update:
"tsc_deadline disabled due to errata..."
This message disappeared with the microcode code update, which is applied at boot and can be
seen in xl dmesg.
The microcode version is now 0x25 (2018-04-02)

And if I use iucode-tool on the intel-ucode directory, from the repo your provided, the only
microcode that matches has the same release date:

$ iucode-tool -S Intel-Linux-Processor-Microcode-Data-Files/intel-ucode/ -l
selected microcodes:
  053/001: sig 0x000306c3, pf_mask 0x32, 2018-04-02, rev 0x0025, size 23552


The bug is still here, so we can exclude a microcode issue.

Mathieu


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-09 18:08             ` Tamas K Lengyel
  2019-05-09 18:08               ` [Xen-devel] " Tamas K Lengyel
@ 2019-05-10 15:20               ` Mathieu Tarral
  2019-05-10 15:20                 ` [Xen-devel] " Mathieu Tarral
  1 sibling, 1 reply; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-10 15:20 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: Andrew Cooper, xen-devel

Le jeudi, mai 9, 2019 8:08 PM, Tamas K Lengyel <tamas@tklengyel.com> a écrit :
> > > > I already suggested to Mathieu to try to reproduce the issue using the
> > > > xen-access test tool that's in the Xen tree to cut out all that
> > > > complexity.
> >
> > xen-access is ok, but I've never encountered a situation where I haven't
> > had to modify it first to get it usable.
>
> Right, it would likely have to be modified.
>
> > I have some plans to replace it with something far more usable, as part
> > of tying together some XTF-based VMI testing, but none of that is
> > remotely ready yet.
>
> Yes, that would be fantastic to have.
>
> > > Without being able to limit the scope of the bug and being
> > > able to reproducible trigger it I see little chance of finding the
> > > root cause. Unfortunately I don't have the time to do that myself.
> >
> > I can probably help out with some suggestions, but I agree that we are
> > going to have to cut out some of the complexity here to figure out
> > exactly what is going on.
> > Alternatively, if there are some sufficiently detailed instructions for
> > how to put together a repro of the problem using libvmi/etc, I might be
> > able to start debugging from that, but I definitely don't have time to
> > do that in the next week.
>
> The instructions are onhttps://drakvuf.com. AFAICT Mathieu is running
> into the issue with simply running it on a up-to-date Windows 10 guest
> but not in any way that I would call reproducible. Running it "for a
> minute or two" is really not a reproducible bug description.

I think there are 2 separate issues,
one is the race-condition i'm describing, impacting both Windows 7 and Windows 10 (which I have tested.).
second is a crash linked to KPTI (the crash happens really fast, and Windows 10 without kpti is quite stable under Drakvuf monitoring.

Regarding how reproductible it is, what I have for now is a PyTest based test suite,
that will inject the sample using either Drakvuf's method (createproc/shellexec) or Ansible/WinRM.

The execution is monitored and when I detect that the process terminated, I validate the test.

On Windows 7 x64, with 4 VCPUs, it crashes around ~10 tests.
@Andrew would you like to give this a try, and repro the issue on your side with the test suite ?
That's the best "reproducibility" I can offer you at the moment.

My next objective will be to look at xen-access tool, and modify it to inject steath breakpoints,
the same way drakvuf does, to build bug repro as small as possible.

Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:20               ` Mathieu Tarral
@ 2019-05-10 15:20                 ` Mathieu Tarral
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-10 15:20 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: Andrew Cooper, xen-devel

Le jeudi, mai 9, 2019 8:08 PM, Tamas K Lengyel <tamas@tklengyel.com> a écrit :
> > > > I already suggested to Mathieu to try to reproduce the issue using the
> > > > xen-access test tool that's in the Xen tree to cut out all that
> > > > complexity.
> >
> > xen-access is ok, but I've never encountered a situation where I haven't
> > had to modify it first to get it usable.
>
> Right, it would likely have to be modified.
>
> > I have some plans to replace it with something far more usable, as part
> > of tying together some XTF-based VMI testing, but none of that is
> > remotely ready yet.
>
> Yes, that would be fantastic to have.
>
> > > Without being able to limit the scope of the bug and being
> > > able to reproducible trigger it I see little chance of finding the
> > > root cause. Unfortunately I don't have the time to do that myself.
> >
> > I can probably help out with some suggestions, but I agree that we are
> > going to have to cut out some of the complexity here to figure out
> > exactly what is going on.
> > Alternatively, if there are some sufficiently detailed instructions for
> > how to put together a repro of the problem using libvmi/etc, I might be
> > able to start debugging from that, but I definitely don't have time to
> > do that in the next week.
>
> The instructions are onhttps://drakvuf.com. AFAICT Mathieu is running
> into the issue with simply running it on a up-to-date Windows 10 guest
> but not in any way that I would call reproducible. Running it "for a
> minute or two" is really not a reproducible bug description.

I think there are 2 separate issues,
one is the race-condition i'm describing, impacting both Windows 7 and Windows 10 (which I have tested.).
second is a crash linked to KPTI (the crash happens really fast, and Windows 10 without kpti is quite stable under Drakvuf monitoring.

Regarding how reproductible it is, what I have for now is a PyTest based test suite,
that will inject the sample using either Drakvuf's method (createproc/shellexec) or Ansible/WinRM.

The execution is monitored and when I detect that the process terminated, I validate the test.

On Windows 7 x64, with 4 VCPUs, it crashes around ~10 tests.
@Andrew would you like to give this a try, and repro the issue on your side with the test suite ?
That's the best "reproducibility" I can offer you at the moment.

My next objective will be to look at xen-access tool, and modify it to inject steath breakpoints,
the same way drakvuf does, to build bug repro as small as possible.

Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:17         ` Mathieu Tarral
  2019-05-10 15:17           ` [Xen-devel] " Mathieu Tarral
@ 2019-05-10 15:21           ` Andrew Cooper
  2019-05-10 15:21             ` [Xen-devel] " Andrew Cooper
                               ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-10 15:21 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel

On 10/05/2019 16:17, Mathieu Tarral wrote:
> Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>>
>> First of all - ensure you are using up-to-date microcode.  The number of
>> errata which have been discovered by people associated with the Xen
>> community is large.
>>
>> The microcode is available from
>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
>> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
>> is some documentation I prepared earlier.
> I updated my microcode following your instructions.
> I installed the intel-microcode package on Debian stretch, and added ucode=scan to my Xen boot command line.
>
> Looking at xl dmesg before the update, I noticed I had a bug that should be fix by a microcode update:
> "tsc_deadline disabled due to errata..."

That is actually a Xen bug where we don't scan the current microcode
version if we aren't about to load new microcode.  Therefore, it was
likely a red herring before.

I'll try and get a fix sorted properly and tagged for backport.

> This message disappeared with the microcode code update, which is applied at boot and can be
> seen in xl dmesg.
> The microcode version is now 0x25 (2018-04-02)
>
> And if I use iucode-tool on the intel-ucode directory, from the repo your provided, the only
> microcode that matches has the same release date:
>
> $ iucode-tool -S Intel-Linux-Processor-Microcode-Data-Files/intel-ucode/ -l
> selected microcodes:
>   053/001: sig 0x000306c3, pf_mask 0x32, 2018-04-02, rev 0x0025, size 23552
>
>
> The bug is still here, so we can exclude a microcode issue.

Good - that is one further angle excluded.  Always make sure you are
running with up-to-date microcode, but it looks like we back to
investigating a logical bug in libvmi or Xen.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:21           ` Andrew Cooper
@ 2019-05-10 15:21             ` Andrew Cooper
  2019-05-13 16:18             ` Mathieu Tarral
  2019-05-28 12:33             ` Mathieu Tarral
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-10 15:21 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel

On 10/05/2019 16:17, Mathieu Tarral wrote:
> Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>>
>> First of all - ensure you are using up-to-date microcode.  The number of
>> errata which have been discovered by people associated with the Xen
>> community is large.
>>
>> The microcode is available from
>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
>> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
>> is some documentation I prepared earlier.
> I updated my microcode following your instructions.
> I installed the intel-microcode package on Debian stretch, and added ucode=scan to my Xen boot command line.
>
> Looking at xl dmesg before the update, I noticed I had a bug that should be fix by a microcode update:
> "tsc_deadline disabled due to errata..."

That is actually a Xen bug where we don't scan the current microcode
version if we aren't about to load new microcode.  Therefore, it was
likely a red herring before.

I'll try and get a fix sorted properly and tagged for backport.

> This message disappeared with the microcode code update, which is applied at boot and can be
> seen in xl dmesg.
> The microcode version is now 0x25 (2018-04-02)
>
> And if I use iucode-tool on the intel-ucode directory, from the repo your provided, the only
> microcode that matches has the same release date:
>
> $ iucode-tool -S Intel-Linux-Processor-Microcode-Data-Files/intel-ucode/ -l
> selected microcodes:
>   053/001: sig 0x000306c3, pf_mask 0x32, 2018-04-02, rev 0x0025, size 23552
>
>
> The bug is still here, so we can exclude a microcode issue.

Good - that is one further angle excluded.  Always make sure you are
running with up-to-date microcode, but it looks like we back to
investigating a logical bug in libvmi or Xen.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:21           ` Andrew Cooper
  2019-05-10 15:21             ` [Xen-devel] " Andrew Cooper
@ 2019-05-13 16:18             ` Mathieu Tarral
  2019-05-13 16:18               ` [Xen-devel] " Mathieu Tarral
  2019-05-13 17:19               ` Razvan Cojocaru
  2019-05-28 12:33             ` Mathieu Tarral
  2 siblings, 2 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-13 16:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le vendredi, mai 10, 2019 5:21 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :

> On 10/05/2019 16:17, Mathieu Tarral wrote:
>
> > Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper andrew.cooper3@citrix.com a écrit :
> >
> > > Therefore, the conclusion to draw is that it is a logical bug somewhere.
> > The bug is still here, so we can exclude a microcode issue.
>
> Good - that is one further angle excluded.  Always make sure you are
> running with up-to-date microcode, but it looks like we back to
> investigating a logical bug in libvmi or Xen.

I played with tool/tests/xen-access this afternoon.

The tool is working, i could intercept breakpoints, cpuid, write and exec mem accesses, etc..

However, using altp2m related intercepts leads to a guest crash sometimes:

Windows 7 x64, 4 VCPUs
- altp2m_write: crash
- altp2m_exec: crash
- altp2m_write_no_gpt: frozen

Windows 7 x64, 1 VCPU
- altp2m_write: crash
- altp2m_exec: OK
- altp2m_write_no_gpt: frozen

"frozen" means that xen-access receives VMI events, bug the guest is frozen until I decide to stop xen-access.
I'm wondering what kind of exec events it received because they are not the same, so it's not looping
over the same RIP over and over. (?)

Here is an example output I have when I run sudo ./xen-access <dom_id> altp2m_write

...
Got event from Xen
Singlestep: rip=fffff800026e60dc, vcpu 1, altp2m 0
        Switching altp2m to view 1!
Error -1 getting mem_access event

Singlestep: rip=fffff800026e6054, vcpu 3, altp2m 0
        Switching altp2m to view 1!
Singlestep: rip=fffff800026d64c5, vcpu 0, altp2m 0
        Switching altp2m to view 1!
xenaccess shutting down on signal -1
Got event from Xen
PAGE ACCESS: rw- for GFN 21cef (offset 000fb8) gla fffff88002039fb8 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 1 [p], altp2m view 1)
        Switching back to default view!
PAGE ACCESS: rw- for GFN 1debc (offset 0004b0) gla fffff880022ed4b0 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 3 [p], altp2m view 1)
        Switching back to default view!
PAGE ACCESS: rw- for GFN b9a (offset 000ae8) gla fffff80000b9aae8 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 0 [p], altp2m view 1)
        Switching back to default view!
xenaccess shut down on signal -1
xenaccess exit code -1

@Tamas: if you added support for altp2m in xen-access, did you remember crashing your guest ?
Or was it working at the time you tested ?

Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-13 16:18             ` Mathieu Tarral
@ 2019-05-13 16:18               ` Mathieu Tarral
  2019-05-13 17:19               ` Razvan Cojocaru
  1 sibling, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-13 16:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Le vendredi, mai 10, 2019 5:21 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :

> On 10/05/2019 16:17, Mathieu Tarral wrote:
>
> > Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper andrew.cooper3@citrix.com a écrit :
> >
> > > Therefore, the conclusion to draw is that it is a logical bug somewhere.
> > The bug is still here, so we can exclude a microcode issue.
>
> Good - that is one further angle excluded.  Always make sure you are
> running with up-to-date microcode, but it looks like we back to
> investigating a logical bug in libvmi or Xen.

I played with tool/tests/xen-access this afternoon.

The tool is working, i could intercept breakpoints, cpuid, write and exec mem accesses, etc..

However, using altp2m related intercepts leads to a guest crash sometimes:

Windows 7 x64, 4 VCPUs
- altp2m_write: crash
- altp2m_exec: crash
- altp2m_write_no_gpt: frozen

Windows 7 x64, 1 VCPU
- altp2m_write: crash
- altp2m_exec: OK
- altp2m_write_no_gpt: frozen

"frozen" means that xen-access receives VMI events, bug the guest is frozen until I decide to stop xen-access.
I'm wondering what kind of exec events it received because they are not the same, so it's not looping
over the same RIP over and over. (?)

Here is an example output I have when I run sudo ./xen-access <dom_id> altp2m_write

...
Got event from Xen
Singlestep: rip=fffff800026e60dc, vcpu 1, altp2m 0
        Switching altp2m to view 1!
Error -1 getting mem_access event

Singlestep: rip=fffff800026e6054, vcpu 3, altp2m 0
        Switching altp2m to view 1!
Singlestep: rip=fffff800026d64c5, vcpu 0, altp2m 0
        Switching altp2m to view 1!
xenaccess shutting down on signal -1
Got event from Xen
PAGE ACCESS: rw- for GFN 21cef (offset 000fb8) gla fffff88002039fb8 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 1 [p], altp2m view 1)
        Switching back to default view!
PAGE ACCESS: rw- for GFN 1debc (offset 0004b0) gla fffff880022ed4b0 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 3 [p], altp2m view 1)
        Switching back to default view!
PAGE ACCESS: rw- for GFN b9a (offset 000ae8) gla fffff80000b9aae8 (valid: y; fault in gpt: n; fault with gla: y) (vcpu 0 [p], altp2m view 1)
        Switching back to default view!
xenaccess shut down on signal -1
xenaccess exit code -1

@Tamas: if you added support for altp2m in xen-access, did you remember crashing your guest ?
Or was it working at the time you tested ?

Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-13 16:18             ` Mathieu Tarral
  2019-05-13 16:18               ` [Xen-devel] " Mathieu Tarral
@ 2019-05-13 17:19               ` Razvan Cojocaru
  2019-05-13 17:19                 ` [Xen-devel] " Razvan Cojocaru
  1 sibling, 1 reply; 54+ messages in thread
From: Razvan Cojocaru @ 2019-05-13 17:19 UTC (permalink / raw)
  To: Mathieu Tarral, Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

On 5/13/19 7:18 PM, Mathieu Tarral wrote:
> Le vendredi, mai 10, 2019 5:21 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
> 
>> On 10/05/2019 16:17, Mathieu Tarral wrote:
>>
>>> Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper andrew.cooper3@citrix.com a écrit :
>>>
>>>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>>> The bug is still here, so we can exclude a microcode issue.
>>
>> Good - that is one further angle excluded.  Always make sure you are
>> running with up-to-date microcode, but it looks like we back to
>> investigating a logical bug in libvmi or Xen.
> 
> I played with tool/tests/xen-access this afternoon.
> 
> The tool is working, i could intercept breakpoints, cpuid, write and exec mem accesses, etc..
> 
> However, using altp2m related intercepts leads to a guest crash sometimes:
> 
> Windows 7 x64, 4 VCPUs
> - altp2m_write: crash
> - altp2m_exec: crash
> - altp2m_write_no_gpt: frozen
> 
> Windows 7 x64, 1 VCPU
> - altp2m_write: crash
> - altp2m_exec: OK
> - altp2m_write_no_gpt: frozen
> 
> "frozen" means that xen-access receives VMI events, bug the guest is frozen until I decide to stop xen-access.
> I'm wondering what kind of exec events it received because they are not the same, so it's not looping
> over the same RIP over and over. (?)
I think you're simply tripping some OS timer because you're slowing the
guest down in the crash case, and simply keep the guest too busy
handling events in the "freeze" case. Remember that there's quite a
delay running each offending instruction: one vm_event saying you've got
a violation, a reply saying "put this VCPU in single-step mode _and_
switch to the unrestricted EPT view", another vm_event saying
"instruction executed", followed by anoher reply saying "switch back to
the restricted EPT _and_ take the VCPU out of single-step mode".

Restricting the whole of the guest's memory (and so doing this dance for
_every_ instruction causing a fault) is practically guaranteed to upset
the OS. A little EPT restricting goes a long way.

Of course, if this could be improved so that even stress-tests (which is
basically what xen-access is) leave the guest running smoothly, that'd
be fantastic.


Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-13 17:19               ` Razvan Cojocaru
@ 2019-05-13 17:19                 ` Razvan Cojocaru
  0 siblings, 0 replies; 54+ messages in thread
From: Razvan Cojocaru @ 2019-05-13 17:19 UTC (permalink / raw)
  To: Mathieu Tarral, Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

On 5/13/19 7:18 PM, Mathieu Tarral wrote:
> Le vendredi, mai 10, 2019 5:21 PM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :
> 
>> On 10/05/2019 16:17, Mathieu Tarral wrote:
>>
>>> Le jeudi, mai 9, 2019 6:42 PM, Andrew Cooper andrew.cooper3@citrix.com a écrit :
>>>
>>>> Therefore, the conclusion to draw is that it is a logical bug somewhere.
>>> The bug is still here, so we can exclude a microcode issue.
>>
>> Good - that is one further angle excluded.  Always make sure you are
>> running with up-to-date microcode, but it looks like we back to
>> investigating a logical bug in libvmi or Xen.
> 
> I played with tool/tests/xen-access this afternoon.
> 
> The tool is working, i could intercept breakpoints, cpuid, write and exec mem accesses, etc..
> 
> However, using altp2m related intercepts leads to a guest crash sometimes:
> 
> Windows 7 x64, 4 VCPUs
> - altp2m_write: crash
> - altp2m_exec: crash
> - altp2m_write_no_gpt: frozen
> 
> Windows 7 x64, 1 VCPU
> - altp2m_write: crash
> - altp2m_exec: OK
> - altp2m_write_no_gpt: frozen
> 
> "frozen" means that xen-access receives VMI events, bug the guest is frozen until I decide to stop xen-access.
> I'm wondering what kind of exec events it received because they are not the same, so it's not looping
> over the same RIP over and over. (?)
I think you're simply tripping some OS timer because you're slowing the
guest down in the crash case, and simply keep the guest too busy
handling events in the "freeze" case. Remember that there's quite a
delay running each offending instruction: one vm_event saying you've got
a violation, a reply saying "put this VCPU in single-step mode _and_
switch to the unrestricted EPT view", another vm_event saying
"instruction executed", followed by anoher reply saying "switch back to
the restricted EPT _and_ take the VCPU out of single-step mode".

Restricting the whole of the guest's memory (and so doing this dance for
_every_ instruction causing a fault) is practically guaranteed to upset
the OS. A little EPT restricting goes a long way.

Of course, if this could be improved so that even stress-tests (which is
basically what xen-access is) leave the guest running smoothly, that'd
be fantastic.


Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-10 15:21           ` Andrew Cooper
  2019-05-10 15:21             ` [Xen-devel] " Andrew Cooper
  2019-05-13 16:18             ` Mathieu Tarral
@ 2019-05-28 12:33             ` Mathieu Tarral
  2019-05-28 12:33               ` [Xen-devel] " Mathieu Tarral
                                 ` (3 more replies)
  2 siblings, 4 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-28 12:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Hi Andrew,

> > The bug is still here, so we can exclude a microcode issue.
>
> Good - that is one further angle excluded.  Always make sure you are
> running with up-to-date microcode, but it looks like we back to
> investigating a logical bug in libvmi or Xen.


I reimplemented a small test, without the Drakvuf/Libvmi layers, that will inject traps on one API in Windows (NtCreateUserProcess),
in the same way that Drakvuf does.

I did some quick testing yesterday, with a Python script that was repeatedly
starting the binary to monitor the API, and at the same time starting Ansible
to run "c:\Windows\system32\reg.exe /?" via WinRM, to trigger some process creation.

The traps are working, I see the software breakpoint hit, switching to the default
view for singlestepping, and switching back to the execution view, so that's already good.

After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
- frozen: the mouse doesn't move: so I would guess the VCPU are blocked.

I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies,
but It's always synchronous, so I doubt that they interfered and "paused" the domain.

Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
Xen is ready to process VMI events.

- BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
and the Windbg analysis is inconclusive, so I can't tell much.

Either way, I can't execute this test sequentially 10 000 times without a crash.

-> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7

I used the compat APIs, like Drakvuf does.

@Tamas, if you could check the traps implementation.

You also have stress-test.py, which is the small test suite that I used, and
the screenshot showing the stdout preceding a test failure,
when Ansible couldn't contact WinRM service because the domain was frozen.

Note: I stole some code from libvmi, to handle page read/write in Xen.

PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
in xl list, despite that my stress-test.py process is already dead.

I have 4 of these entries in my xl list right now.
Might be worth looking into it also.

Best regards,
Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-28 12:33             ` Mathieu Tarral
@ 2019-05-28 12:33               ` Mathieu Tarral
  2019-05-29  0:49               ` Andrew Cooper
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-05-28 12:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel

Hi Andrew,

> > The bug is still here, so we can exclude a microcode issue.
>
> Good - that is one further angle excluded.  Always make sure you are
> running with up-to-date microcode, but it looks like we back to
> investigating a logical bug in libvmi or Xen.


I reimplemented a small test, without the Drakvuf/Libvmi layers, that will inject traps on one API in Windows (NtCreateUserProcess),
in the same way that Drakvuf does.

I did some quick testing yesterday, with a Python script that was repeatedly
starting the binary to monitor the API, and at the same time starting Ansible
to run "c:\Windows\system32\reg.exe /?" via WinRM, to trigger some process creation.

The traps are working, I see the software breakpoint hit, switching to the default
view for singlestepping, and switching back to the execution view, so that's already good.

After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
- frozen: the mouse doesn't move: so I would guess the VCPU are blocked.

I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies,
but It's always synchronous, so I doubt that they interfered and "paused" the domain.

Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
Xen is ready to process VMI events.

- BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
and the Windbg analysis is inconclusive, so I can't tell much.

Either way, I can't execute this test sequentially 10 000 times without a crash.

-> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7

I used the compat APIs, like Drakvuf does.

@Tamas, if you could check the traps implementation.

You also have stress-test.py, which is the small test suite that I used, and
the screenshot showing the stdout preceding a test failure,
when Ansible couldn't contact WinRM service because the domain was frozen.

Note: I stole some code from libvmi, to handle page read/write in Xen.

PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
in xl list, despite that my stress-test.py process is already dead.

I have 4 of these entries in my xl list right now.
Might be worth looking into it also.

Best regards,
Mathieu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-28 12:33             ` Mathieu Tarral
  2019-05-28 12:33               ` [Xen-devel] " Mathieu Tarral
@ 2019-05-29  0:49               ` Andrew Cooper
  2019-05-29  0:49                 ` [Xen-devel] " Andrew Cooper
                                   ` (2 more replies)
  2019-05-29  1:58               ` Tamas K Lengyel
  2019-05-29 15:26               ` Tamas K Lengyel
  3 siblings, 3 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-29  0:49 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 5170 bytes --]

On 28/05/2019 13:33, Mathieu Tarral wrote:
> Hi Andrew,
>
>>> The bug is still here, so we can exclude a microcode issue.
>> Good - that is one further angle excluded.  Always make sure you are
>> running with up-to-date microcode, but it looks like we back to
>> investigating a logical bug in libvmi or Xen.
>
> I reimplemented a small test, without the Drakvuf/Libvmi layers, that will inject traps on one API in Windows (NtCreateUserProcess),
> in the same way that Drakvuf does.
>
> I did some quick testing yesterday, with a Python script that was repeatedly
> starting the binary to monitor the API, and at the same time starting Ansible
> to run "c:\Windows\system32\reg.exe /?" via WinRM, to trigger some process creation.
>
> The traps are working, I see the software breakpoint hit, switching to the default
> view for singlestepping, and switching back to the execution view, so that's already good.
>
> After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
> - frozen: the mouse doesn't move: so I would guess the VCPU are blocked.
>
> I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies,
> but It's always synchronous, so I doubt that they interfered and "paused" the domain.

xc_{,un}pause_domain() calls are reference counted.  Calling unpause too
many times should be safe (from a refcount point of view), and should
fail the hypercall with -EINVAL.

>
> Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
> Xen is ready to process VMI events.
>
> - BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
> and the Windbg analysis is inconclusive, so I can't tell much.
>
> Either way, I can't execute this test sequentially 10 000 times without a crash.

Ok good - this is a far easier place to start debugging from.

> -> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
> https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7

Some observations.

1)  In xen_pause_vm(), you do an xc_domain_getinfo().  First of all, the
API is crazy, so you also need to check "|| info.domid != domid" in your
error condition, but that shouldn't be an issue here as the domid isn't
changing. 

Furthermore, the results of xc_domain_getinfo() are stale before the
hypercall returns, so it is far less useful than it appears.

I'm afraid that the only safe way to issue pause/unpauses is to know
that you've reference counted your own correctly.  All entities in dom0
with privilege can fight over each others references, because there is
nothing Xen can use to distinguish the requests.

2) You allocate a libxl context but do nothing with it.  That can all
go, along with the linkage against libxl.  Also, you don't need to
create a logger like that.  Despite being utterly unacceptable behaviour
for a library, it is the default by passing NULL in xc_interface_open().

3) A malloc()/memset() pair is more commonly spelt calloc()


And some questions.

1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is
specific to the exact windows kernel in use?

2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You
add one extra gfn to the guest, called zero_page, and fill it with 1's
from fmask.

3) You create two altp2m's, but both have the same default access.  Is
this deliberate, or a bug?  If deliberate, why?

Finally, and probably the source of the memory corruption...

4) When injecting a trap, you allocate a new gfn, memcpy() the contents
and insert a 0xcc (so far so good).  You then remap the executable view
to point at the new gfn with a breakpoint in (fine), and remap the
readable view to point at the zero_page, which is full of 1's (uh-oh).

What is this final step trying to achieve?  It guarantees that
patch-guard will eventually notice and BSOD your VM for critical
structure corruption.  The read-only view needs to point to the original
gfn with only read permissions, so when Windows reads the gfn back, it
sees what it expects.  You also need to prohibit writes to either gfn so
you can spot writes (unlikely in this case but important for general
introspection) so you can propagate the change to both copies.

>
> I used the compat APIs, like Drakvuf does.
>
> @Tamas, if you could check the traps implementation.
>
> You also have stress-test.py, which is the small test suite that I used, and
> the screenshot showing the stdout preceding a test failure,
> when Ansible couldn't contact WinRM service because the domain was frozen.
>
> Note: I stole some code from libvmi, to handle page read/write in Xen.
>
> PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
> in xl list, despite that my stress-test.py process is already dead.
>
> I have 4 of these entries in my xl list right now.

That's almost certainly a reference not being dropped on a page.  Can
you run `xl debug-keys q` and paste the resulting analysis which will be
visible in `xl dmesg`?

It is probably some missing cleanup in the altp2m code.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 7112 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29  0:49               ` Andrew Cooper
@ 2019-05-29  0:49                 ` Andrew Cooper
  2019-05-29  1:34                 ` Tamas K Lengyel
  2019-06-11 12:31                 ` Mathieu Tarral
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-29  0:49 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 5170 bytes --]

On 28/05/2019 13:33, Mathieu Tarral wrote:
> Hi Andrew,
>
>>> The bug is still here, so we can exclude a microcode issue.
>> Good - that is one further angle excluded.  Always make sure you are
>> running with up-to-date microcode, but it looks like we back to
>> investigating a logical bug in libvmi or Xen.
>
> I reimplemented a small test, without the Drakvuf/Libvmi layers, that will inject traps on one API in Windows (NtCreateUserProcess),
> in the same way that Drakvuf does.
>
> I did some quick testing yesterday, with a Python script that was repeatedly
> starting the binary to monitor the API, and at the same time starting Ansible
> to run "c:\Windows\system32\reg.exe /?" via WinRM, to trigger some process creation.
>
> The traps are working, I see the software breakpoint hit, switching to the default
> view for singlestepping, and switching back to the execution view, so that's already good.
>
> After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
> - frozen: the mouse doesn't move: so I would guess the VCPU are blocked.
>
> I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies,
> but It's always synchronous, so I doubt that they interfered and "paused" the domain.

xc_{,un}pause_domain() calls are reference counted.  Calling unpause too
many times should be safe (from a refcount point of view), and should
fail the hypercall with -EINVAL.

>
> Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
> Xen is ready to process VMI events.
>
> - BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
> and the Windbg analysis is inconclusive, so I can't tell much.
>
> Either way, I can't execute this test sequentially 10 000 times without a crash.

Ok good - this is a far easier place to start debugging from.

> -> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
> https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7

Some observations.

1)  In xen_pause_vm(), you do an xc_domain_getinfo().  First of all, the
API is crazy, so you also need to check "|| info.domid != domid" in your
error condition, but that shouldn't be an issue here as the domid isn't
changing. 

Furthermore, the results of xc_domain_getinfo() are stale before the
hypercall returns, so it is far less useful than it appears.

I'm afraid that the only safe way to issue pause/unpauses is to know
that you've reference counted your own correctly.  All entities in dom0
with privilege can fight over each others references, because there is
nothing Xen can use to distinguish the requests.

2) You allocate a libxl context but do nothing with it.  That can all
go, along with the linkage against libxl.  Also, you don't need to
create a logger like that.  Despite being utterly unacceptable behaviour
for a library, it is the default by passing NULL in xc_interface_open().

3) A malloc()/memset() pair is more commonly spelt calloc()


And some questions.

1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is
specific to the exact windows kernel in use?

2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You
add one extra gfn to the guest, called zero_page, and fill it with 1's
from fmask.

3) You create two altp2m's, but both have the same default access.  Is
this deliberate, or a bug?  If deliberate, why?

Finally, and probably the source of the memory corruption...

4) When injecting a trap, you allocate a new gfn, memcpy() the contents
and insert a 0xcc (so far so good).  You then remap the executable view
to point at the new gfn with a breakpoint in (fine), and remap the
readable view to point at the zero_page, which is full of 1's (uh-oh).

What is this final step trying to achieve?  It guarantees that
patch-guard will eventually notice and BSOD your VM for critical
structure corruption.  The read-only view needs to point to the original
gfn with only read permissions, so when Windows reads the gfn back, it
sees what it expects.  You also need to prohibit writes to either gfn so
you can spot writes (unlikely in this case but important for general
introspection) so you can propagate the change to both copies.

>
> I used the compat APIs, like Drakvuf does.
>
> @Tamas, if you could check the traps implementation.
>
> You also have stress-test.py, which is the small test suite that I used, and
> the screenshot showing the stdout preceding a test failure,
> when Ansible couldn't contact WinRM service because the domain was frozen.
>
> Note: I stole some code from libvmi, to handle page read/write in Xen.
>
> PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
> in xl list, despite that my stress-test.py process is already dead.
>
> I have 4 of these entries in my xl list right now.

That's almost certainly a reference not being dropped on a page.  Can
you run `xl debug-keys q` and paste the resulting analysis which will be
visible in `xl dmesg`?

It is probably some missing cleanup in the altp2m code.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 7112 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-29  0:49               ` Andrew Cooper
  2019-05-29  0:49                 ` [Xen-devel] " Andrew Cooper
@ 2019-05-29  1:34                 ` Tamas K Lengyel
  2019-05-29  1:34                   ` [Xen-devel] " Tamas K Lengyel
  2019-05-29  3:58                   ` Andrew Cooper
  2019-06-11 12:31                 ` Mathieu Tarral
  2 siblings, 2 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29  1:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

> And some questions.
>
> 1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?
>
> 2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.
>
> 3) You create two altp2m's, but both have the same default access.  Is this deliberate, or a bug?  If deliberate, why?
>
> Finally, and probably the source of the memory corruption...
>
> 4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good).  You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).
>
> What is this final step trying to achieve?  It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption.  The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects.  You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.

I can answer these questions based on how I've set it up in DRAKVUF
(haven't checked Mathiue's implementation and I won't be able for a
while as I'm traveling at the moment). The reason we have multiple EPT
views is to make it all multi-vCPU safe. Swapping the views around can
be done per-vCPU and we don't have to remove/reapply breakpoints all
the time and pause the entire domain while doing so.

There are three views being used: the default (hostp2m); the
execute-view which is active by default and has the remapped
shadow-copies of the pages with breakpoints injected at the very end
of the guests' physmap; and the read-only view that is only used when
someone is trying to read the actual address of a shadow-copy at the
end of the physmap (ie. not via the remapped gfn).

The read-only view has all shadow-copy gfn's remapped to that one page
full of 1's, because if you read random large gfn's in the guests'
memory space that's what Xen's emulator returns. It is called
zero_page because I originally guessed that those pages should be all
0 but it turned out I was wrong. Just haven't change the name of it
yet. This page is there because we want to avoid someone being able to
find out that there are shadow pages present. It would be quite
obvious something is "odd" when you can find copies of the Windows
kernel memory pages at the end of the memory space. So the shadow
pages' real GFN mem_access is restricted in the execute view, which
allows us to switch to the read-only view with MTF enabled and then
back afterwards. That way the shadow pages are not visible to the
guest, if someone tries to read them they return the same value and
behave the same as all other unmapped gfn's in that memory region.

So since the read-only view with all the 1's is rarely used, let's
talk about why patchguard can't notice changes to the kernel: the
execute-view has all pages that were breakpointed remapped and marked
execute-only. When patchguard tries to read these pages, the view is
swapped back to the hostp2m with MTF enabled. Then in the MTF callback
the view is swapped back to the execute-view. This means that
patchguard only ever reads the original page from the hostp2m. If the
page is being written to, the same dance happens with the addition of
the whole page being re-copied to the shadow location and the
breakpoints being reapplied on the shadow copy. This copy happens
while the whole domain is paused to avoid race-condition.

I hope this makes sense.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29  1:34                 ` Tamas K Lengyel
@ 2019-05-29  1:34                   ` Tamas K Lengyel
  2019-05-29  3:58                   ` Andrew Cooper
  1 sibling, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29  1:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

> And some questions.
>
> 1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?
>
> 2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.
>
> 3) You create two altp2m's, but both have the same default access.  Is this deliberate, or a bug?  If deliberate, why?
>
> Finally, and probably the source of the memory corruption...
>
> 4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good).  You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).
>
> What is this final step trying to achieve?  It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption.  The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects.  You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.

I can answer these questions based on how I've set it up in DRAKVUF
(haven't checked Mathiue's implementation and I won't be able for a
while as I'm traveling at the moment). The reason we have multiple EPT
views is to make it all multi-vCPU safe. Swapping the views around can
be done per-vCPU and we don't have to remove/reapply breakpoints all
the time and pause the entire domain while doing so.

There are three views being used: the default (hostp2m); the
execute-view which is active by default and has the remapped
shadow-copies of the pages with breakpoints injected at the very end
of the guests' physmap; and the read-only view that is only used when
someone is trying to read the actual address of a shadow-copy at the
end of the physmap (ie. not via the remapped gfn).

The read-only view has all shadow-copy gfn's remapped to that one page
full of 1's, because if you read random large gfn's in the guests'
memory space that's what Xen's emulator returns. It is called
zero_page because I originally guessed that those pages should be all
0 but it turned out I was wrong. Just haven't change the name of it
yet. This page is there because we want to avoid someone being able to
find out that there are shadow pages present. It would be quite
obvious something is "odd" when you can find copies of the Windows
kernel memory pages at the end of the memory space. So the shadow
pages' real GFN mem_access is restricted in the execute view, which
allows us to switch to the read-only view with MTF enabled and then
back afterwards. That way the shadow pages are not visible to the
guest, if someone tries to read them they return the same value and
behave the same as all other unmapped gfn's in that memory region.

So since the read-only view with all the 1's is rarely used, let's
talk about why patchguard can't notice changes to the kernel: the
execute-view has all pages that were breakpointed remapped and marked
execute-only. When patchguard tries to read these pages, the view is
swapped back to the hostp2m with MTF enabled. Then in the MTF callback
the view is swapped back to the execute-view. This means that
patchguard only ever reads the original page from the hostp2m. If the
page is being written to, the same dance happens with the addition of
the whole page being re-copied to the shadow location and the
breakpoints being reapplied on the shadow copy. This copy happens
while the whole domain is paused to avoid race-condition.

I hope this makes sense.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-28 12:33             ` Mathieu Tarral
  2019-05-28 12:33               ` [Xen-devel] " Mathieu Tarral
  2019-05-29  0:49               ` Andrew Cooper
@ 2019-05-29  1:58               ` Tamas K Lengyel
  2019-05-29  1:58                 ` [Xen-devel] " Tamas K Lengyel
  2019-05-29 15:26               ` Tamas K Lengyel
  3 siblings, 1 reply; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29  1:58 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> @Tamas, if you could check the traps implementation.

I had a quick look and it seems like you forgot to set the mem_access
permissions on the pages. You need the remapped gfn's to be marked
execute-only in the altp2m_idx view, and their actual gfn completely
inaccessible in altp2m_idx. You need to swap the views when those
memory access permissions are violated accordingly to the hostp2m or
to the altp2m_idr view. Without that patchguard is going to bluescreen
you.

Also, if you copy code from the DRAKVUF code-base please make sure you
apply the license that comes with that code (it's not plain GPL).

Thanks,
Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29  1:58               ` Tamas K Lengyel
@ 2019-05-29  1:58                 ` Tamas K Lengyel
  0 siblings, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29  1:58 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> @Tamas, if you could check the traps implementation.

I had a quick look and it seems like you forgot to set the mem_access
permissions on the pages. You need the remapped gfn's to be marked
execute-only in the altp2m_idx view, and their actual gfn completely
inaccessible in altp2m_idx. You need to swap the views when those
memory access permissions are violated accordingly to the hostp2m or
to the altp2m_idr view. Without that patchguard is going to bluescreen
you.

Also, if you copy code from the DRAKVUF code-base please make sure you
apply the license that comes with that code (it's not plain GPL).

Thanks,
Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-29  1:34                 ` Tamas K Lengyel
  2019-05-29  1:34                   ` [Xen-devel] " Tamas K Lengyel
@ 2019-05-29  3:58                   ` Andrew Cooper
  2019-05-29  3:58                     ` [Xen-devel] " Andrew Cooper
  2019-05-29 15:21                     ` Tamas K Lengyel
  1 sibling, 2 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-29  3:58 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 29/05/2019 02:34, Tamas K Lengyel wrote:
>> And some questions.
>>
>> 1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?
>>
>> 2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.
>>
>> 3) You create two altp2m's, but both have the same default access.  Is this deliberate, or a bug?  If deliberate, why?
>>
>> Finally, and probably the source of the memory corruption...
>>
>> 4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good).  You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).
>>
>> What is this final step trying to achieve?  It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption.  The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects.  You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.
> I can answer these questions based on how I've set it up in DRAKVUF
> (haven't checked Mathiue's implementation and I won't be able for a
> while as I'm traveling at the moment). The reason we have multiple EPT
> views is to make it all multi-vCPU safe. Swapping the views around can
> be done per-vCPU and we don't have to remove/reapply breakpoints all
> the time and pause the entire domain while doing so.

Yup - makes sense.

> There are three views being used: the default (hostp2m); the
> execute-view which is active by default and has the remapped
> shadow-copies of the pages with breakpoints injected at the very end
> of the guests' physmap; and the read-only view that is only used when
> someone is trying to read the actual address of a shadow-copy at the
> end of the physmap (ie. not via the remapped gfn).

Perhaps the terminology could be improved then, seeing as the main view
isn't really "execute restricted".  It is only X-- for the few remapped
gfns, and XWR for the vast majority.

Inside the X-- gfns, there are one (or more) 0xcc breakpoints which can
be hit, which trap out to the VMI agent.

(Following Mathieu's code) on encountering this, we switch altp2m back
to the hostp2m, and enable MTF.  We let one instruction execute,
trapping out on the MTF signal, and switch back to the default view (1)
which is fractionally X-- restricted.

Now - this doesn't cope with the case where the 0xcc'd instruction was a
write into one of the X-- pages, because switching back to the hostp2m
gives full XWR permissions to everything, including the VMI-additional
pages. 

However, given the nature of the breakpoint position, it is almost
certainly intercepting a benign stack operation in this specific case. 
It would be useful to disassemble the head of NtCreateUserProcess and
confirm.

> The read-only view has all shadow-copy gfn's remapped to that one page
> full of 1's, because if you read random large gfn's in the guests'
> memory space that's what Xen's emulator returns. It is called
> zero_page because I originally guessed that those pages should be all
> 0 but it turned out I was wrong. Just haven't change the name of it
> yet. This page is there because we want to avoid someone being able to
> find out that there are shadow pages present. It would be quite
> obvious something is "odd" when you can find copies of the Windows
> kernel memory pages at the end of the memory space. So the shadow
> pages' real GFN mem_access is restricted in the execute view, which
> allows us to switch to the read-only view with MTF enabled and then
> back afterwards. That way the shadow pages are not visible to the
> guest, if someone tries to read them they return the same value and
> behave the same as all other unmapped gfn's in that memory region.

Ahh ok.  Yes - write-discard/read ~0 is a staple of "nothing present",
both in IO and MMIO space on x86.  In which case a better name would be
sink_page or similar.

Also, I see now that that is what Mathieu's code is doing (even though
this view isn't used at all, so far as I can tell), so consider the
question answered, and we're back to square 1 on the BSOD.

As identified before, it needs to be only ever mapped read-only, because
sinking real writes into it would be a BadThing(tm).  We do actually
have a p2m_type_ro which would hopefully cause emulated instructions to
DTRT as well, which should be faster than sending a write event all the
way to the VMI agent.

> So since the read-only view with all the 1's is rarely used, let's
> talk about why patchguard can't notice changes to the kernel:

Well... In this case it really is.  We can certainly talk about why
patchguard *shouldn’t* notice :)

> the execute-view has all pages that were breakpointed remapped and marked
> execute-only. When patchguard tries to read these pages, the view is
> swapped back to the hostp2m with MTF enabled. Then in the MTF callback
> the view is swapped back to the execute-view. This means that
> patchguard only ever reads the original page from the hostp2m. If the
> page is being written to, the same dance happens with the addition of
> the whole page being re-copied to the shadow location and the
> breakpoints being reapplied on the shadow copy. This copy happens
> while the whole domain is paused to avoid race-condition.
>
> I hope this makes sense.

The dance with the read-only view doesn't happen in the simplified case,
but as both you and I have noticed, there looks to be issues with the
page permissions which are probably confounding the problems.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29  3:58                   ` Andrew Cooper
@ 2019-05-29  3:58                     ` Andrew Cooper
  2019-05-29 15:21                     ` Tamas K Lengyel
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Cooper @ 2019-05-29  3:58 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel, Mathieu Tarral

On 29/05/2019 02:34, Tamas K Lengyel wrote:
>> And some questions.
>>
>> 1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?
>>
>> 2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.
>>
>> 3) You create two altp2m's, but both have the same default access.  Is this deliberate, or a bug?  If deliberate, why?
>>
>> Finally, and probably the source of the memory corruption...
>>
>> 4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good).  You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).
>>
>> What is this final step trying to achieve?  It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption.  The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects.  You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.
> I can answer these questions based on how I've set it up in DRAKVUF
> (haven't checked Mathiue's implementation and I won't be able for a
> while as I'm traveling at the moment). The reason we have multiple EPT
> views is to make it all multi-vCPU safe. Swapping the views around can
> be done per-vCPU and we don't have to remove/reapply breakpoints all
> the time and pause the entire domain while doing so.

Yup - makes sense.

> There are three views being used: the default (hostp2m); the
> execute-view which is active by default and has the remapped
> shadow-copies of the pages with breakpoints injected at the very end
> of the guests' physmap; and the read-only view that is only used when
> someone is trying to read the actual address of a shadow-copy at the
> end of the physmap (ie. not via the remapped gfn).

Perhaps the terminology could be improved then, seeing as the main view
isn't really "execute restricted".  It is only X-- for the few remapped
gfns, and XWR for the vast majority.

Inside the X-- gfns, there are one (or more) 0xcc breakpoints which can
be hit, which trap out to the VMI agent.

(Following Mathieu's code) on encountering this, we switch altp2m back
to the hostp2m, and enable MTF.  We let one instruction execute,
trapping out on the MTF signal, and switch back to the default view (1)
which is fractionally X-- restricted.

Now - this doesn't cope with the case where the 0xcc'd instruction was a
write into one of the X-- pages, because switching back to the hostp2m
gives full XWR permissions to everything, including the VMI-additional
pages. 

However, given the nature of the breakpoint position, it is almost
certainly intercepting a benign stack operation in this specific case. 
It would be useful to disassemble the head of NtCreateUserProcess and
confirm.

> The read-only view has all shadow-copy gfn's remapped to that one page
> full of 1's, because if you read random large gfn's in the guests'
> memory space that's what Xen's emulator returns. It is called
> zero_page because I originally guessed that those pages should be all
> 0 but it turned out I was wrong. Just haven't change the name of it
> yet. This page is there because we want to avoid someone being able to
> find out that there are shadow pages present. It would be quite
> obvious something is "odd" when you can find copies of the Windows
> kernel memory pages at the end of the memory space. So the shadow
> pages' real GFN mem_access is restricted in the execute view, which
> allows us to switch to the read-only view with MTF enabled and then
> back afterwards. That way the shadow pages are not visible to the
> guest, if someone tries to read them they return the same value and
> behave the same as all other unmapped gfn's in that memory region.

Ahh ok.  Yes - write-discard/read ~0 is a staple of "nothing present",
both in IO and MMIO space on x86.  In which case a better name would be
sink_page or similar.

Also, I see now that that is what Mathieu's code is doing (even though
this view isn't used at all, so far as I can tell), so consider the
question answered, and we're back to square 1 on the BSOD.

As identified before, it needs to be only ever mapped read-only, because
sinking real writes into it would be a BadThing(tm).  We do actually
have a p2m_type_ro which would hopefully cause emulated instructions to
DTRT as well, which should be faster than sending a write event all the
way to the VMI agent.

> So since the read-only view with all the 1's is rarely used, let's
> talk about why patchguard can't notice changes to the kernel:

Well... In this case it really is.  We can certainly talk about why
patchguard *shouldn’t* notice :)

> the execute-view has all pages that were breakpointed remapped and marked
> execute-only. When patchguard tries to read these pages, the view is
> swapped back to the hostp2m with MTF enabled. Then in the MTF callback
> the view is swapped back to the execute-view. This means that
> patchguard only ever reads the original page from the hostp2m. If the
> page is being written to, the same dance happens with the addition of
> the whole page being re-copied to the shadow location and the
> breakpoints being reapplied on the shadow copy. This copy happens
> while the whole domain is paused to avoid race-condition.
>
> I hope this makes sense.

The dance with the read-only view doesn't happen in the simplified case,
but as both you and I have noticed, there looks to be issues with the
page permissions which are probably confounding the problems.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-29  3:58                   ` Andrew Cooper
  2019-05-29  3:58                     ` [Xen-devel] " Andrew Cooper
@ 2019-05-29 15:21                     ` Tamas K Lengyel
  2019-05-29 15:21                       ` [Xen-devel] " Tamas K Lengyel
  1 sibling, 1 reply; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29 15:21 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

> > There are three views being used: the default (hostp2m); the
> > execute-view which is active by default and has the remapped
> > shadow-copies of the pages with breakpoints injected at the very end
> > of the guests' physmap; and the read-only view that is only used when
> > someone is trying to read the actual address of a shadow-copy at the
> > end of the physmap (ie. not via the remapped gfn).
>
> Perhaps the terminology could be improved then, seeing as the main view
> isn't really "execute restricted".  It is only X-- for the few remapped
> gfns, and XWR for the vast majority.

Correct and most certainly the terminology is not great :) The
altp2m_idx view actually starts out mainly empty with only the
remapped pages being populated in it - the rest gets lazily copied
from the hostp2m as the VM runs (and copies the default XWR permission
over).

> > The read-only view has all shadow-copy gfn's remapped to that one page
> > full of 1's, because if you read random large gfn's in the guests'
> > memory space that's what Xen's emulator returns. It is called
> > zero_page because I originally guessed that those pages should be all
> > 0 but it turned out I was wrong. Just haven't change the name of it
> > yet. This page is there because we want to avoid someone being able to
> > find out that there are shadow pages present. It would be quite
> > obvious something is "odd" when you can find copies of the Windows
> > kernel memory pages at the end of the memory space. So the shadow
> > pages' real GFN mem_access is restricted in the execute view, which
> > allows us to switch to the read-only view with MTF enabled and then
> > back afterwards. That way the shadow pages are not visible to the
> > guest, if someone tries to read them they return the same value and
> > behave the same as all other unmapped gfn's in that memory region.
>
> Ahh ok.  Yes - write-discard/read ~0 is a staple of "nothing present",
> both in IO and MMIO space on x86.  In which case a better name would be
> sink_page or similar.

Indeed.

> Also, I see now that that is what Mathieu's code is doing (even though
> this view isn't used at all, so far as I can tell), so consider the
> question answered, and we're back to square 1 on the BSOD.
>
> As identified before, it needs to be only ever mapped read-only, because
> sinking real writes into it would be a BadThing(tm).  We do actually
> have a p2m_type_ro which would hopefully cause emulated instructions to
> DTRT as well, which should be faster than sending a write event all the
> way to the VMI agent.

Any write-violation to the sink page gets emulated with
VM_EVENT_FLAG_EMULATE_NOWRITE. Speed isn't really an issue because
this is an extreme corner case that never happens unless someone is
doing something very peculiar, in which case it's good to be able to
log it ;)

>
> > So since the read-only view with all the 1's is rarely used, let's
> > talk about why patchguard can't notice changes to the kernel:
>
> Well... In this case it really is.  We can certainly talk about why
> patchguard *shouldn’t* notice :)
>
> > the execute-view has all pages that were breakpointed remapped and marked
> > execute-only. When patchguard tries to read these pages, the view is
> > swapped back to the hostp2m with MTF enabled. Then in the MTF callback
> > the view is swapped back to the execute-view. This means that
> > patchguard only ever reads the original page from the hostp2m. If the
> > page is being written to, the same dance happens with the addition of
> > the whole page being re-copied to the shadow location and the
> > breakpoints being reapplied on the shadow copy. This copy happens
> > while the whole domain is paused to avoid race-condition.
> >
> > I hope this makes sense.
>
> The dance with the read-only view doesn't happen in the simplified case,
> but as both you and I have noticed, there looks to be issues with the
> page permissions which are probably confounding the problems.
>

Correct, the read-only view is not something that would be used during
normal execution of Windows. Mathieu's implementation is also
incomplete as he is not applying the memory permissions. I
double-checked in DRAKVUF and the memory permissions are definitely
set so I really don't think its patchguard that catches modification
to the kernel pages is what's behind the BSOD. If that was the case
then simply letting it run for a while would trigger it. But that's
not the case, I can run DRAKVUF for several hours with no BSOD. To me
it seems to be the repeated stop/start that has something to do with
it.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29 15:21                     ` Tamas K Lengyel
@ 2019-05-29 15:21                       ` Tamas K Lengyel
  0 siblings, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29 15:21 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Mathieu Tarral

> > There are three views being used: the default (hostp2m); the
> > execute-view which is active by default and has the remapped
> > shadow-copies of the pages with breakpoints injected at the very end
> > of the guests' physmap; and the read-only view that is only used when
> > someone is trying to read the actual address of a shadow-copy at the
> > end of the physmap (ie. not via the remapped gfn).
>
> Perhaps the terminology could be improved then, seeing as the main view
> isn't really "execute restricted".  It is only X-- for the few remapped
> gfns, and XWR for the vast majority.

Correct and most certainly the terminology is not great :) The
altp2m_idx view actually starts out mainly empty with only the
remapped pages being populated in it - the rest gets lazily copied
from the hostp2m as the VM runs (and copies the default XWR permission
over).

> > The read-only view has all shadow-copy gfn's remapped to that one page
> > full of 1's, because if you read random large gfn's in the guests'
> > memory space that's what Xen's emulator returns. It is called
> > zero_page because I originally guessed that those pages should be all
> > 0 but it turned out I was wrong. Just haven't change the name of it
> > yet. This page is there because we want to avoid someone being able to
> > find out that there are shadow pages present. It would be quite
> > obvious something is "odd" when you can find copies of the Windows
> > kernel memory pages at the end of the memory space. So the shadow
> > pages' real GFN mem_access is restricted in the execute view, which
> > allows us to switch to the read-only view with MTF enabled and then
> > back afterwards. That way the shadow pages are not visible to the
> > guest, if someone tries to read them they return the same value and
> > behave the same as all other unmapped gfn's in that memory region.
>
> Ahh ok.  Yes - write-discard/read ~0 is a staple of "nothing present",
> both in IO and MMIO space on x86.  In which case a better name would be
> sink_page or similar.

Indeed.

> Also, I see now that that is what Mathieu's code is doing (even though
> this view isn't used at all, so far as I can tell), so consider the
> question answered, and we're back to square 1 on the BSOD.
>
> As identified before, it needs to be only ever mapped read-only, because
> sinking real writes into it would be a BadThing(tm).  We do actually
> have a p2m_type_ro which would hopefully cause emulated instructions to
> DTRT as well, which should be faster than sending a write event all the
> way to the VMI agent.

Any write-violation to the sink page gets emulated with
VM_EVENT_FLAG_EMULATE_NOWRITE. Speed isn't really an issue because
this is an extreme corner case that never happens unless someone is
doing something very peculiar, in which case it's good to be able to
log it ;)

>
> > So since the read-only view with all the 1's is rarely used, let's
> > talk about why patchguard can't notice changes to the kernel:
>
> Well... In this case it really is.  We can certainly talk about why
> patchguard *shouldn’t* notice :)
>
> > the execute-view has all pages that were breakpointed remapped and marked
> > execute-only. When patchguard tries to read these pages, the view is
> > swapped back to the hostp2m with MTF enabled. Then in the MTF callback
> > the view is swapped back to the execute-view. This means that
> > patchguard only ever reads the original page from the hostp2m. If the
> > page is being written to, the same dance happens with the addition of
> > the whole page being re-copied to the shadow location and the
> > breakpoints being reapplied on the shadow copy. This copy happens
> > while the whole domain is paused to avoid race-condition.
> >
> > I hope this makes sense.
>
> The dance with the read-only view doesn't happen in the simplified case,
> but as both you and I have noticed, there looks to be issues with the
> page permissions which are probably confounding the problems.
>

Correct, the read-only view is not something that would be used during
normal execution of Windows. Mathieu's implementation is also
incomplete as he is not applying the memory permissions. I
double-checked in DRAKVUF and the memory permissions are definitely
set so I really don't think its patchguard that catches modification
to the kernel pages is what's behind the BSOD. If that was the case
then simply letting it run for a while would trigger it. But that's
not the case, I can run DRAKVUF for several hours with no BSOD. To me
it seems to be the repeated stop/start that has something to do with
it.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [VMI] Possible race-condition in altp2m APIs
  2019-05-28 12:33             ` Mathieu Tarral
                                 ` (2 preceding siblings ...)
  2019-05-29  1:58               ` Tamas K Lengyel
@ 2019-05-29 15:26               ` Tamas K Lengyel
  2019-05-29 15:26                 ` [Xen-devel] " Tamas K Lengyel
  3 siblings, 1 reply; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29 15:26 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
> - frozen: the mouse doesn't move: so I would guess the VCPU are blocked.

You probably have events pending on the ring that you didn't process.
After you pause the domain and remove the registered evenst,
breakpoints, etc. you have to do a final loop on the ring to process
any events that may have been placed there between the end of your
last loop and calling pause on the domain. LibVMI/DRAKVUF does that
already.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29 15:26               ` Tamas K Lengyel
@ 2019-05-29 15:26                 ` Tamas K Lengyel
  0 siblings, 0 replies; 54+ messages in thread
From: Tamas K Lengyel @ 2019-05-29 15:26 UTC (permalink / raw)
  To: Mathieu Tarral; +Cc: Andrew Cooper, xen-devel

> After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states:
> - frozen: the mouse doesn't move: so I would guess the VCPU are blocked.

You probably have events pending on the ring that you didn't process.
After you pause the domain and remove the registered evenst,
breakpoints, etc. you have to do a final loop on the ring to process
any events that may have been placed there between the end of your
last loop and calling pause on the domain. LibVMI/DRAKVUF does that
already.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-05-29  0:49               ` Andrew Cooper
  2019-05-29  0:49                 ` [Xen-devel] " Andrew Cooper
  2019-05-29  1:34                 ` Tamas K Lengyel
@ 2019-06-11 12:31                 ` Mathieu Tarral
  2019-06-11 16:13                   ` Mathieu Tarral
  2 siblings, 1 reply; 54+ messages in thread
From: Mathieu Tarral @ 2019-06-11 12:31 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 6588 bytes --]

Hi Andrew, Tamas,

Le mercredi, mai 29, 2019 2:49 AM, Andrew Cooper <andrew.cooper3@citrix.com> a écrit :

> After a series of tests on 1 or 4 VCPUs, my domain end up in 2 possible states: - frozen: the mouse doesn't move: so I would guess the VCPU are blocked. I'm calling the xc_(un)pause_domain APIs multiple times when I write to the shadow copies, but It's always synchronous, so I doubt that they interfered and "paused" the domain.
>
> xc_{,un}pause_domain() calls are reference counted.  Calling unpause too many times should be safe (from a refcount point of view), and should fail the hypercall with -EINVAL.
>
>> Also, the log output I have before I detect that Ansible failed to execute is that the resume succeded and
>> Xen is ready to process VMI events.
>>
>> - BSOD: that's the second possibility, apparently I'm corrupting critical data structure in the operating system,
>> and the Windbg analysis is inconclusive, so I can't tell much.
>>
>> Either way, I can't execute this test sequentially 10 000 times without a crash.
>
> Ok good - this is a far easier place to start debugging from.
>
>> -> Could you look at the implementation, and tell me if I misused the APIs somewhere ?
>> https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7
>
> Some observations.
>
> 1)  In xen_pause_vm(), you do an xc_domain_getinfo().  First of all, the API is crazy, so you also need to check "|| info.domid != domid" in your error condition, but that shouldn't be an issue here as the domid isn't changing.

"The API is crazy"
Could you elaborate on that ?

> Furthermore, the results of xc_domain_getinfo() are stale before the hypercall returns, so it is far less useful than it appears.

I looked at libvmi's implementation:
https://github.com/libvmi/libvmi/blob/master/libvmi/driver/xen/xen.c#L2696

I guess I should have implemented the checks too.
They just didn't make sense for me as I was sure that my calls were synchronous, one after the other.

> I'm afraid that the only safe way to issue pause/unpauses is to know that you've reference counted your own correctly.  All entities in dom0 with privilege can fight over each others references, because there is nothing Xen can use to distinguish the requests.

-> I removed my calls to xc_pause/resume.

> 2) You allocate a libxl context but do nothing with it.  That can all go, along with the linkage against libxl.  Also, you don't need to create a logger like that.  Despite being utterly unacceptable behaviour for a library, it is the default by passing NULL in xc_interface_open().

I followed drakvuf's xen init function:
https://github.com/tklengyel/drakvuf/blob/master/src/xen_helper/xen_helper.c#L140
As I thought I was going to need this at some point.
Same for the xl_logger initialization.

> 3) A malloc()/memset() pair is more commonly spelt calloc()

True.

> And some questions.
>
> 1) I'm guessing the drakvuf_inject_trap(drakvuf, 0x293e6a0, 0) call is specific to the exact windows kernel in use?

Yes, I used a libvmi python script to tanslate the symbol -> virtual address -> physical address.
Then I replaced that value in my code and recompiled the binary before the test.

> 2) In vmi_init(), what is the purpose of fmask and zero_page_gfn?  You add one extra gfn to the guest, called zero_page, and fill it with 1's from fmask.
>
> 3) You create two altp2m's, but both have the same default access.  Is this deliberate, or a bug?  If deliberate, why?
>
> Finally, and probably the source of the memory corruption...
>
> 4) When injecting a trap, you allocate a new gfn, memcpy() the contents and insert a 0xcc (so far so good).  You then remap the executable view to point at the new gfn with a breakpoint in (fine), and remap the readable view to point at the zero_page, which is full of 1's (uh-oh).
>
> What is this final step trying to achieve?  It guarantees that patch-guard will eventually notice and BSOD your VM for critical structure corruption.  The read-only view needs to point to the original gfn with only read permissions, so when Windows reads the gfn back, it sees what it expects.  You also need to prohibit writes to either gfn so you can spot writes (unlikely in this case but important for general introspection) so you can propagate the change to both copies.

Yes I missed that PatchGuard would eventually check those shadow pages anyway.
I was already happy to see that my breakpoints were working, and I proceeded to the tests
hoping to have a quick reproduction of the bug.

I implemented a basic mem_access event on the restricting to --X only on the original GFN being remapped,
and switching to hostp2m and singlestepping to escape PatchGuard.

It works, but I end up in a situation where Xen fails at some point, because at ~90 tests, it cannot populate the ring anymore:
INFO:root:==== test 92 ====
INFO:root:starting drakvuf
INFO:root:starting Ansible
INIT
xen_init_interface
xc_interface_open
create logger
allocating libxc context
init ring page
xc: error: Failed to populate ring pfn
(16 = Device or resource busy): Internal error
fail to enable monitoring: Device or resource busy
fail to init xen interface
CLOSE
Fail to init vmi

(I updated the Gist: https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c)
What do you think happened ?
I have a call to xc_domain_setmaxmem with ~0, so it shouldn't happen ?
https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c-L598

>> I used the compat APIs, like Drakvuf does.
>>
>> @Tamas, if you could check the traps implementation.
>>
>> You also have stress-test.py, which is the small test suite that I used, and
>> the screenshot showing the stdout preceding a test failure,
>> when Ansible couldn't contact WinRM service because the domain was frozen.
>>
>> Note: I stole some code from libvmi, to handle page read/write in Xen.
>>
>> PS: in the case where the domain is frozen, and I destroy the domain, a (null) entry will remain
>> in xl list, despite that my stress-test.py process is already dead.
>>
>> I have 4 of these entries in my xl list right now.
>
> That's almost certainly a reference not being dropped on a page.  Can you run `xl debug-keys q` and paste the resulting analysis which will be visible in `xl dmesg`?
>
> It is probably some missing cleanup in the altp2m code.

I just checked, and I had a few "xen-drakvuf" processes still running in the background.
When the main Python process raised an exception and terminated, they became attached to systemd.
Killing them removed the (null) domain entries.

Thanks,
Mathieu

[-- Attachment #1.2: Type: text/html, Size: 9328 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs
  2019-06-11 12:31                 ` Mathieu Tarral
@ 2019-06-11 16:13                   ` Mathieu Tarral
  0 siblings, 0 replies; 54+ messages in thread
From: Mathieu Tarral @ 2019-06-11 16:13 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Tamas K Lengyel


[-- Attachment #1.1: Type: text/plain, Size: 1388 bytes --]

Hi,

> Yes I missed that PatchGuard would eventually check those shadow pages anyway.
> I was already happy to see that my breakpoints were working, and I proceeded to the tests
> hoping to have a quick reproduction of the bug.
>
> I implemented a basic mem_access event on the restricting to --X only on the original GFN being remapped,
> and switching to hostp2m and singlestepping to escape PatchGuard.
>
> It works, but I end up in a situation where Xen fails at some point, because at ~90 tests, it cannot populate the ring anymore:
> INFO:root:==== test 92 ====
> INFO:root:starting drakvuf
> INFO:root:starting Ansible
> INIT
> xen_init_interface
> xc_interface_open
> create logger
> allocating libxc context
> init ring page
> xc: error: Failed to populate ring pfn
> (16 = Device or resource busy): Internal error
> fail to enable monitoring: Device or resource busy
> fail to init xen interface
> CLOSE
> Fail to init vmi
>
> (I updated the Gist: https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c)
> What do you think happened ?
> I have a call to xc_domain_setmaxmem with ~0, so it shouldn't happen ?
> https://gist.github.com/mtarral/d99ce5524cfcfb5290eaa05702c3e8e7#file-xen-drakvuf-c-L598

I moved the call to xc_domain_setmaxmem BEFORE xc_monitor_enable.
Which works.

I'm continuing my testing to see if I can reproduce the bug.

Mathieu

[-- Attachment #1.2: Type: text/html, Size: 2117 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2019-06-11 16:14 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-06 16:18 [VMI] Possible race-condition in altp2m APIs Mathieu Tarral
2019-05-06 16:18 ` [Xen-devel] " Mathieu Tarral
2019-05-06 17:07 ` Andrew Cooper
2019-05-06 17:07   ` [Xen-devel] " Andrew Cooper
2019-05-06 17:41   ` Tamas K Lengyel
2019-05-06 17:41     ` [Xen-devel] " Tamas K Lengyel
2019-05-06 18:30     ` Andrew Cooper
2019-05-06 18:30       ` [Xen-devel] " Andrew Cooper
2019-05-06 18:51       ` Tamas K Lengyel
2019-05-06 18:51         ` [Xen-devel] " Tamas K Lengyel
2019-05-07 12:01   ` Mathieu Tarral
2019-05-07 12:01     ` [Xen-devel] " Mathieu Tarral
2019-05-09 16:19     ` Mathieu Tarral
2019-05-09 16:19       ` [Xen-devel] " Mathieu Tarral
2019-05-09 16:42       ` Andrew Cooper
2019-05-09 16:42         ` [Xen-devel] " Andrew Cooper
2019-05-09 17:46         ` Tamas K Lengyel
2019-05-09 17:46           ` [Xen-devel] " Tamas K Lengyel
2019-05-09 18:00           ` Andrew Cooper
2019-05-09 18:00             ` [Xen-devel] " Andrew Cooper
2019-05-09 18:08             ` Tamas K Lengyel
2019-05-09 18:08               ` [Xen-devel] " Tamas K Lengyel
2019-05-10 15:20               ` Mathieu Tarral
2019-05-10 15:20                 ` [Xen-devel] " Mathieu Tarral
2019-05-10 10:52             ` Petre Ovidiu PIRCALABU
2019-05-10 10:52               ` [Xen-devel] " Petre Ovidiu PIRCALABU
2019-05-10 10:55               ` Andrew Cooper
2019-05-10 10:55                 ` [Xen-devel] " Andrew Cooper
2019-05-10 15:17         ` Mathieu Tarral
2019-05-10 15:17           ` [Xen-devel] " Mathieu Tarral
2019-05-10 15:21           ` Andrew Cooper
2019-05-10 15:21             ` [Xen-devel] " Andrew Cooper
2019-05-13 16:18             ` Mathieu Tarral
2019-05-13 16:18               ` [Xen-devel] " Mathieu Tarral
2019-05-13 17:19               ` Razvan Cojocaru
2019-05-13 17:19                 ` [Xen-devel] " Razvan Cojocaru
2019-05-28 12:33             ` Mathieu Tarral
2019-05-28 12:33               ` [Xen-devel] " Mathieu Tarral
2019-05-29  0:49               ` Andrew Cooper
2019-05-29  0:49                 ` [Xen-devel] " Andrew Cooper
2019-05-29  1:34                 ` Tamas K Lengyel
2019-05-29  1:34                   ` [Xen-devel] " Tamas K Lengyel
2019-05-29  3:58                   ` Andrew Cooper
2019-05-29  3:58                     ` [Xen-devel] " Andrew Cooper
2019-05-29 15:21                     ` Tamas K Lengyel
2019-05-29 15:21                       ` [Xen-devel] " Tamas K Lengyel
2019-06-11 12:31                 ` Mathieu Tarral
2019-06-11 16:13                   ` Mathieu Tarral
2019-05-29  1:58               ` Tamas K Lengyel
2019-05-29  1:58                 ` [Xen-devel] " Tamas K Lengyel
2019-05-29 15:26               ` Tamas K Lengyel
2019-05-29 15:26                 ` [Xen-devel] " Tamas K Lengyel
2019-05-09 17:49       ` Tamas K Lengyel
2019-05-09 17:49         ` [Xen-devel] " Tamas K Lengyel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).