linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86 SGDT emulation for Wine
@ 2023-12-27 22:20 Elizabeth Figura
  2023-12-27 23:58 ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Elizabeth Figura @ 2023-12-27 22:20 UTC (permalink / raw)
  To: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Ricardo Neri, wine-devel

Hello all,

There is a Windows 98 program, a game called Nuclear Strike, which wants to do 
some amount of direct VGA access. Part of this is port I/O, which naturally 
throws SIGILL that we can trivially catch and emulate in Wine. The other part 
is direct access to the video memory at 0xa0000, which in general isn't a 
problem to catch and virtualize as well.

However, this program is a bit creative about how it accesses that memory; 
instead of just writing to 0xa0000 directly, it looks up a segment descriptor 
whose base is at 0xa0000 and then uses the %es override to write bytes. In 
pseudo-C, what it does is:

int get_vga_selector()
{
    sgdt(&gdt_size, &gdt_ptr);
    sldt(&ldt_segment);
    ++gdt_size;
    descriptor = gdt_ptr;
    while (descriptor->base != 0xa0000)
    {
        ++descriptor;
        gdt_size -= sizeof(*descriptor);
        if (!gdt_size)
            break;
    }

    if (gdt_size)
        return (descriptor - gdt_ptr) << 3;

    descriptor = gdt_ptr[ldt_segment >> 3]->base;
    ldt_size = gdt_ptr[ldt_segment >> 3]->limit + 1;
    while (descriptor->base != 0xa0000)
    {
        ++descriptor;
        ldt_size -= sizeof(*descriptor);
        if (!ldt_size)
            break;
    }

    if (ldt_size)
        return (descriptor - ldt_ptr) << 3;

    return 0;
}


Currently we emulate IDT access. On a read fault, we execute sidt ourselves, 
check if the read address falls within the IDT, and return some dummy data 
from the exception handler if it does [1]. We can easily enough implement GDT 
access as well this way, and there is even an out-of-tree patch written some 
years ago that does this, and helps the game run.

However, there are two problems that I have observed or anticipated:

(1) On systems with UMIP, the kernel emulates sgdt instructions and returns a 
consistent address which we can guarantee is invalid. However, it also returns 
a size of zero. The program doesn't expect this (cf. the way the loop is 
written above) and I believe will effectively loop forever in that case, or 
until it finds the VGA selector or hits invalid memory.

    I see two obvious ways to fix this: either adjust the size of the fake 
kernel GDT, or provide a switch to stop emulating and let Wine handle it. The 
latter may very well a more sustainable option in the long term (although I'll 
admit I can't immediately come up with a reason why, other than "we might need 
to raise the size yet again".)

    Does anyone have opinions on this particular topic? I can look into 
writing a patch but I'm not sure what the best approach is.

(2) On 64-bit systems without UMIP, sgdt returns a truncated address when in 
32-bit mode. This truncated address in practice might point anywhere in the 
address space, including to valid memory.

    In order to fix this, we would need the kernel to guarantee that the GDT 
base points to an address whose bottom 32 bits we can guarantee are 
inaccessible. This is relatively easy to achieve ourselves by simply mapping 
those pages as noaccess, but it also means that those pages can't overlap 
something we need; we already go to pains to make sure that certain parts of 
the address space are free. Broadly anything above the 2G boundary *should* be 
okay though. Is this feasible?

    We could also just decide we don't care about systems without UMIP, but 
that seems a bit unfortunate; it's not that old of a feature. But I also have 
no idea how hard it would be to make this kind of a guarantee on the kernel 
side.

    This is also, theoretically, a problem for the IDT, except that on the 
machines I've tested, the IDT is always at 0xfffffe0000000000. That's not 
great either (it's certainly caused some weirdness and confusion when 
debugging, when we unexpectedly catch an unrelated null pointer access) but it 
seems to work in practice.

--Zeb

[1] https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/krnl386.exe16/
instr.c#l702



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2023-12-27 22:20 x86 SGDT emulation for Wine Elizabeth Figura
@ 2023-12-27 23:58 ` H. Peter Anvin
  2024-01-02 17:12   ` Sean Christopherson
  2024-01-02 19:53   ` Elizabeth Figura
  0 siblings, 2 replies; 12+ messages in thread
From: H. Peter Anvin @ 2023-12-27 23:58 UTC (permalink / raw)
  To: Elizabeth Figura, x86, Linux Kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Ricardo Neri,
	wine-devel

On December 27, 2023 2:20:37 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
>Hello all,
>
>There is a Windows 98 program, a game called Nuclear Strike, which wants to do 
>some amount of direct VGA access. Part of this is port I/O, which naturally 
>throws SIGILL that we can trivially catch and emulate in Wine. The other part 
>is direct access to the video memory at 0xa0000, which in general isn't a 
>problem to catch and virtualize as well.
>
>However, this program is a bit creative about how it accesses that memory; 
>instead of just writing to 0xa0000 directly, it looks up a segment descriptor 
>whose base is at 0xa0000 and then uses the %es override to write bytes. In 
>pseudo-C, what it does is:
>
>int get_vga_selector()
>{
>    sgdt(&gdt_size, &gdt_ptr);
>    sldt(&ldt_segment);
>    ++gdt_size;
>    descriptor = gdt_ptr;
>    while (descriptor->base != 0xa0000)
>    {
>        ++descriptor;
>        gdt_size -= sizeof(*descriptor);
>        if (!gdt_size)
>            break;
>    }
>
>    if (gdt_size)
>        return (descriptor - gdt_ptr) << 3;
>
>    descriptor = gdt_ptr[ldt_segment >> 3]->base;
>    ldt_size = gdt_ptr[ldt_segment >> 3]->limit + 1;
>    while (descriptor->base != 0xa0000)
>    {
>        ++descriptor;
>        ldt_size -= sizeof(*descriptor);
>        if (!ldt_size)
>            break;
>    }
>
>    if (ldt_size)
>        return (descriptor - ldt_ptr) << 3;
>
>    return 0;
>}
>
>
>Currently we emulate IDT access. On a read fault, we execute sidt ourselves, 
>check if the read address falls within the IDT, and return some dummy data 
>from the exception handler if it does [1]. We can easily enough implement GDT 
>access as well this way, and there is even an out-of-tree patch written some 
>years ago that does this, and helps the game run.
>
>However, there are two problems that I have observed or anticipated:
>
>(1) On systems with UMIP, the kernel emulates sgdt instructions and returns a 
>consistent address which we can guarantee is invalid. However, it also returns 
>a size of zero. The program doesn't expect this (cf. the way the loop is 
>written above) and I believe will effectively loop forever in that case, or 
>until it finds the VGA selector or hits invalid memory.
>
>    I see two obvious ways to fix this: either adjust the size of the fake 
>kernel GDT, or provide a switch to stop emulating and let Wine handle it. The 
>latter may very well a more sustainable option in the long term (although I'll 
>admit I can't immediately come up with a reason why, other than "we might need 
>to raise the size yet again".)
>
>    Does anyone have opinions on this particular topic? I can look into 
>writing a patch but I'm not sure what the best approach is.
>
>(2) On 64-bit systems without UMIP, sgdt returns a truncated address when in 
>32-bit mode. This truncated address in practice might point anywhere in the 
>address space, including to valid memory.
>
>    In order to fix this, we would need the kernel to guarantee that the GDT 
>base points to an address whose bottom 32 bits we can guarantee are 
>inaccessible. This is relatively easy to achieve ourselves by simply mapping 
>those pages as noaccess, but it also means that those pages can't overlap 
>something we need; we already go to pains to make sure that certain parts of 
>the address space are free. Broadly anything above the 2G boundary *should* be 
>okay though. Is this feasible?
>
>    We could also just decide we don't care about systems without UMIP, but 
>that seems a bit unfortunate; it's not that old of a feature. But I also have 
>no idea how hard it would be to make this kind of a guarantee on the kernel 
>side.
>
>    This is also, theoretically, a problem for the IDT, except that on the 
>machines I've tested, the IDT is always at 0xfffffe0000000000. That's not 
>great either (it's certainly caused some weirdness and confusion when 
>debugging, when we unexpectedly catch an unrelated null pointer access) but it 
>seems to work in practice.
>
>--Zeb
>
>[1] https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/krnl386.exe16/
>instr.c#l702
>
>

A prctl() to set the UMIP-emulated return values or disable it (giving SIGILL) would be easy enough.

For the non-UMIP case, and probably for a lot of other corner cases like relying on certain magic selector values and what not, the best option really would be to wrap the code in a lightweight KVM container. I do *not* mean running the Qemu user space part of KVM; instead have Wine interface with /dev/kvm directly.

Non-KVM-capable hardware is basically historic at this point.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2023-12-27 23:58 ` H. Peter Anvin
@ 2024-01-02 17:12   ` Sean Christopherson
  2024-01-02 19:53   ` Elizabeth Figura
  1 sibling, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2024-01-02 17:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Elizabeth Figura, x86, Linux Kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Ricardo Neri,
	wine-devel

On Wed, Dec 27, 2023, H. Peter Anvin wrote:
> On December 27, 2023 2:20:37 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
> >Hello all,
> >
> >There is a Windows 98 program, a game called Nuclear Strike, which wants to do 
> >some amount of direct VGA access. Part of this is port I/O, which naturally 
> >throws SIGILL that we can trivially catch and emulate in Wine. The other part 
> >is direct access to the video memory at 0xa0000, which in general isn't a 
> >problem to catch and virtualize as well.
> >
> >However, this program is a bit creative about how it accesses that memory; 
> >instead of just writing to 0xa0000 directly, it looks up a segment descriptor 
> >whose base is at 0xa0000 and then uses the %es override to write bytes. In 
> >pseudo-C, what it does is:

...

> >Currently we emulate IDT access. On a read fault, we execute sidt ourselves, 
> >check if the read address falls within the IDT, and return some dummy data 
> >from the exception handler if it does [1]. We can easily enough implement GDT 
> >access as well this way, and there is even an out-of-tree patch written some 
> >years ago that does this, and helps the game run.
> >
> >However, there are two problems that I have observed or anticipated:
> >
> >(1) On systems with UMIP, the kernel emulates sgdt instructions and returns a 
> >consistent address which we can guarantee is invalid. However, it also returns 
> >a size of zero. The program doesn't expect this (cf. the way the loop is 
> >written above) and I believe will effectively loop forever in that case, or 
> >until it finds the VGA selector or hits invalid memory.
> >
> >    I see two obvious ways to fix this: either adjust the size of the fake 
> >kernel GDT, or provide a switch to stop emulating and let Wine handle it. The 
> >latter may very well a more sustainable option in the long term (although I'll 
> >admit I can't immediately come up with a reason why, other than "we might need 
> >to raise the size yet again".)
> >
> >    Does anyone have opinions on this particular topic? I can look into 
> >writing a patch but I'm not sure what the best approach is.
> >
> >(2) On 64-bit systems without UMIP, sgdt returns a truncated address when in 
> >32-bit mode. This truncated address in practice might point anywhere in the 
> >address space, including to valid memory.
> >
> >    In order to fix this, we would need the kernel to guarantee that the GDT 
> >base points to an address whose bottom 32 bits we can guarantee are 
> >inaccessible. This is relatively easy to achieve ourselves by simply mapping 
> >those pages as noaccess, but it also means that those pages can't overlap 
> >something we need; we already go to pains to make sure that certain parts of 
> >the address space are free. Broadly anything above the 2G boundary *should* be 
> >okay though. Is this feasible?
> >
> >    We could also just decide we don't care about systems without UMIP, but 
> >that seems a bit unfortunate; it's not that old of a feature. But I also have 
> >no idea how hard it would be to make this kind of a guarantee on the kernel 
> >side.
> >
> >    This is also, theoretically, a problem for the IDT, except that on the 
> >machines I've tested, the IDT is always at 0xfffffe0000000000. That's not 
> >great either (it's certainly caused some weirdness and confusion when 
> >debugging, when we unexpectedly catch an unrelated null pointer access) but it 
> >seems to work in practice.
> >
> >--Zeb
> >
> >[1] https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/krnl386.exe16/
> >instr.c#l702
> >
> >
> 
> A prctl() to set the UMIP-emulated return values or disable it (giving
> SIGILL) would be easy enough.
> 
> For the non-UMIP case, and probably for a lot of other corner cases like
> relying on certain magic selector values and what not, the best option really
> would be to wrap the code in a lightweight KVM container. I do *not* mean
> running the Qemu user space part of KVM; instead have Wine interface with
> /dev/kvm directly.

+1.  Pivoting to KVM would require quite a bit of work up front, but I suspect
the payoff would be worthwhile in the end.

See also https://github.com/dosemu2/dosemu2/tree/devel/src/base/emu-i386.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2023-12-27 23:58 ` H. Peter Anvin
  2024-01-02 17:12   ` Sean Christopherson
@ 2024-01-02 19:53   ` Elizabeth Figura
  2024-01-03  7:30     ` Stefan Dösinger
  2024-01-03 15:19     ` Sean Christopherson
  1 sibling, 2 replies; 12+ messages in thread
From: Elizabeth Figura @ 2024-01-02 19:53 UTC (permalink / raw)
  To: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel, H. Peter Anvin

On Wednesday, December 27, 2023 5:58:19 PM CST H. Peter Anvin wrote:
> On December 27, 2023 2:20:37 PM PST, Elizabeth Figura 
<zfigura@codeweavers.com> wrote:
> >Hello all,
> >
> >There is a Windows 98 program, a game called Nuclear Strike, which wants to
> >do some amount of direct VGA access. Part of this is port I/O, which
> >naturally throws SIGILL that we can trivially catch and emulate in Wine.
> >The other part is direct access to the video memory at 0xa0000, which in
> >general isn't a problem to catch and virtualize as well.
> >
> >However, this program is a bit creative about how it accesses that memory;
> >instead of just writing to 0xa0000 directly, it looks up a segment
> >descriptor whose base is at 0xa0000 and then uses the %es override to
> >write bytes. In pseudo-C, what it does is:
> >
> >int get_vga_selector()
> >{
> >
> >    sgdt(&gdt_size, &gdt_ptr);
> >    sldt(&ldt_segment);
> >    ++gdt_size;
> >    descriptor = gdt_ptr;
> >    while (descriptor->base != 0xa0000)
> >    {
> >    
> >        ++descriptor;
> >        gdt_size -= sizeof(*descriptor);
> >        if (!gdt_size)
> >        
> >            break;
> >    
> >    }
> >    
> >    if (gdt_size)
> >    
> >        return (descriptor - gdt_ptr) << 3;
> >    
> >    descriptor = gdt_ptr[ldt_segment >> 3]->base;
> >    ldt_size = gdt_ptr[ldt_segment >> 3]->limit + 1;
> >    while (descriptor->base != 0xa0000)
> >    {
> >    
> >        ++descriptor;
> >        ldt_size -= sizeof(*descriptor);
> >        if (!ldt_size)
> >        
> >            break;
> >    
> >    }
> >    
> >    if (ldt_size)
> >    
> >        return (descriptor - ldt_ptr) << 3;
> >    
> >    return 0;
> >
> >}
> >
> >
> >Currently we emulate IDT access. On a read fault, we execute sidt
> >ourselves, check if the read address falls within the IDT, and return some
> >dummy data from the exception handler if it does [1]. We can easily enough
> >implement GDT access as well this way, and there is even an out-of-tree
> >patch written some years ago that does this, and helps the game run.
> >
> >However, there are two problems that I have observed or anticipated:
> >
> >(1) On systems with UMIP, the kernel emulates sgdt instructions and returns
> >a consistent address which we can guarantee is invalid. However, it also
> >returns a size of zero. The program doesn't expect this (cf. the way the
> >loop is written above) and I believe will effectively loop forever in that
> >case, or until it finds the VGA selector or hits invalid memory.
> >
> >    I see two obvious ways to fix this: either adjust the size of the fake
> >
> >kernel GDT, or provide a switch to stop emulating and let Wine handle it.
> >The latter may very well a more sustainable option in the long term
> >(although I'll admit I can't immediately come up with a reason why, other
> >than "we might need to raise the size yet again".)
> >
> >    Does anyone have opinions on this particular topic? I can look into
> >
> >writing a patch but I'm not sure what the best approach is.
> >
> >(2) On 64-bit systems without UMIP, sgdt returns a truncated address when
> >in 32-bit mode. This truncated address in practice might point anywhere in
> >the address space, including to valid memory.
> >
> >    In order to fix this, we would need the kernel to guarantee that the
> >    GDT
> >
> >base points to an address whose bottom 32 bits we can guarantee are
> >inaccessible. This is relatively easy to achieve ourselves by simply
> >mapping those pages as noaccess, but it also means that those pages can't
> >overlap something we need; we already go to pains to make sure that
> >certain parts of the address space are free. Broadly anything above the 2G
> >boundary *should* be okay though. Is this feasible?
> >
> >    We could also just decide we don't care about systems without UMIP, but
> >
> >that seems a bit unfortunate; it's not that old of a feature. But I also
> >have no idea how hard it would be to make this kind of a guarantee on the
> >kernel side.
> >
> >    This is also, theoretically, a problem for the IDT, except that on the
> >
> >machines I've tested, the IDT is always at 0xfffffe0000000000. That's not
> >great either (it's certainly caused some weirdness and confusion when
> >debugging, when we unexpectedly catch an unrelated null pointer access) but
> >it seems to work in practice.
> >
> >--Zeb
> >
> >[1] https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/krnl386.exe16/
> >instr.c#l702
> 
> A prctl() to set the UMIP-emulated return values or disable it (giving
> SIGILL) would be easy enough.
> 
> For the non-UMIP case, and probably for a lot of other corner cases like
> relying on certain magic selector values and what not, the best option
> really would be to wrap the code in a lightweight KVM container. I do *not*
> mean running the Qemu user space part of KVM; instead have Wine interface
> with /dev/kvm directly.
> 
> Non-KVM-capable hardware is basically historic at this point.

Sorry for the late response—I've been trying to do research on what would be 
necessary to use KVM (plus I made the poor choice of sending this during the 
holiday season...)

I'm concerned that KVM is going to be difficult or even intractable. Here are 
some of the problems that I (perhaps incorrectly) understand:

* As I am led to understand, there can only be one hypervisor on the machine 
at a time, and KVM has a hard limit on the number of vCPUs.

  The obvious way to use KVM for Wine is to make each (guest) thread a vCPU. 
That will, at the very least, run into the thread limit. In order to avoid 
that we'd need to ship a whole scheduler, which is concerning. That's a huge 
component to ship and a huge burden to keep updated. It also means we need to 
hoist *all* of the ipc and sync code into the guest, which will take an 
enormous amount of work.

  Moreover, because there can only be one hypervisor, and Wine is a multi-
process beast, that means that we suddenly need to throw every process into 
the same VM. That has unfortunate implications regarding isolation (it's been 
a dream for years that we'd be able to share a single wine "VM" between 
multiple users), it complicates memory management (though perhaps not 
terribly?). And it means you can only have one Wine VM at a time, and can't 
use Wine at the same time as a "real" VM, neither of which are restrictions 
that currently exist.

  And it's not even like we can refactor—we'd have to rewrite tons of code to 
work inside a VM, but also keep the old code around for the cases where we 
don't have a VM and want to delegate scheduling to the host OS.

* Besides scheduling, we need to exit the VM every time we would normally call 
into Unix code, which in practice is every time that the application does an 
NT syscall, or uses a library which we delegate to the host (including e.g. 
GPU, multimedia, audio...)

  I'm concerned that this will be very expensive. Most VM users don't need to 
exit on every syscall. While I haven't tested KVM, I think some other Wine 
developers actually did a similar experiment using a hypervisor to solve some 
other problem (related to 32-bit support on Mac OS), and exiting the 
hypervisor was prohibitively slow.

  Alternatively we ship *more* components to reimplement these things inside 
the VM (e.g. virgl/venus for GPU hardware, other virtio bits for interacting 
with e.g. multimedia hardware? enough of a cache to make block I/O reasonably 
fast, a few layers of networking code...), which looks more and more ugly.

If nothing else, it's a huge hammer to fix this one problem for an application 
which doesn't even currently work in Wine, *and* which isn't even a problem on 
sufficiently new hardware (and to fix other GDT problems which are only 
theoretical at this point.)

--Zeb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-02 19:53   ` Elizabeth Figura
@ 2024-01-03  7:30     ` Stefan Dösinger
  2024-01-03 15:19     ` Sean Christopherson
  1 sibling, 0 replies; 12+ messages in thread
From: Stefan Dösinger @ 2024-01-03  7:30 UTC (permalink / raw)
  To: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel, H. Peter Anvin
  Cc: Elizabeth Figura

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

Am Dienstag, 2. Januar 2024, 22:53:26 EAT schrieb Elizabeth Figura:
>   I'm concerned that this will be very expensive. Most VM users don't need
> to exit on every syscall. While I haven't tested KVM, I think some other
> Wine developers actually did a similar experiment using a hypervisor to
> solve some other problem (related to 32-bit support on Mac OS), and exiting
> the hypervisor was prohibitively slow.

Just to add to this point, Ken Thomases and I experimented with this on Mac 
OS, and as Zeb said, we found it to be unworkably slow. In the d3d games we 
tested the performance of hypervisor + lots of exits was approximately the 
same as running all 32 bit guest code inside qemu's software CPU emulation, or 
about 5% of the performance of using native 32 bit mac processes (when they 
still existed). From what we could tell the cost was imposed by the CPU and 
not MacOS' very lightweight hypervisor API.

There are obviously differences between Mac and Linux, and with Wine's new 
syscalls we probably don't need to exit as often as my hangover wrapper DLLs 
did, but combined with the other reasons Zeb listed I don't think running Wine 
inside KVM is ever going to be realistic.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-02 19:53   ` Elizabeth Figura
  2024-01-03  7:30     ` Stefan Dösinger
@ 2024-01-03 15:19     ` Sean Christopherson
  2024-01-03 15:33       ` H. Peter Anvin
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2024-01-03 15:19 UTC (permalink / raw)
  To: Elizabeth Figura
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel, H. Peter Anvin

On Tue, Jan 02, 2024, Elizabeth Figura wrote:
> On Wednesday, December 27, 2023 5:58:19 PM CST H. Peter Anvin wrote:
> > On December 27, 2023 2:20:37 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
> > >Hello all,
> > >
> > >There is a Windows 98 program, a game called Nuclear Strike, which wants to
> > >do some amount of direct VGA access. Part of this is port I/O, which
> > >naturally throws SIGILL that we can trivially catch and emulate in Wine.
> > >The other part is direct access to the video memory at 0xa0000, which in
> > >general isn't a problem to catch and virtualize as well.
> > >
> > >However, this program is a bit creative about how it accesses that memory;
> > >instead of just writing to 0xa0000 directly, it looks up a segment
> > >descriptor whose base is at 0xa0000 and then uses the %es override to
> > >write bytes. In pseudo-C, what it does is:

...

> > A prctl() to set the UMIP-emulated return values or disable it (giving
> > SIGILL) would be easy enough.
> > 
> > For the non-UMIP case, and probably for a lot of other corner cases like
> > relying on certain magic selector values and what not, the best option
> > really would be to wrap the code in a lightweight KVM container. I do *not*
> > mean running the Qemu user space part of KVM; instead have Wine interface
> > with /dev/kvm directly.
> > 
> > Non-KVM-capable hardware is basically historic at this point.
> 
> Sorry for the late response—I've been trying to do research on what would be 
> necessary to use KVM (plus I made the poor choice of sending this during the 
> holiday season...)
> 
> I'm concerned that KVM is going to be difficult or even intractable. Here are 
> some of the problems that I (perhaps incorrectly) understand:
> 
> * As I am led to understand, there can only be one hypervisor on the machine 
> at a time,

No.  Only one instance of KVM-the-module is allowed, but there is no arbitrary
limit on the number of VMs that userspace can create.  The only meaningful
limitation is memory, and while struct kvm isn't tiny, it's not _that_ big.

> and KVM has a hard limit on the number of vCPUs.
>
>   The obvious way to use KVM for Wine is to make each (guest) thread a vCPU. 
> That will, at the very least, run into the thread limit. In order to avoid 
> that we'd need to ship a whole scheduler, which is concerning. That's a huge 
> component to ship and a huge burden to keep updated. It also means we need to 
> hoist *all* of the ipc and sync code into the guest, which will take an 
> enormous amount of work.
> 
>   Moreover, because there can only be one hypervisor, and Wine is a multi-
> process beast, that means that we suddenly need to throw every process into 
> the same VM.

As above, this is wildly inaccurate.  The only KVM restriction with respect to
processes is that a VM is bound to the process (address space) that created the
VM.  There are no restrictions on the number of VMs that can be created, e.g. a
single process can create multiple VMs.

> That has unfortunate implications regarding isolation (it's been a dream for
> years that we'd be able to share a single wine "VM" between multiple users),
> it complicates memory management (though perhaps not terribly?). And it means
> you can only have one Wine VM at a time, and can't use Wine at the same time
> as a "real" VM, neither of which are restrictions that currently exist.
> 
>   And it's not even like we can refactor—we'd have to rewrite tons of code to 
> work inside a VM, but also keep the old code around for the cases where we 
> don't have a VM and want to delegate scheduling to the host OS.
> 
> * Besides scheduling, we need to exit the VM every time we would normally call 
> into Unix code, which in practice is every time that the application does an 
> NT syscall, or uses a library which we delegate to the host (including e.g. 
> GPU, multimedia, audio...)

Maybe I misinterpreted Peter's suggestion, but at least in my mind I wasn't thinking
that the entire Wine process would run in a VM, but rather Wine would run just
the "problematic" code in a VM.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-03 15:19     ` Sean Christopherson
@ 2024-01-03 15:33       ` H. Peter Anvin
  2024-01-04  6:35         ` Elizabeth Figura
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2024-01-03 15:33 UTC (permalink / raw)
  To: Sean Christopherson, Elizabeth Figura
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel

On January 3, 2024 7:19:02 AM PST, Sean Christopherson <seanjc@google.com> wrote:
>On Tue, Jan 02, 2024, Elizabeth Figura wrote:
>> On Wednesday, December 27, 2023 5:58:19 PM CST H. Peter Anvin wrote:
>> > On December 27, 2023 2:20:37 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
>> > >Hello all,
>> > >
>> > >There is a Windows 98 program, a game called Nuclear Strike, which wants to
>> > >do some amount of direct VGA access. Part of this is port I/O, which
>> > >naturally throws SIGILL that we can trivially catch and emulate in Wine.
>> > >The other part is direct access to the video memory at 0xa0000, which in
>> > >general isn't a problem to catch and virtualize as well.
>> > >
>> > >However, this program is a bit creative about how it accesses that memory;
>> > >instead of just writing to 0xa0000 directly, it looks up a segment
>> > >descriptor whose base is at 0xa0000 and then uses the %es override to
>> > >write bytes. In pseudo-C, what it does is:
>
>...
>
>> > A prctl() to set the UMIP-emulated return values or disable it (giving
>> > SIGILL) would be easy enough.
>> > 
>> > For the non-UMIP case, and probably for a lot of other corner cases like
>> > relying on certain magic selector values and what not, the best option
>> > really would be to wrap the code in a lightweight KVM container. I do *not*
>> > mean running the Qemu user space part of KVM; instead have Wine interface
>> > with /dev/kvm directly.
>> > 
>> > Non-KVM-capable hardware is basically historic at this point.
>> 
>> Sorry for the late response—I've been trying to do research on what would be 
>> necessary to use KVM (plus I made the poor choice of sending this during the 
>> holiday season...)
>> 
>> I'm concerned that KVM is going to be difficult or even intractable. Here are 
>> some of the problems that I (perhaps incorrectly) understand:
>> 
>> * As I am led to understand, there can only be one hypervisor on the machine 
>> at a time,
>
>No.  Only one instance of KVM-the-module is allowed, but there is no arbitrary
>limit on the number of VMs that userspace can create.  The only meaningful
>limitation is memory, and while struct kvm isn't tiny, it's not _that_ big.
>
>> and KVM has a hard limit on the number of vCPUs.
>>
>>   The obvious way to use KVM for Wine is to make each (guest) thread a vCPU. 
>> That will, at the very least, run into the thread limit. In order to avoid 
>> that we'd need to ship a whole scheduler, which is concerning. That's a huge 
>> component to ship and a huge burden to keep updated. It also means we need to 
>> hoist *all* of the ipc and sync code into the guest, which will take an 
>> enormous amount of work.
>> 
>>   Moreover, because there can only be one hypervisor, and Wine is a multi-
>> process beast, that means that we suddenly need to throw every process into 
>> the same VM.
>
>As above, this is wildly inaccurate.  The only KVM restriction with respect to
>processes is that a VM is bound to the process (address space) that created the
>VM.  There are no restrictions on the number of VMs that can be created, e.g. a
>single process can create multiple VMs.
>
>> That has unfortunate implications regarding isolation (it's been a dream for
>> years that we'd be able to share a single wine "VM" between multiple users),
>> it complicates memory management (though perhaps not terribly?). And it means
>> you can only have one Wine VM at a time, and can't use Wine at the same time
>> as a "real" VM, neither of which are restrictions that currently exist.
>> 
>>   And it's not even like we can refactor—we'd have to rewrite tons of code to 
>> work inside a VM, but also keep the old code around for the cases where we 
>> don't have a VM and want to delegate scheduling to the host OS.
>> 
>> * Besides scheduling, we need to exit the VM every time we would normally call 
>> into Unix code, which in practice is every time that the application does an 
>> NT syscall, or uses a library which we delegate to the host (including e.g. 
>> GPU, multimedia, audio...)
>
>Maybe I misinterpreted Peter's suggestion, but at least in my mind I wasn't thinking
>that the entire Wine process would run in a VM, but rather Wine would run just
>the "problematic" code in a VM.
>

Yes, the idea would be that you would run the "problematic" code inside a VM *mapped 1:1 with the external address space*, i.e. use KVM simply as a special execution mode to give you more control of the fine grained machine state like the GDT. The code that you don't want executed in the VM context simply leave unmapped in the VM page tables and set up #PF to always exit the VM context.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-03 15:33       ` H. Peter Anvin
@ 2024-01-04  6:35         ` Elizabeth Figura
  2024-01-05  1:02           ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: Elizabeth Figura @ 2024-01-04  6:35 UTC (permalink / raw)
  To: Sean Christopherson, H. Peter Anvin
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel

On Wednesday, January 3, 2024 9:33:10 AM CST H. Peter Anvin wrote:
> On January 3, 2024 7:19:02 AM PST, Sean Christopherson <seanjc@google.com> 
wrote:
> >On Tue, Jan 02, 2024, Elizabeth Figura wrote:
> >> On Wednesday, December 27, 2023 5:58:19 PM CST H. Peter Anvin wrote:
> >> > On December 27, 2023 2:20:37 PM PST, Elizabeth Figura 
<zfigura@codeweavers.com> wrote:
> >> > >Hello all,
> >> > >
> >> > >There is a Windows 98 program, a game called Nuclear Strike, which
> >> > >wants to
> >> > >do some amount of direct VGA access. Part of this is port I/O, which
> >> > >naturally throws SIGILL that we can trivially catch and emulate in
> >> > >Wine.
> >> > >The other part is direct access to the video memory at 0xa0000, which
> >> > >in
> >> > >general isn't a problem to catch and virtualize as well.
> >> > >
> >> > >However, this program is a bit creative about how it accesses that
> >> > >memory;
> >> > >instead of just writing to 0xa0000 directly, it looks up a segment
> >> > >descriptor whose base is at 0xa0000 and then uses the %es override to
> >
> >> > >write bytes. In pseudo-C, what it does is:
> >...
> >
> >> > A prctl() to set the UMIP-emulated return values or disable it (giving
> >> > SIGILL) would be easy enough.
> >> > 
> >> > For the non-UMIP case, and probably for a lot of other corner cases
> >> > like
> >> > relying on certain magic selector values and what not, the best option
> >> > really would be to wrap the code in a lightweight KVM container. I do
> >> > *not*
> >> > mean running the Qemu user space part of KVM; instead have Wine
> >> > interface
> >> > with /dev/kvm directly.
> >> > 
> >> > Non-KVM-capable hardware is basically historic at this point.
> >> 
> >> Sorry for the late response—I've been trying to do research on what would
> >> be necessary to use KVM (plus I made the poor choice of sending this
> >> during the holiday season...)
> >> 
> >> I'm concerned that KVM is going to be difficult or even intractable. Here
> >> are some of the problems that I (perhaps incorrectly) understand:
> >> 
> >> * As I am led to understand, there can only be one hypervisor on the
> >> machine at a time,
> >
> >No.  Only one instance of KVM-the-module is allowed, but there is no
> >arbitrary limit on the number of VMs that userspace can create.  The only
> >meaningful limitation is memory, and while struct kvm isn't tiny, it's not
> >_that_ big.>

Ah, thanks for the correction.

So if we're able to have one VM per thread, or one VM per process with one 
vcpu per thread (but that one is capped at 1024 at least right now?), and we 
don't risk running into any limits, that does make things a great deal easier.

Still, as Stefan said, I don't know if using a hypervisor is going to be 
plausible for speed reasons.

> >> and KVM has a hard limit on the number of vCPUs.
> >> 
> >>   The obvious way to use KVM for Wine is to make each (guest) thread a
> >>   vCPU.
> >> 
> >> That will, at the very least, run into the thread limit. In order to
> >> avoid
> >> that we'd need to ship a whole scheduler, which is concerning. That's a
> >> huge component to ship and a huge burden to keep updated. It also means
> >> we need to hoist *all* of the ipc and sync code into the guest, which
> >> will take an enormous amount of work.
> >> 
> >>   Moreover, because there can only be one hypervisor, and Wine is a
> >>   multi-
> >> 
> >> process beast, that means that we suddenly need to throw every process
> >> into
> >> the same VM.
> >
> >As above, this is wildly inaccurate.  The only KVM restriction with respect
> >to processes is that a VM is bound to the process (address space) that
> >created the VM.  There are no restrictions on the number of VMs that can
> >be created, e.g. a single process can create multiple VMs.
> >
> >> That has unfortunate implications regarding isolation (it's been a dream
> >> for years that we'd be able to share a single wine "VM" between multiple
> >> users), it complicates memory management (though perhaps not terribly?).
> >> And it means you can only have one Wine VM at a time, and can't use Wine
> >> at the same time as a "real" VM, neither of which are restrictions that
> >> currently exist.>> 
> >>   And it's not even like we can refactor—we'd have to rewrite tons of
> >>   code to
> >> 
> >> work inside a VM, but also keep the old code around for the cases where
> >> we
> >> don't have a VM and want to delegate scheduling to the host OS.
> >> 
> >> * Besides scheduling, we need to exit the VM every time we would normally
> >> call into Unix code, which in practice is every time that the
> >> application does an NT syscall, or uses a library which we delegate to
> >> the host (including e.g. GPU, multimedia, audio...)
> >
> >Maybe I misinterpreted Peter's suggestion, but at least in my mind I wasn't
> >thinking that the entire Wine process would run in a VM, but rather Wine
> >would run just the "problematic" code in a VM.
> 
> Yes, the idea would be that you would run the "problematic" code inside a VM
> *mapped 1:1 with the external address space*, i.e. use KVM simply as a
> special execution mode to give you more control of the fine grained machine
> state like the GDT. The code that you don't want executed in the VM context
> simply leave unmapped in the VM page tables and set up #PF to always exit
> the VM context.

So yes, as long as we *can* organize things such that we exit the hypervisor 
every time we want to call into Unix code, then that's feasible. We have a 
well-defined break between Windows and Unix code and it wouldn't be 
inordinately difficult to shove the VM exit into that break. My concern was 
that limitations on the number of VMs or vCPUs we can create would prevent us 
from doing that, and effectively require us to implement a lot more inside the 
VM, but as I understand that's not actually a problem.

That still leaves the question of performance though. If having to exit the VM 
that often for performance reasons isn't feasible, then that's still going to 
force us to implement from scratch an inordinate amount of kernel/library code 
inside the VM just to avoid the transition. Or, more likely, conclude that a 
hypervisor just isn't going to work for us.


I'm not at all familiar with the arch code, and I'm sure I'm not asking 
anything interesting, but is it really impossible to put CPU_ENTRY_AREA_RO_IDT 
somewhere that doesn't truncate to NULL, and to put the GDT at a fixed address 
as well?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-04  6:35         ` Elizabeth Figura
@ 2024-01-05  1:02           ` H. Peter Anvin
  2024-01-05  1:21             ` Sean Christopherson
  2024-01-05  2:47             ` Andrew Cooper
  0 siblings, 2 replies; 12+ messages in thread
From: H. Peter Anvin @ 2024-01-05  1:02 UTC (permalink / raw)
  To: Elizabeth Figura, Sean Christopherson
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel

On January 3, 2024 10:35:28 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
>On Wednesday, January 3, 2024 9:33:10 AM CST H. Peter Anvin wrote:
>> On January 3, 2024 7:19:02 AM PST, Sean Christopherson <seanjc@google.com> 
>wrote:
>> >On Tue, Jan 02, 2024, Elizabeth Figura wrote:
>> >> On Wednesday, December 27, 2023 5:58:19 PM CST H. Peter Anvin wrote:
>> >> > On December 27, 2023 2:20:37 PM PST, Elizabeth Figura 
><zfigura@codeweavers.com> wrote:
>> >> > >Hello all,
>> >> > >
>> >> > >There is a Windows 98 program, a game called Nuclear Strike, which
>> >> > >wants to
>> >> > >do some amount of direct VGA access. Part of this is port I/O, which
>> >> > >naturally throws SIGILL that we can trivially catch and emulate in
>> >> > >Wine.
>> >> > >The other part is direct access to the video memory at 0xa0000, which
>> >> > >in
>> >> > >general isn't a problem to catch and virtualize as well.
>> >> > >
>> >> > >However, this program is a bit creative about how it accesses that
>> >> > >memory;
>> >> > >instead of just writing to 0xa0000 directly, it looks up a segment
>> >> > >descriptor whose base is at 0xa0000 and then uses the %es override to
>> >
>> >> > >write bytes. In pseudo-C, what it does is:
>> >...
>> >
>> >> > A prctl() to set the UMIP-emulated return values or disable it (giving
>> >> > SIGILL) would be easy enough.
>> >> > 
>> >> > For the non-UMIP case, and probably for a lot of other corner cases
>> >> > like
>> >> > relying on certain magic selector values and what not, the best option
>> >> > really would be to wrap the code in a lightweight KVM container. I do
>> >> > *not*
>> >> > mean running the Qemu user space part of KVM; instead have Wine
>> >> > interface
>> >> > with /dev/kvm directly.
>> >> > 
>> >> > Non-KVM-capable hardware is basically historic at this point.
>> >> 
>> >> Sorry for the late response—I've been trying to do research on what would
>> >> be necessary to use KVM (plus I made the poor choice of sending this
>> >> during the holiday season...)
>> >> 
>> >> I'm concerned that KVM is going to be difficult or even intractable. Here
>> >> are some of the problems that I (perhaps incorrectly) understand:
>> >> 
>> >> * As I am led to understand, there can only be one hypervisor on the
>> >> machine at a time,
>> >
>> >No.  Only one instance of KVM-the-module is allowed, but there is no
>> >arbitrary limit on the number of VMs that userspace can create.  The only
>> >meaningful limitation is memory, and while struct kvm isn't tiny, it's not
>> >_that_ big.>
>
>Ah, thanks for the correction.
>
>So if we're able to have one VM per thread, or one VM per process with one 
>vcpu per thread (but that one is capped at 1024 at least right now?), and we 
>don't risk running into any limits, that does make things a great deal easier.
>
>Still, as Stefan said, I don't know if using a hypervisor is going to be 
>plausible for speed reasons.
>
>> >> and KVM has a hard limit on the number of vCPUs.
>> >> 
>> >>   The obvious way to use KVM for Wine is to make each (guest) thread a
>> >>   vCPU.
>> >> 
>> >> That will, at the very least, run into the thread limit. In order to
>> >> avoid
>> >> that we'd need to ship a whole scheduler, which is concerning. That's a
>> >> huge component to ship and a huge burden to keep updated. It also means
>> >> we need to hoist *all* of the ipc and sync code into the guest, which
>> >> will take an enormous amount of work.
>> >> 
>> >>   Moreover, because there can only be one hypervisor, and Wine is a
>> >>   multi-
>> >> 
>> >> process beast, that means that we suddenly need to throw every process
>> >> into
>> >> the same VM.
>> >
>> >As above, this is wildly inaccurate.  The only KVM restriction with respect
>> >to processes is that a VM is bound to the process (address space) that
>> >created the VM.  There are no restrictions on the number of VMs that can
>> >be created, e.g. a single process can create multiple VMs.
>> >
>> >> That has unfortunate implications regarding isolation (it's been a dream
>> >> for years that we'd be able to share a single wine "VM" between multiple
>> >> users), it complicates memory management (though perhaps not terribly?).
>> >> And it means you can only have one Wine VM at a time, and can't use Wine
>> >> at the same time as a "real" VM, neither of which are restrictions that
>> >> currently exist.>> 
>> >>   And it's not even like we can refactor—we'd have to rewrite tons of
>> >>   code to
>> >> 
>> >> work inside a VM, but also keep the old code around for the cases where
>> >> we
>> >> don't have a VM and want to delegate scheduling to the host OS.
>> >> 
>> >> * Besides scheduling, we need to exit the VM every time we would normally
>> >> call into Unix code, which in practice is every time that the
>> >> application does an NT syscall, or uses a library which we delegate to
>> >> the host (including e.g. GPU, multimedia, audio...)
>> >
>> >Maybe I misinterpreted Peter's suggestion, but at least in my mind I wasn't
>> >thinking that the entire Wine process would run in a VM, but rather Wine
>> >would run just the "problematic" code in a VM.
>> 
>> Yes, the idea would be that you would run the "problematic" code inside a VM
>> *mapped 1:1 with the external address space*, i.e. use KVM simply as a
>> special execution mode to give you more control of the fine grained machine
>> state like the GDT. The code that you don't want executed in the VM context
>> simply leave unmapped in the VM page tables and set up #PF to always exit
>> the VM context.
>
>So yes, as long as we *can* organize things such that we exit the hypervisor 
>every time we want to call into Unix code, then that's feasible. We have a 
>well-defined break between Windows and Unix code and it wouldn't be 
>inordinately difficult to shove the VM exit into that break. My concern was 
>that limitations on the number of VMs or vCPUs we can create would prevent us 
>from doing that, and effectively require us to implement a lot more inside the 
>VM, but as I understand that's not actually a problem.
>
>That still leaves the question of performance though. If having to exit the VM 
>that often for performance reasons isn't feasible, then that's still going to 
>force us to implement from scratch an inordinate amount of kernel/library code 
>inside the VM just to avoid the transition. Or, more likely, conclude that a 
>hypervisor just isn't going to work for us.
>
>
>I'm not at all familiar with the arch code, and I'm sure I'm not asking 
>anything interesting, but is it really impossible to put CPU_ENTRY_AREA_RO_IDT 
>somewhere that doesn't truncate to NULL, and to put the GDT at a fixed address 
>as well?
>
>
>

Putting the GDT at a fixed address is pretty much a no-go for a variety of reasons. As I said, a prctl() to specify the desired return information *on UMIP-capable hardware* is certainly doable. However, it does not address things like fixed selectors that have come up.

Note that there is no fundamental reason you cannot run the Unix user space code inside the VM container, too; you only need to vmexit on an actual system call. KVM might be able to assist there by providing a "short-circuit mode", allowing a system call vmexit to invoke the system call directly rather than having to bounce back to user space – twice.

I have to say I'm really impressed how far Wine has come that this level of compatibility are now on the radar screen.

Now, in the case this is a single application that is no longer being updated, then there is of course the BFI solution of detecting the app and patching the relevant code. For an application that *is* updated, it might be possible to reach out to the developer (and/or Steam, if applicable) and help them making their code compatible while fulilling their direct objectives.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-05  1:02           ` H. Peter Anvin
@ 2024-01-05  1:21             ` Sean Christopherson
  2024-01-05  2:47             ` Andrew Cooper
  1 sibling, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2024-01-05  1:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Elizabeth Figura, x86, Linux Kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Ricardo Neri,
	wine-devel

On Thu, Jan 04, 2024, H. Peter Anvin wrote:
> On January 3, 2024 10:35:28 PM PST, Elizabeth Figura <zfigura@codeweavers.com> wrote:
> >That still leaves the question of performance though. If having to exit the VM 
> >that often for performance reasons isn't feasible, then that's still going to 
> >force us to implement from scratch an inordinate amount of kernel/library code 
> >inside the VM just to avoid the transition. Or, more likely, conclude that a 
> >hypervisor just isn't going to work for us.
> >
> >I'm not at all familiar with the arch code, and I'm sure I'm not asking 
> >anything interesting, but is it really impossible to put CPU_ENTRY_AREA_RO_IDT 
> >somewhere that doesn't truncate to NULL, and to put the GDT at a fixed address 
> >as well?
> 
> Putting the GDT at a fixed address is pretty much a no-go for a variety of
> reasons. As I said, a prctl() to specify the desired return information *on
> UMIP-capable hardware* is certainly doable. However, it does not address
> things like fixed selectors that have come up.
> 
> Note that there is no fundamental reason you cannot run the Unix user space
> code inside the VM container, too; you only need to vmexit on an actual
> system call. KVM might be able to assist there by providing a "short-circuit
> mode", allowing a system call vmexit to invoke the system call directly
> rather than having to bounce back to user space – twice.

Heh, I recommend not re-opening that can of worms[1], though some of the follow-up
work[2] from the gVisor folks might be useful/relevant?

[1] https://lore.kernel.org/all/20220722230241.1944655-1-avagin@google.com
[2] https://lore.kernel.org/all/20230308073201.3102738-1-avagin@google.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-05  1:02           ` H. Peter Anvin
  2024-01-05  1:21             ` Sean Christopherson
@ 2024-01-05  2:47             ` Andrew Cooper
  2024-01-05  4:03               ` H. Peter Anvin
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Cooper @ 2024-01-05  2:47 UTC (permalink / raw)
  To: H. Peter Anvin, Elizabeth Figura, Sean Christopherson
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel

On 05/01/2024 1:02 am, H. Peter Anvin wrote:
> Note that there is no fundamental reason you cannot run the Unix user space code inside the VM container, too; you only need to vmexit on an actual system call.

I know this is going on a tangent, but getting a VMExit on the SYSCALL
instruction is surprisingly difficult.

The "easy" way is to hide EFER.SCE behind the guests back, intercept #UD
and emulate both the SYSCALL and SYSRET instructions.  It's slow, but it
works.

However, FRED completely prohibits tricks like this, because what you
cannot reasonably do is clear CR4.FRED behind the back of a guest
kernel.  You'd have to intercept and emulate all event sources in order
to catch SYSCALL.

I raised this as a concern during early review, but Intel has no
official feature to take a VMExit on privilege change, and FRED
(rightly) wasn't an appropriate vehicle to add such a feature, so it was
deemed not an issue that the FRED design would break the unofficial ways
that people were using to intercept/monitor/etc system calls.

~Andrew

P.S. Yes, there are more adventurous tricks like injecting a thunk into
the guest kernel and editing MSR_LSTAR behind the guest's back.  In
principle a similar trick works with FRED, but in order to do this to
Windows, you also need to hook checkpatch to blind it to the thunk, and
this is horribly invasive.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 SGDT emulation for Wine
  2024-01-05  2:47             ` Andrew Cooper
@ 2024-01-05  4:03               ` H. Peter Anvin
  0 siblings, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2024-01-05  4:03 UTC (permalink / raw)
  To: Andrew Cooper, Elizabeth Figura, Sean Christopherson
  Cc: x86, Linux Kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Ricardo Neri, wine-devel

On January 4, 2024 6:47:04 PM PST, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 05/01/2024 1:02 am, H. Peter Anvin wrote:
>> Note that there is no fundamental reason you cannot run the Unix user space code inside the VM container, too; you only need to vmexit on an actual system call.
>
>I know this is going on a tangent, but getting a VMExit on the SYSCALL
>instruction is surprisingly difficult.
>
>The "easy" way is to hide EFER.SCE behind the guests back, intercept #UD
>and emulate both the SYSCALL and SYSRET instructions.  It's slow, but it
>works.
>
>However, FRED completely prohibits tricks like this, because what you
>cannot reasonably do is clear CR4.FRED behind the back of a guest
>kernel.  You'd have to intercept and emulate all event sources in order
>to catch SYSCALL.
>
>I raised this as a concern during early review, but Intel has no
>official feature to take a VMExit on privilege change, and FRED
>(rightly) wasn't an appropriate vehicle to add such a feature, so it was
>deemed not an issue that the FRED design would break the unofficial ways
>that people were using to intercept/monitor/etc system calls.
>
>~Andrew
>
>P.S. Yes, there are more adventurous tricks like injecting a thunk into
>the guest kernel and editing MSR_LSTAR behind the guest's back.  In
>principle a similar trick works with FRED, but in order to do this to
>Windows, you also need to hook checkpatch to blind it to the thunk, and
>this is horribly invasive.

*In this case* it shouldn't be a problem, since the "guest operating system" would be virtually nonexistent and entirely puppeted by Wine.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-01-05  4:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-27 22:20 x86 SGDT emulation for Wine Elizabeth Figura
2023-12-27 23:58 ` H. Peter Anvin
2024-01-02 17:12   ` Sean Christopherson
2024-01-02 19:53   ` Elizabeth Figura
2024-01-03  7:30     ` Stefan Dösinger
2024-01-03 15:19     ` Sean Christopherson
2024-01-03 15:33       ` H. Peter Anvin
2024-01-04  6:35         ` Elizabeth Figura
2024-01-05  1:02           ` H. Peter Anvin
2024-01-05  1:21             ` Sean Christopherson
2024-01-05  2:47             ` Andrew Cooper
2024-01-05  4:03               ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).