[PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
@ 2022-05-17 15:30 Kirill A. Shutemov
  2022-05-17 16:36 ` Dave Hansen
  2022-05-18  8:39 ` David Laight
  0 siblings, 2 replies; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-17 15:30 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	seanjc, thomas.lendacky, x86, linux-kernel, Kirill A. Shutemov

load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

In TDX guest the second page can be shared page and VMM may configure it
to trigger #VE.

Kernel assumes that #VE on a shared page is MMIO access and tries to
decode instruction to handle it. In case of load_unaligned_zeropad() it
may result in confusion as it is not MMIO access.

Check fixup table before trying to handle MMIO.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..5fbdda2f2b86 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -11,6 +11,8 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/trapnr.h>
+#include <asm/extable.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
@@ -296,6 +298,26 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	if (WARN_ON_ONCE(user_mode(regs)))
 		return false;
 
+	/*
+	 * load_unaligned_zeropad() relies on exception fixups in case of the
+	 * word being a page-crosser and the second page is not accessible.
+	 *
+	 * In TDX guest the second page can be shared page and VMM may
+	 * configure it to trigger #VE.
+	 *
+	 * Kernel assumes that #VE on a shared page is MMIO access and tries to
+	 * decode instruction to handle it. In case of load_unaligned_zeropad()
+	 * it may result in confusion as it is not MMIO access.
+	 *
+	 * Check fixup table before trying to handle MMIO.
+	 */
+	if (fixup_exception(regs, X86_TRAP_VE, 0, ve->gla)) {
+		/* regs->ip is adjusted by fixup_exception() */
+		ve->instr_len = 0;
+
+		return true;
+	}
+
 	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
 		return false;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 15:30 [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page Kirill A. Shutemov
@ 2022-05-17 16:36 ` Dave Hansen
  2022-05-17 17:40   ` Kirill A. Shutemov
  2022-05-18  8:39 ` David Laight
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2022-05-17 16:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	seanjc, thomas.lendacky, x86, linux-kernel

On 5/17/22 08:30, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
> 
> In TDX guest the second page can be shared page and VMM may configure it
> to trigger #VE.
> 
> Kernel assumes that #VE on a shared page is MMIO access and tries to
> decode instruction to handle it. In case of load_unaligned_zeropad() it
> may result in confusion as it is not MMIO access.
> 
> Check fixup table before trying to handle MMIO.

Is this a theoretical problem or was it found in practice?

> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 03deb4d6920d..5fbdda2f2b86 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -11,6 +11,8 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/trapnr.h>
> +#include <asm/extable.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
> @@ -296,6 +298,26 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>  	if (WARN_ON_ONCE(user_mode(regs)))
>  		return false;
>  
> +	/*
> +	 * load_unaligned_zeropad() relies on exception fixups in case of the
> +	 * word being a page-crosser and the second page is not accessible.
> +	 *
> +	 * In TDX guest the second page can be shared page and VMM may

	In TDX guests,

> +	 * configure it to trigger #VE.
> +	 *
> +	 * Kernel assumes that #VE on a shared page is MMIO access and tries to
> +	 * decode instruction to handle it. In case of load_unaligned_zeropad()
> +	 * it may result in confusion as it is not MMIO access.
> +	 *
> +	 * Check fixup table before trying to handle MMIO.
> +	 */
> +	if (fixup_exception(regs, X86_TRAP_VE, 0, ve->gla)) {
> +		/* regs->ip is adjusted by fixup_exception() */
> +		ve->instr_len = 0;
> +
> +		return true;
> +	}

This 've->instr_len = ' stuff is just a hack.

ve_info is a software structure.  Why not just add a:

	bool ip_adjusted;

which defaults to false, then we have:

	/*
	 * Adjust RIP if the exception was handled
	 * but RIP was not adjusted.
	 */
	if (!ret && !ve_info->ip_adjusted)
		regs->ip += ve_info->instr_len;

One other oddity I just stumbled upon:

static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
{
...
        ve->instr_len = insn.length;

Why does that need to override 've->instr_len'?  What was wrong with the
gunk in r10 that came out of TDX_GET_VEINFO?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 16:36 ` Dave Hansen
@ 2022-05-17 17:40   ` Kirill A. Shutemov
  2022-05-17 18:14     ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-17 17:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy, ak,
	dan.j.williams, david, hpa, seanjc, thomas.lendacky, x86,
	linux-kernel

On Tue, May 17, 2022 at 09:36:03AM -0700, Dave Hansen wrote:
> On 5/17/22 08:30, Kirill A. Shutemov wrote:
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> > 
> > In TDX guest the second page can be shared page and VMM may configure it
> > to trigger #VE.
> > 
> > Kernel assumes that #VE on a shared page is MMIO access and tries to
> > decode instruction to handle it. In case of load_unaligned_zeropad() it
> > may result in confusion as it is not MMIO access.
> > 
> > Check fixup table before trying to handle MMIO.
> 
> Is this a theoretical problem or was it found in practice?

No, it was found based on analysis.

The problem was found in practice for private pages (see the patch in the
unaccepted memory support patchset), but not for shared.

For shared I had to do some tricks to get it triggered. Shared pages that
configured to trigger MMIO normally comes from ioremap() and they are not
mapped next to normally allocated pages. I had to force this situation.

But there are normally allocated pages that we make shared, like SWIOTLB
buffer. These pages usually do not trigger #VE, but malicious host can
configure them to trigged it at any point.

Even after I forced the situation, insn_decode_mmio() worked fine as
load_unaligned_zeropad() uses a flavour of MOV that insn_decode_mmio() can
decode. But it gets situation worse: we ask host to handle MMIO for the
address in private page and it allows host to override the part of the
word that comes from the private pages. :/

> > diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> > index 03deb4d6920d..5fbdda2f2b86 100644
> > --- a/arch/x86/coco/tdx/tdx.c
> > +++ b/arch/x86/coco/tdx/tdx.c
> > @@ -11,6 +11,8 @@
> >  #include <asm/insn.h>
> >  #include <asm/insn-eval.h>
> >  #include <asm/pgtable.h>
> > +#include <asm/trapnr.h>
> > +#include <asm/extable.h>
> >  
> >  /* TDX module Call Leaf IDs */
> >  #define TDX_GET_INFO			1
> > @@ -296,6 +298,26 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> >  	if (WARN_ON_ONCE(user_mode(regs)))
> >  		return false;
> >  
> > +	/*
> > +	 * load_unaligned_zeropad() relies on exception fixups in case of the
> > +	 * word being a page-crosser and the second page is not accessible.
> > +	 *
> > +	 * In TDX guest the second page can be shared page and VMM may
> 
> 	In TDX guests,
> 
> > +	 * configure it to trigger #VE.
> > +	 *
> > +	 * Kernel assumes that #VE on a shared page is MMIO access and tries to
> > +	 * decode instruction to handle it. In case of load_unaligned_zeropad()
> > +	 * it may result in confusion as it is not MMIO access.
> > +	 *
> > +	 * Check fixup table before trying to handle MMIO.
> > +	 */
> > +	if (fixup_exception(regs, X86_TRAP_VE, 0, ve->gla)) {
> > +		/* regs->ip is adjusted by fixup_exception() */
> > +		ve->instr_len = 0;
> > +
> > +		return true;
> > +	}
> 
> This 've->instr_len = ' stuff is just a hack.
> 
> ve_info is a software structure.  Why not just add a:
> 
> 	bool ip_adjusted;
> 
> which defaults to false, then we have:
> 
> 	/*
> 	 * Adjust RIP if the exception was handled
> 	 * but RIP was not adjusted.
> 	 */
> 	if (!ret && !ve_info->ip_adjusted)
> 		regs->ip += ve_info->instr_len;
> 
> One other oddity I just stumbled upon:
> 
> static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> {
> ...
>         ve->instr_len = insn.length;
> 
> Why does that need to override 've->instr_len'?  What was wrong with the
> gunk in r10 that came out of TDX_GET_VEINFO?

TDX module doesn't decode MMIO instruction and does not provide valid size
of it. We had to do it manually, based on decoding.

Given that we had to adjust IP in handle_mmio() anyway, do you still think
"ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 17:40   ` Kirill A. Shutemov
@ 2022-05-17 18:14     ` Dave Hansen
  2022-05-17 20:17       ` Kirill A. Shutemov
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2022-05-17 18:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy, ak,
	dan.j.williams, david, hpa, seanjc, thomas.lendacky, x86,
	linux-kernel

On 5/17/22 10:40, Kirill A. Shutemov wrote:
>>
>> ve_info is a software structure.  Why not just add a:
>>
>> 	bool ip_adjusted;
>>
>> which defaults to false, then we have:
>>
>> 	/*
>> 	 * Adjust RIP if the exception was handled
>> 	 * but RIP was not adjusted.
>> 	 */
>> 	if (!ret && !ve_info->ip_adjusted)
>> 		regs->ip += ve_info->instr_len;
>>
>> One other oddity I just stumbled upon:
>>
>> static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
>> {
>> ...
>>         ve->instr_len = insn.length;
>>
>> Why does that need to override 've->instr_len'?  What was wrong with the
>> gunk in r10 that came out of TDX_GET_VEINFO?
> TDX module doesn't decode MMIO instruction and does not provide valid size
> of it. We had to do it manually, based on decoding.

That's worth a comment, don't you think?  I'd add one both in where the
ve_info is filled and where ve->instr_len is adjusted.

> Given that we had to adjust IP in handle_mmio() anyway, do you still think
> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.

Something is wrong about it.

You could call it 've->instr_bytes_to_handle' or something.  Then it
makes actual logical sense when you handle it to zero it out.  I just
want it to be more explicit when the upper levels need to do something.

Does ve->instr_len==0 both when the TDX module isn't providing
instruction sizes *and* when no handling is necessary?  That seems like
an unfortunate logical multiplexing of 0.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 18:14     ` Dave Hansen
@ 2022-05-17 20:17       ` Kirill A. Shutemov
  2022-05-17 22:16         ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-17 20:17 UTC (permalink / raw)
  To: Dave Hansen, seanjc
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy, ak,
	dan.j.williams, david, hpa, thomas.lendacky, x86, linux-kernel

On Tue, May 17, 2022 at 11:14:13AM -0700, Dave Hansen wrote:
> On 5/17/22 10:40, Kirill A. Shutemov wrote:
> >>
> >> ve_info is a software structure.  Why not just add a:
> >>
> >> 	bool ip_adjusted;
> >>
> >> which defaults to false, then we have:
> >>
> >> 	/*
> >> 	 * Adjust RIP if the exception was handled
> >> 	 * but RIP was not adjusted.
> >> 	 */
> >> 	if (!ret && !ve_info->ip_adjusted)
> >> 		regs->ip += ve_info->instr_len;
> >>
> >> One other oddity I just stumbled upon:
> >>
> >> static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> >> {
> >> ...
> >>         ve->instr_len = insn.length;
> >>
> >> Why does that need to override 've->instr_len'?  What was wrong with the
> >> gunk in r10 that came out of TDX_GET_VEINFO?
> > TDX module doesn't decode MMIO instruction and does not provide valid size
> > of it. We had to do it manually, based on decoding.
> 
> That's worth a comment, don't you think?  I'd add one both in where the
> ve_info is filled and where ve->instr_len is adjusted.

Okay. Will do.
 
> > Given that we had to adjust IP in handle_mmio() anyway, do you still think
> > "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
> 
> Something is wrong about it.
> 
> You could call it 've->instr_bytes_to_handle' or something. Then it
> makes actual logical sense when you handle it to zero it out.  I just
> want it to be more explicit when the upper levels need to do something.
> 
> Does ve->instr_len==0 both when the TDX module isn't providing
> instruction sizes *and* when no handling is necessary?  That seems like
> an unfortunate logical multiplexing of 0.

For EPT violation, ve->instr_len has *something* (not zero) that doesn't
match the actual instruction size. I dig out that it is filled with data
from VMREAD(0x440C), but I don't know where is the ultimate origin of the
data.

I don't understand virtualization side of the thing well enough. 

Maybe someone who knows virtualtion could comment here. Sean?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 20:17       ` Kirill A. Shutemov
@ 2022-05-17 22:16         ` Dave Hansen
  2022-05-17 22:40           ` Sean Christopherson
  2022-05-19 18:07           ` Kirill A. Shutemov
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2022-05-17 22:16 UTC (permalink / raw)
  To: Kirill A. Shutemov, seanjc
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy, ak,
	dan.j.williams, david, hpa, thomas.lendacky, x86, linux-kernel

On 5/17/22 13:17, Kirill A. Shutemov wrote:
>>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
>>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
>> Something is wrong about it.
>>
>> You could call it 've->instr_bytes_to_handle' or something. Then it
>> makes actual logical sense when you handle it to zero it out.  I just
>> want it to be more explicit when the upper levels need to do something.
>>
>> Does ve->instr_len==0 both when the TDX module isn't providing
>> instruction sizes *and* when no handling is necessary?  That seems like
>> an unfortunate logical multiplexing of 0.
> For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> match the actual instruction size. I dig out that it is filled with data
> from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> data.

The SDM has a breakdown:

	27.2.5 Information for VM Exits Due to Instruction Execution

I didn't realize it came from VMREAD.  I guess I assumed it came from
some TDX module magic.  Silly me.

The SDM makes it sound like we should be more judicious about using
've->instr_len' though.  "All VM exits other than those listed in the
above items leave this field undefined."  Looking over
virt_exception_kernel(), we've got five cases from CPU instructions that
cause unconditional VMEXITs:

        case EXIT_REASON_HLT:
        case EXIT_REASON_MSR_READ:
        case EXIT_REASON_MSR_WRITE:
        case EXIT_REASON_CPUID:
        case EXIT_REASON_IO_INSTRUCTION:

and should have that field filled out, plus one that doesn't:

        case EXIT_REASON_IO_INSTRUCTION:

It seems awfully fragile to me to have the hardware be providing the
'instr_len' in those cases, but not in one other one.  The data in there
is garbage for EXIT_REASON_IO_INSTRUCTION.  The reason we don't consume
garbage is that all the paths leading out of handle_mmio() that return
true also set 've->instr_len'.  But that logic is entirely opaque.

It's also borderline criminal to have six functions that look identical
(in that switch statement), but one of them has different behavior for
've->instr_len'.

I'd probably do it like this:

static int handle_halt(struct ve_info *ve)
{
        /*
         * Since non safe halt is mainly used in CPU offlining
         * and the guest will always stay in the halt state, don't
         * call the STI instruction (set do_sti as false).
         */
        const bool irq_disabled = irqs_disabled();
        const bool do_sti = false;

        if (__halt(irq_disabled, do_sti))
                return -EIO;

	/*
	 * VM-exit instruction length is defined for HLT.  See:
	 * "Information for VM Exits Due to Instruction Execution"
	 * in the SDM.
	 */
        return ve->insn_length;
}

Any >=0 return means the exception was handled and it tells the caller
hoe much to advance RIP.

Then handle_mmio() can say:

	/*
	 * VM-exit instruction length is not provided for the EPT
	 * violations that MMIO causes.  Use the insn_decode() length:
	 */
        return insn.length;

See?  Now everybody that goes and writes a new #VE exception helper has
a chance of actually getting this right.  As it stands, if someone adds
one more of these, they'll probably get random behavior.  This way, they
actually have to choose.  They _might_ even go looking at the SDM.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 22:16         ` Dave Hansen
@ 2022-05-17 22:40           ` Sean Christopherson
  2022-05-17 22:52             ` Dave Hansen
  2022-05-17 22:52             ` Sean Christopherson
  2022-05-19 18:07           ` Kirill A. Shutemov
  1 sibling, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2022-05-17 22:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On Tue, May 17, 2022, Dave Hansen wrote:
> On 5/17/22 13:17, Kirill A. Shutemov wrote:
> >>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
> >>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
> >> Something is wrong about it.
> >>
> >> You could call it 've->instr_bytes_to_handle' or something. Then it
> >> makes actual logical sense when you handle it to zero it out.  I just
> >> want it to be more explicit when the upper levels need to do something.
> >>
> >> Does ve->instr_len==0 both when the TDX module isn't providing
> >> instruction sizes *and* when no handling is necessary?  That seems like
> >> an unfortunate logical multiplexing of 0.
> > For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> > match the actual instruction size. I dig out that it is filled with data
> > from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> > data.
> 
> The SDM has a breakdown:
> 
> 	27.2.5 Information for VM Exits Due to Instruction Execution
> 
> I didn't realize it came from VMREAD.  I guess I assumed it came from
> some TDX module magic.  Silly me.
> 
> The SDM makes it sound like we should be more judicious about using
> 've->instr_len' though.  "All VM exits other than those listed in the
> above items leave this field undefined."  Looking over
> virt_exception_kernel(), we've got five cases from CPU instructions that
> cause unconditional VMEXITs:

None of the below exit unconditionally.

>         case EXIT_REASON_HLT:
>         case EXIT_REASON_MSR_READ:
>         case EXIT_REASON_MSR_WRITE:
>         case EXIT_REASON_CPUID:
>         case EXIT_REASON_IO_INSTRUCTION:
> 
> and should have that field filled out, plus one that doesn't:
> 
>         case EXIT_REASON_IO_INSTRUCTION:

I/O fills the length.  IN, INS, OUT, and OUTS are all listed.  It's not just
unconditional exits that provide the instruction length.  The instruction length
is provided if the instruction is the direct cause of the exit, whether or not
the instruction exits unconditionally doesn't matter.

  For fault-like VM exits due to attempts to execute one of the following
  instructions that cause VM exits unconditionally or based on the settings of
  VM-execution controls.

> Then handle_mmio() can say:
> 
> 	/*
> 	 * VM-exit instruction length is not provided for the EPT
> 	 * violations that MMIO causes.  Use the insn_decode() length:

This is inaccurate.  The instruction length _is_ provided on EPT Violation VM-Exits
(it's also provided by all Intel CPUs on EPT Misconfigs even though the SDM doesn't
say so).

The instruction length is wrong in the TDX case because there is no EPT Violation
VM-Exit.  The EPT Violation is morphed to a #VE by the CPU, and the instruction
length isn't one of the fields that's saved into the #VE info struct by the CPU.
When the TDX Module gets control on the TDCALL, VMCS.INSTRUCTION_LENGTH will hold
the length of the TDCALL, not the instruction that caused the #VE, i.e. the TDX
Module can't provide the correct length.

For all other #VE cases in TDX, the #VE is injected by software (TDX module) after
the instruction-based VM-Exit.  Before injecting the #VE, the TDX module grabs the
length from the VMCS and manually records it in the #VE info struct.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 22:40           ` Sean Christopherson
@ 2022-05-17 22:52             ` Dave Hansen
  2022-05-17 22:52             ` Sean Christopherson
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2022-05-17 22:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On 5/17/22 15:40, Sean Christopherson wrote:
> On Tue, May 17, 2022, Dave Hansen wrote:
>> The SDM makes it sound like we should be more judicious about using
>> 've->instr_len' though.  "All VM exits other than those listed in the
>> above items leave this field undefined."  Looking over
>> virt_exception_kernel(), we've got five cases from CPU instructions that
>> cause unconditional VMEXITs:
> 
> None of the below exit unconditionally.
...
>   For fault-like VM exits due to attempts to execute one of the following
>   instructions that cause VM exits unconditionally or based on the settings of
>   VM-execution controls.

Ahh, got it, thanks.  I bailed on reading that sentence before I got to
the "VM-execution controls" bit.

>> Then handle_mmio() can say:
>>
>> 	/*
>> 	 * VM-exit instruction length is not provided for the EPT
>> 	 * violations that MMIO causes.  Use the insn_decode() length:
> 
> This is inaccurate.  The instruction length _is_ provided on EPT Violation VM-Exits
> (it's also provided by all Intel CPUs on EPT Misconfigs even though the SDM doesn't
> say so).
> 
> The instruction length is wrong in the TDX case because there is no EPT Violation
> VM-Exit.  The EPT Violation is morphed to a #VE by the CPU, and the instruction
> length isn't one of the fields that's saved into the #VE info struct by the CPU.
> When the TDX Module gets control on the TDCALL, VMCS.INSTRUCTION_LENGTH will hold
> the length of the TDCALL, not the instruction that caused the #VE, i.e. the TDX
> Module can't provide the correct length.
> 
> For all other #VE cases in TDX, the #VE is injected by software (TDX module) after
> the instruction-based VM-Exit.  Before injecting the #VE, the TDX module grabs the
> length from the VMCS and manually records it in the #VE info struct.

That's horribly entertaining background. :)

But, it doesn't get us much closer to deciding when we can consume the
instruction length out of the ve_info.  It seems like magic that jut
happens to work at the moment.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 22:40           ` Sean Christopherson
  2022-05-17 22:52             ` Dave Hansen
@ 2022-05-17 22:52             ` Sean Christopherson
  2022-05-19 18:19               ` Kirill A. Shutemov
  1 sibling, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2022-05-17 22:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On Tue, May 17, 2022, Sean Christopherson wrote:
> On Tue, May 17, 2022, Dave Hansen wrote:
> > On 5/17/22 13:17, Kirill A. Shutemov wrote:
> > >>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
> > >>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
> > >> Something is wrong about it.
> > >>
> > >> You could call it 've->instr_bytes_to_handle' or something. Then it
> > >> makes actual logical sense when you handle it to zero it out.  I just
> > >> want it to be more explicit when the upper levels need to do something.
> > >>
> > >> Does ve->instr_len==0 both when the TDX module isn't providing
> > >> instruction sizes *and* when no handling is necessary?  That seems like
> > >> an unfortunate logical multiplexing of 0.
> > > For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> > > match the actual instruction size. I dig out that it is filled with data
> > > from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> > > data.
> > 
> > The SDM has a breakdown:
> > 
> > 	27.2.5 Information for VM Exits Due to Instruction Execution
> > 
> > I didn't realize it came from VMREAD.  I guess I assumed it came from
> > some TDX module magic.  Silly me.
> > 
> > The SDM makes it sound like we should be more judicious about using
> > 've->instr_len' though.  "All VM exits other than those listed in the
> > above items leave this field undefined."  Looking over
> > virt_exception_kernel(), we've got five cases from CPU instructions that
> > cause unconditional VMEXITs:

Ideally, what the SDM says wouldn't matter at all.  The TDX module spec really
should be the authorative source in this case, but it just punts to the SDM:

  The 32-bit value that would have been saved into the VMCS as VM-exit instruction
  length if a legacy VM exit had occurred instead of the virtualization exception.

Even if the TDX spec wants to punt to the SDM, it would save a lot of headache and
SDM reading if it also said something to the effect of:

  The INSTRUCTION_LENGTH and INSTRUCTION_INFORMATION fields are valid for all
  #VEs injected by the Intel TDX Module.  The fields are undefined for #VEs
  injected by the CPU due to EPT Violations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 15:30 [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page Kirill A. Shutemov
  2022-05-17 16:36 ` Dave Hansen
@ 2022-05-18  8:39 ` David Laight
  2022-05-18 12:18   ` Kirill A. Shutemov
  1 sibling, 1 reply; 15+ messages in thread
From: David Laight @ 2022-05-18  8:39 UTC (permalink / raw)
  To: 'Kirill A. Shutemov', tglx, mingo, bp, dave.hansen, luto, peterz
  Cc: sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	seanjc, thomas.lendacky, x86, linux-kernel

From: Kirill A. Shutemov
> Sent: 17 May 2022 16:30
> 
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
> 
> In TDX guest the second page can be shared page and VMM may configure it
> to trigger #VE.
> 
> Kernel assumes that #VE on a shared page is MMIO access and tries to
> decode instruction to handle it. In case of load_unaligned_zeropad() it
> may result in confusion as it is not MMIO access.
> 
> Check fixup table before trying to handle MMIO.

Is it best to avoid that all happening by avoiding mapping
'normal memory' below anything that isn't normal memory.

Even on a normal system it is potentially possibly that the
second page might be MMIO and reference a target that doesn't
want to see non-word sized reads.
(Or the first location might be a fifo and the read consumes
some data.)

In that case the cpu won't fault the access, but the hardware
access might have rather unexpected side effects.

Now the way MMIO pages are allocated probably makes that
impossible - but load_unaligned_zeropad() relies on
it not happening or not breaking anything.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-18  8:39 ` David Laight
@ 2022-05-18 12:18   ` Kirill A. Shutemov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-18 12:18 UTC (permalink / raw)
  To: David Laight
  Cc: tglx, mingo, bp, dave.hansen, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	seanjc, thomas.lendacky, x86, linux-kernel

On Wed, May 18, 2022 at 08:39:45AM +0000, David Laight wrote:
> From: Kirill A. Shutemov
> > Sent: 17 May 2022 16:30
> > 
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> > 
> > In TDX guest the second page can be shared page and VMM may configure it
> > to trigger #VE.
> > 
> > Kernel assumes that #VE on a shared page is MMIO access and tries to
> > decode instruction to handle it. In case of load_unaligned_zeropad() it
> > may result in confusion as it is not MMIO access.
> > 
> > Check fixup table before trying to handle MMIO.
> 
> Is it best to avoid that all happening by avoiding mapping
> 'normal memory' below anything that isn't normal memory.
> 
> Even on a normal system it is potentially possibly that the
> second page might be MMIO and reference a target that doesn't
> want to see non-word sized reads.
> (Or the first location might be a fifo and the read consumes
> some data.)
> 
> In that case the cpu won't fault the access, but the hardware
> access might have rather unexpected side effects.
> 
> Now the way MMIO pages are allocated probably makes that
> impossible - but load_unaligned_zeropad() relies on
> it not happening or not breaking anything.

Normally MMIO mappings comes from ioremap() and it does not land next to
normal pages in virtual memory. So I don't think there's high risk of MMIO
being a problem on normal machines.

What makes TDX (and other confidential computing platforms) different is
security model: host and VMM considered hostile and we need protect
against it. In TDX case, VMM can make any shared memory (such as DMA
buffers) to trigger #VE that kernel interprets as MMIO access. We need to
make sure host cannot exploit it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 22:16         ` Dave Hansen
  2022-05-17 22:40           ` Sean Christopherson
@ 2022-05-19 18:07           ` Kirill A. Shutemov
  2022-05-19 18:33             ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-19 18:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: seanjc, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On Tue, May 17, 2022 at 03:16:42PM -0700, Dave Hansen wrote:
> See?  Now everybody that goes and writes a new #VE exception helper has
> a chance of actually getting this right.  As it stands, if someone adds
> one more of these, they'll probably get random behavior.  This way, they
> actually have to choose.  They _might_ even go looking at the SDM.

Okay. See below. Does it match what you had in mind?

If it is okay. I will do proper patches.

BTW, I found a bug in tdx_early_handle_ve(). It didn't update RIP.
I don't know how it happend. Maybe got lost on the way upstream.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 9955b5a89df8..d2635ac52d9b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -123,7 +123,7 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
 	return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
 }
 
-static bool handle_halt(void)
+static int handle_halt(struct ve_info *ve)
 {
 	/*
 	 * Since non safe halt is mainly used in CPU offlining
@@ -134,9 +134,9 @@ static bool handle_halt(void)
 	const bool do_sti = false;
 
 	if (__halt(irq_disabled, do_sti))
-		return false;
+		return -EIO;
 
-	return true;
+	return ve->instr_len;
 }
 
 void __cpuidle tdx_safe_halt(void)
@@ -156,7 +156,7 @@ void __cpuidle tdx_safe_halt(void)
 		WARN_ONCE(1, "HLT instruction emulation failed\n");
 }
 
-static bool read_msr(struct pt_regs *regs)
+static int read_msr(struct pt_regs *regs, struct ve_info *ve)
 {
 	struct tdx_hypercall_args args = {
 		.r10 = TDX_HYPERCALL_STANDARD,
@@ -170,14 +170,14 @@ static bool read_msr(struct pt_regs *regs)
 	 * (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
 	 */
 	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
-		return false;
+		return -EIO;
 
 	regs->ax = lower_32_bits(args.r11);
 	regs->dx = upper_32_bits(args.r11);
-	return true;
+	return ve->instr_len;
 }
 
-static bool write_msr(struct pt_regs *regs)
+static int write_msr(struct pt_regs *regs, struct ve_info *ve)
 {
 	struct tdx_hypercall_args args = {
 		.r10 = TDX_HYPERCALL_STANDARD,
@@ -191,10 +191,13 @@ static bool write_msr(struct pt_regs *regs)
 	 * can be found in TDX Guest-Host-Communication Interface
 	 * (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
 	 */
-	return !__tdx_hypercall(&args, 0);
+	if (__tdx_hypercall(&args, 0))
+		return -EIO;
+
+	return ve->instr_len;
 }
 
-static bool handle_cpuid(struct pt_regs *regs)
+static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
 {
 	struct tdx_hypercall_args args = {
 		.r10 = TDX_HYPERCALL_STANDARD,
@@ -212,7 +215,7 @@ static bool handle_cpuid(struct pt_regs *regs)
 	 */
 	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
 		regs->ax = regs->bx = regs->cx = regs->dx = 0;
-		return true;
+		return ve->instr_len;
 	}
 
 	/*
@@ -221,7 +224,7 @@ static bool handle_cpuid(struct pt_regs *regs)
 	 * (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
 	 */
 	if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
-		return false;
+		return -EIO;
 
 	/*
 	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
@@ -233,7 +236,7 @@ static bool handle_cpuid(struct pt_regs *regs)
 	regs->cx = args.r14;
 	regs->dx = args.r15;
 
-	return true;
+	return ve->instr_len;
 }
 
 static bool mmio_read(int size, unsigned long addr, unsigned long *val)
@@ -259,7 +262,7 @@ static bool mmio_write(int size, unsigned long addr, unsigned long val)
 			       EPT_WRITE, addr, val);
 }
 
-static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
+static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 {
 	char buffer[MAX_INSN_SIZE];
 	unsigned long *reg, val;
@@ -270,7 +273,7 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 
 	/* Only in-kernel MMIO is supported */
 	if (WARN_ON_ONCE(user_mode(regs)))
-		return false;
+		return -EFAULT;
 
 	/*
 	 * load_unaligned_zeropad() relies on exception fixups in case of the
@@ -287,37 +290,37 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	 */
 	if (fixup_exception(regs, X86_TRAP_VE, 0, ve->gla)) {
 		/* regs->ip is adjusted by fixup_exception() */
-		ve->instr_len = 0;
-
-		return true;
+		return 0;
 	}
 
 	if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
-		return false;
+		return -EFAULT;
 
 	if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
-		return false;
+		return -EINVAL;
 
 	mmio = insn_decode_mmio(&insn, &size);
 	if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
-		return false;
+		return -EINVAL;
 
 	if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
 		reg = insn_get_modrm_reg_ptr(&insn, regs);
 		if (!reg)
-			return false;
+			return -EINVAL;
 	}
 
-	ve->instr_len = insn.length;
-
 	/* Handle writes first */
 	switch (mmio) {
 	case MMIO_WRITE:
 		memcpy(&val, reg, size);
-		return mmio_write(size, ve->gpa, val);
+		if (!mmio_write(size, ve->gpa, val))
+			return -EIO;
+		return insn.length;
 	case MMIO_WRITE_IMM:
 		val = insn.immediate.value;
-		return mmio_write(size, ve->gpa, val);
+		if (!mmio_write(size, ve->gpa, val))
+			return -EIO;
+		return insn.length;
 	case MMIO_READ:
 	case MMIO_READ_ZERO_EXTEND:
 	case MMIO_READ_SIGN_EXTEND:
@@ -330,15 +333,15 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 		 * decoded or handled properly. It was likely not using io.h
 		 * helpers or accessed MMIO accidentally.
 		 */
-		return false;
+		return -EINVAL;
 	default:
 		WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
-		return false;
+		return -EINVAL;
 	}
 
 	/* Handle reads */
 	if (!mmio_read(size, ve->gpa, &val))
-		return false;
+		return -EIO;
 
 	switch (mmio) {
 	case MMIO_READ:
@@ -360,13 +363,13 @@ static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve)
 	default:
 		/* All other cases has to be covered with the first switch() */
 		WARN_ON_ONCE(1);
-		return false;
+		return -EINVAL;
 	}
 
 	if (extend_size)
 		memset(reg, extend_val, extend_size);
 	memcpy(reg, &val, size);
-	return true;
+	return insn.length;
 }
 
 static bool handle_in(struct pt_regs *regs, int size, int port)
@@ -417,13 +420,14 @@ static bool handle_out(struct pt_regs *regs, int size, int port)
  *
  * Return True on success or False on failure.
  */
-static bool handle_io(struct pt_regs *regs, u32 exit_qual)
+static int handle_io(struct pt_regs *regs, struct ve_info *ve)
 {
+	u32 exit_qual = ve->exit_qual;
 	int size, port;
-	bool in;
+	bool in, ret;
 
 	if (VE_IS_IO_STRING(exit_qual))
-		return false;
+		return -EIO;
 
 	in   = VE_IS_IO_IN(exit_qual);
 	size = VE_GET_IO_SIZE(exit_qual);
@@ -431,9 +435,13 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)
 
 
 	if (in)
-		return handle_in(regs, size, port);
+		ret = handle_in(regs, size, port);
 	else
-		return handle_out(regs, size, port);
+		ret = handle_out(regs, size, port);
+	if (!ret)
+		return -EIO;
+
+	return ve->instr_len;
 }
 
 /*
@@ -443,13 +451,19 @@ static bool handle_io(struct pt_regs *regs, u32 exit_qual)
 __init bool tdx_early_handle_ve(struct pt_regs *regs)
 {
 	struct ve_info ve;
+	int ret;
 
 	tdx_get_ve_info(&ve);
 
 	if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
 		return false;
 
-	return handle_io(regs, ve.exit_qual);
+	ret = handle_io(regs, &ve);
+	if (ret < 0)
+		return false;
+
+	regs->ip += ret;
+	return true;
 }
 
 void tdx_get_ve_info(struct ve_info *ve)
@@ -483,53 +497,55 @@ void tdx_get_ve_info(struct ve_info *ve)
 }
 
 /* Handle the user initiated #VE */
-static bool virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
+static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
 {
 	switch (ve->exit_reason) {
 	case EXIT_REASON_CPUID:
-		return handle_cpuid(regs);
+		return handle_cpuid(regs, ve);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-		return false;
+		return -EIO;
 	}
 }
 
 /* Handle the kernel #VE */
-static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
+static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
 {
 	switch (ve->exit_reason) {
 	case EXIT_REASON_HLT:
-		return handle_halt();
+		return handle_halt(ve);
 	case EXIT_REASON_MSR_READ:
-		return read_msr(regs);
+		return read_msr(regs, ve);
 	case EXIT_REASON_MSR_WRITE:
-		return write_msr(regs);
+		return write_msr(regs, ve);
 	case EXIT_REASON_CPUID:
-		return handle_cpuid(regs);
+		return handle_cpuid(regs, ve);
 	case EXIT_REASON_EPT_VIOLATION:
 		return handle_mmio(regs, ve);
 	case EXIT_REASON_IO_INSTRUCTION:
-		return handle_io(regs, ve->exit_qual);
+		return handle_io(regs, ve);
 	default:
 		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
-		return false;
+		return -EIO;
 	}
 }
 
 bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
 {
-	bool ret;
+	int ret;
 
 	if (user_mode(regs))
 		ret = virt_exception_user(regs, ve);
 	else
 		ret = virt_exception_kernel(regs, ve);
 
+	if (ret < 0)
+		return false;
+
 	/* After successful #VE handling, move the IP */
-	if (ret)
-		regs->ip += ve->instr_len;
+	regs->ip += ret;
 
-	return ret;
+	return true;
 }
 
 static bool tdx_tlb_flush_required(bool private)
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-17 22:52             ` Sean Christopherson
@ 2022-05-19 18:19               ` Kirill A. Shutemov
  2022-05-19 18:35                 ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2022-05-19 18:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On Tue, May 17, 2022 at 10:52:48PM +0000, Sean Christopherson wrote:
> On Tue, May 17, 2022, Sean Christopherson wrote:
> > On Tue, May 17, 2022, Dave Hansen wrote:
> > > On 5/17/22 13:17, Kirill A. Shutemov wrote:
> > > >>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
> > > >>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
> > > >> Something is wrong about it.
> > > >>
> > > >> You could call it 've->instr_bytes_to_handle' or something. Then it
> > > >> makes actual logical sense when you handle it to zero it out.  I just
> > > >> want it to be more explicit when the upper levels need to do something.
> > > >>
> > > >> Does ve->instr_len==0 both when the TDX module isn't providing
> > > >> instruction sizes *and* when no handling is necessary?  That seems like
> > > >> an unfortunate logical multiplexing of 0.
> > > > For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> > > > match the actual instruction size. I dig out that it is filled with data
> > > > from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> > > > data.
> > > 
> > > The SDM has a breakdown:
> > > 
> > > 	27.2.5 Information for VM Exits Due to Instruction Execution
> > > 
> > > I didn't realize it came from VMREAD.  I guess I assumed it came from
> > > some TDX module magic.  Silly me.
> > > 
> > > The SDM makes it sound like we should be more judicious about using
> > > 've->instr_len' though.  "All VM exits other than those listed in the
> > > above items leave this field undefined."  Looking over
> > > virt_exception_kernel(), we've got five cases from CPU instructions that
> > > cause unconditional VMEXITs:
> 
> Ideally, what the SDM says wouldn't matter at all.  The TDX module spec really
> should be the authorative source in this case, but it just punts to the SDM:
> 
>   The 32-bit value that would have been saved into the VMCS as VM-exit instruction
>   length if a legacy VM exit had occurred instead of the virtualization exception.
> 
> Even if the TDX spec wants to punt to the SDM, it would save a lot of headache and
> SDM reading if it also said something to the effect of:
> 
>   The INSTRUCTION_LENGTH and INSTRUCTION_INFORMATION fields are valid for all
>   #VEs injected by the Intel TDX Module.  The fields are undefined for #VEs
>   injected by the CPU due to EPT Violations.

I initiated update to the spec, but it will take time.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-19 18:07           ` Kirill A. Shutemov
@ 2022-05-19 18:33             ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2022-05-19 18:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: seanjc, tglx, mingo, bp, luto, peterz,
	sathyanarayanan.kuppuswamy, ak, dan.j.williams, david, hpa,
	thomas.lendacky, x86, linux-kernel

On 5/19/22 11:07, Kirill A. Shutemov wrote:
> On Tue, May 17, 2022 at 03:16:42PM -0700, Dave Hansen wrote:
>> See?  Now everybody that goes and writes a new #VE exception helper has
>> a chance of actually getting this right.  As it stands, if someone adds
>> one more of these, they'll probably get random behavior.  This way, they
>> actually have to choose.  They _might_ even go looking at the SDM.
> 
> Okay. See below. Does it match what you had in mind?

Looks close.

> BTW, I found a bug in tdx_early_handle_ve(). It didn't update RIP.
> I don't know how it happend. Maybe got lost on the way upstream.

Huh, so refactoring things instead of depending on magic hidden behavior
helps find bugs?  Interesting.  ;)

> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 9955b5a89df8..d2635ac52d9b 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -123,7 +123,7 @@ static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
>  	return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
>  }
>  
> -static bool handle_halt(void)
> +static int handle_halt(struct ve_info *ve)
>  {
>  	/*
>  	 * Since non safe halt is mainly used in CPU offlining
> @@ -134,9 +134,9 @@ static bool handle_halt(void)
>  	const bool do_sti = false;
>  
>  	if (__halt(irq_disabled, do_sti))
> -		return false;
> +		return -EIO;
>  
> -	return true;
> +	return ve->instr_len;
>  }

Ideally each of these would include a comment about why we can get the
isntruction length from ve_info.  That "why" is currently a bit weak,
but it's something like:

	/*
	 * In TDX guests, HLT is configured to cause exits.  Assume that
	 * the TDX module has provided the "VM-exit instruction length".
	 */

It would be nice to have some central discussion of this too to explain
that the TDX documentation is currently lacking here, but we don't need
to repeat that part in a comment 6 different times.

...
>  /* Handle the kernel #VE */
> -static bool virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
> +static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
>  {

/*
 * Handle kernel #VEs.  On success, returns the number of
 * bytes RIP should be incremented (>=0) or -errno on error.
 */

>  	switch (ve->exit_reason) {
>  	case EXIT_REASON_HLT:
> -		return handle_halt();
> +		return handle_halt(ve);
>  	case EXIT_REASON_MSR_READ:
> -		return read_msr(regs);
> +		return read_msr(regs, ve);
>  	case EXIT_REASON_MSR_WRITE:
> -		return write_msr(regs);
> +		return write_msr(regs, ve);
>  	case EXIT_REASON_CPUID:
> -		return handle_cpuid(regs);
> +		return handle_cpuid(regs, ve);
>  	case EXIT_REASON_EPT_VIOLATION:
>  		return handle_mmio(regs, ve);
>  	case EXIT_REASON_IO_INSTRUCTION:
> -		return handle_io(regs, ve->exit_qual);
> +		return handle_io(regs, ve);
>  	default:
>  		pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> -		return false;
> +		return -EIO;
>  	}
>  }
>  
>  bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
>  {
> -	bool ret;
> +	int ret;

'ret' is usually used for return values of *this* function.

Let's give it a better name, please.

>  	if (user_mode(regs))
>  		ret = virt_exception_user(regs, ve);
>  	else
>  		ret = virt_exception_kernel(regs, ve);
>  
> +	if (ret < 0)
> +		return false;
> +
>  	/* After successful #VE handling, move the IP */
> -	if (ret)
> -		regs->ip += ve->instr_len;
> +	regs->ip += ret;
>  
> -	return ret;
> +	return true;
>  }
>  
>  static bool tdx_tlb_flush_required(bool private)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  2022-05-19 18:19               ` Kirill A. Shutemov
@ 2022-05-19 18:35                 ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2022-05-19 18:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sean Christopherson
  Cc: tglx, mingo, bp, luto, peterz, sathyanarayanan.kuppuswamy, ak,
	dan.j.williams, david, hpa, thomas.lendacky, x86, linux-kernel

On 5/19/22 11:19, Kirill A. Shutemov wrote:
>>>> The SDM has a breakdown:
>>>>
>>>> 	27.2.5 Information for VM Exits Due to Instruction Execution
>>>>
>>>> I didn't realize it came from VMREAD.  I guess I assumed it came from
>>>> some TDX module magic.  Silly me.
>>>>
>>>> The SDM makes it sound like we should be more judicious about using
>>>> 've->instr_len' though.  "All VM exits other than those listed in the
>>>> above items leave this field undefined."  Looking over
>>>> virt_exception_kernel(), we've got five cases from CPU instructions that
>>>> cause unconditional VMEXITs:
>> Ideally, what the SDM says wouldn't matter at all.  The TDX module spec really
>> should be the authorative source in this case, but it just punts to the SDM:
>>
>>   The 32-bit value that would have been saved into the VMCS as VM-exit instruction
>>   length if a legacy VM exit had occurred instead of the virtualization exception.
>>
>> Even if the TDX spec wants to punt to the SDM, it would save a lot of headache and
>> SDM reading if it also said something to the effect of:
>>
>>   The INSTRUCTION_LENGTH and INSTRUCTION_INFORMATION fields are valid for all
>>   #VEs injected by the Intel TDX Module.  The fields are undefined for #VEs
>>   injected by the CPU due to EPT Violations.
> I initiated update to the spec, but it will take time.

Understood, and thanks for doing that.

For now, let's just declare what we *expect* the spec will say and show
it to the folks doing the spec itself.  They will then have a chance to
balk at our interpretation if we got something wrong.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-19 18:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 15:30 [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page Kirill A. Shutemov
2022-05-17 16:36 ` Dave Hansen
2022-05-17 17:40   ` Kirill A. Shutemov
2022-05-17 18:14     ` Dave Hansen
2022-05-17 20:17       ` Kirill A. Shutemov
2022-05-17 22:16         ` Dave Hansen
2022-05-17 22:40           ` Sean Christopherson
2022-05-17 22:52             ` Dave Hansen
2022-05-17 22:52             ` Sean Christopherson
2022-05-19 18:19               ` Kirill A. Shutemov
2022-05-19 18:35                 ` Dave Hansen
2022-05-19 18:07           ` Kirill A. Shutemov
2022-05-19 18:33             ` Dave Hansen
2022-05-18  8:39 ` David Laight
2022-05-18 12:18   ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).