Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page

From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	seanjc@google.com
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	luto@kernel.org, peterz@infradead.org,
	sathyanarayanan.kuppuswamy@linux.intel.com, ak@linux.intel.com,
	dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com,
	thomas.lendacky@amd.com, x86@kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
Date: Tue, 17 May 2022 15:16:42 -0700	[thread overview]
Message-ID: <083519ab-752f-9815-7741-22b3fcc03322@intel.com> (raw)
In-Reply-To: <20220517201710.ixbpsaga5jzvokvy@black.fi.intel.com>

On 5/17/22 13:17, Kirill A. Shutemov wrote:
>>> Given that we had to adjust IP in handle_mmio() anyway, do you still think
>>> "ve->instr_len = 0;" is wrong? I dislike ip_adjusted more.
>> Something is wrong about it.
>>
>> You could call it 've->instr_bytes_to_handle' or something. Then it
>> makes actual logical sense when you handle it to zero it out.  I just
>> want it to be more explicit when the upper levels need to do something.
>>
>> Does ve->instr_len==0 both when the TDX module isn't providing
>> instruction sizes *and* when no handling is necessary?  That seems like
>> an unfortunate logical multiplexing of 0.
> For EPT violation, ve->instr_len has *something* (not zero) that doesn't
> match the actual instruction size. I dig out that it is filled with data
> from VMREAD(0x440C), but I don't know where is the ultimate origin of the
> data.

The SDM has a breakdown:

	27.2.5 Information for VM Exits Due to Instruction Execution

I didn't realize it came from VMREAD.  I guess I assumed it came from
some TDX module magic.  Silly me.

The SDM makes it sound like we should be more judicious about using
've->instr_len' though.  "All VM exits other than those listed in the
above items leave this field undefined."  Looking over
virt_exception_kernel(), we've got five cases from CPU instructions that
cause unconditional VMEXITs:

        case EXIT_REASON_HLT:
        case EXIT_REASON_MSR_READ:
        case EXIT_REASON_MSR_WRITE:
        case EXIT_REASON_CPUID:
        case EXIT_REASON_IO_INSTRUCTION:

and should have that field filled out, plus one that doesn't:

        case EXIT_REASON_IO_INSTRUCTION:

It seems awfully fragile to me to have the hardware be providing the
'instr_len' in those cases, but not in one other one.  The data in there
is garbage for EXIT_REASON_IO_INSTRUCTION.  The reason we don't consume
garbage is that all the paths leading out of handle_mmio() that return
true also set 've->instr_len'.  But that logic is entirely opaque.

It's also borderline criminal to have six functions that look identical
(in that switch statement), but one of them has different behavior for
've->instr_len'.

I'd probably do it like this:

static int handle_halt(struct ve_info *ve)
{
        /*
         * Since non safe halt is mainly used in CPU offlining
         * and the guest will always stay in the halt state, don't
         * call the STI instruction (set do_sti as false).
         */
        const bool irq_disabled = irqs_disabled();
        const bool do_sti = false;

        if (__halt(irq_disabled, do_sti))
                return -EIO;

	/*
	 * VM-exit instruction length is defined for HLT.  See:
	 * "Information for VM Exits Due to Instruction Execution"
	 * in the SDM.
	 */
        return ve->insn_length;
}

Any >=0 return means the exception was handled and it tells the caller
hoe much to advance RIP.

Then handle_mmio() can say:

	/*
	 * VM-exit instruction length is not provided for the EPT
	 * violations that MMIO causes.  Use the insn_decode() length:
	 */
        return insn.length;

See?  Now everybody that goes and writes a new #VE exception helper has
a chance of actually getting this right.  As it stands, if someone adds
one more of these, they'll probably get random behavior.  This way, they
actually have to choose.  They _might_ even go looking at the SDM.