Re: [PATCH] x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()

From: Borislav Petkov <bp@alien8.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Juergen Gross <jgross@suse.com>,
	xen-devel <xen-devel@lists.xen.org>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: [PATCH] x86/mm: Split read_cr3() into read_cr3_pa() and __read_cr3()
Date: Tue, 13 Jun 2017 11:26:46 +0200	[thread overview]
Message-ID: <20170613092646.l5wgvfrgoeb3fksz@pd.tnic> (raw)
In-Reply-To: <883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.org>

On Mon, Jun 12, 2017 at 10:26:14AM -0700, Andy Lutomirski wrote:
> The kernel has several code paths that read CR3.  Most of them assume that
> CR3 contains the PGD's physical address, whereas some of them awkwardly
> use PHYSICAL_PAGE_MASK to mask off low bits.
> 
> Add explicit mask macros for CR3 and convert all of the CR3 readers.
> This will keep them from breaking when PCID is enabled.

...

> +/*
> + * CR3's layout varies depending on several things.
> + *
> + * If CR4.PCIDE is set (64-bit only), then CR3[11:0] is the address space ID.
> + * If PAE is enabled, then CR3[11:5] is part of the PDPT address
> + * (i.e. it's 32-byte aligned, not page-aligned) and CR3[4:0] is ignored.
> + * Otherwise (non-PAE, non-PCID), CR3[3] is PWT, CR3[4] is PCD, and
> + * CR3[2:0] and CR3[11:5] are ignored.
> + *
> + * In all cases, Linux puts zeros in the low ignored bits and in PWT and PCD.
> + *
> + * CR3[63] is always read as zero.  If CR4.PCIDE is set, then CR3[63] may be
> + * written as 1 to prevent the write to CR3 from flushing the TLB.
> + *
> + * On systems with SME, one bit (in a variable position!) is stolen to indicate
> + * that the top-level paging structure is encrypted.
> + *
> + * All of the remaining bits indicate the physical address of the top-level
> + * paging structure.
> + *
> + * CR3_ADDR_MASK is the mask used by read_cr3_pa().
> + */
> +#ifdef CONFIG_X86_64
> +/* Mask off the address space ID bits. */
> +#define CR3_ADDR_MASK 0x7FFFFFFFFFFFF000ull
> +#define CR3_PCID_MASK 0xFFFull
> +#else
> +/*
> + * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
> + * a tiny bit of code size by setting all the bits.
> + */
> +#define CR3_ADDR_MASK 0xFFFFFFFFull
> +#define CR3_PCID_MASK 0ull

All those can do GENMASK_ULL for better readability.

> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 3586996fc50d..bc0a849589bb 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -391,7 +391,7 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>  
>  	.read_cr2 = native_read_cr2,
>  	.write_cr2 = native_write_cr2,
> -	.read_cr3 = native_read_cr3,
> +	.read_cr3 = __native_read_cr3,
>  	.write_cr3 = native_write_cr3,
>  
>  	.flush_tlb_user = native_flush_tlb,
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index ffeae818aa7a..c6d6dc5f8bb2 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -92,7 +92,7 @@ void __show_regs(struct pt_regs *regs, int all)
>  
>  	cr0 = read_cr0();
>  	cr2 = read_cr2();
> -	cr3 = read_cr3();
> +	cr3 = __read_cr3();
>  	cr4 = __read_cr4();

This is a good example for my confusion. So we have __read_cr4() - with
the underscores - but not read_cr4().

Now we get __read_cr3 *and* read_cr3_pa(). So __read_cr3() could just as
well lose the "__", right?

Or are the PCID series bringing a read_cr3() without "__" too?

Oh, and to confuse me even more, there's __native_read_cr3() which is
*finally* accessing %cr3 :-) But there's native_write_cr3() without
underscores.

So can we make those names a bit more balanced please?

Thanks.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.