Re: [RFC PATCH v2 6/8] arm/mem_access: Add long-descriptor based gpt

From: Julien Grall <julien.grall@arm.com>
To: Sergej Proskurin <proskurin@sec.in.tum.de>,
	xen-devel@lists.xenproject.org
Cc: Stefano Stabellini <sstabellini@kernel.org>
Subject: Re: [RFC PATCH v2 6/8] arm/mem_access: Add long-descriptor based gpt
Date: Fri, 2 Jun 2017 13:55:05 +0100	[thread overview]
Message-ID: <2201fe61-d5d8-8568-9e11-2139b5547601@arm.com> (raw)
In-Reply-To: <20170601151906.10213-7-proskurin@sec.in.tum.de>

Hi Sergej,

On 06/01/2017 04:18 PM, Sergej Proskurin wrote:
> This commit adds functionality to walk the guest's page tables using the
> long-descriptor translation table format for both ARMv7 and ARMv8.
> Similar to the hardware architecture, the implementation supports
> different page granularities (4K, 16K, and 64K). The implementation is
> based on ARM DDI 0487A-g J11-5608, J11-5679, and ARM DDI 0406C-b
> B3-1510.

Please use the most recent ARM ARM.

> 
> Note that the current implementation lacks support for 32-bit EL0
> running on top of 64-bit EL1. The associated location in the code is
> marked appropriately.
> 
> Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de>
> ---
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Julien Grall <julien.grall@arm.com>
> ---
> v2: Use TCR_SZ_MASK instead of TTBCR_SZ_MASK for ARM 32-bit guests using
>      the long-descriptor translation table format.
> 
>      Cosmetic fixes.
> ---
>   xen/arch/arm/p2m.c | 271 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 269 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 0337d83581..ea3be6f050 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -1573,8 +1573,275 @@ static int __p2m_walk_gpt_ld(struct p2m_domain *p2m,
>                                vaddr_t gva, paddr_t *ipa,
>                                unsigned int *perm_ro)
>   {
> -    /* Not implemented yet. */
> -    return -EFAULT;
> +    int t0_sz, t1_sz, disabled = 1;

Looking at the way you use t0_sz and t1_sz, they will never be < 0 so I 
thin should be unsigned.

Also, I think disabled is always 0 or 1. If that is true, please use 
bool/true/false.

> +    unsigned int level, gran;
> +    unsigned int topbit = 0, input_size = 0, output_size;
> +    uint64_t ttbr = 0, ips;
> +    paddr_t mask;
> +    lpae_t pte, *table;
> +    struct page_info *page;
> +    register_t tcr = READ_SYSREG(TCR_EL1);
> +    struct domain *d = p2m->domain;
> +
> +    const vaddr_t offsets[4][3] = {
> +        {
> +#ifdef CONFIG_ARM_64
> +            zeroeth_guest_table_offset_4k(gva),
> +            zeroeth_guest_table_offset_16k(gva),
> +            0, /* There is no zeroeth lookup level with a 64K granule size. */
> +#endif
> +        },
> +        {
> +            first_guest_table_offset_4k(gva),
> +#ifdef CONFIG_ARM_64
> +            first_guest_table_offset_16k(gva),
> +            first_guest_table_offset_64k(gva),
> +#endif
> +        },
> +        {
> +            second_guest_table_offset_4k(gva),
> +#ifdef CONFIG_ARM_64
> +            second_guest_table_offset_16k(gva),
> +            second_guest_table_offset_64k(gva),
> +#endif
> +        },
> +        {
> +            third_guest_table_offset_4k(gva),
> +#ifdef CONFIG_ARM_64
> +            third_guest_table_offset_16k(gva),
> +            third_guest_table_offset_64k(gva),
> +#endif
> +        }
> +    };
> +
> +    const paddr_t masks[4][3] = {
> +        {
> +            ZEROETH_SIZE_4K - 1,
> +            ZEROETH_SIZE_16K - 1,
> +            0 /* There is no zeroeth lookup level with a 64K granule size. */
> +        },
> +        {
> +            FIRST_SIZE_4K - 1,
> +            FIRST_SIZE_16K - 1,
> +            FIRST_SIZE_64K - 1
> +        },
> +        {
> +            SECOND_SIZE_4K - 1,
> +            SECOND_SIZE_16K - 1,
> +            SECOND_SIZE_64K - 1
> +        },
> +        {
> +            THIRD_SIZE_4K - 1,
> +            THIRD_SIZE_16K - 1,
> +            THIRD_SIZE_64K - 1
> +        }
> +    };

Please define them as static. It is not necessary to have them on the 
stack everytime.

> +
> +    const unsigned int grainsizes[3] = {
> +        PAGE_SHIFT_4K,
> +        PAGE_SHIFT_16K,
> +        PAGE_SHIFT_64K > +    };

Ditto.

> +
> +    const unsigned int strides[3] = {
> +        LPAE_SHIFT_4K,
> +        LPAE_SHIFT_16K,
> +        LPAE_SHIFT_64K
> +    };

Ditto. Also, the stride can be found from the page shift. So I am not 
convinced you need that.

> +
> +    t0_sz = (tcr >> TCR_T0SZ_SHIFT) & TCR_SZ_MASK;
> +    t1_sz = (tcr >> TCR_T1SZ_SHIFT) & TCR_SZ_MASK;
> +
> +    /*
> +     * Get the MSB number of the gva, according to "AddrTop" pseudocode

That's a call to have a separate helper for AddrTop rather than 
open-coding in this function.

I think this function could be split in multiple part, making easier the 
review. For instance, you have a lot of "if is_*bit_domain".

> +     * implementation in ARM DDI 0487A-g J11-5739.
> +     *
> +     * XXX: We do not check whether the 64bit domain uses a 32-bit EL0. In this
> +     * case, we need to set topbit to 31, as well.
I think checking 32-bit EL0 is straigh-forward enough to get it done 
now. Have a look at psr_most_is_32bit.

> +     */
> +    if ( is_32bit_domain(d) )
> +        topbit = TCR_TB_31;
> +#ifdef CONFIG_ARM_64
> +    else if ( is_64bit_domain(d) )
> +    {
> +        if ( ((gva & (1UL << TCR_TB_55)) && (tcr & TCR_TBI1)) ||

Please use BIT(...) instead of (1UL << ...).

> +             (!(gva & (1UL << TCR_TB_55)) && (tcr & TCR_TBI0)) )

Ditto.

> +            topbit = TCR_TB_55;

This is really confusing. You define TCR_TB_* to * but it is not even 
part of the register TCR_. TBH, I don't think they hence the code and 
would just hardcoded the 55 here. Afterall it is in the name :).

> +        else
> +            topbit = TCR_TB_63

Ditto.

> +    }
> +#endif
> +
> +#ifdef CONFIG_ARM_64
> +    if ( is_64bit_domain(d) )
> +    {

Likely a comment is missing here to explain what you are doing below. My 
understand is you are selecting TTBR*_EL1. And looking at the code, this 
could be abstracted as both branches are nearly the same.

> +        if ( (gva & (1UL << topbit)) == 0 )

BIT(...)

> +        {
> +            input_size = REGISTER_WIDTH_64_BIT - t0_sz;
> +
> +            if ( input_size > IPS_MAX )
> +                /* We limit the input_size to be max 48 bit. */
> +                input_size = IPS_MAX;
> +            else if ( input_size < IPS_MIN )
> +                /* We limit the input_size to be max 25 bit. */
> +                input_size = IPS_MIN;

like this could be simplified by using min/max. But I think we should 
bail out here. Likely something in the page table is wrong and ignoring 
is the worst thing to do.

For instance ARMv8.2 has extended the input size to 52 bits. It would be 
difficult to catch what is missing because of that, not mentioning that 
the only caller today will be memaccess that is not enabled by default.

> +
> +            /* Normalize granule size. */

I think 0, 1, 2 is more confusing to read. It would be better to use 
directly TCR_TG0_*.

> +            switch ( tcr & TCR_TG0_MASK ) {

Coding style, the brace should be on a newline.

> +            case TCR_TG0_16K:
> +                gran = 1;
> +                break;
> +            case TCR_TG0_64K:
> +                gran = 2;
> +                break;
> +            default:
> +                gran = 0;
> +            } > +
> +            /* Use TTBR0 for GVA to IPA translation. */
> +            ttbr = READ_SYSREG64(TTBR0_EL1);
> +
> +            /* If TCR.EPD0 is set, translations using TTBR0 are disabled. */
> +            disabled = ( tcr & TCR_EPD0 ) ? 1 : 0;

disabled = !!(tcr & TCR_EPD0);

or if you are using bool as requested above:

disabled =  tcr & TCR_EPD0;

> +        }
> +        else
> +        {
> +            input_size = REGISTER_WIDTH_64_BIT - t1_sz;
> +
> +            if ( input_size > IPS_MAX )
> +                /* We limit the input_size to be max 48 bit. */
> +                input_size = IPS_MAX;
> +            else if ( input_size < IPS_MIN )
> +                /* We limit the input_size to be max 25 bit. */
> +                input_size = IPS_MIN;
> +
> +            /* Normalize granule size. */
> +            switch ( tcr & TCR_TG1_MASK ) {

Coding style.

> +            case TCR_TG1_16K:
If you shift your tcr by TCR_TG1_SHIFT then all this code can become 
generic. Avoiding duplication, reviewing twice similar code and 
potential bug.

> +                gran = 1;
> +                break;
> +            case TCR_TG1_64K:
> +                gran = 2;
> +                break;
> +            default:
> +                gran = 0;
> +            }
> +
> +            /* Use TTBR1 for GVA to IPA translation. */
> +            ttbr = READ_SYSREG64(TTBR1_EL1);
> +
> +            /* If TCR.EPD1 is set, translations using TTBR1 are disabled. */
> +            disabled = ( tcr & TCR_EPD1 ) ? 1 : 0;

Same as above.

> +        }
> +    }
> +    else
> +#endif
> +    {
> +        /* Granule size of Aarch32 or ARMv7 architectures is always 4K, indexed by 0. */

NIT: It is not necessarily to mention ARMv7. ARMv7 is always AArch32.

Also s/Aarch32/Aarch32/

> +        gran = 0;
> +
> +        mask = ((1ULL << REGISTER_WIDTH_32_BIT) - 1) &
> +               ~((1ULL << (REGISTER_WIDTH_32_BIT - t0_sz)) - 1);

I don't understand this mask. Why do you need it?

> +
> +        if ( t0_sz == 0 || !(gva & mask) )
> +        {
> +            input_size = REGISTER_WIDTH_32_BIT - t0_sz;
> +
> +            /* Use TTBR0 for GVA to IPA translation. */
> +            ttbr = READ_SYSREG64(TTBR0_EL1);
> +
> +            /* If TCR.EPD0 is set, translations using TTBR0 are disabled. */
> +            disabled = ( tcr & TCR_EPD0 ) ? 1 : 0;

As for the AArch64, there is a call to abstract here.

> +        }
> +
> +        mask = ((1ULL << REGISTER_WIDTH_32_BIT) - 1) &
> +               ~((1ULL << (REGISTER_WIDTH_32_BIT - t1_sz)) - 1);
> +
> +        if ( ((t1_sz == 0) && !ttbr) || (t1_sz && (gva & mask) == mask) )
> +        {
> +            input_size = REGISTER_WIDTH_32_BIT - t1_sz;
> +
> +            /* Use TTBR1 for GVA to IPA translation. */
> +            ttbr = READ_SYSREG64(TTBR1_EL1);
> +
> +            /* If TCR.EPD1 is set, translations using TTBR1 are disabled. */
> +            disabled = ( tcr & TCR_EPD1 ) ? 1 : 0;

Ditto.

> +        }
> +    }
> +
> +    if ( disabled )
> +        return -EFAULT;
> +
> +    level = 4 - DIV_ROUND_UP((input_size - grainsizes[gran]), strides[gran]);

It took me a bit to understand this. The comment in the ARM ARM 
pseudo-code would be useful to keep it here.

> +
> +    /* XXX: We do not consider 32bit EL0 running on Aarch64, yet. */

See my comment above about 32bit EL0 support.

> +    if ( is_64bit_domain(d) )
> +    {
> +        /* Get the intermediate physical address size. */
> +        ips = (tcr & TCR_IPS_MASK) >> TCR_IPS_SHIFT;
> +
> +        switch (ips) {
> +        case TCR_IPS_32_BIT:
> +            output_size = IPS_32_BIT;
> +            break;
> +        case TCR_IPS_36_BIT:
> +            output_size = IPS_36_BIT;
> +            break;
> +        case TCR_IPS_40_BIT:
> +            output_size = IPS_40_BIT;
> +            break;
> +        case TCR_IPS_42_BIT:
> +            output_size = IPS_42_BIT;
> +            break;
> +        case TCR_IPS_44_BIT:
> +            output_size = IPS_44_BIT;
> +            break;
> +        case TCR_IPS_48_BIT:
> +        default:
> +            output_size = IPS_48_BIT;
> +        }
> +    }
> +    else
> +        output_size = IPS_40_BIT;
> +
> +    /* Make sure the base address does not exceed its configured size. */
> +    mask = ((1ULL << IPS_48_BIT) - 1) & ~((1ULL << output_size) - 1);
> +    if ( output_size != IPS_48_BIT && (ttbr & mask) )
> +        return -EFAULT;
> +
> +    mask = ((1ULL << output_size) - 1);
> +    page = get_page_from_gfn(d, paddr_to_pfn(ttbr & mask), NULL, P2M_ALLOC);
> +    if ( !page )
> +        return -EFAULT;
> +
> +    table = __map_domain_page(page);
> +
> +    for ( ; ; level++ )
> +    {
> +        pte = table[offsets[level][gran]];
> +
> +        unmap_domain_page(table);
> +        put_page(page);
> +
> +        if ( level == 3 || !pte.walk.valid || !pte.walk.table )

Please comment this line.

> +            break;
> +
> +        page = get_page_from_gfn(d, pte.walk.base, NULL, P2M_ALLOC);

It is a bit confusing. You care about the output_size for the when 
getting the TBBR but not the rest of the tables. Is it normal?

> +        if ( !page )
> +            return -EFAULT;
> +
> +        table = __map_domain_page(page);
> +    }
> +
> +    if ( !pte.walk.valid || ((level == 3) & !pte.walk.table) )

There is a call to leverage the p2m_table/p2m_valid helpers here.

> +        return -EFAULT;
> +
> +    *ipa = pfn_to_paddr(pte.walk.base) | (gva & masks[level][gran]);
> +
> +    *perm_ro = pte.pt.ro;
> +
> +    /* Return the entire pte so that the caller can check flags by herself. */

Hmmm? You don't return the entire pte but the only whether the page is 
writable or readable-only.

In general we want to have more information back such as execute-never 
bit...

> +    return 0;
>   }
>   
>   int p2m_walk_gpt(struct p2m_domain *p2m, vaddr_t gva,
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel