Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings

From: "wangyanan (Y)" <wangyanan55@huawei.com>
To: Marc Zyngier <maz@kernel.org>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>,
	Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Julien Thierry <julien.thierry.kdev@gmail.com>,
	James Morse <james.morse@arm.com>,
	"Suzuki K Poulose" <suzuki.poulose@arm.com>,
	Quentin Perret <qperret@google.com>,
	"Gavin Shan" <gshan@redhat.com>, <kvmarm@lists.cs.columbia.edu>,
	<linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 3/4] KVM: arm64: Install the block entry before unmapping the page mappings
Date: Thu, 4 Mar 2021 15:22:52 +0800	[thread overview]
Message-ID: <22f279ce-a655-7f1f-68c8-40bdaaf19102@huawei.com> (raw)
In-Reply-To: <b84f41b8-3555-9c8a-126e-34d97643fc95@huawei.com>


On 2021/3/4 15:07, wangyanan (Y) wrote:
> Hi Alex,
>
> On 2021/3/4 1:27, Alexandru Elisei wrote:
>> Hi Yanan,
>>
>> On 3/3/21 11:04 AM, wangyanan (Y) wrote:
>>> Hi Alex,
>>>
>>> On 2021/3/3 1:13, Alexandru Elisei wrote:
>>>> Hello,
>>>>
>>>> On 2/8/21 11:22 AM, Yanan Wang wrote:
>>>>> When KVM needs to coalesce the normal page mappings into a block 
>>>>> mapping,
>>>>> we currently invalidate the old table entry first followed by 
>>>>> invalidation
>>>>> of TLB, then unmap the page mappings, and install the block entry 
>>>>> at last.
>>>>>
>>>>> It will cost a long time to unmap the numerous page mappings, 
>>>>> which means
>>>>> there will be a long period when the table entry can be found 
>>>>> invalid.
>>>>> If other vCPUs access any guest page within the block range and 
>>>>> find the
>>>>> table entry invalid, they will all exit from guest with a 
>>>>> translation fault
>>>>> which is not necessary. And KVM will make efforts to handle these 
>>>>> faults,
>>>>> especially when performing CMOs by block range.
>>>>>
>>>>> So let's quickly install the block entry at first to ensure 
>>>>> uninterrupted
>>>>> memory access of the other vCPUs, and then unmap the page mappings 
>>>>> after
>>>>> installation. This will reduce most of the time when the table 
>>>>> entry is
>>>>> invalid, and avoid most of the unnecessary translation faults.
>>>> I'm not convinced I've fully understood what is going on yet, but 
>>>> it seems to me
>>>> that the idea is sound. Some questions and comments below.
>>> What I am trying to do in this patch is to adjust the order of 
>>> rebuilding block
>>> mappings from page mappings.
>>> Take the rebuilding of 1G block mappings as an example.
>>> Before this patch, the order is like:
>>> 1) invalidate the table entry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) unmap the old PMD/PTE tables
>>> 4) install the new block entry to the 1st level(PUD)
>>>
>>> So entry in the 1st level can be found invalid by other vcpus in 1), 
>>> 2), and 3),
>>> and it's a long time in 3) to unmap
>>> the numerous old PMD/PTE tables, which means the total time of the 
>>> entry being
>>> invalid is long enough to
>>> affect the performance.
>>>
>>> After this patch, the order is like:
>>> 1) invalidate the table ebtry of the 1st level(PUD)
>>> 2) flush TLB by VMID
>>> 3) install the new block entry to the 1st level(PUD)
>>> 4) unmap the old PMD/PTE tables
>>>
>>> The change ensures that period of entry in the 1st level(PUD) being 
>>> invalid is
>>> only in 1) and 2),
>>> so if other vcpus access memory within 1G, there will be less chance 
>>> to find the
>>> entry invalid
>>> and as a result trigger an unnecessary translation fault.
>> Thank you for the explanation, that was my understand of it also, and 
>> I believe
>> your idea is correct. I was more concerned that I got some of the 
>> details wrong,
>> and you have kindly corrected me below.
>>
>>>>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>>>>> ---
>>>>>    arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
>>>>>    1 file changed, 12 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c 
>>>>> b/arch/arm64/kvm/hyp/pgtable.c
>>>>> index 78a560446f80..308c36b9cd21 100644
>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>> @@ -434,6 +434,7 @@ struct stage2_map_data {
>>>>>        kvm_pte_t            attr;
>>>>>          kvm_pte_t            *anchor;
>>>>> +    kvm_pte_t            *follow;
>>>>>          struct kvm_s2_mmu        *mmu;
>>>>>        struct kvm_mmu_memory_cache    *memcache;
>>>>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 
>>>>> addr, u64 end,
>>>>> u32 level,
>>>>>        if (!kvm_block_mapping_supported(addr, end, data->phys, 
>>>>> level))
>>>>>            return 0;
>>>>>    -    kvm_set_invalid_pte(ptep);
>>>>> -
>>>>>        /*
>>>>> -     * Invalidate the whole stage-2, as we may have numerous leaf
>>>>> -     * entries below us which would otherwise need invalidating
>>>>> -     * individually.
>>>>> +     * If we need to coalesce existing table entries into a block 
>>>>> here,
>>>>> +     * then install the block entry first and the sub-level page 
>>>>> mappings
>>>>> +     * will be unmapped later.
>>>>>         */
>>>>> -    kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>>        data->anchor = ptep;
>>>>> +    data->follow = kvm_pte_follow(*ptep);
>>>>> +    stage2_coalesce_tables_into_block(addr, level, ptep, data);
>>>> Here's how stage2_coalesce_tables_into_block() is implemented from 
>>>> the previous
>>>> patch (it might be worth merging it with this patch, I found it 
>>>> impossible to
>>>> judge if the function is correct without seeing how it is used and 
>>>> what is
>>>> replacing):
>>> Ok, will do this if v2 is going to be post.
>>>> static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
>>>>                             kvm_pte_t *ptep,
>>>>                             struct stage2_map_data *data)
>>>> {
>>>>       u64 granule = kvm_granule_size(level), phys = data->phys;
>>>>       kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, 
>>>> level);
>>>>
>>>>       kvm_set_invalid_pte(ptep);
>>>>
>>>>       /*
>>>>        * Invalidate the whole stage-2, as we may have numerous leaf 
>>>> entries
>>>>        * below us which would otherwise need invalidating 
>>>> individually.
>>>>        */
>>>>       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>>>       smp_store_release(ptep, new);
>>>>       data->phys += granule;
>>>> }
>>>>
>>>> This works because __kvm_pgtable_visit() saves the *ptep value 
>>>> before calling the
>>>> pre callback, and it visits the next level table based on the 
>>>> initial pte value,
>>>> not the new value written by stage2_coalesce_tables_into_block().
>>> Right. So before replacing the initial pte value with the new value, 
>>> we have to use
>>> *data->follow = kvm_pte_follow(*ptep)* in 
>>> stage2_map_walk_table_pre() to save
>>> the initial pte value in advance. And data->follow will be used 
>>> when  we start to
>>> unmap the old sub-level tables later.
>> Right, stage2_map_walk_table_post() will use data->follow to free the 
>> table page
>> which is no longer needed because we've replaced the entire next 
>> level table with
>> a block mapping.
>>
>>>> Assuming the first patch in the series is merged ("KVM: arm64: Move 
>>>> the clean of
>>>> dcache to the map handler"), this function is missing the CMOs from
>>>> stage2_map_walker_try_leaf().
>>> Yes, the CMOs are not performed in 
>>> stage2_coalesce_tables_into_block() currently,
>>> because I thought they were not needed when we rebuild the block 
>>> mappings from
>>> normal page mappings.
>> This assumes that the *only* situation when we replace a table entry 
>> with a block
>> mapping is when the next level table (or tables) is *fully* 
>> populated. Is there a
>> way to prove that this is true? I think it's important to prove it 
>> unequivocally,
>> because if there's a corner case where this doesn't happen and we 
>> remove the
>> dcache maintenance, we can end up with hard to reproduce and hard to 
>> diagnose
>> errors in a guest.
> So there is still one thing left about this patch to determine, and 
> that is whether we can straightly
> discard CMOs in stage2_coalesce_tables_into_block() or we should 
> distinguish different situations.
>
> Now we know that the situation you have described won't happen, then I 
> think we will only end up
> in stage2_coalesce_tables_into_block() in the following situation:
> 1) KVM create a new block mapping in stage2_map_walker_try_leaf() for 
> the first time, if guest accesses
>     memory backed by a THP/HUGETLB huge page. And CMOs will be 
> performed here.
> 2) KVM split this block mapping in dirty logging, and build only one 
> new page mapping.
> 3) KVM will build other new page mappings in dirty logging lazily, if 
> guest access any other pages
>     within the block. *In this stage, pages in this block may be fully 
> mapped, or may be not.*
> 4) After dirty logging is disabled, KVM decides to rebuild the block 
> mapping.
>
> Do we still have to perform CMOs when rebuilding the block mapping in 
> step 4, if pages in the block
> were not fully mapped in step 3 ? I'm not completely sure about this.
>
Hi Marc,
Could you please have an answer for above confusion :) ?

Thanks,

Yanan


> Thanks,
>
> Yanan
>>> At least, they are not needed if we rebuild the block mappings 
>>> backed by hugetlbfs
>>> pages, because we must have built the new block mappings for the 
>>> first time before
>>> and now need to rebuild them after they were split in dirty logging. 
>>> Can we
>>> agree on this?
>>> Then let's see the following situation.
>>>> I can think of the following situation where they
>>>> are needed:
>>>>
>>>> 1. The 2nd level (PMD) table that will be turned into a block is 
>>>> mapped at stage 2
>>>> because one of the pages in the 3rd level (PTE) table it points to 
>>>> is accessed by
>>>> the guest.
>>>>
>>>> 2. The kernel decides to turn the userspace mapping into a 
>>>> transparent huge page
>>>> and calls the mmu notifier to remove the mapping from stage 2. The 
>>>> 2nd level table
>>>> is still valid.
>>> I have a question here. Won't the PMD entry been invalidated too in 
>>> this case?
>>> If remove of the stage2 mapping by mmu notifier is an unmap 
>>> operation of a range,
>>> then it's correct and reasonable to both invalidate the PMD entry 
>>> and free the
>>> PTE table.
>>> As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
>>>
>>> And if I was right about this, we will not end up in
>>> stage2_coalesce_tables_into_block()
>>> like step 3 describes, but in stage2_map_walker_try_leaf() instead. 
>>> Because the
>>> PMD entry
>>> is invalid, so KVM will create the new 2M block mapping.
>> Looking at the code for stage2_unmap_walker(), I believe you are 
>> correct. After
>> the entire PTE table has been unmapped, the function will mark the 
>> PMD entry as
>> invalid. In the situation I described, at step 3 we would end up in 
>> the leaf
>> mapper function because the PMD entry is invalid. My example was wrong.
>>
>>> If I'm wrong about this, then I think this is a valid situation.
>>>> 3. Guest accesses a page which is not the page it accessed at step 
>>>> 1, which causes
>>>> a translation fault. KVM decides we can use a PMD block mapping to 
>>>> map the address
>>>> and we end up in stage2_coalesce_tables_into_block(). We need CMOs 
>>>> in this case
>>>> because the guest accesses memory it didn't access before.
>>>>
>>>> What do you think, is that a valid situation?
>>>>>        return 0;
>>>>>    }
>>>>>    @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 
>>>>> addr, u64
>>>>> end, u32 level,
>>>>>                          kvm_pte_t *ptep,
>>>>>                          struct stage2_map_data *data)
>>>>>    {
>>>>> -    int ret = 0;
>>>>> -
>>>>>        if (!data->anchor)
>>>>>            return 0;
>>>>>    -    free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> -    put_page(virt_to_page(ptep));
>>>>> -
>>>>> -    if (data->anchor == ptep) {
>>>>> +    if (data->anchor != ptep) {
>>>>> +        free_page((unsigned long)kvm_pte_follow(*ptep));
>>>>> +        put_page(virt_to_page(ptep));
>>>>> +    } else {
>>>>> +        free_page((unsigned long)data->follow);
>>>>>            data->anchor = NULL;
>>>>> -        ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls 
>>>> put_page() and
>>>> get_page() once in our case (valid old mapping). It looks to me 
>>>> like we're missing
>>>> a put_page() call when the function is called for the anchor. Have 
>>>> you found the
>>>> call to be unnecessary?
>>> Before this patch:
>>> When we find data->anchor == ptep, put_page() has been called once 
>>> in advance
>>> for the anchor
>>> in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
>>> stage2_map_walker_try_leaf()
>>> to install the block entry, and only get_page() will be called once in
>>> stage2_map_walker_try_leaf().
>>> There is a put_page() followed by a get_page() for the anchor, and 
>>> there will
>>> not be a problem about
>>> page_counts.
>> This is how I'm reading the code before your patch:
>>
>> - stage2_map_walk_table_post() returns early if there is no anchor.
>>
>> - stage2_map_walk_table_pre() sets the anchor and marks the entry as 
>> invalid. The
>> entry was a table so the leaf visitor is not called in 
>> __kvm_pgtable_visit().
>>
>> - __kvm_pgtable_visit() visits the next level table.
>>
>> - stage2_map_walk_table_post() calls put_page(), calls 
>> stage2_map_walk_leaf() ->
>> stage2_map_walker_try_leaf(). The old entry was invalidated by the 
>> pre visitor, so
>> it only calls get_page() (and not put_page() + get_page().
>>
>> I agree with your conclusion, I didn't realize that because the pre 
>> visitor marks
>> the entry as invalid, stage2_map_walker_try_leaf() will not call 
>> put_page().
>>
>>> After this patch:
>>> Before we find data->anchor == ptep and after it, there is not a 
>>> put_page() call
>>> for the anchor.
>>> This is because that we didn't call get_page() either in
>>> stage2_coalesce_tables_into_block() when
>>> install the block entry. So I think there will not be a problem too.
>> I agree, the refcount will be identical.
>>
>>> Is above the right answer for your point?
>> Yes, thank you clearing that up for me.
>>
>> Thanks,
>>
>> Alex
>>
>>>>>        }
>>>>>    -    return ret;
>>>>> +    return 0;
>>>> I think it's correct for this function to succeed unconditionally. 
>>>> The error was
>>>> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). 
>>>> The function
>>>> can return an error code if block mapping is not supported, which 
>>>> we know is
>>>> supported because we have an anchor, and if only the permissions 
>>>> are different
>>>> between the old and the new entry, but in our case we've changed 
>>>> both the valid
>>>> and type bits.
>>> Agreed. Besides, we will definitely not end up updating an old valid 
>>> entry for
>>> the anchor
>>> in stage2_map_walker_try_leaf(), because *anchor has already been 
>>> invalidated in
>>> stage2_map_walk_table_pre() before set the anchor, so it will look 
>>> like a build
>>> of new mapping.
>>>
>>> Thanks,
>>>
>>> Yanan
>>>> Thanks,
>>>>
>>>> Alex
>>>>
>>>>>    }
>>>>>      /*
>>>> .
>> .