linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Question about TLB flush while set Stage-2 huge pages
@ 2019-03-11 16:31 Zheng Xiang
  2019-03-12 11:32 ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Xiang @ 2019-03-11 16:31 UTC (permalink / raw)
  To: christoffer.dall, marc.zyngier, catalin.marinas, will.deacon,
	suzuki.poulose, james.morse
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin, yuzenghui,
	lious.lilei, lishuo1

Hi all,

While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
the base address of the huge page and the whole of Stage-1.
However, this just only invalidates the first page within the huge page and the other
pages are not invalidated, see bellow:

    +---------------+--------------+
    |abcde       2MB-Page          |
    +---------------+--------------+

    TLB before setting new pmd:
    +---------------+--------------+
    |      VA       |    PAGESIZE  |
    +---------------+--------------+
    |      a        |      4KB     |
    +---------------+--------------+
    |      b        |      4KB     |
    +---------------+--------------+
    |      c        |      4KB     |
    +---------------+--------------+
    |      d        |      4KB     |
    +---------------+--------------+

    TLB after setting new pmd:
    +---------------+--------------+
    |      VA       |    PAGESIZE  |
    +---------------+--------------+
    |      a        |      2MB     |
    +---------------+--------------+
    |      b        |      4KB     |
    +---------------+--------------+
    |      c        |      4KB     |
    +---------------+--------------+
    |      d        |      4KB     |
    +---------------+--------------+

When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.

For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
KVM will set the memslot READONLY and split the huge pages.
After live migration is canceled and abort, the pages will be merged into THP.
The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.

So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.



-- 

Thanks,
Xiang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-11 16:31 [RFC] Question about TLB flush while set Stage-2 huge pages Zheng Xiang
@ 2019-03-12 11:32 ` Marc Zyngier
  2019-03-12 15:30   ` Zheng Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2019-03-12 11:32 UTC (permalink / raw)
  To: Zheng Xiang, christoffer.dall, catalin.marinas, will.deacon,
	suzuki.poulose, james.morse
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin, yuzenghui,
	lious.lilei, lishuo1

Hi Zheng,

On 11/03/2019 16:31, Zheng Xiang wrote:
> Hi all,
> 
> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
> the base address of the huge page and the whole of Stage-1.
> However, this just only invalidates the first page within the huge page and the other
> pages are not invalidated, see bellow:
> 
>     +---------------+--------------+
>     |abcde       2MB-Page          |
>     +---------------+--------------+
> 
>     TLB before setting new pmd:
>     +---------------+--------------+
>     |      VA       |    PAGESIZE  |
>     +---------------+--------------+
>     |      a        |      4KB     |
>     +---------------+--------------+
>     |      b        |      4KB     |
>     +---------------+--------------+
>     |      c        |      4KB     |
>     +---------------+--------------+
>     |      d        |      4KB     |
>     +---------------+--------------+
> 
>     TLB after setting new pmd:
>     +---------------+--------------+
>     |      VA       |    PAGESIZE  |
>     +---------------+--------------+
>     |      a        |      2MB     |
>     +---------------+--------------+
>     |      b        |      4KB     |
>     +---------------+--------------+
>     |      c        |      4KB     |
>     +---------------+--------------+
>     |      d        |      4KB     |
>     +---------------+--------------+
> 
> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.

That's really bad. I can only imagine two scenarios:

1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
the PTE table in the process, and place the PMD instead. I can't see
this happening.

2) We fail to invalidate on unmap, and that slightly less bad (but still
quite bad).

Which of the two cases are you seeing?

> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
> KVM will set the memslot READONLY and split the huge pages.
> After live migration is canceled and abort, the pages will be merged into THP.
> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
> 
> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.

We should perform an invalidate on each unmap. unmap_stage2_range seems
to do the right thing. __flush_tlb_range only caters for Stage1
mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
TLBs for the whole VM.

I'd really like to understand what you're seeing, and how to reproduce
it. Do you have a minimal example I could run on my own HW?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-12 11:32 ` Marc Zyngier
@ 2019-03-12 15:30   ` Zheng Xiang
  2019-03-12 18:18     ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Xiang @ 2019-03-12 15:30 UTC (permalink / raw)
  To: Marc Zyngier, christoffer.dall, catalin.marinas, will.deacon,
	suzuki.poulose, james.morse
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin, yuzenghui,
	lious.lilei, lishuo1

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 11/03/2019 16:31, Zheng Xiang wrote:
>> Hi all,
>>
>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>> the base address of the huge page and the whole of Stage-1.
>> However, this just only invalidates the first page within the huge page and the other
>> pages are not invalidated, see bellow:
>>
>>     +---------------+--------------+
>>     |abcde       2MB-Page          |
>>     +---------------+--------------+
>>
>>     TLB before setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      4KB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>>     TLB after setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      2MB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
> 
> That's really bad. I can only imagine two scenarios:
> 
> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> the PTE table in the process, and place the PMD instead. I can't see
> this happening.
> 
> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> quite bad).
> 
> Which of the two cases are you seeing?
> 
>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>> KVM will set the memslot READONLY and split the huge pages.
>> After live migration is canceled and abort, the pages will be merged into THP.
>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>
>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> 
> We should perform an invalidate on each unmap. unmap_stage2_range seems
> to do the right thing. __flush_tlb_range only caters for Stage1
> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> TLBs for the whole VM.
> 
> I'd really like to understand what you're seeing, and how to reproduce
> it. Do you have a minimal example I could run on my own HW?

When I start the live migration for a VM, qemu then begins to log and count dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

        /*
         * Multiple vcpus faulting on the same PMD entry, can
         * lead to them sequentially updating the PMD with the
         * same value. Following the break-before-make
         * (pmd_clear() followed by tlb_flush()) process can
         * hinder forward progress due to refaults generated
         * on missing translations.
         *
         * Skip updating the page table if the entry is
         * unchanged.
         */
        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
            return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

        /*
         * Mapping in huge pages should only happen through a
         * fault.  If a page is merged into a transparent huge
         * page, the individual subpages of that huge page
         * should be unmapped through MMU notifiers before we
         * get here.
         *
         * Merging of CompoundPages is not supported; they
         * should become splitting first, unmapped, merged,
         * and mapped back in on-demand.
         */
        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

        pmd_clear(pmd);
        for (cnt = 0; cnt < 512; cnt++)
            kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the problem no longer reproduce.


-- 

Thanks,
Xiang



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-12 15:30   ` Zheng Xiang
@ 2019-03-12 18:18     ` Marc Zyngier
  2019-03-13  9:45       ` Zheng Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2019-03-12 18:18 UTC (permalink / raw)
  To: Zheng Xiang, christoffer.dall, catalin.marinas, will.deacon,
	suzuki.poulose, james.morse
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin, yuzenghui,
	lious.lilei, lishuo1

Hi Zheng,

On 12/03/2019 15:30, Zheng Xiang wrote:
> Hi Marc,
> 
> On 2019/3/12 19:32, Marc Zyngier wrote:
>> Hi Zheng,
>>
>> On 11/03/2019 16:31, Zheng Xiang wrote:
>>> Hi all,
>>>
>>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>>> the base address of the huge page and the whole of Stage-1.
>>> However, this just only invalidates the first page within the huge page and the other
>>> pages are not invalidated, see bellow:
>>>
>>>     +---------------+--------------+
>>>     |abcde       2MB-Page          |
>>>     +---------------+--------------+
>>>
>>>     TLB before setting new pmd:
>>>     +---------------+--------------+
>>>     |      VA       |    PAGESIZE  |
>>>     +---------------+--------------+
>>>     |      a        |      4KB     |
>>>     +---------------+--------------+
>>>     |      b        |      4KB     |
>>>     +---------------+--------------+
>>>     |      c        |      4KB     |
>>>     +---------------+--------------+
>>>     |      d        |      4KB     |
>>>     +---------------+--------------+
>>>
>>>     TLB after setting new pmd:
>>>     +---------------+--------------+
>>>     |      VA       |    PAGESIZE  |
>>>     +---------------+--------------+
>>>     |      a        |      2MB     |
>>>     +---------------+--------------+
>>>     |      b        |      4KB     |
>>>     +---------------+--------------+
>>>     |      c        |      4KB     |
>>>     +---------------+--------------+
>>>     |      d        |      4KB     |
>>>     +---------------+--------------+
>>>
>>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
>>
>> That's really bad. I can only imagine two scenarios:
>>
>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
>> the PTE table in the process, and place the PMD instead. I can't see
>> this happening.
>>
>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
>> quite bad).
>>
>> Which of the two cases are you seeing?
>>
>>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>>> KVM will set the memslot READONLY and split the huge pages.
>>> After live migration is canceled and abort, the pages will be merged into THP.
>>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>>
>>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
>>
>> We should perform an invalidate on each unmap. unmap_stage2_range seems
>> to do the right thing. __flush_tlb_range only caters for Stage1
>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
>> TLBs for the whole VM.
>>
>> I'd really like to understand what you're seeing, and how to reproduce
>> it. Do you have a minimal example I could run on my own HW?
> 
> When I start the live migration for a VM, qemu then begins to log and count dirty pages.
> During the live migration, KVM set the pages READONLY so that we can count how many pages
> would be wrote afterwards.
> 
> Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
> The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
> analyzing the source code, I find KVM always return from the bellow *if* statement in
> stage2_set_pmd_huge() even if we only have a single VCPU:
> 
>         /*
>          * Multiple vcpus faulting on the same PMD entry, can
>          * lead to them sequentially updating the PMD with the
>          * same value. Following the break-before-make
>          * (pmd_clear() followed by tlb_flush()) process can
>          * hinder forward progress due to refaults generated
>          * on missing translations.
>          *
>          * Skip updating the page table if the entry is
>          * unchanged.
>          */
>         if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>             return 0;
> 
> The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
> code to flush tlb for all subpages of the PMD, as shown bellow:
> 
>         /*
>          * Mapping in huge pages should only happen through a
>          * fault.  If a page is merged into a transparent huge
>          * page, the individual subpages of that huge page
>          * should be unmapped through MMU notifiers before we
>          * get here.
>          *
>          * Merging of CompoundPages is not supported; they
>          * should become splitting first, unmapped, merged,
>          * and mapped back in on-demand.
>          */
>         VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> 
>         pmd_clear(pmd);
>         for (cnt = 0; cnt < 512; cnt++)
>             kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);
> 
> Then the problem no longer reproduce.

This makes very little sense. We shouldn't be able to enter this path
for anything else but a permission update, otherwise the VM_BUG_ON
should fire.

Can you either turn this VM_BUG_ON into a simple BUG_ON, or enable
CONFIG_DEBUG_VM please? If what you're describing is indeed correct (and
I have no reason to doubt you), it should fire.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-12 18:18     ` Marc Zyngier
@ 2019-03-13  9:45       ` Zheng Xiang
  2019-03-14 10:55         ` Suzuki K Poulose
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Xiang @ 2019-03-13  9:45 UTC (permalink / raw)
  To: Marc Zyngier, christoffer.dall, catalin.marinas, will.deacon,
	suzuki.poulose, james.morse
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin, yuzenghui,
	lious.lilei, lishuo1



On 2019/3/13 2:18, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 12/03/2019 15:30, Zheng Xiang wrote:
>> Hi Marc,
>>
>> On 2019/3/12 19:32, Marc Zyngier wrote:
>>> Hi Zheng,
>>>
>>> On 11/03/2019 16:31, Zheng Xiang wrote:
>>>> Hi all,
>>>>
>>>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>>>> the base address of the huge page and the whole of Stage-1.
>>>> However, this just only invalidates the first page within the huge page and the other
>>>> pages are not invalidated, see bellow:
>>>>
>>>>     +---------------+--------------+
>>>>     |abcde       2MB-Page          |
>>>>     +---------------+--------------+
>>>>
>>>>     TLB before setting new pmd:
>>>>     +---------------+--------------+
>>>>     |      VA       |    PAGESIZE  |
>>>>     +---------------+--------------+
>>>>     |      a        |      4KB     |
>>>>     +---------------+--------------+
>>>>     |      b        |      4KB     |
>>>>     +---------------+--------------+
>>>>     |      c        |      4KB     |
>>>>     +---------------+--------------+
>>>>     |      d        |      4KB     |
>>>>     +---------------+--------------+
>>>>
>>>>     TLB after setting new pmd:
>>>>     +---------------+--------------+
>>>>     |      VA       |    PAGESIZE  |
>>>>     +---------------+--------------+
>>>>     |      a        |      2MB     |
>>>>     +---------------+--------------+
>>>>     |      b        |      4KB     |
>>>>     +---------------+--------------+
>>>>     |      c        |      4KB     |
>>>>     +---------------+--------------+
>>>>     |      d        |      4KB     |
>>>>     +---------------+--------------+
>>>>
>>>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
>>>
>>> That's really bad. I can only imagine two scenarios:
>>>
>>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
>>> the PTE table in the process, and place the PMD instead. I can't see
>>> this happening.
>>>
>>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
>>> quite bad).
>>>
>>> Which of the two cases are you seeing?
>>>
>>>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>>>> KVM will set the memslot READONLY and split the huge pages.
>>>> After live migration is canceled and abort, the pages will be merged into THP.
>>>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>>>
>>>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>>>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
>>>
>>> We should perform an invalidate on each unmap. unmap_stage2_range seems
>>> to do the right thing. __flush_tlb_range only caters for Stage1
>>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
>>> TLBs for the whole VM.
>>>
>>> I'd really like to understand what you're seeing, and how to reproduce
>>> it. Do you have a minimal example I could run on my own HW?
>>
>> When I start the live migration for a VM, qemu then begins to log and count dirty pages.
>> During the live migration, KVM set the pages READONLY so that we can count how many pages
>> would be wrote afterwards.
>>
>> Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
>> The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
>> analyzing the source code, I find KVM always return from the bellow *if* statement in
>> stage2_set_pmd_huge() even if we only have a single VCPU:
>>
>>         /*
>>          * Multiple vcpus faulting on the same PMD entry, can
>>          * lead to them sequentially updating the PMD with the
>>          * same value. Following the break-before-make
>>          * (pmd_clear() followed by tlb_flush()) process can
>>          * hinder forward progress due to refaults generated
>>          * on missing translations.
>>          *
>>          * Skip updating the page table if the entry is
>>          * unchanged.
>>          */
>>         if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>             return 0;
>>
>> The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
>> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
>> code to flush tlb for all subpages of the PMD, as shown bellow:
>>
>>         /*
>>          * Mapping in huge pages should only happen through a
>>          * fault.  If a page is merged into a transparent huge
>>          * page, the individual subpages of that huge page
>>          * should be unmapped through MMU notifiers before we
>>          * get here.
>>          *
>>          * Merging of CompoundPages is not supported; they
>>          * should become splitting first, unmapped, merged,
>>          * and mapped back in on-demand.
>>          */
>>         VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>
>>         pmd_clear(pmd);
>>         for (cnt = 0; cnt < 512; cnt++)
>>             kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);
>>
>> Then the problem no longer reproduce.
> 
> This makes very little sense. We shouldn't be able to enter this path
> for anything else but a permission update, otherwise the VM_BUG_ON
> should fire.

Hmm, I think I didn't describe it very clearly.
Look at the following sequence:

1) Set a PMD READONLY and logging_active.

2) KVM handles permission fault caused by writing a subpage(assumpt *b*) within this huge PMD.

3) KVM dissolves PMD and invalidates TLB for this PMD. Then set a writable PTE.

4) Read another 511 PTEs and setup Stage-2 PTE table.

5) Now remove logging_active and keep another 511 PTEs READONLY.

6) VM continues to write a subpage(assumpt *c*) and cause permission fault.

7) KVM handles this new fault and makes a new writable PMD after transparent_hugepage_adjust().

8) KVM invalidates TLB for the first page(*a*) of the PMD.
   Here another 511 RO PTEs entries still stay in TLB, especially *c* which will be wrote later.

9) KVM then set this new writable PMD.
   Step 8-9 is what stage2_set_pmd_huge() does.

10) VM continues to write *c*, but this time it hits the RO PTE entry in TLB and causes permission fault again.
   Sometimes it can also cause TLB conflict aborts.

11) KVM repeats step 6 and goes to the following statement and return 0:

         * Skip updating the page table if the entry is
         * unchanged.
         */
        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
            return 0;

12) Then it will repeat step 10-11 until the PTE entry is invalidated.

I think there is something abnormal in step 8.
Should I blame my hardware? Or is it a kernel bug?

-- 

Thanks,
Xiang



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-13  9:45       ` Zheng Xiang
@ 2019-03-14 10:55         ` Suzuki K Poulose
  2019-03-14 15:50           ` Zenghui Yu
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-14 10:55 UTC (permalink / raw)
  To: Zheng Xiang
  Cc: Marc Zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin,
	yuzenghui, lious.lilei, lishuo1, suzuki.poulose

Hi Zheng,

On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote:
> 
> 
> On 2019/3/13 2:18, Marc Zyngier wrote:
> > Hi Zheng,
> > 
> > On 12/03/2019 15:30, Zheng Xiang wrote:
> >> Hi Marc,
> >>
> >> On 2019/3/12 19:32, Marc Zyngier wrote:
> >>> Hi Zheng,
> >>>
> >>> On 11/03/2019 16:31, Zheng Xiang wrote:
> >>>> Hi all,
> >>>>
> >>>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
> >>>> the base address of the huge page and the whole of Stage-1.
> >>>> However, this just only invalidates the first page within the huge page and the other
> >>>> pages are not invalidated, see bellow:
> >>>>
> >>>>     +---------------+--------------+
> >>>>     |abcde       2MB-Page          |
> >>>>     +---------------+--------------+
> >>>>
> >>>>     TLB before setting new pmd:
> >>>>     +---------------+--------------+
> >>>>     |      VA       |    PAGESIZE  |
> >>>>     +---------------+--------------+
> >>>>     |      a        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>     |      b        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>     |      c        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>     |      d        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>
> >>>>     TLB after setting new pmd:
> >>>>     +---------------+--------------+
> >>>>     |      VA       |    PAGESIZE  |
> >>>>     +---------------+--------------+
> >>>>     |      a        |      2MB     |
> >>>>     +---------------+--------------+
> >>>>     |      b        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>     |      c        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>     |      d        |      4KB     |
> >>>>     +---------------+--------------+
> >>>>
> >>>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
> >>>
> >>> That's really bad. I can only imagine two scenarios:
> >>>
> >>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> >>> the PTE table in the process, and place the PMD instead. I can't see
> >>> this happening.
> >>>
> >>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> >>> quite bad).
> >>>
> >>> Which of the two cases are you seeing?
> >>>
> >>>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
> >>>> KVM will set the memslot READONLY and split the huge pages.
> >>>> After live migration is canceled and abort, the pages will be merged into THP.
> >>>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
> >>>>
> >>>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
> >>>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> >>>
> >>> We should perform an invalidate on each unmap. unmap_stage2_range seems
> >>> to do the right thing. __flush_tlb_range only caters for Stage1
> >>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> >>> TLBs for the whole VM.
> >>>
> >>> I'd really like to understand what you're seeing, and how to reproduce
> >>> it. Do you have a minimal example I could run on my own HW?
> >>
> >> When I start the live migration for a VM, qemu then begins to log and count dirty pages.
> >> During the live migration, KVM set the pages READONLY so that we can count how many pages
> >> would be wrote afterwards.
> >>
> >> Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
> >> The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
> >> analyzing the source code, I find KVM always return from the bellow *if* statement in
> >> stage2_set_pmd_huge() even if we only have a single VCPU:
> >>
> >>         /*
> >>          * Multiple vcpus faulting on the same PMD entry, can
> >>          * lead to them sequentially updating the PMD with the
> >>          * same value. Following the break-before-make
> >>          * (pmd_clear() followed by tlb_flush()) process can
> >>          * hinder forward progress due to refaults generated
> >>          * on missing translations.
> >>          *
> >>          * Skip updating the page table if the entry is
> >>          * unchanged.
> >>          */
> >>         if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >>             return 0;
> >>
> >> The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
> >> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
> >> code to flush tlb for all subpages of the PMD, as shown bellow:
> >>
> >>         /*
> >>          * Mapping in huge pages should only happen through a
> >>          * fault.  If a page is merged into a transparent huge
> >>          * page, the individual subpages of that huge page
> >>          * should be unmapped through MMU notifiers before we
> >>          * get here.
> >>          *
> >>          * Merging of CompoundPages is not supported; they
> >>          * should become splitting first, unmapped, merged,
> >>          * and mapped back in on-demand.
> >>          */
> >>         VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> >>
> >>         pmd_clear(pmd);
> >>         for (cnt = 0; cnt < 512; cnt++)
> >>             kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);
> >>
> >> Then the problem no longer reproduce.
> > 
> > This makes very little sense. We shouldn't be able to enter this path
> > for anything else but a permission update, otherwise the VM_BUG_ON
> > should fire.
> 
> Hmm, I think I didn't describe it very clearly.
> Look at the following sequence:
> 
> 1) Set a PMD READONLY and logging_active.
> 
> 2) KVM handles permission fault caused by writing a subpage(assumpt *b*) within this huge PMD.
> 
> 3) KVM dissolves PMD and invalidates TLB for this PMD. Then set a writable PTE.
> 
> 4) Read another 511 PTEs and setup Stage-2 PTE table.
> 
> 5) Now remove logging_active and keep another 511 PTEs READONLY.
> 
> 6) VM continues to write a subpage(assumpt *c*) and cause permission fault.
> 
> 7) KVM handles this new fault and makes a new writable PMD after transparent_hugepage_adjust().
> 
> 8) KVM invalidates TLB for the first page(*a*) of the PMD.
>    Here another 511 RO PTEs entries still stay in TLB, especially *c* which will be wrote later.
> 
> 9) KVM then set this new writable PMD.
>    Step 8-9 is what stage2_set_pmd_huge() does.
> 
> 10) VM continues to write *c*, but this time it hits the RO PTE entry in TLB and causes permission fault again.
>    Sometimes it can also cause TLB conflict aborts.
> 
> 11) KVM repeats step 6 and goes to the following statement and return 0:
> 
>          * Skip updating the page table if the entry is
>          * unchanged.
>          */
>         if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>             return 0;
> 
> 12) Then it will repeat step 10-11 until the PTE entry is invalidated.
> 
> I think there is something abnormal in step 8.
> Should I blame my hardware? Or is it a kernel bug?

Marc and I had a discussion about this and it looks like we may have an
issue here. So with the cancellation of logging, we do not trigger the
mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
have memory leaks while trying to install a huge mapping. Would it be
possible for you to try the patch below ? It will trigger a WARNING
to confirm our theory, but should not cause the hang. As we unmap
the PMD/PUD range of PTE mappings before reinstalling a block map.


---8>---

test: kvm: arm: Fix handling of stage2 huge mappings

We rely on the mmu_notifier call backs to handle the split/merging
of huge pages and thus we are guaranteed that while creating a
block mapping, the entire block is unmapped at stage2. However,
we miss a case where the block mapping is split for dirty logging
case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle these corner cases for the huge mappings at stage2.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 66e0fbb5..04b0f9b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
 		 * Skip updating the page table if the entry is
 		 * unchanged.
 		 */
-		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+		if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
 			return 0;
-
+		} else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
 		/*
-		 * Mapping in huge pages should only happen through a
-		 * fault.  If a page is merged into a transparent huge
-		 * page, the individual subpages of that huge page
-		 * should be unmapped through MMU notifiers before we
-		 * get here.
-		 *
-		 * Merging of CompoundPages is not supported; they
-		 * should become splitting first, unmapped, merged,
-		 * and mapped back in on-demand.
+		 * If we have PTE level mapping for this block,
+		 * we must unmap it to avoid inconsistent TLB
+		 * state. We could end up in this situation if
+		 * the memory slot was marked for dirty logging
+		 * and was reverted, leaving PTE level mappings
+		 * for the pages accessed during the period.
+		 * Normal THP split/merge follows mmu_notifier
+		 * callbacks and do get handled accordingly.
 		 */
-		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
+			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
+		} else {
 
-		pmd_clear(pmd);
-		kvm_tlb_flush_vmid_ipa(kvm, addr);
+			/*
+			 * Mapping in huge pages should only happen through a
+			 * fault.  If a page is merged into a transparent huge
+			 * page, the individual subpages of that huge page
+			 * should be unmapped through MMU notifiers before we
+			 * get here.
+			 *
+			 * Merging of CompoundPages is not supported; they
+			 * should become splitting first, unmapped, merged,
+			 * and mapped back in on-demand.
+			 */
+			WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
+
+			pmd_clear(pmd);
+			kvm_tlb_flush_vmid_ipa(kvm, addr);
+		}
 	} else {
 		get_page(virt_to_page(pmd));
 	}
@@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
 		return 0;
 
 	if (stage2_pud_present(kvm, old_pud)) {
-		stage2_pud_clear(kvm, pudp);
-		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		/* If we have PTE level mapping, unmap the entire range */
+		if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
+			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
+		} else {
+			stage2_pud_clear(kvm, pudp);
+			kvm_tlb_flush_vmid_ipa(kvm, addr);
+		}
 	} else {
 		get_page(virt_to_page(pudp));
 	}
-- 
2.7.4



> 
> -- 
> 
> Thanks,
> Xiang
> 
> 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-14 10:55         ` Suzuki K Poulose
@ 2019-03-14 15:50           ` Zenghui Yu
  2019-03-15  8:21             ` Zheng Xiang
  0 siblings, 1 reply; 22+ messages in thread
From: Zenghui Yu @ 2019-03-14 15:50 UTC (permalink / raw)
  To: Suzuki K Poulose, Zheng Xiang
  Cc: Marc Zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin,
	lious.lilei, lishuo1

Hi Suzuki,

On 2019/3/14 18:55, Suzuki K Poulose wrote:
> Hi Zheng,
> 
> On Wed, Mar 13, 2019 at 05:45:31PM +0800, Zheng Xiang wrote:
>>
>>
>> On 2019/3/13 2:18, Marc Zyngier wrote:
>>> Hi Zheng,
>>>
>>> On 12/03/2019 15:30, Zheng Xiang wrote:
>>>> Hi Marc,
>>>>
>>>> On 2019/3/12 19:32, Marc Zyngier wrote:
>>>>> Hi Zheng,
>>>>>
>>>>> On 11/03/2019 16:31, Zheng Xiang wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>>>>>> the base address of the huge page and the whole of Stage-1.
>>>>>> However, this just only invalidates the first page within the huge page and the other
>>>>>> pages are not invalidated, see bellow:
>>>>>>
>>>>>>      +---------------+--------------+
>>>>>>      |abcde       2MB-Page          |
>>>>>>      +---------------+--------------+
>>>>>>
>>>>>>      TLB before setting new pmd:
>>>>>>      +---------------+--------------+
>>>>>>      |      VA       |    PAGESIZE  |
>>>>>>      +---------------+--------------+
>>>>>>      |      a        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      b        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      c        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      d        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>
>>>>>>      TLB after setting new pmd:
>>>>>>      +---------------+--------------+
>>>>>>      |      VA       |    PAGESIZE  |
>>>>>>      +---------------+--------------+
>>>>>>      |      a        |      2MB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      b        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      c        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>      |      d        |      4KB     |
>>>>>>      +---------------+--------------+
>>>>>>
>>>>>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
>>>>>
>>>>> That's really bad. I can only imagine two scenarios:
>>>>>
>>>>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
>>>>> the PTE table in the process, and place the PMD instead. I can't see
>>>>> this happening.
>>>>>
>>>>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
>>>>> quite bad).
>>>>>
>>>>> Which of the two cases are you seeing?
>>>>>
>>>>>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>>>>>> KVM will set the memslot READONLY and split the huge pages.
>>>>>> After live migration is canceled and abort, the pages will be merged into THP.
>>>>>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>>>>>
>>>>>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>>>>>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
>>>>>
>>>>> We should perform an invalidate on each unmap. unmap_stage2_range seems
>>>>> to do the right thing. __flush_tlb_range only caters for Stage1
>>>>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
>>>>> TLBs for the whole VM.
>>>>>
>>>>> I'd really like to understand what you're seeing, and how to reproduce
>>>>> it. Do you have a minimal example I could run on my own HW?
>>>>
>>>> When I start the live migration for a VM, qemu then begins to log and count dirty pages.
>>>> During the live migration, KVM set the pages READONLY so that we can count how many pages
>>>> would be wrote afterwards.
>>>>
>>>> Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
>>>> The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
>>>> analyzing the source code, I find KVM always return from the bellow *if* statement in
>>>> stage2_set_pmd_huge() even if we only have a single VCPU:
>>>>
>>>>          /*
>>>>           * Multiple vcpus faulting on the same PMD entry, can
>>>>           * lead to them sequentially updating the PMD with the
>>>>           * same value. Following the break-before-make
>>>>           * (pmd_clear() followed by tlb_flush()) process can
>>>>           * hinder forward progress due to refaults generated
>>>>           * on missing translations.
>>>>           *
>>>>           * Skip updating the page table if the entry is
>>>>           * unchanged.
>>>>           */
>>>>          if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>>>              return 0;
>>>>
>>>> The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
>>>> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
>>>> code to flush tlb for all subpages of the PMD, as shown bellow:
>>>>
>>>>          /*
>>>>           * Mapping in huge pages should only happen through a
>>>>           * fault.  If a page is merged into a transparent huge
>>>>           * page, the individual subpages of that huge page
>>>>           * should be unmapped through MMU notifiers before we
>>>>           * get here.
>>>>           *
>>>>           * Merging of CompoundPages is not supported; they
>>>>           * should become splitting first, unmapped, merged,
>>>>           * and mapped back in on-demand.
>>>>           */
>>>>          VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>>>
>>>>          pmd_clear(pmd);
>>>>          for (cnt = 0; cnt < 512; cnt++)
>>>>              kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);
>>>>
>>>> Then the problem no longer reproduce.
>>>
>>> This makes very little sense. We shouldn't be able to enter this path
>>> for anything else but a permission update, otherwise the VM_BUG_ON
>>> should fire.
>>
>> Hmm, I think I didn't describe it very clearly.
>> Look at the following sequence:
>>
>> 1) Set a PMD READONLY and logging_active.
>>
>> 2) KVM handles permission fault caused by writing a subpage(assumpt *b*) within this huge PMD.
>>
>> 3) KVM dissolves PMD and invalidates TLB for this PMD. Then set a writable PTE.
>>
>> 4) Read another 511 PTEs and setup Stage-2 PTE table.
>>
>> 5) Now remove logging_active and keep another 511 PTEs READONLY.
>>
>> 6) VM continues to write a subpage(assumpt *c*) and cause permission fault.
>>
>> 7) KVM handles this new fault and makes a new writable PMD after transparent_hugepage_adjust().
>>
>> 8) KVM invalidates TLB for the first page(*a*) of the PMD.
>>     Here another 511 RO PTEs entries still stay in TLB, especially *c* which will be wrote later.
>>
>> 9) KVM then set this new writable PMD.
>>     Step 8-9 is what stage2_set_pmd_huge() does.
>>
>> 10) VM continues to write *c*, but this time it hits the RO PTE entry in TLB and causes permission fault again.
>>     Sometimes it can also cause TLB conflict aborts.
>>
>> 11) KVM repeats step 6 and goes to the following statement and return 0:
>>
>>           * Skip updating the page table if the entry is
>>           * unchanged.
>>           */
>>          if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>              return 0;
>>
>> 12) Then it will repeat step 10-11 until the PTE entry is invalidated.
>>
>> I think there is something abnormal in step 8.
>> Should I blame my hardware? Or is it a kernel bug?
> 
> Marc and I had a discussion about this and it looks like we may have an
> issue here. So with the cancellation of logging, we do not trigger the
> mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
> have memory leaks while trying to install a huge mapping. Would it be
> possible for you to try the patch below ? It will trigger a WARNING
> to confirm our theory, but should not cause the hang. As we unmap
> the PMD/PUD range of PTE mappings before reinstalling a block map.

Thanks for the reply. And I think this is alomst what Zheng Xiang wanted 
to say! We will test this patch tomorrow and give you some feedback.

BTW, we have noticed that X86 had also suffered from the similar issue. 
You may want to look into commit 3ea3b7fa9af0 ("kvm: mmu: lazy collapse 
small sptes into large sptes" 2015) :-)


thanks,

zenghui

> 
> 
> ---8>---
> 
> test: kvm: arm: Fix handling of stage2 huge mappings
> 
> We rely on the mmu_notifier call backs to handle the split/merging
> of huge pages and thus we are guaranteed that while creating a
> block mapping, the entire block is unmapped at stage2. However,
> we miss a case where the block mapping is split for dirty logging
> case and then could later be made block mapping, if we cancel the
> dirty logging. This not only creates inconsistent TLB entries for
> the pages in the the block, but also leakes the table pages for
> PMD level.
> 
> Handle these corner cases for the huge mappings at stage2.
> 
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>   virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>   1 file changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 66e0fbb5..04b0f9b 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>   		 * Skip updating the page table if the entry is
>   		 * unchanged.
>   		 */
> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> +		if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>   			return 0;
> -
> +		} else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>   		/*
> -		 * Mapping in huge pages should only happen through a
> -		 * fault.  If a page is merged into a transparent huge
> -		 * page, the individual subpages of that huge page
> -		 * should be unmapped through MMU notifiers before we
> -		 * get here.
> -		 *
> -		 * Merging of CompoundPages is not supported; they
> -		 * should become splitting first, unmapped, merged,
> -		 * and mapped back in on-demand.
> +		 * If we have PTE level mapping for this block,
> +		 * we must unmap it to avoid inconsistent TLB
> +		 * state. We could end up in this situation if
> +		 * the memory slot was marked for dirty logging
> +		 * and was reverted, leaving PTE level mappings
> +		 * for the pages accessed during the period.
> +		 * Normal THP split/merge follows mmu_notifier
> +		 * callbacks and do get handled accordingly.
>   		 */
> -		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> +			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
> +		} else {
>   
> -		pmd_clear(pmd);
> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> +			/*
> +			 * Mapping in huge pages should only happen through a
> +			 * fault.  If a page is merged into a transparent huge
> +			 * page, the individual subpages of that huge page
> +			 * should be unmapped through MMU notifiers before we
> +			 * get here.
> +			 *
> +			 * Merging of CompoundPages is not supported; they
> +			 * should become splitting first, unmapped, merged,
> +			 * and mapped back in on-demand.
> +			 */
> +			WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> +
> +			pmd_clear(pmd);
> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		}
>   	} else {
>   		get_page(virt_to_page(pmd));
>   	}
> @@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>   		return 0;
>   
>   	if (stage2_pud_present(kvm, old_pud)) {
> -		stage2_pud_clear(kvm, pudp);
> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		/* If we have PTE level mapping, unmap the entire range */
> +		if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
> +		} else {
> +			stage2_pud_clear(kvm, pudp);
> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		}
>   	} else {
>   		get_page(virt_to_page(pudp));
>   	}
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-14 15:50           ` Zenghui Yu
@ 2019-03-15  8:21             ` Zheng Xiang
  2019-03-15 14:56               ` Suzuki K Poulose
  0 siblings, 1 reply; 22+ messages in thread
From: Zheng Xiang @ 2019-03-15  8:21 UTC (permalink / raw)
  To: Zenghui Yu, Suzuki K Poulose
  Cc: Marc Zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel, Wang Haibin,
	lious.lilei, lishuo1

Hi Suzuki,

I have tested this patch, VM doesn't hang and we get expected WARNING log:

[  526.184452] pstate: 20400009 (nzCv daif +PAN -UAO)
[  526.184454] pc : user_mem_abort+0x484/0x9e0
[  526.184455] lr : user_mem_abort+0x478/0x9e0
[  526.184456] sp : ffff000084a038e0
[  526.184457] x29: ffff000084a038e0 x28: 000000012f600000
[  526.184458] x27: ffff8a2fa27ae918 x26: 0000000000200000
[  526.184460] x25: 0000000000000000 x24: 0000000000000000
[  526.184461] x23: 00400a269d0007fd x22: ffff0000849cd000
[  526.184462] x21: ffff00001181d000 x20: 00000a26eef72003
[  526.184463] x19: ffff8a2fb41d4bd8 x18: 00004fffb8b22000
[  526.184465] x17: 0000000000000000 x16: 0000000000000000
[  526.184466] x15: 0000000000000001 x14: ffff000008dd12a8
[  526.184467] x13: 0000000000000041 x12: ffff8a26eeca6e30
[  526.184468] x11: ffff8000fe4af800 x10: 0000000000000040
[  526.184469] x9 : ffff0000097c46c0 x8 : ffff8000ff400248
[  526.184471] x7 : 0000001000000000 x6 : 00000000000021f8
[  526.184472] x5 : 00000000a269d000 x4 : 0000000000000018
[  526.184473] x3 : 000000000000000a x2 : 0000000000000004
[  526.184474] x1 : 0000000000000000 x0 : 0000000000000000
[  526.184476] Call trace:
[  526.184477]  user_mem_abort+0x484/0x9e0
[  526.184479]  kvm_handle_guest_abort+0x11c/0x478
[  526.184480]  handle_exit+0x14c/0x1c8
[  526.184482]  kvm_arch_vcpu_ioctl_run+0x280/0x898
[  526.184483]  kvm_vcpu_ioctl+0x488/0x8a8
[  526.184485]  do_vfs_ioctl+0xc4/0x8c0
[  526.184486]  ksys_ioctl+0x8c/0xa0
[  526.184487]  __arm64_sys_ioctl+0x28/0x38
[  526.184489]  el0_svc_common+0xa0/0x180
[  526.184491]  el0_svc_handler+0x38/0x78
[  526.184492]  el0_svc+0x8/0xc

However, we also get the following unexpected log:

[  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
[  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 mapping:0000000000000000 index:0x0
[  908.339416] flags: 0x4ffffe0000000000()
[  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 0000000000000000
[  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 0000000000000000
[  908.339420] page dumped because: nonzero _refcount
[  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: G    B  W        5.0.0+ #1
[  908.339438] Call trace:
[  908.339439]  dump_backtrace+0x0/0x188
[  908.339441]  show_stack+0x24/0x30
[  908.339442]  dump_stack+0xa8/0xcc
[  908.339443]  bad_page+0xf0/0x150
[  908.339445]  free_pages_check_bad+0x84/0xa0
[  908.339446]  free_pcppages_bulk+0x4b8/0x750
[  908.339448]  free_unref_page_commit+0x13c/0x198
[  908.339449]  free_unref_page+0x84/0xa0
[  908.339451]  __free_pages+0x58/0x68
[  908.339452]  zap_huge_pmd+0x290/0x2d8
[  908.339454]  unmap_page_range+0x2b4/0x470
[  908.339455]  unmap_single_vma+0x94/0xe8
[  908.339457]  unmap_vmas+0x8c/0x108
[  908.339458]  exit_mmap+0xd4/0x178
[  908.339459]  mmput+0x74/0x180
[  908.339460]  do_exit+0x2b4/0x5b0
[  908.339462]  do_group_exit+0x3c/0xe0
[  908.339463]  __arm64_sys_exit_group+0x24/0x28
[  908.339465]  el0_svc_common+0xa0/0x180
[  908.339466]  el0_svc_handler+0x38/0x78
[  908.339467]  el0_svc+0x8/0xc

>> Marc and I had a discussion about this and it looks like we may have an
>> issue here. So with the cancellation of logging, we do not trigger the
>> mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
>> have memory leaks while trying to install a huge mapping. Would it be
>> possible for you to try the patch below ? It will trigger a WARNING
>> to confirm our theory, but should not cause the hang. As we unmap
>> the PMD/PUD range of PTE mappings before reinstalling a block map.
> 
> Thanks for the reply. And I think this is alomst what Zheng Xiang wanted to say! We will test this patch tomorrow and give you some feedback.
> 
> BTW, we have noticed that X86 had also suffered from the similar issue. You may want to look into commit 3ea3b7fa9af0 ("kvm: mmu: lazy collapse small sptes into large sptes" 2015) :-)
> 
> 
> thanks,
> 
> zenghui
> 
>>
>>
>> ---8>---
>>
>> test: kvm: arm: Fix handling of stage2 huge mappings
>>
>> We rely on the mmu_notifier call backs to handle the split/merging
>> of huge pages and thus we are guaranteed that while creating a
>> block mapping, the entire block is unmapped at stage2. However,
>> we miss a case where the block mapping is split for dirty logging
>> case and then could later be made block mapping, if we cancel the
>> dirty logging. This not only creates inconsistent TLB entries for
>> the pages in the the block, but also leakes the table pages for
>> PMD level.
>>
>> Handle these corner cases for the huge mappings at stage2.
>>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 35 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 66e0fbb5..04b0f9b 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>>            * Skip updating the page table if the entry is
>>            * unchanged.
>>            */
>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>               return 0;
>> -
>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>           /*
>> -         * Mapping in huge pages should only happen through a
>> -         * fault.  If a page is merged into a transparent huge
>> -         * page, the individual subpages of that huge page
>> -         * should be unmapped through MMU notifiers before we
>> -         * get here.
>> -         *
>> -         * Merging of CompoundPages is not supported; they
>> -         * should become splitting first, unmapped, merged,
>> -         * and mapped back in on-demand.
>> +         * If we have PTE level mapping for this block,
>> +         * we must unmap it to avoid inconsistent TLB
>> +         * state. We could end up in this situation if
>> +         * the memory slot was marked for dirty logging
>> +         * and was reverted, leaving PTE level mappings
>> +         * for the pages accessed during the period.
>> +         * Normal THP split/merge follows mmu_notifier
>> +         * callbacks and do get handled accordingly.
>>            */
>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);

It seems that kvm decreases the _refcount of the page twice in transparent_hugepage_adjust()
and unmap_stage2_range().

>> +        } else {
>>   -        pmd_clear(pmd);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +            /*
>> +             * Mapping in huge pages should only happen through a
>> +             * fault.  If a page is merged into a transparent huge
>> +             * page, the individual subpages of that huge page
>> +             * should be unmapped through MMU notifiers before we
>> +             * get here.
>> +             *
>> +             * Merging of CompoundPages is not supported; they
>> +             * should become splitting first, unmapped, merged,
>> +             * and mapped back in on-demand.
>> +             */
>> +            WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +
>> +            pmd_clear(pmd);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pmd));
>>       }
>> @@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>>           return 0;
>>         if (stage2_pud_present(kvm, old_pud)) {
>> -        stage2_pud_clear(kvm, pudp);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        /* If we have PTE level mapping, unmap the entire range */
>> +        if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
>> +            unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>> +        } else {
>> +            stage2_pud_clear(kvm, pudp);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pudp));
>>       }
>>
> 
> 
> .
-- 

Thanks,
Xiang



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-15  8:21             ` Zheng Xiang
@ 2019-03-15 14:56               ` Suzuki K Poulose
  2019-03-17 13:34                 ` Zenghui Yu
  2019-03-17 13:55                 ` [RFC] Question about TLB flush while set Stage-2 huge pages Zenghui Yu
  0 siblings, 2 replies; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-15 14:56 UTC (permalink / raw)
  To: zhengxiang9, yuzenghui
  Cc: marc.zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel,
	wanghaibin.wang, lious.lilei, lishuo1

Hi Zhengui,

On 15/03/2019 08:21, Zheng Xiang wrote:
> Hi Suzuki,
> 
> I have tested this patch, VM doesn't hang and we get expected WARNING log:

Thanks for the quick testing !

> However, we also get the following unexpected log:
> 
> [  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
> [  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 mapping:0000000000000000 index:0x0
> [  908.339416] flags: 0x4ffffe0000000000()
> [  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 0000000000000000
> [  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 0000000000000000
> [  908.339420] page dumped because: nonzero _refcount
> [  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: G    B  W        5.0.0+ #1
> [  908.339438] Call trace:
> [  908.339439]  dump_backtrace+0x0/0x188
> [  908.339441]  show_stack+0x24/0x30
> [  908.339442]  dump_stack+0xa8/0xcc
> [  908.339443]  bad_page+0xf0/0x150
> [  908.339445]  free_pages_check_bad+0x84/0xa0
> [  908.339446]  free_pcppages_bulk+0x4b8/0x750
> [  908.339448]  free_unref_page_commit+0x13c/0x198
> [  908.339449]  free_unref_page+0x84/0xa0
> [  908.339451]  __free_pages+0x58/0x68
> [  908.339452]  zap_huge_pmd+0x290/0x2d8
> [  908.339454]  unmap_page_range+0x2b4/0x470
> [  908.339455]  unmap_single_vma+0x94/0xe8
> [  908.339457]  unmap_vmas+0x8c/0x108
> [  908.339458]  exit_mmap+0xd4/0x178
> [  908.339459]  mmput+0x74/0x180
> [  908.339460]  do_exit+0x2b4/0x5b0
> [  908.339462]  do_group_exit+0x3c/0xe0
> [  908.339463]  __arm64_sys_exit_group+0x24/0x28
> [  908.339465]  el0_svc_common+0xa0/0x180
> [  908.339466]  el0_svc_handler+0x38/0x78
> [  908.339467]  el0_svc+0x8/0xc

Thats bad, we seem to be making upto 4 unbalanced put_page().

>>> ---
>>>    virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>>>    1 file changed, 35 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>> index 66e0fbb5..04b0f9b 100644
>>> --- a/virt/kvm/arm/mmu.c
>>> +++ b/virt/kvm/arm/mmu.c
>>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>>>             * Skip updating the page table if the entry is
>>>             * unchanged.
>>>             */
>>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>>                return 0;
>>> -
>>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>>            /*
>>> -         * Mapping in huge pages should only happen through a
>>> -         * fault.  If a page is merged into a transparent huge
>>> -         * page, the individual subpages of that huge page
>>> -         * should be unmapped through MMU notifiers before we
>>> -         * get here.
>>> -         *
>>> -         * Merging of CompoundPages is not supported; they
>>> -         * should become splitting first, unmapped, merged,
>>> -         * and mapped back in on-demand.
>>> +         * If we have PTE level mapping for this block,
>>> +         * we must unmap it to avoid inconsistent TLB
>>> +         * state. We could end up in this situation if
>>> +         * the memory slot was marked for dirty logging
>>> +         * and was reverted, leaving PTE level mappings
>>> +         * for the pages accessed during the period.
>>> +         * Normal THP split/merge follows mmu_notifier
>>> +         * callbacks and do get handled accordingly.
>>>             */
>>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
> 
> It seems that kvm decreases the _refcount of the page twice in transparent_hugepage_adjust()
> and unmap_stage2_range().

But I thought we should be doing that on the head_page already, as this is THP.
I will take a look and get back to you on this. Btw, is it possible for you
to turn on CONFIG_DEBUG_VM and re-run with the above patch ?

Kind regards
Suzuki



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-15 14:56               ` Suzuki K Poulose
@ 2019-03-17 13:34                 ` Zenghui Yu
  2019-03-18 17:34                   ` Suzuki K Poulose
  2019-03-17 13:55                 ` [RFC] Question about TLB flush while set Stage-2 huge pages Zenghui Yu
  1 sibling, 1 reply; 22+ messages in thread
From: Zenghui Yu @ 2019-03-17 13:34 UTC (permalink / raw)
  To: Suzuki K Poulose, zhengxiang9
  Cc: marc.zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel,
	wanghaibin.wang, lious.lilei, lishuo1

Hi Suzuki,

On 2019/3/15 22:56, Suzuki K Poulose wrote:
> Hi Zhengui,

s/Zhengui/Zheng/

(I think you must wanted to say "Hi" to Zheng :-) )


I have looked into your patch and the kernel log, and I believe that
your patch had already addressed this issue. But I think we can do it
a little better - two more points need to be handled with caution.

Take PMD hugepage (PMD_SIZE == 2M) for example:

> 
> On 15/03/2019 08:21, Zheng Xiang wrote:
>> Hi Suzuki,
>>
>> I have tested this patch, VM doesn't hang and we get expected WARNING 
>> log:
> 
> Thanks for the quick testing !
> 
>> However, we also get the following unexpected log:
>>
>> [  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
>> [  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 
>> mapping:0000000000000000 index:0x0
>> [  908.339416] flags: 0x4ffffe0000000000()
>> [  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 
>> 0000000000000000
>> [  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 
>> 0000000000000000
>> [  908.339420] page dumped because: nonzero _refcount
>> [  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded 
>> Tainted: G    B  W        5.0.0+ #1
>> [  908.339438] Call trace:
>> [  908.339439]  dump_backtrace+0x0/0x188
>> [  908.339441]  show_stack+0x24/0x30
>> [  908.339442]  dump_stack+0xa8/0xcc
>> [  908.339443]  bad_page+0xf0/0x150
>> [  908.339445]  free_pages_check_bad+0x84/0xa0
>> [  908.339446]  free_pcppages_bulk+0x4b8/0x750
>> [  908.339448]  free_unref_page_commit+0x13c/0x198
>> [  908.339449]  free_unref_page+0x84/0xa0
>> [  908.339451]  __free_pages+0x58/0x68
>> [  908.339452]  zap_huge_pmd+0x290/0x2d8
>> [  908.339454]  unmap_page_range+0x2b4/0x470
>> [  908.339455]  unmap_single_vma+0x94/0xe8
>> [  908.339457]  unmap_vmas+0x8c/0x108
>> [  908.339458]  exit_mmap+0xd4/0x178
>> [  908.339459]  mmput+0x74/0x180
>> [  908.339460]  do_exit+0x2b4/0x5b0
>> [  908.339462]  do_group_exit+0x3c/0xe0
>> [  908.339463]  __arm64_sys_exit_group+0x24/0x28
>> [  908.339465]  el0_svc_common+0xa0/0x180
>> [  908.339466]  el0_svc_handler+0x38/0x78
>> [  908.339467]  el0_svc+0x8/0xc
> 
> Thats bad, we seem to be making upto 4 unbalanced put_page().
> 
>>>> ---
>>>>    virt/kvm/arm/mmu.c | 51 
>>>> +++++++++++++++++++++++++++++++++++----------------
>>>>    1 file changed, 35 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>> index 66e0fbb5..04b0f9b 100644
>>>> --- a/virt/kvm/arm/mmu.c
>>>> +++ b/virt/kvm/arm/mmu.c
>>>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm 
>>>> *kvm, struct kvm_mmu_memory_cache
>>>>             * Skip updating the page table if the entry is
>>>>             * unchanged.
>>>>             */
>>>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>>>                return 0;
>>>> -
>>>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>>>            /*
>>>> -         * Mapping in huge pages should only happen through a
>>>> -         * fault.  If a page is merged into a transparent huge
>>>> -         * page, the individual subpages of that huge page
>>>> -         * should be unmapped through MMU notifiers before we
>>>> -         * get here.
>>>> -         *
>>>> -         * Merging of CompoundPages is not supported; they
>>>> -         * should become splitting first, unmapped, merged,
>>>> -         * and mapped back in on-demand.
>>>> +         * If we have PTE level mapping for this block,
>>>> +         * we must unmap it to avoid inconsistent TLB
>>>> +         * state. We could end up in this situation if
>>>> +         * the memory slot was marked for dirty logging
>>>> +         * and was reverted, leaving PTE level mappings
>>>> +         * for the pages accessed during the period.
>>>> +         * Normal THP split/merge follows mmu_notifier
>>>> +         * callbacks and do get handled accordingly.
>>>>             */
>>>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), 
>>>> S2_PMD_SIZE);

First, using unmap_stage2_range() here is not quite appropriate. Suppose
we've only accessed one 2M page in HPA [x, x+1]Gib range, with other
pages unaccessed.  What will happen if unmap_stage2_range(this_2M_page)?
We'll unexpectedly reach clear_stage2_pud_entry(), and things are going
to get really bad.  So we'd better use unmap_stage2_ptes() here since we
only want to unmap a 2M range.


Second, consider below function stack:

   unmap_stage2_ptes()
     clear_stage2_pmd_entry()
       put_page(virt_to_page(pmd))

It seems that we have one "redundant" put_page() here, (thus comes the
bad kernel log ... ,) but actually we do not.  By stage2_set_pmd_huge(),
the PMD table entry will then point to a 2M block (originally pointed
to a PTE table), the _refcount of this PMD-level table page should _not_
change after unmap_stage2_ptes().  So what we really should do is adding
a get_page() after unmapping to keep the _refcount a balance!


thoughts ? A simple patch below (based on yours) for details.


thanks,

zenghui


>>
>> It seems that kvm decreases the _refcount of the page twice in 
>> transparent_hugepage_adjust()
>> and unmap_stage2_range().
> 
> But I thought we should be doing that on the head_page already, as this 
> is THP.
> I will take a look and get back to you on this. Btw, is it possible for you
> to turn on CONFIG_DEBUG_VM and re-run with the above patch ?
> 
> Kind regards
> Suzuki
> 

---8<---

test: kvm: arm: Maybe two more fixes

Applied based on Suzuki's patch.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
---
  virt/kvm/arm/mmu.c | 8 ++++++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 05765df..ccd5d5d 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm, 
struct kvm_mmu_memory_cache
  		 * Normal THP split/merge follows mmu_notifier
  		 * callbacks and do get handled accordingly.
  		 */
-			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
+			addr &= S2_PMD_MASK;
+			unmap_stage2_ptes(kvm, pmd, addr, addr + S2_PMD_SIZE);
+			get_page(virt_to_page(pmd));
  		} else {

  			/*
@@ -1138,7 +1140,9 @@ static int stage2_set_pud_huge(struct kvm *kvm, 
struct kvm_mmu_memory_cache *cac
  	if (stage2_pud_present(kvm, old_pud)) {
  		/* If we have PTE level mapping, unmap the entire range */
  		if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
-			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
+			addr &= S2_PUD_MASK;
+			unmap_stage2_pmds(kvm, pudp, addr, addr + S2_PUD_SIZE);
+			get_page(virt_to_page(pudp));
  		} else {
  			stage2_pud_clear(kvm, pudp);
  			kvm_tlb_flush_vmid_ipa(kvm, addr);
-- 
1.8.3.1






^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-15 14:56               ` Suzuki K Poulose
  2019-03-17 13:34                 ` Zenghui Yu
@ 2019-03-17 13:55                 ` Zenghui Yu
  1 sibling, 0 replies; 22+ messages in thread
From: Zenghui Yu @ 2019-03-17 13:55 UTC (permalink / raw)
  To: Suzuki K Poulose, zhengxiang9
  Cc: marc.zyngier, christoffer.dall, catalin.marinas, will.deacon,
	james.morse, linux-arm-kernel, kvmarm, linux-kernel,
	wanghaibin.wang, lious.lilei, lishuo1

Hi Suzuki,

On 2019/3/15 22:56, Suzuki K Poulose wrote:
> Hi Zhengui,
> 
> On 15/03/2019 08:21, Zheng Xiang wrote:
>> Hi Suzuki,
>>
>> I have tested this patch, VM doesn't hang and we get expected WARNING 
>> log:
> 
> Thanks for the quick testing !
> 
>> However, we also get the following unexpected log:
>>
>> [  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
>> [  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 
>> mapping:0000000000000000 index:0x0
>> [  908.339416] flags: 0x4ffffe0000000000()
>> [  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 
>> 0000000000000000
>> [  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 
>> 0000000000000000
>> [  908.339420] page dumped because: nonzero _refcount
>> [  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded 
>> Tainted: G    B  W        5.0.0+ #1
>> [  908.339438] Call trace:
>> [  908.339439]  dump_backtrace+0x0/0x188
>> [  908.339441]  show_stack+0x24/0x30
>> [  908.339442]  dump_stack+0xa8/0xcc
>> [  908.339443]  bad_page+0xf0/0x150
>> [  908.339445]  free_pages_check_bad+0x84/0xa0
>> [  908.339446]  free_pcppages_bulk+0x4b8/0x750
>> [  908.339448]  free_unref_page_commit+0x13c/0x198
>> [  908.339449]  free_unref_page+0x84/0xa0
>> [  908.339451]  __free_pages+0x58/0x68
>> [  908.339452]  zap_huge_pmd+0x290/0x2d8
>> [  908.339454]  unmap_page_range+0x2b4/0x470
>> [  908.339455]  unmap_single_vma+0x94/0xe8
>> [  908.339457]  unmap_vmas+0x8c/0x108
>> [  908.339458]  exit_mmap+0xd4/0x178
>> [  908.339459]  mmput+0x74/0x180
>> [  908.339460]  do_exit+0x2b4/0x5b0
>> [  908.339462]  do_group_exit+0x3c/0xe0
>> [  908.339463]  __arm64_sys_exit_group+0x24/0x28
>> [  908.339465]  el0_svc_common+0xa0/0x180
>> [  908.339466]  el0_svc_handler+0x38/0x78
>> [  908.339467]  el0_svc+0x8/0xc
> 
> Thats bad, we seem to be making upto 4 unbalanced put_page().
> 
>>>> ---
>>>>    virt/kvm/arm/mmu.c | 51 
>>>> +++++++++++++++++++++++++++++++++++----------------
>>>>    1 file changed, 35 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>> index 66e0fbb5..04b0f9b 100644
>>>> --- a/virt/kvm/arm/mmu.c
>>>> +++ b/virt/kvm/arm/mmu.c
>>>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm 
>>>> *kvm, struct kvm_mmu_memory_cache
>>>>             * Skip updating the page table if the entry is
>>>>             * unchanged.
>>>>             */
>>>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>>>                return 0;
>>>> -
>>>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>>>            /*
>>>> -         * Mapping in huge pages should only happen through a
>>>> -         * fault.  If a page is merged into a transparent huge
>>>> -         * page, the individual subpages of that huge page
>>>> -         * should be unmapped through MMU notifiers before we
>>>> -         * get here.
>>>> -         *
>>>> -         * Merging of CompoundPages is not supported; they
>>>> -         * should become splitting first, unmapped, merged,
>>>> -         * and mapped back in on-demand.
>>>> +         * If we have PTE level mapping for this block,
>>>> +         * we must unmap it to avoid inconsistent TLB
>>>> +         * state. We could end up in this situation if
>>>> +         * the memory slot was marked for dirty logging
>>>> +         * and was reverted, leaving PTE level mappings
>>>> +         * for the pages accessed during the period.
>>>> +         * Normal THP split/merge follows mmu_notifier
>>>> +         * callbacks and do get handled accordingly.
>>>>             */
>>>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), 
>>>> S2_PMD_SIZE);
>>
>> It seems that kvm decreases the _refcount of the page twice in 
>> transparent_hugepage_adjust()
>> and unmap_stage2_range().
> 
> But I thought we should be doing that on the head_page already, as this 
> is THP.
> I will take a look and get back to you on this. Btw, is it possible for you
> to turn on CONFIG_DEBUG_VM and re-run with the above patch ?

And for detailed debugging info:

I've turned on CONFIG_DEBUG_VM and re-run with your patch -- Run a guest
with stage2 PUD hugepage, enable then disable the dirty logging, and
then shutdown this guest. The result is: Host hit a kernel BUG with the
below log (when shutdown-ing guest):


[  486.997640] kernel BUG at ./include/linux/mm.h:547!
[  487.005524] Internal error: Oops - BUG: 0 [#1] SMP
[  487.013455] Modules linked in: ...
[  487.104072] CPU: 14 PID: 60747 Comm: qemu-kvm Kdump: loaded Tainted: 
G        W         5.0.0+ #2
[  487.117150] ...
[  487.135433] pstate: 40400009 (nZcv daif +PAN -UAO)
[  487.144849] pc : unmap_stage2_puds+0x480/0x6e0
[  487.153756] lr : unmap_stage2_puds+0x480/0x6e0
[  487.162507] sp : ffff00002c72bb10
[  487.179630] x27: 0000000041a00000 x26: ffff8027bbb56060
[  487.183465] openvswitch: netlink: Tunnel attr 5 has unexpected len 1 
expected 0
[  487.189184] x25: ffff802769cbe008 x24: ffff7e0000000000
[  487.189185] x23: ffff802769cbe008 x22: ffff00004b0af000
[  487.189186] x21: ffff80279da06060 x20: 00400027332007fd
[  487.189188] x19: 0000000080000000 x18: 0000000000000010
[  487.189189] x17: 0000000000000000 x16: 0000000000000000
[  487.189190] x15: ffff00001182d708 x14: 3030303030303030
[  487.189191] x13: 3030303030302066 x12: ffff000011857000
[  487.189192] x11: 0000000000000000 x10: ffff000011a48000
[  487.189193] x9 : 0000000000000000 x8 : 0000000000000003
[  487.189194] x7 : 000000000000095b x6 : 0000000212557560
[  487.189196] x3 : ffff802fc0b08260 x2 : b2513adc3568f800
[  487.189197] x1 : 0000000000000000 x0 : 000000000000003e
[  487.189200] Process qemu-kvm (pid: 60747, stack limit = 
0x000000004342b298)
[  487.189201] Call trace:
[  487.189203]  unmap_stage2_puds+0x480/0x6e0
[  487.189205]  unmap_stage2_range+0xa4/0x190
[  487.189208]  kvm_free_stage2_pgd+0x64/0x100
[  487.363897]  kvm_arch_flush_shadow_all+0x20/0x30
[  487.372095]  kvm_mmu_notifier_release+0x3c/0x80
[  487.380092]  __mmu_notifier_release+0x50/0x100
[  487.387914]  exit_mmap+0x170/0x178
[  487.394567]  mmput+0x70/0x180
[  487.400653]  do_exit+0x2b4/0x5c8
[  487.406849]  do_group_exit+0x3c/0xe0
--[ end trace 55c414a329c80b63 ]---
[  487.454174] Kernel panic - not syncing: Fatal exception
[  487.461967] SMP: stopping secondary CPUs
[  487.468473] Kernel Offset: disabled
[  487.474520] CPU features: 0x002,22208a38
[  487.481008] Memory Limit: none
[  487.489095] Starting crashdump kernel...
[  487.495457] Bye!


> 
> Kind regards
> Suzuki
> 
> 
> 
> .


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-17 13:34                 ` Zenghui Yu
@ 2019-03-18 17:34                   ` Suzuki K Poulose
  2019-03-19  9:05                     ` Zenghui Yu
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-18 17:34 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: zhengxiang9, marc.zyngier, christoffer.dall, catalin.marinas,
	will.deacon, james.morse, linux-arm-kernel, kvmarm, linux-kernel,
	wanghaibin.wang, lious.lilei, lishuo1, suzuki.poulose

Hi !
On Sun, Mar 17, 2019 at 09:34:11PM +0800, Zenghui Yu wrote:
> Hi Suzuki,
> 
> On 2019/3/15 22:56, Suzuki K Poulose wrote:
> >Hi Zhengui,
> 
> s/Zhengui/Zheng/
> 
> (I think you must wanted to say "Hi" to Zheng :-) )
> 

Sorry about that.

> 
> I have looked into your patch and the kernel log, and I believe that
> your patch had already addressed this issue. But I think we can do it
> a little better - two more points need to be handled with caution.
> 
> Take PMD hugepage (PMD_SIZE == 2M) for example:
>

...

> >Thats bad, we seem to be making upto 4 unbalanced put_page().
> >
> >>>>---
> >>>>   virt/kvm/arm/mmu.c | 51
> >>>>+++++++++++++++++++++++++++++++++++----------------
> >>>>   1 file changed, 35 insertions(+), 16 deletions(-)
> >>>>
> >>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>>>index 66e0fbb5..04b0f9b 100644
> >>>>--- a/virt/kvm/arm/mmu.c
> >>>>+++ b/virt/kvm/arm/mmu.c
> >>>>@@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm
> >>>>*kvm, struct kvm_mmu_memory_cache
> >>>>            * Skip updating the page table if the entry is
> >>>>            * unchanged.
> >>>>            */
> >>>>-        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >>>>+        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
> >>>>               return 0;
> >>>>-
> >>>>+        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
> >>>>           /*
> >>>>-         * Mapping in huge pages should only happen through a
> >>>>-         * fault.  If a page is merged into a transparent huge
> >>>>-         * page, the individual subpages of that huge page
> >>>>-         * should be unmapped through MMU notifiers before we
> >>>>-         * get here.
> >>>>-         *
> >>>>-         * Merging of CompoundPages is not supported; they
> >>>>-         * should become splitting first, unmapped, merged,
> >>>>-         * and mapped back in on-demand.
> >>>>+         * If we have PTE level mapping for this block,
> >>>>+         * we must unmap it to avoid inconsistent TLB
> >>>>+         * state. We could end up in this situation if
> >>>>+         * the memory slot was marked for dirty logging
> >>>>+         * and was reverted, leaving PTE level mappings
> >>>>+         * for the pages accessed during the period.
> >>>>+         * Normal THP split/merge follows mmu_notifier
> >>>>+         * callbacks and do get handled accordingly.
> >>>>            */
> >>>>-        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> >>>>+            unmap_stage2_range(kvm, (addr & S2_PMD_MASK),
> >>>>S2_PMD_SIZE);
> 
> First, using unmap_stage2_range() here is not quite appropriate. Suppose
> we've only accessed one 2M page in HPA [x, x+1]Gib range, with other
> pages unaccessed.  What will happen if unmap_stage2_range(this_2M_page)?
> We'll unexpectedly reach clear_stage2_pud_entry(), and things are going
> to get really bad.  So we'd better use unmap_stage2_ptes() here since we
> only want to unmap a 2M range.

Yes, you're right. If this PMD entry is the only entry in the parent PUD table,
then the PUD table may get free'd and we may install the table in a place which
is not plugged into the table.

> 
> 
> Second, consider below function stack:
> 
>   unmap_stage2_ptes()
>     clear_stage2_pmd_entry()
>       put_page(virt_to_page(pmd))
> 
> It seems that we have one "redundant" put_page() here, (thus comes the
> bad kernel log ... ,) but actually we do not.  By stage2_set_pmd_huge(),
> the PMD table entry will then point to a 2M block (originally pointed
> to a PTE table), the _refcount of this PMD-level table page should _not_
> change after unmap_stage2_ptes().  So what we really should do is adding
> a get_page() after unmapping to keep the _refcount a balance!

Yes we need an additional refcount on the new huge pmd table, if we are
tearing down the PTE level table.

> 
> 
> thoughts ? A simple patch below (based on yours) for details.
> 
> 
> thanks,
> 
> zenghui
> 
> 
> >>
> >>It seems that kvm decreases the _refcount of the page twice in
> >>transparent_hugepage_adjust()
> >>and unmap_stage2_range().
> >
> >But I thought we should be doing that on the head_page already, as this is
> >THP.
> >I will take a look and get back to you on this. Btw, is it possible for you
> >to turn on CONFIG_DEBUG_VM and re-run with the above patch ?
> >
> >Kind regards
> >Suzuki
> >
> 
> ---8<---
> 
> test: kvm: arm: Maybe two more fixes
> 
> Applied based on Suzuki's patch.
> 
> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
> ---
>  virt/kvm/arm/mmu.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 05765df..ccd5d5d 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct
> kvm_mmu_memory_cache
>  		 * Normal THP split/merge follows mmu_notifier
>  		 * callbacks and do get handled accordingly.
>  		 */
> -			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
> +			addr &= S2_PMD_MASK;
> +			unmap_stage2_ptes(kvm, pmd, addr, addr + S2_PMD_SIZE);
> +			get_page(virt_to_page(pmd));
>  		} else {
> 
>  			/*
> @@ -1138,7 +1140,9 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct
> kvm_mmu_memory_cache *cac
>  	if (stage2_pud_present(kvm, old_pud)) {
>  		/* If we have PTE level mapping, unmap the entire range */
>  		if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
> -			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
> +			addr &= S2_PUD_MASK;
> +			unmap_stage2_pmds(kvm, pudp, addr, addr + S2_PUD_SIZE);
> +			get_page(virt_to_page(pudp));
>  		} else {
>  			stage2_pud_clear(kvm, pudp);
>  			kvm_tlb_flush_vmid_ipa(kvm, addr);

This makes it a bit tricky to follow the code. The other option is to
do something like :


---8>---

kvm: arm: Fix handling of stage2 huge mappings

We rely on the mmu_notifier call backs to handle the split/merging
of huge pages and thus we are guaranteed that while creating a
block mapping, the entire block is unmapped at stage2. However,
we miss a case where the block mapping is split for dirty logging
case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle these corner cases for the huge mappings at stage2 by
unmapping the PTE level mapping. This could potentially release
the upper level table. So we need to restart the table walk
once we unmap the range.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 virt/kvm/arm/mmu.c | 57 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index fce0983..a38a3f1 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1060,25 +1060,41 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
 {
 	pmd_t *pmd, old_pmd;
 
+retry:
 	pmd = stage2_get_pmd(kvm, cache, addr);
 	VM_BUG_ON(!pmd);
 
 	old_pmd = *pmd;
+	/*
+	 * Multiple vcpus faulting on the same PMD entry, can
+	 * lead to them sequentially updating the PMD with the
+	 * same value. Following the break-before-make
+	 * (pmd_clear() followed by tlb_flush()) process can
+	 * hinder forward progress due to refaults generated
+	 * on missing translations.
+	 *
+	 * Skip updating the page table if the entry is
+	 * unchanged.
+	 */
+	if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+		return 0;
+
 	if (pmd_present(old_pmd)) {
 		/*
-		 * Multiple vcpus faulting on the same PMD entry, can
-		 * lead to them sequentially updating the PMD with the
-		 * same value. Following the break-before-make
-		 * (pmd_clear() followed by tlb_flush()) process can
-		 * hinder forward progress due to refaults generated
-		 * on missing translations.
-		 *
-		 * Skip updating the page table if the entry is
-		 * unchanged.
+		 * If we already have PTE level mapping for this block,
+		 * we must unmap it to avoid inconsistent TLB
+		 * state. We could end up in this situation if
+		 * the memory slot was marked for dirty logging
+		 * and was reverted, leaving PTE level mappings
+		 * for the pages accessed during the period.
+		 * Normal THP split/merge follows mmu_notifier
+		 * callbacks and do get handled accordingly.
+		 * Unmap the PTE level mapping and retry.
 		 */
-		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
-			return 0;
-
+		if (!pmd_thp_or_huge(old_pmd)) {
+			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
+			goto retry;
+		}
 		/*
 		 * Mapping in huge pages should only happen through a
 		 * fault.  If a page is merged into a transparent huge
@@ -1090,8 +1106,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
 		 * should become splitting first, unmapped, merged,
 		 * and mapped back in on-demand.
 		 */
-		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
-
+		WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
 		pmd_clear(pmd);
 		kvm_tlb_flush_vmid_ipa(kvm, addr);
 	} else {
@@ -1107,6 +1122,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
 {
 	pud_t *pudp, old_pud;
 
+retry:
 	pudp = stage2_get_pud(kvm, cache, addr);
 	VM_BUG_ON(!pudp);
 
@@ -1122,8 +1138,17 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
 		return 0;
 
 	if (stage2_pud_present(kvm, old_pud)) {
-		stage2_pud_clear(kvm, pudp);
-		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		/*
+		 * If we already have PTE level mapping, unmap the entire
+		 * range and retry.
+		 */
+		if (!stage2_pud_huge(kvm, old_pud)) {
+			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
+			goto retry;
+		} else {
+			stage2_pud_clear(kvm, pudp);
+			kvm_tlb_flush_vmid_ipa(kvm, addr);
+		}
 	} else {
 		get_page(virt_to_page(pudp));
 	}
-- 
2.7.4


> -- 
> 1.8.3.1
> 
> 
> 
> 
> 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Question about TLB flush while set Stage-2 huge pages
  2019-03-18 17:34                   ` Suzuki K Poulose
@ 2019-03-19  9:05                     ` Zenghui Yu
  2019-03-19 14:11                       ` [PATCH] kvm: arm: Fix handling of stage2 huge mappings Suzuki K Poulose
  0 siblings, 1 reply; 22+ messages in thread
From: Zenghui Yu @ 2019-03-19  9:05 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: zhengxiang9, marc.zyngier, christoffer.dall, catalin.marinas,
	will.deacon, james.morse, linux-arm-kernel, kvmarm, linux-kernel,
	wanghaibin.wang, lious.lilei, lishuo1

Hi Suzuki,

On 2019/3/19 1:34, Suzuki K Poulose wrote:
> Hi !
> On Sun, Mar 17, 2019 at 09:34:11PM +0800, Zenghui Yu wrote:
>> Hi Suzuki,
>>
>> ---8<---
>>
>> test: kvm: arm: Maybe two more fixes
>>
>> Applied based on Suzuki's patch.
>>
>> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
>> ---
>>   virt/kvm/arm/mmu.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 05765df..ccd5d5d 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct
>> kvm_mmu_memory_cache
>>   		 * Normal THP split/merge follows mmu_notifier
>>   		 * callbacks and do get handled accordingly.
>>   		 */
>> -			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
>> +			addr &= S2_PMD_MASK;
>> +			unmap_stage2_ptes(kvm, pmd, addr, addr + S2_PMD_SIZE);
>> +			get_page(virt_to_page(pmd));
>>   		} else {
>>
>>   			/*
>> @@ -1138,7 +1140,9 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct
>> kvm_mmu_memory_cache *cac
>>   	if (stage2_pud_present(kvm, old_pud)) {
>>   		/* If we have PTE level mapping, unmap the entire range */
>>   		if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
>> -			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>> +			addr &= S2_PUD_MASK;
>> +			unmap_stage2_pmds(kvm, pudp, addr, addr + S2_PUD_SIZE);
>> +			get_page(virt_to_page(pudp));
>>   		} else {
>>   			stage2_pud_clear(kvm, pudp);
>>   			kvm_tlb_flush_vmid_ipa(kvm, addr);
> 
> This makes it a bit tricky to follow the code. The other option is to
> do something like :

Yes.

> 
> 
> ---8>---
> 
> kvm: arm: Fix handling of stage2 huge mappings
> 
> We rely on the mmu_notifier call backs to handle the split/merging
> of huge pages and thus we are guaranteed that while creating a
> block mapping, the entire block is unmapped at stage2. However,
> we miss a case where the block mapping is split for dirty logging
> case and then could later be made block mapping, if we cancel the
> dirty logging. This not only creates inconsistent TLB entries for
> the pages in the the block, but also leakes the table pages for
> PMD level.
> 
> Handle these corner cases for the huge mappings at stage2 by
> unmapping the PTE level mapping. This could potentially release
> the upper level table. So we need to restart the table walk
> once we unmap the range.
> 
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>   virt/kvm/arm/mmu.c | 57 +++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index fce0983..a38a3f1 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1060,25 +1060,41 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>   {
>   	pmd_t *pmd, old_pmd;
>   
> +retry:
>   	pmd = stage2_get_pmd(kvm, cache, addr);
>   	VM_BUG_ON(!pmd);
>   
>   	old_pmd = *pmd;
> +	/*
> +	 * Multiple vcpus faulting on the same PMD entry, can
> +	 * lead to them sequentially updating the PMD with the
> +	 * same value. Following the break-before-make
> +	 * (pmd_clear() followed by tlb_flush()) process can
> +	 * hinder forward progress due to refaults generated
> +	 * on missing translations.
> +	 *
> +	 * Skip updating the page table if the entry is
> +	 * unchanged.
> +	 */
> +	if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> +		return 0;
> +
>   	if (pmd_present(old_pmd)) {
>   		/*
> -		 * Multiple vcpus faulting on the same PMD entry, can
> -		 * lead to them sequentially updating the PMD with the
> -		 * same value. Following the break-before-make
> -		 * (pmd_clear() followed by tlb_flush()) process can
> -		 * hinder forward progress due to refaults generated
> -		 * on missing translations.
> -		 *
> -		 * Skip updating the page table if the entry is
> -		 * unchanged.
> +		 * If we already have PTE level mapping for this block,
> +		 * we must unmap it to avoid inconsistent TLB
> +		 * state. We could end up in this situation if
> +		 * the memory slot was marked for dirty logging
> +		 * and was reverted, leaving PTE level mappings
> +		 * for the pages accessed during the period.
> +		 * Normal THP split/merge follows mmu_notifier
> +		 * callbacks and do get handled accordingly.
> +		 * Unmap the PTE level mapping and retry.
>   		 */
> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> -			return 0;
> -
> +		if (!pmd_thp_or_huge(old_pmd)) {
> +			unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
Nit: we can get rid of the parentheses around "addr & S2_PMD_MASK" to
make it looks the same as PUD level (but it is not necessary).
> +			goto retry;
> +		}
>   		/*
>   		 * Mapping in huge pages should only happen through a
>   		 * fault.  If a page is merged into a transparent huge
> @@ -1090,8 +1106,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>   		 * should become splitting first, unmapped, merged,
>   		 * and mapped back in on-demand.
>   		 */
> -		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> -
> +		WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>   		pmd_clear(pmd);
>   		kvm_tlb_flush_vmid_ipa(kvm, addr);
>   	} else {
> @@ -1107,6 +1122,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>   {
>   	pud_t *pudp, old_pud;
>   
> +retry:
>   	pudp = stage2_get_pud(kvm, cache, addr);
>   	VM_BUG_ON(!pudp);
>   
> @@ -1122,8 +1138,17 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>   		return 0;
>   
>   	if (stage2_pud_present(kvm, old_pud)) {
> -		stage2_pud_clear(kvm, pudp);
> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		/*
> +		 * If we already have PTE level mapping, unmap the entire
> +		 * range and retry.
> +		 */
> +		if (!stage2_pud_huge(kvm, old_pud)) {
> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
> +			goto retry;
> +		} else {
> +			stage2_pud_clear(kvm, pudp);
> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		}
>   	} else {
>   		get_page(virt_to_page(pudp));
>   	}
> 

It look much better, and works fine now!


thanks,

zenghui



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-19  9:05                     ` Zenghui Yu
@ 2019-03-19 14:11                       ` Suzuki K Poulose
  2019-03-19 16:02                         ` Zenghui Yu
  2019-03-20  8:15                         ` Marc Zyngier
  0 siblings, 2 replies; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-19 14:11 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, kvm, kvmarm, will.deacon, catalin.marinas,
	james.morse, julien.thierry, wanghaibin.wang, lious.lilei,
	lishuo1, zhengxiang9, yuzenghui, Suzuki K Poulose, Marc Zyngier,
	Christoffer Dall

We rely on the mmu_notifier call backs to handle the split/merge
of huge pages and thus we are guaranteed that, while creating a
block mapping, either the entire block is unmapped at stage2 or it
is missing permission.

However, we miss a case where the block mapping is split for dirty
logging case and then could later be made block mapping, if we cancel the
dirty logging. This not only creates inconsistent TLB entries for
the pages in the the block, but also leakes the table pages for
PMD level.

Handle this corner case for the huge mappings at stage2 by
unmapping the non-huge mapping for the block. This could potentially
release the upper level table. So we need to restart the table walk
once we unmap the range.

Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
Cc: Zheng Xiang <zhengxiang9@huawei.com>
Cc: Zhengui Yu <yuzenghui@huawei.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 virt/kvm/arm/mmu.c | 63 ++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index fce0983..6ad6f19d 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
 {
 	pmd_t *pmd, old_pmd;
 
+retry:
 	pmd = stage2_get_pmd(kvm, cache, addr);
 	VM_BUG_ON(!pmd);
 
 	old_pmd = *pmd;
+	/*
+	 * Multiple vcpus faulting on the same PMD entry, can
+	 * lead to them sequentially updating the PMD with the
+	 * same value. Following the break-before-make
+	 * (pmd_clear() followed by tlb_flush()) process can
+	 * hinder forward progress due to refaults generated
+	 * on missing translations.
+	 *
+	 * Skip updating the page table if the entry is
+	 * unchanged.
+	 */
+	if (pmd_val(old_pmd) == pmd_val(*new_pmd))
+		return 0;
+
 	if (pmd_present(old_pmd)) {
 		/*
-		 * Multiple vcpus faulting on the same PMD entry, can
-		 * lead to them sequentially updating the PMD with the
-		 * same value. Following the break-before-make
-		 * (pmd_clear() followed by tlb_flush()) process can
-		 * hinder forward progress due to refaults generated
-		 * on missing translations.
+		 * If we already have PTE level mapping for this block,
+		 * we must unmap it to avoid inconsistent TLB state and
+		 * leaking the table page. We could end up in this situation
+		 * if the memory slot was marked for dirty logging and was
+		 * reverted, leaving PTE level mappings for the pages accessed
+		 * during the period. So, unmap the PTE level mapping for this
+		 * block and retry, as we could have released the upper level
+		 * table in the process.
 		 *
-		 * Skip updating the page table if the entry is
-		 * unchanged.
+		 * Normal THP split/merge follows mmu_notifier callbacks and do
+		 * get handled accordingly.
 		 */
-		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
-			return 0;
-
+		if (!pmd_thp_or_huge(old_pmd)) {
+			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
+			goto retry;
+		}
 		/*
 		 * Mapping in huge pages should only happen through a
 		 * fault.  If a page is merged into a transparent huge
@@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
 		 * should become splitting first, unmapped, merged,
 		 * and mapped back in on-demand.
 		 */
-		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
-
+		WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
 		pmd_clear(pmd);
 		kvm_tlb_flush_vmid_ipa(kvm, addr);
 	} else {
@@ -1107,6 +1124,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
 {
 	pud_t *pudp, old_pud;
 
+retry:
 	pudp = stage2_get_pud(kvm, cache, addr);
 	VM_BUG_ON(!pudp);
 
@@ -1114,16 +1132,25 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
 
 	/*
 	 * A large number of vcpus faulting on the same stage 2 entry,
-	 * can lead to a refault due to the
-	 * stage2_pud_clear()/tlb_flush(). Skip updating the page
-	 * tables if there is no change.
+	 * can lead to a refault due to the stage2_pud_clear()/tlb_flush().
+	 * Skip updating the page tables if there is no change.
 	 */
 	if (pud_val(old_pud) == pud_val(*new_pudp))
 		return 0;
 
 	if (stage2_pud_present(kvm, old_pud)) {
-		stage2_pud_clear(kvm, pudp);
-		kvm_tlb_flush_vmid_ipa(kvm, addr);
+		/*
+		 * If we already have table level mapping for this block, unmap
+		 * the range for this block and retry.
+		 */
+		if (!stage2_pud_huge(kvm, old_pud)) {
+			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
+			goto retry;
+		} else {
+			WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
+			stage2_pud_clear(kvm, pudp);
+			kvm_tlb_flush_vmid_ipa(kvm, addr);
+		}
 	} else {
 		get_page(virt_to_page(pudp));
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-19 14:11                       ` [PATCH] kvm: arm: Fix handling of stage2 huge mappings Suzuki K Poulose
@ 2019-03-19 16:02                         ` Zenghui Yu
  2019-03-20  8:15                         ` Marc Zyngier
  1 sibling, 0 replies; 22+ messages in thread
From: Zenghui Yu @ 2019-03-19 16:02 UTC (permalink / raw)
  To: Suzuki K Poulose, linux-arm-kernel
  Cc: linux-kernel, kvm, kvmarm, will.deacon, catalin.marinas,
	james.morse, julien.thierry, wanghaibin.wang, lious.lilei,
	lishuo1, zhengxiang9, Marc Zyngier, Christoffer Dall

Hi Suzuki,

On 2019/3/19 22:11, Suzuki K Poulose wrote:
> We rely on the mmu_notifier call backs to handle the split/merge
> of huge pages and thus we are guaranteed that, while creating a
> block mapping, either the entire block is unmapped at stage2 or it
> is missing permission.
> 
> However, we miss a case where the block mapping is split for dirty
> logging case and then could later be made block mapping, if we cancel the
> dirty logging. This not only creates inconsistent TLB entries for
> the pages in the the block, but also leakes the table pages for
> PMD level.
> 
> Handle this corner case for the huge mappings at stage2 by
> unmapping the non-huge mapping for the block. This could potentially
> release the upper level table. So we need to restart the table walk
> once we unmap the range.
> 
> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
> Cc: Zheng Xiang <zhengxiang9@huawei.com>
> Cc: Zhengui Yu <yuzenghui@huawei.com>

Sorry to bother you, but this should be "Zenghui Yu", thanks!


zenghui

> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <christoffer.dall@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>   virt/kvm/arm/mmu.c | 63 ++++++++++++++++++++++++++++++++++++++----------------
>   1 file changed, 45 insertions(+), 18 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index fce0983..6ad6f19d 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>   {
>   	pmd_t *pmd, old_pmd;
>   
> +retry:
>   	pmd = stage2_get_pmd(kvm, cache, addr);
>   	VM_BUG_ON(!pmd);
>   
>   	old_pmd = *pmd;
> +	/*
> +	 * Multiple vcpus faulting on the same PMD entry, can
> +	 * lead to them sequentially updating the PMD with the
> +	 * same value. Following the break-before-make
> +	 * (pmd_clear() followed by tlb_flush()) process can
> +	 * hinder forward progress due to refaults generated
> +	 * on missing translations.
> +	 *
> +	 * Skip updating the page table if the entry is
> +	 * unchanged.
> +	 */
> +	if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> +		return 0;
> +
>   	if (pmd_present(old_pmd)) {
>   		/*
> -		 * Multiple vcpus faulting on the same PMD entry, can
> -		 * lead to them sequentially updating the PMD with the
> -		 * same value. Following the break-before-make
> -		 * (pmd_clear() followed by tlb_flush()) process can
> -		 * hinder forward progress due to refaults generated
> -		 * on missing translations.
> +		 * If we already have PTE level mapping for this block,
> +		 * we must unmap it to avoid inconsistent TLB state and
> +		 * leaking the table page. We could end up in this situation
> +		 * if the memory slot was marked for dirty logging and was
> +		 * reverted, leaving PTE level mappings for the pages accessed
> +		 * during the period. So, unmap the PTE level mapping for this
> +		 * block and retry, as we could have released the upper level
> +		 * table in the process.
>   		 *
> -		 * Skip updating the page table if the entry is
> -		 * unchanged.
> +		 * Normal THP split/merge follows mmu_notifier callbacks and do
> +		 * get handled accordingly.
>   		 */
> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> -			return 0;
> -
> +		if (!pmd_thp_or_huge(old_pmd)) {
> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
> +			goto retry;
> +		}
>   		/*
>   		 * Mapping in huge pages should only happen through a
>   		 * fault.  If a page is merged into a transparent huge
> @@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>   		 * should become splitting first, unmapped, merged,
>   		 * and mapped back in on-demand.
>   		 */
> -		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> -
> +		WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>   		pmd_clear(pmd);
>   		kvm_tlb_flush_vmid_ipa(kvm, addr);
>   	} else {
> @@ -1107,6 +1124,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>   {
>   	pud_t *pudp, old_pud;
>   
> +retry:
>   	pudp = stage2_get_pud(kvm, cache, addr);
>   	VM_BUG_ON(!pudp);
>   
> @@ -1114,16 +1132,25 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>   
>   	/*
>   	 * A large number of vcpus faulting on the same stage 2 entry,
> -	 * can lead to a refault due to the
> -	 * stage2_pud_clear()/tlb_flush(). Skip updating the page
> -	 * tables if there is no change.
> +	 * can lead to a refault due to the stage2_pud_clear()/tlb_flush().
> +	 * Skip updating the page tables if there is no change.
>   	 */
>   	if (pud_val(old_pud) == pud_val(*new_pudp))
>   		return 0;
>   
>   	if (stage2_pud_present(kvm, old_pud)) {
> -		stage2_pud_clear(kvm, pudp);
> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		/*
> +		 * If we already have table level mapping for this block, unmap
> +		 * the range for this block and retry.
> +		 */
> +		if (!stage2_pud_huge(kvm, old_pud)) {
> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
> +			goto retry;
> +		} else {
> +			WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
> +			stage2_pud_clear(kvm, pudp);
> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		}
>   	} else {
>   		get_page(virt_to_page(pudp));
>   	}
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-19 14:11                       ` [PATCH] kvm: arm: Fix handling of stage2 huge mappings Suzuki K Poulose
  2019-03-19 16:02                         ` Zenghui Yu
@ 2019-03-20  8:15                         ` Marc Zyngier
  2019-03-20  9:44                           ` Suzuki K Poulose
  1 sibling, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2019-03-20  8:15 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, Christoffer Dall

Hi Suzuki,

On Tue, 19 Mar 2019 14:11:08 +0000,
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> 
> We rely on the mmu_notifier call backs to handle the split/merge
> of huge pages and thus we are guaranteed that, while creating a
> block mapping, either the entire block is unmapped at stage2 or it
> is missing permission.
> 
> However, we miss a case where the block mapping is split for dirty
> logging case and then could later be made block mapping, if we cancel the
> dirty logging. This not only creates inconsistent TLB entries for
> the pages in the the block, but also leakes the table pages for
> PMD level.
> 
> Handle this corner case for the huge mappings at stage2 by
> unmapping the non-huge mapping for the block. This could potentially
> release the upper level table. So we need to restart the table walk
> once we unmap the range.
> 
> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
> Cc: Zheng Xiang <zhengxiang9@huawei.com>
> Cc: Zhengui Yu <yuzenghui@huawei.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <christoffer.dall@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 63 ++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 45 insertions(+), 18 deletions(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index fce0983..6ad6f19d 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>  {
>  	pmd_t *pmd, old_pmd;
>  
> +retry:
>  	pmd = stage2_get_pmd(kvm, cache, addr);
>  	VM_BUG_ON(!pmd);
>  
>  	old_pmd = *pmd;
> +	/*
> +	 * Multiple vcpus faulting on the same PMD entry, can
> +	 * lead to them sequentially updating the PMD with the
> +	 * same value. Following the break-before-make
> +	 * (pmd_clear() followed by tlb_flush()) process can
> +	 * hinder forward progress due to refaults generated
> +	 * on missing translations.
> +	 *
> +	 * Skip updating the page table if the entry is
> +	 * unchanged.
> +	 */
> +	if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> +		return 0;
> +
>  	if (pmd_present(old_pmd)) {
>  		/*
> -		 * Multiple vcpus faulting on the same PMD entry, can
> -		 * lead to them sequentially updating the PMD with the
> -		 * same value. Following the break-before-make
> -		 * (pmd_clear() followed by tlb_flush()) process can
> -		 * hinder forward progress due to refaults generated
> -		 * on missing translations.
> +		 * If we already have PTE level mapping for this block,
> +		 * we must unmap it to avoid inconsistent TLB state and
> +		 * leaking the table page. We could end up in this situation
> +		 * if the memory slot was marked for dirty logging and was
> +		 * reverted, leaving PTE level mappings for the pages accessed
> +		 * during the period. So, unmap the PTE level mapping for this
> +		 * block and retry, as we could have released the upper level
> +		 * table in the process.
>  		 *
> -		 * Skip updating the page table if the entry is
> -		 * unchanged.
> +		 * Normal THP split/merge follows mmu_notifier callbacks and do
> +		 * get handled accordingly.
>  		 */
> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> -			return 0;
> -
> +		if (!pmd_thp_or_huge(old_pmd)) {
> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
> +			goto retry;

This looks slightly dodgy. Doing this retry results in another call to
stage2_get_pmd(), which may or may not result in allocating a PUD. I
think this is safe as if we managed to get here, it means the whole
hierarchy was already present and nothing was allocated in the first
round.

Somehow, I would feel more comfortable with just not even trying.
Unmap, don't fix the fault, let the vcpu come again for additional
punishment. But this is probably more invasive, as none of the
stage2_set_p*() return value is ever evaluated. Oh well.

> +		}
>  		/*
>  		 * Mapping in huge pages should only happen through a
>  		 * fault.  If a page is merged into a transparent huge
> @@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>  		 * should become splitting first, unmapped, merged,
>  		 * and mapped back in on-demand.
>  		 */
> -		VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
> -
> +		WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>  		pmd_clear(pmd);
>  		kvm_tlb_flush_vmid_ipa(kvm, addr);
>  	} else {
> @@ -1107,6 +1124,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>  {
>  	pud_t *pudp, old_pud;
>  
> +retry:
>  	pudp = stage2_get_pud(kvm, cache, addr);
>  	VM_BUG_ON(!pudp);
>  
> @@ -1114,16 +1132,25 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>  
>  	/*
>  	 * A large number of vcpus faulting on the same stage 2 entry,
> -	 * can lead to a refault due to the
> -	 * stage2_pud_clear()/tlb_flush(). Skip updating the page
> -	 * tables if there is no change.
> +	 * can lead to a refault due to the stage2_pud_clear()/tlb_flush().
> +	 * Skip updating the page tables if there is no change.
>  	 */
>  	if (pud_val(old_pud) == pud_val(*new_pudp))
>  		return 0;
>  
>  	if (stage2_pud_present(kvm, old_pud)) {
> -		stage2_pud_clear(kvm, pudp);
> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		/*
> +		 * If we already have table level mapping for this block, unmap
> +		 * the range for this block and retry.
> +		 */
> +		if (!stage2_pud_huge(kvm, old_pud)) {
> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);

This broke 32bit. I've added the following hunk to fix it:

diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
index de2089501b8b..b8f21088a744 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -68,6 +68,9 @@ stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
 #define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
 #define stage2_pud_table_empty(kvm, pudp)	false
 
+#define S2_PUD_MASK				PGDIR_MASK
+#define S2_PUD_SIZE				PGDIR_SIZE
+
 static inline bool kvm_stage2_has_pud(struct kvm *kvm)
 {
 	return false;

> +			goto retry;
> +		} else {
> +			WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
> +			stage2_pud_clear(kvm, pudp);
> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +		}

The 'else' line could go, and would make the code similar to the PMD path.

>  	} else {
>  		get_page(virt_to_page(pudp));
>  	}
> -- 
> 2.7.4
> 

If you're OK with the above nits, I'll squash them into the patch.

Thanks,

	M.

-- 
Jazz is not dead, it just smell funny.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20  8:15                         ` Marc Zyngier
@ 2019-03-20  9:44                           ` Suzuki K Poulose
  2019-03-20 10:11                             ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-20  9:44 UTC (permalink / raw)
  To: marc.zyngier
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

Hi Marc,

On 20/03/2019 08:15, Marc Zyngier wrote:
> Hi Suzuki,
> 
> On Tue, 19 Mar 2019 14:11:08 +0000,
> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>
>> We rely on the mmu_notifier call backs to handle the split/merge
>> of huge pages and thus we are guaranteed that, while creating a
>> block mapping, either the entire block is unmapped at stage2 or it
>> is missing permission.
>>
>> However, we miss a case where the block mapping is split for dirty
>> logging case and then could later be made block mapping, if we cancel the
>> dirty logging. This not only creates inconsistent TLB entries for
>> the pages in the the block, but also leakes the table pages for
>> PMD level.
>>
>> Handle this corner case for the huge mappings at stage2 by
>> unmapping the non-huge mapping for the block. This could potentially
>> release the upper level table. So we need to restart the table walk
>> once we unmap the range.
>>
>> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
>> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
>> Cc: Zheng Xiang <zhengxiang9@huawei.com>
>> Cc: Zhengui Yu <yuzenghui@huawei.com>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <christoffer.dall@arm.com>
>> Signed-off-by: Suzuki K Poulose 

...

>> +retry:
>>   	pmd = stage2_get_pmd(kvm, cache, addr);
>>   	VM_BUG_ON(!pmd);
>>   

...

>>   	if (pmd_present(old_pmd)) {
>>   		/*
>> -		 * Multiple vcpus faulting on the same PMD entry, can
>> -		 * lead to them sequentially updating the PMD with the
>> -		 * same value. Following the break-before-make
>> -		 * (pmd_clear() followed by tlb_flush()) process can
>> -		 * hinder forward progress due to refaults generated
>> -		 * on missing translations.
>> +		 * If we already have PTE level mapping for this block,
>> +		 * we must unmap it to avoid inconsistent TLB state and
>> +		 * leaking the table page. We could end up in this situation
>> +		 * if the memory slot was marked for dirty logging and was
>> +		 * reverted, leaving PTE level mappings for the pages accessed
>> +		 * during the period. So, unmap the PTE level mapping for this
>> +		 * block and retry, as we could have released the upper level
>> +		 * table in the process.
>>   		 *
>> -		 * Skip updating the page table if the entry is
>> -		 * unchanged.
>> +		 * Normal THP split/merge follows mmu_notifier callbacks and do
>> +		 * get handled accordingly.
>>   		 */
>> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> -			return 0;
>> -
>> +		if (!pmd_thp_or_huge(old_pmd)) {
>> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
>> +			goto retry;
> 
> This looks slightly dodgy. Doing this retry results in another call to
> stage2_get_pmd(), which may or may not result in allocating a PUD. I
> think this is safe as if we managed to get here, it means the whole
> hierarchy was already present and nothing was allocated in the first
> round.
> 
> Somehow, I would feel more comfortable with just not even trying.
> Unmap, don't fix the fault, let the vcpu come again for additional
> punishment. But this is probably more invasive, as none of the
> stage2_set_p*() return value is ever evaluated. Oh well.
> 

Yes. The other option was to unmap_stage2_ptes() and get the page refcount
on the new pmd. But that kind of makes it a bit difficult to follow the
code.

>>   	if (stage2_pud_present(kvm, old_pud)) {
>> -		stage2_pud_clear(kvm, pudp);
>> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +		/*
>> +		 * If we already have table level mapping for this block, unmap
>> +		 * the range for this block and retry.
>> +		 */
>> +		if (!stage2_pud_huge(kvm, old_pud)) {
>> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
> 
> This broke 32bit. I've added the following hunk to fix it:

Grrr! Sorry about that.

> 
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> index de2089501b8b..b8f21088a744 100644
> --- a/arch/arm/include/asm/stage2_pgtable.h
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -68,6 +68,9 @@ stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>   #define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
>   #define stage2_pud_table_empty(kvm, pudp)	false
>   
> +#define S2_PUD_MASK				PGDIR_MASK
> +#define S2_PUD_SIZE				PGDIR_SIZE
> +

We should really get rid of the S2_P{U/M}D_* definitions, as they are
always the same as the host. The only thing that changes is the PGD size
which varies according to the IPA and the concatenation.

>   static inline bool kvm_stage2_has_pud(struct kvm *kvm)
>   {
>   	return false;
> 
>> +			goto retry;
>> +		} else {
>> +			WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
>> +			stage2_pud_clear(kvm, pudp);
>> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +		}
> 
> The 'else' line could go, and would make the code similar to the PMD path.
> 

Yep. I think the pud_pfn() may not be defined for some configs, if the hugetlbfs
is not selected on arm32. So, we should move them to kvm_pud_pfn() instead.


>>   	} else {
>>   		get_page(virt_to_page(pudp));
>>   	}
>> -- 
>> 2.7.4
>>
> 
> If you're OK with the above nits, I'll squash them into the patch.

With the kvm_pud_pfn() changes, yes. Alternately, I could resend the updated
patch, fixing the typo in Zenghui's name. Let me know.

Cheers
Suzuki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20  9:44                           ` Suzuki K Poulose
@ 2019-03-20 10:11                             ` Marc Zyngier
  2019-03-20 10:23                               ` Suzuki K Poulose
  0 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2019-03-20 10:11 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

On Wed, 20 Mar 2019 09:44:38 +0000
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:

> Hi Marc,
> 
> On 20/03/2019 08:15, Marc Zyngier wrote:
> > Hi Suzuki,
> > 
> > On Tue, 19 Mar 2019 14:11:08 +0000,
> > Suzuki K Poulose <suzuki.poulose@arm.com> wrote:  
> >>
> >> We rely on the mmu_notifier call backs to handle the split/merge
> >> of huge pages and thus we are guaranteed that, while creating a
> >> block mapping, either the entire block is unmapped at stage2 or it
> >> is missing permission.
> >>
> >> However, we miss a case where the block mapping is split for dirty
> >> logging case and then could later be made block mapping, if we cancel the
> >> dirty logging. This not only creates inconsistent TLB entries for
> >> the pages in the the block, but also leakes the table pages for
> >> PMD level.
> >>
> >> Handle this corner case for the huge mappings at stage2 by
> >> unmapping the non-huge mapping for the block. This could potentially
> >> release the upper level table. So we need to restart the table walk
> >> once we unmap the range.
> >>
> >> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
> >> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
> >> Cc: Zheng Xiang <zhengxiang9@huawei.com>
> >> Cc: Zhengui Yu <yuzenghui@huawei.com>
> >> Cc: Marc Zyngier <marc.zyngier@arm.com>
> >> Cc: Christoffer Dall <christoffer.dall@arm.com>
> >> Signed-off-by: Suzuki K Poulose ...  
> 
> >> +retry:
> >>   	pmd = stage2_get_pmd(kvm, cache, addr);
> >>   	VM_BUG_ON(!pmd);
> >>   ...  
> 
> >>   	if (pmd_present(old_pmd)) {
> >>   		/*
> >> -		 * Multiple vcpus faulting on the same PMD entry, can
> >> -		 * lead to them sequentially updating the PMD with the
> >> -		 * same value. Following the break-before-make
> >> -		 * (pmd_clear() followed by tlb_flush()) process can
> >> -		 * hinder forward progress due to refaults generated
> >> -		 * on missing translations.
> >> +		 * If we already have PTE level mapping for this block,
> >> +		 * we must unmap it to avoid inconsistent TLB state and
> >> +		 * leaking the table page. We could end up in this situation
> >> +		 * if the memory slot was marked for dirty logging and was
> >> +		 * reverted, leaving PTE level mappings for the pages accessed
> >> +		 * during the period. So, unmap the PTE level mapping for this
> >> +		 * block and retry, as we could have released the upper level
> >> +		 * table in the process.
> >>   		 *
> >> -		 * Skip updating the page table if the entry is
> >> -		 * unchanged.
> >> +		 * Normal THP split/merge follows mmu_notifier callbacks and do
> >> +		 * get handled accordingly.
> >>   		 */
> >> -		if (pmd_val(old_pmd) == pmd_val(*new_pmd))
> >> -			return 0;
> >> -
> >> +		if (!pmd_thp_or_huge(old_pmd)) {
> >> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
> >> +			goto retry;  
> > 
> > This looks slightly dodgy. Doing this retry results in another call to
> > stage2_get_pmd(), which may or may not result in allocating a PUD. I
> > think this is safe as if we managed to get here, it means the whole
> > hierarchy was already present and nothing was allocated in the first
> > round.
> > 
> > Somehow, I would feel more comfortable with just not even trying.
> > Unmap, don't fix the fault, let the vcpu come again for additional
> > punishment. But this is probably more invasive, as none of the
> > stage2_set_p*() return value is ever evaluated. Oh well.
> >   
> 
> Yes. The other option was to unmap_stage2_ptes() and get the page refcount
> on the new pmd. But that kind of makes it a bit difficult to follow the
> code.
> 
> >>   	if (stage2_pud_present(kvm, old_pud)) {
> >> -		stage2_pud_clear(kvm, pudp);
> >> -		kvm_tlb_flush_vmid_ipa(kvm, addr);
> >> +		/*
> >> +		 * If we already have table level mapping for this block, unmap
> >> +		 * the range for this block and retry.
> >> +		 */
> >> +		if (!stage2_pud_huge(kvm, old_pud)) {
> >> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);  
> > 
> > This broke 32bit. I've added the following hunk to fix it:  
> 
> Grrr! Sorry about that.
> 
> > 
> > diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> > index de2089501b8b..b8f21088a744 100644
> > --- a/arch/arm/include/asm/stage2_pgtable.h
> > +++ b/arch/arm/include/asm/stage2_pgtable.h
> > @@ -68,6 +68,9 @@ stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
> >   #define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
> >   #define stage2_pud_table_empty(kvm, pudp)	false  
> >   > +#define S2_PUD_MASK				PGDIR_MASK  
> > +#define S2_PUD_SIZE				PGDIR_SIZE
> > +  
> 
> We should really get rid of the S2_P{U/M}D_* definitions, as they are
> always the same as the host. The only thing that changes is the PGD size
> which varies according to the IPA and the concatenation.
> 
> >   static inline bool kvm_stage2_has_pud(struct kvm *kvm)
> >   {
> >   	return false;
> >   
> >> +			goto retry;
> >> +		} else {
> >> +			WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp));
> >> +			stage2_pud_clear(kvm, pudp);
> >> +			kvm_tlb_flush_vmid_ipa(kvm, addr);
> >> +		}  
> > 
> > The 'else' line could go, and would make the code similar to the PMD path.
> >   
> 
> Yep. I think the pud_pfn() may not be defined for some configs, if the hugetlbfs
> is not selected on arm32. So, we should move them to kvm_pud_pfn() instead.
> 
> 
> >>   	} else {
> >>   		get_page(virt_to_page(pudp));
> >>   	}  
> >> -- >> 2.7.4  
> >>  
> > 
> > If you're OK with the above nits, I'll squash them into the patch.  
> 
> With the kvm_pud_pfn() changes, yes. Alternately, I could resend the updated
> patch, fixing the typo in Zenghui's name. Let me know.

Sure, feel free to send a fixed version. I'll drop the currently queued
patch.

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20 10:11                             ` Marc Zyngier
@ 2019-03-20 10:23                               ` Suzuki K Poulose
  2019-03-20 10:35                                 ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-20 10:23 UTC (permalink / raw)
  To: marc.zyngier
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

Marc,

On 20/03/2019 10:11, Marc Zyngier wrote:
> On Wed, 20 Mar 2019 09:44:38 +0000
> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> 
>> Hi Marc,
>>
>> On 20/03/2019 08:15, Marc Zyngier wrote:
>>> Hi Suzuki,
>>>
>>> On Tue, 19 Mar 2019 14:11:08 +0000,
>>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>>>
>>>> We rely on the mmu_notifier call backs to handle the split/merge
>>>> of huge pages and thus we are guaranteed that, while creating a
>>>> block mapping, either the entire block is unmapped at stage2 or it
>>>> is missing permission.
>>>>
>>>> However, we miss a case where the block mapping is split for dirty
>>>> logging case and then could later be made block mapping, if we cancel the
>>>> dirty logging. This not only creates inconsistent TLB entries for
>>>> the pages in the the block, but also leakes the table pages for
>>>> PMD level.
>>>>
>>>> Handle this corner case for the huge mappings at stage2 by
>>>> unmapping the non-huge mapping for the block. This could potentially
>>>> release the upper level table. So we need to restart the table walk
>>>> once we unmap the range.
>>>>
>>>> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
>>>> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
>>>> Cc: Zheng Xiang <zhengxiang9@huawei.com>
>>>> Cc: Zhengui Yu <yuzenghui@huawei.com>
>>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>>> Cc: Christoffer Dall <christoffer.dall@arm.com>
>>>> Signed-off-by: Suzuki K Poulose ...


>>>> +		if (!pmd_thp_or_huge(old_pmd)) {
>>>> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
>>>> +			goto retry;
>>>

>>>> +		if (!stage2_pud_huge(kvm, old_pud)) {
>>>> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>>>

>> We should really get rid of the S2_P{U/M}D_* definitions, as they are
>> always the same as the host. The only thing that changes is the PGD size
>> which varies according to the IPA and the concatenation.
>>

Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
above ? I could make that change with the respin.

> 
> Sure, feel free to send a fixed version. I'll drop the currently queued
> patch.
> 


Thanks. Sorry for the trouble.

Cheers
Suzuki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20 10:23                               ` Suzuki K Poulose
@ 2019-03-20 10:35                                 ` Marc Zyngier
  2019-03-20 11:12                                   ` Suzuki K Poulose
  0 siblings, 1 reply; 22+ messages in thread
From: Marc Zyngier @ 2019-03-20 10:35 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

On Wed, 20 Mar 2019 10:23:39 +0000
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:

Hi Suzuki,

> Marc,
> 
> On 20/03/2019 10:11, Marc Zyngier wrote:
> > On Wed, 20 Mar 2019 09:44:38 +0000
> > Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> >   
> >> Hi Marc,
> >>
> >> On 20/03/2019 08:15, Marc Zyngier wrote:  
> >>> Hi Suzuki,
> >>>
> >>> On Tue, 19 Mar 2019 14:11:08 +0000,
> >>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:  
> >>>>
> >>>> We rely on the mmu_notifier call backs to handle the split/merge
> >>>> of huge pages and thus we are guaranteed that, while creating a
> >>>> block mapping, either the entire block is unmapped at stage2 or it
> >>>> is missing permission.
> >>>>
> >>>> However, we miss a case where the block mapping is split for dirty
> >>>> logging case and then could later be made block mapping, if we cancel the
> >>>> dirty logging. This not only creates inconsistent TLB entries for
> >>>> the pages in the the block, but also leakes the table pages for
> >>>> PMD level.
> >>>>
> >>>> Handle this corner case for the huge mappings at stage2 by
> >>>> unmapping the non-huge mapping for the block. This could potentially
> >>>> release the upper level table. So we need to restart the table walk
> >>>> once we unmap the range.
> >>>>
> >>>> Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
> >>>> Reported-by: Zheng Xiang <zhengxiang9@huawei.com>
> >>>> Cc: Zheng Xiang <zhengxiang9@huawei.com>
> >>>> Cc: Zhengui Yu <yuzenghui@huawei.com>
> >>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
> >>>> Cc: Christoffer Dall <christoffer.dall@arm.com>
> >>>> Signed-off-by: Suzuki K Poulose ...  
> 
> 
> >>>> +		if (!pmd_thp_or_huge(old_pmd)) {
> >>>> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
> >>>> +			goto retry;  
> >>>  
> 
> >>>> +		if (!stage2_pud_huge(kvm, old_pud)) {
> >>>> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);  
> >>>  
> 
> >> We should really get rid of the S2_P{U/M}D_* definitions, as they are
> >> always the same as the host. The only thing that changes is the PGD size
> >> which varies according to the IPA and the concatenation.
> >>  
> 
> Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
> above ? I could make that change with the respin.

Given that this is a fix, I'd like it to be as small as obvious as
possible, making it easier to backport.

I'm happy to take another patch for 5.2 that will drop the whole S2_P*
if we still think that this should be the case (though what I'd really
like is to have architectural levels instead of these arbitrary
definitions).

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20 10:35                                 ` Marc Zyngier
@ 2019-03-20 11:12                                   ` Suzuki K Poulose
  2019-03-20 17:24                                     ` Marc Zyngier
  0 siblings, 1 reply; 22+ messages in thread
From: Suzuki K Poulose @ 2019-03-20 11:12 UTC (permalink / raw)
  To: marc.zyngier
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

Marc,

On 20/03/2019 10:35, Marc Zyngier wrote:
> On Wed, 20 Mar 2019 10:23:39 +0000
> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> 
> Hi Suzuki,
> 
>> Marc,
>>
>> On 20/03/2019 10:11, Marc Zyngier wrote:
>>> On Wed, 20 Mar 2019 09:44:38 +0000
>>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
>>>    
>>>> Hi Marc,
>>>>
>>>> On 20/03/2019 08:15, Marc Zyngier wrote:
>>>>> Hi Suzuki,
>>>>>
>>>>> On Tue, 19 Mar 2019 14:11:08 +0000,
>>>>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:

...

>>>>>> +		if (!pmd_thp_or_huge(old_pmd)) {
>>>>>> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
>>>>>> +			goto retry;
>>>>>   
>>
>>>>>> +		if (!stage2_pud_huge(kvm, old_pud)) {
>>>>>> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>>>>>   
>>
>>>> We should really get rid of the S2_P{U/M}D_* definitions, as they are
>>>> always the same as the host. The only thing that changes is the PGD size
>>>> which varies according to the IPA and the concatenation.
>>>>   
>>
>> Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
>> above ? I could make that change with the respin.
> 
> Given that this is a fix, I'd like it to be as small as obvious as
> possible, making it easier to backport.
> 
> I'm happy to take another patch for 5.2 that will drop the whole S2_P*
> if we still think that this should be the case (though what I'd really
> like is to have architectural levels instead of these arbitrary
> definitions).

I only meant the two new instances added above in the patch. Of course, I
could send something to fix the existing ones.

Cheers
Suzuki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings
  2019-03-20 11:12                                   ` Suzuki K Poulose
@ 2019-03-20 17:24                                     ` Marc Zyngier
  0 siblings, 0 replies; 22+ messages in thread
From: Marc Zyngier @ 2019-03-20 17:24 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: linux-arm-kernel, linux-kernel, kvm, kvmarm, will.deacon,
	catalin.marinas, james.morse, julien.thierry, wanghaibin.wang,
	lious.lilei, lishuo1, zhengxiang9, yuzenghui, christoffer.dall

On Wed, 20 Mar 2019 11:12:47 +0000
Suzuki K Poulose <suzuki.poulose@arm.com> wrote:

> Marc,
> 
> On 20/03/2019 10:35, Marc Zyngier wrote:
> > On Wed, 20 Mar 2019 10:23:39 +0000
> > Suzuki K Poulose <suzuki.poulose@arm.com> wrote:
> > 
> > Hi Suzuki,
> >   
> >> Marc,
> >>
> >> On 20/03/2019 10:11, Marc Zyngier wrote:  
> >>> On Wed, 20 Mar 2019 09:44:38 +0000
> >>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:  
> >>>    >>>> Hi Marc,  
> >>>>
> >>>> On 20/03/2019 08:15, Marc Zyngier wrote:  
> >>>>> Hi Suzuki,
> >>>>>
> >>>>> On Tue, 19 Mar 2019 14:11:08 +0000,
> >>>>> Suzuki K Poulose <suzuki.poulose@arm.com> wrote:  
> 
> ...
> 
> >>>>>> +		if (!pmd_thp_or_huge(old_pmd)) {
> >>>>>> +			unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
> >>>>>> +			goto retry;  
> >>>>>   >>  
> >>>>>> +		if (!stage2_pud_huge(kvm, old_pud)) {
> >>>>>> +			unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);  
> >>>>>   >>  
> >>>> We should really get rid of the S2_P{U/M}D_* definitions, as they are
> >>>> always the same as the host. The only thing that changes is the PGD size
> >>>> which varies according to the IPA and the concatenation.  
> >>>>   >>  
> >> Also what do you think about using  P{M,U}D_* instead of S2_P{M,U}D_*
> >> above ? I could make that change with the respin.  
> > 
> > Given that this is a fix, I'd like it to be as small as obvious as
> > possible, making it easier to backport.
> > 
> > I'm happy to take another patch for 5.2 that will drop the whole S2_P*
> > if we still think that this should be the case (though what I'd really
> > like is to have architectural levels instead of these arbitrary
> > definitions).  
> 
> I only meant the two new instances added above in the patch. Of course, I
> could send something to fix the existing ones.

I'd rather be consistent, and use the same names all over the code.
Once we decide to change, we do it all in one go.

Thanks,

	M.
-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-03-20 17:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-11 16:31 [RFC] Question about TLB flush while set Stage-2 huge pages Zheng Xiang
2019-03-12 11:32 ` Marc Zyngier
2019-03-12 15:30   ` Zheng Xiang
2019-03-12 18:18     ` Marc Zyngier
2019-03-13  9:45       ` Zheng Xiang
2019-03-14 10:55         ` Suzuki K Poulose
2019-03-14 15:50           ` Zenghui Yu
2019-03-15  8:21             ` Zheng Xiang
2019-03-15 14:56               ` Suzuki K Poulose
2019-03-17 13:34                 ` Zenghui Yu
2019-03-18 17:34                   ` Suzuki K Poulose
2019-03-19  9:05                     ` Zenghui Yu
2019-03-19 14:11                       ` [PATCH] kvm: arm: Fix handling of stage2 huge mappings Suzuki K Poulose
2019-03-19 16:02                         ` Zenghui Yu
2019-03-20  8:15                         ` Marc Zyngier
2019-03-20  9:44                           ` Suzuki K Poulose
2019-03-20 10:11                             ` Marc Zyngier
2019-03-20 10:23                               ` Suzuki K Poulose
2019-03-20 10:35                                 ` Marc Zyngier
2019-03-20 11:12                                   ` Suzuki K Poulose
2019-03-20 17:24                                     ` Marc Zyngier
2019-03-17 13:55                 ` [RFC] Question about TLB flush while set Stage-2 huge pages Zenghui Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).