All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zheng Xiang <zhengxiang9@huawei.com>
To: Zenghui Yu <yuzenghui@huawei.com>,
	Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>, <christoffer.dall@arm.com>,
	<catalin.marinas@arm.com>, <will.deacon@arm.com>,
	<james.morse@arm.com>, <linux-arm-kernel@lists.infradead.org>,
	<kvmarm@lists.cs.columbia.edu>, <linux-kernel@vger.kernel.org>,
	Wang Haibin <wanghaibin.wang@huawei.com>,
	<lious.lilei@hisilicon.com>, <lishuo1@hisilicon.com>
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Date: Fri, 15 Mar 2019 16:21:03 +0800	[thread overview]
Message-ID: <d322e126-4da2-6dfd-a86d-088dfb3bf0f4@huawei.com> (raw)
In-Reply-To: <368bd218-ac1d-19b2-6e92-960b91afee8b@huawei.com>

Hi Suzuki,

I have tested this patch, VM doesn't hang and we get expected WARNING log:

[  526.184452] pstate: 20400009 (nzCv daif +PAN -UAO)
[  526.184454] pc : user_mem_abort+0x484/0x9e0
[  526.184455] lr : user_mem_abort+0x478/0x9e0
[  526.184456] sp : ffff000084a038e0
[  526.184457] x29: ffff000084a038e0 x28: 000000012f600000
[  526.184458] x27: ffff8a2fa27ae918 x26: 0000000000200000
[  526.184460] x25: 0000000000000000 x24: 0000000000000000
[  526.184461] x23: 00400a269d0007fd x22: ffff0000849cd000
[  526.184462] x21: ffff00001181d000 x20: 00000a26eef72003
[  526.184463] x19: ffff8a2fb41d4bd8 x18: 00004fffb8b22000
[  526.184465] x17: 0000000000000000 x16: 0000000000000000
[  526.184466] x15: 0000000000000001 x14: ffff000008dd12a8
[  526.184467] x13: 0000000000000041 x12: ffff8a26eeca6e30
[  526.184468] x11: ffff8000fe4af800 x10: 0000000000000040
[  526.184469] x9 : ffff0000097c46c0 x8 : ffff8000ff400248
[  526.184471] x7 : 0000001000000000 x6 : 00000000000021f8
[  526.184472] x5 : 00000000a269d000 x4 : 0000000000000018
[  526.184473] x3 : 000000000000000a x2 : 0000000000000004
[  526.184474] x1 : 0000000000000000 x0 : 0000000000000000
[  526.184476] Call trace:
[  526.184477]  user_mem_abort+0x484/0x9e0
[  526.184479]  kvm_handle_guest_abort+0x11c/0x478
[  526.184480]  handle_exit+0x14c/0x1c8
[  526.184482]  kvm_arch_vcpu_ioctl_run+0x280/0x898
[  526.184483]  kvm_vcpu_ioctl+0x488/0x8a8
[  526.184485]  do_vfs_ioctl+0xc4/0x8c0
[  526.184486]  ksys_ioctl+0x8c/0xa0
[  526.184487]  __arm64_sys_ioctl+0x28/0x38
[  526.184489]  el0_svc_common+0xa0/0x180
[  526.184491]  el0_svc_handler+0x38/0x78
[  526.184492]  el0_svc+0x8/0xc

However, we also get the following unexpected log:

[  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
[  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 mapping:0000000000000000 index:0x0
[  908.339416] flags: 0x4ffffe0000000000()
[  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 0000000000000000
[  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 0000000000000000
[  908.339420] page dumped because: nonzero _refcount
[  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: G    B  W        5.0.0+ #1
[  908.339438] Call trace:
[  908.339439]  dump_backtrace+0x0/0x188
[  908.339441]  show_stack+0x24/0x30
[  908.339442]  dump_stack+0xa8/0xcc
[  908.339443]  bad_page+0xf0/0x150
[  908.339445]  free_pages_check_bad+0x84/0xa0
[  908.339446]  free_pcppages_bulk+0x4b8/0x750
[  908.339448]  free_unref_page_commit+0x13c/0x198
[  908.339449]  free_unref_page+0x84/0xa0
[  908.339451]  __free_pages+0x58/0x68
[  908.339452]  zap_huge_pmd+0x290/0x2d8
[  908.339454]  unmap_page_range+0x2b4/0x470
[  908.339455]  unmap_single_vma+0x94/0xe8
[  908.339457]  unmap_vmas+0x8c/0x108
[  908.339458]  exit_mmap+0xd4/0x178
[  908.339459]  mmput+0x74/0x180
[  908.339460]  do_exit+0x2b4/0x5b0
[  908.339462]  do_group_exit+0x3c/0xe0
[  908.339463]  __arm64_sys_exit_group+0x24/0x28
[  908.339465]  el0_svc_common+0xa0/0x180
[  908.339466]  el0_svc_handler+0x38/0x78
[  908.339467]  el0_svc+0x8/0xc

>> Marc and I had a discussion about this and it looks like we may have an
>> issue here. So with the cancellation of logging, we do not trigger the
>> mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
>> have memory leaks while trying to install a huge mapping. Would it be
>> possible for you to try the patch below ? It will trigger a WARNING
>> to confirm our theory, but should not cause the hang. As we unmap
>> the PMD/PUD range of PTE mappings before reinstalling a block map.
> 
> Thanks for the reply. And I think this is alomst what Zheng Xiang wanted to say! We will test this patch tomorrow and give you some feedback.
> 
> BTW, we have noticed that X86 had also suffered from the similar issue. You may want to look into commit 3ea3b7fa9af0 ("kvm: mmu: lazy collapse small sptes into large sptes" 2015) :-)
> 
> 
> thanks,
> 
> zenghui
> 
>>
>>
>> ---8>---
>>
>> test: kvm: arm: Fix handling of stage2 huge mappings
>>
>> We rely on the mmu_notifier call backs to handle the split/merging
>> of huge pages and thus we are guaranteed that while creating a
>> block mapping, the entire block is unmapped at stage2. However,
>> we miss a case where the block mapping is split for dirty logging
>> case and then could later be made block mapping, if we cancel the
>> dirty logging. This not only creates inconsistent TLB entries for
>> the pages in the the block, but also leakes the table pages for
>> PMD level.
>>
>> Handle these corner cases for the huge mappings at stage2.
>>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 35 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 66e0fbb5..04b0f9b 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>>            * Skip updating the page table if the entry is
>>            * unchanged.
>>            */
>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>               return 0;
>> -
>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>           /*
>> -         * Mapping in huge pages should only happen through a
>> -         * fault.  If a page is merged into a transparent huge
>> -         * page, the individual subpages of that huge page
>> -         * should be unmapped through MMU notifiers before we
>> -         * get here.
>> -         *
>> -         * Merging of CompoundPages is not supported; they
>> -         * should become splitting first, unmapped, merged,
>> -         * and mapped back in on-demand.
>> +         * If we have PTE level mapping for this block,
>> +         * we must unmap it to avoid inconsistent TLB
>> +         * state. We could end up in this situation if
>> +         * the memory slot was marked for dirty logging
>> +         * and was reverted, leaving PTE level mappings
>> +         * for the pages accessed during the period.
>> +         * Normal THP split/merge follows mmu_notifier
>> +         * callbacks and do get handled accordingly.
>>            */
>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);

It seems that kvm decreases the _refcount of the page twice in transparent_hugepage_adjust()
and unmap_stage2_range().

>> +        } else {
>>   -        pmd_clear(pmd);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +            /*
>> +             * Mapping in huge pages should only happen through a
>> +             * fault.  If a page is merged into a transparent huge
>> +             * page, the individual subpages of that huge page
>> +             * should be unmapped through MMU notifiers before we
>> +             * get here.
>> +             *
>> +             * Merging of CompoundPages is not supported; they
>> +             * should become splitting first, unmapped, merged,
>> +             * and mapped back in on-demand.
>> +             */
>> +            WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +
>> +            pmd_clear(pmd);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pmd));
>>       }
>> @@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>>           return 0;
>>         if (stage2_pud_present(kvm, old_pud)) {
>> -        stage2_pud_clear(kvm, pudp);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        /* If we have PTE level mapping, unmap the entire range */
>> +        if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
>> +            unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>> +        } else {
>> +            stage2_pud_clear(kvm, pudp);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pudp));
>>       }
>>
> 
> 
> .
-- 

Thanks,
Xiang



WARNING: multiple messages have this Message-ID (diff)
From: Zheng Xiang <zhengxiang9@huawei.com>
To: Zenghui Yu <yuzenghui@huawei.com>,
	Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>,
	christoffer.dall@arm.com, catalin.marinas@arm.com,
	will.deacon@arm.com, james.morse@arm.com,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org,
	Wang Haibin <wanghaibin.wang@huawei.com>,
	lious.lilei@hisilicon.com, lishuo1@hisilicon.com
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Date: Fri, 15 Mar 2019 16:21:03 +0800	[thread overview]
Message-ID: <d322e126-4da2-6dfd-a86d-088dfb3bf0f4@huawei.com> (raw)
In-Reply-To: <368bd218-ac1d-19b2-6e92-960b91afee8b@huawei.com>

Hi Suzuki,

I have tested this patch, VM doesn't hang and we get expected WARNING log:

[  526.184452] pstate: 20400009 (nzCv daif +PAN -UAO)
[  526.184454] pc : user_mem_abort+0x484/0x9e0
[  526.184455] lr : user_mem_abort+0x478/0x9e0
[  526.184456] sp : ffff000084a038e0
[  526.184457] x29: ffff000084a038e0 x28: 000000012f600000
[  526.184458] x27: ffff8a2fa27ae918 x26: 0000000000200000
[  526.184460] x25: 0000000000000000 x24: 0000000000000000
[  526.184461] x23: 00400a269d0007fd x22: ffff0000849cd000
[  526.184462] x21: ffff00001181d000 x20: 00000a26eef72003
[  526.184463] x19: ffff8a2fb41d4bd8 x18: 00004fffb8b22000
[  526.184465] x17: 0000000000000000 x16: 0000000000000000
[  526.184466] x15: 0000000000000001 x14: ffff000008dd12a8
[  526.184467] x13: 0000000000000041 x12: ffff8a26eeca6e30
[  526.184468] x11: ffff8000fe4af800 x10: 0000000000000040
[  526.184469] x9 : ffff0000097c46c0 x8 : ffff8000ff400248
[  526.184471] x7 : 0000001000000000 x6 : 00000000000021f8
[  526.184472] x5 : 00000000a269d000 x4 : 0000000000000018
[  526.184473] x3 : 000000000000000a x2 : 0000000000000004
[  526.184474] x1 : 0000000000000000 x0 : 0000000000000000
[  526.184476] Call trace:
[  526.184477]  user_mem_abort+0x484/0x9e0
[  526.184479]  kvm_handle_guest_abort+0x11c/0x478
[  526.184480]  handle_exit+0x14c/0x1c8
[  526.184482]  kvm_arch_vcpu_ioctl_run+0x280/0x898
[  526.184483]  kvm_vcpu_ioctl+0x488/0x8a8
[  526.184485]  do_vfs_ioctl+0xc4/0x8c0
[  526.184486]  ksys_ioctl+0x8c/0xa0
[  526.184487]  __arm64_sys_ioctl+0x28/0x38
[  526.184489]  el0_svc_common+0xa0/0x180
[  526.184491]  el0_svc_handler+0x38/0x78
[  526.184492]  el0_svc+0x8/0xc

However, we also get the following unexpected log:

[  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
[  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 mapping:0000000000000000 index:0x0
[  908.339416] flags: 0x4ffffe0000000000()
[  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 0000000000000000
[  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 0000000000000000
[  908.339420] page dumped because: nonzero _refcount
[  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: G    B  W        5.0.0+ #1
[  908.339438] Call trace:
[  908.339439]  dump_backtrace+0x0/0x188
[  908.339441]  show_stack+0x24/0x30
[  908.339442]  dump_stack+0xa8/0xcc
[  908.339443]  bad_page+0xf0/0x150
[  908.339445]  free_pages_check_bad+0x84/0xa0
[  908.339446]  free_pcppages_bulk+0x4b8/0x750
[  908.339448]  free_unref_page_commit+0x13c/0x198
[  908.339449]  free_unref_page+0x84/0xa0
[  908.339451]  __free_pages+0x58/0x68
[  908.339452]  zap_huge_pmd+0x290/0x2d8
[  908.339454]  unmap_page_range+0x2b4/0x470
[  908.339455]  unmap_single_vma+0x94/0xe8
[  908.339457]  unmap_vmas+0x8c/0x108
[  908.339458]  exit_mmap+0xd4/0x178
[  908.339459]  mmput+0x74/0x180
[  908.339460]  do_exit+0x2b4/0x5b0
[  908.339462]  do_group_exit+0x3c/0xe0
[  908.339463]  __arm64_sys_exit_group+0x24/0x28
[  908.339465]  el0_svc_common+0xa0/0x180
[  908.339466]  el0_svc_handler+0x38/0x78
[  908.339467]  el0_svc+0x8/0xc

>> Marc and I had a discussion about this and it looks like we may have an
>> issue here. So with the cancellation of logging, we do not trigger the
>> mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
>> have memory leaks while trying to install a huge mapping. Would it be
>> possible for you to try the patch below ? It will trigger a WARNING
>> to confirm our theory, but should not cause the hang. As we unmap
>> the PMD/PUD range of PTE mappings before reinstalling a block map.
> 
> Thanks for the reply. And I think this is alomst what Zheng Xiang wanted to say! We will test this patch tomorrow and give you some feedback.
> 
> BTW, we have noticed that X86 had also suffered from the similar issue. You may want to look into commit 3ea3b7fa9af0 ("kvm: mmu: lazy collapse small sptes into large sptes" 2015) :-)
> 
> 
> thanks,
> 
> zenghui
> 
>>
>>
>> ---8>---
>>
>> test: kvm: arm: Fix handling of stage2 huge mappings
>>
>> We rely on the mmu_notifier call backs to handle the split/merging
>> of huge pages and thus we are guaranteed that while creating a
>> block mapping, the entire block is unmapped at stage2. However,
>> we miss a case where the block mapping is split for dirty logging
>> case and then could later be made block mapping, if we cancel the
>> dirty logging. This not only creates inconsistent TLB entries for
>> the pages in the the block, but also leakes the table pages for
>> PMD level.
>>
>> Handle these corner cases for the huge mappings at stage2.
>>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 35 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 66e0fbb5..04b0f9b 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>>            * Skip updating the page table if the entry is
>>            * unchanged.
>>            */
>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>               return 0;
>> -
>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>           /*
>> -         * Mapping in huge pages should only happen through a
>> -         * fault.  If a page is merged into a transparent huge
>> -         * page, the individual subpages of that huge page
>> -         * should be unmapped through MMU notifiers before we
>> -         * get here.
>> -         *
>> -         * Merging of CompoundPages is not supported; they
>> -         * should become splitting first, unmapped, merged,
>> -         * and mapped back in on-demand.
>> +         * If we have PTE level mapping for this block,
>> +         * we must unmap it to avoid inconsistent TLB
>> +         * state. We could end up in this situation if
>> +         * the memory slot was marked for dirty logging
>> +         * and was reverted, leaving PTE level mappings
>> +         * for the pages accessed during the period.
>> +         * Normal THP split/merge follows mmu_notifier
>> +         * callbacks and do get handled accordingly.
>>            */
>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);

It seems that kvm decreases the _refcount of the page twice in transparent_hugepage_adjust()
and unmap_stage2_range().

>> +        } else {
>>   -        pmd_clear(pmd);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +            /*
>> +             * Mapping in huge pages should only happen through a
>> +             * fault.  If a page is merged into a transparent huge
>> +             * page, the individual subpages of that huge page
>> +             * should be unmapped through MMU notifiers before we
>> +             * get here.
>> +             *
>> +             * Merging of CompoundPages is not supported; they
>> +             * should become splitting first, unmapped, merged,
>> +             * and mapped back in on-demand.
>> +             */
>> +            WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +
>> +            pmd_clear(pmd);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pmd));
>>       }
>> @@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>>           return 0;
>>         if (stage2_pud_present(kvm, old_pud)) {
>> -        stage2_pud_clear(kvm, pudp);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        /* If we have PTE level mapping, unmap the entire range */
>> +        if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
>> +            unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>> +        } else {
>> +            stage2_pud_clear(kvm, pudp);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pudp));
>>       }
>>
> 
> 
> .
-- 

Thanks,
Xiang

WARNING: multiple messages have this Message-ID (diff)
From: Zheng Xiang <zhengxiang9@huawei.com>
To: Zenghui Yu <yuzenghui@huawei.com>,
	Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>,
	catalin.marinas@arm.com, will.deacon@arm.com,
	christoffer.dall@arm.com, linux-kernel@vger.kernel.org,
	james.morse@arm.com, lishuo1@hisilicon.com,
	Wang Haibin <wanghaibin.wang@huawei.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, lious.lilei@hisilicon.com
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Date: Fri, 15 Mar 2019 16:21:03 +0800	[thread overview]
Message-ID: <d322e126-4da2-6dfd-a86d-088dfb3bf0f4@huawei.com> (raw)
In-Reply-To: <368bd218-ac1d-19b2-6e92-960b91afee8b@huawei.com>

Hi Suzuki,

I have tested this patch, VM doesn't hang and we get expected WARNING log:

[  526.184452] pstate: 20400009 (nzCv daif +PAN -UAO)
[  526.184454] pc : user_mem_abort+0x484/0x9e0
[  526.184455] lr : user_mem_abort+0x478/0x9e0
[  526.184456] sp : ffff000084a038e0
[  526.184457] x29: ffff000084a038e0 x28: 000000012f600000
[  526.184458] x27: ffff8a2fa27ae918 x26: 0000000000200000
[  526.184460] x25: 0000000000000000 x24: 0000000000000000
[  526.184461] x23: 00400a269d0007fd x22: ffff0000849cd000
[  526.184462] x21: ffff00001181d000 x20: 00000a26eef72003
[  526.184463] x19: ffff8a2fb41d4bd8 x18: 00004fffb8b22000
[  526.184465] x17: 0000000000000000 x16: 0000000000000000
[  526.184466] x15: 0000000000000001 x14: ffff000008dd12a8
[  526.184467] x13: 0000000000000041 x12: ffff8a26eeca6e30
[  526.184468] x11: ffff8000fe4af800 x10: 0000000000000040
[  526.184469] x9 : ffff0000097c46c0 x8 : ffff8000ff400248
[  526.184471] x7 : 0000001000000000 x6 : 00000000000021f8
[  526.184472] x5 : 00000000a269d000 x4 : 0000000000000018
[  526.184473] x3 : 000000000000000a x2 : 0000000000000004
[  526.184474] x1 : 0000000000000000 x0 : 0000000000000000
[  526.184476] Call trace:
[  526.184477]  user_mem_abort+0x484/0x9e0
[  526.184479]  kvm_handle_guest_abort+0x11c/0x478
[  526.184480]  handle_exit+0x14c/0x1c8
[  526.184482]  kvm_arch_vcpu_ioctl_run+0x280/0x898
[  526.184483]  kvm_vcpu_ioctl+0x488/0x8a8
[  526.184485]  do_vfs_ioctl+0xc4/0x8c0
[  526.184486]  ksys_ioctl+0x8c/0xa0
[  526.184487]  __arm64_sys_ioctl+0x28/0x38
[  526.184489]  el0_svc_common+0xa0/0x180
[  526.184491]  el0_svc_handler+0x38/0x78
[  526.184492]  el0_svc+0x8/0xc

However, we also get the following unexpected log:

[  908.329900] BUG: Bad page state in process qemu-kvm  pfn:a2fb41cf
[  908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0 mapping:0000000000000000 index:0x0
[  908.339416] flags: 0x4ffffe0000000000()
[  908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200 0000000000000000
[  908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff 0000000000000000
[  908.339420] page dumped because: nonzero _refcount
[  908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded Tainted: G    B  W        5.0.0+ #1
[  908.339438] Call trace:
[  908.339439]  dump_backtrace+0x0/0x188
[  908.339441]  show_stack+0x24/0x30
[  908.339442]  dump_stack+0xa8/0xcc
[  908.339443]  bad_page+0xf0/0x150
[  908.339445]  free_pages_check_bad+0x84/0xa0
[  908.339446]  free_pcppages_bulk+0x4b8/0x750
[  908.339448]  free_unref_page_commit+0x13c/0x198
[  908.339449]  free_unref_page+0x84/0xa0
[  908.339451]  __free_pages+0x58/0x68
[  908.339452]  zap_huge_pmd+0x290/0x2d8
[  908.339454]  unmap_page_range+0x2b4/0x470
[  908.339455]  unmap_single_vma+0x94/0xe8
[  908.339457]  unmap_vmas+0x8c/0x108
[  908.339458]  exit_mmap+0xd4/0x178
[  908.339459]  mmput+0x74/0x180
[  908.339460]  do_exit+0x2b4/0x5b0
[  908.339462]  do_group_exit+0x3c/0xe0
[  908.339463]  __arm64_sys_exit_group+0x24/0x28
[  908.339465]  el0_svc_common+0xa0/0x180
[  908.339466]  el0_svc_handler+0x38/0x78
[  908.339467]  el0_svc+0x8/0xc

>> Marc and I had a discussion about this and it looks like we may have an
>> issue here. So with the cancellation of logging, we do not trigger the
>> mmu_notifiers (as the userspace memory mapping hasn't changed) and thus
>> have memory leaks while trying to install a huge mapping. Would it be
>> possible for you to try the patch below ? It will trigger a WARNING
>> to confirm our theory, but should not cause the hang. As we unmap
>> the PMD/PUD range of PTE mappings before reinstalling a block map.
> 
> Thanks for the reply. And I think this is alomst what Zheng Xiang wanted to say! We will test this patch tomorrow and give you some feedback.
> 
> BTW, we have noticed that X86 had also suffered from the similar issue. You may want to look into commit 3ea3b7fa9af0 ("kvm: mmu: lazy collapse small sptes into large sptes" 2015) :-)
> 
> 
> thanks,
> 
> zenghui
> 
>>
>>
>> ---8>---
>>
>> test: kvm: arm: Fix handling of stage2 huge mappings
>>
>> We rely on the mmu_notifier call backs to handle the split/merging
>> of huge pages and thus we are guaranteed that while creating a
>> block mapping, the entire block is unmapped at stage2. However,
>> we miss a case where the block mapping is split for dirty logging
>> case and then could later be made block mapping, if we cancel the
>> dirty logging. This not only creates inconsistent TLB entries for
>> the pages in the the block, but also leakes the table pages for
>> PMD level.
>>
>> Handle these corner cases for the huge mappings at stage2.
>>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   virt/kvm/arm/mmu.c | 51 +++++++++++++++++++++++++++++++++++----------------
>>   1 file changed, 35 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 66e0fbb5..04b0f9b 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
>>            * Skip updating the page table if the entry is
>>            * unchanged.
>>            */
>> -        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> +        if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>               return 0;
>> -
>> +        } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>           /*
>> -         * Mapping in huge pages should only happen through a
>> -         * fault.  If a page is merged into a transparent huge
>> -         * page, the individual subpages of that huge page
>> -         * should be unmapped through MMU notifiers before we
>> -         * get here.
>> -         *
>> -         * Merging of CompoundPages is not supported; they
>> -         * should become splitting first, unmapped, merged,
>> -         * and mapped back in on-demand.
>> +         * If we have PTE level mapping for this block,
>> +         * we must unmap it to avoid inconsistent TLB
>> +         * state. We could end up in this situation if
>> +         * the memory slot was marked for dirty logging
>> +         * and was reverted, leaving PTE level mappings
>> +         * for the pages accessed during the period.
>> +         * Normal THP split/merge follows mmu_notifier
>> +         * callbacks and do get handled accordingly.
>>            */
>> -        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +            unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);

It seems that kvm decreases the _refcount of the page twice in transparent_hugepage_adjust()
and unmap_stage2_range().

>> +        } else {
>>   -        pmd_clear(pmd);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +            /*
>> +             * Mapping in huge pages should only happen through a
>> +             * fault.  If a page is merged into a transparent huge
>> +             * page, the individual subpages of that huge page
>> +             * should be unmapped through MMU notifiers before we
>> +             * get here.
>> +             *
>> +             * Merging of CompoundPages is not supported; they
>> +             * should become splitting first, unmapped, merged,
>> +             * and mapped back in on-demand.
>> +             */
>> +            WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>> +
>> +            pmd_clear(pmd);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pmd));
>>       }
>> @@ -1122,8 +1136,13 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
>>           return 0;
>>         if (stage2_pud_present(kvm, old_pud)) {
>> -        stage2_pud_clear(kvm, pudp);
>> -        kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        /* If we have PTE level mapping, unmap the entire range */
>> +        if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
>> +            unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
>> +        } else {
>> +            stage2_pud_clear(kvm, pudp);
>> +            kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +        }
>>       } else {
>>           get_page(virt_to_page(pudp));
>>       }
>>
> 
> 
> .
-- 

Thanks,
Xiang



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-03-15  8:24 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-11 16:31 [RFC] Question about TLB flush while set Stage-2 huge pages Zheng Xiang
2019-03-11 16:31 ` Zheng Xiang
2019-03-11 16:31 ` Zheng Xiang
2019-03-12 11:32 ` Marc Zyngier
2019-03-12 11:32   ` Marc Zyngier
2019-03-12 15:30   ` Zheng Xiang
2019-03-12 15:30     ` Zheng Xiang
2019-03-12 15:30     ` Zheng Xiang
2019-03-12 18:18     ` Marc Zyngier
2019-03-12 18:18       ` Marc Zyngier
2019-03-13  9:45       ` Zheng Xiang
2019-03-13  9:45         ` Zheng Xiang
2019-03-13  9:45         ` Zheng Xiang
2019-03-14 10:55         ` Suzuki K Poulose
2019-03-14 10:55           ` Suzuki K Poulose
2019-03-14 15:50           ` Zenghui Yu
2019-03-14 15:50             ` Zenghui Yu
2019-03-14 15:50             ` Zenghui Yu
2019-03-15  8:21             ` Zheng Xiang [this message]
2019-03-15  8:21               ` Zheng Xiang
2019-03-15  8:21               ` Zheng Xiang
2019-03-15 14:56               ` Suzuki K Poulose
2019-03-15 14:56                 ` Suzuki K Poulose
2019-03-17 13:34                 ` Zenghui Yu
2019-03-17 13:34                   ` Zenghui Yu
2019-03-17 13:34                   ` Zenghui Yu
2019-03-18 17:34                   ` Suzuki K Poulose
2019-03-18 17:34                     ` Suzuki K Poulose
2019-03-19  9:05                     ` Zenghui Yu
2019-03-19  9:05                       ` Zenghui Yu
2019-03-19  9:05                       ` Zenghui Yu
2019-03-19 14:11                       ` [PATCH] kvm: arm: Fix handling of stage2 huge mappings Suzuki K Poulose
2019-03-19 14:11                         ` Suzuki K Poulose
2019-03-19 16:02                         ` Zenghui Yu
2019-03-19 16:02                           ` Zenghui Yu
2019-03-19 16:02                           ` Zenghui Yu
2019-03-20  8:15                         ` Marc Zyngier
2019-03-20  8:15                           ` Marc Zyngier
2019-03-20  8:15                           ` Marc Zyngier
2019-03-20  9:44                           ` Suzuki K Poulose
2019-03-20  9:44                             ` Suzuki K Poulose
2019-03-20  9:44                             ` Suzuki K Poulose
2019-03-20 10:11                             ` Marc Zyngier
2019-03-20 10:11                               ` Marc Zyngier
2019-03-20 10:11                               ` Marc Zyngier
2019-03-20 10:23                               ` Suzuki K Poulose
2019-03-20 10:23                                 ` Suzuki K Poulose
2019-03-20 10:35                                 ` Marc Zyngier
2019-03-20 10:35                                   ` Marc Zyngier
2019-03-20 10:35                                   ` Marc Zyngier
2019-03-20 11:12                                   ` Suzuki K Poulose
2019-03-20 11:12                                     ` Suzuki K Poulose
2019-03-20 17:24                                     ` Marc Zyngier
2019-03-20 17:24                                       ` Marc Zyngier
2019-03-20 17:24                                       ` Marc Zyngier
2019-03-17 13:55                 ` [RFC] Question about TLB flush while set Stage-2 huge pages Zenghui Yu
2019-03-17 13:55                   ` Zenghui Yu
2019-03-17 13:55                   ` Zenghui Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d322e126-4da2-6dfd-a86d-088dfb3bf0f4@huawei.com \
    --to=zhengxiang9@huawei.com \
    --cc=Suzuki.Poulose@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=christoffer.dall@arm.com \
    --cc=james.morse@arm.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lious.lilei@hisilicon.com \
    --cc=lishuo1@hisilicon.com \
    --cc=marc.zyngier@arm.com \
    --cc=wanghaibin.wang@huawei.com \
    --cc=will.deacon@arm.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.