linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
@ 2017-03-10 13:34 Andrey Konovalov
  2017-03-14 11:07 ` Suzuki K Poulose
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Konovalov @ 2017-03-10 13:34 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Marc Zyngier, Catalin Marinas, Will Deacon, Suzuki K Poulose,
	kvm, linux-arm-kernel, kvmarm, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller

Hi,

I've got the following error report while fuzzing the kernel with syzkaller.

On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).

Unfortunately I can't reproduce it.

==================================================================
BUG: KASAN: use-after-free in put_page include/linux/compiler.h:243 [inline]
BUG: KASAN: use-after-free in unmap_stage2_pmds
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
BUG: KASAN: use-after-free in unmap_stage2_puds
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
BUG: KASAN: use-after-free in unmap_stage2_range+0x884/0x938
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
Read of size 8 at addr ffff80004585c000 by task syz-executor/5176

CPU: 1 PID: 5176 Comm: syz-executor Not tainted
4.11.0-rc1-next-20170308-xc2-dirty #3
Hardware name: Hardkernel ODROID-C2 (DT)
Call trace:
[<ffff20000808fbb0>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:69
[<ffff200008090010>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:219
[<ffff2000088e9578>] __write_once_size include/linux/compiler.h:271 [inline]
[<ffff2000088e9578>] dump_stack+0x110/0x168 lib/dump_stack.c:54
[<ffff200008414018>] print_address_description+0x60/0x248
[<ffff2000084142e8>] print_error_description mm/kasan/report.c:98 [inline]
[<ffff2000084142e8>] kasan_report_error+0xe8/0x250 mm/kasan/report.c:287
[<ffff200008414564>] kasan_report mm/kasan/report.c:308 [inline]
[<ffff200008414564>] __asan_report_load8_noabort+0x3c/0x48 mm/kasan/report.c:329
[<ffff2000080d9664>] put_page include/linux/compiler.h:243 [inline]
[<ffff2000080d9664>] unmap_stage2_pmds
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
[<ffff2000080d9664>] unmap_stage2_puds
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
[<ffff2000080d9664>] unmap_stage2_range+0x884/0x938
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
[<ffff2000080d9740>] kvm_unmap_hva_handler+0x28/0x38
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1556
[<ffff2000080d7578>] handle_hva_to_gpa+0x140/0x250
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1547
[<ffff2000080db858>] kvm_unmap_hva_range+0x60/0x80
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1579
[<ffff2000080c12dc>]
kvm_mmu_notifier_invalidate_range_start+0x194/0x278
arch/arm64/kvm/../../../virt/kvm/kvm_main.c:357
[<ffff2000083fea58>] __mmu_notifier_invalidate_range_start+0x1d0/0x2a0
mm/mmu_notifier.c:199
[<ffff200008395f44>] mmu_notifier_invalidate_range_start
include/linux/mmu_notifier.h:282 [inline]
[<ffff200008395f44>] unmap_vmas+0x12c/0x198 mm/memory.c:1372
[<ffff2000083a8e00>] unmap_region+0x128/0x230 mm/mmap.c:2460
[<ffff2000083ae6e0>] update_hiwater_vm include/linux/mm.h:1483 [inline]
[<ffff2000083ae6e0>] remove_vma_list mm/mmap.c:2432 [inline]
[<ffff2000083ae6e0>] do_munmap+0x598/0x9b0 mm/mmap.c:2662
[<ffff2000083b14a0>] find_vma_links mm/mmap.c:495 [inline]
[<ffff2000083b14a0>] mmap_region+0x138/0xc78 mm/mmap.c:1610
[<ffff2000083b23ac>] is_file_hugepages include/linux/hugetlb.h:269 [inline]
[<ffff2000083b23ac>] do_mmap+0x3cc/0x848 mm/mmap.c:1446
[<ffff200008369b0c>] do_mmap_pgoff include/linux/mm.h:2039 [inline]
[<ffff200008369b0c>] vm_mmap_pgoff+0xec/0x120 mm/util.c:305
[<ffff2000083ac558>] SYSC_mmap_pgoff mm/mmap.c:1475 [inline]
[<ffff2000083ac558>] SyS_mmap_pgoff+0x220/0x420 mm/mmap.c:1458
[<ffff20000808ecf0>] sys_mmap+0x58/0x80 arch/arm64/kernel/sys.c:37
[<ffff200008083f70>] el0_svc_naked+0x24/0x28

The buggy address belongs to the page:
page:ffff7e0001161700 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0xfffc00000000000()
raw: 0fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: ffff7e00018c9120 ffff7e0000ea8b60 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff80004585bf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff80004585bf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff80004585c000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                   ^
 ffff80004585c080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff80004585c100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-03-10 13:34 kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds Andrey Konovalov
@ 2017-03-14 11:07 ` Suzuki K Poulose
  2017-03-14 16:57   ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Suzuki K Poulose @ 2017-03-14 11:07 UTC (permalink / raw)
  To: Andrey Konovalov, Paolo Bonzini, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Marc Zyngier, Catalin Marinas, Will Deacon, kvm,
	linux-arm-kernel, kvmarm, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller

On 10/03/17 13:34, Andrey Konovalov wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).
>
> Unfortunately I can't reproduce it.
>
> ==================================================================
> BUG: KASAN: use-after-free in put_page include/linux/compiler.h:243 [inline]
> BUG: KASAN: use-after-free in unmap_stage2_pmds
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
> BUG: KASAN: use-after-free in unmap_stage2_puds
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
> BUG: KASAN: use-after-free in unmap_stage2_range+0x884/0x938
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
> Read of size 8 at addr ffff80004585c000 by task syz-executor/5176


> [<ffff2000080d9664>] put_page include/linux/compiler.h:243 [inline]
> [<ffff2000080d9664>] unmap_stage2_pmds
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
> [<ffff2000080d9664>] unmap_stage2_puds
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
> [<ffff2000080d9664>] unmap_stage2_range+0x884/0x938
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
> [<ffff2000080d9740>] kvm_unmap_hva_handler+0x28/0x38
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1556
> [<ffff2000080d7578>] handle_hva_to_gpa+0x140/0x250
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1547
> [<ffff2000080db858>] kvm_unmap_hva_range+0x60/0x80
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1579
> [<ffff2000080c12dc>]
> kvm_mmu_notifier_invalidate_range_start+0x194/0x278
> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:357
> [<ffff2000083fea58>] __mmu_notifier_invalidate_range_start+0x1d0/0x2a0
> mm/mmu_notifier.c:199
> [<ffff200008395f44>] mmu_notifier_invalidate_range_start
> include/linux/mmu_notifier.h:282 [inline]
> [<ffff200008395f44>] unmap_vmas+0x12c/0x198 mm/memory.c:1372
> [<ffff2000083a8e00>] unmap_region+0x128/0x230 mm/mmap.c:2460
> [<ffff2000083ae6e0>] update_hiwater_vm include/linux/mm.h:1483 [inline]
> [<ffff2000083ae6e0>] remove_vma_list mm/mmap.c:2432 [inline]
> [<ffff2000083ae6e0>] do_munmap+0x598/0x9b0 mm/mmap.c:2662
> [<ffff2000083b14a0>] find_vma_links mm/mmap.c:495 [inline]
> [<ffff2000083b14a0>] mmap_region+0x138/0xc78 mm/mmap.c:1610
> [<ffff2000083b23ac>] is_file_hugepages include/linux/hugetlb.h:269 [inline]
> [<ffff2000083b23ac>] do_mmap+0x3cc/0x848 mm/mmap.c:1446
> [<ffff200008369b0c>] do_mmap_pgoff include/linux/mm.h:2039 [inline]
> [<ffff200008369b0c>] vm_mmap_pgoff+0xec/0x120 mm/util.c:305
> [<ffff2000083ac558>] SYSC_mmap_pgoff mm/mmap.c:1475 [inline]
> [<ffff2000083ac558>] SyS_mmap_pgoff+0x220/0x420 mm/mmap.c:1458
> [<ffff20000808ecf0>] sys_mmap+0x58/0x80 arch/arm64/kernel/sys.c:37
> [<ffff200008083f70>] el0_svc_naked+0x24/0x28
>

We hold kvm->mmu_lock, while manipulating the stage2 ranges. However, I find that
we don't take the lock, when we do it f rom kvm_free_stage2_pgd(), which could
potentially have caused a problem with an munmap of a memslot ?

Lightly tested...


commit fa75684dbf0fe845cf8403517d6e0c2c3344a544
Author: Suzuki K Poulose <suzuki.poulose@arm.com>
Date:   Tue Mar 14 10:26:54 2017 +0000

     kvm: arm: Fix locking for kvm_free_stage2_pgd
     
     In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
     unmap_stage2_range() on the entire memory range for the guest. This could
     cause problems with other callers (e.g, munmap on a memslot) trying to
     unmap a range.
     
     Cc: Marc Zyngier <marc.zyngier@arm.com>
     Cc: Christoffer Dall <christoffer.dall@linaro.org>
     Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index a07ce3e..7f97063 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
         if (kvm->arch.pgd == NULL)
                 return;
  
+       spin_lock(&kvm->mmu_lock);
         unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+       spin_unlock(&kvm->mmu_lock);
+
         /* Free the HW pgd, one page at a time */
         free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
         kvm->arch.pgd = NULL;




> The buggy address belongs to the page:
> page:ffff7e0001161700 count:0 mapcount:0 mapping:          (null) index:0x0
> flags: 0xfffc00000000000()
> raw: 0fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
> raw: ffff7e00018c9120 ffff7e0000ea8b60 0000000000000000 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  ffff80004585bf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  ffff80004585bf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ffff80004585c000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>                    ^
>  ffff80004585c080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  ffff80004585c100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ==================================================================
>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-03-14 11:07 ` Suzuki K Poulose
@ 2017-03-14 16:57   ` Paolo Bonzini
  2017-04-12 16:19     ` Andrey Konovalov
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2017-03-14 16:57 UTC (permalink / raw)
  To: Suzuki K Poulose, Andrey Konovalov, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Marc Zyngier, Catalin Marinas, Will Deacon, kvm,
	linux-arm-kernel, kvmarm, LKML
  Cc: Dmitry Vyukov, Kostya Serebryany, syzkaller



On 14/03/2017 12:07, Suzuki K Poulose wrote:
> On 10/03/17 13:34, Andrey Konovalov wrote:
>> Hi,
>>
>> I've got the following error report while fuzzing the kernel with
>> syzkaller.
>>
>> On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).
>>
>> Unfortunately I can't reproduce it.
>>
>> ==================================================================
>> BUG: KASAN: use-after-free in put_page include/linux/compiler.h:243
>> [inline]
>> BUG: KASAN: use-after-free in unmap_stage2_pmds
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
>> BUG: KASAN: use-after-free in unmap_stage2_puds
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
>> BUG: KASAN: use-after-free in unmap_stage2_range+0x884/0x938
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
>> Read of size 8 at addr ffff80004585c000 by task syz-executor/5176
> 
> 
>> [<ffff2000080d9664>] put_page include/linux/compiler.h:243 [inline]
>> [<ffff2000080d9664>] unmap_stage2_pmds
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
>> [<ffff2000080d9664>] unmap_stage2_puds
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
>> [<ffff2000080d9664>] unmap_stage2_range+0x884/0x938
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
>> [<ffff2000080d9740>] kvm_unmap_hva_handler+0x28/0x38
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1556
>> [<ffff2000080d7578>] handle_hva_to_gpa+0x140/0x250
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1547
>> [<ffff2000080db858>] kvm_unmap_hva_range+0x60/0x80
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1579
>> [<ffff2000080c12dc>]
>> kvm_mmu_notifier_invalidate_range_start+0x194/0x278
>> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:357
>> [<ffff2000083fea58>] __mmu_notifier_invalidate_range_start+0x1d0/0x2a0
>> mm/mmu_notifier.c:199
>> [<ffff200008395f44>] mmu_notifier_invalidate_range_start
>> include/linux/mmu_notifier.h:282 [inline]
>> [<ffff200008395f44>] unmap_vmas+0x12c/0x198 mm/memory.c:1372
>> [<ffff2000083a8e00>] unmap_region+0x128/0x230 mm/mmap.c:2460
>> [<ffff2000083ae6e0>] update_hiwater_vm include/linux/mm.h:1483 [inline]
>> [<ffff2000083ae6e0>] remove_vma_list mm/mmap.c:2432 [inline]
>> [<ffff2000083ae6e0>] do_munmap+0x598/0x9b0 mm/mmap.c:2662
>> [<ffff2000083b14a0>] find_vma_links mm/mmap.c:495 [inline]
>> [<ffff2000083b14a0>] mmap_region+0x138/0xc78 mm/mmap.c:1610
>> [<ffff2000083b23ac>] is_file_hugepages include/linux/hugetlb.h:269
>> [inline]
>> [<ffff2000083b23ac>] do_mmap+0x3cc/0x848 mm/mmap.c:1446
>> [<ffff200008369b0c>] do_mmap_pgoff include/linux/mm.h:2039 [inline]
>> [<ffff200008369b0c>] vm_mmap_pgoff+0xec/0x120 mm/util.c:305
>> [<ffff2000083ac558>] SYSC_mmap_pgoff mm/mmap.c:1475 [inline]
>> [<ffff2000083ac558>] SyS_mmap_pgoff+0x220/0x420 mm/mmap.c:1458
>> [<ffff20000808ecf0>] sys_mmap+0x58/0x80 arch/arm64/kernel/sys.c:37
>> [<ffff200008083f70>] el0_svc_naked+0x24/0x28
>>
> 
> We hold kvm->mmu_lock, while manipulating the stage2 ranges. However, I
> find that
> we don't take the lock, when we do it f rom kvm_free_stage2_pgd(), which
> could
> potentially have caused a problem with an munmap of a memslot ?
> 
> Lightly tested...
> 
> 
> commit fa75684dbf0fe845cf8403517d6e0c2c3344a544
> Author: Suzuki K Poulose <suzuki.poulose@arm.com>
> Date:   Tue Mar 14 10:26:54 2017 +0000
> 
>     kvm: arm: Fix locking for kvm_free_stage2_pgd
>         In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while
> calling
>     unmap_stage2_range() on the entire memory range for the guest. This
> could
>     cause problems with other callers (e.g, munmap on a memslot) trying to
>     unmap a range.
>         Cc: Marc Zyngier <marc.zyngier@arm.com>
>     Cc: Christoffer Dall <christoffer.dall@linaro.org>
>     Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index a07ce3e..7f97063 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>         if (kvm->arch.pgd == NULL)
>                 return;
>  
> +       spin_lock(&kvm->mmu_lock);
>         unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> +       spin_unlock(&kvm->mmu_lock);
> +
>         /* Free the HW pgd, one page at a time */
>         free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
>         kvm->arch.pgd = NULL;

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> 
> 
>> The buggy address belongs to the page:
>> page:ffff7e0001161700 count:0 mapcount:0 mapping:          (null)
>> index:0x0
>> flags: 0xfffc00000000000()
>> raw: 0fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
>> raw: ffff7e00018c9120 ffff7e0000ea8b60 0000000000000000 0000000000000000
>> page dumped because: kasan: bad access detected
>>
>> Memory state around the buggy address:
>>  ffff80004585bf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>  ffff80004585bf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>> ffff80004585c000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>                    ^
>>  ffff80004585c080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>  ffff80004585c100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>> ==================================================================
>>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-03-14 16:57   ` Paolo Bonzini
@ 2017-04-12 16:19     ` Andrey Konovalov
  2017-04-12 18:43       ` Marc Zyngier
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Konovalov @ 2017-04-12 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Suzuki K Poulose, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Marc Zyngier, Catalin Marinas, Will Deacon, kvm,
	linux-arm-kernel, kvmarm, LKML, Dmitry Vyukov, Kostya Serebryany,
	syzkaller

On Tue, Mar 14, 2017 at 5:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 14/03/2017 12:07, Suzuki K Poulose wrote:
>> On 10/03/17 13:34, Andrey Konovalov wrote:
>>> Hi,
>>>
>>> I've got the following error report while fuzzing the kernel with
>>> syzkaller.
>>>
>>> On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).
>>>
>>> Unfortunately I can't reproduce it.
>>>
>>> ==================================================================
>>> BUG: KASAN: use-after-free in put_page include/linux/compiler.h:243
>>> [inline]
>>> BUG: KASAN: use-after-free in unmap_stage2_pmds
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
>>> BUG: KASAN: use-after-free in unmap_stage2_puds
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
>>> BUG: KASAN: use-after-free in unmap_stage2_range+0x884/0x938
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
>>> Read of size 8 at addr ffff80004585c000 by task syz-executor/5176
>>
>>
>>> [<ffff2000080d9664>] put_page include/linux/compiler.h:243 [inline]
>>> [<ffff2000080d9664>] unmap_stage2_pmds
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:240 [inline]
>>> [<ffff2000080d9664>] unmap_stage2_puds
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:269 [inline]
>>> [<ffff2000080d9664>] unmap_stage2_range+0x884/0x938
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:299
>>> [<ffff2000080d9740>] kvm_unmap_hva_handler+0x28/0x38
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1556
>>> [<ffff2000080d7578>] handle_hva_to_gpa+0x140/0x250
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1547
>>> [<ffff2000080db858>] kvm_unmap_hva_range+0x60/0x80
>>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1579
>>> [<ffff2000080c12dc>]
>>> kvm_mmu_notifier_invalidate_range_start+0x194/0x278
>>> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:357
>>> [<ffff2000083fea58>] __mmu_notifier_invalidate_range_start+0x1d0/0x2a0
>>> mm/mmu_notifier.c:199
>>> [<ffff200008395f44>] mmu_notifier_invalidate_range_start
>>> include/linux/mmu_notifier.h:282 [inline]
>>> [<ffff200008395f44>] unmap_vmas+0x12c/0x198 mm/memory.c:1372
>>> [<ffff2000083a8e00>] unmap_region+0x128/0x230 mm/mmap.c:2460
>>> [<ffff2000083ae6e0>] update_hiwater_vm include/linux/mm.h:1483 [inline]
>>> [<ffff2000083ae6e0>] remove_vma_list mm/mmap.c:2432 [inline]
>>> [<ffff2000083ae6e0>] do_munmap+0x598/0x9b0 mm/mmap.c:2662
>>> [<ffff2000083b14a0>] find_vma_links mm/mmap.c:495 [inline]
>>> [<ffff2000083b14a0>] mmap_region+0x138/0xc78 mm/mmap.c:1610
>>> [<ffff2000083b23ac>] is_file_hugepages include/linux/hugetlb.h:269
>>> [inline]
>>> [<ffff2000083b23ac>] do_mmap+0x3cc/0x848 mm/mmap.c:1446
>>> [<ffff200008369b0c>] do_mmap_pgoff include/linux/mm.h:2039 [inline]
>>> [<ffff200008369b0c>] vm_mmap_pgoff+0xec/0x120 mm/util.c:305
>>> [<ffff2000083ac558>] SYSC_mmap_pgoff mm/mmap.c:1475 [inline]
>>> [<ffff2000083ac558>] SyS_mmap_pgoff+0x220/0x420 mm/mmap.c:1458
>>> [<ffff20000808ecf0>] sys_mmap+0x58/0x80 arch/arm64/kernel/sys.c:37
>>> [<ffff200008083f70>] el0_svc_naked+0x24/0x28
>>>
>>
>> We hold kvm->mmu_lock, while manipulating the stage2 ranges. However, I
>> find that
>> we don't take the lock, when we do it f rom kvm_free_stage2_pgd(), which
>> could
>> potentially have caused a problem with an munmap of a memslot ?
>>
>> Lightly tested...
>>
>>
>> commit fa75684dbf0fe845cf8403517d6e0c2c3344a544
>> Author: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Date:   Tue Mar 14 10:26:54 2017 +0000
>>
>>     kvm: arm: Fix locking for kvm_free_stage2_pgd
>>         In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while
>> calling
>>     unmap_stage2_range() on the entire memory range for the guest. This
>> could
>>     cause problems with other callers (e.g, munmap on a memslot) trying to
>>     unmap a range.
>>         Cc: Marc Zyngier <marc.zyngier@arm.com>
>>     Cc: Christoffer Dall <christoffer.dall@linaro.org>
>>     Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index a07ce3e..7f97063 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>         if (kvm->arch.pgd == NULL)
>>                 return;
>>
>> +       spin_lock(&kvm->mmu_lock);
>>         unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>> +       spin_unlock(&kvm->mmu_lock);
>> +
>>         /* Free the HW pgd, one page at a time */
>>         free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
>>         kvm->arch.pgd = NULL;
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>>
>>
>>> The buggy address belongs to the page:
>>> page:ffff7e0001161700 count:0 mapcount:0 mapping:          (null)
>>> index:0x0
>>> flags: 0xfffc00000000000()
>>> raw: 0fffc00000000000 0000000000000000 0000000000000000 00000000ffffffff
>>> raw: ffff7e00018c9120 ffff7e0000ea8b60 0000000000000000 0000000000000000
>>> page dumped because: kasan: bad access detected
>>>
>>> Memory state around the buggy address:
>>>  ffff80004585bf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>>  ffff80004585bf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>>> ffff80004585c000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>>                    ^
>>>  ffff80004585c080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>>  ffff80004585c100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>>> ==================================================================
>>>
>>

Hi,

Apparently this wasn't fixed, I've got this report again on
linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
arm/arm64: Fix locking for kvm_free_stage2_pgd".

I now have a way to reproduce it, so I can test proposed patches. I
don't have a simple C reproducer though.

The bug happens when the following syzkaller program is executed:

mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
unshare(0x400)
perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
0xffffffffffffffff, 0x0)
r0 = openat$kvm(0xffffffffffffff9c,
&(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
&(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
&(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)

==================================================================
BUG: KASAN: use-after-free in arch_spin_is_locked
include/linux/compiler.h:254 [inline]
BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
Read of size 8 at addr ffff800004476730 by task syz-executor/13106

CPU: 1 PID: 13106 Comm: syz-executor Not tainted
4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
Hardware name: Hardkernel ODROID-C2 (DT)
Call trace:
[<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
[<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
[<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
[<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
[<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
[<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
[<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
[<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
[<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
[<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
[<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
[<ffff2000080ddfb8>] kvm_free_stage2_pgd
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
[<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
[<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
[<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
[<ffff2000083a1fb4>] mmu_notifier_release
include/linux/mmu_notifier.h:235 [inline]
[<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
[<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
[<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
[<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
[<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
[<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
[<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
[<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
[<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
[<ffff200008083e68>] work_pending+0x8/0x14

The buggy address belongs to the page:
page:ffff7e0000111d80 count:0 mapcount:-127 mapping:          (null) index:0x0
flags: 0xfffc00000000000()
raw: 0fffc00000000000 0000000000000000 0000000000000000 00000000ffffff80
raw: ffff7e00009698a0 ffff7e00008eefa0 0000000000000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff800004476600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff800004476680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff800004476700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                     ^
 ffff800004476780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff800004476800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-12 16:19     ` Andrey Konovalov
@ 2017-04-12 18:43       ` Marc Zyngier
  2017-04-12 18:51         ` Andrey Konovalov
  2017-04-13  9:17         ` Suzuki K Poulose
  0 siblings, 2 replies; 17+ messages in thread
From: Marc Zyngier @ 2017-04-12 18:43 UTC (permalink / raw)
  To: Andrey Konovalov, Paolo Bonzini
  Cc: Suzuki K Poulose, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Catalin Marinas, Will Deacon, kvm, linux-arm-kernel, kvmarm,
	LKML, Dmitry Vyukov, Kostya Serebryany, syzkaller

On 12/04/17 17:19, Andrey Konovalov wrote:

Hi Andrey,

> Apparently this wasn't fixed, I've got this report again on
> linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
> arm/arm64: Fix locking for kvm_free_stage2_pgd".

This looks like a different bug.

> 
> I now have a way to reproduce it, so I can test proposed patches. I
> don't have a simple C reproducer though.
> 
> The bug happens when the following syzkaller program is executed:
> 
> mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> unshare(0x400)
> perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
> 0xffffffffffffffff, 0x0)
> r0 = openat$kvm(0xffffffffffffff9c,
> &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
> ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
> &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
> &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
> 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)

Is that the only thing the program does? Or is there anything running in
parallel?

> ==================================================================
> BUG: KASAN: use-after-free in arch_spin_is_locked
> include/linux/compiler.h:254 [inline]
> BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> Read of size 8 at addr ffff800004476730 by task syz-executor/13106
> 
> CPU: 1 PID: 13106 Comm: syz-executor Not tainted
> 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
> Hardware name: Hardkernel ODROID-C2 (DT)
> Call trace:
> [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
> [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
> [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
> [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
> [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
> [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
> [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
> [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
> [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]

This is the assert on the spinlock, and the memory is gone.

> [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
> [<ffff2000080ddfb8>] kvm_free_stage2_pgd
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]

But we've taken than lock here. There's only a handful of instructions
in between, and the memory can only go away if there is something
messing with us in parallel.

> [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
> [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
> [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
> [<ffff2000083a1fb4>] mmu_notifier_release
> include/linux/mmu_notifier.h:235 [inline]
> [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
> [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
> [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
> [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
> [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
> [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
> [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
> [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
> [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
> [<ffff200008083e68>] work_pending+0x8/0x14

So we're being serviced with a signal. Do you know if this signal is
generated by your syzkaller program? We could be racing between do_exit
triggered by a fatal signal (this trace) and the closing of the two file
descriptors (vcpu and vm).

Paolo: does this look possible to you? I can't see what locking we have
that could prevent this race.

Thanks,
	
M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-12 18:43       ` Marc Zyngier
@ 2017-04-12 18:51         ` Andrey Konovalov
  2017-04-13  9:34           ` Mark Rutland
  2017-04-13  9:17         ` Suzuki K Poulose
  1 sibling, 1 reply; 17+ messages in thread
From: Andrey Konovalov @ 2017-04-12 18:51 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Suzuki K Poulose, Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Catalin Marinas, Will Deacon, kvm, linux-arm-kernel, kvmarm,
	LKML, Dmitry Vyukov, Kostya Serebryany, syzkaller

On Wed, Apr 12, 2017 at 8:43 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 12/04/17 17:19, Andrey Konovalov wrote:
>
> Hi Andrey,
>
>> Apparently this wasn't fixed, I've got this report again on
>> linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
>> arm/arm64: Fix locking for kvm_free_stage2_pgd".
>
> This looks like a different bug.

Oh, OK.

>
>>
>> I now have a way to reproduce it, so I can test proposed patches. I
>> don't have a simple C reproducer though.
>>
>> The bug happens when the following syzkaller program is executed:
>>
>> mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
>> unshare(0x400)
>> perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
>> 0xffffffffffffffff, 0x0)
>> r0 = openat$kvm(0xffffffffffffff9c,
>> &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
>> ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
>> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>> syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
>> &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
>> &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
>> 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
>
> Is that the only thing the program does? Or is there anything running in
> parallel?

These calls are executed repeatedly and in random order. That's all.

Except that I'm running the reproducer on a real arm board, so there's
probably a bunch of stuff going on besides these calls.

>
>> ==================================================================
>> BUG: KASAN: use-after-free in arch_spin_is_locked
>> include/linux/compiler.h:254 [inline]
>> BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> Read of size 8 at addr ffff800004476730 by task syz-executor/13106
>>
>> CPU: 1 PID: 13106 Comm: syz-executor Not tainted
>> 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
>> Hardware name: Hardkernel ODROID-C2 (DT)
>> Call trace:
>> [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
>> [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
>> [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
>> [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
>> [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
>> [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
>> [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
>> [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
>> [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
>
> This is the assert on the spinlock, and the memory is gone.
>
>> [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
>> [<ffff2000080ddfb8>] kvm_free_stage2_pgd
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
>
> But we've taken than lock here. There's only a handful of instructions
> in between, and the memory can only go away if there is something
> messing with us in parallel.
>
>> [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
>> [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
>> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
>> [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
>> [<ffff2000083a1fb4>] mmu_notifier_release
>> include/linux/mmu_notifier.h:235 [inline]
>> [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
>> [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
>> [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
>> [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
>> [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
>> [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
>> [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
>> [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
>> [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
>> [<ffff200008083e68>] work_pending+0x8/0x14
>
> So we're being serviced with a signal. Do you know if this signal is
> generated by your syzkaller program? We could be racing between do_exit
> triggered by a fatal signal (this trace) and the closing of the two file
> descriptors (vcpu and vm).

I'm not sure.

>
> Paolo: does this look possible to you? I can't see what locking we have
> that could prevent this race.
>
> Thanks,
>
> M.
> --
> Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-12 18:43       ` Marc Zyngier
  2017-04-12 18:51         ` Andrey Konovalov
@ 2017-04-13  9:17         ` Suzuki K Poulose
  2017-04-13 15:50           ` Suzuki K. Poulose
  1 sibling, 1 reply; 17+ messages in thread
From: Suzuki K Poulose @ 2017-04-13  9:17 UTC (permalink / raw)
  To: Marc Zyngier, Andrey Konovalov, Paolo Bonzini
  Cc: Radim Krčmář,
	Christoffer Dall, commit_signer:26/27=96%),
	Catalin Marinas, Will Deacon, kvm, linux-arm-kernel, kvmarm,
	LKML, Dmitry Vyukov, Kostya Serebryany, syzkaller

On 12/04/17 19:43, Marc Zyngier wrote:
> On 12/04/17 17:19, Andrey Konovalov wrote:
>
> Hi Andrey,
>
>> Apparently this wasn't fixed, I've got this report again on
>> linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
>> arm/arm64: Fix locking for kvm_free_stage2_pgd".
>
> This looks like a different bug.
>
>>
>> I now have a way to reproduce it, so I can test proposed patches. I
>> don't have a simple C reproducer though.
>>
>> The bug happens when the following syzkaller program is executed:
>>
>> mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
>> unshare(0x400)
>> perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
>> 0xffffffffffffffff, 0x0)
>> r0 = openat$kvm(0xffffffffffffff9c,
>> &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
>> ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
>> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>> syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
>> &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
>> &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
>> 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
>
> Is that the only thing the program does? Or is there anything running in
> parallel?
>
>> ==================================================================
>> BUG: KASAN: use-after-free in arch_spin_is_locked
>> include/linux/compiler.h:254 [inline]
>> BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> Read of size 8 at addr ffff800004476730 by task syz-executor/13106
>>
>> CPU: 1 PID: 13106 Comm: syz-executor Not tainted
>> 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
>> Hardware name: Hardkernel ODROID-C2 (DT)
>> Call trace:
>> [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
>> [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
>> [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
>> [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
>> [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
>> [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
>> [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
>> [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
>> [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
>
> This is the assert on the spinlock, and the memory is gone.
>
>> [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
>> [<ffff2000080ddfb8>] kvm_free_stage2_pgd
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
>
> But we've taken than lock here. There's only a handful of instructions
> in between, and the memory can only go away if there is something
> messing with us in parallel.
>
>> [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
>> arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
>> [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
>> arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
>> [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
>> [<ffff2000083a1fb4>] mmu_notifier_release
>> include/linux/mmu_notifier.h:235 [inline]
>> [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
>> [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
>> [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
>> [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
>> [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
>> [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
>> [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
>> [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
>> [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
>> [<ffff200008083e68>] work_pending+0x8/0x14
>
> So we're being serviced with a signal. Do you know if this signal is
> generated by your syzkaller program? We could be racing between do_exit
> triggered by a fatal signal (this trace) and the closing of the two file
> descriptors (vcpu and vm).
>
> Paolo: does this look possible to you? I can't see what locking we have
> that could prevent this race.

On a quick look, I see two issues:

1) It looks like the mmu_notifier->ops.release could be called twice for a notifier,
from mmu_notifier_unregister() and exit_mmap()->mmu_notifier_release(), which is
causing the problem as above.

This could possibly be avoided by swapping the order of the following operations
in themmu_notifier_unregister():

  a) Invoke ops->release under src_read_lock()
  b) Delete the notifier from the list.

which can prevent mmu_notifier_release() calling the ops->release() again, before
we reach (b).


2) The core KVM code does an mmgrab()/mmdrop on the current->mm to pin the mm_struct. But
this doesn't prevent the "real_address user space" from being destroyed. Since KVM
actually depends on the user pages and page tables, it should really/also(?) use
mmget()/mmput() (See Documentation/vm/active_mm.txt). I understand that mmget() shouldn't
be used for pinning unbounded amount of time. But since we do it from within the same
process context (like say threads), we should be safe to do so.

Thanks
Suzuki

>
> Thanks,
> 	
> M.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-12 18:51         ` Andrey Konovalov
@ 2017-04-13  9:34           ` Mark Rutland
  2017-04-13 11:53             ` Andrey Konovalov
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2017-04-13  9:34 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Marc Zyngier, kvm, Catalin Marinas, Will Deacon, LKML,
	Kostya Serebryany, syzkaller, linux-arm-kernel, Dmitry Vyukov,
	Paolo Bonzini, kvmarm

On Wed, Apr 12, 2017 at 08:51:31PM +0200, Andrey Konovalov wrote:
> On Wed, Apr 12, 2017 at 8:43 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> > On 12/04/17 17:19, Andrey Konovalov wrote:

> >> I now have a way to reproduce it, so I can test proposed patches. I
> >> don't have a simple C reproducer though.
> >>
> >> The bug happens when the following syzkaller program is executed:
> >>
> >> mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> >> unshare(0x400)
> >> perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
> >> 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> >> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
> >> 0xffffffffffffffff, 0x0)
> >> r0 = openat$kvm(0xffffffffffffff9c,
> >> &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
> >> ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
> >> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> >> syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
> >> &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
> >> &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
> >> 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
> >
> > Is that the only thing the program does? Or is there anything running in
> > parallel?
> 
> These calls are executed repeatedly and in random order. That's all.
> 
> Except that I'm running the reproducer on a real arm board, so there's
> probably a bunch of stuff going on besides these calls.

I had a go at reproducing this on an arm64 board following [1], but so
far I've had no luck. I've dumped the above into syz-kvm-bug, and I'm
trying to reproduce the issue with:

	PATH=$PATH:$(pwd)/bin syz-execprog \
		-executor $(pwd)/bin/syz-executor \
		-cover=0 -repeat=0 -procs 16 \
		syz-kvm-bug

Just to check, is that the correct way to reproduce the problem with the
above log?

How quickly does that reproduce the problem for you?

Thanks,
Mark.

[1] https://github.com/google/syzkaller/wiki/How-to-execute-syzkaller-programs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13  9:34           ` Mark Rutland
@ 2017-04-13 11:53             ` Andrey Konovalov
  2017-04-13 13:45               ` Mark Rutland
  0 siblings, 1 reply; 17+ messages in thread
From: Andrey Konovalov @ 2017-04-13 11:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Marc Zyngier, kvm, Catalin Marinas, Will Deacon, LKML,
	Kostya Serebryany, syzkaller, linux-arm-kernel, Dmitry Vyukov,
	Paolo Bonzini, kvmarm

On Thu, Apr 13, 2017 at 11:34 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Wed, Apr 12, 2017 at 08:51:31PM +0200, Andrey Konovalov wrote:
>> On Wed, Apr 12, 2017 at 8:43 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> > On 12/04/17 17:19, Andrey Konovalov wrote:
>
>> >> I now have a way to reproduce it, so I can test proposed patches. I
>> >> don't have a simple C reproducer though.
>> >>
>> >> The bug happens when the following syzkaller program is executed:
>> >>
>> >> mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
>> >> unshare(0x400)
>> >> perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
>> >> 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>> >> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
>> >> 0xffffffffffffffff, 0x0)
>> >> r0 = openat$kvm(0xffffffffffffff9c,
>> >> &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
>> >> ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
>> >> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>> >> syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
>> >> &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
>> >> &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
>> >> 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
>> >
>> > Is that the only thing the program does? Or is there anything running in
>> > parallel?
>>
>> These calls are executed repeatedly and in random order. That's all.
>>
>> Except that I'm running the reproducer on a real arm board, so there's
>> probably a bunch of stuff going on besides these calls.
>
> I had a go at reproducing this on an arm64 board following [1], but so
> far I've had no luck. I've dumped the above into syz-kvm-bug, and I'm
> trying to reproduce the issue with:
>
>         PATH=$PATH:$(pwd)/bin syz-execprog \
>                 -executor $(pwd)/bin/syz-executor \
>                 -cover=0 -repeat=0 -procs 16 \
>                 syz-kvm-bug
>
> Just to check, is that the correct way to reproduce the problem with the
> above log?

Hi Mark,

You assume that you have KASAN enabled.

Since a few unintended line breaks were added in the email, here's the
program in plaintext:
https://gist.githubusercontent.com/xairy/69864355b5a64f74e7cb445b7325a7df/raw/bdbbbf177dbea13eac0a5ddfa9c3e6c32695b13b/kvm-arm-uaf-log

I run it as:
# ./syz-execprog -repeat=0 -collide=false -sandbox=namespace ./kvm-arm-uaf-log

And it takes less than a second to trigger the bug.

Thanks!

>
> How quickly does that reproduce the problem for you?
>
> Thanks,
> Mark.
>
> [1] https://github.com/google/syzkaller/wiki/How-to-execute-syzkaller-programs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13 11:53             ` Andrey Konovalov
@ 2017-04-13 13:45               ` Mark Rutland
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Rutland @ 2017-04-13 13:45 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Marc Zyngier, kvm, Catalin Marinas, Will Deacon, LKML,
	Kostya Serebryany, syzkaller, linux-arm-kernel, Dmitry Vyukov,
	Paolo Bonzini, kvmarm

On Thu, Apr 13, 2017 at 01:53:13PM +0200, Andrey Konovalov wrote:
> On Thu, Apr 13, 2017 at 11:34 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> > I had a go at reproducing this on an arm64 board following [1], but so
> > far I've had no luck. 

[...]

> Hi Mark,
> 
> You assume that you have KASAN enabled.

Yes; I have KASAN_OUTLINE enabled.

> Since a few unintended line breaks were added in the email, here's the
> program in plaintext:
> https://gist.githubusercontent.com/xairy/69864355b5a64f74e7cb445b7325a7df/raw/bdbbbf177dbea13eac0a5ddfa9c3e6c32695b13b/kvm-arm-uaf-log

Thanks!

> I run it as:
> # ./syz-execprog -repeat=0 -collide=false -sandbox=namespace ./kvm-arm-uaf-log

Thanks again. Having the exact command was very helpful for
sanity-checking my configuration.

In the end, it turned out that in my filesystem, /dev/kvm was not
accessible by the nobody user, so the executor couldn't open it, and
none of the KVM paths were being triggered.

> And it takes less than a second to trigger the bug.

After having chomd'd /dev/kvm appropriately, I can also reproduce the
issue within a second or two, running the same command.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13  9:17         ` Suzuki K Poulose
@ 2017-04-13 15:50           ` Suzuki K. Poulose
  2017-04-13 15:53             ` Suzuki K Poulose
                               ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Suzuki K. Poulose @ 2017-04-13 15:50 UTC (permalink / raw)
  To: Marc Zyngier, Andrey Konovalov, Paolo Bonzini
  Cc: rkrcmar, christoffer.dall, linux-arm-kernel, linux-kernel, kvm,
	dvyukov, mark.rutland, kvmarm, catalin.marinas, will.deacon, kcc,
	syzkaller

On Thu, Apr 13, 2017 at 10:17:54AM +0100, Suzuki K Poulose wrote:
> On 12/04/17 19:43, Marc Zyngier wrote:
> > On 12/04/17 17:19, Andrey Konovalov wrote:
> >
> > Hi Andrey,
> >
> > > Apparently this wasn't fixed, I've got this report again on
> > > linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
> > > arm/arm64: Fix locking for kvm_free_stage2_pgd".
> >
> > This looks like a different bug.
> >
> > >
> > > I now have a way to reproduce it, so I can test proposed patches. I
> > > don't have a simple C reproducer though.
> > >
> > > The bug happens when the following syzkaller program is executed:
> > >
> > > mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> > > unshare(0x400)
> > > perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
> > > 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> > > 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
> > > 0xffffffffffffffff, 0x0)
> > > r0 = openat$kvm(0xffffffffffffff9c,
> > > &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
> > > ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
> > > r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> > > syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
> > > &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
> > > &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
> > > 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
> >
> > Is that the only thing the program does? Or is there anything running in
> > parallel?
> >
> > > ==================================================================
> > > BUG: KASAN: use-after-free in arch_spin_is_locked
> > > include/linux/compiler.h:254 [inline]
> > > BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> > > Read of size 8 at addr ffff800004476730 by task syz-executor/13106
> > >
> > > CPU: 1 PID: 13106 Comm: syz-executor Not tainted
> > > 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
> > > Hardware name: Hardkernel ODROID-C2 (DT)
> > > Call trace:
> > > [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
> > > [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
> > > [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
> > > [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
> > > [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
> > > [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
> > > [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
> > > [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
> > > [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
> >
> > This is the assert on the spinlock, and the memory is gone.
> >
> > > [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> > > [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
> > > [<ffff2000080ddfb8>] kvm_free_stage2_pgd
> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
> >
> > But we've taken than lock here. There's only a handful of instructions
> > in between, and the memory can only go away if there is something
> > messing with us in parallel.
> >
> > > [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
> > > [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
> > > arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
> > > [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
> > > [<ffff2000083a1fb4>] mmu_notifier_release
> > > include/linux/mmu_notifier.h:235 [inline]
> > > [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
> > > [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
> > > [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
> > > [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
> > > [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
> > > [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
> > > [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
> > > [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
> > > [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
> > > [<ffff200008083e68>] work_pending+0x8/0x14
> >
> > So we're being serviced with a signal. Do you know if this signal is
> > generated by your syzkaller program? We could be racing between do_exit
> > triggered by a fatal signal (this trace) and the closing of the two file
> > descriptors (vcpu and vm).
> >
> > Paolo: does this look possible to you? I can't see what locking we have
> > that could prevent this race.
>
> On a quick look, I see two issues:
>
> 1) It looks like the mmu_notifier->ops.release could be called twice for a notifier,
> from mmu_notifier_unregister() and exit_mmap()->mmu_notifier_release(), which is
> causing the problem as above.
>
> This could possibly be avoided by swapping the order of the following operations
> in themmu_notifier_unregister():
>
>  a) Invoke ops->release under src_read_lock()
>  b) Delete the notifier from the list.
>
> which can prevent mmu_notifier_release() calling the ops->release() again, before
> we reach (b).
>
>
> 2) The core KVM code does an mmgrab()/mmdrop on the current->mm to pin the mm_struct. But
> this doesn't prevent the "real_address user space" from being destroyed. Since KVM
> actually depends on the user pages and page tables, it should really/also(?) use
> mmget()/mmput() (See Documentation/vm/active_mm.txt). I understand that mmget() shouldn't
> be used for pinning unbounded amount of time. But since we do it from within the same
> process context (like say threads), we should be safe to do so.

Here is a patch which implements (2) above.

----8>-----

kvm: Hold reference to the user address space

The core KVM code, uses mmgrab/mmdrop to pin the mm struct of the user
application. mmgrab only guarantees that the mm struct is available,
while the "real address space" (see Documentation/vm/active_mm.txt) may
be destroyed. Since the KVM depends on the user space page tables for
the Guest pages, we should instead do an mmget/mmput. Even though
mmget/mmput is not encouraged for uses with unbounded time, the KVM
is fine to do so, as we are doing it from the context of the same process.

This also prevents the race condition where mmu_notifier_release() could
be called in parallel and one instance could end up using a free'd kvm
instance.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzin <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: andreyknvl@google.com
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 virt/kvm/kvm_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 88257b3..555712e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
                return ERR_PTR(-ENOMEM);

        spin_lock_init(&kvm->mmu_lock);
-       mmgrab(current->mm);
+       mmget(current->mm);
        kvm->mm = current->mm;
        kvm_eventfd_init(kvm);
        mutex_init(&kvm->lock);
@@ -685,7 +685,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
        for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
                kvm_free_memslots(kvm, kvm->memslots[i]);
        kvm_arch_free_vm(kvm);
-       mmdrop(current->mm);
+       mmput(current->mm);
        return ERR_PTR(r);
 }

@@ -747,7 +747,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
        kvm_arch_free_vm(kvm);
        preempt_notifier_dec();
        hardware_disable_all();
-       mmdrop(mm);
+       mmput(mm);
 }

 void kvm_get_kvm(struct kvm *kvm)
--
2.7.4
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13 15:50           ` Suzuki K. Poulose
@ 2017-04-13 15:53             ` Suzuki K Poulose
  2017-04-13 17:06             ` Andrey Konovalov
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Suzuki K Poulose @ 2017-04-13 15:53 UTC (permalink / raw)
  To: Marc Zyngier, Andrey Konovalov, Paolo Bonzini
  Cc: rkrcmar, christoffer.dall, linux-arm-kernel, linux-kernel, kvm,
	dvyukov, mark.rutland, kvmarm, catalin.marinas, will.deacon, kcc,
	syzkaller

On 13/04/17 16:50, Suzuki K. Poulose wrote:
> On Thu, Apr 13, 2017 at 10:17:54AM +0100, Suzuki K Poulose wrote:
>> On 12/04/17 19:43, Marc Zyngier wrote:
>>> On 12/04/17 17:19, Andrey Konovalov wrote:

Please ignore the footer below, that was mistake from my side.

> --
> 2.7.4
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13 15:50           ` Suzuki K. Poulose
  2017-04-13 15:53             ` Suzuki K Poulose
@ 2017-04-13 17:06             ` Andrey Konovalov
  2017-04-18  8:32             ` Mark Rutland
  2017-04-20 16:40             ` Suzuki K Poulose
  3 siblings, 0 replies; 17+ messages in thread
From: Andrey Konovalov @ 2017-04-13 17:06 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Marc Zyngier, Paolo Bonzini, Radim Krčmář,
	Christoffer Dall, linux-arm-kernel, LKML, kvm, Dmitry Vyukov,
	Mark Rutland, kvmarm, Catalin Marinas, Will Deacon,
	Kostya Serebryany, syzkaller

On Thu, Apr 13, 2017 at 5:50 PM, Suzuki K. Poulose
<Suzuki.Poulose@arm.com> wrote:
> On Thu, Apr 13, 2017 at 10:17:54AM +0100, Suzuki K Poulose wrote:
>> On 12/04/17 19:43, Marc Zyngier wrote:
>> > On 12/04/17 17:19, Andrey Konovalov wrote:
>> >
>> > Hi Andrey,
>> >
>> > > Apparently this wasn't fixed, I've got this report again on
>> > > linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
>> > > arm/arm64: Fix locking for kvm_free_stage2_pgd".
>> >
>> > This looks like a different bug.
>> >
>> > >
>> > > I now have a way to reproduce it, so I can test proposed patches. I
>> > > don't have a simple C reproducer though.
>> > >
>> > > The bug happens when the following syzkaller program is executed:
>> > >
>> > > mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
>> > > unshare(0x400)
>> > > perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
>> > > 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
>> > > 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
>> > > 0xffffffffffffffff, 0x0)
>> > > r0 = openat$kvm(0xffffffffffffff9c,
>> > > &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
>> > > ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
>> > > r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>> > > syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
>> > > &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
>> > > &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
>> > > 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
>> >
>> > Is that the only thing the program does? Or is there anything running in
>> > parallel?
>> >
>> > > ==================================================================
>> > > BUG: KASAN: use-after-free in arch_spin_is_locked
>> > > include/linux/compiler.h:254 [inline]
>> > > BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
>> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> > > Read of size 8 at addr ffff800004476730 by task syz-executor/13106
>> > >
>> > > CPU: 1 PID: 13106 Comm: syz-executor Not tainted
>> > > 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
>> > > Hardware name: Hardkernel ODROID-C2 (DT)
>> > > Call trace:
>> > > [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
>> > > [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
>> > > [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
>> > > [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
>> > > [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
>> > > [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
>> > > [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
>> > > [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
>> > > [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
>> >
>> > This is the assert on the spinlock, and the memory is gone.
>> >
>> > > [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
>> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
>> > > [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
>> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
>> > > [<ffff2000080ddfb8>] kvm_free_stage2_pgd
>> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
>> >
>> > But we've taken than lock here. There's only a handful of instructions
>> > in between, and the memory can only go away if there is something
>> > messing with us in parallel.
>> >
>> > > [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
>> > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
>> > > [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
>> > > arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
>> > > [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
>> > > [<ffff2000083a1fb4>] mmu_notifier_release
>> > > include/linux/mmu_notifier.h:235 [inline]
>> > > [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
>> > > [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
>> > > [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
>> > > [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
>> > > [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
>> > > [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
>> > > [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
>> > > [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
>> > > [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
>> > > [<ffff200008083e68>] work_pending+0x8/0x14
>> >
>> > So we're being serviced with a signal. Do you know if this signal is
>> > generated by your syzkaller program? We could be racing between do_exit
>> > triggered by a fatal signal (this trace) and the closing of the two file
>> > descriptors (vcpu and vm).
>> >
>> > Paolo: does this look possible to you? I can't see what locking we have
>> > that could prevent this race.
>>
>> On a quick look, I see two issues:
>>
>> 1) It looks like the mmu_notifier->ops.release could be called twice for a notifier,
>> from mmu_notifier_unregister() and exit_mmap()->mmu_notifier_release(), which is
>> causing the problem as above.
>>
>> This could possibly be avoided by swapping the order of the following operations
>> in themmu_notifier_unregister():
>>
>>  a) Invoke ops->release under src_read_lock()
>>  b) Delete the notifier from the list.
>>
>> which can prevent mmu_notifier_release() calling the ops->release() again, before
>> we reach (b).
>>
>>
>> 2) The core KVM code does an mmgrab()/mmdrop on the current->mm to pin the mm_struct. But
>> this doesn't prevent the "real_address user space" from being destroyed. Since KVM
>> actually depends on the user pages and page tables, it should really/also(?) use
>> mmget()/mmput() (See Documentation/vm/active_mm.txt). I understand that mmget() shouldn't
>> be used for pinning unbounded amount of time. But since we do it from within the same
>> process context (like say threads), we should be safe to do so.
>
> Here is a patch which implements (2) above.

Hi Suzuki,

Your patch fixes KASAN reports for me.

Thanks!

>
> ----8>-----
>
> kvm: Hold reference to the user address space
>
> The core KVM code, uses mmgrab/mmdrop to pin the mm struct of the user
> application. mmgrab only guarantees that the mm struct is available,
> while the "real address space" (see Documentation/vm/active_mm.txt) may
> be destroyed. Since the KVM depends on the user space page tables for
> the Guest pages, we should instead do an mmget/mmput. Even though
> mmget/mmput is not encouraged for uses with unbounded time, the KVM
> is fine to do so, as we are doing it from the context of the same process.
>
> This also prevents the race condition where mmu_notifier_release() could
> be called in parallel and one instance could end up using a free'd kvm
> instance.
>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Paolo Bonzin <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: andreyknvl@google.com
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/kvm_main.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 88257b3..555712e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>                 return ERR_PTR(-ENOMEM);
>
>         spin_lock_init(&kvm->mmu_lock);
> -       mmgrab(current->mm);
> +       mmget(current->mm);
>         kvm->mm = current->mm;
>         kvm_eventfd_init(kvm);
>         mutex_init(&kvm->lock);
> @@ -685,7 +685,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>                 kvm_free_memslots(kvm, kvm->memslots[i]);
>         kvm_arch_free_vm(kvm);
> -       mmdrop(current->mm);
> +       mmput(current->mm);
>         return ERR_PTR(r);
>  }
>
> @@ -747,7 +747,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>         kvm_arch_free_vm(kvm);
>         preempt_notifier_dec();
>         hardware_disable_all();
> -       mmdrop(mm);
> +       mmput(mm);
>  }
>
>  void kvm_get_kvm(struct kvm *kvm)
> --
> 2.7.4
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13 15:50           ` Suzuki K. Poulose
  2017-04-13 15:53             ` Suzuki K Poulose
  2017-04-13 17:06             ` Andrey Konovalov
@ 2017-04-18  8:32             ` Mark Rutland
  2017-04-18  9:08               ` Mark Rutland
  2017-04-20 16:40             ` Suzuki K Poulose
  3 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2017-04-18  8:32 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: Marc Zyngier, Andrey Konovalov, Paolo Bonzini, rkrcmar,
	christoffer.dall, linux-arm-kernel, linux-kernel, kvm, dvyukov,
	kvmarm, catalin.marinas, will.deacon, kcc, syzkaller

Hi Suzuki,

On Thu, Apr 13, 2017 at 04:50:46PM +0100, Suzuki K. Poulose wrote:
> kvm: Hold reference to the user address space
> 
> The core KVM code, uses mmgrab/mmdrop to pin the mm struct of the user
> application. mmgrab only guarantees that the mm struct is available,
> while the "real address space" (see Documentation/vm/active_mm.txt) may
> be destroyed. Since the KVM depends on the user space page tables for
> the Guest pages, we should instead do an mmget/mmput. Even though
> mmget/mmput is not encouraged for uses with unbounded time, the KVM
> is fine to do so, as we are doing it from the context of the same process.
> 
> This also prevents the race condition where mmu_notifier_release() could
> be called in parallel and one instance could end up using a free'd kvm
> instance.
> 
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Paolo Bonzin <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: andreyknvl@google.com
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/kvm_main.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 88257b3..555712e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  		return ERR_PTR(-ENOMEM);
>  
>  	spin_lock_init(&kvm->mmu_lock);
> -	mmgrab(current->mm);
> +	mmget(current->mm);
>  	kvm->mm = current->mm;
>  	kvm_eventfd_init(kvm);
>  	mutex_init(&kvm->lock);
> @@ -685,7 +685,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>  		kvm_free_memslots(kvm, kvm->memslots[i]);
>  	kvm_arch_free_vm(kvm);
> -	mmdrop(current->mm);
> +	mmput(current->mm);
>  	return ERR_PTR(r);
>  }
>  
> @@ -747,7 +747,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_free_vm(kvm);
>  	preempt_notifier_dec();
>  	hardware_disable_all();
> -	mmdrop(mm);
> +	mmput(mm);
>  }


As a heads-up, I'm seeing what looks to be a KVM memory leak with this
patch applied atop of next-20170411.

I don't yet know if this is a problem with next-20170411 or this patch
in particular -- I will try to track that down. In the mean time, info
dump below.

I left syzkaller running over the weekend using this kernel on the host,
and OOM kicked in after it had been running for a short while. Almost
all of my memory is in use, but judging by top, almost none of this is
associated with processes.

It looks like this is almost all AnonPages allocations:

nanook@medister:~$ cat /proc/meminfo 
MemTotal:       14258176 kB
MemFree:          106192 kB
MemAvailable:      38196 kB
Buffers:           27160 kB
Cached:            42508 kB
SwapCached:            0 kB
Active:         13442912 kB
Inactive:           7388 kB
Active(anon):   13380876 kB
Inactive(anon):      400 kB
Active(file):      62036 kB
Inactive(file):     6988 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:      13380688 kB
Mapped:             7352 kB
Shmem:               620 kB
Slab:             568196 kB
SReclaimable:      21756 kB
SUnreclaim:       546440 kB
KernelStack:        2832 kB
PageTables:        49168 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7129088 kB
Committed_AS:   41554652 kB
VmallocTotal:   100930551744 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:  12728320 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:          16384 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Looking at slabtop, there are large number of vm_area_structs around:

 Active / Total Objects (% used)    : 531511 / 587214 (90.5%)
 Active / Total Slabs (% used)      : 29443 / 29443 (100.0%)
 Active / Total Caches (% used)     : 108 / 156 (69.2%)
 Active / Total Size (% used)       : 514052.23K / 536839.57K (95.8%)
 Minimum / Average / Maximum Object : 0.03K / 0.91K / 8.28K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 94924  89757  94%    0.24K   2877       33     23016K vm_area_struct
 72400  60687  83%    0.31K   2896       25     23168K filp
 70553  70484  99%    4.25K  10079        7    322528K names_cache
 70112  64605  92%    0.25K   2191       32     17528K kmalloc-128
 52458  50837  96%    0.09K   1249       42      4996K anon_vma_chain
 23492  22949  97%    4.25K   3356        7    107392K kmalloc-4096
 20631  20631 100%    0.10K    529       39      2116K anon_vma

... so it looks like we could be leaking the mm and associated mappings.

Full OOM splat:

[395953.231838] htop invoked oom-killer: gfp_mask=0x16040d0(GFP_TEMPORARY|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
[395953.244523] htop cpuset=/ mems_allowed=0
[395953.248556] CPU: 4 PID: 2301 Comm: htop Not tainted 4.11.0-rc6-next-20170411-dirty #7044
[395953.256727] Hardware name: AMD Seattle (Rev.B0) Development Board (Overdrive) (DT)
[395953.264374] Call trace:
[395953.266911] [<ffff20000808c358>] dump_backtrace+0x0/0x3a8
[395953.272394] [<ffff20000808c860>] show_stack+0x20/0x30
[395953.277530] [<ffff2000085a86f0>] dump_stack+0xbc/0xec
[395953.282666] [<ffff2000082d66f8>] dump_header+0xd8/0x328
[395953.287977] [<ffff200008215078>] oom_kill_process+0x400/0x6b0
[395953.293807] [<ffff200008215864>] out_of_memory+0x1ec/0x7c0
[395953.299377] [<ffff20000821d918>] __alloc_pages_nodemask+0xd88/0xe68
[395953.305728] [<ffff20000829bd8c>] alloc_pages_current+0xcc/0x218
[395953.311732] [<ffff2000082a9028>] new_slab+0x420/0x658
[395953.316868] [<ffff2000082ab360>] ___slab_alloc+0x370/0x5d8
[395953.322436] [<ffff2000082ab5ec>] __slab_alloc.isra.22+0x24/0x38
[395953.328438] [<ffff2000082abe5c>] kmem_cache_alloc+0x1bc/0x1e8
[395953.334268] [<ffff200008387eec>] proc_alloc_inode+0x24/0xa8
[395953.339924] [<ffff20000830af14>] alloc_inode+0x3c/0xf0
[395953.345146] [<ffff20000830df90>] new_inode_pseudo+0x20/0x80
[395953.350800] [<ffff20000830e014>] new_inode+0x24/0x50
[395953.355850] [<ffff20000838e860>] proc_pid_make_inode+0x28/0x118
[395953.361853] [<ffff20000838ea78>] proc_pident_instantiate+0x48/0x140
[395953.368204] [<ffff20000838ec6c>] proc_pident_lookup+0xfc/0x168
[395953.374121] [<ffff20000838ed8c>] proc_tgid_base_lookup+0x34/0x40
[395953.380210] [<ffff2000082f77ec>] path_openat+0x194c/0x1b68
[395953.385779] [<ffff2000082f96e0>] do_filp_open+0xe0/0x178
[395953.391178] [<ffff2000082d9f70>] do_sys_open+0x1e8/0x300
[395953.396575] [<ffff2000082da108>] SyS_openat+0x38/0x48
[395953.401710] [<ffff200008083730>] el0_svc_naked+0x24/0x28
[395953.408051] Mem-Info:
[395953.410423] active_anon:3354643 inactive_anon:100 isolated_anon:0
[395953.410423]  active_file:16 inactive_file:0 isolated_file:0
[395953.410423]  unevictable:0 dirty:0 writeback:0 unstable:0
[395953.410423]  slab_reclaimable:15505 slab_unreclaimable:143437
[395953.410423]  mapped:0 shmem:155 pagetables:10329 bounce:0
[395953.410423]  free:21060 free_pcp:403 free_cma:0
[395953.443636] Node 0 active_anon:13418572kB inactive_anon:400kB active_file:540kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:380kB dirty:0kB writeback:0kB shmem:620kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12926976kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[395953.471351] Node 0 DMA free:50620kB min:12828kB low:16884kB high:20940kB active_anon:3989600kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:4194304kB managed:4060788kB mlocked:0kB slab_reclaimable:2928kB slab_unreclaimable:10648kB kernel_stack:0kB pagetables:3600kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[395953.503543] lowmem_reserve[]: 0 9958 9958
[395953.507654] Node 0 Normal free:33004kB min:32224kB low:42420kB high:52616kB active_anon:9428972kB inactive_anon:400kB active_file:132kB inactive_file:80kB unevictable:0kB writepending:0kB present:12582912kB managed:10197388kB mlocked:0kB slab_reclaimable:59092kB slab_unreclaimable:563100kB kernel_stack:4032kB pagetables:37716kB bounce:0kB free_pcp:560kB local_pcp:0kB free_cma:0kB
[395953.541392] lowmem_reserve[]: 0 0 0
[395953.544979] Node 0 DMA: 531*4kB (UME) 210*8kB (UME) 114*16kB (UME) 34*32kB (ME) 18*64kB (UME) 34*128kB (UME) 46*256kB (UM) 14*512kB (UM) 7*1024kB (UM) 0*2048kB 3*4096kB (M) = 50620kB
[395953.561390] Node 0 Normal: 3041*4kB (UMEH) 1694*8kB (UMEH) 447*16kB (UMEH) 10*32kB (U) 2*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33316kB
[395953.575702] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[395953.584229] 521 total pagecache pages
[395953.587984] 0 pages in swap cache
[395953.591392] Swap cache stats: add 0, delete 0, find 0/0
[395953.596706] Free swap  = 0kB
[395953.599677] Total swap = 0kB
[395953.602638] 4194304 pages RAM
[395953.605692] 0 pages HighMem/MovableOnly
[395953.609617] 629760 pages reserved
[395953.613021] 4096 pages cma reserved
[395953.616599] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[395953.625244] [ 1447]     0  1447      714       74       5       3        0             0 upstart-udev-br
[395953.634818] [ 1450]     0  1450     2758      187       7       3        0         -1000 systemd-udevd
[395953.644218] [ 1833]     0  1833      632       46       5       3        0             0 upstart-socket-
[395953.653790] [ 1847]     0  1847      708       63       5       3        0             0 rpcbind
[395953.662668] [ 1879]   106  1879      737      114       5       3        0             0 rpc.statd
[395953.671734] [ 1984]     0  1984      636       54       5       4        0             0 upstart-file-br
[395953.681307] [ 2000]   103  2000     1152      120       6       3        0             0 dbus-daemon
[395953.690534] [ 2006]     0  2006      720       49       6       3        0             0 rpc.idmapd
[395953.699676] [ 2008]   101  2008    56308      201      12       3        0             0 rsyslogd
[395953.708641] [ 2014]     0  2014    58414      289      16       3        0             0 ModemManager
[395953.717952] [ 2032]     0  2032     1222       87       6       3        0             0 systemd-logind
[395953.727440] [ 2050]     0  2050    61456      371      18       3        0             0 NetworkManager
[395953.736927] [ 2068]     0  2068      587       39       5       3        0             0 getty
[395953.745632] [ 2071]     0  2071    57242      173      14       3        0             0 polkitd
[395953.754510] [ 2075]     0  2075      587       40       5       3        0             0 getty
[395953.763216] [ 2078]     0  2078      587       39       5       3        0             0 getty
[395953.771922] [ 2079]     0  2079      587       38       5       3        0             0 getty
[395953.780628] [ 2081]     0  2081      587       40       5       3        0             0 getty
[395953.789334] [ 2101]     0  2101     2061      163       8       4        0         -1000 sshd
[395953.797952] [ 2102]     0  2102      793       57       6       3        0             0 cron
[395953.806583] [ 2159]     0  2159      542       38       5       3        0             0 getty
[395953.815288] [ 2161]     0  2161      587       40       5       3        0             0 getty
[395953.823992] [ 2171]     0  2171     1356      575       6       4        0             0 dhclient
[395953.832956] [ 2175] 65534  2175      845       58       5       3        0             0 dnsmasq
[395953.841834] [ 2265]     0  2265     3249      261      10       3        0             0 sshd
[395953.850451] [ 2278]  1000  2278     3249      262       9       3        0             0 sshd
[395953.859067] [ 2279]  1000  2279      920      176       5       3        0             0 bash
[395953.867686] [ 2289]  1000  2289      862       63       5       3        0             0 screen
[395953.876479] [ 2290]  1000  2290     1063      286       5       3        0             0 screen
[395953.885272] [ 2291]  1000  2291      930      186       5       3        0             0 bash
[395953.893890] [ 2301]  1000  2301     1190      550       6       3        0             0 htop
[395953.902508] [ 2302]  1000  2302      940      197       5       3        0             0 bash
[395953.911126] [ 2358]  1000  2358   447461    46148     163       5        0             0 qemu-system-aar
[395953.920699] [ 2359]  1000  2359   449502    45509     166       4        0             0 qemu-system-aar
[395953.930271] [ 2360]  1000  2360   447461    43753     160       5        0             0 qemu-system-aar
[395953.939854] [ 2361]  1000  2361   447461    46144     161       4        0             0 qemu-system-aar
[395953.949429] [ 2362]  1000  2362   447461    44522     160       5        0             0 qemu-system-aar
[395953.959001] [ 2363]  1000  2363   447461    44311     161       4        0             0 qemu-system-aar
[395953.968574] [ 4600]  1000  4600    19468    12828      42       5        0             0 syz-manager
[395953.977820] [ 4915]  1000  4915    16364     1127      28       3        0             0 qemu-system-aar
[395953.987397] [ 4917]  1000  4917    16364     1127      27       3        0             0 qemu-system-aar
[395953.996972] [ 4918]  1000  4918    16364     1127      28       3        0             0 qemu-system-aar
[395954.006546] [ 4919]  1000  4919    16364     1128      28       3        0             0 qemu-system-aar
[395954.016119] [ 4920]  1000  4920    16364      617      30       3        0             0 qemu-system-aar
[395954.025692] [ 4922]  1000  4922    14028      344      21       3        0             0 qemu-system-aar
[395954.035273] Out of memory: Kill process 2358 (qemu-system-aar) score 12 or sacrifice child
[395954.043659] Killed process 2358 (qemu-system-aar) total-vm:1789844kB, anon-rss:184592kB, file-rss:0kB, shmem-rss:0kB
[395954.055211] qemu-system-aar: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[395954.066817] qemu-system-aar cpuset=/ mems_allowed=0
[395954.067606] oom_reaper: reaped process 2358 (qemu-system-aar), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[395954.081761] CPU: 5 PID: 2358 Comm: qemu-system-aar Not tainted 4.11.0-rc6-next-20170411-dirty #7044
[395954.090886] Hardware name: AMD Seattle (Rev.B0) Development Board (Overdrive) (DT)
[395954.098533] Call trace:
[395954.101072] [<ffff20000808c358>] dump_backtrace+0x0/0x3a8
[395954.106555] [<ffff20000808c860>] show_stack+0x20/0x30
[395954.111692] [<ffff2000085a86f0>] dump_stack+0xbc/0xec
[395954.116830] [<ffff20000821ca4c>] warn_alloc+0x144/0x1d8
[395954.122140] [<ffff20000821d9e8>] __alloc_pages_nodemask+0xe58/0xe68
[395954.128491] [<ffff20000829bd8c>] alloc_pages_current+0xcc/0x218
[395954.134494] [<ffff20000820e770>] __page_cache_alloc+0x128/0x150
[395954.140498] [<ffff200008212648>] filemap_fault+0x768/0x940
[395954.146069] [<ffff2000083caf8c>] ext4_filemap_fault+0x4c/0x68
[395954.151898] [<ffff20000825bac4>] __do_fault+0x44/0xd0
[395954.157033] [<ffff200008264c5c>] __handle_mm_fault+0x12c4/0x1978
[395954.163122] [<ffff200008265514>] handle_mm_fault+0x204/0x388
[395954.168865] [<ffff2000080a3994>] do_page_fault+0x3fc/0x4b0
[395954.174434] [<ffff200008081444>] do_mem_abort+0xa4/0x138
[395954.179827] Exception stack(0xffff80034db07dc0 to 0xffff80034db07ef0)
[395954.186352] 7dc0: 0000000000000000 00006003f67fc000 ffffffffffffffff 00000000004109b0
[395954.194266] 7de0: 0000000060000000 0000000000000020 0000000082000007 00000000004109b0
[395954.202179] 7e00: 0000000041b58ab3 ffff20000955d370 ffff2000080813a0 0000000000000124
[395954.210093] 7e20: 0000000000000049 ffff200008f44000 ffff80034db07e40 ffff200008085f60
[395954.218006] 7e40: ffff80034db07e80 ffff20000808b5a0 0000000000000008 ffff80035dde5e80
[395954.225920] 7e60: ffff80035dde5e80 ffff80035dde64f0 ffff80034db07e80 ffff20000808b580
[395954.233833] 7e80: 0000000000000000 ffff200008083618 0000000000000000 00006003f67fc000
[395954.241746] 7ea0: ffffffffffffffff 000000000078d790 0000000060000000 00006003f6813000
[395954.249659] 7ec0: 0000ffffa685f708 0000000000000001 0000000000000001 0000000000000000
[395954.257569] 7ee0: 0000000000000002 0000000000000000
[395954.262530] [<ffff200008083388>] el0_ia+0x18/0x1c
[395954.267433] Mem-Info:
[395954.269806] active_anon:3308476 inactive_anon:100 isolated_anon:0
[395954.269806]  active_file:98 inactive_file:570 isolated_file:0
[395954.269806]  unevictable:0 dirty:0 writeback:0 unstable:0
[395954.269806]  slab_reclaimable:15503 slab_unreclaimable:143557
[395954.269806]  mapped:264 shmem:155 pagetables:10329 bounce:0
[395954.269806]  free:66173 free_pcp:470 free_cma:0
[395954.303371] Node 0 active_anon:13233904kB inactive_anon:400kB active_file:392kB inactive_file:3320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1836kB dirty:0kB writeback:0kB shmem:620kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12728320kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[395954.331169] Node 0 DMA free:50620kB min:12828kB low:16884kB high:20940kB active_anon:3989600kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:4194304kB managed:4060788kB mlocked:0kB slab_reclaimable:2928kB slab_unreclaimable:10648kB kernel_stack:0kB pagetables:3600kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[395954.363335] lowmem_reserve[]: 0 9958 9958
[395954.367625] Node 0 Normal free:212516kB min:32224kB low:42420kB high:52616kB active_anon:9244644kB inactive_anon:400kB active_file:548kB inactive_file:3828kB unevictable:0kB writepending:0kB present:12582912kB managed:10197388kB mlocked:0kB slab_reclaimable:59084kB slab_unreclaimable:563912kB kernel_stack:4032kB pagetables:37716kB bounce:0kB free_pcp:1840kB local_pcp:0kB free_cma:0kB
[395954.401710] lowmem_reserve[]: 0 0 0
[395954.405298] Node 0 DMA: 531*4kB (UME) 210*8kB (UME) 114*16kB (UME) 34*32kB (ME) 18*64kB (UME) 34*128kB (UME) 46*256kB (UM) 14*512kB (UM) 7*1024kB (UM) 0*2048kB 3*4096kB (M) = 50620kB
[395954.421698] Node 0 Normal: 1840*4kB (UMEH) 1740*8kB (MEH) 496*16kB (ME) 47*32kB (UME) 25*64kB (MEH) 3*128kB (UME) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 61*2048kB (UME) 12*4096kB (M) = 209856kB
[395954.439058] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[395954.447582] 2104 total pagecache pages
[395954.451421] 0 pages in swap cache
[395954.454817] Swap cache stats: add 0, delete 0, find 0/0
[395954.460130] Free swap  = 0kB
[395954.463090] Total swap = 0kB
[395954.466057] 4194304 pages RAM
[395954.469111] 0 pages HighMem/MovableOnly
[395954.473035] 629760 pages reserved
[395954.476436] 4096 pages cma reserved
[395954.480151] qemu-system-aar invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
[395954.489898] qemu-system-aar cpuset=/ mems_allowed=0
[395954.494879] CPU: 5 PID: 2358 Comm: qemu-system-aar Not tainted 4.11.0-rc6-next-20170411-dirty #7044
[395954.504003] Hardware name: AMD Seattle (Rev.B0) Development Board (Overdrive) (DT)
[395954.511651] Call trace:
[395954.514184] [<ffff20000808c358>] dump_backtrace+0x0/0x3a8
[395954.519668] [<ffff20000808c860>] show_stack+0x20/0x30
[395954.524802] [<ffff2000085a86f0>] dump_stack+0xbc/0xec
[395954.529939] [<ffff2000082d66f8>] dump_header+0xd8/0x328
[395954.535248] [<ffff200008215078>] oom_kill_process+0x400/0x6b0
[395954.541078] [<ffff200008215864>] out_of_memory+0x1ec/0x7c0
[395954.546648] [<ffff200008215efc>] pagefault_out_of_memory+0xc4/0xd0
[395954.552911] [<ffff2000080a3a40>] do_page_fault+0x4a8/0x4b0
[395954.558478] [<ffff200008081444>] do_mem_abort+0xa4/0x138
[395954.563872] Exception stack(0xffff80034db07dc0 to 0xffff80034db07ef0)
[395954.570397] 7dc0: 0000000000000000 00006003f67fc000 ffffffffffffffff 00000000004109b0
[395954.578310] 7de0: 0000000060000000 0000000000000020 0000000082000007 00000000004109b0
[395954.586224] 7e00: 0000000041b58ab3 ffff20000955d370 ffff2000080813a0 0000000000000124
[395954.594137] 7e20: 0000000000000049 ffff200008f44000 ffff80034db07e40 ffff200008085f60
[395954.602051] 7e40: ffff80034db07e80 ffff20000808b5a0 0000000000000008 ffff80035dde5e80
[395954.609965] 7e60: ffff80035dde5e80 ffff80035dde64f0 ffff80034db07e80 ffff20000808b580
[395954.617878] 7e80: 0000000000000000 ffff200008083618 0000000000000000 00006003f67fc000
[395954.625791] 7ea0: ffffffffffffffff 000000000078d790 0000000060000000 00006003f6813000
[395954.633704] 7ec0: 0000ffffa685f708 0000000000000001 0000000000000001 0000000000000000
[395954.641614] 7ee0: 0000000000000002 0000000000000000
[395954.646575] [<ffff200008083388>] el0_ia+0x18/0x1c
[395954.651396] Mem-Info:
[395954.653772] active_anon:3308476 inactive_anon:100 isolated_anon:0
[395954.653772]  active_file:98 inactive_file:2390 isolated_file:0
[395954.653772]  unevictable:0 dirty:0 writeback:0 unstable:0
[395954.653772]  slab_reclaimable:15503 slab_unreclaimable:143634
[395954.653772]  mapped:1694 shmem:155 pagetables:10329 bounce:0
[395954.653772]  free:64244 free_pcp:379 free_cma:0
[395954.687511] Node 0 active_anon:13233904kB inactive_anon:400kB active_file:392kB inactive_file:9820kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7036kB dirty:0kB writeback:0kB shmem:620kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12728320kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[395954.715375] Node 0 DMA free:50620kB min:12828kB low:16884kB high:20940kB active_anon:3989600kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:4194304kB managed:4060788kB mlocked:0kB slab_reclaimable:2928kB slab_unreclaimable:10648kB kernel_stack:0kB pagetables:3600kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[395954.747565] lowmem_reserve[]: 0 9958 9958
[395954.751679] Node 0 Normal free:204900kB min:32224kB low:42420kB high:52616kB active_anon:9244220kB inactive_anon:400kB active_file:548kB inactive_file:10328kB unevictable:0kB writepending:0kB present:12582912kB managed:10197388kB mlocked:0kB slab_reclaimable:59620kB slab_unreclaimable:564176kB kernel_stack:4032kB pagetables:37716kB bounce:0kB free_pcp:1548kB local_pcp:244kB free_cma:0kB
[395954.786024] lowmem_reserve[]: 0 0 0
[395954.789615] Node 0 DMA: 531*4kB (UME) 210*8kB (UME) 114*16kB (UME) 34*32kB (ME) 18*64kB (UME) 34*128kB (UME) 46*256kB (UM) 14*512kB (UM) 7*1024kB (UM) 0*2048kB 3*4096kB (M) = 50620kB
[395954.806097] Node 0 Normal: 600*4kB (UMEH) 1772*8kB (UMEH) 496*16kB (UME) 53*32kB (UME) 25*64kB (UMH) 3*128kB (UME) 1*256kB (U) 1*512kB (U) 1*1024kB (E) 61*2048kB (UME) 12*4096kB (M) = 204064kB
[395954.823477] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[395954.832055] 3171 total pagecache pages
[395954.835933] 0 pages in swap cache
[395954.839343] Swap cache stats: add 0, delete 0, find 0/0
[395954.844670] Free swap  = 0kB
[395954.847642] Total swap = 0kB
[395954.850614] 4194304 pages RAM
[395954.853671] 0 pages HighMem/MovableOnly
[395954.857603] 629760 pages reserved
[395954.861023] 4096 pages cma reserved
[395954.864611] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[395954.873281] [ 1447]     0  1447      714       74       5       3        0             0 upstart-udev-br
[395954.882868] [ 1450]     0  1450     2758      187       7       3        0         -1000 systemd-udevd
[395954.892294] [ 1833]     0  1833      632       46       5       3        0             0 upstart-socket-
[395954.901882] [ 1847]     0  1847      708       63       5       3        0             0 rpcbind
[395954.910766] [ 1879]   106  1879      737      114       5       3        0             0 rpc.statd
[395954.919856] [ 1984]     0  1984      636       54       5       4        0             0 upstart-file-br
[395954.929462] [ 2000]   103  2000     1152      120       6       3        0             0 dbus-daemon
[395954.938701] [ 2006]     0  2006      720       49       6       3        0             0 rpc.idmapd
[395954.947858] [ 2008]   101  2008    56308      201      12       3        0             0 rsyslogd
[395954.957164] [ 2014]     0  2014    58414      289      16       3        0             0 ModemManager
[395954.966503] [ 2032]     0  2032     1222       87       6       3        0             0 systemd-logind
[395954.976004] [ 2050]     0  2050    61456      371      18       3        0             0 NetworkManager
[395954.985531] [ 2068]     0  2068      587       39       5       3        0             0 getty
[395954.994255] [ 2071]     0  2071    57242      173      14       3        0             0 polkitd
[395955.003154] [ 2075]     0  2075      587       40       5       3        0             0 getty
[395955.011878] [ 2078]     0  2078      587       39       5       3        0             0 getty
[395955.020595] [ 2079]     0  2079      587       38       5       3        0             0 getty
[395955.029322] [ 2081]     0  2081      587       40       5       3        0             0 getty
[395955.038135] [ 2101]     0  2101     2061      163       8       4        0         -1000 sshd
[395955.046800] [ 2102]     0  2102      793       57       6       3        0             0 cron
[395955.055432] [ 2159]     0  2159      542       38       5       3        0             0 getty
[395955.064149] [ 2161]     0  2161      587       40       5       3        0             0 getty
[395955.072884] [ 2171]     0  2171     1356      575       6       4        0             0 dhclient
[395955.081874] [ 2175] 65534  2175      845       58       5       3        0             0 dnsmasq
[395955.090981] [ 2265]     0  2265     3249      261      10       3        0             0 sshd
[395955.099760] [ 2278]  1000  2278     3249      262       9       3        0             0 sshd
[395955.108420] [ 2279]  1000  2279      920      176       5       3        0             0 bash
[395955.117050] [ 2289]  1000  2289      862       63       5       3        0             0 screen
[395955.125870] [ 2290]  1000  2290     1063      286       5       3        0             0 screen
[395955.134674] [ 2291]  1000  2291      930      186       5       3        0             0 bash
[395955.143321] [ 2301]  1000  2301     1190      864       6       3        0             0 htop
[395955.151951] [ 2302]  1000  2302      940      197       5       3        0             0 bash
[395955.160595] [ 2358]  1000  2358   447461        0      76       5        0             0 qemu-system-aar
[395955.170175] [ 2359]  1000  2359   449502    45509     166       4        0             0 qemu-system-aar
[395955.180310] [ 2360]  1000  2360   447461    43753     160       5        0             0 qemu-system-aar
[395955.190467] [ 2361]  1000  2361   447461    46180     161       4        0             0 qemu-system-aar
[395955.200204] [ 2362]  1000  2362   447461    44522     160       5        0             0 qemu-system-aar
[395955.209834] [ 2363]  1000  2363   447461    44311     161       4        0             0 qemu-system-aar
[395955.219818] [ 4600]  1000  4600    19468    13943      42       5        0             0 syz-manager
[395955.229412] [ 4915]  1000  4915    16364     1278      28       3        0             0 qemu-system-aar
[395955.239707] [ 4917]  1000  4917    16364     1196      27       3        0             0 qemu-system-aar
[395955.249837] [ 4918]  1000  4918    16364     1473      28       3        0             0 qemu-system-aar
[395955.260569] [ 4919]  1000  4919    16364     1692      28       3        0             0 qemu-system-aar
[395955.270871] [ 4920]  1000  4920    16364      942      30       3        0             0 qemu-system-aar
[395955.280762] [ 4922]  1000  4922    14028      751      21       3        0             0 qemu-system-aar
[395955.290372] Out of memory: Kill process 2361 (qemu-system-aar) score 13 or sacrifice child
[395955.298858] Killed process 2361 (qemu-system-aar) total-vm:1789844kB, anon-rss:184576kB, file-rss:144kB, shmem-rss:0kB
[395955.324751] oom_reaper: reaped process 2361 (qemu-system-aar), now anon-rss:0kB, file-rss:20kB, shmem-rss:0kB

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-18  8:32             ` Mark Rutland
@ 2017-04-18  9:08               ` Mark Rutland
  2017-04-18 10:30                 ` Suzuki K Poulose
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2017-04-18  9:08 UTC (permalink / raw)
  To: Suzuki K. Poulose
  Cc: kvm, Marc Zyngier, Andrey Konovalov, will.deacon, linux-kernel,
	kcc, syzkaller, catalin.marinas, dvyukov, Paolo Bonzini, kvmarm,
	linux-arm-kernel

On Tue, Apr 18, 2017 at 09:32:31AM +0100, Mark Rutland wrote:
> Hi Suzuki,
> 
> On Thu, Apr 13, 2017 at 04:50:46PM +0100, Suzuki K. Poulose wrote:
> > kvm: Hold reference to the user address space
> > 
> > The core KVM code, uses mmgrab/mmdrop to pin the mm struct of the user
> > application. mmgrab only guarantees that the mm struct is available,
> > while the "real address space" (see Documentation/vm/active_mm.txt) may
> > be destroyed. Since the KVM depends on the user space page tables for
> > the Guest pages, we should instead do an mmget/mmput. Even though
> > mmget/mmput is not encouraged for uses with unbounded time, the KVM
> > is fine to do so, as we are doing it from the context of the same process.
> > 
> > This also prevents the race condition where mmu_notifier_release() could
> > be called in parallel and one instance could end up using a free'd kvm
> > instance.
> > 
> > Cc: Mark Rutland <mark.rutland@arm.com>
> > Cc: Paolo Bonzin <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Marc Zyngier <marc.zyngier@arm.com>
> > Cc: Christoffer Dall <christoffer.dall@linaro.org>
> > Cc: andreyknvl@google.com
> > Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> > ---
> >  virt/kvm/kvm_main.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 88257b3..555712e 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  		return ERR_PTR(-ENOMEM);
> >  
> >  	spin_lock_init(&kvm->mmu_lock);
> > -	mmgrab(current->mm);
> > +	mmget(current->mm);
> >  	kvm->mm = current->mm;
> >  	kvm_eventfd_init(kvm);
> >  	mutex_init(&kvm->lock);
> > @@ -685,7 +685,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> >  		kvm_free_memslots(kvm, kvm->memslots[i]);
> >  	kvm_arch_free_vm(kvm);
> > -	mmdrop(current->mm);
> > +	mmput(current->mm);
> >  	return ERR_PTR(r);
> >  }
> >  
> > @@ -747,7 +747,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >  	kvm_arch_free_vm(kvm);
> >  	preempt_notifier_dec();
> >  	hardware_disable_all();
> > -	mmdrop(mm);
> > +	mmput(mm);
> >  }
> 
> 
> As a heads-up, I'm seeing what looks to be a KVM memory leak with this
> patch applied atop of next-20170411.
> 
> I don't yet know if this is a problem with next-20170411 or this patch
> in particular -- I will try to track that down. In the mean time, info
> dump below.
> 
> I left syzkaller running over the weekend using this kernel on the host,
> and OOM kicked in after it had been running for a short while. Almost
> all of my memory is in use, but judging by top, almost none of this is
> associated with processes.

FWIW, it seems easy enough to trigger this leak with kvmtool. Start a
guest that'll allocate a tonne of memory:

$ lkvm run --console virtio -m 4096 --kernel Image \
  -p "memtest=1 console=hvc0,38400 earlycon=uart,mmio,0x3f8"

... then kill this with a SIGKILL from your favourite process management
tool.

Also, if the guest exits normally, everything appears to be cleaned up
correctly. So it looks like this is in some way related to our fatal
signal handling.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-18  9:08               ` Mark Rutland
@ 2017-04-18 10:30                 ` Suzuki K Poulose
  0 siblings, 0 replies; 17+ messages in thread
From: Suzuki K Poulose @ 2017-04-18 10:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kvm, Marc Zyngier, Andrey Konovalov, will.deacon, linux-kernel,
	kcc, syzkaller, catalin.marinas, dvyukov, Paolo Bonzini, kvmarm,
	linux-arm-kernel

On 18/04/17 10:08, Mark Rutland wrote:
> On Tue, Apr 18, 2017 at 09:32:31AM +0100, Mark Rutland wrote:
>> Hi Suzuki,
>>
>> On Thu, Apr 13, 2017 at 04:50:46PM +0100, Suzuki K. Poulose wrote:
>>> kvm: Hold reference to the user address space
>>>
>>> The core KVM code, uses mmgrab/mmdrop to pin the mm struct of the user
>>> application. mmgrab only guarantees that the mm struct is available,
>>> while the "real address space" (see Documentation/vm/active_mm.txt) may
>>> be destroyed. Since the KVM depends on the user space page tables for
>>> the Guest pages, we should instead do an mmget/mmput. Even though
>>> mmget/mmput is not encouraged for uses with unbounded time, the KVM
>>> is fine to do so, as we are doing it from the context of the same process.
>>>
>>> This also prevents the race condition where mmu_notifier_release() could
>>> be called in parallel and one instance could end up using a free'd kvm
>>> instance.
>>>
>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>> Cc: Paolo Bonzin <pbonzini@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Christoffer Dall <christoffer.dall@linaro.org>
>>> Cc: andreyknvl@google.com
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>>  virt/kvm/kvm_main.c | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 88257b3..555712e 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -613,7 +613,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>>>  		return ERR_PTR(-ENOMEM);
>>>
>>>  	spin_lock_init(&kvm->mmu_lock);
>>> -	mmgrab(current->mm);
>>> +	mmget(current->mm);
>>>  	kvm->mm = current->mm;
>>>  	kvm_eventfd_init(kvm);
>>>  	mutex_init(&kvm->lock);
>>> @@ -685,7 +685,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>>>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>>>  		kvm_free_memslots(kvm, kvm->memslots[i]);
>>>  	kvm_arch_free_vm(kvm);
>>> -	mmdrop(current->mm);
>>> +	mmput(current->mm);
>>>  	return ERR_PTR(r);
>>>  }
>>>
>>> @@ -747,7 +747,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>>  	kvm_arch_free_vm(kvm);
>>>  	preempt_notifier_dec();
>>>  	hardware_disable_all();
>>> -	mmdrop(mm);
>>> +	mmput(mm);
>>>  }
>>
>>
>> As a heads-up, I'm seeing what looks to be a KVM memory leak with this
>> patch applied atop of next-20170411.
>>
>> I don't yet know if this is a problem with next-20170411 or this patch
>> in particular -- I will try to track that down. In the mean time, info
>> dump below.

This is indeed a side effect of the new patch. The VCPU doesn't get released
completely, due to an mmap count held on the VCPU fd, even when we close the
VCPU fd. This keeps the refcount on the KVM instance which in turn holds the
mmap count (with the new patch). So the mmap count on VCPU will never get
released due to the circular dependency here. :-(

Suzuki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds
  2017-04-13 15:50           ` Suzuki K. Poulose
                               ` (2 preceding siblings ...)
  2017-04-18  8:32             ` Mark Rutland
@ 2017-04-20 16:40             ` Suzuki K Poulose
  3 siblings, 0 replies; 17+ messages in thread
From: Suzuki K Poulose @ 2017-04-20 16:40 UTC (permalink / raw)
  To: Marc Zyngier, Andrey Konovalov, Paolo Bonzini
  Cc: rkrcmar, christoffer.dall, linux-arm-kernel, linux-kernel, kvm,
	dvyukov, mark.rutland, kvmarm, catalin.marinas, will.deacon, kcc,
	syzkaller

On Thu, Apr 13, 2017 at 04:50:46PM +0100, Suzuki K. Poulose wrote:
> On Thu, Apr 13, 2017 at 10:17:54AM +0100, Suzuki K Poulose wrote:
> > On 12/04/17 19:43, Marc Zyngier wrote:
> > > On 12/04/17 17:19, Andrey Konovalov wrote:
> > >
> > > Hi Andrey,
> > >
> > > > Apparently this wasn't fixed, I've got this report again on
> > > > linux-next-c4e7b35a3 (Apr 11), which includes 8b3405e34 "kvm:
> > > > arm/arm64: Fix locking for kvm_free_stage2_pgd".
> > >
> > > This looks like a different bug.
> > >
> > > >
> > > > I now have a way to reproduce it, so I can test proposed patches. I
> > > > don't have a simple C reproducer though.
> > > >
> > > > The bug happens when the following syzkaller program is executed:
> > > >
> > > > mmap(&(0x7f0000000000/0xc000)=nil, (0xc000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
> > > > unshare(0x400)
> > > > perf_event_open(&(0x7f000002f000-0x78)={0x1, 0x78, 0x0, 0x0, 0x0, 0x0,
> > > > 0x0, 0x6, 0x0, 0x0, 0xd34, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
> > > > 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffff,
> > > > 0xffffffffffffffff, 0x0)
> > > > r0 = openat$kvm(0xffffffffffffff9c,
> > > > &(0x7f000000c000-0x9)="2f6465762f6b766d00", 0x0, 0x0)
> > > > ioctl$TIOCSBRK(0xffffffffffffffff, 0x5427)
> > > > r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> > > > syz_kvm_setup_cpu$arm64(r1, 0xffffffffffffffff,
> > > > &(0x7f0000dc6000/0x18000)=nil, &(0x7f000000c000)=[{0x0,
> > > > &(0x7f000000c000)="5ba3c16f533efbed09f8221253c73763327fadce2371813b45dd7f7982f84a873e4ae89a6c2bd1af83a6024c36a1ff518318",
> > > > 0x32}], 0x1, 0x0, &(0x7f000000d000-0x10)=[@featur2={0x1, 0x3}], 0x1)
> > >
> > > Is that the only thing the program does? Or is there anything running in
> > > parallel?
> > >
> > > > ==================================================================
> > > > BUG: KASAN: use-after-free in arch_spin_is_locked
> > > > include/linux/compiler.h:254 [inline]
> > > > BUG: KASAN: use-after-free in unmap_stage2_range+0x990/0x9a8
> > > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> > > > Read of size 8 at addr ffff800004476730 by task syz-executor/13106
> > > >
> > > > CPU: 1 PID: 13106 Comm: syz-executor Not tainted
> > > > 4.11.0-rc6-next-20170411-xc2-11025-gc4e7b35a33d4-dirty #5
> > > > Hardware name: Hardkernel ODROID-C2 (DT)
> > > > Call trace:
> > > > [<ffff20000808fd08>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:505
> > > > [<ffff2000080903c0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
> > > > [<ffff2000088df030>] __dump_stack lib/dump_stack.c:16 [inline]
> > > > [<ffff2000088df030>] dump_stack+0x110/0x168 lib/dump_stack.c:52
> > > > [<ffff200008406db8>] print_address_description+0x60/0x248 mm/kasan/report.c:252
> > > > [<ffff2000084072c8>] kasan_report_error mm/kasan/report.c:351 [inline]
> > > > [<ffff2000084072c8>] kasan_report+0x218/0x300 mm/kasan/report.c:408
> > > > [<ffff200008407428>] __asan_report_load8_noabort+0x18/0x20 mm/kasan/report.c:429
> > > > [<ffff2000080db1b8>] arch_spin_is_locked include/linux/compiler.h:254 [inline]
> > >
> > > This is the assert on the spinlock, and the memory is gone.
> > >
> > > > [<ffff2000080db1b8>] unmap_stage2_range+0x990/0x9a8
> > > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:295
> > > > [<ffff2000080db248>] kvm_free_stage2_pgd.part.16+0x30/0x98
> > > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:842
> > > > [<ffff2000080ddfb8>] kvm_free_stage2_pgd
> > > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:838 [inline]
> > >
> > > But we've taken than lock here. There's only a handful of instructions
> > > in between, and the memory can only go away if there is something
> > > messing with us in parallel.
> > >
> > > > [<ffff2000080ddfb8>] kvm_arch_flush_shadow_all+0x40/0x58
> > > > arch/arm64/kvm/../../../arch/arm/kvm/mmu.c:1895
> > > > [<ffff2000080c379c>] kvm_mmu_notifier_release+0x154/0x1d0
> > > > arch/arm64/kvm/../../../virt/kvm/kvm_main.c:472
> > > > [<ffff2000083f2b60>] __mmu_notifier_release+0x1c0/0x3e0 mm/mmu_notifier.c:75
> > > > [<ffff2000083a1fb4>] mmu_notifier_release
> > > > include/linux/mmu_notifier.h:235 [inline]
> > > > [<ffff2000083a1fb4>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
> > > > [<ffff20000810ecd4>] __mmput kernel/fork.c:888 [inline]
> > > > [<ffff20000810ecd4>] mmput+0xdc/0x2e0 kernel/fork.c:910
> > > > [<ffff20000811fda8>] exit_mm kernel/exit.c:557 [inline]
> > > > [<ffff20000811fda8>] do_exit+0x648/0x2020 kernel/exit.c:865
> > > > [<ffff2000081218b4>] do_group_exit+0xdc/0x260 kernel/exit.c:982
> > > > [<ffff20000813adf0>] get_signal+0x358/0xf58 kernel/signal.c:2318
> > > > [<ffff20000808de98>] do_signal+0x170/0xc10 arch/arm64/kernel/signal.c:370
> > > > [<ffff20000808edb4>] do_notify_resume+0xe4/0x120 arch/arm64/kernel/signal.c:421
> > > > [<ffff200008083e68>] work_pending+0x8/0x14
> > >
> > > So we're being serviced with a signal. Do you know if this signal is
> > > generated by your syzkaller program? We could be racing between do_exit
> > > triggered by a fatal signal (this trace) and the closing of the two file
> > > descriptors (vcpu and vm).
> > >
> > > Paolo: does this look possible to you? I can't see what locking we have
> > > that could prevent this race.
> >
> > On a quick look, I see two issues:
> >
> > 1) It looks like the mmu_notifier->ops.release could be called twice for a notifier,
> > from mmu_notifier_unregister() and exit_mmap()->mmu_notifier_release(), which is
> > causing the problem as above.
> >
> > This could possibly be avoided by swapping the order of the following operations
> > in themmu_notifier_unregister():
> >
> >  a) Invoke ops->release under src_read_lock()
> >  b) Delete the notifier from the list.
> >
> > which can prevent mmu_notifier_release() calling the ops->release() again, before
> > we reach (b).
> >
> >
> > 2) The core KVM code does an mmgrab()/mmdrop on the current->mm to pin the mm_struct. But
> > this doesn't prevent the "real_address user space" from being destroyed. Since KVM
> > actually depends on the user pages and page tables, it should really/also(?) use
> > mmget()/mmput() (See Documentation/vm/active_mm.txt). I understand that mmget() shouldn't
> > be used for pinning unbounded amount of time. But since we do it from within the same
> > process context (like say threads), we should be safe to do so.
>

Option 2 doesn't work, as it creates a circular dependency with exit_mmap vs kvm_destory_vm
due to the mmap on VCPU (which prevents KVM from getting dropped). After a couple of trials
at resolving this issue, we have something better to resolve it, below.

----8>----

kvm: Fix mmu_notifier release race

The KVM uses mmu_notifier (wherever available) to keep track
of the changes to the mm of the guest. The guest shadow page
tables are released when the VM exits via mmu_notifier->ops.release().
There is a rare chance that the mmu_notifier->release could be
called more than once via two different paths, which could end
up in use-after-free of kvm instance.

e.g:

thread A                                        thread B
-------                                         --------------

 get_signal->                                   kvm_destroy_vm()->
 do_exit->                                        mmu_notifier_unregister->
 exit_mm->                                        kvm_arch_flush_shadow_all()->
 exit_mmap->                                      spin_lock(&kvm->mmu_lock)
 mmu_notifier_release->                           ....
  kvm_arch_flush_shadow_all()->                   .....
  ... spin_lock(&kvm->mmu_lock)                   .....
                                                  spin_unlock(&kvm->mmu_lock)
                                                kvm_arch_free_kvm()
   *** use after free of kvm ***

This patch attempts to solve the problem by holding a reference to the KVM
for the mmu_notifier, which is dropped only from notifier->ops.release().
This will ensure that the KVM struct is available till we reach the
kvm_mmu_notifier_release, and the kvm_destroy_vm is called only from/after
it. So, we can unregister the notifier with no_release option and hence
avoiding the race above. However, we need to make sure that the KVM is
freed only after the mmu_notifier has finished processing the notifier due to
the following possible path of execution :

mmu_notifier_release -> kvm_mmu_notifier_release -> kvm_put_kvm ->
  kvm_destroy_vm -> kvm_arch_free_kvm

Reported-by: andreyknvl@google.com
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzin <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: andreyknvl@google.com
Cc: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 59 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d025074..561e968 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -424,6 +424,7 @@ struct kvm {
 	struct mmu_notifier mmu_notifier;
 	unsigned long mmu_notifier_seq;
 	long mmu_notifier_count;
+	struct rcu_head mmu_notifier_rcu;
 #endif
 	long tlbs_dirty;
 	struct list_head devices;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 88257b3..2c3fdd4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,6 +471,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	idx = srcu_read_lock(&kvm->srcu);
 	kvm_arch_flush_shadow_all(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
+	kvm_put_kvm(kvm);
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
@@ -486,8 +487,46 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
 {
+	int rc;
 	kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
-	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+	rc = mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+	/*
+	 * We hold a reference to KVM here to make sure that the KVM
+	 * doesn't get free'd before ops->release() completes.
+	 */
+	if (!rc)
+		kvm_get_kvm(kvm);
+	return rc;
+}
+
+static void kvm_free_vm_rcu(struct rcu_head *rcu)
+{
+	struct kvm *kvm = container_of(rcu, struct kvm, mmu_notifier_rcu);
+	kvm_arch_free_vm(kvm);
+}
+
+static void kvm_flush_shadow_mmu(struct kvm *kvm)
+{
+	/*
+	 * We hold a reference to kvm instance for mmu_notifier and is
+	 * only released when ops->release() is called via exit_mmap path.
+	 * So, when we reach here ops->release() has been called already, which
+	 * flushes the shadow page tables. Hence there is no need to call the
+	 * release() again when we unregister the notifier. However, we need
+	 * to delay freeing up the kvm until the release() completes, since
+	 * we could reach here via :
+	 *  kvm_mmu_notifier_release() -> kvm_put_kvm() -> kvm_destroy_vm()
+	 */
+	mmu_notifier_unregister_no_release(&kvm->mmu_notifier, kvm->mm);
+}
+
+static void kvm_free_vm(struct kvm *kvm)
+{
+	/*
+	 * Wait until the mmu_notifier has finished the release().
+	 * See comments above in kvm_flush_shadow_mmu.
+	 */
+	mmu_notifier_call_srcu(&kvm->mmu_notifier_rcu, kvm_free_vm_rcu);
 }
 
 #else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
@@ -497,6 +536,16 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return 0;
 }
 
+static void kvm_flush_shadow_mmu(struct kvm *kvm)
+{
+	kvm_arch_flush_shadow_all(kvm);
+}
+
+static void kvm_free_vm(struct kvm *kvm)
+{
+	kvm_arch_free_vm(kvm);
+}
+
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 static struct kvm_memslots *kvm_alloc_memslots(void)
@@ -733,18 +782,14 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm->buses[i] = NULL;
 	}
 	kvm_coalesced_mmio_free(kvm);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
-#else
-	kvm_arch_flush_shadow_all(kvm);
-#endif
+	kvm_flush_shadow_mmu(kvm);
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 		kvm_free_memslots(kvm, kvm->memslots[i]);
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
-	kvm_arch_free_vm(kvm);
+	kvm_free_vm(kvm);
 	preempt_notifier_dec();
 	hardware_disable_all();
 	mmdrop(mm);
-- 
2.7.4

 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-04-20 16:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 13:34 kvm/arm64: use-after-free in kvm_unmap_hva_handler/unmap_stage2_pmds Andrey Konovalov
2017-03-14 11:07 ` Suzuki K Poulose
2017-03-14 16:57   ` Paolo Bonzini
2017-04-12 16:19     ` Andrey Konovalov
2017-04-12 18:43       ` Marc Zyngier
2017-04-12 18:51         ` Andrey Konovalov
2017-04-13  9:34           ` Mark Rutland
2017-04-13 11:53             ` Andrey Konovalov
2017-04-13 13:45               ` Mark Rutland
2017-04-13  9:17         ` Suzuki K Poulose
2017-04-13 15:50           ` Suzuki K. Poulose
2017-04-13 15:53             ` Suzuki K Poulose
2017-04-13 17:06             ` Andrey Konovalov
2017-04-18  8:32             ` Mark Rutland
2017-04-18  9:08               ` Mark Rutland
2017-04-18 10:30                 ` Suzuki K Poulose
2017-04-20 16:40             ` Suzuki K Poulose

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).