From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=J4ZE=RP=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C1D60C43381
	for <linux-kernel@archiver.kernel.org>; Tue, 12 Mar 2019 15:31:24 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 9760D2083D
	for <linux-kernel@archiver.kernel.org>; Tue, 12 Mar 2019 15:31:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726451AbfCLPbX (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 12 Mar 2019 11:31:23 -0400
Received: from szxga04-in.huawei.com ([45.249.212.190]:4676 "EHLO huawei.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1725892AbfCLPbX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 12 Mar 2019 11:31:23 -0400
Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.58])
        by Forcepoint Email with ESMTP id 5D1F6C7195CD7E649E36;
        Tue, 12 Mar 2019 23:31:09 +0800 (CST)
Received: from [127.0.0.1] (10.177.29.32) by DGGEMS407-HUB.china.huawei.com
 (10.3.19.207) with Microsoft SMTP Server id 14.3.408.0; Tue, 12 Mar 2019
 23:31:03 +0800
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
To:     Marc Zyngier <marc.zyngier@arm.com>, <christoffer.dall@arm.com>,
        <catalin.marinas@arm.com>, <will.deacon@arm.com>,
        <suzuki.poulose@arm.com>, <james.morse@arm.com>
CC:     <linux-arm-kernel@lists.infradead.org>,
        <kvmarm@lists.cs.columbia.edu>, <linux-kernel@vger.kernel.org>,
        Wang Haibin <wanghaibin.wang@huawei.com>,
        "yuzenghui@huawei.com" <yuzenghui@huawei.com>,
        <lious.lilei@hisilicon.com>, <lishuo1@hisilicon.com>
References: <5f712cc6-0874-adbe-add6-46f5de24f36f@huawei.com>
 <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
From:   Zheng Xiang <zhengxiang9@huawei.com>
Message-ID: <1c0e07b9-73f0-efa4-c1b7-ad81789b42c5@huawei.com>
Date:   Tue, 12 Mar 2019 23:30:23 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101
 Thunderbird/64.0
MIME-Version: 1.0
In-Reply-To: <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.177.29.32]
X-CFilter-Loop: Reflected
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 11/03/2019 16:31, Zheng Xiang wrote:
>> Hi all,
>>
>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>> the base address of the huge page and the whole of Stage-1.
>> However, this just only invalidates the first page within the huge page and the other
>> pages are not invalidated, see bellow:
>>
>>     +---------------+--------------+
>>     |abcde       2MB-Page          |
>>     +---------------+--------------+
>>
>>     TLB before setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      4KB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>>     TLB after setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      2MB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
> 
> That's really bad. I can only imagine two scenarios:
> 
> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> the PTE table in the process, and place the PMD instead. I can't see
> this happening.
> 
> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> quite bad).
> 
> Which of the two cases are you seeing?
> 
>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>> KVM will set the memslot READONLY and split the huge pages.
>> After live migration is canceled and abort, the pages will be merged into THP.
>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>
>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> 
> We should perform an invalidate on each unmap. unmap_stage2_range seems
> to do the right thing. __flush_tlb_range only caters for Stage1
> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> TLBs for the whole VM.
> 
> I'd really like to understand what you're seeing, and how to reproduce
> it. Do you have a minimal example I could run on my own HW?

When I start the live migration for a VM, qemu then begins to log and count dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

        /*
         * Multiple vcpus faulting on the same PMD entry, can
         * lead to them sequentially updating the PMD with the
         * same value. Following the break-before-make
         * (pmd_clear() followed by tlb_flush()) process can
         * hinder forward progress due to refaults generated
         * on missing translations.
         *
         * Skip updating the page table if the entry is
         * unchanged.
         */
        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
            return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

        /*
         * Mapping in huge pages should only happen through a
         * fault.  If a page is merged into a transparent huge
         * page, the individual subpages of that huge page
         * should be unmapped through MMU notifiers before we
         * get here.
         *
         * Merging of CompoundPages is not supported; they
         * should become splitting first, unmapped, merged,
         * and mapped back in on-demand.
         */
        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

        pmd_clear(pmd);
        for (cnt = 0; cnt < 512; cnt++)
            kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the problem no longer reproduce.


-- 

Thanks,
Xiang


From mboxrd@z Thu Jan  1 00:00:00 1970
From: Zheng Xiang <zhengxiang9@huawei.com>
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Date: Tue, 12 Mar 2019 23:30:23 +0800
Message-ID: <1c0e07b9-73f0-efa4-c1b7-ad81789b42c5@huawei.com>
References: <5f712cc6-0874-adbe-add6-46f5de24f36f@huawei.com>
 <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <kvmarm-bounces@lists.cs.columbia.edu>
Received: from localhost (localhost [127.0.0.1])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id EB3004A32E
 for <kvmarm@lists.cs.columbia.edu>; Tue, 12 Mar 2019 11:31:15 -0400 (EDT)
Received: from mm01.cs.columbia.edu ([127.0.0.1])
 by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id D5oRZnWeB-tZ for <kvmarm@lists.cs.columbia.edu>;
 Tue, 12 Mar 2019 11:31:14 -0400 (EDT)
Received: from huawei.com (szxga04-in.huawei.com [45.249.212.190])
 by mm01.cs.columbia.edu (Postfix) with ESMTPS id A51954A2BE
 for <kvmarm@lists.cs.columbia.edu>; Tue, 12 Mar 2019 11:31:14 -0400 (EDT)
In-Reply-To: <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
Content-Language: en-US
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu
To: Marc Zyngier <marc.zyngier@arm.com>, christoffer.dall@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, suzuki.poulose@arm.com, james.morse@arm.com
Cc: lishuo1@hisilicon.com, linux-kernel@vger.kernel.org, "yuzenghui@huawei.com" <yuzenghui@huawei.com>, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, lious.lilei@hisilicon.com
List-Id: kvmarm@lists.cs.columbia.edu

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 11/03/2019 16:31, Zheng Xiang wrote:
>> Hi all,
>>
>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>> the base address of the huge page and the whole of Stage-1.
>> However, this just only invalidates the first page within the huge page and the other
>> pages are not invalidated, see bellow:
>>
>>     +---------------+--------------+
>>     |abcde       2MB-Page          |
>>     +---------------+--------------+
>>
>>     TLB before setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      4KB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>>     TLB after setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      2MB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
> 
> That's really bad. I can only imagine two scenarios:
> 
> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> the PTE table in the process, and place the PMD instead. I can't see
> this happening.
> 
> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> quite bad).
> 
> Which of the two cases are you seeing?
> 
>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>> KVM will set the memslot READONLY and split the huge pages.
>> After live migration is canceled and abort, the pages will be merged into THP.
>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>
>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> 
> We should perform an invalidate on each unmap. unmap_stage2_range seems
> to do the right thing. __flush_tlb_range only caters for Stage1
> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> TLBs for the whole VM.
> 
> I'd really like to understand what you're seeing, and how to reproduce
> it. Do you have a minimal example I could run on my own HW?

When I start the live migration for a VM, qemu then begins to log and count dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

        /*
         * Multiple vcpus faulting on the same PMD entry, can
         * lead to them sequentially updating the PMD with the
         * same value. Following the break-before-make
         * (pmd_clear() followed by tlb_flush()) process can
         * hinder forward progress due to refaults generated
         * on missing translations.
         *
         * Skip updating the page table if the entry is
         * unchanged.
         */
        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
            return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

        /*
         * Mapping in huge pages should only happen through a
         * fault.  If a page is merged into a transparent huge
         * page, the individual subpages of that huge page
         * should be unmapped through MMU notifiers before we
         * get here.
         *
         * Merging of CompoundPages is not supported; they
         * should become splitting first, unmapped, merged,
         * and mapped back in on-demand.
         */
        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

        pmd_clear(pmd);
        for (cnt = 0; cnt < 512; cnt++)
            kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the problem no longer reproduce.


-- 

Thanks,
Xiang

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ythm=RP=lists.infradead.org=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,
	URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E8C55C4360F
	for <infradead-linux-arm-kernel@archiver.kernel.org>; Tue, 12 Mar 2019 15:31:25 +0000 (UTC)
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id AF6932083D
	for <infradead-linux-arm-kernel@archiver.kernel.org>; Tue, 12 Mar 2019 15:31:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="mJDryFCL"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AF6932083D
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20170209; h=Sender:
	Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:
	Message-ID:From:References:To:Subject:Reply-To:Content-ID:Content-Description
	:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=VgOCTUYPmNKko8ZHC4EnBjdk3ZcpwjQS/IRAQSA4/Ds=; b=mJDryFCLOCzNOX
	fbcEHO966EnQf8aQDHuR/ahCC3+D39Kg2QuO0Ciqn4IeiDK1sM8sOe7EOZiNSoVR3guuEd2d70//G
	m3fsyqc45poDbpZI6I8iSUWRKaCbX/DemBwBlvzQaxR1ZJV+P8qfNYWeO4UrTCM5CtsGReSrcqB2w
	MFVMXO04XD74W9M2IHbnCNETUZJ5pYMqgtDj9AKnuLvKU779qdA54YMYWYG/Re0vqZbxqxGGNG4Wd
	NJ91wtfOaWm7tnTnKRd0Nlxw6GRhtUQxbfFxu72NoSOZ9LToC55/jwjcd3enIrMOXWfeenCbPQNJt
	veMructpyRRKF9yhwKhA==;
Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux))
	id 1h3jNA-0008Lr-Aq; Tue, 12 Mar 2019 15:31:20 +0000
Received: from szxga04-in.huawei.com ([45.249.212.190] helo=huawei.com)
 by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux))
 id 1h3jN7-0008Id-6b
 for linux-arm-kernel@lists.infradead.org; Tue, 12 Mar 2019 15:31:19 +0000
Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.58])
 by Forcepoint Email with ESMTP id 5D1F6C7195CD7E649E36;
 Tue, 12 Mar 2019 23:31:09 +0800 (CST)
Received: from [127.0.0.1] (10.177.29.32) by DGGEMS407-HUB.china.huawei.com
 (10.3.19.207) with Microsoft SMTP Server id 14.3.408.0; Tue, 12 Mar 2019
 23:31:03 +0800
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
To: Marc Zyngier <marc.zyngier@arm.com>, <christoffer.dall@arm.com>,
 <catalin.marinas@arm.com>, <will.deacon@arm.com>, <suzuki.poulose@arm.com>,
 <james.morse@arm.com>
References: <5f712cc6-0874-adbe-add6-46f5de24f36f@huawei.com>
 <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
From: Zheng Xiang <zhengxiang9@huawei.com>
Message-ID: <1c0e07b9-73f0-efa4-c1b7-ad81789b42c5@huawei.com>
Date: Tue, 12 Mar 2019 23:30:23 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101
 Thunderbird/64.0
MIME-Version: 1.0
In-Reply-To: <e2a94937-c324-e2d6-7e61-3f998e6e6e22@arm.com>
Content-Language: en-US
X-Originating-IP: [10.177.29.32]
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20190312_083117_484538_257CCDD0 
X-CRM114-Status: GOOD (  21.76  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: lishuo1@hisilicon.com, linux-kernel@vger.kernel.org,
 "yuzenghui@huawei.com" <yuzenghui@huawei.com>,
 Wang Haibin <wanghaibin.wang@huawei.com>, kvmarm@lists.cs.columbia.edu,
 linux-arm-kernel@lists.infradead.org, lious.lilei@hisilicon.com
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Marc,

On 2019/3/12 19:32, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 11/03/2019 16:31, Zheng Xiang wrote:
>> Hi all,
>>
>> While a page is merged into a transparent huge page, KVM will invalidate Stage-2 for
>> the base address of the huge page and the whole of Stage-1.
>> However, this just only invalidates the first page within the huge page and the other
>> pages are not invalidated, see bellow:
>>
>>     +---------------+--------------+
>>     |abcde       2MB-Page          |
>>     +---------------+--------------+
>>
>>     TLB before setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      4KB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>>     TLB after setting new pmd:
>>     +---------------+--------------+
>>     |      VA       |    PAGESIZE  |
>>     +---------------+--------------+
>>     |      a        |      2MB     |
>>     +---------------+--------------+
>>     |      b        |      4KB     |
>>     +---------------+--------------+
>>     |      c        |      4KB     |
>>     +---------------+--------------+
>>     |      d        |      4KB     |
>>     +---------------+--------------+
>>
>> When VM access *b* address, it will hit the TLB and result in TLB conflict aborts or other potential exceptions.
> 
> That's really bad. I can only imagine two scenarios:
> 
> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
> the PTE table in the process, and place the PMD instead. I can't see
> this happening.
> 
> 2) We fail to invalidate on unmap, and that slightly less bad (but still
> quite bad).
> 
> Which of the two cases are you seeing?
> 
>> For example, we need to keep tracking of the VM memory dirty pages when VM is in live migration.
>> KVM will set the memslot READONLY and split the huge pages.
>> After live migration is canceled and abort, the pages will be merged into THP.
>> The later access to these pages which are READONLY will cause level-3 Permission Fault until they are invalidated.
>>
>> So should we invalidate the tlb entries for all relative pages(e.g a,b,c,d), like __flush_tlb_range()?
>> Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
> 
> We should perform an invalidate on each unmap. unmap_stage2_range seems
> to do the right thing. __flush_tlb_range only caters for Stage1
> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
> TLBs for the whole VM.
> 
> I'd really like to understand what you're seeing, and how to reproduce
> it. Do you have a minimal example I could run on my own HW?

When I start the live migration for a VM, qemu then begins to log and count dirty pages.
During the live migration, KVM set the pages READONLY so that we can count how many pages
would be wrote afterwards.

Anything is OK until I cancel the live migration and qemu stop logging. Later the VM gets hang.
The trace log shows repeatedly level-3 permission fault caused by a write on a same IPA. After
analyzing the source code, I find KVM always return from the bellow *if* statement in
stage2_set_pmd_huge() even if we only have a single VCPU:

        /*
         * Multiple vcpus faulting on the same PMD entry, can
         * lead to them sequentially updating the PMD with the
         * same value. Following the break-before-make
         * (pmd_clear() followed by tlb_flush()) process can
         * hinder forward progress due to refaults generated
         * on missing translations.
         *
         * Skip updating the page table if the entry is
         * unchanged.
         */
        if (pmd_val(old_pmd) == pmd_val(*new_pmd))
            return 0;

The PMD has already set the PMD_S2_RDWR bit. I doubt kvm_tlb_flush_vmid_ipa() does not invalidate
Stage-2 for the subpages of the PMD(except the first PTE of this PMD). Finally I add some debug
code to flush tlb for all subpages of the PMD, as shown bellow:

        /*
         * Mapping in huge pages should only happen through a
         * fault.  If a page is merged into a transparent huge
         * page, the individual subpages of that huge page
         * should be unmapped through MMU notifiers before we
         * get here.
         *
         * Merging of CompoundPages is not supported; they
         * should become splitting first, unmapped, merged,
         * and mapped back in on-demand.
         */
        VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));

        pmd_clear(pmd);
        for (cnt = 0; cnt < 512; cnt++)
            kvm_tlb_flush_vmid_ipa(kvm, addr + cnt*PAGE_SIZE);

Then the problem no longer reproduce.


-- 

Thanks,
Xiang


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel