From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FE6FCA9EAF for ; Thu, 24 Oct 2019 16:33:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D3F7521925 for ; Thu, 24 Oct 2019 16:33:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D3F7521925 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 79C4B6B0007; Thu, 24 Oct 2019 12:33:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 772D86B000A; Thu, 24 Oct 2019 12:33:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 688FC6B000C; Thu, 24 Oct 2019 12:33:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0177.hostedemail.com [216.40.44.177]) by kanga.kvack.org (Postfix) with ESMTP id 493376B0007 for ; Thu, 24 Oct 2019 12:33:23 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 2E1FF4FFA for ; Thu, 24 Oct 2019 16:33:22 +0000 (UTC) X-FDA: 76079223444.19.tray73_74967907f5b17 X-HE-Tag: tray73_74967907f5b17 X-Filterd-Recvd-Size: 5214 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Oct 2019 16:33:19 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R671e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07417;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0Tg4pDHR_1571934793; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tg4pDHR_1571934793) by smtp.aliyun-inc.com(127.0.0.1); Fri, 25 Oct 2019 00:33:16 +0800 Subject: Re: [v4 PATCH] mm: thp: handle page cache THP correctly in PageTransCompoundMap To: Matthew Wilcox Cc: hughd@google.com, aarcange@redhat.com, kirill.shutemov@linux.intel.com, gavin.dg@linux.alibaba.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org References: <1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com> <20191024135547.GH2963@bombadil.infradead.org> From: Yang Shi Message-ID: Date: Thu, 24 Oct 2019 09:33:11 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20191024135547.GH2963@bombadil.infradead.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/24/19 6:55 AM, Matthew Wilcox wrote: > On Thu, Oct 24, 2019 at 05:19:35AM +0800, Yang Shi wrote: >> We have usecase to use tmpfs as QEMU memory backend and we would like = to >> take the advantage of THP as well. But, our test shows the EPT is not >> PMD mapped even though the underlying THP are PMD mapped on host. >> The number showed by /sys/kernel/debug/kvm/largepage is much less than >> the number of PMD mapped shmem pages as the below: >> >> 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_bac= k_mem.mem.Hz2hSf (deleted) >> Size: 4194304 kB >> [snip] >> AnonHugePages: 0 kB >> ShmemPmdMapped: 579584 kB >> [snip] >> Locked: 0 kB >> >> cat /sys/kernel/debug/kvm/largepages >> 12 >> >> And some benchmarks do worse than with anonymous THPs. >> >> By digging into the code we figured out that commit 127393fbe597 ("mm: >> thp: kvm: fix memory corruption in KVM with THP enabled") checks if >> there is a single PTE mapping on the page for anonymous THP when >> setting up EPT map. But, the _mapcount < 0 check doesn't fit to page >> cache THP since every subpage of page cache THP would get _mapcount >> inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns >> false for page cache THP. This would prevent KVM from setting up PMD >> mapped EPT entry. >> >> So we need handle page cache THP correctly. However, when page cache >> THP's PMD gets split, kernel just remove the map instead of setting up >> PTE map like what anonymous THP does. Before KVM calls get_user_pages= () >> the subpages may get PTE mapped even though it is still a THP since th= e >> page cache THP may be mapped by other processes at the mean time. >> >> Checking its _mapcount and whether the THP has PTE mapped or not. >> Although this may report some false negative cases (PTE mapped by othe= r >> processes), it looks not trivial to make this accurate. > I don't understand why you care how it's mapped into userspace. If the= re > is a PMD-sized page in the page cache, then you can use a PMD mapping > in the EPT tables to map it. Why would another process having a PTE > mapping on the page cause you to not use a PMD mapping? We don't care if the THP is PTE mapped by other process, but either=20 PageDoubleMap flag or _mapcount/compound_mapcount can't tell us if the=20 PTE map comes from the current process or other process unless gup could=20 return pmd's status. I think the commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in=20 KVM with THP enabled") elaborates the trade-off clearly (not full commit=20 log, just paste the most related part): =C2=A0=C2=A0 Ideally instead of the page->_mapcount < 1 check, get_user_= pages() =C2=A0=C2=A0=C2=A0 should return the granularity of the "page" mapping i= n the "mm" passed =C2=A0=C2=A0=C2=A0 to get_user_pages().=C2=A0 However it's non trivial c= hange to pass the "pmd" =C2=A0=C2=A0=C2=A0 status belonging to the "mm" walked by get_user_pages= up the stack (up =C2=A0=C2=A0=C2=A0 to the caller of get_user_pages).=C2=A0 So the fix ju= st checks if there is =C2=A0=C2=A0=C2=A0 not a single pte mapping on the page returned by get_= user_pages, and in =C2=A0=C2=A0=C2=A0 turn if the caller can assume that the whole compound= page is mapped in =C2=A0=C2=A0=C2=A0 the current "mm" (in a pmd_trans_huge()).=C2=A0 In su= ch case the entire =C2=A0=C2=A0=C2=A0 compound page is safe to map into the secondary MMU w= ithout additional =C2=A0=C2=A0=C2=A0 get_user_pages() calls on the surrounding tail/head p= ages.=C2=A0 In addition =C2=A0=C2=A0=C2=A0 of being faster, not having to run other get_user_pag= es() calls also =C2=A0=C2=A0=C2=A0 reduces the memory footprint of the secondary MMU fau= lt in case the pmd =C2=A0=C2=A0=C2=A0 split happened as result of memory pressure.