From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07ECEC282C3 for ; Fri, 25 Jan 2019 02:30:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C08B9218CD for ; Fri, 25 Jan 2019 02:30:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="j6GYWqCw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728604AbfAYCan (ORCPT ); Thu, 24 Jan 2019 21:30:43 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:49604 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728199AbfAYCam (ORCPT ); Thu, 24 Jan 2019 21:30:42 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id x0P2OSnu033300; Fri, 25 Jan 2019 02:29:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : references : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=t21KVkV47Hl55bqWnnUktwqEaWxYW5JUReKQP8FdBY8=; b=j6GYWqCwPgYtBCoesbCa5ftVjprpIokLkww/DklHKrcJSKsF2Z6x8HrrteqPJF89LmF/ ULMAru/IH83tphQ4LNKVYQUns0Bf1CVHBbQhWd5Gink5H7ifr7w8W2ZqtkDX+rvtEc8v VmRePtMdLbx6Az/BuEuia7OIof8PMCaQkE0eS2bnsvXByi2cI5/UOwxS6WmDgAevm3wo xbd0mUV1aVHqpgJnO62bFKFY1FfppoICf+IJsorF8gM4vRV8MyaKKm6YIzUP/zuZMkFK dlawiInTGK3yqLDO7XjlB8RHqovUR+5YxGuKlE3kyqzINR64kfa7D/A1z3Aute0l25Kj vw== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2130.oracle.com with ESMTP id 2q3uav3b93-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 25 Jan 2019 02:29:06 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x0P2T5pO001921 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 25 Jan 2019 02:29:05 GMT Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x0P2T3fc030234; Fri, 25 Jan 2019 02:29:03 GMT Received: from YzMac.local (/68.7.158.207) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 24 Jan 2019 18:29:03 -0800 Subject: Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory From: Anthony Yznaga To: Andrea Arcangeli , Mel Gorman Cc: "Kirill A. Shutemov" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, aneesh.kumar@linux.ibm.com, akpm@linux-foundation.org, jglisse@redhat.com, khandual@linux.vnet.ibm.com, kirill.shutemov@linux.intel.com, mhocko@kernel.org, minchan@kernel.org, peterz@infradead.org, rientjes@google.com, vbabka@suse.cz, willy@infradead.org, ying.huang@intel.com, nitingupta910@gmail.com References: <1541746138-6706-1-git-send-email-anthony.yznaga@oracle.com> <20181109121318.3f3ou56ceegrqhcp@kshutemo-mobl1> <20181109195150.GA24747@redhat.com> <20181110132249.GH23260@techsingularity.net> <20181110164412.GB22642@redhat.com> <3310b7c3-4bcf-3378-e567-1c9200061c25@oracle.com> Message-ID: <9a06f36b-9068-c86b-bc5d-d9d7972f6540@oracle.com> Date: Thu, 24 Jan 2019 18:28:38 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <3310b7c3-4bcf-3378-e567-1c9200061c25@oracle.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9146 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1901250017 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/14/18 3:15 PM, anthony.yznaga@oracle.com wrote: > > > On 11/10/2018 08:44 AM, Andrea Arcangeli wrote: >> On Sat, Nov 10, 2018 at 01:22:49PM +0000, Mel Gorman wrote: >>> On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote: >>>> And if you're in the camp that is concerned about the use of more RAM >>>> or/and about the higher latency of COW faults, I'm afraid the >>>> intermediate solution will be still slower than the already available >>>> MADV_NOHUGEPAGE or enabled=madvise. >>>> >>> Does that not prevent huge page usage? Maybe you can spell it out a bit >> Yes it prevents huge page usage, but preventing the huge page usage is >> also what is achieved with the reservation. >> >>> better. What is the set of system calls an application should make to >>> not use huge pages either for the address space or on a per-VMA basis >>> and defer to kcompactd? I know that can be tuned globally but that's not >>> quite the same thing given that multiple applications or containers can >>> be running with different requirements. >> Yes, in terms of inheritance that could be used to tune a container >> we've only PR_SET_THP_DISABLE, and that will render MADV_HUGEPAGE >> useless too, but then for microservices that should not be a >> concern. How to make those sysfs tunables reentrant in namespaces is a >> separate issue I think. >> >> The difference is that with the reservation over time they can be >> promoted, with MADV_NOHUGEPAGE they cannot become hugepages later, not >> even khugepaged will scan that vma anymore. >> >> The benefit of the reservation will showup in those regions that will >> not become hugepages, so if you can predict beforehand that those >> ranges don't benefit from THP, it's better if userland calls >> madvise(MADV_NOHUGEPAGE) on the range and then there's no need to undo >> the reservation later during memory pressure. >> >> The reservation and promotion is a bit like auto-detecting when >> MADV_NOHUGEPAGE should be set, so it boils down of how much of a >> corner case that is. >> >> I'm not so concerned about the RAM wasted because I don't think it's >> very significant, after all the application can just do a smaller >> malloc if it wants to reduce memory usage. >> >> A massive amount of huge RAM waste is fairly rare and to the extreme >> it could still be wasted even with 4k if the app uses only 1 bit from >> every 4k page it allocates with malloc. >> >> I'm more concerned about cases where THP is wasting CPU: like in redis >> that is hurted by the 2M COWs. redis will map all pages and they will >> be all promoted to THP also with the reservation logic applied, but >> when the parent writes to the memory (after fork) it must trigger 4k >> cows (not 2M cows) and in turn split the THP before the COW, or it >> won't work as fast as with THP disabled. In addition we should try to >> reuse the same IPI for the transhuge pmd split to cover the COW too. >> >> If we add the reservation and that work makes zero difference for the >> redis corner case, and redis must still use MADV_NOHUGEPAGE, it's not >> great in my view. It looks like we're trying to optimize issues that >> are less critical. >> >> The redis+THP case should be possible to optimize later with uffd WP >> model (once completed, Peter Xu is working on it), and uffd WP will >> also remove fork() and it'll convert it to a clone(). The granularity >> of the fault is decided by the userland that way so when uffd >> wrprotects a 4k fragment of a THP, the THP will be split during the >> uffd mprotect ioctl. >> >>>> Now about the implementation: the whole point of the reservation >>>> complexity is to skip the khugepaged copy, so it can collapse in >>>> place. Is skipping the copy worth it? Isn't the big cost the IPI >>>> anyway to avoid leaving two simultaneous TLB mappings of different >>>> granularity? >>>> >>> Not necessarily. With THP anon in the simple case, it might be just a >>> single thread and kcompact so that's one IPI (kcompactd flushes local and >>> one IPI to the CPU the thread was running on assuming it's not migrating >>> excessively). It would scale up with the number of threads but I suspect >>> the main cost is the actual copying, page table manipulation and the >>> locking required. >> Agreed, the IPI wouldn't be a concern for a single threaded app. I was >> looking more at the worst case scenario. For a single threaded app the >> locking should not be too bad either. >> >>> As an aside, a universal benefit would be looking at reducing the time >>> to allocate the necessary huge page as we know that can be excessive. It >>> would be ortogonal to this series. >> With what I suggested the allocation would happen as usual in >> khugepaged at slow peace, without holding locks. So I don't see >> obvious disadvantages in terms of THP allocation latency. >> >>> Could you and Kirill outline what sort of workloads you would consider >>> acceptable for evaluating this series? One would assume it covers at >>> least the following, potentially with a number of workloads. >> I would prefer to add intelligence to detect when COWs after fork >> should be done at 2m or 4k granularity (in the latter case by >> splitting the pmd before the actual COW while leaving the transhuge >> pmd intact in the other mm), because that would save CPU (and it'd >> automatically optimize redis). The snapshot process especially would >> run faster as it will read with THP performance. > And presumably to maintain the performance benefit in subsequent > snapshots the original split PMD would need to be re-promoted > prior to forking or promoted in the child during fork? > >> >> I'm more worried to ensure THP doesn't cause more CPU usage like it >> happens to the above case in COWs, than to just try to save RAM when >> the virtual ranges are only partially utilized by the app. >> >>> 1. Evaluate the collapse and copying costs (probing the entire time >>> spent in collapse_huge_page might do it) >>> 2. Evaluate mmap_sem hold time during hugepage collapse >>> 3. Estimate excessive RAM use due to unnecessary THP usage >>> 4. Estimate the slowdown due to delayed THP usage >>> >>> 1 and 2 would indicate how much time is lost due to not using >>> reservations. That potentially goes in the direction of simply making >>> this faster -- fragmentation reduction (posted but unreviewed), faster >>> compaction searches, better page isolation during compaction to >>> avoid free pages being reused before an order-9 is free. >>> >>> 3 should be straight-forward but 4 would be the hardest to evaluate >>> because it would have to be determimed if 4 is offset by improvements to >>> 1-3. If 1-3 is improved enough, it might remove the motivation for the >>> series entirely. >>> >>> In other words, if we agree on a workload in advance, it might bring >>> this the right direction and not accidentally throw Anthony down a hole >>> working on a series that never gets ack'd. >>> >>> I'm not necessarily the best person to answer because my natural inclination >>> after the fragmentation series would be to keep using thpfiosacle >>> (from the fragmentation avoidance series) and work on improving the THP >>> allocation success rates and reduce latencies. I've tunnel vision on that >>> for the moment. >> Deciding the workloads is a good question indeed, but I would also be >> curious to how many of those pages would not end up to be promoted >> with this logic. >> >> What's the number of pte_none that you require in each pmd to avoid >> promotion? If it's just 1 then apps will run slower, if there's >> partial utilization THP already helps. I've an hard time to think at >> an ideal ratio, this is why max_ptes_none is 511 after all. >> >> Can we start by counting the total number of pte_none() in all pmds >> that can fit a THP according to vma->vm_start/end? The pagetable >> dumper in debugfs may already provide the info we need by scanning all >> mm and by printing the number of "none" pte that would generate >> "wasted" memory (and marginally wasted CPU during copy/clear). >> >> Then you can exactly tell how many pmds won't be promoted to transhuge >> pmds with the patch applied in the real life workloads, even before >> running any benchmark. It'd be good to be sure we're talking about a >> significant number in real life workloads or there's not much to >> optimize to begin with. >> >> If the amount of RAM saved is significant in real life workloads and >> in turn there's a chance of having a worthwhile tradeoff from the >> reservation logic, then we can do the benchmarks because the behavior >> will be different for the page fault, and it'll end up running slower >> with the reservation logic. > > Thank you, Andrea and Mel, for the feedback.  I really appreciate it. > I'm going to proceed as suggested and evaluate the huge page > collapse and copy costs and perform more analysis on the potential > RAM savings. Thanks again to everyone for the feedback. To follow up on this, I was unable to find a workload that could justify these changes. If I had, I suspect that Andrea's suggestion of a THP mode that simply avoided allocating a hugepage on first fault would have sufficed. I did find that khugepaged often spends the most time copying from base pages to a huge page. Separate from the original intent of mitigating bloat, I explored using reservations to reduce the time in khugepaged by allocating them for partially-mmap'd PMD-aligned regions of anon memory in anticipation of the unmapped portion eventually being mapped (think the tail portion of a heap). The number of copies avoided was highly dependent on workload and generally not very high, though, because either a process was too short-lived for the reservation to be converted by khugepaged or the process forked and a parent COW forced the reservation to be released before conversion. Too much overhead for too little gain. An application is better off using a THP-aware allocator. Anthony > > Anthony > >> >> Thanks, >> Andrea >