From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3FD0C433E0 for ; Mon, 15 Mar 2021 03:16:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2CB4464E6B for ; Mon, 15 Mar 2021 03:16:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2CB4464E6B Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 80D696B0006; Sun, 14 Mar 2021 23:16:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E2D26B006C; Sun, 14 Mar 2021 23:16:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 683CE6B0070; Sun, 14 Mar 2021 23:16:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0213.hostedemail.com [216.40.44.213]) by kanga.kvack.org (Postfix) with ESMTP id 4BDA36B0006 for ; Sun, 14 Mar 2021 23:16:41 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 03E98363B for ; Mon, 15 Mar 2021 03:16:40 +0000 (UTC) X-FDA: 77920646202.12.BE51E2B Received: from mail-il1-f175.google.com (mail-il1-f175.google.com [209.85.166.175]) by imf27.hostedemail.com (Postfix) with ESMTP id 8FC588019146 for ; Mon, 15 Mar 2021 03:16:40 +0000 (UTC) Received: by mail-il1-f175.google.com with SMTP id v14so8019334ilj.11 for ; Sun, 14 Mar 2021 20:16:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=8CSdX0A5B03bYxZSOAFsQfaxoGs/c+Z6tlah++60gTY=; b=M1aCcZ0O6ZbH9y3UD8F1qMyWCkdN6xkqXLyKM3I+zmtPFHX8f6FNDs7tvIQe7NsjT/ xM3kxaB3ieHl8V+SF/VkTwa0+SRxkAmzV3IXkXDiwYr/5L+IKf1C34btj0YoSB+nwsDg yfmDE3cjTZAA7UffAYwO1RdVkMLZv6SI1HV4sThbBYzaqwIAsgkw1pdI3IyhGbf9l4Th LAmDHQYTkMybikVR43XL7MM7ocQ47MTqezZTIsoA/o+qFmHetlHMIq279IiTQ5+Hxtnm Pi45BWbNfksXX22GbhbNuXiUtK/siaJXHqraAvlYUKhPLN2TljkE/iJF8B7y1bp1tWD3 gBdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8CSdX0A5B03bYxZSOAFsQfaxoGs/c+Z6tlah++60gTY=; b=sUjfkeovStar8xVwE5ilKlBOtmUhY9PSjWaVITG5oNrffAFHOJtk4us6MD4yqn2JaG gRutlMAU8UamBnKpRGCiUs02EQKZvyZ/zCRYpwqQHMsC0LhauVWOI5Za8x6fsIajJdpL n2G9Az54WlKc7FZFu12NSiah/6jbf1CdbHs2XEX5m47mNEOYdXQoJRW/OqZ1Bh1Noklt 3cHYlEg+cJwpwo2GOKkJrpbXPfJbbbSvWCTerxpXOqu9MbCetJXcRybDHAV/lDNolGk+ IB8rXZsSqJEI2xEJ4cz0aJHSYwmLXIkravg/fM+y5pv+GSmhIIIa88A64XSr8dbx9O0k qrwg== X-Gm-Message-State: AOAM530gYPaNztXvI1bSI6uqxKz8QuTqKvS9lsYDcrQQAmsjsm2z9TaA ih3q7iZJkTfJh7MfrtFS5P6RGA== X-Google-Smtp-Source: ABdhPJws6zaW8x+jq9JtvGiyR0BJEFsx418QCKCDT8zxit3R7F7nTOvZNlLr2f3XrZG1R0qIhEZdgA== X-Received: by 2002:a92:de4c:: with SMTP id e12mr10256007ilr.63.1615778199817; Sun, 14 Mar 2021 20:16:39 -0700 (PDT) Received: from google.com ([2620:15c:183:200:4d84:eb70:5c32:32b8]) by smtp.gmail.com with ESMTPSA id a5sm7138189ilh.23.2021.03.14.20.16.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 14 Mar 2021 20:16:39 -0700 (PDT) Date: Sun, 14 Mar 2021 21:16:35 -0600 From: Yu Zhao To: Dave Hansen Cc: linux-mm@kvack.org, Alex Shi , Andrew Morton , Dave Hansen , Hillf Danton , Johannes Weiner , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Michal Hocko , Roman Gushchin , Vlastimil Babka , Wei Yang , Yang Shi , Ying Huang , linux-kernel@vger.kernel.org, page-reclaim@google.com Subject: Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Message-ID: References: <20210313075747.3781593-1-yuzhao@google.com> <20210313075747.3781593-7-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: 6ckkzrsk3fbmafp86d8pedbdnium15ie X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 8FC588019146 Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf27; identity=mailfrom; envelope-from=""; helo=mail-il1-f175.google.com; client-ip=209.85.166.175 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1615778200-164661 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote: > On 3/12/21 11:57 PM, Yu Zhao wrote: > > Some architectures support the accessed bit on non-leaf PMD entries > > (parents) in addition to leaf PTE entries (children) where pages are > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > as part of linear-address translation [1]. Page table walkers who are > > interested in the accessed bit on children can take advantage of this: > > they do not need to search the children when the accessed bit is not > > set on a parent, given that they have previously cleared the accessed > > bit on this parent in addition to its children. > > I'd like to hear a *LOT* more about how this is going to be used. > > The one part of this which is entirely missing is the interaction with > the TLB and mid-level paging structure caches. The CPU is pretty > aggressive about setting no-leaf accessed bits when TLB entries are > created. This *looks* to be depending on that behavior, but it would be > nice to spell it out explicitly. Good point. Let me start with a couple of observations we've made: 1) some applications create very sparse address spaces, for various reasons. A notable example is those using Scudo memory allocations: they usually have double-digit numbers of PTE entries for each PMD entry (and thousands of VMAs for just a few hundred MBs of memory usage, sigh...). 2) scans of an address space (from the reclaim path) are much less frequent than context switches of it. Under our heaviest memory pressure (30%+ overcommitted; guess how much we've profited from it :) ), their magnitudes are still on different orders. Specifically, on our smallest system (2GB, with PCID), we observed no difference between flushing and not flushing TLB in terms of page selections. We actually observed more TLB misses under heavier memory pressure, and our theory is that this is due to increased memory footprint that causes the pressure. There are two use cases for the accessed bit on non-leaf PMD entries: the hot tracking and the cold tracking. I'll focus on the cold tracking, which is what this series about. Since non-leaf entries are more likely to be cached, in theory, the false negative rate is higher compared with leaf entries as the CPU won't set the accessed bit again until the next TLB miss. (Here a false negative means the accessed bit isn't set on an entry has been used, after we cleared the accessed bit. And IIRC, there are also false positives, i.e., the accessed bit is set on entries used by speculative execution only.) But this is not a problem because of the second observation aforementioned. Now let's consider the worst case scenario: what happens when we hit a false negative on a non-leaf PMD entry? We think the pages mapped by the PTE entries of this PMD entry are inactive and try to reclaim them, until we see the accessed bit set on one of the PTE entries. This will cost us one futile attempt for all the 512 PTE entries. A glance at lru_gen_scan_around() in the 11th patch would explain exactly why. If you are guessing that function embodies the same idea of "fault around", you are right. And there are two places that could benefit from this patch (and the next) immediately, independent to this series. One is clear_refs_test_walk() in fs/proc/task_mmu.c. The other is madvise_pageout_page_range() and madvise_cold_page_range() in mm/madvise.c. Both are page table walkers that clear the accessed bit. I think I've covered a lot of ground but I'm sure there is a lot more. So please feel free to add and I'll include everything we discuss here in the next version.