From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61940C4361B for ; Tue, 15 Dec 2020 22:21:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F1D3322BE9 for ; Tue, 15 Dec 2020 22:21:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F1D3322BE9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8BC718D0002; Tue, 15 Dec 2020 17:21:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 86D2B8D0001; Tue, 15 Dec 2020 17:21:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 735248D0002; Tue, 15 Dec 2020 17:21:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 59F7D8D0001 for ; Tue, 15 Dec 2020 17:21:35 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 228A81EE6 for ; Tue, 15 Dec 2020 22:21:35 +0000 (UTC) X-FDA: 77596939350.02.form37_0e128e327427 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 09F1210097AA0 for ; Tue, 15 Dec 2020 22:21:35 +0000 (UTC) X-HE-Tag: form37_0e128e327427 X-Filterd-Recvd-Size: 17279 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Tue, 15 Dec 2020 22:21:33 +0000 (UTC) Date: Tue, 15 Dec 2020 14:21:31 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1608070892; bh=1sj45ioq6+GBql/T8PI+jt0AUBEdEAXEXYa2uX3VwIQ=; h=From:To:Subject:In-Reply-To:From; b=SFYuDPusNnPpdbLHZGuO6xuT2q99a4cGGAra6p2lTGIy2V57nCejm7X3YnqHEbywH BkT3/u5+JmuYn1va0jnBFOOz+LiDV/O4Z79lt7MO/7h5E/x+R48IogNGxHHZQg7H10 TOIQKTg/ipcKz/Uw51XUnPpaCt/qls0lSZQ99d9E= From: Andrew Morton To: aarcange@redhat.com, akpm@linux-foundation.org, alex.shi@linux.alibaba.com, alexander.duyck@gmail.com, aryabinin@virtuozzo.com, daniel.m.jordan@oracle.com, hannes@cmpxchg.org, hughd@google.com, iamjoonsoo.kim@lge.com, jannh@google.com, khlebnikov@yandex-team.ru, kirill.shutemov@linux.intel.com, kirill@shutemov.name, linux-mm@kvack.org, mgorman@techsingularity.net, mhocko@kernel.org, mhocko@suse.com, mika.penttila@nextfour.com, minchan@kernel.org, mm-commits@vger.kernel.org, richard.weiyang@gmail.com, rong.a.chen@intel.com, shakeelb@google.com, tglx@linutronix.de, tj@kernel.org, torvalds@linux-foundation.org, vbabka@suse.cz, vdavydov.dev@gmail.com, willy@infradead.org, yang.shi@linux.alibaba.com, ying.huang@intel.com Subject: [patch 19/19] mm/lru: revise the comments of lru_lock Message-ID: <20201215222131.Un95p7p-j%akpm@linux-foundation.org> In-Reply-To: <20201215123253.954eca9a5ef4c0d52fd381fa@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =46rom: Hugh Dickins Subject: mm/lru: revise the comments of lru_lock Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix the incorrect comments in code. Also fixed some zone->lru_lock comment error from ancient time. etc. I struggled to understand the comment above move_pages_to_lru() (surely it never calls page_referenced()), and eventually realized that most of it had got separated from shrink_active_list(): move that comment back. Link: https://lkml.kernel.org/r/1604566549-62481-20-git-send-email-alex.shi= @linux.alibaba.com Signed-off-by: Hugh Dickins Signed-off-by: Alex Shi Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Tejun Heo Cc: Andrey Ryabinin Cc: Jann Horn Cc: Mel Gorman Cc: Matthew Wilcox Cc: Alexander Duyck Cc: Andrea Arcangeli Cc: "Chen, Rong A" Cc: Daniel Jordan Cc: "Huang, Ying" Cc: Joonsoo Kim Cc: Kirill A. Shutemov Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Michal Hocko Cc: Michal Hocko Cc: Mika Penttil=C3=A4 Cc: Minchan Kim Cc: Shakeel Butt Cc: Thomas Gleixner Cc: Vladimir Davydov Cc: Wei Yang Cc: Yang Shi Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 ---- Documentation/admin-guide/cgroup-v1/memory.rst | 23 ++---- Documentation/trace/events-kmem.rst | 2=20 Documentation/vm/unevictable-lru.rst | 22 ++--- include/linux/mm_types.h | 2=20 include/linux/mmzone.h | 3=20 mm/filemap.c | 4 - mm/rmap.c | 4 - mm/vmscan.c | 41 ++++++----- 9 files changed, 51 insertions(+), 65 deletions(-) --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst~mm-lru-revise-the-= comments-of-lru_lock +++ a/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -133,18 +133,9 @@ Under below explanation, we assume CONFI =20 8. LRU =3D=3D=3D=3D=3D=3D - Each memcg has its own private LRU. Now, its handling is under glo= bal - VM's control (means that it's handled under global pgdat->lru_lock). - Almost all routines around memcg's LRU is called by global LRU's - list management functions under pgdat->lru_lock. - - A special function is mem_cgroup_isolate_pages(). This scans - memcg's private LRU and call __isolate_lru_page() to extract a page - from LRU. - - (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) - + Each memcg has its own vector of LRUs (inactive anon, active anon, + inactive file, active file, unevictable) of pages from each node, + each LRU handled under a single lru_lock for that memcg and node. =20 9. Typical Tests. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- a/Documentation/admin-guide/cgroup-v1/memory.rst~mm-lru-revise-the-comm= ents-of-lru_lock +++ a/Documentation/admin-guide/cgroup-v1/memory.rst @@ -287,20 +287,17 @@ When oom event notifier is registered, e 2.6 Locking ----------- =20 - lock_page_cgroup()/unlock_page_cgroup() should not be called under - the i_pages lock. +Lock order is as follows: =20 - Other lock order is following: - - PG_locked. - mm->page_table_lock - pgdat->lru_lock - lock_page_cgroup. - - In many cases, just lock_page_cgroup() is called. - - per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - pgdat->lru_lock, it has no lock of its own. + Page lock (PG_locked bit of page->flags) + mm->page_table_lock or split pte_lock + lock_page_memcg (memcg->move_lock) + mapping->i_pages lock + lruvec->lru_lock. + +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by +lruvec->lru_lock; PG_lru bit of page->flags is cleared before +isolating a page from its LRU under lruvec->lru_lock. =20 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) ----------------------------------------------- --- a/Documentation/trace/events-kmem.rst~mm-lru-revise-the-comments-of-lru= _lock +++ a/Documentation/trace/events-kmem.rst @@ -69,7 +69,7 @@ When pages are freed in batch, the also Broadly speaking, pages are taken off the LRU lock in bulk and freed in batch with a page list. Significant amounts of activity here could indicate that the system is under memory pressure and can also indicate -contention on the zone->lru_lock. +contention on the lruvec->lru_lock. =20 4. Per-CPU Allocator Activity =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D --- a/Documentation/vm/unevictable-lru.rst~mm-lru-revise-the-comments-of-lr= u_lock +++ a/Documentation/vm/unevictable-lru.rst @@ -33,7 +33,7 @@ reclaim in Linux. The problems have bee memory x86_64 systems. =20 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB = of -main memory will have over 32 million 4k pages in a single zone. When a l= arge +main memory will have over 32 million 4k pages in a single node. When a l= arge fraction of these pages are not evictable for any reason [see below], vmsc= an will spend a lot of time scanning the LRU lists looking for the small frac= tion of pages that are evictable. This can result in a situation where all CPU= s are @@ -55,7 +55,7 @@ unevictable, either by definition or by The Unevictable Page List ------------------------- =20 -The Unevictable LRU infrastructure consists of an additional, per-zone, LR= U list +The Unevictable LRU infrastructure consists of an additional, per-node, LR= U list called the "unevictable" list and an associated page flag, PG_unevictable,= to indicate that the page is being managed on the unevictable list. =20 @@ -84,15 +84,9 @@ The unevictable list does not differenti swap-backed pages. This differentiation is only important while the pages= are, in fact, evictable. =20 -The unevictable list benefits from the "arrayification" of the per-zone LRU +The unevictable list benefits from the "arrayification" of the per-node LRU lists and statistics originally proposed and posted by Christoph Lameter. =20 -The unevictable list does not use the LRU pagevec mechanism. Rather, -unevictable pages are placed directly on the page's zone's unevictable list -under the zone lru_lock. This allows us to prevent the stranding of pages= on -the unevictable list when one task has the page isolated from the LRU and = other -tasks are changing the "evictability" state of the page. - =20 Memory Control Group Interaction -------------------------------- @@ -101,8 +95,8 @@ The unevictable LRU facility interacts w memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by = extending the lru_list enum. =20 -The memory controller data structure automatically gets a per-zone unevict= able -list as a result of the "arrayification" of the per-zone LRU lists (one per +The memory controller data structure automatically gets a per-node unevict= able +list as a result of the "arrayification" of the per-node LRU lists (one per lru_list enum element). The memory controller tracks the movement of page= s to and from the unevictable list. =20 @@ -196,7 +190,7 @@ for the sake of expediency, to leave a u active/inactive LRU lists for vmscan to deal with. vmscan checks for such pages in all of the shrink_{active|inactive|page}_list() functions and will "cull" such pages that it encounters: that is, it diverts those pages to t= he -unevictable list for the zone being scanned. +unevictable list for the node being scanned. =20 There may be situations where a page is mapped into a VM_LOCKED VMA, but t= he page is not marked as PG_mlocked. Such pages will make it all the way to @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlo page from the LRU, as it is likely on the appropriate active or inactive l= ist at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will p= ut back the page - by calling putback_lru_page() - which will notice that the= page -is now mlocked and divert the page to the zone's unevictable list. If +is now mlocked and divert the page to the node's unevictable list. If mlock_vma_page() is unable to isolate the page from the LRU, vmscan will h= andle it later if and when it attempts to reclaim the page. =20 @@ -603,7 +597,7 @@ Some examples of these unevictable pages unevictable list in mlock_vma_page(). =20 shrink_inactive_list() also diverts any unevictable pages that it finds on= the -inactive lists to the appropriate zone's unevictable list. +inactive lists to the appropriate node's unevictable list. =20 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LO= CK'd after shrink_active_list() had moved them to the inactive list, or pages m= apped --- a/include/linux/mm_types.h~mm-lru-revise-the-comments-of-lru_lock +++ a/include/linux/mm_types.h @@ -79,7 +79,7 @@ struct page { struct { /* Page cache and anonymous pages */ /** * @lru: Pageout list, eg. active_list protected by - * pgdat->lru_lock. Sometimes used as a generic list + * lruvec->lru_lock. Sometimes used as a generic list * by the page owner. */ struct list_head lru; --- a/include/linux/mmzone.h~mm-lru-revise-the-comments-of-lru_lock +++ a/include/linux/mmzone.h @@ -113,8 +113,7 @@ static inline bool free_area_empty(struc struct pglist_data; =20 /* - * zone->lock and the zone lru_lock are two of the hottest locks in the ke= rnel. - * So add a wild amount of padding here to ensure that they fall into sepa= rate + * Add a wild amount of padding here to ensure datas fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. */ --- a/mm/filemap.c~mm-lru-revise-the-comments-of-lru_lock +++ a/mm/filemap.c @@ -102,8 +102,8 @@ * ->swap_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) * ->i_pages lock (try_to_unmap_one) - * ->pgdat->lru_lock (follow_page->mark_page_accessed) - * ->pgdat->lru_lock (check_pte_range->isolate_lru_page) + * ->lruvec->lru_lock (follow_page->mark_page_accessed) + * ->lruvec->lru_lock (check_pte_range->isolate_lru_page) * ->private_lock (page_remove_rmap->set_page_dirty) * ->i_pages lock (page_remove_rmap->set_page_dirty) * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) --- a/mm/rmap.c~mm-lru-revise-the-comments-of-lru_lock +++ a/mm/rmap.c @@ -28,12 +28,12 @@ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * anon_vma->rwsem * mm->page_table_lock or pte_lock - * pgdat->lru_lock (in mark_page_accessed, isolate_lru_page) * swap_lock (in swap_duplicate, swap_info_get) * mmlist_lock (in mmput, drain_mmlist and others) * mapping->private_lock (in __set_page_dirty_buffers) - * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) + * lock_page_memcg move_lock (in __set_page_dirty_buffer= s) * i_pages lock (widely used) + * lruvec->lru_lock (in lock_page_lruvec_irq) * inode->i_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dir= ty) * sb_lock (within inode_lock in fs/fs-writeback.c) --- a/mm/vmscan.c~mm-lru-revise-the-comments-of-lru_lock +++ a/mm/vmscan.c @@ -1613,14 +1613,16 @@ static __always_inline void update_lru_s } =20 /** - * pgdat->lru_lock is heavily contended. Some of the functions that + * Isolating page from the lruvec to fill in @dst list by nr_to_scan times. + * + * lruvec->lru_lock is heavily contended. Some of the functions that * shrink the lists perform better by taking out a batch of pages * and working on them outside the LRU lock. * * For pagecache intensive workloads, this function is the hottest * spot in the kernel (apart from copy_*_user functions). * - * Appropriate locks must be held before calling this function. + * Lru_lock must be held before calling this function. * * @nr_to_scan: The number of eligible pages to look through on the list. * @lruvec: The LRU vector to pull pages from. @@ -1814,25 +1816,11 @@ static int too_many_isolated(struct pgli } =20 /* - * This moves pages from @list to corresponding LRU list. - * - * We move them the other way if the page is referenced by one or more - * processes, from rmap. - * - * If the pages are mostly unmapped, the processing is fast and it is - * appropriate to hold zone_lru_lock across the whole operation. But if - * the pages are mapped, the processing is slow (page_referenced()) so we - * should drop zone_lru_lock around each page. It's impossible to balance - * this, so instead we remove the pages from the LRU while processing them. - * It is safe to rely on PG_active against the non-LRU pages in here becau= se - * nobody will play with that bit on a non-LRU page. - * - * The downside is that we have to touch page->_refcount against each page. - * But we had to alter page->flags anyway. + * move_pages_to_lru() moves pages from private @list to appropriate LRU l= ist. + * On return, @list is reused as a list of pages to be freed by the caller. * * Returns the number of pages moved to the given lruvec. */ - static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, struct list_head *list) { @@ -2010,6 +1998,23 @@ shrink_inactive_list(unsigned long nr_to return nr_reclaimed; } =20 +/* + * shrink_active_list() moves pages from the active LRU to the inactive LR= U. + * + * We move them the other way if the page is referenced by one or more + * processes. + * + * If the pages are mostly unmapped, the processing is fast and it is + * appropriate to hold lru_lock across the whole operation. But if + * the pages are mapped, the processing is slow (page_referenced()), so + * we should drop lru_lock around each page. It's impossible to balance + * this, so instead we remove the pages from the LRU while processing them. + * It is safe to rely on PG_active against the non-LRU pages in here becau= se + * nobody will play with that bit on a non-LRU page. + * + * The downside is that we have to touch page->_refcount against each page. + * But we had to alter page->flags anyway. + */ static void shrink_active_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, _