From: Mel Gorman <email@example.com> To: Andrew Morton <firstname.lastname@example.org> Cc: Chuck Lever <email@example.com>, Jesper Dangaard Brouer <firstname.lastname@example.org>, Thomas Gleixner <email@example.com>, Peter Zijlstra <firstname.lastname@example.org>, Ingo Molnar <email@example.com>, Michal Hocko <firstname.lastname@example.org>, Vlastimil Babka <email@example.com>, Linux-MM <firstname.lastname@example.org>, Linux-RT-Users <email@example.com>, LKML <firstname.lastname@example.org>, Mel Gorman <email@example.com> Subject: [PATCH 0/9 v5] Use local_lock for pcp protection and reduce stat overhead Date: Thu, 22 Apr 2021 12:14:32 +0100 [thread overview] Message-ID: <firstname.lastname@example.org> (raw) Changelog since v4 o Dropped local_lock embed patch due to complexity o Fix !NUMA build o Avoid adding pages with mt >= MIGRATE_PCPTYPES to non-existant per-cpu list Changelog since v3 o Preserve NUMA_* counters after CPU hotplug o Drop "mm/page_alloc: Remove duplicate checks if migratetype should be isolated" o Add micro-optimisation tracking PFN during free_unref_page_list o Add Acks Changelog since v2 o Fix zonestats initialisation o Merged memory hotplug fix separately o Embed local_lock within per_cpu_pages This series requires patches in Andrew's tree so for convenience, it's also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v5r2 The PCP (per-cpu page allocator in page_alloc.c) shares locking requirements with vmstat and the zone lock which is inconvenient and causes some issues. For example, the PCP list and vmstat share the same per-cpu space meaning that it's possible that vmstat updates dirty cache lines holding per-cpu lists across CPUs unless padding is used. Second, PREEMPT_RT does not want to disable IRQs for too long in the page allocator. This series splits the locking requirements and uses locks types more suitable for PREEMPT_RT, reduces the time when special locking is required for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT kernels. Why local_lock? PREEMPT_RT considers the following sequence to be unsafe as documented in Documentation/locking/locktypes.rst local_irq_disable(); spin_lock(&lock); The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to separate this out, it generally means there are points where we enable IRQs and reenable them again immediately. To prevent a migration and the per-cpu pointer going stale, migrate_disable is also needed. That is a custom lock that is similar, but worse, than local_lock. Furthermore, on PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By converting to local_lock which disables migration on PREEMPT_RT, the locking requirements can be separated and start moving the protections for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a bonus, local_lock also means that PROVE_LOCKING does something useful. After that, it's obvious that zone_statistics incurs too much overhead and leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels. zone_statistics uses perfectly accurate counters requiring IRQs be disabled for parallel RMW sequences when inaccurate ones like vm_events would do. The series makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that then require no special protection on !PREEMPT_RT. The bulk page allocator can then do stat updates in bulk with IRQs enabled which should improve the efficiency. Technically, this could have been done without the local_lock and vmstat conversion work and the order simply reflects the timing of when different series were implemented. Finally, there are places where we conflate IRQs being disabled for the PCP with the IRQ-safe zone spinlock. The remainder of the series reduces the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels. By the end of the series, page_alloc.c does not call local_irq_save so the locking scope is a bit clearer. The one exception is that modifying NR_FREE_PAGES still happens in places where it's known the IRQs are disabled as it's harmless for PREEMPT_RT and would be expensive to split the locking there. No performance data is included because despite the overhead of the stats, it's within the noise for most workloads on !PREEMPT_RT. However, Jesper Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @ 3.60GHz CPU on the first version of this series. Focusing on the array variant of the bulk page allocator reveals the following. (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Baseline Patched 1 56.383 54.225 (+3.83%) 2 40.047 35.492 (+11.38%) 3 37.339 32.643 (+12.58%) 4 35.578 30.992 (+12.89%) 8 33.592 29.606 (+11.87%) 16 32.362 28.532 (+11.85%) 32 31.476 27.728 (+11.91%) 64 30.633 27.252 (+11.04%) 128 30.596 27.090 (+11.46%) While this is a positive outcome, the series is more likely to be interesting to the RT people in terms of getting parts of the PREEMPT_RT tree into mainline. drivers/base/node.c | 18 +-- include/linux/mmzone.h | 31 +++-- include/linux/vmstat.h | 65 ++++++----- mm/mempolicy.c | 2 +- mm/page_alloc.c | 255 ++++++++++++++++++++++++----------------- mm/vmstat.c | 255 ++++++++++++++++------------------------- 6 files changed, 323 insertions(+), 303 deletions(-) -- 2.26.2
next reply other threads:[~2021-04-22 11:14 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-04-22 11:14 Mel Gorman [this message] 2021-04-22 11:14 ` [PATCH 1/9] mm/page_alloc: Split per cpu page lists and zone stats Mel Gorman 2021-04-22 12:19 ` Vlastimil Babka 2021-04-22 11:14 ` [PATCH 2/9] mm/page_alloc: Convert per-cpu list protection to local_lock Mel Gorman 2021-04-22 12:34 ` Vlastimil Babka 2021-04-22 11:14 ` [PATCH 3/9] mm/vmstat: Convert NUMA statistics to basic NUMA counters Mel Gorman 2021-04-22 15:18 ` Vlastimil Babka 2021-04-23 8:32 ` Mel Gorman 2021-04-22 11:14 ` [PATCH 4/9] mm/vmstat: Inline NUMA event counter updates Mel Gorman 2021-04-22 11:14 ` [PATCH 5/9] mm/page_alloc: Batch the accounting updates in the bulk allocator Mel Gorman 2021-04-22 11:14 ` [PATCH 6/9] mm/page_alloc: Reduce duration that IRQs are disabled for VM counters Mel Gorman 2021-04-22 11:14 ` [PATCH 7/9] mm/page_alloc: Explicitly acquire the zone lock in __free_pages_ok Mel Gorman 2021-04-22 11:14 ` [PATCH 8/9] mm/page_alloc: Avoid conflating IRQs disabled with zone->lock Mel Gorman 2021-04-22 11:14 ` [PATCH 9/9] mm/page_alloc: Update PGFREE outside the zone lock in __free_pages_ok Mel Gorman
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: [PATCH 0/9 v5] Use local_lock for pcp protection and reduce stat overhead' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).