From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D74AC4332F for ; Fri, 16 Dec 2022 19:52:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ACA3C8E0008; Fri, 16 Dec 2022 14:52:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A7C5E8E0007; Fri, 16 Dec 2022 14:52:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8CCC08E0008; Fri, 16 Dec 2022 14:52:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7C7C58E0007 for ; Fri, 16 Dec 2022 14:52:43 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 59E1E1C5F7A for ; Fri, 16 Dec 2022 19:52:43 +0000 (UTC) X-FDA: 80249217006.26.1FA983D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id B3EAE14000E for ; Fri, 16 Dec 2022 19:52:41 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gXeJEvIz; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of mtosatti@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mtosatti@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671220361; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:references:dkim-signature; bh=z8vp/u7LZUrRwGre1Iv9OOlhsfTbo/8utVSvhDfhbTk=; b=0djAY49Bx68lNz4KBrnMQ4546Sw2QfL05f5xm9inrXxJL2wtuai8e7GaiHIlL5B8IJ7Ffw VNVC6JPVpsy4+oTP7ATA1nVx3E7xVXNZOxOM4/jU541191Q6TqbsOEFmbQ7vBo+UobPyBB LKGqXKKLJQOxv22qFu7VLCcYzM5sG/E= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gXeJEvIz; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of mtosatti@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mtosatti@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671220361; a=rsa-sha256; cv=none; b=DzVadDmG7dy5CC3bdjUzagDF/FO8/M1aalUvl406J4sU7/7qojuUNB5FZzYBXHwhu80KDI ITpBLkyUYt37vHlxb9f/yTIpV62weIaprtwlJVwIEyyGF3NCQDTsK9mmIPYWMknH5lhrKk Wnxdbj80U/VrKDFWrM5pt/NUpHjGCus= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1671220361; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=z8vp/u7LZUrRwGre1Iv9OOlhsfTbo/8utVSvhDfhbTk=; b=gXeJEvIzvyiRcL/YDTcYwxZ4AcYgkeXGdvRABxBpm6vTtYhsOzTra6kOhUiBkGc3UixUXH Kt9cG1DfLkRkhwhM1J6PhIyUMUQQaFVA0ludnpqYy94ITqugUZ27LCkgCj9r2Y4fGyFuK7 l60clsdO1+VyJRL4Es4KQ1SYWe6h1h0= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-202-XXBE8xf3OdemLz0OFcaz-A-1; Fri, 16 Dec 2022 14:52:36 -0500 X-MC-Unique: XXBE8xf3OdemLz0OFcaz-A-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 49BD585A588; Fri, 16 Dec 2022 19:52:36 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-3.gru2.redhat.com [10.97.112.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 64DA753A0; Fri, 16 Dec 2022 19:52:35 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id CAB8E4269CA81; Fri, 16 Dec 2022 16:52:10 -0300 (-03) Message-ID: <20221216194904.155675758@redhat.com> User-Agent: quilt/0.66 Date: Fri, 16 Dec 2022 16:45:43 -0300 From: Marcelo Tosatti To: atomlin@redhat.com, frederic@kernel.org Cc: cl@linux.com, tglx@linutronix.de, mingo@kernel.org, peterz@infradead.org, pauld@redhat.com, neelx@redhat.com, oleksandr@natalenko.name, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Marcelo Tosatti Subject: [PATCH v10 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full References: <20221216194540.202752779@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: B3EAE14000E X-Stat-Signature: w9c5hahnhiom9xpy8ohyizqzf613n9t4 X-HE-Tag: 1671220361-265273 X-HE-Meta: U2FsdGVkX18myRAUJrm7nyG3772a6qoYZuTC3UuUYjVRfi7SIAEUil6Jlm3T7V+jWBsfHYBXPbj4NvrdGIy0NHoKMCFPPCDzfR+7QyxdVUZbhsHVHv4tSbmoMkBI+FxruqqE/DZI0vLnd7A4DtHGH3DPQDiBYn675Z5+47SWNdDbhchdVsAIG4jnBI5S9+nBLxpbQgSCV7/AzgIm/FBsd06Q2P2OtMSxpelVv+O8IP0pKAE9iLb3VaXhPKpdf/HvGyReTJ+RO/+f0uZm2jDGf1cEJnybiWxisC3+jO/Ka6ntPjjX0J2U4eQtzKALndbW0MFxOvIXlMm5qWBXzxizw6PjfWwP5ivojuBl1VBrSRSTTIm2QkUzOVFFrTYZBUGrrz6JVifXU8gQYPgFL/JxtIQiDmrb8RqpVYBsoyTBHHyvLOXBgps21H0W5/BhxaQVkCh2K3ZrTLriQ9Yo0cbJha0Re4coILj7sDsysaCXHfPXIvUgybE1xh4fYG+wgEqRJfhBa3XNyp1NYbUdxSt+P4l2M9kRbpYITlROzKGi5c862BUWvZ0gMP602/3f1K70uxB6Y4wKOOtC4CP8y+0+OSoMEau/fcN5iI4/0vzVQ4o5sBvDZSq3oLVSDKxoV+P4dD7FuEDDnzlfBtbun8SG0oyK6JeUuaCaUoWbvDPztsDR/NmmtCP2cpcbhfXLMRUjyp6NjYC0ZTd25nPfgsuQuzwKIopti1Kbk2YKfHYwc/qyCBFGTOWPQUbE9vxesudUS86cXJvIu5w2I0KGAXLCGCMAveAQWXlTngG9dzJA52GbmBHdOhQfH7G6ZEGO3YsTnpU+vdJ4wq48wGiu1Au2w2nqOJg+d6h9XzaA5jRfB0ScqWLXLAZOdJaIRm718YkB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For nohz full CPUs, we'd like the per-CPU vm statistics to be synchronized when userspace is executing. Otherwise, the vmstat_shepherd might queue a work item to synchronize them, which is undesired intereference for isolated CPUs. This means that its necessary to check for, and possibly sync, the statistics when returning to userspace. This means that there are now two execution contexes, on different CPUs, which require awareness about each other: context switch and vmstat shepherd kernel threadr. To avoid the shared variables between these two contexes (which would require atomic accesses), delegate the responsability of statistics synchronization from vmstat_shepherd to local CPU context, for nohz_full CPUs. Do that by queueing a delayed work when marking per-CPU vmstat dirty. When returning to userspace, fold the stats and cancel the delayed work. When entering idle, only fold the stats. Signed-off-by: Marcelo Tosatti --- include/linux/vmstat.h | 4 ++-- kernel/time/tick-sched.c | 2 +- mm/vmstat.c | 41 ++++++++++++++++++++++++++++++++--------- 3 files changed, 35 insertions(+), 12 deletions(-) Index: linux-2.6/mm/vmstat.c =================================================================== --- linux-2.6.orig/mm/vmstat.c +++ linux-2.6/mm/vmstat.c @@ -28,6 +28,7 @@ #include #include #include +#include #include "internal.h" @@ -195,9 +196,26 @@ void fold_vm_numa_events(void) #ifdef CONFIG_SMP static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty); +static DEFINE_PER_CPU(struct delayed_work, vmstat_work); +int sysctl_stat_interval __read_mostly = HZ; static inline void vmstat_mark_dirty(void) { +#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER + int cpu = smp_processor_id(); + + if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) { + struct delayed_work *dw; + + dw = this_cpu_ptr(&vmstat_work); + if (!delayed_work_pending(dw)) { + unsigned long delay; + + delay = round_jiffies_relative(sysctl_stat_interval); + queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay); + } + } +#endif this_cpu_write(vmstat_dirty, true); } @@ -1886,9 +1904,6 @@ static const struct seq_operations vmsta #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_SMP -static DEFINE_PER_CPU(struct delayed_work, vmstat_work); -int sysctl_stat_interval __read_mostly = HZ; - #ifdef CONFIG_PROC_FS static void refresh_vm_stats(struct work_struct *work) { @@ -1973,7 +1988,7 @@ static void vmstat_update(struct work_st * until the diffs stay at zero. The function is used by NOHZ and can only be * invoked when tick processing is not active. */ -void quiet_vmstat(void) +void quiet_vmstat(bool user) { if (system_state != SYSTEM_RUNNING) return; @@ -1981,13 +1996,19 @@ void quiet_vmstat(void) if (!is_vmstat_dirty()) return; + refresh_cpu_vm_stats(false); + +#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER + if (!user) + return; /* - * Just refresh counters and do not care about the pending delayed - * vmstat_update. It doesn't fire that often to matter and canceling - * it would be too expensive from this path. - * vmstat_shepherd will take care about that for us. + * If the tick is stopped, cancel any delayed work to avoid + * interruptions to this CPU in the future. */ - refresh_cpu_vm_stats(false); + dw = this_cpu_ptr(&vmstat_work); + if (delayed_work_pending(this_cpu_ptr(&vmstat_work))) + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); +#endif } /* @@ -2009,6 +2030,12 @@ static void vmstat_shepherd(struct work_ for_each_online_cpu(cpu) { struct delayed_work *dw = &per_cpu(vmstat_work, cpu); +#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER + /* NOHZ full CPUs manage their own vmstat flushing */ + if (tick_nohz_full_cpu(cpu)) + continue; +#endif + if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); Index: linux-2.6/include/linux/vmstat.h =================================================================== --- linux-2.6.orig/include/linux/vmstat.h +++ linux-2.6/include/linux/vmstat.h @@ -290,7 +290,7 @@ extern void dec_zone_state(struct zone * extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); -void quiet_vmstat(void); +void quiet_vmstat(bool user); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -403,7 +403,7 @@ static inline void __dec_node_page_state static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } -static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat(bool user) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) { } Index: linux-2.6/kernel/time/tick-sched.c =================================================================== --- linux-2.6.orig/kernel/time/tick-sched.c +++ linux-2.6/kernel/time/tick-sched.c @@ -911,7 +911,7 @@ static void tick_nohz_stop_tick(struct t */ if (!ts->tick_stopped) { calc_load_nohz_start(); - quiet_vmstat(); + quiet_vmstat(false); ts->last_tick = hrtimer_get_expires(&ts->sched_timer); ts->tick_stopped = 1; Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig +++ linux-2.6/mm/Kconfig @@ -1124,6 +1124,19 @@ config PTE_MARKER_UFFD_WP purposes. It is required to enable userfaultfd write protection on file-backed memory types like shmem and hugetlbfs. +config FLUSH_WORK_ON_RESUME_USER + bool "Flush per-CPU vmstats on user return (for nohz full CPUs)" + depends on NO_HZ_FULL + default y + + help + By default, nohz full CPUs flush per-CPU vm statistics on return + to userspace (to avoid additional interferences when executing + userspace code). This has a small but measurable impact on + system call performance. You can disable this to improve system call + performance, at the expense of potential interferences to userspace + execution. + # multi-gen LRU { config LRU_GEN bool "Multi-Gen LRU"