From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B210C49EBF for ; Tue, 8 Jun 2021 19:19:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E8B736128A for ; Tue, 8 Jun 2021 19:19:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234024AbhFHTVt (ORCPT ); Tue, 8 Jun 2021 15:21:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237510AbhFHTJx (ORCPT ); Tue, 8 Jun 2021 15:09:53 -0400 Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2BB6C0617A8 for ; Tue, 8 Jun 2021 12:03:38 -0700 (PDT) Received: by mail-qk1-x732.google.com with SMTP id o27so21262817qkj.9 for ; Tue, 08 Jun 2021 12:03:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=imb9TzntNYlcLYZpWKt9HGa1hmkynvSoOuHRDCAcWgQ=; b=n9H5gnzqHVa9nx4HHhbkw8Cy/OEDElpOWG2K6MSgOaE46+ILbVjXByqwaaca0ldkW6 5R9EzHMYD+Wii7FC3jdhRQUMwAj1DDsvnj1mhz0r7jalK8F9JtSmWnbe/4fx/a6x870r 2fVtWZ/mO90dwE494arU7ppMQOY5Tuf6BQEt4XTZixKvEjo28uweWT5c381W0l0hMkUw 84bU4fCPT2HbvQVZ2IEA7ofnwqGeZE3KI7toACf6/L8FYzRJZsNVn9y9kC8U4YCAsLCO Njt8swUXOc/F+pnStjUHkKm/YXtdnqrKDyNjyvS3W8XcpUsagGTNqc76CYtH43mpkdVX nOLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=imb9TzntNYlcLYZpWKt9HGa1hmkynvSoOuHRDCAcWgQ=; b=E4viiQpfzluQbSV8RbRfppqFPOdGHLHXMC9S5EC+aE4pbQRiu+dyezjyG6ooGTFOJj 7aVVqVLdtwAiqtD6/6w7Mo9g7pMY4JZIgcKL6s/xcnpdD9J32lh3i9K2S/ixfNwTvZZE HOxsDCitDnGtO26L91glV2T6nSooQM+RAWBw8b6N34NT10p+jd1s0QrbiKjjhB8bjNuc ozWBPB48w9t3R3EcsMv04+i57xJjuVUbu9XV4WgI2cnCWHEAHf0TREfpnNdUr8FEpA+0 KlCqMLZaWt/4dxrNP8FFX6TjC+QLIDNGtkdlfNMJh45IasOLw3v0roPnS70l1OCJ6zCC /teg== X-Gm-Message-State: AOAM5338FEfC/o7i8hJvh6Af+mkktx9xzsTSvgTignDN3+z8zifnqXU1 YJ0PTjB59m4U9N2g85EHBABi+w== X-Google-Smtp-Source: ABdhPJwEKhA7PB19sjrSdAuf8tAfHdUNDs9xn8e4A4X39I51BZu6hMulwJ+1+B9xxMPWUfzdgV7q4A== X-Received: by 2002:a37:7405:: with SMTP id p5mr22948066qkc.272.1623179017801; Tue, 08 Jun 2021 12:03:37 -0700 (PDT) Received: from localhost (70.44.39.90.res-cmts.bus.ptd.net. [70.44.39.90]) by smtp.gmail.com with ESMTPSA id 104sm11413196qta.90.2021.06.08.12.03.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Jun 2021 12:03:37 -0700 (PDT) From: Johannes Weiner To: Peter Zijlstra Cc: Suren Baghdasaryan , Jared Pochtar , linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] psi: fix sampling artifact from pressure file read frequency Date: Tue, 8 Jun 2021 15:03:36 -0400 Message-Id: <20210608190336.77380-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, every time a psi pressure file is read, the per-cpu states are collected and aggregated according to the CPU's non-idle time weight. This dynamically changes the sampling period, which means read frequency can introduce variance into the observed results. This is somewhat unexpected for users and can be confusing, when e.g. two nested cgroups with the same workload report different pressure levels just because they have different consumers reading the pressure files. Consider the following two CPU timelines: CPU0: [ STALL ] [ SLEEP ] CPU1: [ RUN ] [ RUN ] If we sample and aggregate once for the whole period, we get the following total stall time for CPU0: CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3 But if we sample twice, the total for the period is higher: CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5 CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0 --- 0.5 Neither answer is inherently wrong: if the user asks for pressure after half the period, we can't know yet that the CPU will go to sleep right after and its weight would be lower over the combined period. We could normalize the weight averaging to a fixed window regardless of how often stall times themselves are sampled. But that would make reporting less adaptive to sudden changes when the user intentionally uses poll() with short intervals in order to get a higher resolution. For now, simply limit sampling of the pressure file contents to the fixed two-second period already used by the aggregation worker. psi_show() still needs to force a catch-up run in case the workload went idle and periodic aggregation shut off. Since that can race with periodic aggregation, worker and psi_show() need to fully serialize. And with sampling becoming exclusive, whoever wins the race must also requeue the worker if necessary. Move the locking, the aggregation work, and the worker scheduling logic into update_averages(); use that from the worker and psi_show(). poll() continues to use the proportional weight sampling window. Reported-by: Jared Pochtar Signed-off-by: Johannes Weiner --- kernel/sched/psi.c | 83 ++++++++++++++++++++++++---------------------- 1 file changed, 43 insertions(+), 40 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index cc25a3cff41f..9d647d974f55 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -358,17 +358,36 @@ static void collect_percpu_times(struct psi_group *group, *pchanged_states = changed_states; } -static u64 update_averages(struct psi_group *group, u64 now) +static void update_averages(struct psi_group *group) { unsigned long missed_periods = 0; - u64 expires, period; - u64 avg_next_update; + u64 now, expires, period; + u32 changed_states; int s; /* avgX= */ + mutex_lock(&group->avgs_lock); + + now = sched_clock(); expires = group->avg_next_update; - if (now - expires >= psi_period) - missed_periods = div_u64(now - expires, psi_period); + + /* + * Periodic aggregation. + * + * When tasks in the group are active, we make sure to + * aggregate per-cpu samples and calculate the running + * averages at exactly once per PSI_FREQ period. + * + * When tasks go idle, there is no point in keeping the + * workers running, so we shut them down too. Once restarted, + * we backfill zeroes for the missed periods in calc_avgs(). + * + * We can get here from inside the aggregation worker, but + * also from psi_show() as userspace may query pressure files + * of an idle group whose aggregation worker shut down. + */ + if (now < expires) + goto unlock; /* * The periodic clock tick can get delayed for various @@ -377,10 +396,13 @@ static u64 update_averages(struct psi_group *group, u64 now) * But the deltas we sample out of the per-cpu buckets above * are based on the actual time elapsing between clock ticks. */ - avg_next_update = expires + ((1 + missed_periods) * psi_period); + if (now - expires >= psi_period) + missed_periods = div_u64(now - expires, psi_period); period = now - (group->avg_last_update + (missed_periods * psi_period)); group->avg_last_update = now; + group->avg_next_update = expires + ((1 + missed_periods) * psi_period); + collect_percpu_times(group, PSI_AVGS, &changed_states); for (s = 0; s < NR_PSI_STATES - 1; s++) { u32 sample; @@ -408,42 +430,25 @@ static u64 update_averages(struct psi_group *group, u64 now) calc_avgs(group->avg[s], missed_periods, sample, period); } - return avg_next_update; + if (changed_states & (1 << PSI_NONIDLE)) { + unsigned long delay; + + delay = nsecs_to_jiffies(group->avg_next_update - now) + 1; + schedule_delayed_work(&group->avgs_work, delay); + } +unlock: + mutex_unlock(&group->avgs_lock); } static void psi_avgs_work(struct work_struct *work) { struct delayed_work *dwork; struct psi_group *group; - u32 changed_states; - bool nonidle; - u64 now; dwork = to_delayed_work(work); group = container_of(dwork, struct psi_group, avgs_work); - mutex_lock(&group->avgs_lock); - - now = sched_clock(); - - collect_percpu_times(group, PSI_AVGS, &changed_states); - nonidle = changed_states & (1 << PSI_NONIDLE); - /* - * If there is task activity, periodically fold the per-cpu - * times and feed samples into the running averages. If things - * are idle and there is no data to process, stop the clock. - * Once restarted, we'll catch up the running averages in one - * go - see calc_avgs() and missed_periods. - */ - if (now >= group->avg_next_update) - group->avg_next_update = update_averages(group, now); - - if (nonidle) { - schedule_delayed_work(dwork, nsecs_to_jiffies( - group->avg_next_update - now) + 1); - } - - mutex_unlock(&group->avgs_lock); + update_averages(group); } /* Trigger tracking window manipulations */ @@ -1029,18 +1034,16 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full; - u64 now; if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; - /* Update averages before reporting them */ - mutex_lock(&group->avgs_lock); - now = sched_clock(); - collect_percpu_times(group, PSI_AVGS, NULL); - if (now >= group->avg_next_update) - group->avg_next_update = update_averages(group, now); - mutex_unlock(&group->avgs_lock); + /* + * Periodic aggregation should do all the sampling for us, but + * if the workload goes idle, the worker goes to sleep before + * old stalls may have averaged out. Backfill any idle zeroes. + */ + update_averages(group); for (full = 0; full < 2; full++) { unsigned long avg[3]; -- 2.32.0