From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CE1FC761AF for ; Wed, 29 Mar 2023 19:05:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230054AbjC2TFr (ORCPT ); Wed, 29 Mar 2023 15:05:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41732 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229905AbjC2TFl (ORCPT ); Wed, 29 Mar 2023 15:05:41 -0400 Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCC2A1739 for ; Wed, 29 Mar 2023 12:05:38 -0700 (PDT) Received: by mail-ed1-x534.google.com with SMTP id ew6so67342202edb.7 for ; Wed, 29 Mar 2023 12:05:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; t=1680116737; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Wh2OgqRN7IUqaKEiP4k1yKjXmh3VmoomfWfE80XlVfg=; b=cDnXC0RmlrStn7pmbN1bP3Yqpe1Hnb+VkdHjQyhtS5AnQaU66og84qzDTDJuSlbkiz SV/fYoxnr3WuaaUKeVUsj06Hpx4nxGpab2ur6XwFtMhEUyOOJ4iy7Q48qZNTjlFIf//u CzaYZiT2WSOcoN4IpvBb0jU6rsqcWsXFocRTAg7jj4rESJHbADniExYIQcHFeFQkna2g Z2VxjB/6YBf+xCX8nIc1JsWkCeFRICjLiA//Gs6tdcD9GedGvRjU2lUrcL/aAQ9SFeWj 5C0TF0+4nwb3XgKYLolFvDLM0C0CuMQAtsc3cEqKH2L30g7mgGgaglvounH8B+K9HPJf F9iA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680116737; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Wh2OgqRN7IUqaKEiP4k1yKjXmh3VmoomfWfE80XlVfg=; b=acPwUhjaHLLxFeufQ+dmIK0eS9rvlH8EUrg6DS6wUYDaoke5XJ+R9CvQ7DAMgwXF+t 3b5Z+awrCtwLCvfo2qp2Wjpz0IxtZ7gNFwNHboJC7NxVMSPCXIxGoAFHJcUoXhL2V2sh /JyyhG5/4GIQ9ZL56dFy3xMjacl0PI31SCWZ40Z0JDXRmjXl1TrVwr3TwBOB37v1ptuI N4VqAicPkhASN6pb3l4Fl2Dj0IilcnT7muUIkcjEDmTZiGTwVw4zLa0l5694jJuGd0pj jGmC7qYjE2/ZtHXImEwolz10Jkv9K7E6bGpeDc0v/r3N2Ud1NrIp9SmXvFIMWqrNhIot fOcg== X-Gm-Message-State: AAQBX9elEoGiWwjc4df+IXGsSHjrGndGZbHYqivy7uBRGtweC1PotAhV DwGhEd+JuIOkvobFph0vSlN46Q== X-Google-Smtp-Source: AKy350bRVfZvYzAg7UEOBhLCqFWfWXay6/gXF7KYH+pzpX6kgPVVT9tYn8WzPLZvc4ank6GEdpup0w== X-Received: by 2002:a05:6402:653:b0:4fb:e9b8:ca56 with SMTP id u19-20020a056402065300b004fbe9b8ca56mr17202649edx.41.1680116737379; Wed, 29 Mar 2023 12:05:37 -0700 (PDT) Received: from localhost ([2620:10d:c092:400::5:64b0]) by smtp.gmail.com with ESMTPSA id lm15-20020a170906980f00b008c76facbbf7sm16926702ejb.171.2023.03.29.12.05.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Mar 2023 12:05:37 -0700 (PDT) Date: Wed, 29 Mar 2023 15:05:32 -0400 From: Johannes Weiner To: Domenico Cerasuolo Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, surenb@google.com, brauner@kernel.org, chris@chrisdown.name Subject: Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period Message-ID: References: <20230329153327.140215-1-cerasuolodomenico@gmail.com> <20230329153327.140215-5-cerasuolodomenico@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 29, 2023 at 08:32:12PM +0200, Domenico Cerasuolo wrote: > On Wed, Mar 29, 2023 at 6:13 PM Johannes Weiner wrote: > > > On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote: > > > PSI offers 2 mechanisms to get information about a specific resource > > > pressure. One is reading from /proc/pressure/, which gives > > > average pressures aggregated every 2s. The other is creating a pollable > > > fd for a specific resource and cgroup. > > > > > > The trigger creation requires CAP_SYS_RESOURCE, and gives the > > > possibility to pick specific time window and threshold, spawing an RT > > > thread to aggregate the data. > > > > > > Systemd would like to provide containers the option to monitor pressure > > > on their own cgroup and sub-cgroups. For example, if systemd launches a > > > container that itself then launches services, the container should have > > > the ability to poll() for pressure in individual services. But neither > > > the container nor the services are privileged. > > > > > > This patch implements a mechanism to allow unprivileged users to create > > > pressure triggers. The difference with privileged triggers creation is > > > that unprivileged ones must have a time window that's a multiple of 2s. > > > This is so that we can avoid unrestricted spawning of rt threads, and > > > use instead the same aggregation mechanism done for the averages, which > > > runs independently of any triggers. > > > > > > Suggested-by: Johannes Weiner > > > Signed-off-by: Domenico Cerasuolo > > > > Overall it looks good to me. Thanks for adding the comment on the > > privilege check, it's much easier to understand now. > > > > A few nitpicks below: > > > > > @@ -151,6 +151,9 @@ struct psi_trigger { > > > > > > /* Deferred event(s) from previous ratelimit window */ > > > bool pending_event; > > > + > > > + /* Used to differentiate destruction action*/ > > > + enum psi_aggregators aggregator; > > > }; > > > > The comment is a bit mysterious. How about > > > > /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */ > > > > > @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group) > > > seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); > > > group->avg_last_update = sched_clock(); > > > group->avg_next_update = group->avg_last_update + psi_period; > > > - INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); > > > mutex_init(&group->avgs_lock); > > > - /* Init trigger-related members */ > > > + > > > + /* Init rtpoll trigger-related members */ > > > atomic_set(&group->rtpoll_scheduled, 0); > > > mutex_init(&group->rtpoll_trigger_lock); > > > INIT_LIST_HEAD(&group->rtpoll_triggers); > > > @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group) > > > init_waitqueue_head(&group->rtpoll_wait); > > > timer_setup(&group->rtpoll_timer, poll_timer_fn, 0); > > > rcu_assign_pointer(group->rtpoll_task, NULL); > > > + > > > + /* Init avg trigger-related members */ > > > + INIT_LIST_HEAD(&group->avg_triggers); > > > + memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers)); > > > + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); > > > } > > > > Can you move those above the rtpoll inits? > > > > It helps with navigating the code and spotting missing inits when the > > init sequence follows the order of the struct members. > > > > > @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win, > > u64 now, u64 value) > > > return growth; > > > } > > > > > > -static u64 update_triggers(struct psi_group *group, u64 now, bool > > *update_total) > > > +static u64 update_triggers(struct psi_group *group, u64 now, bool > > *update_total, > > > + enum psi_aggregators > > aggregator) > > > { > > > struct psi_trigger *t; > > > - u64 *total = group->total[PSI_POLL]; > > > + u64 *total = group->total[aggregator]; > > > + struct list_head *triggers = aggregator == PSI_AVGS ? > > &group->avg_triggers > > > + : &group->rtpoll_triggers; > > > + u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total > > : group->rtpoll_total; > > > *update_total = false; > > > > These lines are a bit too long. When the init part gets that long, > > it's preferable to move it outside of the decl block: > > > > if (aggregator == PSI_AVGS) { > > triggers = &group->avg_triggers; > > aggregator_total = group->avg_total; > > } else { > > triggers = &group->rtpoll_triggers; > > aggregator_total = group->rtpoll_total; > > } > > > > > @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t) > > > u64 period = ULLONG_MAX; > > > > > > list_del(&t->node); > > > - group->rtpoll_nr_triggers[t->state]--; > > > - if (!group->rtpoll_nr_triggers[t->state]) > > > - group->rtpoll_states &= ~(1 << t->state); > > > - /* reset min update period for the remaining triggers */ > > > - list_for_each_entry(tmp, &group->rtpoll_triggers, node) > > > - period = min(period, div_u64(tmp->win.size, > > > - UPDATES_PER_WINDOW)); > > > - group->rtpoll_min_period = period; > > > - /* Destroy rtpoll_task when the last trigger is destroyed > > */ > > > - if (group->rtpoll_states == 0) { > > > - group->rtpoll_until = 0; > > > - task_to_destroy = rcu_dereference_protected( > > > - group->rtpoll_task, > > > - > > lockdep_is_held(&group->rtpoll_trigger_lock)); > > > - rcu_assign_pointer(group->rtpoll_task, NULL); > > > - del_timer(&group->rtpoll_timer); > > > + if (t->aggregator == PSI_AVGS) { > > > + group->avg_nr_triggers[t->state]--; > > > + } else { > > > + group->rtpoll_nr_triggers[t->state]--; > > > + if (!group->rtpoll_nr_triggers[t->state]) > > > + group->rtpoll_states &= ~(1 << t->state); > > > + /* reset min update period for the remaining > > triggers */ > > > + list_for_each_entry(tmp, &group->rtpoll_triggers, > > node) > > > + period = min(period, div_u64(tmp->win.size, > > > + UPDATES_PER_WINDOW)); > > > + group->rtpoll_min_period = period; > > > + /* Destroy rtpoll_task when the last trigger is > > destroyed */ > > > + if (group->rtpoll_states == 0) { > > > + group->rtpoll_until = 0; > > > + task_to_destroy = > > rcu_dereference_protected( > > > + group->rtpoll_task, > > > + > > lockdep_is_held(&group->rtpoll_trigger_lock)); > > > + rcu_assign_pointer(group->rtpoll_task, > > NULL); > > > + del_timer(&group->rtpoll_timer); > > > > These lines are quite long too, we usually shoot for a line length of > > 80 characters. Can you do > > > > if (t->aggregator == PSI_AVGS) { > > group->avg_nr_triggers[t->state]--; > > return; > > } > > > > /* Else, it's an rtpoll trigger */ > > group->rtpoll_nr_triggers[t->state]--; > > ... > > > Can't return there I think, the function doesn't end after the else branch, > should I put a `goto out` instead to jump the rtpoll code? You're right, I missed the bottom part beyond the diff. Looking closer, I think trigger_create and trigger_destroy are actually buggy. They have to protect against update_trigger(), so both creation and destruction must take the right lock - avgs_lock or rtpoll_trigger_lock - before modifying the list. They're both taking only the rtpoll_trigger_lock right now. IOW the trigger type distinction needs to be higher up in general. if (t->aggregator == PSI_AVGS) { mutex_lock(&group->avgs_lock); ... } else { mutex_lock(&group->rtpoll_trigger_lock); ... }