From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45A6FC41513 for ; Wed, 16 Aug 2023 17:12:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345102AbjHPRLy (ORCPT ); Wed, 16 Aug 2023 13:11:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345203AbjHPRLl (ORCPT ); Wed, 16 Aug 2023 13:11:41 -0400 Received: from mail-qt1-x832.google.com (mail-qt1-x832.google.com [IPv6:2607:f8b0:4864:20::832]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C06BD2D51 for ; Wed, 16 Aug 2023 10:11:33 -0700 (PDT) Received: by mail-qt1-x832.google.com with SMTP id d75a77b69052e-40c72caec5cso16561cf.0 for ; Wed, 16 Aug 2023 10:11:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692205893; x=1692810693; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mihDJW+Sury5wgYKSVJ4wMxtsiublP7tmkDuALLcHY8=; b=rV1Ty5Qw3EKhiG5czbzW/vE1BSLiUs9zUNJcjf+75M5FK0vpYpqr7JxcSnbPjCRksH JDYTQwxkpjoZNv/FpeeH5CDlqg97AYMA3Uy0Y1Ja7RVxqKn03yZKibU88sWBt2MJWhX1 6pa36ry2L9+ntsrjBi2HhHLxzZtZYzQVmb5Ia2NQND5p/l4gd3XUH/8EpN6CMgeaEv/w ks+iUH9vv+yHXWBqWDjFMFYo5h+Y3BZ31ynhfIDwOs7X7VwcN3j+94Z7jxPv7TFK2OKF GsQGnCvb/Aq3Hd50Nn6+7VHy97QXR0nLGjHiXlAzLOMiZjLu2Sekg/DnlLL4slsuAh9f YoSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692205893; x=1692810693; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mihDJW+Sury5wgYKSVJ4wMxtsiublP7tmkDuALLcHY8=; b=MVegrx1MvlUpPPWpJwMA2VN0XjQPHHDWMpzqU03x/jA+2aOZO0qg+s6MdIFSoKc3dU Fyi4PUhcReaQpvN632ivQdFi6l+SkoLODvwXkp3dtT5PFDzqMSasaR9zrrb5l+WnepJy bFy/RnewUk+8jWc24NzHmGoLUbhmJBYXRkIx4+DfYSJfXdweO60A7UnhfGOeEt9OQJbP j6AMzqlA9b1mXdWECRl5iS3AM+Uhu8nGSdF3Igv8xUdpkx6D1ags7W5yu9yCxtrohosq T+8+sK/gyodFQ83Ub0577Lnv+TQKnGoePzAvAVEgGiE/uGGKu0iNDj4yizG6gqKcqeYc JT/Q== X-Gm-Message-State: AOJu0YwTKRpfNSLQbbXNGhb0+gEJ22q6L1qOphpRmxwbhXtw6Zp8qAu8 jy3MK1CtmjbJPr5ipPUCRwM+zgXlD9gMLcg1wot8+A== X-Google-Smtp-Source: AGHT+IFr1qh1N8TwgMlrGf/aPh/t9PgoXf520XL252U/Yi/GCnx4rLFKpCwbyWraU/2t0RtVIig7bKENz0i5Cc5nowA= X-Received: by 2002:ac8:5b06:0:b0:40f:d3db:f328 with SMTP id m6-20020ac85b06000000b0040fd3dbf328mr15510qtw.2.1692205892845; Wed, 16 Aug 2023 10:11:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shakeel Butt Date: Wed, 16 Aug 2023 10:11:20 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Yosry Ahmed Cc: Tejun Heo , Michal Hocko , Johannes Weiner , Roman Gushchin , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ivan Babrou Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 15, 2023 at 7:20=E2=80=AFPM Yosry Ahmed = wrote: > [...] > > The problem in (1) is that first of all it's a behavioral change, we > start having explicit staleness in the stats, and userspace needs to > adapt by explicitly requesting a flush. A node controller can be > enlightened to do so, but on a system with a lot of cgroups, if you > flush once explicitly and iterate through all cgroups, the flush will > be stale by the time you reach the last cgroup. Keep in mind there are > also users that read their own stats, figuring out which users need to > flush explicitly vs. read cached stats is a problem. I thought we covered the invalidity of the staleness argument. Similar staleness can happen today, so not strictly a behavioral change. We can change the time window and condition of the periodic flush to reduce the chance of staleness. Option 2 can also face staleness as well. > > Taking a step back, the total work that needs to be done does not > change with (2). A node controller iterating cgroups and reading their > stats will do the same amount of flushing, it will just be distributed > across multiple read syscalls, so shorter intervals in kernel space. You seem to be worried about the very fine grained staleness of the stats. So, for scenarios where stats of multi-level cgroups need to be read and the workload is continuously updating the stats, the total work can be much more. For example if we are reading stats of root and top level memcgs then potentially option 2 can flush the stats for the whole tree twice. > > There are also in-kernel flushers (e.g. reclaim and dirty throttling) > that will benefit from (2) by reading more accurate stats without > having to flush the entire tree. The behavior is currently > indeterministic, you may get fresh or stale stats, you may flush one > cgroup or 100 cgroups. > > I think with (2) we make less compromises in terms of accuracy and > determinism, and it's a less disruptive change to userspace. These options are not white and black and there can be something in between but let me be very clear on what I don't want and would NACK. I don't want a global sleepable lock which can be taken by potentially any application running on the system. We have seen similar global locks causing isolation and priority inversion issues in production. So, not another lock which needs to be taken under extreme condition (reading stats under OOM) by a high priority task (node controller) and might be held by a low priority task.