From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CF16C7618E for ; Wed, 15 Mar 2023 08:50:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231716AbjCOIuA (ORCPT ); Wed, 15 Mar 2023 04:50:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231760AbjCOIt6 (ORCPT ); Wed, 15 Mar 2023 04:49:58 -0400 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAFA35D888 for ; Wed, 15 Mar 2023 01:49:53 -0700 (PDT) Received: by mail-pj1-x1033.google.com with SMTP id nn12so18024464pjb.5 for ; Wed, 15 Mar 2023 01:49:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; t=1678870192; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=vsWY3tiIVy0Yn9JroDJe+huBkL6FjkeIMN/r3FyXd5s=; b=Tu1W0IEoh/utfOJuP49Ei0GgUJlOB7C4ob0dWEffMNAq9WjXgZb9jjYQe8No3saiVO bwxNSDjsO64+ejJHsAztdOAUMEOk70GsqiXzzyzN61ovBkH2bktkJXQ/IrOmUXEXUaJC 0u9Iv5t8PBcWw61N6NelKZggBjMrf7xlJ5RkUMJL7oLMTpG4iHWP1fRuEFbwt9WFCDyi JxObnAJVTK8vxtpEfm4Rg4gY6h7yTtKOSgNDFGo3zHkJglspgPi534ltLU/OGPH7Uk0W HU1RImDREGGnttx0PKa96eby1VamA1Hr05aCpF/dsFTr8G4sDbFTP/o5Jjx2yJyIwsaU 3+Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678870192; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vsWY3tiIVy0Yn9JroDJe+huBkL6FjkeIMN/r3FyXd5s=; b=K4LU1oGBQy777JM2RxBdr/3yzQa2esR3w5Bv+cczxnYkzIpUFwYdo3ZIbJJx+hu5cw +CaVEEg9yrsyoqtQD4pZUt0FZBR68SiSnULKwZxFbXtoK9x7pXPM5uI+WblgzzZtb8AT 6FnHXH8Dd7Hi6H4EItgr/SeAZSqP+zsAD1g+rnaNytTFU/teT72NyGmIyMoW+4oJtXvw 1fLBIBwMemzR7Sr1pkgVSYgkho4ra21SUaczScuzDlpZCeSjHnZY9Wg+wacbtWA/brC/ AvEc7E2nNKcp0TKv1GgtDZP3P8kTc2oZ/Z7YaPwwz1FNfRyoXJClU/w4sFBSy6O/fBcf fndQ== X-Gm-Message-State: AO0yUKXtpi37Gy954xJT9q0k63gcQjU7zb3ssv1ivfcPx7DGc8OdcJNh zl/gz9asVtC4KmTtYwl7jXGIiA== X-Google-Smtp-Source: AK7set+bs54csdRPLZxehylIRo4Xa5sAjj4gyGLb/tRk34uFP8c932/ih+zIiD18/7b0oBgOUsyMZg== X-Received: by 2002:a05:6a20:1b1f:b0:d3:7aa2:edb3 with SMTP id ch31-20020a056a201b1f00b000d37aa2edb3mr11528330pzb.55.1678870192441; Wed, 15 Mar 2023 01:49:52 -0700 (PDT) Received: from dread.disaster.area (pa49-186-4-237.pa.vic.optusnet.com.au. [49.186.4.237]) by smtp.gmail.com with ESMTPSA id c11-20020a62e80b000000b005abbfa874d9sm2986079pfi.88.2023.03.15.01.49.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 01:49:51 -0700 (PDT) Received: from [192.168.253.23] (helo=devoid.disaster.area) by dread.disaster.area with esmtp (Exim 4.92.3) (envelope-from ) id 1pcMpQ-008zeR-1Z; Wed, 15 Mar 2023 19:49:48 +1100 Received: from dave by devoid.disaster.area with local (Exim 4.96) (envelope-from ) id 1pcMpQ-00Ag6I-03; Wed, 15 Mar 2023 19:49:48 +1100 From: Dave Chinner To: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org Cc: linux-mm@vger.kernel.org, linux-fsdevel@vger.kernel.org, yebin10@huawei.com Subject: [PATCH 0/4] pcpctr: fix percpu_counter_sum vs cpu offline race Date: Wed, 15 Mar 2023 19:49:34 +1100 Message-Id: <20230315084938.2544737-1-david@fromorbit.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org Ye Bin reported an XFS assert failure when testing CPU hotplug recently. During unmount, XFs was asserting that a percpu counter value should be zero because at that point in time a non-zero value indicates a space accounting leak which is a bug. The details of that failure can be found here: https://lore.kernel.org/linux-kernel/20230314090649.326642-1-yebin@huaweicloud.com/ Ye Bin then proposed changing the XFS code to use percpu_counter_sum_all(), which surprised me because I didn't know that function existed at all. Indeed, it was only merged in the recent 6.3-rc1 merge window because someone else had noticed a pcpctr sum race with hotplug. commit f689054aace2 ("percpu_counter: add percpu_counter_sum_all interface") was introduced via the mm tree. Nobody outside that scope knew about this, because who bothers to even try to read LKML these days? There was little list discussion, and I don't see anything other than a cursory review done on the patch. At minimum, linux-fsdevel should have been cc'd because multiple filesystems use percpu counters for both threshold and ENOSPC accounting in filesystems. Hence if there is a problem with percpu_counter_sum() leaking, filesystem developers kinda need to know about it because leaks like this (as per the XFS bug report) can actually result in on-disk corruption occurring. So, now I know that there is an accuracy problem with percpu_counter_sum(), I will assert that we need to fix it properly rathern than hack around it by adding a new variant. Leaving people who know nothing about cpu hotplug to try to work out if they have a hotplug related issue with their use of percpu_counter_sum() is just bad code; percpu_counter_sum() should just Do The Right Thing. Use of the cpu_dying_mask should effectively close this race condition. That is, when we take a CPU offline we effectively do: mark cpu dying clear cpu from cpu_online_mask run cpu dead callbacks .... fold pcp count into fbc->count clear pcp count ... mark CPU dead clear cpu dying The race condition occurs because we can run a _sum operation between the "clear cpu online" mask update and the "pcpctr cpu dead" notification runs and fold the pcp counter values back into the global count. The sum sees that the CPU is not online, so it skips that CPU even though the count is not zero and hasn't been folded by the CPU dead notifier. Hence it skips something that it shouldn't. However, that race condition doesn't exist if we take cpu_dying_mask into account during the sum. i.e. percpu_counter_sum() should iterate every CPU set in either the cpu_online_mask and the cpu_dying_mask to capture CPUs that are being taken offline. If the cpu is not set in the dying mask, then the online or offline state of the CPU is correct an there is no notifier pending over running and we will skip/sum it correctly. If the CPU is set in the dying mask, then we need to sum it regardless of the online mask state or even whether the cpu dead notifier has run. If the sum wins the race to the pcp counter on the dying CPU, it is included in the local sum from the pcp variable. If the notifier wins the race, it gets folded back into the global count and zeroed before the sum runs. Then the sum includes the count in the local sum from the global counter sum rather than the percpu counter. Either way, we get the same correct sum value from percpu_counter_sum() regardless of how it races with a CPU being removed from the system. And there is no need for percpu_count_sum_all() anymore. This series introduces bitmap operations for finding bits set in either of two bitmasks and adds the for_each_cpu_or() wrapper to iterate CPUs set in either of the two supplied cpu masks. It then converts __percpu_counter_sum_mask() to use this, and have __percpu_counter_sum() pass the cpu_dying_mask as the second mask. This fixes the race condition with CPUs dying. It then converts the only user of percpu_counter_sum_all() to use percpu_counter_sum() as percpu_counter_sum_all() is now redundant, then it removes percpu_counter_sum_all() and recombines __percpu_counter_sum_mask() and __percpu_counter_sum(). This effectively undoes all the changes in commit f689054aace2 except for the small change to use for_each_cpu_or() to fold in the cpu_dying_mask made in this patch set to avoid the problematic race condition. Hence the cpu unplug race condition is now correctly handled by percpu_counter_sum(), and people can go back to being blissfully ignorant of how pcpctrs interact with CPU hotplug (which is how it should be!). This has spent the last siz hours running generic/650 on XFS on a couple of VMs (on 4p, the other 16p) which stresses the filesystem by running a multi-process fsstress invocation whilst randomly onlining and offlining CPUs. Hence it's exercising all the percpu counter cpu dead paths whilst the filesystem is randomly modifying, reading and summing summing the critical counters that XFS needs for accurate accounting of resource consumption within the filesystem. Thoughts, comments and testing welcome! -Dave.