[PATCH 0/4] pcpctr: fix percpu_counter_sum vs cpu offline race

* [PATCH 0/4] pcpctr: fix percpu_counter_sum vs cpu offline race
@ 2023-03-15  8:49 Dave Chinner
  2023-03-15  8:49 ` [PATCH 1/4] cpumask: introduce for_each_cpu_or Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Dave Chinner @ 2023-03-15  8:49 UTC (permalink / raw)
  To: linux-kernel, linux-xfs; +Cc: linux-mm, linux-fsdevel, yebin10

Ye Bin reported an XFS assert failure when testing CPU hotplug
recently. During unmount, XFs was asserting that a percpu counter
value should be zero because at that point in time a non-zero value
indicates a space accounting leak which is a bug. The details of
that failure can be found here:

https://lore.kernel.org/linux-kernel/20230314090649.326642-1-yebin@huaweicloud.com/

Ye Bin then proposed changing the XFS code to use
percpu_counter_sum_all(), which surprised me because I didn't know
that function existed at all. Indeed, it was only merged in the
recent 6.3-rc1 merge window because someone else had noticed a
pcpctr sum race with hotplug.

commit f689054aace2 ("percpu_counter: add percpu_counter_sum_all
interface") was introduced via the mm tree. Nobody outside that
scope knew about this, because who bothers to even try to read LKML
these days? There was little list discussion, and I don't see
anything other than a cursory review done on the patch.

At minimum, linux-fsdevel should have been cc'd because multiple
filesystems use percpu counters for both threshold and ENOSPC
accounting in filesystems.  Hence if there is a problem with
percpu_counter_sum() leaking, filesystem developers kinda need to
know about it because leaks like this (as per the XFS bug report)
can actually result in on-disk corruption occurring.

So, now I know that there is an accuracy problem with
percpu_counter_sum(), I will assert that we need to fix it properly
rathern than hack around it by adding a new variant. Leaving people
who know nothing about cpu hotplug to try to work out if they have a
hotplug related issue with their use of percpu_counter_sum() is just
bad code; percpu_counter_sum() should just Do The Right Thing.

Use of the cpu_dying_mask should effectively close this race
condition.  That is, when we take a CPU offline we effectively do:

	mark cpu dying
	clear cpu from cpu_online_mask
	run cpu dead callbacks
	  ....
	  <lock counter>
	  fold pcp count into fbc->count
	  clear pcp count
	  <unlock counter>
	  ...
	mark CPU dead
	clear cpu dying

The race condition occurs because we can run a _sum operation
between the "clear cpu online" mask update and the "pcpctr cpu dead"
notification runs and fold the pcp counter values back into the
global count.  The sum sees that the CPU is not online, so it skips
that CPU even though the count is not zero and hasn't been folded by
the CPU dead notifier. Hence it skips something that it shouldn't.

However, that race condition doesn't exist if we take cpu_dying_mask
into account during the sum.  i.e. percpu_counter_sum() should
iterate every CPU set in either the cpu_online_mask and the
cpu_dying_mask to capture CPUs that are being taken offline.

If the cpu is not set in the dying mask, then the online or offline
state of the CPU is correct an there is no notifier pending over
running and we will skip/sum it correctly.

If the CPU is set in the dying mask, then we need to sum it
regardless of the online mask state or even whether the cpu dead
notifier has run.  If the sum wins the race to the pcp counter on
the dying CPU, it is included in the local sum from the pcp
variable. If the notifier wins the race, it gets folded back into
the global count and zeroed before the sum runs. Then the sum
includes the count in the local sum from the global counter sum
rather than the percpu counter.

Either way, we get the same correct sum value from
percpu_counter_sum() regardless of how it races with a CPU being
removed from the system. And there is no need for
percpu_count_sum_all() anymore.

This series introduces bitmap operations for finding bits set in
either of two bitmasks and adds the for_each_cpu_or() wrapper to
iterate CPUs set in either of the two supplied cpu masks. It then
converts __percpu_counter_sum_mask() to use this, and have
__percpu_counter_sum() pass the cpu_dying_mask as the second mask.
This fixes the race condition with CPUs dying.

It then converts the only user of percpu_counter_sum_all() to use
percpu_counter_sum() as percpu_counter_sum_all() is now redundant,
then it removes percpu_counter_sum_all() and recombines
__percpu_counter_sum_mask() and __percpu_counter_sum().

This effectively undoes all the changes in commit f689054aace2
except for the small change to use for_each_cpu_or() to fold in the
cpu_dying_mask made in this patch set to avoid the problematic race
condition. Hence the cpu unplug race condition is now correctly
handled by percpu_counter_sum(), and people can go back to being
blissfully ignorant of how pcpctrs interact with CPU hotplug (which
is how it should be!).

This has spent the last siz hours running generic/650 on XFS on a
couple of VMs (on 4p, the other 16p) which stresses the filesystem
by running a multi-process fsstress invocation whilst randomly
onlining and offlining CPUs. Hence it's exercising all the percpu
counter cpu dead paths whilst the filesystem is randomly modifying,
reading and summing summing the critical counters that XFS needs for
accurate accounting of resource consumption within the filesystem.

Thoughts, comments and testing welcome!

-Dave.

^ permalink raw reply	[flat|nested] 12+ messages in thread