Re: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline

From: Reinette Chatre <reinette.chatre@intel.com>
To: Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>,
	Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
	Peter Newman <peternewman@google.com>,
	James Morse <james.morse@arm.com>,
	Babu Moger <babu.moger@amd.com>,
	"Drew Fustini" <dfustini@baylibre.com>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline
Date: Wed, 27 Mar 2024 15:37:19 -0700	[thread overview]
Message-ID: <c53d6abd-5dbe-4edf-89c0-abbea0df1f1c@intel.com> (raw)
In-Reply-To: <20240327184619.236057-1-tony.luck@intel.com>

Hi Tony,

Thank you very much for taking a closer look at this.

On 3/27/2024 11:46 AM, Tony Luck wrote:
> Don't bother looking for another CPU to take over MBM overflow duties
> when the last CPU in a domain goes offline. Doing so results in this
> Oops:
> 
> [   97.166136] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   97.173118] #PF: supervisor read access in kernel mode
> [   97.178263] #PF: error_code(0x0000) - not-present page
> [   97.183410] PGD 0
> [   97.185438] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [   97.189805] CPU: 36 PID: 235 Comm: cpuhp/36 Tainted: G                T  6.9.0-rc1 #356
> [   97.208322] RIP: 0010:__find_nth_andnot_bit+0x66/0x110
> 
> Fixes: 978fcca954cb ("x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU")
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 757d475158a3..4d9987acffd6 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -929,6 +929,10 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
>  	unsigned long delay = msecs_to_jiffies(delay_ms);
>  	int cpu;
>  
> +	/* Nothing to do if this is the last CPU in a domain going offline */
> +	if (!delay_ms && bitmap_weight(cpumask_bits(&dom->cpu_mask), nr_cpu_ids) == 1)
> +		return;
> +
>  	/*
>  	 * When a domain comes online there is no guarantee the filesystem is
>  	 * mounted. If not, there is no need to catch counter overflow.

While this addresses the scenario you tested I do not think that this solves the
underlying problem and thus I believe that there remains other scenarios in which this
same OOPS can be encountered.

For example, I think you will encounter the same OOPS a few lines later within 
cqm_setup_limbo_handler() if the system happened to have some busy RMIDs. Another
example would be if the tick_nohz_full_mask contains all but the exclude CPU.
In this scenario a bitmap_weight() test will not be sufficient since it does
not give insight into how many CPUs remain after taking into account
tick_nohz_full_mask.

There seems to be two issues here (although I am not familiar with these flows). First,
it seems that tick_nohz_full_mask is not actually allocated unless the user boots
with a "nohz_full=". This means that any attempt to access bits within tick_nohz_full_mask
will cause this OOPS. If that is allocated then the second issue seems that the  
buried __ffs() call requires that it not be called with 0 and this checking is not done.

To me it seems most appropriate to fix this at the central place to ensure all scenarios
are handled instead of scattering checks.

To that end, what do you think of something like below? It uses tick_nohz_full_enabled() check
to ensure that tick_nohz_full_mask is actually allocated while the other changes aim to
avoid __ffs() on 0.

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8f40fb35db78..61337f32830c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -72,6 +72,7 @@ static inline unsigned int
 cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
 {
 	unsigned int cpu, hk_cpu;
+	cpumask_var_t cpu_remain;
 
 	if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
 		cpu = cpumask_any(mask);
@@ -85,14 +86,26 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
 	if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu))
 		return cpu;
 
-	/* Try to find a CPU that isn't nohz_full to use in preference */
-	hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
-	if (hk_cpu == exclude_cpu)
-		hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
+	/* Do not try to access tick_nohz_full_mask if it has not been allocated. */
+	if (!tick_nohz_full_enabled())
+		return cpu;
+
+	if (!zalloc_cpumask_var(&cpu_remain, GFP_KERNEL))
+		return cpu;
 
+	if (!cpumask_andnot(cpu_remain, mask, tick_nohz_full_mask)) {
+		free_cpumask_var(cpu_remain);
+		return cpu;
+	}
+
+	cpumask_clear_cpu(exclude_cpu, cpu_remain);
+
+	hk_cpu = cpumask_any(cpu_remain);
 	if (hk_cpu < nr_cpu_ids)
 		cpu = hk_cpu;
 
+	free_cpumask_var(cpu_remain);
+
 	return cpu;
 }
 

Reinette