linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* x86 smp_call_function recent breakage
@ 2001-10-23 16:29 Andrea Arcangeli
  2001-10-23 16:49 ` Alan Cox
  0 siblings, 1 reply; 3+ messages in thread
From: Andrea Arcangeli @ 2001-10-23 16:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alan Cox, Linus Torvalds, Robert Macaulay

This isn't the right fix:

diff -urN 2.4.12ac5/arch/i386/kernel/smp.c 2.4.12ac6/arch/i386/kernel/smp.c
--- 2.4.12ac5/arch/i386/kernel/smp.c	Mon Oct 22 02:05:38 2001
+++ 2.4.12ac6/arch/i386/kernel/smp.c	Tue Oct 23 18:06:34 2001
@@ -549,6 +549,7 @@
 	 * code desn't get confused if it gets an unexpected repeat
 	 * trigger of an old IPI while we're still setting up the new
 	 * one. */
+	wmb();
 	atomic_set(&data->started, 0);
 
 	local_bh_disable();
@@ -563,22 +564,20 @@
 		cpu_relax();
 	}
 
-	/* It is now safe to reuse the "call_data" global, but we need
-	 * to keep local bottom-halves disabled until after waiters have
-	 * been acknowledged to prevent reuse of the per-cpu call data
-	 * entry. */
+	/* We _must_ wait in all cases here, because we cannot drop the
+	 * call lock (and hence allow the call_data pointer to be
+	 * reused) until all CPUs have read the current value. */
+	while (atomic_read(&data->finished) != cpus)
+		barrier();
+
 	spin_unlock(&call_lock);
 
-	if (wait)
-	{
-		while (atomic_read(&data->finished) != cpus)
-		{
-			barrier();
-			cpu_relax();
-		}
-	}
-	local_bh_enable();
+	/* If any of the smp target functions are sufficiently expensive
+	 * that non-waiting IPI is important, we need to add a third
+	 * level to the handshake here to wait for completion of the
+	 * function. */
 
+	local_bh_enable();
 	return 0;
 }
 
@@ -621,9 +620,9 @@
 
 asmlinkage void smp_call_function_interrupt(void)
 {
-	void (*func) (void *info) = call_data->func;
-	void *info = call_data->info;
-	int wait = call_data->wait;
+	void (*func) (void *info);
+	void *info;
+	int wait;
 
 	ack_APIC_irq();
 	/*
@@ -633,11 +632,15 @@
 	 */
 	if (test_and_set_bit(smp_processor_id(), &call_data->started))
 		return;
-	/*
-	 * At this point the info structure may be out of scope unless wait==1
-	 */
+
+	/* We now know that the call_data is valid: */
+	func = call_data->func;
+	info = call_data->info;
+	wait = call_data->wait;
+
 	(*func)(info);
-	if (wait)
-		set_bit(smp_processor_id(), &call_data->finished);
+	/* Once we set this, all of the call_data information may go out
+	 * of scope. */
+	set_bit(smp_processor_id(), &call_data->finished);
 }
 


Right fix is to backout the broken changes that gone in 2.4.10pre12
because it's totally pointless to have a per-cpu array cacheline
aligned, at least if you don't pass also a paramtere to the IPI routine
so that it knows what entry to peek up so that we can scale the
smp_call_function better and allow other to be execute while we wait for
completion, but there's no parameter so it's just wasted memory and such
a scalability optimization would be a very very minor one since if you
want a chance to scale no IPI should run in first place.

So I'm backing out the below one instead of applying the above one:

diff -urN 2.4.10pre11/arch/i386/kernel/smp.c 2.4.10pre12/arch/i386/kernel/smp.c
--- 2.4.10pre11/arch/i386/kernel/smp.c	Tue Sep 18 02:41:56 2001
+++ 2.4.10pre12/arch/i386/kernel/smp.c	Thu Sep 20 01:43:27 2001
@@ -429,9 +429,10 @@
 	atomic_t started;
 	atomic_t finished;
 	int wait;
-};
+} __attribute__ ((__aligned__(SMP_CACHE_BYTES)));
 
 static struct call_data_struct * call_data;
+static struct call_data_struct call_data_array[NR_CPUS];
 
 /*
  * this function sends a 'generic call function' IPI to all other CPUs
@@ -453,33 +454,45 @@
  * hardware interrupt handler, you may call it from a bottom half handler.
  */
 {
-	struct call_data_struct data;
-	int cpus = smp_num_cpus-1;
+	struct call_data_struct *data;
+	int cpus = (cpu_online_map & ~(1 << smp_processor_id()));
 
 	if (!cpus)
 		return 0;
 
-	data.func = func;
-	data.info = info;
-	atomic_set(&data.started, 0);
-	data.wait = wait;
+	data = &call_data_array[smp_processor_id()];
+	
+	data->func = func;
+	data->info = info;
+	data->wait = wait;
 	if (wait)
-		atomic_set(&data.finished, 0);
-
-	spin_lock_bh(&call_lock);
-	call_data = &data;
-	wmb();
+		atomic_set(&data->finished, 0);
+	/* We have do to this one last to make sure that the IPI service
+	 * code desn't get confused if it gets an unexpected repeat
+	 * trigger of an old IPI while we're still setting up the new
+	 * one. */
+	atomic_set(&data->started, 0);
+
+	local_bh_disable();
+	spin_lock(&call_lock);
+	call_data = data;
 	/* Send a message to all other CPUs and wait for them to respond */
 	send_IPI_allbutself(CALL_FUNCTION_VECTOR);
 
 	/* Wait for response */
-	while (atomic_read(&data.started) != cpus)
+	while (atomic_read(&data->started) != cpus)
 		barrier();
 
+	/* It is now safe to reuse the "call_data" global, but we need
+	 * to keep local bottom-halves disabled until after waiters have
+	 * been acknowledged to prevent reuse of the per-cpu call data
+	 * entry. */
+	spin_unlock(&call_lock);
+
 	if (wait)
-		while (atomic_read(&data.finished) != cpus)
+		while (atomic_read(&data->finished) != cpus)
 			barrier();
-	spin_unlock_bh(&call_lock);
+	local_bh_enable();
 
 	return 0;
 }
@@ -529,18 +542,17 @@
 
 	ack_APIC_irq();
 	/*
-	 * Notify initiating CPU that I've grabbed the data and am
-	 * about to execute the function
+	 * Notify initiating CPU that I've grabbed the data and am about
+	 * to execute the function (and avoid servicing any single IPI
+	 * twice)
 	 */
-	mb();
-	atomic_inc(&call_data->started);
+	if (test_and_set_bit(smp_processor_id(), &call_data->started))
+		return;
 	/*
 	 * At this point the info structure may be out of scope unless wait==1
 	 */
 	(*func)(info);
-	if (wait) {
-		mb();
-		atomic_inc(&call_data->finished);
-	}
+	if (wait)
+		set_bit(smp_processor_id(), &call_data->finished);
 }
 


Robert, this explains the missed IPI during drain_cpu_caches, it isn't
ram fault or IPI missed by the hardware, so I suggest to just backout
the second diff above and try again. Will be fixed also in next -aa of
course.

Andrea

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: x86 smp_call_function recent breakage
  2001-10-23 16:29 x86 smp_call_function recent breakage Andrea Arcangeli
@ 2001-10-23 16:49 ` Alan Cox
  2001-10-23 17:23   ` Andrea Arcangeli
  0 siblings, 1 reply; 3+ messages in thread
From: Alan Cox @ 2001-10-23 16:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, Alan Cox, Linus Torvalds, Robert Macaulay

> This isn't the right fix:

The cache thing you may be right on. The core fix is not about caching
its about fixing races on IPI replay. We have to handle an IPI reoccuring
potentially with a small time delay due to a retransmit of a message
lost by one party on the bus. Furthermore it seems other messages can
pass the message that then gets retransmitted.

> Robert, this explains the missed IPI during drain_cpu_caches, it isn't
> ram fault or IPI missed by the hardware, so I suggest to just backout
> the second diff above and try again. Will be fixed also in next -aa of
> course.

I'm not sure I follow why it explains a missed IPI. Please explain in
more detail.

Alan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: x86 smp_call_function recent breakage
  2001-10-23 16:49 ` Alan Cox
@ 2001-10-23 17:23   ` Andrea Arcangeli
  0 siblings, 0 replies; 3+ messages in thread
From: Andrea Arcangeli @ 2001-10-23 17:23 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, Linus Torvalds, Robert Macaulay

On Tue, Oct 23, 2001 at 05:49:35PM +0100, Alan Cox wrote:
> > This isn't the right fix:
> 
> The cache thing you may be right on. The core fix is not about caching
> its about fixing races on IPI replay. We have to handle an IPI reoccuring
> potentially with a small time delay due to a retransmit of a message
> lost by one party on the bus. Furthermore it seems other messages can
> pass the message that then gets retransmitted.
> 
> > Robert, this explains the missed IPI during drain_cpu_caches, it isn't
> > ram fault or IPI missed by the hardware, so I suggest to just backout
> > the second diff above and try again. Will be fixed also in next -aa of
> > course.
> 
> I'm not sure I follow why it explains a missed IPI. Please explain in
> more detail.

The bug is very simple and it have nothing to do with hardware details,
it is a plain software bug:

-	if (wait) {
-		mb();
-		atomic_inc(&call_data->finished);
-	}
+	if (wait)
+		set_bit(smp_processor_id(), &call_data->finished);

since 2.4.10pre12 (and in all previous -ac also) you still use the above
"call_data" for notifying "finished", but you also allowed "call_data"
to be changed under the IPI handler by another incoming IPI (the
scalability optimization), so the first IPI handler will notify the
completion of the second IPI (instead of the first one) generating the
"missed IPI" generating kernel instability (deadlock for Roboert in
drain_cpu_caches that waits all cpus to "finish").

So you wonder about reply IPI or whatever, and you add further hackery
plus you left the NR_CPUS array that now is useless since there's
nothing to scale and we hold the spinlock for the whole duration of the
IPI cycle, while the only fix in your ac5<->ac6 diff is that you now
spin_unlock after touching ->finished like the old code correctly did,
so you notify completion of the right IPI and not of the wrong one, so
it won't deadlock anymore.

I don't think it's necessary to read ->func and friends before checking
->started, also there's no need of testing ->started. we have the
guarantee that only 1 IPI cycle can run at once because of the spinlock
held for the whole duration of the IPI cycle, so if we are running in
smp_call_function_interrupt we know all ->func,->started == 0 fields are
just coherent and freely usable in memory, we only need to care to mb()
between reading them and increasing ->started like the old code
correctly did.  So I think the old code was just perfect.

Andrea

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2001-10-23 17:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-23 16:29 x86 smp_call_function recent breakage Andrea Arcangeli
2001-10-23 16:49 ` Alan Cox
2001-10-23 17:23   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).