All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Parri <parri.andrea@gmail.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: paulmck@kernel.org, palmer@dabbelt.com, paul.walmsley@sifive.com,
	aou@eecs.berkeley.edu, mmaas@google.com, hboehm@google.com,
	striker@us.ibm.com, charlie@rivosinc.com, rehn@rivosinc.com,
	linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] membarrier: riscv: Provide core serializing command
Date: Tue, 28 Nov 2023 16:13:56 +0100	[thread overview]
Message-ID: <ZWYDtB/otYvTMZWd@andrea> (raw)
In-Reply-To: <91ab0210-07f9-42c4-af7f-a98799250cf7@efficios.com>

> I am concerned about the possibility that this change lacks two barriers in the
> following scenario:
> 
> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
> membarrier sync-core is done on [CPU 1]:
> 
> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
>   associated with these stores.
> 
> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
>   within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
>   cpu_rq(0)->curr->mm != mm, so it skips the IPI.
> 
> - This means membarrier relies on switch_mm() to issue the sync-core.
> 
> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
>   may incorrectly skip the sync-core.
> 
> AFAIU, [C] can be reordered before [A] because there is no barrier between those
> operations within membarrier. I suspect it can cause the switch_mm() code to skip
> a needed sync-core.
> 
> AFAIU, [D] can be reordered before [B] because there is no documented barrier
> between those operations within the scheduler, which can also cause switch_mm()
> to skip a needed sync-core.
> 
> We possibly have a similar scenario for uthread->uthread when the scheduler
> switches between mm -> !mm.
> 
> One way to fix this would be to add the following barriers:
> 
> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
>   prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
>   flush_icache_deferred(), with appropriate comments.
> 
> Am I missing something ?

Thanks for the detailed analysis.

AFAIU, the following barrier (in membarrier_private_expedited())

	/*
	 * Matches memory barriers around rq->curr modification in
	 * scheduler.
	 */
	smp_mb();	/* system call entry is not a mb. */

can serve the purpose of ordering [A] before [C] (to be documented in v2).

But I agree that [B] and [D] are unordered /missing suitable synchronization.
Worse, RISC-V has currently no full barrier after [B] and before returning to
user-space: I'm thinking (inspired by the PowerPC implementation),

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 217fd4de61342..f63222513076d 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	if (unlikely(prev == next))
 		return;
 
+#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * after storing to rq->curr, before going back to user-space.
+	 *
+	 * Only need the full barrier when switching between processes:
+	 * barrier when switching from kernel to userspace is not
+	 * required here, given that it is implied by mmdrop(); barrier
+	 * when switching from userspace to kernel is not needed after
+	 * store to rq->curr.
+	 */
+	if (unlikely(atomic_read(&next->membarrier_state) &
+		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
+		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
+		smp_mb();
+#endif
+
 	/*
 	 * Mark the current MM context as inactive, and the next as
 	 * active.  This is at least used by the icache flushing
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e8..a1c749fddd095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		 *
 		 * Here are the schemes providing that barrier on the
 		 * various architectures:
-		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
-		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
+		 *   RISC-V.  switch_mm() relies on membarrier_arch_switch_mm()
+		 *   on PowerPC.
 		 * - finish_lock_switch() for weakly-ordered
 		 *   architectures where spin_unlock is a full barrier,
 		 * - switch_to() for arm64 (weakly-ordered, spin_unlock

The silver lining is that similar changes (probably as a separate/preliminary
patch) also restore the desired order between [B] and [D] AFAIU; so with them,
2/2 would just need additions to document the above SYNC_CORE scenario.

Thoughts?

  Andrea

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Parri <parri.andrea@gmail.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: paulmck@kernel.org, palmer@dabbelt.com, paul.walmsley@sifive.com,
	aou@eecs.berkeley.edu, mmaas@google.com, hboehm@google.com,
	striker@us.ibm.com, charlie@rivosinc.com, rehn@rivosinc.com,
	linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] membarrier: riscv: Provide core serializing command
Date: Tue, 28 Nov 2023 16:13:56 +0100	[thread overview]
Message-ID: <ZWYDtB/otYvTMZWd@andrea> (raw)
In-Reply-To: <91ab0210-07f9-42c4-af7f-a98799250cf7@efficios.com>

> I am concerned about the possibility that this change lacks two barriers in the
> following scenario:
> 
> On a transition from uthread -> uthread on [CPU 0], from a thread belonging to
> another mm to a thread belonging to the mm [!mm -> mm] for which a concurrent
> membarrier sync-core is done on [CPU 1]:
> 
> - [CPU 1] sets all bits in the mm icache_stale_mask [A]. There are no barriers
>   associated with these stores.
> 
> - [CPU 0] store to rq->curr [B] (by the scheduler) vs [CPU 1] loads rq->curr [C]
>   within membarrier to decide if the IPI should be skipped. Let's say CPU 1 observes
>   cpu_rq(0)->curr->mm != mm, so it skips the IPI.
> 
> - This means membarrier relies on switch_mm() to issue the sync-core.
> 
> - [CPU 0] switch_mm() loads [D] the icache_stale_mask. If the bit is zero, switch_mm()
>   may incorrectly skip the sync-core.
> 
> AFAIU, [C] can be reordered before [A] because there is no barrier between those
> operations within membarrier. I suspect it can cause the switch_mm() code to skip
> a needed sync-core.
> 
> AFAIU, [D] can be reordered before [B] because there is no documented barrier
> between those operations within the scheduler, which can also cause switch_mm()
> to skip a needed sync-core.
> 
> We possibly have a similar scenario for uthread->uthread when the scheduler
> switches between mm -> !mm.
> 
> One way to fix this would be to add the following barriers:
> 
> - A smp_mb() between [A] and [C], possibly just after cpumask_setall() in
>   prepare_sync_core_cmd(), with comments detailing the ordering it guarantees,
> - A smp_mb() between [B] and [D], possibly just before cpumask_test_cpu() in
>   flush_icache_deferred(), with appropriate comments.
> 
> Am I missing something ?

Thanks for the detailed analysis.

AFAIU, the following barrier (in membarrier_private_expedited())

	/*
	 * Matches memory barriers around rq->curr modification in
	 * scheduler.
	 */
	smp_mb();	/* system call entry is not a mb. */

can serve the purpose of ordering [A] before [C] (to be documented in v2).

But I agree that [B] and [D] are unordered /missing suitable synchronization.
Worse, RISC-V has currently no full barrier after [B] and before returning to
user-space: I'm thinking (inspired by the PowerPC implementation),

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 217fd4de61342..f63222513076d 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -323,6 +323,23 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	if (unlikely(prev == next))
 		return;
 
+#if defined(CONFIG_MEMBARRIER) && defined(CONFIG_SMP)
+	/*
+	 * The membarrier system call requires a full memory barrier
+	 * after storing to rq->curr, before going back to user-space.
+	 *
+	 * Only need the full barrier when switching between processes:
+	 * barrier when switching from kernel to userspace is not
+	 * required here, given that it is implied by mmdrop(); barrier
+	 * when switching from userspace to kernel is not needed after
+	 * store to rq->curr.
+	 */
+	if (unlikely(atomic_read(&next->membarrier_state) &
+		     (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
+		      MEMBARRIER_STATE_GLOBAL_EXPEDITED)) && prev)
+		smp_mb();
+#endif
+
 	/*
 	 * Mark the current MM context as inactive, and the next as
 	 * active.  This is at least used by the icache flushing
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a708d225c28e8..a1c749fddd095 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6670,8 +6670,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		 *
 		 * Here are the schemes providing that barrier on the
 		 * various architectures:
-		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
-		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
+		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC,
+		 *   RISC-V.  switch_mm() relies on membarrier_arch_switch_mm()
+		 *   on PowerPC.
 		 * - finish_lock_switch() for weakly-ordered
 		 *   architectures where spin_unlock is a full barrier,
 		 * - switch_to() for arm64 (weakly-ordered, spin_unlock

The silver lining is that similar changes (probably as a separate/preliminary
patch) also restore the desired order between [B] and [D] AFAIU; so with them,
2/2 would just need additions to document the above SYNC_CORE scenario.

Thoughts?

  Andrea

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2023-11-28 15:14 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-27 10:32 [PATCH 0/2] membarrier: riscv: Provide core serializing command Andrea Parri
2023-11-27 10:32 ` Andrea Parri
2023-11-27 10:32 ` [PATCH 1/2] locking: Introduce prepare_sync_core_cmd() Andrea Parri
2023-11-27 10:32   ` Andrea Parri
2023-11-27 12:53   ` Mathieu Desnoyers
2023-11-27 12:53     ` Mathieu Desnoyers
2023-11-27 10:32 ` [PATCH 2/2] membarrier: riscv: Provide core serializing command Andrea Parri
2023-11-27 10:32   ` Andrea Parri
2023-11-27 13:28   ` Mathieu Desnoyers
2023-11-27 13:28     ` Mathieu Desnoyers
2023-11-28 15:13     ` Andrea Parri [this message]
2023-11-28 15:13       ` Andrea Parri
2023-11-28 18:39       ` Mathieu Desnoyers
2023-11-28 18:39         ` Mathieu Desnoyers
2023-11-29 18:29         ` Andrea Parri
2023-11-29 18:29           ` Andrea Parri
2023-11-29 20:00           ` Mathieu Desnoyers
2023-11-29 20:00             ` Mathieu Desnoyers
2023-11-29 21:25             ` Andrea Parri
2023-11-29 21:25               ` Andrea Parri
2023-11-29 21:32               ` Mathieu Desnoyers
2023-11-29 21:32                 ` Mathieu Desnoyers
2023-11-29 22:43                 ` Andrea Parri
2023-11-29 22:43                   ` Andrea Parri
2023-12-06 13:05                   ` Palmer Dabbelt
2023-12-06 13:05                     ` Palmer Dabbelt
2023-12-06 14:11                     ` Andrea Parri
2023-12-06 14:11                       ` Andrea Parri
2023-12-06 14:15                       ` Palmer Dabbelt
2023-12-06 14:15                         ` Palmer Dabbelt
2023-12-06 17:42                         ` Andrea Parri
2023-12-06 17:42                           ` Andrea Parri
2023-12-06 17:56                           ` Palmer Dabbelt
2023-12-06 17:56                             ` Palmer Dabbelt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZWYDtB/otYvTMZWd@andrea \
    --to=parri.andrea@gmail.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=charlie@rivosinc.com \
    --cc=hboehm@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mmaas@google.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=paulmck@kernel.org \
    --cc=rehn@rivosinc.com \
    --cc=striker@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.