From: Thomas Gleixner <tglx@linutronix.de>
To: Andy Lutomirski <luto@kernel.org>
Cc: X86 ML <x86@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Borislav Petkov <bp@alien8.de>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mel Gorman <mgorman@suse.de>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Nadav Amit <nadav.amit@gmail.com>, Rik van Riel <riel@redhat.com>,
Dave Hansen <dave.hansen@intel.com>,
Arjan van de Ven <arjan@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID
Date: Thu, 22 Jun 2017 14:21:49 +0200 (CEST) [thread overview]
Message-ID: <alpine.DEB.2.20.1706221037320.1885@nanos> (raw)
In-Reply-To: <CALCETrUrwyMt+k4a-Tyh85Xiidr3zgEW7LKLnGDz90Z6jL9XtA@mail.gmail.com>
On Wed, 21 Jun 2017, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > That requires a conditional branch
> >
> > if (asid >= NR_DYNAMIC_ASIDS) {
> > asid = 0;
> > ....
> > }
> >
> > The question is whether 4 IDs would be sufficient which trades the branch
> > for a mask operation. Or you go for 8 and spend another cache line.
>
> Interesting. I'm inclined to either leave it at 6 or reduce it to 4
> for now and to optimize later.
:)
> > Hmm. So this loop needs to be taken unconditionally even if the task stays
> > on the same CPU. And of course the number of dynamic IDs has to be short in
> > order to makes this loop suck performance wise.
> >
> > Something like the completely disfunctional below might be worthwhile to
> > explore. At least arch/x86/mm/ compiles :)
> >
> > It gets rid of the loop search and lifts the limit of dynamic ids by
> > trading it with a percpu variable in mm_context_t.
>
> That would work, but it would take a lot more memory on large systems
> with lots of processes, and I'd also be concerned that we might run
> out of dynamic percpu space.
Yeah, did not think about the dynamic percpu space.
> How about a different idea: make the percpu data structure look like a
> 4-way set associative cache. The ctxs array could be, say, 1024
> entries long without using crazy amounts of memory. We'd divide it
> into 256 buckets, so you'd index it like ctxs[4*bucket + slot]. For
> each mm, we choose a random bucket (from 0 through 256), and then we'd
> just loop over the four slots in the bucket in choose_asid(). This
> would require very slightly more arithmetic (I'd guess only one or two
> cycles, though) but, critically, wouldn't touch any more cachelines.
>
> The downside of both of these approaches over the one in this patch is
> that the change that the percpu cacheline we need is not in the cache
> is quite a bit higher since it's potentially a different cacheline for
> each mm. It would probably still be a win because avoiding the flush
> is really quite valuable.
>
> What do you think? The added code would be tiny.
That might be worth a try.
Now one other optimization which should be trivial to add is to keep the 4
asid context entries in cpu_tlbstate and cache the last asid in thread
info. If that's still valid then use it otherwise unconditionally get a new
one. That avoids the whole loop machinery and thread info is cache hot in
the context switch anyway. Delta patch on top of your version below.
> (P.S. Why doesn't random_p32() try arch_random_int()?)
Could you please ask questions which do not require crystalballs for
answering?
Thanks,
tglx
8<-------------------
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -159,8 +159,16 @@ static inline void destroy_context(struc
extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk);
-extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk);
+extern void __switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next, u32 *last_asid);
+
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+ __switch_mm_irqs_off(prev, next, &tsk->thread_info.asid);
+}
+
#define switch_mm_irqs_off switch_mm_irqs_off
#define activate_mm(prev, next) \
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -54,6 +54,7 @@ struct task_struct;
struct thread_info {
unsigned long flags; /* low level flags */
+ u32 asid;
};
#define INIT_THREAD_INFO(tsk) \
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,10 +83,13 @@ static inline u64 bump_mm_tlb_gen(struct
#endif
/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
+ * NR_DYNAMIC_ASIDS must be a power of 2. 4 makes tlb_state fit into two
+ * cache lines.
*/
-#define NR_DYNAMIC_ASIDS 6
+#define NR_DYNAMIC_ASIDS_BITS 2
+#define NR_DYNAMIC_ASIDS (1U << NR_DYNAMIC_ASIDS_BITS)
+#define DYNAMIC_ASIDS_MASK (NR_DYNAMIC_ASIDS - 1)
+#define ASID_NEEDS_FLUSH (1U << 16)
struct tlb_context {
u64 ctx_id;
@@ -102,7 +105,8 @@ struct tlb_state {
*/
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
- u16 next_asid;
+ u16 curr_asid;
+ u32 notask_asid;
/*
* Access to this CR4 shadow and to H/W CR4 is protected by
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,7 @@ void __init zone_sizes_init(void)
DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = &init_mm,
- .next_asid = 1,
+ .curr_asid = 0,
.cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */
};
EXPORT_SYMBOL_GPL(cpu_tlbstate);
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -30,43 +30,32 @@
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
-static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
- u16 *new_asid, bool *need_flush)
+static u32 choose_new_asid(mm_context_t *nctx, u32 *last_asid, u64 next_tlb_gen)
{
- u16 asid;
+ struct tlb_context *tctx;
+ u32 asid;
- if (!static_cpu_has(X86_FEATURE_PCID)) {
- *new_asid = 0;
- *need_flush = true;
- return;
- }
+ if (!static_cpu_has(X86_FEATURE_PCID))
+ return ASID_NEEDS_FLUSH;
- for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
- if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
- next->context.ctx_id)
- continue;
-
- *new_asid = asid;
- *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
- next_tlb_gen);
- return;
+ asid = *last_asid;
+ tctx = this_cpu_ptr(cpu_tlbstate.ctxs + asid);
+ if (likely(tctx->ctx_id == nctx->ctx_id)) {
+ if (tctx->tlb_gen != next_tlb_gen)
+ asid |= ASID_NEEDS_FLUSH;
+ return asid;
}
- /*
- * We don't currently own an ASID slot on this CPU.
- * Allocate a slot.
- */
- *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
- if (*new_asid >= NR_DYNAMIC_ASIDS) {
- *new_asid = 0;
- this_cpu_write(cpu_tlbstate.next_asid, 1);
- }
- *need_flush = true;
+ asid = this_cpu_inc_return(cpu_tlbstate.curr_asid);
+ asid &= DYNAMIC_ASIDS_MASK;
+ *last_asid = asid;
+ return asid | ASID_NEEDS_FLUSH;
}
void leave_mm(int cpu)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned long flags;
/*
* It's plausible that we're in lazy TLB mode while our mm is init_mm.
@@ -82,21 +71,27 @@ void leave_mm(int cpu)
/* Warn if we're not lazy. */
WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
- switch_mm(NULL, &init_mm, NULL);
+ local_irq_save(flags);
+ switch_mm_irqs_off(NULL, &init_mm, NULL);
+ local_irq_restore(flags);
}
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
unsigned long flags;
+ u32 *last_asid;
+
+ last_asid = tsk ? &tsk->thread_info.asid :
+ this_cpu_ptr(&cpu_tlbstate.notask_asid);
local_irq_save(flags);
- switch_mm_irqs_off(prev, next, tsk);
+ __switch_mm_irqs_off(prev, next, last_asid);
local_irq_restore(flags);
}
-void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk)
+void __switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
+ u32 *last_asid)
{
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -157,8 +152,7 @@ void switch_mm_irqs_off(struct mm_struct
* are not reflected in tlb_gen.)
*/
} else {
- u16 new_asid;
- bool need_flush;
+ u32 new_asid;
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
/*
@@ -187,9 +181,11 @@ void switch_mm_irqs_off(struct mm_struct
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
- choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+ new_asid = choose_new_asid(&next->context, last_asid,
+ next_tlb_gen);
- if (need_flush) {
+ if (new_asid & ASID_NEEDS_FLUSH) {
+ new_asid &= DYNAMIC_ASIDS_MASK;
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id,
next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen,
next prev parent reply other threads:[~2017-06-22 12:22 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-21 5:22 [PATCH v3 00/11] PCID and improved laziness Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common() Andy Lutomirski
2017-06-21 8:01 ` Thomas Gleixner
2017-06-21 8:49 ` Borislav Petkov
2017-06-21 15:15 ` Andy Lutomirski
2017-06-21 23:26 ` Nadav Amit
2017-06-22 2:27 ` Andy Lutomirski
2017-06-22 7:32 ` Ingo Molnar
2017-06-21 5:22 ` [PATCH v3 02/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
2017-06-21 8:03 ` Thomas Gleixner
2017-06-21 9:40 ` Borislav Petkov
2017-06-22 11:08 ` [tip:x86/mm] x86/ldt: Simplify the " tip-bot for Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
2017-06-21 8:03 ` Thomas Gleixner
2017-06-21 9:50 ` Borislav Petkov
2017-06-22 11:08 ` [tip:x86/mm] " tip-bot for Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
2017-06-21 8:05 ` Thomas Gleixner
2017-06-21 10:33 ` Borislav Petkov
2017-06-21 15:23 ` Andy Lutomirski
2017-06-21 17:06 ` Borislav Petkov
2017-06-21 17:43 ` Borislav Petkov
2017-06-22 2:34 ` Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
2017-06-21 8:32 ` Thomas Gleixner
2017-06-21 15:11 ` Andy Lutomirski
2017-06-21 18:44 ` Borislav Petkov
2017-06-22 2:46 ` Andy Lutomirski
2017-06-22 7:24 ` Borislav Petkov
2017-06-22 14:48 ` Andy Lutomirski
2017-06-22 14:59 ` Borislav Petkov
2017-06-22 15:55 ` Andy Lutomirski
2017-06-22 17:22 ` Borislav Petkov
2017-06-22 18:08 ` Andy Lutomirski
2017-06-23 8:42 ` Borislav Petkov
2017-06-23 15:46 ` Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
2017-06-21 9:01 ` Thomas Gleixner
2017-06-21 16:04 ` Andy Lutomirski
2017-06-21 17:29 ` Borislav Petkov
2017-06-22 14:50 ` Borislav Petkov
2017-06-22 17:47 ` Andy Lutomirski
2017-06-22 19:05 ` Borislav Petkov
2017-07-27 19:53 ` Andrew Banman
2017-07-28 2:05 ` Andy Lutomirski
2017-06-23 13:34 ` Boris Ostrovsky
2017-06-23 15:22 ` Andy Lutomirski
2017-06-21 5:22 ` [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
2017-06-21 9:22 ` Thomas Gleixner
2017-06-21 15:16 ` Andy Lutomirski
2017-06-23 9:07 ` Borislav Petkov
2017-06-21 5:22 ` [PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
2017-06-21 9:26 ` Thomas Gleixner
2017-06-23 9:24 ` Borislav Petkov
2017-06-21 5:22 ` [PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
2017-06-21 9:27 ` Thomas Gleixner
2017-06-23 9:34 ` Borislav Petkov
2017-06-21 5:22 ` [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
2017-06-21 9:39 ` Thomas Gleixner
2017-06-21 13:40 ` Thomas Gleixner
2017-06-21 20:34 ` Andy Lutomirski
2017-06-23 11:50 ` Borislav Petkov
2017-06-23 15:28 ` Andy Lutomirski
2017-06-23 13:35 ` Boris Ostrovsky
2017-06-21 5:22 ` [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
2017-06-21 13:38 ` Thomas Gleixner
2017-06-21 13:40 ` Thomas Gleixner
2017-06-22 2:57 ` Andy Lutomirski
2017-06-22 12:21 ` Thomas Gleixner [this message]
2017-06-22 18:12 ` Andy Lutomirski
2017-06-22 21:22 ` Thomas Gleixner
2017-06-23 3:09 ` Andy Lutomirski
2017-06-23 7:29 ` Thomas Gleixner
2017-06-22 16:09 ` Nadav Amit
2017-06-22 18:10 ` Andy Lutomirski
2017-06-26 15:58 ` Borislav Petkov
2017-06-21 18:23 ` [PATCH v3 00/11] PCID and improved laziness Linus Torvalds
2017-06-22 5:19 ` Andy Lutomirski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.20.1706221037320.1885@nanos \
--to=tglx@linutronix.de \
--cc=akpm@linux-foundation.org \
--cc=arjan@linux.intel.com \
--cc=bp@alien8.de \
--cc=dave.hansen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=nadav.amit@gmail.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).