All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Borislav Petkov <bp@alien8.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Nadav Amit <nadav.amit@gmail.com>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID
Date: Thu, 22 Jun 2017 11:12:45 -0700	[thread overview]
Message-ID: <CALCETrVm9oQCpovr0aZcDXoG-8hOoYPMDyhYZJPSBNFGemXQNg@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1706221037320.1885@nanos>

On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 21 Jun 2017, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > That requires a conditional branch
>> >
>> >         if (asid >= NR_DYNAMIC_ASIDS) {
>> >                 asid = 0;
>> >                 ....
>> >         }
>> >
>> > The question is whether 4 IDs would be sufficient which trades the branch
>> > for a mask operation. Or you go for 8 and spend another cache line.
>>
>> Interesting.  I'm inclined to either leave it at 6 or reduce it to 4
>> for now and to optimize later.
>
> :)
>
>> > Hmm. So this loop needs to be taken unconditionally even if the task stays
>> > on the same CPU. And of course the number of dynamic IDs has to be short in
>> > order to makes this loop suck performance wise.
>> >
>> > Something like the completely disfunctional below might be worthwhile to
>> > explore. At least arch/x86/mm/ compiles :)
>> >
>> > It gets rid of the loop search and lifts the limit of dynamic ids by
>> > trading it with a percpu variable in mm_context_t.
>>
>> That would work, but it would take a lot more memory on large systems
>> with lots of processes, and I'd also be concerned that we might run
>> out of dynamic percpu space.
>
> Yeah, did not think about the dynamic percpu space.
>
>> How about a different idea: make the percpu data structure look like a
>> 4-way set associative cache.  The ctxs array could be, say, 1024
>> entries long without using crazy amounts of memory.  We'd divide it
>> into 256 buckets, so you'd index it like ctxs[4*bucket + slot].  For
>> each mm, we choose a random bucket (from 0 through 256), and then we'd
>> just loop over the four slots in the bucket in choose_asid().  This
>> would require very slightly more arithmetic (I'd guess only one or two
>> cycles, though) but, critically, wouldn't touch any more cachelines.
>>
>> The downside of both of these approaches over the one in this patch is
>> that the change that the percpu cacheline we need is not in the cache
>> is quite a bit higher since it's potentially a different cacheline for
>> each mm.  It would probably still be a win because avoiding the flush
>> is really quite valuable.
>>
>> What do you think?  The added code would be tiny.
>
> That might be worth a try.
>
> Now one other optimization which should be trivial to add is to keep the 4
> asid context entries in cpu_tlbstate and cache the last asid in thread
> info. If that's still valid then use it otherwise unconditionally get a new
> one. That avoids the whole loop machinery and thread info is cache hot in
> the context switch anyway. Delta patch on top of your version below.

I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
CPU 1 and a thread in that mm bounces back and forth between those
CPUs, won't your patch cause it to flush every time?

--Andy

WARNING: multiple messages have this Message-ID (diff)
From: Andy Lutomirski <luto@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Borislav Petkov <bp@alien8.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Nadav Amit <nadav.amit@gmail.com>, Rik van Riel <riel@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID
Date: Thu, 22 Jun 2017 11:12:45 -0700	[thread overview]
Message-ID: <CALCETrVm9oQCpovr0aZcDXoG-8hOoYPMDyhYZJPSBNFGemXQNg@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1706221037320.1885@nanos>

On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 21 Jun 2017, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > That requires a conditional branch
>> >
>> >         if (asid >= NR_DYNAMIC_ASIDS) {
>> >                 asid = 0;
>> >                 ....
>> >         }
>> >
>> > The question is whether 4 IDs would be sufficient which trades the branch
>> > for a mask operation. Or you go for 8 and spend another cache line.
>>
>> Interesting.  I'm inclined to either leave it at 6 or reduce it to 4
>> for now and to optimize later.
>
> :)
>
>> > Hmm. So this loop needs to be taken unconditionally even if the task stays
>> > on the same CPU. And of course the number of dynamic IDs has to be short in
>> > order to makes this loop suck performance wise.
>> >
>> > Something like the completely disfunctional below might be worthwhile to
>> > explore. At least arch/x86/mm/ compiles :)
>> >
>> > It gets rid of the loop search and lifts the limit of dynamic ids by
>> > trading it with a percpu variable in mm_context_t.
>>
>> That would work, but it would take a lot more memory on large systems
>> with lots of processes, and I'd also be concerned that we might run
>> out of dynamic percpu space.
>
> Yeah, did not think about the dynamic percpu space.
>
>> How about a different idea: make the percpu data structure look like a
>> 4-way set associative cache.  The ctxs array could be, say, 1024
>> entries long without using crazy amounts of memory.  We'd divide it
>> into 256 buckets, so you'd index it like ctxs[4*bucket + slot].  For
>> each mm, we choose a random bucket (from 0 through 256), and then we'd
>> just loop over the four slots in the bucket in choose_asid().  This
>> would require very slightly more arithmetic (I'd guess only one or two
>> cycles, though) but, critically, wouldn't touch any more cachelines.
>>
>> The downside of both of these approaches over the one in this patch is
>> that the change that the percpu cacheline we need is not in the cache
>> is quite a bit higher since it's potentially a different cacheline for
>> each mm.  It would probably still be a win because avoiding the flush
>> is really quite valuable.
>>
>> What do you think?  The added code would be tiny.
>
> That might be worth a try.
>
> Now one other optimization which should be trivial to add is to keep the 4
> asid context entries in cpu_tlbstate and cache the last asid in thread
> info. If that's still valid then use it otherwise unconditionally get a new
> one. That avoids the whole loop machinery and thread info is cache hot in
> the context switch anyway. Delta patch on top of your version below.

I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
CPU 1 and a thread in that mm bounces back and forth between those
CPUs, won't your patch cause it to flush every time?

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-06-22 18:13 UTC|newest]

Thread overview: 154+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-21  5:22 [PATCH v3 00/11] PCID and improved laziness Andy Lutomirski
2017-06-21  5:22 ` Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common() Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  8:01   ` Thomas Gleixner
2017-06-21  8:01     ` Thomas Gleixner
2017-06-21  8:49   ` Borislav Petkov
2017-06-21  8:49     ` Borislav Petkov
2017-06-21 15:15     ` Andy Lutomirski
2017-06-21 15:15       ` Andy Lutomirski
2017-06-21 23:26   ` Nadav Amit
2017-06-21 23:26     ` Nadav Amit
2017-06-22  2:27     ` Andy Lutomirski
2017-06-22  2:27       ` Andy Lutomirski
2017-06-22  7:32       ` Ingo Molnar
2017-06-22  7:32         ` Ingo Molnar
2017-06-21  5:22 ` [PATCH v3 02/11] x86/ldt: Simplify LDT switching logic Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  8:03   ` Thomas Gleixner
2017-06-21  8:03     ` Thomas Gleixner
2017-06-21  9:40   ` Borislav Petkov
2017-06-21  9:40     ` Borislav Petkov
2017-06-22 11:08   ` [tip:x86/mm] x86/ldt: Simplify the " tip-bot for Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate() Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  8:03   ` Thomas Gleixner
2017-06-21  8:03     ` Thomas Gleixner
2017-06-21  9:50   ` Borislav Petkov
2017-06-21  9:50     ` Borislav Petkov
2017-06-22 11:08   ` [tip:x86/mm] " tip-bot for Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  8:05   ` Thomas Gleixner
2017-06-21  8:05     ` Thomas Gleixner
2017-06-21 10:33   ` Borislav Petkov
2017-06-21 10:33     ` Borislav Petkov
2017-06-21 15:23     ` Andy Lutomirski
2017-06-21 15:23       ` Andy Lutomirski
2017-06-21 17:06       ` Borislav Petkov
2017-06-21 17:06         ` Borislav Petkov
2017-06-21 17:43   ` Borislav Petkov
2017-06-21 17:43     ` Borislav Petkov
2017-06-22  2:34     ` Andy Lutomirski
2017-06-22  2:34       ` Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  8:32   ` Thomas Gleixner
2017-06-21  8:32     ` Thomas Gleixner
2017-06-21 15:11     ` Andy Lutomirski
2017-06-21 15:11       ` Andy Lutomirski
2017-06-21 18:44   ` Borislav Petkov
2017-06-21 18:44     ` Borislav Petkov
2017-06-22  2:46     ` Andy Lutomirski
2017-06-22  2:46       ` Andy Lutomirski
2017-06-22  7:24       ` Borislav Petkov
2017-06-22  7:24         ` Borislav Petkov
2017-06-22 14:48         ` Andy Lutomirski
2017-06-22 14:48           ` Andy Lutomirski
2017-06-22 14:59           ` Borislav Petkov
2017-06-22 14:59             ` Borislav Petkov
2017-06-22 15:55             ` Andy Lutomirski
2017-06-22 15:55               ` Andy Lutomirski
2017-06-22 17:22               ` Borislav Petkov
2017-06-22 17:22                 ` Borislav Petkov
2017-06-22 18:08                 ` Andy Lutomirski
2017-06-22 18:08                   ` Andy Lutomirski
2017-06-23  8:42                   ` Borislav Petkov
2017-06-23  8:42                     ` Borislav Petkov
2017-06-23 15:46                     ` Andy Lutomirski
2017-06-23 15:46                       ` Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  9:01   ` Thomas Gleixner
2017-06-21  9:01     ` Thomas Gleixner
2017-06-21 16:04     ` Andy Lutomirski
2017-06-21 16:04       ` Andy Lutomirski
2017-06-21 17:29       ` Borislav Petkov
2017-06-21 17:29         ` Borislav Petkov
2017-06-22 14:50   ` Borislav Petkov
2017-06-22 14:50     ` Borislav Petkov
2017-06-22 17:47     ` Andy Lutomirski
2017-06-22 17:47       ` Andy Lutomirski
2017-06-22 19:05       ` Borislav Petkov
2017-06-22 19:05         ` Borislav Petkov
2017-07-27 19:53       ` Andrew Banman
2017-07-27 19:53         ` Andrew Banman
2017-07-28  2:05         ` Andy Lutomirski
2017-07-28  2:05           ` Andy Lutomirski
2017-06-23 13:34   ` Boris Ostrovsky
2017-06-23 13:34     ` Boris Ostrovsky
2017-06-23 15:22     ` Andy Lutomirski
2017-06-23 15:22       ` Andy Lutomirski
2017-06-21  5:22 ` [PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  9:22   ` Thomas Gleixner
2017-06-21  9:22     ` Thomas Gleixner
2017-06-21 15:16     ` Andy Lutomirski
2017-06-21 15:16       ` Andy Lutomirski
2017-06-23  9:07   ` Borislav Petkov
2017-06-23  9:07     ` Borislav Petkov
2017-06-21  5:22 ` [PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  9:26   ` Thomas Gleixner
2017-06-21  9:26     ` Thomas Gleixner
2017-06-23  9:24   ` Borislav Petkov
2017-06-23  9:24     ` Borislav Petkov
2017-06-21  5:22 ` [PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  9:27   ` Thomas Gleixner
2017-06-21  9:27     ` Thomas Gleixner
2017-06-23  9:34   ` Borislav Petkov
2017-06-23  9:34     ` Borislav Petkov
2017-06-21  5:22 ` [PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21  9:39   ` Thomas Gleixner
2017-06-21  9:39     ` Thomas Gleixner
2017-06-21 13:40     ` Thomas Gleixner
2017-06-21 13:40       ` Thomas Gleixner
2017-06-21 20:34     ` Andy Lutomirski
2017-06-21 20:34       ` Andy Lutomirski
2017-06-23 11:50   ` Borislav Petkov
2017-06-23 11:50     ` Borislav Petkov
2017-06-23 15:28     ` Andy Lutomirski
2017-06-23 15:28       ` Andy Lutomirski
2017-06-23 13:35   ` Boris Ostrovsky
2017-06-23 13:35     ` Boris Ostrovsky
2017-06-21  5:22 ` [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID Andy Lutomirski
2017-06-21  5:22   ` Andy Lutomirski
2017-06-21 13:38   ` Thomas Gleixner
2017-06-21 13:38     ` Thomas Gleixner
2017-06-21 13:40     ` Thomas Gleixner
2017-06-21 13:40       ` Thomas Gleixner
2017-06-22  2:57     ` Andy Lutomirski
2017-06-22  2:57       ` Andy Lutomirski
2017-06-22 12:21       ` Thomas Gleixner
2017-06-22 12:21         ` Thomas Gleixner
2017-06-22 18:12         ` Andy Lutomirski [this message]
2017-06-22 18:12           ` Andy Lutomirski
2017-06-22 21:22           ` Thomas Gleixner
2017-06-22 21:22             ` Thomas Gleixner
2017-06-23  3:09             ` Andy Lutomirski
2017-06-23  3:09               ` Andy Lutomirski
2017-06-23  7:29               ` Thomas Gleixner
2017-06-23  7:29                 ` Thomas Gleixner
2017-06-22 16:09   ` Nadav Amit
2017-06-22 16:09     ` Nadav Amit
2017-06-22 18:10     ` Andy Lutomirski
2017-06-22 18:10       ` Andy Lutomirski
2017-06-26 15:58   ` Borislav Petkov
2017-06-26 15:58     ` Borislav Petkov
2017-06-21 18:23 ` [PATCH v3 00/11] PCID and improved laziness Linus Torvalds
2017-06-21 18:23   ` Linus Torvalds
2017-06-22  5:19   ` Andy Lutomirski
2017-06-22  5:19     ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrVm9oQCpovr0aZcDXoG-8hOoYPMDyhYZJPSBNFGemXQNg@mail.gmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=nadav.amit@gmail.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.