Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Anton Blanchard <anton@ozlabs.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@ozlabs.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	linux-arch <linux-arch@vger.kernel.org>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Nadav Amit <nadav.amit@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms
Date: Sun, 9 Jan 2022 19:51:26 -0800	[thread overview]
Message-ID: <CAHk-=wgaQkeCW13TnvOxvNhGo4sz3ihLt+iNyQkkR_DpQZ=W7Q@mail.gmail.com> (raw)
In-Reply-To: <b1d963a8adf4618a53f996283c1bfae37323bbb6.camel@surriel.com>

On Sun, Jan 9, 2022 at 6:40 PM Rik van Riel <riel@surriel.com> wrote:
>
> Also, while 800 loads is kinda expensive, it is a heck of
> a lot less expensive than 800 IPIs.

Rik, the IPI's you have to do *anyway*. So there are exactly zero extra IPI's.

Go take a look. It's part of the whole "flush TLB's" thing in __mmput().

So let me explain one more time what I think we should have done, at
least on x86:

 (1) stop refcounting active_mm entries entirely on x86

Why can we do that? Because instead of worrying about doing those
mm_count games for the active_mm reference, we realize that any
active_mm has to have a _regular_ mm associated with it, and it has a
'mm_users' count.

And when that mm_users count go to zero, we have:

 (2) mmput -> __mmput -> exit_mmap(), which already has to flush all
TLB's because it's tearing down the page tables

And since it has to flush those TLB's as part of tearing down the page
tables, we on x86 then have:

 (3) that TLB flush will have to do the IPI's to anybody who has that
mm active already

and that IPI has to be done *regardless*. And the TLB flushing done by
that IPI? That code already clears the lazy status (and not doing so
would be pointless and in fact wrong).

Notice? There isn't some "800 loads". There isn't some "800 IPI's".
And there isn't any refcounting cost of the lazy TLB.

Well, right now there *is* that refcounting cost, but my point is that
I don't think it should exist.

It shouldn't exist as an atomic access to mm_count (with those cache
ping-pongs when you have a lot of threads across a lot of CPUs), but
it *also* shouldn't exist as a "lightweight hazard pointer".

See my point? I think the lazy-tlb refcounting we do is pointless if
you have to do IPI's for TLB flushes.

Note: the above is for x86, which has to do the IPI's anyway (and it's
very possible that if you don't have to do IPI's because you have HW
TLB coherency, maybe lazy TLB's aren't what you should be using, but I
think that should be a separate discussion).

And yes, right now we do that pointless reference counting, because it
was simple and straightforward, and people historically didn't see it
as a problem.

Plus we had to have that whole secondary 'mm_count' anyway for other
reasons, since we use it for things that need to keep a ref to 'struct
mm_struct' around regardless of page table counts (eg things like a
lot of /proc references to 'struct mm_struct' do not want to keep
forced references to user page tables alive).

But I think conceptually mm_count (ie mmgrab/mmdrop) was always really
dodgy for lazy TLB. Lazy TLB really cares about the page tables still
being there, and that's not what mm_count is ostensibly about. That's
really what mm_users is about.

Yet mmgrab/mmdrop is exactly what the lazy TLB code uses, even if it's
technically odd (ie mmgrab really only keeps the 'struct mm' around,
but not about the vma's and page tables).

Side note: you can see the effects of this mis-use of mmgrab/mmdrop in
 how we tear down _almost_ all the page table state in __mmput(). But
look at what we keep until the final __mmdrop, even though there are
no page tables left:

        mm_free_pgd(mm);
        destroy_context(mm);

exactly because even though we've torn down all the page tables
earlier, we had to keep the page table *root* around for the lazy
case.

It's kind of a layering violation, but it comes from that lazy-tlb
mm_count use, and so we have that odd situation where the page table
directory lifetime is very different from the rest of the page table
lifetimes.

(You can easily make excuses for it by just saying that "mm_users" is
the user-space page table user count, and that the page directory has
a different lifetime because it's also about the kernel page tables,
so it's all a bit of a gray area, but I do think it's also a bit of a
sign of how our refcounting for lazy-tlb is a bit dodgy).

                Linus