From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicholas Piggin Date: Tue, 09 Feb 2021 08:44:53 +0000 Subject: Re: [PATCH 1/2] KVM: PPC: Book3S HV: Remove shared-TLB optimisation from vCPU TLB coherency logic Message-Id: <1612857591.l3d2b98uvh.astroid@bobo.none> List-Id: References: <20200528015331.GD307798@thinks.paulus.ozlabs.org> In-Reply-To: <20200528015331.GD307798@thinks.paulus.ozlabs.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: kvm-ppc@vger.kernel.org Excerpts from Paul Mackerras's message of February 9, 2021 5:19 pm: > On Mon, Jan 18, 2021 at 10:26:08PM +1000, Nicholas Piggin wrote: >> Processors that implement ISA v3.0 or later don't necessarily have >> threads in a core sharing all translations, and/or TLBIEL does not >> necessarily invalidate translations on all other threads (the >> architecture talks only about the effect on translations for "the thread >> executing the tlbiel instruction". > > It seems to me that to have an implementation where TLB entries > created on one thread (say T0) are visible to and usable by another > thread (T1), but a tlbiel on thread T0 does not result in the entry > being removed from visibility/usability on T1, is a pretty insane > implementation. I'm not sure that the architecture envisaged allowing > this kind of implementation, though perhaps the language doesn't > absolutely prohibit it. > > This kind of implementation is what you are allowing for in this > patch, isn't it? Not intentionally, and patch 2 removes the possibility. The main thing it allows is an implementation where TLB entries created by T1 which are visble only to T1 do not get removed by TLBIEL on T0. I also have some concern with ordering of in-flight operations (ptesync, memory ordering, etc) which are mostly avoided with this. > The sane implementations would be ones where either (a) TLB entries > are private to each thread and tlbiel only works on the local thread, > or (b) TLB entries can be shared and tlbiel works across all threads. > I think this is the conclusion we collectively came to when working on > that bug we worked on towards the end of last year. I think an implementation could have both types. So it's hard to get away from flushing all threads. If the vCPU ran on T0 then migrated to another core, then before running a vCPU from the same LPAR in this core again, we could flush _just_ T0 and that should be fine, but if the vCPU was scheduled onto T1 first, we would have to flush to catch shared xlates. But if we flush on T1, then we would still have to flush T0 when that ran the LPID again to catch the private xlates (but probably would not have to flush T2 or T3). We could catch that and optimise it with a shared mask and a private mask, but it seems like a lot of complexity and memory ordering issues for unclear gain. > >> While this worked for POWER9, it may not for future implementations, so >> remove it. A POWER9 specific optimisation would have to have a specific >> CPU feature to check, if it were to be re-added. > > Did you do any measurements of how much performance impact this has on > POWER9? I don't think I've really been able to generate enough CPU migrations to cause a blip. Not to say it doesn't have an impact just that I don't know how to create a worst case load (short of spamming vCPus with sched_setaffinity calls). > I don't believe this patch will actually be necessary on > POWER10, so it seems like this patch is just to allow for some > undefined possible future CPU. It may still be worth putting in for > the sake of strict architecture compliance if the performance impact > is minimal. AFAIK we still have an issue upstream with other known fixes in, that was fixed with this patch. Thanks, Nick