Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption

From: Peter Zijlstra <peterz@infradead.org>
To: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@kernel.org>, Jiri Olsa <jolsa@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	tonyj@suse.com, nelson.dsouza@intel.com
Subject: Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
Date: Tue, 19 Mar 2019 19:20:41 +0100	[thread overview]
Message-ID: <20190319182041.GO5996@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <CABPqkBQapZFCyJNRjP-GUjfOy62=rJU=Fv=cYPxPTnVFBkqtVA@mail.gmail.com>

On Tue, Mar 19, 2019 at 10:52:01AM -0700, Stephane Eranian wrote:
> > Not quite; the control on its own doesn't directly write the MSR. And
> > even when the work-around is allowed, we'll not set the MSR unless there
> > is also demand for PMC3.
> >
> Trying to understand this better here. When the workaround is enabled
> (tfa=0), you lose PMC3 and transactions operate normally.

> When it is disabled (tfa=1), transactions are all aborted and PMC3 is
> available.

Right, but we don't expose tfa.

> If you are saying that when there is a PMU event requesting PMC3, then
> you need PMC3 avail, so you set the MSR so that tfa=1 forcing all
> transactions to abort.

Right, so when allow_tfa=1 (default), we only set tfa=1 when PMC3 is
requested.

This has the advantage that,

  TSX only workload -> works
  perf 4 counteres -> works

Only when you need both of them, do you get 'trouble'.

> But in that case, you are modifying the execution of the workload when
> you are monitoring it, assuming it uses TSX.

We assume you are not in fact using TSX, not a lot of code does. If you
do use TSX a lot, and you don't want to interfere, you have to set
allow_tfa=0 and live with one counter less.

Any which way around you turn this stone, it sucks.

> You want lowest overhead and no modifications to how the workload
> operates, otherwise how representative is the data you are collecting?

Sure; but there are no good choices here. This 'fix' will break
something. We figured TSX+4-counter-perf was the least common scenario.

We konw of people that rely on 4 counter being present; you want to
explain to them how when doing an update their program suddently doesn't
work anymore?

Or you want to default to tfa=1; but then you have to explain to those
people relying on TSX why their workload stopped working.

> I understand that there is no impact on apps not using TSX, well,
> except on context switch where you have to toggle that MSR.

There is no additional code in the context switch; only the perf event
scheduling code prods at the MSR.

> But for workloads using TSX, there is potentially an impact.

Yes, well, if you're a TSX _and_ perf user, you now have an extra knob
to play with; that's not something I can do anything about. We're forced
to make a choice here.

> > Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
> > most people to care about the knob, and the few people that do, should
> > be able to make it work.
> 
> I don't understand how this can work reliably.

> You have a knob to toggle that MSR.

No, we don't have this knob.

> Then, you have another one inside perf_events

Only this knob exists allow_tfa.

> and then the sysadmin has to make sure nobody (incl. NMI watchdog) is
> using the PMU when this all happens.

You're very unlucky if the watchdog runs on PMC3, normally it runs on
Fixed1 or something. Esp early after boot. (Remember, we schedule fixed
counters first, and then general purpose counters, with a preference for
lower counters).

Anyway, you can trivially switch it off if you want.

> How can this be a practical solution? Am I missing something here?

It works just fine; it is unfortunate that we have this interaction but
that's not something we can do anything about. We're forced to deal with
this.

But if you're a TSX+perf user, have your boot scripts do:

  echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort

and you'll not use PMC3 and TSX will be 'awesome'. If you don't give a
crap about TSX (most people), just boot and be happy.

If you do care about TSX+perf and want to dynamically toggle for some
reason, you just have to be a little careful.