Re: [PATCH bpf-next 1/2] bpf: fix a verifier failure with xor

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: Yonghong Song <yhs@fb.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>
Subject: Re: [PATCH bpf-next 1/2] bpf: fix a verifier failure with xor
Date: Wed, 2 Sep 2020 14:40:02 -0700	[thread overview]
Message-ID: <20200902214002.ciczljw7wrbznper@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <871rjki5nw.fsf@toke.dk>

On Wed, Sep 02, 2020 at 05:01:39PM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Wed, Sep 02, 2020 at 11:33:09AM +0200, Toke HÃƒÂ¸iland-JÃƒÂ¸rgensen wrote:
> >> Yonghong Song <yhs@fb.com> writes:
> >> 
> >> > On 9/1/20 1:07 PM, Andrii Nakryiko wrote:
> >> >> On Mon, Aug 24, 2020 at 11:47 PM Yonghong Song <yhs@fb.com> wrote:
> >> >>>
> >> >>> bpf selftest test_progs/test_sk_assign failed with llvm 11 and llvm 12.
> >> >>> Compared to llvm 10, llvm 11 and 12 generates xor instruction which
> >> >> 
> >> >> Does this mean that some perfectly working BPF programs will now fail
> >> >> to verify on older kernels, if compiled with llvm 11 or llvm 12? If
> >> >
> >> > Right.
> >> >
> >> >> yes, is there something that one can do to prevent Clang from using
> >> >> xor in such situations?
> >> >
> >> > The xor is generated by the combination of llvm simplifyCFG and 
> >> > instrCombine phase.
> >> >
> >> > The following is a hack to prevent compiler from generating xor's.
> >> 
> >> Wait, so this means that we can no longer tell people to just use the
> >> newest LLVM version - now we have to keep track of a minimum *and*
> >> maximum LLVM version for each kernel version?
> >
> > No. The only way is forward. Everyone has to upgrade their llvm periodically.
> 
> Right, great! But surely that implies that a regression such as that
> described here, where a new LLVM version turns a previously-valid
> program into one that no longer verifies is a bug, no?

It's not a regression. Previous valid _compiled_ programs will load.
Nothing guarantees that recompiled program will keep loading.
Even if you keep compiler and source code constant the environment could change.
That risk always existed in libbcc and in anything that compiles on the fly.
A new version of bpftrace may suddenly start failing existing bpftrace scripts.
No one wants this, of course, but we cannot guarantee 100%.

> 
> >> Could we maybe try to not *keep* making it harder for people to use BPF? :/
> >
> > Whom do you mean by "we" ?
> 
> I mean "we as a community who would like BPF to be as useful as possible
> to as many people as possible". Usability is a big part of this.

Of course. I completely agree, but your previous statement said
that somebody "is making it harder for people to use BPF"...
and I asked whom did you point finger at.
Sounds like you're saying that you are not a compiler person,
so it's not your fault and some compiler person must be responsible?
Well, we are all in the same boat and all are responsible for the outcome.

> 
> >> As for the patch, sure, make the verifier smarter, but I also feel like
> >> LLVM should be fixed to not suddenly emit such xor instructions...
> >
> > I don't think there is anything to be "fixed". It's not a bug form
> > llvm developers point of view. At least I suspect that's the response
> > you will get if you post the same sentence on llvm-dev mailing list.
> > If you care to help, please bisect which llvm commit introduced this
> > change. May be author (whoever that was) will have ideas how to
> > pessimize it specifically for bpf backend. But I suspect they will
> > refuse to do so. The discussion about partial disable of optimizations
> > was brought up several times. tldr optimizations cannot be disabled
> > effectively. Pretty much all of them may cause trouble for the
> > verifier and all of them are often necessary for the verifier as well.
> > Please read this thread:
> > http://clang-developers.42468.n3.nabble.com/Disable-certain-llvm-optimizations-at-clang-frontend-tp4068601.html
> 
> I am not enough of a compiler person to get the nuances of that
> discussion, but it seems that the last message[0] by Y Song seems to
> imply that you guys do want to fix such issues in LLVM, just not by
> disabling the optimisation, but at a later stage in the processing
> pipeline?

Not really. The "fix such issues in LLVM" statement is missing the point.
There is no _issue_ in LLVM and there is no _issue_ in the verifier.
The word "fix" assigns the blame and implies a bug.
The verifier is getting smarter. LLVM is getting smarter, but they
follow different religions, so to speak. Reconciling the differences
is what should happen.
Inserting inline asm barriers at different stages of the compilation
is a fragile hack. Both the verifier and the LLVM need to work
towards each other. BPF programs are a pain to write. People keep
fighting the verifier and fighting LLVM. Large progs are full of
inline asm hacks (mostly written by humans) to please the verifier
and force LLVM to do something that is against LLVM objectives.
Yonghong is trying to come up with a set of heuristics to do this
asm insertion automatically. It will help, for sure, but won't
close every corner case. The verifier needs to get smarter too.
Recognizing XORs in the verifier is the right thing to do.
Missing XORs in older kernels is not a bug, but we might consider it
a bug and backport this verifier feature to older kernels.
LLVM vs verifier contest is outside of typical kernel bug vs feature
classification of patches. I think we need to be creative here.