RE: another testmgr question

From: Pascal Van Leeuwen <pvanleeuwen@insidesecure.com>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>
Subject: RE: another testmgr question
Date: Mon, 27 May 2019 20:15:42 +0000	[thread overview]
Message-ID: <AM6PR09MB35230C8E9EC3B122BC6DB23FD21D0@AM6PR09MB3523.eurprd09.prod.outlook.com> (raw)
In-Reply-To: <CAKv+Gu-LuwaXsbU=cp24513p+t=SPtvkSbSs9bTE5=ds-=wmbA@mail.gmail.com>

> >
> > How relevant is the fact that there may be (other) situations where latency
> > is not relevant for someone being in a situation where it is relevant?
> >
> > I was just talking about that situation where it actually is relevant and
> > therefore you do *not* wish the hardware driver to be used.
> >
> > Ok, let's phrase it such that can be no further misunderstandings:
> >
> > If you want performance from a single-threaded single application that does
> > synchronous, blocking, crypto API calls, then you do not want those to end
> > up at our hardware. Or, very likely, pretty much any other hardware.
> > The cra_priority mechanism does not allow the driver to convey such a
> thing.
> >
>
> Agreed.
>
> But we came to this point in response to your assertion that a
> userland application can only make meaningful use of the hardware
> accelerator if it uses some kind of asynchronous API like AIO, and I
> tried to explain that this is not the case.
>
Ah OK, now I get your point. I did not intend to say that this is never
useful, just that in a fair number (trying to be careful with the wording
here :-) of cases it really is not what you want.

> > These workarounds are on a hot path by definition, as they have to filter
> > out specific requests coming in, i.e. it affects every single request done
> > to the driver. As for memory footprint: that is still relevant for embedded
> > systems even today. Besides, memory footprint affects instruction cache hit
> > ratio and therefore, indirectly, performance as well.
> >
>
> Of course. But what appears on the actual hot path is a single 'cbz'
> instruction that is always predicted correctly, and the actual code
> lives somewhere else in the binary. That is why I said *memory*
> footprint not *cache* footprint, since it only affects the former and
> not the latter.
>
Well:
a) That would be the case for single corner case, but they all add up.
b) For zero-length HMAC it really is not that simple (for skcipher, it is).
Well, you could theoretically implement it like that, I suppose, by making
it completely independent paths, but then you'd have to duplicate a lot code
and  complexity. Not maintainable. In reality, it's quite a few compares and
condition  checks all over the hot path ... I suppose it could be optimized
further, but that would make it even harder to follow.

> > >
> > Where did I ever say that we omitted that to save silicon area?
> > You're putting words in my mouth (or fingers) now. I never said that,
> > that is not the reason at all.
> >
>
> Fair enough. But I did understand correctly that this was a deliberate
> decision, no?
>
Absolutely. It's a huge amount of extra complexity for the hardware.
You have no idea. It would take many manmonths to just add that the
existing design and properly verify it out, covering all the tricky
arbitration corner cases and pipeline issues it introduces.

I guess that's hard to understand for a software engineer, but the
problem really is NOT computing the zero length hash or HMAC. The
algorithm core can actually already do that. The problem is making the
whole host interface, ring management, DMA infrastructure and data
buffers and control logic deal with input packets of length zero ...

> > NO. We never decided any such thing. We decided that it was not a relevant
> > use case that we needed to support at all. Neither in the hardware nor in
> > the driver. Our own generic OS driver does not contain any such
> workarounds.
> > In fact, based on this Linux driver thing we had a new internal discussion
> > on it and the outcome did not change: not a relevant use case for us.
> >
>
> RIght. So how does this this relate to your remark above that working
> workarounds prevent issues from being known to the h/w guys?
>
I guess your point being: you know about the issue now and you're not
going to fix it. Obviously, we have to weigh every issue seperately.
If it was something we could easily fix, we would do it. I already
implemented HMAC continue - which we couldn't do either - for the next
release of this hardware. Because a) we considered it pretty useful and
b) it was something that wasn't very risky or very much effort to do.

Zero-length hash/HMAC is simpy not very useful - doing shasum's on files is
not really not a use case we're interested in - and very difficult to
implement, risking, potentially, breaking some critical proven-on-silicon
infrastructure.

> > Comply with what exactly? You can't "comply" with algorithms ... you just
> > implement whatever subset makes sense and specify the constraints. You can
> > comply with protocol specifications, and that's what we do. None of those
> > requires zero length hashing, HMAC, cipher or AEAD operations.
> > Many algorithms are unbounded anyway and hardware is bounded by definition.
> >
>
> I'll ignore the remark about boundedness since it has no bearing
> whatsoever on this discussion.
>
It has some bearing. On one end you have the length zero, on the other
end you have a 2^64-1 bit limitation. Which we also cannot do, by the way.
So those are the bounds and we "comply" with neither ...

By the way, the algorithm is specified for *bits*, not *bytes*, so to
be "compliant" you'd have to be able to process arbitrary numbers of *bits*.
So in that respect, I think the whole hash API is non-compliant :-P

> As for compliance, many of the zero length test vectors were sourced
> from FIPS or NIST documents, so i don't care what you call it, but it
> is a perfectly reasonable requirement that new implementations work as
> expected for test vectors that have been published along with the
> algorithm.
>
That's interesting though, as I don't hit these zero-length vectors until
I enable the fuzzing tests. So they're not part of the standard compliance
vectors. And I know for fact - as we deal with compliance testing - that,
for FIPS, zero length is *optional*.

> Again, I am not saying your hardware should do this. I am only saying
> that, from a software engineering perspective, your driver is where we
> fix up the differences, not anywhere else.
>
And I continue my argument that there is no real need to fix up such
differences. At least, not all of them, and not for every driver. As they
may not be relevant for any cases where you *want* to use that driver.
What I think is a mistake is the idea that you should be able to use
every driver with every possible use case. That simply makes no sense.
If you don't automatically select it, then that no longer needs to be a
requirement.

Example:
If I document that my driver should only be used with the IPsec stack,
then why should I need to worry about zero length operations, HMAC continue,
IV output values, etc. That's all not relevant there.

> > > This is all pretty well understood. We all have different interests to
> > > balance against each other, which is why we are perfectly fine with
> > > handling some corner cases in the driver. What we are not prepared to
> > > do is let those corner cases leak into the core crypto layer as cases
> > > that require special handling.
> > >
> > Which can be avoided by not selecting a driver for an application it
> > does not support ... if the corner case is not exercised, then no harm is
> > done. No need for any "leaking" through the crypto layers.
> >
>
> True. So again, if you choose to support your hardware as part of a
> subsystem that does not have these requirements, I am perfectly fine
> with that.
>
But what if that subsystem (IPsec) builds on top of the crypto API?
Because that is the exact situation at hand here.

> Negligible
>
First, that depends entirely on the complexity and amount of the workarounds.
It's often not as simple as a single conditional branch somewhere ...

We're doing performance stuff and often have to meet certain specific
hard performance targets (example: *must* keep up with 10/100/400 Gbps line
rate). Where it's really all or nothing: either you  achieve it or you don't.
So losing a single cycle somewhere can sometimes make a HUGE difference.

> > Driver complexity vs maintenance.
>
> Yes, but again, this complexity has to live *somewhere*, and we don't
> want it in the generic code.
>
Of course it shouldn't be put in the generic code. It just shouldn't be
put anywhere, that's the whole point.

> > Well, for one thing it is not "one particular h/w implementation".
> > As for the zero length thing, I believe it's almost all of them, based
> > on comments from other driver maintainers. I've only seen 1 comment saying
> > that the HW *did* support it. And that was a "maybe".
> >
> > And my main suggestion does not require any core code changes at all.
> >
>
> So what exactly are you proposing? Mind you, we cannot optimize this
> away, so we will have to add checks /somewhere/ that we are not
> calling into the crypto code with length values it doesn't support.
>
Why do you have to add checks /somewhere/?
If you try to use a driver with an application it cannot support, things just
fail. As long as using driver X with application Y is a concious choice and
not something automatic, I don't see the huge problem with that. Just read
the documentation and/or follow the instructions from you h/w vendor.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines
www.insidesecure.com