RE: another testmgr question

From: Pascal Van Leeuwen <pvanleeuwen@insidesecure.com>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>
Subject: RE: another testmgr question
Date: Mon, 27 May 2019 15:56:15 +0000	[thread overview]
Message-ID: <AM6PR09MB35235BD86DCE54760EB14C49D21D0@AM6PR09MB3523.eurprd09.prod.outlook.com> (raw)
In-Reply-To: <CAKv+Gu8W67CDJp3ifWF-wfa47aD4Aim_RnrY9sRxyifnD_KO2g@mail.gmail.com>

> > I understand that as well. But that doesn't change the fact that the
> > application may be waiting for a loooooong (relatively speaking) time
> > for it's results. As latency through hardware may be several orders of
> > a magnitude larger than the time it actually takes to *process* the
> > request.  So when used synchronously the HW may appear to work at a mere
> > fraction of its true performance.
> >
>
> Of course. Sometimes you care about that, and sometimes you don't.
>
> > And if your main interest is in that application, you may not care so
> > much about what the rest of the system does, even if it can use the
> > remaining bandwidth of the accelerator.
> >
>
> 100s of instances of that application, thread, etc could be running at
> the same time, and throughput may be more important than latency.
>
> > In which case it may be desirable *not* to use the accelerator for that
> > application at all due to *very* poor performance (for that application).
> >
> > Which would make even more cycles on the accelerator available to the
> > other applications in the system, so that knife cuts both ways ...
> >
>
> Single thread perfomance is only one metric, and it may not be the one
> you care about most.
>

How relevant is the fact that there may be (other) situations where latency
is not relevant for someone being in a situation where it is relevant?

I was just talking about that situation where it actually is relevant and
therefore you do *not* wish the hardware driver to be used.

Ok, let's phrase it such that can be no further misunderstandings:

If you want performance from a single-threaded single application that does
synchronous, blocking, crypto API calls, then you do not want those to end
up at our hardware. Or, very likely, pretty much any other hardware.
The cra_priority mechanism does not allow the driver to convey such a thing.

> > Adding tons of workarounds to drivers, for example, slows them down, makes
> them
> > use more CPU cycles and more power, and ultimately defeats the purpose of
> having
> > a hardware accelerator at all. That is actually my concern.
>
> If the workaround is in a driver and not on a hot path, we don't
> really care about the memory footprint.
>
These workarounds are on a hot path by definition, as they have to filter
out specific requests coming in, i.e. it affects every single request done
to the driver. As for memory footprint: that is still relevant for embedded
systems even today. Besides, memory footprint affects instruction cache hit
ratio and therefore, indirectly, performance as well.

> > And as an aside, once workarounds have been implemented and proven to
> "work", the
> > underlying issue rarely makes it to the HW guys so we're stuck with it
> forever.
> >
>
> Well, the starting point of the argument was that you deliberately
> omitted handling of zero length inputs to save silicon area.
>
Where did I ever say that we omitted that to save silicon area?
You're putting words in my mouth (or fingers) now. I never said that,
that is not the reason at all.

> So the
> issue would already be known to the h/w guys, and they decided it was
> something they'd punt to the software instead.
>
NO. We never decided any such thing. We decided that it was not a relevant
use case that we needed to support at all. Neither in the hardware nor in
the driver. Our own generic OS driver does not contain any such workarounds.
In fact, based on this Linux driver thing we had a new internal discussion
on it and the outcome did not change: not a relevant use case for us.

> > NO. Hardware is broken if it doesn't comply to its own specifications -
> > which *may* include references to industry standards it must comply with.
> > If I intentionally specify that zero length hashes are not supported, and
> > I don't pretend to comply with any industry standard that requires them,
> > then that's just a *limitation* of the hardware, most certainly not a bug.
>
> Fair enough. But if you want to integrate that h/w in a system that
> does aim to comply, it is up to the software to fix the impedance
> mismatch.
>
Comply with what exactly? You can't "comply" with algorithms ... you just
implement whatever subset makes sense and specify the constraints. You can
comply with protocol specifications, and that's what we do. None of those
requires zero length hashing, HMAC, cipher or AEAD operations.
Many algorithms are unbounded anyway and hardware is bounded by definition.

> > Hardware necessarily *always* has limitations because of all kinds of
> > constraints: area, power, complexity. And even something as mundane as a
> > schedule constraint where you simply can't fit all desired features in the
> > desired schedule. Which is usually very solid due to timeslots being
> > planned in a fab etc. We don't have the luxury of extending our schedule
> > forever like SW guys tend to do ... we're very proud of our track record
> > of always meeting our promised schedules. Plus - silicon can't be patched,
> > so what's done is done and you have to live with it. For many years to
> > come, usually.
> >
>
> This is all pretty well understood. We all have different interests to
> balance against each other, which is why we are perfectly fine with
> handling some corner cases in the driver. What we are not prepared to
> do is let those corner cases leak into the core crypto layer as cases
> that require special handling.
>
Which can be avoided by not selecting a driver for an application it
does not support ... if the corner case is not exercised, then no harm is
done. No need for any "leaking" through the crypto layers.

> > > I know there is a gradient here going
> > > from hashes, AEADs to symmetric ciphers, but I think this applies to
> > > all of them.
> > >
> > > > Please keep in mind that existing hardware cannot be changed. So why
> > > > wasn't the API designed around the limitations of *existing* hardware?
> > >
> > > From a software point of view, adding special cases for zero length
> > > inputs amounts to what you are trying to avoid: using more 'silicon
> > > area'.
> > >
> > No, that's actually not the reason at all in this case. We're trying to
> > avoid significant extra complexity and effort on both the hardware itself
> > and the verification thereof. Silicon area is not even in the picture as
> > a concern for something as "small" as this.
> >
> > Adding zero length support to our hardware architecture is not a trivial
> > exercise. And then you have to weigh added complexity - =added risk, when
> > you talk about hardware with multi-million dollar mask sets in play -
> > against usefulness. Zero-length support was - and still is! - simply not
> > worth the added risk and effort.
> >
>
> Of course. That is why it is perfectly fine to handle this in your driver.
>
Perfectly fine for you, maybe, but not so much for me.

Why:
Performance loss. Driver complexity vs maintenance.
Wasted effort implementing totally irrelevant cases.

> > And if you go that naive route, just fix everything in the driver, then
> > you simply end up with something terribly inefficient because all those
> > corner case checks end up in the fast path and eating up code space.
> >
>
> This is *exactly* the reason why we want this workaround in your
> driver, because if it is not in your driver, we will have to put it in
> generic code where it affects everybody.
>
> > For a someone claiming to "meet in the middle to compromise" you're
> > surely not compromising anything at all ... No offense.
> >
>
> None taken. I am really trying to work with you here, but changing
> core code to address the limitations of one particular h/w
> implementation is not something we do lightly.
>
Well, for one thing it is not "one particular h/w implementation".
As for the zero length thing, I believe it's almost all of them, based
on comments from other driver maintainers. I've only seen 1 comment saying
that the HW *did* support it. And that was a "maybe".

And my main suggestion does not require any core code changes at all.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines
www.insidesecure.com