On Tue, Sep 17, 2019 at 12:33 AM Martin Steigerwald <martin@lichtvoll.de> wrote:
>
> So yes, that would it make it harder to abuse the API, but not
> impossible. Which may still be good, I don't know.

So the real problem is not people abusing the ABI per se. Yes, I was a
bit worried about that too, but it's not the cause of the immediate
issue.

The real problem is that "getrandom(0)" is really _convenient_ for
people who just want random numbers - and not at all the "secure"
kind.

And it's convenient, and during development and testing, it always
"just works", because it doesn't ever block in any normal situation.

And then you deploy it, and on some poor users machine it *does*
block, because the program now encounters the "oops, no entropy"
situation that it never ever encountered on the development machine,
because the testing there was mainly done not during booting, but the
developer also probably had a much more modern machine that had
rdrand, and that quite possibly also had more services enabled at
bootup etc so even without rdrand it got tons of entropy.

That's why

 (a) killing the process is _completely_ silly.  It misses the whole
point of the problem in the first place and only makes things much
worse.

 (b) we should just change getrandom() and add that GRND_SECURE flag
instead. Because the current API is fundamentally confusing. If you
want secure random numbers, you should really deeply _know_ about it,
and think about it, rather than have it be the "oh, don't even bother
passing any flags, it's secure by default".

 (c) the timeout approach isn't wonderful, but it at least helps with
the "this was never tested under those circumstances" kind of problem.

Note that the people who actually *thought* about getrandom() and use
it correctly should already handle error returns (even for the
blocking version), because getrandom() can already return EINTR. So
the argument that we should cater primarily to the secure key people
is not all that strong. We should be able to return EINTR, and the
people who *thought* about blocking and about entropy should be fine.

And gdm and other silly random users that never wanted entropy in the
first place, just "random" random numbers, wouldn't be in the
situation they are now.

That said - looking at some of the problematic traces that Ahmed
posted for his bootup problem, I actually think we can use *another*
heuristic to solve the problem. Namely just looking at how much
randomness the caller wants.

The processes that ask for randomness for an actual secure key have a
very fundamental constraint: they need enough randomness for the key
to be secure in the first place.

But look at what gnome-shell and gnome-session-b does:

    https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/

and most of them already set GRND_NONBLOCK, but look at the
problematic one that actually causes the boot problem:

    gnome-session-b-327   4.400620: getrandom(16 bytes, flags = 0)

and here the big clue is: "Hey, it only asks for 128 bits of randomness".

Does anybody believe that 128 bits of randomness is a good basis for a
long-term secure key? Even if the key itself contains than that, if
you are generating a long-term secure key in this day and age, you had
better be asking for more than 128 bits of actual unpredictable base
data. So just based on the size of the request we can determine that
this is not hugely important.

Compare that to the case later on for something that seems to ask for
actual interesting randomness. and - just judging by the name -
probably even has a reason for it:

      gsd-smartcard-388   51.433924: getrandom(110 bytes, flags = 0)
      gsd-smartcard-388   51.433936: getrandom(256 bytes, flags = 0)

big difference.

End result: I would propose the attached patch.

Ahmed, can you just verify that it works for you (obviously with the
ext4 plugging reinstated)? It looks like it should "obviously" fix
things, but still...

                    Linus