On Tue, Sep 17, 2019 at 12:33 AM Martin Steigerwald wrote: > > So yes, that would it make it harder to abuse the API, but not > impossible. Which may still be good, I don't know. So the real problem is not people abusing the ABI per se. Yes, I was a bit worried about that too, but it's not the cause of the immediate issue. The real problem is that "getrandom(0)" is really _convenient_ for people who just want random numbers - and not at all the "secure" kind. And it's convenient, and during development and testing, it always "just works", because it doesn't ever block in any normal situation. And then you deploy it, and on some poor users machine it *does* block, because the program now encounters the "oops, no entropy" situation that it never ever encountered on the development machine, because the testing there was mainly done not during booting, but the developer also probably had a much more modern machine that had rdrand, and that quite possibly also had more services enabled at bootup etc so even without rdrand it got tons of entropy. That's why (a) killing the process is _completely_ silly. It misses the whole point of the problem in the first place and only makes things much worse. (b) we should just change getrandom() and add that GRND_SECURE flag instead. Because the current API is fundamentally confusing. If you want secure random numbers, you should really deeply _know_ about it, and think about it, rather than have it be the "oh, don't even bother passing any flags, it's secure by default". (c) the timeout approach isn't wonderful, but it at least helps with the "this was never tested under those circumstances" kind of problem. Note that the people who actually *thought* about getrandom() and use it correctly should already handle error returns (even for the blocking version), because getrandom() can already return EINTR. So the argument that we should cater primarily to the secure key people is not all that strong. We should be able to return EINTR, and the people who *thought* about blocking and about entropy should be fine. And gdm and other silly random users that never wanted entropy in the first place, just "random" random numbers, wouldn't be in the situation they are now. That said - looking at some of the problematic traces that Ahmed posted for his bootup problem, I actually think we can use *another* heuristic to solve the problem. Namely just looking at how much randomness the caller wants. The processes that ask for randomness for an actual secure key have a very fundamental constraint: they need enough randomness for the key to be secure in the first place. But look at what gnome-shell and gnome-session-b does: https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/ and most of them already set GRND_NONBLOCK, but look at the problematic one that actually causes the boot problem: gnome-session-b-327 4.400620: getrandom(16 bytes, flags = 0) and here the big clue is: "Hey, it only asks for 128 bits of randomness". Does anybody believe that 128 bits of randomness is a good basis for a long-term secure key? Even if the key itself contains than that, if you are generating a long-term secure key in this day and age, you had better be asking for more than 128 bits of actual unpredictable base data. So just based on the size of the request we can determine that this is not hugely important. Compare that to the case later on for something that seems to ask for actual interesting randomness. and - just judging by the name - probably even has a reason for it: gsd-smartcard-388 51.433924: getrandom(110 bytes, flags = 0) gsd-smartcard-388 51.433936: getrandom(256 bytes, flags = 0) big difference. End result: I would propose the attached patch. Ahmed, can you just verify that it works for you (obviously with the ext4 plugging reinstated)? It looks like it should "obviously" fix things, but still... Linus