From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15E79C4CEC9 for ; Fri, 20 Sep 2019 19:37:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F02D92080F for ; Fri, 20 Sep 2019 19:37:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405207AbfITThy (ORCPT ); Fri, 20 Sep 2019 15:37:54 -0400 Received: from wtarreau.pck.nerim.net ([62.212.114.60]:49284 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727165AbfITThx (ORCPT ); Fri, 20 Sep 2019 15:37:53 -0400 Received: (from willy@localhost) by pcw.home.local (8.15.2/8.15.2/Submit) id x8KJbe98001976; Fri, 20 Sep 2019 21:37:40 +0200 Date: Fri, 20 Sep 2019 21:37:40 +0200 From: Willy Tarreau To: Andy Lutomirski Cc: Linus Torvalds , "Ahmed S. Darwish" , Lennart Poettering , "Theodore Y. Ts'o" , "Eric W. Biederman" , "Alexander E. Patrakov" , Michael Kerrisk , Matthew Garrett , lkml , Ext4 Developers List , Linux API , linux-man Subject: Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2() Message-ID: <20190920193740.GD1889@1wt.eu> References: <20190918211503.GA1808@darwi-home-pc> <20190918211713.GA2225@darwi-home-pc> <20190920134609.GA2113@pc> <20190920181216.GA1889@1wt.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 20, 2019 at 12:22:17PM -0700, Andy Lutomirski wrote: > Perhaps userland could register a helper that takes over and does > something better? If userland sees the failure it can do whatever the developer/distro packager thought suitable for the system facing this condition. > But I think the kernel really should do something > vaguely reasonable all by itself. Definitely, that's what Linus' proposal was doing. Sleeping for some time is what I call "vaguely reasonable". > If nothing else, we want the ext4 > patch that provoked this whole discussion to be applied, Oh absolutely! > which means > that we need to unbreak userspace somehow, and returning garbage it to > is not a good choice. It depends how it's used. I'd claim that we certainly use randoms for other things (such as ASLR/hashtables) *before* using them to generate long lived keys thus we can have a bit more time to get some more entropy before reaching the point of producing these keys. > Here are some possible approaches that come to mind: > > int count; > while (crng isn't inited) { > msleep(1); > } > > and modify add_timer_randomness() to at least credit a tiny bit to > crng_init_cnt. Without a timeout it's sure we'll still face some situations where it blocks forever, which is the current problem. > Or we do something like intentionally triggering readahead on some > offset on the root block device. You don't necessarily have such a device, especially when you're in an initramfs. It's precisely where userland can be smarter. When the caller is sfdisk for example, it does have more chances to try to perform I/O than when it's a tiny http server starting to present a configuration page. > We should definitely not trigger *blocking* IO. I think I agree. > Also, I wonder if the real problem preventing the RNG from staring up > is that the crng_init_cnt threshold is too high. We have a rather > baroque accounting system, and it seems like we can accumulate and > credit entropy for a very long time indeed without actually > considering ourselves done. I have no opinion on this, lacking the skills to evaluate the situation. What I can say for sure is that I've faced the non-booting issue quite a number of times on headless systems, and conversely in the 2.4 era, my front reverse-proxy by then had the same SSH key as 89 other machines on the net. So there's surely a sweet spot to find between those two extremes. I tend to think that waiting *a little bit* for the *first* random is acceptable, even 10-15s, by the time the user starts to think about pressing the reset button the system might finish to boot. Hashing some RAM locations and the RTC when present can also help a little bit. If at least my machine by then had combined the RTC's date and time with the hash, chances for a key collision would have gone down to one over many thousands. Willy