From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54101C43219 for ; Thu, 2 May 2019 12:41:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2BCD12075E for ; Thu, 2 May 2019 12:41:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726525AbfEBMlw (ORCPT ); Thu, 2 May 2019 08:41:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57950 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726437AbfEBMlv (ORCPT ); Thu, 2 May 2019 08:41:51 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 0C69131A41DC; Thu, 2 May 2019 12:41:51 +0000 (UTC) Received: from prarit.bos.redhat.com (prarit-guest.khw1.lab.eng.bos.redhat.com [10.16.200.63]) by smtp.corp.redhat.com (Postfix) with ESMTP id AC42BA224; Thu, 2 May 2019 12:41:49 +0000 (UTC) Subject: Re: [PATCH v3] kernel/module: Reschedule while waiting for modules to finish loading To: Jessica Yu Cc: linux-kernel@vger.kernel.org, Heiko Carstens , David Arcari References: <20190430222207.3002-1-prarit@redhat.com> <90e18809-2b70-52d8-00b3-9c16768db9ad@redhat.com> <20190502094813.GA6690@linux-8ccs> From: Prarit Bhargava Message-ID: Date: Thu, 2 May 2019 08:41:48 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190502094813.GA6690@linux-8ccs> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.41]); Thu, 02 May 2019 12:41:51 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/2/19 5:48 AM, Jessica Yu wrote: > +++ Prarit Bhargava [01/05/19 17:26 -0400]: >> >> >> On 4/30/19 6:22 PM, Prarit Bhargava wrote: >>> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while >>> loading the s390_trng.ko module. >>> >>> Add a reschedule point to the loop that waits for modules to complete >>> loading. >>> >>> v3: cleanup Fixes line. >> >> Jessica, even with this additional patch there appears to be some other issues >> in the module code that are causing significant delays in boot up on large >> systems. > > Is this limited to only s390? Or are you seeing this on other arches > as well? And is it limited to specific modules (like s390_trng)? Other arches. We're seeing a hang on a new 192 CPU x86_64 box & the acpi_cpufreq driver. The system is MUCH faster than any other x86_64 box I've seen and that's likely why I'm seeing a problem. > >> FWIW, the logic in the original patch is correct. It's just that there's, as >> Heiko discovered, some poor scheduling, etc., that is impacting the module >> loading code after these changes. > > I am really curious to see what these performance regressions look > like :/ Please update us when you find out more. > I sent Heiko a private v4 RFC last night with this patch (sorry for the cut-and-paste) diff --git a/kernel/module.c b/kernel/module.c index 1c429d8d2d74..a4ef8628f26f 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod) mutex_lock(&module_mutex); old = find_module_all(mod->name, strlen(mod->name), true); if (old != NULL) { - if (old->state == MODULE_STATE_COMING - || old->state == MODULE_STATE_UNFORMED) { + if (old->state != MODULE_STATE_LIVE) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); - err = wait_event_interruptible(module_wq, - finished_loading(mod->name)); + err = wait_event_interruptible_timeout(module_wq, + finished_loading(mod->name), + HZ / 10000); if (err) goto out_unlocked; goto again; The original module dependency race issue is fixed simply by changing the conditional to checking !MODULE_STATE_LIVE. This, unfortunately, exposed some other problems within the code. The module_wq is only run when a module fails to load. It's possible that the time between the module's failed init() call and running module_wq (kernel/module.c:3455) takes a while. Any thread entering the add_unformed_module() code while the old module is unloading is put to sleep waiting for the module_wq to execute. On the 192 thread box I have noticed that the acpi_cpufreq module attempts to load 392 times (that is not a typo and I am going to try to figure that problem out after this one). This means 191 cpus are put to sleep, and one cpu is executing the acpi_cpufreq module unload which is executing do_init_module() and is now at fail_free_freeinit: kfree(freeinit); fail: /* Try to protect us from buggy refcounters. */ mod->state = MODULE_STATE_GOING; synchronize_rcu(); module_put(mod); blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod); klp_module_going(mod); ftrace_release_mod(mod); free_module(mod); wake_up_all(&module_wq); return ret; } The 191 threads cannot schedule and the system is effectively stuck. It *does* eventually free itself but in some cases it takes minutes to do so. A simple fix for this is to, as I've done above, to add a timeout so that the threads can be scheduled which allows other processes to run. After thinking about it a bit more, however, I wonder if a better approach is to change the mod->state to MODULE_FAILED & running the module_wq immediately so that the threads can return an error. I'm experimenting with that now. P.