[v3] kernel/module: Reschedule while waiting for modules to finish loading
diff mbox series

Message ID 20190430222207.3002-1-prarit@redhat.com
State In Next
Commit 7e470ea99bcd471038f00fe458fc4714886b96d6
Headers show
Series
  • [v3] kernel/module: Reschedule while waiting for modules to finish loading
Related show

Commit Message

Prarit Bhargava April 30, 2019, 10:22 p.m. UTC
On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
loading the s390_trng.ko module.

Add a reschedule point to the loop that waits for modules to complete
loading.

v3: cleanup Fixes line.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Fixes: f9a75c1d717f ("modules: Only return -EEXIST for modules that have finished loading")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 kernel/module.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Jessica Yu May 1, 2019, 7:49 a.m. UTC | #1
+++ Prarit Bhargava [30/04/19 18:22 -0400]:
>On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
>loading the s390_trng.ko module.
>
>Add a reschedule point to the loop that waits for modules to complete
>loading.
>
>v3: cleanup Fixes line.

Thanks, this patch has been re-applied to modules-next.

Jessica

>Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
>Fixes: f9a75c1d717f ("modules: Only return -EEXIST for modules that have finished loading")
>Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>Cc: Jessica Yu <jeyu@kernel.org>
>Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
>---
> kernel/module.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/kernel/module.c b/kernel/module.c
>index 410eeb7e4f1d..48748cfec991 100644
>--- a/kernel/module.c
>+++ b/kernel/module.c
>@@ -3585,6 +3585,7 @@ static int add_unformed_module(struct module *mod)
> 					       finished_loading(mod->name));
> 			if (err)
> 				goto out_unlocked;
>+			cond_resched();
> 			goto again;
> 		}
> 		err = -EEXIST;
>-- 
>2.18.1
>
Prarit Bhargava May 1, 2019, 9:26 p.m. UTC | #2
On 4/30/19 6:22 PM, Prarit Bhargava wrote:
> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
> loading the s390_trng.ko module.
> 
> Add a reschedule point to the loop that waits for modules to complete
> loading.
> 
> v3: cleanup Fixes line.

Jessica, even with this additional patch there appears to be some other issues
in the module code that are causing significant delays in boot up on large
systems.

Please revert these fixes from linux-next & modules-next.  I apologize for the
extra work but I think it is for the best until I come up with a more complete &
better tested patch.

FWIW, the logic in the original patch is correct.  It's just that there's, as
Heiko discovered, some poor scheduling, etc., that is impacting the module
loading code after these changes.

Again, my apologies,

P.

> 
> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> Fixes: f9a75c1d717f ("modules: Only return -EEXIST for modules that have finished loading")
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Cc: Jessica Yu <jeyu@kernel.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> ---
>  kernel/module.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/module.c b/kernel/module.c
> index 410eeb7e4f1d..48748cfec991 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3585,6 +3585,7 @@ static int add_unformed_module(struct module *mod)
>  					       finished_loading(mod->name));
>  			if (err)
>  				goto out_unlocked;
> +			cond_resched();
>  			goto again;
>  		}
>  		err = -EEXIST;
>
Jessica Yu May 2, 2019, 9:48 a.m. UTC | #3
+++ Prarit Bhargava [01/05/19 17:26 -0400]:
>
>
>On 4/30/19 6:22 PM, Prarit Bhargava wrote:
>> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
>> loading the s390_trng.ko module.
>>
>> Add a reschedule point to the loop that waits for modules to complete
>> loading.
>>
>> v3: cleanup Fixes line.
>
>Jessica, even with this additional patch there appears to be some other issues
>in the module code that are causing significant delays in boot up on large
>systems.

Is this limited to only s390? Or are you seeing this on other arches
as well? And is it limited to specific modules (like s390_trng)?

>Please revert these fixes from linux-next & modules-next.  I apologize for the
>extra work but I think it is for the best until I come up with a more complete &
>better tested patch.

Sure, then I will revert this patch and the other one as well
("modules: Only return -EEXIST for modules that have finished
loading").

>FWIW, the logic in the original patch is correct.  It's just that there's, as
>Heiko discovered, some poor scheduling, etc., that is impacting the module
>loading code after these changes.

I am really curious to see what these performance regressions look
like :/ Please update us when you find out more.

>Again, my apologies,

No problem! Thanks again.

Jessica
Prarit Bhargava May 2, 2019, 12:41 p.m. UTC | #4
On 5/2/19 5:48 AM, Jessica Yu wrote:
> +++ Prarit Bhargava [01/05/19 17:26 -0400]:
>>
>>
>> On 4/30/19 6:22 PM, Prarit Bhargava wrote:
>>> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
>>> loading the s390_trng.ko module.
>>>
>>> Add a reschedule point to the loop that waits for modules to complete
>>> loading.
>>>
>>> v3: cleanup Fixes line.
>>
>> Jessica, even with this additional patch there appears to be some other issues
>> in the module code that are causing significant delays in boot up on large
>> systems.
> 
> Is this limited to only s390? Or are you seeing this on other arches
> as well? And is it limited to specific modules (like s390_trng)?

Other arches.  We're seeing a hang on a new 192 CPU x86_64 box & the
acpi_cpufreq driver.  The system is MUCH faster than any other x86_64 box I've
seen and that's likely why I'm seeing a problem.

> 
>> FWIW, the logic in the original patch is correct.  It's just that there's, as
>> Heiko discovered, some poor scheduling, etc., that is impacting the module
>> loading code after these changes.
> 
> I am really curious to see what these performance regressions look
> like :/ Please update us when you find out more.
> 

I sent Heiko a private v4 RFC last night with this patch (sorry for the
cut-and-paste)

diff --git a/kernel/module.c b/kernel/module.c
index 1c429d8d2d74..a4ef8628f26f 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod)
	mutex_lock(&module_mutex);
	old = find_module_all(mod->name, strlen(mod->name), true);
	if (old != NULL) {
-		if (old->state == MODULE_STATE_COMING
-		    || old->state == MODULE_STATE_UNFORMED) {
+		if (old->state != MODULE_STATE_LIVE) {
			/* Wait in case it fails to load. */
			mutex_unlock(&module_mutex);
-			err = wait_event_interruptible(module_wq,
-					       finished_loading(mod->name));
+			err = wait_event_interruptible_timeout(module_wq,
+					       finished_loading(mod->name),
+					       HZ / 10000);
			if (err)
				goto out_unlocked;
			goto again;

The original module dependency race issue is fixed simply by changing the
conditional to checking !MODULE_STATE_LIVE.  This, unfortunately, exposed some
other problems within the code.

The module_wq is only run when a module fails to load.  It's possible that
the time between the module's failed init() call and running module_wq
(kernel/module.c:3455) takes a while.  Any thread entering the
add_unformed_module() code while the old module is unloading is put to sleep
waiting for the module_wq to execute.

On the 192 thread box I have noticed that the acpi_cpufreq module attempts
to load 392 times (that is not a typo and I am going to try to figure that
problem out after this one).  This means 191 cpus are put to sleep, and one
cpu is executing the acpi_cpufreq module unload which is executing
do_init_module() and is now at

fail_free_freeinit:
        kfree(freeinit);
fail:
        /* Try to protect us from buggy refcounters. */
        mod->state = MODULE_STATE_GOING;
        synchronize_rcu();
        module_put(mod);
        blocking_notifier_call_chain(&module_notify_list,
                                     MODULE_STATE_GOING, mod);
        klp_module_going(mod);
        ftrace_release_mod(mod);
        free_module(mod);
        wake_up_all(&module_wq);
        return ret;
}

The 191 threads cannot schedule and the system is effectively stuck.  It *does*
eventually free itself but in some cases it takes minutes to do so.

A simple fix for this is to, as I've done above, to add a timeout so that
the threads can be scheduled which allows other processes to run.  After
thinking about it a bit more, however, I wonder if a better approach is to
change the mod->state to MODULE_FAILED & running the module_wq immediately so
that the threads can return an error.

I'm experimenting with that now.

P.
Prarit Bhargava May 2, 2019, 5:46 p.m. UTC | #5
On 5/2/19 8:41 AM, Prarit Bhargava wrote:
> 
> 
> On 5/2/19 5:48 AM, Jessica Yu wrote:
>> +++ Prarit Bhargava [01/05/19 17:26 -0400]:
>>>
>>>
>>> On 4/30/19 6:22 PM, Prarit Bhargava wrote:
>>>> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
>>>> loading the s390_trng.ko module.
>>>>
>>>> Add a reschedule point to the loop that waits for modules to complete
>>>> loading.
>>>>
>>>> v3: cleanup Fixes line.
>>>
>>> Jessica, even with this additional patch there appears to be some other issues
>>> in the module code that are causing significant delays in boot up on large
>>> systems.
>>
>> Is this limited to only s390? Or are you seeing this on other arches
>> as well? And is it limited to specific modules (like s390_trng)?
> 
> Other arches.  We're seeing a hang on a new 192 CPU x86_64 box & the
> acpi_cpufreq driver.  The system is MUCH faster than any other x86_64 box I've
> seen and that's likely why I'm seeing a problem.
> 
>>
>>> FWIW, the logic in the original patch is correct.  It's just that there's, as
>>> Heiko discovered, some poor scheduling, etc., that is impacting the module
>>> loading code after these changes.
>>
>> I am really curious to see what these performance regressions look
>> like :/ Please update us when you find out more.
>>
> 
> I sent Heiko a private v4 RFC last night with this patch (sorry for the
> cut-and-paste)
> 
> diff --git a/kernel/module.c b/kernel/module.c
> index 1c429d8d2d74..a4ef8628f26f 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod)
> 	mutex_lock(&module_mutex);
> 	old = find_module_all(mod->name, strlen(mod->name), true);
> 	if (old != NULL) {
> -		if (old->state == MODULE_STATE_COMING
> -		    || old->state == MODULE_STATE_UNFORMED) {
> +		if (old->state != MODULE_STATE_LIVE) {
> 			/* Wait in case it fails to load. */
> 			mutex_unlock(&module_mutex);
> -			err = wait_event_interruptible(module_wq,
> -					       finished_loading(mod->name));
> +			err = wait_event_interruptible_timeout(module_wq,
> +					       finished_loading(mod->name),
> +					       HZ / 10000);
> 			if (err)
> 				goto out_unlocked;
> 			goto again;
> 
> The original module dependency race issue is fixed simply by changing the
> conditional to checking !MODULE_STATE_LIVE.  This, unfortunately, exposed some
> other problems within the code.
> 
> The module_wq is only run when a module fails to load.  It's possible that
> the time between the module's failed init() call and running module_wq
> (kernel/module.c:3455) takes a while.  Any thread entering the
> add_unformed_module() code while the old module is unloading is put to sleep
> waiting for the module_wq to execute.
> 
> On the 192 thread box I have noticed that the acpi_cpufreq module attempts
> to load 392 times (that is not a typo and I am going to try to figure that
> problem out after this one).  This means 191 cpus are put to sleep, and one
> cpu is executing the acpi_cpufreq module unload which is executing
> do_init_module() and is now at
> 
> fail_free_freeinit:
>         kfree(freeinit);
> fail:
>         /* Try to protect us from buggy refcounters. */
>         mod->state = MODULE_STATE_GOING;
>         synchronize_rcu();
>         module_put(mod);
>         blocking_notifier_call_chain(&module_notify_list,
>                                      MODULE_STATE_GOING, mod);
>         klp_module_going(mod);
>         ftrace_release_mod(mod);
>         free_module(mod);
>         wake_up_all(&module_wq);
>         return ret;
> }
> 
> The 191 threads cannot schedule and the system is effectively stuck.  It *does*
> eventually free itself but in some cases it takes minutes to do so.
> 
> A simple fix for this is to, as I've done above, to add a timeout so that
> the threads can be scheduled which allows other processes to run.  

After taking a much closer look the above patch appears to be correct.  I am not
seeing any boot failures associated with it anywhere.  I would like to hear from
Heiko as to whether or not this works for him though.

P.
Barret Rhoden May 10, 2019, 6:40 p.m. UTC | #6
Hi -

On 5/2/19 1:46 PM, Prarit Bhargava wrote:
> On 5/2/19 8:41 AM, Prarit Bhargava wrote:
>> On 5/2/19 5:48 AM, Jessica Yu wrote:
>>> +++ Prarit Bhargava [01/05/19 17:26 -0400]:
>>>> On 4/30/19 6:22 PM, Prarit Bhargava wrote:
>>>>> On a s390 z14 LAR with 2 cpus about stalls about 3% of the time while
>>>>> loading the s390_trng.ko module.
>>>>>
>>>>> Add a reschedule point to the loop that waits for modules to complete
>>>>> loading.
>>>>>
>>>>> v3: cleanup Fixes line.
>>>>
>>>> Jessica, even with this additional patch there appears to be some other issues
>>>> in the module code that are causing significant delays in boot up on large
>>>> systems.
>>>
>>> Is this limited to only s390? Or are you seeing this on other arches
>>> as well? And is it limited to specific modules (like s390_trng)?
>>
>> Other arches.  We're seeing a hang on a new 192 CPU x86_64 box & the
>> acpi_cpufreq driver.  The system is MUCH faster than any other x86_64 box I've
>> seen and that's likely why I'm seeing a problem.
>>
>>>
>>>> FWIW, the logic in the original patch is correct.  It's just that there's, as
>>>> Heiko discovered, some poor scheduling, etc., that is impacting the module
>>>> loading code after these changes.
>>>
>>> I am really curious to see what these performance regressions look
>>> like :/ Please update us when you find out more.
>>>
>>
>> I sent Heiko a private v4 RFC last night with this patch (sorry for the
>> cut-and-paste)
>>
>> diff --git a/kernel/module.c b/kernel/module.c
>> index 1c429d8d2d74..a4ef8628f26f 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>> @@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod)
>> 	mutex_lock(&module_mutex);
>> 	old = find_module_all(mod->name, strlen(mod->name), true);
>> 	if (old != NULL) {
>> -		if (old->state == MODULE_STATE_COMING
>> -		    || old->state == MODULE_STATE_UNFORMED) {
>> +		if (old->state != MODULE_STATE_LIVE) {
>> 			/* Wait in case it fails to load. */
>> 			mutex_unlock(&module_mutex);
>> -			err = wait_event_interruptible(module_wq,
>> -					       finished_loading(mod->name));
>> +			err = wait_event_interruptible_timeout(module_wq,
>> +					       finished_loading(mod->name),
>> +					       HZ / 10000);
>> 			if (err)
>> 				goto out_unlocked;
>> 			goto again;
>>
>> The original module dependency race issue is fixed simply by changing the
>> conditional to checking !MODULE_STATE_LIVE.  This, unfortunately, exposed some
>> other problems within the code.
>>
>> The module_wq is only run when a module fails to load.  It's possible that
>> the time between the module's failed init() call and running module_wq
>> (kernel/module.c:3455) takes a while.  Any thread entering the
>> add_unformed_module() code while the old module is unloading is put to sleep
>> waiting for the module_wq to execute.
>>
>> On the 192 thread box I have noticed that the acpi_cpufreq module attempts
>> to load 392 times (that is not a typo and I am going to try to figure that
>> problem out after this one).  This means 191 cpus are put to sleep, and one
>> cpu is executing the acpi_cpufreq module unload which is executing
>> do_init_module() and is now at
>>
>> fail_free_freeinit:
>>          kfree(freeinit);
>> fail:
>>          /* Try to protect us from buggy refcounters. */
>>          mod->state = MODULE_STATE_GOING;
>>          synchronize_rcu();
>>          module_put(mod);
>>          blocking_notifier_call_chain(&module_notify_list,
>>                                       MODULE_STATE_GOING, mod);
>>          klp_module_going(mod);
>>          ftrace_release_mod(mod);
>>          free_module(mod);
>>          wake_up_all(&module_wq);
>>          return ret;
>> }
>>
>> The 191 threads cannot schedule and the system is effectively stuck.  It *does*
>> eventually free itself but in some cases it takes minutes to do so.
>>
>> A simple fix for this is to, as I've done above, to add a timeout so that
>> the threads can be scheduled which allows other processes to run.
> 
> After taking a much closer look the above patch appears to be correct.  I am not
> seeing any boot failures associated with it anywhere.  I would like to hear from
> Heiko as to whether or not this works for him though.

I think I found the issue here.  The original patch changed the state 
check to "not LIVE", which made it include GOING.  However, the wake 
condition was not changed.  That could lead to a livelock, which I 
experienced.

I have a patch that fixes it, which I'll send out shortly.  With my 
change, I think you won't need any of the scheduler functions 
(cond_resched(), wait timeouts, etc).  Those were probably just papering 
over the issue.

Barret

Patch
diff mbox series

diff --git a/kernel/module.c b/kernel/module.c
index 410eeb7e4f1d..48748cfec991 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3585,6 +3585,7 @@  static int add_unformed_module(struct module *mod)
 					       finished_loading(mod->name));
 			if (err)
 				goto out_unlocked;
+			cond_resched();
 			goto again;
 		}
 		err = -EEXIST;