linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* PROBLEM: Long Workqueue delays.
@ 2020-08-17 11:40 Jim Baxter
  2020-08-17 11:57 ` Greg KH
  2020-08-17 15:25 ` Alan Stern
  0 siblings, 2 replies; 7+ messages in thread
From: Jim Baxter @ 2020-08-17 11:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linux-usb
  Cc: Resch Carsten (CM/ESO6), Rosca, Eugeniu (ADITG/ESB)

We have issues with the workqueue of the kernel overloading the CPU 0 
when we we disconnect a USB stick.

This results in other items on the shared workqueue being delayed by
around 6.5 seconds with a default kernel configuration and 2.3 seconds
on a config tailored for our RCar embedded platform.

I am aware there will be delays on the shared workqueue, are the delays
we are seeing considered normal?



We first noticed this issue on custom hardware and we have recreated it
on an RCar Starter Kit using a test module [1] to replicate the
behaviour, the test module outputs any delays of greater then 9ms.

To run the test we have a 4GB random file on a USB stick and perform
the following test:


- Load the Module:
# taskset -c 0 modprobe latency-mon

- Copy large amount of data from the stick:
# dd if=/run/media/sda1/sample.txt of=/dev/zero
[ 1437.517603] DELAY: 10
8388607+1 records in
8388607+1 records out


- Disconnect the USB stick:
[ 1551.796792] usb 2-1: USB disconnect, device number 2
[ 1558.625517] DELAY: 6782


The Delay output 6782 is in milliseconds.

Thank you for you help.
Jim Baxter

[1] Test Module:
// SPDX-License-Identifier: GPL-2.0
/*
 * Simple WQ latency monitoring
 *
 * Copyright (C) 2020 Advanced Driver Information Technology.
 */

#include <linux/init.h>
#include <linux/ktime.h>
#include <linux/module.h>

#define PERIOD_MS 100

static struct delayed_work wq;
static u64 us_save;

static void wq_cb(struct work_struct *work)
{
	u64 us = ktime_to_us(ktime_get());
	u64 us_diff = us - us_save;
	u64 us_print = 0;

	if (!us_save)
		goto skip_print;


	us_print = us_diff / 1000 - PERIOD_MS;
	if (us_print > 9)
		pr_crit("DELAY: %lld\n", us_print);

skip_print:
	us_save = us;
	schedule_delayed_work(&wq, msecs_to_jiffies(PERIOD_MS));
}

static int latency_mon_init(void)
{
	us_save = 0;
	INIT_DELAYED_WORK(&wq, wq_cb);
	schedule_delayed_work(&wq, msecs_to_jiffies(PERIOD_MS));

	return 0;
}

static void latency_mon_exit(void)
{
	cancel_delayed_work_sync(&wq);
	pr_info("%s\n", __func__);
}

module_init(latency_mon_init);
module_exit(latency_mon_exit);
MODULE_AUTHOR("Eugeniu Rosca <erosca@de.adit-jv.com>");
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-17 11:40 PROBLEM: Long Workqueue delays Jim Baxter
@ 2020-08-17 11:57 ` Greg KH
  2020-08-17 18:24   ` Jim Baxter
  2020-08-17 15:25 ` Alan Stern
  1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2020-08-17 11:57 UTC (permalink / raw)
  To: Jim Baxter
  Cc: linux-kernel, linux-mm, linux-usb, Resch Carsten (CM/ESO6),
	Rosca, Eugeniu (ADITG/ESB)

On Mon, Aug 17, 2020 at 12:40:03PM +0100, Jim Baxter wrote:
> We have issues with the workqueue of the kernel overloading the CPU 0 
> when we we disconnect a USB stick.
> 
> This results in other items on the shared workqueue being delayed by
> around 6.5 seconds with a default kernel configuration and 2.3 seconds
> on a config tailored for our RCar embedded platform.
> 
> I am aware there will be delays on the shared workqueue, are the delays
> we are seeing considered normal?
> 
> 
> 
> We first noticed this issue on custom hardware and we have recreated it
> on an RCar Starter Kit using a test module [1] to replicate the
> behaviour, the test module outputs any delays of greater then 9ms.
> 
> To run the test we have a 4GB random file on a USB stick and perform
> the following test:
> 
> 
> - Load the Module:
> # taskset -c 0 modprobe latency-mon
> 
> - Copy large amount of data from the stick:
> # dd if=/run/media/sda1/sample.txt of=/dev/zero
> [ 1437.517603] DELAY: 10
> 8388607+1 records in
> 8388607+1 records out
> 

Is this data really flushed out to the device?

> - Disconnect the USB stick:
> [ 1551.796792] usb 2-1: USB disconnect, device number 2
> [ 1558.625517] DELAY: 6782
> 
> 
> The Delay output 6782 is in milliseconds.

What USB workqueue is taking so long?

The one trying to deal with the filesystem flushing out the data that it
can't do now that the device is removed?  :)

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-17 11:40 PROBLEM: Long Workqueue delays Jim Baxter
  2020-08-17 11:57 ` Greg KH
@ 2020-08-17 15:25 ` Alan Stern
  1 sibling, 0 replies; 7+ messages in thread
From: Alan Stern @ 2020-08-17 15:25 UTC (permalink / raw)
  To: Jim Baxter
  Cc: linux-kernel, linux-mm, linux-usb, Resch Carsten (CM/ESO6),
	Rosca, Eugeniu (ADITG/ESB)

On Mon, Aug 17, 2020 at 12:40:03PM +0100, Jim Baxter wrote:
> We have issues with the workqueue of the kernel overloading the CPU 0 
> when we we disconnect a USB stick.

Do you unmount the stick before disconnecting it?

> This results in other items on the shared workqueue being delayed by
> around 6.5 seconds with a default kernel configuration and 2.3 seconds
> on a config tailored for our RCar embedded platform.
> 
> I am aware there will be delays on the shared workqueue, are the delays
> we are seeing considered normal?

Assuming the workqueue delay is caused by the USB subsystem, you may be 
able to get more information by turning on USB dynamic debugging:

	echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control

and seeing what shows up in the dmesg log.

One long-running job should not normally be enough to delay an entire 
workqueues.  Workqueues are allowed to spawn multiple threads; they are 
supposed to resize themselves dynamically as required.

Alan Stern


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-17 11:57 ` Greg KH
@ 2020-08-17 18:24   ` Jim Baxter
  2020-08-17 18:47     ` Alan Stern
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Baxter @ 2020-08-17 18:24 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, linux-mm, linux-usb, Resch Carsten (CM/ESO6),
	Rosca, Eugeniu (ADITG/ESB)

On 17/08/2020 12:57, Greg KH wrote:
> On Mon, Aug 17, 2020 at 12:40:03PM +0100, Jim Baxter wrote:
>> We have issues with the workqueue of the kernel overloading the CPU 0 
>> when we we disconnect a USB stick.
>>
>> This results in other items on the shared workqueue being delayed by
>> around 6.5 seconds with a default kernel configuration and 2.3 seconds
>> on a config tailored for our RCar embedded platform.
>>
> 
> Is this data really flushed out to the device?
> 
I am unsure if the delay is due to a single system or a combination of memory,
usb or filesystem operations, the delay also occurs if the device is mounted
as ro and using the sync option.

Using umount stops the issue occurring but is unfortunately not guaranteed in
our particular system.

>> - Disconnect the USB stick:
>> [ 1551.796792] usb 2-1: USB disconnect, device number 2
>> [ 1558.625517] DELAY: 6782
>>
>>
>> The Delay output 6782 is in milliseconds.
> 
> What USB workqueue is taking so long?> 
> The one trying to deal with the filesystem flushing out the data that it
> can't do now that the device is removed?  :)
> 
From my analysis it is the hub_event workqueue shown to be using most of the CPU,
the kworker/0:1+usb thread uses around 98% of the CPU.

I have traced the workqueue:workqueue_queue_work function while unplugging the USB
but not found a particular workqueue function being called a lot.

Using perf Iidentified the hub_events workqueue was spending a lot of time in
invalidate_partition(), I have included a cut down the captured data from perf in
[2] which shows the additional functions where the kworker spends most of its time.


I realise that not unmounting the USB stick is not ideal, though I wonder what 
additional work is done when unplugging the USB stick compared to unmounting it.
I guess it may be waiting for a time-out during the operation without the unmount.


Thanks,
Jim Baxter

> thanks,
> 
> greg k-h
> 

[2] perf trace:
    95.22%     0.00%  kworker/0:2-eve  [kernel.kallsyms]
    |
    ---ret_from_fork
       kthread
       worker_thread
       |          
        --95.15%--process_one_work
		  |          
		   --94.99%--hub_event
			 |          
			  --94.99%--usb_disconnect
			  <snip>
				|  
				--94.90%--invalidate_partition
				   __invalidate_device
				   |          
				   |--64.55%--invalidate_bdev
				   |  |          
				   |   --64.13%--invalidate_mapping_pages
				   |     |          
				   |     |--24.09%--invalidate_inode_page
				   |     |   |          
				   |     |   --23.44%--remove_mapping
				   |     |     |          
				   |     |      --23.20%--__remove_mapping
				   |     |        |          
				   |     |         --21.90%--arch_local_irq_restore
				   |     |          
				   |     |--22.44%--arch_local_irq_enable
				   |          
					--30.35%--shrink_dcache_sb 
					<snip>
					  |      
					  --30.17%--truncate_inode_pages_range


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-17 18:24   ` Jim Baxter
@ 2020-08-17 18:47     ` Alan Stern
  2020-08-18 10:54       ` Jim Baxter
  0 siblings, 1 reply; 7+ messages in thread
From: Alan Stern @ 2020-08-17 18:47 UTC (permalink / raw)
  To: Jim Baxter
  Cc: Greg KH, linux-kernel, linux-mm, linux-usb,
	Resch Carsten (CM/ESO6), Rosca, Eugeniu (ADITG/ESB)

On Mon, Aug 17, 2020 at 07:24:01PM +0100, Jim Baxter wrote:
> On 17/08/2020 12:57, Greg KH wrote:
> > On Mon, Aug 17, 2020 at 12:40:03PM +0100, Jim Baxter wrote:
> >> We have issues with the workqueue of the kernel overloading the CPU 0 
> >> when we we disconnect a USB stick.
> >>
> >> This results in other items on the shared workqueue being delayed by
> >> around 6.5 seconds with a default kernel configuration and 2.3 seconds
> >> on a config tailored for our RCar embedded platform.
> >>
> > 
> > Is this data really flushed out to the device?

In the example, data was copied _from_ the device, not to it.

> I am unsure if the delay is due to a single system or a combination of memory,
> usb or filesystem operations, the delay also occurs if the device is mounted
> as ro and using the sync option.
> 
> Using umount stops the issue occurring but is unfortunately not guaranteed in
> our particular system.

Unplugging a R/W USB drive without unmounting it first is a great way to 
corrupt the data.

> >> - Disconnect the USB stick:
> >> [ 1551.796792] usb 2-1: USB disconnect, device number 2
> >> [ 1558.625517] DELAY: 6782
> >>
> >>
> >> The Delay output 6782 is in milliseconds.
> > 
> > What USB workqueue is taking so long?> 
> > The one trying to deal with the filesystem flushing out the data that it
> > can't do now that the device is removed?  :)
> > 
> From my analysis it is the hub_event workqueue shown to be using most of the CPU,
> the kworker/0:1+usb thread uses around 98% of the CPU.
> 
> I have traced the workqueue:workqueue_queue_work function while unplugging the USB
> but not found a particular workqueue function being called a lot.
> 
> Using perf Iidentified the hub_events workqueue was spending a lot of time in
> invalidate_partition(), I have included a cut down the captured data from perf in
> [2] which shows the additional functions where the kworker spends most of its time.

invalidate_partition() is part of the block layer, not part of USB.  It 
gets called whenever a drive is removed from the system, no matter what 
type of drive it is.  You should ask the people involved in that 
subsystem why it takes so long.

> I realise that not unmounting the USB stick is not ideal, though I wonder what 
> additional work is done when unplugging the USB stick compared to unmounting it.

Unmounting a drive flushes all the dirty buffers from memory back to the 
drive.  Obviously that can't be done if the drive is unplugged first.

As far as the USB subsystem is concerned, exactly the same amount of 
work is done during disconnect regardless of whether or not the drive is 
mounted.  (In fact, the USB subsystem doesn't even know whether a drive 
is mounted; that concept is part of the block and filesystem layers.)

> I guess it may be waiting for a time-out during the operation without the unmount.

That seems very unlikely.  When a USB device gets unplugged the system 
realizes it.  Any I/O meant for that device is immediately cancelled; 
there are no timeouts.

(Okay, not strictly true; there is a fraction-of-a-second timeout during 
which the system waits to see whether the disconnect was permanent or 
just a temporary glitch.  But you're talking about 6-second long 
delays.)

Alan Stern


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-17 18:47     ` Alan Stern
@ 2020-08-18 10:54       ` Jim Baxter
  2020-08-18 14:48         ` Alan Stern
  0 siblings, 1 reply; 7+ messages in thread
From: Jim Baxter @ 2020-08-18 10:54 UTC (permalink / raw)
  To: Alan Stern
  Cc: Greg KH, linux-kernel, linux-mm, linux-usb,
	Resch Carsten (CM/ESO6), Rosca, Eugeniu (ADITG/ESB)

On 17/08/2020 19:47, Alan Stern wrote:
> 
> Unplugging a R/W USB drive without unmounting it first is a great way to 
> corrupt the data.
> 
Thank you, post development we will only mount the USB stick as R/O.

>> Using perf Iidentified the hub_events workqueue was spending a lot of time in
>> invalidate_partition(), I have included a cut down the captured data from perf in
>> [2] which shows the additional functions where the kworker spends most of its time.
> 
> invalidate_partition() is part of the block layer, not part of USB.  It 
> gets called whenever a drive is removed from the system, no matter what 
> type of drive it is.  You should ask the people involved in that 
> subsystem why it takes so long.
> 

I included the linux-mm list but missed the filesystem, I will ask the question
to the linux-fsdevel too.

>> I realise that not unmounting the USB stick is not ideal, though I wonder what 
>> additional work is done when unplugging the USB stick compared to unmounting it.
> 
> Unmounting a drive flushes all the dirty buffers from memory back to the 
> drive.  Obviously that can't be done if the drive is unplugged first.
> 
> As far as the USB subsystem is concerned, exactly the same amount of 
> work is done during disconnect regardless of whether or not the drive is 
> mounted.  (In fact, the USB subsystem doesn't even know whether a drive 
> is mounted; that concept is part of the block and filesystem layers.)
>>> I guess it may be waiting for a time-out during the operation without the unmount.
> 
> That seems very unlikely.  When a USB device gets unplugged the system 
> realizes it.  Any I/O meant for that device is immediately cancelled; 
> there are no timeouts.
> 
> (Okay, not strictly true; there is a fraction-of-a-second timeout during 
> which the system waits to see whether the disconnect was permanent or 
> just a temporary glitch.  But you're talking about 6-second long 
> delays.)
> 

Thank you, no I don't expect that to cause the issue and it is very likely the delay
is in another subsystem.

Regards,
Jim Baxter


> Alan Stern
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Long Workqueue delays.
  2020-08-18 10:54       ` Jim Baxter
@ 2020-08-18 14:48         ` Alan Stern
  0 siblings, 0 replies; 7+ messages in thread
From: Alan Stern @ 2020-08-18 14:48 UTC (permalink / raw)
  To: Jim Baxter
  Cc: Greg KH, linux-kernel, linux-mm, linux-usb,
	Resch Carsten (CM/ESO6), Rosca, Eugeniu (ADITG/ESB)

On Tue, Aug 18, 2020 at 11:54:51AM +0100, Jim Baxter wrote:
> On 17/08/2020 19:47, Alan Stern wrote:
> > 
> > Unplugging a R/W USB drive without unmounting it first is a great way to 
> > corrupt the data.
> > 
> Thank you, post development we will only mount the USB stick as R/O.
> 
> >> Using perf Iidentified the hub_events workqueue was spending a lot of time in
> >> invalidate_partition(), I have included a cut down the captured data from perf in
> >> [2] which shows the additional functions where the kworker spends most of its time.
> > 
> > invalidate_partition() is part of the block layer, not part of USB.  It 
> > gets called whenever a drive is removed from the system, no matter what 
> > type of drive it is.  You should ask the people involved in that 
> > subsystem why it takes so long.
> > 
> 
> I included the linux-mm list but missed the filesystem, I will ask the question
> to the linux-fsdevel too.

What about linux-block?  The block layer is different from the 
memory-management (mm) layer.

Alan Stern


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-08-18 14:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 11:40 PROBLEM: Long Workqueue delays Jim Baxter
2020-08-17 11:57 ` Greg KH
2020-08-17 18:24   ` Jim Baxter
2020-08-17 18:47     ` Alan Stern
2020-08-18 10:54       ` Jim Baxter
2020-08-18 14:48         ` Alan Stern
2020-08-17 15:25 ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).