linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
@ 2020-08-09 18:51 Lukas Middendorf
  2020-08-13 16:37 ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2020-08-09 18:51 UTC (permalink / raw)
  To: Luis Chamberlain, linux-btrfs
  Cc: Antti Palosaari, Mauro Carvalho Chehab, linux-media

Hi,

I'm seeing system freezes during resuming from suspend to RAM if all of 
the following is true:
1. /usr/lib/firmware is on btrfs
2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected
3. the firmware file /usr/lib/firmware/dvb-demod-si2168-b40-01.fw (or 
knowledge about its non-existstence) is NOT in a cache in RAM

The exact outcome depends on whether or not I use nvidia or nouveau 
driver an whether or not I have X running. With nvidia I only get some 
kernel messages from suspend on screen, but the system does not seem to 
react to anything (emergency FS sync magic sysrq sometimes works judging 
from HDD-LED). With nouveau and X11 running I just get a black screen 
with mouse cursor and no apparent reaction to key presses (again except 
for emergency FS sync magic sysrq).
With nouveau but without X11 running the system even reacts to 
ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and 
login or any other input is not possible.
Unplugging the USB device after resume does not return the system back 
to normal.

Details on the above conditions:
1. If /usr/lib/firmware is on ext4 everything works fine. I originally 
noticed this with the complete / on btrfs vs. ext4, but putting 
/usr/lib/firmware on a separate partition is sufficient to make it work 
or fail, depending on the fs type.
2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its 
si2168_init() called during resume. In this function a call to 
request_firmware() happens. The firmware is not actually loaded to the 
device on boot or initial usb connection but just during resume and when 
the DVB-Tuner is used.
3. a) if the device has been used and the firmware has already been 
loaded, the load succeeds during resume
    b) if the firmware file is not present and the directory content of 
/usr/lib/firmware has been listed (e.g. through loading of a different 
firmware from there; by the nouveau driver in my case) the firmware load 
fails but does not freeze the system
    c) manually reading the firmware files before suspend can make the 
resume succeed, but this does not seem to be reliable
    d) modifying the si2168 by adding calls to firmware_request_cache() 
in si2168_probe() makes the firmware loading during resume succeed.
    e) having the firmware files not installed freezes the system during 
resume if the content of /usr/lib/firmware has not been listed before 
suspend (e.g. installing the nvidia driver, so the nouveau driver does 
not access this directory)
    f) the system freezes during resume if nothing has caused the 
firmware file to be read from the drive before suspend

I have observed this with all kernel versions I have tested so far: 
fedora kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11
vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c)

Unfortunately I have not yet succeeded in getting useful debug output 
(nothing is in the log files after a reset).  When I disable printk 
console_suspend I can get some messages of the resume process but 
nothing related to si2168 is shown on screen.

This certainly looks like the documented promise of request_firmware()
"The function can be called safely inside device’s suspend and resume 
callback."
does not hold if the requested firmware files are residing on a btrfs.
The documentation of firmware_request_cache() states that this function 
should be called before suspend if the firmware has not been loaded 
earlier. From the combined documentation I would expect that 
request_firmware() might fail during resume if the firmware is not yet 
in cache, but it should not freeze the system.

I'm not sure who really is to blame here:
- BTRFS (ext4 is fine after all)
- the firmware loader because the implementation or the documentation 
are wrong
- si2168 because of not loading the firmware or calling 
firmware_request_cache() before suspend. Also I don't think it is even 
necessary to load the firmware during resume, it should be sufficient to 
load the firmware when the tuner is used. I'm not sure though whether 
the dvb-frontend driver structure allows to properly distinguish between 
resuming and initialization before device usage. The problem can 
definitely be worked around here until the root cause is fixed. I can 
provide a patch if this solution is seen as appropriate.

I'm putting all the maintainers and/or lists of the possible culprits on CC.

Best regards,

Lukas Middendorf


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-09 18:51 Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? Lukas Middendorf
@ 2020-08-13 16:37 ` Luis Chamberlain
  2020-08-13 21:53   ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2020-08-13 16:37 UTC (permalink / raw)
  To: Lukas Middendorf, Anand Jain, mcgrof
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote:
> Hi,

Hey Lukas, thanks for the report!

> I'm seeing system freezes during resuming from suspend to RAM if all of the
> following is true:
> 1. /usr/lib/firmware is on btrfs
> 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected
> 3. the firmware file /usr/lib/firmware/dvb-demod-si2168-b40-01.fw (or
> knowledge about its non-existstence)

OK that is SI2168_B40_FIRMWARE and the driver in question is part of
drivers/media/dvb-frontends/si2168.c, its a 1 file driver so si2168
is the respective driver.

> is NOT in a cache in RAM

But the firmware is supposed to go be cached in RAM prior to suspend,
given this driver uses the request_firmware() call, only the async call
request_firmware_nowait() lets you opt-out of the cache, and likewise
with request_firmware_into_buf() you don't use the cache.

Note that some drivers implement their own caching mechanism as well,
like iwlwifi, and so don't need this cache.

Please refer to:

https://www.kernel.org/doc/html/latest/driver-api/firmware/firmware_cache.html

> The exact outcome depends on whether or not I use nvidia or nouveau driver
> an whether or not I have X running. With nvidia I only get some kernel
> messages from suspend on screen, but the system does not seem to react to
> anything (emergency FS sync magic sysrq sometimes works judging from
> HDD-LED). With nouveau and X11 running I just get a black screen with mouse
> cursor and no apparent reaction to key presses (again except for emergency
> FS sync magic sysrq).

If the nouveau driver doesn't yet have proper suspend working, my guess
is that the issue is the dependency on resources by nvidia ont he dvb
driver, and if suspend may be broken there, ordering on suspend would
not be correct. This is only true is suspend on nouveau is broken and I
don't know the anwer to this.

If suspend is not implemented correctly through the entire pipeline then
your only option is to blacklist the driver. On debian this is what I
use for hibernation blacklisting, this will remove the driver prior to
hibernation:

/etc/hibernate/blacklisted-modules

> With nouveau but without X11 running the system even reacts to
> ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and login
> or any other input is not possible.
> Unplugging the USB device after resume does not return the system back to
> normal.

Please debug further with nouveau as that is the only driver we have
control over.

> Details on the above conditions:
> 1. If /usr/lib/firmware is on ext4 everything works fine. I originally
> noticed this with the complete / on btrfs vs. ext4, but putting
> /usr/lib/firmware on a separate partition is sufficient to make it work or
> fail, depending on the fs type.

The firmware loader just reads the file directly from the filesystem,
that should complete fine but we can find out more details by you
enabling debugging:

echo "file drivers/base/* +p" > /sys/kernel/debug/dynamic_debug/control  

> 2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its
> si2168_init() called during resume. In this function a call to
> request_firmware() happens. The firmware is not actually loaded to the
> device on boot or initial usb connection but just during resume and when the
> DVB-Tuner is used.

That's the thing, the cache is built out of prior requests. We build the
actual firmware cache through a notifier on fw_pm_notify(), and we build
the cache prior to suspend then with device_cache_fw_images().

So whatever issues you might have during suspend/resume because of
a missing firmware would still be present on this driver because the
driver never called for the firmware. If upon resume the firmware is
requested, we race against the filesystem looking for it and that today
can be racy with any filesystem. That fact that you only see it with
one filesystem can be a one-off, as explained in more detail below.

> 3. a) if the device has been used and the firmware has already been loaded,
> the load succeeds during resume

If the device has been used, so long as the file is still present prior
to suspend, then the cache for it will be created. That explains why
this works.

>    b) if the firmware file is not present and the directory content of
> /usr/lib/firmware has been listed (e.g. through loading of a different
> firmware from there; by the nouveau driver in my case) the firmware load
> fails but does not freeze the system

Can you clarify if you mean then that in this case the dvb device is not
used?

Which driver firmware load fails?

>    c) manually reading the firmware files before suspend can make the resume
> succeed, but this does not seem to be reliable

That may because of the dentry cache, and so helping races, doing
something similar as to what the firmware cache does, but in hacky way.

>    d) modifying the si2168 by adding calls to firmware_request_cache() in
> si2168_probe() makes the firmware loading during resume succeed.

That indeed seems like a good idea in general for dvb that should
probably be generalized, if dvb had a way to get a dvb's driver
firmware name and indeed all dvb drivers are not skipping the firmware
cache today.

>    e) having the firmware files not installed freezes the system during
> resume if the content of /usr/lib/firmware has not been listed before
> suspend (e.g. installing the nvidia driver, so the nouveau driver does not
> access this directory)

That may be the same issue as in b) assuming that you meant you didn't
use the dvb device, and that the firmware load issue is from nouveau.

Are all firmware files used by nouveau upon resume the samea as during
probe? If not using firmware_request_cache() for each of them would be
a good idea to resolve possible races issues.

>    f) the system freezes during resume if nothing has caused the firmware
> file to be read from the drive before suspend

Which driver? And yes, this is exactly why the firmware cache thing was
done, and its why that for driver that don't use the firmware cache,
they implement their own caching mechanism.

The firmware must be cached prior to suspend.

> I have observed this with all kernel versions I have tested so far: fedora
> kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11
> vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c)
> 
> Unfortunately I have not yet succeeded in getting useful debug output
> (nothing is in the log files after a reset).  When I disable printk
> console_suspend I can get some messages of the resume process but nothing
> related to si2168 is shown on screen.

Try dynamic printk stuff mentioned.

Even though some of the things you say are not clear yet.

> This certainly looks like the documented promise of request_firmware()
> "The function can be called safely inside device’s suspend and resume
> callback."
> does not hold if 

Inded, this is just false outdated information and needs to be updated.

For now, this should just be re-written to mention in order to help with
current races against the filesystem (detailed way at the bottom of this
email) being brought up on resume, the firmware API relies on a firmware
cache mechanism, and when that is used it is safe to call the
request_firmware() call on resume. If a driver has not yet requested for
the firmware prior to suspend then the firmware cache will not have the
firmware cached in memory prior to suspend, and so to help with this
firmware_request_cache() can be used.  The call to
firmware_request_cache() can be made on device's probe call.

Some drivers may not want to use the firmware cache though, such as
drivers with large firmware which don't want large firmwares to be
taking so much memory prior to suspend/hibernation, and so certain APIs
allow to skip the firwmare cache. Those drivers must implement their own
solutions to ensure that if the firmware is needed on resume it will
be available or not needed.

> the requested firmware files are residing on a btrfs.

There are two issues here. One is that you are using a driver which
may have resume issues because the driver's firmware is not cached
prior to suspend.

The second issue may indeed be a freeze / issues of reading a filesystem
*while racing to suspend / resume*. This issue can happen with any
journaling filesystem today.

More on this below.

In so far as nouveau is concerned, remove the dvb driver and let's debug
that alone separately, and see if the issue is not related to your dvb
driver first.

> The documentation of firmware_request_cache() states that this function
> should be called before suspend if the firmware has not been loaded earlier.
> From the combined documentation I would expect that request_firmware() might
> fail during resume if the firmware is not yet in cache, but it should not
> freeze the system.

Indeed, but then again doing a a race against a filesystem in between
suspend/resume can cause all sorts of issues today, so btrfs is not the
only one with issue there. I had found issue with xfs as well [0]:

[0] https://lwn.net/Articles/735382/

I addressed a topic on this particular issue at LSFFMM in 2018, and here
are the notes from that session:

https://lore.kernel.org/linux-fsdevel/20180426212243.GA27853@wotan.suse.de/

I will pick up this work again now, but it will take a bit of time.

> I'm not sure who really is to blame here:
> - BTRFS (ext4 is fine after all)
> - the firmware loader because the implementation or the documentation are
> wrong
> - si2168 because of not loading the firmware or calling
> firmware_request_cache() before suspend. Also I don't think it is even
> necessary to load the firmware during resume, it should be sufficient to
> load the firmware when the tuner is used. I'm not sure though whether the
> dvb-frontend driver structure allows to properly distinguish between
> resuming and initialization before device usage. The problem can definitely
> be worked around here until the root cause is fixed. I can provide a patch
> if this solution is seen as appropriate.
> 
> I'm putting all the maintainers and/or lists of the possible culprits on CC.

Indeed send the patch to use firmware_request_cache() for si2168.

If you still have issues with nouveau, remove the si2168 driver and see
if you can recreate the issue. If not then something between the two
on suspend is likely the issue. But if the issue still persists then
it would seem to be an issue with suspend on nouveau.

The generic filesystem races on suspend/resume are known, and I will
tackle that once I am done with some other task I am completing.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-13 16:37 ` Luis Chamberlain
@ 2020-08-13 21:53   ` Lukas Middendorf
  2020-08-13 22:13     ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2020-08-13 21:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

Hey Luis,

On 13/08/2020 18:37, Luis Chamberlain wrote:
> On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote:
>> The exact outcome depends on whether or not I use nvidia or nouveau driver
>> an whether or not I have X running. With nvidia I only get some kernel
>> messages from suspend on screen, but the system does not seem to react to
>> anything (emergency FS sync magic sysrq sometimes works judging from
>> HDD-LED). With nouveau and X11 running I just get a black screen with mouse
>> cursor and no apparent reaction to key presses (again except for emergency
>> FS sync magic sysrq).
> 
> If the nouveau driver doesn't yet have proper suspend working, my guess
> is that the issue is the dependency on resources by nvidia ont he dvb
> driver, and if suspend may be broken there, ordering on suspend would
> not be correct. This is only true is suspend on nouveau is broken and I
> don't know the anwer to this.

Sorry for being a bit unclear here. nouveau seems to be working 
perfectly fine for me when it comes to this issue. Not using nouveau 
actually adds one more case in which the system freezes (when no 
firmware files are installed; case e) below). Of course I can not 
completely rule out a different problem with the nouveau driver, but my 
problem seems to be caused by si2168 only.

> If suspend is not implemented correctly through the entire pipeline then
> your only option is to blacklist the driver. On debian this is what I
> use for hibernation blacklisting, this will remove the driver prior to
> hibernation:
> 
> /etc/hibernate/blacklisted-modules

I will likely not have to do that now with the proper solution I have 
for si2168, but this can come in handy in the future should I hit some 
other suspend problem.

>> With nouveau but without X11 running the system even reacts to
>> ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and login
>> or any other input is not possible.
>> Unplugging the USB device after resume does not return the system back to
>> normal.
> 
> Please debug further with nouveau as that is the only driver we have
> control over.

Sure. I only mentioned the nvidia driver because the case "no firmware 
files installed" only seems to freeze the system on resume with the 
nvidia driver but not with nouveau.

>> Details on the above conditions:
>> 1. If /usr/lib/firmware is on ext4 everything works fine. I originally
>> noticed this with the complete / on btrfs vs. ext4, but putting
>> /usr/lib/firmware on a separate partition is sufficient to make it work or
>> fail, depending on the fs type.
> 
> The firmware loader just reads the file directly from the filesystem,
> that should complete fine but we can find out more details by you
> enabling debugging:
> 
> echo "file drivers/base/* +p" > /sys/kernel/debug/dynamic_debug/control

I already figured that out (I used "module firmware_class" though).


>> 2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its
>> si2168_init() called during resume. In this function a call to
>> request_firmware() happens. The firmware is not actually loaded to the
>> device on boot or initial usb connection but just during resume and when the
>> DVB-Tuner is used.
> 
> That's the thing, the cache is built out of prior requests. We build the
> actual firmware cache through a notifier on fw_pm_notify(), and we build
> the cache prior to suspend then with device_cache_fw_images().
> 
> So whatever issues you might have during suspend/resume because of
> a missing firmware would still be present on this driver because the
> driver never called for the firmware. If upon resume the firmware is
> requested, we race against the filesystem looking for it and that today
> can be racy with any filesystem. That fact that you only see it with
> one filesystem can be a one-off, as explained in more detail below.
> 
>> 3. a) if the device has been used and the firmware has already been loaded,
>> the load succeeds during resume
> 
> If the device has been used, so long as the file is still present prior
> to suspend, then the cache for it will be created. That explains why
> this works.

Yes, indeed. I had confirmed that with the firmware_class debug messages 
enabled.

> 
>>     b) if the firmware file is not present and the directory content of
>> /usr/lib/firmware has been listed (e.g. through loading of a different
>> firmware from there; by the nouveau driver in my case) the firmware load
>> fails but does not freeze the system
> 
> Can you clarify if you mean then that in this case the dvb device is not
> used?
> 
> Which driver firmware load fails?

With "not used" I mean that the device has been recognized, the si2168 
module is loaded and si2168_probe() has been called on the device, but I 
have not started a media player which actually plays a DVB-C stream. 
Therefore si2168_init() has never been called and the firmware has never 
been requested or loaded to the device.

On resume si2168_init() will be called (although I don't think this 
actually is really necessary) and the firmware load of si2168 fails if 
the firmware files are not installed. If the kernel does not already 
know that the files are not present without access to the file system, 
the system just freezes.


>>     c) manually reading the firmware files before suspend can make the resume
>> succeed, but this does not seem to be reliable
> 
> That may because of the dentry cache, and so helping races, doing
> something similar as to what the firmware cache does, but in hacky way.
> 
>>     d) modifying the si2168 by adding calls to firmware_request_cache() in
>> si2168_probe() makes the firmware loading during resume succeed.
> 
> That indeed seems like a good idea in general for dvb that should
> probably be generalized, if dvb had a way to get a dvb's driver
> firmware name and indeed all dvb drivers are not skipping the firmware
> cache today.
> 
>>     e) having the firmware files not installed freezes the system during
>> resume if the content of /usr/lib/firmware has not been listed before
>> suspend (e.g. installing the nvidia driver, so the nouveau driver does not
>> access this directory)
> 
> That may be the same issue as in b) assuming that you meant you didn't
> use the dvb device, and that the firmware load issue is from nouveau.

This is actually just the inverted case of b).
The only real relevance of the nouveau driver here is that its 
(perfectly working) firmware caching on suspend actually seems to be 
equivalent to manually running "ls" on the firmware directory and 
afterwards the kernel also knows whether or not the si2168 firmware 
files are present without file system access.

> 
> Are all firmware files used by nouveau upon resume the samea as during
> probe? If not using firmware_request_cache() for each of them would be
> a good idea to resolve possible races issues.

I have not compared them one-by-one, but as I have already written, 
nouveau seems fine here.

> 
>>     f) the system freezes during resume if nothing has caused the firmware
>> file to be read from the drive before suspend
> 
> Which driver? And yes, this is exactly why the firmware cache thing was
> done, and its why that for driver that don't use the firmware cache,
> they implement their own caching mechanism.
> 
> The firmware must be cached prior to suspend.
> 
>> I have observed this with all kernel versions I have tested so far: fedora
>> kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11
>> vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c)
>>
>> Unfortunately I have not yet succeeded in getting useful debug output
>> (nothing is in the log files after a reset).  When I disable printk
>> console_suspend I can get some messages of the resume process but nothing
>> related to si2168 is shown on screen.
> 
> Try dynamic printk stuff mentioned.

I have tried that, but in case the system freezes, I have not managed to 
get any useful debug info on the screen. But

> Even though some of the things you say are not clear yet.

I'm very sorry for that. But it always is a compromise between giving 
all possible information and not producing a huge pile of text.


[snip: lots of interesting information]

>> I'm not sure who really is to blame here:
>> - BTRFS (ext4 is fine after all)
>> - the firmware loader because the implementation or the documentation are
>> wrong
>> - si2168 because of not loading the firmware or calling
>> firmware_request_cache() before suspend. Also I don't think it is even
>> necessary to load the firmware during resume, it should be sufficient to
>> load the firmware when the tuner is used. I'm not sure though whether the
>> dvb-frontend driver structure allows to properly distinguish between
>> resuming and initialization before device usage. The problem can definitely
>> be worked around here until the root cause is fixed. I can provide a patch
>> if this solution is seen as appropriate.
>>
>> I'm putting all the maintainers and/or lists of the possible culprits on CC.
> 
> Indeed send the patch to use firmware_request_cache() for si2168.

I sent it to the media mailing list (split into two patches).

> If you still have issues with nouveau, remove the si2168 driver and see
> if you can recreate the issue. If not then something between the two
> on suspend is likely the issue. But if the issue still persists then
> it would seem to be an issue with suspend on nouveau.

Nouveau does seem to work (including suspend), except if you actually 
need some decent 3D performance.

> The generic filesystem races on suspend/resume are known, and I will
> tackle that once I am done with some other task I am completing.

Great to know that the underlying cause for the freeze is known and a 
solution is being worked on (or at least planned to be worked on).

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-13 21:53   ` Lukas Middendorf
@ 2020-08-13 22:13     ` Luis Chamberlain
  2020-08-14 11:38       ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2020-08-13 22:13 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote:
> Hey Luis,
> 
> On 13/08/2020 18:37, Luis Chamberlain wrote:
> > On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote:
> > 
> > >     b) if the firmware file is not present and the directory content of
> > > /usr/lib/firmware has been listed (e.g. through loading of a different
> > > firmware from there; by the nouveau driver in my case) the firmware load
> > > fails but does not freeze the system
> > 
> > Can you clarify if you mean then that in this case the dvb device is not
> > used?
> > 
> > Which driver firmware load fails?
> 
> With "not used" I mean that the device has been recognized, the si2168
> module is loaded and si2168_probe() has been called on the device, but I
> have not started a media player which actually plays a DVB-C stream.
> Therefore si2168_init() has never been called and the firmware has never
> been requested or loaded to the device.
> 
> On resume si2168_init() will be called (although I don't think this actually
> is really necessary)

Indeed, that seems odd given its not on probe. So yet another possible
si2168 bug. Or another way to put it: your cache calls are not needed
if you remove that si2168_init() if init was not called yet. So simply
extending the data structure for the driver and seting a bool flag to
true if init was called should do the trick.

Then the two cache calls would not be needed.

> and the firmware load of si2168 fails if the firmware
> files are not installed.

OK well that is expected.

> If the kernel does not already know that the files
> are not present without access to the file system, the system just freezes.

It is not clear to me what this means. Can you clarify?

> > >     c) manually reading the firmware files before suspend can make the resume
> > > succeed, but this does not seem to be reliable
> > 
> > That may because of the dentry cache, and so helping races, doing
> > something similar as to what the firmware cache does, but in hacky way.
> > 
> > >     d) modifying the si2168 by adding calls to firmware_request_cache() in
> > > si2168_probe() makes the firmware loading during resume succeed.
> > 
> > That indeed seems like a good idea in general for dvb that should
> > probably be generalized, if dvb had a way to get a dvb's driver
> > firmware name and indeed all dvb drivers are not skipping the firmware
> > cache today.
> > 
> > >     e) having the firmware files not installed freezes the system during
> > > resume if the content of /usr/lib/firmware has not been listed before
> > > suspend (e.g. installing the nvidia driver, so the nouveau driver does not
> > > access this directory)
> > 
> > That may be the same issue as in b) assuming that you meant you didn't
> > use the dvb device, and that the firmware load issue is from nouveau.
> 
> This is actually just the inverted case of b).
> The only real relevance of the nouveau driver here is that its (perfectly
> working) firmware caching on suspend actually seems to be equivalent to
> manually running "ls" on the firmware directory and afterwards the kernel
> also knows whether or not the si2168 firmware files are present without file
> system access.

OK.. I still don't get it, so let me see if we can decipher what you
mean here.

If the firmware is *not* present for the si2168 driver and the device
has *not* been used yet you get a system freeze which you cannot recover
from, but only if you are *not* using a driver which also caches its
firmware already?

In this case if the nouveau is used this freeze does not happen, and you
believe this is because of at least one fetch for the directory is done
already. If so, then this speedup of a fugure negative dentry lookup on
btrfs seems like an interesting test case for btrfs developers too look
at.

But we must first undersand the case well.

> > Are all firmware files used by nouveau upon resume the samea as during
> > probe? If not using firmware_request_cache() for each of them would be
> > a good idea to resolve possible races issues.
> 
> I have not compared them one-by-one, but as I have already written, nouveau
> seems fine here.

OK cool. Thanks for ruling this out!

> > > I'm not sure who really is to blame here:
> > > - BTRFS (ext4 is fine after all)
> > > - the firmware loader because the implementation or the documentation are
> > > wrong
> > > - si2168 because of not loading the firmware or calling
> > > firmware_request_cache() before suspend. Also I don't think it is even
> > > necessary to load the firmware during resume, it should be sufficient to
> > > load the firmware when the tuner is used. I'm not sure though whether the
> > > dvb-frontend driver structure allows to properly distinguish between
> > > resuming and initialization before device usage. The problem can definitely
> > > be worked around here until the root cause is fixed. I can provide a patch
> > > if this solution is seen as appropriate.
> > > 
> > > I'm putting all the maintainers and/or lists of the possible culprits on CC.
> > 
> > Indeed send the patch to use firmware_request_cache() for si2168.
> 
> I sent it to the media mailing list (split into two patches).

Looks good to me, except now that you metion the resuem thing, I think
you can drop those patches if you do the check I mentioned on resume,
ie, to verify that init hasn't been called yet.

Also, might as well rename that function to make it clearer what it does
as init prefixed or postfixed calls tend to be used for things during
initialization.

> > The generic filesystem races on suspend/resume are known, and I will
> > tackle that once I am done with some other task I am completing.
> 
> Great to know that the underlying cause for the freeze is known and a
> solution is being worked on (or at least planned to be worked on).

There is a sliver possibility here that a negative dentry slowdown may
cause a btrfs freeze on suspend easily. If so that can likely be fixed
before that big general filesystem freeze issue I noted with URLs and
talks. But I think we need to confirm that is what is happening here.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-13 22:13     ` Luis Chamberlain
@ 2020-08-14 11:38       ` Lukas Middendorf
  2020-08-14 16:37         ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2020-08-14 11:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

Hey Louis,

On 14/08/2020 00:13, Luis Chamberlain wrote:
> On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote:
>> With "not used" I mean that the device has been recognized, the si2168
>> module is loaded and si2168_probe() has been called on the device, but I
>> have not started a media player which actually plays a DVB-C stream.
>> Therefore si2168_init() has never been called and the firmware has never
>> been requested or loaded to the device.
>>
>> On resume si2168_init() will be called (although I don't think this actually
>> is really necessary)
> 
> Indeed, that seems odd given its not on probe. So yet another possible
> si2168 bug. Or another way to put it: your cache calls are not needed
> if you remove that si2168_init() if init was not called yet. So simply
> extending the data structure for the driver and seting a bool flag to
> true if init was called should do the trick.
> 
> Then the two cache calls would not be needed.

I already thought about a way to remove the firmware load on suspend if 
it has not happened earlier. But I saw no way to distinguish between 
calls to si2168_init() during resume and calls to si2168_init() at the 
start of device usage.

In contrast to struct dvb_tuner_ops the relevant struct dvb_frontend_ops 
has no separate members "suspend" and "resume" to store callbacks to be 
called instead of sleep and init during suspend and resume. Changing the 
behavior of si2168 would likely include some bigger architectural work 
on the dvb_frontend system to extend dvb_frontend_ops.
It might be possible to just put nothing into dvb_frontend_ops.init but 
instead populate the dvb_tuner_ops part of the struct, but those 
pointers might already be populated externally for some other use.

If I try to do this on my very first kernel contribution without 
in-depth knowledge of how the dvb drivers really work, this will likely 
just explode, if not for me then for somebody else with different hardware.

>> If the kernel does not already know that the files
>> are not present without access to the file system, the system just freezes.
> 
> It is not clear to me what this means. Can you clarify?
> 
>>>>      e) having the firmware files not installed freezes the system during
>>>> resume if the content of /usr/lib/firmware has not been listed before
>>>> suspend (e.g. installing the nvidia driver, so the nouveau driver does not
>>>> access this directory)
>>>
>>> That may be the same issue as in b) assuming that you meant you didn't
>>> use the dvb device, and that the firmware load issue is from nouveau.
>>
>> This is actually just the inverted case of b).
>> The only real relevance of the nouveau driver here is that its (perfectly
>> working) firmware caching on suspend actually seems to be equivalent to
>> manually running "ls" on the firmware directory and afterwards the kernel
>> also knows whether or not the si2168 firmware files are present without file
>> system access.
> 
> OK.. I still don't get it, so let me see if we can decipher what you
> mean here.
> 
> If the firmware is *not* present for the si2168 driver and the device
> has *not* been used yet you get a system freeze which you cannot recover
> from, but only if you are *not* using a driver which also caches its
> firmware already?

Yes, this is exactly what I wanted to say.

A new installation of Fedora 32 without firmware files and with nouveau 
did not show my freeze problem. Installing either the firmware files or 
the nvidia driver started the freeze during resume.

> In this case if the nouveau is used this freeze does not happen, and you
> believe this is because of at least one fetch for the directory is done
> already. If so, then this speedup of a fugure negative dentry lookup on
> btrfs seems like an interesting test case for btrfs developers too look
> at.

According to my previous observations, an alternative to having the 
nouveau driver fetch files to prevent the freeze when no files are 
installed is to manually run "ls" on the firmware directory.

> But we must first undersand the case well.

I will check this corner case again in depth tonight or tomorrow and 
will report my findings, at least as far as it is externally observable 
without in-depth understanding of the internals.


>>>> I'm not sure who really is to blame here:
>>>> - BTRFS (ext4 is fine after all)
>>>> - the firmware loader because the implementation or the documentation are
>>>> wrong
>>>> - si2168 because of not loading the firmware or calling
>>>> firmware_request_cache() before suspend. Also I don't think it is even
>>>> necessary to load the firmware during resume, it should be sufficient to
>>>> load the firmware when the tuner is used. I'm not sure though whether the
>>>> dvb-frontend driver structure allows to properly distinguish between
>>>> resuming and initialization before device usage. The problem can definitely
>>>> be worked around here until the root cause is fixed. I can provide a patch
>>>> if this solution is seen as appropriate.
>>>>
>>>> I'm putting all the maintainers and/or lists of the possible culprits on CC.
>>>
>>> Indeed send the patch to use firmware_request_cache() for si2168.
>>
>> I sent it to the media mailing list (split into two patches).
> 
> Looks good to me, except now that you metion the resuem thing, I think
> you can drop those patches if you do the check I mentioned on resume,
> ie, to verify that init hasn't been called yet.

As I have written above, I likely would have tried that, had I found a 
way to distinguish between calls to init during resume and other calls 
to that function.
Do you know of any safe way to detect that a resume operation is going on?

> Also, might as well rename that function to make it clearer what it does
> as init prefixed or postfixed calls tend to be used for things during
> initialization.

As a pointer to this function goes into the "init" member of the 
dvb_frontend_ops struct, the name seems to be appropriate, at least at 
the moment.

>>> The generic filesystem races on suspend/resume are known, and I will
>>> tackle that once I am done with some other task I am completing.
>>
>> Great to know that the underlying cause for the freeze is known and a
>> solution is being worked on (or at least planned to be worked on).
> 
> There is a sliver possibility here that a negative dentry slowdown may
> cause a btrfs freeze on suspend easily. If so that can likely be fixed
> before that big general filesystem freeze issue I noted with URLs and
> talks. But I think we need to confirm that is what is happening here.

This sounds even better, although I don't really understand your theory.

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-14 11:38       ` Lukas Middendorf
@ 2020-08-14 16:37         ` Luis Chamberlain
  2020-08-14 21:59           ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2020-08-14 16:37 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote:
> Hey Louis,
> 
> On 14/08/2020 00:13, Luis Chamberlain wrote:
> > On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote:
> > > With "not used" I mean that the device has been recognized, the si2168
> > > module is loaded and si2168_probe() has been called on the device, but I
> > > have not started a media player which actually plays a DVB-C stream.
> > > Therefore si2168_init() has never been called and the firmware has never
> > > been requested or loaded to the device.
> > > 
> > > On resume si2168_init() will be called (although I don't think this actually
> > > is really necessary)
> > 
> > Indeed, that seems odd given its not on probe. So yet another possible
> > si2168 bug. Or another way to put it: your cache calls are not needed
> > if you remove that si2168_init() if init was not called yet. So simply
> > extending the data structure for the driver and seting a bool flag to
> > true if init was called should do the trick.
> > 
> > Then the two cache calls would not be needed.
> 
> I already thought about a way to remove the firmware load on suspend if it
> has not happened earlier. But I saw no way to distinguish between calls to
> si2168_init() during resume and calls to si2168_init() at the start of
> device usage.
> 
> In contrast to struct dvb_tuner_ops the relevant struct dvb_frontend_ops has
> no separate members "suspend" and "resume" to store callbacks to be called
> instead of sleep and init during suspend and resume. Changing the behavior
> of si2168 would likely include some bigger architectural work on the
> dvb_frontend system to extend dvb_frontend_ops.
> It might be possible to just put nothing into dvb_frontend_ops.init but
> instead populate the dvb_tuner_ops part of the struct, but those pointers
> might already be populated externally for some other use.
> 
> If I try to do this on my very first kernel contribution without in-depth
> knowledge of how the dvb drivers really work, this will likely just explode,
> if not for me then for somebody else with different hardware.

Indeed, the dvb init is called from dvb, so your two patches are good
for now I think.

> > > If the kernel does not already know that the files
> > > are not present without access to the file system, the system just freezes.
> > 
> > It is not clear to me what this means. Can you clarify?
> > 
> > > > >      e) having the firmware files not installed freezes the system during
> > > > > resume if the content of /usr/lib/firmware has not been listed before
> > > > > suspend (e.g. installing the nvidia driver, so the nouveau driver does not
> > > > > access this directory)
> > > > 
> > > > That may be the same issue as in b) assuming that you meant you didn't
> > > > use the dvb device, and that the firmware load issue is from nouveau.
> > > 
> > > This is actually just the inverted case of b).
> > > The only real relevance of the nouveau driver here is that its (perfectly
> > > working) firmware caching on suspend actually seems to be equivalent to
> > > manually running "ls" on the firmware directory and afterwards the kernel
> > > also knows whether or not the si2168 firmware files are present without file
> > > system access.
> > 
> > OK.. I still don't get it, so let me see if we can decipher what you
> > mean here.
> > 
> > If the firmware is *not* present for the si2168 driver and the device
> > has *not* been used yet you get a system freeze which you cannot recover
> > from, but only if you are *not* using a driver which also caches its
> > firmware already?
> 
> Yes, this is exactly what I wanted to say.
> 

OK great.. but..

> A new installation of Fedora 32 without firmware files 

Fedora 32 comes with no firmware at all? Are you sure? How about your
wifi?

> and with nouveau did
> not show my freeze problem. Installing either the firmware files or the
> nvidia driver started the freeze during resume.

Here now you say that if you install either the firmware files for
either si2168 or nouveau can cause a freeze, meanwhile what I wrote
above and you agreed is what you meant, says that the freeze happens
only if you *don't* have the firmware for nouveau present *and* you
also don't have the any other firmware present.

You also clarify here your freeze happens on resume only. Is that right?
Never on suspend, but if the freeze happens, it happens only on resume?

The actual case where you reach a freeze is still not clear yet. Let's
try to clarify this.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-14 16:37         ` Luis Chamberlain
@ 2020-08-14 21:59           ` Lukas Middendorf
  2020-08-17 15:20             ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2020-08-14 21:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On 14/08/2020 18:37, Luis Chamberlain wrote:
> On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote:
>>> If the firmware is *not* present for the si2168 driver and the device
>>> has *not* been used yet you get a system freeze which you cannot recover
>>> from, but only if you are *not* using a driver which also caches its
>>> firmware already?
>>
>> Yes, this is exactly what I wanted to say.
>>
> 
> OK great.. but..
> 
>> A new installation of Fedora 32 without firmware files
> 
> Fedora 32 comes with no firmware at all? Are you sure? How about your
> wifi?

Fedora does come with firmware files for many devices (wifi, nouveau, …) 
but not for dvb devices. Firmware for the si2168 has to be installed as 
an extra package dvb-firmware from rpmfusion. When I talk about "no 
firmware files" or "install the firmware files" I mean the si2168 (and 
other dvb hardware) firmware files only. The nouveau firmware files are 
always present.

>> and with nouveau did
>> not show my freeze problem. Installing either the firmware files or the
>> nvidia driver started the freeze during resume.
> 
> Here now you say that if you install either the firmware files for
> either si2168 or nouveau can cause a freeze, 

no, I'm talking just about the firmware files for si2168

> meanwhile what I wrote
> above and you agreed is what you meant, says that the freeze happens
> only if you *don't* have the firmware for nouveau present *and* you
> also don't have the any other firmware present.

This statement for me seems different compared to the statement in your 
last mail which I agreed to.
In the case that I *don't* have the firmware files for si2168 (!) 
present it happens only if no other firmware is cached on suspend (in my 
case by the nouveau driver). No statement made about cases where the 
si2168 firmware file is present.


> You also clarify here your freeze happens on resume only. Is that right?
> Never on suspend, but if the freeze happens, it happens only on resume?

Correct. I have not seen a freeze on suspend. It only happens on resume.


> The actual case where you reach a freeze is still not clear yet. Let's
> try to clarify this.

OK, let's try that again. To freeze during resume all of 1-4 has to be true:
1. /usr/lib/firmware is on btrfs
2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected
3. have not actively used the tuner
4. any of the following cases:
4a) si2168 firmware not installed + nouveau driver not used + have not 
run "ls -R /usr/lib/firmware" before suspend
4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*"
4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not 
nouveau driver


Not leading to a freeze is:

A: si2168 firmware not installed + nouveau driver used
B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before 
suspend
C: used the tuner before suspend (or tried to use, in case that the 
si2168 firmware is not installed)
D: using my patches with firmware_request_cache()
E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat 
/usr/lib/firmware/dvb*"
F: si2168 firmware installed + nouveau driver used + "cat 
/usr/lib/firmware/dvb*"

I verified all cases again to make sure I was not remembering anything 
wrong.

The nouveau driver in use seems to be equivalent to running "ls -R 
/usr/lib/firmware" before suspend.

All the cases seem to boil down to:
It freezes if the file system has to be accessed to list the content of 
/usr/lib/firmware or to read the si2168 firmware file


Lukas



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-14 21:59           ` Lukas Middendorf
@ 2020-08-17 15:20             ` Luis Chamberlain
  2020-08-17 22:04               ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2020-08-17 15:20 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On Fri, Aug 14, 2020 at 11:59:36PM +0200, Lukas Middendorf wrote:
> On 14/08/2020 18:37, Luis Chamberlain wrote:
> > On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote:
> > > > If the firmware is *not* present for the si2168 driver and the device
> > > > has *not* been used yet you get a system freeze which you cannot recover
> > > > from, but only if you are *not* using a driver which also caches its
> > > > firmware already?
> > > 
> > > Yes, this is exactly what I wanted to say.
> > > 
> > 
> > OK great.. but..
> > 
> > > A new installation of Fedora 32 without firmware files
> > 
> > Fedora 32 comes with no firmware at all? Are you sure? How about your
> > wifi?
> 
> Fedora does come with firmware files for many devices (wifi, nouveau, …) but
> not for dvb devices. Firmware for the si2168 has to be installed as an extra
> package dvb-firmware from rpmfusion. When I talk about "no firmware files"
> or "install the firmware files" I mean the si2168 (and other dvb hardware)
> firmware files only. The nouveau firmware files are always present.

OK so how do you know that other firmware is not getting loaded or cached?

Other than checking kernel logs you can rm -rf /lib/firmware/ and then
only place the files you want to test.

> > > and with nouveau did
> > > not show my freeze problem. Installing either the firmware files or the
> > > nvidia driver started the freeze during resume.
> > 
> > Here now you say that if you install either the firmware files for
> > either si2168 or nouveau can cause a freeze,
> 
> no, I'm talking just about the firmware files for si2168

OK so if you install the firmware files of si2168 you do run into a
freeze, and this freeze happens on resume from suspend?

Without the si2168 firmware the freeze does not happen.

Is that right?

> In the case that I *don't* have the firmware files for si2168 (!) present it
> happens only if no other firmware is cached on suspend (in my case by the
> nouveau driver). No statement made about cases where the si2168 firmware
> file is present.

OK this statement is clear and is very different from the one I made
above.

But note, that this is true, how are you *sure* that no other firmware
other than nouveau is being used? What about wifi? or bluetooth?

> > You also clarify here your freeze happens on resume only. Is that right?
> > Never on suspend, but if the freeze happens, it happens only on resume?
> 
> Correct. I have not seen a freeze on suspend. It only happens on resume.

OK this is only if and only if you haven't used the si2168 device,
right? And since this is related to si2168 we know that even if you
don't use the si2168 device its function which calls to load firmware
*does* get called on resume, even though that same function was not
called on probe, as the device is not used.

If true, then the race to freeze here happens on resume against btrfs.

And is the firmware present or not in this case, in which the freeze
happens?

> > The actual case where you reach a freeze is still not clear yet. Let's
> > try to clarify this.
> 
> OK, let's try that again. To freeze during resume all of 1-4 has to be true:
> 1. /usr/lib/firmware is on btrfs
> 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected
> 3. have not actively used the tuner
> 4. any of the following cases:
> 4a) si2168 firmware not installed + nouveau driver not used + have not run
> "ls -R /usr/lib/firmware" before suspend
> 4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*"
> 4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not
> nouveau driver

OK perfect, now the next question to clarify is *are you sure* that no
other firmware is used, other than si2168 and nouveau?

> Not leading to a freeze is:
> 
> A: si2168 firmware not installed + nouveau driver used
> B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before
> suspend
> C: used the tuner before suspend (or tried to use, in case that the si2168
> firmware is not installed)
> D: using my patches with firmware_request_cache()
> E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat
> /usr/lib/firmware/dvb*"
> F: si2168 firmware installed + nouveau driver used + "cat
> /usr/lib/firmware/dvb*"
> 
> I verified all cases again to make sure I was not remembering anything
> wrong.

This helps, thanks so much, now we'll have to write a reproducer, thanks
for the report!!

> The nouveau driver in use seems to be equivalent to running "ls -R
> /usr/lib/firmware" before suspend.
> 
> All the cases seem to boil down to:
> It freezes if the file system has to be accessed to list the content of
> /usr/lib/firmware or to read the si2168 firmware file

Let's confirm first whether or not your system is using other firmware
files too or not.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-17 15:20             ` Luis Chamberlain
@ 2020-08-17 22:04               ` Lukas Middendorf
  2020-08-18 14:37                 ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2020-08-17 22:04 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On 17/08/2020 17:20, Luis Chamberlain wrote:
> On Fri, Aug 14, 2020 at 11:59:36PM +0200, Lukas Middendorf wrote:
>> On 14/08/2020 18:37, Luis Chamberlain wrote:
>>> On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote:
>>>>> If the firmware is *not* present for the si2168 driver and the device
>>>>> has *not* been used yet you get a system freeze which you cannot recover
>>>>> from, but only if you are *not* using a driver which also caches its
>>>>> firmware already?
>>>>
>>>> Yes, this is exactly what I wanted to say.
>>>>
>>>
>>> OK great.. but..
>>>
>>>> A new installation of Fedora 32 without firmware files
>>>
>>> Fedora 32 comes with no firmware at all? Are you sure? How about your
>>> wifi?
>>
>> Fedora does come with firmware files for many devices (wifi, nouveau, …) but
>> not for dvb devices. Firmware for the si2168 has to be installed as an extra
>> package dvb-firmware from rpmfusion. When I talk about "no firmware files"
>> or "install the firmware files" I mean the si2168 (and other dvb hardware)
>> firmware files only. The nouveau firmware files are always present.
> 
> OK so how do you know that other firmware is not getting loaded or cached?

dmesg with firmware_class debug output enabled

> 
> Other than checking kernel logs you can rm -rf /lib/firmware/ and then
> only place the files you want to test.
> 
>>>> and with nouveau did
>>>> not show my freeze problem. Installing either the firmware files or the
>>>> nvidia driver started the freeze during resume.
>>>
>>> Here now you say that if you install either the firmware files for
>>> either si2168 or nouveau can cause a freeze,
>>
>> no, I'm talking just about the firmware files for si2168
> 
> OK so if you install the firmware files of si2168 you do run into a
> freeze, and this freeze happens on resume from suspend?
> 
> Without the si2168 firmware the freeze does not happen.
> 
> Is that right?
> 
>> In the case that I *don't* have the firmware files for si2168 (!) present it
>> happens only if no other firmware is cached on suspend (in my case by the
>> nouveau driver). No statement made about cases where the si2168 firmware
>> file is present.
> 
> OK this statement is clear and is very different from the one I made
> above.
> 
> But note, that this is true, how are you *sure* that no other firmware
> other than nouveau is being used? What about wifi? or bluetooth?

I don't have wifi or bluetooth enabled in the BIOS. If I don't use the 
nouveau driver, "dmesg | grep firmware" is completely empty after boot 
except for the kernel command line part "ddebug_query=module 
firmware_class +pmf". After suspend + resume I just get two lines for 
the calls to cache and uncache firmwares. If any other firmware is used, 
it is apparently not using the standard firmware loader.
If I rename the directory /usr/lib/firmware everything seems to work as 
before (nouveau and si2168 not loaded) and I don't see any complaints 
about missing firmware in dmesg.

If I enable my secondary rtl8125 Ethernet controller in the BIOS (I only 
tested this now, I always had it disabled previously), I can see its 
firmware being loaded on boot and cached on suspend. This then has an 
effect on my freeze problem identical to using the nouveau driver.

>>> You also clarify here your freeze happens on resume only. Is that right?
>>> Never on suspend, but if the freeze happens, it happens only on resume?
>>
>> Correct. I have not seen a freeze on suspend. It only happens on resume.
> 
> OK this is only if and only if you haven't used the si2168 device,
> right? And since this is related to si2168 we know that even if you
> don't use the si2168 device its function which calls to load firmware
> *does* get called on resume, even though that same function was not
> called on probe, as the device is not used.
> 
> If true, then the race to freeze here happens on resume against btrfs.
> 
> And is the firmware present or not in this case, in which the freeze
> happens?

A freeze can happen on resume with and without the si2168 firmware files 
installed. It however is easier to hit the freeze with the firmware 
files installed. Without the firmware files present the freeze happens 
only if no other driver uses the firmware loader.

>>> The actual case where you reach a freeze is still not clear yet. Let's
>>> try to clarify this.
>>
>> OK, let's try that again. To freeze during resume all of 1-4 has to be true:
>> 1. /usr/lib/firmware is on btrfs
>> 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected
>> 3. have not actively used the tuner
>> 4. any of the following cases:
>> 4a) si2168 firmware not installed + nouveau driver not used + have not run
>> "ls -R /usr/lib/firmware" before suspend
>> 4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*"
>> 4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not
>> nouveau driver
> 
> OK perfect, now the next question to clarify is *are you sure* that no
> other firmware is used, other than si2168 and nouveau?

I'm totally sure that no other firmware is used through the standard 
firmware loader. I'm also almost sure that no firmware from 
/usr/lib/firmware is used though any other means.

>> Not leading to a freeze is:
>>
>> A: si2168 firmware not installed + nouveau driver used
>> B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before
>> suspend
>> C: used the tuner before suspend (or tried to use, in case that the si2168
>> firmware is not installed)
>> D: using my patches with firmware_request_cache()
>> E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat
>> /usr/lib/firmware/dvb*"
>> F: si2168 firmware installed + nouveau driver used + "cat
>> /usr/lib/firmware/dvb*"
>>
>> I verified all cases again to make sure I was not remembering anything
>> wrong.
> 
> This helps, thanks so much, now we'll have to write a reproducer, thanks
> for the report!!

Will you do it yourself or do you expect me to do anything for this?

>> The nouveau driver in use seems to be equivalent to running "ls -R
>> /usr/lib/firmware" before suspend.
>>
>> All the cases seem to boil down to:
>> It freezes if the file system has to be accessed to list the content of
>> /usr/lib/firmware or to read the si2168 firmware file
> 
> Let's confirm first whether or not your system is using other firmware
> files too or not.

I confirmed that above. Why is this so important, anyway?

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-17 22:04               ` Lukas Middendorf
@ 2020-08-18 14:37                 ` Luis Chamberlain
  2021-04-01 14:59                   ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2020-08-18 14:37 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab,
	linux-media

On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote:
> On 17/08/2020 17:20, Luis Chamberlain wrote:
> A freeze can happen on resume with and without the si2168 firmware files
> installed. It however is easier to hit the freeze with the firmware files
> installed. Without the firmware files present the freeze happens only if no
> other driver uses the firmware loader.
> 
> > 
> > This helps, thanks so much, now we'll have to write a reproducer, thanks
> > for the report!!
> 
> Will you do it yourself or do you expect me to do anything for this?

I meant to imply that we'd do this, now that we understand the problem. Thanks
for your report!

> > > The nouveau driver in use seems to be equivalent to running "ls -R
> > > /usr/lib/firmware" before suspend.
> > > 
> > > All the cases seem to boil down to:
> > > It freezes if the file system has to be accessed to list the content of
> > > /usr/lib/firmware or to read the si2168 firmware file
> > 
> > Let's confirm first whether or not your system is using other firmware
> > files too or not.
> 
> I confirmed that above. Why is this so important, anyway?

A reproducer is easier to write if the actual file neeed is not needed.
That's all.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2020-08-18 14:37                 ` Luis Chamberlain
@ 2021-04-01 14:59                   ` Lukas Middendorf
  2021-04-02 18:02                     ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Lukas Middendorf @ 2021-04-01 14:59 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media,
	Lukas Middendorf

Hello Luis,


On 18/08/2020 16:37, Luis Chamberlain wrote:
> On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote:
>> On 17/08/2020 17:20, Luis Chamberlain wrote:
>>> This helps, thanks so much, now we'll have to write a reproducer, thanks
>>> for the report!!
>>
>> Will you do it yourself or do you expect me to do anything for this?
> 
> I meant to imply that we'd do this, now that we understand the problem. Thanks
> for your report!

any news on this issue? Did you succeed with reproducing this at your end?

I retested this with 5.12-rc5 and everything still seems unchanged. I 
still get freezes on resume under the same circumstances described 
previously if /usr/lib/firmware is on btrfs. My patches to si2168.c were 
not merged.
The documentation of request_firmware() seems unchanged and does not 
reflect the fact that prerequisites exist for making it safe to call 
during resume callbacks.

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-01 14:59                   ` Lukas Middendorf
@ 2021-04-02 18:02                     ` Luis Chamberlain
  2021-04-02 22:19                       ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-02 18:02 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Thu, Apr 01, 2021 at 04:59:47PM +0200, Lukas Middendorf wrote:
> Hello Luis,
> 
> 
> On 18/08/2020 16:37, Luis Chamberlain wrote:
> > On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote:
> > > On 17/08/2020 17:20, Luis Chamberlain wrote:
> > > > This helps, thanks so much, now we'll have to write a reproducer, thanks
> > > > for the report!!
> > > 
> > > Will you do it yourself or do you expect me to do anything for this?
> > 
> > I meant to imply that we'd do this, now that we understand the problem. Thanks
> > for your report!
> 
> any news on this issue? Did you succeed with reproducing this at your end?

No sorry, I dropped the ball on this but I managed to now spawn up the
virtual guests where I was doing development to reproduce this. Give me
some time and I will zero in on this now.

For now what I have is the following to test this, I next will work
on the userspace part.

diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 9fee2b93a8d1..f9e67fc4145a 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -18,6 +18,7 @@
 #include <linux/device.h>
 #include <linux/fs.h>
 #include <linux/miscdevice.h>
+#include <linux/platform_device.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
@@ -33,6 +34,8 @@
 static DEFINE_MUTEX(test_fw_mutex);
 static const struct firmware *test_firmware;
 
+static struct platform_device *pdev;
+
 struct test_batched_req {
 	u8 idx;
 	int rc;
@@ -53,6 +56,9 @@ struct test_batched_req {
  * @sync_direct: when the sync trigger is used if this is true
  *	request_firmware_direct() will be used instead.
  * @send_uevent: whether or not to send a uevent for async requests
+ * @enable_resume_test: if @senable_resume is true this will enable a test to
+ *	issue a request_firmware() upon resume. This is useful to test resume
+ *	after suspend filesystem races.
  * @num_requests: number of requests to try per test case. This is trigger
  *	specific.
  * @reqs: stores all requests information
@@ -91,6 +97,7 @@ struct test_config {
 	bool into_buf;
 	bool sync_direct;
 	bool send_uevent;
+	bool enable_resume_test;
 	u8 num_requests;
 	u8 read_fw_idx;
 
@@ -184,6 +191,7 @@ static int __test_firmware_config_init(void)
 	test_fw_config->send_uevent = true;
 	test_fw_config->into_buf = false;
 	test_fw_config->sync_direct = false;
+	test_fw_config->enable_resume_test = false;
 	test_fw_config->req_firmware = request_firmware;
 	test_fw_config->test_result = 0;
 	test_fw_config->reqs = NULL;
@@ -257,6 +265,9 @@ static ssize_t config_show(struct device *dev,
 	len += scnprintf(buf+len, PAGE_SIZE - len,
 			"sync_direct:\t\t%s\n",
 			test_fw_config->sync_direct ? "true" : "false");
+	len += scnprintf(buf+len, PAGE_SIZE - len,
+			"enable_resume_test:\t\t%s\n",
+			test_fw_config->enable_resume_test ? "true" : "false");
 	len += scnprintf(buf+len, PAGE_SIZE - len,
 			"read_fw_idx:\t%u\n", test_fw_config->read_fw_idx);
 
@@ -422,6 +433,22 @@ static ssize_t config_sync_direct_show(struct device *dev,
 }
 static DEVICE_ATTR_RW(config_sync_direct);
 
+static ssize_t config_enable_resume_test_store(struct device *dev,
+					       struct device_attribute *attr,
+					       const char *buf, size_t count)
+{
+	return test_dev_config_update_bool(buf, count,
+					   &test_fw_config->enable_resume_test);
+}
+
+static ssize_t config_enable_resume_test_show(struct device *dev,
+					      struct device_attribute *attr,
+					      char *buf)
+{
+	return test_dev_config_show_bool(buf, test_fw_config->enable_resume_test);
+}
+static DEVICE_ATTR_RW(config_enable_resume_test);
+
 static ssize_t config_send_uevent_store(struct device *dev,
 					struct device_attribute *attr,
 					const char *buf, size_t count)
@@ -929,6 +956,7 @@ static struct attribute *test_dev_attrs[] = {
 	TEST_FW_DEV_ATTR(config_into_buf),
 	TEST_FW_DEV_ATTR(config_sync_direct),
 	TEST_FW_DEV_ATTR(config_send_uevent),
+	TEST_FW_DEV_ATTR(config_enable_resume_test),
 	TEST_FW_DEV_ATTR(config_read_fw_idx),
 
 	/* These don't use the config at all - they could be ported! */
@@ -958,6 +986,81 @@ static struct miscdevice test_fw_misc_device = {
 	.groups 	= test_dev_groups,
 };
 
+static int __maybe_unused test_firmware_suspend(struct device *dev)
+{
+	return 0;
+}
+
+
+static int __maybe_unused test_firmware_resume(struct device *dev)
+{
+	int rc;
+
+	if (!test_fw_config->enable_resume_test)
+		return 0;
+
+	pr_info("resume test, loading '%s'\n", test_fw_config->name);
+
+	mutex_lock(&test_fw_mutex);
+	release_firmware(test_firmware);
+	test_firmware = NULL;
+	rc = request_firmware(&test_firmware, test_fw_config->name, dev);
+	if (rc) {
+		mutex_unlock(&test_fw_mutex);
+		pr_info("load of '%s' failed: %d\n", test_fw_config->name, rc);
+		goto out;
+	}
+
+	pr_info("loaded: %zu\n", test_firmware->size);
+	mutex_unlock(&test_fw_mutex);
+	pr_info("resume test, completed successfully\n");
+out:
+	return rc;
+}
+
+static SIMPLE_DEV_PM_OPS(test_dev_pm_ops, test_firmware_suspend, test_firmware_resume);
+
+static int test_firmware_probe(struct platform_device *dev)
+{
+	int rc;
+
+	rc = misc_register(&test_fw_misc_device);
+	if (rc) {
+		kfree(test_fw_config);
+		pr_err("could not register misc device: %d\n", rc);
+		return rc;
+	}
+
+	pr_info("interface ready\n");
+
+	return 0;
+}
+
+static int test_firmware_remove(struct platform_device *dev)
+{
+	mutex_lock(&test_fw_mutex);
+	release_firmware(test_firmware);
+	misc_deregister(&test_fw_misc_device);
+	mutex_unlock(&test_fw_mutex);
+
+	return 0;
+}
+
+static void test_firmware_shutdown(struct platform_device *dev)
+{
+}
+
+static struct platform_driver test_firmware_driver = {
+	.driver		= {
+		.name	= "test_firmware",
+		.pm	= &test_dev_pm_ops,
+	},
+	.probe		= test_firmware_probe,
+	.remove		= test_firmware_remove,
+	.shutdown	= test_firmware_shutdown,
+};
+
+
 static int __init test_firmware_init(void)
 {
 	int rc;
@@ -973,28 +1076,39 @@ static int __init test_firmware_init(void)
 		return rc;
 	}
 
-	rc = misc_register(&test_fw_misc_device);
-	if (rc) {
-		kfree(test_fw_config);
-		pr_err("could not register misc device: %d\n", rc);
-		return rc;
-	}
+	rc = platform_driver_register(&test_firmware_driver);
+	if (rc)
+		goto err_alloc;
 
-	pr_warn("interface ready\n");
+	pdev = platform_device_alloc("test_firmware", -1);
+	if (!pdev)
+		goto err_driver_unregister;
+
+	rc = platform_device_add(pdev);
+	if (rc)
+		goto err_free_device;
 
 	return 0;
+
+ err_free_device:
+	platform_device_put(pdev);
+ err_driver_unregister:
+	platform_driver_unregister(&test_firmware_driver);
+ err_alloc:
+	__test_firmware_config_free();
+	kfree(test_fw_config);
+	return rc;
 }
 
 module_init(test_firmware_init);
 
 static void __exit test_firmware_exit(void)
 {
-	mutex_lock(&test_fw_mutex);
-	release_firmware(test_firmware);
-	misc_deregister(&test_fw_misc_device);
+	platform_device_unregister(pdev);
+	platform_driver_unregister(&test_firmware_driver);
+
 	__test_firmware_config_free();
 	kfree(test_fw_config);
-	mutex_unlock(&test_fw_mutex);
 
 	pr_warn("removed interface\n");
 }

  Luis

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-02 18:02                     ` Luis Chamberlain
@ 2021-04-02 22:19                       ` Luis Chamberlain
  2021-04-02 22:58                         ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-02 22:19 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

Lukas, can you share your /etc/fstab ? Also, how long do you stay in
the boot before you try to suspend?

 Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-02 22:19                       ` Luis Chamberlain
@ 2021-04-02 22:58                         ` Luis Chamberlain
  2021-04-03 10:24                           ` Lukas Middendorf
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-02 22:58 UTC (permalink / raw)
  To: Lukas Middendorf
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> Lukas, can you share your /etc/fstab ? Also, how long do you stay in
> the boot before you try to suspend?

OK I cannot reproduce the issue with the modified patch I sent to
test_firmware, which if you enable config_enable_resume_test will
trigger a request_firmware() on resume, thus trying to mimic the race
you note. To test this you can simply use a loopback filesystem for
your /lib/firmware and create a btrfs filesystem for it, and then run:

echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test

systemctl suspend

Then resume. You should see "resume test" print on dmesg. I keep my
/lib/firmware/ empty and still, nothing. Can you provide kernel logs
for where you are seeing things get stuck at? Note that I had
mentioned the races on suspend/resume do exist for any journaling
filesystem, but this typically happens if you are doing ongoing
writes. I suppose you are *not* doing writes and your filesystem is
idle.

As such without kernel logs I cannot be sure what the issue is, but at
this point after the initial testing I've done I don't suspect this is
a firmware API issue. You might be better off just reposting your
patches with the respective Reviewed-by tags and pestering your
maintainer.

PS. If you want to test this on a guest you bring the guest back up
with virsh dompmwakeup.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-02 22:58                         ` Luis Chamberlain
@ 2021-04-03 10:24                           ` Lukas Middendorf
  2021-04-03 16:07                             ` Lukas Middendorf
  2021-04-03 20:25                             ` Luis Chamberlain
  0 siblings, 2 replies; 22+ messages in thread
From: Lukas Middendorf @ 2021-04-03 10:24 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media,
	Lukas Middendorf


On 02/04/2021 20:02, Luis Chamberlain wrote:
> No sorry, I dropped the ball on this but I managed to now spawn up the
> virtual guests where I was doing development to reproduce this. Give
> me some time and I will zero in on this now.
>
> For now what I have is the following to test this, I next will work
> on the userspace part.

I can report that your patch for test_firmware works (applied to current 
master from linus d93a0d4; I get some offsets and had to adjust some 
whitespace for it to apply). With that module (and resume test enabled) 
I get freezes at resume in the same cases that would also cause problems 
with si2168.

I'm testing this on bare metal F34 beta with / on a btrfs. I'm using 
nvidia driver again to make sure the system does not otherwise use any 
firmware from /usr/lib/firmware (confirmed with kernel debug messages 
for firmware_loader). My si2168 is not plugged in.

I tested it with a normal population of files in /usr/lib/firmware , 
without test-firmware.bin and also with a random 1MiB file in place. I 
tested after a reboot so it does not do caching.
With /usr/lib/firmware on a separate ext4 partition I can confirm with 
dmesg that the test_firmware suspend test actually works (does not freeze).
With /usr/lib/firmware on btrfs it fails in both cases (with and without 
the firmware files).
With caching (first suspend with the ext4 partition mounted, then a 
second suspend without) it does not freeze even with the firmware on btrfs.


One further thing I noticed which might be problematic in rare cases:
According to the kernel debug messages, the firmware-loader does not 
attempt to cache the firmware during suspend, if the previous call to 
request_firmware() has failed (file not present; call made during 
previous resume). In my opinion it should attempt to cache the firmware 
on suspend even in this case (If I remember correctly, 
firmware_request_cache also works without the file being present). In 
case some low-memory condition has caused the file system cache to lose 
the information about the file being non-present (or the file has been 
written after the initial attempt and is no longer in the file system 
cache), this might lead to freezes even for well-behaved drivers in case 
they reattempt to do request_firmware() on resume.
If the firmware is found during resume, it is cached on further suspends.
Given how long it took me to narrow down this problem in this (for me) 
reliably reproducible case, something like this happening at random 
would be almost impossible to debug/locate and might actually happen 
frequently in the wild.


On 03/04/2021 00:19, Luis Chamberlain wrote:
> Lukas, can you share your /etc/fstab ?

This is the core part (everything else is unmounted), I shortened the 
UUIDs. The ext4 mount of course is also unmounted when I want it to 
fail, with it in place it reliably not-fails.

UUID=<1> /                       btrfs   subvol=linux1 0 0
UUID=<2> /usr/lib/firmware       ext4    defaults 1 1
UUID=<1> /home/lukas             btrfs   subvol=linux1-f34-lukas 0 0
tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0

> Also, how long do you stay in the boot before you try to suspend?

During my reproduction sessions usually only shortly 1-5min, but I think 
I have seen this also after a slightly longer time. I can try to let it 
sit for longer if you think that is important.

On 03/04/2021 00:58, Luis Chamberlain wrote:
> On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>>
>> Lukas, can you share your /etc/fstab ? Also, how long do you stay in
>> the boot before you try to suspend?
> 
> OK I cannot reproduce the issue with the modified patch I sent to
> test_firmware, which if you enable config_enable_resume_test will
> trigger a request_firmware() on resume, thus trying to mimic the race
> you note. To test this you can simply use a loopback filesystem for
> your /lib/firmware and create a btrfs filesystem for it, and then run:
> 
> echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> 
> systemctl suspend
> 
> Then resume. You should see "resume test" print on dmesg. I keep my
> /lib/firmware/ empty and still, nothing.

Did you also try to create a random test-firmware.bin (I used 1M from 
/dev/urandom) instead of an empty /lib/firmware ?
If the directory is completely empty, it also does not freeze for me. If 
the directory is empty, any attempt to access its content can likely be 
directly served from cache, even if the actual directory has never been 
accessed before, as long as /lib (which is a symlink to /usr/lib on 
fedora) has been accessed (which will likely always be true). So I have 
to further add to my previous findings that "firmware directory is not 
completely empty" is a further prerequisite for it to fail.

> Can you provide kernel logs for where you are seeing things get stuck at? 

The log does not have any entries from resume. For the attempts where it 
freezes the last entry in journalctl is
systemd-sleep[5050]: Suspending system...

> Note that I had mentioned the races on suspend/resume do exist for any journaling
> filesystem, but this typically happens if you are doing ongoing
> writes. I suppose you are *not* doing writes and your filesystem is
> idle.

I can of course not completely rule out some random write (log files or 
similar), but there is definitely no heavy writing going on. I think 
only writes caused by the act of suspending and resuming could cause it 
this reliably. I have seen it also with a completely isolated btrfs file 
system for /usr/lib/firmware, where there should not have been any 
writes. For ext4 (which is also journaling) it works properly.

> As such without kernel logs I cannot be sure what the issue is, but at
> this point after the initial testing I've done I don't suspect this is
> a firmware API issue. You might be better off just reposting your
> patches with the respective Reviewed-by tags and pestering your
> maintainer.

I will try to be a little bit more insistent this time. Is "just repost" 
the usual way to handle if patches are ignored?

Lukas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 10:24                           ` Lukas Middendorf
@ 2021-04-03 16:07                             ` Lukas Middendorf
  2021-04-03 20:25                             ` Luis Chamberlain
  1 sibling, 0 replies; 22+ messages in thread
From: Lukas Middendorf @ 2021-04-03 16:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media,
	Lukas Middendorf

Hi Luis,

I now succeeded in reproducing this in VirtualBox running a F34 minimal 
installation with / on btrfs (with default firmware files).
I can send you the virtual disk file if you want it. With the kernel 
sourcecode and the compiled and installed kernel it currently is 21.8GiB.


On 03/04/2021 12:24, Lukas Middendorf wrote:
> On 03/04/2021 00:58, Luis Chamberlain wrote:
>> Can you provide kernel logs for where you are seeing things get stuck at? 

I have dumped the output of the serial console with all outputs cranked 
to the max.
In [1] is the output with test_firmware loaded and suspend test running. 
This leads to a freeze. However it is not completely dead. While it does 
not visually react, at least from time to time there still are some 
messages from the kernel and the cursor is still blinking.


In [2] is a (successful) suspend and resume cycle with test_firmware not 
loaded.

Lukas

[1] https://gist.github.com/midluk/05716f714f778d26dc02771245df0ac9
[2] https://gist.github.com/midluk/a03eb953b6cf097688b8be2e0cd387fd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 10:24                           ` Lukas Middendorf
  2021-04-03 16:07                             ` Lukas Middendorf
@ 2021-04-03 20:25                             ` Luis Chamberlain
  2021-04-03 21:04                               ` Luis Chamberlain
                                                 ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-03 20:25 UTC (permalink / raw)
  To: Lukas Middendorf, Greg KH, dsterba
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote:
> 
> On 02/04/2021 20:02, Luis Chamberlain wrote:
> > No sorry, I dropped the ball on this but I managed to now spawn up the
> > virtual guests where I was doing development to reproduce this. Give
> > me some time and I will zero in on this now.
> > 
> > For now what I have is the following to test this, I next will work
> > on the userspace part.
> 
> I can report that your patch for test_firmware works (applied to current
> master from linus d93a0d4; I get some offsets and had to adjust some
> whitespace for it to apply). With that module (and resume test enabled) I
> get freezes at resume in the same cases that would also cause problems with
> si2168.

Beautiful!

> I'm testing this on bare metal F34 beta with / on a btrfs. I'm using nvidia
> driver again to make sure the system does not otherwise use any firmware
> from /usr/lib/firmware (confirmed with kernel debug messages for
> firmware_loader). My si2168 is not plugged in.

OK I was not able to reproduce but the details below also clarify why,
so I now was able to reproduce! So I don't need to specifically test
with this distro, but others may do so.

> I tested it with a normal population of files in /usr/lib/firmware , without
> test-firmware.bin and also with a random 1MiB file in place. I tested after
> a reboot so it does not do caching.
> With /usr/lib/firmware on a separate ext4 partition I can confirm with dmesg
> that the test_firmware suspend test actually works (does not freeze).
> With /usr/lib/firmware on btrfs it fails in both cases (with and without the
> firmware files).
> With caching (first suspend with the ext4 partition mounted, then a second
> suspend without) it does not freeze even with the firmware on btrfs.

OK thanks for confirming!

> One further thing I noticed which might be problematic in rare cases:
> According to the kernel debug messages, the firmware-loader does not attempt
> to cache the firmware during suspend, 

Correct, the goal with this test was to purposely *not* call for the
firmware prior to suspend, and instead *race* (do the wrong thing) at
resume. It shouldn't really stall... fail yes, but stall, that seems
fishy.

> if the previous call to
> request_firmware() has failed (file not present; call made during previous
> resume). In my opinion it should attempt to cache the firmware on suspend
> even in this case 

Yes, that is the *proper* way to do use the firmware API, but we want to
reproduce the stall, and so we have to recreate the issue you reported
by doing something bad.

> (If I remember correctly, firmware_request_cache also
> works without the file being present). In case some low-memory condition has
> caused the file system cache to lose the information about the file being
> non-present (or the file has been written after the initial attempt and is
> no longer in the file system cache), this might lead to freezes even for
> well-behaved drivers in case they reattempt to do request_firmware() on
> resume.
> If the firmware is found during resume, it is cached on further suspends.

Right.

> Given how long it took me to narrow down this problem in this (for me)
> reliably reproducible case, something like this happening at random would be
> almost impossible to debug/locate and might actually happen frequently in
> the wild.

OK thanks for the report!

> On 03/04/2021 00:19, Luis Chamberlain wrote:
> > Lukas, can you share your /etc/fstab ?
> 
> This is the core part (everything else is unmounted), I shortened the UUIDs.
> The ext4 mount of course is also unmounted when I want it to fail, with it
> in place it reliably not-fails.
> 
> UUID=<1> /                       btrfs   subvol=linux1 0 0
> UUID=<2> /usr/lib/firmware       ext4    defaults 1 1
> UUID=<1> /home/lukas             btrfs   subvol=linux1-f34-lukas 0 0
> tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0
> 
> > Also, how long do you stay in the boot before you try to suspend?
> 
> During my reproduction sessions usually only shortly 1-5min, but I think I
> have seen this also after a slightly longer time. I can try to let it sit
> for longer if you think that is important.

No, I was looking to see how easy it is to reproduce right after a
fresh boot, now that I can reproduce I confirm it happens right away.

> On 03/04/2021 00:58, Luis Chamberlain wrote:
> > On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > 
> > > Lukas, can you share your /etc/fstab ? Also, how long do you stay in
> > > the boot before you try to suspend?
> > 
> > OK I cannot reproduce the issue with the modified patch I sent to
> > test_firmware, which if you enable config_enable_resume_test will
> > trigger a request_firmware() on resume, thus trying to mimic the race
> > you note. To test this you can simply use a loopback filesystem for
> > your /lib/firmware and create a btrfs filesystem for it, and then run:
> > 
> > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> > 
> > systemctl suspend
> > 
> > Then resume. You should see "resume test" print on dmesg. I keep my
> > /lib/firmware/ empty and still, nothing.
> 
> Did you also try to create a random test-firmware.bin (I used 1M from
> /dev/urandom) instead of an empty /lib/firmware ?

No, right now I want to to just focus on fixing the stall you saw.

> If the directory is completely empty, it also does not freeze for me.

This is a very important piece of information, thanks!

> If the
> directory is empty, any attempt to access its content can likely be directly
> served from cache, even if the actual directory has never been accessed
> before, as long as /lib (which is a symlink to /usr/lib on fedora) has been
> accessed (which will likely always be true). So I have to further add to my
> previous findings that "firmware directory is not completely empty" is a
> further prerequisite for it to fail.

Indeed, that helps!

So creating say 1000 random files in /lib/firmware on a freshly created
btrfs partition helps reproduce:

mkfs.btrfs /dev/whatever
mount /dev/wahtever /lib/firmware
# Put it on /etc/fstab too

Generate 1000 random files on it:

```
for n in {1..1000}; do                                                          
    dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 ))
done  
```

Then reboot, upon reboot do:

modprobe test_firmware
echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
systemctl suspend

If its a guest wake it up:

virsh dompmwakeup domidofguest

> > Can you provide kernel logs for where you are seeing things get stuck
> > at?
> 
> The log does not have any entries from resume. For the attempts where it
> freezes the last entry in journalctl is
> systemd-sleep[5050]: Suspending system...

I confirm.

> > Note that I had mentioned the races on suspend/resume do exist for any journaling
> > filesystem, but this typically happens if you are doing ongoing
> > writes. I suppose you are *not* doing writes and your filesystem is
> > idle.
> 
> I can of course not completely rule out some random write (log files or
> similar), but there is definitely no heavy writing going on. I think only
> writes caused by the act of suspending and resuming could cause it this
> reliably. I have seen it also with a completely isolated btrfs file system
> for /usr/lib/firmware, where there should not have been any writes. For ext4
> (which is also journaling) it works properly.

Nope, no writes are going on.

> > As such without kernel logs I cannot be sure what the issue is, but at
> > this point after the initial testing I've done I don't suspect this is
> > a firmware API issue. You might be better off just reposting your
> > patches with the respective Reviewed-by tags and pestering your
> > maintainer.
> 
> I will try to be a little bit more insistent this time. Is "just repost" the
> usual way to handle if patches are ignored?

You can repost, v2 just add my reviewed-by. Those patches are indeed
correct as they are calling for the firmware prior to resume. Maybe
clarify that.

But indeed there is also another issue which you reported which needs to
be fixed, and for that thanks so much for you patience! I'll be looking
into this!

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 20:25                             ` Luis Chamberlain
@ 2021-04-03 21:04                               ` Luis Chamberlain
  2021-04-05  9:52                                 ` Lukas Middendorf
  2021-04-04  0:50                               ` Lukas Middendorf
  2021-04-08 18:02                               ` Luis Chamberlain
  2 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-03 21:04 UTC (permalink / raw)
  To: Lukas Middendorf, Greg KH, dsterba
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote:
> So creating say 1000 random files in /lib/firmware on a freshly created
> btrfs partition helps reproduce:
> 
> mkfs.btrfs /dev/whatever
> mount /dev/wahtever /lib/firmware
> # Put it on /etc/fstab too
> 
> Generate 1000 random files on it:
> 
> ```
> for n in {1..1000}; do                                                          
>     dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 ))
> done  
> ```
> 
> Then reboot, upon reboot do:
> 
> modprobe test_firmware
> echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> systemctl suspend

OK this fixes it but this just shows that likely the thaw'ing allows
a race to take place which we didn't expect. I'll do some more digging
for a proper fix.

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 90d255fbdd9b..b9c37eefab35 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -4,6 +4,7 @@
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
+#include <linux/umh.h>
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -156,17 +157,25 @@ int kernel_read_file_from_path_initns(const char *path, loff_t offset,
 	if (!path || !*path)
 		return -EINVAL;
 
+	ret = usermodehelper_read_trylock();
+	if (WARN_ON(ret)) {
+		pr_warn_once("Trying to do direct read when not available");
+		return ret;
+	}
 	task_lock(&init_task);
 	get_fs_root(init_task.fs, &root);
 	task_unlock(&init_task);
 
 	file = file_open_root(root.dentry, root.mnt, path, O_RDONLY, 0);
 	path_put(&root);
-	if (IS_ERR(file))
+	if (IS_ERR(file)) {
+		usermodehelper_read_unlock();
 		return PTR_ERR(file);
+	}
 
 	ret = kernel_read_file(file, offset, buf, buf_size, file_size, id);
 	fput(file);
+	usermodehelper_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 20:25                             ` Luis Chamberlain
  2021-04-03 21:04                               ` Luis Chamberlain
@ 2021-04-04  0:50                               ` Lukas Middendorf
  2021-04-08 18:02                               ` Luis Chamberlain
  2 siblings, 0 replies; 22+ messages in thread
From: Lukas Middendorf @ 2021-04-04  0:50 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media,
	Greg KH, dsterba, Lukas Middendorf


Great to hear that you now succeeded in reproducing the problem.

On 03/04/2021 22:25, Luis Chamberlain wrote:
> On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote:
>> One further thing I noticed which might be problematic in rare cases:
>> According to the kernel debug messages, the firmware-loader does not attempt
>> to cache the firmware during suspend,
> 
> Correct, the goal with this test was to purposely *not* call for the
> firmware prior to suspend, and instead *race* (do the wrong thing) at
> resume. It shouldn't really stall... fail yes, but stall, that seems
> fishy.

I understood that.
>> if the previous call to
>> request_firmware() has failed (file not present; call made during previous
>> resume). In my opinion it should attempt to cache the firmware on suspend
>> even in this case
> 
> Yes, that is the *proper* way to do use the firmware API, but we want to
> reproduce the stall, and so we have to recreate the issue you reported
> by doing something bad.

Should firmware_request_cache() always be called, or only instead of 
request_firmware() ? In case request_firmware is called on its own, and 
the firmware file is not present, it might still race on the next resume.
request_firmware seems to be meant to include the cache request for the 
next suspend, but it apparently *does not* if firmware loading fails due 
to a missing file. I think this is something that should either be 
changed or properly documented in the API documentation.


>> Did you also try to create a random test-firmware.bin (I used 1M from
>> /dev/urandom) instead of an empty /lib/firmware ?
> 
> No, right now I want to to just focus on fixing the stall you saw.

This is a second nuance of the stall I saw: The firmware file (in this 
case test-firmware.bin) is present but not in cache.

If you run

ls -lR /lib/firmware > /dev/null

before

systemctl suspend

your reproduction steps will very likely not stall.
If you in addition run

dd if=/dev/urandom of=/lib/firmware/test-firmware.bin bs=1K count=64

during the preparation of /lib/firmware (in addition to or instead of 
your for loop), then it will always stall (as long as the file content 
is not read before suspend).

One stall is happening while the directory content is being listed (to 
find that the file is not present), the second stall is happening while 
actually trying to read the file. Those two stalls likely have the same 
root cause, but it might be different. I just want to make sure you have 
covered all cases.


>>> You might be better off just reposting your
>>> patches with the respective Reviewed-by tags and pestering your
>>> maintainer.
>>
>> I will try to be a little bit more insistent this time. Is "just repost" the
>> usual way to handle if patches are ignored?
> 
> You can repost, v2 just add my reviewed-by. Those patches are indeed
> correct as they are calling for the firmware prior to resume. Maybe
> clarify that.

OK, will do.

> But indeed there is also another issue which you reported which needs to
> be fixed, and for that thanks so much for you patience! I'll be looking
> into this!

Also thank you for your effort.

Lukas




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 21:04                               ` Luis Chamberlain
@ 2021-04-05  9:52                                 ` Lukas Middendorf
  0 siblings, 0 replies; 22+ messages in thread
From: Lukas Middendorf @ 2021-04-05  9:52 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media,
	Lukas Middendorf, Greg KH, dsterba



On 03/04/2021 23:04, Luis Chamberlain wrote:
> OK this fixes it but this just shows that likely the thaw'ing allows
> a race to take place which we didn't expect. I'll do some more digging
> for a proper fix.

I can indeed confirm that this fixes the stall. This however does not
seem to be the (complete) solution. Instead I now get a kernel crash
message (see below) for every firmware location tried to read during
resume. This might be intentional for debugging purposes during testing.
This is identical for ext4 and btrfs.

If the firmware file can not be found during caching on suspend, the
reads are still attempted again during resume. This leads to multiple of
those crash messages (for different firmware locations) during resume if
the firmware file is not present, even for drivers properly requesting
caching.

So if this patch is to go in (those crashes would really help with
getting the si2168 patches in…), I think you have to make sure that even for
non-existent firmware files, no read is ever attempted on resume. Which
means to set up the caching even if the initial request_firmware()
failed and to store knowledge about failed caching attempts to not retry
these reads on resume.

Lukas

------------[ cut here ]------------
WARNING: CPU: 0 PID: 662 at fs/kernel_read_file.c:161 kernel_read_file_from_path_initns+0x11c/0x140
Modules linked in: test_firmware nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common kvm_amd vmwgfx ccp kvm ttm drm_kms_helper snd_pcm joydev snd_timer snd e1000 irqbypass cec soundcore vboxguest i2c_piix4 pcspkr drm fuse zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw ata_generic pata_acpi video
CPU: 0 PID: 662 Comm: systemd-sleep Not tainted 5.12.0-rc5+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:kernel_read_file_from_path_initns+0x11c/0x140
Code: ff ff 4c 89 e7 89 44 24 10 e8 50 07 fc ff e8 fb 1c d8 ff 44 8b 44 24 10 48 83 c4 28 44 89 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 80 3d b4 10 7a 01 00 75 e3 e9 5b 5e 83 00 e8 cf 1c d8 ff 45
RSP: 0018:ffffa096c0b9fb90 EFLAGS: 00010286
RAX: 00000000fffffff5 RBX: 0000000000000000 RCX: ffffffffbd85d688
RDX: ffffffffbd85d688 RSI: 0000000000000297 RDI: ffffffffbd85d680
RBP: ffffa096c0b9fbe0 R08: 00000000fffffff5 R09: ffffffffbd85d688
R10: ffffffffffffffff R11: 0000000000000000 R12: ffff8976ca811000
R13: ffffa096c0b9fc00 R14: 000000007fffffff R15: 0000000000000001
FS:  00007f4718de3b40(0000) GS:ffff8979cfc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005594afeca4f8 CR3: 0000000106a62000 CR4: 00000000000506f0
Call Trace:
  ? snprintf+0x39/0x40
  fw_get_filesystem_firmware+0xe2/0x270
  _request_firmware+0x21e/0x500
  request_firmware+0x32/0x50
  test_firmware_resume.cold+0x4e/0xb2 [test_firmware]
  ? platform_pm_suspend+0x40/0x40
  dpm_run_callback+0x4c/0x120
  device_resume+0xa7/0x200
  dpm_resume+0xce/0x2c0
  dpm_resume_end+0xd/0x20
  suspend_devices_and_enter+0x195/0x750
  pm_suspend.cold+0x329/0x374
  state_store+0x71/0xd0
  kernfs_fop_write_iter+0x11c/0x1b0
  new_sync_write+0x108/0x180
  vfs_write+0x1b8/0x270
  ksys_write+0x4f/0xc0
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4719a527a7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffee4d18e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4719a527a7
RDX: 0000000000000004 RSI: 00007ffee4d18f00 RDI: 0000000000000004
RBP: 00007ffee4d18f00 R08: 00005626b555c710 R09: 00007f4719ae84e0
R10: 00007f4719ae83e0 R11: 0000000000000246 R12: 0000000000000004
R13: 00005626b5558650 R14: 0000000000000004 R15: 00007f4719b25700
---[ end trace 7f7ef0dc067dd714 ]---
Trying to do direct read when not available
test_firmware test_firmware: loading /lib/firmware/updates/5.12.0-rc5+/test-firmware.bin failed with error -11
------------[ cut here ]------------

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-03 20:25                             ` Luis Chamberlain
  2021-04-03 21:04                               ` Luis Chamberlain
  2021-04-04  0:50                               ` Lukas Middendorf
@ 2021-04-08 18:02                               ` Luis Chamberlain
  2021-04-16 23:17                                 ` Luis Chamberlain
  2 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-08 18:02 UTC (permalink / raw)
  To: Lukas Middendorf, Greg KH, dsterba
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote:
> So creating say 1000 random files in /lib/firmware on a freshly created
> btrfs partition helps reproduce:
> 
> mkfs.btrfs /dev/whatever
> mount /dev/wahtever /lib/firmware
> # Put it on /etc/fstab too
> 
> Generate 1000 random files on it:
> 
> ```
> for n in {1..1000}; do                                                          
>     dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 ))
> done  
> ```
> 
> Then reboot, upon reboot do:
> 
> modprobe test_firmware
> echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> systemctl suspend
> 
> If its a guest wake it up:
> 
> virsh dompmwakeup domidofguest

This happens because:

btrfs_lookup() --> ... -->                                                      
btrfs_search_slot() --> read_block_for_search() -->                             
	--> read_tree_block() --> btree_read_extent_buffer_pages() -->                
	--> submit_one_bio() --> btrfs_submit_metadata_bio() -->                      
	--> btrfsic_submit_bio() --> submit_bio()
		--> this completes and then
	--> wait_on_page_locked() on the first page
	--> never returns                                                             

I also managed to reproduce this easily with XFS as well, so this is not
a btrfs thing as I suspected. It does not happen with ext4 though.
However I think that's just by chance, it should still be prone to the
same issue.

Either way, I'm dusting off my patches for fs freeze as I believe that
should fix this problem. I am not sure if we want a stop gap hack like
the one I posted in the meantime... I don't think so. I rather fix this
well with the series I'll post for fs freeze. Give me a bit of time and
I'll CC you on the patches.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?
  2021-04-08 18:02                               ` Luis Chamberlain
@ 2021-04-16 23:17                                 ` Luis Chamberlain
  0 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2021-04-16 23:17 UTC (permalink / raw)
  To: Lukas Middendorf, Greg KH, dsterba, Jan Kara, Bart Van Assche, fsdevel
  Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media

On Thu, Apr 08, 2021 at 06:02:24PM +0000, Luis Chamberlain wrote:
> On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote:
> > So creating say 1000 random files in /lib/firmware on a freshly created
> > btrfs partition helps reproduce:
> > 
> > mkfs.btrfs /dev/whatever
> > mount /dev/wahtever /lib/firmware
> > # Put it on /etc/fstab too
> > 
> > Generate 1000 random files on it:
> > 
> > ```
> > for n in {1..1000}; do                                                          
> >     dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 ))
> > done  
> > ```
> > 
> > Then reboot, upon reboot do:
> > 
> > modprobe test_firmware
> > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> > systemctl suspend
> > 
> > If its a guest wake it up:
> > 
> > virsh dompmwakeup domidofguest
> 
> This happens because:
> 
> btrfs_lookup() --> ... -->                                                      
> btrfs_search_slot() --> read_block_for_search() -->                             
> 	--> read_tree_block() --> btree_read_extent_buffer_pages() -->                
> 	--> submit_one_bio() --> btrfs_submit_metadata_bio() -->                      
> 	--> btrfsic_submit_bio() --> submit_bio()
> 		--> this completes and then
> 	--> wait_on_page_locked() on the first page
> 	--> never returns                                                             
> 
> I also managed to reproduce this easily with XFS as well, so this is not
> a btrfs thing as I suspected. It does not happen with ext4 though.
> However I think that's just by chance, it should still be prone to the
> same issue.
> 
> Either way, I'm dusting off my patches for fs freeze as I believe that
> should fix this problem. I am not sure if we want a stop gap hack like
> the one I posted in the meantime... I don't think so. I rather fix this
> well with the series I'll post for fs freeze. Give me a bit of time and
> I'll CC you on the patches.

Low and behold, as I suspectd, my old VFS fsfreeze / thaw patch series
this. However I should note that I needed to add remove also the
WQ_FREEZABLE from fs as well, which was missing in my patch series, and
which Jan Kara had pointed out.

However, the VFS freeze work is quite a bit of work which we are still
discussing, so in the meantime, I think we have no other option but
to put the stop-gap patch I suggested with the usermode helper lock.
I will just modify it a bit more.

I'll also post my fs freeze work now again, but I'll note that it
still requires some more work to address everything which we have
discussed in the community. I'll post the patches as I think others
may be interested in the progress of that.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-04-16 23:18 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-09 18:51 Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? Lukas Middendorf
2020-08-13 16:37 ` Luis Chamberlain
2020-08-13 21:53   ` Lukas Middendorf
2020-08-13 22:13     ` Luis Chamberlain
2020-08-14 11:38       ` Lukas Middendorf
2020-08-14 16:37         ` Luis Chamberlain
2020-08-14 21:59           ` Lukas Middendorf
2020-08-17 15:20             ` Luis Chamberlain
2020-08-17 22:04               ` Lukas Middendorf
2020-08-18 14:37                 ` Luis Chamberlain
2021-04-01 14:59                   ` Lukas Middendorf
2021-04-02 18:02                     ` Luis Chamberlain
2021-04-02 22:19                       ` Luis Chamberlain
2021-04-02 22:58                         ` Luis Chamberlain
2021-04-03 10:24                           ` Lukas Middendorf
2021-04-03 16:07                             ` Lukas Middendorf
2021-04-03 20:25                             ` Luis Chamberlain
2021-04-03 21:04                               ` Luis Chamberlain
2021-04-05  9:52                                 ` Lukas Middendorf
2021-04-04  0:50                               ` Lukas Middendorf
2021-04-08 18:02                               ` Luis Chamberlain
2021-04-16 23:17                                 ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).