* Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? @ 2020-08-09 18:51 Lukas Middendorf 2020-08-13 16:37 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2020-08-09 18:51 UTC (permalink / raw) To: Luis Chamberlain, linux-btrfs Cc: Antti Palosaari, Mauro Carvalho Chehab, linux-media Hi, I'm seeing system freezes during resuming from suspend to RAM if all of the following is true: 1. /usr/lib/firmware is on btrfs 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected 3. the firmware file /usr/lib/firmware/dvb-demod-si2168-b40-01.fw (or knowledge about its non-existstence) is NOT in a cache in RAM The exact outcome depends on whether or not I use nvidia or nouveau driver an whether or not I have X running. With nvidia I only get some kernel messages from suspend on screen, but the system does not seem to react to anything (emergency FS sync magic sysrq sometimes works judging from HDD-LED). With nouveau and X11 running I just get a black screen with mouse cursor and no apparent reaction to key presses (again except for emergency FS sync magic sysrq). With nouveau but without X11 running the system even reacts to ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and login or any other input is not possible. Unplugging the USB device after resume does not return the system back to normal. Details on the above conditions: 1. If /usr/lib/firmware is on ext4 everything works fine. I originally noticed this with the complete / on btrfs vs. ext4, but putting /usr/lib/firmware on a separate partition is sufficient to make it work or fail, depending on the fs type. 2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its si2168_init() called during resume. In this function a call to request_firmware() happens. The firmware is not actually loaded to the device on boot or initial usb connection but just during resume and when the DVB-Tuner is used. 3. a) if the device has been used and the firmware has already been loaded, the load succeeds during resume b) if the firmware file is not present and the directory content of /usr/lib/firmware has been listed (e.g. through loading of a different firmware from there; by the nouveau driver in my case) the firmware load fails but does not freeze the system c) manually reading the firmware files before suspend can make the resume succeed, but this does not seem to be reliable d) modifying the si2168 by adding calls to firmware_request_cache() in si2168_probe() makes the firmware loading during resume succeed. e) having the firmware files not installed freezes the system during resume if the content of /usr/lib/firmware has not been listed before suspend (e.g. installing the nvidia driver, so the nouveau driver does not access this directory) f) the system freezes during resume if nothing has caused the firmware file to be read from the drive before suspend I have observed this with all kernel versions I have tested so far: fedora kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11 vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c) Unfortunately I have not yet succeeded in getting useful debug output (nothing is in the log files after a reset). When I disable printk console_suspend I can get some messages of the resume process but nothing related to si2168 is shown on screen. This certainly looks like the documented promise of request_firmware() "The function can be called safely inside device’s suspend and resume callback." does not hold if the requested firmware files are residing on a btrfs. The documentation of firmware_request_cache() states that this function should be called before suspend if the firmware has not been loaded earlier. From the combined documentation I would expect that request_firmware() might fail during resume if the firmware is not yet in cache, but it should not freeze the system. I'm not sure who really is to blame here: - BTRFS (ext4 is fine after all) - the firmware loader because the implementation or the documentation are wrong - si2168 because of not loading the firmware or calling firmware_request_cache() before suspend. Also I don't think it is even necessary to load the firmware during resume, it should be sufficient to load the firmware when the tuner is used. I'm not sure though whether the dvb-frontend driver structure allows to properly distinguish between resuming and initialization before device usage. The problem can definitely be worked around here until the root cause is fixed. I can provide a patch if this solution is seen as appropriate. I'm putting all the maintainers and/or lists of the possible culprits on CC. Best regards, Lukas Middendorf ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-09 18:51 Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? Lukas Middendorf @ 2020-08-13 16:37 ` Luis Chamberlain 2020-08-13 21:53 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2020-08-13 16:37 UTC (permalink / raw) To: Lukas Middendorf, Anand Jain, mcgrof Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote: > Hi, Hey Lukas, thanks for the report! > I'm seeing system freezes during resuming from suspend to RAM if all of the > following is true: > 1. /usr/lib/firmware is on btrfs > 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected > 3. the firmware file /usr/lib/firmware/dvb-demod-si2168-b40-01.fw (or > knowledge about its non-existstence) OK that is SI2168_B40_FIRMWARE and the driver in question is part of drivers/media/dvb-frontends/si2168.c, its a 1 file driver so si2168 is the respective driver. > is NOT in a cache in RAM But the firmware is supposed to go be cached in RAM prior to suspend, given this driver uses the request_firmware() call, only the async call request_firmware_nowait() lets you opt-out of the cache, and likewise with request_firmware_into_buf() you don't use the cache. Note that some drivers implement their own caching mechanism as well, like iwlwifi, and so don't need this cache. Please refer to: https://www.kernel.org/doc/html/latest/driver-api/firmware/firmware_cache.html > The exact outcome depends on whether or not I use nvidia or nouveau driver > an whether or not I have X running. With nvidia I only get some kernel > messages from suspend on screen, but the system does not seem to react to > anything (emergency FS sync magic sysrq sometimes works judging from > HDD-LED). With nouveau and X11 running I just get a black screen with mouse > cursor and no apparent reaction to key presses (again except for emergency > FS sync magic sysrq). If the nouveau driver doesn't yet have proper suspend working, my guess is that the issue is the dependency on resources by nvidia ont he dvb driver, and if suspend may be broken there, ordering on suspend would not be correct. This is only true is suspend on nouveau is broken and I don't know the anwer to this. If suspend is not implemented correctly through the entire pipeline then your only option is to blacklist the driver. On debian this is what I use for hibernation blacklisting, this will remove the driver prior to hibernation: /etc/hibernate/blacklisted-modules > With nouveau but without X11 running the system even reacts to > ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and login > or any other input is not possible. > Unplugging the USB device after resume does not return the system back to > normal. Please debug further with nouveau as that is the only driver we have control over. > Details on the above conditions: > 1. If /usr/lib/firmware is on ext4 everything works fine. I originally > noticed this with the complete / on btrfs vs. ext4, but putting > /usr/lib/firmware on a separate partition is sufficient to make it work or > fail, depending on the fs type. The firmware loader just reads the file directly from the filesystem, that should complete fine but we can find out more details by you enabling debugging: echo "file drivers/base/* +p" > /sys/kernel/debug/dynamic_debug/control > 2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its > si2168_init() called during resume. In this function a call to > request_firmware() happens. The firmware is not actually loaded to the > device on boot or initial usb connection but just during resume and when the > DVB-Tuner is used. That's the thing, the cache is built out of prior requests. We build the actual firmware cache through a notifier on fw_pm_notify(), and we build the cache prior to suspend then with device_cache_fw_images(). So whatever issues you might have during suspend/resume because of a missing firmware would still be present on this driver because the driver never called for the firmware. If upon resume the firmware is requested, we race against the filesystem looking for it and that today can be racy with any filesystem. That fact that you only see it with one filesystem can be a one-off, as explained in more detail below. > 3. a) if the device has been used and the firmware has already been loaded, > the load succeeds during resume If the device has been used, so long as the file is still present prior to suspend, then the cache for it will be created. That explains why this works. > b) if the firmware file is not present and the directory content of > /usr/lib/firmware has been listed (e.g. through loading of a different > firmware from there; by the nouveau driver in my case) the firmware load > fails but does not freeze the system Can you clarify if you mean then that in this case the dvb device is not used? Which driver firmware load fails? > c) manually reading the firmware files before suspend can make the resume > succeed, but this does not seem to be reliable That may because of the dentry cache, and so helping races, doing something similar as to what the firmware cache does, but in hacky way. > d) modifying the si2168 by adding calls to firmware_request_cache() in > si2168_probe() makes the firmware loading during resume succeed. That indeed seems like a good idea in general for dvb that should probably be generalized, if dvb had a way to get a dvb's driver firmware name and indeed all dvb drivers are not skipping the firmware cache today. > e) having the firmware files not installed freezes the system during > resume if the content of /usr/lib/firmware has not been listed before > suspend (e.g. installing the nvidia driver, so the nouveau driver does not > access this directory) That may be the same issue as in b) assuming that you meant you didn't use the dvb device, and that the firmware load issue is from nouveau. Are all firmware files used by nouveau upon resume the samea as during probe? If not using firmware_request_cache() for each of them would be a good idea to resolve possible races issues. > f) the system freezes during resume if nothing has caused the firmware > file to be read from the drive before suspend Which driver? And yes, this is exactly why the firmware cache thing was done, and its why that for driver that don't use the firmware cache, they implement their own caching mechanism. The firmware must be cached prior to suspend. > I have observed this with all kernel versions I have tested so far: fedora > kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11 > vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c) > > Unfortunately I have not yet succeeded in getting useful debug output > (nothing is in the log files after a reset). When I disable printk > console_suspend I can get some messages of the resume process but nothing > related to si2168 is shown on screen. Try dynamic printk stuff mentioned. Even though some of the things you say are not clear yet. > This certainly looks like the documented promise of request_firmware() > "The function can be called safely inside device’s suspend and resume > callback." > does not hold if Inded, this is just false outdated information and needs to be updated. For now, this should just be re-written to mention in order to help with current races against the filesystem (detailed way at the bottom of this email) being brought up on resume, the firmware API relies on a firmware cache mechanism, and when that is used it is safe to call the request_firmware() call on resume. If a driver has not yet requested for the firmware prior to suspend then the firmware cache will not have the firmware cached in memory prior to suspend, and so to help with this firmware_request_cache() can be used. The call to firmware_request_cache() can be made on device's probe call. Some drivers may not want to use the firmware cache though, such as drivers with large firmware which don't want large firmwares to be taking so much memory prior to suspend/hibernation, and so certain APIs allow to skip the firwmare cache. Those drivers must implement their own solutions to ensure that if the firmware is needed on resume it will be available or not needed. > the requested firmware files are residing on a btrfs. There are two issues here. One is that you are using a driver which may have resume issues because the driver's firmware is not cached prior to suspend. The second issue may indeed be a freeze / issues of reading a filesystem *while racing to suspend / resume*. This issue can happen with any journaling filesystem today. More on this below. In so far as nouveau is concerned, remove the dvb driver and let's debug that alone separately, and see if the issue is not related to your dvb driver first. > The documentation of firmware_request_cache() states that this function > should be called before suspend if the firmware has not been loaded earlier. > From the combined documentation I would expect that request_firmware() might > fail during resume if the firmware is not yet in cache, but it should not > freeze the system. Indeed, but then again doing a a race against a filesystem in between suspend/resume can cause all sorts of issues today, so btrfs is not the only one with issue there. I had found issue with xfs as well [0]: [0] https://lwn.net/Articles/735382/ I addressed a topic on this particular issue at LSFFMM in 2018, and here are the notes from that session: https://lore.kernel.org/linux-fsdevel/20180426212243.GA27853@wotan.suse.de/ I will pick up this work again now, but it will take a bit of time. > I'm not sure who really is to blame here: > - BTRFS (ext4 is fine after all) > - the firmware loader because the implementation or the documentation are > wrong > - si2168 because of not loading the firmware or calling > firmware_request_cache() before suspend. Also I don't think it is even > necessary to load the firmware during resume, it should be sufficient to > load the firmware when the tuner is used. I'm not sure though whether the > dvb-frontend driver structure allows to properly distinguish between > resuming and initialization before device usage. The problem can definitely > be worked around here until the root cause is fixed. I can provide a patch > if this solution is seen as appropriate. > > I'm putting all the maintainers and/or lists of the possible culprits on CC. Indeed send the patch to use firmware_request_cache() for si2168. If you still have issues with nouveau, remove the si2168 driver and see if you can recreate the issue. If not then something between the two on suspend is likely the issue. But if the issue still persists then it would seem to be an issue with suspend on nouveau. The generic filesystem races on suspend/resume are known, and I will tackle that once I am done with some other task I am completing. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-13 16:37 ` Luis Chamberlain @ 2020-08-13 21:53 ` Lukas Middendorf 2020-08-13 22:13 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2020-08-13 21:53 UTC (permalink / raw) To: Luis Chamberlain Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media Hey Luis, On 13/08/2020 18:37, Luis Chamberlain wrote: > On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote: >> The exact outcome depends on whether or not I use nvidia or nouveau driver >> an whether or not I have X running. With nvidia I only get some kernel >> messages from suspend on screen, but the system does not seem to react to >> anything (emergency FS sync magic sysrq sometimes works judging from >> HDD-LED). With nouveau and X11 running I just get a black screen with mouse >> cursor and no apparent reaction to key presses (again except for emergency >> FS sync magic sysrq). > > If the nouveau driver doesn't yet have proper suspend working, my guess > is that the issue is the dependency on resources by nvidia ont he dvb > driver, and if suspend may be broken there, ordering on suspend would > not be correct. This is only true is suspend on nouveau is broken and I > don't know the anwer to this. Sorry for being a bit unclear here. nouveau seems to be working perfectly fine for me when it comes to this issue. Not using nouveau actually adds one more case in which the system freezes (when no firmware files are installed; case e) below). Of course I can not completely rule out a different problem with the nouveau driver, but my problem seems to be caused by si2168 only. > If suspend is not implemented correctly through the entire pipeline then > your only option is to blacklist the driver. On debian this is what I > use for hibernation blacklisting, this will remove the driver prior to > hibernation: > > /etc/hibernate/blacklisted-modules I will likely not have to do that now with the proper solution I have for si2168, but this can come in handy in the future should I hit some other suspend problem. >> With nouveau but without X11 running the system even reacts to >> ctrl+alt+F1/F2/F3, but only the first ttys are displaying anything and login >> or any other input is not possible. >> Unplugging the USB device after resume does not return the system back to >> normal. > > Please debug further with nouveau as that is the only driver we have > control over. Sure. I only mentioned the nvidia driver because the case "no firmware files installed" only seems to freeze the system on resume with the nvidia driver but not with nouveau. >> Details on the above conditions: >> 1. If /usr/lib/firmware is on ext4 everything works fine. I originally >> noticed this with the complete / on btrfs vs. ext4, but putting >> /usr/lib/firmware on a separate partition is sufficient to make it work or >> fail, depending on the fs type. > > The firmware loader just reads the file directly from the filesystem, > that should complete fine but we can find out more details by you > enabling debugging: > > echo "file drivers/base/* +p" > /sys/kernel/debug/dynamic_debug/control I already figured that out (I used "module firmware_class" though). >> 2. The si2168 driver drivers/media/dvb-frontends/si2168.c has its >> si2168_init() called during resume. In this function a call to >> request_firmware() happens. The firmware is not actually loaded to the >> device on boot or initial usb connection but just during resume and when the >> DVB-Tuner is used. > > That's the thing, the cache is built out of prior requests. We build the > actual firmware cache through a notifier on fw_pm_notify(), and we build > the cache prior to suspend then with device_cache_fw_images(). > > So whatever issues you might have during suspend/resume because of > a missing firmware would still be present on this driver because the > driver never called for the firmware. If upon resume the firmware is > requested, we race against the filesystem looking for it and that today > can be racy with any filesystem. That fact that you only see it with > one filesystem can be a one-off, as explained in more detail below. > >> 3. a) if the device has been used and the firmware has already been loaded, >> the load succeeds during resume > > If the device has been used, so long as the file is still present prior > to suspend, then the cache for it will be created. That explains why > this works. Yes, indeed. I had confirmed that with the firmware_class debug messages enabled. > >> b) if the firmware file is not present and the directory content of >> /usr/lib/firmware has been listed (e.g. through loading of a different >> firmware from there; by the nouveau driver in my case) the firmware load >> fails but does not freeze the system > > Can you clarify if you mean then that in this case the dvb device is not > used? > > Which driver firmware load fails? With "not used" I mean that the device has been recognized, the si2168 module is loaded and si2168_probe() has been called on the device, but I have not started a media player which actually plays a DVB-C stream. Therefore si2168_init() has never been called and the firmware has never been requested or loaded to the device. On resume si2168_init() will be called (although I don't think this actually is really necessary) and the firmware load of si2168 fails if the firmware files are not installed. If the kernel does not already know that the files are not present without access to the file system, the system just freezes. >> c) manually reading the firmware files before suspend can make the resume >> succeed, but this does not seem to be reliable > > That may because of the dentry cache, and so helping races, doing > something similar as to what the firmware cache does, but in hacky way. > >> d) modifying the si2168 by adding calls to firmware_request_cache() in >> si2168_probe() makes the firmware loading during resume succeed. > > That indeed seems like a good idea in general for dvb that should > probably be generalized, if dvb had a way to get a dvb's driver > firmware name and indeed all dvb drivers are not skipping the firmware > cache today. > >> e) having the firmware files not installed freezes the system during >> resume if the content of /usr/lib/firmware has not been listed before >> suspend (e.g. installing the nvidia driver, so the nouveau driver does not >> access this directory) > > That may be the same issue as in b) assuming that you meant you didn't > use the dvb device, and that the firmware load issue is from nouveau. This is actually just the inverted case of b). The only real relevance of the nouveau driver here is that its (perfectly working) firmware caching on suspend actually seems to be equivalent to manually running "ls" on the firmware directory and afterwards the kernel also knows whether or not the si2168 firmware files are present without file system access. > > Are all firmware files used by nouveau upon resume the samea as during > probe? If not using firmware_request_cache() for each of them would be > a good idea to resolve possible races issues. I have not compared them one-by-one, but as I have already written, nouveau seems fine here. > >> f) the system freezes during resume if nothing has caused the firmware >> file to be read from the drive before suspend > > Which driver? And yes, this is exactly why the firmware cache thing was > done, and its why that for driver that don't use the firmware cache, > they implement their own caching mechanism. > > The firmware must be cached prior to suspend. > >> I have observed this with all kernel versions I have tested so far: fedora >> kernels 5.5.18, 5.6.19, 5.7.9, 5.7.11 >> vanilla kernels 5.7.12, 5.8.0 and current master (git 06a81c1c) >> >> Unfortunately I have not yet succeeded in getting useful debug output >> (nothing is in the log files after a reset). When I disable printk >> console_suspend I can get some messages of the resume process but nothing >> related to si2168 is shown on screen. > > Try dynamic printk stuff mentioned. I have tried that, but in case the system freezes, I have not managed to get any useful debug info on the screen. But > Even though some of the things you say are not clear yet. I'm very sorry for that. But it always is a compromise between giving all possible information and not producing a huge pile of text. [snip: lots of interesting information] >> I'm not sure who really is to blame here: >> - BTRFS (ext4 is fine after all) >> - the firmware loader because the implementation or the documentation are >> wrong >> - si2168 because of not loading the firmware or calling >> firmware_request_cache() before suspend. Also I don't think it is even >> necessary to load the firmware during resume, it should be sufficient to >> load the firmware when the tuner is used. I'm not sure though whether the >> dvb-frontend driver structure allows to properly distinguish between >> resuming and initialization before device usage. The problem can definitely >> be worked around here until the root cause is fixed. I can provide a patch >> if this solution is seen as appropriate. >> >> I'm putting all the maintainers and/or lists of the possible culprits on CC. > > Indeed send the patch to use firmware_request_cache() for si2168. I sent it to the media mailing list (split into two patches). > If you still have issues with nouveau, remove the si2168 driver and see > if you can recreate the issue. If not then something between the two > on suspend is likely the issue. But if the issue still persists then > it would seem to be an issue with suspend on nouveau. Nouveau does seem to work (including suspend), except if you actually need some decent 3D performance. > The generic filesystem races on suspend/resume are known, and I will > tackle that once I am done with some other task I am completing. Great to know that the underlying cause for the freeze is known and a solution is being worked on (or at least planned to be worked on). Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-13 21:53 ` Lukas Middendorf @ 2020-08-13 22:13 ` Luis Chamberlain 2020-08-14 11:38 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2020-08-13 22:13 UTC (permalink / raw) To: Lukas Middendorf Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote: > Hey Luis, > > On 13/08/2020 18:37, Luis Chamberlain wrote: > > On Sun, Aug 09, 2020 at 08:51:35PM +0200, Lukas Middendorf wrote: > > > > > b) if the firmware file is not present and the directory content of > > > /usr/lib/firmware has been listed (e.g. through loading of a different > > > firmware from there; by the nouveau driver in my case) the firmware load > > > fails but does not freeze the system > > > > Can you clarify if you mean then that in this case the dvb device is not > > used? > > > > Which driver firmware load fails? > > With "not used" I mean that the device has been recognized, the si2168 > module is loaded and si2168_probe() has been called on the device, but I > have not started a media player which actually plays a DVB-C stream. > Therefore si2168_init() has never been called and the firmware has never > been requested or loaded to the device. > > On resume si2168_init() will be called (although I don't think this actually > is really necessary) Indeed, that seems odd given its not on probe. So yet another possible si2168 bug. Or another way to put it: your cache calls are not needed if you remove that si2168_init() if init was not called yet. So simply extending the data structure for the driver and seting a bool flag to true if init was called should do the trick. Then the two cache calls would not be needed. > and the firmware load of si2168 fails if the firmware > files are not installed. OK well that is expected. > If the kernel does not already know that the files > are not present without access to the file system, the system just freezes. It is not clear to me what this means. Can you clarify? > > > c) manually reading the firmware files before suspend can make the resume > > > succeed, but this does not seem to be reliable > > > > That may because of the dentry cache, and so helping races, doing > > something similar as to what the firmware cache does, but in hacky way. > > > > > d) modifying the si2168 by adding calls to firmware_request_cache() in > > > si2168_probe() makes the firmware loading during resume succeed. > > > > That indeed seems like a good idea in general for dvb that should > > probably be generalized, if dvb had a way to get a dvb's driver > > firmware name and indeed all dvb drivers are not skipping the firmware > > cache today. > > > > > e) having the firmware files not installed freezes the system during > > > resume if the content of /usr/lib/firmware has not been listed before > > > suspend (e.g. installing the nvidia driver, so the nouveau driver does not > > > access this directory) > > > > That may be the same issue as in b) assuming that you meant you didn't > > use the dvb device, and that the firmware load issue is from nouveau. > > This is actually just the inverted case of b). > The only real relevance of the nouveau driver here is that its (perfectly > working) firmware caching on suspend actually seems to be equivalent to > manually running "ls" on the firmware directory and afterwards the kernel > also knows whether or not the si2168 firmware files are present without file > system access. OK.. I still don't get it, so let me see if we can decipher what you mean here. If the firmware is *not* present for the si2168 driver and the device has *not* been used yet you get a system freeze which you cannot recover from, but only if you are *not* using a driver which also caches its firmware already? In this case if the nouveau is used this freeze does not happen, and you believe this is because of at least one fetch for the directory is done already. If so, then this speedup of a fugure negative dentry lookup on btrfs seems like an interesting test case for btrfs developers too look at. But we must first undersand the case well. > > Are all firmware files used by nouveau upon resume the samea as during > > probe? If not using firmware_request_cache() for each of them would be > > a good idea to resolve possible races issues. > > I have not compared them one-by-one, but as I have already written, nouveau > seems fine here. OK cool. Thanks for ruling this out! > > > I'm not sure who really is to blame here: > > > - BTRFS (ext4 is fine after all) > > > - the firmware loader because the implementation or the documentation are > > > wrong > > > - si2168 because of not loading the firmware or calling > > > firmware_request_cache() before suspend. Also I don't think it is even > > > necessary to load the firmware during resume, it should be sufficient to > > > load the firmware when the tuner is used. I'm not sure though whether the > > > dvb-frontend driver structure allows to properly distinguish between > > > resuming and initialization before device usage. The problem can definitely > > > be worked around here until the root cause is fixed. I can provide a patch > > > if this solution is seen as appropriate. > > > > > > I'm putting all the maintainers and/or lists of the possible culprits on CC. > > > > Indeed send the patch to use firmware_request_cache() for si2168. > > I sent it to the media mailing list (split into two patches). Looks good to me, except now that you metion the resuem thing, I think you can drop those patches if you do the check I mentioned on resume, ie, to verify that init hasn't been called yet. Also, might as well rename that function to make it clearer what it does as init prefixed or postfixed calls tend to be used for things during initialization. > > The generic filesystem races on suspend/resume are known, and I will > > tackle that once I am done with some other task I am completing. > > Great to know that the underlying cause for the freeze is known and a > solution is being worked on (or at least planned to be worked on). There is a sliver possibility here that a negative dentry slowdown may cause a btrfs freeze on suspend easily. If so that can likely be fixed before that big general filesystem freeze issue I noted with URLs and talks. But I think we need to confirm that is what is happening here. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-13 22:13 ` Luis Chamberlain @ 2020-08-14 11:38 ` Lukas Middendorf 2020-08-14 16:37 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2020-08-14 11:38 UTC (permalink / raw) To: Luis Chamberlain Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media Hey Louis, On 14/08/2020 00:13, Luis Chamberlain wrote: > On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote: >> With "not used" I mean that the device has been recognized, the si2168 >> module is loaded and si2168_probe() has been called on the device, but I >> have not started a media player which actually plays a DVB-C stream. >> Therefore si2168_init() has never been called and the firmware has never >> been requested or loaded to the device. >> >> On resume si2168_init() will be called (although I don't think this actually >> is really necessary) > > Indeed, that seems odd given its not on probe. So yet another possible > si2168 bug. Or another way to put it: your cache calls are not needed > if you remove that si2168_init() if init was not called yet. So simply > extending the data structure for the driver and seting a bool flag to > true if init was called should do the trick. > > Then the two cache calls would not be needed. I already thought about a way to remove the firmware load on suspend if it has not happened earlier. But I saw no way to distinguish between calls to si2168_init() during resume and calls to si2168_init() at the start of device usage. In contrast to struct dvb_tuner_ops the relevant struct dvb_frontend_ops has no separate members "suspend" and "resume" to store callbacks to be called instead of sleep and init during suspend and resume. Changing the behavior of si2168 would likely include some bigger architectural work on the dvb_frontend system to extend dvb_frontend_ops. It might be possible to just put nothing into dvb_frontend_ops.init but instead populate the dvb_tuner_ops part of the struct, but those pointers might already be populated externally for some other use. If I try to do this on my very first kernel contribution without in-depth knowledge of how the dvb drivers really work, this will likely just explode, if not for me then for somebody else with different hardware. >> If the kernel does not already know that the files >> are not present without access to the file system, the system just freezes. > > It is not clear to me what this means. Can you clarify? > >>>> e) having the firmware files not installed freezes the system during >>>> resume if the content of /usr/lib/firmware has not been listed before >>>> suspend (e.g. installing the nvidia driver, so the nouveau driver does not >>>> access this directory) >>> >>> That may be the same issue as in b) assuming that you meant you didn't >>> use the dvb device, and that the firmware load issue is from nouveau. >> >> This is actually just the inverted case of b). >> The only real relevance of the nouveau driver here is that its (perfectly >> working) firmware caching on suspend actually seems to be equivalent to >> manually running "ls" on the firmware directory and afterwards the kernel >> also knows whether or not the si2168 firmware files are present without file >> system access. > > OK.. I still don't get it, so let me see if we can decipher what you > mean here. > > If the firmware is *not* present for the si2168 driver and the device > has *not* been used yet you get a system freeze which you cannot recover > from, but only if you are *not* using a driver which also caches its > firmware already? Yes, this is exactly what I wanted to say. A new installation of Fedora 32 without firmware files and with nouveau did not show my freeze problem. Installing either the firmware files or the nvidia driver started the freeze during resume. > In this case if the nouveau is used this freeze does not happen, and you > believe this is because of at least one fetch for the directory is done > already. If so, then this speedup of a fugure negative dentry lookup on > btrfs seems like an interesting test case for btrfs developers too look > at. According to my previous observations, an alternative to having the nouveau driver fetch files to prevent the freeze when no files are installed is to manually run "ls" on the firmware directory. > But we must first undersand the case well. I will check this corner case again in depth tonight or tomorrow and will report my findings, at least as far as it is externally observable without in-depth understanding of the internals. >>>> I'm not sure who really is to blame here: >>>> - BTRFS (ext4 is fine after all) >>>> - the firmware loader because the implementation or the documentation are >>>> wrong >>>> - si2168 because of not loading the firmware or calling >>>> firmware_request_cache() before suspend. Also I don't think it is even >>>> necessary to load the firmware during resume, it should be sufficient to >>>> load the firmware when the tuner is used. I'm not sure though whether the >>>> dvb-frontend driver structure allows to properly distinguish between >>>> resuming and initialization before device usage. The problem can definitely >>>> be worked around here until the root cause is fixed. I can provide a patch >>>> if this solution is seen as appropriate. >>>> >>>> I'm putting all the maintainers and/or lists of the possible culprits on CC. >>> >>> Indeed send the patch to use firmware_request_cache() for si2168. >> >> I sent it to the media mailing list (split into two patches). > > Looks good to me, except now that you metion the resuem thing, I think > you can drop those patches if you do the check I mentioned on resume, > ie, to verify that init hasn't been called yet. As I have written above, I likely would have tried that, had I found a way to distinguish between calls to init during resume and other calls to that function. Do you know of any safe way to detect that a resume operation is going on? > Also, might as well rename that function to make it clearer what it does > as init prefixed or postfixed calls tend to be used for things during > initialization. As a pointer to this function goes into the "init" member of the dvb_frontend_ops struct, the name seems to be appropriate, at least at the moment. >>> The generic filesystem races on suspend/resume are known, and I will >>> tackle that once I am done with some other task I am completing. >> >> Great to know that the underlying cause for the freeze is known and a >> solution is being worked on (or at least planned to be worked on). > > There is a sliver possibility here that a negative dentry slowdown may > cause a btrfs freeze on suspend easily. If so that can likely be fixed > before that big general filesystem freeze issue I noted with URLs and > talks. But I think we need to confirm that is what is happening here. This sounds even better, although I don't really understand your theory. Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-14 11:38 ` Lukas Middendorf @ 2020-08-14 16:37 ` Luis Chamberlain 2020-08-14 21:59 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2020-08-14 16:37 UTC (permalink / raw) To: Lukas Middendorf Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote: > Hey Louis, > > On 14/08/2020 00:13, Luis Chamberlain wrote: > > On Thu, Aug 13, 2020 at 11:53:36PM +0200, Lukas Middendorf wrote: > > > With "not used" I mean that the device has been recognized, the si2168 > > > module is loaded and si2168_probe() has been called on the device, but I > > > have not started a media player which actually plays a DVB-C stream. > > > Therefore si2168_init() has never been called and the firmware has never > > > been requested or loaded to the device. > > > > > > On resume si2168_init() will be called (although I don't think this actually > > > is really necessary) > > > > Indeed, that seems odd given its not on probe. So yet another possible > > si2168 bug. Or another way to put it: your cache calls are not needed > > if you remove that si2168_init() if init was not called yet. So simply > > extending the data structure for the driver and seting a bool flag to > > true if init was called should do the trick. > > > > Then the two cache calls would not be needed. > > I already thought about a way to remove the firmware load on suspend if it > has not happened earlier. But I saw no way to distinguish between calls to > si2168_init() during resume and calls to si2168_init() at the start of > device usage. > > In contrast to struct dvb_tuner_ops the relevant struct dvb_frontend_ops has > no separate members "suspend" and "resume" to store callbacks to be called > instead of sleep and init during suspend and resume. Changing the behavior > of si2168 would likely include some bigger architectural work on the > dvb_frontend system to extend dvb_frontend_ops. > It might be possible to just put nothing into dvb_frontend_ops.init but > instead populate the dvb_tuner_ops part of the struct, but those pointers > might already be populated externally for some other use. > > If I try to do this on my very first kernel contribution without in-depth > knowledge of how the dvb drivers really work, this will likely just explode, > if not for me then for somebody else with different hardware. Indeed, the dvb init is called from dvb, so your two patches are good for now I think. > > > If the kernel does not already know that the files > > > are not present without access to the file system, the system just freezes. > > > > It is not clear to me what this means. Can you clarify? > > > > > > > e) having the firmware files not installed freezes the system during > > > > > resume if the content of /usr/lib/firmware has not been listed before > > > > > suspend (e.g. installing the nvidia driver, so the nouveau driver does not > > > > > access this directory) > > > > > > > > That may be the same issue as in b) assuming that you meant you didn't > > > > use the dvb device, and that the firmware load issue is from nouveau. > > > > > > This is actually just the inverted case of b). > > > The only real relevance of the nouveau driver here is that its (perfectly > > > working) firmware caching on suspend actually seems to be equivalent to > > > manually running "ls" on the firmware directory and afterwards the kernel > > > also knows whether or not the si2168 firmware files are present without file > > > system access. > > > > OK.. I still don't get it, so let me see if we can decipher what you > > mean here. > > > > If the firmware is *not* present for the si2168 driver and the device > > has *not* been used yet you get a system freeze which you cannot recover > > from, but only if you are *not* using a driver which also caches its > > firmware already? > > Yes, this is exactly what I wanted to say. > OK great.. but.. > A new installation of Fedora 32 without firmware files Fedora 32 comes with no firmware at all? Are you sure? How about your wifi? > and with nouveau did > not show my freeze problem. Installing either the firmware files or the > nvidia driver started the freeze during resume. Here now you say that if you install either the firmware files for either si2168 or nouveau can cause a freeze, meanwhile what I wrote above and you agreed is what you meant, says that the freeze happens only if you *don't* have the firmware for nouveau present *and* you also don't have the any other firmware present. You also clarify here your freeze happens on resume only. Is that right? Never on suspend, but if the freeze happens, it happens only on resume? The actual case where you reach a freeze is still not clear yet. Let's try to clarify this. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-14 16:37 ` Luis Chamberlain @ 2020-08-14 21:59 ` Lukas Middendorf 2020-08-17 15:20 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2020-08-14 21:59 UTC (permalink / raw) To: Luis Chamberlain Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On 14/08/2020 18:37, Luis Chamberlain wrote: > On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote: >>> If the firmware is *not* present for the si2168 driver and the device >>> has *not* been used yet you get a system freeze which you cannot recover >>> from, but only if you are *not* using a driver which also caches its >>> firmware already? >> >> Yes, this is exactly what I wanted to say. >> > > OK great.. but.. > >> A new installation of Fedora 32 without firmware files > > Fedora 32 comes with no firmware at all? Are you sure? How about your > wifi? Fedora does come with firmware files for many devices (wifi, nouveau, …) but not for dvb devices. Firmware for the si2168 has to be installed as an extra package dvb-firmware from rpmfusion. When I talk about "no firmware files" or "install the firmware files" I mean the si2168 (and other dvb hardware) firmware files only. The nouveau firmware files are always present. >> and with nouveau did >> not show my freeze problem. Installing either the firmware files or the >> nvidia driver started the freeze during resume. > > Here now you say that if you install either the firmware files for > either si2168 or nouveau can cause a freeze, no, I'm talking just about the firmware files for si2168 > meanwhile what I wrote > above and you agreed is what you meant, says that the freeze happens > only if you *don't* have the firmware for nouveau present *and* you > also don't have the any other firmware present. This statement for me seems different compared to the statement in your last mail which I agreed to. In the case that I *don't* have the firmware files for si2168 (!) present it happens only if no other firmware is cached on suspend (in my case by the nouveau driver). No statement made about cases where the si2168 firmware file is present. > You also clarify here your freeze happens on resume only. Is that right? > Never on suspend, but if the freeze happens, it happens only on resume? Correct. I have not seen a freeze on suspend. It only happens on resume. > The actual case where you reach a freeze is still not clear yet. Let's > try to clarify this. OK, let's try that again. To freeze during resume all of 1-4 has to be true: 1. /usr/lib/firmware is on btrfs 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected 3. have not actively used the tuner 4. any of the following cases: 4a) si2168 firmware not installed + nouveau driver not used + have not run "ls -R /usr/lib/firmware" before suspend 4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*" 4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not nouveau driver Not leading to a freeze is: A: si2168 firmware not installed + nouveau driver used B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before suspend C: used the tuner before suspend (or tried to use, in case that the si2168 firmware is not installed) D: using my patches with firmware_request_cache() E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat /usr/lib/firmware/dvb*" F: si2168 firmware installed + nouveau driver used + "cat /usr/lib/firmware/dvb*" I verified all cases again to make sure I was not remembering anything wrong. The nouveau driver in use seems to be equivalent to running "ls -R /usr/lib/firmware" before suspend. All the cases seem to boil down to: It freezes if the file system has to be accessed to list the content of /usr/lib/firmware or to read the si2168 firmware file Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-14 21:59 ` Lukas Middendorf @ 2020-08-17 15:20 ` Luis Chamberlain 2020-08-17 22:04 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2020-08-17 15:20 UTC (permalink / raw) To: Lukas Middendorf Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Fri, Aug 14, 2020 at 11:59:36PM +0200, Lukas Middendorf wrote: > On 14/08/2020 18:37, Luis Chamberlain wrote: > > On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote: > > > > If the firmware is *not* present for the si2168 driver and the device > > > > has *not* been used yet you get a system freeze which you cannot recover > > > > from, but only if you are *not* using a driver which also caches its > > > > firmware already? > > > > > > Yes, this is exactly what I wanted to say. > > > > > > > OK great.. but.. > > > > > A new installation of Fedora 32 without firmware files > > > > Fedora 32 comes with no firmware at all? Are you sure? How about your > > wifi? > > Fedora does come with firmware files for many devices (wifi, nouveau, …) but > not for dvb devices. Firmware for the si2168 has to be installed as an extra > package dvb-firmware from rpmfusion. When I talk about "no firmware files" > or "install the firmware files" I mean the si2168 (and other dvb hardware) > firmware files only. The nouveau firmware files are always present. OK so how do you know that other firmware is not getting loaded or cached? Other than checking kernel logs you can rm -rf /lib/firmware/ and then only place the files you want to test. > > > and with nouveau did > > > not show my freeze problem. Installing either the firmware files or the > > > nvidia driver started the freeze during resume. > > > > Here now you say that if you install either the firmware files for > > either si2168 or nouveau can cause a freeze, > > no, I'm talking just about the firmware files for si2168 OK so if you install the firmware files of si2168 you do run into a freeze, and this freeze happens on resume from suspend? Without the si2168 firmware the freeze does not happen. Is that right? > In the case that I *don't* have the firmware files for si2168 (!) present it > happens only if no other firmware is cached on suspend (in my case by the > nouveau driver). No statement made about cases where the si2168 firmware > file is present. OK this statement is clear and is very different from the one I made above. But note, that this is true, how are you *sure* that no other firmware other than nouveau is being used? What about wifi? or bluetooth? > > You also clarify here your freeze happens on resume only. Is that right? > > Never on suspend, but if the freeze happens, it happens only on resume? > > Correct. I have not seen a freeze on suspend. It only happens on resume. OK this is only if and only if you haven't used the si2168 device, right? And since this is related to si2168 we know that even if you don't use the si2168 device its function which calls to load firmware *does* get called on resume, even though that same function was not called on probe, as the device is not used. If true, then the race to freeze here happens on resume against btrfs. And is the firmware present or not in this case, in which the freeze happens? > > The actual case where you reach a freeze is still not clear yet. Let's > > try to clarify this. > > OK, let's try that again. To freeze during resume all of 1-4 has to be true: > 1. /usr/lib/firmware is on btrfs > 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected > 3. have not actively used the tuner > 4. any of the following cases: > 4a) si2168 firmware not installed + nouveau driver not used + have not run > "ls -R /usr/lib/firmware" before suspend > 4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*" > 4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not > nouveau driver OK perfect, now the next question to clarify is *are you sure* that no other firmware is used, other than si2168 and nouveau? > Not leading to a freeze is: > > A: si2168 firmware not installed + nouveau driver used > B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before > suspend > C: used the tuner before suspend (or tried to use, in case that the si2168 > firmware is not installed) > D: using my patches with firmware_request_cache() > E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat > /usr/lib/firmware/dvb*" > F: si2168 firmware installed + nouveau driver used + "cat > /usr/lib/firmware/dvb*" > > I verified all cases again to make sure I was not remembering anything > wrong. This helps, thanks so much, now we'll have to write a reproducer, thanks for the report!! > The nouveau driver in use seems to be equivalent to running "ls -R > /usr/lib/firmware" before suspend. > > All the cases seem to boil down to: > It freezes if the file system has to be accessed to list the content of > /usr/lib/firmware or to read the si2168 firmware file Let's confirm first whether or not your system is using other firmware files too or not. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-17 15:20 ` Luis Chamberlain @ 2020-08-17 22:04 ` Lukas Middendorf 2020-08-18 14:37 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2020-08-17 22:04 UTC (permalink / raw) To: Luis Chamberlain Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On 17/08/2020 17:20, Luis Chamberlain wrote: > On Fri, Aug 14, 2020 at 11:59:36PM +0200, Lukas Middendorf wrote: >> On 14/08/2020 18:37, Luis Chamberlain wrote: >>> On Fri, Aug 14, 2020 at 01:38:40PM +0200, Lukas Middendorf wrote: >>>>> If the firmware is *not* present for the si2168 driver and the device >>>>> has *not* been used yet you get a system freeze which you cannot recover >>>>> from, but only if you are *not* using a driver which also caches its >>>>> firmware already? >>>> >>>> Yes, this is exactly what I wanted to say. >>>> >>> >>> OK great.. but.. >>> >>>> A new installation of Fedora 32 without firmware files >>> >>> Fedora 32 comes with no firmware at all? Are you sure? How about your >>> wifi? >> >> Fedora does come with firmware files for many devices (wifi, nouveau, …) but >> not for dvb devices. Firmware for the si2168 has to be installed as an extra >> package dvb-firmware from rpmfusion. When I talk about "no firmware files" >> or "install the firmware files" I mean the si2168 (and other dvb hardware) >> firmware files only. The nouveau firmware files are always present. > > OK so how do you know that other firmware is not getting loaded or cached? dmesg with firmware_class debug output enabled > > Other than checking kernel logs you can rm -rf /lib/firmware/ and then > only place the files you want to test. > >>>> and with nouveau did >>>> not show my freeze problem. Installing either the firmware files or the >>>> nvidia driver started the freeze during resume. >>> >>> Here now you say that if you install either the firmware files for >>> either si2168 or nouveau can cause a freeze, >> >> no, I'm talking just about the firmware files for si2168 > > OK so if you install the firmware files of si2168 you do run into a > freeze, and this freeze happens on resume from suspend? > > Without the si2168 firmware the freeze does not happen. > > Is that right? > >> In the case that I *don't* have the firmware files for si2168 (!) present it >> happens only if no other firmware is cached on suspend (in my case by the >> nouveau driver). No statement made about cases where the si2168 firmware >> file is present. > > OK this statement is clear and is very different from the one I made > above. > > But note, that this is true, how are you *sure* that no other firmware > other than nouveau is being used? What about wifi? or bluetooth? I don't have wifi or bluetooth enabled in the BIOS. If I don't use the nouveau driver, "dmesg | grep firmware" is completely empty after boot except for the kernel command line part "ddebug_query=module firmware_class +pmf". After suspend + resume I just get two lines for the calls to cache and uncache firmwares. If any other firmware is used, it is apparently not using the standard firmware loader. If I rename the directory /usr/lib/firmware everything seems to work as before (nouveau and si2168 not loaded) and I don't see any complaints about missing firmware in dmesg. If I enable my secondary rtl8125 Ethernet controller in the BIOS (I only tested this now, I always had it disabled previously), I can see its firmware being loaded on boot and cached on suspend. This then has an effect on my freeze problem identical to using the nouveau driver. >>> You also clarify here your freeze happens on resume only. Is that right? >>> Never on suspend, but if the freeze happens, it happens only on resume? >> >> Correct. I have not seen a freeze on suspend. It only happens on resume. > > OK this is only if and only if you haven't used the si2168 device, > right? And since this is related to si2168 we know that even if you > don't use the si2168 device its function which calls to load firmware > *does* get called on resume, even though that same function was not > called on probe, as the device is not used. > > If true, then the race to freeze here happens on resume against btrfs. > > And is the firmware present or not in this case, in which the freeze > happens? A freeze can happen on resume with and without the si2168 firmware files installed. It however is easier to hit the freeze with the firmware files installed. Without the firmware files present the freeze happens only if no other driver uses the firmware loader. >>> The actual case where you reach a freeze is still not clear yet. Let's >>> try to clarify this. >> >> OK, let's try that again. To freeze during resume all of 1-4 has to be true: >> 1. /usr/lib/firmware is on btrfs >> 2. my Hauppauge WinTV-dualHD USB DVB tuner (contains si2168) is connected >> 3. have not actively used the tuner >> 4. any of the following cases: >> 4a) si2168 firmware not installed + nouveau driver not used + have not run >> "ls -R /usr/lib/firmware" before suspend >> 4b) firmware file installed + not run "cat /usr/lib/firmware/dvb*" >> 4c) firmware file installed + not run "ls -R /usr/lib/firmware" + not >> nouveau driver > > OK perfect, now the next question to clarify is *are you sure* that no > other firmware is used, other than si2168 and nouveau? I'm totally sure that no other firmware is used through the standard firmware loader. I'm also almost sure that no firmware from /usr/lib/firmware is used though any other means. >> Not leading to a freeze is: >> >> A: si2168 firmware not installed + nouveau driver used >> B: si2168 firmware not installed + run "ls -R /usr/lib/firmware" before >> suspend >> C: used the tuner before suspend (or tried to use, in case that the si2168 >> firmware is not installed) >> D: using my patches with firmware_request_cache() >> E: si2168 firmware installed + "ls -R /usr/lib/firmware" + "cat >> /usr/lib/firmware/dvb*" >> F: si2168 firmware installed + nouveau driver used + "cat >> /usr/lib/firmware/dvb*" >> >> I verified all cases again to make sure I was not remembering anything >> wrong. > > This helps, thanks so much, now we'll have to write a reproducer, thanks > for the report!! Will you do it yourself or do you expect me to do anything for this? >> The nouveau driver in use seems to be equivalent to running "ls -R >> /usr/lib/firmware" before suspend. >> >> All the cases seem to boil down to: >> It freezes if the file system has to be accessed to list the content of >> /usr/lib/firmware or to read the si2168 firmware file > > Let's confirm first whether or not your system is using other firmware > files too or not. I confirmed that above. Why is this so important, anyway? Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-17 22:04 ` Lukas Middendorf @ 2020-08-18 14:37 ` Luis Chamberlain 2021-04-01 14:59 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2020-08-18 14:37 UTC (permalink / raw) To: Lukas Middendorf Cc: Anand Jain, linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote: > On 17/08/2020 17:20, Luis Chamberlain wrote: > A freeze can happen on resume with and without the si2168 firmware files > installed. It however is easier to hit the freeze with the firmware files > installed. Without the firmware files present the freeze happens only if no > other driver uses the firmware loader. > > > > > This helps, thanks so much, now we'll have to write a reproducer, thanks > > for the report!! > > Will you do it yourself or do you expect me to do anything for this? I meant to imply that we'd do this, now that we understand the problem. Thanks for your report! > > > The nouveau driver in use seems to be equivalent to running "ls -R > > > /usr/lib/firmware" before suspend. > > > > > > All the cases seem to boil down to: > > > It freezes if the file system has to be accessed to list the content of > > > /usr/lib/firmware or to read the si2168 firmware file > > > > Let's confirm first whether or not your system is using other firmware > > files too or not. > > I confirmed that above. Why is this so important, anyway? A reproducer is easier to write if the actual file neeed is not needed. That's all. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2020-08-18 14:37 ` Luis Chamberlain @ 2021-04-01 14:59 ` Lukas Middendorf 2021-04-02 18:02 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Lukas Middendorf @ 2021-04-01 14:59 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media, Lukas Middendorf Hello Luis, On 18/08/2020 16:37, Luis Chamberlain wrote: > On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote: >> On 17/08/2020 17:20, Luis Chamberlain wrote: >>> This helps, thanks so much, now we'll have to write a reproducer, thanks >>> for the report!! >> >> Will you do it yourself or do you expect me to do anything for this? > > I meant to imply that we'd do this, now that we understand the problem. Thanks > for your report! any news on this issue? Did you succeed with reproducing this at your end? I retested this with 5.12-rc5 and everything still seems unchanged. I still get freezes on resume under the same circumstances described previously if /usr/lib/firmware is on btrfs. My patches to si2168.c were not merged. The documentation of request_firmware() seems unchanged and does not reflect the fact that prerequisites exist for making it safe to call during resume callbacks. Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-01 14:59 ` Lukas Middendorf @ 2021-04-02 18:02 ` Luis Chamberlain 2021-04-02 22:19 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2021-04-02 18:02 UTC (permalink / raw) To: Lukas Middendorf Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Thu, Apr 01, 2021 at 04:59:47PM +0200, Lukas Middendorf wrote: > Hello Luis, > > > On 18/08/2020 16:37, Luis Chamberlain wrote: > > On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote: > > > On 17/08/2020 17:20, Luis Chamberlain wrote: > > > > This helps, thanks so much, now we'll have to write a reproducer, thanks > > > > for the report!! > > > > > > Will you do it yourself or do you expect me to do anything for this? > > > > I meant to imply that we'd do this, now that we understand the problem. Thanks > > for your report! > > any news on this issue? Did you succeed with reproducing this at your end? No sorry, I dropped the ball on this but I managed to now spawn up the virtual guests where I was doing development to reproduce this. Give me some time and I will zero in on this now. For now what I have is the following to test this, I next will work on the userspace part. diff --git a/lib/test_firmware.c b/lib/test_firmware.c index 9fee2b93a8d1..f9e67fc4145a 100644 --- a/lib/test_firmware.c +++ b/lib/test_firmware.c @@ -18,6 +18,7 @@ #include <linux/device.h> #include <linux/fs.h> #include <linux/miscdevice.h> +#include <linux/platform_device.h> #include <linux/sizes.h> #include <linux/slab.h> #include <linux/uaccess.h> @@ -33,6 +34,8 @@ static DEFINE_MUTEX(test_fw_mutex); static const struct firmware *test_firmware; +static struct platform_device *pdev; + struct test_batched_req { u8 idx; int rc; @@ -53,6 +56,9 @@ struct test_batched_req { * @sync_direct: when the sync trigger is used if this is true * request_firmware_direct() will be used instead. * @send_uevent: whether or not to send a uevent for async requests + * @enable_resume_test: if @senable_resume is true this will enable a test to + * issue a request_firmware() upon resume. This is useful to test resume + * after suspend filesystem races. * @num_requests: number of requests to try per test case. This is trigger * specific. * @reqs: stores all requests information @@ -91,6 +97,7 @@ struct test_config { bool into_buf; bool sync_direct; bool send_uevent; + bool enable_resume_test; u8 num_requests; u8 read_fw_idx; @@ -184,6 +191,7 @@ static int __test_firmware_config_init(void) test_fw_config->send_uevent = true; test_fw_config->into_buf = false; test_fw_config->sync_direct = false; + test_fw_config->enable_resume_test = false; test_fw_config->req_firmware = request_firmware; test_fw_config->test_result = 0; test_fw_config->reqs = NULL; @@ -257,6 +265,9 @@ static ssize_t config_show(struct device *dev, len += scnprintf(buf+len, PAGE_SIZE - len, "sync_direct:\t\t%s\n", test_fw_config->sync_direct ? "true" : "false"); + len += scnprintf(buf+len, PAGE_SIZE - len, + "enable_resume_test:\t\t%s\n", + test_fw_config->enable_resume_test ? "true" : "false"); len += scnprintf(buf+len, PAGE_SIZE - len, "read_fw_idx:\t%u\n", test_fw_config->read_fw_idx); @@ -422,6 +433,22 @@ static ssize_t config_sync_direct_show(struct device *dev, } static DEVICE_ATTR_RW(config_sync_direct); +static ssize_t config_enable_resume_test_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + return test_dev_config_update_bool(buf, count, + &test_fw_config->enable_resume_test); +} + +static ssize_t config_enable_resume_test_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return test_dev_config_show_bool(buf, test_fw_config->enable_resume_test); +} +static DEVICE_ATTR_RW(config_enable_resume_test); + static ssize_t config_send_uevent_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -929,6 +956,7 @@ static struct attribute *test_dev_attrs[] = { TEST_FW_DEV_ATTR(config_into_buf), TEST_FW_DEV_ATTR(config_sync_direct), TEST_FW_DEV_ATTR(config_send_uevent), + TEST_FW_DEV_ATTR(config_enable_resume_test), TEST_FW_DEV_ATTR(config_read_fw_idx), /* These don't use the config at all - they could be ported! */ @@ -958,6 +986,81 @@ static struct miscdevice test_fw_misc_device = { .groups = test_dev_groups, }; +static int __maybe_unused test_firmware_suspend(struct device *dev) +{ + return 0; +} + + +static int __maybe_unused test_firmware_resume(struct device *dev) +{ + int rc; + + if (!test_fw_config->enable_resume_test) + return 0; + + pr_info("resume test, loading '%s'\n", test_fw_config->name); + + mutex_lock(&test_fw_mutex); + release_firmware(test_firmware); + test_firmware = NULL; + rc = request_firmware(&test_firmware, test_fw_config->name, dev); + if (rc) { + mutex_unlock(&test_fw_mutex); + pr_info("load of '%s' failed: %d\n", test_fw_config->name, rc); + goto out; + } + + pr_info("loaded: %zu\n", test_firmware->size); + mutex_unlock(&test_fw_mutex); + pr_info("resume test, completed successfully\n"); +out: + return rc; +} + +static SIMPLE_DEV_PM_OPS(test_dev_pm_ops, test_firmware_suspend, test_firmware_resume); + +static int test_firmware_probe(struct platform_device *dev) +{ + int rc; + + rc = misc_register(&test_fw_misc_device); + if (rc) { + kfree(test_fw_config); + pr_err("could not register misc device: %d\n", rc); + return rc; + } + + pr_info("interface ready\n"); + + return 0; +} + +static int test_firmware_remove(struct platform_device *dev) +{ + mutex_lock(&test_fw_mutex); + release_firmware(test_firmware); + misc_deregister(&test_fw_misc_device); + mutex_unlock(&test_fw_mutex); + + return 0; +} + +static void test_firmware_shutdown(struct platform_device *dev) +{ +} + +static struct platform_driver test_firmware_driver = { + .driver = { + .name = "test_firmware", + .pm = &test_dev_pm_ops, + }, + .probe = test_firmware_probe, + .remove = test_firmware_remove, + .shutdown = test_firmware_shutdown, +}; + + static int __init test_firmware_init(void) { int rc; @@ -973,28 +1076,39 @@ static int __init test_firmware_init(void) return rc; } - rc = misc_register(&test_fw_misc_device); - if (rc) { - kfree(test_fw_config); - pr_err("could not register misc device: %d\n", rc); - return rc; - } + rc = platform_driver_register(&test_firmware_driver); + if (rc) + goto err_alloc; - pr_warn("interface ready\n"); + pdev = platform_device_alloc("test_firmware", -1); + if (!pdev) + goto err_driver_unregister; + + rc = platform_device_add(pdev); + if (rc) + goto err_free_device; return 0; + + err_free_device: + platform_device_put(pdev); + err_driver_unregister: + platform_driver_unregister(&test_firmware_driver); + err_alloc: + __test_firmware_config_free(); + kfree(test_fw_config); + return rc; } module_init(test_firmware_init); static void __exit test_firmware_exit(void) { - mutex_lock(&test_fw_mutex); - release_firmware(test_firmware); - misc_deregister(&test_fw_misc_device); + platform_device_unregister(pdev); + platform_driver_unregister(&test_firmware_driver); + __test_firmware_config_free(); kfree(test_fw_config); - mutex_unlock(&test_fw_mutex); pr_warn("removed interface\n"); } Luis ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-02 18:02 ` Luis Chamberlain @ 2021-04-02 22:19 ` Luis Chamberlain 2021-04-02 22:58 ` Luis Chamberlain 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2021-04-02 22:19 UTC (permalink / raw) To: Lukas Middendorf Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media Lukas, can you share your /etc/fstab ? Also, how long do you stay in the boot before you try to suspend? Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-02 22:19 ` Luis Chamberlain @ 2021-04-02 22:58 ` Luis Chamberlain 2021-04-03 10:24 ` Lukas Middendorf 0 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2021-04-02 22:58 UTC (permalink / raw) To: Lukas Middendorf Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > Lukas, can you share your /etc/fstab ? Also, how long do you stay in > the boot before you try to suspend? OK I cannot reproduce the issue with the modified patch I sent to test_firmware, which if you enable config_enable_resume_test will trigger a request_firmware() on resume, thus trying to mimic the race you note. To test this you can simply use a loopback filesystem for your /lib/firmware and create a btrfs filesystem for it, and then run: echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test systemctl suspend Then resume. You should see "resume test" print on dmesg. I keep my /lib/firmware/ empty and still, nothing. Can you provide kernel logs for where you are seeing things get stuck at? Note that I had mentioned the races on suspend/resume do exist for any journaling filesystem, but this typically happens if you are doing ongoing writes. I suppose you are *not* doing writes and your filesystem is idle. As such without kernel logs I cannot be sure what the issue is, but at this point after the initial testing I've done I don't suspect this is a firmware API issue. You might be better off just reposting your patches with the respective Reviewed-by tags and pestering your maintainer. PS. If you want to test this on a guest you bring the guest back up with virsh dompmwakeup. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-02 22:58 ` Luis Chamberlain @ 2021-04-03 10:24 ` Lukas Middendorf 2021-04-03 16:07 ` Lukas Middendorf 2021-04-03 20:25 ` Luis Chamberlain 0 siblings, 2 replies; 22+ messages in thread From: Lukas Middendorf @ 2021-04-03 10:24 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media, Lukas Middendorf On 02/04/2021 20:02, Luis Chamberlain wrote: > No sorry, I dropped the ball on this but I managed to now spawn up the > virtual guests where I was doing development to reproduce this. Give > me some time and I will zero in on this now. > > For now what I have is the following to test this, I next will work > on the userspace part. I can report that your patch for test_firmware works (applied to current master from linus d93a0d4; I get some offsets and had to adjust some whitespace for it to apply). With that module (and resume test enabled) I get freezes at resume in the same cases that would also cause problems with si2168. I'm testing this on bare metal F34 beta with / on a btrfs. I'm using nvidia driver again to make sure the system does not otherwise use any firmware from /usr/lib/firmware (confirmed with kernel debug messages for firmware_loader). My si2168 is not plugged in. I tested it with a normal population of files in /usr/lib/firmware , without test-firmware.bin and also with a random 1MiB file in place. I tested after a reboot so it does not do caching. With /usr/lib/firmware on a separate ext4 partition I can confirm with dmesg that the test_firmware suspend test actually works (does not freeze). With /usr/lib/firmware on btrfs it fails in both cases (with and without the firmware files). With caching (first suspend with the ext4 partition mounted, then a second suspend without) it does not freeze even with the firmware on btrfs. One further thing I noticed which might be problematic in rare cases: According to the kernel debug messages, the firmware-loader does not attempt to cache the firmware during suspend, if the previous call to request_firmware() has failed (file not present; call made during previous resume). In my opinion it should attempt to cache the firmware on suspend even in this case (If I remember correctly, firmware_request_cache also works without the file being present). In case some low-memory condition has caused the file system cache to lose the information about the file being non-present (or the file has been written after the initial attempt and is no longer in the file system cache), this might lead to freezes even for well-behaved drivers in case they reattempt to do request_firmware() on resume. If the firmware is found during resume, it is cached on further suspends. Given how long it took me to narrow down this problem in this (for me) reliably reproducible case, something like this happening at random would be almost impossible to debug/locate and might actually happen frequently in the wild. On 03/04/2021 00:19, Luis Chamberlain wrote: > Lukas, can you share your /etc/fstab ? This is the core part (everything else is unmounted), I shortened the UUIDs. The ext4 mount of course is also unmounted when I want it to fail, with it in place it reliably not-fails. UUID=<1> / btrfs subvol=linux1 0 0 UUID=<2> /usr/lib/firmware ext4 defaults 1 1 UUID=<1> /home/lukas btrfs subvol=linux1-f34-lukas 0 0 tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0 > Also, how long do you stay in the boot before you try to suspend? During my reproduction sessions usually only shortly 1-5min, but I think I have seen this also after a slightly longer time. I can try to let it sit for longer if you think that is important. On 03/04/2021 00:58, Luis Chamberlain wrote: > On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote: >> >> Lukas, can you share your /etc/fstab ? Also, how long do you stay in >> the boot before you try to suspend? > > OK I cannot reproduce the issue with the modified patch I sent to > test_firmware, which if you enable config_enable_resume_test will > trigger a request_firmware() on resume, thus trying to mimic the race > you note. To test this you can simply use a loopback filesystem for > your /lib/firmware and create a btrfs filesystem for it, and then run: > > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test > > systemctl suspend > > Then resume. You should see "resume test" print on dmesg. I keep my > /lib/firmware/ empty and still, nothing. Did you also try to create a random test-firmware.bin (I used 1M from /dev/urandom) instead of an empty /lib/firmware ? If the directory is completely empty, it also does not freeze for me. If the directory is empty, any attempt to access its content can likely be directly served from cache, even if the actual directory has never been accessed before, as long as /lib (which is a symlink to /usr/lib on fedora) has been accessed (which will likely always be true). So I have to further add to my previous findings that "firmware directory is not completely empty" is a further prerequisite for it to fail. > Can you provide kernel logs for where you are seeing things get stuck at? The log does not have any entries from resume. For the attempts where it freezes the last entry in journalctl is systemd-sleep[5050]: Suspending system... > Note that I had mentioned the races on suspend/resume do exist for any journaling > filesystem, but this typically happens if you are doing ongoing > writes. I suppose you are *not* doing writes and your filesystem is > idle. I can of course not completely rule out some random write (log files or similar), but there is definitely no heavy writing going on. I think only writes caused by the act of suspending and resuming could cause it this reliably. I have seen it also with a completely isolated btrfs file system for /usr/lib/firmware, where there should not have been any writes. For ext4 (which is also journaling) it works properly. > As such without kernel logs I cannot be sure what the issue is, but at > this point after the initial testing I've done I don't suspect this is > a firmware API issue. You might be better off just reposting your > patches with the respective Reviewed-by tags and pestering your > maintainer. I will try to be a little bit more insistent this time. Is "just repost" the usual way to handle if patches are ignored? Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 10:24 ` Lukas Middendorf @ 2021-04-03 16:07 ` Lukas Middendorf 2021-04-03 20:25 ` Luis Chamberlain 1 sibling, 0 replies; 22+ messages in thread From: Lukas Middendorf @ 2021-04-03 16:07 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media, Lukas Middendorf Hi Luis, I now succeeded in reproducing this in VirtualBox running a F34 minimal installation with / on btrfs (with default firmware files). I can send you the virtual disk file if you want it. With the kernel sourcecode and the compiled and installed kernel it currently is 21.8GiB. On 03/04/2021 12:24, Lukas Middendorf wrote: > On 03/04/2021 00:58, Luis Chamberlain wrote: >> Can you provide kernel logs for where you are seeing things get stuck at? I have dumped the output of the serial console with all outputs cranked to the max. In [1] is the output with test_firmware loaded and suspend test running. This leads to a freeze. However it is not completely dead. While it does not visually react, at least from time to time there still are some messages from the kernel and the cursor is still blinking. In [2] is a (successful) suspend and resume cycle with test_firmware not loaded. Lukas [1] https://gist.github.com/midluk/05716f714f778d26dc02771245df0ac9 [2] https://gist.github.com/midluk/a03eb953b6cf097688b8be2e0cd387fd ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 10:24 ` Lukas Middendorf 2021-04-03 16:07 ` Lukas Middendorf @ 2021-04-03 20:25 ` Luis Chamberlain 2021-04-03 21:04 ` Luis Chamberlain ` (2 more replies) 1 sibling, 3 replies; 22+ messages in thread From: Luis Chamberlain @ 2021-04-03 20:25 UTC (permalink / raw) To: Lukas Middendorf, Greg KH, dsterba Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote: > > On 02/04/2021 20:02, Luis Chamberlain wrote: > > No sorry, I dropped the ball on this but I managed to now spawn up the > > virtual guests where I was doing development to reproduce this. Give > > me some time and I will zero in on this now. > > > > For now what I have is the following to test this, I next will work > > on the userspace part. > > I can report that your patch for test_firmware works (applied to current > master from linus d93a0d4; I get some offsets and had to adjust some > whitespace for it to apply). With that module (and resume test enabled) I > get freezes at resume in the same cases that would also cause problems with > si2168. Beautiful! > I'm testing this on bare metal F34 beta with / on a btrfs. I'm using nvidia > driver again to make sure the system does not otherwise use any firmware > from /usr/lib/firmware (confirmed with kernel debug messages for > firmware_loader). My si2168 is not plugged in. OK I was not able to reproduce but the details below also clarify why, so I now was able to reproduce! So I don't need to specifically test with this distro, but others may do so. > I tested it with a normal population of files in /usr/lib/firmware , without > test-firmware.bin and also with a random 1MiB file in place. I tested after > a reboot so it does not do caching. > With /usr/lib/firmware on a separate ext4 partition I can confirm with dmesg > that the test_firmware suspend test actually works (does not freeze). > With /usr/lib/firmware on btrfs it fails in both cases (with and without the > firmware files). > With caching (first suspend with the ext4 partition mounted, then a second > suspend without) it does not freeze even with the firmware on btrfs. OK thanks for confirming! > One further thing I noticed which might be problematic in rare cases: > According to the kernel debug messages, the firmware-loader does not attempt > to cache the firmware during suspend, Correct, the goal with this test was to purposely *not* call for the firmware prior to suspend, and instead *race* (do the wrong thing) at resume. It shouldn't really stall... fail yes, but stall, that seems fishy. > if the previous call to > request_firmware() has failed (file not present; call made during previous > resume). In my opinion it should attempt to cache the firmware on suspend > even in this case Yes, that is the *proper* way to do use the firmware API, but we want to reproduce the stall, and so we have to recreate the issue you reported by doing something bad. > (If I remember correctly, firmware_request_cache also > works without the file being present). In case some low-memory condition has > caused the file system cache to lose the information about the file being > non-present (or the file has been written after the initial attempt and is > no longer in the file system cache), this might lead to freezes even for > well-behaved drivers in case they reattempt to do request_firmware() on > resume. > If the firmware is found during resume, it is cached on further suspends. Right. > Given how long it took me to narrow down this problem in this (for me) > reliably reproducible case, something like this happening at random would be > almost impossible to debug/locate and might actually happen frequently in > the wild. OK thanks for the report! > On 03/04/2021 00:19, Luis Chamberlain wrote: > > Lukas, can you share your /etc/fstab ? > > This is the core part (everything else is unmounted), I shortened the UUIDs. > The ext4 mount of course is also unmounted when I want it to fail, with it > in place it reliably not-fails. > > UUID=<1> / btrfs subvol=linux1 0 0 > UUID=<2> /usr/lib/firmware ext4 defaults 1 1 > UUID=<1> /home/lukas btrfs subvol=linux1-f34-lukas 0 0 > tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0 > > > Also, how long do you stay in the boot before you try to suspend? > > During my reproduction sessions usually only shortly 1-5min, but I think I > have seen this also after a slightly longer time. I can try to let it sit > for longer if you think that is important. No, I was looking to see how easy it is to reproduce right after a fresh boot, now that I can reproduce I confirm it happens right away. > On 03/04/2021 00:58, Luis Chamberlain wrote: > > On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > > > Lukas, can you share your /etc/fstab ? Also, how long do you stay in > > > the boot before you try to suspend? > > > > OK I cannot reproduce the issue with the modified patch I sent to > > test_firmware, which if you enable config_enable_resume_test will > > trigger a request_firmware() on resume, thus trying to mimic the race > > you note. To test this you can simply use a loopback filesystem for > > your /lib/firmware and create a btrfs filesystem for it, and then run: > > > > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test > > > > systemctl suspend > > > > Then resume. You should see "resume test" print on dmesg. I keep my > > /lib/firmware/ empty and still, nothing. > > Did you also try to create a random test-firmware.bin (I used 1M from > /dev/urandom) instead of an empty /lib/firmware ? No, right now I want to to just focus on fixing the stall you saw. > If the directory is completely empty, it also does not freeze for me. This is a very important piece of information, thanks! > If the > directory is empty, any attempt to access its content can likely be directly > served from cache, even if the actual directory has never been accessed > before, as long as /lib (which is a symlink to /usr/lib on fedora) has been > accessed (which will likely always be true). So I have to further add to my > previous findings that "firmware directory is not completely empty" is a > further prerequisite for it to fail. Indeed, that helps! So creating say 1000 random files in /lib/firmware on a freshly created btrfs partition helps reproduce: mkfs.btrfs /dev/whatever mount /dev/wahtever /lib/firmware # Put it on /etc/fstab too Generate 1000 random files on it: ``` for n in {1..1000}; do dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 )) done ``` Then reboot, upon reboot do: modprobe test_firmware echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test systemctl suspend If its a guest wake it up: virsh dompmwakeup domidofguest > > Can you provide kernel logs for where you are seeing things get stuck > > at? > > The log does not have any entries from resume. For the attempts where it > freezes the last entry in journalctl is > systemd-sleep[5050]: Suspending system... I confirm. > > Note that I had mentioned the races on suspend/resume do exist for any journaling > > filesystem, but this typically happens if you are doing ongoing > > writes. I suppose you are *not* doing writes and your filesystem is > > idle. > > I can of course not completely rule out some random write (log files or > similar), but there is definitely no heavy writing going on. I think only > writes caused by the act of suspending and resuming could cause it this > reliably. I have seen it also with a completely isolated btrfs file system > for /usr/lib/firmware, where there should not have been any writes. For ext4 > (which is also journaling) it works properly. Nope, no writes are going on. > > As such without kernel logs I cannot be sure what the issue is, but at > > this point after the initial testing I've done I don't suspect this is > > a firmware API issue. You might be better off just reposting your > > patches with the respective Reviewed-by tags and pestering your > > maintainer. > > I will try to be a little bit more insistent this time. Is "just repost" the > usual way to handle if patches are ignored? You can repost, v2 just add my reviewed-by. Those patches are indeed correct as they are calling for the firmware prior to resume. Maybe clarify that. But indeed there is also another issue which you reported which needs to be fixed, and for that thanks so much for you patience! I'll be looking into this! Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 20:25 ` Luis Chamberlain @ 2021-04-03 21:04 ` Luis Chamberlain 2021-04-05 9:52 ` Lukas Middendorf 2021-04-04 0:50 ` Lukas Middendorf 2021-04-08 18:02 ` Luis Chamberlain 2 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2021-04-03 21:04 UTC (permalink / raw) To: Lukas Middendorf, Greg KH, dsterba Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote: > So creating say 1000 random files in /lib/firmware on a freshly created > btrfs partition helps reproduce: > > mkfs.btrfs /dev/whatever > mount /dev/wahtever /lib/firmware > # Put it on /etc/fstab too > > Generate 1000 random files on it: > > ``` > for n in {1..1000}; do > dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 )) > done > ``` > > Then reboot, upon reboot do: > > modprobe test_firmware > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test > systemctl suspend OK this fixes it but this just shows that likely the thaw'ing allows a race to take place which we didn't expect. I'll do some more digging for a proper fix. diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 90d255fbdd9b..b9c37eefab35 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -4,6 +4,7 @@ #include <linux/kernel_read_file.h> #include <linux/security.h> #include <linux/vmalloc.h> +#include <linux/umh.h> /** * kernel_read_file() - read file contents into a kernel buffer @@ -156,17 +157,25 @@ int kernel_read_file_from_path_initns(const char *path, loff_t offset, if (!path || !*path) return -EINVAL; + ret = usermodehelper_read_trylock(); + if (WARN_ON(ret)) { + pr_warn_once("Trying to do direct read when not available"); + return ret; + } task_lock(&init_task); get_fs_root(init_task.fs, &root); task_unlock(&init_task); file = file_open_root(root.dentry, root.mnt, path, O_RDONLY, 0); path_put(&root); - if (IS_ERR(file)) + if (IS_ERR(file)) { + usermodehelper_read_unlock(); return PTR_ERR(file); + } ret = kernel_read_file(file, offset, buf, buf_size, file_size, id); fput(file); + usermodehelper_read_unlock(); return ret; } EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 21:04 ` Luis Chamberlain @ 2021-04-05 9:52 ` Lukas Middendorf 0 siblings, 0 replies; 22+ messages in thread From: Lukas Middendorf @ 2021-04-05 9:52 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media, Lukas Middendorf, Greg KH, dsterba On 03/04/2021 23:04, Luis Chamberlain wrote: > OK this fixes it but this just shows that likely the thaw'ing allows > a race to take place which we didn't expect. I'll do some more digging > for a proper fix. I can indeed confirm that this fixes the stall. This however does not seem to be the (complete) solution. Instead I now get a kernel crash message (see below) for every firmware location tried to read during resume. This might be intentional for debugging purposes during testing. This is identical for ext4 and btrfs. If the firmware file can not be found during caching on suspend, the reads are still attempted again during resume. This leads to multiple of those crash messages (for different firmware locations) during resume if the firmware file is not present, even for drivers properly requesting caching. So if this patch is to go in (those crashes would really help with getting the si2168 patches in…), I think you have to make sure that even for non-existent firmware files, no read is ever attempted on resume. Which means to set up the caching even if the initial request_firmware() failed and to store knowledge about failed caching attempts to not retry these reads on resume. Lukas ------------[ cut here ]------------ WARNING: CPU: 0 PID: 662 at fs/kernel_read_file.c:161 kernel_read_file_from_path_initns+0x11c/0x140 Modules linked in: test_firmware nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common kvm_amd vmwgfx ccp kvm ttm drm_kms_helper snd_pcm joydev snd_timer snd e1000 irqbypass cec soundcore vboxguest i2c_piix4 pcspkr drm fuse zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw ata_generic pata_acpi video CPU: 0 PID: 662 Comm: systemd-sleep Not tainted 5.12.0-rc5+ #2 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:kernel_read_file_from_path_initns+0x11c/0x140 Code: ff ff 4c 89 e7 89 44 24 10 e8 50 07 fc ff e8 fb 1c d8 ff 44 8b 44 24 10 48 83 c4 28 44 89 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 80 3d b4 10 7a 01 00 75 e3 e9 5b 5e 83 00 e8 cf 1c d8 ff 45 RSP: 0018:ffffa096c0b9fb90 EFLAGS: 00010286 RAX: 00000000fffffff5 RBX: 0000000000000000 RCX: ffffffffbd85d688 RDX: ffffffffbd85d688 RSI: 0000000000000297 RDI: ffffffffbd85d680 RBP: ffffa096c0b9fbe0 R08: 00000000fffffff5 R09: ffffffffbd85d688 R10: ffffffffffffffff R11: 0000000000000000 R12: ffff8976ca811000 R13: ffffa096c0b9fc00 R14: 000000007fffffff R15: 0000000000000001 FS: 00007f4718de3b40(0000) GS:ffff8979cfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005594afeca4f8 CR3: 0000000106a62000 CR4: 00000000000506f0 Call Trace: ? snprintf+0x39/0x40 fw_get_filesystem_firmware+0xe2/0x270 _request_firmware+0x21e/0x500 request_firmware+0x32/0x50 test_firmware_resume.cold+0x4e/0xb2 [test_firmware] ? platform_pm_suspend+0x40/0x40 dpm_run_callback+0x4c/0x120 device_resume+0xa7/0x200 dpm_resume+0xce/0x2c0 dpm_resume_end+0xd/0x20 suspend_devices_and_enter+0x195/0x750 pm_suspend.cold+0x329/0x374 state_store+0x71/0xd0 kernfs_fop_write_iter+0x11c/0x1b0 new_sync_write+0x108/0x180 vfs_write+0x1b8/0x270 ksys_write+0x4f/0xc0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4719a527a7 Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffee4d18e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f4719a527a7 RDX: 0000000000000004 RSI: 00007ffee4d18f00 RDI: 0000000000000004 RBP: 00007ffee4d18f00 R08: 00005626b555c710 R09: 00007f4719ae84e0 R10: 00007f4719ae83e0 R11: 0000000000000246 R12: 0000000000000004 R13: 00005626b5558650 R14: 0000000000000004 R15: 00007f4719b25700 ---[ end trace 7f7ef0dc067dd714 ]--- Trying to do direct read when not available test_firmware test_firmware: loading /lib/firmware/updates/5.12.0-rc5+/test-firmware.bin failed with error -11 ------------[ cut here ]------------ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 20:25 ` Luis Chamberlain 2021-04-03 21:04 ` Luis Chamberlain @ 2021-04-04 0:50 ` Lukas Middendorf 2021-04-08 18:02 ` Luis Chamberlain 2 siblings, 0 replies; 22+ messages in thread From: Lukas Middendorf @ 2021-04-04 0:50 UTC (permalink / raw) To: Luis Chamberlain Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media, Greg KH, dsterba, Lukas Middendorf Great to hear that you now succeeded in reproducing the problem. On 03/04/2021 22:25, Luis Chamberlain wrote: > On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote: >> One further thing I noticed which might be problematic in rare cases: >> According to the kernel debug messages, the firmware-loader does not attempt >> to cache the firmware during suspend, > > Correct, the goal with this test was to purposely *not* call for the > firmware prior to suspend, and instead *race* (do the wrong thing) at > resume. It shouldn't really stall... fail yes, but stall, that seems > fishy. I understood that. >> if the previous call to >> request_firmware() has failed (file not present; call made during previous >> resume). In my opinion it should attempt to cache the firmware on suspend >> even in this case > > Yes, that is the *proper* way to do use the firmware API, but we want to > reproduce the stall, and so we have to recreate the issue you reported > by doing something bad. Should firmware_request_cache() always be called, or only instead of request_firmware() ? In case request_firmware is called on its own, and the firmware file is not present, it might still race on the next resume. request_firmware seems to be meant to include the cache request for the next suspend, but it apparently *does not* if firmware loading fails due to a missing file. I think this is something that should either be changed or properly documented in the API documentation. >> Did you also try to create a random test-firmware.bin (I used 1M from >> /dev/urandom) instead of an empty /lib/firmware ? > > No, right now I want to to just focus on fixing the stall you saw. This is a second nuance of the stall I saw: The firmware file (in this case test-firmware.bin) is present but not in cache. If you run ls -lR /lib/firmware > /dev/null before systemctl suspend your reproduction steps will very likely not stall. If you in addition run dd if=/dev/urandom of=/lib/firmware/test-firmware.bin bs=1K count=64 during the preparation of /lib/firmware (in addition to or instead of your for loop), then it will always stall (as long as the file content is not read before suspend). One stall is happening while the directory content is being listed (to find that the file is not present), the second stall is happening while actually trying to read the file. Those two stalls likely have the same root cause, but it might be different. I just want to make sure you have covered all cases. >>> You might be better off just reposting your >>> patches with the respective Reviewed-by tags and pestering your >>> maintainer. >> >> I will try to be a little bit more insistent this time. Is "just repost" the >> usual way to handle if patches are ignored? > > You can repost, v2 just add my reviewed-by. Those patches are indeed > correct as they are calling for the firmware prior to resume. Maybe > clarify that. OK, will do. > But indeed there is also another issue which you reported which needs to > be fixed, and for that thanks so much for you patience! I'll be looking > into this! Also thank you for your effort. Lukas ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-03 20:25 ` Luis Chamberlain 2021-04-03 21:04 ` Luis Chamberlain 2021-04-04 0:50 ` Lukas Middendorf @ 2021-04-08 18:02 ` Luis Chamberlain 2021-04-16 23:17 ` Luis Chamberlain 2 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2021-04-08 18:02 UTC (permalink / raw) To: Lukas Middendorf, Greg KH, dsterba Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote: > So creating say 1000 random files in /lib/firmware on a freshly created > btrfs partition helps reproduce: > > mkfs.btrfs /dev/whatever > mount /dev/wahtever /lib/firmware > # Put it on /etc/fstab too > > Generate 1000 random files on it: > > ``` > for n in {1..1000}; do > dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 )) > done > ``` > > Then reboot, upon reboot do: > > modprobe test_firmware > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test > systemctl suspend > > If its a guest wake it up: > > virsh dompmwakeup domidofguest This happens because: btrfs_lookup() --> ... --> btrfs_search_slot() --> read_block_for_search() --> --> read_tree_block() --> btree_read_extent_buffer_pages() --> --> submit_one_bio() --> btrfs_submit_metadata_bio() --> --> btrfsic_submit_bio() --> submit_bio() --> this completes and then --> wait_on_page_locked() on the first page --> never returns I also managed to reproduce this easily with XFS as well, so this is not a btrfs thing as I suspected. It does not happen with ext4 though. However I think that's just by chance, it should still be prone to the same issue. Either way, I'm dusting off my patches for fs freeze as I believe that should fix this problem. I am not sure if we want a stop gap hack like the one I posted in the meantime... I don't think so. I rather fix this well with the series I'll post for fs freeze. Give me a bit of time and I'll CC you on the patches. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? 2021-04-08 18:02 ` Luis Chamberlain @ 2021-04-16 23:17 ` Luis Chamberlain 0 siblings, 0 replies; 22+ messages in thread From: Luis Chamberlain @ 2021-04-16 23:17 UTC (permalink / raw) To: Lukas Middendorf, Greg KH, dsterba, Jan Kara, Bart Van Assche, fsdevel Cc: linux-btrfs, Antti Palosaari, Mauro Carvalho Chehab, linux-media On Thu, Apr 08, 2021 at 06:02:24PM +0000, Luis Chamberlain wrote: > On Sat, Apr 03, 2021 at 08:25:38PM +0000, Luis Chamberlain wrote: > > So creating say 1000 random files in /lib/firmware on a freshly created > > btrfs partition helps reproduce: > > > > mkfs.btrfs /dev/whatever > > mount /dev/wahtever /lib/firmware > > # Put it on /etc/fstab too > > > > Generate 1000 random files on it: > > > > ``` > > for n in {1..1000}; do > > dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 count=$((RANDOM + 1024 )) > > done > > ``` > > > > Then reboot, upon reboot do: > > > > modprobe test_firmware > > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test > > systemctl suspend > > > > If its a guest wake it up: > > > > virsh dompmwakeup domidofguest > > This happens because: > > btrfs_lookup() --> ... --> > btrfs_search_slot() --> read_block_for_search() --> > --> read_tree_block() --> btree_read_extent_buffer_pages() --> > --> submit_one_bio() --> btrfs_submit_metadata_bio() --> > --> btrfsic_submit_bio() --> submit_bio() > --> this completes and then > --> wait_on_page_locked() on the first page > --> never returns > > I also managed to reproduce this easily with XFS as well, so this is not > a btrfs thing as I suspected. It does not happen with ext4 though. > However I think that's just by chance, it should still be prone to the > same issue. > > Either way, I'm dusting off my patches for fs freeze as I believe that > should fix this problem. I am not sure if we want a stop gap hack like > the one I posted in the meantime... I don't think so. I rather fix this > well with the series I'll post for fs freeze. Give me a bit of time and > I'll CC you on the patches. Low and behold, as I suspectd, my old VFS fsfreeze / thaw patch series this. However I should note that I needed to add remove also the WQ_FREEZABLE from fs as well, which was missing in my patch series, and which Jan Kara had pointed out. However, the VFS freeze work is quite a bit of work which we are still discussing, so in the meantime, I think we have no other option but to put the stop-gap patch I suggested with the usermode helper lock. I will just modify it a bit more. I'll also post my fs freeze work now again, but I'll note that it still requires some more work to address everything which we have discussed in the community. I'll post the patches as I think others may be interested in the progress of that. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2021-04-16 23:18 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-09 18:51 Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs? Lukas Middendorf 2020-08-13 16:37 ` Luis Chamberlain 2020-08-13 21:53 ` Lukas Middendorf 2020-08-13 22:13 ` Luis Chamberlain 2020-08-14 11:38 ` Lukas Middendorf 2020-08-14 16:37 ` Luis Chamberlain 2020-08-14 21:59 ` Lukas Middendorf 2020-08-17 15:20 ` Luis Chamberlain 2020-08-17 22:04 ` Lukas Middendorf 2020-08-18 14:37 ` Luis Chamberlain 2021-04-01 14:59 ` Lukas Middendorf 2021-04-02 18:02 ` Luis Chamberlain 2021-04-02 22:19 ` Luis Chamberlain 2021-04-02 22:58 ` Luis Chamberlain 2021-04-03 10:24 ` Lukas Middendorf 2021-04-03 16:07 ` Lukas Middendorf 2021-04-03 20:25 ` Luis Chamberlain 2021-04-03 21:04 ` Luis Chamberlain 2021-04-05 9:52 ` Lukas Middendorf 2021-04-04 0:50 ` Lukas Middendorf 2021-04-08 18:02 ` Luis Chamberlain 2021-04-16 23:17 ` Luis Chamberlain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).