On Jul 26, 2019, at 8:59 PM, Damien Le Moal wrote: > > On 2019/07/27 7:55, Theodore Y. Ts'o wrote: >> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote: >>>> >>>> This looks like something that could hit every file systems, so >>>> shouldn't we fix this in common code? We could also look into >>>> just using memalloc_nofs_save for the page cache allocation path >>>> instead of the per-mapping gfp_mask. >>> >>> I think it has to be the entire IO path - any allocation from the >>> underlying filesystem could recurse into the top level filesystem >>> and then deadlock if the memory reclaim submits IO or blocks on >>> IO completion from the upper filesystem. That's a bloody big hammer >>> for something that is only necessary when there are stacked >>> filesystems like this.... >> >> Yeah.... that's why using memalloc_nofs_save() probably makes the most >> sense, and dm_zoned should use that before it calls into ext4. > > Unfortunately, with this particular setup, that will not solve the problem. > dm-zoned submit BIOs to its backend drive in response to XFS activity. The > requests for these BIOs are passed along to the kernel tcmu HBA and end up in > that HBA command ring. The commands themselves are read from the ring and > executed by the tcmu-runner user process which executes them doing > pread()/pwrite() to the ext4 file. The tcmu-runner process being a different > context than the dm-zoned worker thread issuing the BIO, > memalloc_nofs_save/restore() calls in dm-zoned will have no effect. One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state in the BIO so that the worker thread will also get the correct behaviour when it is processing this IO submission. > We tried a simpler setup using loopback mount (XFS used directly in an ext4 > file) and running the same workload. We failed to recreate a similar deadlock in > this case, but I am strongly suspecting that it can happen too. It is simply > much harder to hit because the IO path from XFS to ext4 is all in-kernel and > asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs > which makes it relatively easy to get inter-dependent writes or read+write > queued back-to-back and create the deadlock. > > So back to Dave's point, we may be needing the big-hammer solution in the case > of stacked file systems, while a non-stack setups do not necessarily need it > (that is for the FS to decide). But I do not see how to implement this big > hammer conditionally. How can a file system tell if it is at the top of the > stack (big hammer not needed) or lower than the top level (big hammer needed) ? > > One simple hack would be an fcntl() or mount option to tell the FS to use > GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the > applications or system setup is correct. So not so safe. Cheers, Andreas