> Il giorno 20 mag 2019, alle ore 11:15, Jan Kara ha scritto: > > On Sat 18-05-19 15:28:47, Theodore Ts'o wrote: >> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote: >>> I've addressed these issues in my last batch of improvements for >>> BFQ, which landed in the upcoming 5.2. If you give it a try, and >>> still see the problem, then I'll be glad to reproduce it, and >>> hopefully fix it for you. >> >> Hi Paolo, I'm curious if you could give a quick summary about what you >> changed in BFQ? >> >> I was considering adding support so that if userspace calls fsync(2) >> or fdatasync(2), to attach the process's CSS to the transaction, and >> then charge all of the journal metadata writes the process's CSS. If >> there are multiple fsync's batched into the transaction, the first >> process which forced the early transaction commit would get charged >> the entire journal write. OTOH, journal writes are sequential I/O, so >> the amount of disk time for writing the journal is going to be >> relatively small, and especially, the fact that work from other >> cgroups is going to be minimal, especially if hadn't issued an >> fsync(). > > But this makes priority-inversion problems with ext4 journal worse, doesn't > it? If we submit journal commit in blkio cgroup of some random process, it > may get throttled which then effectively blocks the whole filesystem. Or do > you want to implement a more complex back-pressure mechanism where you'd > just account to different blkio cgroup during journal commit and then > throttle as different point where you are not blocking other tasks from > progress? > >> In the case where you have three cgroups all issuing fsync(2) and they >> all landed in the same jbd2 transaction thanks to commit batching, in >> the ideal world we would split up the disk time usage equally across >> those three cgroups. But it's probably not worth doing that... >> >> That being said, we probably do need some BFQ support, since in the >> case where we have multiple processes doing buffered writes w/o fsync, >> we do charnge the data=ordered writeback to each block cgroup. Worse, >> the commit can't complete until the all of the data integrity >> writebacks have completed. And if there are N cgroups with dirty >> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth >> of idle time tacked onto the commit time. > > Yeah. At least in some cases, we know there won't be any more IO from a > particular cgroup in the near future (e.g. transaction commit completing, > or when the layers above IO scheduler already know which IO they are going > to submit next) and in that case idling is just a waste of time. Yep. Issues like this are targeted exactly by the improvement I mentioned in my previous reply. > But so far > I haven't decided how should look a reasonably clean interface for this > that isn't specific to a particular IO scheduler implementation. > That's an interesting point. So far, I've assumed that nobody would have told anything to BFQ. But if you guys think that such a communication may be acceptable at some degree, then I'd be glad to try to come up with some solution. For instance: some hook that any I/O scheduler may export if meaningful. Thanks, Paolo > Honza > -- > Jan Kara > SUSE Labs, CR