* Writeback bug causing writeback stalls @ 2020-05-22 9:57 Martijn Coenen 2020-05-22 14:41 ` Jan Kara [not found] ` <20200524140522.14196-1-hdanton@sina.com> 0 siblings, 2 replies; 12+ messages in thread From: Martijn Coenen @ 2020-05-22 9:57 UTC (permalink / raw) To: Al Viro, Jan Kara, Jens Axboe, miklos, tj Cc: linux-fsdevel, LKML, android-storage-core, kernel-team Hi, We've been working on implementing a FUSE filesystem in Android, and have run into what appears to be a bug in the kernel writeback code. The problem that we observed is that an inode in the filesystem is on the b_dirty_time list, but its i_state field does have I_DIRTY_PAGES set, which I think means that the inode is on the wrong list. This condition doesn't appear to fix itself even as new pages are dirtied, because __mark_inode_dirty() has this check: if ((inode->i_state & flags) != flags) { before considering moving the inode to another list. Since the inode already has I_DIRTY_PAGES set, we're not going to move it to the dirty list. I *think* the only way the inode gets out of this condition is whenever we handle the b_dirty_time list, which can take a while. The reason we even noticed this bug in the first place is that FUSE has a very low wb max_ratio by default (1), and so at some point processes get stuck in balance_dirty_pages_ratelimited(), waiting for pages to be written. They hold the inode's write lock, and when other processes try to acquire it, they get stuck. We have a watchdog that reboots the device after ~10 mins of a task being stuck in D state, and this triggered regularly in some tests. After careful studying of the kernel code, I found a reliable repro scenario for this condition, which is described in more detail below. But essentially what I think is happening is this: __mark_inode_dirty() has an early return condition for when a sync is in progress, where it updates the inode i_state but not the writeback list: inode->i_state |= flags; /* * If the inode is being synced, just update its dirty state. * The unlocker will place the inode on the appropriate * superblock list, based upon its state. */ if (inode->i_state & I_SYNC) goto out_unlock_inode; now this comment is true for the generic flusher threads, which run writeback_sb_inodes(), which calls requeue_inode() to move the inode back to the correct wb list when the sync is done. However, there is another function that uses I_SYNC: writeback_single_inode(). This function has some comments saying it prefers not to touch writeback lists, and in fact only removes it if the inode is clean: /* * Skip inode if it is clean and we have no outstanding writeback in * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this * function since flusher thread may be doing for example sync in * parallel and if we move the inode, it could get skipped. So here we * make sure inode is on some writeback list and leave it there unless * we have completely cleaned the inode. */ writeback_single_inode() is called from a few functions, in particular write_inode_now(). write_inode_now() is called by FUSE's flush() f_ops. So, the sequence of events is something like this. Let's assume the inode is already on b_dirty_time for valid reasons. Then: CPU1 CPU2 fuse_flush() write_inode_now() writeback_single_inode() sets I_SYNC __writeback_single_inode() writes back data clears inode dirty flags unlocks inode calls mark_inode_dirty_sync() sets I_DIRTY_SYNC, but doesn't update wb list because I_SYNC is still set write() // somebody else writes mark_inode_dirty(I_DIRTY_PAGES) sets I_DIRTY_PAGES on i_state doesn't update wb list, because I_SYNC set locks inode again sees inode is still dirty, doesn't touch WB list clears I_SYNC So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set, and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or I_DIRTY_SYNC will do nothing to change that. The flusher won't touch the inode either, because it's not on a b_dirty or b_io list. The easiest way to fix this, I think, is to call requeue_inode() at the end of writeback_single_inode(), much like it is called from writeback_sb_inodes(). However, requeue_inode() has the following ominous warning: /* * Find proper writeback list for the inode depending on its current state and * possibly also change of its state while we were doing writeback. Here we * handle things such as livelock prevention or fairness of writeback among * inodes. This function can be called only by flusher thread - noone else * processes all inodes in writeback lists and requeueing inodes behind flusher * thread's back can have unexpected consequences. */ Obviously this is very critical code both from a correctness and a performance point of view, so I wanted to run this by the maintainers and folks who have contributed to this code first. The way I got to reproduce this reliably was by using what is pretty much a pass-through FUSE filesystem, and the following two commands run in parallel: [1] fio --rw=write --size=5G -blocksize=80000 --name=test --directory=/sdcard/ [2] while true; do echo flushme >> /sdcard/test.0.0; sleep 0.1; done I doubt the blocksize matters, it's just the blocksize that I observed being used in one of our testruns that hit this. [2] essentially calls fuse_flush() every 100ms, which is often enough to reproduce this problem within seconds; FIO will stall and enter balance_dirty_pages_ratelimited(), and [2] will hang because it needs the inode write lock. Other filesystems may hit the same problem, though write_inode_now() is usually only used when no more dirty pages are expected (eg in final iput()). There are some other functions that call writeback_single_inode() that are more widely used, like sync_inode() and sync_inode_metadata(). Curious to hear your thoughts on this. I'm happy to provide more info or traces if needed. Thanks, Martijn ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-22 9:57 Writeback bug causing writeback stalls Martijn Coenen @ 2020-05-22 14:41 ` Jan Kara 2020-05-22 15:23 ` Martijn Coenen [not found] ` <20200524140522.14196-1-hdanton@sina.com> 1 sibling, 1 reply; 12+ messages in thread From: Jan Kara @ 2020-05-22 14:41 UTC (permalink / raw) To: Martijn Coenen Cc: Al Viro, Jan Kara, Jens Axboe, miklos, tj, linux-fsdevel, LKML, android-storage-core, kernel-team Hi! On Fri 22-05-20 11:57:42, Martijn Coenen wrote: <snip> > So, the sequence of events is something like this. Let's assume the inode is > already on b_dirty_time for valid reasons. Then: > > CPU1 CPU2 > fuse_flush() > write_inode_now() > writeback_single_inode() > sets I_SYNC > __writeback_single_inode() > writes back data > clears inode dirty flags > unlocks inode > calls mark_inode_dirty_sync() > sets I_DIRTY_SYNC, but doesn't > update wb list because I_SYNC is > still set > write() // somebody else writes > mark_inode_dirty(I_DIRTY_PAGES) > sets I_DIRTY_PAGES on i_state > doesn't update wb list, > because I_SYNC set > locks inode again > sees inode is still dirty, > doesn't touch WB list > clears I_SYNC > > So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set, > and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or > I_DIRTY_SYNC will do nothing to change that. The flusher won't touch > the inode either, > because it's not on a b_dirty or b_io list. Thanks for the detailed analysis and explanation! I agree that what you describe is a bug in the writeback code. > The easiest way to fix this, I think, is to call requeue_inode() at the end of > writeback_single_inode(), much like it is called from writeback_sb_inodes(). > However, requeue_inode() has the following ominous warning: > > /* > * Find proper writeback list for the inode depending on its current state and > * possibly also change of its state while we were doing writeback. Here we > * handle things such as livelock prevention or fairness of writeback among > * inodes. This function can be called only by flusher thread - noone else > * processes all inodes in writeback lists and requeueing inodes behind flusher > * thread's back can have unexpected consequences. > */ > > Obviously this is very critical code both from a correctness and a performance > point of view, so I wanted to run this by the maintainers and folks who have > contributed to this code first. Sadly, the fix won't be so easy. The main problem with calling requeue_inode() from writeback_single_inode() is that if there's parallel sync(2) call, inode->i_io_list is used to track all inodes that need writing before sync(2) can complete. So requeueing inodes in parallel while sync(2) runs can result in breaking data integrity guarantees of it. But I agree we need to find some mechanism to safely move inode to appropriate dirty list reasonably quickly. Probably I'd add an inode state flag telling that inode is queued for writeback by flush worker and we won't touch dirty lists in that case, otherwise we are safe to update current writeback list as needed. I'll work on fixing this as when I was reading the code I've noticed there are other quirks in the code as well. Thanks for the report! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-22 14:41 ` Jan Kara @ 2020-05-22 15:23 ` Martijn Coenen 2020-05-22 15:36 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Martijn Coenen @ 2020-05-22 15:23 UTC (permalink / raw) To: Jan Kara Cc: Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team [ dropped android-storage-core@google.com from CC: since that list can't receive emails from outside google.com - sorry about that ] Hi Jan, On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > The easiest way to fix this, I think, is to call requeue_inode() at the end of > > writeback_single_inode(), much like it is called from writeback_sb_inodes(). > > However, requeue_inode() has the following ominous warning: > > > > /* > > * Find proper writeback list for the inode depending on its current state and > > * possibly also change of its state while we were doing writeback. Here we > > * handle things such as livelock prevention or fairness of writeback among > > * inodes. This function can be called only by flusher thread - noone else > > * processes all inodes in writeback lists and requeueing inodes behind flusher > > * thread's back can have unexpected consequences. > > */ > > > > Obviously this is very critical code both from a correctness and a performance > > point of view, so I wanted to run this by the maintainers and folks who have > > contributed to this code first. > > Sadly, the fix won't be so easy. The main problem with calling > requeue_inode() from writeback_single_inode() is that if there's parallel > sync(2) call, inode->i_io_list is used to track all inodes that need writing > before sync(2) can complete. So requeueing inodes in parallel while sync(2) > runs can result in breaking data integrity guarantees of it. Ah, makes sense. > But I agree > we need to find some mechanism to safely move inode to appropriate dirty > list reasonably quickly. > > Probably I'd add an inode state flag telling that inode is queued for > writeback by flush worker and we won't touch dirty lists in that case, > otherwise we are safe to update current writeback list as needed. I'll work > on fixing this as when I was reading the code I've noticed there are other > quirks in the code as well. Thanks for the report! Thanks! While looking at the code I also saw some other paths that appeared to be racy, though I haven't worked them out in detail to confirm that - the locking around the inode and writeback lists is tricky. What's the best way to follow up on those? Happy to post them to this same thread after I spend a bit more time looking at the code. Thanks, Martijn > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-22 15:23 ` Martijn Coenen @ 2020-05-22 15:36 ` Jan Kara 2020-05-23 8:15 ` Martijn Coenen 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2020-05-22 15:36 UTC (permalink / raw) To: Martijn Coenen Cc: Jan Kara, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team On Fri 22-05-20 17:23:30, Martijn Coenen wrote: > [ dropped android-storage-core@google.com from CC: since that list > can't receive emails from outside google.com - sorry about that ] > > Hi Jan, > > On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > > The easiest way to fix this, I think, is to call requeue_inode() at the end of > > > writeback_single_inode(), much like it is called from writeback_sb_inodes(). > > > However, requeue_inode() has the following ominous warning: > > > > > > /* > > > * Find proper writeback list for the inode depending on its current state and > > > * possibly also change of its state while we were doing writeback. Here we > > > * handle things such as livelock prevention or fairness of writeback among > > > * inodes. This function can be called only by flusher thread - noone else > > > * processes all inodes in writeback lists and requeueing inodes behind flusher > > > * thread's back can have unexpected consequences. > > > */ > > > > > > Obviously this is very critical code both from a correctness and a performance > > > point of view, so I wanted to run this by the maintainers and folks who have > > > contributed to this code first. > > > > Sadly, the fix won't be so easy. The main problem with calling > > requeue_inode() from writeback_single_inode() is that if there's parallel > > sync(2) call, inode->i_io_list is used to track all inodes that need writing > > before sync(2) can complete. So requeueing inodes in parallel while sync(2) > > runs can result in breaking data integrity guarantees of it. > > Ah, makes sense. > > > But I agree > > we need to find some mechanism to safely move inode to appropriate dirty > > list reasonably quickly. > > > > Probably I'd add an inode state flag telling that inode is queued for > > writeback by flush worker and we won't touch dirty lists in that case, > > otherwise we are safe to update current writeback list as needed. I'll work > > on fixing this as when I was reading the code I've noticed there are other > > quirks in the code as well. Thanks for the report! > > Thanks! While looking at the code I also saw some other paths that > appeared to be racy, though I haven't worked them out in detail to > confirm that - the locking around the inode and writeback lists is > tricky. What's the best way to follow up on those? Happy to post them > to this same thread after I spend a bit more time looking at the code. Sure, if you are aware some some other problems, just write them to this thread. FWIW stuff that I've found so far: 1) __I_DIRTY_TIME_EXPIRED setting in move_expired_inodes() can get lost as there are other places doing RMW modifications of inode->i_state. 2) sync(2) is prone to livelocks as when we queue inodes from b_dirty_time list, we don't take dirtied_when into account (and that's the only thing that makes sure aggressive dirtier cannot livelock sync). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-22 15:36 ` Jan Kara @ 2020-05-23 8:15 ` Martijn Coenen 2020-05-25 7:31 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Martijn Coenen @ 2020-05-23 8:15 UTC (permalink / raw) To: Jan Kara, Jaegeuk Kim Cc: Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team Jaegeuk wondered whether callers of write_inode_now() should hold i_rwsem, and whether that would also prevent this problem. Some existing callers of write_inode_now() do, eg ntfs and hfs: hfs_file_fsync() inode_lock(inode); /* sync the inode to buffers */ ret = write_inode_now(inode, 0); but there are also some that don't (eg fat, fuse, orangefs). Thanks, Martijn On Fri, May 22, 2020 at 5:36 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 22-05-20 17:23:30, Martijn Coenen wrote: > > [ dropped android-storage-core@google.com from CC: since that list > > can't receive emails from outside google.com - sorry about that ] > > > > Hi Jan, > > > > On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote: > > > > The easiest way to fix this, I think, is to call requeue_inode() at the end of > > > > writeback_single_inode(), much like it is called from writeback_sb_inodes(). > > > > However, requeue_inode() has the following ominous warning: > > > > > > > > /* > > > > * Find proper writeback list for the inode depending on its current state and > > > > * possibly also change of its state while we were doing writeback. Here we > > > > * handle things such as livelock prevention or fairness of writeback among > > > > * inodes. This function can be called only by flusher thread - noone else > > > > * processes all inodes in writeback lists and requeueing inodes behind flusher > > > > * thread's back can have unexpected consequences. > > > > */ > > > > > > > > Obviously this is very critical code both from a correctness and a performance > > > > point of view, so I wanted to run this by the maintainers and folks who have > > > > contributed to this code first. > > > > > > Sadly, the fix won't be so easy. The main problem with calling > > > requeue_inode() from writeback_single_inode() is that if there's parallel > > > sync(2) call, inode->i_io_list is used to track all inodes that need writing > > > before sync(2) can complete. So requeueing inodes in parallel while sync(2) > > > runs can result in breaking data integrity guarantees of it. > > > > Ah, makes sense. > > > > > But I agree > > > we need to find some mechanism to safely move inode to appropriate dirty > > > list reasonably quickly. > > > > > > Probably I'd add an inode state flag telling that inode is queued for > > > writeback by flush worker and we won't touch dirty lists in that case, > > > otherwise we are safe to update current writeback list as needed. I'll work > > > on fixing this as when I was reading the code I've noticed there are other > > > quirks in the code as well. Thanks for the report! > > > > Thanks! While looking at the code I also saw some other paths that > > appeared to be racy, though I haven't worked them out in detail to > > confirm that - the locking around the inode and writeback lists is > > tricky. What's the best way to follow up on those? Happy to post them > > to this same thread after I spend a bit more time looking at the code. > > Sure, if you are aware some some other problems, just write them to this > thread. FWIW stuff that I've found so far: > > 1) __I_DIRTY_TIME_EXPIRED setting in move_expired_inodes() can get lost as > there are other places doing RMW modifications of inode->i_state. > > 2) sync(2) is prone to livelocks as when we queue inodes from b_dirty_time > list, we don't take dirtied_when into account (and that's the only thing > that makes sure aggressive dirtier cannot livelock sync). > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-23 8:15 ` Martijn Coenen @ 2020-05-25 7:31 ` Jan Kara 2020-05-27 8:14 ` Martijn Coenen 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2020-05-25 7:31 UTC (permalink / raw) To: Martijn Coenen Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team On Sat 23-05-20 10:15:20, Martijn Coenen wrote: > Jaegeuk wondered whether callers of write_inode_now() should hold > i_rwsem, and whether that would also prevent this problem. Some > existing callers of write_inode_now() do, eg ntfs and hfs: > > hfs_file_fsync() > inode_lock(inode); > > /* sync the inode to buffers */ > ret = write_inode_now(inode, 0); > > but there are also some that don't (eg fat, fuse, orangefs). Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem when writing back inode and that's deliberate because of performance. We don't want to block writes (or event reads in case of XFS) for the inode during writeback. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-25 7:31 ` Jan Kara @ 2020-05-27 8:14 ` Martijn Coenen 2020-05-29 15:20 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Martijn Coenen @ 2020-05-27 8:14 UTC (permalink / raw) To: Jan Kara Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team Hi Jan, On Mon, May 25, 2020 at 9:31 AM Jan Kara <jack@suse.cz> wrote: > Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem > when writing back inode and that's deliberate because of performance. We > don't want to block writes (or event reads in case of XFS) for the inode > during writeback. Thanks for clarifying, that makes sense. By the way, do you have an ETA for your fix? We are under some time pressure to get this fixed in our downstream kernels, but I'd much rather take a fix from upstream from somebody who knows this code well. Alternatively, I can take a stab at the idea you proposed and send a patch to LKML for review this week. Thanks, Martijn > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-27 8:14 ` Martijn Coenen @ 2020-05-29 15:20 ` Jan Kara 2020-05-29 19:37 ` Martijn Coenen 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2020-05-29 15:20 UTC (permalink / raw) To: Martijn Coenen Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team [-- Attachment #1: Type: text/plain, Size: 983 bytes --] Hello Martinj! On Wed 27-05-20 10:14:09, Martijn Coenen wrote: > On Mon, May 25, 2020 at 9:31 AM Jan Kara <jack@suse.cz> wrote: > > Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem > > when writing back inode and that's deliberate because of performance. We > > don't want to block writes (or event reads in case of XFS) for the inode > > during writeback. > > Thanks for clarifying, that makes sense. By the way, do you have an > ETA for your fix? We are under some time pressure to get this fixed in > our downstream kernels, but I'd much rather take a fix from upstream > from somebody who knows this code well. Alternatively, I can take a > stab at the idea you proposed and send a patch to LKML for review this > week. I understand. I have written a fix (attached). Currently its under testing together with other cleanups. If everything works fine, I plan to submit the patches on Monday. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: 0001-writeback-Avoid-skipping-inode-writeback.patch --] [-- Type: text/x-patch, Size: 7847 bytes --] From 6d0d46bfddccff7aa4ed114ce15c23dfb68ad2b2 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Fri, 29 May 2020 15:05:22 +0200 Subject: [PATCH] writeback: Avoid skipping inode writeback Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state |= I_SYNC __writeback_single_inode() inode->i_state |= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fs-writeback.c | 39 +++++++++++++++++++++++++++++---------- include/linux/fs.h | 8 ++++++-- 2 files changed, 35 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 76ac9c7d32ec..855c6611710a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -144,7 +144,9 @@ static void inode_io_list_del_locked(struct inode *inode, struct bdi_writeback *wb) { assert_spin_locked(&wb->list_lock); + assert_spin_locked(&inode->i_lock); + inode->i_state &= ~I_SYNC_QUEUED; list_del_init(&inode->i_io_list); wb_io_lists_depopulated(wb); } @@ -1123,7 +1125,9 @@ void inode_io_list_del(struct inode *inode) struct bdi_writeback *wb; wb = inode_to_wb_and_lock_list(inode); + spin_lock(&inode->i_lock); inode_io_list_del_locked(inode, wb); + spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); } @@ -1172,8 +1176,9 @@ void sb_clear_inode_writeback(struct inode *inode) * the case then the inode must have been redirtied while it was being written * out and we don't reset its dirtied_when. */ -static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) +static void __redirty_tail(struct inode *inode, struct bdi_writeback *wb) { + assert_spin_locked(&inode->i_lock); if (!list_empty(&wb->b_dirty)) { struct inode *tail; @@ -1182,6 +1187,14 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) inode->dirtied_when = jiffies; } inode_io_list_move_locked(inode, wb, &wb->b_dirty); + inode->i_state &= ~I_SYNC_QUEUED; +} + +static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) +{ + spin_lock(&inode->i_lock); + __redirty_tail(inode, wb); + spin_unlock(&inode->i_lock); } /* @@ -1250,8 +1263,11 @@ static int move_expired_inodes(struct list_head *delaying_queue, break; list_move(&inode->i_io_list, &tmp); moved++; + spin_lock(&inode->i_lock); if (flags & EXPIRE_DIRTY_ATIME) - set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state); + inode->i_state |= I_DIRTY_TIME_EXPIRED; + inode->i_state |= I_SYNC_QUEUED; + spin_unlock(&inode->i_lock); if (sb_is_blkdev_sb(inode->i_sb)) continue; if (sb && sb != inode->i_sb) @@ -1394,7 +1410,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * writeback is not making progress due to locked * buffers. Skip this inode for now. */ - redirty_tail(inode, wb); + __redirty_tail(inode, wb); return; } @@ -1414,7 +1430,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * retrying writeback of the dirty page/inode * that cannot be performed immediately. */ - redirty_tail(inode, wb); + __redirty_tail(inode, wb); } } else if (inode->i_state & I_DIRTY) { /* @@ -1422,10 +1438,11 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, * such as delayed allocation during submission or metadata * updates after data IO completion. */ - redirty_tail(inode, wb); + __redirty_tail(inode, wb); } else if (inode->i_state & I_DIRTY_TIME) { inode->dirtied_when = jiffies; inode_io_list_move_locked(inode, wb, &wb->b_dirty_time); + inode->i_state &= ~I_SYNC_QUEUED; } else { /* The inode is clean. Remove from writeback lists. */ inode_io_list_del_locked(inode, wb); @@ -1669,8 +1686,9 @@ static long writeback_sb_inodes(struct super_block *sb, */ spin_lock(&inode->i_lock); if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + inode->i_state &= ~I_SYNC_QUEUED; + __redirty_tail(inode, wb); spin_unlock(&inode->i_lock); - redirty_tail(inode, wb); continue; } if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) { @@ -2289,11 +2307,12 @@ void __mark_inode_dirty(struct inode *inode, int flags) inode->i_state |= flags; /* - * If the inode is being synced, just update its dirty state. - * The unlocker will place the inode on the appropriate - * superblock list, based upon its state. + * If the inode is queued for writeback by flush worker, just + * update its dirty state. Once the flush worker is done with + * the inode it will place it on the appropriate superblock + * list, based upon its state. */ - if (inode->i_state & I_SYNC) + if (inode->i_state & I_SYNC_QUEUED) goto out_unlock_inode; /* diff --git a/include/linux/fs.h b/include/linux/fs.h index 45cc10cdf6dd..b02290d19edd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2156,6 +2156,10 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, * * I_CREATING New object's inode in the middle of setting up. * + * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists. + * Used to detect that mark_inode_dirty() should not move + * inode between dirty lists. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -2173,11 +2177,11 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, #define I_DIO_WAKEUP (1 << __I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) #define I_DIRTY_TIME (1 << 11) -#define __I_DIRTY_TIME_EXPIRED 12 -#define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED) +#define I_DIRTY_TIME_EXPIRED (1 << 12) #define I_WB_SWITCH (1 << 13) #define I_OVL_INUSE (1 << 14) #define I_CREATING (1 << 15) +#define I_SYNC_QUEUED (1 << 16) #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) -- 2.16.4 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-29 15:20 ` Jan Kara @ 2020-05-29 19:37 ` Martijn Coenen 2020-06-01 9:09 ` Jan Kara 0 siblings, 1 reply; 12+ messages in thread From: Martijn Coenen @ 2020-05-29 19:37 UTC (permalink / raw) To: Jan Kara Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team Hi Jan, On Fri, May 29, 2020 at 5:20 PM Jan Kara <jack@suse.cz> wrote: > I understand. I have written a fix (attached). Currently its under testing > together with other cleanups. If everything works fine, I plan to submit > the patches on Monday. Thanks a lot for the quick fix! I ran my usual way to reproduce the problem, and did not see it, so that's good! I do observe write speed dips - eg we usually sustain 180 MB/s on this device, but now it regularly dips down to 10 MB/s, then jumps back up again. That might be unrelated to your patch though, I will run more tests over the weekend and report back! Best, Martijn > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-05-29 19:37 ` Martijn Coenen @ 2020-06-01 9:09 ` Jan Kara 2020-06-02 12:16 ` Martijn Coenen 0 siblings, 1 reply; 12+ messages in thread From: Jan Kara @ 2020-06-01 9:09 UTC (permalink / raw) To: Martijn Coenen Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team On Fri 29-05-20 21:37:50, Martijn Coenen wrote: > Hi Jan, > > On Fri, May 29, 2020 at 5:20 PM Jan Kara <jack@suse.cz> wrote: > > I understand. I have written a fix (attached). Currently its under testing > > together with other cleanups. If everything works fine, I plan to submit > > the patches on Monday. > > Thanks a lot for the quick fix! I ran my usual way to reproduce the > problem, and did not see it, so that's good! I do observe write speed > dips - eg we usually sustain 180 MB/s on this device, but now it > regularly dips down to 10 MB/s, then jumps back up again. That might > be unrelated to your patch though, I will run more tests over the > weekend and report back! Thanks for testing! My test run has completed fine so I'll submit patches for review. But I'm curious what's causing the dips in throughput in your test... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Writeback bug causing writeback stalls 2020-06-01 9:09 ` Jan Kara @ 2020-06-02 12:16 ` Martijn Coenen 0 siblings, 0 replies; 12+ messages in thread From: Martijn Coenen @ 2020-06-02 12:16 UTC (permalink / raw) To: Jan Kara Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team On Mon, Jun 1, 2020 at 11:09 AM Jan Kara <jack@suse.cz> wrote: > Thanks for testing! My test run has completed fine so I'll submit patches > for review. But I'm curious what's causing the dips in throughput in your > test... It turned out to be unrelated to your patch. Sorry for the noise! We have the patch in dogfood on some of our devices, and I will let you know if we run into any issues. I'll also spend some more time reviewing your patches and will respond to them later. Thanks, Martijn > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20200524140522.14196-1-hdanton@sina.com>]
* Re: Writeback bug causing writeback stalls [not found] ` <20200524140522.14196-1-hdanton@sina.com> @ 2020-05-25 7:38 ` Jan Kara 0 siblings, 0 replies; 12+ messages in thread From: Jan Kara @ 2020-05-25 7:38 UTC (permalink / raw) To: Hillf Danton Cc: Martijn Coenen, Al Viro, Jan Kara, Jens Axboe, miklos, tj, linux-fsdevel, LKML, android-storage-core, kernel-team On Sun 24-05-20 22:05:22, Hillf Danton wrote: > > On Fri, 22 May 2020 11:57:42 +0200 Martijn Coenen wrote: > > > > So, the sequence of events is something like this. Let's assume the inode is > > already on b_dirty_time for valid reasons. Then: > > > > CPU1 CPU2 > > fuse_flush() > > write_inode_now() > > writeback_single_inode() > > sets I_SYNC > > __writeback_single_inode() > > writes back data > > clears inode dirty flags > > unlocks inode > > calls mark_inode_dirty_sync() > > sets I_DIRTY_SYNC, but doesn't > > update wb list because I_SYNC is > > still set > > write() // somebody else writes > > mark_inode_dirty(I_DIRTY_PAGES) > > sets I_DIRTY_PAGES on i_state > > doesn't update wb list, > > because I_SYNC set > > locks inode again > > sees inode is still dirty, > > doesn't touch WB list > > clears I_SYNC > > > > So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set, > > and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or > > I_DIRTY_SYNC will do nothing to change that. The flusher won't touch > > the inode either, because it's not on a b_dirty or b_io list. Hi Hillf, > Based on the above analysis, check of I_DIRTY_TIME is added before and > after calling __writeback_single_inode() to detect the case you reported. > > If a dirty inode is not on the right io list after writeback, we can > move it to a new one; and we can do that as we are the I_SYNC owner. > > While changing its io list, the inode's dirty timestamp is also updated > to the current tick as does in __mark_inode_dirty(). Apparently you didn't read my reply to Martinj because what you did in this patch is exactly what I described that we cannot do because that can cause sync(2) to miss inodes and thus break its data integrity guarantees. So we have to come up with a different solution. Honza > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1528,6 +1528,7 @@ static int writeback_single_inode(struct > struct writeback_control *wbc) > { > struct bdi_writeback *wb; > + bool dt; > int ret = 0; > > spin_lock(&inode->i_lock); > @@ -1560,6 +1561,7 @@ static int writeback_single_inode(struct > !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))) > goto out; > inode->i_state |= I_SYNC; > + dt = inode->i_state & I_DIRTY_TIME; > wbc_attach_and_unlock_inode(wbc, inode); > > ret = __writeback_single_inode(inode, wbc); > @@ -1574,6 +1576,14 @@ static int writeback_single_inode(struct > */ > if (!(inode->i_state & I_DIRTY_ALL)) > inode_io_list_del_locked(inode, wb); > + else if (!(inode->i_state & I_DIRTY_TIME) && dt) { > + /* > + * We can correct inode's io list, however, by moving it to > + * b_dirty from b_dirty_time as we are the I_SYNC owner > + */ > + inode->dirtied_when = jiffies; > + inode_io_list_move_locked(inode, wb, &wb->b_dirty); > + } > spin_unlock(&wb->list_lock); > inode_sync_complete(inode); > out: > -- > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-06-02 12:16 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-22 9:57 Writeback bug causing writeback stalls Martijn Coenen 2020-05-22 14:41 ` Jan Kara 2020-05-22 15:23 ` Martijn Coenen 2020-05-22 15:36 ` Jan Kara 2020-05-23 8:15 ` Martijn Coenen 2020-05-25 7:31 ` Jan Kara 2020-05-27 8:14 ` Martijn Coenen 2020-05-29 15:20 ` Jan Kara 2020-05-29 19:37 ` Martijn Coenen 2020-06-01 9:09 ` Jan Kara 2020-06-02 12:16 ` Martijn Coenen [not found] ` <20200524140522.14196-1-hdanton@sina.com> 2020-05-25 7:38 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).