From: Tejun Heo <tj@kernel.org> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo <tj@kernel.org> Subject: [PATCH 8/8] writeback: implement foreign cgroup inode bdi_writeback switching Date: Mon, 23 Mar 2015 01:25:44 -0400 [thread overview] Message-ID: <1427088344-17542-9-git-send-email-tj@kernel.org> (raw) In-Reply-To: <1427088344-17542-1-git-send-email-tj@kernel.org> As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> --- fs/fs-writeback.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7a1ab24..5fc7828 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -308,30 +308,112 @@ static void inode_switch_wb_work_fn(struct work_struct *work) struct inode_switch_wb_context *isw = container_of(work, struct inode_switch_wb_context, work); struct inode *inode = isw->inode; + struct address_space *mapping = inode->i_mapping; + struct bdi_writeback *old_wb = inode->i_wb; struct bdi_writeback *new_wb = isw->new_wb; + struct radix_tree_iter iter; + bool switched = false; + void **slot; /* * By the time control reaches here, RCU grace period has passed * since I_WB_SWITCH assertion and all wb stat update transactions * between inode_wb_stat_unlocked_begin/end() are guaranteed to be * synchronizing against mapping->tree_lock. + * + * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock + * gives us exclusion against all wb related operations on @inode + * including IO list manipulations and stat updates. */ + if (old_wb < new_wb) { + spin_lock(&old_wb->list_lock); + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock(&new_wb->list_lock); + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); + } spin_lock(&inode->i_lock); + spin_lock_irq(&mapping->tree_lock); + + /* + * Once I_FREEING is visible under i_lock, the eviction path owns + * the inode and we shouldn't modify ->i_wb_list. + */ + if (unlikely(inode->i_state & I_FREEING)) + goto skip_switch; + /* + * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points + * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to + * pages actually under underwriteback. + */ + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_DIRTY) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page) && PageDirty(page)) { + __dec_wb_stat(old_wb, WB_RECLAIMABLE); + __inc_wb_stat(new_wb, WB_RECLAIMABLE); + } + } + + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_WRITEBACK) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page)) { + WARN_ON_ONCE(!PageWriteback(page)); + __dec_wb_stat(old_wb, WB_WRITEBACK); + __inc_wb_stat(new_wb, WB_WRITEBACK); + } + } + + wb_get(new_wb); + + /* + * Transfer to @new_wb's IO list if necessary. The specific list + * @inode was on is ignored and the inode is put on ->b_dirty which + * is always correct including from ->b_dirty_time. The transfer + * preserves @inode->dirtied_when ordering. + */ + if (!list_empty(&inode->i_wb_list)) { + struct inode *pos; + + inode_wb_list_del_locked(inode, old_wb); + inode->i_wb = new_wb; + list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list) + if (time_after_eq(inode->dirtied_when, + pos->dirtied_when)) + break; + inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev); + } else { + inode->i_wb = new_wb; + } + + /* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */ inode->i_wb_frn_winner = 0; inode->i_wb_frn_avg_time = 0; inode->i_wb_frn_history = 0; - + switched = true; +skip_switch: /* * Paired with load_acquire in inode_wb_stat_unlocked_begin() and * ensures that the new wb is visible if they see !I_WB_SWITCH. */ smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH); + spin_unlock_irq(&mapping->tree_lock); spin_unlock(&inode->i_lock); + spin_unlock(&new_wb->list_lock); + spin_unlock(&old_wb->list_lock); - iput(inode); + if (switched) { + wb_wakeup(new_wb); + wb_put(old_wb); + } wb_put(new_wb); + + iput(inode); kfree(isw); } -- 2.1.0
WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj@kernel.org> To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo <tj@kernel.org> Subject: [PATCH 8/8] writeback: implement foreign cgroup inode bdi_writeback switching Date: Mon, 23 Mar 2015 01:25:44 -0400 [thread overview] Message-ID: <1427088344-17542-9-git-send-email-tj@kernel.org> (raw) In-Reply-To: <1427088344-17542-1-git-send-email-tj@kernel.org> As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> --- fs/fs-writeback.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 7a1ab24..5fc7828 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -308,30 +308,112 @@ static void inode_switch_wb_work_fn(struct work_struct *work) struct inode_switch_wb_context *isw = container_of(work, struct inode_switch_wb_context, work); struct inode *inode = isw->inode; + struct address_space *mapping = inode->i_mapping; + struct bdi_writeback *old_wb = inode->i_wb; struct bdi_writeback *new_wb = isw->new_wb; + struct radix_tree_iter iter; + bool switched = false; + void **slot; /* * By the time control reaches here, RCU grace period has passed * since I_WB_SWITCH assertion and all wb stat update transactions * between inode_wb_stat_unlocked_begin/end() are guaranteed to be * synchronizing against mapping->tree_lock. + * + * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock + * gives us exclusion against all wb related operations on @inode + * including IO list manipulations and stat updates. */ + if (old_wb < new_wb) { + spin_lock(&old_wb->list_lock); + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock(&new_wb->list_lock); + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); + } spin_lock(&inode->i_lock); + spin_lock_irq(&mapping->tree_lock); + + /* + * Once I_FREEING is visible under i_lock, the eviction path owns + * the inode and we shouldn't modify ->i_wb_list. + */ + if (unlikely(inode->i_state & I_FREEING)) + goto skip_switch; + /* + * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points + * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to + * pages actually under underwriteback. + */ + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_DIRTY) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page) && PageDirty(page)) { + __dec_wb_stat(old_wb, WB_RECLAIMABLE); + __inc_wb_stat(new_wb, WB_RECLAIMABLE); + } + } + + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_WRITEBACK) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page)) { + WARN_ON_ONCE(!PageWriteback(page)); + __dec_wb_stat(old_wb, WB_WRITEBACK); + __inc_wb_stat(new_wb, WB_WRITEBACK); + } + } + + wb_get(new_wb); + + /* + * Transfer to @new_wb's IO list if necessary. The specific list + * @inode was on is ignored and the inode is put on ->b_dirty which + * is always correct including from ->b_dirty_time. The transfer + * preserves @inode->dirtied_when ordering. + */ + if (!list_empty(&inode->i_wb_list)) { + struct inode *pos; + + inode_wb_list_del_locked(inode, old_wb); + inode->i_wb = new_wb; + list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list) + if (time_after_eq(inode->dirtied_when, + pos->dirtied_when)) + break; + inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev); + } else { + inode->i_wb = new_wb; + } + + /* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */ inode->i_wb_frn_winner = 0; inode->i_wb_frn_avg_time = 0; inode->i_wb_frn_history = 0; - + switched = true; +skip_switch: /* * Paired with load_acquire in inode_wb_stat_unlocked_begin() and * ensures that the new wb is visible if they see !I_WB_SWITCH. */ smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH); + spin_unlock_irq(&mapping->tree_lock); spin_unlock(&inode->i_lock); + spin_unlock(&new_wb->list_lock); + spin_unlock(&old_wb->list_lock); - iput(inode); + if (switched) { + wb_wakeup(new_wb); + wb_put(old_wb); + } wb_put(new_wb); + + iput(inode); kfree(isw); } -- 2.1.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-03-23 5:26 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2015-03-23 5:25 [PATCHSET 3/3 block/for-4.1/core] writeback: implement foreign cgroup inode bdi_writeback switching Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 1/8] writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 2/8] writeback: make writeback_control track the inode being written back Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 3/8] writeback: implement foreign cgroup inode detection Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 4/8] truncate: swap the order of conditionals in cancel_dirty_page() Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 5/8] writeback: implement [locked_]inode_to_wb_and_lock_list() Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 6/8] writeback: implement I_WB_SWITCH and bdi_writeback stat update transaction Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` [PATCH 7/8] writeback: add lockdep annotation to inode_to_wb() Tejun Heo 2015-03-23 5:25 ` Tejun Heo 2015-03-23 5:25 ` Tejun Heo [this message] 2015-03-23 5:25 ` [PATCH 8/8] writeback: implement foreign cgroup inode bdi_writeback switching Tejun Heo 2015-03-25 22:44 ` [PATCH 9/8] writeback: disassociate inodes from dying bdi_writebacks Tejun Heo 2015-03-25 22:44 ` Tejun Heo 2015-03-25 22:44 ` Tejun Heo 2015-03-25 22:44 ` Tejun Heo
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1427088344-17542-9-git-send-email-tj@kernel.org \ --to=tj@kernel.org \ --cc=axboe@kernel.dk \ --cc=cgroups@vger.kernel.org \ --cc=clm@fb.com \ --cc=david@fromorbit.com \ --cc=fengguang.wu@intel.com \ --cc=gthelen@google.com \ --cc=hannes@cmpxchg.org \ --cc=hch@infradead.org \ --cc=jack@suse.cz \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=lizefan@huawei.com \ --cc=mhocko@suse.cz \ --cc=vgoyal@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.