* [PATCH] fs/writeback: Attach inode's wb to root if needed @ 2019-05-09 8:03 zhangliguang 2019-05-09 16:48 ` Tejun Heo 0 siblings, 1 reply; 5+ messages in thread From: zhangliguang @ 2019-05-09 8:03 UTC (permalink / raw) To: tj, akpm; +Cc: cgroups, linux-mm There might have tons of files queued in the writeback, awaiting for writing back. Unfortunately, the writeback's cgroup has been dead. In this case, we reassociate the inode with another writeback cgroup, but we possibly can't because the writeback associated with the dead cgroup is the only valid one. In this case, the new writeback is allocated, initialized and associated with the inode. It causes unnecessary high system load and latency. This fixes the issue by enforce moving the inode to root cgroup when the previous binding cgroup becomes dead. With it, no more unnecessary writebacks are created, populated and the system load decreased by about 6x in the online service we encounted: Without the patch: about 30% system load With the patch: about 5% system load with the patch observes significant perf graph change: ======================================================================== We record the trace with 'perf record cycles:k -g -a'. The trace without the patch: + 44.68% [kernel] [k] native_queued_spin_lock_slowpath + 3.38% [kernel] [k] memset_erms + 3.04% [kernel] [k] pcpu_alloc_area + 1.46% [kernel] [k] __memmove + 1.37% [kernel] [k] pcpu_alloc detail information abount native_queued_spin_lock_slowpath: 44.68% [kernel] [k] native_queued_spin_lock_slowpath - native_queued_spin_lock_slowpath - _raw_spin_lock_irqsave - 68.80% pcpu_alloc - __alloc_percpu_gfp - 85.01% __percpu_counter_init - 66.75% wb_init wb_get_create inode_switch_wbs wbc_attach_and_unlock_inode __filemap_fdatawrite_range + filemap_write_and_wait_range + 33.25% fprop_local_init_percpu + 14.99% percpu_ref_init + 30.77% free_percpu system load (by top) %Cpu(s): 31.9% sy With the patch: + 4.45% [kernel] [k] native_queued_spin_lock_slowpath + 3.32% [kernel] [k] put_compound_page + 2.20% [kernel] [k] gup_pte_range + 1.91% [kernel] [k] kstat_irqs system load (by top) %Cpu(s): 5.4% sy Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com> --- fs/fs-writeback.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 36855c1..e7e19d8 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -696,6 +696,13 @@ void wbc_detach_inode(struct writeback_control *wbc) inode->i_wb_frn_avg_time = min(avg_time, (unsigned long)U16_MAX); inode->i_wb_frn_history = history; + /* + * The wb is switched to the root memcg unconditionally. We expect + * the correct wb (best candidate) is picked up in next round. + */ + if (wb == inode->i_wb && wb_dying(wb) && !(inode->i_state & I_DIRTY_ALL)) + inode_switch_wbs(inode, root_mem_cgroup->css.id); + wb_put(wbc->wb); wbc->wb = NULL; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] fs/writeback: Attach inode's wb to root if needed 2019-05-09 8:03 [PATCH] fs/writeback: Attach inode's wb to root if needed zhangliguang @ 2019-05-09 16:48 ` Tejun Heo 2019-05-10 1:54 ` 乱石 0 siblings, 1 reply; 5+ messages in thread From: Tejun Heo @ 2019-05-09 16:48 UTC (permalink / raw) To: zhangliguang; +Cc: akpm, cgroups, linux-mm Hello, On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote: > There might have tons of files queued in the writeback, awaiting for > writing back. Unfortunately, the writeback's cgroup has been dead. In > this case, we reassociate the inode with another writeback cgroup, but > we possibly can't because the writeback associated with the dead cgroup > is the only valid one. In this case, the new writeback is allocated, > initialized and associated with the inode. It causes unnecessary high > system load and latency. > > This fixes the issue by enforce moving the inode to root cgroup when the > previous binding cgroup becomes dead. With it, no more unnecessary > writebacks are created, populated and the system load decreased by about > 6x in the online service we encounted: > Without the patch: about 30% system load > With the patch: about 5% system load Can you please describe the scenario with more details? I'm having a bit of hard time understanding the amount of cpu cycles being consumed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] fs/writeback: Attach inode's wb to root if needed 2019-05-09 16:48 ` Tejun Heo @ 2019-05-10 1:54 ` 乱石 2019-05-13 18:30 ` Dennis Zhou 0 siblings, 1 reply; 5+ messages in thread From: 乱石 @ 2019-05-10 1:54 UTC (permalink / raw) To: Tejun Heo; +Cc: akpm, cgroups, linux-mm Hi Tejun, 在 2019/5/10 0:48, Tejun Heo 写道: > Hi Tejun, > > On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote: >> There might have tons of files queued in the writeback, awaiting for >> writing back. Unfortunately, the writeback's cgroup has been dead. In >> this case, we reassociate the inode with another writeback cgroup, but >> we possibly can't because the writeback associated with the dead cgroup >> is the only valid one. In this case, the new writeback is allocated, >> initialized and associated with the inode. It causes unnecessary high >> system load and latency. >> >> This fixes the issue by enforce moving the inode to root cgroup when the >> previous binding cgroup becomes dead. With it, no more unnecessary >> writebacks are created, populated and the system load decreased by about >> 6x in the online service we encounted: >> Without the patch: about 30% system load >> With the patch: about 5% system load > Can you please describe the scenario with more details? I'm having a > bit of hard time understanding the amount of cpu cycles being > consumed. > > Thanks. Our search line reported a problem, when containerA was removed, containerB and containerC's system load were up to 30%. We record the trace with 'perf record cycles:k -g -a', found that wb_init was the hotspot function. Function call: generic_file_direct_write filemap_write_and_wait_range __filemap_fdatawrite_range wbc_attach_fdatawrite_inode inode_attach_wb __inode_attach_wb wb_get_create wbc_attach_and_unlock_inode if (unlikely(wb_dying(wbc->wb))) inode_switch_wbs wb_get_create ; Search bdi->cgwb_tree from memcg_css->id ; OR cgwb_create kmalloc wb_init // hot spot ; Insert to bdi->cgwb_tree, mmecg_css->id as key We discussed it through, base on the analysis: When we running into the issue, there is cgroups are being deleted. The inodes (files) that were associated with these cgroups have to switch into another newly created writeback. We think there are huge amount of inodes in the writeback list that time. So we don't think there is anything abnormal. However, one thing we possibly can do: enforce these inodes to BDI embedded wirteback and we needn't create huge amount of writebacks in that case, to avoid the high system load phenomenon. We expect correct wb (best candidate) is picked up in next round. Thanks, Liguang > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] fs/writeback: Attach inode's wb to root if needed 2019-05-10 1:54 ` 乱石 @ 2019-05-13 18:30 ` Dennis Zhou 2019-05-16 5:54 ` 乱石 0 siblings, 1 reply; 5+ messages in thread From: Dennis Zhou @ 2019-05-13 18:30 UTC (permalink / raw) To: 乱石; +Cc: Tejun Heo, akpm, cgroups, linux-mm Hi Liguang, On Fri, May 10, 2019 at 09:54:27AM +0800, 乱石 wrote: > Hi Tejun, > > 在 2019/5/10 0:48, Tejun Heo 写道: > > Hi Tejun, > > > > On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote: > > > There might have tons of files queued in the writeback, awaiting for > > > writing back. Unfortunately, the writeback's cgroup has been dead. In > > > this case, we reassociate the inode with another writeback cgroup, but > > > we possibly can't because the writeback associated with the dead cgroup > > > is the only valid one. In this case, the new writeback is allocated, > > > initialized and associated with the inode. It causes unnecessary high > > > system load and latency. > > > > > > This fixes the issue by enforce moving the inode to root cgroup when the > > > previous binding cgroup becomes dead. With it, no more unnecessary > > > writebacks are created, populated and the system load decreased by about > > > 6x in the online service we encounted: > > > Without the patch: about 30% system load > > > With the patch: about 5% system load > > Can you please describe the scenario with more details? I'm having a > > bit of hard time understanding the amount of cpu cycles being > > consumed. > > > > Thanks. > > Our search line reported a problem, when containerA was removed, > containerB and containerC's system load were up to 30%. > > We record the trace with 'perf record cycles:k -g -a', found that wb_init > was the hotspot function. > > Function call: > > generic_file_direct_write > filemap_write_and_wait_range > __filemap_fdatawrite_range > wbc_attach_fdatawrite_inode > inode_attach_wb > __inode_attach_wb > wb_get_create > wbc_attach_and_unlock_inode > if (unlikely(wb_dying(wbc->wb))) > inode_switch_wbs > wb_get_create > ; Search bdi->cgwb_tree from memcg_css->id > ; OR cgwb_create > kmalloc > wb_init // hot spot > ; Insert to bdi->cgwb_tree, mmecg_css->id as key > > We discussed it through, base on the analysis: When we running into the > issue, there is cgroups are being deleted. The inodes (files) that were > associated with these cgroups have to switch into another newly created > writeback. We think there are huge amount of inodes in the writeback list > that time. So we don't think there is anything abnormal. However, one > thing we possibly can do: enforce these inodes to BDI embedded wirteback > and we needn't create huge amount of writebacks in that case, to avoid > the high system load phenomenon. We expect correct wb (best candidate) is > picked up in next round. > > Thanks, > Liguang > > > > If I understand correctly, this is mostlikely caused by a file shared by cgroup A and cgroup B. This means cgroup B is doing direct io against the file owned by the dying cgroup A. In this case, the code tries to do a wb switch. However, it fails to reallocate the wb as it's deleted and for the original cgrouip A's memcg id. I think the below may be a better solution. Could you please test it? If it works, I'll spin a patch with a more involved description. Thanks, Dennis --- diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 36855c1f8daf..fb331ea2a626 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -577,7 +577,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, * A dying wb indicates that the memcg-blkcg mapping has changed * and a new wb is already serving the memcg. Switch immediately. */ - if (unlikely(wb_dying(wbc->wb))) + if (unlikely(wb_dying(wbc->wb)) && !css_is_dying(wbc->wb->memcg_css)) inode_switch_wbs(inode, wbc->wb_id); } diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 72e6d0c55cfa..685563ed9788 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -659,7 +659,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, might_sleep_if(gfpflags_allow_blocking(gfp)); - if (!memcg_css->parent) + if (!memcg_css->parent || css_is_dying(memcg_css)) return &bdi->wb; do { ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] fs/writeback: Attach inode's wb to root if needed 2019-05-13 18:30 ` Dennis Zhou @ 2019-05-16 5:54 ` 乱石 0 siblings, 0 replies; 5+ messages in thread From: 乱石 @ 2019-05-16 5:54 UTC (permalink / raw) To: Dennis Zhou; +Cc: Tejun Heo, akpm, cgroups, linux-mm Hi Dennis, Sorry for the later reply. Becase I cann't reproduce this problem by local test, and online environment is not allowed to operate, I am constructing scenario to reproduce it in recent days. 在 2019/5/14 2:30, Dennis Zhou 写道: > Hi Liguang, > > On Fri, May 10, 2019 at 09:54:27AM +0800, 乱石 wrote: >> Hi Tejun, >> >> 在 2019/5/10 0:48, Tejun Heo 写道: >>> Hi Tejun, >>> >>> On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote: >>>> There might have tons of files queued in the writeback, awaiting for >>>> writing back. Unfortunately, the writeback's cgroup has been dead. In >>>> this case, we reassociate the inode with another writeback cgroup, but >>>> we possibly can't because the writeback associated with the dead cgroup >>>> is the only valid one. In this case, the new writeback is allocated, >>>> initialized and associated with the inode. It causes unnecessary high >>>> system load and latency. >>>> >>>> This fixes the issue by enforce moving the inode to root cgroup when the >>>> previous binding cgroup becomes dead. With it, no more unnecessary >>>> writebacks are created, populated and the system load decreased by about >>>> 6x in the online service we encounted: >>>> Without the patch: about 30% system load >>>> With the patch: about 5% system load >>> Can you please describe the scenario with more details? I'm having a >>> bit of hard time understanding the amount of cpu cycles being >>> consumed. >>> >>> Thanks. >> Our search line reported a problem, when containerA was removed, >> containerB and containerC's system load were up to 30%. >> >> We record the trace with 'perf record cycles:k -g -a', found that wb_init >> was the hotspot function. >> >> Function call: >> >> generic_file_direct_write >> filemap_write_and_wait_range >> __filemap_fdatawrite_range >> wbc_attach_fdatawrite_inode >> inode_attach_wb >> __inode_attach_wb >> wb_get_create >> wbc_attach_and_unlock_inode >> if (unlikely(wb_dying(wbc->wb))) >> inode_switch_wbs >> wb_get_create >> ; Search bdi->cgwb_tree from memcg_css->id >> ; OR cgwb_create >> kmalloc >> wb_init // hot spot >> ; Insert to bdi->cgwb_tree, mmecg_css->id as key >> >> We discussed it through, base on the analysis: When we running into the >> issue, there is cgroups are being deleted. The inodes (files) that were >> associated with these cgroups have to switch into another newly created >> writeback. We think there are huge amount of inodes in the writeback list >> that time. So we don't think there is anything abnormal. However, one >> thing we possibly can do: enforce these inodes to BDI embedded wirteback >> and we needn't create huge amount of writebacks in that case, to avoid >> the high system load phenomenon. We expect correct wb (best candidate) is >> picked up in next round. >> >> Thanks, >> Liguang >> > If I understand correctly, this is mostlikely caused by a file shared by > cgroup A and cgroup B. This means cgroup B is doing direct io against > the file owned by the dying cgroup A. In this case, the code tries to do > a wb switch. However, it fails to reallocate the wb as it's deleted and > for the original cgrouip A's memcg id. > > I think the below may be a better solution. Could you please test it? If > it works, I'll spin a patch with a more involved description. > > Thanks, > Dennis > > --- > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 36855c1f8daf..fb331ea2a626 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -577,7 +577,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, > * A dying wb indicates that the memcg-blkcg mapping has changed > * and a new wb is already serving the memcg. Switch immediately. > */ > - if (unlikely(wb_dying(wbc->wb))) > + if (unlikely(wb_dying(wbc->wb)) && !css_is_dying(wbc->wb->memcg_css)) > inode_switch_wbs(inode, wbc->wb_id); > } > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 72e6d0c55cfa..685563ed9788 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -659,7 +659,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, > > might_sleep_if(gfpflags_allow_blocking(gfp)); > > - if (!memcg_css->parent) > + if (!memcg_css->parent || css_is_dying(memcg_css)) > return &bdi->wb; > > do { ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-05-16 5:54 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-09 8:03 [PATCH] fs/writeback: Attach inode's wb to root if needed zhangliguang 2019-05-09 16:48 ` Tejun Heo 2019-05-10 1:54 ` 乱石 2019-05-13 18:30 ` Dennis Zhou 2019-05-16 5:54 ` 乱石
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).