Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Dennis Zhou <dennis@kernel.org>
To: 乱石 <zhangliguang@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>,
	akpm@linux-foundation.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH] fs/writeback: Attach inode's wb to root if needed
Date: Mon, 13 May 2019 14:30:53 -0400
Message-ID: <20190513183053.GA73423@dennisz-mbp> (raw)
In-Reply-To: <a5bb3773-fef5-ce2b-33b9-18e0d49c33c4@linux.alibaba.com>

Hi Liguang,

On Fri, May 10, 2019 at 09:54:27AM +0800, 乱石 wrote:
> Hi Tejun,
> 
> 在 2019/5/10 0:48, Tejun Heo 写道:
> > Hi Tejun,
> > 
> > On Thu, May 09, 2019 at 04:03:53PM +0800, zhangliguang wrote:
> > > There might have tons of files queued in the writeback, awaiting for
> > > writing back. Unfortunately, the writeback's cgroup has been dead. In
> > > this case, we reassociate the inode with another writeback cgroup, but
> > > we possibly can't because the writeback associated with the dead cgroup
> > > is the only valid one. In this case, the new writeback is allocated,
> > > initialized and associated with the inode. It causes unnecessary high
> > > system load and latency.
> > > 
> > > This fixes the issue by enforce moving the inode to root cgroup when the
> > > previous binding cgroup becomes dead. With it, no more unnecessary
> > > writebacks are created, populated and the system load decreased by about
> > > 6x in the online service we encounted:
> > >      Without the patch: about 30% system load
> > >      With the patch:    about  5% system load
> > Can you please describe the scenario with more details?  I'm having a
> > bit of hard time understanding the amount of cpu cycles being
> > consumed.
> > 
> > Thanks.
> 
> Our search line reported a problem, when containerA was removed,
> containerB and containerC's system load were up to 30%.
> 
> We record the trace with 'perf record cycles:k -g -a', found that wb_init
> was the hotspot function.
> 
> Function call:
> 
> generic_file_direct_write
>    filemap_write_and_wait_range
>       __filemap_fdatawrite_range
>          wbc_attach_fdatawrite_inode
>             inode_attach_wb
>                __inode_attach_wb
>                   wb_get_create
>             wbc_attach_and_unlock_inode
>                if (unlikely(wb_dying(wbc->wb)))
>                   inode_switch_wbs
>                      wb_get_create
>                         ; Search bdi->cgwb_tree from memcg_css->id
>                         ; OR cgwb_create
>                            kmalloc
>                            wb_init       // hot spot
>                            ; Insert to bdi->cgwb_tree, mmecg_css->id as key
> 
> We discussed it through, base on the analysis:  When we running into the
> issue, there is cgroups are being deleted. The inodes (files) that were
> associated with these cgroups have to switch into another newly created
> writeback. We think there are huge amount of inodes in the writeback list
> that time. So we don't think there is anything abnormal. However, one
> thing we possibly can do: enforce these inodes to BDI embedded wirteback
> and we needn't create huge amount of writebacks in that case, to avoid
> the high system load phenomenon. We expect correct wb (best candidate) is
> picked up in next round.
> 
> Thanks,
> Liguang
> 
> > 
> 

If I understand correctly, this is mostlikely caused by a file shared by
cgroup A and cgroup B. This means cgroup B is doing direct io against
the file owned by the dying cgroup A. In this case, the code tries to do
a wb switch. However, it fails to reallocate the wb as it's deleted and
for the original cgrouip A's memcg id.

I think the below may be a better solution. Could you please test it? If
it works, I'll spin a patch with a more involved description.

Thanks,
Dennis

---
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..fb331ea2a626 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -577,7 +577,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 	 * A dying wb indicates that the memcg-blkcg mapping has changed
 	 * and a new wb is already serving the memcg.  Switch immediately.
 	 */
-	if (unlikely(wb_dying(wbc->wb)))
+	if (unlikely(wb_dying(wbc->wb)) && !css_is_dying(wbc->wb->memcg_css))
 		inode_switch_wbs(inode, wbc->wb_id);
 }
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 72e6d0c55cfa..685563ed9788 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -659,7 +659,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 
 	might_sleep_if(gfpflags_allow_blocking(gfp));
 
-	if (!memcg_css->parent)
+	if (!memcg_css->parent || css_is_dying(memcg_css))
 		return &bdi->wb;
 
 	do {


  reply index

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-09  8:03 zhangliguang
2019-05-09 16:48 ` Tejun Heo
2019-05-10  1:54   ` 乱石
2019-05-13 18:30     ` Dennis Zhou [this message]
2019-05-16  5:54       ` 乱石

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190513183053.GA73423@dennisz-mbp \
    --to=dennis@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    --cc=zhangliguang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git