All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Roman Gushchin <guro@fb.com>
Cc: Jan Kara <jack@suse.cz>, Tejun Heo <tj@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Dennis Zhou <dennis@kernel.org>,
	Dave Chinner <dchinner@redhat.com>,
	cgroups@vger.kernel.org
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes
Date: Fri, 28 May 2021 15:05:43 +0200	[thread overview]
Message-ID: <20210528130543.GB28653@quack2.suse.cz> (raw)
In-Reply-To: <YK/bi1OU7bNgPBab@carbon.DHCP.thefacebook.com>

On Thu 27-05-21 10:48:59, Roman Gushchin wrote:
> On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > Asynchronously try to release dying cgwbs by switching clean attached
> > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > structures themselves and of pinned memory and block cgroups, which
> > > are way larger structures (mostly due to large per-cpu statistics
> > > data). It helps to prevent memory waste and different scalability
> > > problems caused by large piles of dying cgroups.
> > > 
> > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > clean, etc). In this case the next scheduled cleanup will make a new
> > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > additional attempt scheduled by a timer can be implemented.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  fs/fs-writeback.c                | 35 ++++++++++++++++++
> > >  include/linux/backing-dev-defs.h |  1 +
> > >  include/linux/writeback.h        |  1 +
> > >  mm/backing-dev.c                 | 61 ++++++++++++++++++++++++++++++--
> > >  4 files changed, 96 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 631ef6366293..8fbcd50844f0 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > >  	kfree(isw);
> > >  }
> > >  
> > > +/**
> > > + * cleanup_offline_wb - detach associated clean inodes
> > > + * @wb: target wb
> > > + *
> > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > + * dirty, freeing, in the active writeback process or are in any way busy.
> > 
> > I think the comment doesn't match the function anymore.
> > 
> > > + */
> > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > +{
> > > +	struct inode *inode, *tmp;
> > > +
> > > +	spin_lock(&wb->list_lock);
> > > +restart:
> > > +	list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > +		if (!spin_trylock(&inode->i_lock))
> > > +			continue;
> > > +		xa_lock_irq(&inode->i_mapping->i_pages);
> > > +		if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> > 
> > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > hard time seeing how it would relate to whether inode should switch wbs...
> 
> What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> flag here. So there must be
> 	if ((inode->i_state | I_REFERENCED) != I_REFERENCED)
> 
> Does this look good or I am wrong and there are other flags acceptable here?

Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE
and I_OVL_INUSE but that's not that important.

> > > +			struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > > +
> > > +			WARN_ON_ONCE(inode->i_wb != wb);
> > > +
> > > +			inode->i_wb = bdi_wb;
> > > +			list_del_init(&inode->i_io_list);
> > > +			wb_put(wb);
> > 
> > I was kind of hoping you'll use some variant of inode_switch_wbs() here.
> 
> My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
> while in the cleanup case we can deal only with clean inodes and clean wb's.
> Hopefully this can make the whole procedure simpler/cheaper. Also, the number
> of simultaneous switches is limited and I don't think cleanups should share
> this limit.
> However I agree that it would be nice to share at least some code.

I agree limits on parallel switches should not apply. Otherwise I agree
some bits of inode_switch_wbs_work_fn() should not be strictly necessary
but they should be pretty cheap anyway.

> > That way we have single function handling all the subtleties of switching
> > inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> > you detach only from b_attached list and move to bdi_wb so things are
> > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> > and I'd also like to have a comment here explaining why this cannot race
> > with other writeback handling or wb switching in a harmful way.
> 
> If we'll check under wb->list_lock that wb has no inodes on any writeback
> lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be
> 0?

No, pages under writeback are not reflected in inode->i_state in any way.
You would need to check mapping_tagged(inode->i_mapping,
PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use
inode_switch_wbs_work_fn() you wouldn't even have to be that careful when
switching inodes as it can handle alive inodes just fine...

> Re racing: my logic here was that we're taking all possible locks before doing
> anything and then we check that the inode is entirely clean, so this must be
> safe:
> 	spin_lock(&wb->list_lock);
> 	spin_trylock(&inode->i_lock);
> 	xa_lock_irq(&inode->i_mapping->i_pages);
> 	...
> 
> But now I see that the unlocked inode's wb access mechanism
> (unlocked_inode_to_wb_begin()/end()) probably requires additional care.

Yeah, exactly corner case like this were not quite clear to me whether you
have them correct or not.

> Repeating the mechanism with scheduling the switching of each inode separately
> after an rcu grace period looks too slow. Maybe we can mark all inodes at once
> and then switch them all at once, all in two steps. I need to think more.
> Do you have any ideas/suggestions here?

Nothing really bright. As you say I'd do this in batches - i.e., tag all
inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call
inode_switch_wbs_work_fn() for each inode (or probably some helper function
that has guts of inode_switch_wbs_work_fn() as we probably don't want to
acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Alexander Viro
	<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Dennis Zhou <dennis-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes
Date: Fri, 28 May 2021 15:05:43 +0200	[thread overview]
Message-ID: <20210528130543.GB28653@quack2.suse.cz> (raw)
In-Reply-To: <YK/bi1OU7bNgPBab-lLJQVQxiE4uLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>

On Thu 27-05-21 10:48:59, Roman Gushchin wrote:
> On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > Asynchronously try to release dying cgwbs by switching clean attached
> > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > structures themselves and of pinned memory and block cgroups, which
> > > are way larger structures (mostly due to large per-cpu statistics
> > > data). It helps to prevent memory waste and different scalability
> > > problems caused by large piles of dying cgroups.
> > > 
> > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > clean, etc). In this case the next scheduled cleanup will make a new
> > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > additional attempt scheduled by a timer can be implemented.
> > > 
> > > Signed-off-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> > > ---
> > >  fs/fs-writeback.c                | 35 ++++++++++++++++++
> > >  include/linux/backing-dev-defs.h |  1 +
> > >  include/linux/writeback.h        |  1 +
> > >  mm/backing-dev.c                 | 61 ++++++++++++++++++++++++++++++--
> > >  4 files changed, 96 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 631ef6366293..8fbcd50844f0 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > >  	kfree(isw);
> > >  }
> > >  
> > > +/**
> > > + * cleanup_offline_wb - detach associated clean inodes
> > > + * @wb: target wb
> > > + *
> > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > + * dirty, freeing, in the active writeback process or are in any way busy.
> > 
> > I think the comment doesn't match the function anymore.
> > 
> > > + */
> > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > +{
> > > +	struct inode *inode, *tmp;
> > > +
> > > +	spin_lock(&wb->list_lock);
> > > +restart:
> > > +	list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > +		if (!spin_trylock(&inode->i_lock))
> > > +			continue;
> > > +		xa_lock_irq(&inode->i_mapping->i_pages);
> > > +		if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> > 
> > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > hard time seeing how it would relate to whether inode should switch wbs...
> 
> What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> flag here. So there must be
> 	if ((inode->i_state | I_REFERENCED) != I_REFERENCED)
> 
> Does this look good or I am wrong and there are other flags acceptable here?

Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE
and I_OVL_INUSE but that's not that important.

> > > +			struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > > +
> > > +			WARN_ON_ONCE(inode->i_wb != wb);
> > > +
> > > +			inode->i_wb = bdi_wb;
> > > +			list_del_init(&inode->i_io_list);
> > > +			wb_put(wb);
> > 
> > I was kind of hoping you'll use some variant of inode_switch_wbs() here.
> 
> My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
> while in the cleanup case we can deal only with clean inodes and clean wb's.
> Hopefully this can make the whole procedure simpler/cheaper. Also, the number
> of simultaneous switches is limited and I don't think cleanups should share
> this limit.
> However I agree that it would be nice to share at least some code.

I agree limits on parallel switches should not apply. Otherwise I agree
some bits of inode_switch_wbs_work_fn() should not be strictly necessary
but they should be pretty cheap anyway.

> > That way we have single function handling all the subtleties of switching
> > inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> > you detach only from b_attached list and move to bdi_wb so things are
> > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> > and I'd also like to have a comment here explaining why this cannot race
> > with other writeback handling or wb switching in a harmful way.
> 
> If we'll check under wb->list_lock that wb has no inodes on any writeback
> lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be
> 0?

No, pages under writeback are not reflected in inode->i_state in any way.
You would need to check mapping_tagged(inode->i_mapping,
PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use
inode_switch_wbs_work_fn() you wouldn't even have to be that careful when
switching inodes as it can handle alive inodes just fine...

> Re racing: my logic here was that we're taking all possible locks before doing
> anything and then we check that the inode is entirely clean, so this must be
> safe:
> 	spin_lock(&wb->list_lock);
> 	spin_trylock(&inode->i_lock);
> 	xa_lock_irq(&inode->i_mapping->i_pages);
> 	...
> 
> But now I see that the unlocked inode's wb access mechanism
> (unlocked_inode_to_wb_begin()/end()) probably requires additional care.

Yeah, exactly corner case like this were not quite clear to me whether you
have them correct or not.

> Repeating the mechanism with scheduling the switching of each inode separately
> after an rcu grace period looks too slow. Maybe we can mark all inodes at once
> and then switch them all at once, all in two steps. I need to think more.
> Do you have any ideas/suggestions here?

Nothing really bright. As you say I'd do this in batches - i.e., tag all
inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call
inode_switch_wbs_work_fn() for each inode (or probably some helper function
that has guts of inode_switch_wbs_work_fn() as we probably don't want to
acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily).

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

  parent reply	other threads:[~2021-05-28 13:06 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 22:25 [PATCH v5 0/2] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Roman Gushchin
2021-05-26 22:25 ` Roman Gushchin
2021-05-26 22:25 ` [PATCH v5 1/2] writeback, cgroup: keep list of inodes attached to bdi_writeback Roman Gushchin
2021-05-26 22:25   ` Roman Gushchin
2021-05-27 10:35   ` Jan Kara
2021-05-27 10:35     ` Jan Kara
2021-05-27 16:32     ` Roman Gushchin
2021-05-27 16:32       ` Roman Gushchin
2021-05-26 22:25 ` [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes Roman Gushchin
2021-05-26 22:25   ` Roman Gushchin
2021-05-27  3:24   ` Hillf Danton
2021-05-27 16:29     ` Roman Gushchin
2021-05-27 16:29       ` Roman Gushchin
2021-05-27 11:24   ` Jan Kara
2021-05-27 11:24     ` Jan Kara
2021-05-27 17:48     ` Roman Gushchin
2021-05-27 17:48       ` Roman Gushchin
2021-05-27 19:45       ` Roman Gushchin
2021-05-27 19:45         ` Roman Gushchin
2021-05-28 13:05       ` Jan Kara [this message]
2021-05-28 13:05         ` Jan Kara
2021-05-28 16:25         ` Roman Gushchin
2021-05-28 16:25           ` Roman Gushchin
2021-05-28  2:58   ` Ming Lei
2021-05-28 16:22     ` Roman Gushchin
2021-05-28 16:22       ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210528130543.GB28653@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=cgroups@vger.kernel.org \
    --cc=dchinner@redhat.com \
    --cc=dennis@kernel.org \
    --cc=guro@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.