linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
       [not found]   ` <CAAeU0aNAd1Ra6LXmWwq8row4MD_BpVHiSXOwHx07m86UWREvHw@mail.gmail.com>
@ 2016-02-16 18:24     ` Tejun Heo
  2016-02-16 18:34       ` Jens Axboe
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-16 18:24 UTC (permalink / raw)
  To: Jens Axboe, Tahsin Erdogan
  Cc: cgroups, Theodore Ts'o, Nauman Rafique, linux-kernel, Jan Kara

>From 586afaa034bec88934bad4eb6ab38ba07031ec5a Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Tue, 16 Feb 2016 13:14:35 -0500

If cgroup writeback is in use, an inode is associated with a cgroup
for writeback.  If the inode's main dirtier changes to another cgroup,
the association gets updated asynchronously.  Nothing was pinning the
superblock while such switches are in progress and superblock could go
away while async switching is pending or in progress leading to
crashes like the following.

 kernel BUG at fs/jbd2/transaction.c:319!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
 Hardware name: Google Google, BIOS Google 01/01/2011
 Workqueue: events inode_switch_wbs_work_fn
 task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
 RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
 RSP: 0018:ffff880209267c30  EFLAGS: 00010202
 ...
 Call Trace:
  [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
  [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
  [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
  [<ffffffff8035338b>] evict+0xbb/0x190
  [<ffffffff80354190>] iput+0x130/0x190
  [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
  [<ffffffff80279819>] process_one_work+0x129/0x300
  [<ffffffff80279b16>] worker_thread+0x126/0x480
  [<ffffffff8027ed14>] kthread+0xc4/0xe0
  [<ffffffff809771df>] ret_from_fork+0x3f/0x70

Fix it by bumping s_active while cgroup association switching is in
flight.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Cc: stable@vger.kernel.org #v4.5+
---
 fs/fs-writeback.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c95..1f76d89 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -317,6 +317,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	struct inode_switch_wbs_context *isw =
 		container_of(work, struct inode_switch_wbs_context, work);
 	struct inode *inode = isw->inode;
+	struct super_block *sb = inode->i_sb;
 	struct address_space *mapping = inode->i_mapping;
 	struct bdi_writeback *old_wb = inode->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
@@ -423,6 +424,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	wb_put(new_wb);
 
 	iput(inode);
+	deactivate_super(sb);
 	kfree(isw);
 }
 
@@ -469,11 +471,14 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 
 	/* while holding I_WB_SWITCH, no one else can update the association */
 	spin_lock(&inode->i_lock);
+
 	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
-	    inode_to_wb(inode) == isw->new_wb) {
-		spin_unlock(&inode->i_lock);
-		goto out_free;
-	}
+	    inode_to_wb(inode) == isw->new_wb)
+		goto out_unlock;
+
+	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
+		goto out_unlock;
+
 	inode->i_state |= I_WB_SWITCH;
 	spin_unlock(&inode->i_lock);
 
@@ -489,6 +494,8 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
 	return;
 
+out_unlock:
+	spin_unlock(&inode->i_lock);
 out_free:
 	if (isw->new_wb)
 		wb_put(isw->new_wb);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-16 18:24     ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Tejun Heo
@ 2016-02-16 18:34       ` Jens Axboe
  2016-02-17 20:57       ` Jan Kara
  2016-02-18 10:12       ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Nikolay Borisov
  2 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2016-02-16 18:34 UTC (permalink / raw)
  To: Tejun Heo, Tahsin Erdogan
  Cc: cgroups, Theodore Ts'o, Nauman Rafique, linux-kernel, Jan Kara

On 02/16/2016 11:24 AM, Tejun Heo wrote:
>  From 586afaa034bec88934bad4eb6ab38ba07031ec5a Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Tue, 16 Feb 2016 13:14:35 -0500
>
> If cgroup writeback is in use, an inode is associated with a cgroup
> for writeback.  If the inode's main dirtier changes to another cgroup,
> the association gets updated asynchronously.  Nothing was pinning the
> superblock while such switches are in progress and superblock could go
> away while async switching is pending or in progress leading to
> crashes like the following.
>
>   kernel BUG at fs/jbd2/transaction.c:319!
>   invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>   CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
>   Hardware name: Google Google, BIOS Google 01/01/2011
>   Workqueue: events inode_switch_wbs_work_fn
>   task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
>   RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
>   RSP: 0018:ffff880209267c30  EFLAGS: 00010202
>   ...
>   Call Trace:
>    [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
>    [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
>    [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
>    [<ffffffff8035338b>] evict+0xbb/0x190
>    [<ffffffff80354190>] iput+0x130/0x190
>    [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
>    [<ffffffff80279819>] process_one_work+0x129/0x300
>    [<ffffffff80279b16>] worker_thread+0x126/0x480
>    [<ffffffff8027ed14>] kthread+0xc4/0xe0
>    [<ffffffff809771df>] ret_from_fork+0x3f/0x70
>
> Fix it by bumping s_active while cgroup association switching is in
> flight.

Added for 4.5.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-16 18:24     ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Tejun Heo
  2016-02-16 18:34       ` Jens Axboe
@ 2016-02-17 20:57       ` Jan Kara
  2016-02-17 21:07         ` Tejun Heo
  2016-02-18 10:12       ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Nikolay Borisov
  2 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-02-17 20:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Tahsin Erdogan, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Tue 16-02-16 13:24:57, Tejun Heo wrote:
> From 586afaa034bec88934bad4eb6ab38ba07031ec5a Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Tue, 16 Feb 2016 13:14:35 -0500
> 
> If cgroup writeback is in use, an inode is associated with a cgroup
> for writeback.  If the inode's main dirtier changes to another cgroup,
> the association gets updated asynchronously.  Nothing was pinning the
> superblock while such switches are in progress and superblock could go
> away while async switching is pending or in progress leading to
> crashes like the following.
> 
>  kernel BUG at fs/jbd2/transaction.c:319!
>  invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>  CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
>  Hardware name: Google Google, BIOS Google 01/01/2011
>  Workqueue: events inode_switch_wbs_work_fn
>  task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
>  RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
>  RSP: 0018:ffff880209267c30  EFLAGS: 00010202
>  ...
>  Call Trace:
>   [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
>   [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
>   [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
>   [<ffffffff8035338b>] evict+0xbb/0x190
>   [<ffffffff80354190>] iput+0x130/0x190
>   [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
>   [<ffffffff80279819>] process_one_work+0x129/0x300
>   [<ffffffff80279b16>] worker_thread+0x126/0x480
>   [<ffffffff8027ed14>] kthread+0xc4/0xe0
>   [<ffffffff809771df>] ret_from_fork+0x3f/0x70
> 
> Fix it by bumping s_active while cgroup association switching is in
> flight.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
> Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
> Cc: stable@vger.kernel.org #v4.5+

Well, but this has the side-effect that trying to umount a filesystem while
migrations are happening will result in EBUSY error. Without obvious reason
why that happens. As an admin I would be rather upset when umount sometimes
returns EBUSY without apparent reason and you have to basically implement a
loop around umount to make it reliable. So a nack from me for this patch.

Traditionally, we have used sb->s_count and sb->s_umount semaphore to pin
superblock while writeback code was working on it. That makes umount block
until we can safely unmount the filesystem and thus doesn't result in these
spurious EBUSY errors. But from a quick look this can be problematic for the
cgroup setting.

Alternatively, you could either cancel all the switching work when
unmounting filesystem or maybe just handle I_WB_SWITCH similarly to I_SYNC
- don't grab inode reference when switching is going on, just make
I_WB_SWITCH pin the inode and wait in evict() for it to be clear (similarly
as we call inode_wait_for_writeback() there).

								Honza
> ---
>  fs/fs-writeback.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6915c95..1f76d89 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -317,6 +317,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
>  	struct inode_switch_wbs_context *isw =
>  		container_of(work, struct inode_switch_wbs_context, work);
>  	struct inode *inode = isw->inode;
> +	struct super_block *sb = inode->i_sb;
>  	struct address_space *mapping = inode->i_mapping;
>  	struct bdi_writeback *old_wb = inode->i_wb;
>  	struct bdi_writeback *new_wb = isw->new_wb;
> @@ -423,6 +424,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
>  	wb_put(new_wb);
>  
>  	iput(inode);
> +	deactivate_super(sb);
>  	kfree(isw);
>  }
>  
> @@ -469,11 +471,14 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
>  
>  	/* while holding I_WB_SWITCH, no one else can update the association */
>  	spin_lock(&inode->i_lock);
> +
>  	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
> -	    inode_to_wb(inode) == isw->new_wb) {
> -		spin_unlock(&inode->i_lock);
> -		goto out_free;
> -	}
> +	    inode_to_wb(inode) == isw->new_wb)
> +		goto out_unlock;
> +
> +	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
> +		goto out_unlock;
> +
>  	inode->i_state |= I_WB_SWITCH;
>  	spin_unlock(&inode->i_lock);
>  
> @@ -489,6 +494,8 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
>  	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
>  	return;
>  
> +out_unlock:
> +	spin_unlock(&inode->i_lock);
>  out_free:
>  	if (isw->new_wb)
>  		wb_put(isw->new_wb);
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-17 20:57       ` Jan Kara
@ 2016-02-17 21:07         ` Tejun Heo
  2016-02-17 22:30           ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-02-17 21:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Tahsin Erdogan, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

Hello, Jan.

On Wed, Feb 17, 2016 at 09:57:21PM +0100, Jan Kara wrote:
> Well, but this has the side-effect that trying to umount a filesystem while
> migrations are happening will result in EBUSY error. Without obvious reason
> why that happens. As an admin I would be rather upset when umount sometimes
> returns EBUSY without apparent reason and you have to basically implement a
> loop around umount to make it reliable. So a nack from me for this patch.

I see.  Can you please point me to the s_active check during umount?
I first tried s_umount but couldn't transfer its ownership to the
worker so ended up doing s_active.  I looked at how s_active is used
and couldn't find where it'd block umount.  may_umount() checks
mnt_count, not s_active, so it looked like holding s_active may delay
destruction of the superblock but not prevent umount.

> Traditionally, we have used sb->s_count and sb->s_umount semaphore to pin
> superblock while writeback code was working on it. That makes umount block
> until we can safely unmount the filesystem and thus doesn't result in these
> spurious EBUSY errors. But from a quick look this can be problematic for the
> cgroup setting.
> 
> Alternatively, you could either cancel all the switching work when
> unmounting filesystem or maybe just handle I_WB_SWITCH similarly to I_SYNC
> - don't grab inode reference when switching is going on, just make
> I_WB_SWITCH pin the inode and wait in evict() for it to be clear (similarly
> as we call inode_wait_for_writeback() there).

Yeah, this is an alternative but likely more involved.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-17 21:07         ` Tejun Heo
@ 2016-02-17 22:30           ` Jan Kara
  2016-02-17 22:41             ` Tahsin Erdogan
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-02-17 22:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, Tahsin Erdogan, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Wed 17-02-16 16:07:44, Tejun Heo wrote:
> Hello, Jan.
> 
> On Wed, Feb 17, 2016 at 09:57:21PM +0100, Jan Kara wrote:
> > Well, but this has the side-effect that trying to umount a filesystem while
> > migrations are happening will result in EBUSY error. Without obvious reason
> > why that happens. As an admin I would be rather upset when umount sometimes
> > returns EBUSY without apparent reason and you have to basically implement a
> > loop around umount to make it reliable. So a nack from me for this patch.
> 
> I see.  Can you please point me to the s_active check during umount?
> I first tried s_umount but couldn't transfer its ownership to the
> worker so ended up doing s_active.  I looked at how s_active is used
> and couldn't find where it'd block umount.  may_umount() checks
> mnt_count, not s_active, so it looked like holding s_active may delay
> destruction of the superblock but not prevent umount.

Bah, sorry. It's too late here. You are right that s_active will just delay
destruction of the superblock until the reference is dropped. So I don't
see obvious issues with what you do and I retract my nack. I still feel
somewhat uneasy about postponing fs shutdown to a workqueue like this but
hopefully there's no hidden catch.

								Honza


> > Traditionally, we have used sb->s_count and sb->s_umount semaphore to pin
> > superblock while writeback code was working on it. That makes umount block
> > until we can safely unmount the filesystem and thus doesn't result in these
> > spurious EBUSY errors. But from a quick look this can be problematic for the
> > cgroup setting.
> > 
> > Alternatively, you could either cancel all the switching work when
> > unmounting filesystem or maybe just handle I_WB_SWITCH similarly to I_SYNC
> > - don't grab inode reference when switching is going on, just make
> > I_WB_SWITCH pin the inode and wait in evict() for it to be clear (similarly
> > as we call inode_wait_for_writeback() there).
> 
> Yeah, this is an alternative but likely more involved.
> 
> Thanks.
> 
> -- 
> tejun
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-17 22:30           ` Jan Kara
@ 2016-02-17 22:41             ` Tahsin Erdogan
  2016-02-17 23:02               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Tahsin Erdogan @ 2016-02-17 22:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

With this patch, I am starting to have issues running fsck immediately
after umount.

*** fsck.ext4 output ***
fsck from util-linux-ng 2.17.2
e2fsck 1.42.12-gg3 (9-Sep-2014)
Warning!  /dev/sdb3 is in use.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 62346243 has zero dtime.  Fix? no


On Wed, Feb 17, 2016 at 2:30 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 17-02-16 16:07:44, Tejun Heo wrote:
>> Hello, Jan.
>>
>> On Wed, Feb 17, 2016 at 09:57:21PM +0100, Jan Kara wrote:
>> > Well, but this has the side-effect that trying to umount a filesystem while
>> > migrations are happening will result in EBUSY error. Without obvious reason
>> > why that happens. As an admin I would be rather upset when umount sometimes
>> > returns EBUSY without apparent reason and you have to basically implement a
>> > loop around umount to make it reliable. So a nack from me for this patch.
>>
>> I see.  Can you please point me to the s_active check during umount?
>> I first tried s_umount but couldn't transfer its ownership to the
>> worker so ended up doing s_active.  I looked at how s_active is used
>> and couldn't find where it'd block umount.  may_umount() checks
>> mnt_count, not s_active, so it looked like holding s_active may delay
>> destruction of the superblock but not prevent umount.
>
> Bah, sorry. It's too late here. You are right that s_active will just delay
> destruction of the superblock until the reference is dropped. So I don't
> see obvious issues with what you do and I retract my nack. I still feel
> somewhat uneasy about postponing fs shutdown to a workqueue like this but
> hopefully there's no hidden catch.
>
>                                                                 Honza
>
>
>> > Traditionally, we have used sb->s_count and sb->s_umount semaphore to pin
>> > superblock while writeback code was working on it. That makes umount block
>> > until we can safely unmount the filesystem and thus doesn't result in these
>> > spurious EBUSY errors. But from a quick look this can be problematic for the
>> > cgroup setting.
>> >
>> > Alternatively, you could either cancel all the switching work when
>> > unmounting filesystem or maybe just handle I_WB_SWITCH similarly to I_SYNC
>> > - don't grab inode reference when switching is going on, just make
>> > I_WB_SWITCH pin the inode and wait in evict() for it to be clear (similarly
>> > as we call inode_wait_for_writeback() there).
>>
>> Yeah, this is an alternative but likely more involved.
>>
>> Thanks.
>>
>> --
>> tejun
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-17 22:41             ` Tahsin Erdogan
@ 2016-02-17 23:02               ` Tejun Heo
  2016-02-18  9:55                 ` Jan Kara
  2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
  0 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-17 23:02 UTC (permalink / raw)
  To: Tahsin Erdogan
  Cc: Jan Kara, Jens Axboe, cgroups, Theodore Ts'o, Nauman Rafique,
	linux-kernel, Jan Kara

On Wed, Feb 17, 2016 at 02:41:25PM -0800, Tahsin Erdogan wrote:
> With this patch, I am starting to have issues running fsck immediately
> after umount.
> 
> *** fsck.ext4 output ***
> fsck from util-linux-ng 2.17.2
> e2fsck 1.42.12-gg3 (9-Sep-2014)
> Warning!  /dev/sdb3 is in use.
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 62346243 has zero dtime.  Fix? no

Yeah, that'd be one of the side effects.  Will think more about it.
Jan, do you know why we have both s_active and the rwsem at this
point?  I can't tell why both are needed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-17 23:02               ` Tejun Heo
@ 2016-02-18  9:55                 ` Jan Kara
  2016-02-18 13:00                   ` Tejun Heo
  2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2016-02-18  9:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Wed 17-02-16 18:02:31, Tejun Heo wrote:
> On Wed, Feb 17, 2016 at 02:41:25PM -0800, Tahsin Erdogan wrote:
> > With this patch, I am starting to have issues running fsck immediately
> > after umount.
> > 
> > *** fsck.ext4 output ***
> > fsck from util-linux-ng 2.17.2
> > e2fsck 1.42.12-gg3 (9-Sep-2014)
> > Warning!  /dev/sdb3 is in use.
> > Pass 1: Checking inodes, blocks, and sizes
> > Deleted inode 62346243 has zero dtime.  Fix? no
> 
> Yeah, that'd be one of the side effects.  Will think more about it.
> Jan, do you know why we have both s_active and the rwsem at this
> point?  I can't tell why both are needed.

I'm not sure I understand the question. Do you mean why both s_active and
s_umount rwsem exist? s_active is a reference count keeping superblock
alive - e.g. if the filesystem is mounted in more places, we need a
reference for each mountpoint. s_umount is used when we want to block any
umount operation until we are done. For example sync(2) is using it to make
sure superblock doesn't disappear and so that we don't keep superblock
alive after admin called umount(2).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-16 18:24     ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Tejun Heo
  2016-02-16 18:34       ` Jens Axboe
  2016-02-17 20:57       ` Jan Kara
@ 2016-02-18 10:12       ` Nikolay Borisov
  2016-02-18 12:57         ` Tejun Heo
  2 siblings, 1 reply; 30+ messages in thread
From: Nikolay Borisov @ 2016-02-18 10:12 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe, Tahsin Erdogan
  Cc: cgroups, Theodore Ts'o, Nauman Rafique, linux-kernel, Jan Kara



On 02/16/2016 08:24 PM, Tejun Heo wrote:
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
> Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
> Cc: stable@vger.kernel.org #v4.5+


Why is this tagged for 4.5+ since the commit in question was merged in 4.2?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-18 10:12       ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Nikolay Borisov
@ 2016-02-18 12:57         ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-18 12:57 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Jens Axboe, Tahsin Erdogan, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Thu, Feb 18, 2016 at 12:12:40PM +0200, Nikolay Borisov wrote:
> 
> 
> On 02/16/2016 08:24 PM, Tejun Heo wrote:
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com>
> > Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
> > Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
> > Cc: stable@vger.kernel.org #v4.5+
> 
> Why is this tagged for 4.5+ since the commit in question was merged in 4.2?

Up until 4.5, the whole cgroup v2 interface is hidden behind a devel
flag and cgroup writeback can only be used through cgroup v2.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-18  9:55                 ` Jan Kara
@ 2016-02-18 13:00                   ` Tejun Heo
  2016-02-18 13:20                     ` Jan Kara
  2016-02-19 20:18                     ` Al Viro
  0 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-18 13:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

Hello, Jan.

On Thu, Feb 18, 2016 at 10:55:38AM +0100, Jan Kara wrote:
> I'm not sure I understand the question. Do you mean why both s_active and
> s_umount rwsem exist? s_active is a reference count keeping superblock

Yes.

> alive - e.g. if the filesystem is mounted in more places, we need a
> reference for each mountpoint. s_umount is used when we want to block any

I could be mistaken but I *think* we used to reject umounts based on
s_active and s_umount is the mechanism to delay umounts rather than
failing them and probably with bind mounts the behavior changed.

> umount operation until we are done. For example sync(2) is using it to make
> sure superblock doesn't disappear and so that we don't keep superblock
> alive after admin called umount(2).

So, the question is why aren't we just using s_active and draining it
on umount of the last mountpoint.  Because, right now, the behavior is
weird in that we allow umounts to proceed but then let the superblock
hang onto the block device till s_active is drained.  This really
should be synchronous.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-18 13:00                   ` Tejun Heo
@ 2016-02-18 13:20                     ` Jan Kara
  2016-02-19 20:18                     ` Al Viro
  1 sibling, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-02-18 13:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara, Al Viro

Hi Tejun,

On Thu 18-02-16 08:00:33, Tejun Heo wrote:
> On Thu, Feb 18, 2016 at 10:55:38AM +0100, Jan Kara wrote:
> > I'm not sure I understand the question. Do you mean why both s_active and
> > s_umount rwsem exist? s_active is a reference count keeping superblock
> 
> Yes.
> 
> > alive - e.g. if the filesystem is mounted in more places, we need a
> > reference for each mountpoint. s_umount is used when we want to block any
> 
> I could be mistaken but I *think* we used to reject umounts based on
> s_active and s_umount is the mechanism to delay umounts rather than
> failing them and probably with bind mounts the behavior changed.
> 
> > umount operation until we are done. For example sync(2) is using it to make
> > sure superblock doesn't disappear and so that we don't keep superblock
> > alive after admin called umount(2).
> 
> So, the question is why aren't we just using s_active and draining it
> on umount of the last mountpoint.  Because, right now, the behavior is
> weird in that we allow umounts to proceed but then let the superblock
> hang onto the block device till s_active is drained.  This really
> should be synchronous.

Hum, I'm not sure. I guess Al can give you more qualified answer than me.
Added to CC...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-18 13:00                   ` Tejun Heo
  2016-02-18 13:20                     ` Jan Kara
@ 2016-02-19 20:18                     ` Al Viro
  2016-02-19 20:51                       ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-02-19 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Thu, Feb 18, 2016 at 08:00:33AM -0500, Tejun Heo wrote:

> So, the question is why aren't we just using s_active and draining it
> on umount of the last mountpoint.  Because, right now, the behavior is
> weird in that we allow umounts to proceed but then let the superblock
> hang onto the block device till s_active is drained.  This really
> should be synchronous.

This really should not.  First of all, umount -l (or exit of the last
namespace user, for that matter) can leave you with actual fs shutdown
postponed until some opened files get closed.  Nothing synchronous about
that.

If you need details on s_active/s_umount/etc., I can give you a braindump,
but I suspect your real question is a lot more specific.  Details, please...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-19 20:18                     ` Al Viro
@ 2016-02-19 20:51                       ` Tejun Heo
  2016-02-19 21:58                         ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-02-19 20:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

Hello, Al.

On Fri, Feb 19, 2016 at 08:18:06PM +0000, Al Viro wrote:
> On Thu, Feb 18, 2016 at 08:00:33AM -0500, Tejun Heo wrote:
> > So, the question is why aren't we just using s_active and draining it
> > on umount of the last mountpoint.  Because, right now, the behavior is
> > weird in that we allow umounts to proceed but then let the superblock
> > hang onto the block device till s_active is drained.  This really
> > should be synchronous.
> 
> This really should not.  First of all, umount -l (or exit of the last
> namespace user, for that matter) can leave you with actual fs shutdown
> postponed until some opened files get closed.  Nothing synchronous about
> that.

I see, I suppose that's what distinguishes s_active and s_umount
usages - whether pinning should block umounting?

> If you need details on s_active/s_umount/etc., I can give you a braindump,
> but I suspect your real question is a lot more specific.  Details, please...

So, the problem is that cgroup writeback path sometimes schedules a
work item to change the cgroup an inode is associated.  Currently,
only the inode was pinned and the underlying sb may go away while the
work item is still pending.  The work item performs iput() at the end
and that explodes if the underlying sb is already gone.

As writeback path relies on s_umount for synchronization anyway, I
think that'd be the most natural way to hold onto the sb but
unfortunately there's no way to pass on the down_read to the async
execution context, so I made it grap s_active, which worked fine but
it made the sb hang around until such work items are finished.  It's
an unlikely race to hit but still broken.

The last option would be canceling / flushing these work items from sb
shutdown path which is likely more involved.

What should it be doing?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-19 20:51                       ` Tejun Heo
@ 2016-02-19 21:58                         ` Al Viro
  2016-02-19 22:15                           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-02-19 21:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Fri, Feb 19, 2016 at 03:51:47PM -0500, Tejun Heo wrote:

> I see, I suppose that's what distinguishes s_active and s_umount
> usages - whether pinning should block umounting?

???

->s_active is plain and simple count of "I hold a long-term reference to
this superblock, don't you shut it down until I drop that".

->s_umount is held across some of the transitions in struct super_block
life cycle, including the actual process of shutdown.

> > If you need details on s_active/s_umount/etc., I can give you a braindump,
> > but I suspect your real question is a lot more specific.  Details, please...
> 
> So, the problem is that cgroup writeback path sometimes schedules a
> work item to change the cgroup an inode is associated.  Currently,
> only the inode was pinned and the underlying sb may go away while the
> work item is still pending.  The work item performs iput() at the end
> and that explodes if the underlying sb is already gone.
> 
> As writeback path relies on s_umount for synchronization anyway, I
> think that'd be the most natural way to hold onto the sb but
> unfortunately there's no way to pass on the down_read to the async
> execution context, so I made it grap s_active, which worked fine but
> it made the sb hang around until such work items are finished.  It's
> an unlikely race to hit but still broken.
> 
> The last option would be canceling / flushing these work items from sb
> shutdown path which is likely more involved.
> 
> What should it be doing?

Um...  What ordering requirements do you have?  You obviously shouldn't
let it continue past the shutdown - as the matter of fact, you *can't* let
it continue past generic_shutdown_super(), since any inode references
held at evict_inodes() time will make it very unhappy.  Attempts to do
any IO after that will make things a lot worse than unhappy - data structures
needed to do it might be gone (and if you hold a bit longer, filesystem
driver itself might very well be gone, along with the functions you were
going to call).

Grabbing ->s_active is a seriously bad idea for another reason - in
a situation when there's only one mount of given fs, plain umount() should
_not_ return 0 before fs shutdown is over.  Sure, it is possible that there's
a binding somewhere, or that it's a lazy umount, etc., but those are "you've
asked for it" situations; having plain umount of e.g. ext3 on a USB stick
return success before it is safe to pull that stick out is a Bloody Bad Idea,
for obvious usability reasons.

IOW, while fs shutdown may be async, making it *always* async would be a bad
bug.  And bumping ->s_active does just that.

I'd go for trylock inside that work + making generic_shutdown_super()
kill all such works.  I assume that it *can* be abandoned in situation
when we know that sync_filesystem() is about to be called and that
said sync_filesystem() won't, in turn, schedule any such works, of course...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-19 21:58                         ` Al Viro
@ 2016-02-19 22:15                           ` Tejun Heo
  2016-02-19 22:26                             ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-02-19 22:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

Hello,

On Fri, Feb 19, 2016 at 09:58:11PM +0000, Al Viro wrote:
> Um...  What ordering requirements do you have?  You obviously shouldn't
> let it continue past the shutdown - as the matter of fact, you *can't* let
> it continue past generic_shutdown_super(), since any inode references
> held at evict_inodes() time will make it very unhappy.  Attempts to do
> any IO after that will make things a lot worse than unhappy - data structures
> needed to do it might be gone (and if you hold a bit longer, filesystem
> driver itself might very well be gone, along with the functions you were
> going to call).

It can be thought of as an extension of fs writeback operation and
it'd be ideal if it can hold off sb shutdown as on-going writeback
does through holding s_umount.  Unfortunately, that doesn't seem
possible because there's no way to transfer rwsem ownership.

It doesn't generate any IO.  The reason it's done asynchronously is
because the operation requires an RCU grace period.  After the grace
period, it accesses only the generic inode and address_space and the
only time it ends up accessing sb is through the iput call.
Everything else AFAICS doesn't really care whether the underlying sb
is shut down or not.

> Grabbing ->s_active is a seriously bad idea for another reason - in
> a situation when there's only one mount of given fs, plain umount() should
> _not_ return 0 before fs shutdown is over.  Sure, it is possible that there's
> a binding somewhere, or that it's a lazy umount, etc., but those are "you've
> asked for it" situations; having plain umount of e.g. ext3 on a USB stick
> return success before it is safe to pull that stick out is a Bloody Bad Idea,
> for obvious usability reasons.

I see.

> IOW, while fs shutdown may be async, making it *always* async would be a bad
> bug.  And bumping ->s_active does just that.
> 
> I'd go for trylock inside that work + making generic_shutdown_super()
> kill all such works.  I assume that it *can* be abandoned in situation
> when we know that sync_filesystem() is about to be called and that
> said sync_filesystem() won't, in turn, schedule any such works, of course...

I'll make generic_shutdown_super() to kill all such work items.  I
don't think the work item itself would need further locking tho.  Can
you please elaborate why you thought adding trylock to the work would
be necessary?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-19 22:15                           ` Tejun Heo
@ 2016-02-19 22:26                             ` Al Viro
  2016-02-28 21:53                               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-02-19 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Fri, Feb 19, 2016 at 05:15:12PM -0500, Tejun Heo wrote:

> > IOW, while fs shutdown may be async, making it *always* async would be a bad
> > bug.  And bumping ->s_active does just that.
> > 
> > I'd go for trylock inside that work + making generic_shutdown_super()
> > kill all such works.  I assume that it *can* be abandoned in situation
> > when we know that sync_filesystem() is about to be called and that
> > said sync_filesystem() won't, in turn, schedule any such works, of course...
> 
> I'll make generic_shutdown_super() to kill all such work items.  I
> don't think the work item itself would need further locking tho.  Can
> you please elaborate why you thought adding trylock to the work would
> be necessary?

Umm...  Not much, except that it would make the life cycle rules a bit
more regular.

Is that code OK with e.g. running in parallel with remounting filesystem r/o?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches
  2016-02-19 22:26                             ` Al Viro
@ 2016-02-28 21:53                               ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-28 21:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Tahsin Erdogan, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

Hello, Al.

On Fri, Feb 19, 2016 at 10:26:09PM +0000, Al Viro wrote:
> Is that code OK with e.g. running in parallel with remounting filesystem r/o?

AFAICS, yeah.  It doesn't care about the sb or filesystem at all.  All
it touches is generic inodes and their pages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-17 23:02               ` Tejun Heo
  2016-02-18  9:55                 ` Jan Kara
@ 2016-02-29 20:47                 ` Tejun Heo
  2016-02-29 20:54                   ` Al Viro
                                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-29 20:47 UTC (permalink / raw)
  To: Tahsin Erdogan
  Cc: Jan Kara, Jens Axboe, cgroups, Theodore Ts'o, Nauman Rafique,
	linux-kernel, Jan Kara, Al Viro

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
---
Hello,

Tahsin, can you please verify that this removes the asynchronous
behavior while still avoiding the original issue?

Thanks.

 fs/fs-writeback.c         |   51 +++++++++++++++++++++++++++++++++++-----------
 fs/super.c                |    1 
 include/linux/writeback.h |    5 ++++
 3 files changed, 45 insertions(+), 12 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -223,6 +223,9 @@ static void wb_wait_for_completion(struc
 #define WB_FRN_HIST_MAX_SLOTS	(WB_FRN_HIST_THR_SLOTS / 2 + 1)
 					/* one round can affect upto 5 slots */
 
+static atomic_t isw_nr_in_flight = ATOMIC_INIT(0);
+static struct workqueue_struct *isw_wq;
+
 void __inode_attach_wb(struct inode *inode, struct page *page)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -317,7 +320,6 @@ static void inode_switch_wbs_work_fn(str
 	struct inode_switch_wbs_context *isw =
 		container_of(work, struct inode_switch_wbs_context, work);
 	struct inode *inode = isw->inode;
-	struct super_block *sb = inode->i_sb;
 	struct address_space *mapping = inode->i_mapping;
 	struct bdi_writeback *old_wb = inode->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
@@ -424,8 +426,9 @@ skip_switch:
 	wb_put(new_wb);
 
 	iput(inode);
-	deactivate_super(sb);
 	kfree(isw);
+
+	atomic_dec(&isw_nr_in_flight);
 }
 
 static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head)
@@ -435,7 +438,7 @@ static void inode_switch_wbs_rcu_fn(stru
 
 	/* needs to grab bh-unsafe locks, bounce to work item */
 	INIT_WORK(&isw->work, inode_switch_wbs_work_fn);
-	schedule_work(&isw->work);
+	queue_work(isw_wq, &isw->work);
 }
 
 /**
@@ -471,20 +474,19 @@ static void inode_switch_wbs(struct inod
 
 	/* while holding I_WB_SWITCH, no one else can update the association */
 	spin_lock(&inode->i_lock);
-
 	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
-	    inode_to_wb(inode) == isw->new_wb)
-		goto out_unlock;
-
-	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
-		goto out_unlock;
-
+	    inode_to_wb(inode) == isw->new_wb) {
+		spin_unlock(&inode->i_lock);
+		goto out_free;
+	}
 	inode->i_state |= I_WB_SWITCH;
 	spin_unlock(&inode->i_lock);
 
 	ihold(inode);
 	isw->inode = inode;
 
+	atomic_inc(&isw_nr_in_flight);
+
 	/*
 	 * In addition to synchronizing among switchers, I_WB_SWITCH tells
 	 * the RCU protected stat update paths to grab the mapping's
@@ -494,8 +496,6 @@ static void inode_switch_wbs(struct inod
 	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
 	return;
 
-out_unlock:
-	spin_unlock(&inode->i_lock);
 out_free:
 	if (isw->new_wb)
 		wb_put(isw->new_wb);
@@ -847,6 +847,33 @@ restart:
 		wb_put(last_wb);
 }
 
+/**
+ * cgroup_writeback_umount - flush inode wb switches for umount
+ *
+ * This function is called when a super_block is about to be destroyed and
+ * flushes in-flight inode wb switches.  An inode wb switch goes through
+ * RCU and then workqueue, so the two need to be flushed in order to ensure
+ * that all previously scheduled switches are finished.  As wb switches are
+ * rare occurrences and synchronize_rcu() can take a while, perform
+ * flushing iff wb switches are in flight.
+ */
+void cgroup_writeback_umount(void)
+{
+	if (atomic_read(&isw_nr_in_flight)) {
+		synchronize_rcu();
+		flush_workqueue(isw_wq);
+	}
+}
+
+static int __init cgroup_writeback_init(void)
+{
+	isw_wq = alloc_workqueue("inode_switch_wbs", 0, 0);
+	if (!isw_wq)
+		return -ENOMEM;
+	return 0;
+}
+fs_initcall(cgroup_writeback_init);
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static struct bdi_writeback *
--- a/fs/super.c
+++ b/fs/super.c
@@ -414,6 +414,7 @@ void generic_shutdown_super(struct super
 		sync_filesystem(sb);
 		sb->s_flags &= ~MS_ACTIVE;
 
+		cgroup_writeback_umount();
 		fsnotify_unmount_inodes(sb);
 
 		evict_inodes(sb);
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -198,6 +198,7 @@ void wbc_attach_and_unlock_inode(struct
 void wbc_detach_inode(struct writeback_control *wbc);
 void wbc_account_io(struct writeback_control *wbc, struct page *page,
 		    size_t bytes);
+void cgroup_writeback_umount(void);
 
 /**
  * inode_attach_wb - associate an inode with its wb
@@ -301,6 +302,10 @@ static inline void wbc_account_io(struct
 {
 }
 
+static inline void cgroup_writeback_umount(void)
+{
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
@ 2016-02-29 20:54                   ` Al Viro
  2016-02-29 20:58                     ` Tejun Heo
  2016-02-29 23:28                   ` [PATCH v2 " Tejun Heo
  2016-03-01 13:39                   ` [PATCH " Tahsin Erdogan
  2 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-02-29 20:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Mon, Feb 29, 2016 at 03:47:24PM -0500, Tejun Heo wrote:
> If cgroup writeback is in use, inodes can be scheduled for
> asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
> superblock pinned during cgroup writeback association switches"), this
> could race with umount leading to super_block being destroyed while
> inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
> s_active while wb switches are in flight; however, this allowed
> in-flight wb switches to make umounts asynchronous when the userland
> expected synchronosity - e.g. fsck immediately following umount may
> fail because the device is still busy.
> 
> This patch removes the problematic super_block pinning and instead
> makes generic_shutdown_super() flush in-flight wb switches.  wb
> switches are now executed on a dedicated isw_wq so that they can be
> flushed and isw_nr_in_flight keeps track of the number of in-flight wb
> switches so that flushing can be avoided in most cases.

Wait a bloody minute.  What's to prevent shrink_dcache_for_umount() from
dirtying more inodes, triggering more of the same?

> -	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
> -		goto out_unlock;

This would've failed for inodes on superblock in the middle of shutdown;
what's to do the same for the new variant?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 20:54                   ` Al Viro
@ 2016-02-29 20:58                     ` Tejun Heo
  2016-02-29 21:06                       ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-02-29 20:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Mon, Feb 29, 2016 at 08:54:28PM +0000, Al Viro wrote:
> > This patch removes the problematic super_block pinning and instead
> > makes generic_shutdown_super() flush in-flight wb switches.  wb
> > switches are now executed on a dedicated isw_wq so that they can be
> > flushed and isw_nr_in_flight keeps track of the number of in-flight wb
> > switches so that flushing can be avoided in most cases.
> 
> Wait a bloody minute.  What's to prevent shrink_dcache_for_umount() from
> dirtying more inodes, triggering more of the same?

Hmmm?  The flushing is done after shrink_dcache_for_umount() and
sync_filesystems().  Aren't inodes supposed to stay clean after that?

> > -	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
> > -		goto out_unlock;
> 
> This would've failed for inodes on superblock in the middle of shutdown;
> what's to do the same for the new variant?

I don't follow.  As long as no new writeback operations are initiated
after flushing, none can be in flight for the super_block.  Isn't that
enough?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 20:58                     ` Tejun Heo
@ 2016-02-29 21:06                       ` Al Viro
  2016-02-29 21:08                         ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-02-29 21:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Mon, Feb 29, 2016 at 03:58:37PM -0500, Tejun Heo wrote:
> On Mon, Feb 29, 2016 at 08:54:28PM +0000, Al Viro wrote:
> > > This patch removes the problematic super_block pinning and instead
> > > makes generic_shutdown_super() flush in-flight wb switches.  wb
> > > switches are now executed on a dedicated isw_wq so that they can be
> > > flushed and isw_nr_in_flight keeps track of the number of in-flight wb
> > > switches so that flushing can be avoided in most cases.
> > 
> > Wait a bloody minute.  What's to prevent shrink_dcache_for_umount() from
> > dirtying more inodes, triggering more of the same?
> 
> Hmmm?  The flushing is done after shrink_dcache_for_umount() and
> sync_filesystems().  Aren't inodes supposed to stay clean after that?

s/shrink_dcache_for_umount/fsnotify_unmount_inodes/ - sorry.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 21:06                       ` Al Viro
@ 2016-02-29 21:08                         ` Tejun Heo
  2016-02-29 21:21                           ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-02-29 21:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara

On Mon, Feb 29, 2016 at 09:06:15PM +0000, Al Viro wrote:
> > Hmmm?  The flushing is done after shrink_dcache_for_umount() and
> > sync_filesystems().  Aren't inodes supposed to stay clean after that?
> 
> s/shrink_dcache_for_umount/fsnotify_unmount_inodes/ - sorry.

Is that allowed to dirty indoes and initiate writebacks again, after
sync_filesystems() is done?  That sounds weird but it's trivial to
move cgroup_writeback_umount() below that if so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 21:08                         ` Tejun Heo
@ 2016-02-29 21:21                           ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-02-29 21:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Al Viro, Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups,
	Theodore Ts'o, Nauman Rafique, linux-kernel, Jan Kara

On Mon 29-02-16 16:08:00, Tejun Heo wrote:
> On Mon, Feb 29, 2016 at 09:06:15PM +0000, Al Viro wrote:
> > > Hmmm?  The flushing is done after shrink_dcache_for_umount() and
> > > sync_filesystems().  Aren't inodes supposed to stay clean after that?
> > 
> > s/shrink_dcache_for_umount/fsnotify_unmount_inodes/ - sorry.
> 
> Is that allowed to dirty indoes and initiate writebacks again, after
> sync_filesystems() is done?  That sounds weird but it's trivial to
> move cgroup_writeback_umount() below that if so.

Hardly, but generally it is true that filesystem may still dirty something
(e.g. from outstanding workqueue work, but most likely some special "system
inodes" may still become dirty) until ->put_super() is finished.

Anyway, to make this foolproof, I'd just avoid queueing any new switching
work after S_ACTIVE is cleared on the superblock and flush the workqueue just
before evict_inodes() call.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
  2016-02-29 20:54                   ` Al Viro
@ 2016-02-29 23:28                   ` Tejun Heo
  2016-03-01  9:20                     ` Jan Kara
  2016-03-01 17:46                     ` Jens Axboe
  2016-03-01 13:39                   ` [PATCH " Tahsin Erdogan
  2 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2016-02-29 23:28 UTC (permalink / raw)
  To: Tahsin Erdogan
  Cc: Jan Kara, Jens Axboe, cgroups, Theodore Ts'o, Nauman Rafique,
	linux-kernel, Jan Kara, Al Viro

If cgroup writeback is in use, inodes can be scheduled for
asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
superblock pinned during cgroup writeback association switches"), this
could race with umount leading to super_block being destroyed while
inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
s_active while wb switches are in flight; however, this allowed
in-flight wb switches to make umounts asynchronous when the userland
expected synchronosity - e.g. fsck immediately following umount may
fail because the device is still busy.

This patch removes the problematic super_block pinning and instead
makes generic_shutdown_super() flush in-flight wb switches.  wb
switches are now executed on a dedicated isw_wq so that they can be
flushed and isw_nr_in_flight keeps track of the number of in-flight wb
switches so that flushing can be avoided in most cases.

v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
Cc: stable@vger.kernel.org #v4.5
---
 fs/fs-writeback.c         |   54 ++++++++++++++++++++++++++++++++++------------
 fs/super.c                |    1 
 include/linux/writeback.h |    5 ++++
 3 files changed, 47 insertions(+), 13 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -223,6 +223,9 @@ static void wb_wait_for_completion(struc
 #define WB_FRN_HIST_MAX_SLOTS	(WB_FRN_HIST_THR_SLOTS / 2 + 1)
 					/* one round can affect upto 5 slots */
 
+static atomic_t isw_nr_in_flight = ATOMIC_INIT(0);
+static struct workqueue_struct *isw_wq;
+
 void __inode_attach_wb(struct inode *inode, struct page *page)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -317,7 +320,6 @@ static void inode_switch_wbs_work_fn(str
 	struct inode_switch_wbs_context *isw =
 		container_of(work, struct inode_switch_wbs_context, work);
 	struct inode *inode = isw->inode;
-	struct super_block *sb = inode->i_sb;
 	struct address_space *mapping = inode->i_mapping;
 	struct bdi_writeback *old_wb = inode->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
@@ -424,8 +426,9 @@ skip_switch:
 	wb_put(new_wb);
 
 	iput(inode);
-	deactivate_super(sb);
 	kfree(isw);
+
+	atomic_dec(&isw_nr_in_flight);
 }
 
 static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head)
@@ -435,7 +438,7 @@ static void inode_switch_wbs_rcu_fn(stru
 
 	/* needs to grab bh-unsafe locks, bounce to work item */
 	INIT_WORK(&isw->work, inode_switch_wbs_work_fn);
-	schedule_work(&isw->work);
+	queue_work(isw_wq, &isw->work);
 }
 
 /**
@@ -471,20 +474,20 @@ static void inode_switch_wbs(struct inod
 
 	/* while holding I_WB_SWITCH, no one else can update the association */
 	spin_lock(&inode->i_lock);
-
-	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
-	    inode_to_wb(inode) == isw->new_wb)
-		goto out_unlock;
-
-	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
-		goto out_unlock;
-
+	if (!(inode->i_sb->s_flags & MS_ACTIVE) ||
+	    inode->i_state & (I_WB_SWITCH | I_FREEING) ||
+	    inode_to_wb(inode) == isw->new_wb) {
+		spin_unlock(&inode->i_lock);
+		goto out_free;
+	}
 	inode->i_state |= I_WB_SWITCH;
 	spin_unlock(&inode->i_lock);
 
 	ihold(inode);
 	isw->inode = inode;
 
+	atomic_inc(&isw_nr_in_flight);
+
 	/*
 	 * In addition to synchronizing among switchers, I_WB_SWITCH tells
 	 * the RCU protected stat update paths to grab the mapping's
@@ -494,8 +497,6 @@ static void inode_switch_wbs(struct inod
 	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
 	return;
 
-out_unlock:
-	spin_unlock(&inode->i_lock);
 out_free:
 	if (isw->new_wb)
 		wb_put(isw->new_wb);
@@ -847,6 +848,33 @@ restart:
 		wb_put(last_wb);
 }
 
+/**
+ * cgroup_writeback_umount - flush inode wb switches for umount
+ *
+ * This function is called when a super_block is about to be destroyed and
+ * flushes in-flight inode wb switches.  An inode wb switch goes through
+ * RCU and then workqueue, so the two need to be flushed in order to ensure
+ * that all previously scheduled switches are finished.  As wb switches are
+ * rare occurrences and synchronize_rcu() can take a while, perform
+ * flushing iff wb switches are in flight.
+ */
+void cgroup_writeback_umount(void)
+{
+	if (atomic_read(&isw_nr_in_flight)) {
+		synchronize_rcu();
+		flush_workqueue(isw_wq);
+	}
+}
+
+static int __init cgroup_writeback_init(void)
+{
+	isw_wq = alloc_workqueue("inode_switch_wbs", 0, 0);
+	if (!isw_wq)
+		return -ENOMEM;
+	return 0;
+}
+fs_initcall(cgroup_writeback_init);
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static struct bdi_writeback *
--- a/fs/super.c
+++ b/fs/super.c
@@ -415,6 +415,7 @@ void generic_shutdown_super(struct super
 		sb->s_flags &= ~MS_ACTIVE;
 
 		fsnotify_unmount_inodes(sb);
+		cgroup_writeback_umount();
 
 		evict_inodes(sb);
 
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -198,6 +198,7 @@ void wbc_attach_and_unlock_inode(struct
 void wbc_detach_inode(struct writeback_control *wbc);
 void wbc_account_io(struct writeback_control *wbc, struct page *page,
 		    size_t bytes);
+void cgroup_writeback_umount(void);
 
 /**
  * inode_attach_wb - associate an inode with its wb
@@ -301,6 +302,10 @@ static inline void wbc_account_io(struct
 {
 }
 
+static inline void cgroup_writeback_umount(void)
+{
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /*

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 23:28                   ` [PATCH v2 " Tejun Heo
@ 2016-03-01  9:20                     ` Jan Kara
  2016-03-01 17:46                     ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-01  9:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tahsin Erdogan, Jan Kara, Jens Axboe, cgroups, Theodore Ts'o,
	Nauman Rafique, linux-kernel, Jan Kara, Al Viro

On Mon 29-02-16 18:28:53, Tejun Heo wrote:
> If cgroup writeback is in use, inodes can be scheduled for
> asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
> superblock pinned during cgroup writeback association switches"), this
> could race with umount leading to super_block being destroyed while
> inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
> s_active while wb switches are in flight; however, this allowed
> in-flight wb switches to make umounts asynchronous when the userland
> expected synchronosity - e.g. fsck immediately following umount may
> fail because the device is still busy.
> 
> This patch removes the problematic super_block pinning and instead
> makes generic_shutdown_super() flush in-flight wb switches.  wb
> switches are now executed on a dedicated isw_wq so that they can be
> flushed and isw_nr_in_flight keeps track of the number of in-flight wb
> switches so that flushing can be avoided in most cases.
> 
> v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
>     check in inode_switch_wbs() as Jan an Al suggested.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Tahsin Erdogan <tahsin@google.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Al Viro <viro@ZenIV.linux.org.uk>
> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
> Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
> Cc: stable@vger.kernel.org #v4.5

The patch looks good to me now. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

									Honza

> ---
>  fs/fs-writeback.c         |   54 ++++++++++++++++++++++++++++++++++------------
>  fs/super.c                |    1 
>  include/linux/writeback.h |    5 ++++
>  3 files changed, 47 insertions(+), 13 deletions(-)
> 
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -223,6 +223,9 @@ static void wb_wait_for_completion(struc
>  #define WB_FRN_HIST_MAX_SLOTS	(WB_FRN_HIST_THR_SLOTS / 2 + 1)
>  					/* one round can affect upto 5 slots */
>  
> +static atomic_t isw_nr_in_flight = ATOMIC_INIT(0);
> +static struct workqueue_struct *isw_wq;
> +
>  void __inode_attach_wb(struct inode *inode, struct page *page)
>  {
>  	struct backing_dev_info *bdi = inode_to_bdi(inode);
> @@ -317,7 +320,6 @@ static void inode_switch_wbs_work_fn(str
>  	struct inode_switch_wbs_context *isw =
>  		container_of(work, struct inode_switch_wbs_context, work);
>  	struct inode *inode = isw->inode;
> -	struct super_block *sb = inode->i_sb;
>  	struct address_space *mapping = inode->i_mapping;
>  	struct bdi_writeback *old_wb = inode->i_wb;
>  	struct bdi_writeback *new_wb = isw->new_wb;
> @@ -424,8 +426,9 @@ skip_switch:
>  	wb_put(new_wb);
>  
>  	iput(inode);
> -	deactivate_super(sb);
>  	kfree(isw);
> +
> +	atomic_dec(&isw_nr_in_flight);
>  }
>  
>  static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head)
> @@ -435,7 +438,7 @@ static void inode_switch_wbs_rcu_fn(stru
>  
>  	/* needs to grab bh-unsafe locks, bounce to work item */
>  	INIT_WORK(&isw->work, inode_switch_wbs_work_fn);
> -	schedule_work(&isw->work);
> +	queue_work(isw_wq, &isw->work);
>  }
>  
>  /**
> @@ -471,20 +474,20 @@ static void inode_switch_wbs(struct inod
>  
>  	/* while holding I_WB_SWITCH, no one else can update the association */
>  	spin_lock(&inode->i_lock);
> -
> -	if (inode->i_state & (I_WB_SWITCH | I_FREEING) ||
> -	    inode_to_wb(inode) == isw->new_wb)
> -		goto out_unlock;
> -
> -	if (!atomic_inc_not_zero(&inode->i_sb->s_active))
> -		goto out_unlock;
> -
> +	if (!(inode->i_sb->s_flags & MS_ACTIVE) ||
> +	    inode->i_state & (I_WB_SWITCH | I_FREEING) ||
> +	    inode_to_wb(inode) == isw->new_wb) {
> +		spin_unlock(&inode->i_lock);
> +		goto out_free;
> +	}
>  	inode->i_state |= I_WB_SWITCH;
>  	spin_unlock(&inode->i_lock);
>  
>  	ihold(inode);
>  	isw->inode = inode;
>  
> +	atomic_inc(&isw_nr_in_flight);
> +
>  	/*
>  	 * In addition to synchronizing among switchers, I_WB_SWITCH tells
>  	 * the RCU protected stat update paths to grab the mapping's
> @@ -494,8 +497,6 @@ static void inode_switch_wbs(struct inod
>  	call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
>  	return;
>  
> -out_unlock:
> -	spin_unlock(&inode->i_lock);
>  out_free:
>  	if (isw->new_wb)
>  		wb_put(isw->new_wb);
> @@ -847,6 +848,33 @@ restart:
>  		wb_put(last_wb);
>  }
>  
> +/**
> + * cgroup_writeback_umount - flush inode wb switches for umount
> + *
> + * This function is called when a super_block is about to be destroyed and
> + * flushes in-flight inode wb switches.  An inode wb switch goes through
> + * RCU and then workqueue, so the two need to be flushed in order to ensure
> + * that all previously scheduled switches are finished.  As wb switches are
> + * rare occurrences and synchronize_rcu() can take a while, perform
> + * flushing iff wb switches are in flight.
> + */
> +void cgroup_writeback_umount(void)
> +{
> +	if (atomic_read(&isw_nr_in_flight)) {
> +		synchronize_rcu();
> +		flush_workqueue(isw_wq);
> +	}
> +}
> +
> +static int __init cgroup_writeback_init(void)
> +{
> +	isw_wq = alloc_workqueue("inode_switch_wbs", 0, 0);
> +	if (!isw_wq)
> +		return -ENOMEM;
> +	return 0;
> +}
> +fs_initcall(cgroup_writeback_init);
> +
>  #else	/* CONFIG_CGROUP_WRITEBACK */
>  
>  static struct bdi_writeback *
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -415,6 +415,7 @@ void generic_shutdown_super(struct super
>  		sb->s_flags &= ~MS_ACTIVE;
>  
>  		fsnotify_unmount_inodes(sb);
> +		cgroup_writeback_umount();
>  
>  		evict_inodes(sb);
>  
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -198,6 +198,7 @@ void wbc_attach_and_unlock_inode(struct
>  void wbc_detach_inode(struct writeback_control *wbc);
>  void wbc_account_io(struct writeback_control *wbc, struct page *page,
>  		    size_t bytes);
> +void cgroup_writeback_umount(void);
>  
>  /**
>   * inode_attach_wb - associate an inode with its wb
> @@ -301,6 +302,10 @@ static inline void wbc_account_io(struct
>  {
>  }
>  
> +static inline void cgroup_writeback_umount(void)
> +{
> +}
> +
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
>  /*
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
  2016-02-29 20:54                   ` Al Viro
  2016-02-29 23:28                   ` [PATCH v2 " Tejun Heo
@ 2016-03-01 13:39                   ` Tahsin Erdogan
  2 siblings, 0 replies; 30+ messages in thread
From: Tahsin Erdogan @ 2016-03-01 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, cgroups, Theodore Ts'o, Nauman Rafique,
	linux-kernel, Jan Kara, Al Viro

On Mon, Feb 29, 2016 at 12:47 PM, Tejun Heo <tj@kernel.org> wrote:
> Tahsin, can you please verify that this removes the asynchronous
> behavior while still avoiding the original issue?
>

I verified that this fixes the original issue and also removes the
asynchronous umount behavior.

Thanks a lot for the fix.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-02-29 23:28                   ` [PATCH v2 " Tejun Heo
  2016-03-01  9:20                     ` Jan Kara
@ 2016-03-01 17:46                     ` Jens Axboe
  2016-03-01 17:50                       ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2016-03-01 17:46 UTC (permalink / raw)
  To: Tejun Heo, Tahsin Erdogan
  Cc: Jan Kara, cgroups, Theodore Ts'o, Nauman Rafique,
	linux-kernel, Jan Kara, Al Viro

On 02/29/2016 04:28 PM, Tejun Heo wrote:
> If cgroup writeback is in use, inodes can be scheduled for
> asynchronous wb switching.  Before 5ff8eaac1636 ("writeback: keep
> superblock pinned during cgroup writeback association switches"), this
> could race with umount leading to super_block being destroyed while
> inodes are pinned for wb switching.  5ff8eaac1636 fixed it by bumping
> s_active while wb switches are in flight; however, this allowed
> in-flight wb switches to make umounts asynchronous when the userland
> expected synchronosity - e.g. fsck immediately following umount may
> fail because the device is still busy.
>
> This patch removes the problematic super_block pinning and instead
> makes generic_shutdown_super() flush in-flight wb switches.  wb
> switches are now executed on a dedicated isw_wq so that they can be
> flushed and isw_nr_in_flight keeps track of the number of in-flight wb
> switches so that flushing can be avoided in most cases.
>
> v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
>      check in inode_switch_wbs() as Jan an Al suggested.

I queued this up for 4.5, but I'm feeling a bit uneasy about it. But 
it's either that, or revert 5ff8eaac1636 and fix it for real in 4.6. 
Jan/Tejun, what do you think?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-03-01 17:46                     ` Jens Axboe
@ 2016-03-01 17:50                       ` Tejun Heo
  2016-03-02 10:29                         ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-03-01 17:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tahsin Erdogan, Jan Kara, Cgroups, Theodore Ts'o,
	Nauman Rafique, lkml, Jan Kara, Al Viro

Hello, Jens.

On Tue, Mar 1, 2016 at 12:46 PM, Jens Axboe <axboe@kernel.dk> wrote:
> I queued this up for 4.5, but I'm feeling a bit uneasy about it. But it's
> either that, or revert 5ff8eaac1636 and fix it for real in 4.6. Jan/Tejun,
> what do you think?

Given that this only matters for cgroup writeback cases, this should
still be fairly low impact, so I don't think it'd matter too much
whether we fix this in this cycle or for 4.6. However, that also means
that we're not risking much by doing it in this cycle, so I'd vote for
doing it now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block
  2016-03-01 17:50                       ` Tejun Heo
@ 2016-03-02 10:29                         ` Jan Kara
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Kara @ 2016-03-02 10:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Tahsin Erdogan, Jan Kara, Cgroups, Theodore Ts'o,
	Nauman Rafique, lkml, Jan Kara, Al Viro

On Tue 01-03-16 12:50:19, Tejun Heo wrote:
> Hello, Jens.
> 
> On Tue, Mar 1, 2016 at 12:46 PM, Jens Axboe <axboe@kernel.dk> wrote:
> > I queued this up for 4.5, but I'm feeling a bit uneasy about it. But it's
> > either that, or revert 5ff8eaac1636 and fix it for real in 4.6. Jan/Tejun,
> > what do you think?
> 
> Given that this only matters for cgroup writeback cases, this should
> still be fairly low impact, so I don't think it'd matter too much
> whether we fix this in this cycle or for 4.6. However, that also means
> that we're not risking much by doing it in this cycle, so I'd vote for
> doing it now.

Yeah, without CGROUP_WRITEBACK enabled this patch is NOP so I don't care
much when this gets merged.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-03-02 10:29 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com>
     [not found] ` <20160215210047.GN3965@htj.duckdns.org>
     [not found]   ` <CAAeU0aNAd1Ra6LXmWwq8row4MD_BpVHiSXOwHx07m86UWREvHw@mail.gmail.com>
2016-02-16 18:24     ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Tejun Heo
2016-02-16 18:34       ` Jens Axboe
2016-02-17 20:57       ` Jan Kara
2016-02-17 21:07         ` Tejun Heo
2016-02-17 22:30           ` Jan Kara
2016-02-17 22:41             ` Tahsin Erdogan
2016-02-17 23:02               ` Tejun Heo
2016-02-18  9:55                 ` Jan Kara
2016-02-18 13:00                   ` Tejun Heo
2016-02-18 13:20                     ` Jan Kara
2016-02-19 20:18                     ` Al Viro
2016-02-19 20:51                       ` Tejun Heo
2016-02-19 21:58                         ` Al Viro
2016-02-19 22:15                           ` Tejun Heo
2016-02-19 22:26                             ` Al Viro
2016-02-28 21:53                               ` Tejun Heo
2016-02-29 20:47                 ` [PATCH block/for-linus] writeback: flush inode cgroup wb switches instead of pinning super_block Tejun Heo
2016-02-29 20:54                   ` Al Viro
2016-02-29 20:58                     ` Tejun Heo
2016-02-29 21:06                       ` Al Viro
2016-02-29 21:08                         ` Tejun Heo
2016-02-29 21:21                           ` Jan Kara
2016-02-29 23:28                   ` [PATCH v2 " Tejun Heo
2016-03-01  9:20                     ` Jan Kara
2016-03-01 17:46                     ` Jens Axboe
2016-03-01 17:50                       ` Tejun Heo
2016-03-02 10:29                         ` Jan Kara
2016-03-01 13:39                   ` [PATCH " Tahsin Erdogan
2016-02-18 10:12       ` [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches Nikolay Borisov
2016-02-18 12:57         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).