All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io
       [not found] <157ed606-4a61-508b-d26a-2f5d638f39bb@meituan.com>
@ 2018-04-02 11:50 ` Wang Long
  2018-04-03 12:03   ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Wang Long @ 2018-04-02 11:50 UTC (permalink / raw)
  To: tj, hannes; +Cc: gthelen, npiggin, akpm, linux-kernel, wanglong19


Hi,  Johannes Weiner and Tejun Heo

I use linux-4.4.y to test the new cgroup controller io and the current
stable kernel linux-4.4.y has the follow logic


int clear_page_dirty_for_io(struct page *page){
...
...
                 memcg = mem_cgroup_begin_page_stat(page); ----------(a)
                 wb = unlocked_inode_to_wb_begin(inode, &locked); ---------(b)
                 if (TestClearPageDirty(page)) {
                         mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
                         dec_zone_page_state(page, NR_FILE_DIRTY);
                         dec_wb_stat(wb, WB_RECLAIMABLE);
                         ret =1;
                 }
                 unlocked_inode_to_wb_end(inode, locked); -----------(c)
                 mem_cgroup_end_page_stat(memcg); -------------(d)
                 return ret;
...
...
}


when memcg is moving, and I_WB_SWITCH flags for inode is set. the logic
is the following:


spin_lock_irqsave(&memcg->move_lock, flags); -------------(a)
         spin_lock_irq(&inode->i_mapping->tree_lock); ------------(b)
         spin_unlock_irq(&inode->i_mapping->tree_lock); -----------(c)
spin_unlock_irqrestore(&memcg->move_lock, flags); -----------(d)


after (c) , the local irq is enabled. I think it is not correct.

We get a deadlock backtrace after (c), the cpu get an softirq and in the
irq it also call mem_cgroup_begin_page_stat to lock the same
memcg->move_lock.

Since the conditions are too harsh, this scenario is difficult to
reproduce.  But it really exists.

So how about change (b) (c) to spin_lock_irqsave/spin_lock_irqrestore?

Thanks:-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io
  2018-04-02 11:50 ` [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io Wang Long
@ 2018-04-03 12:03   ` Michal Hocko
  2018-04-03 23:12     ` Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2018-04-03 12:03 UTC (permalink / raw)
  To: Wang Long; +Cc: tj, hannes, gthelen, akpm, linux-kernel, Nicholas Piggin

On Mon 02-04-18 19:50:50, Wang Long wrote:
> 
> Hi,  Johannes Weiner and Tejun Heo
> 
> I use linux-4.4.y to test the new cgroup controller io and the current
> stable kernel linux-4.4.y has the follow logic
> 
> 
> int clear_page_dirty_for_io(struct page *page){
> ...
> ...
>                 memcg = mem_cgroup_begin_page_stat(page); ----------(a)
>                 wb = unlocked_inode_to_wb_begin(inode, &locked); ---------(b)
>                 if (TestClearPageDirty(page)) {
>                         mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
>                         dec_zone_page_state(page, NR_FILE_DIRTY);
>                         dec_wb_stat(wb, WB_RECLAIMABLE);
>                         ret =1;
>                 }
>                 unlocked_inode_to_wb_end(inode, locked); -----------(c)
>                 mem_cgroup_end_page_stat(memcg); -------------(d)
>                 return ret;
> ...
> ...
> }
> 
> 
> when memcg is moving, and I_WB_SWITCH flags for inode is set. the logic
> is the following:
> 
> 
> spin_lock_irqsave(&memcg->move_lock, flags); -------------(a)
>         spin_lock_irq(&inode->i_mapping->tree_lock); ------------(b)
>         spin_unlock_irq(&inode->i_mapping->tree_lock); -----------(c)
> spin_unlock_irqrestore(&memcg->move_lock, flags); -----------(d)
> 
> 
> after (c) , the local irq is enabled. I think it is not correct.
> 
> We get a deadlock backtrace after (c), the cpu get an softirq and in the
> irq it also call mem_cgroup_begin_page_stat to lock the same
> memcg->move_lock.
> 
> Since the conditions are too harsh, this scenario is difficult to
> reproduce.  But it really exists.
> 
> So how about change (b) (c) to spin_lock_irqsave/spin_lock_irqrestore?

Yes, it seems we really need this even for the current tree. Please note
that At least clear_page_dirty_for_io doesn't lock memcg anymore.
__cancel_dirty_page still uses lock_page_memcg though (former
mem_cgroup_begin_page_stat).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io
  2018-04-03 12:03   ` Michal Hocko
@ 2018-04-03 23:12     ` Greg Thelen
  2018-04-04  6:31       ` Wang Long
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-03 23:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: wanglong19, Tejun Heo, Johannes Weiner, akpm, LKML, npiggin

On Tue, Apr 3, 2018 at 5:03 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 02-04-18 19:50:50, Wang Long wrote:
> >
> > Hi,  Johannes Weiner and Tejun Heo
> >
> > I use linux-4.4.y to test the new cgroup controller io and the current
> > stable kernel linux-4.4.y has the follow logic
> >
> >
> > int clear_page_dirty_for_io(struct page *page){
> > ...
> > ...
> >                 memcg = mem_cgroup_begin_page_stat(page); ----------(a)
> >                 wb = unlocked_inode_to_wb_begin(inode, &locked);
---------(b)
> >                 if (TestClearPageDirty(page)) {
> >                         mem_cgroup_dec_page_stat(memcg,
MEM_CGROUP_STAT_DIRTY);
> >                         dec_zone_page_state(page, NR_FILE_DIRTY);
> >                         dec_wb_stat(wb, WB_RECLAIMABLE);
> >                         ret =1;
> >                 }
> >                 unlocked_inode_to_wb_end(inode, locked); -----------(c)
> >                 mem_cgroup_end_page_stat(memcg); -------------(d)
> >                 return ret;
> > ...
> > ...
> > }
> >
> >
> > when memcg is moving, and I_WB_SWITCH flags for inode is set. the logic
> > is the following:
> >
> >
> > spin_lock_irqsave(&memcg->move_lock, flags); -------------(a)
> >         spin_lock_irq(&inode->i_mapping->tree_lock); ------------(b)
> >         spin_unlock_irq(&inode->i_mapping->tree_lock); -----------(c)
> > spin_unlock_irqrestore(&memcg->move_lock, flags); -----------(d)
> >
> >
> > after (c) , the local irq is enabled. I think it is not correct.
> >
> > We get a deadlock backtrace after (c), the cpu get an softirq and in the
> > irq it also call mem_cgroup_begin_page_stat to lock the same
> > memcg->move_lock.
> >
> > Since the conditions are too harsh, this scenario is difficult to
> > reproduce.  But it really exists.
> >
> > So how about change (b) (c) to spin_lock_irqsave/spin_lock_irqrestore?

> Yes, it seems we really need this even for the current tree. Please note
> that At least clear_page_dirty_for_io doesn't lock memcg anymore.
> __cancel_dirty_page still uses lock_page_memcg though (former
> mem_cgroup_begin_page_stat).
> --
> Michal Hocko
> SUSE Labs

I agree the issue looks real in 4.4 stable and upstream.  It seems like
unlocked_inode_to_wb_begin/_end should use spin_lock_irqsave/restore.

I'm testing a little patch now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io
  2018-04-03 23:12     ` Greg Thelen
@ 2018-04-04  6:31       ` Wang Long
  2018-04-06  8:03         ` [PATCH] writeback: safer lock nesting Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Wang Long @ 2018-04-04  6:31 UTC (permalink / raw)
  To: Greg Thelen, Michal Hocko; +Cc: Tejun Heo, Johannes Weiner, akpm, LKML, npiggin



On 4/4/2018 7:12 AM, Greg Thelen wrote:
> On Tue, Apr 3, 2018 at 5:03 AM Michal Hocko <mhocko@kernel.org> wrote:
>
>> On Mon 02-04-18 19:50:50, Wang Long wrote:
>>> Hi,  Johannes Weiner and Tejun Heo
>>>
>>> I use linux-4.4.y to test the new cgroup controller io and the current
>>> stable kernel linux-4.4.y has the follow logic
>>>
>>>
>>> int clear_page_dirty_for_io(struct page *page){
>>> ...
>>> ...
>>>                  memcg = mem_cgroup_begin_page_stat(page); ----------(a)
>>>                  wb = unlocked_inode_to_wb_begin(inode, &locked);
> ---------(b)
>>>                  if (TestClearPageDirty(page)) {
>>>                          mem_cgroup_dec_page_stat(memcg,
> MEM_CGROUP_STAT_DIRTY);
>>>                          dec_zone_page_state(page, NR_FILE_DIRTY);
>>>                          dec_wb_stat(wb, WB_RECLAIMABLE);
>>>                          ret =1;
>>>                  }
>>>                  unlocked_inode_to_wb_end(inode, locked); -----------(c)
>>>                  mem_cgroup_end_page_stat(memcg); -------------(d)
>>>                  return ret;
>>> ...
>>> ...
>>> }
>>>
>>>
>>> when memcg is moving, and I_WB_SWITCH flags for inode is set. the logic
>>> is the following:
>>>
>>>
>>> spin_lock_irqsave(&memcg->move_lock, flags); -------------(a)
>>>          spin_lock_irq(&inode->i_mapping->tree_lock); ------------(b)
>>>          spin_unlock_irq(&inode->i_mapping->tree_lock); -----------(c)
>>> spin_unlock_irqrestore(&memcg->move_lock, flags); -----------(d)
>>>
>>>
>>> after (c) , the local irq is enabled. I think it is not correct.
>>>
>>> We get a deadlock backtrace after (c), the cpu get an softirq and in the
>>> irq it also call mem_cgroup_begin_page_stat to lock the same
>>> memcg->move_lock.
>>>
>>> Since the conditions are too harsh, this scenario is difficult to
>>> reproduce.  But it really exists.
>>>
>>> So how about change (b) (c) to spin_lock_irqsave/spin_lock_irqrestore?
>> Yes, it seems we really need this even for the current tree. Please note
>> that At least clear_page_dirty_for_io doesn't lock memcg anymore.
>> __cancel_dirty_page still uses lock_page_memcg though (former
>> mem_cgroup_begin_page_stat).
>> --
>> Michal Hocko
>> SUSE Labs
> I agree the issue looks real in 4.4 stable and upstream.  It seems like
> unlocked_inode_to_wb_begin/_end should use spin_lock_irqsave/restore.
>
> I'm testing a little patch now.
Thanks.

When fix it on upstream. The longterm kernel 4.9 and 4.14 also need to fix.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] writeback: safer lock nesting
  2018-04-04  6:31       ` Wang Long
@ 2018-04-06  8:03         ` Greg Thelen
  2018-04-06  8:07           ` Michal Hocko
  2018-04-10 13:50           ` [PATCH] " Sasha Levin
  0 siblings, 2 replies; 24+ messages in thread
From: Greg Thelen @ 2018-04-06  8:03 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner, Tejun Heo
  Cc: npiggin, linux-kernel, linux-mm, Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Swithces occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 fs/fs-writeback.c           |  5 +++--
 include/linux/backing-dev.h | 18 ++++++++++++------
 mm/page-writeback.c         | 15 +++++++++------
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..d51bae5a53e2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -746,10 +746,11 @@ int inode_congested(struct inode *inode, int cong_bits)
 	if (inode && inode_to_wb_is_valid(inode)) {
 		struct bdi_writeback *wb;
 		bool locked, congested;
+		unsigned long flags;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);
 		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, locked, flags);
 		return congested;
 	}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..6c74b64d6f56 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -347,6 +347,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
  * @lockedp: temp bool output param, to be passed to the end function
+ * @flags: saved irq flags, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -359,7 +360,8 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
  * disabled on return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp,
+			   unsigned long *flags)
 {
 	rcu_read_lock();
 
@@ -370,7 +372,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
 	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
+		spin_lock_irqsave(&inode->i_mapping->tree_lock, *flags);
 
 	/*
 	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -383,11 +385,13 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
  * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @flags: *@flags from unlocked_inode_to_wb_begin()
  */
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked,
+					    unsigned long flags)
 {
 	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
+		spin_unlock_irqrestore(&inode->i_mapping->tree_lock, flags);
 
 	rcu_read_unlock();
 }
@@ -434,12 +438,14 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 }
 
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp,
+			   unsigned long *flags)
 {
 	return inode_to_wb(inode);
 }
 
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked,
+					    unsigned long flags)
 {
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c83..ca786528c74d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2501,13 +2501,14 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
+		unsigned long flags;
 		bool locked;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);
 		current->nr_dirtied--;
 		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, locked, flags);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2613,15 +2614,16 @@ void __cancel_dirty_page(struct page *page)
 	if (mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
+		unsigned long flags;
 		bool locked;
 
 		lock_page_memcg(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, wb);
 
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, locked, flags);
 		unlock_page_memcg(page);
 	} else {
 		ClearPageDirty(page);
@@ -2653,6 +2655,7 @@ int clear_page_dirty_for_io(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
+		unsigned long flags;
 		bool locked;
 
 		/*
@@ -2690,14 +2693,14 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);
 		if (TestClearPageDirty(page)) {
 			dec_lruvec_page_state(page, NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, locked, flags);
 		return ret;
 	}
 	return TestClearPageDirty(page);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] writeback: safer lock nesting
  2018-04-06  8:03         ` [PATCH] writeback: safer lock nesting Greg Thelen
@ 2018-04-06  8:07           ` Michal Hocko
  2018-04-06 18:49             ` Greg Thelen
  2018-04-10 13:50           ` [PATCH] " Sasha Levin
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2018-04-06  8:07 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Wang Long, Andrew Morton, Johannes Weiner, Tejun Heo, npiggin,
	linux-kernel, linux-mm

On Fri 06-04-18 01:03:24, Greg Thelen wrote:
[...]
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d4d04fee568a..d51bae5a53e2 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -746,10 +746,11 @@ int inode_congested(struct inode *inode, int cong_bits)
>  	if (inode && inode_to_wb_is_valid(inode)) {
>  		struct bdi_writeback *wb;
>  		bool locked, congested;
> +		unsigned long flags;
>  
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);

Wouldn't it be better to have a cookie (struct) rather than 2 parameters
and let unlocked_inode_to_wb_end DTRT?

>  		congested = wb_congested(wb, cong_bits);
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, locked, flags);
>  		return congested;
>  	}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] writeback: safer lock nesting
  2018-04-06  8:07           ` Michal Hocko
@ 2018-04-06 18:49             ` Greg Thelen
  2018-04-06 18:55               ` [PATCH v2] " Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-06 18:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wang Long, Andrew Morton, Johannes Weiner, Tejun Heo, npiggin,
	LKML, Linux MM

On Fri, Apr 6, 2018 at 1:07 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 06-04-18 01:03:24, Greg Thelen wrote:
> [...]
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index d4d04fee568a..d51bae5a53e2 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -746,10 +746,11 @@ int inode_congested(struct inode *inode, int
cong_bits)
> >       if (inode && inode_to_wb_is_valid(inode)) {
> >               struct bdi_writeback *wb;
> >               bool locked, congested;
> > +             unsigned long flags;
> >
> > -             wb = unlocked_inode_to_wb_begin(inode, &locked);
> > +             wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);

> Wouldn't it be better to have a cookie (struct) rather than 2 parameters
> and let unlocked_inode_to_wb_end DTRT?

Nod.  I'll post a V2 patch with that change.

> >               congested = wb_congested(wb, cong_bits);
> > -             unlocked_inode_to_wb_end(inode, locked);
> > +             unlocked_inode_to_wb_end(inode, locked, flags);
> >               return congested;
> >       }
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] writeback: safer lock nesting
  2018-04-06 18:49             ` Greg Thelen
@ 2018-04-06 18:55               ` Greg Thelen
  2018-04-07 18:56                   ` kbuild test robot
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-06 18:55 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner, Tejun Heo
  Cc: npiggin, linux-kernel, linux-mm, Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Swithces occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c                |  7 ++++---
 include/linux/backing-dev-defs.h |  5 +++++
 include/linux/backing-dev.h      | 30 ++++++++++++++++--------------
 mm/page-writeback.c              | 18 +++++++++---------
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..518f72754a77 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 	 */
 	if (inode && inode_to_wb_is_valid(inode)) {
 		struct bdi_writeback *wb;
-		bool locked, congested;
+		struct wb_lock_cookie lock_cookie;
+		bool congested;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
 		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &lock_cookie);
 		return congested;
 	}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+	bool locked;
+	unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..1d744c61d996 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
+ * can't sleep during transaction.  IRQ may or may not be disabled on return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	rcu_read_lock();
 
@@ -367,10 +366,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 	 * Paired with store_release in inode_switch_wb_work_fn() and
 	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 	 */
-	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
+	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
-	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
 
 	/*
 	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -382,12 +381,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 /**
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
- * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @cookie: @cookie from unlocked_inode_to_wb_begin()
  */
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
-	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
+				       cookie->flags);
 
 	rcu_read_unlock();
 }
@@ -434,12 +435,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 }
 
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	return inode_to_wb(inode);
 }
 
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c83..0e82c2fa78df 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		current->nr_dirtied--;
 		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page)
 	if (mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie;
 
 		lock_page_memcg(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, wb);
 
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		unlock_page_memcg(page);
 	} else {
 		ClearPageDirty(page);
@@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie;
 
 		/*
 		 * Yes, Virginia, this is indeed insane.
@@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		if (TestClearPageDirty(page)) {
 			dec_lruvec_page_state(page, NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		return ret;
 	}
 	return TestClearPageDirty(page);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] writeback: safer lock nesting
  2018-04-06 18:55               ` [PATCH v2] " Greg Thelen
@ 2018-04-07 18:56                   ` kbuild test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-04-07 18:56 UTC (permalink / raw)
  To: Greg Thelen
  Cc: kbuild-all, Wang Long, Michal Hocko, Andrew Morton,
	Johannes Weiner, Tejun Heo, npiggin, linux-kernel, linux-mm,
	Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 6265 bytes --]

Hi Greg,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.16]
[cannot apply to next-20180406]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Greg-Thelen/writeback-safer-lock-nesting/20180407-122200
config: mips-db1xxx_defconfig (attached as .config)
compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=mips 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function 'account_page_redirty':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2504:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~
   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function '__cancel_dirty_page':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2616:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~
   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function 'clear_page_dirty_for_io':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2656:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~

vim +80 include/linux/irqflags.h

81d68a96 Steven Rostedt 2008-05-12  66  
df9ee292 David Howells  2010-10-07  67  /*
df9ee292 David Howells  2010-10-07  68   * Wrap the arch provided IRQ routines to provide appropriate checks.
df9ee292 David Howells  2010-10-07  69   */
df9ee292 David Howells  2010-10-07  70  #define raw_local_irq_disable()		arch_local_irq_disable()
df9ee292 David Howells  2010-10-07  71  #define raw_local_irq_enable()		arch_local_irq_enable()
df9ee292 David Howells  2010-10-07  72  #define raw_local_irq_save(flags)			\
df9ee292 David Howells  2010-10-07  73  	do {						\
df9ee292 David Howells  2010-10-07  74  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  75  		flags = arch_local_irq_save();		\
df9ee292 David Howells  2010-10-07  76  	} while (0)
df9ee292 David Howells  2010-10-07  77  #define raw_local_irq_restore(flags)			\
df9ee292 David Howells  2010-10-07  78  	do {						\
df9ee292 David Howells  2010-10-07  79  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07 @80  		arch_local_irq_restore(flags);		\
df9ee292 David Howells  2010-10-07  81  	} while (0)
df9ee292 David Howells  2010-10-07  82  #define raw_local_save_flags(flags)			\
df9ee292 David Howells  2010-10-07  83  	do {						\
df9ee292 David Howells  2010-10-07  84  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  85  		flags = arch_local_save_flags();	\
df9ee292 David Howells  2010-10-07  86  	} while (0)
df9ee292 David Howells  2010-10-07  87  #define raw_irqs_disabled_flags(flags)			\
df9ee292 David Howells  2010-10-07  88  	({						\
df9ee292 David Howells  2010-10-07  89  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  90  		arch_irqs_disabled_flags(flags);	\
df9ee292 David Howells  2010-10-07  91  	})
df9ee292 David Howells  2010-10-07  92  #define raw_irqs_disabled()		(arch_irqs_disabled())
df9ee292 David Howells  2010-10-07  93  #define raw_safe_halt()			arch_safe_halt()
de30a2b3 Ingo Molnar    2006-07-03  94  

:::::: The code at line 80 was first introduced by commit
:::::: df9ee29270c11dba7d0fe0b83ce47a4d8e8d2101 Fix IRQ flag handling naming

:::::: TO: David Howells <dhowells@redhat.com>
:::::: CC: David Howells <dhowells@redhat.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 22089 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] writeback: safer lock nesting
@ 2018-04-07 18:56                   ` kbuild test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-04-07 18:56 UTC (permalink / raw)
  To: Greg Thelen
  Cc: kbuild-all, Wang Long, Michal Hocko, Andrew Morton,
	Johannes Weiner, Tejun Heo, npiggin, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 6265 bytes --]

Hi Greg,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.16]
[cannot apply to next-20180406]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Greg-Thelen/writeback-safer-lock-nesting/20180407-122200
config: mips-db1xxx_defconfig (attached as .config)
compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=mips 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function 'account_page_redirty':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2504:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~
   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function '__cancel_dirty_page':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2616:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~
   In file included from arch/mips/include/asm/atomic.h:17:0,
                    from include/linux/atomic.h:5,
                    from arch/mips/include/asm/processor.h:14,
                    from arch/mips/include/asm/thread_info.h:16,
                    from include/linux/thread_info.h:38,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/mips/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from mm/page-writeback.c:16:
   mm/page-writeback.c: In function 'clear_page_dirty_for_io':
>> include/linux/irqflags.h:80:3: warning: 'cookie.flags' may be used uninitialized in this function [-Wmaybe-uninitialized]
      arch_local_irq_restore(flags);  \
      ^~~~~~~~~~~~~~~~~~~~~~
   mm/page-writeback.c:2656:25: note: 'cookie.flags' was declared here
      struct wb_lock_cookie cookie;
                            ^~~~~~

vim +80 include/linux/irqflags.h

81d68a96 Steven Rostedt 2008-05-12  66  
df9ee292 David Howells  2010-10-07  67  /*
df9ee292 David Howells  2010-10-07  68   * Wrap the arch provided IRQ routines to provide appropriate checks.
df9ee292 David Howells  2010-10-07  69   */
df9ee292 David Howells  2010-10-07  70  #define raw_local_irq_disable()		arch_local_irq_disable()
df9ee292 David Howells  2010-10-07  71  #define raw_local_irq_enable()		arch_local_irq_enable()
df9ee292 David Howells  2010-10-07  72  #define raw_local_irq_save(flags)			\
df9ee292 David Howells  2010-10-07  73  	do {						\
df9ee292 David Howells  2010-10-07  74  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  75  		flags = arch_local_irq_save();		\
df9ee292 David Howells  2010-10-07  76  	} while (0)
df9ee292 David Howells  2010-10-07  77  #define raw_local_irq_restore(flags)			\
df9ee292 David Howells  2010-10-07  78  	do {						\
df9ee292 David Howells  2010-10-07  79  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07 @80  		arch_local_irq_restore(flags);		\
df9ee292 David Howells  2010-10-07  81  	} while (0)
df9ee292 David Howells  2010-10-07  82  #define raw_local_save_flags(flags)			\
df9ee292 David Howells  2010-10-07  83  	do {						\
df9ee292 David Howells  2010-10-07  84  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  85  		flags = arch_local_save_flags();	\
df9ee292 David Howells  2010-10-07  86  	} while (0)
df9ee292 David Howells  2010-10-07  87  #define raw_irqs_disabled_flags(flags)			\
df9ee292 David Howells  2010-10-07  88  	({						\
df9ee292 David Howells  2010-10-07  89  		typecheck(unsigned long, flags);	\
df9ee292 David Howells  2010-10-07  90  		arch_irqs_disabled_flags(flags);	\
df9ee292 David Howells  2010-10-07  91  	})
df9ee292 David Howells  2010-10-07  92  #define raw_irqs_disabled()		(arch_irqs_disabled())
df9ee292 David Howells  2010-10-07  93  #define raw_safe_halt()			arch_safe_halt()
de30a2b3 Ingo Molnar    2006-07-03  94  

:::::: The code at line 80 was first introduced by commit
:::::: df9ee29270c11dba7d0fe0b83ce47a4d8e8d2101 Fix IRQ flag handling naming

:::::: TO: David Howells <dhowells@redhat.com>
:::::: CC: David Howells <dhowells@redhat.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 22089 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3] writeback: safer lock nesting
  2018-04-07 18:56                   ` kbuild test robot
  (?)
@ 2018-04-10  0:59                   ` Greg Thelen
  2018-04-10  6:33                     ` Michal Hocko
                                       ` (2 more replies)
  -1 siblings, 3 replies; 24+ messages in thread
From: Greg Thelen @ 2018-04-10  0:59 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner, Tejun Heo
  Cc: npiggin, linux-kernel, linux-mm, Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
---
Changelog since v2:
- explicitly initialize wb_lock_cookie to silence compiler warnings.

Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c                |  7 ++++---
 include/linux/backing-dev-defs.h |  5 +++++
 include/linux/backing-dev.h      | 30 ++++++++++++++++--------------
 mm/page-writeback.c              | 18 +++++++++---------
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1280f915079b..f4b2f6625913 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 	 */
 	if (inode && inode_to_wb_is_valid(inode)) {
 		struct bdi_writeback *wb;
-		bool locked, congested;
+		struct wb_lock_cookie lock_cookie;
+		bool congested;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
 		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &lock_cookie);
 		return congested;
 	}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+	bool locked;
+	unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..1d744c61d996 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
+ * can't sleep during transaction.  IRQ may or may not be disabled on return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	rcu_read_lock();
 
@@ -367,10 +366,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 	 * Paired with store_release in inode_switch_wb_work_fn() and
 	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 	 */
-	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
+	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
-	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
 
 	/*
 	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -382,12 +381,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 /**
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
- * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @cookie: @cookie from unlocked_inode_to_wb_begin()
  */
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
-	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
+				       cookie->flags);
 
 	rcu_read_unlock();
 }
@@ -434,12 +435,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 }
 
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	return inode_to_wb(inode);
 }
 
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c83..bc38a2a7a597 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {0};
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		current->nr_dirtied--;
 		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page)
 	if (mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {0};
 
 		lock_page_memcg(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, wb);
 
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		unlock_page_memcg(page);
 	} else {
 		ClearPageDirty(page);
@@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {0};
 
 		/*
 		 * Yes, Virginia, this is indeed insane.
@@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		if (TestClearPageDirty(page)) {
 			dec_lruvec_page_state(page, NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		return ret;
 	}
 	return TestClearPageDirty(page);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10  0:59                   ` [PATCH v3] " Greg Thelen
@ 2018-04-10  6:33                     ` Michal Hocko
  2018-04-10 20:48                       ` Andrew Morton
  2018-04-10  8:14                     ` Wang Long
  2018-04-10 20:37                     ` Andrew Morton
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2018-04-10  6:33 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Wang Long, Andrew Morton, Johannes Weiner, Tejun Heo, npiggin,
	linux-kernel, linux-mm

On Mon 09-04-18 17:59:08, Greg Thelen wrote:
> lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> the page's memcg is undergoing move accounting, which occurs when a
> process leaves its memcg for a new one that has
> memory.move_charge_at_immigrate set.
> 
> unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
> given inode is switching writeback domains.  Switches occur when enough
> writes are issued from a new domain.
> 
> This existing pattern is thus suspicious:
>     lock_page_memcg(page);
>     unlocked_inode_to_wb_begin(inode, &locked);
>     ...
>     unlocked_inode_to_wb_end(inode, locked);
>     unlock_page_memcg(page);
> 
> If both inode switch and process memcg migration are both in-flight then
> unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> still holding the lock_page_memcg() irq spinlock.  This suggests the
> possibility of deadlock if an interrupt occurs before
> unlock_page_memcg().
> 
>     truncate
>     __cancel_dirty_page
>     lock_page_memcg
>     unlocked_inode_to_wb_begin
>     unlocked_inode_to_wb_end
>     <interrupts mistakenly enabled>
>                                     <interrupt>
>                                     end_page_writeback
>                                     test_clear_page_writeback
>                                     lock_page_memcg
>                                     <deadlock>
>     unlock_page_memcg
> 
> Due to configuration limitations this deadlock is not currently possible
> because we don't mix cgroup writeback (a cgroupv2 feature) and
> memory.move_charge_at_immigrate (a cgroupv1 feature).
> 
> If the kernel is hacked to always claim inode switching and memcg
> moving_account, then this script triggers lockup in less than a minute:
>   cd /mnt/cgroup/memory
>   mkdir a b
>   echo 1 > a/memory.move_charge_at_immigrate
>   echo 1 > b/memory.move_charge_at_immigrate
>   (
>     echo $BASHPID > a/cgroup.procs
>     while true; do
>       dd if=/dev/zero of=/mnt/big bs=1M count=256
>     done
>   ) &
>   while true; do
>     sync
>   done &
>   sleep 1h &
>   SLEEP=$!
>   while true; do
>     echo $SLEEP > a/cgroup.procs
>     echo $SLEEP > b/cgroup.procs
>   done
> 

Very nice changelog!

> Given the deadlock is not currently possible, it's debatable if there's
> any reason to modify the kernel.  I suggest we should to prevent future
> surprises.

Agreed!

> Reported-by: Wang Long <wanglong19@meituan.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613

Not a stable material IMHO but
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
AFAIU

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> Changelog since v2:
> - explicitly initialize wb_lock_cookie to silence compiler warnings.
> 
> Changelog since v1:
> - add wb_lock_cookie to record lock context.
> 
>  fs/fs-writeback.c                |  7 ++++---
>  include/linux/backing-dev-defs.h |  5 +++++
>  include/linux/backing-dev.h      | 30 ++++++++++++++++--------------
>  mm/page-writeback.c              | 18 +++++++++---------
>  4 files changed, 34 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1280f915079b..f4b2f6625913 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
>  	 */
>  	if (inode && inode_to_wb_is_valid(inode)) {
>  		struct bdi_writeback *wb;
> -		bool locked, congested;
> +		struct wb_lock_cookie lock_cookie;
> +		bool congested;
>  
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
>  		congested = wb_congested(wb, cong_bits);
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &lock_cookie);
>  		return congested;
>  	}
>  
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index bfe86b54f6c1..0bd432a4d7bd 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	set_wb_congested(bdi->wb.congested, sync);
>  }
>  
> +struct wb_lock_cookie {
> +	bool locked;
> +	unsigned long flags;
> +};
> +
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  /**
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 3e4ce54d84ab..1d744c61d996 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>  /**
>   * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
>   * @inode: target inode
> - * @lockedp: temp bool output param, to be passed to the end function
> + * @cookie: output param, to be passed to the end function
>   *
>   * The caller wants to access the wb associated with @inode but isn't
>   * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>   * association doesn't change until the transaction is finished with
>   * unlocked_inode_to_wb_end().
>   *
> - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> - * afterwards and can't sleep during transaction.  IRQ may or may not be
> - * disabled on return.
> + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
> + * can't sleep during transaction.  IRQ may or may not be disabled on return.
>   */
>  static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
>  {
>  	rcu_read_lock();
>  
> @@ -367,10 +366,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
>  	 * Paired with store_release in inode_switch_wb_work_fn() and
>  	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
>  	 */
> -	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
> +	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
>  
> -	if (unlikely(*lockedp))
> -		spin_lock_irq(&inode->i_mapping->tree_lock);
> +	if (unlikely(cookie->locked))
> +		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
>  
>  	/*
>  	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
> @@ -382,12 +381,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
>  /**
>   * unlocked_inode_to_wb_end - end inode wb access transaction
>   * @inode: target inode
> - * @locked: *@lockedp from unlocked_inode_to_wb_begin()
> + * @cookie: @cookie from unlocked_inode_to_wb_begin()
>   */
> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +					    struct wb_lock_cookie *cookie)
>  {
> -	if (unlikely(locked))
> -		spin_unlock_irq(&inode->i_mapping->tree_lock);
> +	if (unlikely(cookie->locked))
> +		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
> +				       cookie->flags);
>  
>  	rcu_read_unlock();
>  }
> @@ -434,12 +435,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
>  }
>  
>  static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
>  {
>  	return inode_to_wb(inode);
>  }
>  
> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +					    struct wb_lock_cookie *cookie)
>  {
>  }
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 586f31261c83..bc38a2a7a597 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>  
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>  		current->nr_dirtied--;
>  		dec_node_page_state(page, NR_DIRTIED);
>  		dec_wb_stat(wb, WB_DIRTIED);
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>  	}
>  }
>  EXPORT_SYMBOL(account_page_redirty);
> @@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page)
>  	if (mapping_cap_account_dirty(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>  
>  		lock_page_memcg(page);
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>  
>  		if (TestClearPageDirty(page))
>  			account_page_cleaned(page, mapping, wb);
>  
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>  		unlock_page_memcg(page);
>  	} else {
>  		ClearPageDirty(page);
> @@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page)
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>  
>  		/*
>  		 * Yes, Virginia, this is indeed insane.
> @@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * always locked coming in here, so we get the desired
>  		 * exclusion.
>  		 */
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>  		if (TestClearPageDirty(page)) {
>  			dec_lruvec_page_state(page, NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  			dec_wb_stat(wb, WB_RECLAIMABLE);
>  			ret = 1;
>  		}
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>  		return ret;
>  	}
>  	return TestClearPageDirty(page);
> -- 
> 2.17.0.484.g0c8726318c-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10  0:59                   ` [PATCH v3] " Greg Thelen
  2018-04-10  6:33                     ` Michal Hocko
@ 2018-04-10  8:14                     ` Wang Long
  2018-04-11  0:40                       ` Greg Thelen
  2018-04-10 20:37                     ` Andrew Morton
  2 siblings, 1 reply; 24+ messages in thread
From: Wang Long @ 2018-04-10  8:14 UTC (permalink / raw)
  To: Greg Thelen, Michal Hocko, Andrew Morton, Johannes Weiner, Tejun Heo
  Cc: npiggin, linux-kernel, linux-mm

> lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> the page's memcg is undergoing move accounting, which occurs when a
> process leaves its memcg for a new one that has
> memory.move_charge_at_immigrate set.
>
> unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
> given inode is switching writeback domains.  Switches occur when enough
> writes are issued from a new domain.
>
> This existing pattern is thus suspicious:
>      lock_page_memcg(page);
>      unlocked_inode_to_wb_begin(inode, &locked);
>      ...
>      unlocked_inode_to_wb_end(inode, locked);
>      unlock_page_memcg(page);
>
> If both inode switch and process memcg migration are both in-flight then
> unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> still holding the lock_page_memcg() irq spinlock.  This suggests the
> possibility of deadlock if an interrupt occurs before
> unlock_page_memcg().
>
>      truncate
>      __cancel_dirty_page
>      lock_page_memcg
>      unlocked_inode_to_wb_begin
>      unlocked_inode_to_wb_end
>      <interrupts mistakenly enabled>
>                                      <interrupt>
>                                      end_page_writeback
>                                      test_clear_page_writeback
>                                      lock_page_memcg
>                                      <deadlock>
>      unlock_page_memcg
>
> Due to configuration limitations this deadlock is not currently possible
> because we don't mix cgroup writeback (a cgroupv2 feature) and
> memory.move_charge_at_immigrate (a cgroupv1 feature).
>
> If the kernel is hacked to always claim inode switching and memcg
> moving_account, then this script triggers lockup in less than a minute:
>    cd /mnt/cgroup/memory
>    mkdir a b
>    echo 1 > a/memory.move_charge_at_immigrate
>    echo 1 > b/memory.move_charge_at_immigrate
>    (
>      echo $BASHPID > a/cgroup.procs
>      while true; do
>        dd if=/dev/zero of=/mnt/big bs=1M count=256
>      done
>    ) &
>    while true; do
>      sync
>    done &
>    sleep 1h &
>    SLEEP=$!
>    while true; do
>      echo $SLEEP > a/cgroup.procs
>      echo $SLEEP > b/cgroup.procs
>    done
>
> Given the deadlock is not currently possible, it's debatable if there's
> any reason to modify the kernel.  I suggest we should to prevent future
> surprises.
This deadlock occurs three times in our environment,

this deadlock occurs three times in our environment. It is better to cc stable kernel and
backport it.

Acked-by: Wang Long <wanglong19@meituan.com>

thanks

> Reported-by: Wang Long <wanglong19@meituan.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
> ---
> Changelog since v2:
> - explicitly initialize wb_lock_cookie to silence compiler warnings.
>
> Changelog since v1:
> - add wb_lock_cookie to record lock context.
>
>   fs/fs-writeback.c                |  7 ++++---
>   include/linux/backing-dev-defs.h |  5 +++++
>   include/linux/backing-dev.h      | 30 ++++++++++++++++--------------
>   mm/page-writeback.c              | 18 +++++++++---------
>   4 files changed, 34 insertions(+), 26 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1280f915079b..f4b2f6625913 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
>   	 */
>   	if (inode && inode_to_wb_is_valid(inode)) {
>   		struct bdi_writeback *wb;
> -		bool locked, congested;
> +		struct wb_lock_cookie lock_cookie;
> +		bool congested;
>   
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
>   		congested = wb_congested(wb, cong_bits);
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &lock_cookie);
>   		return congested;
>   	}
>   
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index bfe86b54f6c1..0bd432a4d7bd 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>   	set_wb_congested(bdi->wb.congested, sync);
>   }
>   
> +struct wb_lock_cookie {
> +	bool locked;
> +	unsigned long flags;
> +};
> +
>   #ifdef CONFIG_CGROUP_WRITEBACK
>   
>   /**
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 3e4ce54d84ab..1d744c61d996 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>   /**
>    * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
>    * @inode: target inode
> - * @lockedp: temp bool output param, to be passed to the end function
> + * @cookie: output param, to be passed to the end function
>    *
>    * The caller wants to access the wb associated with @inode but isn't
>    * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>    * association doesn't change until the transaction is finished with
>    * unlocked_inode_to_wb_end().
>    *
> - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> - * afterwards and can't sleep during transaction.  IRQ may or may not be
> - * disabled on return.
> + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
> + * can't sleep during transaction.  IRQ may or may not be disabled on return.
>    */
>   static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
>   {
>   	rcu_read_lock();
>   
> @@ -367,10 +366,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
>   	 * Paired with store_release in inode_switch_wb_work_fn() and
>   	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
>   	 */
> -	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
> +	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
>   
> -	if (unlikely(*lockedp))
> -		spin_lock_irq(&inode->i_mapping->tree_lock);
> +	if (unlikely(cookie->locked))
> +		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
>   
>   	/*
>   	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
> @@ -382,12 +381,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
>   /**
>    * unlocked_inode_to_wb_end - end inode wb access transaction
>    * @inode: target inode
> - * @locked: *@lockedp from unlocked_inode_to_wb_begin()
> + * @cookie: @cookie from unlocked_inode_to_wb_begin()
>    */
> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +					    struct wb_lock_cookie *cookie)
>   {
> -	if (unlikely(locked))
> -		spin_unlock_irq(&inode->i_mapping->tree_lock);
> +	if (unlikely(cookie->locked))
> +		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
> +				       cookie->flags);
>   
>   	rcu_read_unlock();
>   }
> @@ -434,12 +435,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
>   }
>   
>   static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
>   {
>   	return inode_to_wb(inode);
>   }
>   
> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +					    struct wb_lock_cookie *cookie)
>   {
>   }
>   
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 586f31261c83..bc38a2a7a597 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
>   	if (mapping && mapping_cap_account_dirty(mapping)) {
>   		struct inode *inode = mapping->host;
>   		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>   
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>   		current->nr_dirtied--;
>   		dec_node_page_state(page, NR_DIRTIED);
>   		dec_wb_stat(wb, WB_DIRTIED);
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>   	}
>   }
>   EXPORT_SYMBOL(account_page_redirty);
> @@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page)
>   	if (mapping_cap_account_dirty(mapping)) {
>   		struct inode *inode = mapping->host;
>   		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>   
>   		lock_page_memcg(page);
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>   
>   		if (TestClearPageDirty(page))
>   			account_page_cleaned(page, mapping, wb);
>   
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>   		unlock_page_memcg(page);
>   	} else {
>   		ClearPageDirty(page);
> @@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page)
>   	if (mapping && mapping_cap_account_dirty(mapping)) {
>   		struct inode *inode = mapping->host;
>   		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};
>   
>   		/*
>   		 * Yes, Virginia, this is indeed insane.
> @@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page)
>   		 * always locked coming in here, so we get the desired
>   		 * exclusion.
>   		 */
> -		wb = unlocked_inode_to_wb_begin(inode, &locked);
> +		wb = unlocked_inode_to_wb_begin(inode, &cookie);
>   		if (TestClearPageDirty(page)) {
>   			dec_lruvec_page_state(page, NR_FILE_DIRTY);
>   			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>   			dec_wb_stat(wb, WB_RECLAIMABLE);
>   			ret = 1;
>   		}
> -		unlocked_inode_to_wb_end(inode, locked);
> +		unlocked_inode_to_wb_end(inode, &cookie);
>   		return ret;
>   	}
>   	return TestClearPageDirty(page);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] writeback: safer lock nesting
  2018-04-06  8:03         ` [PATCH] writeback: safer lock nesting Greg Thelen
  2018-04-06  8:07           ` Michal Hocko
@ 2018-04-10 13:50           ` Sasha Levin
  2018-04-11  2:44             ` Wang Long
  1 sibling, 1 reply; 24+ messages in thread
From: Sasha Levin @ 2018-04-10 13:50 UTC (permalink / raw)
  To: Sasha Levin, Greg Thelen, Wang Long; +Cc: npiggin, linux-kernel, stable

Hi,

[This is an automated email]

This commit has been processed by the -stable helper bot and determined
to be a high probability candidate for -stable trees. (score: 44.5575)

The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33, v4.9.93, v4.4.127.

v4.16.1: Build OK!
v4.15.16: Build OK!
v4.14.33: Build OK!
v4.9.93: Build OK!
v4.4.127: Failed to apply! Possible dependencies:
    62cccb8c8e7a ("mm: simplify lock_page_memcg()")


Please let us know if you'd like to have this patch included in a stable tree.

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10  0:59                   ` [PATCH v3] " Greg Thelen
  2018-04-10  6:33                     ` Michal Hocko
  2018-04-10  8:14                     ` Wang Long
@ 2018-04-10 20:37                     ` Andrew Morton
  2018-04-11  1:03                       ` Greg Thelen
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2018-04-10 20:37 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Wang Long, Michal Hocko, Johannes Weiner, Tejun Heo, npiggin,
	linux-kernel, linux-mm

On Mon,  9 Apr 2018 17:59:08 -0700 Greg Thelen <gthelen@google.com> wrote:

> lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> the page's memcg is undergoing move accounting, which occurs when a
> process leaves its memcg for a new one that has
> memory.move_charge_at_immigrate set.
> 
> unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
> given inode is switching writeback domains.  Switches occur when enough
> writes are issued from a new domain.
> 
> This existing pattern is thus suspicious:
>     lock_page_memcg(page);
>     unlocked_inode_to_wb_begin(inode, &locked);
>     ...
>     unlocked_inode_to_wb_end(inode, locked);
>     unlock_page_memcg(page);
> 
> If both inode switch and process memcg migration are both in-flight then
> unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> still holding the lock_page_memcg() irq spinlock.  This suggests the
> possibility of deadlock if an interrupt occurs before
> unlock_page_memcg().
> 
>     truncate
>     __cancel_dirty_page
>     lock_page_memcg
>     unlocked_inode_to_wb_begin
>     unlocked_inode_to_wb_end
>     <interrupts mistakenly enabled>
>                                     <interrupt>
>                                     end_page_writeback
>                                     test_clear_page_writeback
>                                     lock_page_memcg
>                                     <deadlock>
>     unlock_page_memcg
> 
> Due to configuration limitations this deadlock is not currently possible
> because we don't mix cgroup writeback (a cgroupv2 feature) and
> memory.move_charge_at_immigrate (a cgroupv1 feature).
> 
> If the kernel is hacked to always claim inode switching and memcg
> moving_account, then this script triggers lockup in less than a minute:
>   cd /mnt/cgroup/memory
>   mkdir a b
>   echo 1 > a/memory.move_charge_at_immigrate
>   echo 1 > b/memory.move_charge_at_immigrate
>   (
>     echo $BASHPID > a/cgroup.procs
>     while true; do
>       dd if=/dev/zero of=/mnt/big bs=1M count=256
>     done
>   ) &
>   while true; do
>     sync
>   done &
>   sleep 1h &
>   SLEEP=$!
>   while true; do
>     echo $SLEEP > a/cgroup.procs
>     echo $SLEEP > b/cgroup.procs
>   done
> 
> Given the deadlock is not currently possible, it's debatable if there's
> any reason to modify the kernel.  I suggest we should to prevent future
> surprises.
> 
> ...
>
> Changelog since v2:
> - explicitly initialize wb_lock_cookie to silence compiler warnings.

But only in some places.  What's up with that?

>
> ...
>
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>  /**
>   * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
>   * @inode: target inode
> - * @lockedp: temp bool output param, to be passed to the end function
> + * @cookie: output param, to be passed to the end function
>   *
>   * The caller wants to access the wb associated with @inode but isn't
>   * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> @@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
>   * association doesn't change until the transaction is finished with
>   * unlocked_inode_to_wb_end().
>   *
> - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> - * afterwards and can't sleep during transaction.  IRQ may or may not be
> - * disabled on return.
> + * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
> + * can't sleep during transaction.  IRQ may or may not be disabled on return.
>   */

Grammar is a bit awkward here,

>
> ...
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
>  	if (mapping && mapping_cap_account_dirty(mapping)) {
>  		struct inode *inode = mapping->host;
>  		struct bdi_writeback *wb;
> -		bool locked;
> +		struct wb_lock_cookie cookie = {0};

Trivia: it's better to use "= {}" here.  That has the same effect and
it doesn't assume that the first field is a scalar.  And indeed, the
first field is a bool so it should be {false}!


So...


--- a/include/linux/backing-dev.h~writeback-safer-lock-nesting-fix
+++ a/include/linux/backing-dev.h
@@ -355,7 +355,8 @@ static inline struct bdi_writeback *inod
  * unlocked_inode_to_wb_end().
  *
  * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
- * can't sleep during transaction.  IRQ may or may not be disabled on return.
+ * can't sleep during the transaction.  IRQs may or may not be disabled on
+ * return.
  */
 static inline struct bdi_writeback *
 unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
--- a/mm/page-writeback.c~writeback-safer-lock-nesting-fix
+++ a/mm/page-writeback.c
@@ -2501,7 +2501,7 @@ void account_page_redirty(struct page *p
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		struct wb_lock_cookie cookie = {0};
+		struct wb_lock_cookie cookie = {};
 
 		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		current->nr_dirtied--;
@@ -2613,7 +2613,7 @@ void __cancel_dirty_page(struct page *pa
 	if (mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		struct wb_lock_cookie cookie = {0};
+		struct wb_lock_cookie cookie = {};
 
 		lock_page_memcg(page);
 		wb = unlocked_inode_to_wb_begin(inode, &cookie);
@@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		struct wb_lock_cookie cookie = {0};
+		struct wb_lock_cookie cookie = {};
 
 		/*
 		 * Yes, Virginia, this is indeed insane.

But I wonder about the remaining uninitialized wb_lock_cookies?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10  6:33                     ` Michal Hocko
@ 2018-04-10 20:48                       ` Andrew Morton
  2018-04-11  5:50                         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2018-04-10 20:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, Wang Long, Johannes Weiner, Tejun Heo, npiggin,
	linux-kernel, linux-mm

On Tue, 10 Apr 2018 08:33:57 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > Reported-by: Wang Long <wanglong19@meituan.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
> 
> Not a stable material IMHO

Why's that?  Wang Long said he's observed the deadlock three times?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10  8:14                     ` Wang Long
@ 2018-04-11  0:40                       ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  0:40 UTC (permalink / raw)
  To: Wang Long
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Tejun Heo, npiggin,
	LKML, Linux MM

On Tue, Apr 10, 2018 at 1:15 AM Wang Long <wanglong19@meituan.com> wrote:

> > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> > the page's memcg is undergoing move accounting, which occurs when a
> > process leaves its memcg for a new one that has
> > memory.move_charge_at_immigrate set.
> >
> > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> > given inode is switching writeback domains.  Switches occur when enough
> > writes are issued from a new domain.
> >
> > This existing pattern is thus suspicious:
> >      lock_page_memcg(page);
> >      unlocked_inode_to_wb_begin(inode, &locked);
> >      ...
> >      unlocked_inode_to_wb_end(inode, locked);
> >      unlock_page_memcg(page);
> >
> > If both inode switch and process memcg migration are both in-flight then
> > unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> > still holding the lock_page_memcg() irq spinlock.  This suggests the
> > possibility of deadlock if an interrupt occurs before
> > unlock_page_memcg().
> >
> >      truncate
> >      __cancel_dirty_page
> >      lock_page_memcg
> >      unlocked_inode_to_wb_begin
> >      unlocked_inode_to_wb_end
> >      <interrupts mistakenly enabled>
> >                                      <interrupt>
> >                                      end_page_writeback
> >                                      test_clear_page_writeback
> >                                      lock_page_memcg
> >                                      <deadlock>
> >      unlock_page_memcg
> >
> > Due to configuration limitations this deadlock is not currently possible
> > because we don't mix cgroup writeback (a cgroupv2 feature) and
> > memory.move_charge_at_immigrate (a cgroupv1 feature).
> >
> > If the kernel is hacked to always claim inode switching and memcg
> > moving_account, then this script triggers lockup in less than a minute:
> >    cd /mnt/cgroup/memory
> >    mkdir a b
> >    echo 1 > a/memory.move_charge_at_immigrate
> >    echo 1 > b/memory.move_charge_at_immigrate
> >    (
> >      echo $BASHPID > a/cgroup.procs
> >      while true; do
> >        dd if=/dev/zero of=/mnt/big bs=1M count=256
> >      done
> >    ) &
> >    while true; do
> >      sync
> >    done &
> >    sleep 1h &
> >    SLEEP=$!
> >    while true; do
> >      echo $SLEEP > a/cgroup.procs
> >      echo $SLEEP > b/cgroup.procs
> >    done
> >
> > Given the deadlock is not currently possible, it's debatable if there's
> > any reason to modify the kernel.  I suggest we should to prevent future
> > surprises.
> This deadlock occurs three times in our environment,

> this deadlock occurs three times in our environment. It is better to cc
stable kernel and
> backport it.

That's interesting.  Are you using cgroup v1 or v2?  Do you enable
memory.move_charge_at_immigrate?
I assume you've been using 4.4 stable.  I'll look closer at it at a 4.4
stable backport.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10 20:37                     ` Andrew Morton
@ 2018-04-11  1:03                       ` Greg Thelen
  2018-04-11  8:46                         ` [PATCH v4] " Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  1:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wang Long, Michal Hocko, Johannes Weiner, Tejun Heo, npiggin,
	LKML, Linux MM

On Tue, Apr 10, 2018 at 1:38 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Mon,  9 Apr 2018 17:59:08 -0700 Greg Thelen <gthelen@google.com> wrote:

> > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> > the page's memcg is undergoing move accounting, which occurs when a
> > process leaves its memcg for a new one that has
> > memory.move_charge_at_immigrate set.
> >
> > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> > given inode is switching writeback domains.  Switches occur when enough
> > writes are issued from a new domain.
> >
> > This existing pattern is thus suspicious:
> >     lock_page_memcg(page);
> >     unlocked_inode_to_wb_begin(inode, &locked);
> >     ...
> >     unlocked_inode_to_wb_end(inode, locked);
> >     unlock_page_memcg(page);
> >
> > If both inode switch and process memcg migration are both in-flight then
> > unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> > still holding the lock_page_memcg() irq spinlock.  This suggests the
> > possibility of deadlock if an interrupt occurs before
> > unlock_page_memcg().
> >
> >     truncate
> >     __cancel_dirty_page
> >     lock_page_memcg
> >     unlocked_inode_to_wb_begin
> >     unlocked_inode_to_wb_end
> >     <interrupts mistakenly enabled>
> >                                     <interrupt>
> >                                     end_page_writeback
> >                                     test_clear_page_writeback
> >                                     lock_page_memcg
> >                                     <deadlock>
> >     unlock_page_memcg
> >
> > Due to configuration limitations this deadlock is not currently possible
> > because we don't mix cgroup writeback (a cgroupv2 feature) and
> > memory.move_charge_at_immigrate (a cgroupv1 feature).
> >
> > If the kernel is hacked to always claim inode switching and memcg
> > moving_account, then this script triggers lockup in less than a minute:
> >   cd /mnt/cgroup/memory
> >   mkdir a b
> >   echo 1 > a/memory.move_charge_at_immigrate
> >   echo 1 > b/memory.move_charge_at_immigrate
> >   (
> >     echo $BASHPID > a/cgroup.procs
> >     while true; do
> >       dd if=/dev/zero of=/mnt/big bs=1M count=256
> >     done
> >   ) &
> >   while true; do
> >     sync
> >   done &
> >   sleep 1h &
> >   SLEEP=$!
> >   while true; do
> >     echo $SLEEP > a/cgroup.procs
> >     echo $SLEEP > b/cgroup.procs
> >   done
> >
> > Given the deadlock is not currently possible, it's debatable if there's
> > any reason to modify the kernel.  I suggest we should to prevent future
> > surprises.
> >
> > ...
> >
> > Changelog since v2:
> > - explicitly initialize wb_lock_cookie to silence compiler warnings.

> But only in some places.  What's up with that?

I annotated it in places where my compiler was complaining about.  But
you're right.  It's better to init all 4.

> >
> > ...
> >
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -346,7 +346,7 @@ static inline struct bdi_writeback
*inode_to_wb(const struct inode *inode)
> >  /**
> >   * unlocked_inode_to_wb_begin - begin unlocked inode wb access
transaction
> >   * @inode: target inode
> > - * @lockedp: temp bool output param, to be passed to the end function
> > + * @cookie: output param, to be passed to the end function
> >   *
> >   * The caller wants to access the wb associated with @inode but isn't
> >   * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> > @@ -354,12 +354,11 @@ static inline struct bdi_writeback
*inode_to_wb(const struct inode *inode)
> >   * association doesn't change until the transaction is finished with
> >   * unlocked_inode_to_wb_end().
> >   *
> > - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> > - * afterwards and can't sleep during transaction.  IRQ may or may not
be
> > - * disabled on return.
> > + * The caller must call unlocked_inode_to_wb_end() with *@cookie
afterwards and
> > + * can't sleep during transaction.  IRQ may or may not be disabled on
return.
> >   */

> Grammar is a bit awkward here,

> >
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
> >       if (mapping && mapping_cap_account_dirty(mapping)) {
> >               struct inode *inode = mapping->host;
> >               struct bdi_writeback *wb;
> > -             bool locked;
> > +             struct wb_lock_cookie cookie = {0};

> Trivia: it's better to use "= {}" here.  That has the same effect and
> it doesn't assume that the first field is a scalar.  And indeed, the
> first field is a bool so it should be {false}!

Nod.  Thanks for the tip.

> So...


> --- a/include/linux/backing-dev.h~writeback-safer-lock-nesting-fix
> +++ a/include/linux/backing-dev.h
> @@ -355,7 +355,8 @@ static inline struct bdi_writeback *inod
>     * unlocked_inode_to_wb_end().
>     *
>     * The caller must call unlocked_inode_to_wb_end() with *@cookie
afterwards and
> - * can't sleep during transaction.  IRQ may or may not be disabled on
return.
> + * can't sleep during the transaction.  IRQs may or may not be disabled
on
> + * return.
>     */
>    static inline struct bdi_writeback *
>    unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie
*cookie)
> --- a/mm/page-writeback.c~writeback-safer-lock-nesting-fix
> +++ a/mm/page-writeback.c
> @@ -2501,7 +2501,7 @@ void account_page_redirty(struct page *p
>           if (mapping && mapping_cap_account_dirty(mapping)) {
>                   struct inode *inode = mapping->host;
>                   struct bdi_writeback *wb;
> -               struct wb_lock_cookie cookie = {0};
> +               struct wb_lock_cookie cookie = {};

>                   wb = unlocked_inode_to_wb_begin(inode, &cookie);
>                   current->nr_dirtied--;
> @@ -2613,7 +2613,7 @@ void __cancel_dirty_page(struct page *pa
>           if (mapping_cap_account_dirty(mapping)) {
>                   struct inode *inode = mapping->host;
>                   struct bdi_writeback *wb;
> -               struct wb_lock_cookie cookie = {0};
> +               struct wb_lock_cookie cookie = {};

>                   lock_page_memcg(page);
>                   wb = unlocked_inode_to_wb_begin(inode, &cookie);
> @@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page
>           if (mapping && mapping_cap_account_dirty(mapping)) {
>                   struct inode *inode = mapping->host;
>                   struct bdi_writeback *wb;
> -               struct wb_lock_cookie cookie = {0};
> +               struct wb_lock_cookie cookie = {};

>                   /*
>                    * Yes, Virginia, this is indeed insane.

> But I wonder about the remaining uninitialized wb_lock_cookies?

Yeah, I'll post an v4 with everything folded in.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] writeback: safer lock nesting
  2018-04-10 13:50           ` [PATCH] " Sasha Levin
@ 2018-04-11  2:44             ` Wang Long
  2018-04-11  3:13               ` Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Wang Long @ 2018-04-11  2:44 UTC (permalink / raw)
  To: Sasha Levin, Greg Thelen; +Cc: npiggin, linux-kernel, stable, heguanjun


> Hi,
>
> [This is an automated email]
>
> This commit has been processed by the -stable helper bot and determined
> to be a high probability candidate for -stable trees. (score: 44.5575)
>
> The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33, v4.9.93, v4.4.127.
>
> v4.16.1: Build OK!
> v4.15.16: Build OK!
> v4.14.33: Build OK!
> v4.9.93: Build OK!
> v4.4.127: Failed to apply! Possible dependencies:
>      62cccb8c8e7a ("mm: simplify lock_page_memcg()")
Hi Sasha,

I test the memory cgroup in lts v4.4, for this issue, 62cccb8c8e7a ("mm: 
simplify lock_page_memcg()")
need to adjust and there are several other places that need to be fixed.

I will make the patch for lts v4.4 if no one did.

Thanks.
>
>
> Please let us know if you'd like to have this patch included in a stable tree.
>
> --
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] writeback: safer lock nesting
  2018-04-11  2:44             ` Wang Long
@ 2018-04-11  3:13               ` Greg Thelen
  2018-04-11  8:45                 ` [PATCH for-4.4] " Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  3:13 UTC (permalink / raw)
  To: Wang Long; +Cc: Alexander.Levin, npiggin, LKML, stable, heguanjun

On Tue, Apr 10, 2018 at 7:44 PM Wang Long <wanglong19@meituan.com> wrote:

> > Hi,
> >
> > [This is an automated email]
> >
> > This commit has been processed by the -stable helper bot and determined
> > to be a high probability candidate for -stable trees. (score: 44.5575)
> >
> > The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33,
v4.9.93, v4.4.127.
> >
> > v4.16.1: Build OK!
> > v4.15.16: Build OK!
> > v4.14.33: Build OK!
> > v4.9.93: Build OK!
> > v4.4.127: Failed to apply! Possible dependencies:
> >      62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> Hi Sasha,

> I test the memory cgroup in lts v4.4, for this issue, 62cccb8c8e7a ("mm:
> simplify lock_page_memcg()")
> need to adjust and there are several other places that need to be fixed.

> I will make the patch for lts v4.4 if no one did.

I'm testing a 4.4-stable patch right now.  ETA is a few hours.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] writeback: safer lock nesting
  2018-04-10 20:48                       ` Andrew Morton
@ 2018-04-11  5:50                         ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2018-04-11  5:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Wang Long, Johannes Weiner, Tejun Heo, npiggin,
	linux-kernel, linux-mm

On Tue 10-04-18 13:48:37, Andrew Morton wrote:
> On Tue, 10 Apr 2018 08:33:57 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > Reported-by: Wang Long <wanglong19@meituan.com>
> > > Signed-off-by: Greg Thelen <gthelen@google.com>
> > > Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
> > 
> > Not a stable material IMHO
> 
> Why's that?  Wang Long said he's observed the deadlock three times?

I thought it is just too unlikely to hit all the conditions. My fault,
I have completely missed/forgot the fact Wang Long is seeing the issue
happening.

No real objection for the stable backport from me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH for-4.4] writeback: safer lock nesting
  2018-04-11  3:13               ` Greg Thelen
@ 2018-04-11  8:45                 ` Greg Thelen
  2018-04-11  8:50                   ` Greg Thelen
  0 siblings, 1 reply; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  8:45 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner,
	Tejun Heo, Sasha Levin
  Cc: npiggin, linux-kernel, linux-mm, Greg Thelen, stable

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.

[ This patch is only for 4.4 stable.  Newer stable kernels should use be able to
  cherry pick the upstream "writeback: safer lock nesting" patch. ]

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Cc: stable@vger.kernel.org # v4.2+
Reported-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Wang Long <wanglong19@meituan.com>
---
 fs/fs-writeback.c                |  7 ++++---
 include/linux/backing-dev-defs.h |  5 +++++
 include/linux/backing-dev.h      | 31 +++++++++++++++++--------------
 mm/page-writeback.c              | 18 +++++++++---------
 4 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 22b30249fbcb..0fe667875852 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -747,11 +747,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 	 */
 	if (inode && inode_to_wb_is_valid(inode)) {
 		struct bdi_writeback *wb;
-		bool locked, congested;
+		struct wb_lock_cookie lock_cookie = {};
+		bool congested;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
 		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &lock_cookie);
 		return congested;
 	}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 140c29635069..a307c37c2e6c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -191,6 +191,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+	bool locked;
+	unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 89d3de3e096b..361274ce5815 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -366,7 +366,7 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -374,12 +374,12 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
+ * can't sleep during the transaction.  IRQs may or may not be disabled on
+ * return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	rcu_read_lock();
 
@@ -387,10 +387,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 	 * Paired with store_release in inode_switch_wb_work_fn() and
 	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 	 */
-	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
+	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
-	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
 
 	/*
 	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -402,12 +402,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 /**
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
- * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @cookie: @cookie from unlocked_inode_to_wb_begin()
  */
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
-	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
+				       cookie->flags);
 
 	rcu_read_unlock();
 }
@@ -454,12 +456,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 }
 
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	return inode_to_wb(inode);
 }
 
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6d0dbde4503b..3309dbda7ffa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2510,13 +2510,13 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		current->nr_dirtied--;
 		dec_zone_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2622,15 +2622,15 @@ void cancel_dirty_page(struct page *page)
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
 		struct mem_cgroup *memcg;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
 		memcg = mem_cgroup_begin_page_stat(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, memcg, wb);
 
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		mem_cgroup_end_page_stat(memcg);
 	} else {
 		ClearPageDirty(page);
@@ -2663,7 +2663,7 @@ int clear_page_dirty_for_io(struct page *page)
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
 		struct mem_cgroup *memcg;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
 		/*
 		 * Yes, Virginia, this is indeed insane.
@@ -2701,14 +2701,14 @@ int clear_page_dirty_for_io(struct page *page)
 		 * exclusion.
 		 */
 		memcg = mem_cgroup_begin_page_stat(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		mem_cgroup_end_page_stat(memcg);
 		return ret;
 	}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4] writeback: safer lock nesting
  2018-04-11  1:03                       ` Greg Thelen
@ 2018-04-11  8:46                         ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  8:46 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner,
	Tejun Heo, Sasha Levin
  Cc: npiggin, linux-kernel, linux-mm, Greg Thelen, stable

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.
For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Cc: stable@vger.kernel.org # v4.2+
Reported-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Wang Long <wanglong19@meituan.com>
---

Changelog since v3:
- initialize wb_lock_cookie wiht {} rather than {0}.
- comment grammar fix
- commit log footer cleanup (-Change-Id, +Fixes, +Acks, +stable), though patch
  does not cleanly apply to 4.4.  I'll post a 4.4-stable specific patch.

Changelog since v2:
- explicitly initialize wb_lock_cookie to silence compiler warnings.

Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c                |  7 ++++---
 include/linux/backing-dev-defs.h |  5 +++++
 include/linux/backing-dev.h      | 31 +++++++++++++++++--------------
 mm/page-writeback.c              | 18 +++++++++---------
 4 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1280f915079b..b1178acfcb08 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 	 */
 	if (inode && inode_to_wb_is_valid(inode)) {
 		struct bdi_writeback *wb;
-		bool locked, congested;
+		struct wb_lock_cookie lock_cookie = {};
+		bool congested;
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
 		congested = wb_congested(wb, cong_bits);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &lock_cookie);
 		return congested;
 	}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+	bool locked;
+	unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..96f4a3ddfb81 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -354,12 +354,12 @@ static inline struct bdi_writeback *inode_to_wb(const struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
+ * can't sleep during the transaction.  IRQs may or may not be disabled on
+ * return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	rcu_read_lock();
 
@@ -367,10 +367,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 	 * Paired with store_release in inode_switch_wb_work_fn() and
 	 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 	 */
-	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
+	cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
-	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_lock_irqsave(&inode->i_mapping->tree_lock, cookie->flags);
 
 	/*
 	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -382,12 +382,14 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 /**
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
- * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @cookie: @cookie from unlocked_inode_to_wb_begin()
  */
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
-	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
+	if (unlikely(cookie->locked))
+		spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
+				       cookie->flags);
 
 	rcu_read_unlock();
 }
@@ -434,12 +436,13 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 }
 
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
 {
 	return inode_to_wb(inode);
 }
 
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+					    struct wb_lock_cookie *cookie)
 {
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c83..8369572e1f7d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		current->nr_dirtied--;
 		dec_node_page_state(page, NR_DIRTIED);
 		dec_wb_stat(wb, WB_DIRTIED);
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2613,15 +2613,15 @@ void __cancel_dirty_page(struct page *page)
 	if (mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
 		lock_page_memcg(page);
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 
 		if (TestClearPageDirty(page))
 			account_page_cleaned(page, mapping, wb);
 
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		unlock_page_memcg(page);
 	} else {
 		ClearPageDirty(page);
@@ -2653,7 +2653,7 @@ int clear_page_dirty_for_io(struct page *page)
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		struct inode *inode = mapping->host;
 		struct bdi_writeback *wb;
-		bool locked;
+		struct wb_lock_cookie cookie = {};
 
 		/*
 		 * Yes, Virginia, this is indeed insane.
@@ -2690,14 +2690,14 @@ int clear_page_dirty_for_io(struct page *page)
 		 * always locked coming in here, so we get the desired
 		 * exclusion.
 		 */
-		wb = unlocked_inode_to_wb_begin(inode, &locked);
+		wb = unlocked_inode_to_wb_begin(inode, &cookie);
 		if (TestClearPageDirty(page)) {
 			dec_lruvec_page_state(page, NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
-		unlocked_inode_to_wb_end(inode, locked);
+		unlocked_inode_to_wb_end(inode, &cookie);
 		return ret;
 	}
 	return TestClearPageDirty(page);
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH for-4.4] writeback: safer lock nesting
  2018-04-11  8:45                 ` [PATCH for-4.4] " Greg Thelen
@ 2018-04-11  8:50                   ` Greg Thelen
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Thelen @ 2018-04-11  8:50 UTC (permalink / raw)
  To: Wang Long, Michal Hocko, Andrew Morton, Johannes Weiner,
	Tejun Heo, Sasha Levin
  Cc: npiggin, LKML, Linux MM, stable

On Wed, Apr 11, 2018 at 1:45 AM Greg Thelen <gthelen@google.com> wrote:

> lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> the page's memcg is undergoing move accounting, which occurs when a
> process leaves its memcg for a new one that has
> memory.move_charge_at_immigrate set.

> unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> given inode is switching writeback domains.  Switches occur when enough
> writes are issued from a new domain.

> This existing pattern is thus suspicious:
>      lock_page_memcg(page);
>      unlocked_inode_to_wb_begin(inode, &locked);
>      ...
>      unlocked_inode_to_wb_end(inode, locked);
>      unlock_page_memcg(page);

> If both inode switch and process memcg migration are both in-flight then
> unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> still holding the lock_page_memcg() irq spinlock.  This suggests the
> possibility of deadlock if an interrupt occurs before
> unlock_page_memcg().

>      truncate
>      __cancel_dirty_page
>      lock_page_memcg
>      unlocked_inode_to_wb_begin
>      unlocked_inode_to_wb_end
>      <interrupts mistakenly enabled>
>                                      <interrupt>
>                                      end_page_writeback
>                                      test_clear_page_writeback
>                                      lock_page_memcg
>                                      <deadlock>
>      unlock_page_memcg

> Due to configuration limitations this deadlock is not currently possible
> because we don't mix cgroup writeback (a cgroupv2 feature) and
> memory.move_charge_at_immigrate (a cgroupv1 feature).

> If the kernel is hacked to always claim inode switching and memcg
> moving_account, then this script triggers lockup in less than a minute:
>    cd /mnt/cgroup/memory
>    mkdir a b
>    echo 1 > a/memory.move_charge_at_immigrate
>    echo 1 > b/memory.move_charge_at_immigrate
>    (
>      echo $BASHPID > a/cgroup.procs
>      while true; do
>        dd if=/dev/zero of=/mnt/big bs=1M count=256
>      done
>    ) &
>    while true; do
>      sync
>    done &
>    sleep 1h &
>    SLEEP=$!
>    while true; do
>      echo $SLEEP > a/cgroup.procs
>      echo $SLEEP > b/cgroup.procs
>    done

> The deadlock does not seem possible, so it's debatable if there's
> any reason to modify the kernel.  I suggest we should to prevent future
> surprises.  And Wang Long said "this deadlock occurs three times in our
> environment", so there's more reason to apply this, even to stable.

Wang Long: I wasn't able to reproduce the 4.4 problem.  But tracing
suggests this 4.4 patch is effective.  If you can reproduce the problem in
your 4.4 environment, then it'd be nice to confirm this fixes it.  Thanks!

> [ This patch is only for 4.4 stable.  Newer stable kernels should use be
able to
>    cherry pick the upstream "writeback: safer lock nesting" patch. ]

> Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates")
> Cc: stable@vger.kernel.org # v4.2+
> Reported-by: Wang Long <wanglong19@meituan.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Wang Long <wanglong19@meituan.com>
> ---
>   fs/fs-writeback.c                |  7 ++++---
>   include/linux/backing-dev-defs.h |  5 +++++
>   include/linux/backing-dev.h      | 31 +++++++++++++++++--------------
>   mm/page-writeback.c              | 18 +++++++++---------
>   4 files changed, 35 insertions(+), 26 deletions(-)

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 22b30249fbcb..0fe667875852 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -747,11 +747,12 @@ int inode_congested(struct inode *inode, int
cong_bits)
>           */
>          if (inode && inode_to_wb_is_valid(inode)) {
>                  struct bdi_writeback *wb;
> -               bool locked, congested;
> +               struct wb_lock_cookie lock_cookie = {};
> +               bool congested;

> -               wb = unlocked_inode_to_wb_begin(inode, &locked);
> +               wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
>                  congested = wb_congested(wb, cong_bits);
> -               unlocked_inode_to_wb_end(inode, locked);
> +               unlocked_inode_to_wb_end(inode, &lock_cookie);
>                  return congested;
>          }

> diff --git a/include/linux/backing-dev-defs.h
b/include/linux/backing-dev-defs.h
> index 140c29635069..a307c37c2e6c 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -191,6 +191,11 @@ static inline void set_bdi_congested(struct
backing_dev_info *bdi, int sync)
>          set_wb_congested(bdi->wb.congested, sync);
>   }

> +struct wb_lock_cookie {
> +       bool locked;
> +       unsigned long flags;
> +};
> +
>   #ifdef CONFIG_CGROUP_WRITEBACK

>   /**
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 89d3de3e096b..361274ce5815 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -366,7 +366,7 @@ static inline struct bdi_writeback
*inode_to_wb(struct inode *inode)
>   /**
>    * unlocked_inode_to_wb_begin - begin unlocked inode wb access
transaction
>    * @inode: target inode
> - * @lockedp: temp bool output param, to be passed to the end function
> + * @cookie: output param, to be passed to the end function
>    *
>    * The caller wants to access the wb associated with @inode but isn't
>    * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> @@ -374,12 +374,12 @@ static inline struct bdi_writeback
*inode_to_wb(struct inode *inode)
>    * association doesn't change until the transaction is finished with
>    * unlocked_inode_to_wb_end().
>    *
> - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> - * afterwards and can't sleep during transaction.  IRQ may or may not be
> - * disabled on return.
> + * The caller must call unlocked_inode_to_wb_end() with *@cookie
afterwards and
> + * can't sleep during the transaction.  IRQs may or may not be disabled
on
> + * return.
>    */
>   static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie
*cookie)
>   {
>          rcu_read_lock();

> @@ -387,10 +387,10 @@ unlocked_inode_to_wb_begin(struct inode *inode,
bool *lockedp)
>           * Paired with store_release in inode_switch_wb_work_fn() and
>           * ensures that we see the new wb if we see cleared I_WB_SWITCH.
>           */
> -       *lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
> +       cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;

> -       if (unlikely(*lockedp))
> -               spin_lock_irq(&inode->i_mapping->tree_lock);
> +       if (unlikely(cookie->locked))
> +               spin_lock_irqsave(&inode->i_mapping->tree_lock,
cookie->flags);

>          /*
>           * Protected by either !I_WB_SWITCH + rcu_read_lock() or
tree_lock.
> @@ -402,12 +402,14 @@ unlocked_inode_to_wb_begin(struct inode *inode,
bool *lockedp)
>   /**
>    * unlocked_inode_to_wb_end - end inode wb access transaction
>    * @inode: target inode
> - * @locked: *@lockedp from unlocked_inode_to_wb_begin()
> + * @cookie: @cookie from unlocked_inode_to_wb_begin()
>    */
> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool
locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +                                           struct wb_lock_cookie *cookie)
>   {
> -       if (unlikely(locked))
> -               spin_unlock_irq(&inode->i_mapping->tree_lock);
> +       if (unlikely(cookie->locked))
> +               spin_unlock_irqrestore(&inode->i_mapping->tree_lock,
> +                                      cookie->flags);

>          rcu_read_unlock();
>   }
> @@ -454,12 +456,13 @@ static inline struct bdi_writeback
*inode_to_wb(struct inode *inode)
>   }

>   static inline struct bdi_writeback *
> -unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
> +unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie
*cookie)
>   {
>          return inode_to_wb(inode);
>   }

> -static inline void unlocked_inode_to_wb_end(struct inode *inode, bool
locked)
> +static inline void unlocked_inode_to_wb_end(struct inode *inode,
> +                                           struct wb_lock_cookie *cookie)
>   {
>   }

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 6d0dbde4503b..3309dbda7ffa 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2510,13 +2510,13 @@ void account_page_redirty(struct page *page)
>          if (mapping && mapping_cap_account_dirty(mapping)) {
>                  struct inode *inode = mapping->host;
>                  struct bdi_writeback *wb;
> -               bool locked;
> +               struct wb_lock_cookie cookie = {};

> -               wb = unlocked_inode_to_wb_begin(inode, &locked);
> +               wb = unlocked_inode_to_wb_begin(inode, &cookie);
>                  current->nr_dirtied--;
>                  dec_zone_page_state(page, NR_DIRTIED);
>                  dec_wb_stat(wb, WB_DIRTIED);
> -               unlocked_inode_to_wb_end(inode, locked);
> +               unlocked_inode_to_wb_end(inode, &cookie);
>          }
>   }
>   EXPORT_SYMBOL(account_page_redirty);
> @@ -2622,15 +2622,15 @@ void cancel_dirty_page(struct page *page)
>                  struct inode *inode = mapping->host;
>                  struct bdi_writeback *wb;
>                  struct mem_cgroup *memcg;
> -               bool locked;
> +               struct wb_lock_cookie cookie = {};

>                  memcg = mem_cgroup_begin_page_stat(page);
> -               wb = unlocked_inode_to_wb_begin(inode, &locked);
> +               wb = unlocked_inode_to_wb_begin(inode, &cookie);

>                  if (TestClearPageDirty(page))
>                          account_page_cleaned(page, mapping, memcg, wb);

> -               unlocked_inode_to_wb_end(inode, locked);
> +               unlocked_inode_to_wb_end(inode, &cookie);
>                  mem_cgroup_end_page_stat(memcg);
>          } else {
>                  ClearPageDirty(page);
> @@ -2663,7 +2663,7 @@ int clear_page_dirty_for_io(struct page *page)
>                  struct inode *inode = mapping->host;
>                  struct bdi_writeback *wb;
>                  struct mem_cgroup *memcg;
> -               bool locked;
> +               struct wb_lock_cookie cookie = {};

>                  /*
>                   * Yes, Virginia, this is indeed insane.
> @@ -2701,14 +2701,14 @@ int clear_page_dirty_for_io(struct page *page)
>                   * exclusion.
>                   */
>                  memcg = mem_cgroup_begin_page_stat(page);
> -               wb = unlocked_inode_to_wb_begin(inode, &locked);
> +               wb = unlocked_inode_to_wb_begin(inode, &cookie);
>                  if (TestClearPageDirty(page)) {
>                          mem_cgroup_dec_page_stat(memcg,
MEM_CGROUP_STAT_DIRTY);
>                          dec_zone_page_state(page, NR_FILE_DIRTY);
>                          dec_wb_stat(wb, WB_RECLAIMABLE);
>                          ret = 1;
>                  }
> -               unlocked_inode_to_wb_end(inode, locked);
> +               unlocked_inode_to_wb_end(inode, &cookie);
>                  mem_cgroup_end_page_stat(memcg);
>                  return ret;
>          }
> --
> 2.17.0.484.g0c8726318c-goog

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-04-11  8:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <157ed606-4a61-508b-d26a-2f5d638f39bb@meituan.com>
2018-04-02 11:50 ` [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io Wang Long
2018-04-03 12:03   ` Michal Hocko
2018-04-03 23:12     ` Greg Thelen
2018-04-04  6:31       ` Wang Long
2018-04-06  8:03         ` [PATCH] writeback: safer lock nesting Greg Thelen
2018-04-06  8:07           ` Michal Hocko
2018-04-06 18:49             ` Greg Thelen
2018-04-06 18:55               ` [PATCH v2] " Greg Thelen
2018-04-07 18:56                 ` kbuild test robot
2018-04-07 18:56                   ` kbuild test robot
2018-04-10  0:59                   ` [PATCH v3] " Greg Thelen
2018-04-10  6:33                     ` Michal Hocko
2018-04-10 20:48                       ` Andrew Morton
2018-04-11  5:50                         ` Michal Hocko
2018-04-10  8:14                     ` Wang Long
2018-04-11  0:40                       ` Greg Thelen
2018-04-10 20:37                     ` Andrew Morton
2018-04-11  1:03                       ` Greg Thelen
2018-04-11  8:46                         ` [PATCH v4] " Greg Thelen
2018-04-10 13:50           ` [PATCH] " Sasha Levin
2018-04-11  2:44             ` Wang Long
2018-04-11  3:13               ` Greg Thelen
2018-04-11  8:45                 ` [PATCH for-4.4] " Greg Thelen
2018-04-11  8:50                   ` Greg Thelen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.