Re: fs/buffer.c: WARNING: alloc_page_buffers while mke2fs

From: Shakeel Butt <shakeelb@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yang Shi <shy828301@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	 Naresh Kamboju <naresh.kamboju@linaro.org>,
	linux-mm <linux-mm@kvack.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	 Michal Hocko <mhocko@kernel.org>,
	Dan Schatzberg <schatzberg.dan@gmail.com>
Subject: Re: fs/buffer.c: WARNING: alloc_page_buffers while mke2fs
Date: Tue, 3 Mar 2020 12:40:33 -0800	[thread overview]
Message-ID: <CALvZod61X33TyXiV27cS8+UebBhN6xgMoYpaRPRdYZwo+G9g_g@mail.gmail.com> (raw)
In-Reply-To: <20200303202623.GA68565@cmpxchg.org>

On Tue, Mar 3, 2020 at 12:26 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Mar 03, 2020 at 10:14:49AM -0800, Shakeel Butt wrote:
> > On Tue, Mar 3, 2020 at 9:47 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Mar 3, 2020 at 2:53 AM Tetsuo Handa
> > > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > > >
> > > > Hello, Naresh.
> > > >
> > > > > [   98.003346] WARNING: CPU: 2 PID: 340 at
> > > > > include/linux/sched/mm.h:323 alloc_page_buffers+0x210/0x288
> > > >
> > > > This is
> > > >
> > > > /**
> > > >  * memalloc_use_memcg - Starts the remote memcg charging scope.
> > > >  * @memcg: memcg to charge.
> > > >  *
> > > >  * This function marks the beginning of the remote memcg charging scope. All the
> > > >  * __GFP_ACCOUNT allocations till the end of the scope will be charged to the
> > > >  * given memcg.
> > > >  *
> > > >  * NOTE: This function is not nesting safe.
> > > >  */
> > > > static inline void memalloc_use_memcg(struct mem_cgroup *memcg)
> > > > {
> > > >         WARN_ON_ONCE(current->active_memcg);
> > > >         current->active_memcg = memcg;
> > > > }
> > > >
> > > > which is about memcg. Redirecting to linux-mm.
> > >
> > > Isn't this triggered by ("loop: use worker per cgroup instead of
> > > kworker") in linux-next, which converted loop driver to use worker per
> > > cgroup, so it may have multiple workers work at the mean time?
> > >
> > > So they may share the same "current", then it may cause kind of nested
> > > call to memalloc_use_memcg().
> > >
> > > Could you please try the below debug patch? This is not the proper
> > > fix, but it may help us narrow down the problem.
> > >
> > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > index c49257a..1cc1cdc 100644
> > > --- a/include/linux/sched/mm.h
> > > +++ b/include/linux/sched/mm.h
> > > @@ -320,6 +320,10 @@ static inline void
> > > memalloc_nocma_restore(unsigned int flags)
> > >   */
> > >  static inline void memalloc_use_memcg(struct mem_cgroup *memcg)
> > >  {
> > > +       if ((current->flags & PF_KTHREAD) &&
> > > +            current->active_memcg)
> > > +               return;
> > > +
> > >         WARN_ON_ONCE(current->active_memcg);
> > >         current->active_memcg = memcg;
> > >  }
> > >
> >
> > Maybe it's time to make memalloc_use_memcg() nesting safe.
>
> Yes, I think so. The stack trace:
>
> [   98.137605]  alloc_page_buffers+0x210/0x288
> [   98.141799]  __getblk_gfp+0x1d4/0x400
> [   98.145475]  ext4_read_block_bitmap_nowait+0x148/0xbc8
> [   98.150628]  ext4_mb_init_cache+0x25c/0x9b0
> [   98.154821]  ext4_mb_init_group+0x270/0x390
> [   98.159014]  ext4_mb_good_group+0x264/0x270
> [   98.163208]  ext4_mb_regular_allocator+0x480/0x798
> [   98.168011]  ext4_mb_new_blocks+0x958/0x10f8
> [   98.172294]  ext4_ext_map_blocks+0xec8/0x1618
> [   98.176660]  ext4_map_blocks+0x1b8/0x8a0
> [   98.180592]  ext4_writepages+0x830/0xf10
> [   98.184523]  do_writepages+0xb4/0x198
> [   98.188195]  __filemap_fdatawrite_range+0x170/0x1c8
> [   98.193086]  filemap_write_and_wait_range+0x40/0xb0
> [   98.197974]  ext4_punch_hole+0x4a4/0x660
> [   98.201907]  ext4_fallocate+0x294/0x1190
> [   98.205839]  loop_process_work+0x690/0x1100
> [   98.210032]  loop_workfn+0x2c/0x110
> [   98.213529]  process_one_work+0x3e0/0x648
> [   98.217546]  worker_thread+0x70/0x670
> [   98.221217]  kthread+0x1b8/0x1c0
> [   98.224452]  ret_from_fork+0x10/0x18
>
> The loop kworker is instantiating cache pages on behalf of who queued
> the io request, but if the page already exists, the buffers should be
> allocated on behalf of who already owns the page. Nesting makes sense.
>
> Since the only difference between the use and unuse function is the
> warn when we nest, we can remove the unuse and do something like:
>
> old = memalloc_use_memcg(memcg);
> memalloc_use_memcg(old);
>
> What do you think?

SGTM.

> Patch below. It should go in before Dan's patches,
> and they in turn need a small update to save and restore active_memcg.
> (Since loop's use is from a kworker, it's unlikely that there is an
> outer scope. But it's probably best to keep this simple and robust.)
>
> ---
>
> From e0e5ace069af5a36e41eafe3bf21a67966127c04 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 3 Mar 2020 15:15:39 -0500
> Subject: [PATCH] mm: support nesting memalloc_use_memcg()
>
> The memalloc_use_memcg() function to override the default memcg
> accounting context currently doesn't nest. But the patches to make the
> loop driver cgroup-aware will end up nesting:
>
> [   98.137605]  alloc_page_buffers+0x210/0x288
> [   98.141799]  __getblk_gfp+0x1d4/0x400
> [   98.145475]  ext4_read_block_bitmap_nowait+0x148/0xbc8
> [   98.150628]  ext4_mb_init_cache+0x25c/0x9b0
> [   98.154821]  ext4_mb_init_group+0x270/0x390
> [   98.159014]  ext4_mb_good_group+0x264/0x270
> [   98.163208]  ext4_mb_regular_allocator+0x480/0x798
> [   98.168011]  ext4_mb_new_blocks+0x958/0x10f8
> [   98.172294]  ext4_ext_map_blocks+0xec8/0x1618
> [   98.176660]  ext4_map_blocks+0x1b8/0x8a0
> [   98.180592]  ext4_writepages+0x830/0xf10
> [   98.184523]  do_writepages+0xb4/0x198
> [   98.188195]  __filemap_fdatawrite_range+0x170/0x1c8
> [   98.193086]  filemap_write_and_wait_range+0x40/0xb0
> [   98.197974]  ext4_punch_hole+0x4a4/0x660
> [   98.201907]  ext4_fallocate+0x294/0x1190
> [   98.205839]  loop_process_work+0x690/0x1100
> [   98.210032]  loop_workfn+0x2c/0x110
> [   98.213529]  process_one_work+0x3e0/0x648
> [   98.217546]  worker_thread+0x70/0x670
> [   98.221217]  kthread+0x1b8/0x1c0
> [   98.224452]  ret_from_fork+0x10/0x18
>
> where loop_process_work() sets the memcg override to the memcg that
> submitted the IO request, and alloc_page_buffers() sets the override
> to the memcg that instantiated the cache page, which may differ.
>
> Make memalloc_use_memcg() return the old memcg and convert existing
> users to a stacking model. Delete the unused memalloc_unuse_memcg().
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  fs/buffer.c                          |  6 ++---
>  fs/notify/fanotify/fanotify.c        |  5 +++--
>  fs/notify/inotify/inotify_fsnotify.c |  5 +++--
>  include/linux/sched/mm.h             | 33 ++++++++++++----------------
>  4 files changed, 23 insertions(+), 26 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d8c7242426bb..54d5df14bd36 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -857,13 +857,13 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
>         struct buffer_head *bh, *head;
>         gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
>         long offset;
> -       struct mem_cgroup *memcg;
> +       struct mem_cgroup *memcg, *oldmemcg;
>
>         if (retry)
>                 gfp |= __GFP_NOFAIL;
>
>         memcg = get_mem_cgroup_from_page(page);
> -       memalloc_use_memcg(memcg);
> +       oldmemcg = memalloc_use_memcg(memcg);
>
>         head = NULL;
>         offset = PAGE_SIZE;
> @@ -882,7 +882,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
>                 set_bh_page(bh, page, offset);
>         }
>  out:
> -       memalloc_unuse_memcg();
> +       memalloc_use_memcg(oldmemcg);
>         mem_cgroup_put(memcg);
>         return head;
>  /*
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 5778d1347b35..cb596ad002d8 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -284,6 +284,7 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>         struct fanotify_event *event = NULL;
>         gfp_t gfp = GFP_KERNEL_ACCOUNT;
>         struct inode *id = fanotify_fid_inode(inode, mask, data, data_type);
> +       struct mem_cgroup *oldmemcg;
>
>         /*
>          * For queues with unlimited length lost events are not expected and
> @@ -297,7 +298,7 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>                 gfp |= __GFP_RETRY_MAYFAIL;
>
>         /* Whoever is interested in the event, pays for the allocation. */
> -       memalloc_use_memcg(group->memcg);
> +       oldmemcg = memalloc_use_memcg(group->memcg);
>
>         if (fanotify_is_perm_event(mask)) {
>                 struct fanotify_perm_event *pevent;
> @@ -334,7 +335,7 @@ init: __maybe_unused
>                 event->path.dentry = NULL;
>         }
>  out:
> -       memalloc_unuse_memcg();
> +       memalloc_use_memcg(oldmemcg);
>         return event;
>  }
>
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index d510223d302c..776ce66aaa47 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -68,6 +68,7 @@ int inotify_handle_event(struct fsnotify_group *group,
>         int ret;
>         int len = 0;
>         int alloc_len = sizeof(struct inotify_event_info);
> +       struct mem_cgroup *oldmemcg;
>
>         if (WARN_ON(fsnotify_iter_vfsmount_mark(iter_info)))
>                 return 0;
> @@ -95,9 +96,9 @@ int inotify_handle_event(struct fsnotify_group *group,
>          * trigger OOM killer in the target monitoring memcg as it may have
>          * security repercussion.
>          */
> -       memalloc_use_memcg(group->memcg);
> +       oldmemcg = memalloc_use_memcg(group->memcg);
>         event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
> -       memalloc_unuse_memcg();
> +       memalloc_use_memcg(oldmemcg);
>
>         if (unlikely(!event)) {
>                 /*
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index c49257a3b510..ced06c12daf7 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -316,31 +316,26 @@ static inline void memalloc_nocma_restore(unsigned int flags)
>   * __GFP_ACCOUNT allocations till the end of the scope will be charged to the
>   * given memcg.
>   *
> - * NOTE: This function is not nesting safe.
> - */
> -static inline void memalloc_use_memcg(struct mem_cgroup *memcg)
> -{
> -       WARN_ON_ONCE(current->active_memcg);
> -       current->active_memcg = memcg;
> -}
> -
> -/**
> - * memalloc_unuse_memcg - Ends the remote memcg charging scope.
> + * NOTE: This function can nest. Users must save the return value and
> + * reset the previous value after their own charging scope is over:

Should we mention that this is still not irq safe? At the moment other
than skmem we don't do memcg charging in the interrupt and skmem has a
memcg already associated with it. Maybe in future there might be a
case where we want a specific memcg be charged for kmem in the
interrupt context. Until then we can mark that this function should
not be used in interrupt.

>   *
> - * This function marks the end of the remote memcg charging scope started by
> - * memalloc_use_memcg().
> + *      old = memalloc_use_memcg(memcg);
> + *      // ... allocations ...
> + *      memalloc_use_memcg(old);
>   */
> -static inline void memalloc_unuse_memcg(void)
> +static inline struct mem_cgroup *__must_check
> +memalloc_use_memcg(struct mem_cgroup *memcg)
>  {
> -       current->active_memcg = NULL;
> +       struct mem_cgroup *old = current->active_memcg;
> +
> +       current->active_memcg = memcg;
> +       return old;
>  }
>  #else
> -static inline void memalloc_use_memcg(struct mem_cgroup *memcg)
> -{
> -}
> -
> -static inline void memalloc_unuse_memcg(void)
> +static inline struct mem_cgroup *__must_check
> +memalloc_use_memcg(struct mem_cgroup *memcg)
>  {
> +       return NULL;
>  }
>  #endif
>
> --
> 2.24.1
>