linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
@ 2019-11-20 16:58 Johannes Weiner
  2019-11-20 20:31 ` Shakeel Butt
  2019-11-21  3:15 ` Hugh Dickins
  0 siblings, 2 replies; 7+ messages in thread
From: Johannes Weiner @ 2019-11-20 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Michal Hocko, Alex Shi,
	Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team

While reviewing the "per lruvec lru_lock for memcg" series, Hugh and I
noticed two places in the existing code where the page -> memcg ->
lruvec lookup can result in a use-after-free bug. This affects cgroup1
setups that have charge migration enabled.

To pin page->mem_cgroup, callers need to either have the page locked,
an exclusive refcount (0), or hold the lru_lock and "own" PageLRU
(either ensure it's set, or be the one to hold the page in isolation)
to make cgroup migration fail the isolation step. Failure to follow
this can result in the page moving out of the memcg and freeing it,
along with its lruvecs, while the observer is dereferencing them.

1. isolate_lru_page() calls mem_cgroup_page_lruvec() with the lru_lock
held but before testing PageLRU. It doesn't dereference the returned
lruvec before testing PageLRU, giving the impression that it might
just be safe ordering after all - but mem_cgroup_page_lruvec() itself
touches the lruvec to lazily initialize the pgdat back pointer. This
one is easy to fix, just move the lookup into the PageLRU branch.

2. pagevec_lru_move_fn() conveniently looks up the lruvec for all the
callbacks it might get invoked on. Unfortunately, it's the callbacks
that first check PageLRU under the lru_lock, which makes this order
equally unsafe as isolate_lru_page(). Remove the lruvec argument from
the move callbacks and let them do it inside their PageLRU branches.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c   | 48 +++++++++++++++++++++++++++++-------------------
 mm/vmscan.c |  8 ++++----
 2 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..6b015e9532fb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -188,12 +188,11 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
+	void (*move_fn)(struct page *page, void *arg),
 	void *arg)
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
@@ -207,8 +206,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, arg);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -216,12 +214,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, void *arg)
 {
 	int *pgmoved = arg;
 
 	if (PageLRU(page) && !PageUnevictable(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -272,12 +272,14 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
 		reclaim_stat->recent_rotated[file]++;
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, void *arg)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
 		del_page_from_lru_list(page, lruvec, lru);
 		SetPageActive(page);
@@ -328,7 +330,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, NULL);
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -498,9 +500,9 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, void *arg)
 {
+	struct lruvec *lruvec;
 	int lru, file;
 	bool active;
 
@@ -518,6 +520,8 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
 
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
 	del_page_from_lru_list(page, lruvec, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
@@ -544,12 +548,14 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, void *arg)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = page_lru_base_type(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
 		ClearPageActive(page);
@@ -561,12 +567,14 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, void *arg)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
 		del_page_from_lru_list(page, lruvec,
 				       LRU_INACTIVE_ANON + active);
@@ -921,15 +929,17 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, void *arg)
 {
-	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
+	struct lruvec *lruvec;
+	enum lru_list lru;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-
 	SetPageLRU(page);
+
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
 	/*
 	 * Page becomes evictable in two ways:
 	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df859b1d583c..3c8b81990146 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1767,15 +1767,15 @@ int isolate_lru_page(struct page *page)
 
 	if (PageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
-		struct lruvec *lruvec;
 
 		spin_lock_irq(&pgdat->lru_lock);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		if (PageLRU(page)) {
-			int lru = page_lru(page);
+			struct lruvec *lruvec;
+
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
 			get_page(page);
 			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
+			del_page_from_lru_list(page, lruvec, page_lru(page));
 			ret = 0;
 		}
 		spin_unlock_irq(&pgdat->lru_lock);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-20 16:58 [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration Johannes Weiner
@ 2019-11-20 20:31 ` Shakeel Butt
  2019-11-20 21:39   ` Johannes Weiner
  2019-11-21  3:15 ` Hugh Dickins
  1 sibling, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2019-11-20 20:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Michal Hocko, Alex Shi,
	Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team

On Wed, Nov 20, 2019 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> While reviewing the "per lruvec lru_lock for memcg" series, Hugh and I
> noticed two places in the existing code where the page -> memcg ->
> lruvec lookup can result in a use-after-free bug. This affects cgroup1
> setups that have charge migration enabled.
>
> To pin page->mem_cgroup, callers need to either have the page locked,
> an exclusive refcount (0), or hold the lru_lock and "own" PageLRU
> (either ensure it's set, or be the one to hold the page in isolation)
> to make cgroup migration fail the isolation step.

I think we should add the above para in the comments for better visibility.

> Failure to follow
> this can result in the page moving out of the memcg and freeing it,
> along with its lruvecs, while the observer is dereferencing them.
>
> 1. isolate_lru_page() calls mem_cgroup_page_lruvec() with the lru_lock
> held but before testing PageLRU. It doesn't dereference the returned
> lruvec before testing PageLRU, giving the impression that it might
> just be safe ordering after all - but mem_cgroup_page_lruvec() itself
> touches the lruvec to lazily initialize the pgdat back pointer. This
> one is easy to fix, just move the lookup into the PageLRU branch.
>
> 2. pagevec_lru_move_fn() conveniently looks up the lruvec for all the
> callbacks it might get invoked on. Unfortunately, it's the callbacks
> that first check PageLRU under the lru_lock, which makes this order
> equally unsafe as isolate_lru_page(). Remove the lruvec argument from
> the move callbacks and let them do it inside their PageLRU branches.
>
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  mm/swap.c   | 48 +++++++++++++++++++++++++++++-------------------
>  mm/vmscan.c |  8 ++++----
>  2 files changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5341ae93861f..6b015e9532fb 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -188,12 +188,11 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
>  EXPORT_SYMBOL_GPL(get_kernel_page);
>
>  static void pagevec_lru_move_fn(struct pagevec *pvec,
> -       void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
> +       void (*move_fn)(struct page *page, void *arg),
>         void *arg)
>  {
>         int i;
>         struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
>         unsigned long flags = 0;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
> @@ -207,8 +206,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>                         spin_lock_irqsave(&pgdat->lru_lock, flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               (*move_fn)(page, lruvec, arg);
> +               (*move_fn)(page, arg);
>         }
>         if (pgdat)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> @@ -216,12 +214,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>         pagevec_reinit(pvec);
>  }
>
> -static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
> -                                void *arg)
> +static void pagevec_move_tail_fn(struct page *page, void *arg)
>  {
>         int *pgmoved = arg;
>
>         if (PageLRU(page) && !PageUnevictable(page)) {
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 ClearPageActive(page);
>                 add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -272,12 +272,14 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
>                 reclaim_stat->recent_rotated[file]++;
>  }
>
> -static void __activate_page(struct page *page, struct lruvec *lruvec,
> -                           void *arg)
> +static void __activate_page(struct page *page, void *arg)
>  {
>         if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>                 int file = page_is_file_cache(page);
>                 int lru = page_lru_base_type(page);
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>
>                 del_page_from_lru_list(page, lruvec, lru);
>                 SetPageActive(page);
> @@ -328,7 +330,7 @@ void activate_page(struct page *page)
>
>         page = compound_head(page);
>         spin_lock_irq(&pgdat->lru_lock);
> -       __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
> +       __activate_page(page, NULL);
>         spin_unlock_irq(&pgdat->lru_lock);
>  }
>  #endif
> @@ -498,9 +500,9 @@ void lru_cache_add_active_or_unevictable(struct page *page,
>   * be write it out by flusher threads as this is much more effective
>   * than the single-page writeout from reclaim.
>   */
> -static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> -                             void *arg)
> +static void lru_deactivate_file_fn(struct page *page, void *arg)
>  {
> +       struct lruvec *lruvec;
>         int lru, file;
>         bool active;
>
> @@ -518,6 +520,8 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>         file = page_is_file_cache(page);
>         lru = page_lru_base_type(page);
>
> +       lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
>         del_page_from_lru_list(page, lruvec, lru + active);
>         ClearPageActive(page);
>         ClearPageReferenced(page);
> @@ -544,12 +548,14 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>         update_page_reclaim_stat(lruvec, file, 0);
>  }
>
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> -                           void *arg)
> +static void lru_deactivate_fn(struct page *page, void *arg)
>  {
>         if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
>                 int file = page_is_file_cache(page);
>                 int lru = page_lru_base_type(page);
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>
>                 del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>                 ClearPageActive(page);
> @@ -561,12 +567,14 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>         }
>  }
>
> -static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> -                           void *arg)
> +static void lru_lazyfree_fn(struct page *page, void *arg)
>  {
>         if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
>             !PageSwapCache(page) && !PageUnevictable(page)) {
>                 bool active = PageActive(page);
> +               struct lruvec *lruvec;
> +
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>
>                 del_page_from_lru_list(page, lruvec,
>                                        LRU_INACTIVE_ANON + active);
> @@ -921,15 +929,17 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> -                                void *arg)
> +static void __pagevec_lru_add_fn(struct page *page, void *arg)
>  {
> -       enum lru_list lru;
>         int was_unevictable = TestClearPageUnevictable(page);
> +       struct lruvec *lruvec;
> +       enum lru_list lru;
>
>         VM_BUG_ON_PAGE(PageLRU(page), page);
> -
>         SetPageLRU(page);
> +
> +       lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
>         /*
>          * Page becomes evictable in two ways:
>          * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df859b1d583c..3c8b81990146 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1767,15 +1767,15 @@ int isolate_lru_page(struct page *page)
>
>         if (PageLRU(page)) {
>                 pg_data_t *pgdat = page_pgdat(page);
> -               struct lruvec *lruvec;
>
>                 spin_lock_irq(&pgdat->lru_lock);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 if (PageLRU(page)) {
> -                       int lru = page_lru(page);
> +                       struct lruvec *lruvec;
> +
> +                       lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                         get_page(page);
>                         ClearPageLRU(page);
> -                       del_page_from_lru_list(page, lruvec, lru);
> +                       del_page_from_lru_list(page, lruvec, page_lru(page));
>                         ret = 0;
>                 }
>                 spin_unlock_irq(&pgdat->lru_lock);
> --
> 2.24.0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-20 20:31 ` Shakeel Butt
@ 2019-11-20 21:39   ` Johannes Weiner
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2019-11-20 21:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Hugh Dickins, Michal Hocko, Alex Shi,
	Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team

On Wed, Nov 20, 2019 at 12:31:06PM -0800, Shakeel Butt wrote:
> On Wed, Nov 20, 2019 at 8:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > While reviewing the "per lruvec lru_lock for memcg" series, Hugh and I
> > noticed two places in the existing code where the page -> memcg ->
> > lruvec lookup can result in a use-after-free bug. This affects cgroup1
> > setups that have charge migration enabled.
> >
> > To pin page->mem_cgroup, callers need to either have the page locked,
> > an exclusive refcount (0), or hold the lru_lock and "own" PageLRU
> > (either ensure it's set, or be the one to hold the page in isolation)
> > to make cgroup migration fail the isolation step.
> 
> I think we should add the above para in the comments for better visibility.

Good idea. I'm attaching a delta patch below.

> > Reported-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks!

---
From 73b58ce09009cce668ea97d9e047611c60e95bd6 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 20 Nov 2019 16:36:03 -0500
Subject: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge
 migration fix

Better document the mem_cgroup_page_lruvec() caller requirements.

Suggested-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50f5bc55fcec..2d700fa0d7f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1202,9 +1202,18 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  * @page: the page
  * @pgdat: pgdat of the page
  *
- * This function is only safe when following the LRU page isolation
- * and putback protocol: the LRU lock must be held, and the page must
- * either be PageLRU() or the caller must have isolated/allocated it.
+ * NOTE: The returned lruvec is only stable if the calling context has
+ * the page->mem_cgroup pinned! This is accomplished by satisfying one
+ * of the following criteria:
+ *
+ *    a) have the @page locked
+ *    b) have an exclusive reference to @page (e.g. refcount 0)
+ *    c) hold the lru_lock and "own" the PageLRU (meaning either ensure
+ *       it's set, or be the one to hold the page in isolation)
+ *
+ * Otherwise, the page could be freed or moved out of the memcg,
+ * thereby releasing its reference on the memcg and potentially
+ * freeing it and its lruvecs in the process.
  */
 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
 {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-20 16:58 [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration Johannes Weiner
  2019-11-20 20:31 ` Shakeel Butt
@ 2019-11-21  3:15 ` Hugh Dickins
  2019-11-21 13:03   ` Alex Shi
  2019-11-21 20:56   ` Johannes Weiner
  1 sibling, 2 replies; 7+ messages in thread
From: Hugh Dickins @ 2019-11-21  3:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Shakeel Butt, Michal Hocko,
	Alex Shi, Roman Gushchin, linux-mm, cgroups, linux-kernel,
	kernel-team

On Wed, 20 Nov 2019, Johannes Weiner wrote:

> While reviewing the "per lruvec lru_lock for memcg" series, Hugh and I
> noticed two places in the existing code where the page -> memcg ->
> lruvec lookup can result in a use-after-free bug. This affects cgroup1
> setups that have charge migration enabled.

I don't see how it can result in a use-after-free, or other bug.

> 
> To pin page->mem_cgroup, callers need to either have the page locked,
> an exclusive refcount (0), or hold the lru_lock and "own" PageLRU

"exclusive refcount (0)", okay, that's quite a good term for it -
though "frozen refcount (0)" is what we would usually say.

You don't suggest holding move_lock, but I guess move_lock is never
used in that way (nor should be).

Okay.

> (either ensure it's set, or be the one to hold the page in isolation)
> to make cgroup migration fail the isolation step. Failure to follow
> this can result in the page moving out of the memcg and freeing it,
> along with its lruvecs, while the observer is dereferencing them.
> 
> 1. isolate_lru_page() calls mem_cgroup_page_lruvec() with the lru_lock
> held but before testing PageLRU. It doesn't dereference the returned
> lruvec before testing PageLRU, giving the impression that it might
> just be safe ordering after all - but mem_cgroup_page_lruvec() itself
> touches the lruvec to lazily initialize the pgdat back pointer. This
> one is easy to fix, just move the lookup into the PageLRU branch.
> 
> 2. pagevec_lru_move_fn() conveniently looks up the lruvec for all the
> callbacks it might get invoked on. Unfortunately, it's the callbacks
> that first check PageLRU under the lru_lock, which makes this order
> equally unsafe as isolate_lru_page(). Remove the lruvec argument from
> the move callbacks and let them do it inside their PageLRU branches.
> 
> Reported-by: Hugh Dickins <hughd@google.com>

You are overly generous to name me there! I have to demur, it's true
that yesterday I said "even isolate_lru_page() looks unsafe to me now",
but I was talking (in private mail) about my own mods, not current
upstream.  And you have then gone on to observe something suspicious
in the current code, but I don't think it amounts to a bug at all.
Appreciated, but please delete me!

Further explanation below, at the smaller isolate_lru_page() hunk,
where it's easier to see what goes on without the indirections of
mm/swap.c.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/swap.c   | 48 +++++++++++++++++++++++++++++-------------------
>  mm/vmscan.c |  8 ++++----
>  2 files changed, 33 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 5341ae93861f..6b015e9532fb 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -188,12 +188,11 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
>  EXPORT_SYMBOL_GPL(get_kernel_page);
>  
>  static void pagevec_lru_move_fn(struct pagevec *pvec,
> -	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
> +	void (*move_fn)(struct page *page, void *arg),
>  	void *arg)
>  {
>  	int i;
>  	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
> @@ -207,8 +206,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  			spin_lock_irqsave(&pgdat->lru_lock, flags);
>  		}
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		(*move_fn)(page, lruvec, arg);
> +		(*move_fn)(page, arg);
>  	}
>  	if (pgdat)
>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> @@ -216,12 +214,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  	pagevec_reinit(pvec);
>  }
>  
> -static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
> -				 void *arg)
> +static void pagevec_move_tail_fn(struct page *page, void *arg)
>  {
>  	int *pgmoved = arg;
>  
>  	if (PageLRU(page) && !PageUnevictable(page)) {
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		ClearPageActive(page);
>  		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -272,12 +272,14 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
>  		reclaim_stat->recent_rotated[file]++;
>  }
>  
> -static void __activate_page(struct page *page, struct lruvec *lruvec,
> -			    void *arg)
> +static void __activate_page(struct page *page, void *arg)
>  {
>  	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>  		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  
>  		del_page_from_lru_list(page, lruvec, lru);
>  		SetPageActive(page);
> @@ -328,7 +330,7 @@ void activate_page(struct page *page)
>  
>  	page = compound_head(page);
>  	spin_lock_irq(&pgdat->lru_lock);
> -	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
> +	__activate_page(page, NULL);
>  	spin_unlock_irq(&pgdat->lru_lock);
>  }
>  #endif
> @@ -498,9 +500,9 @@ void lru_cache_add_active_or_unevictable(struct page *page,
>   * be write it out by flusher threads as this is much more effective
>   * than the single-page writeout from reclaim.
>   */
> -static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
> -			      void *arg)
> +static void lru_deactivate_file_fn(struct page *page, void *arg)
>  {
> +	struct lruvec *lruvec;
>  	int lru, file;
>  	bool active;
>  
> @@ -518,6 +520,8 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	file = page_is_file_cache(page);
>  	lru = page_lru_base_type(page);
>  
> +	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
>  	del_page_from_lru_list(page, lruvec, lru + active);
>  	ClearPageActive(page);
>  	ClearPageReferenced(page);
> @@ -544,12 +548,14 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> -			    void *arg)
> +static void lru_deactivate_fn(struct page *page, void *arg)
>  {
>  	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
>  		int file = page_is_file_cache(page);
>  		int lru = page_lru_base_type(page);
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>  		ClearPageActive(page);
> @@ -561,12 +567,14 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  	}
>  }
>  
> -static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
> -			    void *arg)
> +static void lru_lazyfree_fn(struct page *page, void *arg)
>  {
>  	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
>  	    !PageSwapCache(page) && !PageUnevictable(page)) {
>  		bool active = PageActive(page);
> +		struct lruvec *lruvec;
> +
> +		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  
>  		del_page_from_lru_list(page, lruvec,
>  				       LRU_INACTIVE_ANON + active);
> @@ -921,15 +929,17 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> -				 void *arg)
> +static void __pagevec_lru_add_fn(struct page *page, void *arg)
>  {
> -	enum lru_list lru;
>  	int was_unevictable = TestClearPageUnevictable(page);
> +	struct lruvec *lruvec;
> +	enum lru_list lru;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
> -
>  	SetPageLRU(page);
> +
> +	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
>  	/*
>  	 * Page becomes evictable in two ways:
>  	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df859b1d583c..3c8b81990146 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1767,15 +1767,15 @@ int isolate_lru_page(struct page *page)
>  
>  	if (PageLRU(page)) {
>  		pg_data_t *pgdat = page_pgdat(page);
> -		struct lruvec *lruvec;
>  
>  		spin_lock_irq(&pgdat->lru_lock);
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		if (PageLRU(page)) {
> -			int lru = page_lru(page);
> +			struct lruvec *lruvec;
> +
> +			lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  			get_page(page);
>  			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, lru);
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
>  			ret = 0;
>  		}
>  		spin_unlock_irq(&pgdat->lru_lock);
> -- 
> 2.24.0

It like the way you've rearranged isolate_lru_page() there, but I
don't think it amounts to more than a cleanup.  Very good thinking
about the odd "lruvec->pgdat = pgdat" case tucked away inside
mem_cgroup_page_lruvec(), but actually, what harm does it do, if
mem_cgroup_move_account() changes page->mem_cgroup concurrently?

You say use-after-free, but we have spin_lock_irq here, and the
struct mem_cgroup (and its lruvecs) cannot be freed until an RCU
grace period expires, which we rely upon in many places, and which
cannot happen until after the spin_unlock_irq.

And the same applies in the pagevec_lru_move functions, doesn't it?

I think now is not the time for such cleanups. If this fits well
with Alex's per-lruvec locking (or represents an initial direction
that you think he should follow), fine, but better to let him take it
into his patchset in that case, than change the base unnecessarily
underneath him.

(It happens to go against my own direction, since it separates the
locking from the determination of lruvec, which I insist must be
kept together; but perhaps that won't be quite the same for Alex.)

Hugh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-21  3:15 ` Hugh Dickins
@ 2019-11-21 13:03   ` Alex Shi
  2019-11-21 20:56   ` Johannes Weiner
  1 sibling, 0 replies; 7+ messages in thread
From: Alex Shi @ 2019-11-21 13:03 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner
  Cc: Andrew Morton, Shakeel Butt, Michal Hocko, Roman Gushchin,
	linux-mm, cgroups, linux-kernel, kernel-team


> It like the way you've rearranged isolate_lru_page() there, but I
> don't think it amounts to more than a cleanup.  Very good thinking
> about the odd "lruvec->pgdat = pgdat" case tucked away inside
> mem_cgroup_page_lruvec(), but actually, what harm does it do, if
> mem_cgroup_move_account() changes page->mem_cgroup concurrently?

Maybe the page could be added to root_mem_cgroup?

> 
> You say use-after-free, but we have spin_lock_irq here, and the
> struct mem_cgroup (and its lruvecs) cannot be freed until an RCU
> grace period expires, which we rely upon in many places, and which
> cannot happen until after the spin_unlock_irq.
> 
> And the same applies in the pagevec_lru_move functions, doesn't it?
> 
> I think now is not the time for such cleanups. If this fits well
> with Alex's per-lruvec locking (or represents an initial direction
> that you think he should follow), fine, but better to let him take it
> into his patchset in that case, than change the base unnecessarily
> underneath him.
> 
> (It happens to go against my own direction, since it separates the
> locking from the determination of lruvec, which I insist must be
> kept together; but perhaps that won't be quite the same for Alex.)
> 

It looks like we share the same base.

Before this patch, root memcg's lruvc lock could guards !PageLRU and it followings, But now, there are much holes in the wall. :)

Thanks
Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-21  3:15 ` Hugh Dickins
  2019-11-21 13:03   ` Alex Shi
@ 2019-11-21 20:56   ` Johannes Weiner
  2019-11-21 21:30     ` Shakeel Butt
  1 sibling, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2019-11-21 20:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Michal Hocko, Alex Shi,
	Roman Gushchin, linux-mm, cgroups, linux-kernel, kernel-team

On Wed, Nov 20, 2019 at 07:15:27PM -0800, Hugh Dickins wrote:
> It like the way you've rearranged isolate_lru_page() there, but I
> don't think it amounts to more than a cleanup.  Very good thinking
> about the odd "lruvec->pgdat = pgdat" case tucked away inside
> mem_cgroup_page_lruvec(), but actually, what harm does it do, if
> mem_cgroup_move_account() changes page->mem_cgroup concurrently?
> 
> You say use-after-free, but we have spin_lock_irq here, and the
> struct mem_cgroup (and its lruvecs) cannot be freed until an RCU
> grace period expires, which we rely upon in many places, and which
> cannot happen until after the spin_unlock_irq.

You are correct, I missed the rcu locking implied by the
spinlock. With this, the justification for this patch is wrong.

But all of this is way too fragile and error-prone for my taste. We're
looking up a page's lruvec in a scope that does not promise at all
that the lruvec will be the page's. Luckily we currently don't touch
the lruvec outside of the PageLRU branch, but this subtlety is
entirely non-obvious from the code.

I will put more thought into this. Let's scrap this patch for now.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration
  2019-11-21 20:56   ` Johannes Weiner
@ 2019-11-21 21:30     ` Shakeel Butt
  0 siblings, 0 replies; 7+ messages in thread
From: Shakeel Butt @ 2019-11-21 21:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, Alex Shi,
	Roman Gushchin, Linux MM, Cgroups, LKML, Kernel Team

On Thu, Nov 21, 2019 at 12:56 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Nov 20, 2019 at 07:15:27PM -0800, Hugh Dickins wrote:
> > It like the way you've rearranged isolate_lru_page() there, but I
> > don't think it amounts to more than a cleanup.  Very good thinking
> > about the odd "lruvec->pgdat = pgdat" case tucked away inside
> > mem_cgroup_page_lruvec(), but actually, what harm does it do, if
> > mem_cgroup_move_account() changes page->mem_cgroup concurrently?
> >
> > You say use-after-free, but we have spin_lock_irq here, and the
> > struct mem_cgroup (and its lruvecs) cannot be freed until an RCU
> > grace period expires, which we rely upon in many places, and which
> > cannot happen until after the spin_unlock_irq.
>
> You are correct, I missed the rcu locking implied by the
> spinlock. With this, the justification for this patch is wrong.
>
> But all of this is way too fragile and error-prone for my taste. We're
> looking up a page's lruvec in a scope that does not promise at all
> that the lruvec will be the page's. Luckily we currently don't touch
> the lruvec outside of the PageLRU branch, but this subtlety is
> entirely non-obvious from the code.
>
> I will put more thought into this. Let's scrap this patch for now.

What about the comment on mem_cgroup_page_lruvec()? I feel that
comment is a good documentation independent of the original patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-11-21 21:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20 16:58 [PATCH] mm: fix unsafe page -> lruvec lookups with cgroup charge migration Johannes Weiner
2019-11-20 20:31 ` Shakeel Butt
2019-11-20 21:39   ` Johannes Weiner
2019-11-21  3:15 ` Hugh Dickins
2019-11-21 13:03   ` Alex Shi
2019-11-21 20:56   ` Johannes Weiner
2019-11-21 21:30     ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).