All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2020-09-19  4:19 Andrew Morton
  2020-09-19  4:20 ` [patch 01/15] mailmap: add older email addresses for Kees Cook Andrew Morton
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

15 patches, based on 92ab97adeefccf375de7ebaad9d5b75d4125fe8b.

Subsystems affected by this patch series:

  mailmap
  mm/hotfixes
  mm/thp
  mm/memory-hotplug
  misc
  kcsan

Subsystem: mailmap

    Kees Cook <keescook@chromium.org>:
      mailmap: add older email addresses for Kees Cook

Subsystem: mm/hotfixes

    Hugh Dickins <hughd@google.com>:
    Patch series "mm: fixes to past from future testing":
      ksm: reinstate memcg charge on copied pages
      mm: migration of hugetlbfs page skip memcg
      shmem: shmem_writepage() split unlikely i915 THP
      mm: fix check_move_unevictable_pages() on THP
      mlock: fix unevictable_pgs event counts on THP

    Byron Stanoszek <gandalf@winds.org>:
      tmpfs: restore functionality of nr_inodes=0

    Muchun Song <songmuchun@bytedance.com>:
      kprobes: fix kill kprobe which has been marked as gone

Subsystem: mm/thp

    Ralph Campbell <rcampbell@nvidia.com>:
      mm/thp: fix __split_huge_pmd_locked() for migration PMD

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      selftests/vm: fix display of page size in map_hugetlb

Subsystem: mm/memory-hotplug

    Pavel Tatashin <pasha.tatashin@soleen.com>:
      mm/memory_hotplug: drain per-cpu pages again during memory offline

Subsystem: misc

    Tobias Klauser <tklauser@distanz.ch>:
      ftrace: let ftrace_enable_sysctl take a kernel pointer buffer
      stackleak: let stack_erasing_sysctl take a kernel pointer buffer
      fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype

Subsystem: kcsan

    Changbin Du <changbin.du@gmail.com>:
      kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments'

 .mailmap                                 |    4 ++
 fs/fs-writeback.c                        |    2 -
 include/linux/ftrace.h                   |    3 --
 include/linux/stackleak.h                |    2 -
 kernel/kprobes.c                         |    9 +++++-
 kernel/stackleak.c                       |    2 -
 kernel/trace/ftrace.c                    |    3 --
 lib/Kconfig.debug                        |    4 --
 mm/huge_memory.c                         |   42 ++++++++++++++++---------------
 mm/ksm.c                                 |    4 ++
 mm/memory_hotplug.c                      |   14 ++++++++++
 mm/migrate.c                             |    3 +-
 mm/mlock.c                               |   24 +++++++++++------
 mm/page_isolation.c                      |    8 +++++
 mm/shmem.c                               |   20 +++++++++++---
 mm/swap.c                                |    6 ++--
 mm/vmscan.c                              |   10 +++++--
 tools/testing/selftests/vm/map_hugetlb.c |    2 -
 18 files changed, 111 insertions(+), 51 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 01/15] mailmap: add older email addresses for Kees Cook
  2020-09-19  4:19 incoming Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, corbet, joe, keescook, linux-mm, mm-commits, torvalds

From: Kees Cook <keescook@chromium.org>
Subject: mailmap: add older email addresses for Kees Cook

This adds explicit mailmap entries for my older/other email addresses.

Link: https://lkml.kernel.org/r/20200910193939.3798377-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Joe Perches <joe@perches.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 .mailmap |    4 ++++
 1 file changed, 4 insertions(+)

--- a/.mailmap~mailmap-add-older-email-addresses-for-kees-cook
+++ a/.mailmap
@@ -169,6 +169,10 @@ Juha Yrjola <juha.yrjola@solidboot.com>
 Julien Thierry <julien.thierry.kdev@gmail.com> <julien.thierry@arm.com>
 Kamil Konieczny <k.konieczny@samsung.com> <k.konieczny@partner.samsung.com>
 Kay Sievers <kay.sievers@vrfy.org>
+Kees Cook <keescook@chromium.org> <kees.cook@canonical.com>
+Kees Cook <keescook@chromium.org> <keescook@google.com>
+Kees Cook <keescook@chromium.org> <kees@outflux.net>
+Kees Cook <keescook@chromium.org> <kees@ubuntu.com>
 Kenneth W Chen <kenneth.w.chen@intel.com>
 Konstantin Khlebnikov <koct9i@gmail.com> <khlebnikov@yandex-team.ru>
 Konstantin Khlebnikov <koct9i@gmail.com> <k.khlebnikov@samsung.com>
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 02/15] ksm: reinstate memcg charge on copied pages
  2020-09-19  4:19 incoming Andrew Morton
  2020-09-19  4:20 ` [patch 01/15] mailmap: add older email addresses for Kees Cook Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 03/15] mm: migration of hugetlbfs page skip memcg Andrew Morton
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, stable, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: ksm: reinstate memcg charge on copied pages

Patch series "mm: fixes to past from future testing".

Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.


This patch (of 5):

In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().

This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".

But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated. 
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on.  Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find?  A
question not easily answered: put in the fix for missed memcg charges.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Fixes: 4c6355b25e8b ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc; Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.8]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/ksm.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/ksm.c~ksm-reinstate-memcg-charge-on-copied-pages
+++ a/mm/ksm.c
@@ -2586,6 +2586,10 @@ struct page *ksm_might_need_to_copy(stru
 		return page;		/* let do_swap_page report the error */
 
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (new_page && mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL)) {
+		put_page(new_page);
+		new_page = NULL;
+	}
 	if (new_page) {
 		copy_user_highpage(new_page, page, address, vma);
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 03/15] mm: migration of hugetlbfs page skip memcg
  2020-09-19  4:19 incoming Andrew Morton
  2020-09-19  4:20 ` [patch 01/15] mailmap: add older email addresses for Kees Cook Andrew Morton
  2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: mm: migration of hugetlbfs page skip memcg

hugetlbfs pages do not participate in memcg: so although they do find most
of migrate_page_states() useful, it would be better if they did not call
into mem_cgroup_migrate() - where Qian Cai reported that LTP's
move_pages12 triggers the warning in Alex Shi's prospective commit
"mm/memcg: warning on !memcg after readahead page charged".

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxch.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/migrate.c~mm-migration-of-hugetlbfs-page-skip-memcg
+++ a/mm/migrate.c
@@ -668,7 +668,8 @@ void migrate_page_states(struct page *ne
 
 	copy_page_owner(page, newpage);
 
-	mem_cgroup_migrate(page, newpage);
+	if (!PageHuge(page))
+		mem_cgroup_migrate(page, newpage);
 }
 EXPORT_SYMBOL(migrate_page_states);
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  4:19 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-09-19  4:20 ` [patch 03/15] mm: migration of hugetlbfs page skip memcg Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:44   ` Matthew Wilcox
  2020-09-19  4:20 ` [patch 05/15] mm: fix check_move_unevictable_pages() on THP Andrew Morton
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: shmem: shmem_writepage() split unlikely i915 THP

drivers/gpu/drm/i915/gem/i915_gem_shmem.c contains a shmem_writeback()
which calls shmem_writepage() from a shrinker: that usually works well
enough; but if /sys/kernel/mm/transparent_hugepage/shmem_enabled has been
set to "force" (documented as "Force the huge option on for all - very
useful for testing"), shmem_writepage() is surprised to be called with a
huge page, and crashes on the VM_BUG_ON_PAGE(PageCompound) (I did not find
out where the crash happens when CONFIG_DEBUG_VM is off).

LRU page reclaim always splits the shmem huge page first: I'd prefer not
to demand that of i915, so check and split compound in shmem_writepage().

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301401390.5954@eggly.anvils
Fixes: 2d6692e642e7 ("drm/i915: Start writeback from the shrinker")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/mm/shmem.c~shmem-shmem_writepage-split-unlikely-i915-thp
+++ a/mm/shmem.c
@@ -1362,7 +1362,15 @@ static int shmem_writepage(struct page *
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page))
+		if (split_huge_page(page) < 0)
+			goto redirty;
+
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 05/15] mm: fix check_move_unevictable_pages() on THP
  2020-09-19  4:19 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 06/15] mlock: fix unevictable_pgs event counts " Andrew Morton
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, cai, hannes, hughd, linux-mm, mhocko, mike.kravetz,
	mm-commits, shakeelb, shy828301, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: mm: fix check_move_unevictable_pages() on THP

check_move_unevictable_pages() is used in making unevictable shmem pages
evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and
i915/gem check_release_pagevec().  Those may pass down subpages of a huge
page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force".

That does not crash or warn at present, but the accounting of vmstats
unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent:
scanned being incremented on each subpage, rescued only on the head (since
tails already appear evictable once the head has been updated).

5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has
established that vm_events in general (and unevictable_pgs_rescued in
particular) should count every subpage: so follow that precedent here.

Do this in such a way that if mem_cgroup_page_lruvec() is made stricter
(to check page->mem_cgroup is always set), no problem: skip the tails
before calling it, and add thp_nr_pages() to vmstats on the head.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/mm/vmscan.c~mm-fix-check_move_unevictable_pages-on-thp
+++ a/mm/vmscan.c
@@ -4268,8 +4268,14 @@ void check_move_unevictable_pages(struct
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		struct pglist_data *pagepgdat = page_pgdat(page);
+		int nr_pages;
+
+		if (PageTransTail(page))
+			continue;
+
+		nr_pages = thp_nr_pages(page);
+		pgscanned += nr_pages;
 
-		pgscanned++;
 		if (pagepgdat != pgdat) {
 			if (pgdat)
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -4288,7 +4294,7 @@ void check_move_unevictable_pages(struct
 			ClearPageUnevictable(page);
 			del_page_from_lru_list(page, lruvec, LRU_UNEVICTABLE);
 			add_page_to_lru_list(page, lruvec, lru);
-			pgrescued++;
+			pgrescued += nr_pages;
 		}
 	}
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 06/15] mlock: fix unevictable_pgs event counts on THP
  2020-09-19  4:19 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-09-19  4:20 ` [patch 05/15] mm: fix check_move_unevictable_pages() on THP Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 07/15] tmpfs: restore functionality of nr_inodes=0 Andrew Morton
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, torvalds

From: Hugh Dickins <hughd@google.com>
Subject: mlock: fix unevictable_pgs event counts on THP

5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has
established that vm_events should count every subpage of a THP, including
unevictable_pgs_culled and unevictable_pgs_rescued; but
lru_cache_add_inactive_or_unevictable() was not doing so for
unevictable_pgs_mlocked, and mm/mlock.c was not doing so for
unevictable_pgs mlocked, munlocked, cleared and stranded.

Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed
on that path.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils
Fixes: 5d91f31faf8e ("mm: swap: fix vmstats for huge page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mlock.c |   24 +++++++++++++++---------
 mm/swap.c  |    6 +++---
 2 files changed, 18 insertions(+), 12 deletions(-)

--- a/mm/mlock.c~mlock-fix-unevictable_pgs-event-counts-on-thp
+++ a/mm/mlock.c
@@ -58,11 +58,14 @@ EXPORT_SYMBOL(can_do_mlock);
  */
 void clear_page_mlock(struct page *page)
 {
+	int nr_pages;
+
 	if (!TestClearPageMlocked(page))
 		return;
 
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -thp_nr_pages(page));
-	count_vm_event(UNEVICTABLE_PGCLEARED);
+	nr_pages = thp_nr_pages(page);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+	count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
 	/*
 	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
 	 * in __pagevec_lru_add_fn().
@@ -76,7 +79,7 @@ void clear_page_mlock(struct page *page)
 		 * We lost the race. the page already moved to evictable list.
 		 */
 		if (PageUnevictable(page))
-			count_vm_event(UNEVICTABLE_PGSTRANDED);
+			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	}
 }
 
@@ -93,9 +96,10 @@ void mlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
 
 	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    thp_nr_pages(page));
-		count_vm_event(UNEVICTABLE_PGMLOCKED);
+		int nr_pages = thp_nr_pages(page);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
 	}
@@ -138,7 +142,7 @@ static void __munlock_isolated_page(stru
 
 	/* Did try_to_unlock() succeed or punt? */
 	if (!PageMlocked(page))
-		count_vm_event(UNEVICTABLE_PGMUNLOCKED);
+		count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
 
 	putback_lru_page(page);
 }
@@ -154,10 +158,12 @@ static void __munlock_isolated_page(stru
  */
 static void __munlock_isolation_failed(struct page *page)
 {
+	int nr_pages = thp_nr_pages(page);
+
 	if (PageUnevictable(page))
-		__count_vm_event(UNEVICTABLE_PGSTRANDED);
+		__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	else
-		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
+		__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
 }
 
 /**
--- a/mm/swap.c~mlock-fix-unevictable_pgs-event-counts-on-thp
+++ a/mm/swap.c
@@ -494,14 +494,14 @@ void lru_cache_add_inactive_or_unevictab
 
 	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
 	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
 		/*
 		 * We use the irq-unsafe __mod_zone_page_stat because this
 		 * counter is not modified from interrupt context, and the pte
 		 * lock is held(spinlock), which implies preemption disabled.
 		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    thp_nr_pages(page));
-		count_vm_event(UNEVICTABLE_PGMLOCKED);
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 	lru_cache_add(page);
 }
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 07/15] tmpfs: restore functionality of nr_inodes=0
  2020-09-19  4:19 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-09-19  4:20 ` [patch 06/15] mlock: fix unevictable_pgs event counts " Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, chris, gandalf, hughd, linux-mm, mm-commits, torvalds

From: Byron Stanoszek <gandalf@winds.org>
Subject: tmpfs: restore functionality of nr_inodes=0

Commit e809d5f0b5c9 ("tmpfs: per-superblock i_ino support") made changes
to shmem_reserve_inode() in mm/shmem.c, however the original test for
(sbinfo->max_inodes) got dropped.  This causes mounting tmpfs with option
nr_inodes=0 to fail:

  # mount -ttmpfs -onr_inodes=0 none /ext0
  mount: /ext0: mount(2) system call failed: Cannot allocate memory.

This patch restores the nr_inodes=0 functionality.

Link: https://lkml.kernel.org/r/20200902035715.16414-1-gandalf@winds.org
Fixes: e809d5f0b5c9 ("tmpfs: per-superblock i_ino support")
Signed-off-by: Byron Stanoszek <gandalf@winds.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Chris Down <chris@chrisdown.name>	
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/mm/shmem.c~tmpfs-restore-functionality-of-nr_inodes=0
+++ a/mm/shmem.c
@@ -279,11 +279,13 @@ static int shmem_reserve_inode(struct su
 
 	if (!(sb->s_flags & SB_KERNMOUNT)) {
 		spin_lock(&sbinfo->stat_lock);
-		if (!sbinfo->free_inodes) {
-			spin_unlock(&sbinfo->stat_lock);
-			return -ENOSPC;
+		if (sbinfo->max_inodes) {
+			if (!sbinfo->free_inodes) {
+				spin_unlock(&sbinfo->stat_lock);
+				return -ENOSPC;
+			}
+			sbinfo->free_inodes--;
 		}
-		sbinfo->free_inodes--;
 		if (inop) {
 			ino = sbinfo->next_ino++;
 			if (unlikely(is_zero_ino(ino)))
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 08/15] kprobes: fix kill kprobe which has been marked as gone
  2020-09-19  4:19 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-09-19  4:20 ` [patch 07/15] tmpfs: restore functionality of nr_inodes=0 Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, anil.s.keshavamurthy, davem, linux-mm, mhiramat,
	mm-commits, naveen.n.rao, rostedt, songliubraving, songmuchun,
	stable, torvalds, zhouchengming

From: Muchun Song <songmuchun@bytedance.com>
Subject: kprobes: fix kill kprobe which has been marked as gone

If a kprobe is marked as gone, we should not kill it again.  Otherwise, we
can disarm the kprobe more than once.  In that case, the statistics of
kprobe_ftrace_enabled can unbalance which can lead to that kprobe do not
work.

Link: https://lkml.kernel.org/r/20200822030055.32383-1-songmuchun@bytedance.com
Fixes: e8386a0cb22f ("kprobes: support probing module __exit function")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Song Liu <songliubraving@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kprobes.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/kernel/kprobes.c~kprobes-fix-kill-kprobe-which-has-been-marked-as-gone
+++ a/kernel/kprobes.c
@@ -2140,6 +2140,9 @@ static void kill_kprobe(struct kprobe *p
 
 	lockdep_assert_held(&kprobe_mutex);
 
+	if (WARN_ON_ONCE(kprobe_gone(p)))
+		return;
+
 	p->flags |= KPROBE_FLAG_GONE;
 	if (kprobe_aggrprobe(p)) {
 		/*
@@ -2419,7 +2422,10 @@ static int kprobes_module_callback(struc
 	mutex_lock(&kprobe_mutex);
 	for (i = 0; i < KPROBE_TABLE_SIZE; i++) {
 		head = &kprobe_table[i];
-		hlist_for_each_entry(p, head, hlist)
+		hlist_for_each_entry(p, head, hlist) {
+			if (kprobe_gone(p))
+				continue;
+
 			if (within_module_init((unsigned long)p->addr, mod) ||
 			    (checkcore &&
 			     within_module_core((unsigned long)p->addr, mod))) {
@@ -2436,6 +2442,7 @@ static int kprobes_module_callback(struc
 				 */
 				kill_kprobe(p);
 			}
+		}
 	}
 	if (val == MODULE_STATE_GOING)
 		remove_module_kprobe_blacklist(mod);
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD
  2020-09-19  4:19 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, apopple, bharata, bskeggs, hch, jgg, jglisse, jhubbard,
	linux-mm, mm-commits, rcampbell, shuah, shy828301, stable,
	torvalds, ziy

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/thp: fix __split_huge_pmd_locked() for migration PMD

A migrating transparent huge page has to already be unmapped.  Otherwise,
the page could be modified while it is being copied to a new page and data
could be lost.  The function __split_huge_pmd() checks for a PMD migration
entry before calling __split_huge_pmd_locked() leading one to think that
__split_huge_pmd_locked() can handle splitting a migrating PMD.

However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.

Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).  Fix these problems by
checking for a PMD migration entry.

Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |   42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

--- a/mm/huge_memory.c~mm-thp-fix-__split_huge_pmd_locked-for-migration-pmd
+++ a/mm/huge_memory.c
@@ -2022,7 +2022,7 @@ static void __split_huge_pmd_locked(stru
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
-	} else if (is_huge_zero_pmd(*pmd)) {
+	} else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
@@ -2116,30 +2116,34 @@ static void __split_huge_pmd_locked(stru
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
-		atomic_inc(&page[i]._mapcount);
-		pte_unmap(pte);
-	}
-
-	/*
-	 * Set PG_double_map before dropping compound_mapcount to avoid
-	 * false-negative page_mapped().
-	 */
-	if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
-		for (i = 0; i < HPAGE_PMD_NR; i++)
+		if (!pmd_migration)
 			atomic_inc(&page[i]._mapcount);
+		pte_unmap(pte);
 	}
 
-	lock_page_memcg(page);
-	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-		/* Last compound_mapcount is gone. */
-		__dec_lruvec_page_state(page, NR_ANON_THPS);
-		if (TestClearPageDoubleMap(page)) {
-			/* No need in mapcount reference anymore */
+	if (!pmd_migration) {
+		/*
+		 * Set PG_double_map before dropping compound_mapcount to avoid
+		 * false-negative page_mapped().
+		 */
+		if (compound_mapcount(page) > 1 &&
+		    !TestSetPageDoubleMap(page)) {
 			for (i = 0; i < HPAGE_PMD_NR; i++)
-				atomic_dec(&page[i]._mapcount);
+				atomic_inc(&page[i]._mapcount);
+		}
+
+		lock_page_memcg(page);
+		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+			/* Last compound_mapcount is gone. */
+			__dec_lruvec_page_state(page, NR_ANON_THPS);
+			if (TestClearPageDoubleMap(page)) {
+				/* No need in mapcount reference anymore */
+				for (i = 0; i < HPAGE_PMD_NR; i++)
+					atomic_dec(&page[i]._mapcount);
+			}
 		}
+		unlock_page_memcg(page);
 	}
-	unlock_page_memcg(page);
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 10/15] selftests/vm: fix display of page size in map_hugetlb
  2020-09-19  4:19 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, christophe.leroy, linux-mm, mm-commits, stable, torvalds

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: selftests/vm: fix display of page size in map_hugetlb

The displayed size is in bytes while the text says it is in kB.

Shift it by 10 to really display kBytes.

Link: https://lkml.kernel.org/r/e27481224564a93d14106e750de31189deaa8bc8.1598861977.git.christophe.leroy@csgroup.eu
Fixes: fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/map_hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/map_hugetlb.c~selftests-vm-fix-display-of-page-size-in-map_hugetlb
+++ a/tools/testing/selftests/vm/map_hugetlb.c
@@ -83,7 +83,7 @@ int main(int argc, char **argv)
 	}
 
 	if (shift)
-		printf("%u kB hugepages\n", 1 << shift);
+		printf("%u kB hugepages\n", 1 << (shift - 10));
 	else
 		printf("Default size hugepages\n");
 	printf("Mapping %lu Mbytes\n", (unsigned long)length >> 20);
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline
  2020-09-19  4:19 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 12/15] ftrace: let ftrace_enable_sysctl take a kernel pointer buffer Andrew Morton
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, david, linux-mm, mhocko, mm-commits, osalvador,
	pasha.tatashin, richard.weiyang, rientjes, stable, torvalds,
	vbabka

From: Pavel Tatashin <pasha.tatashin@soleen.com>
Subject: mm/memory_hotplug: drain per-cpu pages again during memory offline

There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.

Thread#1 - a new process:

load_elf_binary
 begin_new_exec
  exec_mmap
   mmput
    exit_mmap
     tlb_finish_mmu
      tlb_flush_mmu
       release_pages
        free_unref_page_list
         free_unref_page_prepare
          set_pcppage_migratetype(page, migratetype);
             // Set page->index migration type below  MIGRATE_PCPTYPES

Thread#2 - hot-removes memory
__offline_pages
  start_isolate_page_range
    set_migratetype_isolate
      set_pageblock_migratetype(page, MIGRATE_ISOLATE);
        Set migration type to MIGRATE_ISOLATE-> set
        drain_all_pages(zone);
             // drain per-cpu page lists to buddy allocator.

Thread#1 - continue
         free_unref_page_commit
           migratetype = get_pcppage_migratetype(page);
              // get old migration type
           list_add(&page->lru, &pcp->lists[migratetype]);
              // add new page to already drained pcp list

Thread#2
Never drains pcp again, and therefore gets stuck in the loop.

The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.

Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   14 ++++++++++++++
 mm/page_isolation.c |    8 ++++++++
 2 files changed, 22 insertions(+)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/memory_hotplug.c
@@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigne
 		/* check again */
 		ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
 					    NULL, check_pages_isolated_cb);
+		/*
+		 * per-cpu pages are drained in start_isolate_page_range, but if
+		 * there are still pages that are not free, make sure that we
+		 * drain again, because when we isolated range we might
+		 * have raced with another thread that was adding pages to pcp
+		 * list.
+		 *
+		 * Forward progress should be still guaranteed because
+		 * pages on the pcp list can only belong to MOVABLE_ZONE
+		 * because has_unmovable_pages explicitly checks for
+		 * PageBuddy on freed pages on other zones.
+		 */
+		if (ret)
+			drain_all_pages(zone);
 	} while (ret);
 
 	/* Ok, all of our target is isolated.
--- a/mm/page_isolation.c~mm-memory_hotplug-drain-per-cpu-pages-again-during-memory-offline
+++ a/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, un
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending
+ * on how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
  * Return: the number of isolated pageblocks on success and -EBUSY if any part
  * of range cannot be isolated.
  */
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 12/15] ftrace: let ftrace_enable_sysctl take a kernel pointer buffer
  2020-09-19  4:19 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 13/15] stackleak: let stack_erasing_sysctl " Andrew Morton
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, tklauser, torvalds, viro

From: Tobias Klauser <tklauser@distanz.ch>
Subject: ftrace: let ftrace_enable_sysctl take a kernel pointer buffer

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer.  Adjust the
signature of ftrace_enable_sysctl to match ctl_table.proc_handler which
fixes the following sparse warning:

kernel/trace/ftrace.c:7544:43: warning: incorrect type in argument 3 (different address spaces)
kernel/trace/ftrace.c:7544:43:    expected void *
kernel/trace/ftrace.c:7544:43:    got void [noderef] __user *buffer

Link: https://lkml.kernel.org/r/20200907093207.13540-1-tklauser@distanz.ch
Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/ftrace.h |    3 +--
 kernel/trace/ftrace.c  |    3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

--- a/include/linux/ftrace.h~ftrace-let-ftrace_enable_sysctl-take-a-kernel-pointer-buffer
+++ a/include/linux/ftrace.h
@@ -85,8 +85,7 @@ static inline int ftrace_mod_get_kallsym
 extern int ftrace_enabled;
 extern int
 ftrace_enable_sysctl(struct ctl_table *table, int write,
-		     void __user *buffer, size_t *lenp,
-		     loff_t *ppos);
+		     void *buffer, size_t *lenp, loff_t *ppos);
 
 struct ftrace_ops;
 
--- a/kernel/trace/ftrace.c~ftrace-let-ftrace_enable_sysctl-take-a-kernel-pointer-buffer
+++ a/kernel/trace/ftrace.c
@@ -7531,8 +7531,7 @@ static bool is_permanent_ops_registered(
 
 int
 ftrace_enable_sysctl(struct ctl_table *table, int write,
-		     void __user *buffer, size_t *lenp,
-		     loff_t *ppos)
+		     void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int ret = -ENODEV;
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 13/15] stackleak: let stack_erasing_sysctl take a kernel pointer buffer
  2020-09-19  4:19 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-09-19  4:20 ` [patch 12/15] ftrace: let ftrace_enable_sysctl take a kernel pointer buffer Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 14/15] fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype Andrew Morton
  2020-09-19  4:20 ` [patch 15/15] kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments' Andrew Morton
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, tklauser, torvalds, viro

From: Tobias Klauser <tklauser@distanz.ch>
Subject: stackleak: let stack_erasing_sysctl take a kernel pointer buffer

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer.  Adjust the
signature of stack_erasing_sysctl to match ctl_table.proc_handler which
fixes the following sparse warning:

kernel/stackleak.c:31:50: warning: incorrect type in argument 3 (different address spaces)
kernel/stackleak.c:31:50:    expected void *
kernel/stackleak.c:31:50:    got void [noderef] __user *buffer

Link: https://lkml.kernel.org/r/20200907093253.13656-1-tklauser@distanz.ch
Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/stackleak.h |    2 +-
 kernel/stackleak.c        |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/stackleak.h~stackleak-let-stack_erasing_sysctl-take-a-kernel-pointer-buffer
+++ a/include/linux/stackleak.h
@@ -25,7 +25,7 @@ static inline void stackleak_task_init(s
 
 #ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE
 int stack_erasing_sysctl(struct ctl_table *table, int write,
-			void __user *buffer, size_t *lenp, loff_t *ppos);
+			void *buffer, size_t *lenp, loff_t *ppos);
 #endif
 
 #else /* !CONFIG_GCC_PLUGIN_STACKLEAK */
--- a/kernel/stackleak.c~stackleak-let-stack_erasing_sysctl-take-a-kernel-pointer-buffer
+++ a/kernel/stackleak.c
@@ -20,7 +20,7 @@
 static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass);
 
 int stack_erasing_sysctl(struct ctl_table *table, int write,
-			void __user *buffer, size_t *lenp, loff_t *ppos)
+			void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int ret = 0;
 	int state = !static_branch_unlikely(&stack_erasing_bypass);
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 14/15] fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype
  2020-09-19  4:19 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-09-19  4:20 ` [patch 13/15] stackleak: let stack_erasing_sysctl " Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  2020-09-19  4:20 ` [patch 15/15] kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments' Andrew Morton
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, mm-commits, tklauser, torvalds, viro

From: Tobias Klauser <tklauser@distanz.ch>
Subject: fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer.  Adjust the
definition of dirtytime_interval_handler to match its prototype in
linux/writeback.h which fixes the following sparse error/warning:

fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces)
fs/fs-writeback.c:2189:50:    expected void *
fs/fs-writeback.c:2189:50:    got void [noderef] __user *buffer
fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)):
fs/fs-writeback.c:2184:5:    int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )
fs/fs-writeback.c: note: in included file:
./include/linux/writeback.h:374:5: note: previously declared as:
./include/linux/writeback.h:374:5:    int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... )

Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.ch
Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/fs-writeback.c~fs-adjust-dirtytime_interval_handler-definition-to-match-prototype
+++ a/fs/fs-writeback.c
@@ -2184,7 +2184,7 @@ static int __init start_dirtytime_writeb
 __initcall(start_dirtytime_writeback);
 
 int dirtytime_interval_handler(struct ctl_table *table, int write,
-			       void __user *buffer, size_t *lenp, loff_t *ppos)
+			       void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int ret;
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 15/15] kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments'
  2020-09-19  4:19 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2020-09-19  4:20 ` [patch 14/15] fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype Andrew Morton
@ 2020-09-19  4:20 ` Andrew Morton
  14 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2020-09-19  4:20 UTC (permalink / raw)
  To: akpm, changbin.du, elver, gregkh, linux-mm, mm-commits, rdunlap,
	torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments'

This moves the KCSAN kconfig items under menu 'Generic Kernel Debugging
Instruments' where UBSAN resides.

Link: https://lkml.kernel.org/r/20200904152224.5570-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/lib/Kconfig.debug~kcsan-kconfig-move-to-menu-generic-kernel-debugging-instruments
+++ a/lib/Kconfig.debug
@@ -520,8 +520,8 @@ config DEBUG_FS_ALLOW_NONE
 endchoice
 
 source "lib/Kconfig.kgdb"
-
 source "lib/Kconfig.ubsan"
+source "lib/Kconfig.kcsan"
 
 endmenu
 
@@ -1620,8 +1620,6 @@ config PROVIDE_OHCI1394_DMA_INIT
 
 source "samples/Kconfig"
 
-source "lib/Kconfig.kcsan"
-
 config ARCH_HAS_DEVMEM_IS_ALLOWED
 	bool
 
_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
@ 2020-09-19  4:44   ` Matthew Wilcox
  2020-09-19  5:44       ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2020-09-19  4:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: alex.shi, cai, hannes, hughd, linux-mm, mhocko, mike.kravetz,
	mm-commits, shakeelb, shy828301, stable, torvalds

On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> LRU page reclaim always splits the shmem huge page first: I'd prefer not
> to demand that of i915, so check and split compound in shmem_writepage().

Sorry for not checking this earlier, but I don't think this is right.

        for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
...
                if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
...
                        ret = mapping->a_ops->writepage(page, &wbc);

so we cleared the dirty bit on the entire hugepage, but then split
it after clearing the dirty bit, so the subpages are now not dirty.
I think we'll lose writes as a result?  At least we won't swap pages
out that deserve to be paged out.

>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page))
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +
>  	BUG_ON(!PageLocked(page));
>  	mapping = page->mapping;
>  	index = page->index;
> _
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  4:44   ` Matthew Wilcox
@ 2020-09-19  5:44       ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2020-09-19  5:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> > LRU page reclaim always splits the shmem huge page first: I'd prefer not
> > to demand that of i915, so check and split compound in shmem_writepage().
> 
> Sorry for not checking this earlier, but I don't think this is right.
> 
>         for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
> ...
>                 if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
> ...
>                         ret = mapping->a_ops->writepage(page, &wbc);
> 
> so we cleared the dirty bit on the entire hugepage, but then split
> it after clearing the dirty bit, so the subpages are now not dirty.
> I think we'll lose writes as a result?  At least we won't swap pages
> out that deserve to be paged out.

Very good observation, thank you.

It behaves a lot better with this patch in than without it; but you're
right, only the head will get written to swap, and the tails left in
memory; with dirty cleared, so they may be left indefinitely (I've
not yet looked to see when if ever PageDirty might get set later).

Hmm. It may just be a matter of restyling the i915 code with

		if (!page_mapped(page)) {
			clear_page_dirty_for_io(page);

but I don't want to rush to that conclusion - there might turn
out to be a good way of doing it at the shmem_writepage() end, but
probably only hacks available.  I'll mull it over: it deserves some
thought about what would suit, if a THP arrived here some other way.

We did have some i915 guys Cc'ed on the original posting, but no
need to Cc them again until I'm closer to deciding what's right.

Linus, Andrew, probably best to drop this patch for now, since
no-one else has reported the problem here than me, testing with 
/sys/kernel/mm/transparent_hugepage/shmem_enabled set to "force";
and what it fixes is not a new regression in 5.9.

Though no harm done if the patch goes in: it is an improvement,
but seriously incomplete, as Matthew has observed.

Hugh

> 
> >  
> > -	VM_BUG_ON_PAGE(PageCompound(page), page);
> > +	/*
> > +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> > +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> > +	 * and its shmem_writeback() needs them to be split when swapping.
> > +	 */
> > +	if (PageTransCompound(page))
> > +		if (split_huge_page(page) < 0)
> > +			goto redirty;
> > +
> >  	BUG_ON(!PageLocked(page));
> >  	mapping = page->mapping;
> >  	index = page->index;
> > _
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
@ 2020-09-19  5:44       ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2020-09-19  5:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, alex.shi, cai, hannes, hughd, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 09:20:09PM -0700, Andrew Morton wrote:
> > LRU page reclaim always splits the shmem huge page first: I'd prefer not
> > to demand that of i915, so check and split compound in shmem_writepage().
> 
> Sorry for not checking this earlier, but I don't think this is right.
> 
>         for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
> ...
>                 if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
> ...
>                         ret = mapping->a_ops->writepage(page, &wbc);
> 
> so we cleared the dirty bit on the entire hugepage, but then split
> it after clearing the dirty bit, so the subpages are now not dirty.
> I think we'll lose writes as a result?  At least we won't swap pages
> out that deserve to be paged out.

Very good observation, thank you.

It behaves a lot better with this patch in than without it; but you're
right, only the head will get written to swap, and the tails left in
memory; with dirty cleared, so they may be left indefinitely (I've
not yet looked to see when if ever PageDirty might get set later).

Hmm. It may just be a matter of restyling the i915 code with

		if (!page_mapped(page)) {
			clear_page_dirty_for_io(page);

but I don't want to rush to that conclusion - there might turn
out to be a good way of doing it at the shmem_writepage() end, but
probably only hacks available.  I'll mull it over: it deserves some
thought about what would suit, if a THP arrived here some other way.

We did have some i915 guys Cc'ed on the original posting, but no
need to Cc them again until I'm closer to deciding what's right.

Linus, Andrew, probably best to drop this patch for now, since
no-one else has reported the problem here than me, testing with 
/sys/kernel/mm/transparent_hugepage/shmem_enabled set to "force";
and what it fixes is not a new regression in 5.9.

Though no harm done if the patch goes in: it is an improvement,
but seriously incomplete, as Matthew has observed.

Hugh

> 
> >  
> > -	VM_BUG_ON_PAGE(PageCompound(page), page);
> > +	/*
> > +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> > +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> > +	 * and its shmem_writeback() needs them to be split when swapping.
> > +	 */
> > +	if (PageTransCompound(page))
> > +		if (split_huge_page(page) < 0)
> > +			goto redirty;
> > +
> >  	BUG_ON(!PageLocked(page));
> >  	mapping = page->mapping;
> >  	index = page->index;
> > _
> > 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19  5:44       ` Hugh Dickins
  (?)
@ 2020-09-19 16:18       ` Matthew Wilcox
  2020-09-20  0:16           ` Hugh Dickins
  2020-10-02 18:37         ` Matthew Wilcox
  -1 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox @ 2020-09-19 16:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, alex.shi, cai, hannes, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> It behaves a lot better with this patch in than without it; but you're
> right, only the head will get written to swap, and the tails left in
> memory; with dirty cleared, so they may be left indefinitely (I've
> not yet looked to see when if ever PageDirty might get set later).
> 
> Hmm. It may just be a matter of restyling the i915 code with
> 
> 		if (!page_mapped(page)) {
> 			clear_page_dirty_for_io(page);
> 
> but I don't want to rush to that conclusion - there might turn
> out to be a good way of doing it at the shmem_writepage() end, but
> probably only hacks available.  I'll mull it over: it deserves some
> thought about what would suit, if a THP arrived here some other way.

I think the ultimate solution is to do as I have done for iomap and make
->writepage handle arbitrary sized pages.  However, I don't know the
swap I/O path particularly well, and I would rather not learn it just yet.

How about this for a band-aid until we sort that out properly?  Just mark
the page as dirty before splitting it so subsequent iterations see the
subpages as dirty.  Arguably, we should use set_page_dirty() instead of
SetPageDirty, but I don't think i915 cares.  In particular, it uses
an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..6231207ab1eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
-	VM_BUG_ON_PAGE(PageCompound(page), page);
 	BUG_ON(!PageLocked(page));
+
+	/*
+	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
+	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
+	 * and its shmem_writeback() needs them to be split when swapping.
+	 */
+	if (PageTransCompound(page)) {
+		/* Ensure the subpages are still dirty */
+		SetPageDirty(page);
+		if (split_huge_page(page) < 0)
+			goto redirty;
+		ClearPageDirty(page);
+	}
+
 	mapping = page->mapping;
 	index = page->index;
 	inode = mapping->host;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19 16:18       ` Matthew Wilcox
@ 2020-09-20  0:16           ` Hugh Dickins
  2020-10-02 18:37         ` Matthew Wilcox
  1 sibling, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2020-09-20  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.

Understood ;)

> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.

This is certainly much better than my original, and I've had no problem
in running with it (briefly); and it's what I'll now keep in my testing
tree - thanks.  But it is more obviously a hack, or as you put it, a
band-aid: fit for Linus's tree?

This issue has only come up with i915 and shmem_enabled "force", nobody
has been bleeding except me: but if it's something that impedes or may
impede your own testing, or that you feel should go in anyway, say so and
I'll push your new improved version to akpm (adding your signoff to mine).

> Arguably, we should use set_page_dirty() instead of SetPageDirty,

My position on that has changed down the years: I went through a
phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
the idea that its "if (!PageDirty)" is good to avoid setting cacheline
dirty.  Then Spectre changed the balance, so now I'd rather avoid the
indirect function call, and go with your SetPageDirty.  But the bdi
flags are such that they do the same thing, and if they ever diverge,
then we make the necessary changes at that time.

> but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
with one exception, every shmem page is always dirty (since its swap
is freed as soon as the page is moved from swap cache to file cache);
unless I'm forgetting something (like the tails in my patch!), the
only exception is a page allocated to a hole on fault (after which
a write fault will soon set pte dirty).

So I didn't quite get what i915 was doing with its set_page_dirty()s:
but very probably being a good citizen, marking when it exposed the
page to changes itself, making no assumption of what shmem.c does.

It would be good to have a conversation with i915 guys about huge pages
sometime (they forked off their own mount point in i915_gemfs_init(),
precisely in order to make use of huge pages, but couldn't get them
working, so removed the option to ask for them, but kept the separate
mount point.  But not a conversation I'll be responsive on at present.)

Hugh

> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
@ 2020-09-20  0:16           ` Hugh Dickins
  0 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2020-09-20  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.

Understood ;)

> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.

This is certainly much better than my original, and I've had no problem
in running with it (briefly); and it's what I'll now keep in my testing
tree - thanks.  But it is more obviously a hack, or as you put it, a
band-aid: fit for Linus's tree?

This issue has only come up with i915 and shmem_enabled "force", nobody
has been bleeding except me: but if it's something that impedes or may
impede your own testing, or that you feel should go in anyway, say so and
I'll push your new improved version to akpm (adding your signoff to mine).

> Arguably, we should use set_page_dirty() instead of SetPageDirty,

My position on that has changed down the years: I went through a
phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
the idea that its "if (!PageDirty)" is good to avoid setting cacheline
dirty.  Then Spectre changed the balance, so now I'd rather avoid the
indirect function call, and go with your SetPageDirty.  But the bdi
flags are such that they do the same thing, and if they ever diverge,
then we make the necessary changes at that time.

> but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.

PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
with one exception, every shmem page is always dirty (since its swap
is freed as soon as the page is moved from swap cache to file cache);
unless I'm forgetting something (like the tails in my patch!), the
only exception is a page allocated to a hole on fault (after which
a write fault will soon set pte dirty).

So I didn't quite get what i915 was doing with its set_page_dirty()s:
but very probably being a good citizen, marking when it exposed the
page to changes itself, making no assumption of what shmem.c does.

It would be good to have a conversation with i915 guys about huge pages
sometime (they forked off their own mount point in i915_gemfs_init(),
precisely in order to make use of huge pages, but couldn't get them
working, so removed the option to ask for them, but kept the separate
mount point.  But not a conversation I'll be responsive on at present.)

Hugh

> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-20  0:16           ` Hugh Dickins
  (?)
@ 2020-09-20  3:32           ` Matthew Wilcox
  -1 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2020-09-20  3:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, alex.shi, cai, hannes, linux-mm, mhocko,
	mike.kravetz, mm-commits, shakeelb, shy828301, stable, torvalds

On Sat, Sep 19, 2020 at 05:16:57PM -0700, Hugh Dickins wrote:
> On Sat, 19 Sep 2020, Matthew Wilcox wrote:
> > How about this for a band-aid until we sort that out properly?  Just mark
> > the page as dirty before splitting it so subsequent iterations see the
> > subpages as dirty.
> 
> This is certainly much better than my original, and I've had no problem
> in running with it (briefly); and it's what I'll now keep in my testing
> tree - thanks.  But it is more obviously a hack, or as you put it, a
> band-aid: fit for Linus's tree?
> 
> This issue has only come up with i915 and shmem_enabled "force", nobody
> has been bleeding except me: but if it's something that impedes or may
> impede your own testing, or that you feel should go in anyway, say so and
> I'll push your new improved version to akpm (adding your signoff to mine).

No, it's not impeding my testing; I don't have i915 even compiled into
my kernel.

> > Arguably, we should use set_page_dirty() instead of SetPageDirty,
> 
> My position on that has changed down the years: I went through a
> phase of replacing shmem.c's SetPageDirtys by set_page_dirtys, with
> the idea that its "if (!PageDirty)" is good to avoid setting cacheline
> dirty.  Then Spectre changed the balance, so now I'd rather avoid the
> indirect function call, and go with your SetPageDirty.  But the bdi
> flags are such that they do the same thing, and if they ever diverge,
> then we make the necessary changes at that time.

That makes sense.

> > but I don't think i915 cares.  In particular, it uses
> > an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> PAGECACHE_TAG_DIRTY comes with bdi flags that shmem does not use:
> with one exception, every shmem page is always dirty (since its swap
> is freed as soon as the page is moved from swap cache to file cache);
> unless I'm forgetting something (like the tails in my patch!), the
> only exception is a page allocated to a hole on fault (after which
> a write fault will soon set pte dirty).

Ah, of course.  I keep forgetting that shmem doesn't use the
dirty/writeback bits.

> So I didn't quite get what i915 was doing with its set_page_dirty()s:
> but very probably being a good citizen, marking when it exposed the
> page to changes itself, making no assumption of what shmem.c does.
> 
> It would be good to have a conversation with i915 guys about huge pages
> sometime (they forked off their own mount point in i915_gemfs_init(),
> precisely in order to make use of huge pages, but couldn't get them
> working, so removed the option to ask for them, but kept the separate
> mount point.  But not a conversation I'll be responsive on at present.)

Yes, I have a strong feeling the i915 shmem code is cargo-culted.  There
are some very questionable constructs in there.  It's a little unfair to
expect graphics developers to suddenly become experts on the page cache,
so I've taken this as impetus to clean up our APIs a little (eg moving
find_lock_entry() to mm/internal.h)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-09-19 16:18       ` Matthew Wilcox
  2020-09-20  0:16           ` Hugh Dickins
@ 2020-10-02 18:37         ` Matthew Wilcox
  2020-10-09  8:14             ` Huang, Ying
  1 sibling, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2020-10-02 18:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Huang Ying, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> > It behaves a lot better with this patch in than without it; but you're
> > right, only the head will get written to swap, and the tails left in
> > memory; with dirty cleared, so they may be left indefinitely (I've
> > not yet looked to see when if ever PageDirty might get set later).
> > 
> > Hmm. It may just be a matter of restyling the i915 code with
> > 
> > 		if (!page_mapped(page)) {
> > 			clear_page_dirty_for_io(page);
> > 
> > but I don't want to rush to that conclusion - there might turn
> > out to be a good way of doing it at the shmem_writepage() end, but
> > probably only hacks available.  I'll mull it over: it deserves some
> > thought about what would suit, if a THP arrived here some other way.
> 
> I think the ultimate solution is to do as I have done for iomap and make
> ->writepage handle arbitrary sized pages.  However, I don't know the
> swap I/O path particularly well, and I would rather not learn it just yet.
> 
> How about this for a band-aid until we sort that out properly?  Just mark
> the page as dirty before splitting it so subsequent iterations see the
> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 271548ca20f3..6231207ab1eb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	swp_entry_t swap;
>  	pgoff_t index;
>  
> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	BUG_ON(!PageLocked(page));
> +
> +	/*
> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> +	 * and its shmem_writeback() needs them to be split when swapping.
> +	 */
> +	if (PageTransCompound(page)) {
> +		/* Ensure the subpages are still dirty */
> +		SetPageDirty(page);
> +		if (split_huge_page(page) < 0)
> +			goto redirty;
> +		ClearPageDirty(page);
> +	}
> +
>  	mapping = page->mapping;
>  	index = page->index;
>  	inode = mapping->host;

It turns out that I have an entirely different reason for wanting
->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
we currently try to split file-backed THPs.  This always fails for XFS
file-backed THPs because they have page_private set which increments
the refcount by 1.  And so we OOM when the page cache is full of XFS
THPs.  I've been running successfully for a few days with this patch:

@@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
                                /* Adding to swap updated mapping */
                                mapping = page_mapping(page);
                        }
-               } else if (unlikely(PageTransHuge(page))) {
-                       /* Split file THP */
-                       if (split_huge_page_to_list(page, page_list))
-                               goto keep_locked;
                }
 
                /*


Kirill points out that this will probably make shmem unhappy (it's
possible that said pages will get split anyway if they're mapped
because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
->writepage().

The patch above is probably not exactly the right solution for this
case, since pageout() calls writepage only once, not once for each
sub-page.  This is hard to write a cute patch for because the
pages get unlocked by split_huge_page().  I think I'm going to have
to learn about the swap path, unless someone can save me from that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-02 18:37         ` Matthew Wilcox
@ 2020-10-09  8:14             ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2020-10-09  8:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Hi, Matthew,

Sorry for late reply.  I just come back from a long holiday.

Matthew Wilcox <willy@infradead.org> writes:

> On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> > It behaves a lot better with this patch in than without it; but you're
>> > right, only the head will get written to swap, and the tails left in
>> > memory; with dirty cleared, so they may be left indefinitely (I've
>> > not yet looked to see when if ever PageDirty might get set later).
>> > 
>> > Hmm. It may just be a matter of restyling the i915 code with
>> > 
>> > 		if (!page_mapped(page)) {
>> > 			clear_page_dirty_for_io(page);
>> > 
>> > but I don't want to rush to that conclusion - there might turn
>> > out to be a good way of doing it at the shmem_writepage() end, but
>> > probably only hacks available.  I'll mull it over: it deserves some
>> > thought about what would suit, if a THP arrived here some other way.
>> 
>> I think the ultimate solution is to do as I have done for iomap and make
>> ->writepage handle arbitrary sized pages.  However, I don't know the
>> swap I/O path particularly well, and I would rather not learn it just yet.
>> 
>> How about this for a band-aid until we sort that out properly?  Just mark
>> the page as dirty before splitting it so subsequent iterations see the
>> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 271548ca20f3..6231207ab1eb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>>  	swp_entry_t swap;
>>  	pgoff_t index;
>>  
>> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>>  	BUG_ON(!PageLocked(page));
>> +
>> +	/*
>> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> +	 * and its shmem_writeback() needs them to be split when swapping.
>> +	 */
>> +	if (PageTransCompound(page)) {
>> +		/* Ensure the subpages are still dirty */
>> +		SetPageDirty(page);
>> +		if (split_huge_page(page) < 0)
>> +			goto redirty;
>> +		ClearPageDirty(page);
>> +	}
>> +
>>  	mapping = page->mapping;
>>  	index = page->index;
>>  	inode = mapping->host;
>
> It turns out that I have an entirely different reason for wanting
> ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> we currently try to split file-backed THPs.  This always fails for XFS
> file-backed THPs because they have page_private set which increments
> the refcount by 1.  And so we OOM when the page cache is full of XFS
> THPs.  I've been running successfully for a few days with this patch:
>
> @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                                 /* Adding to swap updated mapping */
>                                 mapping = page_mapping(page);
>                         }
> -               } else if (unlikely(PageTransHuge(page))) {
> -                       /* Split file THP */
> -                       if (split_huge_page_to_list(page, page_list))
> -                               goto keep_locked;
>                 }
>  
>                 /*
>
>
> Kirill points out that this will probably make shmem unhappy (it's
> possible that said pages will get split anyway if they're mapped
> because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> ->writepage().

We may distinguish the shmem THPs from the XFS file cache THPs via
PageSwapBacked()?

Best Regards,
Huang, Ying

> The patch above is probably not exactly the right solution for this
> case, since pageout() calls writepage only once, not once for each
> sub-page.  This is hard to write a cute patch for because the
> pages get unlocked by split_huge_page().  I think I'm going to have
> to learn about the swap path, unless someone can save me from that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
@ 2020-10-09  8:14             ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2020-10-09  8:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Hi, Matthew,

Sorry for late reply.  I just come back from a long holiday.

Matthew Wilcox <willy@infradead.org> writes:

> On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> > It behaves a lot better with this patch in than without it; but you're
>> > right, only the head will get written to swap, and the tails left in
>> > memory; with dirty cleared, so they may be left indefinitely (I've
>> > not yet looked to see when if ever PageDirty might get set later).
>> > 
>> > Hmm. It may just be a matter of restyling the i915 code with
>> > 
>> > 		if (!page_mapped(page)) {
>> > 			clear_page_dirty_for_io(page);
>> > 
>> > but I don't want to rush to that conclusion - there might turn
>> > out to be a good way of doing it at the shmem_writepage() end, but
>> > probably only hacks available.  I'll mull it over: it deserves some
>> > thought about what would suit, if a THP arrived here some other way.
>> 
>> I think the ultimate solution is to do as I have done for iomap and make
>> ->writepage handle arbitrary sized pages.  However, I don't know the
>> swap I/O path particularly well, and I would rather not learn it just yet.
>> 
>> How about this for a band-aid until we sort that out properly?  Just mark
>> the page as dirty before splitting it so subsequent iterations see the
>> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 271548ca20f3..6231207ab1eb 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>>  	swp_entry_t swap;
>>  	pgoff_t index;
>>  
>> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>>  	BUG_ON(!PageLocked(page));
>> +
>> +	/*
>> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> +	 * and its shmem_writeback() needs them to be split when swapping.
>> +	 */
>> +	if (PageTransCompound(page)) {
>> +		/* Ensure the subpages are still dirty */
>> +		SetPageDirty(page);
>> +		if (split_huge_page(page) < 0)
>> +			goto redirty;
>> +		ClearPageDirty(page);
>> +	}
>> +
>>  	mapping = page->mapping;
>>  	index = page->index;
>>  	inode = mapping->host;
>
> It turns out that I have an entirely different reason for wanting
> ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> we currently try to split file-backed THPs.  This always fails for XFS
> file-backed THPs because they have page_private set which increments
> the refcount by 1.  And so we OOM when the page cache is full of XFS
> THPs.  I've been running successfully for a few days with this patch:
>
> @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                                 /* Adding to swap updated mapping */
>                                 mapping = page_mapping(page);
>                         }
> -               } else if (unlikely(PageTransHuge(page))) {
> -                       /* Split file THP */
> -                       if (split_huge_page_to_list(page, page_list))
> -                               goto keep_locked;
>                 }
>  
>                 /*
>
>
> Kirill points out that this will probably make shmem unhappy (it's
> possible that said pages will get split anyway if they're mapped
> because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> ->writepage().

We may distinguish the shmem THPs from the XFS file cache THPs via
PageSwapBacked()?

Best Regards,
Huang, Ying

> The patch above is probably not exactly the right solution for this
> case, since pageout() calls writepage only once, not once for each
> sub-page.  This is hard to write a cute patch for because the
> pages get unlocked by split_huge_page().  I think I'm going to have
> to learn about the swap path, unless someone can save me from that.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-09  8:14             ` Huang, Ying
  (?)
@ 2020-10-10 15:32             ` Matthew Wilcox
  2020-10-12  2:01                 ` Huang, Ying
  -1 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2020-10-10 15:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
> >> > It behaves a lot better with this patch in than without it; but you're
> >> > right, only the head will get written to swap, and the tails left in
> >> > memory; with dirty cleared, so they may be left indefinitely (I've
> >> > not yet looked to see when if ever PageDirty might get set later).
> >> > 
> >> > Hmm. It may just be a matter of restyling the i915 code with
> >> > 
> >> > 		if (!page_mapped(page)) {
> >> > 			clear_page_dirty_for_io(page);
> >> > 
> >> > but I don't want to rush to that conclusion - there might turn
> >> > out to be a good way of doing it at the shmem_writepage() end, but
> >> > probably only hacks available.  I'll mull it over: it deserves some
> >> > thought about what would suit, if a THP arrived here some other way.
> >> 
> >> I think the ultimate solution is to do as I have done for iomap and make
> >> ->writepage handle arbitrary sized pages.  However, I don't know the
> >> swap I/O path particularly well, and I would rather not learn it just yet.
> >> 
> >> How about this for a band-aid until we sort that out properly?  Just mark
> >> the page as dirty before splitting it so subsequent iterations see the
> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
> >> 
> >> diff --git a/mm/shmem.c b/mm/shmem.c
> >> index 271548ca20f3..6231207ab1eb 100644
> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >>  	swp_entry_t swap;
> >>  	pgoff_t index;
> >>  
> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
> >>  	BUG_ON(!PageLocked(page));
> >> +
> >> +	/*
> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
> >> +	 * and its shmem_writeback() needs them to be split when swapping.
> >> +	 */
> >> +	if (PageTransCompound(page)) {
> >> +		/* Ensure the subpages are still dirty */
> >> +		SetPageDirty(page);
> >> +		if (split_huge_page(page) < 0)
> >> +			goto redirty;
> >> +		ClearPageDirty(page);
> >> +	}
> >> +
> >>  	mapping = page->mapping;
> >>  	index = page->index;
> >>  	inode = mapping->host;
> >
> > It turns out that I have an entirely different reason for wanting
> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
> > we currently try to split file-backed THPs.  This always fails for XFS
> > file-backed THPs because they have page_private set which increments
> > the refcount by 1.  And so we OOM when the page cache is full of XFS
> > THPs.  I've been running successfully for a few days with this patch:
> >
> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> >                                 /* Adding to swap updated mapping */
> >                                 mapping = page_mapping(page);
> >                         }
> > -               } else if (unlikely(PageTransHuge(page))) {
> > -                       /* Split file THP */
> > -                       if (split_huge_page_to_list(page, page_list))
> > -                               goto keep_locked;
> >                 }
> >  
> >                 /*
> >
> >
> > Kirill points out that this will probably make shmem unhappy (it's
> > possible that said pages will get split anyway if they're mapped
> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
> > ->writepage().
> 
> We may distinguish the shmem THPs from the XFS file cache THPs via
> PageSwapBacked()?

Yes, we _can_, but we now have two reasons for wanting to be able to write
THPs to swap without splitting them.  Another thing this solves is that,
in my tree, we don't allocate the bottom layer of the XArray for THPs.
So when we split, we have to allocate memory to store the split pages,
and it seems like a bad idea to allocate memory in order to free memory,
particularly when we don't have to.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
  2020-10-10 15:32             ` Matthew Wilcox
@ 2020-10-12  2:01                 ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2020-10-12  2:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> >> > It behaves a lot better with this patch in than without it; but you're
>> >> > right, only the head will get written to swap, and the tails left in
>> >> > memory; with dirty cleared, so they may be left indefinitely (I've
>> >> > not yet looked to see when if ever PageDirty might get set later).
>> >> > 
>> >> > Hmm. It may just be a matter of restyling the i915 code with
>> >> > 
>> >> > 		if (!page_mapped(page)) {
>> >> > 			clear_page_dirty_for_io(page);
>> >> > 
>> >> > but I don't want to rush to that conclusion - there might turn
>> >> > out to be a good way of doing it at the shmem_writepage() end, but
>> >> > probably only hacks available.  I'll mull it over: it deserves some
>> >> > thought about what would suit, if a THP arrived here some other way.
>> >> 
>> >> I think the ultimate solution is to do as I have done for iomap and make
>> >> ->writepage handle arbitrary sized pages.  However, I don't know the
>> >> swap I/O path particularly well, and I would rather not learn it just yet.
>> >> 
>> >> How about this for a band-aid until we sort that out properly?  Just mark
>> >> the page as dirty before splitting it so subsequent iterations see the
>> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> >> 
>> >> diff --git a/mm/shmem.c b/mm/shmem.c
>> >> index 271548ca20f3..6231207ab1eb 100644
>> >> --- a/mm/shmem.c
>> >> +++ b/mm/shmem.c
>> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>> >>  	swp_entry_t swap;
>> >>  	pgoff_t index;
>> >>  
>> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>> >>  	BUG_ON(!PageLocked(page));
>> >> +
>> >> +	/*
>> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> >> +	 * and its shmem_writeback() needs them to be split when swapping.
>> >> +	 */
>> >> +	if (PageTransCompound(page)) {
>> >> +		/* Ensure the subpages are still dirty */
>> >> +		SetPageDirty(page);
>> >> +		if (split_huge_page(page) < 0)
>> >> +			goto redirty;
>> >> +		ClearPageDirty(page);
>> >> +	}
>> >> +
>> >>  	mapping = page->mapping;
>> >>  	index = page->index;
>> >>  	inode = mapping->host;
>> >
>> > It turns out that I have an entirely different reason for wanting
>> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
>> > we currently try to split file-backed THPs.  This always fails for XFS
>> > file-backed THPs because they have page_private set which increments
>> > the refcount by 1.  And so we OOM when the page cache is full of XFS
>> > THPs.  I've been running successfully for a few days with this patch:
>> >
>> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>> >                                 /* Adding to swap updated mapping */
>> >                                 mapping = page_mapping(page);
>> >                         }
>> > -               } else if (unlikely(PageTransHuge(page))) {
>> > -                       /* Split file THP */
>> > -                       if (split_huge_page_to_list(page, page_list))
>> > -                               goto keep_locked;
>> >                 }
>> >  
>> >                 /*
>> >
>> >
>> > Kirill points out that this will probably make shmem unhappy (it's
>> > possible that said pages will get split anyway if they're mapped
>> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
>> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
>> > ->writepage().
>> 
>> We may distinguish the shmem THPs from the XFS file cache THPs via
>> PageSwapBacked()?
>
> Yes, we _can_, but we now have two reasons for wanting to be able to write
> THPs to swap without splitting them.  Another thing this solves is that,
> in my tree, we don't allocate the bottom layer of the XArray for THPs.
> So when we split, we have to allocate memory to store the split pages,
> and it seems like a bad idea to allocate memory in order to free memory,
> particularly when we don't have to.

I am afraid we cannot avoid to allocate memory during swapping.  Because
the anonymous page (strictly PageSwapBacked()) will be put in swap cache
(a special page cache) during page reclaiming.  This means we need to
allocate XArray node for them.

And, we cannot guarantee there are always large continuous free space
available in the swap devices to accommodate the THP as a whole, so we
need to be prepared to split the THP anyway.

Things may be different for the page cache.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP
@ 2020-10-12  2:01                 ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2020-10-12  2:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, alex.shi, cai, hannes, linux-mm,
	mhocko, mike.kravetz, mm-commits, shakeelb, shy828301, stable,
	torvalds

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Oct 09, 2020 at 04:14:25PM +0800, Huang, Ying wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Sat, Sep 19, 2020 at 05:18:47PM +0100, Matthew Wilcox wrote:
>> >> On Fri, Sep 18, 2020 at 10:44:32PM -0700, Hugh Dickins wrote:
>> >> > It behaves a lot better with this patch in than without it; but you're
>> >> > right, only the head will get written to swap, and the tails left in
>> >> > memory; with dirty cleared, so they may be left indefinitely (I've
>> >> > not yet looked to see when if ever PageDirty might get set later).
>> >> > 
>> >> > Hmm. It may just be a matter of restyling the i915 code with
>> >> > 
>> >> > 		if (!page_mapped(page)) {
>> >> > 			clear_page_dirty_for_io(page);
>> >> > 
>> >> > but I don't want to rush to that conclusion - there might turn
>> >> > out to be a good way of doing it at the shmem_writepage() end, but
>> >> > probably only hacks available.  I'll mull it over: it deserves some
>> >> > thought about what would suit, if a THP arrived here some other way.
>> >> 
>> >> I think the ultimate solution is to do as I have done for iomap and make
>> >> ->writepage handle arbitrary sized pages.  However, I don't know the
>> >> swap I/O path particularly well, and I would rather not learn it just yet.
>> >> 
>> >> How about this for a band-aid until we sort that out properly?  Just mark
>> >> the page as dirty before splitting it so subsequent iterations see the
>> >> subpages as dirty.  Arguably, we should use set_page_dirty() instead of
>> >> SetPageDirty, but I don't think i915 cares.  In particular, it uses
>> >> an untagged iteration instead of searching for PAGECACHE_TAG_DIRTY.
>> >> 
>> >> diff --git a/mm/shmem.c b/mm/shmem.c
>> >> index 271548ca20f3..6231207ab1eb 100644
>> >> --- a/mm/shmem.c
>> >> +++ b/mm/shmem.c
>> >> @@ -1362,8 +1362,21 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>> >>  	swp_entry_t swap;
>> >>  	pgoff_t index;
>> >>  
>> >> -	VM_BUG_ON_PAGE(PageCompound(page), page);
>> >>  	BUG_ON(!PageLocked(page));
>> >> +
>> >> +	/*
>> >> +	 * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force",
>> >> +	 * then drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages,
>> >> +	 * and its shmem_writeback() needs them to be split when swapping.
>> >> +	 */
>> >> +	if (PageTransCompound(page)) {
>> >> +		/* Ensure the subpages are still dirty */
>> >> +		SetPageDirty(page);
>> >> +		if (split_huge_page(page) < 0)
>> >> +			goto redirty;
>> >> +		ClearPageDirty(page);
>> >> +	}
>> >> +
>> >>  	mapping = page->mapping;
>> >>  	index = page->index;
>> >>  	inode = mapping->host;
>> >
>> > It turns out that I have an entirely different reason for wanting
>> > ->writepage to handle an unsplit page.  In vmscan.c:shrink_page_list(),
>> > we currently try to split file-backed THPs.  This always fails for XFS
>> > file-backed THPs because they have page_private set which increments
>> > the refcount by 1.  And so we OOM when the page cache is full of XFS
>> > THPs.  I've been running successfully for a few days with this patch:
>> >
>> > @@ -1271,10 +1271,6 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>> >                                 /* Adding to swap updated mapping */
>> >                                 mapping = page_mapping(page);
>> >                         }
>> > -               } else if (unlikely(PageTransHuge(page))) {
>> > -                       /* Split file THP */
>> > -                       if (split_huge_page_to_list(page, page_list))
>> > -                               goto keep_locked;
>> >                 }
>> >  
>> >                 /*
>> >
>> >
>> > Kirill points out that this will probably make shmem unhappy (it's
>> > possible that said pages will get split anyway if they're mapped
>> > because we pass TTU_SPLIT_HUGE_PMD into try_to_unmap()), but if
>> > they're (a) Dirty, (b) !mapped, we'll call pageout() which calls
>> > ->writepage().
>> 
>> We may distinguish the shmem THPs from the XFS file cache THPs via
>> PageSwapBacked()?
>
> Yes, we _can_, but we now have two reasons for wanting to be able to write
> THPs to swap without splitting them.  Another thing this solves is that,
> in my tree, we don't allocate the bottom layer of the XArray for THPs.
> So when we split, we have to allocate memory to store the split pages,
> and it seems like a bad idea to allocate memory in order to free memory,
> particularly when we don't have to.

I am afraid we cannot avoid to allocate memory during swapping.  Because
the anonymous page (strictly PageSwapBacked()) will be put in swap cache
(a special page cache) during page reclaiming.  This means we need to
allocate XArray node for them.

And, we cannot guarantee there are always large continuous free space
available in the swap devices to accommodate the THP as a whole, so we
need to be prepared to split the THP anyway.

Things may be different for the page cache.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-10-12  2:02 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-19  4:19 incoming Andrew Morton
2020-09-19  4:20 ` [patch 01/15] mailmap: add older email addresses for Kees Cook Andrew Morton
2020-09-19  4:20 ` [patch 02/15] ksm: reinstate memcg charge on copied pages Andrew Morton
2020-09-19  4:20 ` [patch 03/15] mm: migration of hugetlbfs page skip memcg Andrew Morton
2020-09-19  4:20 ` [patch 04/15] shmem: shmem_writepage() split unlikely i915 THP Andrew Morton
2020-09-19  4:44   ` Matthew Wilcox
2020-09-19  5:44     ` Hugh Dickins
2020-09-19  5:44       ` Hugh Dickins
2020-09-19 16:18       ` Matthew Wilcox
2020-09-20  0:16         ` Hugh Dickins
2020-09-20  0:16           ` Hugh Dickins
2020-09-20  3:32           ` Matthew Wilcox
2020-10-02 18:37         ` Matthew Wilcox
2020-10-09  8:14           ` Huang, Ying
2020-10-09  8:14             ` Huang, Ying
2020-10-10 15:32             ` Matthew Wilcox
2020-10-12  2:01               ` Huang, Ying
2020-10-12  2:01                 ` Huang, Ying
2020-09-19  4:20 ` [patch 05/15] mm: fix check_move_unevictable_pages() on THP Andrew Morton
2020-09-19  4:20 ` [patch 06/15] mlock: fix unevictable_pgs event counts " Andrew Morton
2020-09-19  4:20 ` [patch 07/15] tmpfs: restore functionality of nr_inodes=0 Andrew Morton
2020-09-19  4:20 ` [patch 08/15] kprobes: fix kill kprobe which has been marked as gone Andrew Morton
2020-09-19  4:20 ` [patch 09/15] mm/thp: fix __split_huge_pmd_locked() for migration PMD Andrew Morton
2020-09-19  4:20 ` [patch 10/15] selftests/vm: fix display of page size in map_hugetlb Andrew Morton
2020-09-19  4:20 ` [patch 11/15] mm/memory_hotplug: drain per-cpu pages again during memory offline Andrew Morton
2020-09-19  4:20 ` [patch 12/15] ftrace: let ftrace_enable_sysctl take a kernel pointer buffer Andrew Morton
2020-09-19  4:20 ` [patch 13/15] stackleak: let stack_erasing_sysctl " Andrew Morton
2020-09-19  4:20 ` [patch 14/15] fs/fs-writeback.c: adjust dirtytime_interval_handler definition to match prototype Andrew Morton
2020-09-19  4:20 ` [patch 15/15] kcsan: kconfig: move to menu 'Generic Kernel Debugging Instruments' Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.