linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2021-07-29 21:52 Andrew Morton
  2021-07-29 21:53 ` [patch 1/7] lib/test_string.c: move string selftest in the Runtime Testing menu Andrew Morton
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

7 patches, based on 7e96bf476270aecea66740a083e51b38c1371cd2.

Subsystems affected by this patch series:

  lib
  ocfs2
  mm/memcg
  mm/migration
  mm/slub
  mm/memcg

Subsystem: lib

    Matteo Croce <mcroce@microsoft.com>:
      lib/test_string.c: move string selftest in the Runtime Testing menu

Subsystem: ocfs2

    Junxiao Bi <junxiao.bi@oracle.com>:
      ocfs2: fix zero out valid data
      ocfs2: issue zeroout to EOF blocks

Subsystem: mm/memcg

    Johannes Weiner <hannes@cmpxchg.org>:
      mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code

Subsystem: mm/migration

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
      mm/migrate: fix NR_ISOLATED corruption on 64-bit

Subsystem: mm/slub

    Shakeel Butt <shakeelb@google.com>:
      slub: fix unreclaimable slab stat for bulk free

Subsystem: mm/memcg

    Wang Hai <wanghai38@huawei.com>:
      mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()

 fs/ocfs2/file.c   |  103 ++++++++++++++++++++++++++++++++----------------------
 lib/Kconfig       |    3 -
 lib/Kconfig.debug |    3 +
 mm/memcontrol.c   |    3 +
 mm/migrate.c      |    2 -
 mm/slab.h         |    2 -
 mm/slub.c         |   22 ++++++-----
 7 files changed, 81 insertions(+), 57 deletions(-)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 1/7] lib/test_string.c: move string selftest in the Runtime Testing menu
  2021-07-29 21:52 incoming Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-29 21:53 ` [patch 2/7] ocfs2: fix zero out valid data Andrew Morton
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, geert, linux-mm, mcroce, mm-commits, peda, rdunlap, torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: lib/test_string.c: move string selftest in the Runtime Testing menu

STRING_SELFTEST is presented in the "Library routines" menu.  Move it in
Kernel hacking > Kernel Testing and Coverage > Runtime Testing together
with other similar tests found in lib/

	--- Runtime Testing
	<*>   Test functions located in the hexdump module at runtime
	<*>   Test string functions (NEW)
	<*>   Test functions located in the string_helpers module at runtime
	<*>   Test strscpy*() family of functions at runtime
	<*>   Test kstrto*() family of functions at runtime
	<*>   Test printf() family of functions at runtime
	<*>   Test scanf() family of functions at runtime

Link: https://lkml.kernel.org/r/20210719185158.190371-1-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Peter Rosin <peda@axentia.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig       |    3 ---
 lib/Kconfig.debug |    3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/lib/Kconfig~lib-test_stringc-move-string-selftest-in-the-runtime-testing-menu
+++ a/lib/Kconfig
@@ -683,9 +683,6 @@ config PARMAN
 config OBJAGG
 	tristate "objagg" if COMPILE_TEST
 
-config STRING_SELFTEST
-	tristate "Test string functions"
-
 endmenu
 
 config GENERIC_IOREMAP
--- a/lib/Kconfig.debug~lib-test_stringc-move-string-selftest-in-the-runtime-testing-menu
+++ a/lib/Kconfig.debug
@@ -2180,6 +2180,9 @@ config ASYNC_RAID6_TEST
 config TEST_HEXDUMP
 	tristate "Test functions located in the hexdump module at runtime"
 
+config STRING_SELFTEST
+	tristate "Test string functions at runtime"
+
 config TEST_STRING_HELPERS
 	tristate "Test functions located in the string_helpers module at runtime"
 
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 2/7] ocfs2: fix zero out valid data
  2021-07-29 21:52 incoming Andrew Morton
  2021-07-29 21:53 ` [patch 1/7] lib/test_string.c: move string selftest in the Runtime Testing menu Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-29 21:53 ` [patch 3/7] ocfs2: issue zeroout to EOF blocks Andrew Morton
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: fix zero out valid data

If append-dio feature is enabled, direct-io write and fallocate could run
in parallel to extend file size, fallocate used "orig_isize" to record
i_size before taking "ip_alloc_sem", when ocfs2_zeroout_partial_cluster()
zeroout EOF blocks, i_size maybe already extended by
ocfs2_dio_end_io_write(), that will cause valid data zeroed out.

Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471f0cc ("ocfs2: fix data corruption by fallocate")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/file.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/ocfs2/file.c~ocfs2-fix-zero-out-valid-data
+++ a/fs/ocfs2/file.c
@@ -1935,7 +1935,6 @@ static int __ocfs2_change_file_space(str
 		goto out_inode_unlock;
 	}
 
-	orig_isize = i_size_read(inode);
 	switch (sr->l_whence) {
 	case 0: /*SEEK_SET*/
 		break;
@@ -1943,7 +1942,7 @@ static int __ocfs2_change_file_space(str
 		sr->l_start += f_pos;
 		break;
 	case 2: /*SEEK_END*/
-		sr->l_start += orig_isize;
+		sr->l_start += i_size_read(inode);
 		break;
 	default:
 		ret = -EINVAL;
@@ -1998,6 +1997,7 @@ static int __ocfs2_change_file_space(str
 		ret = -EINVAL;
 	}
 
+	orig_isize = i_size_read(inode);
 	/* zeroout eof blocks in the cluster. */
 	if (!ret && change_size && orig_isize < size) {
 		ret = ocfs2_zeroout_partial_cluster(inode, orig_isize,
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 3/7] ocfs2: issue zeroout to EOF blocks
  2021-07-29 21:52 incoming Andrew Morton
  2021-07-29 21:53 ` [patch 1/7] lib/test_string.c: move string selftest in the Runtime Testing menu Andrew Morton
  2021-07-29 21:53 ` [patch 2/7] ocfs2: fix zero out valid data Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-29 21:53 ` [patch 4/7] mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code Andrew Morton
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, stable, torvalds

From: Junxiao Bi <junxiao.bi@oracle.com>
Subject: ocfs2: issue zeroout to EOF blocks

For punch holes in EOF blocks, fallocate used buffer write to zero the EOF
blocks in last cluster.  But since ->writepage will ignore EOF pages,
those zeros will not be flushed.

This "looks" ok as commit 6bba4471f0cc ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't.  The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page was
cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already had
DIRTY flag set, it would not mark page DIRTY again.  That made writeback
ignore them forever.  That will cause data corruption.  Even directio
write can't work because it will fail when trying to drop pages caches
before direct io, as it found the buffer_head for those pages still had
DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache, it
will never be flushed to disk even file size extends and that page is not
EOF page any more.  The fix is to avoid zero EOF blocks with buffer write.

The following code snippet from qemu-img could trigger the corruption.

656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
...
660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
660   fallocate(11, 0, 2275868672, 327680) = 0
658   pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/file.c |   99 +++++++++++++++++++++++++++-------------------
 1 file changed, 60 insertions(+), 39 deletions(-)

--- a/fs/ocfs2/file.c~ocfs2-issue-zeroout-to-eof-blocks
+++ a/fs/ocfs2/file.c
@@ -1529,6 +1529,45 @@ static void ocfs2_truncate_cluster_pages
 	}
 }
 
+/*
+ * zero out partial blocks of one cluster.
+ *
+ * start: file offset where zero starts, will be made upper block aligned.
+ * len: it will be trimmed to the end of current cluster if "start + len"
+ *      is bigger than it.
+ */
+static int ocfs2_zeroout_partial_cluster(struct inode *inode,
+					u64 start, u64 len)
+{
+	int ret;
+	u64 start_block, end_block, nr_blocks;
+	u64 p_block, offset;
+	u32 cluster, p_cluster, nr_clusters;
+	struct super_block *sb = inode->i_sb;
+	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
+
+	if (start + len < end)
+		end = start + len;
+
+	start_block = ocfs2_blocks_for_bytes(sb, start);
+	end_block = ocfs2_blocks_for_bytes(sb, end);
+	nr_blocks = end_block - start_block;
+	if (!nr_blocks)
+		return 0;
+
+	cluster = ocfs2_bytes_to_clusters(sb, start);
+	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
+				&nr_clusters, NULL);
+	if (ret)
+		return ret;
+	if (!p_cluster)
+		return 0;
+
+	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
+	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
+	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
+}
+
 static int ocfs2_zero_partial_clusters(struct inode *inode,
 				       u64 start, u64 len)
 {
@@ -1538,6 +1577,7 @@ static int ocfs2_zero_partial_clusters(s
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 	unsigned int csize = osb->s_clustersize;
 	handle_t *handle;
+	loff_t isize = i_size_read(inode);
 
 	/*
 	 * The "start" and "end" values are NOT necessarily part of
@@ -1558,6 +1598,26 @@ static int ocfs2_zero_partial_clusters(s
 	if ((start & (csize - 1)) == 0 && (end & (csize - 1)) == 0)
 		goto out;
 
+	/* No page cache for EOF blocks, issue zero out to disk. */
+	if (end > isize) {
+		/*
+		 * zeroout eof blocks in last cluster starting from
+		 * "isize" even "start" > "isize" because it is
+		 * complicated to zeroout just at "start" as "start"
+		 * may be not aligned with block size, buffer write
+		 * would be required to do that, but out of eof buffer
+		 * write is not supported.
+		 */
+		ret = ocfs2_zeroout_partial_cluster(inode, isize,
+					end - isize);
+		if (ret) {
+			mlog_errno(ret);
+			goto out;
+		}
+		if (start >= isize)
+			goto out;
+		end = isize;
+	}
 	handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
@@ -1856,45 +1916,6 @@ out:
 }
 
 /*
- * zero out partial blocks of one cluster.
- *
- * start: file offset where zero starts, will be made upper block aligned.
- * len: it will be trimmed to the end of current cluster if "start + len"
- *      is bigger than it.
- */
-static int ocfs2_zeroout_partial_cluster(struct inode *inode,
-					u64 start, u64 len)
-{
-	int ret;
-	u64 start_block, end_block, nr_blocks;
-	u64 p_block, offset;
-	u32 cluster, p_cluster, nr_clusters;
-	struct super_block *sb = inode->i_sb;
-	u64 end = ocfs2_align_bytes_to_clusters(sb, start);
-
-	if (start + len < end)
-		end = start + len;
-
-	start_block = ocfs2_blocks_for_bytes(sb, start);
-	end_block = ocfs2_blocks_for_bytes(sb, end);
-	nr_blocks = end_block - start_block;
-	if (!nr_blocks)
-		return 0;
-
-	cluster = ocfs2_bytes_to_clusters(sb, start);
-	ret = ocfs2_get_clusters(inode, cluster, &p_cluster,
-				&nr_clusters, NULL);
-	if (ret)
-		return ret;
-	if (!p_cluster)
-		return 0;
-
-	offset = start_block - ocfs2_clusters_to_blocks(sb, cluster);
-	p_block = ocfs2_clusters_to_blocks(sb, p_cluster) + offset;
-	return sb_issue_zeroout(sb, p_block, nr_blocks, GFP_NOFS);
-}
-
-/*
  * Parts of this function taken from xfs_change_file_space()
  */
 static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 4/7] mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code
  2021-07-29 21:52 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-07-29 21:53 ` [patch 3/7] ocfs2: issue zeroout to EOF blocks Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-29 21:53 ` [patch 5/7] mm/migrate: fix NR_ISOLATED corruption on 64-bit Andrew Morton
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, chris, dan.carpenter, hannes, linux-mm, mhocko, mm-commits,
	riel, shakeelb, stable, torvalds

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code

Dan Carpenter reports:

    The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr
    29, 2021, leads to the following static checker warning:

	    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
	    warn: sleeping in atomic context

    mm/memcontrol.c
      3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
      3573  {
      3574          unsigned long val;
      3575
      3576          if (mem_cgroup_is_root(memcg)) {
      3577                  cgroup_rstat_flush(memcg->css.cgroup);
			    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    This is from static analysis and potentially a false positive.  The
    problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
    which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
    can sleep.

      3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
      3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
      3580                  if (swap)
      3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
      3582          } else {
      3583                  if (!swap)
      3584                          val = page_counter_read(&memcg->memory);
      3585                  else
      3586                          val = page_counter_read(&memcg->memsw);
      3587          }
      3588          return val;
      3589  }

__mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
thresholding code is invoked during stat changes, and those contexts have
irqs disabled as well.  If the lock breaking occurs inside the flush
function, it will result in a sleep from an atomic context.

Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.

Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcontrol-fix-blocking-rstat-function-called-from-atomic-cgroup1-thresholding-code
+++ a/mm/memcontrol.c
@@ -3574,7 +3574,8 @@ static unsigned long mem_cgroup_usage(st
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
-		cgroup_rstat_flush(memcg->css.cgroup);
+		/* mem_cgroup_threshold() calls here from irqsafe context */
+		cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 5/7] mm/migrate: fix NR_ISOLATED corruption on 64-bit
  2021-07-29 21:52 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-07-29 21:53 ` [patch 4/7] mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-29 21:53 ` [patch 6/7] slub: fix unreclaimable slab stat for bulk free Andrew Morton
  2021-07-29 21:53 ` [patch 7/7] mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() Andrew Morton
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: aik, akpm, aneesh.kumar, david, linux-mm, mgorman, mm-commits,
	mpe, npiggin, shy828301, torvalds

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Subject: mm/migrate: fix NR_ISOLATED corruption on 64-bit

Similar to commit 2da9f6305f30 ("mm/vmscan: fix NR_ISOLATED_FILE
corruption on 64-bit") avoid using unsigned int for nr_pages.  With
unsigned int type the large unsigned int converts to a large positive
signed long.

Symptoms include CMA allocations hanging forever due to
alloc_contig_range->...->isolate_migratepages_block waiting forever in
"while (unlikely(too_many_isolated(pgdat)))".

Link: https://lkml.kernel.org/r/20210728042531.359409-1-aneesh.kumar@linux.ibm.com
Fixes: c5fc5c3ae0c8 ("mm: migrate: account THP NUMA migration counters correctly")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/migrate.c~mm-migrate-fix-nr_isolated-corruption-on-64-bit
+++ a/mm/migrate.c
@@ -2068,7 +2068,7 @@ int migrate_misplaced_page(struct page *
 	LIST_HEAD(migratepages);
 	new_page_t *new;
 	bool compound;
-	unsigned int nr_pages = thp_nr_pages(page);
+	int nr_pages = thp_nr_pages(page);
 
 	/*
 	 * PTE mapped THP or HugeTLB page can't reach here so the page could
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 6/7] slub: fix unreclaimable slab stat for bulk free
  2021-07-29 21:52 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-07-29 21:53 ` [patch 5/7] mm/migrate: fix NR_ISOLATED corruption on 64-bit Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  2021-07-31 22:18   ` Nathan Chancellor
  2021-07-29 21:53 ` [patch 7/7] mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() Andrew Morton
  6 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, cl, guro, iamjoonsoo.kim, linux-mm, mhocko, mm-commits,
	penberg, rientjes, shakeelb, songmuchun, torvalds, vbabka

From: Shakeel Butt <shakeelb@google.com>
Subject: slub: fix unreclaimable slab stat for bulk free

SLUB uses page allocator for higher order allocations and update
unreclaimable slab stat for such allocations.  At the moment, the bulk
free for SLUB does not share code with normal free code path for these
type of allocations and have missed the stat update.  So, fix the stat
update by common code.  The user visible impact of the bug is the
potential of inconsistent unreclaimable slab stat visible through meminfo
and vmstat.

Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

--- a/mm/slub.c~slub-fix-unreclaimable-slab-stat-for-bulk-free
+++ a/mm/slub.c
@@ -3236,6 +3236,16 @@ struct detached_freelist {
 	struct kmem_cache *s;
 };
 
+static inline void free_nonslab_page(struct page *page)
+{
+	unsigned int order = compound_order(page);
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	kfree_hook(page_address(page));
+	mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, -(PAGE_SIZE << order));
+	__free_pages(page, order);
+}
+
 /*
  * This function progressively scans the array with free objects (with
  * a limited look ahead) and extract objects belonging to the same
@@ -3272,9 +3282,7 @@ int build_detached_freelist(struct kmem_
 	if (!s) {
 		/* Handle kalloc'ed objects */
 		if (unlikely(!PageSlab(page))) {
-			BUG_ON(!PageCompound(page));
-			kfree_hook(object);
-			__free_pages(page, compound_order(page));
+			free_nonslab_page(page);
 			p[size] = NULL; /* mark object processed */
 			return size;
 		}
@@ -4250,13 +4258,7 @@ void kfree(const void *x)
 
 	page = virt_to_head_page(x);
 	if (unlikely(!PageSlab(page))) {
-		unsigned int order = compound_order(page);
-
-		BUG_ON(!PageCompound(page));
-		kfree_hook(object);
-		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
-				      -(PAGE_SIZE << order));
-		__free_pages(page, order);
+		free_nonslab_page(page);
 		return;
 	}
 	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch 7/7] mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()
  2021-07-29 21:52 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-07-29 21:53 ` [patch 6/7] slub: fix unreclaimable slab stat for bulk free Andrew Morton
@ 2021-07-29 21:53 ` Andrew Morton
  6 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2021-07-29 21:53 UTC (permalink / raw)
  To: akpm, ast, cl, guro, hannes, iamjoonsoo.kim, linux-mm, mhocko,
	mm-commits, penberg, rientjes, shakeelb, songmuchun, stable,
	torvalds, vbabka, wanghai38, wangkefeng.wang

From: Wang Hai <wanghai38@huawei.com>
Subject: mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook()

When I use kfree_rcu() to free a large memory allocated by kmalloc_node(),
the following dump occurs.

BUG: kernel NULL pointer dereference, address: 0000000000000020
[...]
Oops: 0000 [#1] SMP
[...]
Workqueue: events kfree_rcu_work
RIP: 0010:__obj_to_index include/linux/slub_def.h:182 [inline]
RIP: 0010:obj_to_index include/linux/slub_def.h:191 [inline]
RIP: 0010:memcg_slab_free_hook+0x120/0x260 mm/slab.h:363
[...]
Call Trace:
 kmem_cache_free_bulk+0x58/0x630 mm/slub.c:3293
 kfree_bulk include/linux/slab.h:413 [inline]
 kfree_rcu_work+0x1ab/0x200 kernel/rcu/tree.c:3300
 process_one_work+0x207/0x530 kernel/workqueue.c:2276
 worker_thread+0x320/0x610 kernel/workqueue.c:2422
 kthread+0x13d/0x160 kernel/kthread.c:313
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

When kmalloc_node() a large memory, page is allocated, not slab, so when
freeing memory via kfree_rcu(), this large memory should not be used by
memcg_slab_free_hook(), because memcg_slab_free_hook() is is used for
slab.

Using page_objcgs_check() instead of page_objcgs() in
memcg_slab_free_hook() to fix this bug.

Link: https://lkml.kernel.org/r/20210728145655.274476-1-wanghai38@huawei.com
Fixes: 270c6a71460e ("mm: memcontrol/slab: Use helpers to access slab page's memcg_data")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slab.h~mm-memcg-fix-null-pointer-dereference-in-memcg_slab_free_hook
+++ a/mm/slab.h
@@ -346,7 +346,7 @@ static inline void memcg_slab_free_hook(
 			continue;
 
 		page = virt_to_head_page(p[i]);
-		objcgs = page_objcgs(page);
+		objcgs = page_objcgs_check(page);
 		if (!objcgs)
 			continue;
 
_


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 6/7] slub: fix unreclaimable slab stat for bulk free
  2021-07-29 21:53 ` [patch 6/7] slub: fix unreclaimable slab stat for bulk free Andrew Morton
@ 2021-07-31 22:18   ` Nathan Chancellor
  2021-08-01  5:32     ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Nathan Chancellor @ 2021-07-31 22:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, cl, guro, iamjoonsoo.kim, linux-mm, mhocko, mm-commits,
	penberg, rientjes, shakeelb, songmuchun, torvalds, vbabka

On Thu, Jul 29, 2021 at 02:53:50PM -0700, Andrew Morton wrote:
> From: Shakeel Butt <shakeelb@google.com>
> Subject: slub: fix unreclaimable slab stat for bulk free
> 
> SLUB uses page allocator for higher order allocations and update
> unreclaimable slab stat for such allocations.  At the moment, the bulk
> free for SLUB does not share code with normal free code path for these
> type of allocations and have missed the stat update.  So, fix the stat
> update by common code.  The user visible impact of the bug is the
> potential of inconsistent unreclaimable slab stat visible through meminfo
> and vmstat.
> 
> Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
> Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/slub.c |   22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> --- a/mm/slub.c~slub-fix-unreclaimable-slab-stat-for-bulk-free
> +++ a/mm/slub.c
> @@ -3236,6 +3236,16 @@ struct detached_freelist {
>  	struct kmem_cache *s;
>  };
>  
> +static inline void free_nonslab_page(struct page *page)
> +{
> +	unsigned int order = compound_order(page);
> +
> +	VM_BUG_ON_PAGE(!PageCompound(page), page);
> +	kfree_hook(page_address(page));
> +	mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, -(PAGE_SIZE << order));
> +	__free_pages(page, order);
> +}
> +
>  /*
>   * This function progressively scans the array with free objects (with
>   * a limited look ahead) and extract objects belonging to the same
> @@ -3272,9 +3282,7 @@ int build_detached_freelist(struct kmem_
>  	if (!s) {
>  		/* Handle kalloc'ed objects */
>  		if (unlikely(!PageSlab(page))) {
> -			BUG_ON(!PageCompound(page));
> -			kfree_hook(object);
> -			__free_pages(page, compound_order(page));
> +			free_nonslab_page(page);
>  			p[size] = NULL; /* mark object processed */
>  			return size;
>  		}
> @@ -4250,13 +4258,7 @@ void kfree(const void *x)
>  
>  	page = virt_to_head_page(x);
>  	if (unlikely(!PageSlab(page))) {
> -		unsigned int order = compound_order(page);
> -
> -		BUG_ON(!PageCompound(page));
> -		kfree_hook(object);
> -		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
> -				      -(PAGE_SIZE << order));
> -		__free_pages(page, order);
> +		free_nonslab_page(page);
>  		return;
>  	}
>  	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
> _

This patch, now in mainline as commit f227f0faf63b ("slub: fix
unreclaimable slab stat for bulk free") causes the KASAN KUnit test
kmalloc_pagealloc_invalid_free to no longer fail:

[    0.000000] Linux version 5.14.0-rc3-00066-gf227f0faf63b (nathan@archlinux-ax161) (x86_64-linux-gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Sat Jul 31 15:08:11 MST 2021
...
[    5.717678]     # kmalloc_pagealloc_invalid_free: EXPECTATION FAILED at lib/test_kasan.c:203
[    5.717678]     KASAN failure expected in "kfree(ptr + 1)", but none occurred
[    5.718909]     not ok 6 - kmalloc_pagealloc_invalid_free
...
[    9.481520] not ok 1 - kasan

The previous commit is fine:

[    0.000000] Linux version 5.14.0-rc3-00065-gb5916c025432 (nathan@archlinux-ax161) (x86_64-linux-gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Sat Jul 31 15:05:09 MST 2021
...
[    9.347598] ok 1 - kasan

I am by no means a KASAN or mm/ expert, I noticed this when trying to
test KASAN with clang for ClangBuiltLinux's CI, so it does not appear to
be compiler dependent. It is reproducible for me in QEMU with
x86_64_defconfig + CONFIG_KASAN=y + CONFIG_KUNIT=y +
CONFIG_KASAN_KUNIT_TEST=y.

Please let me know if there is any other information I can provide or
testing I can do.

Cheers,
Nathan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [patch 6/7] slub: fix unreclaimable slab stat for bulk free
  2021-07-31 22:18   ` Nathan Chancellor
@ 2021-08-01  5:32     ` Shakeel Butt
  0 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2021-08-01  5:32 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: LKML, Andrew Morton, Christoph Lameter, Roman Gushchin,
	Joonsoo Kim, Linux MM, Michal Hocko, mm-commits, Pekka Enberg,
	David Rientjes, Muchun Song, Linus Torvalds, Vlastimil Babka

Hi Nathan,

On Sat, Jul 31, 2021 at 3:18 PM Nathan Chancellor <nathan@kernel.org> wrote:
>
> On Thu, Jul 29, 2021 at 02:53:50PM -0700, Andrew Morton wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> > Subject: slub: fix unreclaimable slab stat for bulk free
> >
> > SLUB uses page allocator for higher order allocations and update
> > unreclaimable slab stat for such allocations.  At the moment, the bulk
> > free for SLUB does not share code with normal free code path for these
> > type of allocations and have missed the stat update.  So, fix the stat
> > update by common code.  The user visible impact of the bug is the
> > potential of inconsistent unreclaimable slab stat visible through meminfo
> > and vmstat.
> >
> > Link: https://lkml.kernel.org/r/20210728155354.3440560-1-shakeelb@google.com
> > Fixes: 6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Acked-by: Roman Gushchin <guro@fb.com>
> > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >
> >  mm/slub.c |   22 ++++++++++++----------
> >  1 file changed, 12 insertions(+), 10 deletions(-)
> >
> > --- a/mm/slub.c~slub-fix-unreclaimable-slab-stat-for-bulk-free
> > +++ a/mm/slub.c
> > @@ -3236,6 +3236,16 @@ struct detached_freelist {
> >       struct kmem_cache *s;
> >  };
> >
> > +static inline void free_nonslab_page(struct page *page)
> > +{
> > +     unsigned int order = compound_order(page);
> > +
> > +     VM_BUG_ON_PAGE(!PageCompound(page), page);
> > +     kfree_hook(page_address(page));
> > +     mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, -(PAGE_SIZE << order));
> > +     __free_pages(page, order);
> > +}
> > +
> >  /*
> >   * This function progressively scans the array with free objects (with
> >   * a limited look ahead) and extract objects belonging to the same
> > @@ -3272,9 +3282,7 @@ int build_detached_freelist(struct kmem_
> >       if (!s) {
> >               /* Handle kalloc'ed objects */
> >               if (unlikely(!PageSlab(page))) {
> > -                     BUG_ON(!PageCompound(page));
> > -                     kfree_hook(object);
> > -                     __free_pages(page, compound_order(page));
> > +                     free_nonslab_page(page);
> >                       p[size] = NULL; /* mark object processed */
> >                       return size;
> >               }
> > @@ -4250,13 +4258,7 @@ void kfree(const void *x)
> >
> >       page = virt_to_head_page(x);
> >       if (unlikely(!PageSlab(page))) {
> > -             unsigned int order = compound_order(page);
> > -
> > -             BUG_ON(!PageCompound(page));
> > -             kfree_hook(object);
> > -             mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B,
> > -                                   -(PAGE_SIZE << order));
> > -             __free_pages(page, order);
> > +             free_nonslab_page(page);
> >               return;
> >       }
> >       slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
> > _
>
> This patch, now in mainline as commit f227f0faf63b ("slub: fix
> unreclaimable slab stat for bulk free") causes the KASAN KUnit test
> kmalloc_pagealloc_invalid_free to no longer fail:
>
> [    0.000000] Linux version 5.14.0-rc3-00066-gf227f0faf63b (nathan@archlinux-ax161) (x86_64-linux-gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Sat Jul 31 15:08:11 MST 2021
> ...
> [    5.717678]     # kmalloc_pagealloc_invalid_free: EXPECTATION FAILED at lib/test_kasan.c:203
> [    5.717678]     KASAN failure expected in "kfree(ptr + 1)", but none occurred
> [    5.718909]     not ok 6 - kmalloc_pagealloc_invalid_free
> ...
> [    9.481520] not ok 1 - kasan
>
> The previous commit is fine:
>
> [    0.000000] Linux version 5.14.0-rc3-00065-gb5916c025432 (nathan@archlinux-ax161) (x86_64-linux-gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Sat Jul 31 15:05:09 MST 2021
> ...
> [    9.347598] ok 1 - kasan
>
> I am by no means a KASAN or mm/ expert, I noticed this when trying to
> test KASAN with clang for ClangBuiltLinux's CI, so it does not appear to
> be compiler dependent. It is reproducible for me in QEMU with
> x86_64_defconfig + CONFIG_KASAN=y + CONFIG_KUNIT=y +
> CONFIG_KASAN_KUNIT_TEST=y.
>
> Please let me know if there is any other information I can provide or
> testing I can do.
>

Thanks for the report. This is actually due to changing
kfree_hook(object) to kfree_hook(page_address(page)). The test forces
slub to go to the page allocator and then freeing with the next byte
address instead of the returned address. Since both are addresses on
the same page, the code is fine but the kasan test is not happy.

The test is making sure that programmers use the address returned by
kmalloc in the kfree. I don't think this is urgent but I will send the
patch to fix this during the week.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-08-01  5:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-29 21:52 incoming Andrew Morton
2021-07-29 21:53 ` [patch 1/7] lib/test_string.c: move string selftest in the Runtime Testing menu Andrew Morton
2021-07-29 21:53 ` [patch 2/7] ocfs2: fix zero out valid data Andrew Morton
2021-07-29 21:53 ` [patch 3/7] ocfs2: issue zeroout to EOF blocks Andrew Morton
2021-07-29 21:53 ` [patch 4/7] mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code Andrew Morton
2021-07-29 21:53 ` [patch 5/7] mm/migrate: fix NR_ISOLATED corruption on 64-bit Andrew Morton
2021-07-29 21:53 ` [patch 6/7] slub: fix unreclaimable slab stat for bulk free Andrew Morton
2021-07-31 22:18   ` Nathan Chancellor
2021-08-01  5:32     ` Shakeel Butt
2021-07-29 21:53 ` [patch 7/7] mm/memcg: fix NULL pointer dereference in memcg_slab_free_hook() Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).