All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jianfeng Wang <jianfeng.w.wang@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: akpm@linux-foundation.org, cl@linux.com, vbabka@suse.cz,
	penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
	junxiao.bi@oracle.com
Subject: [PATCH v2 1/1] slub: limit number of slabs to scan in count_partial()
Date: Wed, 17 Apr 2024 11:59:38 -0700	[thread overview]
Message-ID: <20240417185938.5237-2-jianfeng.w.wang@oracle.com> (raw)
In-Reply-To: <20240417185938.5237-1-jianfeng.w.wang@oracle.com>

When reading "/proc/slabinfo", the kernel needs to report the number
of free objects for each kmem_cache. The current implementation uses
count_partial() to count the number of free objects by scanning each
kmem_cache_node's list of partial slabs and summing free objects
from every partial slab in the list. This process must hold per
kmem_cache_node spinlock and disable IRQ, and may take a long time.
Consequently, it can block slab allocations on other CPU cores and
cause timeouts for network devices and so on, when the partial list
is long. In production, even NMI watchdog can be triggered due to this
matter: e.g., for "buffer_head", the number of partial slabs was
observed to be ~1M in one kmem_cache_node. This problem was also
confirmed by several others [1-3].

Iterating a partial list to get the exact count of objects can cause
soft lockups for a long list with or without the lock (e.g., if
preemption is disabled), and is not very useful too: the object
count can change right after the lock is released. The approach of
maintaining free-object counters requires atomic operations on the
fast path [3].

So, the fix is to limit the number of slabs to scan in count_partial().
Suppose the limit is N. If the list's length is not greater than N,
output the exact count by traversing the whole list; if its length is
greater than N, then output an approximated count by traversing a
subset of the list. The proposed method is to scan N/2 slabs from the
list's head and the other N/2 slabs from the tail. For a partial list
with ~280K slabs, benchmarks show that this approach performs better
than just counting from the list's head, after slabs get sorted by
kmem_cache_shrink(). Default the limit to 10000, as it produces an
approximation within 1% of the exact count for both scenarios.

Benchmarks: Diff = (exact - approximated) / exact
* Normal case (w/o kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  0.43  %              |  1.09  %              |
| 5000        |  0.06  %              |  0.37  %              |
| 10000       |  0.02  %              |  0.16  %              |
| 20000       |  0.009 %              | -0.003 %              |

* Skewed case (w/ kmem_cache_shrink()):
| MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)|
| 1000        |  12.46 %              |  6.75  %              |
| 5000        |  5.38  %              |  1.27  %              |
| 10000       |  4.99  %              |  0.22  %              |
| 20000       |  4.86  %              | -0.06  %              |

[1] https://lore.kernel.org/linux-mm/
alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/
[2] https://lore.kernel.org/lkml/
alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/
[3] https://lore.kernel.org/lkml/
1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/

Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com>
---
 mm/slub.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7b6..7e34f2f0ba85 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3213,6 +3213,8 @@ static inline bool free_debug_processing(struct kmem_cache *s,
 #endif /* CONFIG_SLUB_DEBUG */
 
 #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
+#define MAX_PARTIAL_TO_SCAN 10000
+
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct slab *))
 {
@@ -3221,8 +3223,30 @@ static unsigned long count_partial(struct kmem_cache_node *n,
 	struct slab *slab;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(slab, &n->partial, slab_list)
-		x += get_count(slab);
+	if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
+		list_for_each_entry(slab, &n->partial, slab_list)
+			x += get_count(slab);
+	} else {
+		/*
+		 * For a long list, approximate the total count of objects in
+		 * it to meet the limit on the number of slabs to scan.
+		 * Scan from both the list's head and tail for better accuracy.
+		 */
+		unsigned long scanned = 0;
+
+		list_for_each_entry(slab, &n->partial, slab_list) {
+			x += get_count(slab);
+			if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
+				break;
+		}
+		list_for_each_entry_reverse(slab, &n->partial, slab_list) {
+			x += get_count(slab);
+			if (++scanned == MAX_PARTIAL_TO_SCAN)
+				break;
+		}
+		x = mult_frac(x, n->nr_partial, scanned);
+		x = min(x, node_nr_objs(n));
+	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	return x;
 }
-- 
2.42.1


  reply	other threads:[~2024-04-17 18:59 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-17 18:59 [PATCH v2 0/1] slub: limit number of slabs to scan in count_partial() Jianfeng Wang
2024-04-17 18:59 ` Jianfeng Wang [this message]
2024-04-18 10:01   ` [PATCH v2 1/1] " Vlastimil Babka
2024-04-18 22:44     ` Jianfeng Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240417185938.5237-2-jianfeng.w.wang@oracle.com \
    --to=jianfeng.w.wang@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=junxiao.bi@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.