linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ
@ 2023-01-30  0:57 Shanker Donthineni
  2023-01-30  0:57 ` [PATCH 1/5] genirq: Use hlist for managing resend handlers Shanker Donthineni
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

The ARM64 architecture uses SPARSEIRQ with a default value of NR_IRQS,
which is set to 64. This means that only 64+8192 IRQ descriptors are
allowed, which may not be sufficient for modern ARM64 servers that
have a large number of IO devices and GIC hardware that supports
direct vSGI and vLPI injection features.

This limitation has caused issues when attempting to launch multiple
virtual machines with GICv4.1 features, resulting in the error message
'kvm_err("VPE IRQ allocation failure\n")'. The root cause of this issue
is the ~8K IRQ descriptor limit.

To address this issue, an initial proposal was made to define NR_IRQS
to 2^19 for ARM64. However, Marc Zyngier suggested implementing a
generic solution instead of hard-coded values. Thomas Gleixner advised
to use the maple tree data structure and provided most of the necessary
functions.

For more information, refer to the discussion thread at
https://lore.kernel.org/linux-arm-kernel/20230104023738.1258925-1-sdonthineni@nvidia.com/.

This patch series converts the static memory allocation to dynamic using
the maple tree, and increases the maximum number of IRQ descriptors to
INT_MAX from NR_IRQS+8192. This change has been tested on an ARM64 server
with CONFIG_SPARSE_IRQ=y, where 256 virtual machines were launched,
creating a total of 128K+ IRQ descriptors, and IRQ injection was verified.

Tested with v6.2-rc5 along with Maple-Tree RCU mode bug fixes, as per the
information available at this link: 
 https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/

 [PATCH 1/41]  maple_tree: Be more cautious about dead nodes
 [PATCH 2/41]  maple_tree: Detect dead nodes in mas_start()
 [PATCH 3/41]  maple_tree: Fix freeing of nodes in rcu mode
 [PATCH 4/41]  maple_tree: remove extra smp_wmb() from mas_dead_leaves()
 [PATCH 5/41]  maple_tree: Fix write memory barrier of nodes once dead for RCU mode
 [PATCH 6/41]  maple_tree: Add smp_rmb() to dead node detection

Shanker Donthineni (5):
  genirq: Use hlist for managing resend handlers
  genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ
  genirq: Introduce two helper functions
  genirq: Use the common function irq_expand_nr_irqs()
  genirq: Use the maple tree for IRQ descriptors management

 include/linux/irqdesc.h |   3 ++
 kernel/irq/chip.c       |   1 +
 kernel/irq/internals.h  |   5 +-
 kernel/irq/irqdesc.c    | 117 ++++++++++++++++++++++------------------
 kernel/irq/resend.c     |  36 +++++++------
 5 files changed, 94 insertions(+), 68 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] genirq: Use hlist for managing resend handlers
  2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
@ 2023-01-30  0:57 ` Shanker Donthineni
  2023-01-31  8:59   ` Thomas Gleixner
  2023-01-30  0:57 ` [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ Shanker Donthineni
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

The current implementation utilizes a bitmap for managing IRQ resend
handlers, which is allocated based on the SPARSE_IRQ/NR_IRQS macros.
However, this method may not efficiently utilize memory during runtime,
particularly when IRQ_BITMAP_BITS is large.

This proposed patch aims to address this issue by using hlist to manage
IRQ resend handlers instead of relying on static memory allocation.
Additionally, a new function, clear_irq_resend(), is introduced and
called from irq_shutdown to ensure a graceful teardown of IRQD.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 include/linux/irqdesc.h |  3 +++
 kernel/irq/chip.c       |  1 +
 kernel/irq/internals.h  |  1 +
 kernel/irq/irqdesc.c    |  6 ++++++
 kernel/irq/resend.c     | 36 +++++++++++++++++++++---------------
 5 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 844a8e30e6de..d9451d456a73 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -102,6 +102,9 @@ struct irq_desc {
 	int			parent_irq;
 	struct module		*owner;
 	const char		*name;
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	struct hlist_node	resend_node;
+#endif
 } ____cacheline_internodealigned_in_smp;
 
 #ifdef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..2eac5532c3c8 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -306,6 +306,7 @@ static void __irq_disable(struct irq_desc *desc, bool mask);
 void irq_shutdown(struct irq_desc *desc)
 {
 	if (irqd_is_started(&desc->irq_data)) {
+		clear_irq_resend(desc);
 		desc->depth = 1;
 		if (desc->irq_data.chip->irq_shutdown) {
 			desc->irq_data.chip->irq_shutdown(&desc->irq_data);
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 5fdc0b557579..2fd17057ed4b 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -113,6 +113,7 @@ irqreturn_t handle_irq_event(struct irq_desc *desc);
 
 /* Resending of interrupts :*/
 int check_irq_resend(struct irq_desc *desc, bool inject);
+void clear_irq_resend(struct irq_desc *desc);
 bool irq_wait_for_poll(struct irq_desc *desc);
 void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action);
 
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index fd0996274401..21a968bc468b 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -415,6 +415,9 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags,
 	desc_set_defaults(irq, desc, node, affinity, owner);
 	irqd_set(&desc->irq_data, flags);
 	kobject_init(&desc->kobj, &irq_kobj_type);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+	INIT_HLIST_NODE(&desc->resend_node);
+#endif
 
 	return desc;
 
@@ -581,6 +584,9 @@ int __init early_irq_init(void)
 		mutex_init(&desc[i].request_mutex);
 		init_waitqueue_head(&desc[i].wait_for_threads);
 		desc_set_defaults(i, &desc[i], node, NULL, NULL);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+		INIT_HLIST_NODE(&desc->resend_node);
+#endif
 	}
 	return arch_early_irq_init();
 }
diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
index 0c46e9fe3a89..f2b23871cbef 100644
--- a/kernel/irq/resend.c
+++ b/kernel/irq/resend.c
@@ -21,8 +21,9 @@
 
 #ifdef CONFIG_HARDIRQS_SW_RESEND
 
-/* Bitmap to handle software resend of interrupts: */
-static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
+/* hlist_head to handle software resend of interrupts: */
+static HLIST_HEAD(irq_resend_list);
+static DEFINE_RAW_SPINLOCK(irq_resend_lock);
 
 /*
  * Run software resends of IRQ's
@@ -30,18 +31,16 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
 static void resend_irqs(struct tasklet_struct *unused)
 {
 	struct irq_desc *desc;
-	int irq;
-
-	while (!bitmap_empty(irqs_resend, nr_irqs)) {
-		irq = find_first_bit(irqs_resend, nr_irqs);
-		clear_bit(irq, irqs_resend);
-		desc = irq_to_desc(irq);
-		if (!desc)
-			continue;
+	struct hlist_node *n;
+
+	raw_spin_lock_irq(&irq_resend_lock);
+	hlist_for_each_entry_safe(desc, n, &irq_resend_list, resend_node) {
+		hlist_del_init(&desc->resend_node);
 		local_irq_disable();
 		desc->handle_irq(desc);
 		local_irq_enable();
 	}
+	raw_spin_unlock_irq(&irq_resend_lock);
 }
 
 /* Tasklet to handle resend: */
@@ -49,8 +48,6 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs);
 
 static int irq_sw_resend(struct irq_desc *desc)
 {
-	unsigned int irq = irq_desc_get_irq(desc);
-
 	/*
 	 * Validate whether this interrupt can be safely injected from
 	 * non interrupt context
@@ -70,16 +67,25 @@ static int irq_sw_resend(struct irq_desc *desc)
 		 */
 		if (!desc->parent_irq)
 			return -EINVAL;
-		irq = desc->parent_irq;
 	}
 
-	/* Set it pending and activate the softirq: */
-	set_bit(irq, irqs_resend);
+	/* Add to resend_list and activate the softirq: */
+	raw_spin_lock(&irq_resend_lock);
+	hlist_add_head(&desc->resend_node, &irq_resend_list);
+	raw_spin_unlock(&irq_resend_lock);
 	tasklet_schedule(&resend_tasklet);
 	return 0;
 }
 
+void clear_irq_resend(struct irq_desc *desc)
+{
+	raw_spin_lock(&irq_resend_lock);
+	hlist_del_init(&desc->resend_node);
+	raw_spin_unlock(&irq_resend_lock);
+}
 #else
+void clear_irq_resend(struct irq_desc *desc) {}
+
 static int irq_sw_resend(struct irq_desc *desc)
 {
 	return -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ
  2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
  2023-01-30  0:57 ` [PATCH 1/5] genirq: Use hlist for managing resend handlers Shanker Donthineni
@ 2023-01-30  0:57 ` Shanker Donthineni
  2023-01-31  9:16   ` Thomas Gleixner
  2023-01-30  0:57 ` [PATCH 3/5] genirq: Introduce two helper functions Shanker Donthineni
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

Remove the use of statically allocated arrays for IRQ descriptors.
Instead, allocate the necessary NR_IRQ descriptors during the boot
time in early_irq_init().

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 kernel/irq/irqdesc.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 21a968bc468b..a4911f7ebb07 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -556,27 +556,24 @@ int __init early_irq_init(void)
 
 #else /* !CONFIG_SPARSE_IRQ */
 
-struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
-	[0 ... NR_IRQS-1] = {
-		.handle_irq	= handle_bad_irq,
-		.depth		= 1,
-		.lock		= __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
-	}
-};
+static struct irq_desc *irq_descs;
 
 int __init early_irq_init(void)
 {
-	int count, i, node = first_online_node;
+	int i, node = first_online_node;
 	struct irq_desc *desc;
 
 	init_irq_default_affinity();
 
 	printk(KERN_INFO "NR_IRQS: %d\n", NR_IRQS);
 
-	desc = irq_desc;
-	count = ARRAY_SIZE(irq_desc);
+	desc = kmalloc_array(NR_IRQS, sizeof(*desc), GFP_KERNEL | __GFP_ZERO);
+	if (desc == NULL)
+		return -ENOMEM;
+
+	irq_descs = desc;
 
-	for (i = 0; i < count; i++) {
+	for (i = 0; i < NR_IRQS; i++) {
 		desc[i].kstat_irqs = alloc_percpu(unsigned int);
 		alloc_masks(&desc[i], node);
 		raw_spin_lock_init(&desc[i].lock);
@@ -593,7 +590,7 @@ int __init early_irq_init(void)
 
 struct irq_desc *irq_to_desc(unsigned int irq)
 {
-	return (irq < NR_IRQS) ? irq_desc + irq : NULL;
+	return (irq < NR_IRQS) ? irq_descs + irq : NULL;
 }
 EXPORT_SYMBOL(irq_to_desc);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/5] genirq: Introduce two helper functions
  2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
  2023-01-30  0:57 ` [PATCH 1/5] genirq: Use hlist for managing resend handlers Shanker Donthineni
  2023-01-30  0:57 ` [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ Shanker Donthineni
@ 2023-01-30  0:57 ` Shanker Donthineni
  2023-01-31  9:20   ` Thomas Gleixner
  2023-01-30  0:57 ` [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs() Shanker Donthineni
  2023-01-30  0:57 ` [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management Shanker Donthineni
  4 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

Introduce helper functions irq_find_free_area() and irq_find_next_irq().
The first function is used to locate available free space for a new IRQ,
and the second one is used to find the next valid IRQ.

These helper functions will be later modified to utilize the Maple data
structure in a later patch. Additionally, in order to align the moving
to the new data structure, the IRQ_BITMAP_BITS is renamed to
MAX_SPARSE_IRQS.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 kernel/irq/internals.h |  4 ++--
 kernel/irq/irqdesc.c   | 28 +++++++++++++++++++---------
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 2fd17057ed4b..5d741b0e7d5e 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,9 +12,9 @@
 #include <linux/sched/clock.h>
 
 #ifdef CONFIG_SPARSE_IRQ
-# define IRQ_BITMAP_BITS	(NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS	(NR_IRQS + 8196)
 #else
-# define IRQ_BITMAP_BITS	NR_IRQS
+# define MAX_SPARSE_IRQS	NR_IRQS
 #endif
 
 #define istate core_internal_state__do_not_mess_with_it
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index a4911f7ebb07..aacfb4826c5e 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -131,7 +131,18 @@ int nr_irqs = NR_IRQS;
 EXPORT_SYMBOL_GPL(nr_irqs);
 
 static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
+static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+
+static int irq_find_free_area(unsigned int from, unsigned int cnt)
+{
+	return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
+					  from, cnt, 0);
+}
+
+static unsigned int irq_find_next_irq(unsigned int offset)
+{
+	return find_next_bit(allocated_irqs, nr_irqs, offset);
+}
 
 #ifdef CONFIG_SPARSE_IRQ
 
@@ -519,7 +530,7 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 
 static int irq_expand_nr_irqs(unsigned int nr)
 {
-	if (nr > IRQ_BITMAP_BITS)
+	if (nr > MAX_SPARSE_IRQS)
 		return -ENOMEM;
 	nr_irqs = nr;
 	return 0;
@@ -537,11 +548,11 @@ int __init early_irq_init(void)
 	printk(KERN_INFO "NR_IRQS: %d, nr_irqs: %d, preallocated irqs: %d\n",
 	       NR_IRQS, nr_irqs, initcnt);
 
-	if (WARN_ON(nr_irqs > IRQ_BITMAP_BITS))
-		nr_irqs = IRQ_BITMAP_BITS;
+	if (WARN_ON(nr_irqs > MAX_SPARSE_IRQS))
+		nr_irqs = MAX_SPARSE_IRQS;
 
-	if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
-		initcnt = IRQ_BITMAP_BITS;
+	if (WARN_ON(initcnt > MAX_SPARSE_IRQS))
+		initcnt = MAX_SPARSE_IRQS;
 
 	if (initcnt > nr_irqs)
 		nr_irqs = initcnt;
@@ -813,8 +824,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 
 	mutex_lock(&sparse_irq_lock);
 
-	start = bitmap_find_next_zero_area(allocated_irqs, IRQ_BITMAP_BITS,
-					   from, cnt, 0);
+	start = irq_find_free_area(from, cnt);
 	ret = -EEXIST;
 	if (irq >=0 && start != irq)
 		goto unlock;
@@ -839,7 +849,7 @@ EXPORT_SYMBOL_GPL(__irq_alloc_descs);
  */
 unsigned int irq_get_next_irq(unsigned int offset)
 {
-	return find_next_bit(allocated_irqs, nr_irqs, offset);
+	return irq_find_next_irq(offset);
 }
 
 struct irq_desc *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs()
  2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
                   ` (2 preceding siblings ...)
  2023-01-30  0:57 ` [PATCH 3/5] genirq: Introduce two helper functions Shanker Donthineni
@ 2023-01-30  0:57 ` Shanker Donthineni
  2023-01-31  9:35   ` Thomas Gleixner
  2023-01-30  0:57 ` [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management Shanker Donthineni
  4 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

When the !SPARSEIRQ code path is executed, the function
irq_expand_nr_irqs() returns -ENOMEM. However, the SPARSEIRQ
version of the function can be safely used in both cases, as
nr_irqs = MAX_SPARSE_IRQS = NR_IRQS.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 kernel/irq/irqdesc.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index aacfb4826c5e..247a0718d028 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -144,6 +144,14 @@ static unsigned int irq_find_next_irq(unsigned int offset)
 	return find_next_bit(allocated_irqs, nr_irqs, offset);
 }
 
+static int irq_expand_nr_irqs(unsigned int nr)
+{
+	if (nr > MAX_SPARSE_IRQS)
+		return -ENOMEM;
+	nr_irqs = nr;
+	return 0;
+}
+
 #ifdef CONFIG_SPARSE_IRQ
 
 static void irq_kobj_release(struct kobject *kobj);
@@ -528,14 +536,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 	return -ENOMEM;
 }
 
-static int irq_expand_nr_irqs(unsigned int nr)
-{
-	if (nr > MAX_SPARSE_IRQS)
-		return -ENOMEM;
-	nr_irqs = nr;
-	return 0;
-}
-
 int __init early_irq_init(void)
 {
 	int i, initcnt, node = first_online_node;
@@ -630,11 +630,6 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
 	return start;
 }
 
-static int irq_expand_nr_irqs(unsigned int nr)
-{
-	return -ENOMEM;
-}
-
 void irq_mark_irq(unsigned int irq)
 {
 	mutex_lock(&sparse_irq_lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
                   ` (3 preceding siblings ...)
  2023-01-30  0:57 ` [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs() Shanker Donthineni
@ 2023-01-30  0:57 ` Shanker Donthineni
  2023-01-31  9:52   ` Thomas Gleixner
  2023-02-01  6:02   ` kernel test robot
  4 siblings, 2 replies; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-30  0:57 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

The current implementation uses a static bitmap and a radix tree
to manage IRQ allocation and irq_desc pointer store respectively.
However, the size of the bitmap is constrained by the build time
macro MAX_SPARSE_IRQS, which may not be sufficient to support the
high-end servers, particularly those with GICv4.1 hardware, which
require a large interrupt space to cover LPIs and vSGIs

The maple tree is a highly efficient data structure for storing
non-overlapping ranges and can handle a large number of entries,
up to ULONG_MAX. It can be utilized for both storing IRQD and
identifying available free spaces.

The IRQD management can be simplified by switching to a maple tree
data structure, which offers greater flexibility and scalability.
To support modern servers, the maximum number of IRQs has been
increased to INT_MAX, which provides a more adequate value than
the previous limit of NR_IRQS+8192.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 kernel/irq/internals.h |  2 +-
 kernel/irq/irqdesc.c   | 51 ++++++++++++++++++++++++------------------
 2 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 5d741b0e7d5e..e35de737802c 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -12,7 +12,7 @@
 #include <linux/sched/clock.h>
 
 #ifdef CONFIG_SPARSE_IRQ
-# define MAX_SPARSE_IRQS	(NR_IRQS + 8196)
+# define MAX_SPARSE_IRQS	INT_MAX
 #else
 # define MAX_SPARSE_IRQS	NR_IRQS
 #endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 247a0718d028..06be5f924a7c 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -15,6 +15,7 @@
 #include <linux/radix-tree.h>
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
+#include <linux/maple_tree.h>
 #include <linux/sysfs.h>
 
 #include "internals.h"
@@ -131,17 +132,37 @@ int nr_irqs = NR_IRQS;
 EXPORT_SYMBOL_GPL(nr_irqs);
 
 static DEFINE_MUTEX(sparse_irq_lock);
-static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
+static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
+					MT_FLAGS_ALLOC_RANGE |
+					MT_FLAGS_LOCK_EXTERN |
+					MT_FLAGS_USE_RCU, sparse_irq_lock);
 
 static int irq_find_free_area(unsigned int from, unsigned int cnt)
 {
-	return bitmap_find_next_zero_area(allocated_irqs, MAX_SPARSE_IRQS,
-					  from, cnt, 0);
+	MA_STATE(mas, &sparse_irqs, 0, 0);
+
+	if (mas_empty_area(&mas, from, MAX_SPARSE_IRQS, cnt))
+		return -ENOSPC;
+	return mas.index;
 }
 
 static unsigned int irq_find_next_irq(unsigned int offset)
 {
-	return find_next_bit(allocated_irqs, nr_irqs, offset);
+	struct irq_desc *desc = mt_next(&sparse_irqs, offset, nr_irqs);
+
+	return desc ? irq_desc_get_irq(desc) : nr_irqs;
+}
+
+static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
+{
+	MA_STATE(mas, &sparse_irqs, irq, irq);
+	WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
+}
+
+static void delete_irq_desc(unsigned int irq)
+{
+	MA_STATE(mas, &sparse_irqs, irq, irq);
+	mas_erase(&mas);
 }
 
 static int irq_expand_nr_irqs(unsigned int nr)
@@ -363,26 +384,14 @@ static void irq_sysfs_del(struct irq_desc *desc) {}
 
 #endif /* CONFIG_SYSFS */
 
-static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
-
-static void irq_insert_desc(unsigned int irq, struct irq_desc *desc)
-{
-	radix_tree_insert(&irq_desc_tree, irq, desc);
-}
-
 struct irq_desc *irq_to_desc(unsigned int irq)
 {
-	return radix_tree_lookup(&irq_desc_tree, irq);
+	return mtree_load(&sparse_irqs, irq);
 }
 #ifdef CONFIG_KVM_BOOK3S_64_HV_MODULE
 EXPORT_SYMBOL_GPL(irq_to_desc);
 #endif
 
-static void delete_irq_desc(unsigned int irq)
-{
-	radix_tree_delete(&irq_desc_tree, irq);
-}
-
 #ifdef CONFIG_SMP
 static void free_masks(struct irq_desc *desc)
 {
@@ -527,7 +536,6 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		irq_sysfs_add(start + i, desc);
 		irq_add_debugfs_entry(start + i, desc);
 	}
-	bitmap_set(allocated_irqs, start, cnt);
 	return start;
 
 err:
@@ -559,7 +567,6 @@ int __init early_irq_init(void)
 
 	for (i = 0; i < initcnt; i++) {
 		desc = alloc_desc(i, node, 0, NULL, NULL);
-		set_bit(i, allocated_irqs);
 		irq_insert_desc(i, desc);
 	}
 	return arch_early_irq_init();
@@ -613,6 +620,7 @@ static void free_desc(unsigned int irq)
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	delete_irq_desc(irq);
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
@@ -625,15 +633,15 @@ static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
 		struct irq_desc *desc = irq_to_desc(start + i);
 
 		desc->owner = owner;
+		irq_insert_desc(start + i, desc);
 	}
-	bitmap_set(allocated_irqs, start, cnt);
 	return start;
 }
 
 void irq_mark_irq(unsigned int irq)
 {
 	mutex_lock(&sparse_irq_lock);
-	bitmap_set(allocated_irqs, irq, 1);
+	irq_insert_desc(irq, irq_descs + irq);
 	mutex_unlock(&sparse_irq_lock);
 }
 
@@ -777,7 +785,6 @@ void irq_free_descs(unsigned int from, unsigned int cnt)
 	for (i = 0; i < cnt; i++)
 		free_desc(from + i);
 
-	bitmap_clear(allocated_irqs, from, cnt);
 	mutex_unlock(&sparse_irq_lock);
 }
 EXPORT_SYMBOL_GPL(irq_free_descs);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] genirq: Use hlist for managing resend handlers
  2023-01-30  0:57 ` [PATCH 1/5] genirq: Use hlist for managing resend handlers Shanker Donthineni
@ 2023-01-31  8:59   ` Thomas Gleixner
  2023-01-31 16:17     ` Shanker Donthineni
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-01-31  8:59 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
> +/* hlist_head to handle software resend of interrupts: */
> +static HLIST_HEAD(irq_resend_list);
> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>  
>  /*
>   * Run software resends of IRQ's
> @@ -30,18 +31,16 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>  static void resend_irqs(struct tasklet_struct *unused)
>  {
>  	struct irq_desc *desc;
> -	int irq;
> -
> -	while (!bitmap_empty(irqs_resend, nr_irqs)) {
> -		irq = find_first_bit(irqs_resend, nr_irqs);
> -		clear_bit(irq, irqs_resend);
> -		desc = irq_to_desc(irq);
> -		if (!desc)
> -			continue;
> +	struct hlist_node *n;
> +
> +	raw_spin_lock_irq(&irq_resend_lock);
> +	hlist_for_each_entry_safe(desc, n, &irq_resend_list, resend_node) {
> +		hlist_del_init(&desc->resend_node);
>  		local_irq_disable();
>  		desc->handle_irq(desc);
>  		local_irq_enable();
>  	}
> +	raw_spin_unlock_irq(&irq_resend_lock);

The lock ordering is broken here. irq_sw_resend() is invoked with
desc->lock held and takes irq_resend_lock.

Lockdep clearly would have told you...

	raw_spin_lock_irq(&irq_resend_lock);
        while (!hlist_empty(...)) {
        	desc = hlist_entry(irq_resend_list.first, ...);
		hlist_del_init(&desc->resend_node);
                raw_spin_unlock(&...);
                desc->handle_irq();
                raw_spin_lock(&...);
        }        

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ
  2023-01-30  0:57 ` [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ Shanker Donthineni
@ 2023-01-31  9:16   ` Thomas Gleixner
  2023-01-31 16:41     ` Shanker Donthineni
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-01-31  9:16 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:

> Remove the use of statically allocated arrays for IRQ descriptors.
> Instead, allocate the necessary NR_IRQ descriptors during the boot
> time in early_irq_init().

That's not what I meant. I was pondering to remove !SPARSEIRQ
completely.

That allocation does not buy us much.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] genirq: Introduce two helper functions
  2023-01-30  0:57 ` [PATCH 3/5] genirq: Introduce two helper functions Shanker Donthineni
@ 2023-01-31  9:20   ` Thomas Gleixner
  2023-01-31 16:42     ` Shanker Donthineni
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-01-31  9:20 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
> Introduce helper functions irq_find_free_area() and irq_find_next_irq().
> The first function is used to locate available free space for a new IRQ,
> and the second one is used to find the next valid IRQ.
>
> These helper functions will be later modified to utilize the Maple data

So you first introduce helpers just to modify them in the next patch?

What you want to say here in this changelog is:

     genirq: Encapsulate sparse bitmap handling

     Move the open coded sparse bitmap handling into helper functions as
     a preparatory step for converting the sparse interrupt management
     to a maple tree.

     No functional change.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs()
  2023-01-30  0:57 ` [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs() Shanker Donthineni
@ 2023-01-31  9:35   ` Thomas Gleixner
  2023-01-31 16:43     ` Shanker Donthineni
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-01-31  9:35 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:

> Subject: genirq: Use the common function ...

  genirq: Unify irq_expand_nr_irqs()

  irq_expand_nr_irqs() is implemented as a stub function for !SPARSEIRQ
  builds. That's not necessary as the SPARSEIRQ version returns -ENOMEM
  correctly even for the !SPARSEIRQ case as the ....


But this common function is non-obvious for the !SPARSEIRQ case. It at
least needs a comment

> +static int irq_expand_nr_irqs(unsigned int nr)
> +{
> +	if (nr > MAX_SPARSE_IRQS)
> +		return -ENOMEM;
> +	nr_irqs = nr;
> +	return 0;
> +}

or preferrably something like this:

	if (!IS_ENABLED(CONFIG_SPARSEIRQ) || nr > MAX_SPARSE_IRQS)
		return -ENOMEM;

which makes it entirely clear and also allows the compiler to optimize
is down to a 'return -ENOMEM'.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-01-30  0:57 ` [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management Shanker Donthineni
@ 2023-01-31  9:52   ` Thomas Gleixner
  2023-01-31 16:45     ` Shanker Donthineni
  2023-02-01  6:02   ` kernel test robot
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-01-31  9:52 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, linux-kernel

On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
> The current implementation uses a static bitmap and a radix tree
> to manage IRQ allocation and irq_desc pointer store respectively.
> However, the size of the bitmap is constrained by the build time
> macro MAX_SPARSE_IRQS, which may not be sufficient to support the
> high-end servers, particularly those with GICv4.1 hardware, which
> require a large interrupt space to cover LPIs and vSGIs
>
> The maple tree is a highly efficient data structure for storing
> non-overlapping ranges and can handle a large number of entries,
> up to ULONG_MAX. It can be utilized for both storing IRQD and

IRQD ??. Please write it out: interrupt descriptors

Changelogs have no space constraints and there is zero value to
introduce unreadable acronyms.

>  static DEFINE_MUTEX(sparse_irq_lock);
> -static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
> +static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
> +					MT_FLAGS_ALLOC_RANGE |
> +					MT_FLAGS_LOCK_EXTERN |
> +					MT_FLAGS_USE_RCU, sparse_irq_lock);

Nit. Can we please format this properly:

static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
						      MT_FLAGS_ALLOC_RANGE |
					              MT_FLAGS_LOCK_EXTERN |
					              MT_FLAGS_USE_RCU,
                                                      sparse_irq_lock);

Other than that this looks really good.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] genirq: Use hlist for managing resend handlers
  2023-01-31  8:59   ` Thomas Gleixner
@ 2023-01-31 16:17     ` Shanker Donthineni
  2023-01-31 17:06       ` Shanker Donthineni
  0 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 16:17 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 02:59, Thomas Gleixner wrote:
> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
>> +/* hlist_head to handle software resend of interrupts: */
>> +static HLIST_HEAD(irq_resend_list);
>> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>>
>>   /*
>>    * Run software resends of IRQ's
>> @@ -30,18 +31,16 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>>   static void resend_irqs(struct tasklet_struct *unused)
>>   {
>>        struct irq_desc *desc;
>> -     int irq;
>> -
>> -     while (!bitmap_empty(irqs_resend, nr_irqs)) {
>> -             irq = find_first_bit(irqs_resend, nr_irqs);
>> -             clear_bit(irq, irqs_resend);
>> -             desc = irq_to_desc(irq);
>> -             if (!desc)
>> -                     continue;
>> +     struct hlist_node *n;
>> +
>> +     raw_spin_lock_irq(&irq_resend_lock);
>> +     hlist_for_each_entry_safe(desc, n, &irq_resend_list, resend_node) {
>> +             hlist_del_init(&desc->resend_node);
>>                local_irq_disable();
>>                desc->handle_irq(desc);
>>                local_irq_enable();
>>        }
>> +     raw_spin_unlock_irq(&irq_resend_lock);
> The lock ordering is broken here. irq_sw_resend() is invoked with
> desc->lock held and takes irq_resend_lock.
> 
> Lockdep clearly would have told you...
> 
>          raw_spin_lock_irq(&irq_resend_lock);
>          while (!hlist_empty(...)) {
>                  desc = hlist_entry(irq_resend_list.first, ...);
>                  hlist_del_init(&desc->resend_node);
>                  raw_spin_unlock(&...);
>                  desc->handle_irq();
>                  raw_spin_lock(&...);
>          }

Thanks for catching mistakes, I'll change it to this, please correct me if
I'm doing another mistake.

static void resend_irqs(struct tasklet_struct *unused)
{
         struct irq_desc *desc;

         raw_spin_lock_irq(&irq_resend_lock);
         while (hlist_empty(&irq_resend_list)) {
                 desc = hlist_entry(irq_resend_list.first, struct irq_desc,
                                    resend_node);
                 hlist_del_init(&desc->resend_node);
                 raw_spin_unlock(&irq_resend_lock);
                 desc->handle_irq(desc);
                 raw_spin_lock(&irq_resend_lock);
         }
         raw_spin_unlock_irq(&irq_resend_lock);
}

Thanks,
Shanker

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ
  2023-01-31  9:16   ` Thomas Gleixner
@ 2023-01-31 16:41     ` Shanker Donthineni
  2023-02-07 10:28       ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 16:41 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 03:16, Thomas Gleixner wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
> 
>> Remove the use of statically allocated arrays for IRQ descriptors.
>> Instead, allocate the necessary NR_IRQ descriptors during the boot
>> time in early_irq_init().
> 
> That's not what I meant. I was pondering to remove !SPARSEIRQ
> completely.
> 

It's touching many files to drop CONFIG_SPARSE_IRQ=n support. Worried
about removing !SPARSEIRQ without testing on legacy platforms.

Can work on this after moving it to maple tree?

  arch/arm/Kconfig           |  1 -
  arch/arm/include/asm/irq.h |  4 --
  arch/arm/kernel/irq.c      |  2 -
  arch/arm64/Kconfig         |  1 -
  arch/csky/Kconfig          |  1 -
  arch/loongarch/Kconfig     |  1 -
  arch/microblaze/Kconfig    |  1 -
  arch/nios2/Kconfig         |  1 -
  arch/openrisc/Kconfig      |  1 -
  arch/powerpc/Kconfig       |  1 -
  arch/riscv/Kconfig         |  1 -
  arch/s390/Kconfig          |  1 -
  arch/sh/Kconfig            |  1 -
  arch/sparc/Kconfig         |  1 -
  arch/x86/Kconfig           |  1 -
  drivers/irqchip/Kconfig    |  5 ---
  include/linux/irqdesc.h    |  8 ----
  kernel/irq/Kconfig         | 17 --------
  kernel/irq/chip.c          |  5 ---
  kernel/irq/internals.h     | 10 -----
  kernel/irq/irqdesc.c       | 88 --------------------------------------
  kernel/irq/irqdomain.c     | 14 +++---
  22 files changed, 6 insertions(+), 160 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 43c7773b89ae..d4fc6c549660 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -134,7 +134,6 @@ config ARM
         select PCI_SYSCALL if PCI
         select PERF_USE_VMALLOC
         select RTC_LIB
-       select SPARSE_IRQ if !(ARCH_FOOTBRIDGE || ARCH_RPC)
         select SYS_SUPPORTS_APM_EMULATION
         select THREAD_INFO_IN_TASK
         select TIMER_OF if OF
diff --git a/arch/arm/include/asm/irq.h b/arch/arm/include/asm/irq.h
index a7c2337b0c7d..6d4ff82c40cf 100644
--- a/arch/arm/include/asm/irq.h
+++ b/arch/arm/include/asm/irq.h
@@ -4,11 +4,7 @@

...

-#ifndef CONFIG_SPARSE_IRQ
-#include <mach/irqs.h>
-#else
  #define NR_IRQS NR_IRQS_LEGACY
-#endif

...
# Make sparse irq Kconfig switch below available
-config MAY_HAVE_SPARSE_IRQ
-       bool
-
  # Legacy support, required for itanic
  config GENERIC_IRQ_LEGACY
         bool
@@ -107,19 +103,6 @@ config GENERIC_IRQ_RESERVATION_MODE
  config IRQ_FORCED_THREADING
         bool


> That allocation does not buy us much.
> 

Agree, not much value other than reducing kernel binary size. I will drop
this patch in v2.

Thanks,
Shanker

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/5] genirq: Introduce two helper functions
  2023-01-31  9:20   ` Thomas Gleixner
@ 2023-01-31 16:42     ` Shanker Donthineni
  0 siblings, 0 replies; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 16:42 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 03:20, Thomas Gleixner wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
>> Introduce helper functions irq_find_free_area() and irq_find_next_irq().
>> The first function is used to locate available free space for a new IRQ,
>> and the second one is used to find the next valid IRQ.
>>
>> These helper functions will be later modified to utilize the Maple data
> 
> So you first introduce helpers just to modify them in the next patch?
> 
> What you want to say here in this changelog is:
> 
>       genirq: Encapsulate sparse bitmap handling
> 
>       Move the open coded sparse bitmap handling into helper functions as
>       a preparatory step for converting the sparse interrupt management
>       to a maple tree.
> 
>       No functional change.
> 

Thanks, I'll update in v2 patch.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs()
  2023-01-31  9:35   ` Thomas Gleixner
@ 2023-01-31 16:43     ` Shanker Donthineni
  2023-02-07 10:29       ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 16:43 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 03:35, Thomas Gleixner wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
> 
>> Subject: genirq: Use the common function ...
> 
>    genirq: Unify irq_expand_nr_irqs()
> 
>    irq_expand_nr_irqs() is implemented as a stub function for !SPARSEIRQ
>    builds. That's not necessary as the SPARSEIRQ version returns -ENOMEM
>    correctly even for the !SPARSEIRQ case as the ....
> 
> 
> But this common function is non-obvious for the !SPARSEIRQ case. It at
> least needs a comment
> 
>> +static int irq_expand_nr_irqs(unsigned int nr)
>> +{
>> +     if (nr > MAX_SPARSE_IRQS)
>> +             return -ENOMEM;
>> +     nr_irqs = nr;
>> +     return 0;
>> +}
> 
> or preferrably something like this:
> 
>          if (!IS_ENABLED(CONFIG_SPARSEIRQ) || nr > MAX_SPARSE_IRQS)
>                  return -ENOMEM;
> 
> which makes it entirely clear and also allows the compiler to optimize
> is down to a 'return -ENOMEM'.
> 

I'll drop this patch since you're suggesting to remove !SPARSEIRQ support.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-01-31  9:52   ` Thomas Gleixner
@ 2023-01-31 16:45     ` Shanker Donthineni
  0 siblings, 0 replies; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 16:45 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 03:52, Thomas Gleixner wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
>> The current implementation uses a static bitmap and a radix tree
>> to manage IRQ allocation and irq_desc pointer store respectively.
>> However, the size of the bitmap is constrained by the build time
>> macro MAX_SPARSE_IRQS, which may not be sufficient to support the
>> high-end servers, particularly those with GICv4.1 hardware, which
>> require a large interrupt space to cover LPIs and vSGIs
>>
>> The maple tree is a highly efficient data structure for storing
>> non-overlapping ranges and can handle a large number of entries,
>> up to ULONG_MAX. It can be utilized for both storing IRQD and
> 
> IRQD ??. Please write it out: interrupt descriptors
> 
> Changelogs have no space constraints and there is zero value to
> introduce unreadable acronyms.
> 
>>   static DEFINE_MUTEX(sparse_irq_lock);
>> -static DECLARE_BITMAP(allocated_irqs, MAX_SPARSE_IRQS);
>> +static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
>> +                                     MT_FLAGS_ALLOC_RANGE |
>> +                                     MT_FLAGS_LOCK_EXTERN |
>> +                                     MT_FLAGS_USE_RCU, sparse_irq_lock);
> 
> Nit. Can we please format this properly:
> 
> static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
>                                                        MT_FLAGS_ALLOC_RANGE |
>                                                        MT_FLAGS_LOCK_EXTERN |
>                                                        MT_FLAGS_USE_RCU,
>                                                        sparse_irq_lock);
> 
> Other than that this looks really good.
> 

I'll update in v2 patch.

Thanks,
Shanker

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] genirq: Use hlist for managing resend handlers
  2023-01-31 16:17     ` Shanker Donthineni
@ 2023-01-31 17:06       ` Shanker Donthineni
  0 siblings, 0 replies; 33+ messages in thread
From: Shanker Donthineni @ 2023-01-31 17:06 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel



On 1/31/23 10:17, Shanker Donthineni wrote:
> 
> 
> On 1/31/23 02:59, Thomas Gleixner wrote:
>> On Sun, Jan 29 2023 at 18:57, Shanker Donthineni wrote:
>>> +/* hlist_head to handle software resend of interrupts: */
>>> +static HLIST_HEAD(irq_resend_list);
>>> +static DEFINE_RAW_SPINLOCK(irq_resend_lock);
>>>
>>>   /*
>>>    * Run software resends of IRQ's
>>> @@ -30,18 +31,16 @@ static DECLARE_BITMAP(irqs_resend, IRQ_BITMAP_BITS);
>>>   static void resend_irqs(struct tasklet_struct *unused)
>>>   {
>>>        struct irq_desc *desc;
>>> -     int irq;
>>> -
>>> -     while (!bitmap_empty(irqs_resend, nr_irqs)) {
>>> -             irq = find_first_bit(irqs_resend, nr_irqs);
>>> -             clear_bit(irq, irqs_resend);
>>> -             desc = irq_to_desc(irq);
>>> -             if (!desc)
>>> -                     continue;
>>> +     struct hlist_node *n;
>>> +
>>> +     raw_spin_lock_irq(&irq_resend_lock);
>>> +     hlist_for_each_entry_safe(desc, n, &irq_resend_list, resend_node) {
>>> +             hlist_del_init(&desc->resend_node);
>>>                local_irq_disable();
>>>                desc->handle_irq(desc);
>>>                local_irq_enable();
>>>        }
>>> +     raw_spin_unlock_irq(&irq_resend_lock);
>> The lock ordering is broken here. irq_sw_resend() is invoked with
>> desc->lock held and takes irq_resend_lock.
>>
>> Lockdep clearly would have told you...
>>
>>          raw_spin_lock_irq(&irq_resend_lock);
>>          while (!hlist_empty(...)) {
>>                  desc = hlist_entry(irq_resend_list.first, ...);
>>                  hlist_del_init(&desc->resend_node);
>>                  raw_spin_unlock(&...);
>>                  desc->handle_irq();
>>                  raw_spin_lock(&...);
>>          }
> 
> Thanks for catching mistakes, I'll change it to this, please correct me if
> I'm doing another mistake.
> 
> static void resend_irqs(struct tasklet_struct *unused)
> {
>          struct irq_desc *desc;
> 
>          raw_spin_lock_irq(&irq_resend_lock);
>          while (hlist_empty(&irq_resend_list)) {

Sorry typo "while (!hlist_empty(&irq_resend_list)) {"


>                  desc = hlist_entry(irq_resend_list.first, struct irq_desc,
>                                     resend_node);
>                  hlist_del_init(&desc->resend_node);
>                  raw_spin_unlock(&irq_resend_lock);
>                  desc->handle_irq(desc);
>                  raw_spin_lock(&irq_resend_lock);
>          }
>          raw_spin_unlock_irq(&irq_resend_lock);
> }
> 
> Thanks,
> Shanker

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-01-30  0:57 ` [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management Shanker Donthineni
  2023-01-31  9:52   ` Thomas Gleixner
@ 2023-02-01  6:02   ` kernel test robot
  2023-02-01 13:27     ` Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: kernel test robot @ 2023-02-01  6:02 UTC (permalink / raw)
  To: Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Thomas Gleixner, Marc Zyngier,
	Michael Walle, Sebastian Andrzej Siewior, Hans de Goede,
	Wolfram Sang, Shanker Donthineni

[-- Attachment #1: Type: text/plain, Size: 7343 bytes --]


Greeting,

FYI, we noticed WARNING:at_kernel/locking/lockdep.c:#lockdep_hardirqs_on_prepare due to commit (built with gcc-11):

commit: 02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c ("[PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management")
url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230130-085956
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 188a569658584e93930ab60334c5a1079c0330d8
patch link: https://lore.kernel.org/all/20230130005725.3517597-6-sdonthineni@nvidia.com/
patch subject: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Link: https://lore.kernel.org/oe-lkp/202302011308.f53123d2-oliver.sang@intel.com


[    2.214554][    T0] ------------[ cut here ]------------
[    2.215401][    T0] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
[    2.215446][    T0] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4308 lockdep_hardirqs_on_prepare+0x2d4/0x350
[    2.217975][    T0] Modules linked in:
[    2.218526][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc3-00015-g02fb8013ee5f #1
[    2.219803][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[    2.221228][    T0] RIP: 0010:lockdep_hardirqs_on_prepare+0x2d4/0x350
[    2.222207][    T0] Code: 11 38 d0 7c 04 84 d2 75 5e 8b 0d bf 8b f7 03 85 c9 0f 85 c9 fe ff ff 48 c7 c6 40 7d a9 83 48 c7 c7 60 4e a9 83 e8 60 7c 35 02 <0f> 0b e9 af fe ff ff e8 50 8d 62 00 e9 0c fe ff ff e8 e6 8d 62 00
[    2.224923][    T0] RSP: 0000:ffffffff844075a0 EFLAGS: 00010082
[    2.225792][    T0] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
[    2.226889][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffbfff0880ea6
[    2.227955][    T0] RBP: ffff8883af23fac0 R08: 0000000000000000 R09: ffffffff844072df
[    2.229068][    T0] R10: fffffbfff0880e5b R11: 0000000000000001 R12: 0000000000000002
[    2.230147][    T0] R13: 0000000000000002 R14: ffff88810022b018 R15: ffff88810022b010
[    2.231269][    T0] FS:  0000000000000000(0000) GS:ffff8883af200000(0000) knlGS:0000000000000000
[    2.232522][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.233395][    T0] CR2: ffff88843ffff000 CR3: 000000000442a000 CR4: 00000000000406b0
[    2.234504][    T0] Call Trace:
[    2.234941][    T0]  <TASK>
[    2.235345][    T0]  trace_hardirqs_on+0x40/0x140
[    2.236029][    T0]  __kmem_cache_alloc_bulk+0x22e/0x490
[    2.236795][    T0]  ? kasan_set_track+0x25/0x30
[    2.237470][    T0]  kmem_cache_alloc_bulk+0x159/0x2e0
[    2.238225][    T0]  mas_alloc_nodes+0x253/0x690
[    2.238886][    T0]  mas_split+0x30d/0x1580
[    2.239561][    T0]  ? mas_push_data+0x1a40/0x1a40
[    2.240219][    T0]  ? memset+0x24/0x50
[    2.240782][    T0]  ? blake2s_final+0x110/0x140
[    2.241426][    T0]  ? blake2s+0x115/0x150
[    2.242143][    T0]  ? wait_for_random_bytes+0xd0/0xd0
[    2.242859][    T0]  ? mas_mab_cp+0x2f6/0x890
[    2.243527][    T0]  ? memset+0x24/0x50
[    2.244122][    T0]  ? find_held_lock+0x2c/0x110
[    2.244803][    T0]  ? mas_store_b_node+0x54c/0x1180
[    2.245510][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
[    2.246282][    T0]  mas_wr_bnode+0x14f/0x1d0
[    2.246902][    T0]  ? mas_commit_b_node+0x600/0x600
[    2.247677][    T0]  ? secondary_startup_64_no_verify+0xe0/0xeb
[    2.248567][    T0]  ? ___slab_alloc+0x70b/0xe00
[    2.249251][    T0]  ? mas_wr_store_entry+0x2e9/0xe30
[    2.250088][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
[    2.250864][    T0]  mas_store_gfp+0xc2/0x190
[    2.251516][    T0]  ? mtree_erase+0x100/0x100
[    2.252190][    T0]  ? lockdep_init_map_type+0x2c7/0x780
[    2.252924][    T0]  irq_insert_desc+0xac/0xf0
[    2.253562][    T0]  ? irq_kobj_release+0x100/0x100
[    2.254243][    T0]  early_irq_init+0x81/0x8c
[    2.254866][    T0]  start_kernel+0x1c7/0x3a4
[    2.255479][    T0]  secondary_startup_64_no_verify+0xe0/0xeb
[    2.256408][    T0]  </TASK>
[    2.256802][    T0] irq event stamp: 0
[    2.257268][    T0] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[    2.258177][    T0] hardirqs last disabled at (0): [<0000000000000000>] 0x0
[    2.259116][    T0] softirqs last  enabled at (0): [<0000000000000000>] 0x0
[    2.260044][    T0] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    2.260979][    T0] ---[ end trace 0000000000000000 ]---
[    2.262190][    T0] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    2.263441][    T0] ------------[ cut here ]------------
[    2.264180][    T0] Interrupts were enabled early
[    2.264809][    T0] WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+0x239/0x3a4
[    2.265872][    T0] Modules linked in:
[    2.266391][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.2.0-rc3-00015-g02fb8013ee5f #1
[    2.267721][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[    2.270166][    T0] RIP: 0010:start_kernel+0x239/0x3a4
[    2.270938][    T0] Code: 48 89 05 f6 11 58 7a e8 b9 04 06 00 e8 f4 d2 d1 fd e8 40 75 05 00 9c 58 0f ba e0 09 73 0e 48 c7 c7 60 0e a0 83 e8 af bf bf
fd <0f> 0b c6 05 2a 12 81 ff 00 e8 ad 96 ad fb fb e8 58 07 07 00 e8 49
[    2.273782][    T0] RSP: 0000:ffffffff84407f38 EFLAGS: 00010286
[    2.274637][    T0] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[    2.275771][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffbfff0880fd9
[    2.276858][    T0] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff84407c77
[    2.277994][    T0] R10: fffffbfff0880f8e R11: 0000000000000001 R12: 0000000000000000
[    2.279079][    T0] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.280185][    T0] FS:  0000000000000000(0000) GS:ffff8883af200000(0000) knlGS:0000000000000000
[    2.281474][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.282441][    T0] CR2: ffff88843ffff000 CR3: 000000000442a000 CR4: 00000000000406b0
[    2.283519][    T0] Call Trace:
[    2.283930][    T0]  <TASK>
[    2.284328][    T0]  secondary_startup_64_no_verify+0xe0/0xeb
[    2.285143][    T0]  </TASK>
[    2.285517][    T0] irq event stamp: 0
[    2.286011][    T0] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[    2.286946][    T0] hardirqs last disabled at (0): [<0000000000000000>] 0x0
[    2.287873][    T0] softirqs last  enabled at (0): [<0000000000000000>] 0x0
[    2.288797][    T0] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    2.289618][    T0] ---[ end trace 0000000000000000 ]---


To reproduce:

        

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests



[-- Attachment #2: job-script --]
[-- Type: text/plain, Size: 5011 bytes --]

#!/bin/sh

export_top_env()
{
	export suite='boot'
	export testcase='boot'
	export category='functional'
	export timeout='10m'
	export job_origin='boot.yaml'
	export queue_cmdline_keys='branch
commit
kbuild_queue_analysis'
	export queue='validate'
	export testbox='vm-snb'
	export tbox_group='vm-snb'
	export branch='linux-review/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230130-085956'
	export commit='02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c'
	export kconfig='x86_64-rhel-8.3-kselftests'
	export repeat_to=6
	export nr_vm=300
	export submit_id='63d9bd1880dd2ecb4e91ed89'
	export job_file='/lkp/jobs/scheduled/vm-meta-29/boot-1-yocto-x86_64-minimal-20190520.cgz-02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c-20230201-52046-1foxqlm-4.yaml'
	export id='44d34cde8a1698b43937f2dbf1af484adf8126ad'
	export queuer_version='/zday/lkp'
	export model='qemu-system-x86_64 -enable-kvm -cpu SandyBridge'
	export nr_cpu=2
	export memory='16G'
	export need_kconfig=\{\"KVM_GUEST\"\=\>\"y\"\}
	export ssh_base_port=23032
	export kernel_cmdline='vmalloc=256M initramfs_async=0 page_owner=on'
	export rootfs='yocto-x86_64-minimal-20190520.cgz'
	export compiler='gcc-11'
	export enqueue_time='2023-02-01 09:15:04 +0800'
	export _id='63d9bd2f80dd2ecb4e91ed8a'
	export _rt='/result/boot/1/vm-snb/yocto-x86_64-minimal-20190520.cgz/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c'
	export user='lkp'
	export LKP_SERVER='internal-lkp-server'
	export result_root='/result/boot/1/vm-snb/yocto-x86_64-minimal-20190520.cgz/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c/3'
	export scheduler_version='/lkp/lkp/.src-20230120-005037'
	export arch='x86_64'
	export max_uptime=600
	export initrd='/osimage/yocto/yocto-x86_64-minimal-20190520.cgz'
	export bootloader_append='root=/dev/ram0
RESULT_ROOT=/result/boot/1/vm-snb/yocto-x86_64-minimal-20190520.cgz/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c/3
BOOT_IMAGE=/pkg/linux/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c/vmlinuz-6.2.0-rc3-00015-g02fb8013ee5f
branch=linux-review/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230130-085956
job=/lkp/jobs/scheduled/vm-meta-29/boot-1-yocto-x86_64-minimal-20190520.cgz-02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c-20230201-52046-1foxqlm-4.yaml
user=lkp
ARCH=x86_64
kconfig=x86_64-rhel-8.3-kselftests
commit=02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c
vmalloc=256M initramfs_async=0 page_owner=on
initcall_debug
nmi_watchdog=0
max_uptime=600
LKP_SERVER=internal-lkp-server
selinux=0
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw'
	export modules_initrd='/pkg/linux/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c/modules.cgz'
	export lkp_initrd='/osimage/user/lkp/lkp-x86_64.cgz'
	export site='lkp-wsx01'
	export LKP_CGI_PORT=80
	export LKP_CIFS_PORT=139
	export schedule_notify_address=
	export stop_repeat_if_found='dmesg.WARNING:at_kernel/locking/lockdep.c:#lockdep_hardirqs_on_prepare'
	export kbuild_queue_analysis=1
	export meta_host='vm-meta-29'
	export kernel='/pkg/linux/x86_64-rhel-8.3-kselftests/gcc-11/02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c/vmlinuz-6.2.0-rc3-00015-g02fb8013ee5f'
	export dequeue_time='2023-02-01 09:15:39 +0800'
	export job_initrd='/lkp/jobs/scheduled/vm-meta-29/boot-1-yocto-x86_64-minimal-20190520.cgz-02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c-20230201-52046-1foxqlm-4.cgz'

	[ -n "$LKP_SRC" ] ||
	export LKP_SRC=/lkp/${user:-lkp}/src
}

run_job()
{
	echo $$ > $TMP/run-job.pid

	. $LKP_SRC/lib/http.sh
	. $LKP_SRC/lib/job.sh
	. $LKP_SRC/lib/env.sh

	export_top_env

	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-slabinfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-meminfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper memmap
	run_monitor $LKP_SRC/monitors/no-stdout/wrapper boot-time
	run_monitor $LKP_SRC/monitors/wrapper kmsg
	run_monitor $LKP_SRC/monitors/wrapper heartbeat
	run_monitor $LKP_SRC/monitors/wrapper meminfo
	run_monitor $LKP_SRC/monitors/wrapper oom-killer
	run_monitor $LKP_SRC/monitors/plain/watchdog

	run_test $LKP_SRC/tests/wrapper sleep 1
}

extract_stats()
{
	export stats_part_begin=
	export stats_part_end=

	$LKP_SRC/stats/wrapper boot-slabinfo
	$LKP_SRC/stats/wrapper boot-meminfo
	$LKP_SRC/stats/wrapper memmap
	$LKP_SRC/stats/wrapper boot-memory
	$LKP_SRC/stats/wrapper boot-time
	$LKP_SRC/stats/wrapper kernel-size
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper sleep
	$LKP_SRC/stats/wrapper meminfo

	$LKP_SRC/stats/wrapper time sleep.time
	$LKP_SRC/stats/wrapper dmesg
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper last_state
	$LKP_SRC/stats/wrapper stderr
	$LKP_SRC/stats/wrapper time
}

"$@"

[-- Attachment #3: dmesg.xz --]
[-- Type: application/x-xz, Size: 30300 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-02-01  6:02   ` kernel test robot
@ 2023-02-01 13:27     ` Thomas Gleixner
  2023-02-06 14:24       ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-01 13:27 UTC (permalink / raw)
  To: kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang,
	Shanker Donthineni, Vlastimil Babka, linux-mm

On Wed, Feb 01 2023 at 14:02, kernel test robot wrote:
> FYI, we noticed WARNING:at_kernel/locking/lockdep.c:#lockdep_hardirqs_on_prepare due to commit (built with gcc-11):
>
> commit: 02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c ("[PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management")
> url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230130-085956
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 188a569658584e93930ab60334c5a1079c0330d8
> patch link: https://lore.kernel.org/all/20230130005725.3517597-6-sdonthineni@nvidia.com/
> patch subject: [PATCH 5/5] genirq: Use the maple tree for IRQ
> descriptors management

> [    2.214554][    T0] ------------[ cut here ]------------
> [    2.215401][    T0] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
> [    2.215446][    T0] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4308 lockdep_hardirqs_on_prepare+0x2d4/0x350
> [    2.217975][    T0] Modules linked in:
> [    2.218526][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc3-00015-g02fb8013ee5f #1
> [    2.219803][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [    2.221228][    T0] RIP: 0010:lockdep_hardirqs_on_prepare+0x2d4/0x350
> [    2.222207][    T0] Code: 11 38 d0 7c 04 84 d2 75 5e 8b 0d bf 8b f7 03 85 c9 0f 85 c9 fe ff ff 48 c7 c6 40 7d a9 83 48 c7 c7 60 4e a9 83 e8 60 7c 35 02 <0f> 0b e9 af fe ff ff e8 50 8d 62 00 e9 0c fe ff ff e8 e6 8d 62 00
> [    2.224923][    T0] RSP: 0000:ffffffff844075a0 EFLAGS: 00010082
> [    2.225792][    T0] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
> [    2.226889][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffbfff0880ea6
> [    2.227955][    T0] RBP: ffff8883af23fac0 R08: 0000000000000000 R09: ffffffff844072df
> [    2.229068][    T0] R10: fffffbfff0880e5b R11: 0000000000000001 R12: 0000000000000002
> [    2.230147][    T0] R13: 0000000000000002 R14: ffff88810022b018 R15: ffff88810022b010
> [    2.231269][    T0] FS:  0000000000000000(0000) GS:ffff8883af200000(0000) knlGS:0000000000000000
> [    2.232522][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.233395][    T0] CR2: ffff88843ffff000 CR3: 000000000442a000 CR4: 00000000000406b0
> [    2.234504][    T0] Call Trace:
> [    2.234941][    T0]  <TASK>
> [    2.235345][    T0]  trace_hardirqs_on+0x40/0x140
> [    2.236029][    T0]  __kmem_cache_alloc_bulk+0x22e/0x490
> [    2.236795][    T0]  ? kasan_set_track+0x25/0x30
> [    2.237470][    T0]  kmem_cache_alloc_bulk+0x159/0x2e0
> [    2.238225][    T0]  mas_alloc_nodes+0x253/0x690
> [    2.238886][    T0]  mas_split+0x30d/0x1580
> [    2.239561][    T0]  ? mas_push_data+0x1a40/0x1a40
> [    2.240219][    T0]  ? memset+0x24/0x50
> [    2.240782][    T0]  ? blake2s_final+0x110/0x140
> [    2.241426][    T0]  ? blake2s+0x115/0x150
> [    2.242143][    T0]  ? wait_for_random_bytes+0xd0/0xd0
> [    2.242859][    T0]  ? mas_mab_cp+0x2f6/0x890
> [    2.243527][    T0]  ? memset+0x24/0x50
> [    2.244122][    T0]  ? find_held_lock+0x2c/0x110
> [    2.244803][    T0]  ? mas_store_b_node+0x54c/0x1180
> [    2.245510][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
> [    2.246282][    T0]  mas_wr_bnode+0x14f/0x1d0
> [    2.246902][    T0]  ? mas_commit_b_node+0x600/0x600
> [    2.247677][    T0]  ? secondary_startup_64_no_verify+0xe0/0xeb
> [    2.248567][    T0]  ? ___slab_alloc+0x70b/0xe00
> [    2.249251][    T0]  ? mas_wr_store_entry+0x2e9/0xe30
> [    2.250088][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
> [    2.250864][    T0]  mas_store_gfp+0xc2/0x190
> [    2.251516][    T0]  ? mtree_erase+0x100/0x100
> [    2.252190][    T0]  ? lockdep_init_map_type+0x2c7/0x780
> [    2.252924][    T0]  irq_insert_desc+0xac/0xf0
> [    2.253562][    T0]  ? irq_kobj_release+0x100/0x100
> [    2.254243][    T0]  early_irq_init+0x81/0x8c
> [    2.254866][    T0]  start_kernel+0x1c7/0x3a4
> [    2.255479][    T0]  secondary_startup_64_no_verify+0xe0/0xeb

This triggers because __kmem_cache_alloc_bulk() uses
lock_irq()/unlock_irq(). Seems nobody used it during early boot stage
yet. Though the maple tree conversion of the interrupt descriptor
storage which is the purpose of the patch in question makes that happen.

Fix below.

Thanks,

        tglx
---
Subject: mm, slub: Take slab lock with irqsave()
From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 01 Feb 2023 14:14:00 +0100

<Add blurb>

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 mm/slub.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3913,6 +3913,7 @@ static inline int __kmem_cache_alloc_bul
 			size_t size, void **p, struct obj_cgroup *objcg)
 {
 	struct kmem_cache_cpu *c;
+	unsigned long irqflags;
 	int i;
 
 	/*
@@ -3921,7 +3922,7 @@ static inline int __kmem_cache_alloc_bul
 	 * handlers invoking normal fastpath.
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
-	local_lock_irq(&s->cpu_slab->lock);
+	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
 
 	for (i = 0; i < size; i++) {
 		void *object = kfence_alloc(s, s->object_size, flags);
@@ -3942,7 +3943,7 @@ static inline int __kmem_cache_alloc_bul
 			 */
 			c->tid = next_tid(c->tid);
 
-			local_unlock_irq(&s->cpu_slab->lock);
+			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
 
 			/*
 			 * Invoking slow path likely have side-effect
@@ -3956,7 +3957,7 @@ static inline int __kmem_cache_alloc_bul
 			c = this_cpu_ptr(s->cpu_slab);
 			maybe_wipe_obj_freeptr(s, p[i]);
 
-			local_lock_irq(&s->cpu_slab->lock);
+			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
 
 			continue; /* goto for-loop */
 		}
@@ -3965,7 +3966,7 @@ static inline int __kmem_cache_alloc_bul
 		maybe_wipe_obj_freeptr(s, p[i]);
 	}
 	c->tid = next_tid(c->tid);
-	local_unlock_irq(&s->cpu_slab->lock);
+	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
 	slub_put_cpu_ptr(s->cpu_slab);
 
 	return i;



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-02-01 13:27     ` Thomas Gleixner
@ 2023-02-06 14:24       ` Vlastimil Babka
  2023-02-06 18:10         ` Thomas Gleixner
  2023-02-07 10:30         ` Thomas Gleixner
  0 siblings, 2 replies; 33+ messages in thread
From: Vlastimil Babka @ 2023-02-06 14:24 UTC (permalink / raw)
  To: Thomas Gleixner, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox

On 2/1/23 14:27, Thomas Gleixner wrote:
> On Wed, Feb 01 2023 at 14:02, kernel test robot wrote:
>> FYI, we noticed WARNING:at_kernel/locking/lockdep.c:#lockdep_hardirqs_on_prepare due to commit (built with gcc-11):
>>
>> commit: 02fb8013ee5f9b7d7bc35d54bf8bc5fe1179970c ("[PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management")
>> url: https://github.com/intel-lab-lkp/linux/commits/Shanker-Donthineni/genirq-Use-hlist-for-managing-resend-handlers/20230130-085956
>> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 188a569658584e93930ab60334c5a1079c0330d8
>> patch link: https://lore.kernel.org/all/20230130005725.3517597-6-sdonthineni@nvidia.com/
>> patch subject: [PATCH 5/5] genirq: Use the maple tree for IRQ
>> descriptors management
> 
>> [    2.214554][    T0] ------------[ cut here ]------------
>> [    2.215401][    T0] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
>> [    2.215446][    T0] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4308 lockdep_hardirqs_on_prepare+0x2d4/0x350
>> [    2.217975][    T0] Modules linked in:
>> [    2.218526][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc3-00015-g02fb8013ee5f #1
>> [    2.219803][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [    2.221228][    T0] RIP: 0010:lockdep_hardirqs_on_prepare+0x2d4/0x350
>> [    2.222207][    T0] Code: 11 38 d0 7c 04 84 d2 75 5e 8b 0d bf 8b f7 03 85 c9 0f 85 c9 fe ff ff 48 c7 c6 40 7d a9 83 48 c7 c7 60 4e a9 83 e8 60 7c 35 02 <0f> 0b e9 af fe ff ff e8 50 8d 62 00 e9 0c fe ff ff e8 e6 8d 62 00
>> [    2.224923][    T0] RSP: 0000:ffffffff844075a0 EFLAGS: 00010082
>> [    2.225792][    T0] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
>> [    2.226889][    T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffffbfff0880ea6
>> [    2.227955][    T0] RBP: ffff8883af23fac0 R08: 0000000000000000 R09: ffffffff844072df
>> [    2.229068][    T0] R10: fffffbfff0880e5b R11: 0000000000000001 R12: 0000000000000002
>> [    2.230147][    T0] R13: 0000000000000002 R14: ffff88810022b018 R15: ffff88810022b010
>> [    2.231269][    T0] FS:  0000000000000000(0000) GS:ffff8883af200000(0000) knlGS:0000000000000000
>> [    2.232522][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    2.233395][    T0] CR2: ffff88843ffff000 CR3: 000000000442a000 CR4: 00000000000406b0
>> [    2.234504][    T0] Call Trace:
>> [    2.234941][    T0]  <TASK>
>> [    2.235345][    T0]  trace_hardirqs_on+0x40/0x140
>> [    2.236029][    T0]  __kmem_cache_alloc_bulk+0x22e/0x490
>> [    2.236795][    T0]  ? kasan_set_track+0x25/0x30
>> [    2.237470][    T0]  kmem_cache_alloc_bulk+0x159/0x2e0
>> [    2.238225][    T0]  mas_alloc_nodes+0x253/0x690
>> [    2.238886][    T0]  mas_split+0x30d/0x1580
>> [    2.239561][    T0]  ? mas_push_data+0x1a40/0x1a40
>> [    2.240219][    T0]  ? memset+0x24/0x50
>> [    2.240782][    T0]  ? blake2s_final+0x110/0x140
>> [    2.241426][    T0]  ? blake2s+0x115/0x150
>> [    2.242143][    T0]  ? wait_for_random_bytes+0xd0/0xd0
>> [    2.242859][    T0]  ? mas_mab_cp+0x2f6/0x890
>> [    2.243527][    T0]  ? memset+0x24/0x50
>> [    2.244122][    T0]  ? find_held_lock+0x2c/0x110
>> [    2.244803][    T0]  ? mas_store_b_node+0x54c/0x1180
>> [    2.245510][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
>> [    2.246282][    T0]  mas_wr_bnode+0x14f/0x1d0
>> [    2.246902][    T0]  ? mas_commit_b_node+0x600/0x600
>> [    2.247677][    T0]  ? secondary_startup_64_no_verify+0xe0/0xeb
>> [    2.248567][    T0]  ? ___slab_alloc+0x70b/0xe00
>> [    2.249251][    T0]  ? mas_wr_store_entry+0x2e9/0xe30
>> [    2.250088][    T0]  ? rcu_read_lock_sched_held+0x16/0x80
>> [    2.250864][    T0]  mas_store_gfp+0xc2/0x190
>> [    2.251516][    T0]  ? mtree_erase+0x100/0x100
>> [    2.252190][    T0]  ? lockdep_init_map_type+0x2c7/0x780
>> [    2.252924][    T0]  irq_insert_desc+0xac/0xf0
>> [    2.253562][    T0]  ? irq_kobj_release+0x100/0x100
>> [    2.254243][    T0]  early_irq_init+0x81/0x8c
>> [    2.254866][    T0]  start_kernel+0x1c7/0x3a4
>> [    2.255479][    T0]  secondary_startup_64_no_verify+0xe0/0xeb
> 
> This triggers because __kmem_cache_alloc_bulk() uses
> lock_irq()/unlock_irq(). Seems nobody used it during early boot stage
> yet. Though the maple tree conversion of the interrupt descriptor
> storage which is the purpose of the patch in question makes that happen.
> 
> Fix below.

Looks like it should work. But I think we also need to adjust SLAB's
mm/slab.c kmem_cache_alloc_bulk() which does local_irq_disable(); /
local_irq_enable(); right?

Also if we enter this with IRQ's disabled, then we should take care about
the proper gfp flags. Looking at the patch [1] I see

WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);

so GFP_KERNEL would be wrong with irqs disabled, looks like a case for
GFP_ATOMIC.
OTOH I can see the thing it replaces was:

static RADIX_TREE(irq_desc_tree, GFP_KERNEL);

so that's also a GFP_KERNEL and we haven't seen debug splats from
might_alloc() checks before in this code?. That's weird, or maybe the case
of "we didn't enable irqs yet on this cpu being bootstrapped" is handled
differently than "we have temporarily disabled irqs"? Sure, during early
boot we should have all the memory and no need to reclaim...


[1]
https://lore.kernel.org/all/20230130005725.3517597-6-sdonthineni@nvidia.com/#t

> Thanks,
> 
>         tglx
> ---
> Subject: mm, slub: Take slab lock with irqsave()
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 01 Feb 2023 14:14:00 +0100
> 
> <Add blurb>

Will you add the blurb, and the SLAB part, or should I? And once done should
I put it in slab tree for 6.3 or want to make it part of the series so it's
not blocked?

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  mm/slub.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3913,6 +3913,7 @@ static inline int __kmem_cache_alloc_bul
>  			size_t size, void **p, struct obj_cgroup *objcg)
>  {
>  	struct kmem_cache_cpu *c;
> +	unsigned long irqflags;
>  	int i;
>  
>  	/*
> @@ -3921,7 +3922,7 @@ static inline int __kmem_cache_alloc_bul
>  	 * handlers invoking normal fastpath.
>  	 */
>  	c = slub_get_cpu_ptr(s->cpu_slab);
> -	local_lock_irq(&s->cpu_slab->lock);
> +	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  	for (i = 0; i < size; i++) {
>  		void *object = kfence_alloc(s, s->object_size, flags);
> @@ -3942,7 +3943,7 @@ static inline int __kmem_cache_alloc_bul
>  			 */
>  			c->tid = next_tid(c->tid);
>  
> -			local_unlock_irq(&s->cpu_slab->lock);
> +			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
>  
>  			/*
>  			 * Invoking slow path likely have side-effect
> @@ -3956,7 +3957,7 @@ static inline int __kmem_cache_alloc_bul
>  			c = this_cpu_ptr(s->cpu_slab);
>  			maybe_wipe_obj_freeptr(s, p[i]);
>  
> -			local_lock_irq(&s->cpu_slab->lock);
> +			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  			continue; /* goto for-loop */
>  		}
> @@ -3965,7 +3966,7 @@ static inline int __kmem_cache_alloc_bul
>  		maybe_wipe_obj_freeptr(s, p[i]);
>  	}
>  	c->tid = next_tid(c->tid);
> -	local_unlock_irq(&s->cpu_slab->lock);
> +	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
>  	slub_put_cpu_ptr(s->cpu_slab);
>  
>  	return i;
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-02-06 14:24       ` Vlastimil Babka
@ 2023-02-06 18:10         ` Thomas Gleixner
  2023-02-07 10:30         ` Thomas Gleixner
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-06 18:10 UTC (permalink / raw)
  To: Vlastimil Babka, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox

On Mon, Feb 06 2023 at 15:24, Vlastimil Babka wrote:
> On 2/1/23 14:27, Thomas Gleixner wrote:
>> This triggers because __kmem_cache_alloc_bulk() uses
>> lock_irq()/unlock_irq(). Seems nobody used it during early boot stage
>> yet. Though the maple tree conversion of the interrupt descriptor
>> storage which is the purpose of the patch in question makes that happen.
>> 
>> Fix below.
>
> Looks like it should work. But I think we also need to adjust SLAB's
> mm/slab.c kmem_cache_alloc_bulk() which does local_irq_disable(); /
> local_irq_enable(); right?

Yup.

> Also if we enter this with IRQ's disabled, then we should take care about
> the proper gfp flags. Looking at the patch [1] I see
>
> WARN_ON(mas_store_gfp(&mas, desc, GFP_KERNEL) != 0);
>
> so GFP_KERNEL would be wrong with irqs disabled, looks like a case for
> GFP_ATOMIC.
> OTOH I can see the thing it replaces was:
>
> static RADIX_TREE(irq_desc_tree, GFP_KERNEL);
>
> so that's also a GFP_KERNEL and we haven't seen debug splats from
> might_alloc() checks before in this code?. That's weird, or maybe the
> case

might_alloc()
  might_sleep_if()
    __might_sleep()
      WARN_ON(task->state != RUNNING);  <- Does not trigger
      __might_resched()
        if (.... || system_state == SYSTEM_BOOTING || ...)
           return;

As system_state is SYSTEM_BOOTING at this point the splats are not
happening.

> of "we didn't enable irqs yet on this cpu being bootstrapped" is handled
> differently than "we have temporarily disabled irqs"? Sure, during early
> boot we should have all the memory and no need to reclaim...

The point is that interrupts are fully disabled during early boot and
there is no scheduler so there is no scheduling possible.

Quite some code in the kernel relies on GFP_KERNEL being functional
during that early boot stage. If the kernel runs out of memory that
early, then the chance of recovery is exactly 0.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ
  2023-01-31 16:41     ` Shanker Donthineni
@ 2023-02-07 10:28       ` Thomas Gleixner
  0 siblings, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-07 10:28 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel

On Tue, Jan 31 2023 at 10:41, Shanker Donthineni wrote:
> On 1/31/23 03:16, Thomas Gleixner wrote:
>> That's not what I meant. I was pondering to remove !SPARSEIRQ
>> completely.
>> 
> It's touching many files to drop CONFIG_SPARSE_IRQ=n support. Worried
> about removing !SPARSEIRQ without testing on legacy platforms.

This needs some thought.

> Can work on this after moving it to maple tree?

Yup. Just drop this patch.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs()
  2023-01-31 16:43     ` Shanker Donthineni
@ 2023-02-07 10:29       ` Thomas Gleixner
  0 siblings, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-07 10:29 UTC (permalink / raw)
  To: Shanker Donthineni, Marc Zyngier, Michael Walle
  Cc: Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-kernel

On Tue, Jan 31 2023 at 10:43, Shanker Donthineni wrote:
> On 1/31/23 03:35, Thomas Gleixner wrote:
>>> +static int irq_expand_nr_irqs(unsigned int nr)
>>> +{
>>> +     if (nr > MAX_SPARSE_IRQS)
>>> +             return -ENOMEM;
>>> +     nr_irqs = nr;
>>> +     return 0;
>>> +}
>> 
>> or preferrably something like this:
>> 
>>          if (!IS_ENABLED(CONFIG_SPARSEIRQ) || nr > MAX_SPARSE_IRQS)
>>                  return -ENOMEM;
>> 
>> which makes it entirely clear and also allows the compiler to optimize
>> is down to a 'return -ENOMEM'.
>> 
> I'll drop this patch since you're suggesting to remove !SPARSEIRQ support.

Sometime in the future when I analyzed what the implications are. So
just keep it and make it readable.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management
  2023-02-06 14:24       ` Vlastimil Babka
  2023-02-06 18:10         ` Thomas Gleixner
@ 2023-02-07 10:30         ` Thomas Gleixner
  2023-02-07 14:16           ` mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-07 10:30 UTC (permalink / raw)
  To: Vlastimil Babka, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox

On Mon, Feb 06 2023 at 15:24, Vlastimil Babka wrote:
> On 2/1/23 14:27, Thomas Gleixner wrote:
>> Subject: mm, slub: Take slab lock with irqsave()
>> From: Thomas Gleixner <tglx@linutronix.de>
>> Date: Wed, 01 Feb 2023 14:14:00 +0100
>> 
>> <Add blurb>
>
> Will you add the blurb, and the SLAB part, or should I? And once done should
> I put it in slab tree for 6.3 or want to make it part of the series so it's
> not blocked?

Ooops. I missed that part. Let me add slab and blurb and send it as a
proper patch. Just take it into the slab tree. The maple tree conversion
has still some issues, so I don't expect it to be 6.3 material.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 10:30         ` Thomas Gleixner
@ 2023-02-07 14:16           ` Thomas Gleixner
  2023-02-07 14:45             ` Vlastimil Babka
  2023-02-08 13:20             ` Hyeonggon Yoo
  0 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-07 14:16 UTC (permalink / raw)
  To: Vlastimil Babka, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox

The memory allocators are available during early boot even in the phase
where interrupts are disabled and scheduling is not yet possible.

The setup is so that GFP_KERNEL allocations work in this phase without
causing might_alloc() splats to be emitted because the system state is
SYSTEM_BOOTING at that point which prevents the warnings to trigger.

Most allocation/free functions use local_irq_save()/restore() or a lock
variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use
local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when
interrupts are enabled during the early boot phase.

This went unnoticed so far as there are no early users of these
interfaces. The upcoming conversion of the interrupt descriptor store from
radix_tree to maple_tree triggered this warning as maple_tree uses the bulk
interface.

Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and
SLAB to local[_lock]_irq_save()/restore().

There is obviously no reclaim possible and required at this point so there
is no need to expand this coverage further.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
Initial version: https://lore.kernel.org/r/87o7qdzfay.ffs@tglx
Changes: Update SLAB as well and add changelog
---
 mm/slab.c |   18 ++++++++++--------
 mm/slub.c |    9 +++++----
 2 files changed, 15 insertions(+), 12 deletions(-)

--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3479,14 +3479,15 @@ cache_alloc_debugcheck_after_bulk(struct
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
-	size_t i;
 	struct obj_cgroup *objcg = NULL;
+	unsigned long irqflags;
+	size_t i;
 
 	s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags);
 	if (!s)
 		return 0;
 
-	local_irq_disable();
+	local_irq_save(irqflags);
 	for (i = 0; i < size; i++) {
 		void *objp = kfence_alloc(s, s->object_size, flags) ?:
 			     __do_cache_alloc(s, flags, NUMA_NO_NODE);
@@ -3495,7 +3496,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 			goto error;
 		p[i] = objp;
 	}
-	local_irq_enable();
+	local_irq_restore(irqflags);
 
 	cache_alloc_debugcheck_after_bulk(s, flags, size, p, _RET_IP_);
 
@@ -3508,7 +3509,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	/* FIXME: Trace call missing. Christoph would like a bulk variant */
 	return size;
 error:
-	local_irq_enable();
+	local_irq_restore(irqflags);
 	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
 	slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
 	kmem_cache_free_bulk(s, i, p);
@@ -3610,8 +3611,9 @@ EXPORT_SYMBOL(kmem_cache_free);
 
 void kmem_cache_free_bulk(struct kmem_cache *orig_s, size_t size, void **p)
 {
+	unsigned long flags;
 
-	local_irq_disable();
+	local_irq_save(flags);
 	for (int i = 0; i < size; i++) {
 		void *objp = p[i];
 		struct kmem_cache *s;
@@ -3621,9 +3623,9 @@ void kmem_cache_free_bulk(struct kmem_ca
 
 			/* called via kfree_bulk */
 			if (!folio_test_slab(folio)) {
-				local_irq_enable();
+				local_irq_restore(flags);
 				free_large_kmalloc(folio, objp);
-				local_irq_disable();
+				local_irq_save(flags);
 				continue;
 			}
 			s = folio_slab(folio)->slab_cache;
@@ -3640,7 +3642,7 @@ void kmem_cache_free_bulk(struct kmem_ca
 
 		__cache_free(s, objp, _RET_IP_);
 	}
-	local_irq_enable();
+	local_irq_restore(flags);
 
 	/* FIXME: add tracing */
 }
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3913,6 +3913,7 @@ static inline int __kmem_cache_alloc_bul
 			size_t size, void **p, struct obj_cgroup *objcg)
 {
 	struct kmem_cache_cpu *c;
+	unsigned long irqflags;
 	int i;
 
 	/*
@@ -3921,7 +3922,7 @@ static inline int __kmem_cache_alloc_bul
 	 * handlers invoking normal fastpath.
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
-	local_lock_irq(&s->cpu_slab->lock);
+	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
 
 	for (i = 0; i < size; i++) {
 		void *object = kfence_alloc(s, s->object_size, flags);
@@ -3942,7 +3943,7 @@ static inline int __kmem_cache_alloc_bul
 			 */
 			c->tid = next_tid(c->tid);
 
-			local_unlock_irq(&s->cpu_slab->lock);
+			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
 
 			/*
 			 * Invoking slow path likely have side-effect
@@ -3956,7 +3957,7 @@ static inline int __kmem_cache_alloc_bul
 			c = this_cpu_ptr(s->cpu_slab);
 			maybe_wipe_obj_freeptr(s, p[i]);
 
-			local_lock_irq(&s->cpu_slab->lock);
+			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
 
 			continue; /* goto for-loop */
 		}
@@ -3965,7 +3966,7 @@ static inline int __kmem_cache_alloc_bul
 		maybe_wipe_obj_freeptr(s, p[i]);
 	}
 	c->tid = next_tid(c->tid);
-	local_unlock_irq(&s->cpu_slab->lock);
+	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
 	slub_put_cpu_ptr(s->cpu_slab);
 
 	return i;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 14:16           ` mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early Thomas Gleixner
@ 2023-02-07 14:45             ` Vlastimil Babka
  2023-02-07 14:47               ` Vlastimil Babka
  2023-02-08 13:20             ` Hyeonggon Yoo
  1 sibling, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2023-02-07 14:45 UTC (permalink / raw)
  To: Thomas Gleixner, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox, David Rientjes,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin

On 2/7/23 15:16, Thomas Gleixner wrote:
> The memory allocators are available during early boot even in the phase
> where interrupts are disabled and scheduling is not yet possible.
> 
> The setup is so that GFP_KERNEL allocations work in this phase without
> causing might_alloc() splats to be emitted because the system state is
> SYSTEM_BOOTING at that point which prevents the warnings to trigger.
> 
> Most allocation/free functions use local_irq_save()/restore() or a lock
> variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use
> local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when
> interrupts are enabled during the early boot phase.
> 
> This went unnoticed so far as there are no early users of these
> interfaces. The upcoming conversion of the interrupt descriptor store from
> radix_tree to maple_tree triggered this warning as maple_tree uses the bulk
> interface.
> 
> Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and
> SLAB to local[_lock]_irq_save()/restore().
> 
> There is obviously no reclaim possible and required at this point so there
> is no need to expand this coverage further.
> 
> No functional change.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

+Cc rest of slab folks

Thanks, added to slab/for-6.3/fixes

> ---
> Initial version: https://lore.kernel.org/r/87o7qdzfay.ffs@tglx
> Changes: Update SLAB as well and add changelog
> ---
>  mm/slab.c |   18 ++++++++++--------
>  mm/slub.c |    9 +++++----
>  2 files changed, 15 insertions(+), 12 deletions(-)
> 
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3479,14 +3479,15 @@ cache_alloc_debugcheck_after_bulk(struct
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			  void **p)
>  {
> -	size_t i;
>  	struct obj_cgroup *objcg = NULL;
> +	unsigned long irqflags;
> +	size_t i;
>  
>  	s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags);
>  	if (!s)
>  		return 0;
>  
> -	local_irq_disable();
> +	local_irq_save(irqflags);
>  	for (i = 0; i < size; i++) {
>  		void *objp = kfence_alloc(s, s->object_size, flags) ?:
>  			     __do_cache_alloc(s, flags, NUMA_NO_NODE);
> @@ -3495,7 +3496,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
>  			goto error;
>  		p[i] = objp;
>  	}
> -	local_irq_enable();
> +	local_irq_restore(irqflags);
>  
>  	cache_alloc_debugcheck_after_bulk(s, flags, size, p, _RET_IP_);
>  
> @@ -3508,7 +3509,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
>  	/* FIXME: Trace call missing. Christoph would like a bulk variant */
>  	return size;
>  error:
> -	local_irq_enable();
> +	local_irq_restore(irqflags);
>  	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
>  	slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
>  	kmem_cache_free_bulk(s, i, p);
> @@ -3610,8 +3611,9 @@ EXPORT_SYMBOL(kmem_cache_free);
>  
>  void kmem_cache_free_bulk(struct kmem_cache *orig_s, size_t size, void **p)
>  {
> +	unsigned long flags;
>  
> -	local_irq_disable();
> +	local_irq_save(flags);
>  	for (int i = 0; i < size; i++) {
>  		void *objp = p[i];
>  		struct kmem_cache *s;
> @@ -3621,9 +3623,9 @@ void kmem_cache_free_bulk(struct kmem_ca
>  
>  			/* called via kfree_bulk */
>  			if (!folio_test_slab(folio)) {
> -				local_irq_enable();
> +				local_irq_restore(flags);
>  				free_large_kmalloc(folio, objp);
> -				local_irq_disable();
> +				local_irq_save(flags);
>  				continue;
>  			}
>  			s = folio_slab(folio)->slab_cache;
> @@ -3640,7 +3642,7 @@ void kmem_cache_free_bulk(struct kmem_ca
>  
>  		__cache_free(s, objp, _RET_IP_);
>  	}
> -	local_irq_enable();
> +	local_irq_restore(flags);
>  
>  	/* FIXME: add tracing */
>  }
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3913,6 +3913,7 @@ static inline int __kmem_cache_alloc_bul
>  			size_t size, void **p, struct obj_cgroup *objcg)
>  {
>  	struct kmem_cache_cpu *c;
> +	unsigned long irqflags;
>  	int i;
>  
>  	/*
> @@ -3921,7 +3922,7 @@ static inline int __kmem_cache_alloc_bul
>  	 * handlers invoking normal fastpath.
>  	 */
>  	c = slub_get_cpu_ptr(s->cpu_slab);
> -	local_lock_irq(&s->cpu_slab->lock);
> +	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  	for (i = 0; i < size; i++) {
>  		void *object = kfence_alloc(s, s->object_size, flags);
> @@ -3942,7 +3943,7 @@ static inline int __kmem_cache_alloc_bul
>  			 */
>  			c->tid = next_tid(c->tid);
>  
> -			local_unlock_irq(&s->cpu_slab->lock);
> +			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
>  
>  			/*
>  			 * Invoking slow path likely have side-effect
> @@ -3956,7 +3957,7 @@ static inline int __kmem_cache_alloc_bul
>  			c = this_cpu_ptr(s->cpu_slab);
>  			maybe_wipe_obj_freeptr(s, p[i]);
>  
> -			local_lock_irq(&s->cpu_slab->lock);
> +			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  			continue; /* goto for-loop */
>  		}
> @@ -3965,7 +3966,7 @@ static inline int __kmem_cache_alloc_bul
>  		maybe_wipe_obj_freeptr(s, p[i]);
>  	}
>  	c->tid = next_tid(c->tid);
> -	local_unlock_irq(&s->cpu_slab->lock);
> +	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
>  	slub_put_cpu_ptr(s->cpu_slab);
>  
>  	return i;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 14:45             ` Vlastimil Babka
@ 2023-02-07 14:47               ` Vlastimil Babka
  2023-02-07 18:20                 ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2023-02-07 14:47 UTC (permalink / raw)
  To: Thomas Gleixner, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox, David Rientjes,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin

On 2/7/23 15:45, Vlastimil Babka wrote:
> On 2/7/23 15:16, Thomas Gleixner wrote:
>> The memory allocators are available during early boot even in the phase
>> where interrupts are disabled and scheduling is not yet possible.
>> 
>> The setup is so that GFP_KERNEL allocations work in this phase without
>> causing might_alloc() splats to be emitted because the system state is
>> SYSTEM_BOOTING at that point which prevents the warnings to trigger.
>> 
>> Most allocation/free functions use local_irq_save()/restore() or a lock
>> variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use
>> local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when
>> interrupts are enabled during the early boot phase.
>> 
>> This went unnoticed so far as there are no early users of these
>> interfaces. The upcoming conversion of the interrupt descriptor store from
>> radix_tree to maple_tree triggered this warning as maple_tree uses the bulk
>> interface.
>> 
>> Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and
>> SLAB to local[_lock]_irq_save()/restore().
>> 
>> There is obviously no reclaim possible and required at this point so there
>> is no need to expand this coverage further.
>> 
>> No functional change.
>> 
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> +Cc rest of slab folks
> 
> Thanks, added to slab/for-6.3/fixes

After your patch, I think it also makes sense to do the following:
----8<----
From 340d7c7b99f3e67780f6dec480ed1d27e6f325eb Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 7 Feb 2023 15:34:53 +0100
Subject: [PATCH] mm, slab/slub: remove notes that bulk alloc/free needs
 interrupts enabled

The slab functions kmem_cache_[alloc|free]_bulk() have been documented
as requiring interrupts to be enabled, since their addition in 2015.
It's unclear whether that was a fundamental restriction, or an attempt
to save some cpu cycles by not having to save and restore the irq flags.

However, it appears that most of the code involved was/became safe to be
called with interrupts disabled, and the remaining bits were fixed by
commit f244b0182b8e ("mm, slab/slub: Ensure kmem_cache_alloc_bulk() is
available early"). While the commit was aimed at early boot scenario, we
can now also remove the documented restrictions for any interrupt
disabled scenarios.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h | 2 --
 mm/slub.c            | 2 --
 2 files changed, 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..ea439b4e2b34 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -481,8 +481,6 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
  * Bulk allocation and freeing operations. These are accelerated in an
  * allocator specific way to avoid taking locks repeatedly or building
  * metadata structures unnecessarily.
- *
- * Note that interrupts must be enabled when calling these functions.
  */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
diff --git a/mm/slub.c b/mm/slub.c
index c16d78698e3f..23b3fb86045d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3889,7 +3889,6 @@ int build_detached_freelist(struct kmem_cache *s, size_t size,
 	return same;
 }
 
-/* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	if (!size)
@@ -4009,7 +4008,6 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 }
 #endif /* CONFIG_SLUB_TINY */
 
-/* Note that interrupts must be enabled when calling this function. */
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)
 {
-- 
2.39.1




^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 14:47               ` Vlastimil Babka
@ 2023-02-07 18:20                 ` Thomas Gleixner
  2023-02-08  9:15                   ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-07 18:20 UTC (permalink / raw)
  To: Vlastimil Babka, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox, David Rientjes,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin

On Tue, Feb 07 2023 at 15:47, Vlastimil Babka wrote:
> From 340d7c7b99f3e67780f6dec480ed1d27e6f325eb Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 7 Feb 2023 15:34:53 +0100
> Subject: [PATCH] mm, slab/slub: remove notes that bulk alloc/free needs
>  interrupts enabled
>
> The slab functions kmem_cache_[alloc|free]_bulk() have been documented
> as requiring interrupts to be enabled, since their addition in 2015.
> It's unclear whether that was a fundamental restriction, or an attempt
> to save some cpu cycles by not having to save and restore the irq
> flags.

I don't think so. The restriction is rather meant to avoid huge
allocations in atomic context which causes latencies and also might
deplete the atomic reserves.

So I rather avoid that and enforce !ATOMIC mode despite the
local_irq_save/restore() change which is really only to accomodate with
early boot.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 18:20                 ` Thomas Gleixner
@ 2023-02-08  9:15                   ` Vlastimil Babka
  2023-02-08 20:46                     ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2023-02-08  9:15 UTC (permalink / raw)
  To: Thomas Gleixner, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox, David Rientjes,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin

On 2/7/23 19:20, Thomas Gleixner wrote:
> On Tue, Feb 07 2023 at 15:47, Vlastimil Babka wrote:
>> From 340d7c7b99f3e67780f6dec480ed1d27e6f325eb Mon Sep 17 00:00:00 2001
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Tue, 7 Feb 2023 15:34:53 +0100
>> Subject: [PATCH] mm, slab/slub: remove notes that bulk alloc/free needs
>>  interrupts enabled
>>
>> The slab functions kmem_cache_[alloc|free]_bulk() have been documented
>> as requiring interrupts to be enabled, since their addition in 2015.
>> It's unclear whether that was a fundamental restriction, or an attempt
>> to save some cpu cycles by not having to save and restore the irq
>> flags.
> 
> I don't think so. The restriction is rather meant to avoid huge
> allocations in atomic context which causes latencies and also might
> deplete the atomic reserves.

Fair enough.

> So I rather avoid that and enforce !ATOMIC mode despite the
> local_irq_save/restore() change which is really only to accomodate with
> early boot.

We could add some warning then? People might use the bulk alloc unknowingly
again e.g. via maple tree. GFP_KERNEL would warn through the existing
warning, but e.g. GFP_ATOMIC currently not.
Some maple tree users could use its preallocation instead outside of the
atomic context, when possible.

> Thanks,
> 
>         tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-07 14:16           ` mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early Thomas Gleixner
  2023-02-07 14:45             ` Vlastimil Babka
@ 2023-02-08 13:20             ` Hyeonggon Yoo
  1 sibling, 0 replies; 33+ messages in thread
From: Hyeonggon Yoo @ 2023-02-08 13:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vlastimil Babka, kernel test robot, Shanker Donthineni, oe-lkp,
	lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox

On Tue, Feb 07, 2023 at 03:16:53PM +0100, Thomas Gleixner wrote:
> The memory allocators are available during early boot even in the phase
> where interrupts are disabled and scheduling is not yet possible.
> 
> The setup is so that GFP_KERNEL allocations work in this phase without
> causing might_alloc() splats to be emitted because the system state is
> SYSTEM_BOOTING at that point which prevents the warnings to trigger.
> 
> Most allocation/free functions use local_irq_save()/restore() or a lock
> variant of that. But kmem_cache_alloc_bulk() and kmem_cache_free_bulk() use
> local_[lock]_irq_disable()/enable(), which leads to a lockdep warning when
> interrupts are enabled during the early boot phase.
> 
> This went unnoticed so far as there are no early users of these
> interfaces. The upcoming conversion of the interrupt descriptor store from
> radix_tree to maple_tree triggered this warning as maple_tree uses the bulk
> interface.
> 
> Cure this by moving the kmem_cache_alloc/free() bulk variants of SLUB and
> SLAB to local[_lock]_irq_save()/restore().
> 
> There is obviously no reclaim possible and required at this point so there
> is no need to expand this coverage further.
> 
> No functional change.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> Initial version: https://lore.kernel.org/r/87o7qdzfay.ffs@tglx
> Changes: Update SLAB as well and add changelog
> ---
>  mm/slab.c |   18 ++++++++++--------
>  mm/slub.c |    9 +++++----
>  2 files changed, 15 insertions(+), 12 deletions(-)
> 
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3479,14 +3479,15 @@ cache_alloc_debugcheck_after_bulk(struct
>  int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			  void **p)
>  {
> -	size_t i;
>  	struct obj_cgroup *objcg = NULL;
> +	unsigned long irqflags;
> +	size_t i;
>  
>  	s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags);
>  	if (!s)
>  		return 0;
>  
> -	local_irq_disable();
> +	local_irq_save(irqflags);
>  	for (i = 0; i < size; i++) {
>  		void *objp = kfence_alloc(s, s->object_size, flags) ?:
>  			     __do_cache_alloc(s, flags, NUMA_NO_NODE);
> @@ -3495,7 +3496,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
>  			goto error;
>  		p[i] = objp;
>  	}
> -	local_irq_enable();
> +	local_irq_restore(irqflags);
>  
>  	cache_alloc_debugcheck_after_bulk(s, flags, size, p, _RET_IP_);
>  
> @@ -3508,7 +3509,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
>  	/* FIXME: Trace call missing. Christoph would like a bulk variant */
>  	return size;
>  error:
> -	local_irq_enable();
> +	local_irq_restore(irqflags);
>  	cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_);
>  	slab_post_alloc_hook(s, objcg, flags, i, p, false, s->object_size);
>  	kmem_cache_free_bulk(s, i, p);
> @@ -3610,8 +3611,9 @@ EXPORT_SYMBOL(kmem_cache_free);
>  
>  void kmem_cache_free_bulk(struct kmem_cache *orig_s, size_t size, void **p)
>  {
> +	unsigned long flags;
>  
> -	local_irq_disable();
> +	local_irq_save(flags);
>  	for (int i = 0; i < size; i++) {
>  		void *objp = p[i];
>  		struct kmem_cache *s;
> @@ -3621,9 +3623,9 @@ void kmem_cache_free_bulk(struct kmem_ca
>  
>  			/* called via kfree_bulk */
>  			if (!folio_test_slab(folio)) {
> -				local_irq_enable();
> +				local_irq_restore(flags);
>  				free_large_kmalloc(folio, objp);
> -				local_irq_disable();
> +				local_irq_save(flags);
>  				continue;
>  			}
>  			s = folio_slab(folio)->slab_cache;
> @@ -3640,7 +3642,7 @@ void kmem_cache_free_bulk(struct kmem_ca
>  
>  		__cache_free(s, objp, _RET_IP_);
>  	}
> -	local_irq_enable();
> +	local_irq_restore(flags);
>  
>  	/* FIXME: add tracing */
>  }
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3913,6 +3913,7 @@ static inline int __kmem_cache_alloc_bul
>  			size_t size, void **p, struct obj_cgroup *objcg)
>  {
>  	struct kmem_cache_cpu *c;
> +	unsigned long irqflags;
>  	int i;
>  
>  	/*
> @@ -3921,7 +3922,7 @@ static inline int __kmem_cache_alloc_bul
>  	 * handlers invoking normal fastpath.
>  	 */
>  	c = slub_get_cpu_ptr(s->cpu_slab);
> -	local_lock_irq(&s->cpu_slab->lock);
> +	local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  	for (i = 0; i < size; i++) {
>  		void *object = kfence_alloc(s, s->object_size, flags);
> @@ -3942,7 +3943,7 @@ static inline int __kmem_cache_alloc_bul
>  			 */
>  			c->tid = next_tid(c->tid);
>  
> -			local_unlock_irq(&s->cpu_slab->lock);
> +			local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);
>  
>  			/*
>  			 * Invoking slow path likely have side-effect
> @@ -3956,7 +3957,7 @@ static inline int __kmem_cache_alloc_bul
>  			c = this_cpu_ptr(s->cpu_slab);
>  			maybe_wipe_obj_freeptr(s, p[i]);
>  
> -			local_lock_irq(&s->cpu_slab->lock);
> +			local_lock_irqsave(&s->cpu_slab->lock, irqflags);
>  
>  			continue; /* goto for-loop */
>  		}
> @@ -3965,7 +3966,7 @@ static inline int __kmem_cache_alloc_bul
>  		maybe_wipe_obj_freeptr(s, p[i]);
>  	}
>  	c->tid = next_tid(c->tid);
> -	local_unlock_irq(&s->cpu_slab->lock);
> +	local_unlock_irqrestore(&s->cpu_slab->lock, irqflags);

Looks good to me.

Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

>  	slub_put_cpu_ptr(s->cpu_slab);
>  
>  	return i;
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-08  9:15                   ` Vlastimil Babka
@ 2023-02-08 20:46                     ` Thomas Gleixner
  2023-02-09 20:28                       ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-08 20:46 UTC (permalink / raw)
  To: Vlastimil Babka, kernel test robot, Shanker Donthineni
  Cc: oe-lkp, lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, Matthew Wilcox, David Rientjes,
	Christoph Lameter, Pekka Enberg, Joonsoo Kim, Hyeonggon Yoo,
	Roman Gushchin, Matthew Wilcox

On Wed, Feb 08 2023 at 10:15, Vlastimil Babka wrote:

Cc+ Willy

> On 2/7/23 19:20, Thomas Gleixner wrote:
>> On Tue, Feb 07 2023 at 15:47, Vlastimil Babka wrote:
>>> From 340d7c7b99f3e67780f6dec480ed1d27e6f325eb Mon Sep 17 00:00:00 2001
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Date: Tue, 7 Feb 2023 15:34:53 +0100
>>> Subject: [PATCH] mm, slab/slub: remove notes that bulk alloc/free needs
>>>  interrupts enabled
>>>
>>> The slab functions kmem_cache_[alloc|free]_bulk() have been documented
>>> as requiring interrupts to be enabled, since their addition in 2015.
>>> It's unclear whether that was a fundamental restriction, or an attempt
>>> to save some cpu cycles by not having to save and restore the irq
>>> flags.
>> 
>> I don't think so. The restriction is rather meant to avoid huge
>> allocations in atomic context which causes latencies and also might
>> deplete the atomic reserves.
>
> Fair enough.
>
>> So I rather avoid that and enforce !ATOMIC mode despite the
>> local_irq_save/restore() change which is really only to accomodate with
>> early boot.
>
> We could add some warning then? People might use the bulk alloc unknowingly
> again e.g. via maple tree. GFP_KERNEL would warn through the existing
> warning, but e.g. GFP_ATOMIC currently not.

Correct.

> Some maple tree users could use its preallocation instead outside of the
> atomic context, when possible.

Right.

The issue is that there might be maple_tree users which depend on
GFP_ATOMIC, but call in from interrupt enabled context, which is
legitimate today.

Willy might have some insight on that.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-08 20:46                     ` Thomas Gleixner
@ 2023-02-09 20:28                       ` Matthew Wilcox
  2023-02-09 23:19                         ` Thomas Gleixner
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-02-09 20:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vlastimil Babka, kernel test robot, Shanker Donthineni, oe-lkp,
	lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, David Rientjes, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, Hyeonggon Yoo, Roman Gushchin

On Wed, Feb 08, 2023 at 09:46:30PM +0100, Thomas Gleixner wrote:
> On Wed, Feb 08 2023 at 10:15, Vlastimil Babka wrote:
> 
> Cc+ Willy
> 
> > On 2/7/23 19:20, Thomas Gleixner wrote:
> >> On Tue, Feb 07 2023 at 15:47, Vlastimil Babka wrote:
> >>> From 340d7c7b99f3e67780f6dec480ed1d27e6f325eb Mon Sep 17 00:00:00 2001
> >>> From: Vlastimil Babka <vbabka@suse.cz>
> >>> Date: Tue, 7 Feb 2023 15:34:53 +0100
> >>> Subject: [PATCH] mm, slab/slub: remove notes that bulk alloc/free needs
> >>>  interrupts enabled
> >>>
> >>> The slab functions kmem_cache_[alloc|free]_bulk() have been documented
> >>> as requiring interrupts to be enabled, since their addition in 2015.
> >>> It's unclear whether that was a fundamental restriction, or an attempt
> >>> to save some cpu cycles by not having to save and restore the irq
> >>> flags.
> >> 
> >> I don't think so. The restriction is rather meant to avoid huge
> >> allocations in atomic context which causes latencies and also might
> >> deplete the atomic reserves.
> >
> > Fair enough.
> >
> >> So I rather avoid that and enforce !ATOMIC mode despite the
> >> local_irq_save/restore() change which is really only to accomodate with
> >> early boot.
> >
> > We could add some warning then? People might use the bulk alloc unknowingly
> > again e.g. via maple tree. GFP_KERNEL would warn through the existing
> > warning, but e.g. GFP_ATOMIC currently not.
> 
> Correct.
> 
> > Some maple tree users could use its preallocation instead outside of the
> > atomic context, when possible.
> 
> Right.
> 
> The issue is that there might be maple_tree users which depend on
> GFP_ATOMIC, but call in from interrupt enabled context, which is
> legitimate today.
> 
> Willy might have some insight on that.

Not today, but eventually.  There are XArray users which modify the tree
in interrupt context or under some other spinlock that we can't drop
for them in order to do an allocation.  And I want to replace the radix
tree underpinnings of the XArray with the maple tree.  In my copious
spare time.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early
  2023-02-09 20:28                       ` Matthew Wilcox
@ 2023-02-09 23:19                         ` Thomas Gleixner
  0 siblings, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2023-02-09 23:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, kernel test robot, Shanker Donthineni, oe-lkp,
	lkp, linux-kernel, Marc Zyngier, Michael Walle,
	Sebastian Andrzej Siewior, Hans de Goede, Wolfram Sang, linux-mm,
	Liam R. Howlett, David Rientjes, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, Hyeonggon Yoo, Roman Gushchin

On Thu, Feb 09 2023 at 20:28, Matthew Wilcox wrote:
> On Wed, Feb 08, 2023 at 09:46:30PM +0100, Thomas Gleixner wrote:
>> The issue is that there might be maple_tree users which depend on
>> GFP_ATOMIC, but call in from interrupt enabled context, which is
>> legitimate today.
>> 
>> Willy might have some insight on that.
>
> Not today, but eventually.  There are XArray users which modify the tree
> in interrupt context or under some other spinlock that we can't drop
> for them in order to do an allocation.  And I want to replace the radix
> tree underpinnings of the XArray with the maple tree.  In my copious
> spare time.

If any usage which you described, i.e. interrupt context or with a
spinlock held, where interrupts were disabled on acquisition of the
lock, ends up calling into kmem_cache_alloc_bulk() today, then that's
broken because kmem_cache_alloc_bulk() reenables interrupts
unconditionally.

So either such code does not exist as of today or it just gets lucky to
not run into the code path leading up to kmem_cache_alloc_bulk().

We have to clarify what the valid calling convention of
kmem_cache_alloc_bulk() is in the regular kernel context, i.e. outside
of early boot.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-02-09 23:19 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-30  0:57 [PATCH 0/5] Increase the number of IRQ descriptors for SPARSEIRQ Shanker Donthineni
2023-01-30  0:57 ` [PATCH 1/5] genirq: Use hlist for managing resend handlers Shanker Donthineni
2023-01-31  8:59   ` Thomas Gleixner
2023-01-31 16:17     ` Shanker Donthineni
2023-01-31 17:06       ` Shanker Donthineni
2023-01-30  0:57 ` [PATCH 2/5] genirq: Allocate IRQ descriptors at boot time for !SPARSEIRQ Shanker Donthineni
2023-01-31  9:16   ` Thomas Gleixner
2023-01-31 16:41     ` Shanker Donthineni
2023-02-07 10:28       ` Thomas Gleixner
2023-01-30  0:57 ` [PATCH 3/5] genirq: Introduce two helper functions Shanker Donthineni
2023-01-31  9:20   ` Thomas Gleixner
2023-01-31 16:42     ` Shanker Donthineni
2023-01-30  0:57 ` [PATCH 4/5] genirq: Use the common function irq_expand_nr_irqs() Shanker Donthineni
2023-01-31  9:35   ` Thomas Gleixner
2023-01-31 16:43     ` Shanker Donthineni
2023-02-07 10:29       ` Thomas Gleixner
2023-01-30  0:57 ` [PATCH 5/5] genirq: Use the maple tree for IRQ descriptors management Shanker Donthineni
2023-01-31  9:52   ` Thomas Gleixner
2023-01-31 16:45     ` Shanker Donthineni
2023-02-01  6:02   ` kernel test robot
2023-02-01 13:27     ` Thomas Gleixner
2023-02-06 14:24       ` Vlastimil Babka
2023-02-06 18:10         ` Thomas Gleixner
2023-02-07 10:30         ` Thomas Gleixner
2023-02-07 14:16           ` mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early Thomas Gleixner
2023-02-07 14:45             ` Vlastimil Babka
2023-02-07 14:47               ` Vlastimil Babka
2023-02-07 18:20                 ` Thomas Gleixner
2023-02-08  9:15                   ` Vlastimil Babka
2023-02-08 20:46                     ` Thomas Gleixner
2023-02-09 20:28                       ` Matthew Wilcox
2023-02-09 23:19                         ` Thomas Gleixner
2023-02-08 13:20             ` Hyeonggon Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).