linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly
@ 2020-04-10  3:51 Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 2/9] sched: Avoid scale real weight down to zero Sasha Levin
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Sungbo Eo, Marc Zyngier, Sasha Levin, linux-arm-kernel

From: Sungbo Eo <mans0n@gorani.run>

[ Upstream commit 486562da598c59e9f835b551d7cf19507de2d681 ]

Enclose the chained handler with chained_irq_{enter,exit}(), so that the
muxed interrupts get properly acked.

This patch also fixes a reboot bug on OX820 SoC, where the jiffies timer
interrupt is never acked. The kernel waits a clock tick forever in
calibrate_delay_converge(), which leads to a boot hang.

Fixes: c41b16f8c9d9 ("ARM: integrator/versatile: consolidate FPGA IRQ handling code")
Signed-off-by: Sungbo Eo <mans0n@gorani.run>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200319023448.1479701-1-mans0n@gorani.run
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/irqchip/irq-versatile-fpga.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-versatile-fpga.c b/drivers/irqchip/irq-versatile-fpga.c
index 37dd4645bf188..66502bdefdf78 100644
--- a/drivers/irqchip/irq-versatile-fpga.c
+++ b/drivers/irqchip/irq-versatile-fpga.c
@@ -5,6 +5,7 @@
 #include <linux/irq.h>
 #include <linux/io.h>
 #include <linux/irqchip.h>
+#include <linux/irqchip/chained_irq.h>
 #include <linux/irqchip/versatile-fpga.h>
 #include <linux/irqdomain.h>
 #include <linux/module.h>
@@ -67,12 +68,16 @@ static void fpga_irq_unmask(struct irq_data *d)
 
 static void fpga_irq_handle(struct irq_desc *desc)
 {
+	struct irq_chip *chip = irq_desc_get_chip(desc);
 	struct fpga_irq_data *f = irq_desc_get_handler_data(desc);
-	u32 status = readl(f->base + IRQ_STATUS);
+	u32 status;
+
+	chained_irq_enter(chip, desc);
 
+	status = readl(f->base + IRQ_STATUS);
 	if (status == 0) {
 		do_bad_IRQ(desc);
-		return;
+		goto out;
 	}
 
 	do {
@@ -81,6 +86,9 @@ static void fpga_irq_handle(struct irq_desc *desc)
 		status &= ~(1 << irq);
 		generic_handle_irq(irq_find_mapping(f->domain, irq));
 	} while (status);
+
+out:
+	chained_irq_exit(chip, desc);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 2/9] sched: Avoid scale real weight down to zero
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 3/9] selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault Sasha Levin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Michael Wang, Peter Zijlstra, Vincent Guittot, Sasha Levin

From: Michael Wang <yun.wang@linux.alibaba.com>

[ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ]

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

  /sys/fs/cgroup/cpu/A		(shares=102400)
  /sys/fs/cgroup/cpu/A/B	(shares=2)
  /sys/fs/cgroup/cpu/A/B/C	(shares=1024)

  /sys/fs/cgroup/cpu/D		(shares=1024)
  /sys/fs/cgroup/cpu/D/E	(shares=1024)
  /sys/fs/cgroup/cpu/D/E/F	(shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since
A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

  load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
  shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
after scale down, although 'tg_shares' is 102400, shares of the se
which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no
longer telling the real story, the caller will have the wrong
information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will
be >= MIN_SHARES after scale, after applied the group C wins as
expected.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/sched/sched.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15c08752926b0..819bd5fb02647 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -73,7 +73,13 @@ static inline void update_idle_core(struct rq *rq) { }
 #ifdef CONFIG_64BIT
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)	((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w) \
+({ \
+	unsigned long __w = (w); \
+	if (__w) \
+		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
+	__w; \
+})
 #else
 # define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)		(w)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 3/9] selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 2/9] sched: Avoid scale real weight down to zero Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 4/9] libata: Remove extra scsi_host_put() in ata_scsi_add_hosts() Sasha Levin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Andy Lutomirski, kbuild test robot, Borislav Petkov, Sasha Levin,
	linux-kselftest

From: Andy Lutomirski <luto@kernel.org>

[ Upstream commit 630b99ab60aa972052a4202a1ff96c7e45eb0054 ]

If AT_SYSINFO is not present, don't try to call a NULL pointer.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/faaf688265a7e1a5b944d6f8bc0f6368158306d3.1584052409.git.luto@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 tools/testing/selftests/x86/ptrace_syscall.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/x86/ptrace_syscall.c b/tools/testing/selftests/x86/ptrace_syscall.c
index 1e3da137a8bb9..5390c827a359e 100644
--- a/tools/testing/selftests/x86/ptrace_syscall.c
+++ b/tools/testing/selftests/x86/ptrace_syscall.c
@@ -413,8 +413,12 @@ int main()
 
 #if defined(__i386__) && (!defined(__GLIBC__) || __GLIBC__ > 2 || __GLIBC_MINOR__ >= 16)
 	vsyscall32 = (void *)getauxval(AT_SYSINFO);
-	printf("[RUN]\tCheck AT_SYSINFO return regs\n");
-	test_sys32_regs(do_full_vsyscall32);
+	if (vsyscall32) {
+		printf("[RUN]\tCheck AT_SYSINFO return regs\n");
+		test_sys32_regs(do_full_vsyscall32);
+	} else {
+		printf("[SKIP]\tAT_SYSINFO is not available\n");
+	}
 #endif
 
 	test_ptrace_syscall_restart();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 4/9] libata: Remove extra scsi_host_put() in ata_scsi_add_hosts()
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 2/9] sched: Avoid scale real weight down to zero Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 3/9] selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 5/9] gfs2: Don't demote a glock until its revokes are written Sasha Levin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: John Garry, Jens Axboe, Sasha Levin, linux-ide

From: John Garry <john.garry@huawei.com>

[ Upstream commit 1d72f7aec3595249dbb83291ccac041a2d676c57 ]

If the call to scsi_add_host_with_dma() in ata_scsi_add_hosts() fails,
then we may get use-after-free KASAN warns:

==================================================================
BUG: KASAN: use-after-free in kobject_put+0x24/0x180
Read of size 1 at addr ffff0026b8c80364 by task swapper/0/1
CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W         5.6.0-rc3-00004-g5a71b206ea82-dirty #1765
Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 2280-V2 CS V3.B160.01 02/24/2020
Call trace:
dump_backtrace+0x0/0x298
show_stack+0x14/0x20
dump_stack+0x118/0x190
print_address_description.isra.9+0x6c/0x3b8
__kasan_report+0x134/0x23c
kasan_report+0xc/0x18
__asan_load1+0x5c/0x68
kobject_put+0x24/0x180
put_device+0x10/0x20
scsi_host_put+0x10/0x18
ata_devres_release+0x74/0xb0
release_nodes+0x2d0/0x470
devres_release_all+0x50/0x78
really_probe+0x2d4/0x560
driver_probe_device+0x7c/0x148
device_driver_attach+0x94/0xa0
__driver_attach+0xa8/0x110
bus_for_each_dev+0xe8/0x158
driver_attach+0x30/0x40
bus_add_driver+0x220/0x2e0
driver_register+0xbc/0x1d0
__pci_register_driver+0xbc/0xd0
ahci_pci_driver_init+0x20/0x28
do_one_initcall+0xf0/0x608
kernel_init_freeable+0x31c/0x384
kernel_init+0x10/0x118
ret_from_fork+0x10/0x18

Allocated by task 5:
save_stack+0x28/0xc8
__kasan_kmalloc.isra.8+0xbc/0xd8
kasan_kmalloc+0xc/0x18
__kmalloc+0x1a8/0x280
scsi_host_alloc+0x44/0x678
ata_scsi_add_hosts+0x74/0x268
ata_host_register+0x228/0x488
ahci_host_activate+0x1c4/0x2a8
ahci_init_one+0xd18/0x1298
local_pci_probe+0x74/0xf0
work_for_cpu_fn+0x2c/0x48
process_one_work+0x488/0xc08
worker_thread+0x330/0x5d0
kthread+0x1c8/0x1d0
ret_from_fork+0x10/0x18

Freed by task 5:
save_stack+0x28/0xc8
__kasan_slab_free+0x118/0x180
kasan_slab_free+0x10/0x18
slab_free_freelist_hook+0xa4/0x1a0
kfree+0xd4/0x3a0
scsi_host_dev_release+0x100/0x148
device_release+0x7c/0xe0
kobject_put+0xb0/0x180
put_device+0x10/0x20
scsi_host_put+0x10/0x18
ata_scsi_add_hosts+0x210/0x268
ata_host_register+0x228/0x488
ahci_host_activate+0x1c4/0x2a8
ahci_init_one+0xd18/0x1298
local_pci_probe+0x74/0xf0
work_for_cpu_fn+0x2c/0x48
process_one_work+0x488/0xc08
worker_thread+0x330/0x5d0
kthread+0x1c8/0x1d0
ret_from_fork+0x10/0x18

There is also refcount issue, as well:
WARNING: CPU: 1 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xf8/0x170

The issue is that we make an erroneous extra call to scsi_host_put()
for that host:

So in ahci_init_one()->ata_host_alloc_pinfo()->ata_host_alloc(), we setup
a device release method - ata_devres_release() - which intends to release
the SCSI hosts:

static void ata_devres_release(struct device *gendev, void *res)
{
	...
	for (i = 0; i < host->n_ports; i++) {
		struct ata_port *ap = host->ports[i];

		if (!ap)
			continue;

		if (ap->scsi_host)
			scsi_host_put(ap->scsi_host);

	}
	...
}

However in the ata_scsi_add_hosts() error path, we also call
scsi_host_put() for the SCSI hosts.

Fix by removing the the scsi_host_put() calls in ata_scsi_add_hosts() and
leave this to ata_devres_release().

Fixes: f31871951b38 ("libata: separate out ata_host_alloc() and ata_host_register()")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/ata/libata-scsi.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index f5eb102a2cf76..c4f2b563c9f03 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4444,22 +4444,19 @@ int ata_scsi_add_hosts(struct ata_host *host, struct scsi_host_template *sht)
 		 */
 		shost->max_host_blocked = 1;
 
-		rc = scsi_add_host_with_dma(ap->scsi_host,
-						&ap->tdev, ap->host->dev);
+		rc = scsi_add_host_with_dma(shost, &ap->tdev, ap->host->dev);
 		if (rc)
-			goto err_add;
+			goto err_alloc;
 	}
 
 	return 0;
 
- err_add:
-	scsi_host_put(host->ports[i]->scsi_host);
  err_alloc:
 	while (--i >= 0) {
 		struct Scsi_Host *shost = host->ports[i]->scsi_host;
 
+		/* scsi_host_put() is in ata_devres_release() */
 		scsi_remove_host(shost);
-		scsi_host_put(shost);
 	}
 	return rc;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 5/9] gfs2: Don't demote a glock until its revokes are written
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
                   ` (2 preceding siblings ...)
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 4/9] libata: Remove extra scsi_host_put() in ata_scsi_add_hosts() Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 6/9] x86/boot: Use unsigned comparison for addresses Sasha Levin
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Bob Peterson, Andreas Gruenbacher, Sasha Levin, cluster-devel

From: Bob Peterson <rpeterso@redhat.com>

[ Upstream commit df5db5f9ee112e76b5202fbc331f990a0fc316d6 ]

Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.

This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/gfs2/glock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index efd44d5645d83..adc1a97cfe96d 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -548,6 +548,9 @@ __acquires(&gl->gl_lockref.lock)
 			goto out_unlock;
 		if (nonblock)
 			goto out_sched;
+		smp_mb();
+		if (atomic_read(&gl->gl_revokes) != 0)
+			goto out_sched;
 		set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
 		GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
 		gl->gl_target = gl->gl_demote_state;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 6/9] x86/boot: Use unsigned comparison for addresses
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
                   ` (3 preceding siblings ...)
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 5/9] gfs2: Don't demote a glock until its revokes are written Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 7/9] locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps() Sasha Levin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Arvind Sankar, Ard Biesheuvel, Ingo Molnar, Sasha Levin

From: Arvind Sankar <nivedita@alum.mit.edu>

[ Upstream commit 81a34892c2c7c809f9c4e22c5ac936ae673fb9a2 ]

The load address is compared with LOAD_PHYSICAL_ADDR using a signed
comparison currently (using jge instruction).

When loading a 64-bit kernel using the new efi32_pe_entry() point added by:

  97aa276579b2 ("efi/x86: Add true mixed mode entry point into .compat section")

using Qemu with -m 3072, the firmware actually loads us above 2Gb,
resulting in a very early crash.

Use the JAE instruction to perform a unsigned comparison instead, as physical
addresses should be considered unsigned.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200301230436.2246909-6-nivedita@alum.mit.edu
Link: https://lore.kernel.org/r/20200308080859.21568-14-ardb@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/boot/compressed/head_32.S | 2 +-
 arch/x86/boot/compressed/head_64.S | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index fd0b6a272dd5b..7532f6f536774 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -170,7 +170,7 @@ preferred_addr:
 	notl	%eax
 	andl    %eax, %ebx
 	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
-	jge	1f
+	jae	1f
 #endif
 	movl	$LOAD_PHYSICAL_ADDR, %ebx
 1:
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 9e3a183561a9a..3fac2d133e4ec 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -104,7 +104,7 @@ ENTRY(startup_32)
 	notl	%eax
 	andl	%eax, %ebx
 	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
-	jge	1f
+	jae	1f
 #endif
 	movl	$LOAD_PHYSICAL_ADDR, %ebx
 1:
@@ -339,7 +339,7 @@ preferred_addr:
 	notq	%rax
 	andq	%rax, %rbp
 	cmpq	$LOAD_PHYSICAL_ADDR, %rbp
-	jge	1f
+	jae	1f
 #endif
 	movq	$LOAD_PHYSICAL_ADDR, %rbp
 1:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 7/9] locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps()
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
                   ` (4 preceding siblings ...)
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 6/9] x86/boot: Use unsigned comparison for addresses Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 8/9] btrfs: remove a BUG_ON() from merge_reloc_roots() Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 9/9] btrfs: track reloc roots based on their commit root bytenr Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Boqun Feng, Qian Cai, Peter Zijlstra, Sasha Levin

From: Boqun Feng <boqun.feng@gmail.com>

[ Upstream commit 25016bd7f4caf5fc983bbab7403d08e64cba3004 ]

Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep
triggered a warning:

  [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
  ...
  [ ] Call Trace:
  [ ]  lock_is_held_type+0x5d/0x150
  [ ]  ? rcu_lockdep_current_cpu_online+0x64/0x80
  [ ]  rcu_read_lock_any_held+0xac/0x100
  [ ]  ? rcu_read_lock_held+0xc0/0xc0
  [ ]  ? __slab_free+0x421/0x540
  [ ]  ? kasan_kmalloc+0x9/0x10
  [ ]  ? __kmalloc_node+0x1d7/0x320
  [ ]  ? kvmalloc_node+0x6f/0x80
  [ ]  __bfs+0x28a/0x3c0
  [ ]  ? class_equal+0x30/0x30
  [ ]  lockdep_count_forward_deps+0x11a/0x1a0

The warning got triggered because lockdep_count_forward_deps() call
__bfs() without current->lockdep_recursion being set, as a result
a lockdep internal function (__bfs()) is checked by lockdep, which is
unexpected, and the inconsistency between the irq-off state and the
state traced by lockdep caused the warning.

Apart from this warning, lockdep internal functions like __bfs() should
always be protected by current->lockdep_recursion to avoid potential
deadlocks and data inconsistency, therefore add the
current->lockdep_recursion on-and-off section to protect __bfs() in both
lockdep_count_forward_deps() and lockdep_count_backward_deps()

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/locking/lockdep.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d7f425698a4a1..9f56e3fac795a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1241,9 +1241,11 @@ unsigned long lockdep_count_forward_deps(struct lock_class *class)
 	this.class = class;
 
 	raw_local_irq_save(flags);
+	current->lockdep_recursion = 1;
 	arch_spin_lock(&lockdep_lock);
 	ret = __lockdep_count_forward_deps(&this);
 	arch_spin_unlock(&lockdep_lock);
+	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
 
 	return ret;
@@ -1268,9 +1270,11 @@ unsigned long lockdep_count_backward_deps(struct lock_class *class)
 	this.class = class;
 
 	raw_local_irq_save(flags);
+	current->lockdep_recursion = 1;
 	arch_spin_lock(&lockdep_lock);
 	ret = __lockdep_count_backward_deps(&this);
 	arch_spin_unlock(&lockdep_lock);
+	current->lockdep_recursion = 0;
 	raw_local_irq_restore(flags);
 
 	return ret;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 8/9] btrfs: remove a BUG_ON() from merge_reloc_roots()
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
                   ` (5 preceding siblings ...)
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 7/9] locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps() Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 9/9] btrfs: track reloc roots based on their commit root bytenr Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Josef Bacik, Qu Wenruo, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit 7b7b74315b24dc064bc1c683659061c3d48f8668 ]

This was pretty subtle, we default to reloc roots having 0 root refs, so
if we crash in the middle of the relocation they can just be deleted.
If we successfully complete the relocation operations we'll set our root
refs to 1 in prepare_to_merge() and then go on to merge_reloc_roots().

At prepare_to_merge() time if any of the reloc roots have a 0 reference
still, we will remove that reloc root from our reloc root rb tree, and
then clean it up later.

However this only happens if we successfully start a transaction.  If
we've aborted previously we will skip this step completely, and only
have reloc roots with a reference count of 0, but were never properly
removed from the reloc control's rb tree.

This isn't a problem per-se, our references are held by the list the
reloc roots are on, and by the original root the reloc root belongs to.
If we end up in this situation all the reloc roots will be added to the
dirty_reloc_list, and then properly dropped at that point.  The reloc
control will be free'd and the rb tree is no longer used.

There were two options when fixing this, one was to remove the BUG_ON(),
the other was to make prepare_to_merge() handle the case where we
couldn't start a trans handle.

IMO this is the cleaner solution.  I started with handling the error in
prepare_to_merge(), but it turned out super ugly.  And in the end this
BUG_ON() simply doesn't matter, the cleanup was happening properly, we
were just panicing because this BUG_ON() only matters in the success
case.  So I've opted to just remove it and add a comment where it was.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/relocation.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b106d365257d3..97bc85ca508c2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2457,7 +2457,21 @@ void merge_reloc_roots(struct reloc_control *rc)
 			free_reloc_roots(&reloc_roots);
 	}
 
-	BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root));
+	/*
+	 * We used to have
+	 *
+	 * BUG_ON(!RB_EMPTY_ROOT(&rc->reloc_root_tree.rb_root));
+	 *
+	 * here, but it's wrong.  If we fail to start the transaction in
+	 * prepare_to_merge() we will have only 0 ref reloc roots, none of which
+	 * have actually been removed from the reloc_root_tree rb tree.  This is
+	 * fine because we're bailing here, and we hold a reference on the root
+	 * for the list that holds it, so these roots will be cleaned up when we
+	 * do the reloc_dirty_list afterwards.  Meanwhile the root->reloc_root
+	 * will be cleaned up on unmount.
+	 *
+	 * The remaining nodes will be cleaned up by free_reloc_control.
+	 */
 }
 
 static void free_block_list(struct rb_root *blocks)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH AUTOSEL 4.9 9/9] btrfs: track reloc roots based on their commit root bytenr
  2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
                   ` (6 preceding siblings ...)
  2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 8/9] btrfs: remove a BUG_ON() from merge_reloc_roots() Sasha Levin
@ 2020-04-10  3:51 ` Sasha Levin
  7 siblings, 0 replies; 9+ messages in thread
From: Sasha Levin @ 2020-04-10  3:51 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: Josef Bacik, David Sterba, Sasha Levin, linux-btrfs

From: Josef Bacik <josef@toxicpanda.com>

[ Upstream commit ea287ab157c2816bf12aad4cece41372f9d146b4 ]

We always search the commit root of the extent tree for looking up back
references, however we track the reloc roots based on their current
bytenr.

This is wrong, if we commit the transaction between relocating tree
blocks we could end up in this code in build_backref_tree

  if (key.objectid == key.offset) {
	  /*
	   * Only root blocks of reloc trees use backref
	   * pointing to itself.
	   */
	  root = find_reloc_root(rc, cur->bytenr);
	  ASSERT(root);
	  cur->root = root;
	  break;
  }

find_reloc_root() is looking based on the bytenr we had in the commit
root, but if we've COWed this reloc root we will not find that bytenr,
and we will trip over the ASSERT(root).

Fix this by using the commit_root->start bytenr for indexing the commit
root.  Then we change the __update_reloc_root() caller to be used when
we switch the commit root for the reloc root during commit.

This fixes the panic I was seeing when we started throttling relocation
for delayed refs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/relocation.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 97bc85ca508c2..8d98e8e248b7a 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1296,7 +1296,7 @@ static int __must_check __add_reloc_root(struct btrfs_root *root)
 	if (!node)
 		return -ENOMEM;
 
-	node->bytenr = root->node->start;
+	node->bytenr = root->commit_root->start;
 	node->data = root;
 
 	spin_lock(&rc->reloc_root_tree.lock);
@@ -1328,10 +1328,11 @@ static void __del_reloc_root(struct btrfs_root *root)
 	if (rc && root->node) {
 		spin_lock(&rc->reloc_root_tree.lock);
 		rb_node = tree_search(&rc->reloc_root_tree.rb_root,
-				      root->node->start);
+				      root->commit_root->start);
 		if (rb_node) {
 			node = rb_entry(rb_node, struct mapping_node, rb_node);
 			rb_erase(&node->rb_node, &rc->reloc_root_tree.rb_root);
+			RB_CLEAR_NODE(&node->rb_node);
 		}
 		spin_unlock(&rc->reloc_root_tree.lock);
 		if (!node)
@@ -1349,7 +1350,7 @@ static void __del_reloc_root(struct btrfs_root *root)
  * helper to update the 'address of tree root -> reloc tree'
  * mapping
  */
-static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr)
+static int __update_reloc_root(struct btrfs_root *root)
 {
 	struct rb_node *rb_node;
 	struct mapping_node *node = NULL;
@@ -1357,7 +1358,7 @@ static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr)
 
 	spin_lock(&rc->reloc_root_tree.lock);
 	rb_node = tree_search(&rc->reloc_root_tree.rb_root,
-			      root->node->start);
+			      root->commit_root->start);
 	if (rb_node) {
 		node = rb_entry(rb_node, struct mapping_node, rb_node);
 		rb_erase(&node->rb_node, &rc->reloc_root_tree.rb_root);
@@ -1369,7 +1370,7 @@ static int __update_reloc_root(struct btrfs_root *root, u64 new_bytenr)
 	BUG_ON((struct btrfs_root *)node->data != root);
 
 	spin_lock(&rc->reloc_root_tree.lock);
-	node->bytenr = new_bytenr;
+	node->bytenr = root->node->start;
 	rb_node = tree_insert(&rc->reloc_root_tree.rb_root,
 			      node->bytenr, &node->rb_node);
 	spin_unlock(&rc->reloc_root_tree.lock);
@@ -1519,6 +1520,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	}
 
 	if (reloc_root->commit_root != reloc_root->node) {
+		__update_reloc_root(reloc_root);
 		btrfs_set_root_node(root_item, reloc_root->node);
 		free_extent_buffer(reloc_root->commit_root);
 		reloc_root->commit_root = btrfs_root_node(reloc_root);
@@ -4717,11 +4719,6 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
 	BUG_ON(rc->stage == UPDATE_DATA_PTRS &&
 	       root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID);
 
-	if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
-		if (buf == root->node)
-			__update_reloc_root(root, cow->start);
-	}
-
 	level = btrfs_header_level(buf);
 	if (btrfs_header_generation(buf) <=
 	    btrfs_root_last_snapshot(&root->root_item))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-10  3:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-10  3:51 [PATCH AUTOSEL 4.9 1/9] irqchip/versatile-fpga: Handle chained IRQs properly Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 2/9] sched: Avoid scale real weight down to zero Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 3/9] selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 4/9] libata: Remove extra scsi_host_put() in ata_scsi_add_hosts() Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 5/9] gfs2: Don't demote a glock until its revokes are written Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 6/9] x86/boot: Use unsigned comparison for addresses Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 7/9] locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps() Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 8/9] btrfs: remove a BUG_ON() from merge_reloc_roots() Sasha Levin
2020-04-10  3:51 ` [PATCH AUTOSEL 4.9 9/9] btrfs: track reloc roots based on their commit root bytenr Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).