linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 1/2] of: add J-Core timer bindings
  2016-09-24  5:07 [PATCH v7 0/2] J-Core timer support Rich Felker
  2016-09-24  5:07 ` [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver Rich Felker
@ 2016-09-24  5:07 ` Rich Felker
  1 sibling, 0 replies; 30+ messages in thread
From: Rich Felker @ 2016-09-24  5:07 UTC (permalink / raw)
  To: devicetree, linux-kernel
  Cc: Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Paul E. McKenney

Signed-off-by: Rich Felker <dalias@libc.org>
Acked-by: Rob Herring <robh@kernel.org>
---
 .../devicetree/bindings/timer/jcore,pit.txt        | 24 ++++++++++++++++++++++
 1 file changed, 24 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/timer/jcore,pit.txt

diff --git a/Documentation/devicetree/bindings/timer/jcore,pit.txt b/Documentation/devicetree/bindings/timer/jcore,pit.txt
new file mode 100644
index 0000000..af5dd35
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/jcore,pit.txt
@@ -0,0 +1,24 @@
+J-Core Programmable Interval Timer and Clocksource
+
+Required properties:
+
+- compatible: Must be "jcore,pit".
+
+- reg: Memory region(s) for timer/clocksource registers. For SMP,
+  there should be one region per cpu, indexed by the sequential,
+  zero-based hardware cpu number.
+
+- interrupts: An interrupt to assign for the timer. The actual pit
+  core is integrated with the aic and allows the timer interrupt
+  assignment to be programmed by software, but this property is
+  required in order to reserve an interrupt number that doesn't
+  conflict with other devices.
+
+
+Example:
+
+timer@200 {
+	compatible = "jcore,pit";
+	reg = < 0x200 0x30 0x500 0x30 >;
+	interrupts = < 0x48 >;
+};
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 0/2] J-Core timer support
@ 2016-09-24  5:07 Rich Felker
  2016-09-24  5:07 ` [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver Rich Felker
  2016-09-24  5:07 ` [PATCH v7 1/2] of: add J-Core timer bindings Rich Felker
  0 siblings, 2 replies; 30+ messages in thread
From: Rich Felker @ 2016-09-24  5:07 UTC (permalink / raw)
  To: devicetree, linux-kernel
  Cc: Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Paul E. McKenney

This version has been updated to include specific changes requested in
the last round of review, and to work around a problem discussed with
Paul McKenney (See thread "rcu_sched stalls in idle task introduced in
pre-4.8?" for details).

Rich Felker (2):
  of: add J-Core timer bindings
  clocksource: add J-Core timer/clocksource driver

 .../devicetree/bindings/timer/jcore,pit.txt        |  24 +++
 drivers/clocksource/Kconfig                        |  10 +
 drivers/clocksource/Makefile                       |   1 +
 drivers/clocksource/jcore-pit.c                    | 240 +++++++++++++++++++++
 include/linux/cpuhotplug.h                         |   1 +
 5 files changed, 276 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/timer/jcore,pit.txt
 create mode 100644 drivers/clocksource/jcore-pit.c

-- 
2.10.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-24  5:07 [PATCH v7 0/2] J-Core timer support Rich Felker
@ 2016-09-24  5:07 ` Rich Felker
  2016-09-26 21:07   ` Rich Felker
  2016-09-24  5:07 ` [PATCH v7 1/2] of: add J-Core timer bindings Rich Felker
  1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-09-24  5:07 UTC (permalink / raw)
  To: devicetree, linux-kernel
  Cc: Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Paul E. McKenney

At the hardware level, the J-Core PIT is integrated with the interrupt
controller, but it is represented as its own device and has an
independent programming interface. It provides a 12-bit countdown
timer, which is not presently used, and a periodic timer. The interval
length for the latter is programmable via a 32-bit throttle register
whose units are determined by a bus-period register. The periodic
timer is used to implement both periodic and oneshot clock event
modes; in oneshot mode the interrupt handler simply disables the timer
as soon as it fires.

Despite its device tree node representing an interrupt for the PIT,
the actual irq generated is programmable, not hard-wired. The driver
is responsible for programming the PIT to generate the hardware irq
number that the DT assigns to it.

On SMP configurations, J-Core provides cpu-local instances of the PIT;
no broadcast timer is needed. This driver supports the creation of the
necessary per-cpu clock_event_device instances.

A nanosecond-resolution clocksource is provided using the J-Core "RTC"
registers, which give a 64-bit seconds count and 32-bit nanoseconds
that wrap every second. The driver converts these to a full-range
32-bit nanoseconds count.

Signed-off-by: Rich Felker <dalias@libc.org>
---
 drivers/clocksource/Kconfig     |  10 ++
 drivers/clocksource/Makefile    |   1 +
 drivers/clocksource/jcore-pit.c | 240 ++++++++++++++++++++++++++++++++++++++++
 include/linux/cpuhotplug.h      |   1 +
 4 files changed, 252 insertions(+)
 create mode 100644 drivers/clocksource/jcore-pit.c

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index 5677886..95dd78b 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
 config SYS_SUPPORTS_EM_STI
         bool
 
+config CLKSRC_JCORE_PIT
+	bool "J-Core PIT timer driver"
+	depends on OF && (SUPERH || COMPILE_TEST)
+	depends on GENERIC_CLOCKEVENTS
+	depends on HAS_IOMEM
+	select CLKSRC_MMIO
+	help
+	  This enables build of clocksource and clockevent driver for
+	  the integrated PIT in the J-Core synthesizable, open source SoC.
+
 config SH_TIMER_CMT
 	bool "Renesas CMT timer driver" if COMPILE_TEST
 	depends on GENERIC_CLOCKEVENTS
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index fd9d6df..cf87f40 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)	+= tcb_clksrc.o
 obj-$(CONFIG_X86_PM_TIMER)	+= acpi_pm.o
 obj-$(CONFIG_SCx200HR_TIMER)	+= scx200_hrt.o
 obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC)	+= cs5535-clockevt.o
+obj-$(CONFIG_CLKSRC_JCORE_PIT)		+= jcore-pit.o
 obj-$(CONFIG_SH_TIMER_CMT)	+= sh_cmt.o
 obj-$(CONFIG_SH_TIMER_MTU2)	+= sh_mtu2.o
 obj-$(CONFIG_SH_TIMER_TMU)	+= sh_tmu.o
diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
new file mode 100644
index 0000000..4c512d7
--- /dev/null
+++ b/drivers/clocksource/jcore-pit.c
@@ -0,0 +1,240 @@
+/*
+ * J-Core SoC PIT/clocksource driver
+ *
+ * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/clockchips.h>
+#include <linux/clocksource.h>
+#include <linux/sched_clock.h>
+#include <linux/cpu.h>
+#include <linux/cpuhotplug.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
+
+#define PIT_IRQ_SHIFT		12
+#define PIT_PRIO_SHIFT		20
+#define PIT_ENABLE_SHIFT	26
+#define PIT_IRQ_MASK		0x3f
+#define PIT_PRIO_MASK		0xf
+
+#define REG_PITEN		0x00
+#define REG_THROT		0x10
+#define REG_COUNT		0x14
+#define REG_BUSPD		0x18
+#define REG_SECHI		0x20
+#define REG_SECLO		0x24
+#define REG_NSEC		0x28
+
+struct jcore_pit {
+	struct clock_event_device	ced;
+	__iomem void			*base;
+	unsigned long			periodic_delta;
+	unsigned			cpu;
+	u32				enable_val;
+};
+
+static __iomem void		*jcore_pit_base;
+struct jcore_pit __percpu	*jcore_pit_percpu;
+
+static notrace u64 jcore_sched_clock_read(void)
+{
+	u32 seclo, nsec, seclo0;
+	__iomem void *base = jcore_pit_base;
+
+	seclo = readl(base + REG_SECLO);
+	do {
+		seclo0 = seclo;
+		nsec  = readl(base + REG_NSEC);
+		seclo = readl(base + REG_SECLO);
+	} while (seclo0 != seclo);
+
+	return seclo * NSEC_PER_SEC + nsec;
+}
+
+static cycle_t jcore_clocksource_read(struct clocksource *cs)
+{
+	return jcore_sched_clock_read();
+}
+
+static int jcore_pit_disable(struct jcore_pit *pit)
+{
+	writel(0, pit->base + REG_PITEN);
+	return 0;
+}
+
+static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
+{
+	jcore_pit_disable(pit);
+	writel(delta, pit->base + REG_THROT);
+	writel(pit->enable_val, pit->base + REG_PITEN);
+	return 0;
+}
+
+static int jcore_pit_set_state_shutdown(struct clock_event_device *ced)
+{
+	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
+
+	return jcore_pit_disable(pit);
+}
+
+static int jcore_pit_set_state_oneshot(struct clock_event_device *ced)
+{
+	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
+
+	return jcore_pit_disable(pit);
+}
+
+static int jcore_pit_set_state_periodic(struct clock_event_device *ced)
+{
+	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
+
+	return jcore_pit_set(pit->periodic_delta, pit);
+}
+
+static int jcore_pit_set_next_event(unsigned long delta,
+				    struct clock_event_device *ced)
+{
+	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
+
+	return jcore_pit_set(delta, pit);
+}
+
+static int jcore_pit_local_init(unsigned cpu)
+{
+	struct jcore_pit *pit = this_cpu_ptr(jcore_pit_percpu);
+	unsigned buspd, freq, min_delta;
+
+	pr_info("Local J-Core PIT init on cpu %u\n", pit->cpu);
+
+	buspd = readl(pit->base + REG_BUSPD);
+	freq = DIV_ROUND_CLOSEST(NSEC_PER_SEC, buspd);
+	pit->periodic_delta = DIV_ROUND_CLOSEST(NSEC_PER_SEC, HZ * buspd);
+
+	/*
+	 * This is a workaround. When the timer system schedules oneshot
+	 * events so soon that the interrupt occurs before the softirq
+	 * part of the timer handler runs, a feedback loop ensues and
+	 * results in long stalls that are broken up only by other
+	 * interrupt activity.
+	 */
+	min_delta = DIV_ROUND_CLOSEST(5000000, buspd);
+
+	clockevents_config_and_register(&pit->ced, freq, min_delta, ULONG_MAX);
+
+	return 0;
+}
+
+static irqreturn_t jcore_timer_interrupt(int irq, void *dev_id)
+{
+	struct jcore_pit *pit = this_cpu_ptr(dev_id);
+
+	if (clockevent_state_oneshot(&pit->ced))
+		jcore_pit_disable(pit);
+
+	pit->ced.event_handler(&pit->ced);
+
+	return IRQ_HANDLED;
+}
+
+static int __init jcore_pit_init(struct device_node *node)
+{
+	int err;
+	unsigned pit_irq, cpu;
+	unsigned long hwirq;
+	u32 irqprio, enable_val;
+
+	jcore_pit_base = of_iomap(node, 0);
+	if (!jcore_pit_base) {
+		pr_err("Error: Cannot map base address for J-Core PIT\n");
+		return -ENXIO;
+	}
+
+	pit_irq = irq_of_parse_and_map(node, 0);
+	if (!pit_irq) {
+		pr_err("Error: J-Core PIT has no IRQ\n");
+		return -ENXIO;
+	}
+
+	pr_info("Initializing J-Core PIT at %p IRQ %d\n",
+		jcore_pit_base, pit_irq);
+
+	err = clocksource_mmio_init(jcore_pit_base, "jcore_pit_cs",
+				    NSEC_PER_SEC, 400, 32,
+				    jcore_clocksource_read);
+	if (err) {
+		pr_err("Error registering clocksource device: %d\n", err);
+		return err;
+	}
+
+	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
+
+	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
+	if (!jcore_pit_percpu) {
+		pr_err("Failed to allocate memory for clock event device\n");
+		return -ENOMEM;
+	}
+
+	err = request_irq(pit_irq, jcore_timer_interrupt,
+			  IRQF_TIMER | IRQF_PERCPU,
+			  "jcore_pit", jcore_pit_percpu);
+	if (err) {
+		pr_err("pit irq request failed: %d\n", err);
+		free_percpu(jcore_pit_percpu);
+		return err;
+	}
+
+	/*
+	 * The J-Core PIT is not hard-wired to a particular IRQ, but
+	 * integrated with the interrupt controller such that the IRQ it
+	 * generates is programmable. The programming interface has a
+	 * legacy field which was an interrupt priority for AIC1, but
+	 * which is OR'd onto bits 2-5 of the generated IRQ number when
+	 * used with J-Core AIC2, so set it to match these bits.
+	 */
+	hwirq = irq_get_irq_data(pit_irq)->hwirq;
+	irqprio = (hwirq >> 2) & PIT_PRIO_MASK;
+	enable_val = (1U << PIT_ENABLE_SHIFT)
+		   | (hwirq << PIT_IRQ_SHIFT)
+		   | (irqprio << PIT_PRIO_SHIFT);
+
+	for_each_present_cpu(cpu) {
+		struct jcore_pit *pit = per_cpu_ptr(jcore_pit_percpu, cpu);
+
+		pit->base = of_iomap(node, cpu);
+		if (!pit->base) {
+			pr_err("Unable to map PIT for cpu %u\n", cpu);
+			continue;
+		}
+
+		pit->ced.name = "jcore_pit";
+		pit->ced.features = CLOCK_EVT_FEAT_PERIODIC
+				  | CLOCK_EVT_FEAT_ONESHOT
+				  | CLOCK_EVT_FEAT_PERCPU;
+		pit->ced.cpumask = cpumask_of(cpu);
+		pit->ced.rating = 400;
+		pit->ced.irq = pit_irq;
+		pit->ced.set_state_shutdown = jcore_pit_set_state_shutdown;
+		pit->ced.set_state_periodic = jcore_pit_set_state_periodic;
+		pit->ced.set_state_oneshot = jcore_pit_set_state_oneshot;
+		pit->ced.set_next_event = jcore_pit_set_next_event;
+
+		pit->cpu = cpu;
+		pit->enable_val = enable_val;
+	}
+
+	cpuhp_setup_state(CPUHP_AP_JCORE_TIMER_STARTING,
+			  "AP_JCORE_TIMER_STARTING",
+			  jcore_pit_local_init, NULL);
+
+	return 0;
+}
+
+CLOCKSOURCE_OF_DECLARE(jcore_pit, "jcore,pit", jcore_pit_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 34bd805..e2e6c8c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -52,6 +52,7 @@ enum cpuhp_state {
 	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
 	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
 	CPUHP_AP_DUMMY_TIMER_STARTING,
+	CPUHP_AP_JCORE_TIMER_STARTING,
 	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
 	CPUHP_AP_ARM_TWD_STARTING,
 	CPUHP_AP_METAG_TIMER_STARTING,
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-24  5:07 ` [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver Rich Felker
@ 2016-09-26 21:07   ` Rich Felker
  2016-09-26 21:27     ` Daniel Lezcano
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-09-26 21:07 UTC (permalink / raw)
  To: devicetree, linux-kernel, linux-sh
  Cc: Rob Herring, Mark Rutland, Daniel Lezcano, Thomas Gleixner,
	Paul E. McKenney

Ping. Is there anything that still needs to be changed for this driver
to be acceptable?

On Mon, Sep 17, 2001 at 04:00:00AM +0000, Rich Felker wrote:
> At the hardware level, the J-Core PIT is integrated with the interrupt
> controller, but it is represented as its own device and has an
> independent programming interface. It provides a 12-bit countdown
> timer, which is not presently used, and a periodic timer. The interval
> length for the latter is programmable via a 32-bit throttle register
> whose units are determined by a bus-period register. The periodic
> timer is used to implement both periodic and oneshot clock event
> modes; in oneshot mode the interrupt handler simply disables the timer
> as soon as it fires.
> 
> Despite its device tree node representing an interrupt for the PIT,
> the actual irq generated is programmable, not hard-wired. The driver
> is responsible for programming the PIT to generate the hardware irq
> number that the DT assigns to it.
> 
> On SMP configurations, J-Core provides cpu-local instances of the PIT;
> no broadcast timer is needed. This driver supports the creation of the
> necessary per-cpu clock_event_device instances.
> 
> A nanosecond-resolution clocksource is provided using the J-Core "RTC"
> registers, which give a 64-bit seconds count and 32-bit nanoseconds
> that wrap every second. The driver converts these to a full-range
> 32-bit nanoseconds count.
> 
> Signed-off-by: Rich Felker <dalias@libc.org>
> ---
>  drivers/clocksource/Kconfig     |  10 ++
>  drivers/clocksource/Makefile    |   1 +
>  drivers/clocksource/jcore-pit.c | 240 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/cpuhotplug.h      |   1 +
>  4 files changed, 252 insertions(+)
>  create mode 100644 drivers/clocksource/jcore-pit.c
> 
> diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
> index 5677886..95dd78b 100644
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig
> @@ -407,6 +407,16 @@ config SYS_SUPPORTS_SH_TMU
>  config SYS_SUPPORTS_EM_STI
>          bool
>  
> +config CLKSRC_JCORE_PIT
> +	bool "J-Core PIT timer driver"
> +	depends on OF && (SUPERH || COMPILE_TEST)
> +	depends on GENERIC_CLOCKEVENTS
> +	depends on HAS_IOMEM
> +	select CLKSRC_MMIO
> +	help
> +	  This enables build of clocksource and clockevent driver for
> +	  the integrated PIT in the J-Core synthesizable, open source SoC.
> +
>  config SH_TIMER_CMT
>  	bool "Renesas CMT timer driver" if COMPILE_TEST
>  	depends on GENERIC_CLOCKEVENTS
> diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
> index fd9d6df..cf87f40 100644
> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile
> @@ -5,6 +5,7 @@ obj-$(CONFIG_ATMEL_TCB_CLKSRC)	+= tcb_clksrc.o
>  obj-$(CONFIG_X86_PM_TIMER)	+= acpi_pm.o
>  obj-$(CONFIG_SCx200HR_TIMER)	+= scx200_hrt.o
>  obj-$(CONFIG_CS5535_CLOCK_EVENT_SRC)	+= cs5535-clockevt.o
> +obj-$(CONFIG_CLKSRC_JCORE_PIT)		+= jcore-pit.o
>  obj-$(CONFIG_SH_TIMER_CMT)	+= sh_cmt.o
>  obj-$(CONFIG_SH_TIMER_MTU2)	+= sh_mtu2.o
>  obj-$(CONFIG_SH_TIMER_TMU)	+= sh_tmu.o
> diff --git a/drivers/clocksource/jcore-pit.c b/drivers/clocksource/jcore-pit.c
> new file mode 100644
> index 0000000..4c512d7
> --- /dev/null
> +++ b/drivers/clocksource/jcore-pit.c
> @@ -0,0 +1,240 @@
> +/*
> + * J-Core SoC PIT/clocksource driver
> + *
> + * Copyright (C) 2015-2016 Smart Energy Instruments, Inc.
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License.  See the file "COPYING" in the main directory of this archive
> + * for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/clockchips.h>
> +#include <linux/clocksource.h>
> +#include <linux/sched_clock.h>
> +#include <linux/cpu.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/of_address.h>
> +#include <linux/of_irq.h>
> +
> +#define PIT_IRQ_SHIFT		12
> +#define PIT_PRIO_SHIFT		20
> +#define PIT_ENABLE_SHIFT	26
> +#define PIT_IRQ_MASK		0x3f
> +#define PIT_PRIO_MASK		0xf
> +
> +#define REG_PITEN		0x00
> +#define REG_THROT		0x10
> +#define REG_COUNT		0x14
> +#define REG_BUSPD		0x18
> +#define REG_SECHI		0x20
> +#define REG_SECLO		0x24
> +#define REG_NSEC		0x28
> +
> +struct jcore_pit {
> +	struct clock_event_device	ced;
> +	__iomem void			*base;
> +	unsigned long			periodic_delta;
> +	unsigned			cpu;
> +	u32				enable_val;
> +};
> +
> +static __iomem void		*jcore_pit_base;
> +struct jcore_pit __percpu	*jcore_pit_percpu;
> +
> +static notrace u64 jcore_sched_clock_read(void)
> +{
> +	u32 seclo, nsec, seclo0;
> +	__iomem void *base = jcore_pit_base;
> +
> +	seclo = readl(base + REG_SECLO);
> +	do {
> +		seclo0 = seclo;
> +		nsec  = readl(base + REG_NSEC);
> +		seclo = readl(base + REG_SECLO);
> +	} while (seclo0 != seclo);
> +
> +	return seclo * NSEC_PER_SEC + nsec;
> +}
> +
> +static cycle_t jcore_clocksource_read(struct clocksource *cs)
> +{
> +	return jcore_sched_clock_read();
> +}
> +
> +static int jcore_pit_disable(struct jcore_pit *pit)
> +{
> +	writel(0, pit->base + REG_PITEN);
> +	return 0;
> +}
> +
> +static int jcore_pit_set(unsigned long delta, struct jcore_pit *pit)
> +{
> +	jcore_pit_disable(pit);
> +	writel(delta, pit->base + REG_THROT);
> +	writel(pit->enable_val, pit->base + REG_PITEN);
> +	return 0;
> +}
> +
> +static int jcore_pit_set_state_shutdown(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}
> +
> +static int jcore_pit_set_state_oneshot(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_disable(pit);
> +}
> +
> +static int jcore_pit_set_state_periodic(struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_set(pit->periodic_delta, pit);
> +}
> +
> +static int jcore_pit_set_next_event(unsigned long delta,
> +				    struct clock_event_device *ced)
> +{
> +	struct jcore_pit *pit = container_of(ced, struct jcore_pit, ced);
> +
> +	return jcore_pit_set(delta, pit);
> +}
> +
> +static int jcore_pit_local_init(unsigned cpu)
> +{
> +	struct jcore_pit *pit = this_cpu_ptr(jcore_pit_percpu);
> +	unsigned buspd, freq, min_delta;
> +
> +	pr_info("Local J-Core PIT init on cpu %u\n", pit->cpu);
> +
> +	buspd = readl(pit->base + REG_BUSPD);
> +	freq = DIV_ROUND_CLOSEST(NSEC_PER_SEC, buspd);
> +	pit->periodic_delta = DIV_ROUND_CLOSEST(NSEC_PER_SEC, HZ * buspd);
> +
> +	/*
> +	 * This is a workaround. When the timer system schedules oneshot
> +	 * events so soon that the interrupt occurs before the softirq
> +	 * part of the timer handler runs, a feedback loop ensues and
> +	 * results in long stalls that are broken up only by other
> +	 * interrupt activity.
> +	 */
> +	min_delta = DIV_ROUND_CLOSEST(5000000, buspd);
> +
> +	clockevents_config_and_register(&pit->ced, freq, min_delta, ULONG_MAX);
> +
> +	return 0;
> +}
> +
> +static irqreturn_t jcore_timer_interrupt(int irq, void *dev_id)
> +{
> +	struct jcore_pit *pit = this_cpu_ptr(dev_id);
> +
> +	if (clockevent_state_oneshot(&pit->ced))
> +		jcore_pit_disable(pit);
> +
> +	pit->ced.event_handler(&pit->ced);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int __init jcore_pit_init(struct device_node *node)
> +{
> +	int err;
> +	unsigned pit_irq, cpu;
> +	unsigned long hwirq;
> +	u32 irqprio, enable_val;
> +
> +	jcore_pit_base = of_iomap(node, 0);
> +	if (!jcore_pit_base) {
> +		pr_err("Error: Cannot map base address for J-Core PIT\n");
> +		return -ENXIO;
> +	}
> +
> +	pit_irq = irq_of_parse_and_map(node, 0);
> +	if (!pit_irq) {
> +		pr_err("Error: J-Core PIT has no IRQ\n");
> +		return -ENXIO;
> +	}
> +
> +	pr_info("Initializing J-Core PIT at %p IRQ %d\n",
> +		jcore_pit_base, pit_irq);
> +
> +	err = clocksource_mmio_init(jcore_pit_base, "jcore_pit_cs",
> +				    NSEC_PER_SEC, 400, 32,
> +				    jcore_clocksource_read);
> +	if (err) {
> +		pr_err("Error registering clocksource device: %d\n", err);
> +		return err;
> +	}
> +
> +	sched_clock_register(jcore_sched_clock_read, 32, NSEC_PER_SEC);
> +
> +	jcore_pit_percpu = alloc_percpu(struct jcore_pit);
> +	if (!jcore_pit_percpu) {
> +		pr_err("Failed to allocate memory for clock event device\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = request_irq(pit_irq, jcore_timer_interrupt,
> +			  IRQF_TIMER | IRQF_PERCPU,
> +			  "jcore_pit", jcore_pit_percpu);
> +	if (err) {
> +		pr_err("pit irq request failed: %d\n", err);
> +		free_percpu(jcore_pit_percpu);
> +		return err;
> +	}
> +
> +	/*
> +	 * The J-Core PIT is not hard-wired to a particular IRQ, but
> +	 * integrated with the interrupt controller such that the IRQ it
> +	 * generates is programmable. The programming interface has a
> +	 * legacy field which was an interrupt priority for AIC1, but
> +	 * which is OR'd onto bits 2-5 of the generated IRQ number when
> +	 * used with J-Core AIC2, so set it to match these bits.
> +	 */
> +	hwirq = irq_get_irq_data(pit_irq)->hwirq;
> +	irqprio = (hwirq >> 2) & PIT_PRIO_MASK;
> +	enable_val = (1U << PIT_ENABLE_SHIFT)
> +		   | (hwirq << PIT_IRQ_SHIFT)
> +		   | (irqprio << PIT_PRIO_SHIFT);
> +
> +	for_each_present_cpu(cpu) {
> +		struct jcore_pit *pit = per_cpu_ptr(jcore_pit_percpu, cpu);
> +
> +		pit->base = of_iomap(node, cpu);
> +		if (!pit->base) {
> +			pr_err("Unable to map PIT for cpu %u\n", cpu);
> +			continue;
> +		}
> +
> +		pit->ced.name = "jcore_pit";
> +		pit->ced.features = CLOCK_EVT_FEAT_PERIODIC
> +				  | CLOCK_EVT_FEAT_ONESHOT
> +				  | CLOCK_EVT_FEAT_PERCPU;
> +		pit->ced.cpumask = cpumask_of(cpu);
> +		pit->ced.rating = 400;
> +		pit->ced.irq = pit_irq;
> +		pit->ced.set_state_shutdown = jcore_pit_set_state_shutdown;
> +		pit->ced.set_state_periodic = jcore_pit_set_state_periodic;
> +		pit->ced.set_state_oneshot = jcore_pit_set_state_oneshot;
> +		pit->ced.set_next_event = jcore_pit_set_next_event;
> +
> +		pit->cpu = cpu;
> +		pit->enable_val = enable_val;
> +	}
> +
> +	cpuhp_setup_state(CPUHP_AP_JCORE_TIMER_STARTING,
> +			  "AP_JCORE_TIMER_STARTING",
> +			  jcore_pit_local_init, NULL);
> +
> +	return 0;
> +}
> +
> +CLOCKSOURCE_OF_DECLARE(jcore_pit, "jcore,pit", jcore_pit_init);
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 34bd805..e2e6c8c 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -52,6 +52,7 @@ enum cpuhp_state {
>  	CPUHP_AP_ARM_ARCH_TIMER_STARTING,
>  	CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
>  	CPUHP_AP_DUMMY_TIMER_STARTING,
> +	CPUHP_AP_JCORE_TIMER_STARTING,
>  	CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
>  	CPUHP_AP_ARM_TWD_STARTING,
>  	CPUHP_AP_METAG_TIMER_STARTING,
> -- 
> 2.10.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-26 21:07   ` Rich Felker
@ 2016-09-26 21:27     ` Daniel Lezcano
  2016-09-26 22:23       ` Rich Felker
  2016-10-03 22:10       ` Rich Felker
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Lezcano @ 2016-09-26 21:27 UTC (permalink / raw)
  To: Rich Felker, devicetree, linux-kernel, linux-sh
  Cc: Rob Herring, Mark Rutland, Thomas Gleixner, Paul E. McKenney,
	Mark Rutland

On 26/09/2016 23:07, Rich Felker wrote:
> Ping. Is there anything that still needs to be changed for this driver
> to be acceptable?
> 

It is on my radar. I'm reviewing it.

Can you elaborate the workaround mentioned in the changelog. I have been
digging into the lkml@ thread but it is not clear if the issue is
related to the time framework, the driver itself or whatever else. Can
you clarify that ?

Regarding the previous version, did you reach a consensus regarding
per_cpu irq with Mark Rutland ?

Thanks.

  -- Daniel

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-26 21:27     ` Daniel Lezcano
@ 2016-09-26 22:23       ` Rich Felker
  2016-09-26 23:55         ` Thomas Gleixner
  2016-10-03 22:10       ` Rich Felker
  1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-09-26 22:23 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: devicetree, linux-kernel, linux-sh, Rob Herring, Mark Rutland,
	Thomas Gleixner, Paul E. McKenney

On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

It does not seem to be related to the driver. I'd be happy to have a
workaround (or even better a fix) at a higher level outside the
driver. I'll try to summarize what I think is happening.

The symptom is heavy irq load (much higher than the 1-2 irqs/sec I
observed on 4.6, something like 30+ on average) and frequent rcu_sched
stall messages. Under these conditions the system "feels" responsive
still, but that seems to be a consequence of other interrupts breaking
the stall; by inserting and removing the SD card, which is probed
periodically (1s period) by a kernel thread when there's no irq hooked
up for the slot, I was able to observe that the card-removed message
did not appear for several seconds, so apparently the kernel really is
stalling.

Based on use of ftrace, I was able to see situations where a second
timer hardirq happened immediately after one occurred, before the
timer softirq could run. My theory is that this is causing some kind
of feedback loop where new timer expirations keep getting scheduled
with a very short interval such that the softirq never gets to run
(until other interrupt activity disrups the feedback loop). I tried
reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
relevant and it didn't help, but on further review (right now) there
seem to be a few related commits just before it that might be
responsible for the regression. I'll see if I can dig up anything else
useful.

> Regarding the previous version, did you reach a consensus regarding
> per_cpu irq with Mark Rutland ?

I'm not sure. I think I can reasonably say that the existign percpu
framework is not suitable or necessary for modeling what the jcore
hardware is doing. The interrupt controller driver seems to have been
accepted already without use of percpu stuff. I know there was some
concern that not letting the kernel know whether an irq is percpu or
not could lead to wrong behavior, but I believe that's only possible
in the other direction (wrongly registering an irq as percpu when it's
actually a normal one that could happen on either cpu, in which case
locking might be wrongly omitted, etc.).

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-26 22:23       ` Rich Felker
@ 2016-09-26 23:55         ` Thomas Gleixner
  2016-09-27  0:42           ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2016-09-26 23:55 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Mon, 26 Sep 2016, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> Based on use of ftrace, I was able to see situations where a second
> timer hardirq happened immediately after one occurred, before the
> timer softirq could run. My theory is that this is causing some kind
> of feedback loop where new timer expirations keep getting scheduled
> with a very short interval such that the softirq never gets to run
> (until other interrupt activity disrups the feedback loop). I tried
> reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> relevant and it didn't help, but on further review (right now) there
> seem to be a few related commits just before it that might be
> responsible for the regression. I'll see if I can dig up anything else
> useful.

Interesting theory, but lets look at the data from the other thread:

         <idle>-0     [001] d.h.  2646.702790: irq_handler_entry: irq=72 name=jcore_pit
          <idle>-0     [001] d.h.  2646.703398: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d.h.  2646.703607: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d.h.  2646.703884: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.  2646.704191: irq_handler_exit: irq=72 ret=handled

So it takes 1.4ms to finish the timer irq handler

          <idle>-0     [001] d.H.  2646.704509: irq_handler_entry: irq=72 name=jcore_pit
          <idle>-0     [001] d.H.  2646.704990: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d.H.  2646.705182: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d.H.  2646.705465: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.H.  2646.705761: irq_handler_exit: irq=72 ret=handled

And 1.2ms in this case
	  

          <idle>-0     [001] d.H.  2646.706071: irq_handler_entry: irq=72 name=jcore_pit
          <idle>-0     [001] d.H.  2646.706328: irq_handler_exit: irq=72 ret=handled

So this one is actually short enough that the soft interrupt can excute and
obviously no new timer is scheduled close enough.

          <idle>-0     [001] ..s.  2646.706530: softirq_entry: vec=1 [action=TIMER]

Now the obvious question arises: Why is that handler taking that long? And
sure the next event being programmed shortely after the other has to do
with various factors (HZ setting, High-resolution timers enabled?). W/o
having the corresponding hrtimer tracepoints available (start and expire)
it's hard to tell what's going on.

But I can tell for sure that there is no feedback loop and its not
disrupted by some other interrupt. The above trace is merily seing the PIT
irq and nothing else.

Can you please provide a trace with hrtimer tracepoints enabled and tell us
the settings of CONFIG_HZ and whether high resolution timers are enabled or
not.

That workaround of setting min_delta to some absurd big number looks fishy
at best.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-26 23:55         ` Thomas Gleixner
@ 2016-09-27  0:42           ` Rich Felker
  2016-09-27 22:08             ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-09-27  0:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> On Mon, 26 Sep 2016, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > Based on use of ftrace, I was able to see situations where a second
> > timer hardirq happened immediately after one occurred, before the
> > timer softirq could run. My theory is that this is causing some kind
> > of feedback loop where new timer expirations keep getting scheduled
> > with a very short interval such that the softirq never gets to run
> > (until other interrupt activity disrups the feedback loop). I tried
> > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > relevant and it didn't help, but on further review (right now) there
> > seem to be a few related commits just before it that might be
> > responsible for the regression. I'll see if I can dig up anything else
> > useful.
> 
> Interesting theory, but lets look at the data from the other thread:
> 
>          <idle>-0     [001] d.h.  2646.702790: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.h.  2646.703398: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.h.  2646.703607: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.h.  2646.703884: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [001] d.h.  2646.704191: irq_handler_exit: irq=72 ret=handled
> 
> So it takes 1.4ms to finish the timer irq handler
> 
>           <idle>-0     [001] d.H.  2646.704509: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.H.  2646.704990: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.H.  2646.705182: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.H.  2646.705465: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [001] d.H.  2646.705761: irq_handler_exit: irq=72 ret=handled
> 
> And 1.2ms in this case
> 	  
> 
>           <idle>-0     [001] d.H.  2646.706071: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.H.  2646.706328: irq_handler_exit: irq=72 ret=handled
> 
> So this one is actually short enough that the soft interrupt can excute and
> obviously no new timer is scheduled close enough.
> 
>           <idle>-0     [001] ..s.  2646.706530: softirq_entry: vec=1 [action=TIMER]
> 
> Now the obvious question arises: Why is that handler taking that long? And

I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 60000 cycles.
It's possible much of that overhead is coming from ftrace. I can play
around again with just logging the minimum intervals that are
programmed, and not using ftrace.

> sure the next event being programmed shortely after the other has to do
> with various factors (HZ setting, High-resolution timers enabled?). W/o
> having the corresponding hrtimer tracepoints available (start and expire)
> it's hard to tell what's going on.

I do have hrtimers enabled. If it would be helpful I could try again
with traces for them enabled too.

> But I can tell for sure that there is no feedback loop and its not
> disrupted by some other interrupt.

FYI no rcu_sched messages or noticable stalls were observed with
tracing active. The cpu load from tracing is so high that I would not
expect to see the same behavior. I probably failed to make that clear
before; sorry.

> The above trace is merily seing the PIT
> irq and nothing else.

It is showing another one happening before the softirq side of the
first could run, right? Of course this is probably due to overall
ftrace slowness. CONFIG_DYNAMIC_FTRACE does not work on big-endian sh
(the objdump needed at build time is hard-coded to assume sh is
little-endian and I haven't yet worked out what all needs to be fixed)
so I suspect that may be the cause of ftrace being so slow when it's
turned on.

Oddly, the min_delta cutoff needed to prevent the stalls was on the
same order of magnitude as these numbers (2.5ms almost entirely made
them go away; 5ms completely).

> Can you please provide a trace with hrtimer tracepoints enabled and tell us
> the settings of CONFIG_HZ and whether high resolution timers are enabled or
> not.

Yes. I think these are the relevant config items:

CONFIG_HZ=100
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

Let me know if I missed any. I'll try getting a trace and follow up
again.

> That workaround of setting min_delta to some absurd big number looks fishy
> at best.

I agree. I'd like to find the right fix, but usability badly suffers
without either a proper fix or a workaround.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-27  0:42           ` Rich Felker
@ 2016-09-27 22:08             ` Rich Felker
  2016-09-30 13:15               ` Thomas Gleixner
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-09-27 22:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Mon, Sep 26, 2016 at 08:42:58PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 07:55:13PM -0400, Thomas Gleixner wrote:
> > On Mon, 26 Sep 2016, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > Based on use of ftrace, I was able to see situations where a second
> > > timer hardirq happened immediately after one occurred, before the
> > > timer softirq could run. My theory is that this is causing some kind
> > > of feedback loop where new timer expirations keep getting scheduled
> > > with a very short interval such that the softirq never gets to run
> > > (until other interrupt activity disrups the feedback loop). I tried
> > > reverting 4e85876a9d2a977b4a07389da8c07edf76d10825 which seemed
> > > relevant and it didn't help, but on further review (right now) there
> > > seem to be a few related commits just before it that might be
> > > responsible for the regression. I'll see if I can dig up anything else
> > > useful.
> > 
> > Interesting theory, but lets look at the data from the other thread:
> > 
> >          <idle>-0     [001] d.h.  2646.702790: irq_handler_entry: irq=72 name=jcore_pit
> >           <idle>-0     [001] d.h.  2646.703398: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [001] d.h.  2646.703607: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [001] d.h.  2646.703884: softirq_raise: vec=7 [action=SCHED]
> >           <idle>-0     [001] d.h.  2646.704191: irq_handler_exit: irq=72 ret=handled
> > 
> > So it takes 1.4ms to finish the timer irq handler
> > 
> >           <idle>-0     [001] d.H.  2646.704509: irq_handler_entry: irq=72 name=jcore_pit
> >           <idle>-0     [001] d.H.  2646.704990: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [001] d.H.  2646.705182: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [001] d.H.  2646.705465: softirq_raise: vec=7 [action=SCHED]
> >           <idle>-0     [001] d.H.  2646.705761: irq_handler_exit: irq=72 ret=handled
> > 
> > And 1.2ms in this case
> > 	  
> > 
> >           <idle>-0     [001] d.H.  2646.706071: irq_handler_entry: irq=72 name=jcore_pit
> >           <idle>-0     [001] d.H.  2646.706328: irq_handler_exit: irq=72 ret=handled
> > 
> > So this one is actually short enough that the soft interrupt can excute and
> > obviously no new timer is scheduled close enough.
> > 
> >           <idle>-0     [001] ..s.  2646.706530: softirq_entry: vec=1 [action=TIMER]
> > 
> > Now the obvious question arises: Why is that handler taking that long? And
> 
> I'm not sure. Even at our clock rate of 50 MHz, 1.2ms is 60000 cycles.
> It's possible much of that overhead is coming from ftrace. I can play
> around again with just logging the minimum intervals that are
> programmed, and not using ftrace.
> 
> > sure the next event being programmed shortely after the other has to do
> > with various factors (HZ setting, High-resolution timers enabled?). W/o
> > having the corresponding hrtimer tracepoints available (start and expire)
> > it's hard to tell what's going on.
> 
> I do have hrtimers enabled. If it would be helpful I could try again
> with traces for them enabled too.
> 
> > But I can tell for sure that there is no feedback loop and its not
> > disrupted by some other interrupt.
> 
> FYI no rcu_sched messages or noticable stalls were observed with
> tracing active. The cpu load from tracing is so high that I would not
> expect to see the same behavior. I probably failed to make that clear
> before; sorry.

I've managed to get a trace with a stall. I'm not sure what the best
way to share the full thing is, since it's large, but here are the
potentially interesting parts.

The first is a big time gap with no events, from 82.446093 to
109.852709:

          <idle>-0     [001] .ns.    82.431940: timer_expire_exit: timer=160a9eb0
          <idle>-0     [001] .ns.    82.432088: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] .ns.    82.432180: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] .ns.    82.432380: softirq_exit: vec=7 [action=SCHED]
     ksoftirqd/1-14    [001] ..s.    82.433042: softirq_entry: vec=1 [action=TIMER]
     ksoftirqd/1-14    [001] ..s.    82.433166: softirq_exit: vec=1 [action=TIMER]
     ksoftirqd/1-14    [001] ..s.    82.433258: softirq_entry: vec=7 [action=SCHED]
     ksoftirqd/1-14    [001] ..s.    82.433428: softirq_exit: vec=7 [action=SCHED]
       rcu_sched-7     [001] ....    82.434269: timer_init: timer=160a9eb0
       rcu_sched-7     [001] d...    82.434410: timer_start: timer=160a9eb0 function=process_timeout expires=4294945532 [timeout=1] flags=0x00000001
             cat-394   [000] d.h.    82.437621: irq_handler_entry: irq=16 name=jcore_pit
             cat-394   [000] d.h.    82.437851: hrtimer_cancel: hrtimer=109e949c
             cat-394   [000] d.h.    82.438006: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=82350361520
             cat-394   [000] d.h.    82.438555: softirq_raise: vec=1 [action=TIMER]
             cat-394   [000] d.h.    82.438780: softirq_raise: vec=9 [action=RCU]
             cat-394   [000] d.h.    82.439245: hrtimer_expire_exit: hrtimer=109e949c
             cat-394   [000] d.h.    82.439402: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=82360000000 softexpires=82360000000
             cat-394   [000] d.h.    82.439633: irq_handler_exit: irq=16 ret=handled
             cat-394   [000] ..s.    82.439834: softirq_entry: vec=1 [action=TIMER]
             cat-394   [000] ..s.    82.440011: softirq_exit: vec=1 [action=TIMER]
             cat-394   [000] ..s.    82.440125: softirq_entry: vec=9 [action=RCU]
             cat-394   [000] ..s.    82.440416: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [001] d.h.    82.443398: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [001] d.h.    82.443573: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.    82.443689: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=82356112320
          <idle>-0     [001] d.h.    82.443911: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d.h.    82.444089: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d.h.    82.444306: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.    82.444448: hrtimer_expire_exit: hrtimer=109f449c
          <idle>-0     [001] d.h.    82.444592: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=82360000000 softexpires=82360000000
          <idle>-0     [001] d.h.    82.444821: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] ..s.    82.445034: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.    82.445222: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.    82.445341: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.    82.445745: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.    82.445880: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [001] ..s.    82.446093: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [000] dn..   109.852709: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] dn..   109.853043: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109770000000 softexpires=109770000000
              sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 name=uartlite
              sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 ret=handled
              sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 name=uartlite
              sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 ret=unhandled
          <idle>-0     [001] d.h.   109.857684: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [001] d.h.   109.857934: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.   109.858087: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=109770452820
          <idle>-0     [001] d.h.   109.858639: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d.h.   109.858839: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d.h.   109.859097: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.   109.859282: hrtimer_expire_exit: hrtimer=109f449c
          <idle>-0     [001] d.h.   109.859428: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109780000000 softexpires=109780000000
          <idle>-0     [001] d.h.   109.859637: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] ..s.   109.859806: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   109.859984: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   109.860098: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   109.860533: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   109.860663: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [001] .ns.   109.861424: softirq_exit: vec=9 [action=RCU]
              sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 name=jcore_pit
              sh-388   [000] d.h.   109.862071: hrtimer_cancel: hrtimer=109e949c
       rcu_sched-7     [001] ....   109.862141: timer_init: timer=160a9eb0
              sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109774586740
       rcu_sched-7     [001] d...   109.862317: timer_start: timer=160a9eb0 function=process_timeout expires=4294948275 [timeout=1] flags=0x00000001

However no stall message appeared on the console here, and I suspect
it might have just been a gap in event recording because it doesn't
look terribly suspicious. At first I thought (and discussed with
Daniel off-list) that the hrtimer_cancel at the end of the long gap
might have been spinning for the whole time, but this does not match
the behavior of the stalls I'm observing waith tracing off; by hooking
up LED output in the hrtimer_cancel code paths, I see rapid bursts of
successful hrtimer_cancel (no repeated hrtimer_try_to_cancel looping)
leading up to the rcu_sched stall message. That matches with what I'm
seeing later in the event log, between 109s and 132s where the stall
occurred:

          <idle>-0     [000] d.h.   109.946127: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.946339: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.946486: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109858865160
          <idle>-0     [000] d.h.   109.946714: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d.h.   109.946918: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [000] d.h.   109.947193: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.947364: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.947498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109860000000 softexpires=109860000000
          <idle>-0     [000] d.h.   109.947736: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.947887: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109860266000
          <idle>-0     [000] d.h.   109.948394: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d.h.   109.948600: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [000] d.h.   109.948870: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.949036: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.949170: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109870000000 softexpires=109870000000
          <idle>-0     [000] d.h.   109.949389: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.949585: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.949785: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.949902: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.950347: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.950472: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [000] ..s.   109.951089: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [001] dn..   109.951213: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] dn..   109.951476: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109870000000 softexpires=109870000000
       rcu_sched-7     [001] d...   109.952030: timer_cancel: timer=160a9eb0
       rcu_sched-7     [001] ....   109.952478: timer_init: timer=160a9eb0
       rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
          <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
          <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
          <idle>-0     [000] d.h.   109.957708: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.957918: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.958061: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109870441200
          <idle>-0     [000] d.h.   109.958571: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d.h.   109.958791: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [000] d.h.   109.959061: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.959235: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.959369: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109880000000 softexpires=109880000000
          <idle>-0     [000] d.h.   109.959591: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.959795: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.959970: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.960088: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.960529: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.960656: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [000] ..s.   109.960890: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [000] d.h.   109.967689: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.967925: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.968070: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109880451380
          <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   109.968307: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d.h.   109.968522: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
          <idle>-0     [000] d.h.   109.968803: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.968969: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
          <idle>-0     [000] d.h.   109.969324: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.969527: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.969704: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.969815: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.970266: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.970392: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [000] ..s.   109.970635: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
          <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   109.978288: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
          <idle>-0     [000] d.h.   109.978635: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.978812: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.978945: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109900000000 softexpires=109900000000
          <idle>-0     [000] d.h.   109.979167: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.979365: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.979542: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.979655: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.980097: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
          <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   109.988339: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109910000000 softexpires=109910000000
          <idle>-0     [000] d.h.   109.988692: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.988877: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.989012: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109910000000 softexpires=109910000000
          <idle>-0     [000] d.h.   109.989232: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.989429: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.989602: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.989717: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   109.990160: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.997726: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   109.997954: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.998094: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109910476420
          <idle>-0     [001] d...   109.998243: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   109.998340: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   109.998541: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109920000000 softexpires=109920000000
          <idle>-0     [000] d.h.   109.998692: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   109.998879: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   109.999013: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109920000000 softexpires=109920000000
          <idle>-0     [000] d.h.   109.999234: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   109.999440: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.999623: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   109.999735: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   110.000178: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   110.007712: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   110.007955: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   110.008100: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109920477560
          <idle>-0     [001] d...   110.008253: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   110.008334: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   110.008548: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109930000000 softexpires=109930000000
          <idle>-0     [000] d.h.   110.008691: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   110.008875: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   110.009011: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109930000000 softexpires=109930000000
          <idle>-0     [000] d.h.   110.009232: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   110.009435: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   110.009610: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   110.009726: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   110.010172: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   110.017731: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   110.017960: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   110.018106: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109930482300
          <idle>-0     [001] d...   110.018252: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   110.018347: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   110.018552: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109940000000 softexpires=109940000000
          <idle>-0     [000] d.h.   110.018701: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   110.018881: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   110.019021: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109940000000 softexpires=109940000000
          <idle>-0     [000] d.h.   110.019239: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit

The rcu_sched process does not run again after the tick_stop until
132s, and only a few RCU softirqs happen (all shown above). During
this time, cpu1 has no interrupt activity and nothing in the trace
except the above hrtimer_cancel/hrtimer_start pairs (not sure how
they're happening without any interrupts).

This pattern repeats until almost 131s, where cpu1 goes into a frenzy
of hrtimer_cancel/start:

          <idle>-0     [000] d.h.   130.967671: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.h.   130.967901: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d.h.   130.968044: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=130880420820
          <idle>-0     [001] d...   130.968189: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   130.968281: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d...   130.968497: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130890000000 softexpires=130890000000
          <idle>-0     [001] d...   130.978145: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   130.978421: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130900000000 softexpires=130900000000
          <idle>-0     [001] d...   130.988149: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   130.988432: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130910000000 softexpires=130910000000
          <idle>-0     [001] d...   130.998157: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   130.998439: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130920000000 softexpires=130920000000
          <idle>-0     [001] d...   131.008183: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   131.008463: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130930000000 softexpires=130930000000
          <idle>-0     [001] d...   131.018130: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   131.018407: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=130940000000 softexpires=130940000000
...
          <idle>-0     [001] d...   132.188156: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   132.188442: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132110000000 softexpires=132110000000
          <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
          <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
          <idle>-0     [000] d.h.   132.205999: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handled
          <idle>-0     [000] d.h.   132.206208: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   132.206350: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=130890000000 softexpires=130890000000
          <idle>-0     [000] d.h.   132.206590: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [000] d.h.   132.206745: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=132119115200
          <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
          <idle>-0     [000] d.h.   132.206985: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d.h.   132.207315: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [000] d.h.   132.207508: hrtimer_expire_exit: hrtimer=109e949c
          <idle>-0     [000] d.h.   132.207643: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=132120000000 softexpires=132120000000
       rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0
          <idle>-0     [000] d.h.   132.207795: hrtimer_cancel: hrtimer=108d0098
          <idle>-0     [000] d.h.   132.207933: hrtimer_expire_entry: hrtimer=108d0098 function=sched_clock_poll now=132119115200
          <idle>-0     [000] d.h.   132.208140: hrtimer_expire_exit: hrtimer=108d0098
          <idle>-0     [000] d.h.   132.208275: hrtimer_start: hrtimer=108d0098 function=sched_clock_poll expires=133143986114 softexpires=133143986114
       rcu_sched-7     [001] ....   132.208303: timer_init: timer=160a9eb0
          <idle>-0     [000] d.h.   132.208524: hrtimer_cancel: hrtimer=109e949c
       rcu_sched-7     [001] d...   132.208598: timer_start: timer=160a9eb0 function=process_timeout expires=4294950509 [timeout=1] flags=0x00000001

Ending with an IPI interrupt and rcu_sched finally getting to run.

A couple oddities I noticed:

- During the whole sequence, hrtimer expiration times are being set to
  exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.

- The CLOCK_MONOTONIC hrtimer times do not match up with the
  timestamps; they're off by about 0.087s. I assume this is just
  sched_clock vs clocksource time and not a big deal.

- The rcu_sched process is sleeping with timeout=1. This seems
  odd/excessive.

Any ideas what could be going on?

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-27 22:08             ` Rich Felker
@ 2016-09-30 13:15               ` Thomas Gleixner
  2016-09-30 13:48                 ` Paul E. McKenney
  2016-10-08  1:32                 ` Rich Felker
  0 siblings, 2 replies; 30+ messages in thread
From: Thomas Gleixner @ 2016-09-30 13:15 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Tue, 27 Sep 2016, Rich Felker wrote:
> I've managed to get a trace with a stall. I'm not sure what the best
> way to share the full thing is, since it's large, but here are the
> potentially interesting parts.

Upload it somewhere.
 
> The first is a big time gap with no events, from 82.446093 to
> 109.852709:
> 

>           <idle>-0     [001] d.h.    82.443398: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [001] d.h.    82.443573: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d.h.    82.443689: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=82356112320

Tick timer expires

>           <idle>-0     [001] d.h.    82.443911: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.h.    82.444089: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.h.    82.444306: softirq_raise: vec=7 [action=SCHED]

Raises softirqs

>           <idle>-0     [001] d.h.    82.444448: hrtimer_expire_exit: hrtimer=109f449c
>           <idle>-0     [001] d.h.    82.444592: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=82360000000 softexpires=82360000000

Rearms the next period

>           <idle>-0     [001] d.h.    82.444821: irq_handler_exit: irq=16 ret=handled
>           <idle>-0     [001] ..s.    82.445034: softirq_entry: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.    82.445222: softirq_exit: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.    82.445341: softirq_entry: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.    82.445745: softirq_exit: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.    82.445880: softirq_entry: vec=9 [action=RCU]
>           <idle>-0     [001] ..s.    82.446093: softirq_exit: vec=9 [action=RCU]

Softirqs done and the machine vanishes into lala land ....

And now after that gap we have a cancel/start on CPU 0

>           <idle>-0     [000] dn..   109.852709: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] dn..   109.853043: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109770000000 softexpires=109770000000

>               sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 name=uartlite
>               sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 ret=handled
>               sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 name=uartlite
>               sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 ret=unhandled
>           <idle>-0     [001] d.h.   109.857684: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [001] d.h.   109.857934: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d.h.   109.858087: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=109770452820

And this expiry happens on CPU 1, but the expiry time looks like the one
which got armed on CPU 0 above !?!?!

>           <idle>-0     [001] d.h.   109.858639: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.h.   109.858839: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.h.   109.859097: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [001] d.h.   109.859282: hrtimer_expire_exit: hrtimer=109f449c
>           <idle>-0     [001] d.h.   109.859428: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109780000000 softexpires=109780000000
>           <idle>-0     [001] d.h.   109.859637: irq_handler_exit: irq=16 ret=handled
>           <idle>-0     [001] ..s.   109.859806: softirq_entry: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.   109.859984: softirq_exit: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.   109.860098: softirq_entry: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.   109.860533: softirq_exit: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.   109.860663: softirq_entry: vec=9 [action=RCU]
>           <idle>-0     [001] .ns.   109.861424: softirq_exit: vec=9 [action=RCU]
>               sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 name=jcore_pit
>               sh-388   [000] d.h.   109.862071: hrtimer_cancel: hrtimer=109e949c
>        rcu_sched-7     [001] ....   109.862141: timer_init: timer=160a9eb0
>               sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109774586740

And CPU0 has it's timer firing 4.5ms late .....

>        rcu_sched-7     [001] d...   109.862317: timer_start: timer=160a9eb0 function=process_timeout expires=4294948275 [timeout=1] flags=0x00000001
> 
> However no stall message appeared on the console here, and I suspect
> it might have just been a gap in event recording because it doesn't
> look terribly suspicious. At first I thought (and discussed with

Well. If there are gaps in the traces then decoding this is going to be a
nightmare.

> Daniel off-list) that the hrtimer_cancel at the end of the long gap
> might have been spinning for the whole time, but this does not match
> the behavior of the stalls I'm observing waith tracing off; by hooking

Why would it spinning? There is no way that the callback of the tick timer
runs when cancel is called. It's cpu local.....

> up LED output in the hrtimer_cancel code paths, I see rapid bursts of
> successful hrtimer_cancel (no repeated hrtimer_try_to_cancel looping)
> leading up to the rcu_sched stall message. That matches with what I'm
> seeing later in the event log, between 109s and 132s where the stall
> occurred:
> 
>           <idle>-0     [000] d.h.   109.946127: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [000] d.h.   109.946339: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.946486: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109858865160
>           <idle>-0     [000] d.h.   109.946714: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [000] d.h.   109.946918: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [000] d.h.   109.947193: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [000] d.h.   109.947364: hrtimer_expire_exit: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.947498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109860000000 softexpires=109860000000
>           <idle>-0     [000] d.h.   109.947736: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.947887: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109860266000
>           <idle>-0     [000] d.h.   109.948394: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [000] d.h.   109.948600: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [000] d.h.   109.948870: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [000] d.h.   109.949036: hrtimer_expire_exit: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.949170: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109870000000 softexpires=109870000000
>           <idle>-0     [000] d.h.   109.949389: irq_handler_exit: irq=16 ret=handled
>           <idle>-0     [000] ..s.   109.949585: softirq_entry: vec=1 [action=TIMER]
>           <idle>-0     [000] ..s.   109.949785: softirq_exit: vec=1 [action=TIMER]
>           <idle>-0     [000] ..s.   109.949902: softirq_entry: vec=7 [action=SCHED]
>           <idle>-0     [000] ..s.   109.950347: softirq_exit: vec=7 [action=SCHED]
>           <idle>-0     [000] ..s.   109.950472: softirq_entry: vec=9 [action=RCU]
>           <idle>-0     [000] ..s.   109.951089: softirq_exit: vec=9 [action=RCU]
>           <idle>-0     [001] dn..   109.951213: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] dn..   109.951476: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109870000000 softexpires=109870000000
>        rcu_sched-7     [001] d...   109.952030: timer_cancel: timer=160a9eb0
>        rcu_sched-7     [001] ....   109.952478: timer_init: timer=160a9eb0
>        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001

So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
into a NOHZ idle sleep.

>           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
>           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000

which is (using the 0.087621us delta between the trace clock and clock
MONO) at: 109.880 + 0.087621 = 109.968

Which is about correct as we expect the RCU timer to fire at:
      
      109.952633 + 0.01 = 109.963633

or latest at

      109.952633 + 0.02 = 109.983633
   
There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
it did not because:

>        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001

The CPU nr encoded in flags is: 1

Now we cancel and restart the timer w/o seing the interrupt expiring
it. And that expiry should have happened at 109.968000 !?!

>           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000

So this advances the next tick even further out. And CPU 0 sets the timer to
the exact smae value:

>           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000


>           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160

Which expires here. And CPU1 instead of getting an interrupt and expiring
the timer does the cancel/restart to the next jiffie again:

>           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000

And this repeats;

>           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620

>           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c

There is something badly wrong here.

>           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
>           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
>           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
>           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
>           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> 
> The rcu_sched process does not run again after the tick_stop until
> 132s, and only a few RCU softirqs happen (all shown above). During
> this time, cpu1 has no interrupt activity and nothing in the trace
> except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> they're happening without any interrupts).

If you drop out of the arch idle into the core idle loop then you might end
up with this. You want to add a few trace points or trace_printks() to the
involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
tick_nohz_restart_sched_tick() and the idle code should be a good starting
point.
 
> This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> of hrtimer_cancel/start:

It's not a frenzy. It's the same pattern as above. It arms the timer to the
next tick, but that timer never ever fires. And it does that every tick ....

Please put a tracepoint into your set_next_event() callback as well. SO
this changes here:

>           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000

>           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
>           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle

So CPU1 gets an IPI

>           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
49c function=tick_sched_timer now=132119115200
>           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000

And rcu-sched-7 gets running magically, but we don't know what woke it
up. Definitely not the timer, because that did not fire.

>        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0


> - During the whole sequence, hrtimer expiration times are being set to
>   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.

When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
than the next tick, then tick_sched_timer is set to this next event which
can be far out. So that's expected.
 
> - The CLOCK_MONOTONIC hrtimer times do not match up with the
>   timestamps; they're off by about 0.087s. I assume this is just
>   sched_clock vs clocksource time and not a big deal.

Yes. You can tell the tracer to use clock monotonic so then they should match.
 
> - The rcu_sched process is sleeping with timeout=1. This seems
>   odd/excessive.

Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
the problem at all. The problem is your timer not firing, but the cpu is
obviously either getting out of idle and then moves the tick ahead for some
unknown reason.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-30 13:15               ` Thomas Gleixner
@ 2016-09-30 13:48                 ` Paul E. McKenney
  2016-10-01 17:05                   ` Rich Felker
  2016-10-08  1:32                 ` Rich Felker
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2016-09-30 13:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rich Felker, Daniel Lezcano, devicetree, linux-kernel, linux-sh,
	Rob Herring, Mark Rutland

On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.

[ . . . ]

Some RCU commentary, on the off-chance that it helps...

> So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> into a NOHZ idle sleep.
> 
> >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> 
> which is (using the 0.087621us delta between the trace clock and clock
> MONO) at: 109.880 + 0.087621 = 109.968
> 
> Which is about correct as we expect the RCU timer to fire at:
>       
>       109.952633 + 0.01 = 109.963633
> 
> or latest at
> 
>       109.952633 + 0.02 = 109.983633
>    
> There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> it did not because:

Just for annoying completeness, the location of the timer depends on how
the rcuo callback-offload kthreads are constrained.  And yes, in the most
constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
by default all run on CPU 0.

> >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> 
> The CPU nr encoded in flags is: 1
> 
> Now we cancel and restart the timer w/o seing the interrupt expiring
> it. And that expiry should have happened at 109.968000 !?!
> 
> >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> 
> So this advances the next tick even further out. And CPU 0 sets the timer to
> the exact smae value:
> 
> >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> 
> 
> >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> 
> Which expires here. And CPU1 instead of getting an interrupt and expiring
> the timer does the cancel/restart to the next jiffie again:
> 
> >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> 
> And this repeats;
> 
> >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> 
> >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> 
> There is something badly wrong here.
> 
> >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > 
> > The rcu_sched process does not run again after the tick_stop until
> > 132s, and only a few RCU softirqs happen (all shown above). During
> > this time, cpu1 has no interrupt activity and nothing in the trace
> > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > they're happening without any interrupts).
> 
> If you drop out of the arch idle into the core idle loop then you might end
> up with this. You want to add a few trace points or trace_printks() to the
> involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> tick_nohz_restart_sched_tick() and the idle code should be a good starting
> point.
> 
> > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > of hrtimer_cancel/start:
> 
> It's not a frenzy. It's the same pattern as above. It arms the timer to the
> next tick, but that timer never ever fires. And it does that every tick ....
> 
> Please put a tracepoint into your set_next_event() callback as well. SO
> this changes here:
> 
> >           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> 
> >           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
> >           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle
> 
> So CPU1 gets an IPI
> 
> >           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
> 49c function=tick_sched_timer now=132119115200
> >           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> 
> And rcu-sched-7 gets running magically, but we don't know what woke it
> up. Definitely not the timer, because that did not fire.
> 
> >        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0

It could have been an explicit wakeup at the end of a grace period.  That
would explain its cancelling the timer, anyway.

> > - During the whole sequence, hrtimer expiration times are being set to
> >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> 
> When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> than the next tick, then tick_sched_timer is set to this next event which
> can be far out. So that's expected.
> 
> > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> >   timestamps; they're off by about 0.087s. I assume this is just
> >   sched_clock vs clocksource time and not a big deal.
> 
> Yes. You can tell the tracer to use clock monotonic so then they should match.
> 
> > - The rcu_sched process is sleeping with timeout=1. This seems
> >   odd/excessive.
> 
> Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> the problem at all. The problem is your timer not firing, but the cpu is
> obviously either getting out of idle and then moves the tick ahead for some
> unknown reason.

And a one-jiffy timeout is in fact expected behavior when HZ=100.
You have to be running HZ=250 or better to have two-jiffy timeouts,
and HZ=500 or better for three-jiffy timeouts.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-30 13:48                 ` Paul E. McKenney
@ 2016-10-01 17:05                   ` Rich Felker
  2016-10-01 17:58                     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-01 17:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > I've managed to get a trace with a stall. I'm not sure what the best
> > > way to share the full thing is, since it's large, but here are the
> > > potentially interesting parts.
> 
> [ . . . ]
> 
> Some RCU commentary, on the off-chance that it helps...
> 
> > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > into a NOHZ idle sleep.
> > 
> > >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> > >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> > >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> > 
> > which is (using the 0.087621us delta between the trace clock and clock
> > MONO) at: 109.880 + 0.087621 = 109.968
> > 
> > Which is about correct as we expect the RCU timer to fire at:
> >       
> >       109.952633 + 0.01 = 109.963633
> > 
> > or latest at
> > 
> >       109.952633 + 0.02 = 109.983633
> >    
> > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> > it did not because:
> 
> Just for annoying completeness, the location of the timer depends on how
> the rcuo callback-offload kthreads are constrained.  And yes, in the most
> constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> by default all run on CPU 0.

In default full nohz configuration, am I correct that all cpus except
cpu0 willd be nohz and that the rcu callbacks then have to run on
cpu0?

> > >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> > 
> > The CPU nr encoded in flags is: 1
> > 
> > Now we cancel and restart the timer w/o seing the interrupt expiring
> > it. And that expiry should have happened at 109.968000 !?!
> > 
> > >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> > >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > 
> > So this advances the next tick even further out. And CPU 0 sets the timer to
> > the exact smae value:
> > 
> > >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > 
> > 
> > >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> > >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> > >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> > 
> > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > the timer does the cancel/restart to the next jiffie again:
> > 
> > >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> > >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> > 
> > And this repeats;
> > 
> > >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> > >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> > >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> > 
> > >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> > >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> > 
> > There is something badly wrong here.
> > 
> > >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> > >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> > >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> > >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> > >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > > 
> > > The rcu_sched process does not run again after the tick_stop until
> > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > they're happening without any interrupts).
> > 
> > If you drop out of the arch idle into the core idle loop then you might end
> > up with this. You want to add a few trace points or trace_printks() to the
> > involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> > tick_nohz_restart_sched_tick() and the idle code should be a good starting
> > point.
> > 
> > > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > > of hrtimer_cancel/start:
> > 
> > It's not a frenzy. It's the same pattern as above. It arms the timer to the
> > next tick, but that timer never ever fires. And it does that every tick ....
> > 
> > Please put a tracepoint into your set_next_event() callback as well. SO
> > this changes here:
> > 
> > >           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
> > >           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > 
> > >           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
> > >           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle
> > 
> > So CPU1 gets an IPI
> > 
> > >           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
> > 49c function=tick_sched_timer now=132119115200
> > >           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > 
> > And rcu-sched-7 gets running magically, but we don't know what woke it
> > up. Definitely not the timer, because that did not fire.
> > 
> > >        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0
> 
> It could have been an explicit wakeup at the end of a grace period.  That
> would explain its cancelling the timer, anyway.

I think the rcu stall handler kicked it, no? Looking at the code
again, maybe that behavior needs to be explicitly turned on, so maybe
it's just the uart interrupt activity/load from the stall message that
breaks the stall condition.

> > > - During the whole sequence, hrtimer expiration times are being set to
> > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > 
> > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > than the next tick, then tick_sched_timer is set to this next event which
> > can be far out. So that's expected.
> > 
> > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > >   timestamps; they're off by about 0.087s. I assume this is just
> > >   sched_clock vs clocksource time and not a big deal.
> > 
> > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > 
> > > - The rcu_sched process is sleeping with timeout=1. This seems
> > >   odd/excessive.
> > 
> > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > the problem at all. The problem is your timer not firing, but the cpu is
> > obviously either getting out of idle and then moves the tick ahead for some
> > unknown reason.
> 
> And a one-jiffy timeout is in fact expected behavior when HZ=100.
> You have to be running HZ=250 or better to have two-jiffy timeouts,
> and HZ=500 or better for three-jiffy timeouts.

One possible theory I'm looking at is that the two cpus are both
waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
with sufficient consistency that every time the rcu_gp_fqs_check_wake
loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
outside the rcu_idle_enter/rcu_idle_exit range. Would this block
forward process? I added an LED indicator in rcu_gp_fqs_check_wake
that shows the low 2 bits of rnp->qsmask every time it's called, and
under normal operation the LEDs just flash on momentarily or just one
stays on for a few seconds then goes off. During a stall both are
stuck on. I'm still trying to make sense of the code but my impression
so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
correspond directly to cpus; is that right? If so I'm a bit confused
because I don't see how forward progress could ever happen if the cpu
on which rcu_gp_kthread is blocking forward progress of
rcu_gp_kthread.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-01 17:05                   ` Rich Felker
@ 2016-10-01 17:58                     ` Paul E. McKenney
  2016-10-02  0:00                       ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-01 17:58 UTC (permalink / raw)
  To: Rich Felker
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > way to share the full thing is, since it's large, but here are the
> > > > potentially interesting parts.
> > 
> > [ . . . ]
> > 
> > Some RCU commentary, on the off-chance that it helps...
> > 
> > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > > into a NOHZ idle sleep.
> > > 
> > > >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> > > >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> > > >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> > > 
> > > which is (using the 0.087621us delta between the trace clock and clock
> > > MONO) at: 109.880 + 0.087621 = 109.968
> > > 
> > > Which is about correct as we expect the RCU timer to fire at:
> > >       
> > >       109.952633 + 0.01 = 109.963633
> > > 
> > > or latest at
> > > 
> > >       109.952633 + 0.02 = 109.983633
> > >    
> > > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> > > it did not because:
> > 
> > Just for annoying completeness, the location of the timer depends on how
> > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > by default all run on CPU 0.
> 
> In default full nohz configuration, am I correct that all cpus except
> cpu0 willd be nohz and that the rcu callbacks then have to run on
> cpu0?

In recent kernels, it looks to me that the default is that none of the
CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

But are there rcu0 kthreads running on your system?

> > > >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> > > 
> > > The CPU nr encoded in flags is: 1
> > > 
> > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > it. And that expiry should have happened at 109.968000 !?!
> > > 
> > > >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> > > >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > 
> > > So this advances the next tick even further out. And CPU 0 sets the timer to
> > > the exact smae value:
> > > 
> > > >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > 
> > > 
> > > >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> > > >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> > > >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > 
> > > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > > the timer does the cancel/restart to the next jiffie again:
> > > 
> > > >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> > > >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> > > 
> > > And this repeats;
> > > 
> > > >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> > > >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> > > >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > 
> > > >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> > > >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> > > 
> > > There is something badly wrong here.
> > > 
> > > >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> > > >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> > > >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> > > >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> > > >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > > > 
> > > > The rcu_sched process does not run again after the tick_stop until
> > > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > > they're happening without any interrupts).
> > > 
> > > If you drop out of the arch idle into the core idle loop then you might end
> > > up with this. You want to add a few trace points or trace_printks() to the
> > > involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> > > tick_nohz_restart_sched_tick() and the idle code should be a good starting
> > > point.
> > > 
> > > > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > > > of hrtimer_cancel/start:
> > > 
> > > It's not a frenzy. It's the same pattern as above. It arms the timer to the
> > > next tick, but that timer never ever fires. And it does that every tick ....
> > > 
> > > Please put a tracepoint into your set_next_event() callback as well. SO
> > > this changes here:
> > > 
> > > >           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
> > > >           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > 
> > > >           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
> > > >           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle
> > > 
> > > So CPU1 gets an IPI
> > > 
> > > >           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
> > > 49c function=tick_sched_timer now=132119115200
> > > >           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > 
> > > And rcu-sched-7 gets running magically, but we don't know what woke it
> > > up. Definitely not the timer, because that did not fire.
> > > 
> > > >        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0
> > 
> > It could have been an explicit wakeup at the end of a grace period.  That
> > would explain its cancelling the timer, anyway.
> 
> I think the rcu stall handler kicked it, no? Looking at the code
> again, maybe that behavior needs to be explicitly turned on, so maybe
> it's just the uart interrupt activity/load from the stall message that
> breaks the stall condition.

That is indeed another alternative.  But either way, if your timers aren't
waking rcu_sched up when it asks to be awakened, you will very likely get
RCU CPU stall warning messages.  And on small systems, rcu_sched asks to
be awakened every few jiffies by default.

> > > > - During the whole sequence, hrtimer expiration times are being set to
> > > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > > 
> > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > > than the next tick, then tick_sched_timer is set to this next event which
> > > can be far out. So that's expected.
> > > 
> > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > >   sched_clock vs clocksource time and not a big deal.
> > > 
> > > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > > 
> > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > >   odd/excessive.
> > > 
> > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > > the problem at all. The problem is your timer not firing, but the cpu is
> > > obviously either getting out of idle and then moves the tick ahead for some
> > > unknown reason.
> > 
> > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > and HZ=500 or better for three-jiffy timeouts.
> 
> One possible theory I'm looking at is that the two cpus are both
> waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> with sufficient consistency that every time the rcu_gp_fqs_check_wake
> loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> that shows the low 2 bits of rnp->qsmask every time it's called, and
> under normal operation the LEDs just flash on momentarily or just one
> stays on for a few seconds then goes off. During a stall both are
> stuck on. I'm still trying to make sense of the code but my impression
> so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> correspond directly to cpus; is that right? If so I'm a bit confused
> because I don't see how forward progress could ever happen if the cpu
> on which rcu_gp_kthread is blocking forward progress of
> rcu_gp_kthread.

No.  If the CPUs are entering and leaving idle, and if your timers
were waking up rcu_sched every few jiffies like it asks, then the
repeated idle entry/exit events would be noticed, courtesy of the atomic
increments of ->dynticks and the rcu_sched kthread's snapshotting and
checking of this value.  Even if the CPUs were always non-idle on every
time force_quiescent_state() is invoked, give or take the possibility
of counter wrap -- but even on a 32-bit system, that takes awhile.

But if your timers don't wake up rcu_sched, then force_quiescent_state()
will never be called, and the ->dynticks counter will be neither
snapshotted nor checked, which will mean that the idle CPUs will never
be noticed, which might mean that the grace period never completes.
Which will definitely result in an RCU CPU stall warning like the one
that you are seeing.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-01 17:58                     ` Paul E. McKenney
@ 2016-10-02  0:00                       ` Rich Felker
  2016-10-02  3:59                         ` Rich Felker
  2016-10-02  6:30                         ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Rich Felker @ 2016-10-02  0:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > way to share the full thing is, since it's large, but here are the
> > > > > potentially interesting parts.
> > > 
> > > [ . . . ]
> > > 
> > > Some RCU commentary, on the off-chance that it helps...
> > > 
> > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > > > into a NOHZ idle sleep.
> > > > 
> > > > >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> > > > >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> > > > >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> > > > 
> > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > 
> > > > Which is about correct as we expect the RCU timer to fire at:
> > > >       
> > > >       109.952633 + 0.01 = 109.963633
> > > > 
> > > > or latest at
> > > > 
> > > >       109.952633 + 0.02 = 109.983633
> > > >    
> > > > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> > > > it did not because:
> > > 
> > > Just for annoying completeness, the location of the timer depends on how
> > > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > > by default all run on CPU 0.
> > 
> > In default full nohz configuration, am I correct that all cpus except
> > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > cpu0?
> 
> In recent kernels, it looks to me that the default is that none of the
> CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.

Actually NO_HZ_FULL is not supported because
HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
it's just normal NO_HZ. I listed the relevant config items somewhere
earlier in this thread I think.

> But are there rcu0 kthreads running on your system?

Apparently not:

# ps aux | grep rcu
    7 root       0:00 [rcu_sched]
    8 root       0:00 [rcu_bh]
  395 root       0:00 grep rcu
          
> > > > >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> > > > 
> > > > The CPU nr encoded in flags is: 1
> > > > 
> > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > it. And that expiry should have happened at 109.968000 !?!
> > > > 
> > > > >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> > > > >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > > 
> > > > So this advances the next tick even further out. And CPU 0 sets the timer to
> > > > the exact smae value:
> > > > 
> > > > >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > > 
> > > > 
> > > > >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> > > > >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> > > > >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > > 
> > > > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > > > the timer does the cancel/restart to the next jiffie again:
> > > > 
> > > > >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> > > > >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> > > > 
> > > > And this repeats;
> > > > 
> > > > >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> > > > >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> > > > >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > > 
> > > > >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> > > > >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> > > > 
> > > > There is something badly wrong here.
> > > > 
> > > > >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> > > > >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> > > > >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> > > > >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> > > > >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > > > > 
> > > > > The rcu_sched process does not run again after the tick_stop until
> > > > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > > > they're happening without any interrupts).
> > > > 
> > > > If you drop out of the arch idle into the core idle loop then you might end
> > > > up with this. You want to add a few trace points or trace_printks() to the
> > > > involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> > > > tick_nohz_restart_sched_tick() and the idle code should be a good starting
> > > > point.
> > > > 
> > > > > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > > > > of hrtimer_cancel/start:
> > > > 
> > > > It's not a frenzy. It's the same pattern as above. It arms the timer to the
> > > > next tick, but that timer never ever fires. And it does that every tick ....
> > > > 
> > > > Please put a tracepoint into your set_next_event() callback as well. SO
> > > > this changes here:
> > > > 
> > > > >           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
> > > > >           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > > 
> > > > >           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
> > > > >           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle
> > > > 
> > > > So CPU1 gets an IPI
> > > > 
> > > > >           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
> > > > 49c function=tick_sched_timer now=132119115200
> > > > >           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > > 
> > > > And rcu-sched-7 gets running magically, but we don't know what woke it
> > > > up. Definitely not the timer, because that did not fire.
> > > > 
> > > > >        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0
> > > 
> > > It could have been an explicit wakeup at the end of a grace period.  That
> > > would explain its cancelling the timer, anyway.
> > 
> > I think the rcu stall handler kicked it, no? Looking at the code
> > again, maybe that behavior needs to be explicitly turned on, so maybe
> > it's just the uart interrupt activity/load from the stall message that
> > breaks the stall condition.
> 
> That is indeed another alternative.  But either way, if your timers aren't
> waking rcu_sched up when it asks to be awakened, you will very likely get
> RCU CPU stall warning messages.  And on small systems, rcu_sched asks to
> be awakened every few jiffies by default.

Can you clarify which timer you think should be waking it up and
isn't? Maybe I'm mistaken but it looks to me like rcu_gp_kthread() is
waking up as expected, only to go back to sleep due to failure of
rcu_gp_fqs_check_wake to end the loop.

To demonstrate this to myself, I hacked rcu_gp_fqs_check_wake to show
2 low bits of rnp->qsmask on LEDs. I also put a hook to update them in
cpu_idle_loop just now, just in case. At least one, and usually two,
LEDs are on when the stalls happen.

> > > > > - During the whole sequence, hrtimer expiration times are being set to
> > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > > > 
> > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > > > than the next tick, then tick_sched_timer is set to this next event which
> > > > can be far out. So that's expected.
> > > > 
> > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > >   sched_clock vs clocksource time and not a big deal.
> > > > 
> > > > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > > > 
> > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > >   odd/excessive.
> > > > 
> > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > > > the problem at all. The problem is your timer not firing, but the cpu is
> > > > obviously either getting out of idle and then moves the tick ahead for some
> > > > unknown reason.
> > > 
> > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > and HZ=500 or better for three-jiffy timeouts.
> > 
> > One possible theory I'm looking at is that the two cpus are both
> > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > under normal operation the LEDs just flash on momentarily or just one
> > stays on for a few seconds then goes off. During a stall both are
> > stuck on. I'm still trying to make sense of the code but my impression
> > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > correspond directly to cpus; is that right? If so I'm a bit confused
> > because I don't see how forward progress could ever happen if the cpu
> > on which rcu_gp_kthread is blocking forward progress of
> > rcu_gp_kthread.
> 
> No.  If the CPUs are entering and leaving idle, and if your timers
> were waking up rcu_sched every few jiffies like it asks, then the
> repeated idle entry/exit events would be noticed, courtesy of the atomic
> increments of ->dynticks and the rcu_sched kthread's snapshotting and
> checking of this value.

I don't see how rcu_sched could notice the changes if it's stuck in
the wait loop I think it's stuck in. There is no check of ->dynticks
in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
accomplishes this, I updated my LED hacks to clear the LEDs in that
exit path (and killed the other place that could turn them back on
from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
stall message.

> Even if the CPUs were always non-idle on every
> time force_quiescent_state() is invoked, give or take the possibility
> of counter wrap -- but even on a 32-bit system, that takes awhile.

Perhaps force_quiescent_state is not getting invoked? Does that sound
plausible, and if so, how should I go about debugging it? I see it is
called from the stall reporting code, so that's presumably what's
breaking the stalls.

> But if your timers don't wake up rcu_sched, then force_quiescent_state()
> will never be called, and the ->dynticks counter will be neither
> snapshotted nor checked, which will mean that the idle CPUs will never
> be noticed, which might mean that the grace period never completes.
> Which will definitely result in an RCU CPU stall warning like the one
> that you are seeing.

I don't follow. I thought force_quiescent_state was called by other
code to break rcu_sched out of the loop I'm finding it stuck in, not
by rcu_sched itself.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-02  0:00                       ` Rich Felker
@ 2016-10-02  3:59                         ` Rich Felker
  2016-10-02  5:59                           ` Paul E. McKenney
  2016-10-02  6:30                         ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-02  3:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > - During the whole sequence, hrtimer expiration times are being set to
> > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > > > > 
> > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > > > > than the next tick, then tick_sched_timer is set to this next event which
> > > > > can be far out. So that's expected.
> > > > > 
> > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > 
> > > > > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > > > > 
> > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > >   odd/excessive.
> > > > > 
> > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > > > > the problem at all. The problem is your timer not firing, but the cpu is
> > > > > obviously either getting out of idle and then moves the tick ahead for some
> > > > > unknown reason.
> > > > 
> > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > and HZ=500 or better for three-jiffy timeouts.
> > > 
> > > One possible theory I'm looking at is that the two cpus are both
> > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > under normal operation the LEDs just flash on momentarily or just one
> > > stays on for a few seconds then goes off. During a stall both are
> > > stuck on. I'm still trying to make sense of the code but my impression
> > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > because I don't see how forward progress could ever happen if the cpu
> > > on which rcu_gp_kthread is blocking forward progress of
> > > rcu_gp_kthread.
> > 
> > No.  If the CPUs are entering and leaving idle, and if your timers
> > were waking up rcu_sched every few jiffies like it asks, then the
> > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > checking of this value.
> 
> I don't see how rcu_sched could notice the changes if it's stuck in
> the wait loop I think it's stuck in. There is no check of ->dynticks
> in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> accomplishes this, I updated my LED hacks to clear the LEDs in that
> exit path (and killed the other place that could turn them back on
> from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> stall message.
> 
> > Even if the CPUs were always non-idle on every
> > time force_quiescent_state() is invoked, give or take the possibility
> > of counter wrap -- but even on a 32-bit system, that takes awhile.
> 
> Perhaps force_quiescent_state is not getting invoked? Does that sound
> plausible, and if so, how should I go about debugging it? I see it is
> called from the stall reporting code, so that's presumably what's
> breaking the stalls.

I can confirm that force_quiescent_state is not being called at all
except from the stall handler. Where is it supposed to be called from?
I can't find any code paths to it except the stall handler and
__call_rcu_core, but the latter seems to only be called when adding
new rcu callbacks, not as a response to a stalled rcu_sched thread.
Maybe I'm missing something but this seems like incorrect logic in the
rcu subsystem.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-02  3:59                         ` Rich Felker
@ 2016-10-02  5:59                           ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-02  5:59 UTC (permalink / raw)
  To: Rich Felker
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Sat, Oct 01, 2016 at 11:59:25PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> > > > > > > - During the whole sequence, hrtimer expiration times are being set to
> > > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > > > > > 
> > > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > > > > > than the next tick, then tick_sched_timer is set to this next event which
> > > > > > can be far out. So that's expected.
> > > > > > 
> > > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > > 
> > > > > > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > > > > > 
> > > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > > >   odd/excessive.
> > > > > > 
> > > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > > > > > the problem at all. The problem is your timer not firing, but the cpu is
> > > > > > obviously either getting out of idle and then moves the tick ahead for some
> > > > > > unknown reason.
> > > > > 
> > > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > > and HZ=500 or better for three-jiffy timeouts.
> > > > 
> > > > One possible theory I'm looking at is that the two cpus are both
> > > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > > under normal operation the LEDs just flash on momentarily or just one
> > > > stays on for a few seconds then goes off. During a stall both are
> > > > stuck on. I'm still trying to make sense of the code but my impression
> > > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > > because I don't see how forward progress could ever happen if the cpu
> > > > on which rcu_gp_kthread is blocking forward progress of
> > > > rcu_gp_kthread.
> > > 
> > > No.  If the CPUs are entering and leaving idle, and if your timers
> > > were waking up rcu_sched every few jiffies like it asks, then the
> > > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > > checking of this value.
> > 
> > I don't see how rcu_sched could notice the changes if it's stuck in
> > the wait loop I think it's stuck in. There is no check of ->dynticks
> > in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> > accomplishes this, I updated my LED hacks to clear the LEDs in that
> > exit path (and killed the other place that could turn them back on
> > from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> > stall message.
> > 
> > > Even if the CPUs were always non-idle on every
> > > time force_quiescent_state() is invoked, give or take the possibility
> > > of counter wrap -- but even on a 32-bit system, that takes awhile.
> > 
> > Perhaps force_quiescent_state is not getting invoked? Does that sound
> > plausible, and if so, how should I go about debugging it? I see it is
> > called from the stall reporting code, so that's presumably what's
> > breaking the stalls.
> 
> I can confirm that force_quiescent_state is not being called at all
> except from the stall handler. Where is it supposed to be called from?
> I can't find any code paths to it except the stall handler and
> __call_rcu_core, but the latter seems to only be called when adding
> new rcu callbacks, not as a response to a stalled rcu_sched thread.
> Maybe I'm missing something but this seems like incorrect logic in the
> rcu subsystem.

This is the RCU logic you are missing within the RCU grace-period kthread:

			ret = swait_event_interruptible_timeout(rsp->gp_wq,
					rcu_gp_fqs_check_wake(rsp, &gf), j);

On your system, j==1, which means that when the RCU grace-period kthread
sleeps during a grace period, it is supposed to be awakened after one
jiffy regardless of anything else.  On your system (and apparently -only-
your system), this wakeup is not happening.

RCU reacts to your system's failure to awaken it by giving you an RCU
CPU stall warning.

So please do two things:

1.	As Thomas asked in his September 30th email, upload your full
	trace file somewhere so that he can fully analyze it.  We
	cannot help you if you ignore our requests!
	
	If you did upload it and privately informed Thomas, apologies
	for the false alarm.

2.	Look carefully at the portions of the trace that relate to
	Thomas's sentence in his September 30 email to you with Message-ID
	<alpine.DEB.2.20.1609301319410.4799@nanos> that reads as follows:

		"And this expiry happens on CPU 1, but the expiry time
		looks like the one which got armed on CPU 0 above !?!?!"

	This is a big flaming red-flag whack-over-the-head clue.  Please
	give it the time and attention that it deserves.  Please find
	out why on earth the request for a 10-ms-later timer event on
	CPU 0 appears to have happened way later on CPU 1.  One way
	to do this might be by adding event traces in the lower-level
	hardware-specific timer code.

Again, the reason that force_quiescent_state() is never being called is
because your system's timers are failing to wake up the rcu_sched kthread.
As can be seen from the trace you sent to Thomas, the rcu_sched kthread
is asking to be awakened after 10ms (1 jiffy).  If it had been correctly
awakened, it would have invoked force_quiescent_state().

So please focus some time and attention on learning why your system refused
to respond properly to rcu_sched's timed-wakeup request.

After all, if RCU asks to be awakened in 10 milliseconds, and your system
refuses to do that, you cannot expect RCU to magically wake itself up,
now can you?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-02  0:00                       ` Rich Felker
  2016-10-02  3:59                         ` Rich Felker
@ 2016-10-02  6:30                         ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-02  6:30 UTC (permalink / raw)
  To: Rich Felker
  Cc: Thomas Gleixner, Daniel Lezcano, devicetree, linux-kernel,
	linux-sh, Rob Herring, Mark Rutland

On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 10:58:37AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 01, 2016 at 01:05:08PM -0400, Rich Felker wrote:
> > > On Fri, Sep 30, 2016 at 06:48:35AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> > > > > On Tue, 27 Sep 2016, Rich Felker wrote:
> > > > > > I've managed to get a trace with a stall. I'm not sure what the best
> > > > > > way to share the full thing is, since it's large, but here are the
> > > > > > potentially interesting parts.
> > > > 
> > > > [ . . . ]
> > > > 
> > > > Some RCU commentary, on the off-chance that it helps...
> > > > 
> > > > > So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> > > > > into a NOHZ idle sleep.
> > > > > 
> > > > > >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> > > > > >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> > > > > >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> > > > > 
> > > > > which is (using the 0.087621us delta between the trace clock and clock
> > > > > MONO) at: 109.880 + 0.087621 = 109.968
> > > > > 
> > > > > Which is about correct as we expect the RCU timer to fire at:
> > > > >       
> > > > >       109.952633 + 0.01 = 109.963633
> > > > > 
> > > > > or latest at
> > > > > 
> > > > >       109.952633 + 0.02 = 109.983633
> > > > >    
> > > > > There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> > > > > it did not because:
> > > > 
> > > > Just for annoying completeness, the location of the timer depends on how
> > > > the rcuo callback-offload kthreads are constrained.  And yes, in the most
> > > > constrained case where all CPUs except for CPU 0 are nohz CPUs, they will
> > > > by default all run on CPU 0.
> > > 
> > > In default full nohz configuration, am I correct that all cpus except
> > > cpu0 willd be nohz and that the rcu callbacks then have to run on
> > > cpu0?
> > 
> > In recent kernels, it looks to me that the default is that none of the
> > CPUs will be nohz by default.  You have to build with both NO_HZ_FULL
> > and NO_HZ_FULL_ALL to get the everything-on-CPU-0 behavior.
> 
> Actually NO_HZ_FULL is not supported because
> HAVE_VIRT_CPU_ACCOUNTING_GEN and HAVE_CONTEXT_TRACKING are missing, so
> it's just normal NO_HZ. I listed the relevant config items somewhere
> earlier in this thread I think.

Very good.  Please feel free to go back and double-check, if you wish.

> > But are there rcu0 kthreads running on your system?
> 
> Apparently not:
> 
> # ps aux | grep rcu
>     7 root       0:00 [rcu_sched]
>     8 root       0:00 [rcu_bh]
>   395 root       0:00 grep rcu

OK, you do not have callback-offloading enabled.  We can therefore eliminate
the possibility that you are being hit by callback-offloading logic.

> > > > > >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> > > > > 
> > > > > The CPU nr encoded in flags is: 1
> > > > > 
> > > > > Now we cancel and restart the timer w/o seing the interrupt expiring
> > > > > it. And that expiry should have happened at 109.968000 !?!
> > > > > 
> > > > > >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> > > > > >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > > > 
> > > > > So this advances the next tick even further out. And CPU 0 sets the timer to
> > > > > the exact smae value:
> > > > > 
> > > > > >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> > > > > 
> > > > > 
> > > > > >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> > > > > >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> > > > > >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> > > > > 
> > > > > Which expires here. And CPU1 instead of getting an interrupt and expiring
> > > > > the timer does the cancel/restart to the next jiffie again:
> > > > > 
> > > > > >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> > > > > >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> > > > > 
> > > > > And this repeats;
> > > > > 
> > > > > >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> > > > > >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> > > > > >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> > > > > 
> > > > > >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> > > > > >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> > > > > 
> > > > > There is something badly wrong here.
> > > > > 
> > > > > >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> > > > > >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> > > > > >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> > > > > >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> > > > > >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > > > > > 
> > > > > > The rcu_sched process does not run again after the tick_stop until
> > > > > > 132s, and only a few RCU softirqs happen (all shown above). During
> > > > > > this time, cpu1 has no interrupt activity and nothing in the trace
> > > > > > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > > > > > they're happening without any interrupts).
> > > > > 
> > > > > If you drop out of the arch idle into the core idle loop then you might end
> > > > > up with this. You want to add a few trace points or trace_printks() to the
> > > > > involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> > > > > tick_nohz_restart_sched_tick() and the idle code should be a good starting
> > > > > point.
> > > > > 
> > > > > > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > > > > > of hrtimer_cancel/start:
> > > > > 
> > > > > It's not a frenzy. It's the same pattern as above. It arms the timer to the
> > > > > next tick, but that timer never ever fires. And it does that every tick ....
> > > > > 
> > > > > Please put a tracepoint into your set_next_event() callback as well. SO
> > > > > this changes here:
> > > > > 
> > > > > >           <idle>-0     [001] d...   132.198170: hrtimer_cancel: hrtimer=109f449c
> > > > > >           <idle>-0     [001] d...   132.198451: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > > > 
> > > > > >           <idle>-0     [001] dnh.   132.205860: irq_handler_entry: irq=20 name=ipi
> > > > > >           <idle>-0     [001] dnh.   132.206041: irq_handler_exit: irq=20 ret=handle
> > > > > 
> > > > > So CPU1 gets an IPI
> > > > > 
> > > > > >           <idle>-0     [001] dn..   132.206650: hrtimer_cancel: hrtimer=109f449c
> > > > > 49c function=tick_sched_timer now=132119115200
> > > > > >           <idle>-0     [001] dn..   132.206936: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=132120000000 softexpires=132120000000
> > > > > 
> > > > > And rcu-sched-7 gets running magically, but we don't know what woke it
> > > > > up. Definitely not the timer, because that did not fire.
> > > > > 
> > > > > >        rcu_sched-7     [001] d...   132.207710: timer_cancel: timer=160a9eb0
> > > > 
> > > > It could have been an explicit wakeup at the end of a grace period.  That
> > > > would explain its cancelling the timer, anyway.
> > > 
> > > I think the rcu stall handler kicked it, no? Looking at the code
> > > again, maybe that behavior needs to be explicitly turned on, so maybe
> > > it's just the uart interrupt activity/load from the stall message that
> > > breaks the stall condition.
> > 
> > That is indeed another alternative.  But either way, if your timers aren't
> > waking rcu_sched up when it asks to be awakened, you will very likely get
> > RCU CPU stall warning messages.  And on small systems, rcu_sched asks to
> > be awakened every few jiffies by default.
> 
> Can you clarify which timer you think should be waking it up and
> isn't? Maybe I'm mistaken but it looks to me like rcu_gp_kthread() is
> waking up as expected, only to go back to sleep due to failure of
> rcu_gp_fqs_check_wake to end the loop.

The timeout in the call to swait_event_interruptible_timeout() below from
rcu_gp_kthread():

			ret = swait_event_interruptible_timeout(rsp->gp_wq,
					rcu_gp_fqs_check_wake(rsp, &gf), j);

On your system, j==1, so rcu_sched should be awakened each and every
jiffy.  The ftrace you sent to Thomas Gleixner indicates that this is
not happening.  Until this problem is fixed, you will continue getting
RCU CPU stall warnings.

> To demonstrate this to myself, I hacked rcu_gp_fqs_check_wake to show
> 2 low bits of rnp->qsmask on LEDs. I also put a hook to update them in
> cpu_idle_loop just now, just in case. At least one, and usually two,
> LEDs are on when the stalls happen.

If one of your CPUs is idle through the grace period, that is the expected
state.  The system needs to invoke force_quiescent_state() to detect the
idle CPUs.

Can you make the LED light up while the grace-period kthread is waiting
in the swait_event_interruptible_timeout() call?  This will need to be
conditioned on the flavor of RCU that is giving you the stall warnings,
for example something like the following, given that you had at least
one rcu_sched RCU CPU stall warning:

			if (rsp == &rcu_sched_state)
				turn_on_led();
			ret = swait_event_interruptible_timeout(rsp->gp_wq,
					rcu_gp_fqs_check_wake(rsp, &gf), j);
			if (rsp == &rcu_sched_state)
				turn_off_led();

Or maybe set different LED bits based on the RCU flavor, which given that
you have preemption disabled are rcu_sched_state and rcu_bh_state.

> > > > > > - During the whole sequence, hrtimer expiration times are being set to
> > > > > >   exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary.
> > > > > 
> > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out
> > > > > than the next tick, then tick_sched_timer is set to this next event which
> > > > > can be far out. So that's expected.
> > > > > 
> > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the
> > > > > >   timestamps; they're off by about 0.087s. I assume this is just
> > > > > >   sched_clock vs clocksource time and not a big deal.
> > > > > 
> > > > > Yes. You can tell the tracer to use clock monotonic so then they should match.
> > > > > 
> > > > > > - The rcu_sched process is sleeping with timeout=1. This seems
> > > > > >   odd/excessive.
> > > > > 
> > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not
> > > > > the problem at all. The problem is your timer not firing, but the cpu is
> > > > > obviously either getting out of idle and then moves the tick ahead for some
> > > > > unknown reason.
> > > > 
> > > > And a one-jiffy timeout is in fact expected behavior when HZ=100.
> > > > You have to be running HZ=250 or better to have two-jiffy timeouts,
> > > > and HZ=500 or better for three-jiffy timeouts.
> > > 
> > > One possible theory I'm looking at is that the two cpus are both
> > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy
> > > with sufficient consistency that every time the rcu_gp_fqs_check_wake
> > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but
> > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block
> > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake
> > > that shows the low 2 bits of rnp->qsmask every time it's called, and
> > > under normal operation the LEDs just flash on momentarily or just one
> > > stays on for a few seconds then goes off. During a stall both are
> > > stuck on. I'm still trying to make sense of the code but my impression
> > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits
> > > correspond directly to cpus; is that right? If so I'm a bit confused
> > > because I don't see how forward progress could ever happen if the cpu
> > > on which rcu_gp_kthread is blocking forward progress of
> > > rcu_gp_kthread.
> > 
> > No.  If the CPUs are entering and leaving idle, and if your timers
> > were waking up rcu_sched every few jiffies like it asks, then the
> > repeated idle entry/exit events would be noticed, courtesy of the atomic
> > increments of ->dynticks and the rcu_sched kthread's snapshotting and
> > checking of this value.
> 
> I don't see how rcu_sched could notice the changes if it's stuck in
> the wait loop I think it's stuck in. There is no check of ->dynticks
> in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS
> accomplishes this, I updated my LED hacks to clear the LEDs in that
> exit path (and killed the other place that could turn them back on
> from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a
> stall message.

Again, from what I can see, the reason that rcu_sched is stuck in that
wait loop is because the timers are not waking it up.  And again, given
the trace you sent Thomas Gleixner, the timers are not firing properly.
And again, given that they are not firing properly, the RCU CPU stall
warnings are expected behavior.

> > Even if the CPUs were always non-idle on every
> > time force_quiescent_state() is invoked, give or take the possibility
> > of counter wrap -- but even on a 32-bit system, that takes awhile.
> 
> Perhaps force_quiescent_state is not getting invoked? Does that sound
> plausible, and if so, how should I go about debugging it? I see it is
> called from the stall reporting code, so that's presumably what's
> breaking the stalls.

Old memories, apologies.  The force_quiescent_state() name is many years
ago, and means something different now.  :-/

It is rcu_gp_fqs() that should be getting called every jiffy during the
grace period, and for each flavor of RCU.

> > But if your timers don't wake up rcu_sched, then force_quiescent_state()
> > will never be called, and the ->dynticks counter will be neither
> > snapshotted nor checked, which will mean that the idle CPUs will never
> > be noticed, which might mean that the grace period never completes.
> > Which will definitely result in an RCU CPU stall warning like the one
> > that you are seeing.
> 
> I don't follow. I thought force_quiescent_state was called by other
> code to break rcu_sched out of the loop I'm finding it stuck in, not
> by rcu_sched itself.

The timers are supposed to wake rcu_sched up each and every jiffy during
the grace period.  They appear to be failing to do so.  Please look
into why.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-26 21:27     ` Daniel Lezcano
  2016-09-26 22:23       ` Rich Felker
@ 2016-10-03 22:10       ` Rich Felker
  2016-10-04  7:06         ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-03 22:10 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: devicetree, linux-kernel, linux-sh, Rob Herring, Mark Rutland,
	Thomas Gleixner, Paul E. McKenney

On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
> 
> It is on my radar. I'm reviewing it.
> 
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

Do you have comments on any remaining issues other than this
workaround? If not I don't mind removing the workaround and trying to
solve the issue separately later. Let me know and either way I'll
submit a v8.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-03 22:10       ` Rich Felker
@ 2016-10-04  7:06         ` Paul E. McKenney
  2016-10-04 20:58           ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-04  7:06 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Thomas Gleixner

On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > On 26/09/2016 23:07, Rich Felker wrote:
> > > Ping. Is there anything that still needs to be changed for this driver
> > > to be acceptable?
> > 
> > It is on my radar. I'm reviewing it.
> > 
> > Can you elaborate the workaround mentioned in the changelog. I have been
> > digging into the lkml@ thread but it is not clear if the issue is
> > related to the time framework, the driver itself or whatever else. Can
> > you clarify that ?
> 
> Do you have comments on any remaining issues other than this
> workaround? If not I don't mind removing the workaround and trying to
> solve the issue separately later. Let me know and either way I'll
> submit a v8.

One question of interest to me is whether this patchset prevents the
RCU CPU stall warnings that you are seeing.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-04  7:06         ` Paul E. McKenney
@ 2016-10-04 20:58           ` Rich Felker
  2016-10-04 21:14             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-04 20:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Thomas Gleixner

On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > Ping. Is there anything that still needs to be changed for this driver
> > > > to be acceptable?
> > > 
> > > It is on my radar. I'm reviewing it.
> > > 
> > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > digging into the lkml@ thread but it is not clear if the issue is
> > > related to the time framework, the driver itself or whatever else. Can
> > > you clarify that ?
> > 
> > Do you have comments on any remaining issues other than this
> > workaround? If not I don't mind removing the workaround and trying to
> > solve the issue separately later. Let me know and either way I'll
> > submit a v8.
> 
> One question of interest to me is whether this patchset prevents the
> RCU CPU stall warnings that you are seeing.

With the 5ms minimum delta, I didn't observe any rcu_sched stall
warnings. At 2.5ms I thought it was gone but eventually saw one. With
the previous mindelta = 1, i.e. 1 hardware bus period, I get the
stalls regularly.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-04 20:58           ` Rich Felker
@ 2016-10-04 21:14             ` Paul E. McKenney
  2016-10-04 21:48               ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-04 21:14 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Thomas Gleixner

On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > to be acceptable?
> > > > 
> > > > It is on my radar. I'm reviewing it.
> > > > 
> > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > related to the time framework, the driver itself or whatever else. Can
> > > > you clarify that ?
> > > 
> > > Do you have comments on any remaining issues other than this
> > > workaround? If not I don't mind removing the workaround and trying to
> > > solve the issue separately later. Let me know and either way I'll
> > > submit a v8.
> > 
> > One question of interest to me is whether this patchset prevents the
> > RCU CPU stall warnings that you are seeing.
> 
> With the 5ms minimum delta, I didn't observe any rcu_sched stall
> warnings. At 2.5ms I thought it was gone but eventually saw one. With
> the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> stalls regularly.

Sounds to me like your low-level clock drivers or your clock hardware is
having trouble dealing with short timeouts.  I suggest writing a focused
test.  It is of course quite possible that the failure could occur for
any time period, but simply becomes less probable with longer time
periods.

Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
in order to get the most information in the shortest period of time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-04 21:14             ` Paul E. McKenney
@ 2016-10-04 21:48               ` Rich Felker
  2016-10-05 12:41                 ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-04 21:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Thomas Gleixner

On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > > to be acceptable?
> > > > > 
> > > > > It is on my radar. I'm reviewing it.
> > > > > 
> > > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > you clarify that ?
> > > > 
> > > > Do you have comments on any remaining issues other than this
> > > > workaround? If not I don't mind removing the workaround and trying to
> > > > solve the issue separately later. Let me know and either way I'll
> > > > submit a v8.
> > > 
> > > One question of interest to me is whether this patchset prevents the
> > > RCU CPU stall warnings that you are seeing.
> > 
> > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > stalls regularly.
> 
> Sounds to me like your low-level clock drivers or your clock hardware is
> having trouble dealing with short timeouts.  I suggest writing a focused
> test.  It is of course quite possible that the failure could occur for
> any time period, but simply becomes less probable with longer time
> periods.

I don't see any indication of such a problem on the hardware side, and
it didn't happen in earlier kernels, though of course it's possible
that they were programming different intervals. And all of the
intervals logged when I added trace logging were way above the scale
where it would be plausible for the hardware to be an issue.

> Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> in order to get the most information in the shortest period of time.

I'm starting with new traces with "nohlt" on the kernel command line,
since many of the stalls I've observed up to now may have a different
mechanism -- since the arch_cpu_idle code doesn't work (returns
immediately) the rapid rcu idle enter/exit in the cpu idle loop might
be part of what was interfering with rcu_sched.

With my new traces and LED debugging (which I still need to tune with
more logging, then I'll post some more results) what I'm seeing when
the stall happens is the timer interrupt firing as expected and making
it into __hrtimer_run_queues, but tick_sched_timer not getting called
and thus rcu_sched never waking up from its 1s schedule_timeout (via
swait).

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-04 21:48               ` Rich Felker
@ 2016-10-05 12:41                 ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2016-10-05 12:41 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Thomas Gleixner

On Tue, Oct 04, 2016 at 05:48:18PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > > > to be acceptable?
> > > > > > 
> > > > > > It is on my radar. I'm reviewing it.
> > > > > > 
> > > > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > > you clarify that ?
> > > > > 
> > > > > Do you have comments on any remaining issues other than this
> > > > > workaround? If not I don't mind removing the workaround and trying to
> > > > > solve the issue separately later. Let me know and either way I'll
> > > > > submit a v8.
> > > > 
> > > > One question of interest to me is whether this patchset prevents the
> > > > RCU CPU stall warnings that you are seeing.
> > > 
> > > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > > stalls regularly.
> > 
> > Sounds to me like your low-level clock drivers or your clock hardware is
> > having trouble dealing with short timeouts.  I suggest writing a focused
> > test.  It is of course quite possible that the failure could occur for
> > any time period, but simply becomes less probable with longer time
> > periods.
> 
> I don't see any indication of such a problem on the hardware side, and
> it didn't happen in earlier kernels, though of course it's possible
> that they were programming different intervals. And all of the
> intervals logged when I added trace logging were way above the scale
> where it would be plausible for the hardware to be an issue.

Is it possible that the problem occurs when the process setting up the
timeout is delayed for longer than the timeout duration, so that the
time has already expired at the time that the hardware is programmed?

> > Or perhaps better, do the tracing that Thomas Gleixner suggested.  Or both,
> > in order to get the most information in the shortest period of time.
> 
> I'm starting with new traces with "nohlt" on the kernel command line,
> since many of the stalls I've observed up to now may have a different
> mechanism -- since the arch_cpu_idle code doesn't work (returns
> immediately) the rapid rcu idle enter/exit in the cpu idle loop might
> be part of what was interfering with rcu_sched.

That doesn't sound good, though I cannot immediately think of a reason
why that would cause RCU much trouble.

> With my new traces and LED debugging (which I still need to tune with
> more logging, then I'll post some more results) what I'm seeing when
> the stall happens is the timer interrupt firing as expected and making
> it into __hrtimer_run_queues, but tick_sched_timer not getting called
> and thus rcu_sched never waking up from its 1s schedule_timeout (via
> swait).

It would be very interesting to track down exactly the point at which
the system decides not to fire the timeout.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-09-30 13:15               ` Thomas Gleixner
  2016-09-30 13:48                 ` Paul E. McKenney
@ 2016-10-08  1:32                 ` Rich Felker
  2016-10-08 11:32                   ` Thomas Gleixner
  1 sibling, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-08  1:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Fri, Sep 30, 2016 at 03:15:11PM +0200, Thomas Gleixner wrote:
> On Tue, 27 Sep 2016, Rich Felker wrote:
> > I've managed to get a trace with a stall. I'm not sure what the best
> > way to share the full thing is, since it's large, but here are the
> > potentially interesting parts.
> 
> Upload it somewhere.
>  
> > The first is a big time gap with no events, from 82.446093 to
> > 109.852709:
> > 
> 
> >           <idle>-0     [001] d.h.    82.443398: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [001] d.h.    82.443573: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d.h.    82.443689: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=82356112320
> 
> Tick timer expires
> 
> >           <idle>-0     [001] d.h.    82.443911: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [001] d.h.    82.444089: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [001] d.h.    82.444306: softirq_raise: vec=7 [action=SCHED]
> 
> Raises softirqs
> 
> >           <idle>-0     [001] d.h.    82.444448: hrtimer_expire_exit: hrtimer=109f449c
> >           <idle>-0     [001] d.h.    82.444592: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=82360000000 softexpires=82360000000
> 
> Rearms the next period
> 
> >           <idle>-0     [001] d.h.    82.444821: irq_handler_exit: irq=16 ret=handled
> >           <idle>-0     [001] ..s.    82.445034: softirq_entry: vec=1 [action=TIMER]
> >           <idle>-0     [001] ..s.    82.445222: softirq_exit: vec=1 [action=TIMER]
> >           <idle>-0     [001] ..s.    82.445341: softirq_entry: vec=7 [action=SCHED]
> >           <idle>-0     [001] ..s.    82.445745: softirq_exit: vec=7 [action=SCHED]
> >           <idle>-0     [001] ..s.    82.445880: softirq_entry: vec=9 [action=RCU]
> >           <idle>-0     [001] ..s.    82.446093: softirq_exit: vec=9 [action=RCU]
> 
> Softirqs done and the machine vanishes into lala land ....
> 
> And now after that gap we have a cancel/start on CPU 0
> 
> >           <idle>-0     [000] dn..   109.852709: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] dn..   109.853043: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109770000000 softexpires=109770000000
> 
> >               sh-388   [000] d.h.   109.855851: irq_handler_entry: irq=18 name=uartlite
> >               sh-388   [000] d.h.   109.856751: irq_handler_exit: irq=18 ret=handled
> >               sh-388   [000] d.h.   109.857264: irq_handler_entry: irq=18 name=uartlite
> >               sh-388   [000] d.h.   109.857458: irq_handler_exit: irq=18 ret=unhandled
> >           <idle>-0     [001] d.h.   109.857684: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [001] d.h.   109.857934: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d.h.   109.858087: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=109770452820
> 
> And this expiry happens on CPU 1, but the expiry time looks like the one
> which got armed on CPU 0 above !?!?!
> 
> >           <idle>-0     [001] d.h.   109.858639: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [001] d.h.   109.858839: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [001] d.h.   109.859097: softirq_raise: vec=7 [action=SCHED]
> >           <idle>-0     [001] d.h.   109.859282: hrtimer_expire_exit: hrtimer=109f449c
> >           <idle>-0     [001] d.h.   109.859428: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109780000000 softexpires=109780000000
> >           <idle>-0     [001] d.h.   109.859637: irq_handler_exit: irq=16 ret=handled
> >           <idle>-0     [001] ..s.   109.859806: softirq_entry: vec=1 [action=TIMER]
> >           <idle>-0     [001] ..s.   109.859984: softirq_exit: vec=1 [action=TIMER]
> >           <idle>-0     [001] ..s.   109.860098: softirq_entry: vec=7 [action=SCHED]
> >           <idle>-0     [001] ..s.   109.860533: softirq_exit: vec=7 [action=SCHED]
> >           <idle>-0     [001] ..s.   109.860663: softirq_entry: vec=9 [action=RCU]
> >           <idle>-0     [001] .ns.   109.861424: softirq_exit: vec=9 [action=RCU]
> >               sh-388   [000] d.h.   109.861831: irq_handler_entry: irq=16 name=jcore_pit
> >               sh-388   [000] d.h.   109.862071: hrtimer_cancel: hrtimer=109e949c
> >        rcu_sched-7     [001] ....   109.862141: timer_init: timer=160a9eb0
> >               sh-388   [000] d.h.   109.862233: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109774586740
> 
> And CPU0 has it's timer firing 4.5ms late .....
> 
> >        rcu_sched-7     [001] d...   109.862317: timer_start: timer=160a9eb0 function=process_timeout expires=4294948275 [timeout=1] flags=0x00000001
> > 
> > However no stall message appeared on the console here, and I suspect
> > it might have just been a gap in event recording because it doesn't
> > look terribly suspicious. At first I thought (and discussed with
> 
> Well. If there are gaps in the traces then decoding this is going to be a
> nightmare.
> 
> > Daniel off-list) that the hrtimer_cancel at the end of the long gap
> > might have been spinning for the whole time, but this does not match
> > the behavior of the stalls I'm observing waith tracing off; by hooking
> 
> Why would it spinning? There is no way that the callback of the tick timer
> runs when cancel is called. It's cpu local.....
> 
> > up LED output in the hrtimer_cancel code paths, I see rapid bursts of
> > successful hrtimer_cancel (no repeated hrtimer_try_to_cancel looping)
> > leading up to the rcu_sched stall message. That matches with what I'm
> > seeing later in the event log, between 109s and 132s where the stall
> > occurred:
> > 
> >           <idle>-0     [000] d.h.   109.946127: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [000] d.h.   109.946339: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.946486: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109858865160
> >           <idle>-0     [000] d.h.   109.946714: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [000] d.h.   109.946918: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [000] d.h.   109.947193: softirq_raise: vec=7 [action=SCHED]
> >           <idle>-0     [000] d.h.   109.947364: hrtimer_expire_exit: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.947498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109860000000 softexpires=109860000000
> >           <idle>-0     [000] d.h.   109.947736: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.947887: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109860266000
> >           <idle>-0     [000] d.h.   109.948394: softirq_raise: vec=1 [action=TIMER]
> >           <idle>-0     [000] d.h.   109.948600: softirq_raise: vec=9 [action=RCU]
> >           <idle>-0     [000] d.h.   109.948870: softirq_raise: vec=7 [action=SCHED]
> >           <idle>-0     [000] d.h.   109.949036: hrtimer_expire_exit: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.949170: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109870000000 softexpires=109870000000
> >           <idle>-0     [000] d.h.   109.949389: irq_handler_exit: irq=16 ret=handled
> >           <idle>-0     [000] ..s.   109.949585: softirq_entry: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   109.949785: softirq_exit: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   109.949902: softirq_entry: vec=7 [action=SCHED]
> >           <idle>-0     [000] ..s.   109.950347: softirq_exit: vec=7 [action=SCHED]
> >           <idle>-0     [000] ..s.   109.950472: softirq_entry: vec=9 [action=RCU]
> >           <idle>-0     [000] ..s.   109.951089: softirq_exit: vec=9 [action=RCU]
> >           <idle>-0     [001] dn..   109.951213: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] dn..   109.951476: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109870000000 softexpires=109870000000
> >        rcu_sched-7     [001] d...   109.952030: timer_cancel: timer=160a9eb0
> >        rcu_sched-7     [001] ....   109.952478: timer_init: timer=160a9eb0
> >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> 
> So that should kick rcu_sched-7 in 10ms, latest 20ms from now and CPU1 goes
> into a NOHZ idle sleep.
> 
> >           <idle>-0     [001] d...   109.953436: tick_stop: success=1 dependency=NONE
> >           <idle>-0     [001] d...   109.953617: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.953818: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109880000000 softexpires=109880000000
> 
> which is (using the 0.087621us delta between the trace clock and clock
> MONO) at: 109.880 + 0.087621 = 109.968
> 
> Which is about correct as we expect the RCU timer to fire at:
>       
>       109.952633 + 0.01 = 109.963633
> 
> or latest at
> 
>       109.952633 + 0.02 = 109.983633
>    
> There is another caveat. That nohz stuff can queue the rcu timer on CPU0, which
> it did not because:
> 
> >        rcu_sched-7     [001] d...   109.952633: timer_start: timer=160a9eb0 function=process_timeout expires=4294948284 [timeout=1] flags=0x00000001
> 
> The CPU nr encoded in flags is: 1
> 
> Now we cancel and restart the timer w/o seing the interrupt expiring
> it. And that expiry should have happened at 109.968000 !?!
> 
> >           <idle>-0     [001] d...   109.968225: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.968526: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> 
> So this advances the next tick even further out. And CPU 0 sets the timer to
> the exact smae value:
> 
> >           <idle>-0     [000] d.h.   109.969104: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=109890000000 softexpires=109890000000
> 
> 
> >           <idle>-0     [000] d.h.   109.977690: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [000] d.h.   109.977911: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.978053: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109890434160
> 
> Which expires here. And CPU1 instead of getting an interrupt and expiring
> the timer does the cancel/restart to the next jiffie again:
> 
> >           <idle>-0     [001] d...   109.978206: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.978495: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=109900000000 softexpires=109900000000
> 
> And this repeats;
> 
> >           <idle>-0     [000] d.h.   109.987726: irq_handler_entry: irq=16 name=jcore_pit
> >           <idle>-0     [000] d.h.   109.987954: hrtimer_cancel: hrtimer=109e949c
> >           <idle>-0     [000] d.h.   109.988095: hrtimer_expire_entry: hrtimer=109e949c function=tick_sched_timer now=109900474620
> 
> >           <idle>-0     [001] d...   109.988243: hrtimer_cancel: hrtimer=109f449c
> >           <idle>-0     [001] d...   109.988537: hrtimer_start: hrtimer=109f449c fun9c
> 
> There is something badly wrong here.
> 
> >           <idle>-0     [000] ..s.   110.019443: softirq_entry: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   110.019617: softirq_exit: vec=1 [action=TIMER]
> >           <idle>-0     [000] ..s.   110.019730: softirq_entry: vec=7 [action=SCHED]
> >           <idle>-0     [000] ..s.   110.020174: softirq_exit: vec=7 [action=SCHED]
> >           <idle>-0     [000] d.h.   110.027674: irq_handler_entry: irq=16 name=jcore_pit
> > 
> > The rcu_sched process does not run again after the tick_stop until
> > 132s, and only a few RCU softirqs happen (all shown above). During
> > this time, cpu1 has no interrupt activity and nothing in the trace
> > except the above hrtimer_cancel/hrtimer_start pairs (not sure how
> > they're happening without any interrupts).
> 
> If you drop out of the arch idle into the core idle loop then you might end
> up with this. You want to add a few trace points or trace_printks() to the
> involved functions. tick_nohz_restart() tick_nohz_stop_sched_tick()
> tick_nohz_restart_sched_tick() and the idle code should be a good starting
> point.
>  
> > This pattern repeats until almost 131s, where cpu1 goes into a frenzy
> > of hrtimer_cancel/start:
> 
> It's not a frenzy. It's the same pattern as above. It arms the timer to the
> next tick, but that timer never ever fires. And it does that every tick ....
> 
> Please put a tracepoint into your set_next_event() callback as well. SO
> this changes here:

Here is a full log with added printk's of a stall:

http://sprunge.us/PgOZ

The relevant excerpt where the stall begins is:

       rcu_sched-7     [000] d...   150.828744: timer_start: timer=1636feac function=process_timeout expires=4294952380 [timeout=1] flags=0x00000000
       rcu_sched-7     [000] d...   150.828892: rcu_utilization: Start context switch
       rcu_sched-7     [000] d...   150.828988: rcu_grace_period: rcu_sched 51 cpuqs
       rcu_sched-7     [000] d...   150.829080: rcu_utilization: End context switch
       rcu_sched-7     [000] d...   150.829230: sched_stat_runtime: comm=rcu_sched pid=7 runtime=2813820 [ns] vruntime=10018192393 [ns]
       rcu_sched-7     [000] d...   150.829496: sched_switch: prev_comm=rcu_sched prev_pid=7 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
          <idle>-0     [000] d...   150.829698: __tick_nohz_idle_enter: can stop idle tick
          <idle>-0     [000] d...   150.829774: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [000] d...   150.830001: tick_stop: success=1 dependency=NONE
          <idle>-0     [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d...   150.830380: jcore_pit_set_next_event: pit 109eccb0 0 set delta 82124660 en=4410000 cntr=75 throt=82124660
          <idle>-0     [000] d...   150.830498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150840000000 softexpires=150840000000
          <idle>-0     [000] d...   150.830720: jcore_pit_set_next_event: pit 109eccb0 0 set delta 539823 en=4410000 cntr=34 throt=539823
          <idle>-0     [000] d...   150.830822: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [000] d...   150.830940: rcu_dyntick: Start 140000000000000 0
          <idle>-0     [000] d...   150.831072: rcu_dyntick: End 0 1
          <idle>-0     [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
          <idle>-0     [000] d...   150.831269: rcu_dyntick: Start 1 0
          <idle>-0     [000] ....   150.831368: cpu_startup_entry: enabled irqs, spinning
          <idle>-0     [001] d...   150.831491: rcu_dyntick: End 0 1
          <idle>-0     [001] d.h.   150.831704: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [001] d.h.   150.831855: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.   150.831965: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150830402668
          <idle>-0     [001] d.h.   150.832075: tick_sched_do_timer: update jiffies via tick
          <idle>-0     [001] d.h.   150.832217: tick_do_update_jiffies64: finished do_timer(1)
          <idle>-0     [001] d.h.   150.832335: tick_do_update_jiffies64: finished updating jiffies
          <idle>-0     [001] d.h.   150.832462: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [001] d.h.   150.832566: rcu_utilization: Start scheduler-tick
          <idle>-0     [001] d.h.   150.832719: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [001] d.h.   150.832811: rcu_utilization: End scheduler-tick
          <idle>-0     [001] d.h.   150.833012: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.   150.833146: hrtimer_expire_exit: hrtimer=109f449c
          <idle>-0     [001] d.h.   150.833244: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=150840000000 softexpires=150840000000
          <idle>-0     [001] d.h.   150.833481: jcore_pit_set_next_event: pit 109f7cb0 1 set delta 402319 en=4410000 cntr=54 throt=402319
          <idle>-0     [001] d.h.   150.833583: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [001] ..s.   150.833735: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   150.833869: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   150.833959: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   150.834298: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   150.834398: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [001] ..s.   150.834494: rcu_utilization: Start RCU core
          <idle>-0     [001] d.s.   150.834628: rcu_future_grace_period: rcu_sched 51 50 52 0 0 1 Startleaf
          <idle>-0     [001] d.s.   150.834738: rcu_future_grace_period: rcu_sched 51 50 52 0 0 1 Startedleaf
          <idle>-0     [001] d.s.   150.834839: rcu_grace_period: rcu_sched 50 AccWaitCB
          <idle>-0     [001] d.s.   150.834931: rcu_grace_period: rcu_sched 51 cpustart
          <idle>-0     [001] ..s.   150.835062: rcu_utilization: End RCU core
          <idle>-0     [001] ..s.   150.835156: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [001] d...   150.835304: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [001] d...   150.835399: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [001] d...   150.835497: rcu_dyntick: Start 1 0
          <idle>-0     [001] d...   150.841525: rcu_dyntick: End 0 1
          <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1
          <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq
          <idle>-0     [001] d.h.   150.841829: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
          <idle>-0     [001] d.h.   150.842054: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.   150.842218: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150840579940
          <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies
          <idle>-0     [001] d.h.   150.842382: tick_sched_do_timer: update jiffies via tick
          <idle>-0     [001] d.h.   150.842576: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick
          <idle>-0     [000] d...   150.842700: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [001] d.h.   150.842718: rcu_utilization: Start scheduler-tick
          <idle>-0     [001] d.h.   150.842869: rcu_grace_period: rcu_sched 51 cpuqs
          <idle>-0     [000] d...   150.842937: tick_nohz_restart: restart
          <idle>-0     [001] d.h.   150.843051: softirq_raise: vec=9 [action=RCU]
          <idle>-0     [000] d...   150.843115: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [001] d.h.   150.843179: rcu_utilization: End scheduler-tick
          <idle>-0     [000] d...   150.843455: jcore_pit_set_next_event: pit 109eccb0 0 set delta 81472695 en=4410000 cntr=74 throt=81472695
          <idle>-0     [001] d.h.   150.843467: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.   150.843651: hrtimer_expire_exit: hrtimer=109f449c
          <idle>-0     [000] d...   150.843698: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150850000000 softexpires=150850000000
          <idle>-0     [001] d.h.   150.843789: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=150850000000 softexpires=150850000000
          <idle>-0     [000] d...   150.844005: jcore_pit_set_next_event: pit 109eccb0 0 set delta 377377 en=4410000 cntr=42 throt=377377
          <idle>-0     [001] d.h.   150.844115: jcore_pit_set_next_event: pit 109f7cb0 1 set delta 372809 en=4410000 cntr=66 throt=372809
          <idle>-0     [000] d...   150.844132: tick_nohz_restart: return from restart
          <idle>-0     [000] d...   150.844240: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [001] d.h.   150.844260: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] d...   150.844377: rcu_dyntick: Start 1 0
          <idle>-0     [001] ..s.   150.844473: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   150.844612: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   150.844701: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   150.845036: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   150.845136: softirq_entry: vec=9 [action=RCU]
          <idle>-0     [001] ..s.   150.845231: rcu_utilization: Start RCU core
          <idle>-0     [001] d.s.   150.845413: rcu_future_grace_period: rcu_sched 51 50 52 0 0 1 Startleaf
          <idle>-0     [001] d.s.   150.845518: rcu_future_grace_period: rcu_sched 51 50 52 0 0 1 Prestartleaf
          <idle>-0     [001] d.s.   150.845619: rcu_grace_period: rcu_sched 51 AccWaitCB
          <idle>-0     [001] d.s.   150.845723: rcu_quiescent_state_report: rcu_sched 51 2>1 0 0 1 0
          <idle>-0     [001] ..s.   150.845857: rcu_utilization: End RCU core
          <idle>-0     [001] ..s.   150.845949: softirq_exit: vec=9 [action=RCU]
          <idle>-0     [001] d...   150.846098: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [001] d...   150.846192: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [001] d...   150.846289: rcu_dyntick: Start 1 0
          <idle>-0     [000] d...   150.851519: rcu_dyntick: End 0 1
          <idle>-0     [001] d...   150.851522: rcu_dyntick: End 0 1
          <idle>-0     [000] d.s.   150.851709: tick_irq_enter: update jiffies via irq
          <idle>-0     [001] d.h.   150.851824: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [000] d.s.   150.851900: tick_do_update_jiffies64: finished do_timer(1)
          <idle>-0     [001] d.h.   150.852047: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.   150.852213: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150850574115
          <idle>-0     [000] d.s.   150.852307: tick_do_update_jiffies64: finished updating jiffies
          <idle>-0     [001] d.h.   150.852375: tick_sched_do_timer: update jiffies via tick
          <idle>-0     [000] d...   150.852566: __tick_nohz_idle_enter: can stop idle tick
          <idle>-0     [001] d.h.   150.852578: softirq_raise: vec=1 [action=TIMER]
          <idle>-0     [000] d...   150.852660: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [001] d.h.   150.852723: rcu_utilization: Start scheduler-tick
          <idle>-0     [000] d...   150.852907: tick_nohz_restart: restart
          <idle>-0     [001] d.h.   150.852934: rcu_utilization: End scheduler-tick
          <idle>-0     [000] d...   150.853076: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [001] d.h.   150.853229: softirq_raise: vec=7 [action=SCHED]
          <idle>-0     [001] d.h.   150.853412: hrtimer_expire_exit: hrtimer=109f449c
          <idle>-0     [000] d...   150.853412: jcore_pit_set_next_event: pit 109eccb0 0 set delta 80974397 en=4410000 cntr=34 throt=80974397
          <idle>-0     [001] d.h.   150.853549: hrtimer_start: hrtimer=109f449c function=tick_sched_timer expires=150860000000 softexpires=150860000000
          <idle>-0     [000] d...   150.853645: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150860000000 softexpires=150860000000
          <idle>-0     [001] d.h.   150.853879: jcore_pit_set_next_event: pit 109f7cb0 1 set delta 384360 en=4410000 cntr=66 throt=384360
          <idle>-0     [000] d...   150.853941: jcore_pit_set_next_event: pit 109eccb0 0 set delta 380089 en=4410000 cntr=42 throt=380089
          <idle>-0     [001] d.h.   150.854024: irq_handler_exit: irq=16 ret=handled
          <idle>-0     [000] d...   150.854082: tick_nohz_restart: return from restart
          <idle>-0     [000] d...   150.854186: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [001] ..s.   150.854235: softirq_entry: vec=1 [action=TIMER]
          <idle>-0     [000] d...   150.854325: rcu_dyntick: Start 1 0
          <idle>-0     [001] ..s.   150.854426: softirq_exit: vec=1 [action=TIMER]
          <idle>-0     [001] ..s.   150.854520: softirq_entry: vec=7 [action=SCHED]
          <idle>-0     [001] ..s.   150.854861: softirq_exit: vec=7 [action=SCHED]
          <idle>-0     [001] d...   150.855017: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [001] d...   150.855111: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [001] d...   150.855208: rcu_dyntick: Start 1 0
          <idle>-0     [000] d...   150.861516: rcu_dyntick: End 0 1
          <idle>-0     [001] d...   150.861523: rcu_dyntick: End 0 1
          <idle>-0     [000] d.s.   150.861703: tick_irq_enter: update jiffies via irq

If I'm not mistaken, the bug is in tick_nohz_restart. According to the
comment, it forwards the timer to take place one jiffy in the future,
whereas one of its callers, tick_nohz_stop_sched_tick, expects it to
"prod the hardware into firing an interrupt". That's not going to
happen right away (as needed to get proper timing), and not going to
happen at all if the idle loop keeps breaking out and forwarding the
timer over and over again.

Before commit 0ff53d09642204c648424def0caa9117e7a3caaf ("tick: sched:
Force tick interrupt and get rid of softirq magic"), a timer softirq
was always raised, which seems to have made it impossible to get into
this situation. Adding back the raise_softirq_irqoff(TIMER_SOFTIRQ)
seems to make the problem go away -- with that change, my "rcu_sched
is stalled in swait" LED only lights briefly. But it's clearly not the
right fix; the interrupt load with this change is comparable to
periodic-tick.

FWIW I also spent a good deal of time exploring the possibility of a
hardware problem, but couldn't find any. The timer interrupt is firing
when it's programmed to; the kernel is just repeatedly reprogramming
it to fire in the future.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-08  1:32                 ` Rich Felker
@ 2016-10-08 11:32                   ` Thomas Gleixner
  2016-10-08 16:26                     ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2016-10-08 11:32 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Fri, 7 Oct 2016, Rich Felker wrote:
> 
> If I'm not mistaken, the bug is in tick_nohz_restart. According to the

I think you are mistaken. Let's look at CPU0 only:

          <idle>-0     [000] d...   150.829698: __tick_nohz_idle_enter: can stop idle tick
          <idle>-0     [000] d...   150.829774: __tick_nohz_idle_enter: entering stop_sched_tick
          <idle>-0     [000] d...   150.830001: tick_stop: success=1 dependency=NONE
          <idle>-0     [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
          <idle>-0     [000] d...   150.830380: jcore_pit_set_next_event: pit 109eccb0 0 set delta 82124660 en=4410000 cntr=75 throt=82124660
          <idle>-0     [000] d...   150.830498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150840000000 softexpires=150840000000
          <idle>-0     [000] d...   150.830720: jcore_pit_set_next_event: pit 109eccb0 0 set delta 539823 en=4410000 cntr=34 throt=539823

Tick is programmed to fire at 150.84

          <idle>-0     [000] d...   150.830822: __tick_nohz_idle_enter: returning from stop_sched_tick
          <idle>-0     [000] d...   150.830940: rcu_dyntick: Start 140000000000000 0
          <idle>-0     [000] d...   150.831072: rcu_dyntick: End 0 1
          <idle>-0     [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
          <idle>-0     [000] d...   150.831269: rcu_dyntick: Start 1 0
          <idle>-0     [000] ....   150.831368: cpu_startup_entry: enabled irqs, spinning

CPU spins and waits for an interrupt to happen


          <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1

Dropping out of the spin about the time we expect the PIT interrupt to
fire. And an interrupt is the reason why we drop out because the 'need
resched' flag is not set and we end up in:

          <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq

which is called from irq_enter()

          <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
          <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies


So here we would expect:

    	 irq_handler_entry: irq=16 name=jcore_pit
	 ...
	 irq_handler_exit .....
    	 

or any other interrupt being invoked. But we see this here:

          <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick

And that's just wrong. 

Now looking at CPU1 at the same time:

          <idle>-0     [001] d.h.   150.841829: irq_handler_entry: irq=16 name=jcore_          <idle>-0     [001] d.h.   150.842054: hrtimer_cancel: hrtimer=109f449c
          <idle>-0     [001] d.h.   150.842218: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150840579940

So there the PIT interrupt enters and runs the interrupt handler.

Now looking deeper at the PIT interrupts:

     kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 name=jcore_pit
     kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 ret=handled


          <idle>-0     [001] dnh.   150.645370: irq_handler_entry: irq=16 name=jcore_pit
          <idle>-0     [001] dnh.   150.647444: irq_handler_exit: irq=16 ret=handled

Both CPUs have the same interrupt number for their per cpu PIT interrupt.

IIRC you said earlier when the percpu interrupt discussion happened, that
your per cpu PITs have distinct interrupt numbers, but that does not seem
the case here.

That's where you have to dig.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-08 11:32                   ` Thomas Gleixner
@ 2016-10-08 16:26                     ` Rich Felker
  2016-10-08 17:03                       ` Thomas Gleixner
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-08 16:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Rich Felker wrote:
> > 
> > If I'm not mistaken, the bug is in tick_nohz_restart. According to the
> 
> I think you are mistaken. Let's look at CPU0 only:

OK.

>           <idle>-0     [000] d...   150.829698: __tick_nohz_idle_enter: can stop idle tick
>           <idle>-0     [000] d...   150.829774: __tick_nohz_idle_enter: entering stop_sched_tick
>           <idle>-0     [000] d...   150.830001: tick_stop: success=1 dependency=NONE
>           <idle>-0     [000] d...   150.830127: hrtimer_cancel: hrtimer=109e949c
>           <idle>-0     [000] d...   150.830380: jcore_pit_set_next_event: pit 109eccb0 0 set delta 82124660 en=4410000 cntr=75 throt=82124660
>           <idle>-0     [000] d...   150.830498: hrtimer_start: hrtimer=109e949c function=tick_sched_timer expires=150840000000 softexpires=150840000000
>           <idle>-0     [000] d...   150.830720: jcore_pit_set_next_event: pit 109eccb0 0 set delta 539823 en=4410000 cntr=34 throt=539823
> 
> Tick is programmed to fire at 150.84
> 
>           <idle>-0     [000] d...   150.830822: __tick_nohz_idle_enter: returning from stop_sched_tick
>           <idle>-0     [000] d...   150.830940: rcu_dyntick: Start 140000000000000 0
>           <idle>-0     [000] d...   150.831072: rcu_dyntick: End 0 1
>           <idle>-0     [000] d...   150.831165: cpu_idle: state=0 cpu_id=0
>           <idle>-0     [000] d...   150.831269: rcu_dyntick: Start 1 0
>           <idle>-0     [000] ....   150.831368: cpu_startup_entry: enabled irqs, spinning
> 
> CPU spins and waits for an interrupt to happen
> 
> 
>           <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1
> 
> Dropping out of the spin about the time we expect the PIT interrupt to
> fire. And an interrupt is the reason why we drop out because the 'need
> resched' flag is not set and we end up in:

OK, I missed that.

>           <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq
> 
> which is called from irq_enter()
> 
>           <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
>           <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies
> 
> 
> So here we would expect:
> 
>     	 irq_handler_entry: irq=16 name=jcore_pit
> 	 ...
> 	 irq_handler_exit .....
>     	 
> 
> or any other interrupt being invoked. But we see this here:

According to the trace the 'irq' is soft ('s'). I couldn't find the
code path from the idle loop to softirq but so maybe this is a bug. Is
it possible that in some cases the arch irq entry does not get
identified as a hard irq or traced and then the subsequent tick code
thinks it's running in a softirq and behaves differently? I could add
more tracing around irq entry.

>           <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick
> 
> And that's just wrong. 

Can you explain?

> Now looking at CPU1 at the same time:
> 
>           <idle>-0     [001] d.h.   150.841829: irq_handler_entry: irq=16 name=jcore_
>           <idle>-0     [001] d.h.   150.842054: hrtimer_cancel: hrtimer=109f449c
>           <idle>-0     [001] d.h.   150.842218: hrtimer_expire_entry: hrtimer=109f449c function=tick_sched_timer now=150840579940
> 
> So there the PIT interrupt enters and runs the interrupt handler.
> 
> Now looking deeper at the PIT interrupts:
> 
>      kworker/0:1-208   [000] d.h.   150.641822: irq_handler_entry: irq=16 name=jcore_pit
>      kworker/0:1-208   [000] d.h.   150.644928: irq_handler_exit: irq=16 ret=handled
> 
> 
>           <idle>-0     [001] dnh.   150.645370: irq_handler_entry: irq=16 name=jcore_pit
>           <idle>-0     [001] dnh.   150.647444: irq_handler_exit: irq=16 ret=handled
> 
> Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> 
> IIRC you said earlier when the percpu interrupt discussion happened, that
> your per cpu PITs have distinct interrupt numbers, but that does not seem
> the case here.

No, I said the opposite. They use the same irq number but they're each
wired to their own cpu, and there's no way for them to fire on the
wrong cpu.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-08 16:26                     ` Rich Felker
@ 2016-10-08 17:03                       ` Thomas Gleixner
  2016-10-09  1:28                         ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2016-10-08 17:03 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > CPU spins and waits for an interrupt to happen
> > 
> > 
> >           <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1
> > 
> > Dropping out of the spin about the time we expect the PIT interrupt to
> > fire. And an interrupt is the reason why we drop out because the 'need
> > resched' flag is not set and we end up in:
> 
> OK, I missed that.
> 
> >           <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq
> > 
> > which is called from irq_enter()
> > 
> >           <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
> >           <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies
> > 
> > 
> > So here we would expect:
> > 
> >     	 irq_handler_entry: irq=16 name=jcore_pit
> > 	 ...
> > 	 irq_handler_exit .....
> >     	 
> > 
> > or any other interrupt being invoked. But we see this here:
> 
> According to the trace the 'irq' is soft ('s'). I couldn't find the
> code path from the idle loop to softirq but so maybe this is a bug. Is
> it possible that in some cases the arch irq entry does not get
> identified as a hard irq or traced and then the subsequent tick code
> thinks it's running in a softirq and behaves differently? I could add
> more tracing around irq entry.

No.

irq_enter()
	if (is_idle_task(current) && !in_interrupt()) {
	   	local_bh_disable();
		tick_irq_enter();
		local_bh_enable();
	}
	__irq_enter()
		preempt_count_add(HARDIRQ_OFFSET);

So the 's' comes from local_bh_disable() and the code does not behave
special because of that. It's just always that way. Look at all the other
instances of tick_irq_enter() which do not show that issue.

> >           <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick
> > 
> > And that's just wrong. 
> 
> Can you explain?

Because you drop out the idle spin due to an interrupt, but no interrupt is
handled according to the trace. You just go back to sleep and the trace is
full of this behaviour.

> > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > 
> > IIRC you said earlier when the percpu interrupt discussion happened, that
> > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > the case here.
> 
> No, I said the opposite. They use the same irq number but they're each
> wired to their own cpu, and there's no way for them to fire on the
> wrong cpu.

Your patch does:

> +       err = request_irq(pit_irq, jcore_timer_interrupt,
> +                         IRQF_TIMER | IRQF_PERCPU,
> +                         "jcore_pit", jcore_pit_percpu);

which is the wrong thing to do. You need request_percpu_irq() here, but of
course with the irq chip you are using, which has every callback as a noop,
it does not matter at all. So that's not an issue and not the reason for
this wreckage.

But what you need to figure out is why this happens:

100.00x	 hrtimer_start(expires = 100.01)
100.00x	 idle spin
100.01x	 tick_irq_enter()
100.01x	 tick_stop()

i.e. why do you drop out of your idle spin, take the low level interrupt
entry path which calls irq_enter() -> tick_irq_enter() and then you do not
handle any interrupt at all and drop back into the idle loop. That's the
real question and that's a problem in your architecture low level interrupt
entry code or some hardware related issue. But certainly not a bug in the
core tick code.

You can come up with another boat load of weird theories about bugs in
that code, but I won't even have a look _before_ you can explain why the
above sequence happens and the PIT timer interrupt is not handled.

Thanks,

	tglx




    

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-08 17:03                       ` Thomas Gleixner
@ 2016-10-09  1:28                         ` Rich Felker
  2016-10-09  9:14                           ` Thomas Gleixner
  0 siblings, 1 reply; 30+ messages in thread
From: Rich Felker @ 2016-10-09  1:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 01:32:06PM +0200, Thomas Gleixner wrote:
> > > CPU spins and waits for an interrupt to happen
> > > 
> > > 
> > >           <idle>-0     [000] d...   150.841530: rcu_dyntick: End 0 1
> > > 
> > > Dropping out of the spin about the time we expect the PIT interrupt to
> > > fire. And an interrupt is the reason why we drop out because the 'need
> > > resched' flag is not set and we end up in:
> > 
> > OK, I missed that.
> > 
> > >           <idle>-0     [000] d.s.   150.841724: tick_irq_enter: update jiffies via irq
> > > 
> > > which is called from irq_enter()
> > > 
> > >           <idle>-0     [000] d.s.   150.841918: tick_do_update_jiffies64: finished do_timer(1)
> > >           <idle>-0     [000] d.s.   150.842348: tick_do_update_jiffies64: finished updating jiffies
> > > 
> > > 
> > > So here we would expect:
> > > 
> > >     	 irq_handler_entry: irq=16 name=jcore_pit
> > > 	 ...
> > > 	 irq_handler_exit .....
> > >     	 
> > > 
> > > or any other interrupt being invoked. But we see this here:
> > 
> > According to the trace the 'irq' is soft ('s'). I couldn't find the
> > code path from the idle loop to softirq but so maybe this is a bug. Is
> > it possible that in some cases the arch irq entry does not get
> > identified as a hard irq or traced and then the subsequent tick code
> > thinks it's running in a softirq and behaves differently? I could add
> > more tracing around irq entry.
> 
> No.
> 
> irq_enter()
> 	if (is_idle_task(current) && !in_interrupt()) {
> 	   	local_bh_disable();
> 		tick_irq_enter();
> 		local_bh_enable();
> 	}
> 	__irq_enter()
> 		preempt_count_add(HARDIRQ_OFFSET);
> 
> So the 's' comes from local_bh_disable() and the code does not behave
> special because of that. It's just always that way. Look at all the other
> instances of tick_irq_enter() which do not show that issue.

Thank you -- that was really confusing me. Now that I know there's
actually a hardirq handling I know where to look for the problem.

> > >           <idle>-0     [000] d...   150.842603: __tick_nohz_idle_enter: can stop idle tick
> > > 
> > > And that's just wrong. 
> > 
> > Can you explain?
> 
> Because you drop out the idle spin due to an interrupt, but no interrupt is
> handled according to the trace. You just go back to sleep and the trace is
> full of this behaviour.

Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
ignored if the same interrupt is being handled on the other cpu.
Modifying the jcore aic driver to call handle_percpu_irq rather than
handle_simple_irq when the irq was registered as percpu solves the
problem, but I'm actually concerned that handle_simple_irq would also
be wrong in the case of non-percpu irqs if they could be delivered on
either core -- if another irq arrives after the driver is finished
checking for new device status, but before the IRQD_IRQ_INPROGRESS
flag is removed, it seems handle_simple_irq loses the irq. This is not
a problem for the current hardware since non-percpu irqs always arrive
on cpu0, but I'd rather improve the driver to properly handle possible
future hardware that allows irq delivery on any core.

> > > Both CPUs have the same interrupt number for their per cpu PIT interrupt.
> > > 
> > > IIRC you said earlier when the percpu interrupt discussion happened, that
> > > your per cpu PITs have distinct interrupt numbers, but that does not seem
> > > the case here.
> > 
> > No, I said the opposite. They use the same irq number but they're each
> > wired to their own cpu, and there's no way for them to fire on the
> > wrong cpu.
> 
> Your patch does:
> 
> > +       err = request_irq(pit_irq, jcore_timer_interrupt,
> > +                         IRQF_TIMER | IRQF_PERCPU,
> > +                         "jcore_pit", jcore_pit_percpu);
> 
> which is the wrong thing to do. You need request_percpu_irq() here, but of
> course with the irq chip you are using, which has every callback as a noop,
> it does not matter at all. So that's not an issue and not the reason for
> this wreckage.

Mentioning this was helpful to get me looking at the right things
tracking from the irq entry point to where the irq was lost/dropped,
so thanks for bringing it up. request_irq ends up working fine still
because IRQF_PERCPU causes __setup_irq to mark the irqdesc/irq_data as
percpu, so that my handle function can treat it differently. Here's
the patch that has it working:

diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
index 5e5e3bb..b53a8a5 100644
--- a/drivers/irqchip/irq-jcore-aic.c
+++ b/drivers/irqchip/irq-jcore-aic.c
@@ -25,12 +25,20 @@
 
 static struct irq_chip jcore_aic;
 
+static void handle_jcore_irq(struct irq_desc *desc)
+{
+	if (irq_is_percpu(irq_desc_get_irq(desc)))
+		handle_percpu_irq(desc);
+	else
+		handle_simple_irq(desc);
+}
+
 static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
 				   irq_hw_number_t hwirq)
 {
 	struct irq_chip *aic = d->host_data;
 
-	irq_set_chip_and_handler(irq, aic, handle_simple_irq);
+	irq_set_chip_and_handler(irq, aic, handle_jcore_irq);
 
 	return 0;
 }

After some more testing I'll send this patch or something similar.

> But what you need to figure out is why this happens:
> 
> 100.00x	 hrtimer_start(expires = 100.01)
> 100.00x	 idle spin
> 100.01x	 tick_irq_enter()
> 100.01x	 tick_stop()
> 
> i.e. why do you drop out of your idle spin, take the low level interrupt
> entry path which calls irq_enter() -> tick_irq_enter() and then you do not
> handle any interrupt at all and drop back into the idle loop. That's the
> real question and that's a problem in your architecture low level interrupt
> entry code or some hardware related issue. But certainly not a bug in the
> core tick code.
> 
> You can come up with another boat load of weird theories about bugs in
> that code, but I won't even have a look _before_ you can explain why the
> above sequence happens and the PIT timer interrupt is not handled.

Apologies for this big distraction for what was ultimately a driver
bug on my side. I was misled by the way it seemed to be a regression,
since the symptoms did not appear in earlier kernel versions for
whatever reason (likely just chance).

Rich

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-09  1:28                         ` Rich Felker
@ 2016-10-09  9:14                           ` Thomas Gleixner
  2016-10-09 14:35                             ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2016-10-09  9:14 UTC (permalink / raw)
  To: Rich Felker
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

Rich,

On Sat, 8 Oct 2016, Rich Felker wrote:
> On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > Because you drop out the idle spin due to an interrupt, but no interrupt is
> > handled according to the trace. You just go back to sleep and the trace is
> > full of this behaviour.
> 
> Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> ignored if the same interrupt is being handled on the other cpu.
> Modifying the jcore aic driver to call handle_percpu_irq rather than
> handle_simple_irq when the irq was registered as percpu solves the
> problem, but I'm actually concerned that handle_simple_irq would also
> be wrong in the case of non-percpu irqs if they could be delivered on
> either core -- if another irq arrives after the driver is finished
> checking for new device status, but before the IRQD_IRQ_INPROGRESS
> flag is removed, it seems handle_simple_irq loses the irq. This is not
> a problem for the current hardware since non-percpu irqs always arrive
> on cpu0, but I'd rather improve the driver to properly handle possible
> future hardware that allows irq delivery on any core.

We guarantee no-rentrancy for the handlers. That's why we have special
treatment for per cpu interrupts and edge type interrupts, where we mark
the interrupt pending when it arrives before the in progress flag is
cleared and then call the handler again when it returns. For level type
interrupts this is not required because level is going to stay raised in
the hardware.

> > which is the wrong thing to do. You need request_percpu_irq() here, but of
> > course with the irq chip you are using, which has every callback as a noop,
> > it does not matter at all. So that's not an issue and not the reason for
> > this wreckage.
> 
> Mentioning this was helpful to get me looking at the right things
> tracking from the irq entry point to where the irq was lost/dropped,
> so thanks for bringing it up.

Duh, forgot about reentrancy. I should have thought about it when staring
at the traces. I noticed that the irq on the other cpu was handled exactly
at the same point, but I did not make the connection. In all hte
problematic cases there is this sequence:

          <idle>-0     [000] d.s.   150.861703: tick_irq_enter: update jiffies via irq
          <idle>-0     [001] d.h.   150.861827: irq_handler_entry: irq=16 name=jcore_pit

i.e. the handler on cpu1 is invoked before cpu0 has reached
handle_simple_irq(). 

> request_irq ends up working fine still because IRQF_PERCPU causes
> __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> function can treat it differently. Here's the patch that has it working:
 
> diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
> index 5e5e3bb..b53a8a5 100644
> --- a/drivers/irqchip/irq-jcore-aic.c
> +++ b/drivers/irqchip/irq-jcore-aic.c
> @@ -25,12 +25,20 @@
>  
>  static struct irq_chip jcore_aic;
>  
> +static void handle_jcore_irq(struct irq_desc *desc)
> +{
> +	if (irq_is_percpu(irq_desc_get_irq(desc)))
> +		handle_percpu_irq(desc);
> +	else
> +		handle_simple_irq(desc);
> +}
> +
>  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
>  				   irq_hw_number_t hwirq)
>  {
>  	struct irq_chip *aic = d->host_data;
>  
> -	irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> +	irq_set_chip_and_handler(irq, aic, handle_jcore_irq);

What you should do is:

     	if (jcore_irq_per_cpu(hwirq))
		irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
	else
		irq_set_chip_and_handler(irq, aic, handle_simple_irq);
	
That avoids the conditional in the irq delivery path, which is overly
expensive as you end up looking up the irq descriptor which you already
have. You can avoid the lookup by using:

      if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))

but that's still a conditional in the hotpath which you can avoid entirely
by setting up the proper handler when you map it.

> Apologies for this big distraction for what was ultimately a driver
> bug on my side. I was misled by the way it seemed to be a regression,
> since the symptoms did not appear in earlier kernel versions for
> whatever reason (likely just chance).

No problem. Glad that I was able to help.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver
  2016-10-09  9:14                           ` Thomas Gleixner
@ 2016-10-09 14:35                             ` Rich Felker
  0 siblings, 0 replies; 30+ messages in thread
From: Rich Felker @ 2016-10-09 14:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Daniel Lezcano, devicetree, linux-kernel, linux-sh, Rob Herring,
	Mark Rutland, Paul E. McKenney

On Sun, Oct 09, 2016 at 11:14:20AM +0200, Thomas Gleixner wrote:
> Rich,
> 
> On Sat, 8 Oct 2016, Rich Felker wrote:
> > On Sat, Oct 08, 2016 at 07:03:30PM +0200, Thomas Gleixner wrote:
> > > Because you drop out the idle spin due to an interrupt, but no interrupt is
> > > handled according to the trace. You just go back to sleep and the trace is
> > > full of this behaviour.
> > 
> > Found the problem. IRQD_IRQ_INPROGRESS is causing the interrupt to be
> > ignored if the same interrupt is being handled on the other cpu.
> > Modifying the jcore aic driver to call handle_percpu_irq rather than
> > handle_simple_irq when the irq was registered as percpu solves the
> > problem, but I'm actually concerned that handle_simple_irq would also
> > be wrong in the case of non-percpu irqs if they could be delivered on
> > either core -- if another irq arrives after the driver is finished
> > checking for new device status, but before the IRQD_IRQ_INPROGRESS
> > flag is removed, it seems handle_simple_irq loses the irq. This is not
> > a problem for the current hardware since non-percpu irqs always arrive
> > on cpu0, but I'd rather improve the driver to properly handle possible
> > future hardware that allows irq delivery on any core.
> 
> We guarantee no-rentrancy for the handlers. That's why we have special
> treatment for per cpu interrupts and edge type interrupts, where we mark
> the interrupt pending when it arrives before the in progress flag is
> cleared and then call the handler again when it returns. For level type
> interrupts this is not required because level is going to stay raised in
> the hardware.

I understand the no-reentrancy guarantee, but it seems that, for a
"simple" irq with no ack/etc. mechanisms, if another interrupt has
arrives during handling, a flag for that has to be kept and the
interrupt handler re-invoked after the first return. Otherwise
interrupts are lost. Perhaps handle_simple_irq is written with the
intent that individual drivers have to do something ack-like with
their hardware. If so then it's not the right handler (although it
works at present as long as non-percpu interrupts are bound to a
single cpu at the hardware level) and I should probably write an
appropriate handle function.

Or, if we want to consider the current behavior a guarantee so that
changing it would require a new compatible string, then
handle_percpu_irq or an equivalently simpler function could be used
even for non-percpu irqs and avoid all the locking overhead. This
would save a lot more time in the hot path than eliminating the one
conditional (as discussed below below).

> > request_irq ends up working fine still because IRQF_PERCPU causes
> > __setup_irq to mark the irqdesc/irq_data as percpu, so that my handle
> > function can treat it differently. Here's the patch that has it working:
>  
> > diff --git a/drivers/irqchip/irq-jcore-aic.c b/drivers/irqchip/irq-jcore-aic.c
> > index 5e5e3bb..b53a8a5 100644
> > --- a/drivers/irqchip/irq-jcore-aic.c
> > +++ b/drivers/irqchip/irq-jcore-aic.c
> > @@ -25,12 +25,20 @@
> >  
> >  static struct irq_chip jcore_aic;
> >  
> > +static void handle_jcore_irq(struct irq_desc *desc)
> > +{
> > +	if (irq_is_percpu(irq_desc_get_irq(desc)))
> > +		handle_percpu_irq(desc);
> > +	else
> > +		handle_simple_irq(desc);
> > +}
> > +
> >  static int jcore_aic_irqdomain_map(struct irq_domain *d, unsigned int irq,
> >  				   irq_hw_number_t hwirq)
> >  {
> >  	struct irq_chip *aic = d->host_data;
> >  
> > -	irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> > +	irq_set_chip_and_handler(irq, aic, handle_jcore_irq);
> 
> What you should do is:
> 
>      	if (jcore_irq_per_cpu(hwirq))
> 		irq_set_chip_and_handler(irq, aic, handle_percpu_irq);
> 	else
> 		irq_set_chip_and_handler(irq, aic, handle_simple_irq);
> 	
> That avoids the conditional in the irq delivery path,

I'll follow up on this in the thread on the patch.

> which is overly
> expensive as you end up looking up the irq descriptor which you already
> have. You can avoid the lookup by using:
> 
>       if (irqd_is_per_cpu(irq_desc_get_irq_data(desc)))
> 
> but that's still a conditional in the hotpath which you can avoid entirely
> by setting up the proper handler when you map it.

Indeed that helps. Having CONFIG_FRAME_POINTER on was making both
versions huge (because of the implicit -fno-optimize-sibling-calls)
but with that off, it's much more reasonable.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-10-09 14:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-24  5:07 [PATCH v7 0/2] J-Core timer support Rich Felker
2016-09-24  5:07 ` [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver Rich Felker
2016-09-26 21:07   ` Rich Felker
2016-09-26 21:27     ` Daniel Lezcano
2016-09-26 22:23       ` Rich Felker
2016-09-26 23:55         ` Thomas Gleixner
2016-09-27  0:42           ` Rich Felker
2016-09-27 22:08             ` Rich Felker
2016-09-30 13:15               ` Thomas Gleixner
2016-09-30 13:48                 ` Paul E. McKenney
2016-10-01 17:05                   ` Rich Felker
2016-10-01 17:58                     ` Paul E. McKenney
2016-10-02  0:00                       ` Rich Felker
2016-10-02  3:59                         ` Rich Felker
2016-10-02  5:59                           ` Paul E. McKenney
2016-10-02  6:30                         ` Paul E. McKenney
2016-10-08  1:32                 ` Rich Felker
2016-10-08 11:32                   ` Thomas Gleixner
2016-10-08 16:26                     ` Rich Felker
2016-10-08 17:03                       ` Thomas Gleixner
2016-10-09  1:28                         ` Rich Felker
2016-10-09  9:14                           ` Thomas Gleixner
2016-10-09 14:35                             ` Rich Felker
2016-10-03 22:10       ` Rich Felker
2016-10-04  7:06         ` Paul E. McKenney
2016-10-04 20:58           ` Rich Felker
2016-10-04 21:14             ` Paul E. McKenney
2016-10-04 21:48               ` Rich Felker
2016-10-05 12:41                 ` Paul E. McKenney
2016-09-24  5:07 ` [PATCH v7 1/2] of: add J-Core timer bindings Rich Felker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).