linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
@ 2018-02-15 19:17 Dmitry Safonov
  2018-03-05 15:00 ` Dmitry Safonov
  2018-03-15 13:46 ` Joerg Roedel
  0 siblings, 2 replies; 13+ messages in thread
From: Dmitry Safonov @ 2018-02-15 19:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: 0x7f454c46, Dmitry Safonov, Alex Williamson, David Woodhouse,
	Ingo Molnar, Joerg Roedel, Lu Baolu, iommu

There is a ratelimit for printing, but it's incremented each time the
cpu recives dmar fault interrupt. While one interrupt may signal about
*many* faults.
So, measuring the impact it turns out that reading/clearing one fault
takes < 1 usec, and printing info about the fault takes ~170 msec.

Having in mind that maximum number of fault recording registers per
remapping hardware unit is 256.. IRQ handler may run for (170*256) msec.
And as fault-serving loop runs without a time limit, during servicing
new faults may occur..

Ratelimit each fault printing rather than each irq printing.

Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")

BUG: spinlock lockup suspected on CPU#0, CliShell/9903
 lock: 0xffffffff81a47440, .magic: dead4ead, .owner: kworker/u16:2/8915, .owner_cpu: 6
CPU: 0 PID: 9903 Comm: CliShell
Call Trace:$\n'
[..] dump_stack+0x65/0x83$\n'
[..] spin_dump+0x8f/0x94$\n'
[..] do_raw_spin_lock+0x123/0x170$\n'
[..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
[..] uart_chars_in_buffer+0x20/0x4d$\n'
[..] tty_chars_in_buffer+0x18/0x1d$\n'
[..] n_tty_poll+0x1cb/0x1f2$\n'
[..] tty_poll+0x5e/0x76$\n'
[..] do_select+0x363/0x629$\n'
[..] compat_core_sys_select+0x19e/0x239$\n'
[..] compat_SyS_select+0x98/0xc0$\n'
[..] sysenter_dispatch+0x7/0x25$\n'
[..]
NMI backtrace for cpu 6
CPU: 6 PID: 8915 Comm: kworker/u16:2
Workqueue: dmar_fault dmar_fault_work
Call Trace:$\n'
[..] wait_for_xmitr+0x26/0x8f$\n'
[..] serial8250_console_putchar+0x1c/0x2c$\n'
[..] uart_console_write+0x40/0x4b$\n'
[..] serial8250_console_write+0xe6/0x13f$\n'
[..] call_console_drivers.constprop.13+0xce/0x103$\n'
[..] console_unlock+0x1f8/0x39b$\n'
[..] vprintk_emit+0x39e/0x3e6$\n'
[..] printk+0x4d/0x4f$\n'
[..] dmar_fault+0x1a8/0x1fc$\n'
[..] dmar_fault_work+0x15/0x17$\n'
[..] process_one_work+0x1e8/0x3a9$\n'
[..] worker_thread+0x25d/0x345$\n'
[..] kthread+0xea/0xf2$\n'
[..] ret_from_fork+0x58/0x90$\n'

Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
Maybe it's worth to limit while(1) cycle.
If IOMMU generates faults with equal speed as irq handler cleans
them, it may turn into long-irq-disabled region again.
Not sure if it can happen anyway.

 drivers/iommu/dmar.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index accf58388bdb..6c4ea32ee6a9 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
 	int reg, fault_index;
 	u32 fault_status;
 	unsigned long flag;
-	bool ratelimited;
 	static DEFINE_RATELIMIT_STATE(rs,
 				      DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
-	/* Disable printing, simply clear the fault when ratelimited */
-	ratelimited = !__ratelimit(&rs);
-
 	raw_spin_lock_irqsave(&iommu->register_lock, flag);
 	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
-	if (fault_status && !ratelimited)
+	if (fault_status && __ratelimit(&rs))
 		pr_err("DRHD: handling fault status reg %x\n", fault_status);
 
 	/* TBD: ignore advanced fault log currently */
@@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
 	fault_index = dma_fsts_fault_record_index(fault_status);
 	reg = cap_fault_reg_offset(iommu->cap);
 	while (1) {
+		/* Disable printing, simply clear the fault when ratelimited */
+		bool ratelimited = !__ratelimit(&rs);
 		u8 fault_reason;
 		u16 source_id;
 		u64 guest_addr;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-02-15 19:17 [PATCHv3] iommu/intel: Ratelimit each dmar fault printing Dmitry Safonov
@ 2018-03-05 15:00 ` Dmitry Safonov
  2018-03-13 16:21   ` Dmitry Safonov
  2018-03-15 13:46 ` Joerg Roedel
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-05 15:00 UTC (permalink / raw)
  To: linux-kernel, Joerg Roedel
  Cc: 0x7f454c46, Alex Williamson, David Woodhouse, Ingo Molnar,
	Lu Baolu, iommu

Hi Joerg,

What do you think about v3?
It looks like, I can solve my softlookups with just a bit more proper
ratelimiting..

On Thu, 2018-02-15 at 19:17 +0000, Dmitry Safonov wrote:
> There is a ratelimit for printing, but it's incremented each time the
> cpu recives dmar fault interrupt. While one interrupt may signal
> about
> *many* faults.
> So, measuring the impact it turns out that reading/clearing one fault
> takes < 1 usec, and printing info about the fault takes ~170 msec.
> 
> Having in mind that maximum number of fault recording registers per
> remapping hardware unit is 256.. IRQ handler may run for (170*256)
> msec.
> And as fault-serving loop runs without a time limit, during servicing
> new faults may occur..
> 
> Ratelimit each fault printing rather than each irq printing.
> 
> Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")
> 
> BUG: spinlock lockup suspected on CPU#0, CliShell/9903
>  lock: 0xffffffff81a47440, .magic: dead4ead, .owner:
> kworker/u16:2/8915, .owner_cpu: 6
> CPU: 0 PID: 9903 Comm: CliShell
> Call Trace:$\n'
> [..] dump_stack+0x65/0x83$\n'
> [..] spin_dump+0x8f/0x94$\n'
> [..] do_raw_spin_lock+0x123/0x170$\n'
> [..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
> [..] uart_chars_in_buffer+0x20/0x4d$\n'
> [..] tty_chars_in_buffer+0x18/0x1d$\n'
> [..] n_tty_poll+0x1cb/0x1f2$\n'
> [..] tty_poll+0x5e/0x76$\n'
> [..] do_select+0x363/0x629$\n'
> [..] compat_core_sys_select+0x19e/0x239$\n'
> [..] compat_SyS_select+0x98/0xc0$\n'
> [..] sysenter_dispatch+0x7/0x25$\n'
> [..]
> NMI backtrace for cpu 6
> CPU: 6 PID: 8915 Comm: kworker/u16:2
> Workqueue: dmar_fault dmar_fault_work
> Call Trace:$\n'
> [..] wait_for_xmitr+0x26/0x8f$\n'
> [..] serial8250_console_putchar+0x1c/0x2c$\n'
> [..] uart_console_write+0x40/0x4b$\n'
> [..] serial8250_console_write+0xe6/0x13f$\n'
> [..] call_console_drivers.constprop.13+0xce/0x103$\n'
> [..] console_unlock+0x1f8/0x39b$\n'
> [..] vprintk_emit+0x39e/0x3e6$\n'
> [..] printk+0x4d/0x4f$\n'
> [..] dmar_fault+0x1a8/0x1fc$\n'
> [..] dmar_fault_work+0x15/0x17$\n'
> [..] process_one_work+0x1e8/0x3a9$\n'
> [..] worker_thread+0x25d/0x345$\n'
> [..] kthread+0xea/0xf2$\n'
> [..] ret_from_fork+0x58/0x90$\n'
> 
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Cc: iommu@lists.linux-foundation.org
> Signed-off-by: Dmitry Safonov <dima@arista.com>
> ---
> Maybe it's worth to limit while(1) cycle.
> If IOMMU generates faults with equal speed as irq handler cleans
> them, it may turn into long-irq-disabled region again.
> Not sure if it can happen anyway.
> 
>  drivers/iommu/dmar.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index accf58388bdb..6c4ea32ee6a9 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>  	int reg, fault_index;
>  	u32 fault_status;
>  	unsigned long flag;
> -	bool ratelimited;
>  	static DEFINE_RATELIMIT_STATE(rs,
>  				      DEFAULT_RATELIMIT_INTERVAL,
>  				      DEFAULT_RATELIMIT_BURST);
>  
> -	/* Disable printing, simply clear the fault when ratelimited
> */
> -	ratelimited = !__ratelimit(&rs);
> -
>  	raw_spin_lock_irqsave(&iommu->register_lock, flag);
>  	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> -	if (fault_status && !ratelimited)
> +	if (fault_status && __ratelimit(&rs))
>  		pr_err("DRHD: handling fault status reg %x\n",
> fault_status);
>  
>  	/* TBD: ignore advanced fault log currently */
> @@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>  	fault_index = dma_fsts_fault_record_index(fault_status);
>  	reg = cap_fault_reg_offset(iommu->cap);
>  	while (1) {
> +		/* Disable printing, simply clear the fault when
> ratelimited */
> +		bool ratelimited = !__ratelimit(&rs);
>  		u8 fault_reason;
>  		u16 source_id;
>  		u64 guest_addr;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-05 15:00 ` Dmitry Safonov
@ 2018-03-13 16:21   ` Dmitry Safonov
  0 siblings, 0 replies; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-13 16:21 UTC (permalink / raw)
  To: linux-kernel, Joerg Roedel
  Cc: 0x7f454c46, Alex Williamson, David Woodhouse, Ingo Molnar,
	Lu Baolu, iommu

Gentle ping?

On Mon, 2018-03-05 at 15:00 +0000, Dmitry Safonov wrote:
> Hi Joerg,
> 
> What do you think about v3?
> It looks like, I can solve my softlookups with just a bit more proper
> ratelimiting..
> 
> On Thu, 2018-02-15 at 19:17 +0000, Dmitry Safonov wrote:
> > There is a ratelimit for printing, but it's incremented each time
> > the
> > cpu recives dmar fault interrupt. While one interrupt may signal
> > about
> > *many* faults.
> > So, measuring the impact it turns out that reading/clearing one
> > fault
> > takes < 1 usec, and printing info about the fault takes ~170 msec.
> > 
> > Having in mind that maximum number of fault recording registers per
> > remapping hardware unit is 256.. IRQ handler may run for (170*256)
> > msec.
> > And as fault-serving loop runs without a time limit, during
> > servicing
> > new faults may occur..
> > 
> > Ratelimit each fault printing rather than each irq printing.
> > 
> > Fixes: commit c43fce4eebae ("iommu/vt-d: Ratelimit fault handler")
> > 
> > BUG: spinlock lockup suspected on CPU#0, CliShell/9903
> >  lock: 0xffffffff81a47440, .magic: dead4ead, .owner:
> > kworker/u16:2/8915, .owner_cpu: 6
> > CPU: 0 PID: 9903 Comm: CliShell
> > Call Trace:$\n'
> > [..] dump_stack+0x65/0x83$\n'
> > [..] spin_dump+0x8f/0x94$\n'
> > [..] do_raw_spin_lock+0x123/0x170$\n'
> > [..] _raw_spin_lock_irqsave+0x32/0x3a$\n'
> > [..] uart_chars_in_buffer+0x20/0x4d$\n'
> > [..] tty_chars_in_buffer+0x18/0x1d$\n'
> > [..] n_tty_poll+0x1cb/0x1f2$\n'
> > [..] tty_poll+0x5e/0x76$\n'
> > [..] do_select+0x363/0x629$\n'
> > [..] compat_core_sys_select+0x19e/0x239$\n'
> > [..] compat_SyS_select+0x98/0xc0$\n'
> > [..] sysenter_dispatch+0x7/0x25$\n'
> > [..]
> > NMI backtrace for cpu 6
> > CPU: 6 PID: 8915 Comm: kworker/u16:2
> > Workqueue: dmar_fault dmar_fault_work
> > Call Trace:$\n'
> > [..] wait_for_xmitr+0x26/0x8f$\n'
> > [..] serial8250_console_putchar+0x1c/0x2c$\n'
> > [..] uart_console_write+0x40/0x4b$\n'
> > [..] serial8250_console_write+0xe6/0x13f$\n'
> > [..] call_console_drivers.constprop.13+0xce/0x103$\n'
> > [..] console_unlock+0x1f8/0x39b$\n'
> > [..] vprintk_emit+0x39e/0x3e6$\n'
> > [..] printk+0x4d/0x4f$\n'
> > [..] dmar_fault+0x1a8/0x1fc$\n'
> > [..] dmar_fault_work+0x15/0x17$\n'
> > [..] process_one_work+0x1e8/0x3a9$\n'
> > [..] worker_thread+0x25d/0x345$\n'
> > [..] kthread+0xea/0xf2$\n'
> > [..] ret_from_fork+0x58/0x90$\n'
> > 
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: David Woodhouse <dwmw2@infradead.org>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Cc: iommu@lists.linux-foundation.org
> > Signed-off-by: Dmitry Safonov <dima@arista.com>
> > ---
> > Maybe it's worth to limit while(1) cycle.
> > If IOMMU generates faults with equal speed as irq handler cleans
> > them, it may turn into long-irq-disabled region again.
> > Not sure if it can happen anyway.
> > 
> >  drivers/iommu/dmar.c | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index accf58388bdb..6c4ea32ee6a9 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void
> > *dev_id)
> >  	int reg, fault_index;
> >  	u32 fault_status;
> >  	unsigned long flag;
> > -	bool ratelimited;
> >  	static DEFINE_RATELIMIT_STATE(rs,
> >  				      DEFAULT_RATELIMIT_INTERVAL,
> >  				      DEFAULT_RATELIMIT_BURST);
> >  
> > -	/* Disable printing, simply clear the fault when
> > ratelimited
> > */
> > -	ratelimited = !__ratelimit(&rs);
> > -
> >  	raw_spin_lock_irqsave(&iommu->register_lock, flag);
> >  	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> > -	if (fault_status && !ratelimited)
> > +	if (fault_status && __ratelimit(&rs))
> >  		pr_err("DRHD: handling fault status reg %x\n",
> > fault_status);
> >  
> >  	/* TBD: ignore advanced fault log currently */
> > @@ -1638,6 +1634,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
> >  	fault_index = dma_fsts_fault_record_index(fault_status);
> >  	reg = cap_fault_reg_offset(iommu->cap);
> >  	while (1) {
> > +		/* Disable printing, simply clear the fault when
> > ratelimited */
> > +		bool ratelimited = !__ratelimit(&rs);
> >  		u8 fault_reason;
> >  		u16 source_id;
> >  		u64 guest_addr;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-02-15 19:17 [PATCHv3] iommu/intel: Ratelimit each dmar fault printing Dmitry Safonov
  2018-03-05 15:00 ` Dmitry Safonov
@ 2018-03-15 13:46 ` Joerg Roedel
  2018-03-15 14:13   ` Dmitry Safonov
  1 sibling, 1 reply; 13+ messages in thread
From: Joerg Roedel @ 2018-03-15 13:46 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, Feb 15, 2018 at 07:17:29PM +0000, Dmitry Safonov wrote:
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> index accf58388bdb..6c4ea32ee6a9 100644
> --- a/drivers/iommu/dmar.c
> +++ b/drivers/iommu/dmar.c
> @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
>  	int reg, fault_index;
>  	u32 fault_status;
>  	unsigned long flag;
> -	bool ratelimited;
>  	static DEFINE_RATELIMIT_STATE(rs,
>  				      DEFAULT_RATELIMIT_INTERVAL,
>  				      DEFAULT_RATELIMIT_BURST);
>  
> -	/* Disable printing, simply clear the fault when ratelimited */
> -	ratelimited = !__ratelimit(&rs);
> -
>  	raw_spin_lock_irqsave(&iommu->register_lock, flag);
>  	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> -	if (fault_status && !ratelimited)
> +	if (fault_status && __ratelimit(&rs))
>  		pr_err("DRHD: handling fault status reg %x\n", fault_status);

This looks aweful. Have you tried to limit the number of loops in this
function and returning? You can handle the next faults by the next
interrupt. This ensures that the cpu visits a scheduling point from time
to time so that you don't see soft-lockups.



	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 13:46 ` Joerg Roedel
@ 2018-03-15 14:13   ` Dmitry Safonov
  2018-03-15 14:22     ` Joerg Roedel
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-15 14:13 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, 2018-03-15 at 14:46 +0100, Joerg Roedel wrote:
> On Thu, Feb 15, 2018 at 07:17:29PM +0000, Dmitry Safonov wrote:
> > diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
> > index accf58388bdb..6c4ea32ee6a9 100644
> > --- a/drivers/iommu/dmar.c
> > +++ b/drivers/iommu/dmar.c
> > @@ -1618,17 +1618,13 @@ irqreturn_t dmar_fault(int irq, void
> > *dev_id)
> >  	int reg, fault_index;
> >  	u32 fault_status;
> >  	unsigned long flag;
> > -	bool ratelimited;
> >  	static DEFINE_RATELIMIT_STATE(rs,
> >  				      DEFAULT_RATELIMIT_INTERVAL,
> >  				      DEFAULT_RATELIMIT_BURST);
> >  
> > -	/* Disable printing, simply clear the fault when
> > ratelimited */
> > -	ratelimited = !__ratelimit(&rs);
> > -
> >  	raw_spin_lock_irqsave(&iommu->register_lock, flag);
> >  	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
> > -	if (fault_status && !ratelimited)
> > +	if (fault_status && __ratelimit(&rs))
> >  		pr_err("DRHD: handling fault status reg %x\n",
> > fault_status);
> 
> This looks aweful. Have you tried to limit the number of loops in
> this
> function and returning? You can handle the next faults by the next
> interrupt. This ensures that the cpu visits a scheduling point from
> time
> to time so that you don't see soft-lockups.

So, you suggest to remove ratelimit at all?
Do we really need printk flood for each happened fault?
Imagine, you've hundreds of mappings and then PCI link flapped..
Wouldn't it be better to keep ratelimiting?
I don't mind, just it looks a bit strange to me.

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 14:13   ` Dmitry Safonov
@ 2018-03-15 14:22     ` Joerg Roedel
  2018-03-15 14:34       ` Dmitry Safonov
  0 siblings, 1 reply; 13+ messages in thread
From: Joerg Roedel @ 2018-03-15 14:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, Mar 15, 2018 at 02:13:03PM +0000, Dmitry Safonov wrote:
> So, you suggest to remove ratelimit at all?
> Do we really need printk flood for each happened fault?
> Imagine, you've hundreds of mappings and then PCI link flapped..
> Wouldn't it be better to keep ratelimiting?
> I don't mind, just it looks a bit strange to me.

I never said you should remove the ratelimiting, after all you are
trying to fix a soft-lockup, no?

And that should not be fixed by changes to the ratelimiting, but with
proper irq handling.


	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 14:22     ` Joerg Roedel
@ 2018-03-15 14:34       ` Dmitry Safonov
  2018-03-15 14:42         ` Dmitry Safonov
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-15 14:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, 2018-03-15 at 15:22 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:13:03PM +0000, Dmitry Safonov wrote:
> > So, you suggest to remove ratelimit at all?
> > Do we really need printk flood for each happened fault?
> > Imagine, you've hundreds of mappings and then PCI link flapped..
> > Wouldn't it be better to keep ratelimiting?
> > I don't mind, just it looks a bit strange to me.
> 
> I never said you should remove the ratelimiting, after all you are
> trying to fix a soft-lockup, no?
> 
> And that should not be fixed by changes to the ratelimiting, but with
> proper irq handling.

Uh, I'm a bit confused then.
- Isn't it better to ratelimit each printk() instead of bunch of
printks inside irq handler?
- I can limit the number of loops, but the most of the time is spent in
the loop on printk() (on my machine ~170msec per loop), while
everything else takes much lesser time (on my machine < 1 usec per
loop). So, if I will limit number of loops per-irq, that cycle-limit
will be based on limiting time spent on printk (e.g., how many printks
to do in atomic context so that node will not lockup). It smells like
ratelimiting, no?

I must be misunderstanding something, but why introducing another limit
for number of printk() called when there is ratelimit which may be
tuned..

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 14:34       ` Dmitry Safonov
@ 2018-03-15 14:42         ` Dmitry Safonov
  2018-03-15 15:28           ` Joerg Roedel
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-15 14:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, 2018-03-15 at 14:34 +0000, Dmitry Safonov wrote:
> On Thu, 2018-03-15 at 15:22 +0100, Joerg Roedel wrote:
> > On Thu, Mar 15, 2018 at 02:13:03PM +0000, Dmitry Safonov wrote:
> > > So, you suggest to remove ratelimit at all?
> > > Do we really need printk flood for each happened fault?
> > > Imagine, you've hundreds of mappings and then PCI link flapped..
> > > Wouldn't it be better to keep ratelimiting?
> > > I don't mind, just it looks a bit strange to me.
> > 
> > I never said you should remove the ratelimiting, after all you are
> > trying to fix a soft-lockup, no?
> > 
> > And that should not be fixed by changes to the ratelimiting, but
> > with
> > proper irq handling.
> 
> Uh, I'm a bit confused then.
> - Isn't it better to ratelimit each printk() instead of bunch of
> printks inside irq handler?
> - I can limit the number of loops, but the most of the time is spent
> in
> the loop on printk() (on my machine ~170msec per loop), while
> everything else takes much lesser time (on my machine < 1 usec per
> loop). So, if I will limit number of loops per-irq, that cycle-limit
> will be based on limiting time spent on printk (e.g., how many
> printks
> to do in atomic context so that node will not lockup). It smells like
> ratelimiting, no?
> 
> I must be misunderstanding something, but why introducing another
> limit
> for number of printk() called when there is ratelimit which may be
> tuned..
> 

So I agree, that maybe better to have another limit to the cycle
*also*, because if we clean faults with the same speed as they're
generated by hw, we may stuck in the loop..
By on my measures clearing fault is so fast (< 1 usec), that I'm not
sure that it may happen with hw. By that reason I didn't introduce
loop-limit.

But even with loop-limit we will need ratelimit each printk() *also*.
Otherwise loop-limit will be based on time spent printing, not on
anything else..
The patch makes sense even with loop-limit in my opinion.

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 14:42         ` Dmitry Safonov
@ 2018-03-15 15:28           ` Joerg Roedel
  2018-03-15 15:54             ` Dmitry Safonov
  2018-03-20 20:50             ` Dmitry Safonov
  0 siblings, 2 replies; 13+ messages in thread
From: Joerg Roedel @ 2018-03-15 15:28 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, Mar 15, 2018 at 02:42:00PM +0000, Dmitry Safonov wrote:
> But even with loop-limit we will need ratelimit each printk() *also*.
> Otherwise loop-limit will be based on time spent printing, not on
> anything else..
> The patch makes sense even with loop-limit in my opinion.

Looks like I mis-read your patch, somehow it looked to me as if you
replace all 'ratelimited' usages with a call to __ratelimit(), but you
just move 'ratelimited' into the loop, which actually makes sense.

But still, this alone is no proper fix for the soft-lockups you are
seeing.


	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 15:28           ` Joerg Roedel
@ 2018-03-15 15:54             ` Dmitry Safonov
  2018-03-20 20:50             ` Dmitry Safonov
  1 sibling, 0 replies; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-15 15:54 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, 2018-03-15 at 16:28 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:42:00PM +0000, Dmitry Safonov wrote:
> > But even with loop-limit we will need ratelimit each printk()
> > *also*.
> > Otherwise loop-limit will be based on time spent printing, not on
> > anything else..
> > The patch makes sense even with loop-limit in my opinion.
> 
> Looks like I mis-read your patch, somehow it looked to me as if you
> replace all 'ratelimited' usages with a call to __ratelimit(), but
> you
> just move 'ratelimited' into the loop, which actually makes sense.

Oh, ok

> But still, this alone is no proper fix for the soft-lockups you are
> seeing.

Well, I can also limit number of loops with say cap_num_fault_regs().
I didn't do that as on my measures the time spent on clearing a fault
is so small, that I'm not sure if it's possible to stuck in this loop.

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-15 15:28           ` Joerg Roedel
  2018-03-15 15:54             ` Dmitry Safonov
@ 2018-03-20 20:50             ` Dmitry Safonov
  2018-03-29  8:50               ` Joerg Roedel
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-20 20:50 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Thu, 2018-03-15 at 16:28 +0100, Joerg Roedel wrote:
> On Thu, Mar 15, 2018 at 02:42:00PM +0000, Dmitry Safonov wrote:
> > But even with loop-limit we will need ratelimit each printk()
> > *also*.
> > Otherwise loop-limit will be based on time spent printing, not on
> > anything else..
> > The patch makes sense even with loop-limit in my opinion.
> 
> Looks like I mis-read your patch, somehow it looked to me as if you
> replace all 'ratelimited' usages with a call to __ratelimit(), but
> you
> just move 'ratelimited' into the loop, which actually makes sense.

So, is it worth to apply the patch?

> But still, this alone is no proper fix for the soft-lockups you are
> seeing.

Hmm, but this fixes my softlockup issue, because it's about time spent
in printk() inside irq-disabled section, rather about exiting the dmar-
clearing loop.
And on my hw doesn't make any difference to limit loop or not because
clearing a fault is much faster than hw could generate a new fault.
ITOW, it fixes the softlockup for me and the loop-related lockup can't
happen on hw I have (so it's the other issue, [possible?] on other hw).

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-20 20:50             ` Dmitry Safonov
@ 2018-03-29  8:50               ` Joerg Roedel
  2018-03-29 13:52                 ` Dmitry Safonov
  0 siblings, 1 reply; 13+ messages in thread
From: Joerg Roedel @ 2018-03-29  8:50 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, 0x7f454c46, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

On Tue, Mar 20, 2018 at 08:50:13PM +0000, Dmitry Safonov wrote:
> Hmm, but this fixes my softlockup issue, because it's about time spent
> in printk() inside irq-disabled section, rather about exiting the dmar-
> clearing loop.
> And on my hw doesn't make any difference to limit loop or not because
> clearing a fault is much faster than hw could generate a new fault.
> ITOW, it fixes the softlockup for me and the loop-related lockup can't
> happen on hw I have (so it's the other issue, [possible?] on other hw).

It might solve your issue, but someone else might still run into it with
a different setup. An upstream fix needs to solve it for everyone.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3] iommu/intel: Ratelimit each dmar fault printing
  2018-03-29  8:50               ` Joerg Roedel
@ 2018-03-29 13:52                 ` Dmitry Safonov
  0 siblings, 0 replies; 13+ messages in thread
From: Dmitry Safonov @ 2018-03-29 13:52 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Dmitry Safonov, open list, Alex Williamson, David Woodhouse,
	Ingo Molnar, Lu Baolu, iommu

2018-03-29 9:50 GMT+01:00 Joerg Roedel <joro@8bytes.org>:
> On Tue, Mar 20, 2018 at 08:50:13PM +0000, Dmitry Safonov wrote:
>> Hmm, but this fixes my softlockup issue, because it's about time spent
>> in printk() inside irq-disabled section, rather about exiting the dmar-
>> clearing loop.
>> And on my hw doesn't make any difference to limit loop or not because
>> clearing a fault is much faster than hw could generate a new fault.
>> ITOW, it fixes the softlockup for me and the loop-related lockup can't
>> happen on hw I have (so it's the other issue, [possible?] on other hw).
>
> It might solve your issue, but someone else might still run into it with
> a different setup. An upstream fix needs to solve it for everyone.

Ok, I'll resend v4 with an additional patch to limit the loop.

Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-03-29 13:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 19:17 [PATCHv3] iommu/intel: Ratelimit each dmar fault printing Dmitry Safonov
2018-03-05 15:00 ` Dmitry Safonov
2018-03-13 16:21   ` Dmitry Safonov
2018-03-15 13:46 ` Joerg Roedel
2018-03-15 14:13   ` Dmitry Safonov
2018-03-15 14:22     ` Joerg Roedel
2018-03-15 14:34       ` Dmitry Safonov
2018-03-15 14:42         ` Dmitry Safonov
2018-03-15 15:28           ` Joerg Roedel
2018-03-15 15:54             ` Dmitry Safonov
2018-03-20 20:50             ` Dmitry Safonov
2018-03-29  8:50               ` Joerg Roedel
2018-03-29 13:52                 ` Dmitry Safonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).