All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-15 16:35 ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2016-03-15 16:35 UTC (permalink / raw)
  To: iommu, dwmw2; +Cc: joro, linux-kernel

Fault rates can easily overwhelm the console and make the system
unresponsive.  Ratelimit to allow an opportunity for maintenance.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/iommu/dmar.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 8ffd756..628987e 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1579,19 +1579,20 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 	reason = dmar_get_fault_reason(fault_reason, &fault_type);
 
 	if (fault_type == INTR_REMAP)
-		pr_err("INTR-REMAP: Request device [[%02x:%02x.%d] "
-		       "fault index %llx\n"
-			"INTR-REMAP:[fault reason %02d] %s\n",
-			(source_id >> 8), PCI_SLOT(source_id & 0xFF),
-			PCI_FUNC(source_id & 0xFF), addr >> 48,
-			fault_reason, reason);
+		pr_err_ratelimited("INTR-REMAP: Request device [[%02x:%02x.%d] "
+				   "fault index %llx\n"
+				   "INTR-REMAP:[fault reason %02d] %s\n",
+				   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+				   PCI_FUNC(source_id & 0xFF), addr >> 48,
+				   fault_reason, reason);
 	else
-		pr_err("DMAR:[%s] Request device [%02x:%02x.%d] "
-		       "fault addr %llx \n"
-		       "DMAR:[fault reason %02d] %s\n",
-		       (type ? "DMA Read" : "DMA Write"),
-		       (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+		pr_err_ratelimited("DMAR:[%s] Request device [%02x:%02x.%d] "
+				   "fault addr %llx \n"
+				   "DMAR:[fault reason %02d] %s\n",
+				   (type ? "DMA Read" : "DMA Write"),
+				   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+				   PCI_FUNC(source_id & 0xFF), addr,
+				   fault_reason, reason);
 	return 0;
 }
 
@@ -1606,7 +1607,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
 	raw_spin_lock_irqsave(&iommu->register_lock, flag);
 	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
 	if (fault_status)
-		pr_err("DRHD: handling fault status reg %x\n", fault_status);
+		pr_err_ratelimited("DRHD: handling fault status reg %x\n",
+				   fault_status);
 
 	/* TBD: ignore advanced fault log currently */
 	if (!(fault_status & DMA_FSTS_PPF))

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-15 16:35 ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2016-03-15 16:35 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA

Fault rates can easily overwhelm the console and make the system
unresponsive.  Ratelimit to allow an opportunity for maintenance.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/iommu/dmar.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 8ffd756..628987e 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1579,19 +1579,20 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 	reason = dmar_get_fault_reason(fault_reason, &fault_type);
 
 	if (fault_type == INTR_REMAP)
-		pr_err("INTR-REMAP: Request device [[%02x:%02x.%d] "
-		       "fault index %llx\n"
-			"INTR-REMAP:[fault reason %02d] %s\n",
-			(source_id >> 8), PCI_SLOT(source_id & 0xFF),
-			PCI_FUNC(source_id & 0xFF), addr >> 48,
-			fault_reason, reason);
+		pr_err_ratelimited("INTR-REMAP: Request device [[%02x:%02x.%d] "
+				   "fault index %llx\n"
+				   "INTR-REMAP:[fault reason %02d] %s\n",
+				   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+				   PCI_FUNC(source_id & 0xFF), addr >> 48,
+				   fault_reason, reason);
 	else
-		pr_err("DMAR:[%s] Request device [%02x:%02x.%d] "
-		       "fault addr %llx \n"
-		       "DMAR:[fault reason %02d] %s\n",
-		       (type ? "DMA Read" : "DMA Write"),
-		       (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+		pr_err_ratelimited("DMAR:[%s] Request device [%02x:%02x.%d] "
+				   "fault addr %llx \n"
+				   "DMAR:[fault reason %02d] %s\n",
+				   (type ? "DMA Read" : "DMA Write"),
+				   (source_id >> 8), PCI_SLOT(source_id & 0xFF),
+				   PCI_FUNC(source_id & 0xFF), addr,
+				   fault_reason, reason);
 	return 0;
 }
 
@@ -1606,7 +1607,8 @@ irqreturn_t dmar_fault(int irq, void *dev_id)
 	raw_spin_lock_irqsave(&iommu->register_lock, flag);
 	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
 	if (fault_status)
-		pr_err("DRHD: handling fault status reg %x\n", fault_status);
+		pr_err_ratelimited("DRHD: handling fault status reg %x\n",
+				   fault_status);
 
 	/* TBD: ignore advanced fault log currently */
 	if (!(fault_status & DMA_FSTS_PPF))

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
  2016-03-15 16:35 ` Alex Williamson
  (?)
@ 2016-03-15 17:10 ` Joe Perches
  2016-03-15 18:01   ` Joe Perches
  -1 siblings, 1 reply; 8+ messages in thread
From: Joe Perches @ 2016-03-15 17:10 UTC (permalink / raw)
  To: Alex Williamson, iommu, dwmw2; +Cc: joro, linux-kernel

On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> Fault rates can easily overwhelm the console and make the system
> unresponsive.  Ratelimit to allow an opportunity for maintenance.

A few suggestions:

o Use a single ratelimit state.
o The multiple lines output are unnecessary and hard to grep
  in the dmesg output because of inconsistent prefixing as
  second and subsequent output lines are not prefixed by pr_fmt.
o The DMAR prefix on the second block is also unnecessary as
  it's already prefixed by pr_fmt
o Coalesce the formats for easier grep.

so maybe:
---
 drivers/iommu/dmar.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 8ffd756..59dcaaa 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1575,23 +1575,27 @@ static int dmar_fault_do_one(struct intel_iommu *iommu, int type,
 {
 	const char *reason;
 	int fault_type;
+	static DEFINE_RATELIMIT_STATE(rs,
+				      DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+
+	if (__ratelimit(&rs))
+		return 0;
 
 	reason = dmar_get_fault_reason(fault_reason, &fault_type);
 
 	if (fault_type == INTR_REMAP)
-		pr_err("INTR-REMAP: Request device [[%02x:%02x.%d] "
-		       "fault index %llx\n"
-			"INTR-REMAP:[fault reason %02d] %s\n",
-			(source_id >> 8), PCI_SLOT(source_id & 0xFF),
-			PCI_FUNC(source_id & 0xFF), addr >> 48,
-			fault_reason, reason);
+		pr_err("[INTR-REMAP] Request device [%02x:%02x.%d] fault index %llx [fault reason %02d] %s\n",
+		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
+		       PCI_FUNC(source_id & 0xFF), addr >> 48,
+		       fault_reason, reason);
 	else
-		pr_err("DMAR:[%s] Request device [%02x:%02x.%d] "
-		       "fault addr %llx \n"
-		       "DMAR:[fault reason %02d] %s\n",
-		       (type ? "DMA Read" : "DMA Write"),
-		       (source_id >> 8), PCI_SLOT(source_id & 0xFF),
-		       PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+		pr_err("[%s] Request device [%02x:%02x.%d] fault addr %llx [fault reason %02d] %s\n",
+		       type ? "DMA Read" : "DMA Write",
+		       source_id >> 8, PCI_SLOT(source_id & 0xFF),
+		       PCI_FUNC(source_id & 0xFF), addr,
+		       fault_reason, reason);
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
  2016-03-15 17:10 ` Joe Perches
@ 2016-03-15 18:01   ` Joe Perches
  0 siblings, 0 replies; 8+ messages in thread
From: Joe Perches @ 2016-03-15 18:01 UTC (permalink / raw)
  To: Alex Williamson, iommu, dwmw2; +Cc: joro, linux-kernel

On Tue, 2016-03-15 at 10:10 -0700, Joe Perches wrote:
> o Use a single ratelimit state.
[]
> diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
[]
> +	if (__ratelimit(&rs))
> +		return 0;

That of course should be:

	if (!__ratelimit(&rs))
		return 0;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-15 19:47   ` David Woodhouse
  0 siblings, 0 replies; 8+ messages in thread
From: David Woodhouse @ 2016-03-15 19:47 UTC (permalink / raw)
  To: Alex Williamson, iommu; +Cc: joro, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> Fault rates can easily overwhelm the console and make the system
> unresponsive.  Ratelimit to allow an opportunity for maintenance.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Rather than just rate-limiting the printk, I'd prefer to handle this
explicitly. There's a bit in the context-entry which can tell the IOMMU
not to bother raising an interrupt at all. And then we can re-enable it
if/when the driver recovers the device. (Or perhaps just when it next
does a mapping).

We really ought to be reporting faults to drivers too, FWIW. I keep
meaning to take a look at that.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-15 19:47   ` David Woodhouse
  0 siblings, 0 replies; 8+ messages in thread
From: David Woodhouse @ 2016-03-15 19:47 UTC (permalink / raw)
  To: Alex Williamson, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 891 bytes --]

On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> Fault rates can easily overwhelm the console and make the system
> unresponsive.  Ratelimit to allow an opportunity for maintenance.
> 
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Rather than just rate-limiting the printk, I'd prefer to handle this
explicitly. There's a bit in the context-entry which can tell the IOMMU
not to bother raising an interrupt at all. And then we can re-enable it
if/when the driver recovers the device. (Or perhaps just when it next
does a mapping).

We really ought to be reporting faults to drivers too, FWIW. I keep
meaning to take a look at that.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-17 16:53     ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2016-03-17 16:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: iommu, joro, linux-kernel

On Tue, 15 Mar 2016 19:47:56 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> > Fault rates can easily overwhelm the console and make the system
> > unresponsive.  Ratelimit to allow an opportunity for maintenance.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> Rather than just rate-limiting the printk, I'd prefer to handle this
> explicitly. There's a bit in the context-entry which can tell the IOMMU
> not to bother raising an interrupt at all. And then we can re-enable it
> if/when the driver recovers the device. (Or perhaps just when it next
> does a mapping).

Seems like we need to keep statistics per context entry for that, are
you prepared for that sort of overhead?  IME, a device that's spewing
faults at this rate is broken to the point where it needs to be removed
from the system or is actively being tested and debugged for driver or
assignment work.  In those case, I think we want to keep reminding the
user that something is very wrong and it probably explains why the
device isn't working properly.  If the device is using the DMA API,
maybe clearing FPD on each mapping event is a way to do that, but an
IOMMU API managed device might have very long lived mapping entries.
It seems impractical to setup statistics per context entry and timers
to check back on them for things that really ought to be rare events.
My goal was only to reduce the overall impact on the system so that
it's usable when this occurs.

> We really ought to be reporting faults to drivers too, FWIW. I keep
> meaning to take a look at that.

Yes, that path has been absent for far too long.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] iommu/vt-d: Ratelimit fault handler
@ 2016-03-17 16:53     ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2016-03-17 16:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 15 Mar 2016 19:47:56 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> > Fault rates can easily overwhelm the console and make the system
> > unresponsive.  Ratelimit to allow an opportunity for maintenance.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>  
> 
> Rather than just rate-limiting the printk, I'd prefer to handle this
> explicitly. There's a bit in the context-entry which can tell the IOMMU
> not to bother raising an interrupt at all. And then we can re-enable it
> if/when the driver recovers the device. (Or perhaps just when it next
> does a mapping).

Seems like we need to keep statistics per context entry for that, are
you prepared for that sort of overhead?  IME, a device that's spewing
faults at this rate is broken to the point where it needs to be removed
from the system or is actively being tested and debugged for driver or
assignment work.  In those case, I think we want to keep reminding the
user that something is very wrong and it probably explains why the
device isn't working properly.  If the device is using the DMA API,
maybe clearing FPD on each mapping event is a way to do that, but an
IOMMU API managed device might have very long lived mapping entries.
It seems impractical to setup statistics per context entry and timers
to check back on them for things that really ought to be rare events.
My goal was only to reduce the overall impact on the system so that
it's usable when this occurs.

> We really ought to be reporting faults to drivers too, FWIW. I keep
> meaning to take a look at that.

Yes, that path has been absent for far too long.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-17 16:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-15 16:35 [PATCH] iommu/vt-d: Ratelimit fault handler Alex Williamson
2016-03-15 16:35 ` Alex Williamson
2016-03-15 17:10 ` Joe Perches
2016-03-15 18:01   ` Joe Perches
2016-03-15 19:47 ` David Woodhouse
2016-03-15 19:47   ` David Woodhouse
2016-03-17 16:53   ` Alex Williamson
2016-03-17 16:53     ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.