All of lore.kernel.org
 help / color / mirror / Atom feed
* igb_poll - device driver failed to check map error
@ 2013-03-12  9:31 Christoph Paasch
  2013-03-13 15:30 ` Wyborny, Carolyn
  2013-03-15  2:18 ` Alexander Duyck
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Paasch @ 2013-03-12  9:31 UTC (permalink / raw)
  To: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Alex Duyck, Eric Dumazet
  Cc: netdev

Hello,

I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:

[   36.402824] ------------[ cut here ]------------
[   36.458070] WARNING: at /home/cpaasch/builder/net-next/lib/dma-debug.c:934 
check_unmap+0x648/0x702()
[   36.567377] Hardware name: ProLiant DL165 G7
[   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check map 
error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as single]
[   36.776640] Modules linked in:
[   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
[   36.892515] Call Trace:
[   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
[   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
[   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
[   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
[   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
[   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
[   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
[   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
[   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
[   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
[   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
[   37.651086]  <EOI>  [<ffffffff8105d4be>] ? 
__tick_nohz_idle_enter+0x116/0x31f
[   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
[   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
[   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
[   37.930364] ---[ end trace 01b5bb0fd75a464c ]---


It happens shortly after mounting the NFS-root filesystem.

I tried to understand what is going on, but I am now at my wit's end.

By adding some print-statements, here is what I found out (not sure if this is 
anyhow helpful):

The difference between tx_buffer->time_stamp and the current 'jiffies' is up 
to 2000 jiffies (HZ==1000) at the first time the above warning happens (this 
seems too much for me). From then on, I see my print 3-4 times appear but 
without such a big difference between the timestamps (difference around 1 and 
2 jiffies).

Some other stuff, I printed:
tx_buffer->skb: ffff880235054c80
tx_buffer->bytecount: 154 
tx_buffer->gso_segs: 1
tx_buffer->protocol: 8 
tx_buffer->tx_flags 0x20


One last thing:
Am I right that after each call to dma_map_single/page a call to 
dma_mapping_error is needed? If that's the case, I have some patches that add 
this statement at missing places in the e1000, e1000e and ixgb driver. But 
these patches do not fix my above problem.


Thanks for your help,
Christoph

-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://multipath-tcp.org
UCLouvain
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: igb_poll - device driver failed to check map error
  2013-03-12  9:31 igb_poll - device driver failed to check map error Christoph Paasch
@ 2013-03-13 15:30 ` Wyborny, Carolyn
  2013-03-15  2:18 ` Alexander Duyck
  1 sibling, 0 replies; 15+ messages in thread
From: Wyborny, Carolyn @ 2013-03-13 15:30 UTC (permalink / raw)
  To: christoph.paasch, Kirsher, Jeffrey T, Brandeburg, Jesse, Allan,
	Bruce W, Duyck, Alexander H, Eric Dumazet
  Cc: netdev

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Christoph Paasch
> Sent: Tuesday, March 12, 2013 2:31 AM
> To: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Duyck, Alexander H;
> Eric Dumazet
> Cc: netdev@vger.kernel.org
> Subject: igb_poll - device driver failed to check map error
> 
> Hello,
> 
> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
> 
> [   36.402824] ------------[ cut here ]------------
> [   36.458070] WARNING: at /home/cpaasch/builder/net-next/lib/dma-
> debug.c:934
> check_unmap+0x648/0x702()
> [   36.567377] Hardware name: ProLiant DL165 G7
> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check map
> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
> single]
> [   36.776640] Modules linked in:
> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
> [   36.892515] Call Trace:
> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
> __tick_nohz_idle_enter+0x116/0x31f
> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
> 
> 
> It happens shortly after mounting the NFS-root filesystem.
> 
> I tried to understand what is going on, but I am now at my wit's end.
> 
> By adding some print-statements, here is what I found out (not sure if this is
> anyhow helpful):
> 
> The difference between tx_buffer->time_stamp and the current 'jiffies' is up to
> 2000 jiffies (HZ==1000) at the first time the above warning happens (this seems
> too much for me). From then on, I see my print 3-4 times appear but without
> such a big difference between the timestamps (difference around 1 and
> 2 jiffies).
> 
> Some other stuff, I printed:
> tx_buffer->skb: ffff880235054c80
> tx_buffer->bytecount: 154
> tx_buffer->gso_segs: 1
> tx_buffer->protocol: 8
> tx_buffer->tx_flags 0x20
> 
> 
> One last thing:
> Am I right that after each call to dma_map_single/page a call to
> dma_mapping_error is needed? If that's the case, I have some patches that add
> this statement at missing places in the e1000, e1000e and ixgb driver. But these
> patches do not fix my above problem.
> 
> 
> Thanks for your help,
> Christoph
> 
> --
> IP Networking Lab --- http://inl.info.ucl.ac.be MultiPath TCP in the Linux Kernel -
> -- http://multipath-tcp.org UCLouvain
> --
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a
> message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

Thanks Christoph,

We'll take a look at this also.

Carolyn


Carolyn Wyborny 
Linux Development 
Networking Division 
Intel Corporation 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-12  9:31 igb_poll - device driver failed to check map error Christoph Paasch
  2013-03-13 15:30 ` Wyborny, Carolyn
@ 2013-03-15  2:18 ` Alexander Duyck
  2013-03-15  7:52   ` Christoph Paasch
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2013-03-15  2:18 UTC (permalink / raw)
  To: christoph.paasch
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Alex Duyck,
	Eric Dumazet, netdev

On 03/12/2013 02:31 AM, Christoph Paasch wrote:
> Hello,
> 
> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
> 
> [   36.402824] ------------[ cut here ]------------
> [   36.458070] WARNING: at /home/cpaasch/builder/net-next/lib/dma-debug.c:934 
> check_unmap+0x648/0x702()
> [   36.567377] Hardware name: ProLiant DL165 G7
> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check map 
> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as single]
> [   36.776640] Modules linked in:
> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
> [   36.892515] Call Trace:
> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ? 
> __tick_nohz_idle_enter+0x116/0x31f
> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
> 
> 
> It happens shortly after mounting the NFS-root filesystem.
> 
> I tried to understand what is going on, but I am now at my wit's end.
> 
> By adding some print-statements, here is what I found out (not sure if this is 
> anyhow helpful):
> 
> The difference between tx_buffer->time_stamp and the current 'jiffies' is up 
> to 2000 jiffies (HZ==1000) at the first time the above warning happens (this 
> seems too much for me). From then on, I see my print 3-4 times appear but 
> without such a big difference between the timestamps (difference around 1 and 
> 2 jiffies).
> 
> Some other stuff, I printed:
> tx_buffer->skb: ffff880235054c80
> tx_buffer->bytecount: 154 
> tx_buffer->gso_segs: 1
> tx_buffer->protocol: 8 
> tx_buffer->tx_flags 0x20
> 
> 
> One last thing:
> Am I right that after each call to dma_map_single/page a call to 
> dma_mapping_error is needed? If that's the case, I have some patches that add 
> this statement at missing places in the e1000, e1000e and ixgb driver. But 
> these patches do not fix my above problem.
> 
> 
> Thanks for your help,
> Christoph
> 

Christoph,

One thing that might be useful would be to reproduce this with a
standard 3.9-rc kernel instead of one using the multipath TCP patches.
This will help us to verify that the issue is reproducible with a stock
kernel and is not related to any ongoing work you may have only in your
tree.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-15  2:18 ` Alexander Duyck
@ 2013-03-15  7:52   ` Christoph Paasch
  2013-03-15 16:03     ` Allan, Bruce W
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Christoph Paasch @ 2013-03-15  7:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jeff Kirsher, Jesse Brandeburg, Bruce Allan, Alex Duyck,
	Eric Dumazet, netdev

On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
> On 03/12/2013 02:31 AM, Christoph Paasch wrote:
> > Hello,
> > 
> > I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
> > 
> > [   36.402824] ------------[ cut here ]------------
> > [   36.458070] WARNING: at
> > /home/cpaasch/builder/net-next/lib/dma-debug.c:934
> > check_unmap+0x648/0x702()
> > [   36.567377] Hardware name: ProLiant DL165 G7
> > [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
> > map
> > error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
> > single] [   36.776640] Modules linked in:
> > [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
> > [   36.892515] Call Trace:
> > [   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
> > [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
> > [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
> > [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
> > [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
> > [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
> > [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
> > [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
> > [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
> > [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
> > [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
> > [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
> > __tick_nohz_idle_enter+0x116/0x31f
> > [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
> > [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
> > [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
> > [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
> > 
> > 
> > It happens shortly after mounting the NFS-root filesystem.
> > 
> > I tried to understand what is going on, but I am now at my wit's end.
> > 
> > By adding some print-statements, here is what I found out (not sure if
> > this is anyhow helpful):
> > 
> > The difference between tx_buffer->time_stamp and the current 'jiffies' is
> > up to 2000 jiffies (HZ==1000) at the first time the above warning happens
> > (this seems too much for me). From then on, I see my print 3-4 times
> > appear but without such a big difference between the timestamps
> > (difference around 1 and 2 jiffies).
> > 
> > Some other stuff, I printed:
> > tx_buffer->skb: ffff880235054c80
> > tx_buffer->bytecount: 154
> > tx_buffer->gso_segs: 1
> > tx_buffer->protocol: 8
> > tx_buffer->tx_flags 0x20
> > 
> > 
> > One last thing:
> > Am I right that after each call to dma_map_single/page a call to
> > dma_mapping_error is needed? If that's the case, I have some patches that
> > add this statement at missing places in the e1000, e1000e and ixgb
> > driver. But these patches do not fix my above problem.
> > 
> > 
> > Thanks for your help,
> > Christoph
> 
> Christoph,
> 
> One thing that might be useful would be to reproduce this with a
> standard 3.9-rc kernel instead of one using the multipath TCP patches.
> This will help us to verify that the issue is reproducible with a stock
> kernel and is not related to any ongoing work you may have only in your
> tree.

Hello,

this is on a clean net-next kernel without any MPTCP-code.

I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of 
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply 
introduces the debug_dma_mapping_error-checks.

Am I right with the missing calls to dma_mapping_error in e1000, e1000e and 
ixgb?

Cheers,
Christoph



-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://multipath-tcp.org
UCLouvain
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: igb_poll - device driver failed to check map error
  2013-03-15  7:52   ` Christoph Paasch
@ 2013-03-15 16:03     ` Allan, Bruce W
  2013-03-16  9:27       ` Christoph Paasch
  2013-03-15 16:07     ` Alexander Duyck
  2013-03-15 23:08     ` Alexander Duyck
  2 siblings, 1 reply; 15+ messages in thread
From: Allan, Bruce W @ 2013-03-15 16:03 UTC (permalink / raw)
  To: christoph.paasch, Alexander Duyck
  Cc: Kirsher, Jeffrey T, Brandeburg, Jesse, Duyck, Alexander H,
	Eric Dumazet, netdev

> -----Original Message-----
> From: Christoph Paasch [mailto:christoph.paasch@gmail.com] On Behalf Of
> Christoph Paasch
> Sent: Friday, March 15, 2013 12:52 AM
> To: Alexander Duyck
> Cc: Kirsher, Jeffrey T; Brandeburg, Jesse; Allan, Bruce W; Duyck, Alexander
> H; Eric Dumazet; netdev@vger.kernel.org
> Subject: Re: igb_poll - device driver failed to check map error
> 
> On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
> > On 03/12/2013 02:31 AM, Christoph Paasch wrote:
> > > Hello,
> > >
> > > I'm seeing a warning while booting my machine when DMA_API_DEBUG
> is set:
> > >
> > > [   36.402824] ------------[ cut here ]------------
> > > [   36.458070] WARNING: at
> > > /home/cpaasch/builder/net-next/lib/dma-debug.c:934
> > > check_unmap+0x648/0x702()
> > > [   36.567377] Hardware name: ProLiant DL165 G7
> > > [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
> > > map
> > > error[device address=0x0000000233d9b232] [size=154 bytes] [mapped
> as
> > > single] [   36.776640] Modules linked in:
> > > [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+
> #101
> > > [   36.892515] Call Trace:
> > > [   36.921745]  <IRQ>  [<ffffffff8102ad7f>]
> warn_slowpath_common+0x80/0x9a
> > > [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
> > > [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
> > > [   37.134363]  [<ffffffff811db3e9>]
> debug_dma_unmap_page+0x50/0x52
> > > [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
> > > [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
> > > [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
> > > [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
> > > [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
> > > [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
> > > [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
> > > [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
> > > __tick_nohz_idle_enter+0x116/0x31f
> > > [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
> > > [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
> > > [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
> > > [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
> > >
> > >
> > > It happens shortly after mounting the NFS-root filesystem.
> > >
> > > I tried to understand what is going on, but I am now at my wit's end.
> > >
> > > By adding some print-statements, here is what I found out (not sure if
> > > this is anyhow helpful):
> > >
> > > The difference between tx_buffer->time_stamp and the current 'jiffies'
> is
> > > up to 2000 jiffies (HZ==1000) at the first time the above warning
> happens
> > > (this seems too much for me). From then on, I see my print 3-4 times
> > > appear but without such a big difference between the timestamps
> > > (difference around 1 and 2 jiffies).
> > >
> > > Some other stuff, I printed:
> > > tx_buffer->skb: ffff880235054c80
> > > tx_buffer->bytecount: 154
> > > tx_buffer->gso_segs: 1
> > > tx_buffer->protocol: 8
> > > tx_buffer->tx_flags 0x20
> > >
> > >
> > > One last thing:
> > > Am I right that after each call to dma_map_single/page a call to
> > > dma_mapping_error is needed? If that's the case, I have some patches
> that
> > > add this statement at missing places in the e1000, e1000e and ixgb
> > > driver. But these patches do not fix my above problem.
> > >
> > >
> > > Thanks for your help,
> > > Christoph
> >
> > Christoph,
> >
> > One thing that might be useful would be to reproduce this with a
> > standard 3.9-rc kernel instead of one using the multipath TCP patches.
> > This will help us to verify that the issue is reproducible with a stock
> > kernel and is not related to any ongoing work you may have only in your
> > tree.
> 
> Hello,
> 
> this is on a clean net-next kernel without any MPTCP-code.
> 
> I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of
> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply
> introduces the debug_dma_mapping_error-checks.
> 
> Am I right with the missing calls to dma_mapping_error in e1000, e1000e
> and
> ixgb?
> 
> Cheers,
> Christoph
> 
> 
> 
> --
> IP Networking Lab --- http://inl.info.ucl.ac.be
> MultiPath TCP in the Linux Kernel --- http://multipath-tcp.org
> UCLouvain
> --

Hi Christoph,

You are correct re. the missing calls to dma_mapping_error and I have that on my
to-do list for e1000e, but if you have patches already feel free to send them along
(please cc the Intel wired ethernet list e1000-devel@lists.sourceforge.net).

Thanks,
Bruce.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-15  7:52   ` Christoph Paasch
  2013-03-15 16:03     ` Allan, Bruce W
@ 2013-03-15 16:07     ` Alexander Duyck
  2013-03-15 23:08     ` Alexander Duyck
  2 siblings, 0 replies; 15+ messages in thread
From: Alexander Duyck @ 2013-03-15 16:07 UTC (permalink / raw)
  To: christoph.paasch
  Cc: Alexander Duyck, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Eric Dumazet, netdev

On 03/15/2013 12:52 AM, Christoph Paasch wrote:
> On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
>> On 03/12/2013 02:31 AM, Christoph Paasch wrote:
>>> Hello,
>>>
>>> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
>>>
>>> [   36.402824] ------------[ cut here ]------------
>>> [   36.458070] WARNING: at
>>> /home/cpaasch/builder/net-next/lib/dma-debug.c:934
>>> check_unmap+0x648/0x702()
>>> [   36.567377] Hardware name: ProLiant DL165 G7
>>> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
>>> map
>>> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
>>> single] [   36.776640] Modules linked in:
>>> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
>>> [   36.892515] Call Trace:
>>> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
>>> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
>>> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
>>> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
>>> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
>>> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
>>> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
>>> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
>>> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
>>> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
>>> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
>>> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
>>> __tick_nohz_idle_enter+0x116/0x31f
>>> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
>>> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
>>> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
>>> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
>>>
>>>
>>> It happens shortly after mounting the NFS-root filesystem.
>>>
>>> I tried to understand what is going on, but I am now at my wit's end.
>>>
>>> By adding some print-statements, here is what I found out (not sure if
>>> this is anyhow helpful):
>>>
>>> The difference between tx_buffer->time_stamp and the current 'jiffies' is
>>> up to 2000 jiffies (HZ==1000) at the first time the above warning happens
>>> (this seems too much for me). From then on, I see my print 3-4 times
>>> appear but without such a big difference between the timestamps
>>> (difference around 1 and 2 jiffies).
>>>
>>> Some other stuff, I printed:
>>> tx_buffer->skb: ffff880235054c80
>>> tx_buffer->bytecount: 154
>>> tx_buffer->gso_segs: 1
>>> tx_buffer->protocol: 8
>>> tx_buffer->tx_flags 0x20
>>>
>>>
>>> One last thing:
>>> Am I right that after each call to dma_map_single/page a call to
>>> dma_mapping_error is needed? If that's the case, I have some patches that
>>> add this statement at missing places in the e1000, e1000e and ixgb
>>> driver. But these patches do not fix my above problem.
>>>
>>>
>>> Thanks for your help,
>>> Christoph
>> Christoph,
>>
>> One thing that might be useful would be to reproduce this with a
>> standard 3.9-rc kernel instead of one using the multipath TCP patches.
>> This will help us to verify that the issue is reproducible with a stock
>> kernel and is not related to any ongoing work you may have only in your
>> tree.
> Hello,
>
> this is on a clean net-next kernel without any MPTCP-code.
>
> I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply 
> introduces the debug_dma_mapping_error-checks.
>
> Am I right with the missing calls to dma_mapping_error in e1000, e1000e and 
> ixgb?
>
> Cheers,
> Christoph
>

Christoph,

I wasn't able to see any missing calls in e1000, however I did see one
in e1000e, and possibly another in the ixgb paths.  If you already have
patches please submit them.  It is much easier to verify the issues you
found if we know what exactly it is you found.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-15  7:52   ` Christoph Paasch
  2013-03-15 16:03     ` Allan, Bruce W
  2013-03-15 16:07     ` Alexander Duyck
@ 2013-03-15 23:08     ` Alexander Duyck
  2013-03-16 11:07       ` Christoph Paasch
  2 siblings, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2013-03-15 23:08 UTC (permalink / raw)
  To: christoph.paasch
  Cc: Alexander Duyck, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Eric Dumazet, netdev

On 03/15/2013 12:52 AM, Christoph Paasch wrote:
> On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
>> On 03/12/2013 02:31 AM, Christoph Paasch wrote:
>>> Hello,
>>>
>>> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
>>>
>>> [   36.402824] ------------[ cut here ]------------
>>> [   36.458070] WARNING: at
>>> /home/cpaasch/builder/net-next/lib/dma-debug.c:934
>>> check_unmap+0x648/0x702()
>>> [   36.567377] Hardware name: ProLiant DL165 G7
>>> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
>>> map
>>> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
>>> single] [   36.776640] Modules linked in:
>>> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
>>> [   36.892515] Call Trace:
>>> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>] warn_slowpath_common+0x80/0x9a
>>> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
>>> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
>>> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
>>> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
>>> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
>>> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
>>> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
>>> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
>>> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
>>> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
>>> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
>>> __tick_nohz_idle_enter+0x116/0x31f
>>> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
>>> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
>>> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
>>> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
>>>
>>>
>>> It happens shortly after mounting the NFS-root filesystem.
>>>
>>> I tried to understand what is going on, but I am now at my wit's end.
>>>
>>> By adding some print-statements, here is what I found out (not sure if
>>> this is anyhow helpful):
>>>
>>> The difference between tx_buffer->time_stamp and the current 'jiffies' is
>>> up to 2000 jiffies (HZ==1000) at the first time the above warning happens
>>> (this seems too much for me). From then on, I see my print 3-4 times
>>> appear but without such a big difference between the timestamps
>>> (difference around 1 and 2 jiffies).
>>>
>>> Some other stuff, I printed:
>>> tx_buffer->skb: ffff880235054c80
>>> tx_buffer->bytecount: 154
>>> tx_buffer->gso_segs: 1
>>> tx_buffer->protocol: 8
>>> tx_buffer->tx_flags 0x20
>>>
>>>
>>> One last thing:
>>> Am I right that after each call to dma_map_single/page a call to
>>> dma_mapping_error is needed? If that's the case, I have some patches that
>>> add this statement at missing places in the e1000, e1000e and ixgb
>>> driver. But these patches do not fix my above problem.
>>>
>>>
>>> Thanks for your help,
>>> Christoph
>>
>> Christoph,
>>
>> One thing that might be useful would be to reproduce this with a
>> standard 3.9-rc kernel instead of one using the multipath TCP patches.
>> This will help us to verify that the issue is reproducible with a stock
>> kernel and is not related to any ongoing work you may have only in your
>> tree.
> 
> Hello,
> 
> this is on a clean net-next kernel without any MPTCP-code.
> 
> I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply 
> introduces the debug_dma_mapping_error-checks.
> 
> Am I right with the missing calls to dma_mapping_error in e1000, e1000e and 
> ixgb?
> 
> Cheers,
> Christoph

Christoph,

The cause of this issues you are seeing may be due to the fact that the
buffer triggering the error is being reused.  I was able to reproduce
this issue occasionally with pktgen if I cloned the skb.  What may be
happening is that the buffer is being mapped in the transmit path on one
CPU while on another CPU the buffer is being cleaned.  Since the output
of each mapping is the physical address there is nothing to make each
mapping unique and I suspect this is resulting in false hits.

You should be able to verify this if you were to check the skb->users
count as well as the dataref value in the skb_shared_info.  I suspect
either the users count of the dataref will be greater than 1.

You might also try testing the patch below to see if it has any effect.
 All it does is reorder the free and the unmap so that the buffer is not
freed for reuse until after we have checked it in the unmap path.

Thanks,

Alex

---
 drivers/net/ethernet/intel/igb/igb_main.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index 4dbd629..0f9c324 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5959,15 +5959,15 @@ static bool igb_clean_tx_irq(struct igb_q_vector
*q_vector)
 		total_bytes += tx_buffer->bytecount;
 		total_packets += tx_buffer->gso_segs;

-		/* free the skb */
-		dev_kfree_skb_any(tx_buffer->skb);
-
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
 				 dma_unmap_addr(tx_buffer, dma),
 				 dma_unmap_len(tx_buffer, len),
 				 DMA_TO_DEVICE);

+		/* free the skb */
+		dev_kfree_skb_any(tx_buffer->skb);
+
 		/* clear tx_buffer data */
 		tx_buffer->skb = NULL;
 		dma_unmap_len_set(tx_buffer, len, 0);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-15 16:03     ` Allan, Bruce W
@ 2013-03-16  9:27       ` Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2013-03-16  9:27 UTC (permalink / raw)
  To: Allan, Bruce W
  Cc: Alexander Duyck, Kirsher, Jeffrey T, Brandeburg, Jesse, Duyck,
	Alexander H, Eric Dumazet, netdev

On Friday 15 March 2013 16:03:47 Allan, Bruce W wrote:
> You are correct re. the missing calls to dma_mapping_error and I have that
> on my to-do list for e1000e, but if you have patches already feel free to
> send them along (please cc the Intel wired ethernet list
> e1000-devel@lists.sourceforge.net).

I will send them soon, after some basic tests.

Cheers,
Christoph

-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://multipath-tcp.org
UCLouvain
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-15 23:08     ` Alexander Duyck
@ 2013-03-16 11:07       ` Christoph Paasch
  2013-03-18 17:29         ` Alexander Duyck
  2013-03-18 22:12         ` [PATCH 0/2] Address issues in dma-debug API Alexander Duyck
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Paasch @ 2013-03-16 11:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Eric Dumazet, netdev

On Friday 15 March 2013 16:08:31 Alexander Duyck wrote:
> On 03/15/2013 12:52 AM, Christoph Paasch wrote:
> > On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
> >> On 03/12/2013 02:31 AM, Christoph Paasch wrote:
> >>> Hello,
> >>> 
> >>> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
> >>> 
> >>> [   36.402824] ------------[ cut here ]------------
> >>> [   36.458070] WARNING: at
> >>> /home/cpaasch/builder/net-next/lib/dma-debug.c:934
> >>> check_unmap+0x648/0x702()
> >>> [   36.567377] Hardware name: ProLiant DL165 G7
> >>> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
> >>> map
> >>> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
> >>> single] [   36.776640] Modules linked in:
> >>> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
> >>> [   36.892515] Call Trace:
> >>> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>]
> >>> warn_slowpath_common+0x80/0x9a
> >>> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
> >>> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
> >>> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
> >>> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
> >>> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
> >>> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
> >>> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
> >>> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
> >>> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
> >>> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
> >>> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
> >>> __tick_nohz_idle_enter+0x116/0x31f
> >>> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
> >>> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
> >>> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
> >>> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
> >>> 
> >>> 
> >>> It happens shortly after mounting the NFS-root filesystem.
> >>> 
> >>> I tried to understand what is going on, but I am now at my wit's end.
> >>> 
> >>> By adding some print-statements, here is what I found out (not sure if
> >>> this is anyhow helpful):
> >>> 
> >>> The difference between tx_buffer->time_stamp and the current 'jiffies'
> >>> is
> >>> up to 2000 jiffies (HZ==1000) at the first time the above warning
> >>> happens
> >>> (this seems too much for me). From then on, I see my print 3-4 times
> >>> appear but without such a big difference between the timestamps
> >>> (difference around 1 and 2 jiffies).
> >>> 
> >>> Some other stuff, I printed:
> >>> tx_buffer->skb: ffff880235054c80
> >>> tx_buffer->bytecount: 154
> >>> tx_buffer->gso_segs: 1
> >>> tx_buffer->protocol: 8
> >>> tx_buffer->tx_flags 0x20
> >>> 
> >>> 
> >>> One last thing:
> >>> Am I right that after each call to dma_map_single/page a call to
> >>> dma_mapping_error is needed? If that's the case, I have some patches
> >>> that
> >>> add this statement at missing places in the e1000, e1000e and ixgb
> >>> driver. But these patches do not fix my above problem.
> >>> 
> >>> 
> >>> Thanks for your help,
> >>> Christoph
> >> 
> >> Christoph,
> >> 
> >> One thing that might be useful would be to reproduce this with a
> >> standard 3.9-rc kernel instead of one using the multipath TCP patches.
> >> This will help us to verify that the issue is reproducible with a stock
> >> kernel and is not related to any ongoing work you may have only in your
> >> tree.
> > 
> > Hello,
> > 
> > this is on a clean net-next kernel without any MPTCP-code.
> > 
> > I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply
> > introduces the debug_dma_mapping_error-checks.
> > 
> > Am I right with the missing calls to dma_mapping_error in e1000, e1000e
> > and
> > ixgb?
> > 
> > Cheers,
> > Christoph
> 
> Christoph,
> 
> The cause of this issues you are seeing may be due to the fact that the
> buffer triggering the error is being reused.  I was able to reproduce
> this issue occasionally with pktgen if I cloned the skb.  What may be
> happening is that the buffer is being mapped in the transmit path on one
> CPU while on another CPU the buffer is being cleaned.  Since the output
> of each mapping is the physical address there is nothing to make each
> mapping unique and I suspect this is resulting in false hits.
> 
> You should be able to verify this if you were to check the skb->users
> count as well as the dataref value in the skb_shared_info.  I suspect
> either the users count of the dataref will be greater than 1.

Both, users and dataref, are equal to 1. Before the call to dev_kfree_skb_any 
and after dma_unmap_single fails.

> You might also try testing the patch below to see if it has any effect.
>  All it does is reorder the free and the unmap so that the buffer is not
> freed for reuse until after we have checked it in the unmap path.

I tested your patch, and it fixes my issue. Feel free to add a "Tested-by" to 
the official patch.


Cheers,
Christoph

> ---
>  drivers/net/ethernet/intel/igb/igb_main.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index 4dbd629..0f9c324 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -5959,15 +5959,15 @@ static bool igb_clean_tx_irq(struct igb_q_vector
> *q_vector)
>  		total_bytes += tx_buffer->bytecount;
>  		total_packets += tx_buffer->gso_segs;
> 
> -		/* free the skb */
> -		dev_kfree_skb_any(tx_buffer->skb);
> -
>  		/* unmap skb header data */
>  		dma_unmap_single(tx_ring->dev,
>  				 dma_unmap_addr(tx_buffer, dma),
>  				 dma_unmap_len(tx_buffer, len),
>  				 DMA_TO_DEVICE);
> 
> +		/* free the skb */
> +		dev_kfree_skb_any(tx_buffer->skb);
> +
>  		/* clear tx_buffer data */
>  		tx_buffer->skb = NULL;
>  		dma_unmap_len_set(tx_buffer, len, 0);
-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://multipath-tcp.org
UCLouvain
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: igb_poll - device driver failed to check map error
  2013-03-16 11:07       ` Christoph Paasch
@ 2013-03-18 17:29         ` Alexander Duyck
  2013-03-18 22:12         ` [PATCH 0/2] Address issues in dma-debug API Alexander Duyck
  1 sibling, 0 replies; 15+ messages in thread
From: Alexander Duyck @ 2013-03-18 17:29 UTC (permalink / raw)
  To: christoph.paasch
  Cc: Alexander Duyck, Jeff Kirsher, Jesse Brandeburg, Bruce Allan,
	Eric Dumazet, netdev

On 03/16/2013 04:07 AM, Christoph Paasch wrote:
> On Friday 15 March 2013 16:08:31 Alexander Duyck wrote:
>> On 03/15/2013 12:52 AM, Christoph Paasch wrote:
>>> On Thursday 14 March 2013 19:18:18 Alexander Duyck wrote:
>>>> On 03/12/2013 02:31 AM, Christoph Paasch wrote:
>>>>> Hello,
>>>>>
>>>>> I'm seeing a warning while booting my machine when DMA_API_DEBUG is set:
>>>>>
>>>>> [   36.402824] ------------[ cut here ]------------
>>>>> [   36.458070] WARNING: at
>>>>> /home/cpaasch/builder/net-next/lib/dma-debug.c:934
>>>>> check_unmap+0x648/0x702()
>>>>> [   36.567377] Hardware name: ProLiant DL165 G7
>>>>> [   36.618452] igb 0000:04:00.0: DMA-API: device driver failed to check
>>>>> map
>>>>> error[device address=0x0000000233d9b232] [size=154 bytes] [mapped as
>>>>> single] [   36.776640] Modules linked in:
>>>>> [   36.815446] Pid: 0, comm: swapper/7 Not tainted 3.9.0-rc1-mptcp+ #101
>>>>> [   36.892515] Call Trace:
>>>>> [   36.921745]  <IRQ>  [<ffffffff8102ad7f>]
>>>>> warn_slowpath_common+0x80/0x9a
>>>>> [   37.001023]  [<ffffffff8102ae2d>] warn_slowpath_fmt+0x41/0x43
>>>>> [   37.069771]  [<ffffffff811db17f>] check_unmap+0x648/0x702
>>>>> [   37.134363]  [<ffffffff811db3e9>] debug_dma_unmap_page+0x50/0x52
>>>>> [   37.206234]  [<ffffffff8136676a>] igb_poll+0x144/0xf7c
>>>>> [   37.267706]  [<ffffffff8104dd19>] ? sched_clock_cpu+0x46/0xd1
>>>>> [   37.336456]  [<ffffffff814458ce>] net_rx_action+0xa7/0x1d0
>>>>> [   37.402085]  [<ffffffff81030b65>] __do_softirq+0xb4/0x16f
>>>>> [   37.466673]  [<ffffffff81030c90>] irq_exit+0x40/0x87
>>>>> [   37.526067]  [<ffffffff81002db1>] do_IRQ+0x98/0xaf
>>>>> [   37.583378]  [<ffffffff815210aa>] common_interrupt+0x6a/0x6a
>>>>> [   37.651086]  <EOI>  [<ffffffff8105d4be>] ?
>>>>> __tick_nohz_idle_enter+0x116/0x31f
>>>>> [   37.736595]  [<ffffffff81008a04>] ? default_idle+0x24/0x39
>>>>> [   37.802224]  [<ffffffff81008c62>] cpu_idle+0x68/0xa4
>>>>> [   37.861616]  [<ffffffff81519f78>] start_secondary+0x1a9/0x1ad
>>>>> [   37.930364] ---[ end trace 01b5bb0fd75a464c ]---
>>>>>
>>>>>
>>>>> It happens shortly after mounting the NFS-root filesystem.
>>>>>
>>>>> I tried to understand what is going on, but I am now at my wit's end.
>>>>>
>>>>> By adding some print-statements, here is what I found out (not sure if
>>>>> this is anyhow helpful):
>>>>>
>>>>> The difference between tx_buffer->time_stamp and the current 'jiffies'
>>>>> is
>>>>> up to 2000 jiffies (HZ==1000) at the first time the above warning
>>>>> happens
>>>>> (this seems too much for me). From then on, I see my print 3-4 times
>>>>> appear but without such a big difference between the timestamps
>>>>> (difference around 1 and 2 jiffies).
>>>>>
>>>>> Some other stuff, I printed:
>>>>> tx_buffer->skb: ffff880235054c80
>>>>> tx_buffer->bytecount: 154
>>>>> tx_buffer->gso_segs: 1
>>>>> tx_buffer->protocol: 8
>>>>> tx_buffer->tx_flags 0x20
>>>>>
>>>>>
>>>>> One last thing:
>>>>> Am I right that after each call to dma_map_single/page a call to
>>>>> dma_mapping_error is needed? If that's the case, I have some patches
>>>>> that
>>>>> add this statement at missing places in the e1000, e1000e and ixgb
>>>>> driver. But these patches do not fix my above problem.
>>>>>
>>>>>
>>>>> Thanks for your help,
>>>>> Christoph
>>>> Christoph,
>>>>
>>>> One thing that might be useful would be to reproduce this with a
>>>> standard 3.9-rc kernel instead of one using the multipath TCP patches.
>>>> This will help us to verify that the issue is reproducible with a stock
>>>> kernel and is not related to any ongoing work you may have only in your
>>>> tree.
>>> Hello,
>>>
>>> this is on a clean net-next kernel without any MPTCP-code.
>>>
>>> I bisected it down to  787314c35fbb (Merge tag 'iommu-updates-v3.8' of
>>> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu), which simply
>>> introduces the debug_dma_mapping_error-checks.
>>>
>>> Am I right with the missing calls to dma_mapping_error in e1000, e1000e
>>> and
>>> ixgb?
>>>
>>> Cheers,
>>> Christoph
>> Christoph,
>>
>> The cause of this issues you are seeing may be due to the fact that the
>> buffer triggering the error is being reused.  I was able to reproduce
>> this issue occasionally with pktgen if I cloned the skb.  What may be
>> happening is that the buffer is being mapped in the transmit path on one
>> CPU while on another CPU the buffer is being cleaned.  Since the output
>> of each mapping is the physical address there is nothing to make each
>> mapping unique and I suspect this is resulting in false hits.
>>
>> You should be able to verify this if you were to check the skb->users
>> count as well as the dataref value in the skb_shared_info.  I suspect
>> either the users count of the dataref will be greater than 1.
> Both, users and dataref, are equal to 1. Before the call to dev_kfree_skb_any 
> and after dma_unmap_single fails.
>
>> You might also try testing the patch below to see if it has any effect.
>>  All it does is reorder the free and the unmap so that the buffer is not
>> freed for reuse until after we have checked it in the unmap path.
> I tested your patch, and it fixes my issue. Feel free to add a "Tested-by" to 
> the official patch.
>
>
> Cheers,
> Christoph

I'm not going to submit that as a fix since it doesn't resolve the
underlying issue and I still see problems when I do the pktgen test.

I'll try to submit a patch for the DMA debug API later today to resolve
the issue.  Basically what needs to happen is that we need to step
through and make it so that we tag each instance of the mapping
correctly instead of only tagging the first instance in the bucket.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/2] Address issues in dma-debug API
  2013-03-16 11:07       ` Christoph Paasch
  2013-03-18 17:29         ` Alexander Duyck
@ 2013-03-18 22:12         ` Alexander Duyck
  2013-03-18 22:12           ` [PATCH 1/2] dma-debug: Fix locking bug in check_unmap Alexander Duyck
  2013-03-18 22:12           ` [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer Alexander Duyck
  1 sibling, 2 replies; 15+ messages in thread
From: Alexander Duyck @ 2013-03-18 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: konrad.wilk, joerg.roedel, konrad, christoph.paasch, mingo,
	shuahkhan, hpa, akpm, shuah.khan, netdev, jeffrey.t.kirsher

Christoph Paasch recently reported a "device driver failed to check map 
error" on igb.  However after reviewing the code there was no possibility of
that.  On closer inspection there was a bug in the DMA debug API that was
causing the issue.  These two patches address the issues I found.

The first issue I found while trying to implement a workaround.  Specifically
the problem is a locking bug which is triggered if a multiple mapped buffer
exists and there is not an exact match for the unmap.  This results in the CPU
becoming deadlocked.

The second issue, which was the original problem, is resolved by guaranteeing
that if we call dma_mapping_error we set a matching entry to MAP_ERR_CHECKED
that was not previously set. 

I'm not sure if these are critical enough to go into one of the upcoming RC
kernels or if these can wait until the merge since this is in a debugging API.
I'm leaving that for the sub-maintainers to decide.

---

Alexander Duyck (2):
      dma-debug: Fix locking bug in check_unmap
      dma-debug: Update DMA debug API to better handle multiple mappings of a buffer


 lib/dma-debug.c |   42 ++++++++++++++++++++++++++++--------------
 1 files changed, 28 insertions(+), 14 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] dma-debug: Fix locking bug in check_unmap
  2013-03-18 22:12         ` [PATCH 0/2] Address issues in dma-debug API Alexander Duyck
@ 2013-03-18 22:12           ` Alexander Duyck
  2013-03-19 20:29             ` Shuah Khan
  2013-03-18 22:12           ` [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer Alexander Duyck
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2013-03-18 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: konrad.wilk, joerg.roedel, konrad, christoph.paasch, mingo,
	shuahkhan, hpa, akpm, shuah.khan, netdev, jeffrey.t.kirsher

In check_unmap it is possible to get into a dead-locked state if
dma_mapping_error is called.  The problem is that the bucket is locked in
check_unmap, and locked again by debug_dma_mapping_error which is called by
dma_mapping_error.  To resolve that we must release the lock on the bucket
before making the call to dma_mapping_error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 lib/dma-debug.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 5e396ac..724bd4d 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -862,17 +862,18 @@ static void check_unmap(struct dma_debug_entry *ref)
 	entry = bucket_find_exact(bucket, ref);
 
 	if (!entry) {
+		/* must drop lock before calling dma_mapping_error */
+		put_hash_bucket(bucket, &flags);
+
 		if (dma_mapping_error(ref->dev, ref->dev_addr)) {
 			err_printk(ref->dev, NULL,
-				   "DMA-API: device driver tries "
-				   "to free an invalid DMA memory address\n");
-			return;
+				   "DMA-API: device driver tries to free an invalid DMA memory address\n");
+		} else {
+			err_printk(ref->dev, NULL,
+				   "DMA-API: device driver tries to free DMA memory it has not allocated [device address=0x%016llx] [size=%llu bytes]\n",
+				   ref->dev_addr, ref->size);
 		}
-		err_printk(ref->dev, NULL, "DMA-API: device driver tries "
-			   "to free DMA memory it has not allocated "
-			   "[device address=0x%016llx] [size=%llu bytes]\n",
-			   ref->dev_addr, ref->size);
-		goto out;
+		return;
 	}
 
 	if (ref->size != entry->size) {
@@ -936,7 +937,6 @@ static void check_unmap(struct dma_debug_entry *ref)
 	hash_bucket_del(entry);
 	dma_entry_free(entry);
 
-out:
 	put_hash_bucket(bucket, &flags);
 }
 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer
  2013-03-18 22:12         ` [PATCH 0/2] Address issues in dma-debug API Alexander Duyck
  2013-03-18 22:12           ` [PATCH 1/2] dma-debug: Fix locking bug in check_unmap Alexander Duyck
@ 2013-03-18 22:12           ` Alexander Duyck
  2013-03-19 20:30             ` Shuah Khan
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Duyck @ 2013-03-18 22:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: konrad.wilk, joerg.roedel, konrad, christoph.paasch, mingo,
	shuahkhan, hpa, akpm, shuah.khan, netdev, jeffrey.t.kirsher

There were reports of the igb driver unmapping buffers without calling
dma_mapping_error.  On closer inspection issues were found in the DMA debug
API and how it handled multiple mappings of the same buffer.

The issue I found is the fact that the debug_dma_mapping_error would only set
the map_err_type to MAP_ERR_CHECKED in the case that the was only one match
for device and device address.  However in the case of non-IOMMU, multiple
addresses existed and as a result it was not setting this field once a
second mapping was instantiated.  I have resolved this by changing the search
so that it instead will now set MAP_ERR_CHECKED on the first buffer that
matches the device and DMA address that is currently in the state
MAP_ERR_NOT_CHECKED.

A secondary side effect of this patch is that in the case of multiple buffers
using the same address only the last mapping will have a valid map_err_type.
The previous mappings will all end up with map_err_type set to
MAP_ERR_CHECKED because of the dma_mapping_error call in debug_dma_map_page.
However this behavior may be preferable as it means you will likely only see
one real error per multi-mapped buffer, versus the current behavior of
multiple false errors mer multi-mapped buffer.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 lib/dma-debug.c |   24 +++++++++++++++++++-----
 1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 724bd4d..aa465d9 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -1082,13 +1082,27 @@ void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 	ref.dev = dev;
 	ref.dev_addr = dma_addr;
 	bucket = get_hash_bucket(&ref, &flags);
-	entry = bucket_find_exact(bucket, &ref);
 
-	if (!entry)
-		goto out;
+	list_for_each_entry(entry, &bucket->list, list) {
+		if (!exact_match(&ref, entry))
+			continue;
+
+		/*
+		 * The same physical address can be mapped multiple
+		 * times. Without a hardware IOMMU this results in the
+		 * same device addresses being put into the dma-debug
+		 * hash multiple times too. This can result in false
+		 * positives being reported. Therefore we implement a
+		 * best-fit algorithm here which updates the first entry
+		 * from the hash which fits the reference value and is
+		 * not currently listed as being checked.
+		 */
+		if (entry->map_err_type == MAP_ERR_NOT_CHECKED) {
+			entry->map_err_type = MAP_ERR_CHECKED;
+			break;
+		}
+	}
 
-	entry->map_err_type = MAP_ERR_CHECKED;
-out:
 	put_hash_bucket(bucket, &flags);
 }
 EXPORT_SYMBOL(debug_dma_mapping_error);


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] dma-debug: Fix locking bug in check_unmap
  2013-03-18 22:12           ` [PATCH 1/2] dma-debug: Fix locking bug in check_unmap Alexander Duyck
@ 2013-03-19 20:29             ` Shuah Khan
  0 siblings, 0 replies; 15+ messages in thread
From: Shuah Khan @ 2013-03-19 20:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-kernel, konrad.wilk, joerg.roedel, konrad,
	christoph.paasch, mingo, hpa, akpm, netdev, jeffrey.t.kirsher,
	shuah.khan

On Mon, 2013-03-18 at 15:12 -0700, Alexander Duyck wrote:
> In check_unmap it is possible to get into a dead-locked state if
> dma_mapping_error is called.  The problem is that the bucket is locked in
> check_unmap, and locked again by debug_dma_mapping_error which is called by
> dma_mapping_error.  To resolve that we must release the lock on the bucket
> before making the call to dma_mapping_error.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good.

Reviewed-by: Shuah Khan
Tested-by Shuah Khan

Thanks for finding and fixing the problem.
-- Shuah



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer
  2013-03-18 22:12           ` [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer Alexander Duyck
@ 2013-03-19 20:30             ` Shuah Khan
  0 siblings, 0 replies; 15+ messages in thread
From: Shuah Khan @ 2013-03-19 20:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-kernel, konrad.wilk, joerg.roedel, konrad,
	christoph.paasch, mingo, hpa, akpm, netdev, jeffrey.t.kirsher,
	shuah.khan

On Mon, 2013-03-18 at 15:12 -0700, Alexander Duyck wrote:
> There were reports of the igb driver unmapping buffers without calling
> dma_mapping_error.  On closer inspection issues were found in the DMA debug
> API and how it handled multiple mappings of the same buffer.
> 
> The issue I found is the fact that the debug_dma_mapping_error would only set
> the map_err_type to MAP_ERR_CHECKED in the case that the was only one match
> for device and device address.  However in the case of non-IOMMU, multiple
> addresses existed and as a result it was not setting this field once a
> second mapping was instantiated.  I have resolved this by changing the search
> so that it instead will now set MAP_ERR_CHECKED on the first buffer that
> matches the device and DMA address that is currently in the state
> MAP_ERR_NOT_CHECKED.
> 
> A secondary side effect of this patch is that in the case of multiple buffers
> using the same address only the last mapping will have a valid map_err_type.
> The previous mappings will all end up with map_err_type set to
> MAP_ERR_CHECKED because of the dma_mapping_error call in debug_dma_map_page.
> However this behavior may be preferable as it means you will likely only see
> one real error per multi-mapped buffer, versus the current behavior of
> multiple false errors mer multi-mapped buffer.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---

Looks good. Tested it as well.

Reviewed-by: shuah.khan@hp.com
Tested-by: shuah.khan@hp.com

Thanks for finding and fixing the problem.

-- Shuah


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-03-19 20:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-12  9:31 igb_poll - device driver failed to check map error Christoph Paasch
2013-03-13 15:30 ` Wyborny, Carolyn
2013-03-15  2:18 ` Alexander Duyck
2013-03-15  7:52   ` Christoph Paasch
2013-03-15 16:03     ` Allan, Bruce W
2013-03-16  9:27       ` Christoph Paasch
2013-03-15 16:07     ` Alexander Duyck
2013-03-15 23:08     ` Alexander Duyck
2013-03-16 11:07       ` Christoph Paasch
2013-03-18 17:29         ` Alexander Duyck
2013-03-18 22:12         ` [PATCH 0/2] Address issues in dma-debug API Alexander Duyck
2013-03-18 22:12           ` [PATCH 1/2] dma-debug: Fix locking bug in check_unmap Alexander Duyck
2013-03-19 20:29             ` Shuah Khan
2013-03-18 22:12           ` [PATCH 2/2] dma-debug: Update DMA debug API to better handle multiple mappings of a buffer Alexander Duyck
2013-03-19 20:30             ` Shuah Khan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.