[PATCH] e1000e: Work around hardware unit hang by disabling TSO

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] e1000e: Work around hardware unit hang by disabling TSO
@ 2019-05-09 10:34 ` Juliana Rodrigueiro
  0 siblings, 0 replies; 8+ messages in thread
From: Juliana Rodrigueiro @ 2019-05-09 10:34 UTC (permalink / raw)
  To: intel-wired-lan; +Cc: netdev, jeffrey.t.kirsher, thomas.jarosch

When forwarding traffic to a client behind NAT, some e1000e devices
become unstable, hanging and then being reset by the watchdog.

Output from syslog:

kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
kernel:  TDH                  <5f>
kernel:  TDT                  <8d>
kernel:  next_to_use          <8d>
kernel:  next_to_clean        <5c>
kernel: buffer_info[next_to_clean]:
kernel:  time_stamp           <6bd7b>
kernel:  next_to_watch        <5f>
kernel:  jiffies              <6c180>
kernel:  next_to_watch.status <0>
kernel: MAC Status             <40080083>
kernel: PHY Status             <796d>
kernel: PHY 1000BASE-T Status  <7800>
kernel: PHY Extended Status    <3000>
kernel: PCI Status             <10>
kernel: e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly

This repeats several times and never recovers.

Disabling TCP segmentation offload (TSO) seems to be the only way to
work around this problem on the affected devices.

This issue was first reported in 14.01.2015:
https://marc.info/?l=linux-netdev&m=142124954120315

Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8b11682ebba2..4781a45c1047 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6936,6 +6936,12 @@ static netdev_features_t e1000_fix_features(struct net_device *netdev,
 	if ((hw->mac.type >= e1000_pch2lan) && (netdev->mtu > ETH_DATA_LEN))
 		features &= ~NETIF_F_RXFCS;
 
+	if (adapter->pdev->device == E1000_DEV_ID_PCH2_LV_V) {
+		e_info("Disabling TSO on problematic device to avoid hardware unit hang.\n");
+		features &= ~NETIF_F_TSO;
+		features &= ~NETIF_F_TSO6;
+	}
+
 	/* Since there is no support for separate Rx/Tx vlan accel
 	 * enable/disable make sure Tx flag is always in same state as Rx.
 	 */
-- 
2.20.1





^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
@ 2019-05-09 10:34 ` Juliana Rodrigueiro
  0 siblings, 0 replies; 8+ messages in thread
From: Juliana Rodrigueiro @ 2019-05-09 10:34 UTC (permalink / raw)
  To: intel-wired-lan

When forwarding traffic to a client behind NAT, some e1000e devices
become unstable, hanging and then being reset by the watchdog.

Output from syslog:

kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
kernel:  TDH                  <5f>
kernel:  TDT                  <8d>
kernel:  next_to_use          <8d>
kernel:  next_to_clean        <5c>
kernel: buffer_info[next_to_clean]:
kernel:  time_stamp           <6bd7b>
kernel:  next_to_watch        <5f>
kernel:  jiffies              <6c180>
kernel:  next_to_watch.status <0>
kernel: MAC Status             <40080083>
kernel: PHY Status             <796d>
kernel: PHY 1000BASE-T Status  <7800>
kernel: PHY Extended Status    <3000>
kernel: PCI Status             <10>
kernel: e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly

This repeats several times and never recovers.

Disabling TCP segmentation offload (TSO) seems to be the only way to
work around this problem on the affected devices.

This issue was first reported in 14.01.2015:
https://marc.info/?l=linux-netdev&m=142124954120315

Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8b11682ebba2..4781a45c1047 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -6936,6 +6936,12 @@ static netdev_features_t e1000_fix_features(struct net_device *netdev,
 	if ((hw->mac.type >= e1000_pch2lan) && (netdev->mtu > ETH_DATA_LEN))
 		features &= ~NETIF_F_RXFCS;
 
+	if (adapter->pdev->device == E1000_DEV_ID_PCH2_LV_V) {
+		e_info("Disabling TSO on problematic device to avoid hardware unit hang.\n");
+		features &= ~NETIF_F_TSO;
+		features &= ~NETIF_F_TSO6;
+	}
+
 	/* Since there is no support for separate Rx/Tx vlan accel
 	 * enable/disable make sure Tx flag is always in same state as Rx.
 	 */
-- 
2.20.1





^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
  2019-05-09 10:34 ` [Intel-wired-lan] " Juliana Rodrigueiro
@ 2019-05-15  5:39   ` Neftin, Sasha
  -1 siblings, 0 replies; 8+ messages in thread
From: Neftin, Sasha @ 2019-05-15  5:39 UTC (permalink / raw)
  To: Juliana Rodrigueiro, intel-wired-lan; +Cc: thomas.jarosch, netdev

On 5/9/2019 13:34, Juliana Rodrigueiro wrote:
> When forwarding traffic to a client behind NAT, some e1000e devices
> become unstable, hanging and then being reset by the watchdog.
> 
> Output from syslog:
> 
> kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
> kernel:  TDH                  <5f>
> kernel:  TDT                  <8d>
> kernel:  next_to_use          <8d>
> kernel:  next_to_clean        <5c>
> kernel: buffer_info[next_to_clean]:
> kernel:  time_stamp           <6bd7b>
> kernel:  next_to_watch        <5f>
> kernel:  jiffies              <6c180>
> kernel:  next_to_watch.status <0>
> kernel: MAC Status             <40080083>
> kernel: PHY Status             <796d>
> kernel: PHY 1000BASE-T Status  <7800>
> kernel: PHY Extended Status    <3000>
> kernel: PCI Status             <10>
> kernel: e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
> 
> This repeats several times and never recovers.
> 
> Disabling TCP segmentation offload (TSO) seems to be the only way to
> work around this problem on the affected devices.
> 
> This issue was first reported in 14.01.2015:
> https://marc.info/?l=linux-netdev&m=142124954120315
> 
> Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
> ---
>   drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 8b11682ebba2..4781a45c1047 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6936,6 +6936,12 @@ static netdev_features_t e1000_fix_features(struct net_device *netdev,
>   	if ((hw->mac.type >= e1000_pch2lan) && (netdev->mtu > ETH_DATA_LEN))
>   		features &= ~NETIF_F_RXFCS;
>   
> +	if (adapter->pdev->device == E1000_DEV_ID_PCH2_LV_V) {
> +		e_info("Disabling TSO on problematic device to avoid hardware unit hang.\n");
> +		features &= ~NETIF_F_TSO;
> +		features &= ~NETIF_F_TSO6;
> +	}
> +
>   	/* Since there is no support for separate Rx/Tx vlan accel
>   	 * enable/disable make sure Tx flag is always in same state as Rx.
>   	 */
> 
You are right, in some particular configurations e1000e devices stuck at 
Tx hang while TCP segmentation offload is on. But for all other users we 
should keep the TCP segmentation option is enabled as default. I suggest 
to use 'ethtool' command: ethtool -K <adapter> tso on/off to workaround 
Tx hang in your situation.
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
@ 2019-05-15  5:39   ` Neftin, Sasha
  0 siblings, 0 replies; 8+ messages in thread
From: Neftin, Sasha @ 2019-05-15  5:39 UTC (permalink / raw)
  To: intel-wired-lan

On 5/9/2019 13:34, Juliana Rodrigueiro wrote:
> When forwarding traffic to a client behind NAT, some e1000e devices
> become unstable, hanging and then being reset by the watchdog.
> 
> Output from syslog:
> 
> kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
> kernel:  TDH                  <5f>
> kernel:  TDT                  <8d>
> kernel:  next_to_use          <8d>
> kernel:  next_to_clean        <5c>
> kernel: buffer_info[next_to_clean]:
> kernel:  time_stamp           <6bd7b>
> kernel:  next_to_watch        <5f>
> kernel:  jiffies              <6c180>
> kernel:  next_to_watch.status <0>
> kernel: MAC Status             <40080083>
> kernel: PHY Status             <796d>
> kernel: PHY 1000BASE-T Status  <7800>
> kernel: PHY Extended Status    <3000>
> kernel: PCI Status             <10>
> kernel: e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
> 
> This repeats several times and never recovers.
> 
> Disabling TCP segmentation offload (TSO) seems to be the only way to
> work around this problem on the affected devices.
> 
> This issue was first reported in 14.01.2015:
> https://marc.info/?l=linux-netdev&m=142124954120315
> 
> Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
> ---
>   drivers/net/ethernet/intel/e1000e/netdev.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 8b11682ebba2..4781a45c1047 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6936,6 +6936,12 @@ static netdev_features_t e1000_fix_features(struct net_device *netdev,
>   	if ((hw->mac.type >= e1000_pch2lan) && (netdev->mtu > ETH_DATA_LEN))
>   		features &= ~NETIF_F_RXFCS;
>   
> +	if (adapter->pdev->device == E1000_DEV_ID_PCH2_LV_V) {
> +		e_info("Disabling TSO on problematic device to avoid hardware unit hang.\n");
> +		features &= ~NETIF_F_TSO;
> +		features &= ~NETIF_F_TSO6;
> +	}
> +
>   	/* Since there is no support for separate Rx/Tx vlan accel
>   	 * enable/disable make sure Tx flag is always in same state as Rx.
>   	 */
> 
You are right, in some particular configurations e1000e devices stuck at 
Tx hang while TCP segmentation offload is on. But for all other users we 
should keep the TCP segmentation option is enabled as default. I suggest 
to use 'ethtool' command: ethtool -K <adapter> tso on/off to workaround 
Tx hang in your situation.
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
  2019-05-15  5:39   ` Neftin, Sasha
@ 2019-05-21 15:42     ` Juliana Rodrigueiro
  -1 siblings, 0 replies; 8+ messages in thread
From: Juliana Rodrigueiro @ 2019-05-21 15:42 UTC (permalink / raw)
  To: Neftin, Sasha; +Cc: intel-wired-lan, thomas.jarosch, netdev

Hello Sasha,

On Wednesday, 15 May 2019 07:39:46 CEST Neftin, Sasha wrote:
> You are right, in some particular configurations e1000e devices stuck at
> Tx hang while TCP segmentation offload is on. But for all other users we
> should keep the TCP segmentation option is enabled as default. I suggest
> to use 'ethtool' command: ethtool -K <adapter> tso on/off to workaround
> Tx hang in your situation.
> Thanks,
> Sasha

thank you for your reply.

I did consider using "ethtool" to disable TSO for my use cases. However, I 
have no guarantees that a machine with the PCH2 device will not hang and 
render my system inaccessible before anything in userspace runs. No amount of 
connection outage is acceptable.

The problem escalates when we take into consideration that the exact 
circumstances that bring the device into an unrecoverable state don't seem to 
be known even by the Intel developers themselves.

This patch keeps the problematic device stable for all configurations.

So I ask myself, how actually feasible is it to gamble the usage of "ethtool" 
to turn on or off TSO every time the network configuration changes?

Why should we let the users run into an open knife instead of preemptively fix 
a known hardware bug via the kernel? Otherwise all Linux distributions would 
need to apply the magic ethtool fix for this specific PCI id.

Best regards,
Juliana

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
@ 2019-05-21 15:42     ` Juliana Rodrigueiro
  0 siblings, 0 replies; 8+ messages in thread
From: Juliana Rodrigueiro @ 2019-05-21 15:42 UTC (permalink / raw)
  To: intel-wired-lan

Hello Sasha,

On Wednesday, 15 May 2019 07:39:46 CEST Neftin, Sasha wrote:
> You are right, in some particular configurations e1000e devices stuck at
> Tx hang while TCP segmentation offload is on. But for all other users we
> should keep the TCP segmentation option is enabled as default. I suggest
> to use 'ethtool' command: ethtool -K <adapter> tso on/off to workaround
> Tx hang in your situation.
> Thanks,
> Sasha

thank you for your reply.

I did consider using "ethtool" to disable TSO for my use cases. However, I 
have no guarantees that a machine with the PCH2 device will not hang and 
render my system inaccessible before anything in userspace runs. No amount of 
connection outage is acceptable.

The problem escalates when we take into consideration that the exact 
circumstances that bring the device into an unrecoverable state don't seem to 
be known even by the Intel developers themselves.

This patch keeps the problematic device stable for all configurations.

So I ask myself, how actually feasible is it to gamble the usage of "ethtool" 
to turn on or off TSO every time the network configuration changes?

Why should we let the users run into an open knife instead of preemptively fix 
a known hardware bug via the kernel? Otherwise all Linux distributions would 
need to apply the magic ethtool fix for this specific PCI id.

Best regards,
Juliana

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
  2019-05-21 15:42     ` Juliana Rodrigueiro
@ 2019-05-22 10:58       ` Neftin, Sasha
  -1 siblings, 0 replies; 8+ messages in thread
From: Neftin, Sasha @ 2019-05-22 10:58 UTC (permalink / raw)
  To: Juliana Rodrigueiro; +Cc: intel-wired-lan, thomas.jarosch, netdev

On 5/21/2019 18:42, Juliana Rodrigueiro wrote:
> So I ask myself, how actually feasible is it to gamble the usage of "ethtool"
> to turn on or off TSO every time the network configuration changes?
Hello Juliana,
There are many PCH2 devices with different SKU's.  Not all devices have 
this problem (Tx hand). We do not want to set disabling TSO as the 
default version. Let's keep this option for all other users.
Also, this is very old known HW bug - unfortunately we didn't fixed it. 
Our more new devices have not this problem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [PATCH] e1000e: Work around hardware unit hang by disabling TSO
@ 2019-05-22 10:58       ` Neftin, Sasha
  0 siblings, 0 replies; 8+ messages in thread
From: Neftin, Sasha @ 2019-05-22 10:58 UTC (permalink / raw)
  To: intel-wired-lan

On 5/21/2019 18:42, Juliana Rodrigueiro wrote:
> So I ask myself, how actually feasible is it to gamble the usage of "ethtool"
> to turn on or off TSO every time the network configuration changes?
Hello Juliana,
There are many PCH2 devices with different SKU's.  Not all devices have 
this problem (Tx hand). We do not want to set disabling TSO as the 
default version. Let's keep this option for all other users.
Also, this is very old known HW bug - unfortunately we didn't fixed it. 
Our more new devices have not this problem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-05-22 10:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-09 10:34 [PATCH] e1000e: Work around hardware unit hang by disabling TSO Juliana Rodrigueiro
2019-05-09 10:34 ` [Intel-wired-lan] " Juliana Rodrigueiro
2019-05-15  5:39 ` Neftin, Sasha
2019-05-15  5:39   ` Neftin, Sasha
2019-05-21 15:42   ` Juliana Rodrigueiro
2019-05-21 15:42     ` Juliana Rodrigueiro
2019-05-22 10:58     ` Neftin, Sasha
2019-05-22 10:58       ` Neftin, Sasha

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.