netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bug: tpacket_snd can cause data corruption
@ 2019-07-03 11:07 Frank de Brabander
  2019-07-03 16:07 ` Willem de Bruijn
  0 siblings, 1 reply; 5+ messages in thread
From: Frank de Brabander @ 2019-07-03 11:07 UTC (permalink / raw)
  To: David S . Miller, Willem de Bruijn; +Cc: netdev, Frank de Brabander

In commit 5cd8d46e a fix was applied for data corruption in
tpacket_snd. A selftest was added in commit 358be656 which
validates this fix.

Unfortunately this bug still persists, although since this fix less
likely to trigger. This bug was initially observed using a PACKET_MMAP
application, but can also be seen by tweaking the kernel selftest.

By tweaking the selftest txring_overwrite.c to run
as an infinite loop, the data corruption will still trigger. It
seems to occur faster by generating interrupts (e.g. by plugging
in USB devices). Tested with kernel version 5.2-RC7.

Cause for this bug is still unclear.

Signed-off-by: Frank de Brabander <debrabander@gmail.com>
---
 tools/testing/selftests/net/txring_overwrite.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/txring_overwrite.c b/tools/testing/selftests/net/txring_overwrite.c
index fd8b1c6..3ee23e5 100644
--- a/tools/testing/selftests/net/txring_overwrite.c
+++ b/tools/testing/selftests/net/txring_overwrite.c
@@ -143,19 +143,22 @@ static int read_verify_pkt(int fdr, char payload_char)
 	int ret;
 
 	ret = read(fdr, buf, sizeof(buf));
-	if (ret != sizeof(buf))
-		error(1, errno, "read");
+	if (ret != sizeof(buf)) {
+		//error(1, errno, "read");
+		printf("read error\n");
+		return 1;
+	}
 
 	if (buf[60] != payload_char) {
 		printf("wrong pattern: 0x%x != 0x%x\n", buf[60], payload_char);
 		return 1;
 	}
 
-	printf("read: %c (0x%x)\n", buf[60], buf[60]);
+	//printf("read: %c (0x%x)\n", buf[60], buf[60]);
 	return 0;
 }
 
-int main(int argc, char **argv)
+void run_test(void)
 {
 	const char payload_patterns[] = "ab";
 	char *ring;
@@ -177,3 +180,10 @@ int main(int argc, char **argv)
 
 	return ret;
 }
+
+int main(int argc, char **argv)
+{
+	while (true) {
+		run_test();
+	}
+}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: bug: tpacket_snd can cause data corruption
  2019-07-03 11:07 bug: tpacket_snd can cause data corruption Frank de Brabander
@ 2019-07-03 16:07 ` Willem de Bruijn
  2019-07-04 10:43   ` Frank de Brabander
  0 siblings, 1 reply; 5+ messages in thread
From: Willem de Bruijn @ 2019-07-03 16:07 UTC (permalink / raw)
  To: Frank de Brabander
  Cc: David S . Miller, Willem de Bruijn, Network Development

On Wed, Jul 3, 2019 at 7:08 AM Frank de Brabander <debrabander@gmail.com> wrote:
>
> In commit 5cd8d46e a fix was applied for data corruption in
> tpacket_snd. A selftest was added in commit 358be656 which
> validates this fix.
>
> Unfortunately this bug still persists, although since this fix less
> likely to trigger. This bug was initially observed using a PACKET_MMAP
> application, but can also be seen by tweaking the kernel selftest.
>
> By tweaking the selftest txring_overwrite.c to run
> as an infinite loop, the data corruption will still trigger. It
> seems to occur faster by generating interrupts (e.g. by plugging
> in USB devices). Tested with kernel version 5.2-RC7.
>
> Cause for this bug is still unclear.

The cause of the original bug is well understood.

The issue you report I expect is due to background traffic. And more
about the test than the kernel implementation.

Can you reproduce the issue when running the modified test in a
network namespace (./in_netns.sh ./txring_overwrite)?

I observe the issue report outside that, but not inside. That implies
that what we're observing is random background traffic. The modified
test then drops the unexpected packet because it mismatches on length.
As a result the next read (the test always sends two packets, then
reads both) will report a data mismatch. Because it is reading the
first test packet, but expecting the second. Output with a bit more
data:

count: 200
count: 300
count: 400
count: 500
 read: 90B != 100B
wrong pattern: 0x61 != 0x62
count: 600
count: 700
count: 800
 read: 90B != 100B
wrong pattern: 0x61 != 0x62
count: 900
 read: 90B != 100B
wrong pattern: 0x61 != 0x62

Notice the clear pattern.

This does not trigger inside a network namespace, which is how
kselftest invokes txring_override (from run_afpackettests).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bug: tpacket_snd can cause data corruption
  2019-07-03 16:07 ` Willem de Bruijn
@ 2019-07-04 10:43   ` Frank de Brabander
  2019-07-04 22:59     ` Willem de Bruijn
  0 siblings, 1 reply; 5+ messages in thread
From: Frank de Brabander @ 2019-07-04 10:43 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: David S . Miller, Willem de Bruijn, Network Development

On 03-07-19 18:07, Willem de Bruijn wrote:

> On Wed, Jul 3, 2019 at 7:08 AM Frank de Brabander <debrabander@gmail.com> wrote:
>> In commit 5cd8d46e a fix was applied for data corruption in
>> tpacket_snd. A selftest was added in commit 358be656 which
>> validates this fix.
>>
>> Unfortunately this bug still persists, although since this fix less
>> likely to trigger. This bug was initially observed using a PACKET_MMAP
>> application, but can also be seen by tweaking the kernel selftest.
>>
>> By tweaking the selftest txring_overwrite.c to run
>> as an infinite loop, the data corruption will still trigger. It
>> seems to occur faster by generating interrupts (e.g. by plugging
>> in USB devices). Tested with kernel version 5.2-RC7.
>>
>> Cause for this bug is still unclear.
> The cause of the original bug is well understood.
>
> The issue you report I expect is due to background traffic. And more
> about the test than the kernel implementation.
>
> Can you reproduce the issue when running the modified test in a
> network namespace (./in_netns.sh ./txring_overwrite)?
>
> I observe the issue report outside that, but not inside. That implies
> that what we're observing is random background traffic. The modified
> test then drops the unexpected packet because it mismatches on length.
> As a result the next read (the test always sends two packets, then
> reads both) will report a data mismatch. Because it is reading the
> first test packet, but expecting the second. Output with a bit more
> data:
>
> count: 200
> count: 300
> count: 400
> count: 500
>   read: 90B != 100B
> wrong pattern: 0x61 != 0x62
> count: 600
> count: 700
> count: 800
>   read: 90B != 100B
> wrong pattern: 0x61 != 0x62
> count: 900
>   read: 90B != 100B
> wrong pattern: 0x61 != 0x62
>
> Notice the clear pattern.
>
> This does not trigger inside a network namespace, which is how
> kselftest invokes txring_override (from run_afpackettests).
I'm also able to reproduce the issue inside a network namespace.

I've added the extra logging, as seen in your output, for
mismatches on length. Running the test without ./in_netns.sh
is indeed as you describe:

read error: 66 != 100
wrong pattern: 0x61 != 0x62
read error: 66 != 100
wrong pattern: 0x61 != 0x62
read error: 74 != 100
read error: 66 != 100
wrong pattern: 0x53 != 0x61
wrong pattern: 0x53 != 0x62
read error: 66 != 100
read error: 66 != 100
read error: 66 != 100
wrong pattern: 0x61 != 0x62
read error: 95 != 100
read error: 95 != 100
wrong pattern: 0xffffffbe != 0x61
wrong pattern: 0x61 != 0x62
read error: 66 != 100

But even when running the test with ./in_netns.sh it shows
"wrong pattern", this time without length mismatches:

wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61
wrong pattern: 0x62 != 0x61

As already mentioned, it seems to trigger mainly (only ?) when
an USB device is connected. The PC I'm testing this on has an
USB hub with many ports and connected devices. When connecting
this USB hub, the amount of "wrong pattern" errors that are
shown seems to correlate to the amount of new devices
that the kernel should detect. Connecting in a single USB device
also triggers the error, but not on every attempt.

Unfortunately have not found any other way to force the
error to trigger. E.g. running stress-ng to generate CPU load or
timer interrupts does not seem to have any impact.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bug: tpacket_snd can cause data corruption
  2019-07-04 10:43   ` Frank de Brabander
@ 2019-07-04 22:59     ` Willem de Bruijn
  2019-07-05  7:49       ` Frank de Brabander
  0 siblings, 1 reply; 5+ messages in thread
From: Willem de Bruijn @ 2019-07-04 22:59 UTC (permalink / raw)
  To: Frank de Brabander
  Cc: David S . Miller, Willem de Bruijn, Network Development

> > Can you reproduce the issue when running the modified test in a
> > network namespace (./in_netns.sh ./txring_overwrite)?

> But even when running the test with ./in_netns.sh it shows
> "wrong pattern", this time without length mismatches:
>
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
> wrong pattern: 0x62 != 0x61
>
> As already mentioned, it seems to trigger mainly (only ?) when
> an USB device is connected. The PC I'm testing this on has an
> USB hub with many ports and connected devices. When connecting
> this USB hub, the amount of "wrong pattern" errors that are
> shown seems to correlate to the amount of new devices
> that the kernel should detect. Connecting in a single USB device
> also triggers the error, but not on every attempt.
>
> Unfortunately have not found any other way to force the
> error to trigger. E.g. running stress-ng to generate CPU load or
> timer interrupts does not seem to have any impact.

Interesting, thanks for testing. No exact idea so far. The USB devices
are not necessarily network devices, I suppose? I don't immediately
have a setup to test the usb hotplug, so cannot yet reproduce the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bug: tpacket_snd can cause data corruption
  2019-07-04 22:59     ` Willem de Bruijn
@ 2019-07-05  7:49       ` Frank de Brabander
  0 siblings, 0 replies; 5+ messages in thread
From: Frank de Brabander @ 2019-07-05  7:49 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: David S . Miller, Willem de Bruijn, Network Development

On 05-07-19 00:59, Willem de Bruijn wrote:

>>> Can you reproduce the issue when running the modified test in a
>>> network namespace (./in_netns.sh ./txring_overwrite)?
>> But even when running the test with ./in_netns.sh it shows
>> "wrong pattern", this time without length mismatches:
>>
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>> wrong pattern: 0x62 != 0x61
>>
>> As already mentioned, it seems to trigger mainly (only ?) when
>> an USB device is connected. The PC I'm testing this on has an
>> USB hub with many ports and connected devices. When connecting
>> this USB hub, the amount of "wrong pattern" errors that are
>> shown seems to correlate to the amount of new devices
>> that the kernel should detect. Connecting in a single USB device
>> also triggers the error, but not on every attempt.
>>
>> Unfortunately have not found any other way to force the
>> error to trigger. E.g. running stress-ng to generate CPU load or
>> timer interrupts does not seem to have any impact.
> Interesting, thanks for testing. No exact idea so far. The USB devices
> are not necessarily network devices, I suppose? I don't immediately
> have a setup to test the usb hotplug, so cannot yet reproduce the bug.
It triggers with different types of USB devices. Verified the
bug can trigger with an USB flash drive, mouse, USB-serial
adapter and USB hub (also with no devices connected).

It can trigger when the USB device is connected as well as when
it's disconnected. But there is a bit of luck needed, it can take
a bunch of times before it happens. Using a large USB hub with
many connected devices will trigger it much easier.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-07-05  7:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-03 11:07 bug: tpacket_snd can cause data corruption Frank de Brabander
2019-07-03 16:07 ` Willem de Bruijn
2019-07-04 10:43   ` Frank de Brabander
2019-07-04 22:59     ` Willem de Bruijn
2019-07-05  7:49       ` Frank de Brabander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).