[PATCH net-next 0/2] skbuff: Fix applications not being woken for errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
@ 2018-03-13 20:35 Vinicius Costa Gomes
  2018-03-13 20:35 ` [PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters Vinicius Costa Gomes
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Vinicius Costa Gomes @ 2018-03-13 20:35 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, randy.e.witt, davem, eric.dumazet

Hi,

Changes from the RFC:
 - tweaked commit messages;

Original cover letter:

This is actually a "bug report"-RFC instead of the more usual "new
feature"-RFC.

We are developing an application that uses TX hardware timestamping to
make some measurements, and during development Randy Witt initially
reported that the application poll() never unblocked when TX hardware
timestamping was enabled.

After some investigation, it turned out the problem wasn't only
exclusive to hardware timestamping, and could be reproduced with
software timestamping.

Applying patch (1), and running txtimestamp like this, for example:

$ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F

('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
packets for each test, '-D' to remove the delay between packets, '-l
1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
wait forever)

will cause the application to become stuck in the poll() call in most
of the times. (Note: I couldn't reproduce the issue running against an
address that is routed through loopback.)

Another interesting fact is that if the POLLIN event is added to the
poll() .events, poll() no longer becomes stuck, and more interestingly
the returned event in .revents is only POLLERR.

After a few debugging sessions, we got to 'sock_queue_err_skb()' and
how it notifies applications of the error just enqueued. Changing it
to use 'sk->sk_error_report()', fixes the issue for hardware and
software timestamping. That is patch (2).

The "solution" proposed in patch (2) looks like too big a hammer, if
it's not, then it seems that this problem existed since a long time
ago (pre git) and was uncommon for folks to reach the necessary
conditions to trigger it (my hypothesis is that only triggers when the
error is reported from a different task context than the application).

Am I missing something here?

Cheers,
--

Vinicius Costa Gomes (2):
  selftests/txtimestamp: Add more configurable parameters
  skbuff: Fix not waking applications when errors are enqueued

 net/core/skbuff.c                                   |  2 +-
 .../selftests/networking/timestamping/txtimestamp.c | 21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

--
2.16.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters
  2018-03-13 20:35 [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Vinicius Costa Gomes
@ 2018-03-13 20:35 ` Vinicius Costa Gomes
  2018-03-13 20:35 ` [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued Vinicius Costa Gomes
  2018-03-14 16:32 ` [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Willem de Bruijn
  2 siblings, 0 replies; 7+ messages in thread
From: Vinicius Costa Gomes @ 2018-03-13 20:35 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, randy.e.witt, davem, eric.dumazet

Add a way to configure if poll() should wait forever for an event, the
number of packets that should be sent for each and if there should be
any delay between packets.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 .../selftests/networking/timestamping/txtimestamp.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/networking/timestamping/txtimestamp.c b/tools/testing/selftests/networking/timestamping/txtimestamp.c
index 5df07047ca86..5190b1dd78b1 100644
--- a/tools/testing/selftests/networking/timestamping/txtimestamp.c
+++ b/tools/testing/selftests/networking/timestamping/txtimestamp.c
@@ -68,9 +68,11 @@ static int cfg_num_pkts = 4;
 static int do_ipv4 = 1;
 static int do_ipv6 = 1;
 static int cfg_payload_len = 10;
+static int cfg_poll_timeout = 100;
 static bool cfg_show_payload;
 static bool cfg_do_pktinfo;
 static bool cfg_loop_nodata;
+static bool cfg_no_delay;
 static uint16_t dest_port = 9000;
 
 static struct sockaddr_in daddr;
@@ -171,7 +173,7 @@ static void __poll(int fd)
 
 	memset(&pollfd, 0, sizeof(pollfd));
 	pollfd.fd = fd;
-	ret = poll(&pollfd, 1, 100);
+	ret = poll(&pollfd, 1, cfg_poll_timeout);
 	if (ret != 1)
 		error(1, errno, "poll");
 }
@@ -371,7 +373,8 @@ static void do_test(int family, unsigned int opt)
 			error(1, errno, "send");
 
 		/* wait for all errors to be queued, else ACKs arrive OOO */
-		usleep(50 * 1000);
+		if (!cfg_no_delay)
+			usleep(50 * 1000);
 
 		__poll(fd);
 
@@ -397,6 +400,9 @@ static void __attribute__((noreturn)) usage(const char *filepath)
 			"  -n:   set no-payload option\n"
 			"  -r:   use raw\n"
 			"  -R:   use raw (IP_HDRINCL)\n"
+			"  -D:   no delay between packets\n"
+			"  -F:   poll() waits forever for an event\n"
+			"  -c N: number of packets for each test\n"
 			"  -p N: connect to port N\n"
 			"  -u:   use udp\n"
 			"  -x:   show payload (up to 70 bytes)\n",
@@ -409,7 +415,7 @@ static void parse_opt(int argc, char **argv)
 	int proto_count = 0;
 	char c;
 
-	while ((c = getopt(argc, argv, "46hIl:np:rRux")) != -1) {
+	while ((c = getopt(argc, argv, "46hIl:np:rRuxc:DF")) != -1) {
 		switch (c) {
 		case '4':
 			do_ipv6 = 0;
@@ -447,6 +453,15 @@ static void parse_opt(int argc, char **argv)
 		case 'x':
 			cfg_show_payload = true;
 			break;
+		case 'c':
+			cfg_num_pkts = strtoul(optarg, NULL, 10);
+			break;
+		case 'D':
+			cfg_no_delay = true;
+			break;
+		case 'F':
+			cfg_poll_timeout = -1;
+			break;
 		case 'h':
 		default:
 			usage(argv[0]);
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued
  2018-03-13 20:35 [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Vinicius Costa Gomes
  2018-03-13 20:35 ` [PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters Vinicius Costa Gomes
@ 2018-03-13 20:35 ` Vinicius Costa Gomes
  2018-03-14 16:40   ` Eric Dumazet
  2018-03-14 16:32 ` [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Willem de Bruijn
  2 siblings, 1 reply; 7+ messages in thread
From: Vinicius Costa Gomes @ 2018-03-13 20:35 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, randy.e.witt, davem, eric.dumazet

When errors are enqueued to the error queue via sock_queue_err_skb()
function, it is possible that the waiting application is not notified.

Calling 'sk->sk_data_ready()' would not notify applications that
selected only POLLERR events in poll() (for example).

Reported-by: Randy E. Witt <randy.e.witt@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..6def3534f509 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4181,7 +4181,7 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 
 	skb_queue_tail(&sk->sk_error_queue, skb);
 	if (!sock_flag(sk, SOCK_DEAD))
-		sk->sk_data_ready(sk);
+		sk->sk_error_report(sk);
 	return 0;
 }
 EXPORT_SYMBOL(sock_queue_err_skb);
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
  2018-03-13 20:35 [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Vinicius Costa Gomes
  2018-03-13 20:35 ` [PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters Vinicius Costa Gomes
  2018-03-13 20:35 ` [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued Vinicius Costa Gomes
@ 2018-03-14 16:32 ` Willem de Bruijn
  2018-03-14 17:39   ` Soheil Hassas Yeganeh
  2018-03-14 19:37   ` Vinicius Costa Gomes
  2 siblings, 2 replies; 7+ messages in thread
From: Willem de Bruijn @ 2018-03-14 16:32 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: Network Development, randy.e.witt, David Miller, Eric Dumazet

On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
> Hi,
>
> Changes from the RFC:
>  - tweaked commit messages;
>
> Original cover letter:
>
> This is actually a "bug report"-RFC instead of the more usual "new
> feature"-RFC.
>
> We are developing an application that uses TX hardware timestamping to
> make some measurements, and during development Randy Witt initially
> reported that the application poll() never unblocked when TX hardware
> timestamping was enabled.
>
> After some investigation, it turned out the problem wasn't only
> exclusive to hardware timestamping, and could be reproduced with
> software timestamping.
>
> Applying patch (1), and running txtimestamp like this, for example:
>
> $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
>
> ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> packets for each test, '-D' to remove the delay between packets, '-l
> 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> wait forever)
>
> will cause the application to become stuck in the poll() call in most
> of the times. (Note: I couldn't reproduce the issue running against an
> address that is routed through loopback.)
>
> Another interesting fact is that if the POLLIN event is added to the
> poll() .events, poll() no longer becomes stuck,

The process has registered interest only in POLLIN, which the call to
sk_data_read (sock_def_readable) will trigger.

> and more interestingly
> the returned event in .revents is only POLLERR.

datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> how it notifies applications of the error just enqueued. Changing it
> to use 'sk->sk_error_report()', fixes the issue for hardware and
> software timestamping. That is patch (2).
>
> The "solution" proposed in patch (2) looks like too big a hammer,

It looks fine to me. POLLERR is returned regardless of the mask a
process sets up in pollfd.events. So waking with sk_error_report
will fix this while still waking callers waiting on POLLIN.

Note that on sock_dequeue_err_skb, if another notification (of the
right kind) is waiting, sk_error_report is already called instead of
sk_data_ready.

This should perhaps go to net, instead of net-next (though not the test).

If resending, a small nit in the test: please keep the alphabetical
order in getopt. The filepath also looks a bit fishy, but git am applied
the mbox from patchwork without issue.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued
  2018-03-13 20:35 ` [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued Vinicius Costa Gomes
@ 2018-03-14 16:40   ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2018-03-14 16:40 UTC (permalink / raw)
  To: Vinicius Costa Gomes, netdev; +Cc: randy.e.witt, davem



On 03/13/2018 01:35 PM, Vinicius Costa Gomes wrote:
> When errors are enqueued to the error queue via sock_queue_err_skb()
> function, it is possible that the waiting application is not notified.
> 
> Calling 'sk->sk_data_ready()' would not notify applications that
> selected only POLLERR events in poll() (for example).
> 
> Reported-by: Randy E. Witt <randy.e.witt@intel.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
>   net/core/skbuff.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c13495ba6..6def3534f509 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4181,7 +4181,7 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>   
>   	skb_queue_tail(&sk->sk_error_queue, skb);
>   	if (!sock_flag(sk, SOCK_DEAD))
> -		sk->sk_data_ready(sk);
> +		sk->sk_error_report(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL(sock_queue_err_skb);
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>

Given this bug has been there forever, there is probably no rush to get 
the fix in 4.16, but this looks safe enough IMO.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
  2018-03-14 16:32 ` [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Willem de Bruijn
@ 2018-03-14 17:39   ` Soheil Hassas Yeganeh
  2018-03-14 19:37   ` Vinicius Costa Gomes
  1 sibling, 0 replies; 7+ messages in thread
From: Soheil Hassas Yeganeh @ 2018-03-14 17:39 UTC (permalink / raw)
  To: vinicius.gomes
  Cc: netdev, randy.e.witt, David Miller, Eric Dumazet, Willem de Bruijn

On Wed, Mar 14, 2018 at 12:32 PM Willem de Bruijn <
willemdebruijn.kernel@gmail.com> wrote:

> On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
> > Hi,
> >
> > Changes from the RFC:
> >  - tweaked commit messages;
> >
> > Original cover letter:
> >
> > This is actually a "bug report"-RFC instead of the more usual "new
> > feature"-RFC.
> >
> > We are developing an application that uses TX hardware timestamping to
> > make some measurements, and during development Randy Witt initially
> > reported that the application poll() never unblocked when TX hardware
> > timestamping was enabled.
> >
> > After some investigation, it turned out the problem wasn't only
> > exclusive to hardware timestamping, and could be reproduced with
> > software timestamping.
> >
> > Applying patch (1), and running txtimestamp like this, for example:
> >
> > $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
> >
> > ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> > packets for each test, '-D' to remove the delay between packets, '-l
> > 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> > wait forever)
> >
> > will cause the application to become stuck in the poll() call in most
> > of the times. (Note: I couldn't reproduce the issue running against an
> > address that is routed through loopback.)
> >
> > Another interesting fact is that if the POLLIN event is added to the
> > poll() .events, poll() no longer becomes stuck,

> The process has registered interest only in POLLIN, which the call to
> sk_data_read (sock_def_readable) will trigger.

> > and more interestingly
> > the returned event in .revents is only POLLERR.

> datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> > After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> > how it notifies applications of the error just enqueued. Changing it
> > to use 'sk->sk_error_report()', fixes the issue for hardware and
> > software timestamping. That is patch (2).
> >
> > The "solution" proposed in patch (2) looks like too big a hammer,

> It looks fine to me. POLLERR is returned regardless of the mask a
> process sets up in pollfd.events. So waking with sk_error_report
> will fix this while still waking callers waiting on POLLIN.

> Note that on sock_dequeue_err_skb, if another notification (of the
> right kind) is waiting, sk_error_report is already called instead of
> sk_data_ready.

Thank you for the fix. It looks fine to me too, because we already only arm
POLLERR for sock_dequeue_err_skb, and POLLERR is always returned when error
queue is not empty:
   if (...skb_queue_empty(&sk->sk_error_queue))
      mask |= POLLERR ...

> This should perhaps go to net, instead of net-next (though not the test).

+1 I think the fix should go to net.

> If resending, a small nit in the test: please keep the alphabetical
> order in getopt. The filepath also looks a bit fishy, but git am applied
> the mbox from patchwork without issue.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
  2018-03-14 16:32 ` [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Willem de Bruijn
  2018-03-14 17:39   ` Soheil Hassas Yeganeh
@ 2018-03-14 19:37   ` Vinicius Costa Gomes
  1 sibling, 0 replies; 7+ messages in thread
From: Vinicius Costa Gomes @ 2018-03-14 19:37 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, randy.e.witt, David Miller, Eric Dumazet

Hi,

Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:

>> Another interesting fact is that if the POLLIN event is added to the
>> poll() .events, poll() no longer becomes stuck,
>
> The process has registered interest only in POLLIN, which the call to
> sk_data_read (sock_def_readable) will trigger.
>
>> and more interestingly
>> the returned event in .revents is only POLLERR.
>
> datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.
>
>> After a few debugging sessions, we got to 'sock_queue_err_skb()' and
>> how it notifies applications of the error just enqueued. Changing it
>> to use 'sk->sk_error_report()', fixes the issue for hardware and
>> software timestamping. That is patch (2).
>>
>> The "solution" proposed in patch (2) looks like too big a hammer,
>
> It looks fine to me. POLLERR is returned regardless of the mask a
> process sets up in pollfd.events. So waking with sk_error_report
> will fix this while still waking callers waiting on POLLIN.
>
> Note that on sock_dequeue_err_skb, if another notification (of the
> right kind) is waiting, sk_error_report is already called instead of
> sk_data_ready.

Thank you all who did confirm that this was the right fix.

>
> This should perhaps go to net, instead of net-next (though not the
> test).

Will propose it to net, what I am thinking is that now that a few TSN
features are upstream, applications that use hardware timestamping (the
easiest way to trigger this bug) could be more common.

>
> If resending, a small nit in the test: please keep the alphabetical
> order in getopt. The filepath also looks a bit fishy, but git am applied
> the mbox from patchwork without issue.

Will send a second version.


Thanks,
--
Vinicius

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-03-14 19:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-13 20:35 [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Vinicius Costa Gomes
2018-03-13 20:35 ` [PATCH net-next 1/2] selftests/txtimestamp: Add more configurable parameters Vinicius Costa Gomes
2018-03-13 20:35 ` [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued Vinicius Costa Gomes
2018-03-14 16:40   ` Eric Dumazet
2018-03-14 16:32 ` [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors Willem de Bruijn
2018-03-14 17:39   ` Soheil Hassas Yeganeh
2018-03-14 19:37   ` Vinicius Costa Gomes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.