All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
@ 2017-08-14  5:52 Matthew Dawson
  2017-08-14  9:27 ` Paolo Abeni
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Dawson @ 2017-08-14  5:52 UTC (permalink / raw)
  To: netdev; +Cc: Matthew Dawson, Macieira, Thiago, willemdebruijn.kernel

Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
headers from UDP packets before queueing"), when udp packets are being
peeked the requested extra offset is always 0 as there is no need to skip
the udp header.  However, when the offset is 0 and the next skb is
of length 0, it is only returned once.  The behaviour can be seen with
the following python script:

#!/usr/bin/env python3
from socket import *;
f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
f.bind(('::', 0));
addr=('::1', f.getsockname()[1]);
g.sendto(b'', addr)
g.sendto(b'b', addr)
print(f.recvfrom(10, MSG_PEEK));
print(f.recvfrom(10, MSG_PEEK));

Where the expected output should be the empty string twice.

Instead, make sk_peek_offset return negative values, and pass those values
to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
After the call, the offset is set to 0 if negative to ensure all further
calculations are correct.

Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
greater then 0, and off is greater then or equal to skb->len, then (_off ||
skb->len) must always be true assuming skb->len >= 0 is always true.

Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
as it double checked if MSG_PEEK was set in the flags.

Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
---
 include/net/sock.h  | 4 +---
 net/core/datagram.c | 4 ++--
 net/ipv4/udp.c      | 4 ++++
 net/ipv6/udp.c      | 4 ++++
 net/unix/af_unix.c  | 8 +++++---
 5 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7c0632c7e870..aeeec62992ca 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val);
 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
 	if (unlikely(flags & MSG_PEEK)) {
-		s32 off = READ_ONCE(sk->sk_peek_off);
-		if (off >= 0)
-			return off;
+		return READ_ONCE(sk->sk_peek_off);
 	}
 
 	return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..a91cd8aa244b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
 	*last = queue->prev;
 	skb_queue_walk(queue, skb) {
 		if (flags & MSG_PEEK) {
-			if (_off >= skb->len && (skb->len || _off ||
-						 skb->peeked)) {
+			if (_off >= 0 && _off >= skb->len &&
+			    (_off || skb->peeked)) {
 				_off -= skb->len;
 				continue;
 			}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a7c804f73990..a0d8d6d285f4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1576,6 +1576,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 try_again:
 	peeking = off = sk_peek_offset(sk, flags);
 	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
+	if (off < 0) {
+		off = 0;
+		peeking = 0;
+	}
 	if (!skb)
 		return err;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 578142b7ca3e..34d8f9f8d87d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -364,6 +364,10 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 try_again:
 	peeking = off = sk_peek_offset(sk, flags);
 	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
+	if (peeking < 0) {
+		off = 0;
+		peeking = 0;
+	}
 	if (!skb)
 		return err;
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 7b52a380d710..390e0627dc05 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2127,6 +2127,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		skip = sk_peek_offset(sk, flags);
 		skb = __skb_try_recv_datagram(sk, flags, NULL, &peeked, &skip,
 					      &err, &last);
+		if (skip < 0)
+			skip = 0;
 		if (skb)
 			break;
 
@@ -2304,10 +2306,10 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state,
 	 */
 	mutex_lock(&u->iolock);
 
-	if (flags & MSG_PEEK)
-		skip = sk_peek_offset(sk, flags);
-	else
+	skip = sk_peek_offset(sk, flags);
+	if (skip < 0) {
 		skip = 0;
+	}
 
 	do {
 		int chunk;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14  5:52 [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs Matthew Dawson
@ 2017-08-14  9:27 ` Paolo Abeni
  2017-08-14 14:05   ` Matthew Dawson
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Abeni @ 2017-08-14  9:27 UTC (permalink / raw)
  To: Matthew Dawson, netdev; +Cc: Macieira, Thiago, willemdebruijn.kernel

On Mon, 2017-08-14 at 01:52 -0400, Matthew Dawson wrote:
> Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
> headers from UDP packets before queueing"), when udp packets are being
> peeked the requested extra offset is always 0 as there is no need to skip
> the udp header.  However, when the offset is 0 and the next skb is
> of length 0, it is only returned once.  The behaviour can be seen with
> the following python script:
> 
> #!/usr/bin/env python3
> from socket import *;
> f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> f.bind(('::', 0));
> addr=('::1', f.getsockname()[1]);
> g.sendto(b'', addr)
> g.sendto(b'b', addr)
> print(f.recvfrom(10, MSG_PEEK));
> print(f.recvfrom(10, MSG_PEEK));
> 
> Where the expected output should be the empty string twice.
> 
> Instead, make sk_peek_offset return negative values, and pass those values
> to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
> to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
> After the call, the offset is set to 0 if negative to ensure all further
> calculations are correct.
> 
> Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
> greater then 0, and off is greater then or equal to skb->len, then (_off ||
> skb->len) must always be true assuming skb->len >= 0 is always true.
> 
> Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
> as it double checked if MSG_PEEK was set in the flags.
> 
> Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
> ---
>  include/net/sock.h  | 4 +---
>  net/core/datagram.c | 4 ++--
>  net/ipv4/udp.c      | 4 ++++
>  net/ipv6/udp.c      | 4 ++++
>  net/unix/af_unix.c  | 8 +++++---
>  5 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7c0632c7e870..aeeec62992ca 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val);
>  static inline int sk_peek_offset(struct sock *sk, int flags)
>  {
>  	if (unlikely(flags & MSG_PEEK)) {
> -		s32 off = READ_ONCE(sk->sk_peek_off);
> -		if (off >= 0)
> -			return off;
> +		return READ_ONCE(sk->sk_peek_off);
>  	}
>  
>  	return 0;

You probably want/must also update sk_set_peek_off() to allow negative
values, elsewhere this will break as soon as the user will do
SOL_SOCKET/SO_PEEK_OFF.

I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
would help simplyifing the code: no need for negative offset; set such
flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called
(and clear it when a negative value is used), forward such flag to
__skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select
the proper peek behaviour.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14  9:27 ` Paolo Abeni
@ 2017-08-14 14:05   ` Matthew Dawson
  2017-08-14 15:03     ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Matthew Dawson @ 2017-08-14 14:05 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, Macieira, Thiago, willemdebruijn.kernel

[-- Attachment #1: Type: text/plain, Size: 4132 bytes --]

On Monday, August 14, 2017 5:27:26 AM EDT Paolo Abeni wrote:
> On Mon, 2017-08-14 at 01:52 -0400, Matthew Dawson wrote:
> > Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove
> > headers from UDP packets before queueing"), when udp packets are being
> > peeked the requested extra offset is always 0 as there is no need to skip
> > the udp header.  However, when the offset is 0 and the next skb is
> > of length 0, it is only returned once.  The behaviour can be seen with
> > the following python script:
> > 
> > #!/usr/bin/env python3
> > from socket import *;
> > f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> > g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
> > f.bind(('::', 0));
> > addr=('::1', f.getsockname()[1]);
> > g.sendto(b'', addr)
> > g.sendto(b'b', addr)
> > print(f.recvfrom(10, MSG_PEEK));
> > print(f.recvfrom(10, MSG_PEEK));
> > 
> > Where the expected output should be the empty string twice.
> > 
> > Instead, make sk_peek_offset return negative values, and pass those values
> > to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed
> > offset
> > to __skb_try_recv_from_queue is negative, the checked skb is never
> > skipped.
> > After the call, the offset is set to 0 if negative to ensure all further
> > calculations are correct.
> > 
> > Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
> > greater then 0, and off is greater then or equal to skb->len, then (_off
> > ||
> > skb->len) must always be true assuming skb->len >= 0 is always true.
> > 
> > Also remove a redundant check around a call to sk_peek_offset in
> > af_unix.c,
> > as it double checked if MSG_PEEK was set in the flags.
> > 
> > Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca>
> > ---
> > 
> >  include/net/sock.h  | 4 +---
> >  net/core/datagram.c | 4 ++--
> >  net/ipv4/udp.c      | 4 ++++
> >  net/ipv6/udp.c      | 4 ++++
> >  net/unix/af_unix.c  | 8 +++++---
> >  5 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 7c0632c7e870..aeeec62992ca 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -507,9 +507,7 @@ int sk_set_peek_off(struct sock *sk, int val);
> > 
> >  static inline int sk_peek_offset(struct sock *sk, int flags)
> >  {
> >  
> >  	if (unlikely(flags & MSG_PEEK)) {
> > 
> > -		s32 off = READ_ONCE(sk->sk_peek_off);
> > -		if (off >= 0)
> > -			return off;
> > +		return READ_ONCE(sk->sk_peek_off);
> > 
> >  	}
> >  	
> >  	return 0;
> 
> You probably want/must also update sk_set_peek_off() to allow negative
> values, elsewhere this will break as soon as the user will do
> SOL_SOCKET/SO_PEEK_OFF.
I'm happy to do this either in this patch or as a second patch and make this a 
series, whatever the networking community prefers.

I'm actually surprised that only unix sockets can have negative values.  Is 
there a reason for that?  I had assumed that sk_set_peek_off would allow 
negative values as the code already has to support negative values due to what 
the initial value is.

> I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
> would help simplyifing the code: no need for negative offset; set such
> flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called
> (and clear it when a negative value is used), forward such flag to
> __skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select
> the proper peek behaviour.
The negative value is documented and supported for unix sockets, so I don't 
think we can just reject all negative values.  However, I do see a way to 
implement that while supporting user space sending negative values.  If that 
is preferred let me know and I'll see what I can make.

I assume that would make this patch target net-next?  Would it be possible to 
pull this into net so it can fix this regression for the next kernel release, 
while I work on getting the better solution finished?  I'm happy to to make 
__skb_try_recv_datagram/__skb_try_recv_from_queue accept the extra parameter 
so they don't see an offset less then 0 in this patch.

-- 
Matthew

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 14:05   ` Matthew Dawson
@ 2017-08-14 15:03     ` Willem de Bruijn
  2017-08-14 15:31       ` Paolo Abeni
  2017-08-14 16:06       ` Thiago Macieira
  0 siblings, 2 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 15:03 UTC (permalink / raw)
  To: Matthew Dawson; +Cc: Paolo Abeni, Network Development, Macieira, Thiago

> I'm actually surprised that only unix sockets can have negative values.  Is
> there a reason for that?  I had assumed that sk_set_peek_off would allow
> negative values as the code already has to support negative values due to what
> the initial value is.

A negative initial value indicates that PEEK_OFF is disabled. It only
makes sense to peek from a positive offset from the start of the data.

>> I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
>> would help simplyifing the code:

The behavior needs to be bifurcated between peeking with
offset and without offset.

When peeking with offset is enabled by setting SO_PEEK_OFF,
subsequent reads do move the offset, so the observed behavior
is correct.

When sk->sk_peek_offset is negative, offset mode is disabled
and the same packet must be read twice.

An explicit boolean flag to discern between the two may make
the code simpler to understand, not sure whether that is logically
required.

>> no need for negative offset; set such
>> flag when SOL_SOCKET/SO_PEEK_OFF with a non negative value is called
>> (and clear it when a negative value is used), forward such flag to
>> __skb_try_recv_datagram/__skb_try_recv_from_queue and use it to select
>> the proper peek behaviour.
> The negative value is documented and supported for unix sockets, so I don't
> think we can just reject all negative values.

What is the documented behavior on those sockets? The original
patch that introduced it explicitly mentions peek mode as disabled
when the value is negative:

  https://www.spinics.net/lists/netdev/msg189589.html

> However, I do see a way to
> implement that while supporting user space sending negative values.  If that
> is preferred let me know and I'll see what I can make.
>
> I assume that would make this patch target net-next?

I suspect that we can fix this with a small enough patch to be able
to send to net.

> Would it be possible to
> pull this into net so it can fix this regression for the next kernel release,
> while I work on getting the better solution finished?  I'm happy to to make
> __skb_try_recv_datagram/__skb_try_recv_from_queue accept the extra parameter
> so they don't see an offset less then 0 in this patch.
>
> --
> Matthew

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 15:03     ` Willem de Bruijn
@ 2017-08-14 15:31       ` Paolo Abeni
  2017-08-15  1:35         ` Willem de Bruijn
  2017-08-14 16:06       ` Thiago Macieira
  1 sibling, 1 reply; 31+ messages in thread
From: Paolo Abeni @ 2017-08-14 15:31 UTC (permalink / raw)
  To: Willem de Bruijn, Matthew Dawson; +Cc: Network Development, Macieira, Thiago

On Mon, 2017-08-14 at 11:03 -0400, Willem de Bruijn wrote:
> > I'm actually surprised that only unix sockets can have negative values.  Is
> > there a reason for that?  I had assumed that sk_set_peek_off would allow
> > negative values as the code already has to support negative values due to what
> > the initial value is.
> 
> A negative initial value indicates that PEEK_OFF is disabled. It only
> makes sense to peek from a positive offset from the start of the data.

With the current code, the user space can disable peeking with offset
setting a negative offset value, after that peeking with offset has
been enabled. But only for UNIX sockets. I think the same should be
allowed for UDP sockets.

> > > I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
> > > would help simplyifing the code:
> 
> The behavior needs to be bifurcated between peeking with
> offset and without offset.
> 
> When peeking with offset is enabled by setting SO_PEEK_OFF,
> subsequent reads do move the offset, so the observed behavior
> is correct.
> 
> When sk->sk_peek_offset is negative, offset mode is disabled
> and the same packet must be read twice.
> 
> An explicit boolean flag to discern between the two may make
> the code simpler to understand, not sure whether that is logically
> required.

Yes, an explicit PEEK_OFF flag is just to keep the code simpler, so
that there is no need to add checks at every sk_peek_offset() call site
and the relevant logic can be fully encapsulated under the MSG_PEEK
branch in __skb_try_recv_from_queue(), I think/hope.
It's not a functional requirement.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 15:03     ` Willem de Bruijn
  2017-08-14 15:31       ` Paolo Abeni
@ 2017-08-14 16:06       ` Thiago Macieira
  2017-08-14 16:33         ` Willem de Bruijn
  1 sibling, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-14 16:06 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Monday, 14 August 2017 08:03:50 PDT Willem de Bruijn wrote:
> > I'm actually surprised that only unix sockets can have negative values. 
> > Is
> > there a reason for that?  I had assumed that sk_set_peek_off would allow
> > negative values as the code already has to support negative values due to
> > what the initial value is.
> 
> A negative initial value indicates that PEEK_OFF is disabled. It only
> makes sense to peek from a positive offset from the start of the data.

But here's a question: if the peek offset is equal to the length, should the 
reading return an empty datagram? This would indicate to the caller that there 
was a datagram there, which was skipped over.

That's how we deal with empty datagrams anyway.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 16:06       ` Thiago Macieira
@ 2017-08-14 16:33         ` Willem de Bruijn
  2017-08-14 17:02           ` Thiago Macieira
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 16:33 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Mon, Aug 14, 2017 at 12:06 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday, 14 August 2017 08:03:50 PDT Willem de Bruijn wrote:
>> > I'm actually surprised that only unix sockets can have negative values.
>> > Is
>> > there a reason for that?  I had assumed that sk_set_peek_off would allow
>> > negative values as the code already has to support negative values due to
>> > what the initial value is.
>>
>> A negative initial value indicates that PEEK_OFF is disabled. It only
>> makes sense to peek from a positive offset from the start of the data.
>
> But here's a question: if the peek offset is equal to the length, should the
> reading return an empty datagram? This would indicate to the caller that there
> was a datagram there, which was skipped over.

In the general case, no, it should read at the offset, which is the next skb.

Zero length packets are a special case. This did come up before and
we chose to signal their existence in the queue by returning 0 for each
once, even in the offset-enabled mode.

Since we only need to change no-offset semantics to fix this bug,
I would not change this behavior, which is also expected by some
applications by now.

>
> That's how we deal with empty datagrams anyway.

What is? With no-offset and a zero payload skb at the head, peek
or recv returns 0, right?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 16:33         ` Willem de Bruijn
@ 2017-08-14 17:02           ` Thiago Macieira
  2017-08-14 18:25             ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-14 17:02 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Monday, 14 August 2017 09:33:48 PDT Willem de Bruijn wrote:
> > But here's a question: if the peek offset is equal to the length, should
> > the reading return an empty datagram? This would indicate to the caller
> > that there was a datagram there, which was skipped over.
> 
> In the general case, no, it should read at the offset, which is the next
> skb.

I beg to differ. In this particular case, we are talking about datagrams. If 
it were stream sockets, I would agree with you: just skip to the next. But in 
datagrams, the same way you do return zero-sized ones, I would return an empty 
one if you peeked at or past the end.

> Since we only need to change no-offset semantics to fix this bug,
> I would not change this behavior, which is also expected by some
> applications by now.

Do applications using SOCK_DGRAM rely on the behaviour of skipping over 
datagrams that are too short?

> > That's how we deal with empty datagrams anyway.
> 
> What is? With no-offset and a zero payload skb at the head, peek
> or recv returns 0, right?

Right.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 17:02           ` Thiago Macieira
@ 2017-08-14 18:25             ` Willem de Bruijn
  2017-08-14 18:33               ` Thiago Macieira
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 18:25 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Mon, Aug 14, 2017 at 1:02 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday, 14 August 2017 09:33:48 PDT Willem de Bruijn wrote:
>> > But here's a question: if the peek offset is equal to the length, should
>> > the reading return an empty datagram? This would indicate to the caller
>> > that there was a datagram there, which was skipped over.
>>
>> In the general case, no, it should read at the offset, which is the next
>> skb.
>
> I beg to differ. In this particular case, we are talking about datagrams. If
> it were stream sockets, I would agree with you: just skip to the next. But in
> datagrams, the same way you do return zero-sized ones, I would return an empty
> one if you peeked at or past the end.

It can be argued either way. I would not change it in the scope of
this bug.

>
>> Since we only need to change no-offset semantics to fix this bug,
>> I would not change this behavior, which is also expected by some
>> applications by now.
>
> Do applications using SOCK_DGRAM rely on the behaviour of skipping over
> datagrams that are too short?

It is established behavior. It cannot be ruled out that an application
somewhere depends on it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 18:25             ` Willem de Bruijn
@ 2017-08-14 18:33               ` Thiago Macieira
  2017-08-14 18:46                 ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-14 18:33 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote:
> > Do applications using SOCK_DGRAM rely on the behaviour of skipping over
> > datagrams that are too short?
> 
> It is established behavior. It cannot be ruled out that an application
> somewhere depends on it.

Understood.

By the way, what were the usecases for the peek offset feature?

Also, do they apply to non-peeking recv?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 18:33               ` Thiago Macieira
@ 2017-08-14 18:46                 ` Willem de Bruijn
  2017-08-14 18:58                   ` Thiago Macieira
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 18:46 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Mon, Aug 14, 2017 at 2:33 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote:
>> > Do applications using SOCK_DGRAM rely on the behaviour of skipping over
>> > datagrams that are too short?
>>
>> It is established behavior. It cannot be ruled out that an application
>> somewhere depends on it.
>
> Understood.
>
> By the way, what were the usecases for the peek offset feature?

The idea was to be able to peek at application headers of upper
layer protocols and multiplex messages among threads. It proved
so complex even for UDP that we did not attempt the same feature
for TCP. Also, KCM implements demultiplexing using eBPF today.

> Also, do they apply to non-peeking recv?

Reading at an offset is not implemented. Especially for RPC over
TCP, reading at an offset could make sense. Say, if a message
is received completely, but it is head-of-line blocked behind
another that has a hole. Here, too, KCM is an alternative.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 18:46                 ` Willem de Bruijn
@ 2017-08-14 18:58                   ` Thiago Macieira
  2017-08-14 19:03                     ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-14 18:58 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
> > By the way, what were the usecases for the peek offset feature?
> 
> The idea was to be able to peek at application headers of upper
> layer protocols and multiplex messages among threads. It proved
> so complex even for UDP that we did not attempt the same feature
> for TCP. Also, KCM implements demultiplexing using eBPF today.

Interesting, but how would userspace coordinate like that? Suppose multiple 
threads are woken up by a datagram being received, they peek at a certain 
offset shared among them all to see which one reads. Suppose that thread is 
slow or blocked and, while it's getting its act together, another datagram 
arrives.

Because of that, the other threads can't disable their polling. They will 
continually be woken up by the kernel if they go back to poll/select. Even 
with epoll, there's no new edge trigger since event is already at level.

How will they avoid busy-waiting? And won't this secondary coordination 
obviate the need for offset peeking?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 18:58                   ` Thiago Macieira
@ 2017-08-14 19:03                     ` Willem de Bruijn
  2017-08-14 19:15                       ` Thiago Macieira
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 19:03 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
>> > By the way, what were the usecases for the peek offset feature?
>>
>> The idea was to be able to peek at application headers of upper
>> layer protocols and multiplex messages among threads. It proved
>> so complex even for UDP that we did not attempt the same feature
>> for TCP. Also, KCM implements demultiplexing using eBPF today.
>
> Interesting, but how would userspace coordinate like that? Suppose multiple
> threads are woken up by a datagram being received

This assumes a separate listener thread and worker threadpool.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 19:03                     ` Willem de Bruijn
@ 2017-08-14 19:15                       ` Thiago Macieira
  2017-08-14 19:39                         ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-14 19:15 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote:
> On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
> 
> <thiago.macieira@intel.com> wrote:
> > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
> >> > By the way, what were the usecases for the peek offset feature?
> >> 
> >> The idea was to be able to peek at application headers of upper
> >> layer protocols and multiplex messages among threads. It proved
> >> so complex even for UDP that we did not attempt the same feature
> >> for TCP. Also, KCM implements demultiplexing using eBPF today.
> > 
> > Interesting, but how would userspace coordinate like that? Suppose
> > multiple
> > threads are woken up by a datagram being received
> 
> This assumes a separate listener thread and worker threadpool.

The listener thread still needs to synchronise with the worker that got 
activated and wait for it to recv from the socket before the listener thread 
can go back to poll().

If we are really talking about threads in the same process, it might be easier 
for the listener to just read the datagram anyway and pass it on to the 
worker. That way, it can proceed immediately to the next datagram and not have 
to wait for the possibly slow worker.

If it is a separate process, then I don't see another way and this might be 
necessary.

By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full 
datagram's size or is it the size minus the peek offset?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 19:15                       ` Thiago Macieira
@ 2017-08-14 19:39                         ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-14 19:39 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Matthew Dawson, Paolo Abeni, Network Development

On Mon, Aug 14, 2017 at 3:15 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote:
>> On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
>>
>> <thiago.macieira@intel.com> wrote:
>> > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
>> >> > By the way, what were the usecases for the peek offset feature?
>> >>
>> >> The idea was to be able to peek at application headers of upper
>> >> layer protocols and multiplex messages among threads. It proved
>> >> so complex even for UDP that we did not attempt the same feature
>> >> for TCP. Also, KCM implements demultiplexing using eBPF today.
>> >
>> > Interesting, but how would userspace coordinate like that? Suppose
>> > multiple
>> > threads are woken up by a datagram being received
>>
>> This assumes a separate listener thread and worker threadpool.
>
> The listener thread still needs to synchronise with the worker that got
> activated and wait for it to recv from the socket before the listener thread
> can go back to poll().
>
> If we are really talking about threads in the same process, it might be easier
> for the listener to just read the datagram anyway and pass it on to the
> worker. That way, it can proceed immediately to the next datagram and not have
> to wait for the possibly slow worker.
>
> If it is a separate process, then I don't see another way and this might be
> necessary.
>
> By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full
> datagram's size or is it the size minus the peek offset?

udp_recvmsg returns ulen if the flag is passed.

        if (flags & MSG_TRUNC)
                err = ulen;

This is computed earlier in the function as udp payload length and not
modified after.

        ulen = udp_skb_len(skb);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-14 15:31       ` Paolo Abeni
@ 2017-08-15  1:35         ` Willem de Bruijn
  2017-08-15 15:40           ` Paolo Abeni
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-15  1:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Mon, Aug 14, 2017 at 11:31 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Mon, 2017-08-14 at 11:03 -0400, Willem de Bruijn wrote:
>> > I'm actually surprised that only unix sockets can have negative values.  Is
>> > there a reason for that?  I had assumed that sk_set_peek_off would allow
>> > negative values as the code already has to support negative values due to what
>> > the initial value is.
>>
>> A negative initial value indicates that PEEK_OFF is disabled. It only
>> makes sense to peek from a positive offset from the start of the data.
>
> With the current code, the user space can disable peeking with offset
> setting a negative offset value, after that peeking with offset has
> been enabled. But only for UNIX sockets. I think the same should be
> allowed for UDP sockets.

Agreed. Reverting to no-offset should be allowed.

>> > > I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
>> > > would help simplyifing the code:
>>
>> The behavior needs to be bifurcated between peeking with
>> offset and without offset.
>>
>> When peeking with offset is enabled by setting SO_PEEK_OFF,
>> subsequent reads do move the offset, so the observed behavior
>> is correct.
>>
>> When sk->sk_peek_offset is negative, offset mode is disabled
>> and the same packet must be read twice.
>>
>> An explicit boolean flag to discern between the two may make
>> the code simpler to understand, not sure whether that is logically
>> required.
>
> Yes, an explicit PEEK_OFF flag is just to keep the code simpler, so
> that there is no need to add checks at every sk_peek_offset() call site
> and the relevant logic can be fully encapsulated under the MSG_PEEK
> branch in __skb_try_recv_from_queue(), I think/hope.
> It's not a functional requirement.

It is a problematic that sk_peek_offset returns zero both on zero offset
and when peeking at offset is disabled.

It is not infeasible to fix that and fix up all callers, as Matthew's
patch does. But perhaps this patch is simpler to reason about. Thoughts?

+static inline bool sk_peek_at_offset(struct sock *sk)
+{
+       return READ_ONCE(sk->sk_peek_off) >= 0;
+}
+
 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
        if (unlikely(flags & MSG_PEEK)) {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..30b53932af73 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -175,12 +175,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (_off >= skb->len && sk_peek_at_offset(sk) &&
+                           (skb->len || _off || skb->peeked)) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-15  1:35         ` Willem de Bruijn
@ 2017-08-15 15:40           ` Paolo Abeni
  2017-08-15 16:45             ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Abeni @ 2017-08-15 15:40 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Mon, 2017-08-14 at 21:35 -0400, Willem de Bruijn wrote:
> It is not infeasible to fix that and fix up all callers, as Matthew's
> patch does. But perhaps this patch is simpler to reason about. Thoughts?
> 
> +static inline bool sk_peek_at_offset(struct sock *sk)
> +{
> +       return READ_ONCE(sk->sk_peek_off) >= 0;
> +}
> +
>  static inline int sk_peek_offset(struct sock *sk, int flags)
>  {
>         if (unlikely(flags & MSG_PEEK)) {
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index ee5647bd91b3..30b53932af73 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -175,12 +175,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
>         *last = queue->prev;
>         skb_queue_walk(queue, skb) {
>                 if (flags & MSG_PEEK) {
> -                       if (_off >= skb->len && (skb->len || _off ||
> -                                                skb->peeked)) {
> +                       if (_off >= skb->len && sk_peek_at_offset(sk) &&
> +                           (skb->len || _off || skb->peeked)) {

I like the idea a lot, but I think/fear that bad things could happen
since sk_peek_off is read twice at different times: one in
sk_peek_offset() and one in the above chunk.

If the above is not an issue (I am not sure) I'm very fine with this
approach.

For the record, I thought something like the following (uncomplete,
does not even compile):
---
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8b13db5163cc..5085cf003b88 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -286,6 +286,7 @@ struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
+#define MSG_PEEK_OFF	0x80000
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
diff --git a/include/net/sock.h b/include/net/sock.h
index 7c0632c7e870..e75e024b55d2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -504,12 +504,14 @@ enum sk_pacing {
 
 int sk_set_peek_off(struct sock *sk, int val);
 
-static inline int sk_peek_offset(struct sock *sk, int flags)
+static inline int sk_peek_offset(struct sock *sk, int *flags)
 {
-	if (unlikely(flags & MSG_PEEK)) {
+	if (unlikely(*flags & MSG_PEEK)) {
 		s32 off = READ_ONCE(sk->sk_peek_off);
-		if (off >= 0)
+		if (off >= 0) {
+			*flags |= MSG_PEEK_OFF;
 			return off;
+		}
 	}
 
 	return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..91e1d014d64c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
 	*last = queue->prev;
 	skb_queue_walk(queue, skb) {
 		if (flags & MSG_PEEK) {
-			if (_off >= skb->len && (skb->len || _off ||
-						 skb->peeked)) {
+			if (flags & MSG_PEEK_OFF && _off >= skb->len &&
+			    (skb->len || _off || skb->peeked)) {
 				_off -= skb->len;
 				continue;
 			}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a7c804f73990..7e1bcd3500f4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1574,7 +1574,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 		return ip_recv_error(sk, msg, len, addr_len);
 
 try_again:
-	peeking = off = sk_peek_offset(sk, flags);
+	peeking = off = sk_peek_offset(sk, &flags);
 	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
 	if (!skb)
 		return err;
---
Paolo

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-15 15:40           ` Paolo Abeni
@ 2017-08-15 16:45             ` Willem de Bruijn
  2017-08-15 17:00               ` Willem de Bruijn
  2017-08-15 18:17               ` Paolo Abeni
  0 siblings, 2 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-15 16:45 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Tue, Aug 15, 2017 at 11:40 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Mon, 2017-08-14 at 21:35 -0400, Willem de Bruijn wrote:
>> It is not infeasible to fix that and fix up all callers, as Matthew's
>> patch does. But perhaps this patch is simpler to reason about. Thoughts?
>>
>> +static inline bool sk_peek_at_offset(struct sock *sk)
>> +{
>> +       return READ_ONCE(sk->sk_peek_off) >= 0;
>> +}
>> +
>>  static inline int sk_peek_offset(struct sock *sk, int flags)
>>  {
>>         if (unlikely(flags & MSG_PEEK)) {
>> diff --git a/net/core/datagram.c b/net/core/datagram.c
>> index ee5647bd91b3..30b53932af73 100644
>> --- a/net/core/datagram.c
>> +++ b/net/core/datagram.c
>> @@ -175,12 +175,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
>>         *last = queue->prev;
>>         skb_queue_walk(queue, skb) {
>>                 if (flags & MSG_PEEK) {
>> -                       if (_off >= skb->len && (skb->len || _off ||
>> -                                                skb->peeked)) {
>> +                       if (_off >= skb->len && sk_peek_at_offset(sk) &&
>> +                           (skb->len || _off || skb->peeked)) {
>
> I like the idea a lot, but I think/fear that bad things could happen
> since sk_peek_off is read twice at different times: one in
> sk_peek_offset() and one in the above chunk.
>
> If the above is not an issue (I am not sure) I'm very fine with this
> approach.

Good point. I don't think we need to read it in the callers at all.
Then int *off is only used to return to offset within the skb back
to the callers.

Though it should be read only once here, outside the loop.

Something like (also untested):

diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..06bad8726612 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -170,13 +170,15 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
                                          struct sk_buff **last)
 {
        struct sk_buff *skb;
-       int _off = *off;
+       int _off;
+       bool peek_at_off;

+       _off = sk_peek_offset(sk, flags);
+       peek_at_off = _off >= 0;
        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (peek_at_off && off >= skb->len &&
+                           (skb->len || _off || skb->peeked)) {
                                _off -= skb->len;
                                continue;
                        }

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a7c804f73990..4b51b9853406 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1574,7 +1574,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len, int noblock,
                return ip_recv_error(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)
                return err;

The explicit peek_at_off is a bit verbose, but we need to be careful
about signedness between _off and skb->len.

> For the record, I thought something like the following (uncomplete,
> does not even compile):
> ---
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 8b13db5163cc..5085cf003b88 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -286,6 +286,7 @@ struct ucred {
>  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
>  #define MSG_BATCH      0x40000 /* sendmmsg(): more messages coming */
>  #define MSG_EOF         MSG_FIN
> +#define MSG_PEEK_OFF   0x80000

Yes, that also works well.

I'm afraid about exhausting the MSG_* flag space here for a
feature that is not exposed to userspace. We don't have many flags
left. We could shadow an existing flag that is unused in this context.

There is another difference between reading sk_peek_offset in the
caller or in __skb_try_recv_from_queue. The latter is called repeatedly
when it returns NULL. Each call can modify *off. I believe that it needs
to restart with _off at sk->sk_peek_off each time, as it restarts from the
head of the queue each time.

>
>  #define MSG_FASTOPEN   0x20000000      /* Send data in TCP SYN */
>  #define MSG_CMSG_CLOEXEC 0x40000000    /* Set close_on_exec for file
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7c0632c7e870..e75e024b55d2 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -504,12 +504,14 @@ enum sk_pacing {
>
>  int sk_set_peek_off(struct sock *sk, int val);
>
> -static inline int sk_peek_offset(struct sock *sk, int flags)
> +static inline int sk_peek_offset(struct sock *sk, int *flags)
>  {
> -       if (unlikely(flags & MSG_PEEK)) {
> +       if (unlikely(*flags & MSG_PEEK)) {
>                 s32 off = READ_ONCE(sk->sk_peek_off);
> -               if (off >= 0)
> +               if (off >= 0) {
> +                       *flags |= MSG_PEEK_OFF;
>                         return off;
> +               }
>         }
>
>         return 0;
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index ee5647bd91b3..91e1d014d64c 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
>         *last = queue->prev;
>         skb_queue_walk(queue, skb) {
>                 if (flags & MSG_PEEK) {
> -                       if (_off >= skb->len && (skb->len || _off ||
> -                                                skb->peeked)) {
> +                       if (flags & MSG_PEEK_OFF && _off >= skb->len &&
> +                           (skb->len || _off || skb->peeked)) {
>                                 _off -= skb->len;
>                                 continue;
>                         }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a7c804f73990..7e1bcd3500f4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1574,7 +1574,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
>                 return ip_recv_error(sk, msg, len, addr_len);
>
>  try_again:
> -       peeking = off = sk_peek_offset(sk, flags);
> +       peeking = off = sk_peek_offset(sk, &flags);
>         skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
>         if (!skb)
>                 return err;
> ---
> Paolo

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-15 16:45             ` Willem de Bruijn
@ 2017-08-15 17:00               ` Willem de Bruijn
  2017-08-16  9:28                 ` Paolo Abeni
  2017-08-15 18:17               ` Paolo Abeni
  1 sibling, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-15 17:00 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

> There is another difference between reading sk_peek_offset in the
> caller or in __skb_try_recv_from_queue. The latter is called repeatedly
> when it returns NULL. Each call can modify *off. I believe that it needs
> to restart with _off at sk->sk_peek_off each time, as it restarts from the
> head of the queue each time.

I made a mistake here. *off is not updated when returning NULL.

In that case, it is better to read sk_peek_offset once, than to read
it each time __skb_try_recv_from_queue is entered.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-15 16:45             ` Willem de Bruijn
  2017-08-15 17:00               ` Willem de Bruijn
@ 2017-08-15 18:17               ` Paolo Abeni
  1 sibling, 0 replies; 31+ messages in thread
From: Paolo Abeni @ 2017-08-15 18:17 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Tue, 2017-08-15 at 12:45 -0400, Willem de Bruijn wrote:
> On Tue, Aug 15, 2017 at 11:40 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > For the record, I thought something like the following (uncomplete,
> > does not even compile):
> > ---
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index 8b13db5163cc..5085cf003b88 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -286,6 +286,7 @@ struct ucred {
> >  #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
> >  #define MSG_BATCH      0x40000 /* sendmmsg(): more messages coming */
> >  #define MSG_EOF         MSG_FIN
> > +#define MSG_PEEK_OFF   0x80000
> 
> Yes, that also works well.
> 
> I'm afraid about exhausting the MSG_* flag space here for a
> feature that is not exposed to userspace. We don't have many flags
> left. We could shadow an existing flag that is unused in this context.

That was my concern, too. Fortunately there are a bunch of flags
defined but apparently unused (MSG_FIN, MSG_SYN, MSG_RST) since long
time (if I'm not too low on coffee).

We can shadow one of them (and ev. drop the above define, if really
unused).

I think that the MSG_PEEK_OFF should be explicitly cleared in
sk_peek_offset() when the 'sk_peek_off' is negative, to avoid beeing
fooled by stray bits into the 'flags' argument.

I'll try to scatch-up something tomorrow.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-15 17:00               ` Willem de Bruijn
@ 2017-08-16  9:28                 ` Paolo Abeni
  2017-08-16 15:18                   ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Abeni @ 2017-08-16  9:28 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Tue, 2017-08-15 at 13:00 -0400, Willem de Bruijn wrote:
> > There is another difference between reading sk_peek_offset in the
> > caller or in __skb_try_recv_from_queue. The latter is called repeatedly
> > when it returns NULL. Each call can modify *off. I believe that it needs
> > to restart with _off at sk->sk_peek_off each time, as it restarts from the
> > head of the queue each time.
> 
> I made a mistake here. *off is not updated when returning NULL.
> 
> In that case, it is better to read sk_peek_offset once, than to read
> it each time __skb_try_recv_from_queue is entered.

If I read the above correctly, you are arguining in favor of the
addittional flag version, right?

Regarding the MSG flag exaustion, there are a bunch of flags defined
but apparently unused (MSG_FIN, MSG_SYN, MSG_RST) since long time (if
I'm not too low on coffee).

We can shadow one of them (or ev. drop the above defines, if really
unused).

I think that the MSG_PEEK_OFF should be explicitly cleared in
sk_peek_offset() when the sk_peek_off is negative, to avoid beeing
fooled by stray bits passed by the us, like in the following:
---
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8b13db5163cc..3b7f53b9cc08 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -286,6 +286,7 @@ struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
+#define MSG_PEEK_OFF	MSG_FIN /* Peeking with offset for datagram sockets */
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
diff --git a/include/net/sock.h b/include/net/sock.h
index 7c0632c7e870..452f9aac2e6a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -504,12 +504,17 @@ enum sk_pacing {
 
 int sk_set_peek_off(struct sock *sk, int val);
 
-static inline int sk_peek_offset(struct sock *sk, int flags)
+static inline int sk_peek_offset(struct sock *sk, int *flags)
 {
-	if (unlikely(flags & MSG_PEEK)) {
+	if (unlikely(*flags & MSG_PEEK)) {
 		s32 off = READ_ONCE(sk->sk_peek_off);
-		if (off >= 0)
+		if (off >= 0) {
+			*flags |= MSG_PEEK_OFF;
 			return off;
+		}
+
+		/* clear evental stray bits in the user-provided bitmask */
+		*flags &= ~MSG_PEEK_OFF;
 	}
 
 	return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..91e1d014d64c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -175,8 +175,8 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
 	*last = queue->prev;
 	skb_queue_walk(queue, skb) {
 		if (flags & MSG_PEEK) {
-			if (_off >= skb->len && (skb->len || _off ||
-						 skb->peeked)) {
+			if (flags & MSG_PEEK_OFF && _off >= skb->len &&
+			    (skb->len || _off || skb->peeked)) {
 				_off -= skb->len;
 				continue;
 			}
diff --git a/net/core/sock.c b/net/core/sock.c
index ac2a404c73eb..0c123540b02f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2408,9 +2408,7 @@ EXPORT_SYMBOL(__sk_mem_reclaim);
 
 int sk_set_peek_off(struct sock *sk, int val)
 {
-	if (val < 0)
-		return -EINVAL;
-
+	/* a negative value will disable peeking with offset */
 	sk->sk_peek_off = val;
 	return 0;
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a7c804f73990..7e1bcd3500f4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1574,7 +1574,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 		return ip_recv_error(sk, msg, len, addr_len);
 
 try_again:
-	peeking = off = sk_peek_offset(sk, flags);
+	peeking = off = sk_peek_offset(sk, &flags);
 	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
 	if (!skb)
 		return err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 578142b7ca3e..86fb4ff8934c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -362,7 +362,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
 
 try_again:
-	peeking = off = sk_peek_offset(sk, flags);
+	peeking = off = sk_peek_offset(sk, &flags);
 	skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
 	if (!skb)
 		return err;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 7b52a380d710..06c740939d9d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2124,7 +2124,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 	do {
 		mutex_lock(&u->iolock);
 
-		skip = sk_peek_offset(sk, flags);
+		skip = sk_peek_offset(sk, &flags);
 		skb = __skb_try_recv_datagram(sk, flags, NULL, &peeked, &skip,
 					      &err, &last);
 		if (skb)
@@ -2304,10 +2304,7 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state,
         */
        mutex_lock(&u->iolock);
 
-       if (flags & MSG_PEEK)
-               skip = sk_peek_offset(sk, flags);
-       else
-               skip = 0;
+       skip = sk_peek_offset(sk, &flags);
 
        do {
                int chunk;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16  9:28                 ` Paolo Abeni
@ 2017-08-16 15:18                   ` Willem de Bruijn
  2017-08-16 20:20                     ` Paolo Abeni
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-16 15:18 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

> If I read the above correctly, you are arguining in favor of the
> addittional flag version, right?

I was. Though if we are going to thread the argument from the caller
to __skb_try_recv_from_queue to avoid rereading sk->sk_peek_off,
on second thought it might be simpler to do it through off:

@@ -511,7 +511,9 @@ static inline int sk_peek_offset(struct sock *sk, int flags)
        if (unlikely(flags & MSG_PEEK)) {
                s32 off = READ_ONCE(sk->sk_peek_off);
                if (off >= 0)
-                       return off;
+                       return off + 1;
+               else
+                       return 0;
        }

        return 0;

In __skb_try_recv_from_queue we can then disambiguate the two as follows:

@@ -170,13 +170,19 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
                                          struct sk_buff **last)
 {
        struct sk_buff *skb;
-       int _off = *off;
+       bool peek_at_off = false;
+       int _off = 0;
+
+       if (flags & MSG_PEEK && *off) {
+               peek_at_off = true;
+               _off = (*off) - 1;
+       }

        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (peek_at_off && _off >= skb->len &&
+                           (skb->len || _off || skb->peeked)) {


This, of course, requires restricting sk_peek_off to protect against overflow.

If I'm not mistaken, the test in udp_recvmsg currently incorrectly sets
peeking to false when peeking at offset zero:

        peeking = off = sk_peek_offset(sk, flags);


> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2408,9 +2408,7 @@ EXPORT_SYMBOL(__sk_mem_reclaim);
>
>  int sk_set_peek_off(struct sock *sk, int val)
>  {
> -       if (val < 0)
> -               return -EINVAL;
> -
> +       /* a negative value will disable peeking with offset */
>         sk->sk_peek_off = val;
>         return 0;
>  }

Separate patch to net-next?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 15:18                   ` Willem de Bruijn
@ 2017-08-16 20:20                     ` Paolo Abeni
  2017-08-16 23:27                       ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Paolo Abeni @ 2017-08-16 20:20 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Wed, 2017-08-16 at 11:18 -0400, Willem de Bruijn wrote:
> > If I read the above correctly, you are arguining in favor of the
> > addittional flag version, right?
> 
> I was. Though if we are going to thread the argument from the caller
> to __skb_try_recv_from_queue to avoid rereading sk->sk_peek_off,
> on second thought it might be simpler to do it through off:
[...]
> This, of course, requires restricting sk_peek_off to protect against overflow.

Ok, even if I'm not 100% sure overall this will be simpler when adding
also the overflow check.

> If I'm not mistaken, the test in udp_recvmsg currently incorrectly sets
> peeking to false when peeking at offset zero:
> 
>         peeking = off = sk_peek_offset(sk, flags);

I think you are right, does not look correct.

> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -2408,9 +2408,7 @@ EXPORT_SYMBOL(__sk_mem_reclaim);
> > 
> >  int sk_set_peek_off(struct sock *sk, int val)
> >  {
> > -       if (val < 0)
> > -               return -EINVAL;
> > -
> > +       /* a negative value will disable peeking with offset */
> >         sk->sk_peek_off = val;
> >         return 0;
> >  }
> 
> Separate patch to net-next?

Agreed.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 20:20                     ` Paolo Abeni
@ 2017-08-16 23:27                       ` Willem de Bruijn
  2017-08-16 23:40                         ` Willem de Bruijn
                                           ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-16 23:27 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

On Wed, Aug 16, 2017 at 4:20 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> On Wed, 2017-08-16 at 11:18 -0400, Willem de Bruijn wrote:
>> > If I read the above correctly, you are arguining in favor of the
>> > addittional flag version, right?
>>
>> I was. Though if we are going to thread the argument from the caller
>> to __skb_try_recv_from_queue to avoid rereading sk->sk_peek_off,
>> on second thought it might be simpler to do it through off:
> [...]
>> This, of course, requires restricting sk_peek_off to protect against overflow.
>
> Ok, even if I'm not 100% sure overall this will be simpler when adding
> also the overflow check.

Actually, it is safe even without the check. Overflow of the signed integer
is benign here.

>> If I'm not mistaken, the test in udp_recvmsg currently incorrectly sets
>> peeking to false when peeking at offset zero:
>>
>>         peeking = off = sk_peek_offset(sk, flags);
>
> I think you are right, does not look correct.

By shifting the offset by two, we could even make both assignments
become correct. Return 0 without peek, 1 on peek without SO_PEEK_OFF,
2+ otherwise, including overflow up to INT_MIN + 1.

But the end result is more readable if we just separate those two
assignments.

@@ -1574,7 +1574,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len, int noblock,
                return ip_recv_error(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
+       off = sk_peek_offset(sk, flags);
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)
                return err;

@@ -362,7 +362,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr
*msg, size_t len,
                return ipv6_recv_rxpmtu(sk, msg, len, addr_len);

 try_again:
-       peeking = off = sk_peek_offset(sk, flags);
+       peeking = flags & MSG_PEEK;
+       off = sk_peek_offset(sk, flags);
        skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
        if (!skb)
                return err;

At which point there is also no longer a need for the variable shift
at sk_peek_offset. Just pass the raw value down to
__skb_try_recv_from_queue and disambiguate there:

@@ -506,11 +506,8 @@ int sk_set_peek_off(struct sock *sk, int val);

 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
-       if (unlikely(flags & MSG_PEEK)) {
-               s32 off = READ_ONCE(sk->sk_peek_off);
-               if (off >= 0)
-                       return off;
-       }
+       if (unlikely(flags & MSG_PEEK))
+               return READ_ONCE(sk->sk_peek_off);

        return 0;
 }

@@ -169,14 +169,20 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
                                          int *peeked, int *off, int *err,
                                          struct sk_buff **last)
 {
+       bool peek_at_off = false;
        struct sk_buff *skb;
-       int _off = *off;
+       int _off = 0;
+
+       if (flags & MSG_PEEK && (*off) >= 0) {
+               peek_at_off = true;
+               _off = *off;
+       }

        *last = queue->prev;
        skb_queue_walk(queue, skb) {
                if (flags & MSG_PEEK) {
-                       if (_off >= skb->len && (skb->len || _off ||
-                                                skb->peeked)) {
+                       if (peek_at_off && _off >= skb->len &&
+                           (skb->len || _off || skb->peeked)) {
                                _off -= skb->len;
                                continue;
                        }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 23:27                       ` Willem de Bruijn
@ 2017-08-16 23:40                         ` Willem de Bruijn
  2017-08-16 23:55                         ` Thiago Macieira
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-16 23:40 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Matthew Dawson, Network Development, Macieira, Thiago

>  static inline int sk_peek_offset(struct sock *sk, int flags)
>  {
> -       if (unlikely(flags & MSG_PEEK)) {
> -               s32 off = READ_ONCE(sk->sk_peek_off);
> -               if (off >= 0)
> -                       return off;
> -       }
> +       if (unlikely(flags & MSG_PEEK))
> +               return READ_ONCE(sk->sk_peek_off);
>
>         return 0;
>  }

I noticed too late that this function is used in one case where the
result is not blindly passed to __skb_try_recv_from_queue.

The AF_UNIX stream case will need a floor on its offset.

-               skip = sk_peek_offset(sk, flags);
+               skip = max(sk_peek_offset(sk, flags), 0);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 23:27                       ` Willem de Bruijn
  2017-08-16 23:40                         ` Willem de Bruijn
@ 2017-08-16 23:55                         ` Thiago Macieira
  2017-08-17  0:10                           ` Willem de Bruijn
  2017-08-17  9:15                         ` David Laight
  2017-08-17 15:47                         ` Matthew Dawson
  3 siblings, 1 reply; 31+ messages in thread
From: Thiago Macieira @ 2017-08-16 23:55 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Paolo Abeni, Matthew Dawson, Network Development

On Wednesday, 16 August 2017 16:27:17 PDT Willem de Bruijn wrote:
> Actually, it is safe even without the check. Overflow of the signed integer
> is benign here.

Usually, it's a bad idea to allow UB to happen.

Where is the overflow? I didn't see it in the patches so far.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 23:55                         ` Thiago Macieira
@ 2017-08-17  0:10                           ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-17  0:10 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: Paolo Abeni, Matthew Dawson, Network Development

On Wed, Aug 16, 2017 at 7:55 PM, Thiago Macieira
<thiago.macieira@intel.com> wrote:
> On Wednesday, 16 August 2017 16:27:17 PDT Willem de Bruijn wrote:
>> Actually, it is safe even without the check. Overflow of the signed integer
>> is benign here.
>
> Usually, it's a bad idea to allow UB to happen.
>
> Where is the overflow? I didn't see it in the patches so far.

There isn't. Instead of using a shift, as I proposed earlier, the latest
patch just passes the raw value of sk_peek_off.

The shift was an attempt to disambiguate between the cases that
return zero in sk_peek_offset:
(a) peek without SO_PEEK_OFF:
      each consecutive peek must look at the first packet
(b) peek with offset zero:
      peek at the first packet, move forward with each call

After coding up a version that shifts by one to differentiate the two
cases of zero it was obvious that there is no need to shift at all:
the case without SO_PEEK_OFF is identified by all negative values
if sk_peek_offset just does not explicitly return 0 for this case.

Of course, then all callers need to be verified to correctly handle
negative values. __sk_try_recv_from_queue uses this information.
The only other cases, unix stream, does not care about the difference,
as the bytestream has no record separators.

What is UB in this context?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 23:27                       ` Willem de Bruijn
  2017-08-16 23:40                         ` Willem de Bruijn
  2017-08-16 23:55                         ` Thiago Macieira
@ 2017-08-17  9:15                         ` David Laight
  2017-08-17 14:37                           ` Willem de Bruijn
  2017-08-17 15:47                         ` Matthew Dawson
  3 siblings, 1 reply; 31+ messages in thread
From: David Laight @ 2017-08-17  9:15 UTC (permalink / raw)
  To: 'Willem de Bruijn', Paolo Abeni
  Cc: Matthew Dawson, Network Development, Macieira, Thiago

From: Willem de Bruijn
> Sent: 17 August 2017 00:27
> Actually, it is safe even without the check. Overflow of the signed integer
> is benign here.

IIRC the C language states that 'signed integer overflow' is undefined.
So 'MAXINT + 1' doesn't have to equal '-MAXINT - 1' (as one would
expect on a 2's compliment system).

While the linux kernel probably won't run on systems where this isn't true
(eg where signed arithmetic saturates) gcc will assume it can't happen
and optimise code with that assumption.

This may not matter here ...

	David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-17  9:15                         ` David Laight
@ 2017-08-17 14:37                           ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-17 14:37 UTC (permalink / raw)
  To: David Laight
  Cc: Paolo Abeni, Matthew Dawson, Network Development, Macieira, Thiago

On Thu, Aug 17, 2017 at 5:15 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Willem de Bruijn
>> Sent: 17 August 2017 00:27
>> Actually, it is safe even without the check. Overflow of the signed integer
>> is benign here.
>
> IIRC the C language states that 'signed integer overflow' is undefined.
> So 'MAXINT + 1' doesn't have to equal '-MAXINT - 1' (as one would
> expect on a 2's compliment system).
>
> While the linux kernel probably won't run on systems where this isn't true
> (eg where signed arithmetic saturates) gcc will assume it can't happen
> and optimise code with that assumption.

Ah, of course. Thanks. The last patch does not rely on such tricks, indeed.

On rereading, it is actually very similar to Matthew's original. The
main difference is handling the negative offset case inside
__skb_try_recv_from_queue instead of in each caller.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-16 23:27                       ` Willem de Bruijn
                                           ` (2 preceding siblings ...)
  2017-08-17  9:15                         ` David Laight
@ 2017-08-17 15:47                         ` Matthew Dawson
  2017-08-17 16:45                           ` Willem de Bruijn
  3 siblings, 1 reply; 31+ messages in thread
From: Matthew Dawson @ 2017-08-17 15:47 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Paolo Abeni, Network Development, Macieira, Thiago

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

On Wednesday, August 16, 2017 7:27:17 PM EDT Willem de Bruijn wrote:
> On Wed, Aug 16, 2017 at 4:20 PM, Paolo Abeni <pabeni@redhat.com> wrote:
> > On Wed, 2017-08-16 at 11:18 -0400, Willem de Bruijn wrote:
> >> > If I read the above correctly, you are arguining in favor of the
> >> > addittional flag version, right?
> >> 
> >> I was. Though if we are going to thread the argument from the caller
> >> to __skb_try_recv_from_queue to avoid rereading sk->sk_peek_off,
> > 
> >> on second thought it might be simpler to do it through off:
> > [...]
> > 
> >> This, of course, requires restricting sk_peek_off to protect against
> >> overflow.> 
> > Ok, even if I'm not 100% sure overall this will be simpler when adding
> > also the overflow check.
> 
> Actually, it is safe even without the check. Overflow of the signed integer
> is benign here.
> 
> >> If I'm not mistaken, the test in udp_recvmsg currently incorrectly sets
> >> 
> >> peeking to false when peeking at offset zero:
> >>         peeking = off = sk_peek_offset(sk, flags);
> > 
> > I think you are right, does not look correct.
> 
> By shifting the offset by two, we could even make both assignments
> become correct. Return 0 without peek, 1 on peek without SO_PEEK_OFF,
> 2+ otherwise, including overflow up to INT_MIN + 1.
> 
> But the end result is more readable if we just separate those two
> assignments.
> 
> @@ -1574,7 +1574,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr
> *msg, size_t len, int noblock,
>                 return ip_recv_error(sk, msg, len, addr_len);
> 
>  try_again:
> -       peeking = off = sk_peek_offset(sk, flags);
> +       peeking = flags & MSG_PEEK;
> +       off = sk_peek_offset(sk, flags);
>         skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
>         if (!skb)
>                 return err;
> 
> @@ -362,7 +362,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr
> *msg, size_t len,
>                 return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
> 
>  try_again:
> -       peeking = off = sk_peek_offset(sk, flags);
> +       peeking = flags & MSG_PEEK;
> +       off = sk_peek_offset(sk, flags);
>         skb = __skb_recv_udp(sk, flags, noblock, &peeked, &off, &err);
>         if (!skb)
>                 return err;
> 
> At which point there is also no longer a need for the variable shift
> at sk_peek_offset. Just pass the raw value down to
> __skb_try_recv_from_queue and disambiguate there:
> 
> @@ -506,11 +506,8 @@ int sk_set_peek_off(struct sock *sk, int val);
> 
>  static inline int sk_peek_offset(struct sock *sk, int flags)
>  {
> -       if (unlikely(flags & MSG_PEEK)) {
> -               s32 off = READ_ONCE(sk->sk_peek_off);
> -               if (off >= 0)
> -                       return off;
> -       }
> +       if (unlikely(flags & MSG_PEEK))
> +               return READ_ONCE(sk->sk_peek_off);
> 
>         return 0;
>  }
> 
> @@ -169,14 +169,20 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock
> *sk, int *peeked, int *off, int *err, struct sk_buff **last)
>  {
> +       bool peek_at_off = false;
>         struct sk_buff *skb;
> -       int _off = *off;
> +       int _off = 0;
> +
> +       if (flags & MSG_PEEK && (*off) >= 0) {
> +               peek_at_off = true;
> +               _off = *off;
> +       }
> 
>         *last = queue->prev;
>         skb_queue_walk(queue, skb) {
>                 if (flags & MSG_PEEK) {
> -                       if (_off >= skb->len && (skb->len || _off ||
> -                                                skb->peeked)) {
> +                       if (peek_at_off && _off >= skb->len &&
> +                           (skb->len || _off || skb->peeked)) {
                                ^ I'm pretty sure we can remove this check 
(that skb->len is not zero) in this if statement.  If _off is zero, then skb-
>len must also be zero (since _off >= skb->len, if _off is 0, skb->len <= 0.  
If skb->len can't be negative, then skb->len <= 0 => skb->len == 0).  If _off 
is not zero, then checking skb->len is redundant.
>                                 _off -= skb->len;
>                                 continue;
>                         }
Is this queued to go in already?  Or can I help by updating my patch with what 
was discussed here?  I can do that today if wanted.

-- 
Matthew

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs
  2017-08-17 15:47                         ` Matthew Dawson
@ 2017-08-17 16:45                           ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2017-08-17 16:45 UTC (permalink / raw)
  To: Matthew Dawson; +Cc: Paolo Abeni, Network Development, Macieira, Thiago

>> @@ -169,14 +169,20 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock
>> *sk, int *peeked, int *off, int *err, struct sk_buff **last)
>>  {
>> +       bool peek_at_off = false;
>>         struct sk_buff *skb;
>> -       int _off = *off;
>> +       int _off = 0;
>> +
>> +       if (flags & MSG_PEEK && (*off) >= 0) {
>> +               peek_at_off = true;
>> +               _off = *off;
>> +       }
>>
>>         *last = queue->prev;
>>         skb_queue_walk(queue, skb) {
>>                 if (flags & MSG_PEEK) {
>> -                       if (_off >= skb->len && (skb->len || _off ||
>> -                                                skb->peeked)) {
>> +                       if (peek_at_off && _off >= skb->len &&
>> +                           (skb->len || _off || skb->peeked)) {
>                                 ^ I'm pretty sure we can remove this check
> (that skb->len is not zero) in this if statement.  If _off is zero, then skb-
>>len must also be zero (since _off >= skb->len, if _off is 0, skb->len <= 0.
> If skb->len can't be negative, then skb->len <= 0 => skb->len == 0).  If _off
> is not zero, then checking skb->len is redundant.

Good point.

>>                                 _off -= skb->len;
>>                                 continue;
>>                         }
> Is this queued to go in already? Or can I help by updating my patch with what
> was discussed here?  I can do that today if wanted.

It isn't. Please do if no one else has additional comments.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-08-17 16:46 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-14  5:52 [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs Matthew Dawson
2017-08-14  9:27 ` Paolo Abeni
2017-08-14 14:05   ` Matthew Dawson
2017-08-14 15:03     ` Willem de Bruijn
2017-08-14 15:31       ` Paolo Abeni
2017-08-15  1:35         ` Willem de Bruijn
2017-08-15 15:40           ` Paolo Abeni
2017-08-15 16:45             ` Willem de Bruijn
2017-08-15 17:00               ` Willem de Bruijn
2017-08-16  9:28                 ` Paolo Abeni
2017-08-16 15:18                   ` Willem de Bruijn
2017-08-16 20:20                     ` Paolo Abeni
2017-08-16 23:27                       ` Willem de Bruijn
2017-08-16 23:40                         ` Willem de Bruijn
2017-08-16 23:55                         ` Thiago Macieira
2017-08-17  0:10                           ` Willem de Bruijn
2017-08-17  9:15                         ` David Laight
2017-08-17 14:37                           ` Willem de Bruijn
2017-08-17 15:47                         ` Matthew Dawson
2017-08-17 16:45                           ` Willem de Bruijn
2017-08-15 18:17               ` Paolo Abeni
2017-08-14 16:06       ` Thiago Macieira
2017-08-14 16:33         ` Willem de Bruijn
2017-08-14 17:02           ` Thiago Macieira
2017-08-14 18:25             ` Willem de Bruijn
2017-08-14 18:33               ` Thiago Macieira
2017-08-14 18:46                 ` Willem de Bruijn
2017-08-14 18:58                   ` Thiago Macieira
2017-08-14 19:03                     ` Willem de Bruijn
2017-08-14 19:15                       ` Thiago Macieira
2017-08-14 19:39                         ` Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.