All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
@ 2012-02-10 13:54 Pavel Emelyanov
  2012-02-10 14:41 ` Eric Dumazet
  2012-02-15 20:52 ` David Miller
  0 siblings, 2 replies; 6+ messages in thread
From: Pavel Emelyanov @ 2012-02-10 13:54 UTC (permalink / raw)
  To: David Miller, Tejun Heo, Linux Netdev List

We're working on the checkpoint-restore project. To checkpoint a unix socket
we need to read its skb queue. Analogous task for TCP sockets Tejun proposed
to solve with parasite + recvmsg + MSG_PEEK. That's nice, but doesn't work
for unix sockets, because for them MSG_PEEK always peeks a single skb from the 
head of the queue.

I propose to extend the MSG_PEEK functionality with two more flags that peek
either next not picked skb in queue or pick the last picked one. The latter
ability is required to make it possible to re-read a message if MSG_TRUNC
was reported on it.

These two flags can be used for unix stream sockets, since making the MSG_PEEK
just report all data that fits the buffer length is bad -- we may have scms
in queue thus turning stream socket into dgram one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 include/linux/socket.h |    3 ++
 net/core/datagram.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index d0e77f6..ab3aa19 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -266,6 +266,9 @@ struct ucred {
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
 
+#define MSG_PEEK_MORE	0x20000
+#define MSG_PEEK_AGAIN	0x40000
+
 #define MSG_EOF         MSG_FIN
 
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exit for file
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 68bbf9f..c330e40 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -157,6 +157,40 @@ out_noerr:
  *	quite explicitly by POSIX 1003.1g, don't change them without having
  *	the standard around please.
  */
+
+/*
+ * Peek the last skb marked as peeked
+ */
+
+static struct sk_buff *skb_peek_again(struct sk_buff_head *queue)
+{
+	struct sk_buff *skb, *prev = NULL;
+
+	skb_queue_walk(queue, skb) {
+		if (skb->peeked)
+			prev = skb;
+		else
+			break;
+	}
+
+	return prev;
+}
+
+/*
+ * Peek the first not peeked skb
+ */
+
+static struct sk_buff *skb_peek_more(struct sk_buff_head *queue)
+{
+	struct sk_buff *skb;
+
+	skb_queue_walk(queue, skb)
+		if (!skb->peeked)
+			return skb;
+
+	return NULL;
+}
+
 struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
 				    int *peeked, int *err)
 {
@@ -180,18 +214,28 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
 		 * However, this function was correct in any case. 8)
 		 */
 		unsigned long cpu_flags;
-
-		spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
-		skb = skb_peek(&sk->sk_receive_queue);
+		struct sk_buff_head *queue = &sk->sk_receive_queue;
+
+		spin_lock_irqsave(&queue->lock, cpu_flags);
+		if (flags & MSG_PEEK) {
+			if (flags & MSG_PEEK_MORE)
+				skb = skb_peek_more(queue);
+			else if (flags & MSG_PEEK_AGAIN)
+				skb = skb_peek_again(queue);
+			else
+				skb = skb_peek(queue);
+		} else
+			skb = skb_peek(queue);
+	
 		if (skb) {
 			*peeked = skb->peeked;
 			if (flags & MSG_PEEK) {
 				skb->peeked = 1;
 				atomic_inc(&skb->users);
 			} else
-				__skb_unlink(skb, &sk->sk_receive_queue);
+				__skb_unlink(skb, queue);
 		}
-		spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
+		spin_unlock_irqrestore(&queue->lock, cpu_flags);
 
 		if (skb)
 			return skb;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
  2012-02-10 13:54 [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing Pavel Emelyanov
@ 2012-02-10 14:41 ` Eric Dumazet
  2012-02-10 14:52   ` Pavel Emelyanov
  2012-02-15 20:52 ` David Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2012-02-10 14:41 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: David Miller, Tejun Heo, Linux Netdev List

Le vendredi 10 février 2012 à 17:54 +0400, Pavel Emelyanov a écrit :
> We're working on the checkpoint-restore project. To checkpoint a unix socket
> we need to read its skb queue. Analogous task for TCP sockets Tejun proposed
> to solve with parasite + recvmsg + MSG_PEEK. That's nice, but doesn't work
> for unix sockets, because for them MSG_PEEK always peeks a single skb from the 
> head of the queue.
> 
> I propose to extend the MSG_PEEK functionality with two more flags that peek
> either next not picked skb in queue or pick the last picked one. The latter
> ability is required to make it possible to re-read a message if MSG_TRUNC
> was reported on it.
> 
> These two flags can be used for unix stream sockets, since making the MSG_PEEK
> just report all data that fits the buffer length is bad -- we may have scms
> in queue thus turning stream socket into dgram one.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> 
> ---

Nice !

So this CR stuff assumes an application wont use itself MSG_PEEK /
MORE / AGAIN ?

(skb->peeked can only be set, never unset)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
  2012-02-10 14:41 ` Eric Dumazet
@ 2012-02-10 14:52   ` Pavel Emelyanov
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Emelyanov @ 2012-02-10 14:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Tejun Heo, Linux Netdev List

On 02/10/2012 06:41 PM, Eric Dumazet wrote:
> Le vendredi 10 février 2012 à 17:54 +0400, Pavel Emelyanov a écrit :
>> We're working on the checkpoint-restore project. To checkpoint a unix socket
>> we need to read its skb queue. Analogous task for TCP sockets Tejun proposed
>> to solve with parasite + recvmsg + MSG_PEEK. That's nice, but doesn't work
>> for unix sockets, because for them MSG_PEEK always peeks a single skb from the 
>> head of the queue.
>>
>> I propose to extend the MSG_PEEK functionality with two more flags that peek
>> either next not picked skb in queue or pick the last picked one. The latter
>> ability is required to make it possible to re-read a message if MSG_TRUNC
>> was reported on it.
>>
>> These two flags can be used for unix stream sockets, since making the MSG_PEEK
>> just report all data that fits the buffer length is bad -- we may have scms
>> in queue thus turning stream socket into dgram one.
>>
>> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
>>
>> ---
> 
> Nice !
> 
> So this CR stuff assumes an application wont use itself MSG_PEEK /
> MORE / AGAIN ?

:( Not exactly. MSG_PEEK will still work, but PEEK_MORE/PEEK_AGAIN will not.

Hm... This means that the state of "what was peek-ed already" should be on
the user side. Can we pass some "offset" (in bytes for stream and in packets
for datagram) to the recvmsg?

> (skb->peeked can only be set, never unset)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
  2012-02-10 13:54 [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing Pavel Emelyanov
  2012-02-10 14:41 ` Eric Dumazet
@ 2012-02-15 20:52 ` David Miller
  2012-02-20 12:17   ` Pavel Emelyanov
  1 sibling, 1 reply; 6+ messages in thread
From: David Miller @ 2012-02-15 20:52 UTC (permalink / raw)
  To: xemul; +Cc: tj, netdev

From: Pavel Emelyanov <xemul@parallels.com>
Date: Fri, 10 Feb 2012 17:54:10 +0400

> We're working on the checkpoint-restore project. To checkpoint a unix socket
> we need to read its skb queue. Analogous task for TCP sockets Tejun proposed
> to solve with parasite + recvmsg + MSG_PEEK. That's nice, but doesn't work
> for unix sockets, because for them MSG_PEEK always peeks a single skb from the 
> head of the queue.
> 
> I propose to extend the MSG_PEEK functionality with two more flags that peek
> either next not picked skb in queue or pick the last picked one. The latter
> ability is required to make it possible to re-read a message if MSG_TRUNC
> was reported on it.
> 
> These two flags can be used for unix stream sockets, since making the MSG_PEEK
> just report all data that fits the buffer length is bad -- we may have scms
> in queue thus turning stream socket into dgram one.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

I'm still thinking about how I feel about this.

But meanwhile you can fix some thing up, for example I really
feel that you should make sure that multiple bits aren't set
at the same time.

You either mean MSG_PEEK_MORE or MSG_PEEK_AGAIN, so setting both
makes no sense and should be flagged as an error.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
  2012-02-15 20:52 ` David Miller
@ 2012-02-20 12:17   ` Pavel Emelyanov
  2012-02-21  5:01     ` David Miller
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Emelyanov @ 2012-02-20 12:17 UTC (permalink / raw)
  To: David Miller, Eric Dumazet; +Cc: tj, netdev

> I'm still thinking about how I feel about this.
> 
> But meanwhile you can fix some thing up, for example I really
> feel that you should make sure that multiple bits aren't set
> at the same time.
> 
> You either mean MSG_PEEK_MORE or MSG_PEEK_AGAIN, so setting both
> makes no sense and should be flagged as an error.

Actually Eric has pointed out the critical issue with this -- if a task
(whose socket's queue we're about to peek) is currently peek-ing this
queue himself, we will peek only the queue's tail and will potentially 
break this task's peeking in case he uses the _MORE/_AGAIN macros :(


I was thinking about another option of doing the same, how about introducing
the peek offset member on sock (get/set via sockopt) which works like

* if == -1, then peek works as before
* if >= 0, then each peek/recvmsg will increase/decrease the value so that
  the next peek peeks next data

It's questionable what to do if the peek_off points into the middle of a
datagram however. Here's an example of how this looks for datagram sockets
(tested on pf_unix), for stream this requires more patching.

What do you think? Does it make sense to go on with this making other 
->recvmsg handlers support peeking offset?

---

diff --git a/include/asm-generic/socket.h b/include/asm-generic/socket.h
index 49c1704..832c270 100644
--- a/include/asm-generic/socket.h
+++ b/include/asm-generic/socket.h
@@ -66,5 +66,6 @@
 #define SO_RXQ_OVFL             40
 
 #define SO_WIFI_STATUS		41
+#define SO_PEEK_OFF		42
 #define SCM_WIFI_STATUS	SO_WIFI_STATUS
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 91c1c8b..94b0372 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -357,6 +357,7 @@ struct sock {
 	struct page		*sk_sndmsg_page;
 	struct sk_buff		*sk_send_head;
 	__u32			sk_sndmsg_off;
+	__s32			sk_peek_off;
 	int			sk_write_pending;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 68bbf9f..78e5147 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -180,21 +180,37 @@ struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned flags,
 		 * However, this function was correct in any case. 8)
 		 */
 		unsigned long cpu_flags;
+		long skip;
 
 		spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
-		skb = skb_peek(&sk->sk_receive_queue);
-		if (skb) {
+		skip = sk->sk_peek_off;
+		skb_queue_walk(&sk->sk_receive_queue, skb) {
 			*peeked = skb->peeked;
 			if (flags & MSG_PEEK) {
 				skb->peeked = 1;
 				atomic_inc(&skb->users);
-			} else
+				if (skip >= skb->len) {
+					skip -= skb->len;
+					continue;
+				}
+
+				if (sk->sk_peek_off >= 0)
+					sk->sk_peek_off += (skb->len - skip);
+			} else {
 				__skb_unlink(skb, &sk->sk_receive_queue);
-		}
-		spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
 
-		if (skb)
+				if (sk->sk_peek_off >= 0) {
+					if (sk->sk_peek_off >= skb->len)
+						sk->sk_peek_off -= skb->len;
+					else
+						sk->sk_peek_off = 0;
+				}
+			}
+
+			spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
 			return skb;
+		}
+		spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
 
 		/* User doesn't want to wait */
 		error = -EAGAIN;
diff --git a/net/core/sock.c b/net/core/sock.c
index 02f8dfe..5a5d581 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -792,7 +792,9 @@ set_rcvbuf:
 	case SO_WIFI_STATUS:
 		sock_valbool_flag(sk, SOCK_WIFI_STATUS, valbool);
 		break;
-
+	case SO_PEEK_OFF:
+		sk->sk_peek_off = val;
+		break;
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1017,6 +1019,9 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 	case SO_WIFI_STATUS:
 		v.val = !!sock_flag(sk, SOCK_WIFI_STATUS);
 		break;
+	case SO_PEEK_OFF:
+		v.val = sk->sk_peek_off;
+		break;
 
 	default:
 		return -ENOPROTOOPT;
@@ -2092,6 +2097,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_sndmsg_page	=	NULL;
 	sk->sk_sndmsg_off	=	0;
+	sk->sk_peek_off		=	-1;
 
 	sk->sk_peer_pid 	=	NULL;
 	sk->sk_peer_cred	=	NULL;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing
  2012-02-20 12:17   ` Pavel Emelyanov
@ 2012-02-21  5:01     ` David Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2012-02-21  5:01 UTC (permalink / raw)
  To: xemul; +Cc: eric.dumazet, tj, netdev

From: Pavel Emelyanov <xemul@parallels.com>
Date: Mon, 20 Feb 2012 16:17:11 +0400

> I was thinking about another option of doing the same, how about introducing
> the peek offset member on sock (get/set via sockopt) which works like
> 
> * if == -1, then peek works as before
> * if >= 0, then each peek/recvmsg will increase/decrease the value so that
>   the next peek peeks next data
> 
> It's questionable what to do if the peek_off points into the middle of a
> datagram however. Here's an example of how this looks for datagram sockets
> (tested on pf_unix), for stream this requires more patching.
> 
> What do you think? Does it make sense to go on with this making other 
> ->recvmsg handlers support peeking offset?

I think you need to add some kind of locking to this for sanity, for
example right now you're doing test/decide/store decisions over
sk_peek_off inside of __skb_recv_datagram() without the socket lock,
yet another thread can set the SO_PEEK_OFF in parallel.

The locking decision will probably be worse for stream sockets.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-02-21  5:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10 13:54 [PATCH] datagram: Extend the datagram queue MSG_PEEK-ing Pavel Emelyanov
2012-02-10 14:41 ` Eric Dumazet
2012-02-10 14:52   ` Pavel Emelyanov
2012-02-15 20:52 ` David Miller
2012-02-20 12:17   ` Pavel Emelyanov
2012-02-21  5:01     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.