netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next 0/6] rds: zerocopy support
@ 2018-01-17 12:19 Sowmini Varadhan
  2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:19 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

This patch series provides support for MSG_ZERCOCOPY
on a PF_RDS socket based on the APIs and infrastructure added
by f214f915e7db ("tcp: enable MSG_ZEROCOPY")

For single threaded rds-stress testing using rds-tcp with the
ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
results show that  there is a significant reduction in latency: about
90 usec with zerocopy, compared with 200 usec without zerocopy.

Additional testing/debugging is ongoing, but I am sharing
the current patchset to get some feedback on API design choices
especially for the send-completion notification for multi-threaded 
datagram socket applications

Brief RDS Architectural overview: PF_RDS sockets implement 
message-bounded datagram semantics over a reliable transport.
The RDS socket layer tracks message boundaries and uses
an underlying transport like TCP to segment/reassemble the
message into MTU sized frames. In addition to the reliable,
ordered delivery semantics provided by the transport, the
RDS layer also retains the datagram in its retransmit queue,
to be resent in case of transport failure/restart events.

This patchset modifies the above for zerocopy in the following manner.
- if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
- if the SO_ZEROCOPY  socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg are pinned. The pinning
uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit 
on MSG_ZEROCOPY pages")

The message is unpinned after we get back an ACK (TCP ACK, in the
case of rds-tcp) indicating that the RDS module at the receiver
has received the datagram, and it is safe for the sender to free
the message from its (RDS) retransmit queue. 

The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure thus needs to be notified
about send-completion, and that notification must uniquely
identify the message to the application so that the application
buffers may be freed/reused. 

Unique identification of the message in the completion notification
is done in the following manner:
- application passes down a 32 bit cookie as ancillary data with
  rds_sendmsg. The ancillary data in this case has cmsg_level == SOL_RDS
  and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
- upon send-completion, the rds module passes up a batch of cookies
  on the sk_error_queue associated with the PF_RDS socket. The message
  thus received will have a batch of N cookies in the data, with the
  number of cookies (N) specified in the ancillary data passed with
  recvmsg(). The current patchset sets up the ancillary data as a
  sock_extended_err with ee_origin == SO_EE_ORIGIN_ZEROCOPY, and 
  ee_data == N based on 52267790ef52 ("sock: add MSG_ZEROCOPY"), and
  alternate suggestions for designing this API are invited. The 
  important point here is that the notification would need to be able
  to contain an arbitrary number of cookies, where each cookie
  would allow the application to  uniquely identify a buffer used with
  sendmsg()

  Note that cookie-batching on send-completion notification means
  that the application may not know the buffering requirements
  a priori and the buffer sent down with recvmsg on the MSG_ERRQUEUE
  may be smaller than the required size for the notifications to be
  sent. To accomodate this case, sk_error_queue has been enhanced 
  to support MSG_PEEK semantics (so that the application
  can retry with a larger buffer)
  
Work in progress
- additional testing: when we test this with rds-stress with 8 sockets,
  and a send depth of 64 (i.e. each socket can have at most 64 outstanding
  requests) some data corruption is reported by rds-stress. Working
  on drilling down the root-cause
- optimizing the send-completion notification API: our use-cases are
  multi-threaded, and we want to be able to reuse buffers as soon
  as possible (instead of waiting for the req-resp transaction to 
  complete). Sub-optimal design of the completion notification can
  actually cause a perf deterioration (system-call overhead to
  reap notification, throughput can go down because application does
  not send "fast enough", even though latency is small), so this area
  needs to be optimized carefully
- additional test results beyond the rds-stress micro-benchmarks.


Sowmini Varadhan (6):
  sock: MSG_PEEK support for sk_error_queue
  skbuff: export mm_[un]account_pinned_pages for other modules
  rds: hold a sock ref from rds_message to the rds_sock
  sock: permit SO_ZEROCOPY on PF_RDS socket
  rds: support for zcopy completion notification
  rds: zerocopy Tx support.

 drivers/net/tun.c        |    2 +-
 include/linux/skbuff.h   |    3 +
 include/net/sock.h       |    2 +-
 include/uapi/linux/rds.h |    1 +
 net/core/skbuff.c        |    6 ++-
 net/core/sock.c          |   14 +++++-
 net/packet/af_packet.c   |    3 +-
 net/rds/af_rds.c         |    3 +
 net/rds/message.c        |  119 ++++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h            |   16 +++++-
 net/rds/recv.c           |    3 +
 net/rds/send.c           |   41 ++++++++++++----
 12 files changed, 192 insertions(+), 21 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
@ 2018-01-17 12:19 ` Sowmini Varadhan
  2018-01-17 23:50   ` Willem de Bruijn
  2018-01-18 15:51   ` Eric Dumazet
  2018-01-17 12:20 ` [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:19 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

Allow the application the ability to use MSG_PEEK with sk_error_queue
so that it can peek and re-read message in cases where MSG_TRUNC
may be encountered.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 drivers/net/tun.c      |    2 +-
 include/net/sock.h     |    2 +-
 net/core/sock.c        |    7 +++++--
 net/packet/af_packet.c |    3 ++-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2fba3be..cfd0e0f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2313,7 +2313,7 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 	}
 	if (flags & MSG_ERRQUEUE) {
 		ret = sock_recv_errqueue(sock->sk, m, total_len,
-					 SOL_PACKET, TUN_TX_TIMESTAMP);
+					 SOL_PACKET, TUN_TX_TIMESTAMP, flags);
 		goto out;
 	}
 	ret = tun_do_read(tun, tfile, &m->msg_iter, flags & MSG_DONTWAIT, ptr);
diff --git a/include/net/sock.h b/include/net/sock.h
index 73b7830..f0b6990 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2343,7 +2343,7 @@ static inline bool sk_listener(const struct sock *sk)
 int sock_get_timestamp(struct sock *, struct timeval __user *);
 int sock_get_timestampns(struct sock *, struct timespec __user *);
 int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
-		       int type);
+		       int type, int flags);
 
 bool sk_ns_capable(const struct sock *sk,
 		   struct user_namespace *user_ns, int cap);
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..4f52677 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2887,7 +2887,7 @@ void sock_enable_timestamp(struct sock *sk, int flag)
 }
 
 int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
-		       int level, int type)
+		       int level, int type, int flags)
 {
 	struct sock_exterr_skb *serr;
 	struct sk_buff *skb;
@@ -2916,7 +2916,10 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
 	err = copied;
 
 out_free_skb:
-	kfree_skb(skb);
+	if (likely(!(flags & MSG_PEEK)))
+		kfree_skb(skb);
+	else
+		skb_queue_head(&sk->sk_error_queue, skb);
 out:
 	return err;
 }
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ee7aa0b..4314f31 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3294,7 +3294,8 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 	if (flags & MSG_ERRQUEUE) {
 		err = sock_recv_errqueue(sk, msg, len,
-					 SOL_PACKET, PACKET_TX_TIMESTAMP);
+					 SOL_PACKET, PACKET_TX_TIMESTAMP,
+					 flags);
 		goto out;
 	}
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
  2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
@ 2018-01-17 12:20 ` Sowmini Varadhan
  2018-01-17 12:20 ` [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:20 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

RDS would like to use the helper functions for managing pinned pages
added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/skbuff.h |    3 +++
 net/core/skbuff.c      |    6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b8e0da6..8e2730a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -466,6 +466,9 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
+void mm_unaccount_pinned_pages(struct mmpin *mmp);
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01e8285..272a513 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -890,7 +890,7 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 {
 	unsigned long max_pg, num_pg, new_pg, old_pg;
 	struct user_struct *user;
@@ -919,14 +919,16 @@ static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
 
-static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct mmpin *mmp)
 {
 	if (mmp->user) {
 		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
 		free_uid(mmp->user);
 	}
 }
+EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
  2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
  2018-01-17 12:20 ` [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
@ 2018-01-17 12:20 ` Sowmini Varadhan
  2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:20 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

The existing model holds a reference from the rds_sock to the
rds_message, but the rds_message does not itself hold a sock_put()
on the rds_sock. Instead the m_rs field in the rds_message is
assigned when the message is queued on the sock, and nulled when
the message is dequeued from the sock.

We want to be able to notify userspace when the rds_message
is actually freed (from rds_message_purge(), after the refcounts
to the rds_message go to 0). These notifications will signal that
it is safe for uspace to free/reuse any pages that may have
been pinned down for zerocopy, and are sent up on the PF_RDS
socket.

In order to be able to send the notifications on the rds_sock,
we need to retain the m_rs assignment in the rds_message with
the necessary refcount book-keeping.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/message.c |    8 +++++++-
 net/rds/send.c    |    7 +------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/rds/message.c b/net/rds/message.c
index 4318cc9..ef3daaf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -58,7 +58,7 @@ void rds_message_addref(struct rds_message *rm)
  */
 static void rds_message_purge(struct rds_message *rm)
 {
-	unsigned long i;
+	unsigned long i, flags;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
@@ -69,6 +69,12 @@ static void rds_message_purge(struct rds_message *rm)
 		__free_page(sg_page(&rm->data.op_sg[i]));
 	}
 	rm->data.op_nents = 0;
+	spin_lock_irqsave(&rm->m_rs_lock, flags);
+	if (rm->m_rs) {
+		sock_put(rds_rs_to_sk(rm->m_rs));
+		rm->m_rs = NULL;
+	}
+	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
diff --git a/net/rds/send.c b/net/rds/send.c
index d3e32d1..5ac0925 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -649,7 +649,6 @@ static void rds_send_remove_from_sock(struct list_head *messages, int status)
 				rm->rdma.op_notifier = NULL;
 			}
 			was_on_sock = 1;
-			rm->m_rs = NULL;
 		}
 		spin_unlock(&rs->rs_lock);
 
@@ -756,9 +755,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		 */
 		if (!test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
 			spin_unlock_irqrestore(&cp->cp_lock, flags);
-			spin_lock_irqsave(&rm->m_rs_lock, flags);
-			rm->m_rs = NULL;
-			spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 			continue;
 		}
 		list_del_init(&rm->m_conn_item);
@@ -774,7 +770,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -798,7 +793,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -849,6 +843,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
 		list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
 		set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
 		rds_message_addref(rm);
+		sock_hold(rds_rs_to_sk(rs));
 		rm->m_rs = rs;
 
 		/* The code ordering is a little weird, but we're
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
                   ` (2 preceding siblings ...)
  2018-01-17 12:20 ` [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
@ 2018-01-17 12:20 ` Sowmini Varadhan
  2018-01-18  0:03   ` Willem de Bruijn
  2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
  2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
  5 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:20 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

allow the application to set SO_ZEROCOPY on the underlying sk
of a PF_RDS socket

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/sock.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 4f52677..f0f44b0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1049,6 +1049,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		break;
 
 	case SO_ZEROCOPY:
+		if (sk->sk_family == PF_RDS) {
+			if (val < 0 || val > 1)
+				ret = -EINVAL;
+			else
+				sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
+			break;
+		}
 		if (sk->sk_family != PF_INET && sk->sk_family != PF_INET6)
 			ret = -ENOTSUPP;
 		else if (sk->sk_protocol != IPPROTO_TCP)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
                   ` (3 preceding siblings ...)
  2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
@ 2018-01-17 12:20 ` Sowmini Varadhan
  2018-01-18  0:23   ` Willem de Bruijn
  2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
  5 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:20 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

RDS removes a datagram from the retransmit queue when an ACK is
received. The ACK indicates that the receiver has queued the
RDS datagram, so that the sender can safely forget the datagram.

If the datagram to be removed had pinned pages set up, add
an entry to the rs->rs_znotify_queue so that the notifcation
will be sent up via rds_rm_zerocopy_callback() when the
rds_message is eventually freed by rds_message_purge.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/af_rds.c  |    3 ++
 net/rds/message.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h     |   13 +++++++++-
 net/rds/recv.c    |    3 ++
 net/rds/send.c    |    7 +++++
 5 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b405f77..23126db 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -183,6 +183,8 @@ static unsigned int rds_poll(struct file *file, struct socket *sock,
 		mask |= (POLLIN | POLLRDNORM);
 	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
 		mask |= (POLLOUT | POLLWRNORM);
+	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
+		mask |= POLLERR;
 	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
 
 	/* clear state any time we wake a seen-congested socket */
@@ -511,6 +513,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
 	INIT_LIST_HEAD(&rs->rs_send_queue);
 	INIT_LIST_HEAD(&rs->rs_recv_queue);
 	INIT_LIST_HEAD(&rs->rs_notify_queue);
+	INIT_LIST_HEAD(&rs->rs_znotify_queue);
 	INIT_LIST_HEAD(&rs->rs_cong_list);
 	spin_lock_init(&rs->rs_rdma_lock);
 	rs->rs_rdma_keys = RB_ROOT;
diff --git a/net/rds/message.c b/net/rds/message.c
index ef3daaf..25c74b3 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -33,6 +33,9 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/skbuff.h>
+#include <linux/list.h>
+#include <linux/errqueue.h>
 
 #include "rds.h"
 
@@ -53,6 +56,64 @@ void rds_message_addref(struct rds_message *rm)
 }
 EXPORT_SYMBOL_GPL(rds_message_addref);
 
+static void rds_rm_zerocopy_callback(struct rds_sock *rs)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct sk_buff *skb;
+	struct sock_exterr_skb *serr;
+	struct sk_buff_head *q;
+	unsigned long flags;
+	struct sk_buff *tail;
+	u32 *ptr;
+	int ncookies = 0, i;
+	struct rds_znotifier *znotif, *ztmp;
+	LIST_HEAD(tmp_list);
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	list_splice(&rs->rs_znotify_queue, &tmp_list);
+	INIT_LIST_HEAD(&rs->rs_znotify_queue);
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list)
+		ncookies++;
+	if (ncookies == 0)
+		return;
+	skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
+	if (!skb) {
+		spin_lock_irqsave(&rs->rs_lock, flags);
+		list_splice(&tmp_list, &rs->rs_znotify_queue);
+		spin_unlock_irqrestore(&rs->rs_lock, flags);
+		return;
+	}
+	serr = SKB_EXT_ERR(skb);
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
+	serr->ee.ee_data = ncookies;
+	serr->ee.ee_info = 0;
+	serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
+	ptr = skb_put(skb, ncookies * sizeof(u32));
+
+	i = 0;
+	list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) {
+		list_del(&znotif->z_list);
+		ptr[i++] = znotif->z_cookie;
+		mm_unaccount_pinned_pages(&znotif->z_mmp);
+		kfree(znotif);
+	}
+	WARN_ON(!list_empty(&tmp_list));
+	q = &sk->sk_error_queue;
+	spin_lock_irqsave(&q->lock, flags);
+	tail = skb_peek_tail(q);
+	if (!tail ||
+	    SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY)  {
+		__skb_queue_tail(q, skb);
+		skb = NULL;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
+	sk->sk_error_report(sk);
+	consume_skb(skb);
+}
+
 /*
  * This relies on dma_map_sg() not touching sg[].page during merging.
  */
@@ -66,11 +127,15 @@ static void rds_message_purge(struct rds_message *rm)
 	for (i = 0; i < rm->data.op_nents; i++) {
 		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
 		/* XXX will have to put_page for page refs */
-		__free_page(sg_page(&rm->data.op_sg[i]));
+		if (!rm->data.op_zcopy)
+			__free_page(sg_page(&rm->data.op_sg[i]));
+		else
+			put_page(sg_page(&rm->data.op_sg[i]));
 	}
 	rm->data.op_nents = 0;
 	spin_lock_irqsave(&rm->m_rs_lock, flags);
 	if (rm->m_rs) {
+		rds_rm_zerocopy_callback(rm->m_rs);
 		sock_put(rds_rs_to_sk(rm->m_rs));
 		rm->m_rs = NULL;
 	}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 374ae83..de5015a 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -356,6 +356,12 @@ static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
 #define RDS_MSG_PAGEVEC		7
 #define RDS_MSG_FLUSH		8
 
+struct rds_znotifier {
+	struct list_head	z_list;
+	u32			z_cookie;
+	struct mmpin		z_mmp;
+};
+
 struct rds_message {
 	refcount_t		m_refcount;
 	struct list_head	m_sock_item;
@@ -431,11 +437,14 @@ struct rds_message {
 		} rdma;
 		struct rm_data_op {
 			unsigned int		op_active:1;
-			unsigned int		op_notify:1;
+			unsigned int		op_notify:1,
+						op_zcopy:1,
+						op_pad_to_32:30;
 			unsigned int		op_nents;
 			unsigned int		op_count;
 			unsigned int		op_dmasg;
 			unsigned int		op_dmaoff;
+			struct rds_znotifier	*op_mmp_znotifier;
 			struct scatterlist	*op_sg;
 		} data;
 	};
@@ -588,6 +597,8 @@ struct rds_sock {
 	/* Socket receive path trace points*/
 	u8			rs_rx_traces;
 	u8			rs_rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+
+	struct list_head	rs_znotify_queue; /* zerocopy completion */
 };
 
 static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
diff --git a/net/rds/recv.c b/net/rds/recv.c
index b25bcfe..043f667 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -594,6 +594,9 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 
 	if (msg_flags & MSG_OOB)
 		goto out;
+	if (msg_flags & MSG_ERRQUEUE)
+		return sock_recv_errqueue(sk, msg, size, SOL_IP, IP_RECVERR,
+					  msg_flags);
 
 	while (1) {
 		/* If there are pending notifications, do those - and nothing else */
diff --git a/net/rds/send.c b/net/rds/send.c
index 5ac0925..5c38ce3 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -635,7 +635,14 @@ static void rds_send_remove_from_sock(struct list_head *messages, int status)
 		if (test_and_clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) {
 			struct rm_rdma_op *ro = &rm->rdma;
 			struct rds_notifier *notifier;
+			struct rds_znotifier *znotifier;
 
+			if (rm->data.op_zcopy) {
+				znotifier = rm->data.op_mmp_znotifier;
+				list_add_tail(&znotifier->z_list,
+					      &rs->rs_znotify_queue);
+				rm->data.op_mmp_znotifier = NULL;
+			}
 			list_del_init(&rm->m_sock_item);
 			rds_send_sndbuf_remove(rs, rm);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH RFC net-next 6/6] rds: zerocopy Tx support.
  2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
                   ` (4 preceding siblings ...)
  2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-01-17 12:20 ` Sowmini Varadhan
  2018-01-18  0:32   ` Willem de Bruijn
  5 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-17 12:20 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg() are pinned.

The pinning uses the accounting infrastructure added by
Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure may thus need to be notified
about send-completion so that it can free/reuse the buffers
passed to rds_sendmsg(). Notification of send-completion will
identify each message-buffer by a cookie that the application
must specify as ancillary data to rds_sendmsg().
The ancillary data in this case has cmsg_level == SOL_RDS
and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/uapi/linux/rds.h |    1 +
 net/rds/message.c        |   44 +++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h            |    3 ++-
 net/rds/send.c           |   27 ++++++++++++++++++++++++---
 4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index e71d449..09b8cc6 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -102,6 +102,7 @@
 #define RDS_CMSG_ATOMIC_CSWP		7
 #define RDS_CMSG_MASKED_ATOMIC_FADD	8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP	9
+#define	RDS_CMSG_ZCOPY_COOKIE		10
 #define RDS_CMSG_RXPATH_LATENCY		11
 
 #define RDS_INFO_FIRST			10000
diff --git a/net/rds/message.c b/net/rds/message.c
index 25c74b3..111d5a3 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -337,12 +337,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
 	return rm;
 }
 
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       struct rds_sock *rs, bool zcopy)
 {
 	unsigned long to_copy, nbytes;
 	unsigned long sg_off;
 	struct scatterlist *sg;
 	int ret = 0;
+	int length = iov_iter_count(from);
 
 	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -352,6 +354,46 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
 	sg = rm->data.op_sg;
 	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
 
+	if (zcopy) {
+		int total_copied = 0;
+		size_t zsize = sizeof(struct rds_znotifier);
+
+		rm->data.op_zcopy = 1;
+		rm->data.op_mmp_znotifier = kzalloc(zsize, GFP_KERNEL);
+		if (!rm->data.op_mmp_znotifier)
+			return -ENOMEM;
+		if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+					    length)) {
+			kfree(rm->data.op_mmp_znotifier);
+			rm->data.op_mmp_znotifier = NULL;
+			return -ENOMEM;
+		}
+		while (iov_iter_count(from)) {
+			struct page *pages;
+			size_t start;
+			ssize_t copied;
+
+			copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
+						    1, &start);
+			if (copied < 0) {
+				/* XXX revert pinning */
+				kfree(rm->data.op_mmp_znotifier);
+				rm->data.op_mmp_znotifier = NULL;
+				return -EFAULT;
+			}
+			total_copied += copied;
+			iov_iter_advance(from, copied);
+			length -= copied;
+			sg_set_page(sg, pages, copied, start);
+			rm->data.op_nents++;
+			sg++;
+		}
+		WARN_ON_ONCE(length != 0);
+		return ret;
+	} /* zcopy */
+
+	rm->data.op_zcopy = 0;
+
 	while (iov_iter_count(from)) {
 		if (!sg_page(sg)) {
 			ret = rds_page_remainder_alloc(sg, iov_iter_count(from),
diff --git a/net/rds/rds.h b/net/rds/rds.h
index de5015a..884d61c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -781,7 +781,8 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 /* message.c */
 struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
 struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents);
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from);
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       struct rds_sock *rs, bool zcopy);
 struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
 void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
 				 __be16 dport, u64 seq);
diff --git a/net/rds/send.c b/net/rds/send.c
index 5c38ce3..e085730 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -908,6 +908,7 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 
 		case RDS_CMSG_RDMA_DEST:
 		case RDS_CMSG_RDMA_MAP:
+		case RDS_CMSG_ZCOPY_COOKIE:
 			cmsg_groups |= 2;
 			/* these are valid but do no add any size */
 			break;
@@ -935,6 +936,18 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 	return size;
 }
 
+static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	unsigned int *cookie;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
+		return -EINVAL;
+	cookie = CMSG_DATA(cmsg);
+	rm->data.op_mmp_znotifier->z_cookie = *cookie;
+	return 0;
+}
+
 static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			 struct msghdr *msg, int *allocated_mr)
 {
@@ -977,6 +990,10 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			ret = rds_cmsg_atomic(rs, rm, cmsg);
 			break;
 
+		case RDS_CMSG_ZCOPY_COOKIE:
+			ret = rds_cmsg_zcopy(rs, rm, cmsg);
+			break;
+
 		default:
 			return -EINVAL;
 		}
@@ -1047,10 +1064,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 	long timeo = sock_sndtimeo(sk, nonblock);
 	struct rds_conn_path *cpath;
 	size_t total_payload_len = payload_len, rdma_payload_len = 0;
+	bool zcopy = ((msg->msg_flags & MSG_ZEROCOPY) &&
+		      sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY));
 
 	/* Mirror Linux UDP mirror of BSD error message compatibility */
 	/* XXX: Perhaps MSG_MORE someday */
-	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
+	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT | MSG_ZEROCOPY)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -1107,12 +1126,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* Attach data to the rm */
 	if (payload_len) {
-		rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, PAGE_SIZE));
+		int num_sgs = ceil(payload_len, PAGE_SIZE);
+
+		rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);
 		if (!rm->data.op_sg) {
 			ret = -ENOMEM;
 			goto out;
 		}
-		ret = rds_message_copy_from_user(rm, &msg->msg_iter);
+		ret = rds_message_copy_from_user(rm, &msg->msg_iter, rs, zcopy);
 		if (ret)
 			goto out;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
@ 2018-01-17 23:50   ` Willem de Bruijn
  2018-01-18 11:02     ` Sowmini Varadhan
  2018-01-18 15:51   ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-17 23:50 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 17, 2018 at 7:19 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Allow the application the ability to use MSG_PEEK with sk_error_queue
> so that it can peek and re-read message in cases where MSG_TRUNC
> may be encountered.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>

>  int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
> -                      int level, int type)
> +                      int level, int type, int flags)
>  {
>         struct sock_exterr_skb *serr;
>         struct sk_buff *skb;
> @@ -2916,7 +2916,10 @@ int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len,
>         err = copied;
>
>  out_free_skb:
> -       kfree_skb(skb);
> +       if (likely(!(flags & MSG_PEEK)))
> +               kfree_skb(skb);
> +       else
> +               skb_queue_head(&sk->sk_error_queue, skb);

This can cause reordering with parallel readers. Can we avoid the need
for peeking? It also caused a slew of subtle bugs previously.

How about just define a max number of cookies and require the caller
to always read with sufficient room to hold them?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket
  2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
@ 2018-01-18  0:03   ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18  0:03 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> allow the application to set SO_ZEROCOPY on the underlying sk
> of a PF_RDS socket
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  net/core/sock.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4f52677..f0f44b0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1049,6 +1049,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>                 break;
>
>         case SO_ZEROCOPY:
> +               if (sk->sk_family == PF_RDS) {
> +                       if (val < 0 || val > 1)
> +                               ret = -EINVAL;
> +                       else
> +                               sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
> +                       break;
> +               }

Let's integrate this in the existing logic. Perhaps something like

  if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
    if (sk->sk_protocol != IPPROTO_TCP)
      ret = -ENOTSUPP;
    else if (sk->sk_state != TCP_CLOSE)
      ret = -EBUSY;
  } else if (sk->sk_protocol != PF_RDS) {
    ret = -ENOTSUPP;
  }

  if (!ret) {
    if (val < 0 || val > 1)
      ret = -EINVAL;
    else
      sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
  }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
  2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-01-18  0:23   ` Willem de Bruijn
  2018-01-18 11:40     ` Sowmini Varadhan
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18  0:23 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram from the retransmit queue when an ACK is
> received. The ACK indicates that the receiver has queued the
> RDS datagram, so that the sender can safely forget the datagram.
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb;
> +       struct sock_exterr_skb *serr;
> +       struct sk_buff_head *q;
> +       unsigned long flags;
> +       struct sk_buff *tail;
> +       u32 *ptr;
> +       int ncookies = 0, i;
> +       struct rds_znotifier *znotif, *ztmp;
> +       LIST_HEAD(tmp_list);
> +
> +       spin_lock_irqsave(&rs->rs_lock, flags);
> +       list_splice(&rs->rs_znotify_queue, &tmp_list);
> +       INIT_LIST_HEAD(&rs->rs_znotify_queue);
> +       spin_unlock_irqrestore(&rs->rs_lock, flags);
> +
> +       list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list)
> +               ncookies++;
> +       if (ncookies == 0)
> +               return;
> +       skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
> +       if (!skb) {
> +               spin_lock_irqsave(&rs->rs_lock, flags);
> +               list_splice(&tmp_list, &rs->rs_znotify_queue);
> +               spin_unlock_irqrestore(&rs->rs_lock, flags);
> +               return;
> +       }

TCP zerocopy avoids this issue by allocating the notification skb when the
zerocopy packet is created, which would be rds_message_copy_from_user.

This does not add an allocation, if using the same trick of stashing
the intermediate notification object (op_mmp_znotifier) in skb->cb. Though,
alloc_skb is probably more expensive than that kzalloc. If nothing else,
because of more zeroing.

> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> +       serr->ee.ee_data = ncookies;

This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
even if the semantics can be derived from the packet family (for now).

Even better would be if we can avoid the cookies completely. I understand
the issue with concurrent send threads racing on obtaining a zckey value. If
the sender could learn which zckey was chosen for a call, would that suffice?

I suspect that in even with concurrent senders, notifications arrive largely in
order, in which case we could just maintain the existing semantics and even
reuse that implementation.

> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

Why set the copied bit here?

> +       ptr = skb_put(skb, ncookies * sizeof(u32));
> +
> +       i = 0;
> +       list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) {
> +               list_del(&znotif->z_list);
> +               ptr[i++] = znotif->z_cookie;
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               kfree(znotif);
> +       }
> +       WARN_ON(!list_empty(&tmp_list));
> +       q = &sk->sk_error_queue;
> +       spin_lock_irqsave(&q->lock, flags);
> +       tail = skb_peek_tail(q);
> +       if (!tail ||
> +           SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY)  {
> +               __skb_queue_tail(q, skb);
> +               skb = NULL;
> +       }

This drops the packet if the branch is not taken. In the TCP case this condition
means that we can try to coalesce packets, but that is not the case here.

> +       spin_unlock_irqrestore(&q->lock, flags);
> +       sk->sk_error_report(sk);
> +       consume_skb(skb);
> +}
> +
>  /*
>   * This relies on dma_map_sg() not touching sg[].page during merging.
>   */
> @@ -66,11 +127,15 @@ static void rds_message_purge(struct rds_message *rm)
>         for (i = 0; i < rm->data.op_nents; i++) {
>                 rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
>                 /* XXX will have to put_page for page refs */
> -               __free_page(sg_page(&rm->data.op_sg[i]));
> +               if (!rm->data.op_zcopy)
> +                       __free_page(sg_page(&rm->data.op_sg[i]));
> +               else
> +                       put_page(sg_page(&rm->data.op_sg[i]));
>         }
>         rm->data.op_nents = 0;
>         spin_lock_irqsave(&rm->m_rs_lock, flags);
>         if (rm->m_rs) {
> +               rds_rm_zerocopy_callback(rm->m_rs);
>                 sock_put(rds_rs_to_sk(rm->m_rs));
>                 rm->m_rs = NULL;
>         }
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 374ae83..de5015a 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -356,6 +356,12 @@ static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
>  #define RDS_MSG_PAGEVEC                7
>  #define RDS_MSG_FLUSH          8
>
> +struct rds_znotifier {
> +       struct list_head        z_list;
> +       u32                     z_cookie;
> +       struct mmpin            z_mmp;
> +};
> +
>  struct rds_message {
>         refcount_t              m_refcount;
>         struct list_head        m_sock_item;
> @@ -431,11 +437,14 @@ struct rds_message {
>                 } rdma;
>                 struct rm_data_op {
>                         unsigned int            op_active:1;
> -                       unsigned int            op_notify:1;
> +                       unsigned int            op_notify:1,
> +                                               op_zcopy:1,

not necessary if op_mmp_znotifier is NULL unless set in
rds_message_copy_from_user

> +                                               op_pad_to_32:30;
>                         unsigned int            op_nents;
>                         unsigned int            op_count;
>                         unsigned int            op_dmasg;
>                         unsigned int            op_dmaoff;
> +                       struct rds_znotifier    *op_mmp_znotifier;
>                         struct scatterlist      *op_sg;
>                 } data;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 6/6] rds: zerocopy Tx support.
  2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
@ 2018-01-18  0:32   ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18  0:32 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 17, 2018 at 7:20 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg() are pinned.
>
> The pinning uses the accounting infrastructure added by
> Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
>
> The payload bytes in the message may not be modified for the
> duration that the message has been pinned. A multi-threaded
> application using this infrastructure may thus need to be notified
> about send-completion so that it can free/reuse the buffers
> passed to rds_sendmsg(). Notification of send-completion will
> identify each message-buffer by a cookie that the application
> must specify as ancillary data to rds_sendmsg().
> The ancillary data in this case has cmsg_level == SOL_RDS
> and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/uapi/linux/rds.h |    1 +
>  net/rds/message.c        |   44 +++++++++++++++++++++++++++++++++++++++++++-
>  net/rds/rds.h            |    3 ++-
>  net/rds/send.c           |   27 ++++++++++++++++++++++++---
>  4 files changed, 70 insertions(+), 5 deletions(-)

> +static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm,
> +                         struct cmsghdr *cmsg)
> +{
> +       unsigned int *cookie;

Use fixed-width types across the ABI.

> +
> +       if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
> +               return -EINVAL;
> +       cookie = CMSG_DATA(cmsg);
> +       rm->data.op_mmp_znotifier->z_cookie = *cookie;
> +       return 0;
> +}
> +

> @@ -1107,12 +1126,14 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
>
>         /* Attach data to the rm */
>         if (payload_len) {
> -               rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, PAGE_SIZE));
> +               int num_sgs = ceil(payload_len, PAGE_SIZE);
> +
> +               rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);

Unrelated change. Leftover from revising the patch?

Also, with user pages, data is no longer guaranteed to be page
aligned, so this may
need more sgs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-17 23:50   ` Willem de Bruijn
@ 2018-01-18 11:02     ` Sowmini Varadhan
  2018-01-18 15:54       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 11:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On (01/17/18 18:50), Willem de Bruijn wrote:
> 
> This can cause reordering with parallel readers. Can we avoid the need
> for peeking? It also caused a slew of subtle bugs previously.

Yes, I did notice the potential for re-ordering when writing the patch..
but these are not actuallly messages from the wire, so is re-ordering 
fatal?

In general, I"m not particularly attached to this solution- in my
testing, I'm seeing that it's possible to reduce the latency  and still
take a hit on the throughput if the application does not reap the
completion notifciation (and send out new data) efficiently

Some (radically differnt) alternatives that were suggested to me 

- send up all the cookies as ancillary data with recvmsg (i.e., send
  it as a cmsgdata along with actual data from the wire). In most
  cases, the application has data to read, anyway. If it doesnt (pure
  sender), we could wake up recvmsg with 0 bytes of data, but with
  the cookie info in the ancillary data. This feels not-so-elegant
  to me, but I suppose it would have the benefit of optimizing on
  the syscall overhead.. (and you could use MSG_CTRUNC to handle
  the case of insuufficient bufffer for cookies, sending the rest
  on the next call)..
- allow application to use a setsockopt on the rds socket, with
  some shmem region, into which the kernel could write the cookies,
  Let application reap cookies without syscall overhead from that
  shmem region..

> How about just define a max number of cookies and require the caller
> to always read with sufficient room to hold them?

This may be "good enough" as well, maybe allow a max of (say) 16 cookies,
and set up  the skb's in the error queue to send up batches of 16 cookies
at a time?

--Sowmini

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
  2018-01-18  0:23   ` Willem de Bruijn
@ 2018-01-18 11:40     ` Sowmini Varadhan
  2018-01-18 22:46       ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 11:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar,
	sowmini.varadhan

On (01/17/18 19:23), Willem de Bruijn wrote:
> > +static void rds_rm_zerocopy_callback(struct rds_sock *rs)
     :
> > +       skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
> 
> TCP zerocopy avoids this issue by allocating the notification skb when the
> zerocopy packet is created, which would be rds_message_copy_from_user.

right, I could allocate the skb when we set up the zcopy data strucutres.

> This does not add an allocation, if using the same trick of stashing
> the intermediate notification object (op_mmp_znotifier) in skb->cb. Though,
> alloc_skb is probably more expensive than that kzalloc. If nothing else,
> because of more zeroing.

I would have liked to reuse skb->cb, but had to fall back to the alloc_skb
because of the attempt to fall back to flexibly numbered (and sized) cookie
notifications.

If we choose the "max number of cookies" option discussed in the previous 
thread, I think I should be able to do this with the existing 48 byte sized 
cb and pack in 8 32 bit cookies after the sock_extended_err.. maybe
that's sufficient.

> > +       serr = SKB_EXT_ERR(skb);
> > +       serr->ee.ee_errno = 0;
> > +       serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> > +       serr->ee.ee_data = ncookies;
> 
> This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
> even if the semantics can be derived from the packet family (for now).

sounds good, I'll take care of this (and other review comments)
for the next rev.

> Even better would be if we can avoid the cookies completely. I understand
> the issue with concurrent send threads racing on obtaining a zckey value. If
> the sender could learn which zckey was chosen for a call, would that suffice?

I'm not sure I understand the proposal- you want sendmsg to return
the cookie ("zckey") for you? How?

even if we mangled sendmsg() to return something other than
the existing POSIX semantics, how will the application be asynchronously
notified that send has completed (on a per-message/per-datagram) basis?

> I suspect that in even with concurrent senders, notifications arrive
> largely in
> order, in which case we could just maintain the existing semantics and even
> reuse that implementation.

not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe
connection. 

When you have multiple threads writing to a socket, you cannot know
what was the "order of send", unless you bottleneck all the threads
to go through a single-point-of-send.  rds-stress is an example of this
sort of usage (fairly typical in our applications)

[1] http://public-yum.oracle.com/repo/OracleLinux/OL6/ofed_UEK/x86_64//getPackageSource/rds-tools-2.0.7-1.12.el6.src.rpm

Consider 2 RDS sockets fd1 and fd2, each one sending to the
same pair of IP addresses: if fd1 gets bound to tcp sock1, fd2 to tcp sock2,
it's very possible that the send completion is not in the same order
as the order of send. The application cannot know which socket gets
bound to which TCP connection (or even whether/how-many tcp connections
are involved: mprds strives to make this transparent to the application)

Same problem exists for pure-datagram sockets like PF_PACKET, UDP etc.
application may send buf1 (datagram1) and buf2 (datagram2), and buf2
may make it to the destination before buf1.  The "notifications 
arrive largely in order" may be mostly true about a single-stream TCP
connection but not for the datagram models (or even threaded tcp apps
like iperf?)..

> > +       tail = skb_peek_tail(q);
> > +       if (!tail ||
> > +           SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY)  {
> > +               __skb_queue_tail(q, skb);
> > +               skb = NULL;
> > +       }
> 
> This drops the packet if the branch is not taken. In the TCP case
> this condition
> means that we can try to coalesce packets, but that is not the case here.

good point, I'll check into this and fix (same applies for the comments
to patches 4/6 and 6/6)

> >                 } rdma;
> >                 struct rm_data_op {
> >                         unsigned int            op_active:1;
> > -                       unsigned int            op_notify:1;
> > +                       unsigned int            op_notify:1,
> > +                                               op_zcopy:1,
                                                     :
> > +                       struct rds_znotifier    *op_mmp_znotifier;
> 
> not necessary if op_mmp_znotifier is NULL unless set in
> rds_message_copy_from_user

To make sure I dont misunderstand, you are suggesting that we dont
need op_zcopy, but can just check for the null-ness of
op_mmp_znotifier (yes, true, I agree)? or something else?

--Sowmini

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
  2018-01-17 23:50   ` Willem de Bruijn
@ 2018-01-18 15:51   ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2018-01-18 15:51 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, santosh.shilimkar

On Wed, 2018-01-17 at 04:19 -0800, Sowmini Varadhan wrote:
> Allow the application the ability to use MSG_PEEK with sk_error_queue
> so that it can peek and re-read message in cases where MSG_TRUNC
> may be encountered.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>


Lets not add buggy feature that only fuzzers love to use to trigger
awful bugs.

No serious (performance sensitive) application use MSG_PEEK, because of
extra syscall overhead.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 11:02     ` Sowmini Varadhan
@ 2018-01-18 15:54       ` Eric Dumazet
  2018-01-18 16:10         ` Sowmini Varadhan
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2018-01-18 15:54 UTC (permalink / raw)
  To: Sowmini Varadhan, Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Thu, 2018-01-18 at 06:02 -0500, Sowmini Varadhan wrote:
> On (01/17/18 18:50), Willem de Bruijn wrote:
> > 
> > This can cause reordering with parallel readers. Can we avoid the need
> > for peeking? It also caused a slew of subtle bugs previously.
> 
> Yes, I did notice the potential for re-ordering when writing the patch..
> but these are not actuallly messages from the wire, so is re-ordering 
> fatal?

Some applications out there would break horribly, trust me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 15:54       ` Eric Dumazet
@ 2018-01-18 16:10         ` Sowmini Varadhan
  2018-01-18 16:53           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 16:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On (01/18/18 07:54), Eric Dumazet wrote:
> 
> Some applications out there would break horribly, trust me.
> 

so I'm not particularly attached to that solution, and I appreciate
the wisdom (and the NACK), but lets try to find a useful alternative

The current zcopy completion notification mechanism involves syscall
overhead already, and is also inadequate for threaded applications sharing
an fd.  Plus it wont work for datagram sockets.

I'm fine with Willem's suggestion of passing a fixed number of cookies
as ancillary data (with CTRUNC to denote inadequate buffer) but if we
are really so thrifty about syscall overhead, we should not be using
sk_error_queue in the first place- perhaps we can pass up the completion
notification as ancillary data with recvmsg() on the POLLIN channel
itself (which is weird if there is no data to recv, and only ancillary
info to pass up, but hey, we are "performant"!).

--Sowmini
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 16:10         ` Sowmini Varadhan
@ 2018-01-18 16:53           ` Eric Dumazet
  2018-01-18 17:12             ` Sowmini Varadhan
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2018-01-18 16:53 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Willem de Bruijn, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On Thu, 2018-01-18 at 11:10 -0500, Sowmini Varadhan wrote:
> On (01/18/18 07:54), Eric Dumazet wrote:
> > 
> > Some applications out there would break horribly, trust me.
> > 
> 
> so I'm not particularly attached to that solution, and I appreciate
> the wisdom (and the NACK), but lets try to find a useful alternative
> 
> The current zcopy completion notification mechanism involves syscall
> overhead already, and is also inadequate for threaded applications sharing
> an fd.  Plus it wont work for datagram sockets.
> 
> I'm fine with Willem's suggestion of passing a fixed number of cookies
> as ancillary data (with CTRUNC to denote inadequate buffer) but if we
> are really so thrifty about syscall overhead, we should not be using
> sk_error_queue in the first place- perhaps we can pass up the completion
> notification as ancillary data with recvmsg() on the POLLIN channel
> itself (which is weird if there is no data to recv, and only ancillary
> info to pass up, but hey, we are "performant"!).

The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.

And begins the crazy stuff.

So lets properly design things, and not re-use legacy stuff that is
proven to be not multi-thread ready and too complex.

If you want to design a new channel of communication, do it, and
maintain it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 16:53           ` Eric Dumazet
@ 2018-01-18 17:12             ` Sowmini Varadhan
  2018-01-18 22:54               ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On (01/18/18 08:53), Eric Dumazet wrote:
> 
> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.

sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled
about anyway)

> So lets properly design things, and not re-use legacy stuff that is
> proven to be not multi-thread ready and too complex.
> 
> If you want to design a new channel of communication, do it, and
> maintain it.

My instinct is to go with the fixed size ancillary data- which itself
allows 2 options:

1. cmsg_data has a sock_extended_err preamble
   with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar),
   and the ee_data is an array of 32 bit cookies (can pack at most 8 
   32-bit cookies, if we want to pack this into an skb->cb)

   Using the sock_extended_err as preamble will allow this to be usable by
   existing tcp zcopy applications (they can use the ee_origin to find
   out if this a batch of cookies or the existing hi/lo values).

2. If we have the option of passing completion-notification up as ancillary 
   data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
   we dont have to try to retain "backward compat" to the
   SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data
   struct for the notification and potentially pack more cookies into
   48 bytes (RDS could be the first guinea pig for this- doesnt even
   have to be done across all protocol families on day-1).

I think the shmem channel suggestion would be an optional optimization
that can be added later- it may not even be necessary, since most
applications will likely be sending *and* receiving data, so passing up
cookies with recvmsg should be "good enough" to save syscall overhead
for the common case.

I can work #2, if there are no objections to it.

--Sowmini

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification
  2018-01-18 11:40     ` Sowmini Varadhan
@ 2018-01-18 22:46       ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18 22:46 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

>> This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
>> even if the semantics can be derived from the packet family (for now).
>
> sounds good, I'll take care of this (and other review comments)
> for the next rev.
>
>> Even better would be if we can avoid the cookies completely. I understand
>> the issue with concurrent send threads racing on obtaining a zckey value. If
>> the sender could learn which zckey was chosen for a call, would that suffice?
>
> I'm not sure I understand the proposal- you want sendmsg to return
> the cookie ("zckey") for you? How?
>
> even if we mangled sendmsg() to return something other than
> the existing POSIX semantics, how will the application be asynchronously
> notified that send has completed (on a per-message/per-datagram) basis?

I'm purposely glossing over how the kernel returns this item for now.
Was just wondering whether we can then assume mostly in order delivery
and reuse the existing notification interface from tcp zerocopy.

>From your experiments, it sounds like this is not the case. In which case
there is little benefit to trying to force linear IDs derived from sk->sk_zckey.

>> I suspect that in even with concurrent senders, notifications arrive
>> largely in
>> order, in which case we could just maintain the existing semantics and even
>> reuse that implementation.
>
> not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe
> connection.

Okay. In that case, the cmsg cookie approach sounds fine. I had the same
in an early tcp zerocopy test version, as a matter of fact.

>> >                 } rdma;
>> >                 struct rm_data_op {
>> >                         unsigned int            op_active:1;
>> > -                       unsigned int            op_notify:1;
>> > +                       unsigned int            op_notify:1,
>> > +                                               op_zcopy:1,
>                                                      :
>> > +                       struct rds_znotifier    *op_mmp_znotifier;
>>
>> not necessary if op_mmp_znotifier is NULL unless set in
>> rds_message_copy_from_user
>
> To make sure I dont misunderstand, you are suggesting that we dont
> need op_zcopy, but can just check for the null-ness of
> op_mmp_znotifier (yes, true, I agree)? or something else?

Yes, that's what I meant.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 17:12             ` Sowmini Varadhan
@ 2018-01-18 22:54               ` Willem de Bruijn
  2018-01-18 23:03                 ` Sowmini Varadhan
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18 22:54 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Eric Dumazet, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On Thu, Jan 18, 2018 at 12:12 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 08:53), Eric Dumazet wrote:
>>
>> The thing is : MSG_PEEK 'support' will also need SO_PEEK_OFF support.
>
> sure, I'll drop the MSG_PEEK idea (which I wasnt very thrilled
> about anyway)
>
>> So lets properly design things, and not re-use legacy stuff that is
>> proven to be not multi-thread ready and too complex.
>>
>> If you want to design a new channel of communication, do it, and
>> maintain it.
>
> My instinct is to go with the fixed size ancillary data- which itself
> allows 2 options:
>
> 1. cmsg_data has a sock_extended_err preamble
>    with ee_origin = SO_EE_ORIGIN_ZEROCOPY_COOKIE (or similar),
>    and the ee_data is an array of 32 bit cookies (can pack at most 8
>    32-bit cookies, if we want to pack this into an skb->cb)
>
>    Using the sock_extended_err as preamble will allow this to be usable by
>    existing tcp zcopy applications (they can use the ee_origin to find
>    out if this a batch of cookies or the existing hi/lo values).
>
> 2. If we have the option of passing completion-notification up as ancillary
>    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)

This assumes a somewhat symmetric workload, where there are enough recv
calls to reap the notification associated with the send calls.

>    we dont have to try to retain "backward compat" to the
>    SO_EE_ORIGIN_ZEROCOPY API: we can just use a completely new data
>    struct for the notification and potentially pack more cookies into
>    48 bytes (RDS could be the first guinea pig for this- doesnt even
>    have to be done across all protocol families on day-1).
>
> I think the shmem channel suggestion would be an optional optimization
> that can be added later- it may not even be necessary, since most
> applications will likely be sending *and* receiving data, so passing up
> cookies with recvmsg should be "good enough" to save syscall overhead
> for the common case.
>
> I can work #2, if there are no objections to it.

I would stay with MSG_ERRQUEUE processing. One option is to pass data
up to userspace in the data portion of the notification skb instead of
encoding it in ancillary data, like tcp_get_timestamping_opt_stats.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 22:54               ` Willem de Bruijn
@ 2018-01-18 23:03                 ` Sowmini Varadhan
  2018-01-18 23:09                   ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 23:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On (01/18/18 17:54), Willem de Bruijn wrote:
> > 2. If we have the option of passing completion-notification up as ancillary
> >    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
> 
> This assumes a somewhat symmetric workload, where there are enough recv
> calls to reap the notification associated with the send calls.

Your comment about the assumption is true, but at least for the database
use-cases, we have a request-response model, so the assumption works out..
I dont know if many other workloads that send large buffers have this
pattern.

> I would stay with MSG_ERRQUEUE processing. One option is to pass data
> up to userspace in the data portion of the notification skb instead of
> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.

that's similar to what I have, except that it does not have the
MSG_PEEK part (you'd need to enforce that the data portion
is upper-bounded, and that the application has the responsibility
of sending down "enough" buffer with recvmsg). 

Note that any one of these choices are ok with me- I have no
special attachments to any of them.

--Sowmini

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 23:03                 ` Sowmini Varadhan
@ 2018-01-18 23:09                   ` Willem de Bruijn
  2018-01-18 23:20                     ` Sowmini Varadhan
  0 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18 23:09 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Eric Dumazet, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On Thu, Jan 18, 2018 at 6:03 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 17:54), Willem de Bruijn wrote:
>> > 2. If we have the option of passing completion-notification up as ancillary
>> >    data on the pollin/recvmsg channel itself (instead of MSG_ERRQUEUE)
>>
>> This assumes a somewhat symmetric workload, where there are enough recv
>> calls to reap the notification associated with the send calls.
>
> Your comment about the assumption is true, but at least for the database
> use-cases, we have a request-response model, so the assumption works out..
> I dont know if many other workloads that send large buffers have this
> pattern.

If that is true in general for PF_RDS, then it is a reasonable approach.
How about treating it as a (follow-on) optimization path. Opportunistic
piggybacking of notifications on data reads is more widely applicable.

>
>> I would stay with MSG_ERRQUEUE processing. One option is to pass data
>> up to userspace in the data portion of the notification skb instead of
>> encoding it in ancillary data, like tcp_get_timestamping_opt_stats.
>
> that's similar to what I have, except that it does not have the
> MSG_PEEK part (you'd need to enforce that the data portion
> is upper-bounded, and that the application has the responsibility
> of sending down "enough" buffer with recvmsg).

Right. I think that an upper bound is the simplest solution here.

By the way, if you allocate an skb immediately on page pinning, then
there are always sufficient skbs to store all notifications. On errqueue
enqueue just drop the new skb and copy its notification to the body of
the skb already on the queue, if one exists and it has room. That is
essentially what the tcp zerocopy code does with the [data, info] range.

> Note that any one of these choices are ok with me- I have no
> special attachments to any of them.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 23:09                   ` Willem de Bruijn
@ 2018-01-18 23:20                     ` Sowmini Varadhan
  2018-01-18 23:24                       ` Willem de Bruijn
  0 siblings, 1 reply; 24+ messages in thread
From: Sowmini Varadhan @ 2018-01-18 23:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On (01/18/18 18:09), Willem de Bruijn wrote:
> If that is true in general for PF_RDS, then it is a reasonable approach.
> How about treating it as a (follow-on) optimization path. Opportunistic
> piggybacking of notifications on data reads is more widely applicable.

sounds good.

> > that's similar to what I have, except that it does not have the
> > MSG_PEEK part (you'd need to enforce that the data portion
> > is upper-bounded, and that the application has the responsibility
> > of sending down "enough" buffer with recvmsg).
> 
> Right. I think that an upper bound is the simplest solution here.
> 
> By the way, if you allocate an skb immediately on page pinning, then
> there are always sufficient skbs to store all notifications. On errqueue
> enqueue just drop the new skb and copy its notification to the body of
> the skb already on the queue, if one exists and it has room. That is
> essentially what the tcp zerocopy code does with the [data, info] range.

ok, I'll give that a shot (I'm working through the other review comments
as well)

fwiw, the data-corruption issue I mentioned turned out to be a day-one
bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/). 
The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
this one out..

--Sowmini

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue
  2018-01-18 23:20                     ` Sowmini Varadhan
@ 2018-01-18 23:24                       ` Willem de Bruijn
  0 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2018-01-18 23:24 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Eric Dumazet, Network Development, David Miller, rds-devel,
	santosh.shilimkar

On Thu, Jan 18, 2018 at 6:20 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/18/18 18:09), Willem de Bruijn wrote:
>> If that is true in general for PF_RDS, then it is a reasonable approach.
>> How about treating it as a (follow-on) optimization path. Opportunistic
>> piggybacking of notifications on data reads is more widely applicable.
>
> sounds good.
>
>> > that's similar to what I have, except that it does not have the
>> > MSG_PEEK part (you'd need to enforce that the data portion
>> > is upper-bounded, and that the application has the responsibility
>> > of sending down "enough" buffer with recvmsg).
>>
>> Right. I think that an upper bound is the simplest solution here.
>>
>> By the way, if you allocate an skb immediately on page pinning, then
>> there are always sufficient skbs to store all notifications. On errqueue
>> enqueue just drop the new skb and copy its notification to the body of
>> the skb already on the queue, if one exists and it has room. That is
>> essentially what the tcp zerocopy code does with the [data, info] range.
>
> ok, I'll give that a shot (I'm working through the other review comments
> as well)
>
> fwiw, the data-corruption issue I mentioned turned out to be a day-one
> bug in rds-tcp (patched in http://patchwork.ozlabs.org/patch/863183/).
> The buffer reaping with zcopy (and aggressiveness of rds-stress) brought
> this one out..

Thanks. Good to hear that it's not in zerocopy, itself.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-01-18 23:24 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-17 12:19 [PATCH RFC net-next 0/6] rds: zerocopy support Sowmini Varadhan
2018-01-17 12:19 ` [PATCH RFC net-next 1/6] sock: MSG_PEEK support for sk_error_queue Sowmini Varadhan
2018-01-17 23:50   ` Willem de Bruijn
2018-01-18 11:02     ` Sowmini Varadhan
2018-01-18 15:54       ` Eric Dumazet
2018-01-18 16:10         ` Sowmini Varadhan
2018-01-18 16:53           ` Eric Dumazet
2018-01-18 17:12             ` Sowmini Varadhan
2018-01-18 22:54               ` Willem de Bruijn
2018-01-18 23:03                 ` Sowmini Varadhan
2018-01-18 23:09                   ` Willem de Bruijn
2018-01-18 23:20                     ` Sowmini Varadhan
2018-01-18 23:24                       ` Willem de Bruijn
2018-01-18 15:51   ` Eric Dumazet
2018-01-17 12:20 ` [PATCH RFC net-next 2/6] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 3/6] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
2018-01-17 12:20 ` [PATCH RFC net-next 4/6] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
2018-01-18  0:03   ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 5/6] rds: support for zcopy completion notification Sowmini Varadhan
2018-01-18  0:23   ` Willem de Bruijn
2018-01-18 11:40     ` Sowmini Varadhan
2018-01-18 22:46       ` Willem de Bruijn
2018-01-17 12:20 ` [PATCH RFC net-next 6/6] rds: zerocopy Tx support Sowmini Varadhan
2018-01-18  0:32   ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).