All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 net-next 0/7] RDS: zerocopy support
@ 2018-02-14 10:28 Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

This is version 2 of the series at 
  https://www.mail-archive.com/netdev@vger.kernel.org/msg213829.html

Review comments addressed
Patch 4: 
  - make sure to always sock_put m_rs even if there is no znotifier.
  - major rewrite of notification, resulting in much simplification.

Patch 5:
  - remove unused data_len argument to rds_rm_size; 
  - unmap as necessary if we fail in the middle of zerocopy setup

Patch 7:
  - restructured do_recv_completion to avoid excessive code re-indent
  - on-stack allocation of cmsghdr for cookie in do_sendmsg
  - Additional verification: Verify ncookies <= MAX_.., verify ret ==
    ncookies * sizeof(uint32_t)

A brief overview of this feature follows.

This patch series provides support for MSG_ZERCOCOPY
on a PF_RDS socket based on the APIs and infrastructure added
by f214f915e7db ("tcp: enable MSG_ZEROCOPY")

For single threaded rds-stress testing using rds-tcp with the
ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
results show that  there is a significant reduction in latency: about
90 usec with zerocopy, compared with 200 usec without zerocopy.

This patchset modifies the above for zerocopy in the following manner.
- if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
- if the SO_ZEROCOPY  socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg are pinned. The pinning
uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit 
on MSG_ZEROCOPY pages"). The message is unpinned when all references
to the message go down to 0, and the message is freed by rds_message_purge.

A multithreaded application using this infrastructure must send down 
a unique 32 bit cookie as ancillary data with each sendmsg invocation.
The format of this ancillary data is described in Patch 5 of the series.
The cookie is passed up to the application on the sk_error_queue when 
the message is unpinned, indicating to the application that it is now
safe to free/reuse the message buffer. The details of the completion
notifiction are provided in Patch 4 of this series.


Sowmini Varadhan (7):
  skbuff: export mm_[un]account_pinned_pages for other modules
  rds: hold a sock ref from rds_message to the rds_sock
  sock: permit SO_ZEROCOPY on PF_RDS socket
  rds: support for zcopy completion notification
  rds: zerocopy Tx support.
  selftests/net: add support for PF_RDS sockets
  selftests/net: add zerocopy support for PF_RDS test case

 include/linux/skbuff.h                     |    3 +
 include/uapi/linux/errqueue.h              |    2 +
 include/uapi/linux/rds.h                   |    1 +
 net/core/skbuff.c                          |    6 +-
 net/core/sock.c                            |   25 +++---
 net/rds/af_rds.c                           |    2 +
 net/rds/message.c                          |  132 ++++++++++++++++++++++++++-
 net/rds/rds.h                              |   17 ++++-
 net/rds/recv.c                             |    2 +
 net/rds/send.c                             |   51 ++++++++---
 tools/testing/selftests/net/msg_zerocopy.c |  133 ++++++++++++++++++++++++++-
 11 files changed, 339 insertions(+), 35 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

RDS would like to use the helper functions for managing pinned pages
added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/skbuff.h |    3 +++
 net/core/skbuff.c      |    6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b8e0da6..8e2730a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -466,6 +466,9 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
+void mm_unaccount_pinned_pages(struct mmpin *mmp);
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01e8285..272a513 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -890,7 +890,7 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 {
 	unsigned long max_pg, num_pg, new_pg, old_pg;
 	struct user_struct *user;
@@ -919,14 +919,16 @@ static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
 
-static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct mmpin *mmp)
 {
 	if (mmp->user) {
 		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
 		free_uid(mmp->user);
 	}
 }
+EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next  2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 18:38   ` Santosh Shilimkar
  2018-02-14 10:28 ` [PATCH V2 net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

The existing model holds a reference from the rds_sock to the
rds_message, but the rds_message does not itself hold a sock_put()
on the rds_sock. Instead the m_rs field in the rds_message is
assigned when the message is queued on the sock, and nulled when
the message is dequeued from the sock.

We want to be able to notify userspace when the rds_message
is actually freed (from rds_message_purge(), after the refcounts
to the rds_message go to 0). At the time that rds_message_purge()
is called, the message is no longer on the rds_sock retransmit
queue. Thus the explicit reference for the m_rs is needed to
send a notification that will signal to userspace that
it is now safe to free/reuse any pages that may have
been pinned down for zerocopy.

This patch manages the m_rs assignment in the rds_message with
the necessary refcount book-keeping.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/message.c |    8 +++++++-
 net/rds/send.c    |    7 +------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/rds/message.c b/net/rds/message.c
index 4318cc9..ef3daaf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -58,7 +58,7 @@ void rds_message_addref(struct rds_message *rm)
  */
 static void rds_message_purge(struct rds_message *rm)
 {
-	unsigned long i;
+	unsigned long i, flags;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
@@ -69,6 +69,12 @@ static void rds_message_purge(struct rds_message *rm)
 		__free_page(sg_page(&rm->data.op_sg[i]));
 	}
 	rm->data.op_nents = 0;
+	spin_lock_irqsave(&rm->m_rs_lock, flags);
+	if (rm->m_rs) {
+		sock_put(rds_rs_to_sk(rm->m_rs));
+		rm->m_rs = NULL;
+	}
+	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
diff --git a/net/rds/send.c b/net/rds/send.c
index d3e32d1..5ac0925 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -649,7 +649,6 @@ static void rds_send_remove_from_sock(struct list_head *messages, int status)
 				rm->rdma.op_notifier = NULL;
 			}
 			was_on_sock = 1;
-			rm->m_rs = NULL;
 		}
 		spin_unlock(&rs->rs_lock);
 
@@ -756,9 +755,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		 */
 		if (!test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
 			spin_unlock_irqrestore(&cp->cp_lock, flags);
-			spin_lock_irqsave(&rm->m_rs_lock, flags);
-			rm->m_rs = NULL;
-			spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 			continue;
 		}
 		list_del_init(&rm->m_conn_item);
@@ -774,7 +770,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -798,7 +793,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -849,6 +843,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
 		list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
 		set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
 		rds_message_addref(rm);
+		sock_hold(rds_rs_to_sk(rs));
 		rm->m_rs = rs;
 
 		/* The code ordering is a little weird, but we're
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

allow the application to set SO_ZEROCOPY on the underlying sk
of a PF_RDS socket

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/sock.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..3c75adc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1049,18 +1049,21 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		break;
 
 	case SO_ZEROCOPY:
-		if (sk->sk_family != PF_INET && sk->sk_family != PF_INET6)
+		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
+			if (sk->sk_protocol != IPPROTO_TCP)
+				ret = -ENOTSUPP;
+			else if (sk->sk_state != TCP_CLOSE)
+				ret = -EBUSY;
+		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
-		else if (sk->sk_protocol != IPPROTO_TCP)
-			ret = -ENOTSUPP;
-		else if (sk->sk_state != TCP_CLOSE)
-			ret = -EBUSY;
-		else if (val < 0 || val > 1)
-			ret = -EINVAL;
-		else
-			sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-		break;
-
+		}
+		if (!ret) {
+			if (val < 0 || val > 1)
+				ret = -EINVAL;
+			else
+				sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
+			break;
+		}
 	default:
 		ret = -ENOPROTOOPT;
 		break;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
                   ` (2 preceding siblings ...)
  2018-02-14 10:28 ` [PATCH V2 net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 18:50   ` Santosh Shilimkar
                     ` (2 more replies)
  2018-02-14 10:28 ` [PATCH V2 net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

RDS removes a datagram (rds_message) from the retransmit queue when
an ACK is received. The ACK indicates that the receiver has queued
the RDS datagram, so that the sender can safely forget the datagram.
When all references to the rds_message are quiesced, rds_message_purge
is called to release resources used by the rds_message

If the datagram to be removed had pinned pages set up, add
an entry to the rs->rs_znotify_queue so that the notifcation
will be sent up via rds_rm_zerocopy_callback() when the
rds_message is eventually freed by rds_message_purge.

rds_rm_zerocopy_callback() attempts to batch the number of cookies
sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
This is achieved by checking the tail skb in the sk_error_queue:
if this has room for one more cookie, the cookie from the
current notification is added; else a new skb is added to the
sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
trigger a ->sk_error_report to notify the application.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: 
  - make sure to always sock_put m_rs even if there is no znotifier.
  - major rewrite of notification, resulting in much simplification.

 include/uapi/linux/errqueue.h |    2 +
 net/rds/af_rds.c              |    2 +
 net/rds/message.c             |   83 +++++++++++++++++++++++++++++++++++++---
 net/rds/rds.h                 |   14 +++++++
 net/rds/recv.c                |    2 +
 5 files changed, 96 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfa..28812ed 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,11 +20,13 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
 #define SO_EE_ORIGIN_ZEROCOPY	5
+#define SO_EE_ORIGIN_ZCOOKIE	6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED	1
+#define	SO_EE_ORIGIN_MAX_ZCOOKIES	8
 
 /**
  *	struct scm_timestamping - timestamps exposed through cmsg
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b405f77..74a4bf8 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -183,6 +183,8 @@ static unsigned int rds_poll(struct file *file, struct socket *sock,
 		mask |= (POLLIN | POLLRDNORM);
 	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
 		mask |= (POLLOUT | POLLWRNORM);
+	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
+		mask |= POLLERR;
 	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
 
 	/* clear state any time we wake a seen-congested socket */
diff --git a/net/rds/message.c b/net/rds/message.c
index ef3daaf..d874b74 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -33,6 +33,9 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/skbuff.h>
+#include <linux/list.h>
+#include <linux/errqueue.h>
 
 #include "rds.h"
 
@@ -53,29 +56,95 @@ void rds_message_addref(struct rds_message *rm)
 }
 EXPORT_SYMBOL_GPL(rds_message_addref);
 
+static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	int ncookies;
+	u32 *ptr;
+
+	if (serr->ee.ee_origin != SO_EE_ORIGIN_ZCOOKIE)
+		return false;
+	ncookies = serr->ee.ee_data;
+	if (ncookies == SO_EE_ORIGIN_MAX_ZCOOKIES)
+		return false;
+	ptr = skb_put(skb, sizeof(u32));
+	*ptr = cookie;
+	serr->ee.ee_data = ++ncookies;
+	return true;
+}
+
+static void rds_rm_zerocopy_callback(struct rds_sock *rs,
+				     struct rds_znotifier *znotif)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct sk_buff *skb, *tail;
+	struct sock_exterr_skb *serr;
+	unsigned long flags;
+	struct sk_buff_head *q;
+	u32 cookie = znotif->z_cookie;
+
+	q = &sk->sk_error_queue;
+	spin_lock_irqsave(&q->lock, flags);
+	tail = skb_peek_tail(q);
+
+	if (tail && skb_zcookie_add(tail, cookie)) {
+		spin_unlock_irqrestore(&q->lock, flags);
+		mm_unaccount_pinned_pages(&znotif->z_mmp);
+		consume_skb(rds_skb_from_znotifier(znotif));
+		sk->sk_error_report(sk);
+		return;
+	}
+
+	skb = rds_skb_from_znotifier(znotif);
+	serr = SKB_EXT_ERR(skb);
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
+	serr->ee.ee_info = 0;
+	serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
+	WARN_ON(!skb_zcookie_add(skb, cookie));
+
+	__skb_queue_tail(q, skb);
+
+	spin_unlock_irqrestore(&q->lock, flags);
+	sk->sk_error_report(sk);
+
+	mm_unaccount_pinned_pages(&znotif->z_mmp);
+}
+
 /*
  * This relies on dma_map_sg() not touching sg[].page during merging.
  */
 static void rds_message_purge(struct rds_message *rm)
 {
 	unsigned long i, flags;
+	bool zcopy = false;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
 
-	for (i = 0; i < rm->data.op_nents; i++) {
-		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
-		/* XXX will have to put_page for page refs */
-		__free_page(sg_page(&rm->data.op_sg[i]));
-	}
-	rm->data.op_nents = 0;
 	spin_lock_irqsave(&rm->m_rs_lock, flags);
 	if (rm->m_rs) {
-		sock_put(rds_rs_to_sk(rm->m_rs));
+		struct rds_sock *rs = rm->m_rs;
+
+		if (rm->data.op_mmp_znotifier) {
+			zcopy = true;
+			rds_rm_zerocopy_callback(rs, rm->data.op_mmp_znotifier);
+			rm->data.op_mmp_znotifier = NULL;
+		}
+		sock_put(rds_rs_to_sk(rs));
 		rm->m_rs = NULL;
 	}
 	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
+	for (i = 0; i < rm->data.op_nents; i++) {
+		/* XXX will have to put_page for page refs */
+		if (!zcopy)
+			__free_page(sg_page(&rm->data.op_sg[i]));
+		else
+			put_page(sg_page(&rm->data.op_sg[i]));
+	}
+	rm->data.op_nents = 0;
+
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
 	if (rm->rdma.op_rdma_mr)
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 374ae83..6e8fc4c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -356,6 +356,19 @@ static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
 #define RDS_MSG_PAGEVEC		7
 #define RDS_MSG_FLUSH		8
 
+struct rds_znotifier {
+	struct list_head	z_list;
+	struct mmpin		z_mmp;
+	u32			z_cookie;
+};
+
+#define	RDS_ZCOPY_SKB(__skb)	((struct rds_znotifier *)&((__skb)->cb[0]))
+
+static inline struct sk_buff *rds_skb_from_znotifier(struct rds_znotifier *z)
+{
+	return container_of((void *)z, struct sk_buff, cb);
+}
+
 struct rds_message {
 	refcount_t		m_refcount;
 	struct list_head	m_sock_item;
@@ -436,6 +449,7 @@ struct rds_message {
 			unsigned int		op_count;
 			unsigned int		op_dmasg;
 			unsigned int		op_dmaoff;
+			struct rds_znotifier	*op_mmp_znotifier;
 			struct scatterlist	*op_sg;
 		} data;
 	};
diff --git a/net/rds/recv.c b/net/rds/recv.c
index b25bcfe..b080961 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -594,6 +594,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 
 	if (msg_flags & MSG_OOB)
 		goto out;
+	if (msg_flags & MSG_ERRQUEUE)
+		return sock_recv_errqueue(sk, msg, size, SOL_IP, IP_RECVERR);
 
 	while (1) {
 		/* If there are pending notifications, do those - and nothing else */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
                   ` (3 preceding siblings ...)
  2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 19:10   ` Santosh Shilimkar
  2018-02-14 23:48   ` Willem de Bruijn
  2018-02-14 10:28 ` [PATCH V2 net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg() are pinned.

The pinning uses the accounting infrastructure added by
Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure may thus need to be notified
about send-completion so that it can free/reuse the buffers
passed to rds_sendmsg(). Notification of send-completion will
identify each message-buffer by a cookie that the application
must specify as ancillary data to rds_sendmsg().
The ancillary data in this case has cmsg_level == SOL_RDS
and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: 
  - remove unused data_len argument to rds_rm_size;
  - unmap as necessary if we fail in the middle of zerocopy setup

 include/uapi/linux/rds.h |    1 +
 net/rds/message.c        |   51 +++++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h            |    3 +-
 net/rds/send.c           |   44 ++++++++++++++++++++++++++++++++++-----
 4 files changed, 91 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index e71d449..12e3bca 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -103,6 +103,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_FADD	8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP	9
 #define RDS_CMSG_RXPATH_LATENCY		11
+#define	RDS_CMSG_ZCOPY_COOKIE		12
 
 #define RDS_INFO_FIRST			10000
 #define RDS_INFO_COUNTERS		10000
diff --git a/net/rds/message.c b/net/rds/message.c
index d874b74..e499566 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -341,12 +341,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
 	return rm;
 }
 
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       bool zcopy)
 {
 	unsigned long to_copy, nbytes;
 	unsigned long sg_off;
 	struct scatterlist *sg;
 	int ret = 0;
+	int length = iov_iter_count(from);
 
 	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
 	sg = rm->data.op_sg;
 	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
 
+	if (zcopy) {
+		int total_copied = 0;
+		struct sk_buff *skb;
+
+		skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
+				GFP_KERNEL);
+		if (!skb)
+			return -ENOMEM;
+		rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
+		memset(rm->data.op_mmp_znotifier, 0,
+		       sizeof(*rm->data.op_mmp_znotifier));
+		if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+					    length)) {
+			consume_skb(skb);
+			rm->data.op_mmp_znotifier = NULL;
+			return -ENOMEM;
+		}
+		while (iov_iter_count(from)) {
+			struct page *pages;
+			size_t start;
+			ssize_t copied;
+
+			copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
+						    1, &start);
+			if (copied < 0) {
+				struct mmpin *mmp;
+				int i;
+
+				for (i = 0; i < rm->data.op_nents; i++)
+					put_page(sg_page(&rm->data.op_sg[i]));
+				mmp = &rm->data.op_mmp_znotifier->z_mmp;
+				mm_unaccount_pinned_pages(mmp);
+				consume_skb(skb);
+				rm->data.op_mmp_znotifier = NULL;
+				return -EFAULT;
+			}
+			total_copied += copied;
+			iov_iter_advance(from, copied);
+			length -= copied;
+			sg_set_page(sg, pages, copied, start);
+			rm->data.op_nents++;
+			sg++;
+		}
+		WARN_ON_ONCE(length != 0);
+		return ret;
+	} /* zcopy */
+
 	while (iov_iter_count(from)) {
 		if (!sg_page(sg)) {
 			ret = rds_page_remainder_alloc(sg, iov_iter_count(from),
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 6e8fc4c..dfdc9b3 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -784,7 +784,8 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 /* message.c */
 struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
 struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents);
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from);
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       bool zcopy);
 struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
 void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
 				 __be16 dport, u64 seq);
diff --git a/net/rds/send.c b/net/rds/send.c
index 5ac0925..80171cf 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -875,12 +875,13 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
  * rds_message is getting to be quite complicated, and we'd like to allocate
  * it all in one go. This figures out how big it needs to be up front.
  */
-static int rds_rm_size(struct msghdr *msg, int data_len)
+static int rds_rm_size(struct msghdr *msg, int num_sgs)
 {
 	struct cmsghdr *cmsg;
 	int size = 0;
 	int cmsg_groups = 0;
 	int retval;
+	bool zcopy_cookie = false;
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
@@ -899,6 +900,8 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 
 			break;
 
+		case RDS_CMSG_ZCOPY_COOKIE:
+			zcopy_cookie = true;
 		case RDS_CMSG_RDMA_DEST:
 		case RDS_CMSG_RDMA_MAP:
 			cmsg_groups |= 2;
@@ -919,7 +922,10 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 
 	}
 
-	size += ceil(data_len, PAGE_SIZE) * sizeof(struct scatterlist);
+	if ((msg->msg_flags & MSG_ZEROCOPY) && !zcopy_cookie)
+		return -EINVAL;
+
+	size += num_sgs * sizeof(struct scatterlist);
 
 	/* Ensure (DEST, MAP) are never used with (ARGS, ATOMIC) */
 	if (cmsg_groups == 3)
@@ -928,6 +934,18 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 	return size;
 }
 
+static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	u32 *cookie;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
+		return -EINVAL;
+	cookie = CMSG_DATA(cmsg);
+	rm->data.op_mmp_znotifier->z_cookie = *cookie;
+	return 0;
+}
+
 static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			 struct msghdr *msg, int *allocated_mr)
 {
@@ -970,6 +988,10 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			ret = rds_cmsg_atomic(rs, rm, cmsg);
 			break;
 
+		case RDS_CMSG_ZCOPY_COOKIE:
+			ret = rds_cmsg_zcopy(rs, rm, cmsg);
+			break;
+
 		default:
 			return -EINVAL;
 		}
@@ -1040,10 +1062,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 	long timeo = sock_sndtimeo(sk, nonblock);
 	struct rds_conn_path *cpath;
 	size_t total_payload_len = payload_len, rdma_payload_len = 0;
+	bool zcopy = ((msg->msg_flags & MSG_ZEROCOPY) &&
+		      sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY));
+	int num_sgs = ceil(payload_len, PAGE_SIZE);
 
 	/* Mirror Linux UDP mirror of BSD error message compatibility */
 	/* XXX: Perhaps MSG_MORE someday */
-	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
+	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT | MSG_ZEROCOPY)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -1087,8 +1112,15 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 		goto out;
 	}
 
+	if (zcopy) {
+		if (rs->rs_transport->t_type != RDS_TRANS_TCP) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+		num_sgs = iov_iter_npages(&msg->msg_iter, INT_MAX);
+	}
 	/* size of rm including all sgs */
-	ret = rds_rm_size(msg, payload_len);
+	ret = rds_rm_size(msg, num_sgs);
 	if (ret < 0)
 		goto out;
 
@@ -1100,12 +1132,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* Attach data to the rm */
 	if (payload_len) {
-		rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, PAGE_SIZE));
+		rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);
 		if (!rm->data.op_sg) {
 			ret = -ENOMEM;
 			goto out;
 		}
-		ret = rds_message_copy_from_user(rm, &msg->msg_iter);
+		ret = rds_message_copy_from_user(rm, &msg->msg_iter, zcopy);
 		if (ret)
 			goto out;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 6/7] selftests/net: add support for PF_RDS sockets
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
                   ` (4 preceding siblings ...)
  2018-02-14 10:28 ` [PATCH V2 net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-14 10:28 ` [PATCH V2 net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
  2018-02-15  0:19 ` [PATCH V2 net-next 0/7] RDS: zerocopy support Willem de Bruijn
  7 siblings, 0 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

Add support for basic PF_RDS client-server testing in msg_zerocopy

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 tools/testing/selftests/net/msg_zerocopy.c |   65 +++++++++++++++++++++++++++-
 1 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index e11fe84..7a5b353 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -14,6 +14,9 @@
  * - SOCK_DGRAM
  * - SOCK_RAW
  *
+ * PF_RDS
+ * - SOCK_SEQPACKET
+ *
  * Start this program on two connected hosts, one in send mode and
  * the other with option '-r' to put it in receiver mode.
  *
@@ -53,6 +56,7 @@
 #include <sys/types.h>
 #include <sys/wait.h>
 #include <unistd.h>
+#include <linux/rds.h>
 
 #ifndef SO_EE_ORIGIN_ZEROCOPY
 #define SO_EE_ORIGIN_ZEROCOPY		5
@@ -300,10 +304,15 @@ static int do_setup_tx(int domain, int type, int protocol)
 	if (cfg_zerocopy)
 		do_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, 1);
 
-	if (domain != PF_PACKET)
+	if (domain != PF_PACKET && domain != PF_RDS)
 		if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
 			error(1, errno, "connect");
 
+	if (domain == PF_RDS) {
+		if (bind(fd, (void *) &cfg_src_addr, cfg_alen))
+			error(1, errno, "bind");
+	}
+
 	return fd;
 }
 
@@ -444,6 +453,13 @@ static void do_tx(int domain, int type, int protocol)
 		msg.msg_iovlen++;
 	}
 
+	if (domain == PF_RDS) {
+		msg.msg_name = &cfg_dst_addr;
+		msg.msg_namelen =  (cfg_dst_addr.ss_family == AF_INET ?
+				    sizeof(struct sockaddr_in) :
+				    sizeof(struct sockaddr_in6));
+	}
+
 	iov[2].iov_base = payload;
 	iov[2].iov_len = cfg_payload_len;
 	msg.msg_iovlen++;
@@ -555,6 +571,40 @@ static void do_flush_datagram(int fd, int type)
 	bytes += cfg_payload_len;
 }
 
+
+static void do_recvmsg(int fd)
+{
+	int ret, off = 0;
+	char *buf;
+	struct iovec iov;
+	struct msghdr msg;
+	struct sockaddr_storage din;
+
+	buf = calloc(cfg_payload_len, sizeof(char));
+	iov.iov_base = buf;
+	iov.iov_len = cfg_payload_len;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	ret = recvmsg(fd, &msg, MSG_TRUNC);
+
+	if (ret == -1)
+		error(1, errno, "recv");
+	if (ret != cfg_payload_len)
+		error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
+
+	if (memcmp(buf + off, payload, ret))
+		error(1, 0, "recv: data mismatch");
+
+	free(buf);
+	packets++;
+	bytes += cfg_payload_len;
+}
+
 static void do_rx(int domain, int type, int protocol)
 {
 	uint64_t tstop;
@@ -566,6 +616,8 @@ static void do_rx(int domain, int type, int protocol)
 	do {
 		if (type == SOCK_STREAM)
 			do_flush_tcp(fd);
+		else if (domain == PF_RDS)
+			do_recvmsg(fd);
 		else
 			do_flush_datagram(fd, type);
 
@@ -610,6 +662,7 @@ static void parse_opts(int argc, char **argv)
 				    40 /* max tcp options */;
 	int c;
 	char *daddr = NULL, *saddr = NULL;
+	char *cfg_test;
 
 	cfg_payload_len = max_payload_len;
 
@@ -667,6 +720,14 @@ static void parse_opts(int argc, char **argv)
 			break;
 		}
 	}
+
+	cfg_test = argv[argc - 1];
+	if (strcmp(cfg_test, "rds") == 0) {
+		if (!daddr)
+			error(1, 0, "-D <server addr> required for PF_RDS\n");
+		if (!cfg_rx && !saddr)
+			error(1, 0, "-S <client addr> required for PF_RDS\n");
+	}
 	setup_sockaddr(cfg_family, daddr, &cfg_dst_addr);
 	setup_sockaddr(cfg_family, saddr, &cfg_src_addr);
 
@@ -699,6 +760,8 @@ int main(int argc, char **argv)
 		do_test(cfg_family, SOCK_STREAM, 0);
 	else if (!strcmp(cfg_test, "udp"))
 		do_test(cfg_family, SOCK_DGRAM, 0);
+	else if (!strcmp(cfg_test, "rds"))
+		do_test(PF_RDS, SOCK_SEQPACKET, 0);
 	else
 		error(1, 0, "unknown cfg_test %s", cfg_test);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V2 net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
                   ` (5 preceding siblings ...)
  2018-02-14 10:28 ` [PATCH V2 net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
@ 2018-02-14 10:28 ` Sowmini Varadhan
  2018-02-15  0:19 ` [PATCH V2 net-next 0/7] RDS: zerocopy support Willem de Bruijn
  7 siblings, 0 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 10:28 UTC (permalink / raw)
  To: netdev, willemdebruijn.kernel
  Cc: davem, rds-devel, sowmini.varadhan, santosh.shilimkar

Send a cookie with sendmsg() on PF_RDS sockets, and process the
returned batched cookies in do_recv_completion()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
v2: 
  - restructured do_recv_completion to avoid excessive code re-indent
  - on-stack allocation of cmsghdr for cookie in do_sendmsg
  - Additional verification: Verify ncookies <= MAX_.., verify ret ==
    ncookies * sizeof(uint32_t)

 tools/testing/selftests/net/msg_zerocopy.c |   68 ++++++++++++++++++++++++++--
 1 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index 7a5b353..5cc2a53 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -168,17 +168,39 @@ static int do_accept(int fd)
 	return fd;
 }
 
-static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
+static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie)
+{
+	struct cmsghdr *cm;
+
+	if (!msg->msg_control)
+		error(1, errno, "NULL cookie");
+	cm = (void *)msg->msg_control;
+	cm->cmsg_len = CMSG_LEN(sizeof(cookie));
+	cm->cmsg_level = SOL_RDS;
+	cm->cmsg_type = RDS_CMSG_ZCOPY_COOKIE;
+	memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));
+}
+
+static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain)
 {
 	int ret, len, i, flags;
+	static uint32_t cookie;
+	char ckbuf[CMSG_SPACE(sizeof(cookie))];
 
 	len = 0;
 	for (i = 0; i < msg->msg_iovlen; i++)
 		len += msg->msg_iov[i].iov_len;
 
 	flags = MSG_DONTWAIT;
-	if (do_zerocopy)
+	if (do_zerocopy) {
 		flags |= MSG_ZEROCOPY;
+		if (domain == PF_RDS) {
+			memset(&msg->msg_control, 0, sizeof(msg->msg_control));
+			msg->msg_controllen = CMSG_SPACE(sizeof(cookie));
+			msg->msg_control = (struct cmsghdr *)ckbuf;
+			add_zcopy_cookie(msg, ++cookie);
+		}
+	}
 
 	ret = sendmsg(fd, msg, flags);
 	if (ret == -1 && errno == EAGAIN)
@@ -194,6 +216,10 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
 		if (do_zerocopy && ret)
 			expected_completions++;
 	}
+	if (do_zerocopy && domain == PF_RDS) {
+		msg->msg_control = NULL;
+		msg->msg_controllen = 0;
+	}
 
 	return true;
 }
@@ -220,7 +246,9 @@ static void do_sendmsg_corked(int fd, struct msghdr *msg)
 		msg->msg_iov[0].iov_len = payload_len + extra_len;
 		extra_len = 0;
 
-		do_sendmsg(fd, msg, do_zerocopy);
+		do_sendmsg(fd, msg, do_zerocopy,
+			   (cfg_dst_addr.ss_family == AF_INET ?
+			    PF_INET : PF_INET6));
 	}
 
 	do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
@@ -316,6 +344,26 @@ static int do_setup_tx(int domain, int type, int protocol)
 	return fd;
 }
 
+static int do_process_zerocopy_cookies(struct sock_extended_err *serr,
+				       uint32_t *ckbuf, size_t nbytes)
+{
+	int ncookies, i;
+
+	if (serr->ee_errno != 0)
+		error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
+	ncookies = serr->ee_data;
+	if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES)
+		error(1, 0, "Returned %d cookies, max expected %d\n",
+		      ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES);
+	if (nbytes != ncookies * sizeof(uint32_t))
+		error(1, 0, "Expected %d cookies, got %ld\n",
+		      ncookies, nbytes/sizeof(uint32_t));
+	for (i = 0; i < ncookies; i++)
+		if (cfg_verbose >= 2)
+			fprintf(stderr, "%d\n", ckbuf[i]);
+	return ncookies;
+}
+
 static bool do_recv_completion(int fd)
 {
 	struct sock_extended_err *serr;
@@ -324,10 +372,17 @@ static bool do_recv_completion(int fd)
 	uint32_t hi, lo, range;
 	int ret, zerocopy;
 	char control[100];
+	uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
+	struct iovec iov;
 
 	msg.msg_control = control;
 	msg.msg_controllen = sizeof(control);
 
+	iov.iov_base = ckbuf;
+	iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
 	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
 	if (ret == -1 && errno == EAGAIN)
 		return false;
@@ -346,6 +401,11 @@ static bool do_recv_completion(int fd)
 		      cm->cmsg_level, cm->cmsg_type);
 
 	serr = (void *) CMSG_DATA(cm);
+
+	if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
+		completions += do_process_zerocopy_cookies(serr, ckbuf, ret);
+		return true;
+	}
 	if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
 		error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
 	if (serr->ee_errno != 0)
@@ -470,7 +530,7 @@ static void do_tx(int domain, int type, int protocol)
 		if (cfg_cork)
 			do_sendmsg_corked(fd, &msg);
 		else
-			do_sendmsg(fd, &msg, cfg_zerocopy);
+			do_sendmsg(fd, &msg, cfg_zerocopy, domain);
 
 		while (!do_poll(fd, POLLOUT)) {
 			if (cfg_zerocopy)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-02-14 10:28 ` [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
@ 2018-02-14 18:38   ` Santosh Shilimkar
  0 siblings, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 18:38 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, willemdebruijn.kernel; +Cc: davem, rds-devel

On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
> The existing model holds a reference from the rds_sock to the
> rds_message, but the rds_message does not itself hold a sock_put()
> on the rds_sock. Instead the m_rs field in the rds_message is
> assigned when the message is queued on the sock, and nulled when
> the message is dequeued from the sock.
> 
> We want to be able to notify userspace when the rds_message
> is actually freed (from rds_message_purge(), after the refcounts
> to the rds_message go to 0). At the time that rds_message_purge()
> is called, the message is no longer on the rds_sock retransmit
> queue. Thus the explicit reference for the m_rs is needed to
> send a notification that will signal to userspace that
> it is now safe to free/reuse any pages that may have
> been pinned down for zerocopy.
> 
> This patch manages the m_rs assignment in the rds_message with
> the necessary refcount book-keeping.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

[...]


> @@ -756,9 +755,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
>   		 */
>   		if (!test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
>   			spin_unlock_irqrestore(&cp->cp_lock, flags);
> -			spin_lock_irqsave(&rm->m_rs_lock, flags);
> -			rm->m_rs = NULL;
> -			spin_unlock_irqrestore(&rm->m_rs_lock, flags);
>   			continue;
>   		}
>   		list_del_init(&rm->m_conn_item);
This hunk was clearly wrong so good that you got rid of it as well.
Patch looks fine to me.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-02-14 18:50   ` Santosh Shilimkar
  2018-02-14 19:01     ` Sowmini Varadhan
  2018-02-14 22:26     ` Willem de Bruijn
  2018-02-15  0:12   ` Willem de Bruijn
  2018-02-15  0:41   ` Willem de Bruijn
  2 siblings, 2 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 18:50 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, willemdebruijn.kernel; +Cc: davem, rds-devel

On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
> 
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
> 
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
> v2:
>    - make sure to always sock_put m_rs even if there is no znotifier.
>    - major rewrite of notification, resulting in much simplification.
>
>   include/uapi/linux/errqueue.h |    2 +
>   net/rds/af_rds.c              |    2 +
>   net/rds/message.c             |   83 +++++++++++++++++++++++++++++++++++++---
>   net/rds/rds.h                 |   14 +++++++
>   net/rds/recv.c                |    2 +
>   5 files changed, 96 insertions(+), 7 deletions(-)
>
generic comment and please update it where it is applicable
in terms of variable names, notifiers etc.

RDS support true zero copy already with RDMA transport so some of
this code can easily get confused.

So I suggest something like below.
s/zerocopy/zeromsgcopy
s/zcopy/zmsgcopy
s/zcookie/zmsgcpycookie
s/znotifier/zmsgcpynotifier	


> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index dc64cfa..28812ed 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -20,11 +20,13 @@ struct sock_extended_err {
>   #define SO_EE_ORIGIN_ICMP6	3
>   #define SO_EE_ORIGIN_TXSTATUS	4
>   #define SO_EE_ORIGIN_ZEROCOPY	5
> +#define SO_EE_ORIGIN_ZCOOKIE	6
>   #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>   
>   #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
>   
>   #define SO_EE_CODE_ZEROCOPY_COPIED	1
> +#define	SO_EE_ORIGIN_MAX_ZCOOKIES	8

This error change might need to go though other subsystem tree. May
be you can seperate it and also copy "linux-api@vger.kernel.org"

Otherwise changes looks fine to me.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 18:50   ` Santosh Shilimkar
@ 2018-02-14 19:01     ` Sowmini Varadhan
  2018-02-14 19:02       ` David Miller
  2018-02-14 19:17       ` Santosh Shilimkar
  2018-02-14 22:26     ` Willem de Bruijn
  1 sibling, 2 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 19:01 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: netdev, willemdebruijn.kernel, davem, rds-devel

On (02/14/18 10:50), Santosh Shilimkar wrote:
> generic comment and please update it where it is applicable
> in terms of variable names, notifiers etc.
> 
> RDS support true zero copy already with RDMA transport so some of
> this code can easily get confused.
> 
> So I suggest something like below.
> s/zerocopy/zeromsgcopy
> s/zcopy/zmsgcopy
> s/zcookie/zmsgcpycookie
> s/znotifier/zmsgcpynotifier	

I'd like to hear some additional opinions from the list on this: 
the existing socket API for TCP etc.  already uses ZEROCOPY, and other
than extending variable names (and putting me at risk of violating the
"fit within 80 chars per line" requirement, leading to not-so-pretty
line wraps), I'm not seeing much value in this.

> This error change might need to go though other subsystem tree. May
> be you can seperate it and also copy "linux-api@vger.kernel.org"

sure I can do that, either  when I put out patch v3 or later today,
after others have had a chance to review this.

--Sowmini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 19:01     ` Sowmini Varadhan
@ 2018-02-14 19:02       ` David Miller
  2018-02-14 21:10         ` Santosh Shilimkar
  2018-02-14 19:17       ` Santosh Shilimkar
  1 sibling, 1 reply; 31+ messages in thread
From: David Miller @ 2018-02-14 19:02 UTC (permalink / raw)
  To: sowmini.varadhan
  Cc: santosh.shilimkar, netdev, willemdebruijn.kernel, rds-devel

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 14 Feb 2018 14:01:10 -0500

> On (02/14/18 10:50), Santosh Shilimkar wrote:
>> generic comment and please update it where it is applicable
>> in terms of variable names, notifiers etc.
>> 
>> RDS support true zero copy already with RDMA transport so some of
>> this code can easily get confused.
>> 
>> So I suggest something like below.
>> s/zerocopy/zeromsgcopy
>> s/zcopy/zmsgcopy
>> s/zcookie/zmsgcpycookie
>> s/znotifier/zmsgcpynotifier	
> 
> I'd like to hear some additional opinions from the list on this: 
> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
> than extending variable names (and putting me at risk of violating the
> "fit within 80 chars per line" requirement, leading to not-so-pretty
> line wraps), I'm not seeing much value in this.

I agree, this name change requires seems pointless.  Just keep the names
the way they are, thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 10:28 ` [PATCH V2 net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
@ 2018-02-14 19:10   ` Santosh Shilimkar
  2018-02-14 19:49     ` Sowmini Varadhan
  2018-02-14 23:48   ` Willem de Bruijn
  1 sibling, 1 reply; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 19:10 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, willemdebruijn.kernel; +Cc: davem, rds-devel

On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
> If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg() are pinned.
> 
> The pinning uses the accounting infrastructure added by
> Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
> 
> The payload bytes in the message may not be modified for the
> duration that the message has been pinned. A multi-threaded
> application using this infrastructure may thus need to be notified
> about send-completion so that it can free/reuse the buffers
> passed to rds_sendmsg(). Notification of send-completion will
> identify each message-buffer by a cookie that the application
> must specify as ancillary data to rds_sendmsg().
> The ancillary data in this case has cmsg_level == SOL_RDS
> and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
> v2:
>    - remove unused data_len argument to rds_rm_size;
>    - unmap as necessary if we fail in the middle of zerocopy setup
> 
>   include/uapi/linux/rds.h |    1 +
>   net/rds/message.c        |   51 +++++++++++++++++++++++++++++++++++++++++++++-
>   net/rds/rds.h            |    3 +-
>   net/rds/send.c           |   44 ++++++++++++++++++++++++++++++++++-----
>   4 files changed, 91 insertions(+), 8 deletions(-)
> 
> diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
> index e71d449..12e3bca 100644
> --- a/include/uapi/linux/rds.h
> +++ b/include/uapi/linux/rds.h
> @@ -103,6 +103,7 @@
>   #define RDS_CMSG_MASKED_ATOMIC_FADD	8
>   #define RDS_CMSG_MASKED_ATOMIC_CSWP	9
>   #define RDS_CMSG_RXPATH_LATENCY		11
> +#define	RDS_CMSG_ZCOPY_COOKIE		12
>
s/RDS_CMSG_ZCOPY_COOKIE/RDS_CMSG_ZMSGCOPY_COOKIE	

>   #define RDS_INFO_FIRST			10000
>   #define RDS_INFO_COUNTERS		10000
> diff --git a/net/rds/message.c b/net/rds/message.c
> index d874b74..e499566 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -341,12 +341,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
>   	return rm;
>   }
>   
> -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
> +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
> +			       bool zcopy)
>   {
>   	unsigned long to_copy, nbytes;
>   	unsigned long sg_off;
>   	struct scatterlist *sg;
>   	int ret = 0;
> +	int length = iov_iter_count(from);
>   
>   	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>   
> @@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
>   	sg = rm->data.op_sg;
>   	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
>   
> +	if (zcopy) {
> +		int total_copied = 0;
> +		struct sk_buff *skb;
> +
> +		skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
> +				GFP_KERNEL);
This can sleep so you might want to check if you want to use ATOMIC 
version here.

> +		if (!skb)
> +			return -ENOMEM;
> +		rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
> +		memset(rm->data.op_mmp_znotifier, 0,
> +		       sizeof(*rm->data.op_mmp_znotifier));
> +		if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> +					    length)) {
> +			consume_skb(skb);
> +			rm->data.op_mmp_znotifier = NULL;
> +			return -ENOMEM;
> +		}
NOMEM new application visible change but probably the right one for this
particular case. Just need to make sure application can handle this
error.


> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 6e8fc4c..dfdc9b3 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -784,7 +784,8 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
>   /* message.c */
>   struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
>   struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents);
> -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from);
> +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
> +			       bool zcopy);
>   struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
>   void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
>   				 __be16 dport, u64 seq);
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 5ac0925..80171cf 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c

[...]

> @@ -1087,8 +1112,15 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
>   		goto out;
>   	}
>   
> +	if (zcopy) {
> +		if (rs->rs_transport->t_type != RDS_TRANS_TCP) {
> +			ret = -EOPNOTSUPP;
> +			goto out;
> +		}
> +		num_sgs = iov_iter_npages(&msg->msg_iter, INT_MAX);
> +	}

Instead of this transport check, lets move this under transport function
which then can be populated by TCP transport.

Rest of the changes looks good.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 19:01     ` Sowmini Varadhan
  2018-02-14 19:02       ` David Miller
@ 2018-02-14 19:17       ` Santosh Shilimkar
  1 sibling, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 19:17 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: netdev, willemdebruijn.kernel, davem, rds-devel

On 2/14/2018 11:01 AM, Sowmini Varadhan wrote:
> On (02/14/18 10:50), Santosh Shilimkar wrote:
>> generic comment and please update it where it is applicable
>> in terms of variable names, notifiers etc.
>>
>> RDS support true zero copy already with RDMA transport so some of
>> this code can easily get confused.
>>
>> So I suggest something like below.
>> s/zerocopy/zeromsgcopy
>> s/zcopy/zmsgcopy
>> s/zcookie/zmsgcpycookie
>> s/znotifier/zmsgcpynotifier	
> 
> I'd like to hear some additional opinions from the list on this:
> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
> than extending variable names (and putting me at risk of violating the
> "fit within 80 chars per line" requirement, leading to not-so-pretty
> line wraps), I'm not seeing much value in this.
> 
What you mean by not seeing value ?
It has a value for RDS code which already support ZERO copy via
RDMA transport. For TCP sockets in isolation, ZCOPY may be
alright. Patch under discussion is for RDS and hence the
comment.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 19:10   ` Santosh Shilimkar
@ 2018-02-14 19:49     ` Sowmini Varadhan
  2018-02-14 21:14       ` Santosh Shilimkar
  0 siblings, 1 reply; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 19:49 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: netdev, willemdebruijn.kernel, davem, rds-devel

On (02/14/18 11:10), Santosh Shilimkar wrote:
> s/RDS_CMSG_ZCOPY_COOKIE/RDS_CMSG_ZMSGCOPY_COOKIE	
> 

Please see https://www.spinics.net/lists/netdev/msg483627.html

> >@@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
> >  	sg = rm->data.op_sg;
> >  	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
> >+	if (zcopy) {
> >+		int total_copied = 0;
> >+		struct sk_buff *skb;
> >+
> >+		skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
> >+				GFP_KERNEL);
> This can sleep so you might want to check if you want to use ATOMIC version
> here.

I think it should be fine: rds_message_copy_from_user() is called
in process context, and if you notice, the calling function rds_sendmsg()
also has this
   1100         rm = rds_message_alloc(ret, GFP_KERNEL);
   1101         if (!rm) {
   1102                 ret = -ENOMEM;
   1103                 goto out;
   1104         }

    :
   1106         /* Attach data to the rm */
    :
   1113                 ret = rds_message_copy_from_user(rm, &msg->msg_iter);

So using GFP_KERNEL is as safe as the call at line 1100.


> >+			return -ENOMEM;
> >+		}
> NOMEM new application visible change but probably the right one for this
> particular case. Just need to make sure application can handle this
> error.

I think the application already handles this correctly (see line 1102 above)

Thanks for taking a look.

--Sowmini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 19:02       ` David Miller
@ 2018-02-14 21:10         ` Santosh Shilimkar
  2018-02-14 21:25           ` Sowmini Varadhan
  0 siblings, 1 reply; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 21:10 UTC (permalink / raw)
  To: David Miller; +Cc: sowmini.varadhan, netdev, willemdebruijn.kernel, rds-devel

On 2/14/2018 11:02 AM, David Miller wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Wed, 14 Feb 2018 14:01:10 -0500
> 
>> On (02/14/18 10:50), Santosh Shilimkar wrote:
>>> generic comment and please update it where it is applicable
>>> in terms of variable names, notifiers etc.
>>>
>>> RDS support true zero copy already with RDMA transport so some of
>>> this code can easily get confused.
>>>
>>> So I suggest something like below.
>>> s/zerocopy/zeromsgcopy
>>> s/zcopy/zmsgcopy
>>> s/zcookie/zmsgcpycookie
>>> s/znotifier/zmsgcpynotifier	
>>
>> I'd like to hear some additional opinions from the list on this:
>> the existing socket API for TCP etc.  already uses ZEROCOPY, and other
>> than extending variable names (and putting me at risk of violating the
>> "fit within 80 chars per line" requirement, leading to not-so-pretty
>> line wraps), I'm not seeing much value in this.
> 
> I agree, this name change requires seems pointless.  Just keep the names
> the way they are, thank you.
> 
Dave I understand your point for generic code and I was not all asking 
to rename anything in generic code.

My point was new code getting added to RDS to support this zero copy
message option. Within RDS, core code is common and all the 
infrastructure with it and it will be hard to follow the code if
that distinction is not maintained.

Ofcourse you have a last say on this list so if you want to
over rule the comment, I will step back :-)

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 19:49     ` Sowmini Varadhan
@ 2018-02-14 21:14       ` Santosh Shilimkar
  0 siblings, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 21:14 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: netdev, willemdebruijn.kernel, davem, rds-devel

On 2/14/2018 11:49 AM, Sowmini Varadhan wrote:
> On (02/14/18 11:10), Santosh Shilimkar wrote:
>> s/RDS_CMSG_ZCOPY_COOKIE/RDS_CMSG_ZMSGCOPY_COOKIE	
>>
> 
> Please see https://www.spinics.net/lists/netdev/msg483627.html
>
Just saw it and responded to Dave.


>>> @@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
>>>   	sg = rm->data.op_sg;
>>>   	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
>>> +	if (zcopy) {
>>> +		int total_copied = 0;
>>> +		struct sk_buff *skb;
>>> +
>>> +		skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
>>> +				GFP_KERNEL);
>> This can sleep so you might want to check if you want to use ATOMIC version
>> here.
> 
> I think it should be fine: rds_message_copy_from_user() is called
> in process context, and if you notice, the calling function rds_sendmsg()
> also has this
>     1100         rm = rds_message_alloc(ret, GFP_KERNEL);
>     1101         if (!rm) {
>     1102                 ret = -ENOMEM;
>     1103                 goto out;
>     1104         }
> 
>      :
>     1106         /* Attach data to the rm */
>      :
>     1113                 ret = rds_message_copy_from_user(rm, &msg->msg_iter);
> 
> So using GFP_KERNEL is as safe as the call at line 1100.
>
Was just asking you to check if it is safe. The path already
does that so we are good.

> 
>>> +			return -ENOMEM;
>>> +		}
>> NOMEM new application visible change but probably the right one for this
>> particular case. Just need to make sure application can handle this
>> error.
> 
> I think the application already handles this correctly (see line 1102 above)
> 
Indeed. Thanks for checking.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 21:10         ` Santosh Shilimkar
@ 2018-02-14 21:25           ` Sowmini Varadhan
  2018-02-14 21:39             ` Santosh Shilimkar
  2018-02-14 22:08             ` Santosh Shilimkar
  0 siblings, 2 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-14 21:25 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: David Miller, netdev, willemdebruijn.kernel, rds-devel

On (02/14/18 13:10), Santosh Shilimkar wrote:
> >>>RDS support true zero copy already with RDMA transport so some of
> >>>this code can easily get confused.

btw, another way to solve this is to have the RDMA code use the 
suffix "rdma" (which is what it really is) as needed.

--Sowmini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 21:25           ` Sowmini Varadhan
@ 2018-02-14 21:39             ` Santosh Shilimkar
  2018-02-14 22:08             ` Santosh Shilimkar
  1 sibling, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 21:39 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: David Miller, netdev, willemdebruijn.kernel, rds-devel

On 2/14/2018 1:25 PM, Sowmini Varadhan wrote:
> On (02/14/18 13:10), Santosh Shilimkar wrote:
>>>>> RDS support true zero copy already with RDMA transport so some of
>>>>> this code can easily get confused.
> 
> btw, another way to solve this is to have the RDMA code use the
> suffix "rdma" (which is what it really is) as needed.
> 
And same breath,here zcopy is No message from user ;-)
ZCOPY is otherwise often tied with RDMA directly.

Renaming churns are not that useful and I definitely agree
with what Dave said if it was renaming change to the
existing code like what you are suggesting with 'rdma'

Anyways I don't want to contest this too much since I can
follow that code and know what each does and means :-)

The comment was long term readability perspective for
some one completely new reading the code and being able to
distinguish the different modes.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 21:25           ` Sowmini Varadhan
  2018-02-14 21:39             ` Santosh Shilimkar
@ 2018-02-14 22:08             ` Santosh Shilimkar
  1 sibling, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-14 22:08 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: David Miller, netdev, willemdebruijn.kernel, rds-devel

On 2/14/2018 1:25 PM, Sowmini Varadhan wrote:
> On (02/14/18 13:10), Santosh Shilimkar wrote:
>>>>> RDS support true zero copy already with RDMA transport so some of
>>>>> this code can easily get confused.
> 
> btw, another way to solve this is to have the RDMA code use the
> suffix "rdma" (which is what it really is) as needed.
> 
And same breath,here zcopy is No message from user ;-)
ZCOPY is otherwise often tied with RDMA directly.

Renaming churns are not that useful and I definitely agree
with what Dave said if it was renaming change to the
existing code like what you are suggesting with 'rdma'

Anyways I don't want to contest this too much since I can
follow that code and know what each does and means :-)

The comment was long term readability perspective for
some one completely new reading the code and being able to
distinguish the different modes.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 18:50   ` Santosh Shilimkar
  2018-02-14 19:01     ` Sowmini Varadhan
@ 2018-02-14 22:26     ` Willem de Bruijn
  2018-02-15  0:19       ` Santosh Shilimkar
  1 sibling, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-14 22:26 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: Sowmini Varadhan, Network Development, David Miller, rds-devel

On Wed, Feb 14, 2018 at 1:50 PM, Santosh Shilimkar
<santosh.shilimkar@oracle.com> wrote:
> On 2/14/2018 2:28 AM, Sowmini Varadhan wrote:
>>
>> RDS removes a datagram (rds_message) from the retransmit queue when
>> an ACK is received. The ACK indicates that the receiver has queued
>> the RDS datagram, so that the sender can safely forget the datagram.
>> When all references to the rds_message are quiesced, rds_message_purge
>> is called to release resources used by the rds_message
>>
>> If the datagram to be removed had pinned pages set up, add
>> an entry to the rs->rs_znotify_queue so that the notifcation
>> will be sent up via rds_rm_zerocopy_callback() when the
>> rds_message is eventually freed by rds_message_purge.
>>
>> rds_rm_zerocopy_callback() attempts to batch the number of cookies
>> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
>> This is achieved by checking the tail skb in the sk_error_queue:
>> if this has room for one more cookie, the cookie from the
>> current notification is added; else a new skb is added to the
>> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
>> trigger a ->sk_error_report to notify the application.
>>
>> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>> v2:
>>    - make sure to always sock_put m_rs even if there is no znotifier.
>>    - major rewrite of notification, resulting in much simplification.
>>
>>   include/uapi/linux/errqueue.h |    2 +
>>   net/rds/af_rds.c              |    2 +
>>   net/rds/message.c             |   83
>> +++++++++++++++++++++++++++++++++++++---
>>   net/rds/rds.h                 |   14 +++++++
>>   net/rds/recv.c                |    2 +
>>   5 files changed, 96 insertions(+), 7 deletions(-)
>>
> generic comment and please update it where it is applicable
> in terms of variable names, notifiers etc.
>
> RDS support true zero copy already with RDMA transport so some of
> this code can easily get confused.
>
> So I suggest something like below.
> s/zerocopy/zeromsgcopy
> s/zcopy/zmsgcopy
> s/zcookie/zmsgcpycookie
> s/znotifier/zmsgcpynotifier
>
>
>> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> index dc64cfa..28812ed 100644
>> --- a/include/uapi/linux/errqueue.h
>> +++ b/include/uapi/linux/errqueue.h
>> @@ -20,11 +20,13 @@ struct sock_extended_err {
>>   #define SO_EE_ORIGIN_ICMP6    3
>>   #define SO_EE_ORIGIN_TXSTATUS 4
>>   #define SO_EE_ORIGIN_ZEROCOPY 5
>> +#define SO_EE_ORIGIN_ZCOOKIE   6
>>   #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
>>     #define SO_EE_OFFENDER(ee)  ((struct sockaddr*)((ee)+1))
>>     #define SO_EE_CODE_ZEROCOPY_COPIED  1
>> +#define        SO_EE_ORIGIN_MAX_ZCOOKIES       8
>
>
> This error change might need to go though other subsystem tree. May
> be you can seperate it and also copy "linux-api@vger.kernel.org"

Previous changes to this file also went in through net-next, so this is
fine. The error queue is a socket, so network, datastructure.

As for naming, zerocopy is unfortunately an overused term. I did not
help matters when adding msg_zerocopy. That said, let's try to keep
consistent naming across socket families, so no zmsg prefix only in
RDS.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 10:28 ` [PATCH V2 net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
  2018-02-14 19:10   ` Santosh Shilimkar
@ 2018-02-14 23:48   ` Willem de Bruijn
  2018-02-15  0:09     ` Sowmini Varadhan
  1 sibling, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-14 23:48 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg() are pinned.
>
> The pinning uses the accounting infrastructure added by
> Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
>
> The payload bytes in the message may not be modified for the
> duration that the message has been pinned. A multi-threaded
> application using this infrastructure may thus need to be notified
> about send-completion so that it can free/reuse the buffers
> passed to rds_sendmsg(). Notification of send-completion will
> identify each message-buffer by a cookie that the application
> must specify as ancillary data to rds_sendmsg().
> The ancillary data in this case has cmsg_level == SOL_RDS
> and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

If the missing break is intentional, no need to respin just for the other
minor comments.

> @@ -341,12 +341,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
>         return rm;
>  }
>
> -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
> +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
> +                              bool zcopy)
>  {
>         unsigned long to_copy, nbytes;
>         unsigned long sg_off;
>         struct scatterlist *sg;
>         int ret = 0;
> +       int length = iov_iter_count(from);
>
>         rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>
> @@ -356,6 +358,53 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
>         sg = rm->data.op_sg;
>         sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
>
> +       if (zcopy) {
> +               int total_copied = 0;
> +               struct sk_buff *skb;
> +
> +               skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
> +                               GFP_KERNEL);
> +               if (!skb)
> +                       return -ENOMEM;
> +               rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
> +               memset(rm->data.op_mmp_znotifier, 0,
> +                      sizeof(*rm->data.op_mmp_znotifier));

Not strictly needed, as alloc_skb clears skb->cb[]

> +               if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> +                                           length)) {
> +                       consume_skb(skb);
> +                       rm->data.op_mmp_znotifier = NULL;
> +                       return -ENOMEM;
> +               }

One less action to revert if moving the mm_account_pinned_pages check
before assigning op_mmp_znotifier.

Conversely, move to an err: label at the end to be able to deduplicate
with the error branch introduced below.

> @@ -875,12 +875,13 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
>   * rds_message is getting to be quite complicated, and we'd like to allocate
>   * it all in one go. This figures out how big it needs to be up front.
>   */
> -static int rds_rm_size(struct msghdr *msg, int data_len)
> +static int rds_rm_size(struct msghdr *msg, int num_sgs)
>  {
>         struct cmsghdr *cmsg;
>         int size = 0;
>         int cmsg_groups = 0;
>         int retval;
> +       bool zcopy_cookie = false;
>
>         for_each_cmsghdr(cmsg, msg) {
>                 if (!CMSG_OK(msg, cmsg))
> @@ -899,6 +900,8 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
>
>                         break;
>
> +               case RDS_CMSG_ZCOPY_COOKIE:
> +                       zcopy_cookie = true;

break, or if intended to fall through, please label as such.

>                 case RDS_CMSG_RDMA_DEST:
>                 case RDS_CMSG_RDMA_MAP:
>                         cmsg_groups |= 2;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-14 23:48   ` Willem de Bruijn
@ 2018-02-15  0:09     ` Sowmini Varadhan
  2018-02-15  0:15       ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-15  0:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On (02/14/18 18:48), Willem de Bruijn wrote:
> 
> If the missing break is intentional, no need to respin just for the other
> minor comments.

yes the missing break is intentional-  the function returns the
size of the scatterlist needed for RDMA, and RDS_CMSG_ZCOPY_COOKIE
(like RDMA_DEST and RDMA_MAP) is meta-data that does not change
that size.

I expect to be in the neighborhood of this  code pretty soon, to get
the additional opimization of passing up the zcopy completion
as part of recvmsg (see the discussion in
https://www.mail-archive.com/netdev@vger.kernel.org/msg212788.html)
 
I can take care of the other code-cleanup comment suggestions in
here at that time..

--Sowmini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
  2018-02-14 18:50   ` Santosh Shilimkar
@ 2018-02-15  0:12   ` Willem de Bruijn
  2018-02-15  0:41   ` Willem de Bruijn
  2 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-15  0:12 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> +                                    struct rds_znotifier *znotif)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb, *tail;
> +       struct sock_exterr_skb *serr;
> +       unsigned long flags;
> +       struct sk_buff_head *q;
> +       u32 cookie = znotif->z_cookie;
> +
> +       q = &sk->sk_error_queue;
> +       spin_lock_irqsave(&q->lock, flags);
> +       tail = skb_peek_tail(q);
> +
> +       if (tail && skb_zcookie_add(tail, cookie)) {
> +               spin_unlock_irqrestore(&q->lock, flags);
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               consume_skb(rds_skb_from_znotifier(znotif));
> +               sk->sk_error_report(sk);
> +               return;
> +       }
> +
> +       skb = rds_skb_from_znotifier(znotif);
> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

This resets selective fields in skb->cb[]. That array was previously used
as struct rds_znotifier, so may have non-zero state left over. Due to length
of serr->header it is likely that notifier and serr->ee do not overlap. But if
they do kernel data is leaked to userspace. And size of rds_znotifier may
change, so relying on that is fragile. Safer to clear serr->ee completely.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 5/7] rds: zerocopy Tx support.
  2018-02-15  0:09     ` Sowmini Varadhan
@ 2018-02-15  0:15       ` Willem de Bruijn
  0 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-15  0:15 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Wed, Feb 14, 2018 at 7:09 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (02/14/18 18:48), Willem de Bruijn wrote:
>>
>> If the missing break is intentional, no need to respin just for the other
>> minor comments.
>
> yes the missing break is intentional-  the function returns the
> size of the scatterlist needed for RDMA, and RDS_CMSG_ZCOPY_COOKIE
> (like RDMA_DEST and RDMA_MAP) is meta-data that does not change
> that size.
>
> I expect to be in the neighborhood of this  code pretty soon, to get
> the additional opimization of passing up the zcopy completion
> as part of recvmsg (see the discussion in
> https://www.mail-archive.com/netdev@vger.kernel.org/msg212788.html)
>
> I can take care of the other code-cleanup comment suggestions in
> here at that time..

Sounds good.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 22:26     ` Willem de Bruijn
@ 2018-02-15  0:19       ` Santosh Shilimkar
  0 siblings, 0 replies; 31+ messages in thread
From: Santosh Shilimkar @ 2018-02-15  0:19 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Sowmini Varadhan, Network Development, David Miller, rds-devel

On 2/14/2018 2:26 PM, Willem de Bruijn wrote:
> On Wed, Feb 14, 2018 at 1:50 PM, Santosh Shilimkar

[...]

>> This error change might need to go though other subsystem tree. May
>> be you can seperate it and also copy "linux-api@vger.kernel.org"
> 
> Previous changes to this file also went in through net-next, so this is
> fine. The error queue is a socket, so network, datastructure.
>
OK if that acceptable then there is no need to split the change.

> As for naming, zerocopy is unfortunately an overused term. I did not
> help matters when adding msg_zerocopy. 
:-)

> That said, let's try to keep
> consistent naming across socket families, so no zmsg prefix only in
> RDS.
> Fair enough.

Regards,
Santosh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 0/7] RDS: zerocopy support
  2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
                   ` (6 preceding siblings ...)
  2018-02-14 10:28 ` [PATCH V2 net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
@ 2018-02-15  0:19 ` Willem de Bruijn
  7 siblings, 0 replies; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-15  0:19 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> This is version 2 of the series at
>   https://www.mail-archive.com/netdev@vger.kernel.org/msg213829.html
>
> Review comments addressed
> Patch 4:
>   - make sure to always sock_put m_rs even if there is no znotifier.
>   - major rewrite of notification, resulting in much simplification.
>
> Patch 5:
>   - remove unused data_len argument to rds_rm_size;
>   - unmap as necessary if we fail in the middle of zerocopy setup
>
> Patch 7:
>   - restructured do_recv_completion to avoid excessive code re-indent
>   - on-stack allocation of cmsghdr for cookie in do_sendmsg
>   - Additional verification: Verify ncookies <= MAX_.., verify ret ==
>     ncookies * sizeof(uint32_t)
>
> A brief overview of this feature follows.
>
> This patch series provides support for MSG_ZERCOCOPY
> on a PF_RDS socket based on the APIs and infrastructure added
> by f214f915e7db ("tcp: enable MSG_ZEROCOPY")
>
> For single threaded rds-stress testing using rds-tcp with the
> ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
> results show that  there is a significant reduction in latency: about
> 90 usec with zerocopy, compared with 200 usec without zerocopy.
>
> This patchset modifies the above for zerocopy in the following manner.
> - if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> - if the SO_ZEROCOPY  socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg are pinned. The pinning
> uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit
> on MSG_ZEROCOPY pages"). The message is unpinned when all references
> to the message go down to 0, and the message is freed by rds_message_purge.
>
> A multithreaded application using this infrastructure must send down
> a unique 32 bit cookie as ancillary data with each sendmsg invocation.
> The format of this ancillary data is described in Patch 5 of the series.
> The cookie is passed up to the application on the sk_error_queue when
> the message is unpinned, indicating to the application that it is now
> safe to free/reuse the message buffer. The details of the completion
> notifiction are provided in Patch 4 of this series.
>
>
> Sowmini Varadhan (7):
>   skbuff: export mm_[un]account_pinned_pages for other modules
>   rds: hold a sock ref from rds_message to the rds_sock
>   sock: permit SO_ZEROCOPY on PF_RDS socket
>   rds: support for zcopy completion notification
>   rds: zerocopy Tx support.
>   selftests/net: add support for PF_RDS sockets
>   selftests/net: add zerocopy support for PF_RDS test case
>
>  include/linux/skbuff.h                     |    3 +
>  include/uapi/linux/errqueue.h              |    2 +
>  include/uapi/linux/rds.h                   |    1 +
>  net/core/skbuff.c                          |    6 +-
>  net/core/sock.c                            |   25 +++---
>  net/rds/af_rds.c                           |    2 +
>  net/rds/message.c                          |  132 ++++++++++++++++++++++++++-
>  net/rds/rds.h                              |   17 ++++-
>  net/rds/recv.c                             |    2 +
>  net/rds/send.c                             |   51 ++++++++---
>  tools/testing/selftests/net/msg_zerocopy.c |  133 ++++++++++++++++++++++++++-
>  11 files changed, 339 insertions(+), 35 deletions(-)
>

A few inline suggestions in patch 5/7 were already discussed in that thread.
These will be cleaned up in a follow-on patch.

The fragile use of skb->cb[] in patch 4/7 should also be addressed there.

With that caveat, for the series:

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
  2018-02-14 18:50   ` Santosh Shilimkar
  2018-02-15  0:12   ` Willem de Bruijn
@ 2018-02-15  0:41   ` Willem de Bruijn
  2018-02-15 12:03     ` Sowmini Varadhan
  2 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-15  0:41 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Wed, Feb 14, 2018 at 5:28 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> This is achieved by checking the tail skb in the sk_error_queue:
> if this has room for one more cookie, the cookie from the
> current notification is added; else a new skb is added to the
> sk_error_queue. Every invocation of rds_rm_zerocopy_callback() will
> trigger a ->sk_error_report to notify the application.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> +                                    struct rds_znotifier *znotif)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb, *tail;
> +       struct sock_exterr_skb *serr;
> +       unsigned long flags;
> +       struct sk_buff_head *q;
> +       u32 cookie = znotif->z_cookie;
> +
> +       q = &sk->sk_error_queue;
> +       spin_lock_irqsave(&q->lock, flags);
> +       tail = skb_peek_tail(q);
> +
> +       if (tail && skb_zcookie_add(tail, cookie)) {
> +               spin_unlock_irqrestore(&q->lock, flags);
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               consume_skb(rds_skb_from_znotifier(znotif));
> +               sk->sk_error_report(sk);
> +               return;
> +       }
> +
> +       skb = rds_skb_from_znotifier(znotif);
> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;

One more thing: this code notifies that the operation succeeded, but
the data was copied in the process. It does not have to be set otherwise.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-15  0:41   ` Willem de Bruijn
@ 2018-02-15 12:03     ` Sowmini Varadhan
  2018-02-15 16:35       ` Willem de Bruijn
  0 siblings, 1 reply; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-15 12:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On (02/14/18 19:41), Willem de Bruijn wrote:
> 
> One more thing: this code notifies that the operation succeeded, but
> the data was copied in the process. It does not have to be set otherwise.

I see. this one was a bit confusing for me (hence the copy/paste) -
maybe because the TCP/UDP/PACKET case is a bit different from RDS,
and sendmsg may find that it has to switch from zcopy to bcopy 
after the sendmsg() itself has returned, but this cannot happen
for RDS (thus I should never set this code?).

Is it ok if I fix this in the next round or do you think
this warrants a V3?

--Sowmini 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-15 12:03     ` Sowmini Varadhan
@ 2018-02-15 16:35       ` Willem de Bruijn
  2018-02-15 16:43         ` Sowmini Varadhan
  0 siblings, 1 reply; 31+ messages in thread
From: Willem de Bruijn @ 2018-02-15 16:35 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On Thu, Feb 15, 2018 at 7:03 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (02/14/18 19:41), Willem de Bruijn wrote:
>>
>> One more thing: this code notifies that the operation succeeded, but
>> the data was copied in the process. It does not have to be set otherwise.
>
> I see. this one was a bit confusing for me (hence the copy/paste) -
> maybe because the TCP/UDP/PACKET case is a bit different from RDS,
> and sendmsg may find that it has to switch from zcopy to bcopy
> after the sendmsg() itself has returned, but this cannot happen
> for RDS (thus I should never set this code?).

Agreed.

> Is it ok if I fix this in the next round or do you think
> this warrants a V3?

The flag is a hint, so on its own it does not require a respin.

There are by now four small items to patch up. I would respin so that the
patchset that is recorded can be read later as a single correct solution.

But either way works, I don't feel strongly.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V2 net-next 4/7] rds: support for zcopy completion notification
  2018-02-15 16:35       ` Willem de Bruijn
@ 2018-02-15 16:43         ` Sowmini Varadhan
  0 siblings, 0 replies; 31+ messages in thread
From: Sowmini Varadhan @ 2018-02-15 16:43 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, Santosh Shilimkar

On (02/15/18 11:35), Willem de Bruijn wrote:
> 
> The flag is a hint, so on its own it does not require a respin.
> 
> There are by now four small items to patch up. I would respin so that the
> patchset that is recorded can be read later as a single correct solution.

Agreed, I'll genenrate V3 later today

--Sowmini

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-02-15 16:44 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-14 10:28 [PATCH V2 net-next 0/7] RDS: zerocopy support Sowmini Varadhan
2018-02-14 10:28 ` [PATCH V2 net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
2018-02-14 10:28 ` [PATCH V2 net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
2018-02-14 18:38   ` Santosh Shilimkar
2018-02-14 10:28 ` [PATCH V2 net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
2018-02-14 10:28 ` [PATCH V2 net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
2018-02-14 18:50   ` Santosh Shilimkar
2018-02-14 19:01     ` Sowmini Varadhan
2018-02-14 19:02       ` David Miller
2018-02-14 21:10         ` Santosh Shilimkar
2018-02-14 21:25           ` Sowmini Varadhan
2018-02-14 21:39             ` Santosh Shilimkar
2018-02-14 22:08             ` Santosh Shilimkar
2018-02-14 19:17       ` Santosh Shilimkar
2018-02-14 22:26     ` Willem de Bruijn
2018-02-15  0:19       ` Santosh Shilimkar
2018-02-15  0:12   ` Willem de Bruijn
2018-02-15  0:41   ` Willem de Bruijn
2018-02-15 12:03     ` Sowmini Varadhan
2018-02-15 16:35       ` Willem de Bruijn
2018-02-15 16:43         ` Sowmini Varadhan
2018-02-14 10:28 ` [PATCH V2 net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
2018-02-14 19:10   ` Santosh Shilimkar
2018-02-14 19:49     ` Sowmini Varadhan
2018-02-14 21:14       ` Santosh Shilimkar
2018-02-14 23:48   ` Willem de Bruijn
2018-02-15  0:09     ` Sowmini Varadhan
2018-02-15  0:15       ` Willem de Bruijn
2018-02-14 10:28 ` [PATCH V2 net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
2018-02-14 10:28 ` [PATCH V2 net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
2018-02-15  0:19 ` [PATCH V2 net-next 0/7] RDS: zerocopy support Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.