netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/7] RDS zerocopy support
@ 2018-01-24 11:45 Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:45 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

This patch series follows up on the RFC and subsequent review comments
at https://patchwork.ozlabs.org/cover/862248/

Review comments addressed are
- drop MSG_PEEK change for sk_error_queue
- (patch4) batch of SO_EE_ORIGIN_MAX_ZCOOKIES (#defined to 8) is sent up
  as part of the data in the error notification. The ancillary data in
  with this notification specifies the number of cookies in ee_data,
  with the ee_origin is set to SO_EE_ORIGIN_ZCOOKIE
- (patch4, patch5) allocate the skb to be used for error notification
  up-front (in rds_sendmsg()) so that we never have to fail due to skb
  allocation failure in the callback routine.
- other minor review fixes around refactoring code for the setsockopt
  of ZEROCOPY, use iov_iter_npages()  etc.

This patch series also updates the selftests/net/msg_zerocopy.c to support
PF_RDS sockets (both with and without zerocopy)

Thanks to Willem de Bruijn and Eric Dumazet for review comments.

Sowmini Varadhan (7):
  skbuff: export mm_[un]account_pinned_pages for other modules
  rds: hold a sock ref from rds_message to the rds_sock
  sock: permit SO_ZEROCOPY on PF_RDS socket
  rds: support for zcopy completion notification
  rds: zerocopy Tx support.
  selftests/net: add support for PF_RDS sockets
  selftests/net: add zerocopy support for PF_RDS test case

 include/linux/skbuff.h                     |    3 +
 include/uapi/linux/errqueue.h              |    2 +
 include/uapi/linux/rds.h                   |    1 +
 net/core/skbuff.c                          |    6 +-
 net/core/sock.c                            |   25 ++--
 net/rds/af_rds.c                           |    7 +
 net/rds/message.c                          |  148 ++++++++++++++++++++++-
 net/rds/rds.h                              |   23 ++++-
 net/rds/recv.c                             |    2 +
 net/rds/send.c                             |   51 ++++++--
 tools/testing/selftests/net/msg_zerocopy.c |  182 +++++++++++++++++++++++-----
 11 files changed, 390 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
@ 2018-01-24 11:45 ` Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:45 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

RDS would like to use the helper functions for managing pinned pages
added by Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/skbuff.h |    3 +++
 net/core/skbuff.c      |    6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b8e0da6..8e2730a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -466,6 +466,9 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
+void mm_unaccount_pinned_pages(struct mmpin *mmp);
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 01e8285..272a513 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -890,7 +890,7 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
-static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 {
 	unsigned long max_pg, num_pg, new_pg, old_pg;
 	struct user_struct *user;
@@ -919,14 +919,16 @@ static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(mm_account_pinned_pages);
 
-static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct mmpin *mmp)
 {
 	if (mmp->user) {
 		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
 		free_uid(mmp->user);
 	}
 }
+EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
@ 2018-01-24 11:45 ` Sowmini Varadhan
  2018-01-25 14:44   ` Willem de Bruijn
  2018-01-24 11:45 ` [PATCH net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:45 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

The existing model holds a reference from the rds_sock to the
rds_message, but the rds_message does not itself hold a sock_put()
on the rds_sock. Instead the m_rs field in the rds_message is
assigned when the message is queued on the sock, and nulled when
the message is dequeued from the sock.

We want to be able to notify userspace when the rds_message
is actually freed (from rds_message_purge(), after the refcounts
to the rds_message go to 0). At the time that rds_message_purge()
is called, the message is no longer on the rds_sock retransmit
queue. Thus the explicit reference for the m_rs is needed to
send a notification that will signal to userspace that
it is now safe to free/reuse any pages that may have
been pinned down for zerocopy.

This patch manages the m_rs assignment in the rds_message with
the necessary refcount book-keeping.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/message.c |    8 +++++++-
 net/rds/send.c    |    7 +------
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/rds/message.c b/net/rds/message.c
index 4318cc9..ef3daaf 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -58,7 +58,7 @@ void rds_message_addref(struct rds_message *rm)
  */
 static void rds_message_purge(struct rds_message *rm)
 {
-	unsigned long i;
+	unsigned long i, flags;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
@@ -69,6 +69,12 @@ static void rds_message_purge(struct rds_message *rm)
 		__free_page(sg_page(&rm->data.op_sg[i]));
 	}
 	rm->data.op_nents = 0;
+	spin_lock_irqsave(&rm->m_rs_lock, flags);
+	if (rm->m_rs) {
+		sock_put(rds_rs_to_sk(rm->m_rs));
+		rm->m_rs = NULL;
+	}
+	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
diff --git a/net/rds/send.c b/net/rds/send.c
index d3e32d1..5ac0925 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -649,7 +649,6 @@ static void rds_send_remove_from_sock(struct list_head *messages, int status)
 				rm->rdma.op_notifier = NULL;
 			}
 			was_on_sock = 1;
-			rm->m_rs = NULL;
 		}
 		spin_unlock(&rs->rs_lock);
 
@@ -756,9 +755,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		 */
 		if (!test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) {
 			spin_unlock_irqrestore(&cp->cp_lock, flags);
-			spin_lock_irqsave(&rm->m_rs_lock, flags);
-			rm->m_rs = NULL;
-			spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 			continue;
 		}
 		list_del_init(&rm->m_conn_item);
@@ -774,7 +770,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -798,7 +793,6 @@ void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest)
 		__rds_send_complete(rs, rm, RDS_RDMA_CANCELED);
 		spin_unlock(&rs->rs_lock);
 
-		rm->m_rs = NULL;
 		spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 		rds_message_put(rm);
@@ -849,6 +843,7 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
 		list_add_tail(&rm->m_sock_item, &rs->rs_send_queue);
 		set_bit(RDS_MSG_ON_SOCK, &rm->m_flags);
 		rds_message_addref(rm);
+		sock_hold(rds_rs_to_sk(rs));
 		rm->m_rs = rs;
 
 		/* The code ordering is a little weird, but we're
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
@ 2018-01-24 11:45 ` Sowmini Varadhan
  2018-01-24 11:45 ` [PATCH net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:45 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

allow the application to set SO_ZEROCOPY on the underlying sk
of a PF_RDS socket

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/sock.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b2..3c75adc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1049,18 +1049,21 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		break;
 
 	case SO_ZEROCOPY:
-		if (sk->sk_family != PF_INET && sk->sk_family != PF_INET6)
+		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
+			if (sk->sk_protocol != IPPROTO_TCP)
+				ret = -ENOTSUPP;
+			else if (sk->sk_state != TCP_CLOSE)
+				ret = -EBUSY;
+		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
-		else if (sk->sk_protocol != IPPROTO_TCP)
-			ret = -ENOTSUPP;
-		else if (sk->sk_state != TCP_CLOSE)
-			ret = -EBUSY;
-		else if (val < 0 || val > 1)
-			ret = -EINVAL;
-		else
-			sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-		break;
-
+		}
+		if (!ret) {
+			if (val < 0 || val > 1)
+				ret = -EINVAL;
+			else
+				sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
+			break;
+		}
 	default:
 		ret = -ENOPROTOOPT;
 		break;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 4/7] rds: support for zcopy completion notification
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
                   ` (2 preceding siblings ...)
  2018-01-24 11:45 ` [PATCH net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
@ 2018-01-24 11:45 ` Sowmini Varadhan
  2018-01-28 13:56   ` Willem de Bruijn
  2018-01-24 11:46 ` [PATCH net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:45 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

RDS removes a datagram (rds_message) from the retransmit queue when
an ACK is received. The ACK indicates that the receiver has queued
the RDS datagram, so that the sender can safely forget the datagram.
When all references to the rds_message are quiesced, rds_message_purge
is called to release resources used by the rds_message

If the datagram to be removed had pinned pages set up, add
an entry to the rs->rs_znotify_queue so that the notifcation
will be sent up via rds_rm_zerocopy_callback() when the
rds_message is eventually freed by rds_message_purge.

rds_rm_zerocopy_callback() attempts to batch the number of cookies
sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
Each time a cookie is released by rds_message_purge(), the
rs_znotify_queue is checked to see if the MAX_ZCOOKIES batch limit
has been exceeded (in which case we send up a notification). If the
limit has not been exceeded, the cookie is added to the rs_znotify_queue
and a timer is set up, to make sure the cookie notification will
be sent after an upper bound of  RDS_REAP_TIMEOUT (should the
traffic rate slow down)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/uapi/linux/errqueue.h |    2 +
 net/rds/af_rds.c              |    7 +++
 net/rds/message.c             |  104 ++++++++++++++++++++++++++++++++++++++---
 net/rds/rds.h                 |   20 ++++++++
 net/rds/recv.c                |    2 +
 5 files changed, 128 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfa..28812ed 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,11 +20,13 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
 #define SO_EE_ORIGIN_ZEROCOPY	5
+#define SO_EE_ORIGIN_ZCOOKIE	6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED	1
+#define	SO_EE_ORIGIN_MAX_ZCOOKIES	8
 
 /**
  *	struct scm_timestamping - timestamps exposed through cmsg
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b405f77..49a81e8 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -66,6 +66,8 @@ static int rds_release(struct socket *sock)
 	rs = rds_sk_to_rs(sk);
 
 	sock_orphan(sk);
+
+	del_timer_sync(&rs->rs_cookie_timer);
 	/* Note - rds_clear_recv_queue grabs rs_recv_lock, so
 	 * that ensures the recv path has completed messing
 	 * with the socket. */
@@ -183,6 +185,8 @@ static unsigned int rds_poll(struct file *file, struct socket *sock,
 		mask |= (POLLIN | POLLRDNORM);
 	if (rs->rs_snd_bytes < rds_sk_sndbuf(rs))
 		mask |= (POLLOUT | POLLWRNORM);
+	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
+		mask |= POLLERR;
 	read_unlock_irqrestore(&rs->rs_recv_lock, flags);
 
 	/* clear state any time we wake a seen-congested socket */
@@ -511,6 +515,9 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
 	INIT_LIST_HEAD(&rs->rs_send_queue);
 	INIT_LIST_HEAD(&rs->rs_recv_queue);
 	INIT_LIST_HEAD(&rs->rs_notify_queue);
+	INIT_LIST_HEAD(&rs->rs_znotify_queue);
+	rs->rs_ncookies = 0;
+	timer_setup(&rs->rs_cookie_timer, rs_zcopy_notify, 0);
 	INIT_LIST_HEAD(&rs->rs_cong_list);
 	spin_lock_init(&rs->rs_rdma_lock);
 	rs->rs_rdma_keys = RB_ROOT;
diff --git a/net/rds/message.c b/net/rds/message.c
index ef3daaf..7ca968a 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -33,6 +33,9 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/skbuff.h>
+#include <linux/list.h>
+#include <linux/errqueue.h>
 
 #include "rds.h"
 
@@ -53,28 +56,115 @@ void rds_message_addref(struct rds_message *rm)
 }
 EXPORT_SYMBOL_GPL(rds_message_addref);
 
+static void rds_rm_zerocopy_callback(struct rds_sock *rs,
+				     struct rds_znotifier *znotifier,
+				     bool force)
+{
+	struct sock *sk = rds_rs_to_sk(rs);
+	struct sk_buff *skb;
+	struct sock_exterr_skb *serr;
+	unsigned long flags;
+	u32 *ptr;
+	int ncookies = 0, i;
+	struct rds_znotifier *znotif, *ztmp, *first;
+	LIST_HEAD(tmp_list);
+
+	spin_lock_irqsave(&rs->rs_lock, flags);
+	ncookies = rs->rs_ncookies;
+	if (ncookies < SO_EE_ORIGIN_MAX_ZCOOKIES && !force) {
+		if (znotifier) { /* add this cookie to the list and return */
+			list_add_tail(&znotifier->z_list,
+				      &rs->rs_znotify_queue);
+			rs->rs_ncookies++;
+		}
+		spin_unlock_irqrestore(&rs->rs_lock, flags);
+		return;
+	}
+	if (!ncookies) { /* timer finds a reaped list */
+		spin_unlock_irqrestore(&rs->rs_lock, flags);
+		return;
+	}
+	/* reap existing cookie list if we have hit the max, then add
+	 * new cookie to the list for next round of reaping.
+	 */
+	list_splice(&rs->rs_znotify_queue, &tmp_list); /* reap now */
+	INIT_LIST_HEAD(&rs->rs_znotify_queue);
+	rs->rs_ncookies = 0;
+	if (znotifier) { /* for next round */
+		list_add_tail(&znotifier->z_list, &rs->rs_znotify_queue);
+		rs->rs_ncookies++;
+	}
+	spin_unlock_irqrestore(&rs->rs_lock, flags);
+
+	first = list_first_entry(&tmp_list, struct rds_znotifier, z_list);
+	znotif = list_next_entry(first, z_list);
+	list_del(&first->z_list);
+
+	skb = rds_skb_from_znotifier(first);
+	ptr = skb_put(skb, ncookies * sizeof(u32));
+	i = 0;
+	ptr[i++] = first->z_cookie;
+
+	list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) {
+		list_del(&znotif->z_list);
+		ptr[i++] = znotif->z_cookie;
+		mm_unaccount_pinned_pages(&znotif->z_mmp);
+		consume_skb(rds_skb_from_znotifier(znotif));
+	}
+	WARN_ON(!list_empty(&tmp_list));
+
+	serr = SKB_EXT_ERR(skb);
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
+	serr->ee.ee_data = ncookies;
+	serr->ee.ee_info = 0;
+	serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
+
+	if (sock_queue_err_skb(sk, skb))
+		consume_skb(skb);
+}
+
+void rs_zcopy_notify(struct timer_list *t)
+{
+	struct rds_sock *rs = from_timer(rs, t, rs_cookie_timer);
+
+	rds_rm_zerocopy_callback(rs, NULL, true);
+}
+
 /*
  * This relies on dma_map_sg() not touching sg[].page during merging.
  */
 static void rds_message_purge(struct rds_message *rm)
 {
 	unsigned long i, flags;
+	bool zcopy = false;
 
 	if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
 		return;
 
+	spin_lock_irqsave(&rm->m_rs_lock, flags);
+	if (rm->data.op_mmp_znotifier && rm->m_rs) {
+		struct rds_sock *rs = rm->m_rs;
+
+		zcopy = true;
+		rds_rm_zerocopy_callback(rs, rm->data.op_mmp_znotifier, false);
+		rm->data.op_mmp_znotifier = NULL;
+		(void)mod_timer(&rs->rs_cookie_timer, RDS_REAP_TIMEOUT);
+
+		sock_put(rds_rs_to_sk(rs));
+		rm->m_rs = NULL;
+	}
+	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
+
 	for (i = 0; i < rm->data.op_nents; i++) {
 		rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
 		/* XXX will have to put_page for page refs */
-		__free_page(sg_page(&rm->data.op_sg[i]));
+		if (!zcopy)
+			__free_page(sg_page(&rm->data.op_sg[i]));
+		else
+			put_page(sg_page(&rm->data.op_sg[i]));
 	}
 	rm->data.op_nents = 0;
-	spin_lock_irqsave(&rm->m_rs_lock, flags);
-	if (rm->m_rs) {
-		sock_put(rds_rs_to_sk(rm->m_rs));
-		rm->m_rs = NULL;
-	}
-	spin_unlock_irqrestore(&rm->m_rs_lock, flags);
 
 	if (rm->rdma.op_active)
 		rds_rdma_free_op(&rm->rdma);
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 374ae83..c375dd8 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -356,6 +356,19 @@ static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie)
 #define RDS_MSG_PAGEVEC		7
 #define RDS_MSG_FLUSH		8
 
+struct rds_znotifier {
+	struct list_head	z_list;
+	struct mmpin		z_mmp;
+	u32			z_cookie;
+};
+
+#define	RDS_ZCOPY_SKB(__skb)	((struct rds_znotifier *)&((__skb)->cb[0]))
+
+static inline struct sk_buff *rds_skb_from_znotifier(struct rds_znotifier *z)
+{
+	return container_of((void *)z, struct sk_buff, cb);
+}
+
 struct rds_message {
 	refcount_t		m_refcount;
 	struct list_head	m_sock_item;
@@ -436,6 +449,7 @@ struct rds_message {
 			unsigned int		op_count;
 			unsigned int		op_dmasg;
 			unsigned int		op_dmaoff;
+			struct rds_znotifier	*op_mmp_znotifier;
 			struct scatterlist	*op_sg;
 		} data;
 	};
@@ -588,6 +602,11 @@ struct rds_sock {
 	/* Socket receive path trace points*/
 	u8			rs_rx_traces;
 	u8			rs_rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+
+	struct list_head	rs_znotify_queue; /* zerocopy completion */
+	int			rs_ncookies;
+	struct timer_list	rs_cookie_timer;
+#define	RDS_REAP_TIMEOUT	((HZ / 100) + 1)
 };
 
 static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
@@ -785,6 +804,7 @@ int rds_message_next_extension(struct rds_header *hdr,
 void rds_message_put(struct rds_message *rm);
 void rds_message_wait(struct rds_message *rm);
 void rds_message_unmapped(struct rds_message *rm);
+void rs_zcopy_notify(struct timer_list *t);
 
 static inline void rds_message_make_checksum(struct rds_header *hdr)
 {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index b25bcfe..b080961 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -594,6 +594,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 
 	if (msg_flags & MSG_OOB)
 		goto out;
+	if (msg_flags & MSG_ERRQUEUE)
+		return sock_recv_errqueue(sk, msg, size, SOL_IP, IP_RECVERR);
 
 	while (1) {
 		/* If there are pending notifications, do those - and nothing else */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 5/7] rds: zerocopy Tx support.
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
                   ` (3 preceding siblings ...)
  2018-01-24 11:45 ` [PATCH net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-01-24 11:46 ` Sowmini Varadhan
  2018-01-28 13:57   ` Willem de Bruijn
  2018-01-24 11:46 ` [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:46 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg() are pinned.

The pinning uses the accounting infrastructure added by
Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")

The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure may thus need to be notified
about send-completion so that it can free/reuse the buffers
passed to rds_sendmsg(). Notification of send-completion will
identify each message-buffer by a cookie that the application
must specify as ancillary data to rds_sendmsg().
The ancillary data in this case has cmsg_level == SOL_RDS
and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/uapi/linux/rds.h |    1 +
 net/rds/message.c        |   48 +++++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h            |    3 +-
 net/rds/send.c           |   44 ++++++++++++++++++++++++++++++++++++-----
 4 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index e71d449..12e3bca 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -103,6 +103,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_FADD	8
 #define RDS_CMSG_MASKED_ATOMIC_CSWP	9
 #define RDS_CMSG_RXPATH_LATENCY		11
+#define	RDS_CMSG_ZCOPY_COOKIE		12
 
 #define RDS_INFO_FIRST			10000
 #define RDS_INFO_COUNTERS		10000
diff --git a/net/rds/message.c b/net/rds/message.c
index 7ca968a..79b24db 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -362,12 +362,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
 	return rm;
 }
 
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       bool zcopy)
 {
 	unsigned long to_copy, nbytes;
 	unsigned long sg_off;
 	struct scatterlist *sg;
 	int ret = 0;
+	int length = iov_iter_count(from);
 
 	rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
 
@@ -377,6 +379,50 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
 	sg = rm->data.op_sg;
 	sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
 
+	if (zcopy) {
+		int total_copied = 0;
+		struct sk_buff *skb;
+
+		skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
+				GFP_KERNEL);
+		if (!skb)
+			return -ENOMEM;
+		rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
+		memset(rm->data.op_mmp_znotifier, 0,
+		       sizeof(*rm->data.op_mmp_znotifier));
+		if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+					    length)) {
+			consume_skb(skb);
+			rm->data.op_mmp_znotifier = NULL;
+			return -ENOMEM;
+		}
+		while (iov_iter_count(from)) {
+			struct page *pages;
+			size_t start;
+			ssize_t copied;
+
+			copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
+						    1, &start);
+			if (copied < 0) {
+				struct mmpin *mmp;
+
+				mmp = &rm->data.op_mmp_znotifier->z_mmp;
+				mm_unaccount_pinned_pages(mmp);
+				consume_skb(skb);
+				rm->data.op_mmp_znotifier = NULL;
+				return -EFAULT;
+			}
+			total_copied += copied;
+			iov_iter_advance(from, copied);
+			length -= copied;
+			sg_set_page(sg, pages, copied, start);
+			rm->data.op_nents++;
+			sg++;
+		}
+		WARN_ON_ONCE(length != 0);
+		return ret;
+	} /* zcopy */
+
 	while (iov_iter_count(from)) {
 		if (!sg_page(sg)) {
 			ret = rds_page_remainder_alloc(sg, iov_iter_count(from),
diff --git a/net/rds/rds.h b/net/rds/rds.h
index c375dd8..9dfc23c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -789,7 +789,8 @@ void rds_for_each_conn_info(struct socket *sock, unsigned int len,
 /* message.c */
 struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp);
 struct scatterlist *rds_message_alloc_sgs(struct rds_message *rm, int nents);
-int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from);
+int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
+			       bool zcopy);
 struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len);
 void rds_message_populate_header(struct rds_header *hdr, __be16 sport,
 				 __be16 dport, u64 seq);
diff --git a/net/rds/send.c b/net/rds/send.c
index 5ac0925..1f72c8a 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -875,12 +875,13 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
  * rds_message is getting to be quite complicated, and we'd like to allocate
  * it all in one go. This figures out how big it needs to be up front.
  */
-static int rds_rm_size(struct msghdr *msg, int data_len)
+static int rds_rm_size(struct msghdr *msg, int data_len, int num_sgs)
 {
 	struct cmsghdr *cmsg;
 	int size = 0;
 	int cmsg_groups = 0;
 	int retval;
+	bool zcopy_cookie = false;
 
 	for_each_cmsghdr(cmsg, msg) {
 		if (!CMSG_OK(msg, cmsg))
@@ -899,6 +900,8 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 
 			break;
 
+		case RDS_CMSG_ZCOPY_COOKIE:
+			zcopy_cookie = true;
 		case RDS_CMSG_RDMA_DEST:
 		case RDS_CMSG_RDMA_MAP:
 			cmsg_groups |= 2;
@@ -919,7 +922,10 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 
 	}
 
-	size += ceil(data_len, PAGE_SIZE) * sizeof(struct scatterlist);
+	if ((msg->msg_flags & MSG_ZEROCOPY) && !zcopy_cookie)
+		return -EINVAL;
+
+	size += num_sgs * sizeof(struct scatterlist);
 
 	/* Ensure (DEST, MAP) are never used with (ARGS, ATOMIC) */
 	if (cmsg_groups == 3)
@@ -928,6 +934,18 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
 	return size;
 }
 
+static int rds_cmsg_zcopy(struct rds_sock *rs, struct rds_message *rm,
+			  struct cmsghdr *cmsg)
+{
+	u32 *cookie;
+
+	if (cmsg->cmsg_len < CMSG_LEN(sizeof(*cookie)))
+		return -EINVAL;
+	cookie = CMSG_DATA(cmsg);
+	rm->data.op_mmp_znotifier->z_cookie = *cookie;
+	return 0;
+}
+
 static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			 struct msghdr *msg, int *allocated_mr)
 {
@@ -970,6 +988,10 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm,
 			ret = rds_cmsg_atomic(rs, rm, cmsg);
 			break;
 
+		case RDS_CMSG_ZCOPY_COOKIE:
+			ret = rds_cmsg_zcopy(rs, rm, cmsg);
+			break;
+
 		default:
 			return -EINVAL;
 		}
@@ -1040,10 +1062,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 	long timeo = sock_sndtimeo(sk, nonblock);
 	struct rds_conn_path *cpath;
 	size_t total_payload_len = payload_len, rdma_payload_len = 0;
+	bool zcopy = ((msg->msg_flags & MSG_ZEROCOPY) &&
+		      sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY));
+	int num_sgs = ceil(payload_len, PAGE_SIZE);
 
 	/* Mirror Linux UDP mirror of BSD error message compatibility */
 	/* XXX: Perhaps MSG_MORE someday */
-	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) {
+	if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT | MSG_ZEROCOPY)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -1087,8 +1112,15 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 		goto out;
 	}
 
+	if (zcopy) {
+		if (rs->rs_transport->t_type != RDS_TRANS_TCP) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+		num_sgs = iov_iter_npages(&msg->msg_iter, INT_MAX);
+	}
 	/* size of rm including all sgs */
-	ret = rds_rm_size(msg, payload_len);
+	ret = rds_rm_size(msg, payload_len, num_sgs);
 	if (ret < 0)
 		goto out;
 
@@ -1100,12 +1132,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* Attach data to the rm */
 	if (payload_len) {
-		rm->data.op_sg = rds_message_alloc_sgs(rm, ceil(payload_len, PAGE_SIZE));
+		rm->data.op_sg = rds_message_alloc_sgs(rm, num_sgs);
 		if (!rm->data.op_sg) {
 			ret = -ENOMEM;
 			goto out;
 		}
-		ret = rds_message_copy_from_user(rm, &msg->msg_iter);
+		ret = rds_message_copy_from_user(rm, &msg->msg_iter, zcopy);
 		if (ret)
 			goto out;
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
                   ` (4 preceding siblings ...)
  2018-01-24 11:46 ` [PATCH net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
@ 2018-01-24 11:46 ` Sowmini Varadhan
  2018-01-28 13:58   ` Willem de Bruijn
  2018-01-24 11:46 ` [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
  2018-01-25 16:41 ` [PATCH net-next 0/7] RDS zerocopy support Santosh Shilimkar
  7 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:46 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

Add support for basic PF_RDS client-server testing in msg_zerocopy

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 tools/testing/selftests/net/msg_zerocopy.c |   65 +++++++++++++++++++++++++++-
 1 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index e11fe84..7a5b353 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -14,6 +14,9 @@
  * - SOCK_DGRAM
  * - SOCK_RAW
  *
+ * PF_RDS
+ * - SOCK_SEQPACKET
+ *
  * Start this program on two connected hosts, one in send mode and
  * the other with option '-r' to put it in receiver mode.
  *
@@ -53,6 +56,7 @@
 #include <sys/types.h>
 #include <sys/wait.h>
 #include <unistd.h>
+#include <linux/rds.h>
 
 #ifndef SO_EE_ORIGIN_ZEROCOPY
 #define SO_EE_ORIGIN_ZEROCOPY		5
@@ -300,10 +304,15 @@ static int do_setup_tx(int domain, int type, int protocol)
 	if (cfg_zerocopy)
 		do_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, 1);
 
-	if (domain != PF_PACKET)
+	if (domain != PF_PACKET && domain != PF_RDS)
 		if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
 			error(1, errno, "connect");
 
+	if (domain == PF_RDS) {
+		if (bind(fd, (void *) &cfg_src_addr, cfg_alen))
+			error(1, errno, "bind");
+	}
+
 	return fd;
 }
 
@@ -444,6 +453,13 @@ static void do_tx(int domain, int type, int protocol)
 		msg.msg_iovlen++;
 	}
 
+	if (domain == PF_RDS) {
+		msg.msg_name = &cfg_dst_addr;
+		msg.msg_namelen =  (cfg_dst_addr.ss_family == AF_INET ?
+				    sizeof(struct sockaddr_in) :
+				    sizeof(struct sockaddr_in6));
+	}
+
 	iov[2].iov_base = payload;
 	iov[2].iov_len = cfg_payload_len;
 	msg.msg_iovlen++;
@@ -555,6 +571,40 @@ static void do_flush_datagram(int fd, int type)
 	bytes += cfg_payload_len;
 }
 
+
+static void do_recvmsg(int fd)
+{
+	int ret, off = 0;
+	char *buf;
+	struct iovec iov;
+	struct msghdr msg;
+	struct sockaddr_storage din;
+
+	buf = calloc(cfg_payload_len, sizeof(char));
+	iov.iov_base = buf;
+	iov.iov_len = cfg_payload_len;
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	ret = recvmsg(fd, &msg, MSG_TRUNC);
+
+	if (ret == -1)
+		error(1, errno, "recv");
+	if (ret != cfg_payload_len)
+		error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
+
+	if (memcmp(buf + off, payload, ret))
+		error(1, 0, "recv: data mismatch");
+
+	free(buf);
+	packets++;
+	bytes += cfg_payload_len;
+}
+
 static void do_rx(int domain, int type, int protocol)
 {
 	uint64_t tstop;
@@ -566,6 +616,8 @@ static void do_rx(int domain, int type, int protocol)
 	do {
 		if (type == SOCK_STREAM)
 			do_flush_tcp(fd);
+		else if (domain == PF_RDS)
+			do_recvmsg(fd);
 		else
 			do_flush_datagram(fd, type);
 
@@ -610,6 +662,7 @@ static void parse_opts(int argc, char **argv)
 				    40 /* max tcp options */;
 	int c;
 	char *daddr = NULL, *saddr = NULL;
+	char *cfg_test;
 
 	cfg_payload_len = max_payload_len;
 
@@ -667,6 +720,14 @@ static void parse_opts(int argc, char **argv)
 			break;
 		}
 	}
+
+	cfg_test = argv[argc - 1];
+	if (strcmp(cfg_test, "rds") == 0) {
+		if (!daddr)
+			error(1, 0, "-D <server addr> required for PF_RDS\n");
+		if (!cfg_rx && !saddr)
+			error(1, 0, "-S <client addr> required for PF_RDS\n");
+	}
 	setup_sockaddr(cfg_family, daddr, &cfg_dst_addr);
 	setup_sockaddr(cfg_family, saddr, &cfg_src_addr);
 
@@ -699,6 +760,8 @@ int main(int argc, char **argv)
 		do_test(cfg_family, SOCK_STREAM, 0);
 	else if (!strcmp(cfg_test, "udp"))
 		do_test(cfg_family, SOCK_DGRAM, 0);
+	else if (!strcmp(cfg_test, "rds"))
+		do_test(PF_RDS, SOCK_SEQPACKET, 0);
 	else
 		error(1, 0, "unknown cfg_test %s", cfg_test);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
                   ` (5 preceding siblings ...)
  2018-01-24 11:46 ` [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
@ 2018-01-24 11:46 ` Sowmini Varadhan
  2018-01-28 14:00   ` Willem de Bruijn
  2018-01-25 16:41 ` [PATCH net-next 0/7] RDS zerocopy support Santosh Shilimkar
  7 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-24 11:46 UTC (permalink / raw)
  To: sowmini.varadhan, willemdebruijn.kernel, netdev
  Cc: davem, rds-devel, santosh.shilimkar

Send a cookie with sendmsg() on PF_RDS sockets, and process the
returned batched cookies in do_recv_completion()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 tools/testing/selftests/net/msg_zerocopy.c |  119 ++++++++++++++++++++-------
 1 files changed, 88 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index 7a5b353..19d2b1a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -168,7 +168,26 @@ static int do_accept(int fd)
 	return fd;
 }
 
-static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
+static void add_zcopy_cookie(struct msghdr *msg)
+{
+	int olen = msg->msg_controllen;
+	struct cmsghdr *cm;
+	static uint32_t cookie;
+
+	msg->msg_controllen += CMSG_SPACE(sizeof(cookie));
+	msg->msg_control = (struct cmsghdr *)realloc(msg->msg_control,
+						     msg->msg_controllen);
+	if (!msg->msg_control)
+		error(1, errno, "cannot allocate cmsghdr for cookie");
+	cm = (void *)msg->msg_control + olen;
+	cm->cmsg_len = CMSG_SPACE(sizeof(cookie));
+	cm->cmsg_level = SOL_RDS;
+	cm->cmsg_type = RDS_CMSG_ZCOPY_COOKIE;
+	++cookie;
+	memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));
+}
+
+static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain)
 {
 	int ret, len, i, flags;
 
@@ -177,8 +196,11 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
 		len += msg->msg_iov[i].iov_len;
 
 	flags = MSG_DONTWAIT;
-	if (do_zerocopy)
+	if (do_zerocopy) {
 		flags |= MSG_ZEROCOPY;
+		if (domain == PF_RDS)
+			add_zcopy_cookie(msg);
+	}
 
 	ret = sendmsg(fd, msg, flags);
 	if (ret == -1 && errno == EAGAIN)
@@ -194,6 +216,11 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
 		if (do_zerocopy && ret)
 			expected_completions++;
 	}
+	if (do_zerocopy && domain == PF_RDS) {
+		free(msg->msg_control);
+		msg->msg_control = NULL;
+		msg->msg_controllen = 0;
+	}
 
 	return true;
 }
@@ -220,7 +247,9 @@ static void do_sendmsg_corked(int fd, struct msghdr *msg)
 		msg->msg_iov[0].iov_len = payload_len + extra_len;
 		extra_len = 0;
 
-		do_sendmsg(fd, msg, do_zerocopy);
+		do_sendmsg(fd, msg, do_zerocopy,
+			   (cfg_dst_addr.ss_family == AF_INET ?
+			    PF_INET : PF_INET6));
 	}
 
 	do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
@@ -324,10 +353,17 @@ static bool do_recv_completion(int fd)
 	uint32_t hi, lo, range;
 	int ret, zerocopy;
 	char control[100];
+	uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES];
+	struct iovec iov;
 
 	msg.msg_control = control;
 	msg.msg_controllen = sizeof(control);
 
+	iov.iov_base = ckbuf;
+	iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0]));
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
 	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
 	if (ret == -1 && errno == EAGAIN)
 		return false;
@@ -346,36 +382,57 @@ static bool do_recv_completion(int fd)
 		      cm->cmsg_level, cm->cmsg_type);
 
 	serr = (void *) CMSG_DATA(cm);
-	if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
-		error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
-	if (serr->ee_errno != 0)
-		error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
 
-	hi = serr->ee_data;
-	lo = serr->ee_info;
-	range = hi - lo + 1;
+	switch (serr->ee_origin) {
+	case SO_EE_ORIGIN_ZEROCOPY: {
+		if (serr->ee_errno != 0)
+			error(1, 0, "serr: wrong error code: %u",
+			      serr->ee_errno);
+		hi = serr->ee_data;
+		lo = serr->ee_info;
+		range = hi - lo + 1;
+
+		/* Detect notification gaps. These should not happen often,
+		 * if at all.  Gaps can occur due to drops, reordering and
+		 * retransmissions.
+		 */
+		if (lo != next_completion)
+			fprintf(stderr, "gap: %u..%u does not append to %u\n",
+				lo, hi, next_completion);
+		next_completion = hi + 1;
+
+		zerocopy = !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED);
+		if (zerocopied == -1)
+			zerocopied = zerocopy;
+		else if (zerocopied != zerocopy) {
+			fprintf(stderr, "serr: inconsistent\n");
+			zerocopied = zerocopy;
+		}
+		if (cfg_verbose >= 2)
+			fprintf(stderr, "completed: %u (h=%u l=%u)\n",
+				range, hi, lo);
 
-	/* Detect notification gaps. These should not happen often, if at all.
-	 * Gaps can occur due to drops, reordering and retransmissions.
-	 */
-	if (lo != next_completion)
-		fprintf(stderr, "gap: %u..%u does not append to %u\n",
-			lo, hi, next_completion);
-	next_completion = hi + 1;
-
-	zerocopy = !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED);
-	if (zerocopied == -1)
-		zerocopied = zerocopy;
-	else if (zerocopied != zerocopy) {
-		fprintf(stderr, "serr: inconsistent\n");
-		zerocopied = zerocopy;
+		completions += range;
+		break;
+	}
+	case SO_EE_ORIGIN_ZCOOKIE: {
+		int ncookies, i;
+
+		if (serr->ee_errno != 0)
+			error(1, 0, "serr: wrong error code: %u",
+			      serr->ee_errno);
+		ncookies = serr->ee_data;
+		for (i = 0; i < ncookies; i++)
+			if (cfg_verbose >= 2)
+				fprintf(stderr, "%d\n", ckbuf[i]);
+		completions += ncookies;
+		zerocopied = 1;
+		break;
+	}
+	default:
+		error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
 	}
 
-	if (cfg_verbose >= 2)
-		fprintf(stderr, "completed: %u (h=%u l=%u)\n",
-			range, hi, lo);
-
-	completions += range;
 	return true;
 }
 
@@ -470,7 +527,7 @@ static void do_tx(int domain, int type, int protocol)
 		if (cfg_cork)
 			do_sendmsg_corked(fd, &msg);
 		else
-			do_sendmsg(fd, &msg, cfg_zerocopy);
+			do_sendmsg(fd, &msg, cfg_zerocopy, domain);
 
 		while (!do_poll(fd, POLLOUT)) {
 			if (cfg_zerocopy)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-24 11:45 ` [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
@ 2018-01-25 14:44   ` Willem de Bruijn
  2018-01-25 15:35     ` Sowmini Varadhan
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-25 14:44 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 24, 2018 at 12:45 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> The existing model holds a reference from the rds_sock to the
> rds_message, but the rds_message does not itself hold a sock_put()
> on the rds_sock. Instead the m_rs field in the rds_message is
> assigned when the message is queued on the sock, and nulled when
> the message is dequeued from the sock.
>
> We want to be able to notify userspace when the rds_message
> is actually freed (from rds_message_purge(), after the refcounts
> to the rds_message go to 0). At the time that rds_message_purge()
> is called, the message is no longer on the rds_sock retransmit
> queue. Thus the explicit reference for the m_rs is needed to
> send a notification that will signal to userspace that
> it is now safe to free/reuse any pages that may have
> been pinned down for zerocopy.
>
> This patch manages the m_rs assignment in the rds_message with
> the necessary refcount book-keeping.

You may alos be able to do the same as tcp zerocopy and
hold an sk reference on the notification skb.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-25 14:44   ` Willem de Bruijn
@ 2018-01-25 15:35     ` Sowmini Varadhan
  2018-01-28 13:51       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-25 15:35 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On (01/25/18 15:44), Willem de Bruijn wrote:
> > This patch manages the m_rs assignment in the rds_message with
> > the necessary refcount book-keeping.
> 
> You may alos be able to do the same as tcp zerocopy and
> hold an sk reference on the notification skb.

We tether the notification skb to the rds socket after the 
refcount on the rds_message goes to zero, so we already have a 
ref from the sk to the notification skb, 

If we kept a refcount of all notification skb's (even the ones that
are not ready to be unpinned yet) on the sk, then we have additional
complexity trying to figure out which skb's are ready for notification
at any point, so not sure it would make things simpler..

--Sowmini

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 0/7] RDS zerocopy support
  2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
                   ` (6 preceding siblings ...)
  2018-01-24 11:46 ` [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
@ 2018-01-25 16:41 ` Santosh Shilimkar
  7 siblings, 0 replies; 23+ messages in thread
From: Santosh Shilimkar @ 2018-01-25 16:41 UTC (permalink / raw)
  To: Sowmini Varadhan, willemdebruijn.kernel, netdev; +Cc: davem, rds-devel

Hi Sowmini,

On 1/24/2018 3:45 AM, Sowmini Varadhan wrote:
> This patch series follows up on the RFC and subsequent review comments
> at https://patchwork.ozlabs.org/cover/862248/
> 
> Review comments addressed are
> - drop MSG_PEEK change for sk_error_queue
> - (patch4) batch of SO_EE_ORIGIN_MAX_ZCOOKIES (#defined to 8) is sent up
>    as part of the data in the error notification. The ancillary data in
>    with this notification specifies the number of cookies in ee_data,
>    with the ee_origin is set to SO_EE_ORIGIN_ZCOOKIE
> - (patch4, patch5) allocate the skb to be used for error notification
>    up-front (in rds_sendmsg()) so that we never have to fail due to skb
>    allocation failure in the callback routine.
> - other minor review fixes around refactoring code for the setsockopt
>    of ZEROCOPY, use iov_iter_npages()  etc.
> 
> This patch series also updates the selftests/net/msg_zerocopy.c to support
> PF_RDS sockets (both with and without zerocopy)
> 
RDS changes looks like largely good but I need some time to look at the
completion notification and send side changes. Will try to provide
feedback in next few days.

regards,
Santosh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-25 15:35     ` Sowmini Varadhan
@ 2018-01-28 13:51       ` Willem de Bruijn
  2018-01-28 16:18         ` Sowmini Varadhan
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 13:51 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Thu, Jan 25, 2018 at 4:35 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/25/18 15:44), Willem de Bruijn wrote:
>> > This patch manages the m_rs assignment in the rds_message with
>> > the necessary refcount book-keeping.
>>
>> You may alos be able to do the same as tcp zerocopy and
>> hold an sk reference on the notification skb.
>
> We tether the notification skb to the rds socket after the
> refcount on the rds_message goes to zero, so we already have a
> ref from the sk to the notification skb,
>
> If we kept a refcount of all notification skb's (even the ones that
> are not ready to be unpinned yet) on the sk, then we have additional
> complexity trying to figure out which skb's are ready for notification
> at any point, so not sure it would make things simpler..

I don't quite follow. Every notification skb is created when pages refcount
is increased. It persists until at least rds_rm_zerocopy_callback, after data
skb has been freed and pages refcount has been decreased.

In this callback, skb is consumed if another skb is already queued on
the error queue, otherwise it is queued itself. It needs to hold a sock ref
until it can be queued.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 4/7] rds: support for zcopy completion notification
  2018-01-24 11:45 ` [PATCH net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
@ 2018-01-28 13:56   ` Willem de Bruijn
  2018-01-28 16:15     ` Sowmini Varadhan
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 13:56 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 24, 2018 at 12:45 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> RDS removes a datagram (rds_message) from the retransmit queue when
> an ACK is received. The ACK indicates that the receiver has queued
> the RDS datagram, so that the sender can safely forget the datagram.
> When all references to the rds_message are quiesced, rds_message_purge
> is called to release resources used by the rds_message
>
> If the datagram to be removed had pinned pages set up, add
> an entry to the rs->rs_znotify_queue so that the notifcation
> will be sent up via rds_rm_zerocopy_callback() when the
> rds_message is eventually freed by rds_message_purge.
>
> rds_rm_zerocopy_callback() attempts to batch the number of cookies
> sent with each notification  to a max of SO_EE_ORIGIN_MAX_ZCOOKIES.
> Each time a cookie is released by rds_message_purge(), the
> rs_znotify_queue is checked to see if the MAX_ZCOOKIES batch limit
> has been exceeded (in which case we send up a notification). If the
> limit has not been exceeded, the cookie is added to the rs_znotify_queue
> and a timer is set up

An alternative that does not require a timer is to batch on the sk
error queue itself, like tcp zerocopy. That queues the first notification
skb on the error queue without any notification latency.

Then, if a subsequent notification comes in while another is pending
with < MAX zcookies, it coalesces the new notification onto the pending
skb and consumes the other. For RDS notifications, the implementation
is an extra skb_put + uint32_t assignment.

Optionally, the socket can trigger another sk_error_report on each
new notification.

> +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> +                                    struct rds_znotifier *znotifier,
> +                                    bool force)
> +{
> +       struct sock *sk = rds_rs_to_sk(rs);
> +       struct sk_buff *skb;
> +       struct sock_exterr_skb *serr;
> +       unsigned long flags;
> +       u32 *ptr;
> +       int ncookies = 0, i;
> +       struct rds_znotifier *znotif, *ztmp, *first;
> +       LIST_HEAD(tmp_list);
> +
> +       spin_lock_irqsave(&rs->rs_lock, flags);
> +       ncookies = rs->rs_ncookies;
> +       if (ncookies < SO_EE_ORIGIN_MAX_ZCOOKIES && !force) {
> +               if (znotifier) { /* add this cookie to the list and return */

can be checked before taking lock.

More importantly, when is this ever NULL? This function is a callback
for a zerocopy struct of type znotifier. Is it doing double duty to flush
any outstanding if znotifier == NULL && force == true? If so, the first
condition probably never occurs unless force == true and thus the
second is redundant.

> +                       list_add_tail(&znotifier->z_list,
> +                                     &rs->rs_znotify_queue);
> +                       rs->rs_ncookies++;
> +               }
> +               spin_unlock_irqrestore(&rs->rs_lock, flags);
> +               return;
> +       }
> +       if (!ncookies) { /* timer finds a reaped list */
> +               spin_unlock_irqrestore(&rs->rs_lock, flags);
> +               return;
> +       }
> +       /* reap existing cookie list if we have hit the max, then add
> +        * new cookie to the list for next round of reaping.
> +        */
> +       list_splice(&rs->rs_znotify_queue, &tmp_list); /* reap now */
> +       INIT_LIST_HEAD(&rs->rs_znotify_queue);
> +       rs->rs_ncookies = 0;
> +       if (znotifier) { /* for next round */

This adds unnecessary notification latency to delivery of current
notification. The latest notification can be appended to tmp_list and
sent up immediately.

> +               list_add_tail(&znotifier->z_list, &rs->rs_znotify_queue);
> +               rs->rs_ncookies++;
> +       }
> +       spin_unlock_irqrestore(&rs->rs_lock, flags);
> +
> +       first = list_first_entry(&tmp_list, struct rds_znotifier, z_list);
> +       znotif = list_next_entry(first, z_list);
> +       list_del(&first->z_list);
> +
> +       skb = rds_skb_from_znotifier(first);
> +       ptr = skb_put(skb, ncookies * sizeof(u32));
> +       i = 0;
> +       ptr[i++] = first->z_cookie;
> +
> +       list_for_each_entry_safe(znotif, ztmp, &tmp_list, z_list) {
> +               list_del(&znotif->z_list);
> +               ptr[i++] = znotif->z_cookie;
> +               mm_unaccount_pinned_pages(&znotif->z_mmp);
> +               consume_skb(rds_skb_from_znotifier(znotif));
> +       }
> +       WARN_ON(!list_empty(&tmp_list));
> +
> +       serr = SKB_EXT_ERR(skb);
> +       serr->ee.ee_errno = 0;
> +       serr->ee.ee_origin = SO_EE_ORIGIN_ZCOOKIE;
> +       serr->ee.ee_data = ncookies;
> +       serr->ee.ee_info = 0;
> +       serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
> +
> +       if (sock_queue_err_skb(sk, skb))
> +               consume_skb(skb);
> +}
> +
> +void rs_zcopy_notify(struct timer_list *t)
> +{
> +       struct rds_sock *rs = from_timer(rs, t, rs_cookie_timer);
> +
> +       rds_rm_zerocopy_callback(rs, NULL, true);
> +}
> +
>  /*
>   * This relies on dma_map_sg() not touching sg[].page during merging.
>   */
>  static void rds_message_purge(struct rds_message *rm)
>  {
>         unsigned long i, flags;
> +       bool zcopy = false;
>
>         if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags)))
>                 return;
>
> +       spin_lock_irqsave(&rm->m_rs_lock, flags);
> +       if (rm->data.op_mmp_znotifier && rm->m_rs) {
> +               struct rds_sock *rs = rm->m_rs;
> +
> +               zcopy = true;
> +               rds_rm_zerocopy_callback(rs, rm->data.op_mmp_znotifier, false);
> +               rm->data.op_mmp_znotifier = NULL;
> +               (void)mod_timer(&rs->rs_cookie_timer, RDS_REAP_TIMEOUT);

Resetting timeout on each queued notification causes unbound
notification latency for previous notifications on the queue.

> +
> +               sock_put(rds_rs_to_sk(rs));
> +               rm->m_rs = NULL;

These two lines are now called only if znotifier is true, but
used to be called whenever rm->m_rs != NULL. Intentional?

> +       }
> +       spin_unlock_irqrestore(&rm->m_rs_lock, flags);
> +
>         for (i = 0; i < rm->data.op_nents; i++) {
>                 rdsdebug("putting data page %p\n", (void *)sg_page(&rm->data.op_sg[i]));
>                 /* XXX will have to put_page for page refs */
> -               __free_page(sg_page(&rm->data.op_sg[i]));
> +               if (!zcopy)
> +                       __free_page(sg_page(&rm->data.op_sg[i]));
> +               else
> +                       put_page(sg_page(&rm->data.op_sg[i]));
>         }
>         rm->data.op_nents = 0;
> -       spin_lock_irqsave(&rm->m_rs_lock, flags);
> -       if (rm->m_rs) {
> -               sock_put(rds_rs_to_sk(rm->m_rs));
> -               rm->m_rs = NULL;
> -       }
> -       spin_unlock_irqrestore(&rm->m_rs_lock, flags);
>
>         if (rm->rdma.op_active)
>                 rds_rdma_free_op(&rm->rdma);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 5/7] rds: zerocopy Tx support.
  2018-01-24 11:46 ` [PATCH net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
@ 2018-01-28 13:57   ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 13:57 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 24, 2018 at 12:46 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> If the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
> if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
> application pages sent down with rds_sendmsg() are pinned.
>
> The pinning uses the accounting infrastructure added by
> Commit a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
>
> The payload bytes in the message may not be modified for the
> duration that the message has been pinned. A multi-threaded
> application using this infrastructure may thus need to be notified
> about send-completion so that it can free/reuse the buffers
> passed to rds_sendmsg(). Notification of send-completion will
> identify each message-buffer by a cookie that the application
> must specify as ancillary data to rds_sendmsg().
> The ancillary data in this case has cmsg_level == SOL_RDS
> and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/uapi/linux/rds.h |    1 +
>  net/rds/message.c        |   48 +++++++++++++++++++++++++++++++++++++++++++++-
>  net/rds/rds.h            |    3 +-
>  net/rds/send.c           |   44 ++++++++++++++++++++++++++++++++++++-----
>  4 files changed, 88 insertions(+), 8 deletions(-)
>
> diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
> index e71d449..12e3bca 100644
> --- a/include/uapi/linux/rds.h
> +++ b/include/uapi/linux/rds.h
> @@ -103,6 +103,7 @@
>  #define RDS_CMSG_MASKED_ATOMIC_FADD    8
>  #define RDS_CMSG_MASKED_ATOMIC_CSWP    9
>  #define RDS_CMSG_RXPATH_LATENCY                11
> +#define        RDS_CMSG_ZCOPY_COOKIE           12
>
>  #define RDS_INFO_FIRST                 10000
>  #define RDS_INFO_COUNTERS              10000
> diff --git a/net/rds/message.c b/net/rds/message.c
> index 7ca968a..79b24db 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -362,12 +362,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in
>         return rm;
>  }
>
> -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
> +int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from,
> +                              bool zcopy)
>  {
>         unsigned long to_copy, nbytes;
>         unsigned long sg_off;
>         struct scatterlist *sg;
>         int ret = 0;
> +       int length = iov_iter_count(from);
>
>         rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));
>
> @@ -377,6 +379,50 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from)
>         sg = rm->data.op_sg;
>         sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */
>
> +       if (zcopy) {
> +               int total_copied = 0;
> +               struct sk_buff *skb;
> +
> +               skb = alloc_skb(SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(u32),
> +                               GFP_KERNEL);
> +               if (!skb)
> +                       return -ENOMEM;
> +               rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb);
> +               memset(rm->data.op_mmp_znotifier, 0,
> +                      sizeof(*rm->data.op_mmp_znotifier));
> +               if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
> +                                           length)) {
> +                       consume_skb(skb);
> +                       rm->data.op_mmp_znotifier = NULL;
> +                       return -ENOMEM;
> +               }
> +               while (iov_iter_count(from)) {
> +                       struct page *pages;
> +                       size_t start;
> +                       ssize_t copied;
> +
> +                       copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
> +                                                   1, &start);
> +                       if (copied < 0) {
> +                               struct mmpin *mmp;
> +
> +                               mmp = &rm->data.op_mmp_znotifier->z_mmp;
> +                               mm_unaccount_pinned_pages(mmp);
> +                               consume_skb(skb);
> +                               rm->data.op_mmp_znotifier = NULL;
> +                               return -EFAULT;

also need to unmap pages pinned during previous iterations.

> +                       }
> +                       total_copied += copied;
> +                       iov_iter_advance(from, copied);
> +                       length -= copied;
> +                       sg_set_page(sg, pages, copied, start);
> +                       rm->data.op_nents++;
> +                       sg++;
> +               }
> +               WARN_ON_ONCE(length != 0);
> +               return ret;
> +       } /* zcopy */
> +
>         while (iov_iter_count(from)) {
>                 if (!sg_page(sg)) {
>                         ret = rds_page_remainder_alloc(sg, iov_iter_count(from),
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index c375dd8..9dfc23c 100644

> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -875,12 +875,13 @@ static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn,
>   * rds_message is getting to be quite complicated, and we'd like to allocate
>   * it all in one go. This figures out how big it needs to be up front.
>   */
> -static int rds_rm_size(struct msghdr *msg, int data_len)
> +static int rds_rm_size(struct msghdr *msg, int data_len, int num_sgs)
>  {
>         struct cmsghdr *cmsg;
>         int size = 0;
>         int cmsg_groups = 0;
>         int retval;
> +       bool zcopy_cookie = false;
>
>         for_each_cmsghdr(cmsg, msg) {
>                 if (!CMSG_OK(msg, cmsg))
> @@ -899,6 +900,8 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
>
>                         break;
>
> +               case RDS_CMSG_ZCOPY_COOKIE:
> +                       zcopy_cookie = true;
>                 case RDS_CMSG_RDMA_DEST:
>                 case RDS_CMSG_RDMA_MAP:
>                         cmsg_groups |= 2;
> @@ -919,7 +922,10 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
>
>         }
>
> -       size += ceil(data_len, PAGE_SIZE) * sizeof(struct scatterlist);
> +       if ((msg->msg_flags & MSG_ZEROCOPY) && !zcopy_cookie)
> +               return -EINVAL;
> +
> +       size += num_sgs * sizeof(struct scatterlist);

argument data_len is no longer used

>
>         /* Ensure (DEST, MAP) are never used with (ARGS, ATOMIC) */
>         if (cmsg_groups == 3)
> @@ -928,6 +934,18 @@ static int rds_rm_size(struct msghdr *msg, int data_len)
>         return size;
>  }
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets
  2018-01-24 11:46 ` [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
@ 2018-01-28 13:58   ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 13:58 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 24, 2018 at 12:46 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Add support for basic PF_RDS client-server testing in msg_zerocopy
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  tools/testing/selftests/net/msg_zerocopy.c |   65 +++++++++++++++++++++++++++-
>  1 files changed, 64 insertions(+), 1 deletions(-)
>
> diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
> index e11fe84..7a5b353 100644
> --- a/tools/testing/selftests/net/msg_zerocopy.c
> +++ b/tools/testing/selftests/net/msg_zerocopy.c

> @@ -555,6 +571,40 @@ static void do_flush_datagram(int fd, int type)
>         bytes += cfg_payload_len;
>  }
>
> +
> +static void do_recvmsg(int fd)
> +{

Can PF_RDS/SOC_SEQPACKET use one of the existing bytestream
or datagram flush methods?

> +       int ret, off = 0;
> +       char *buf;
> +       struct iovec iov;
> +       struct msghdr msg;
> +       struct sockaddr_storage din;
> +
> +       buf = calloc(cfg_payload_len, sizeof(char));
> +       iov.iov_base = buf;
> +       iov.iov_len = cfg_payload_len;
> +
> +       memset(&msg, 0, sizeof(msg));
> +       msg.msg_name = &din;
> +       msg.msg_namelen = sizeof(din);

Not read, so no need to configure. In that case a simpler
recv will do and is more concise than setting up recvmsg.

> +       msg.msg_iov = &iov;
> +       msg.msg_iovlen = 1;
> +
> +       ret = recvmsg(fd, &msg, MSG_TRUNC);
> +
> +       if (ret == -1)
> +               error(1, errno, "recv");
> +       if (ret != cfg_payload_len)
> +               error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
> +
> +       if (memcmp(buf + off, payload, ret))
> +               error(1, 0, "recv: data mismatch");
> +
> +       free(buf);
> +       packets++;
> +       bytes += cfg_payload_len;
> +}
> +

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-01-24 11:46 ` [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
@ 2018-01-28 14:00   ` Willem de Bruijn
  2018-01-28 16:18     ` Sowmini Varadhan
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 14:00 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Wed, Jan 24, 2018 at 12:46 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> Send a cookie with sendmsg() on PF_RDS sockets, and process the
> returned batched cookies in do_recv_completion()
>
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  tools/testing/selftests/net/msg_zerocopy.c |  119 ++++++++++++++++++++-------
>  1 files changed, 88 insertions(+), 31 deletions(-)
>
> diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
> index 7a5b353..19d2b1a 100644
> --- a/tools/testing/selftests/net/msg_zerocopy.c
> +++ b/tools/testing/selftests/net/msg_zerocopy.c
> @@ -168,7 +168,26 @@ static int do_accept(int fd)
>         return fd;
>  }
>
> -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
> +static void add_zcopy_cookie(struct msghdr *msg)
> +{
> +       int olen = msg->msg_controllen;
> +       struct cmsghdr *cm;
> +       static uint32_t cookie;
> +
> +       msg->msg_controllen += CMSG_SPACE(sizeof(cookie));
> +       msg->msg_control = (struct cmsghdr *)realloc(msg->msg_control,
> +                                                    msg->msg_controllen);

Please just allocate ahead of time. And since cookie size is fixed
and small just define a local variable on the stack in do_sendmsg.

  char control[CMSG_SPACE(sizeof(uint32_t)];

> +       if (!msg->msg_control)
> +               error(1, errno, "cannot allocate cmsghdr for cookie");
> +       cm = (void *)msg->msg_control + olen;
> +       cm->cmsg_len = CMSG_SPACE(sizeof(cookie));

CMSG_LEN

> +       cm->cmsg_level = SOL_RDS;
> +       cm->cmsg_type = RDS_CMSG_ZCOPY_COOKIE;
> +       ++cookie;
> +       memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));

cookie is not initialized

> @@ -346,36 +382,57 @@ static bool do_recv_completion(int fd)
>                       cm->cmsg_level, cm->cmsg_type);
>
>         serr = (void *) CMSG_DATA(cm);
> -       if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
> -               error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
> -       if (serr->ee_errno != 0)
> -               error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
>
> -       hi = serr->ee_data;
> -       lo = serr->ee_info;
> -       range = hi - lo + 1;
> +       switch (serr->ee_origin) {
> +       case SO_EE_ORIGIN_ZEROCOPY: {
> +               if (serr->ee_errno != 0)
> +                       error(1, 0, "serr: wrong error code: %u",
> +                             serr->ee_errno);
> +               hi = serr->ee_data;
> +               lo = serr->ee_info;
> +               range = hi - lo + 1;
> +
> +               /* Detect notification gaps. These should not happen often,
> +                * if at all.  Gaps can occur due to drops, reordering and
> +                * retransmissions.
> +                */
> +               if (lo != next_completion)
> +                       fprintf(stderr, "gap: %u..%u does not append to %u\n",
> +                               lo, hi, next_completion);
> +               next_completion = hi + 1;
> +
> +               zerocopy = !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED);
> +               if (zerocopied == -1)
> +                       zerocopied = zerocopy;
> +               else if (zerocopied != zerocopy) {
> +                       fprintf(stderr, "serr: inconsistent\n");
> +                       zerocopied = zerocopy;
> +               }
> +               if (cfg_verbose >= 2)
> +                       fprintf(stderr, "completed: %u (h=%u l=%u)\n",
> +                               range, hi, lo);
>
> -       /* Detect notification gaps. These should not happen often, if at all.
> -        * Gaps can occur due to drops, reordering and retransmissions.
> -        */
> -       if (lo != next_completion)
> -               fprintf(stderr, "gap: %u..%u does not append to %u\n",
> -                       lo, hi, next_completion);
> -       next_completion = hi + 1;
> -
> -       zerocopy = !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED);
> -       if (zerocopied == -1)
> -               zerocopied = zerocopy;
> -       else if (zerocopied != zerocopy) {
> -               fprintf(stderr, "serr: inconsistent\n");
> -               zerocopied = zerocopy;

Instead of indenting all this existing code please add a helper
do_recv_completion_zcookie, call that if SO_EE_ORIGIN_ZCOOKIE
and fall through to existing code otherwise.

> +               completions += range;
> +               break;
> +       }
> +       case SO_EE_ORIGIN_ZCOOKIE: {
> +               int ncookies, i;
> +
> +               if (serr->ee_errno != 0)
> +                       error(1, 0, "serr: wrong error code: %u",
> +                             serr->ee_errno);
> +               ncookies = serr->ee_data;

Verify ncookies <= MAX_..
Verify ret == ncookies * sizeof(uint32_t)

> +               for (i = 0; i < ncookies; i++)
> +                       if (cfg_verbose >= 2)
> +                               fprintf(stderr, "%d\n", ckbuf[i]);
> +               completions += ncookies;
> +               zerocopied = 1;

Unused in this path

> +               break;
> +       }
> +       default:
> +               error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
>         }
>
> -       if (cfg_verbose >= 2)
> -               fprintf(stderr, "completed: %u (h=%u l=%u)\n",
> -                       range, hi, lo);
> -
> -       completions += range;
>         return true;
>  }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 4/7] rds: support for zcopy completion notification
  2018-01-28 13:56   ` Willem de Bruijn
@ 2018-01-28 16:15     ` Sowmini Varadhan
  2018-01-28 18:46       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-28 16:15 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar


thanks for taking the time to go through the code!

> 
> An alternative that does not require a timer is to batch on the sk
> error queue itself, like tcp zerocopy. That queues the first notification
> skb on the error queue without any notification latency.
> 
> Then, if a subsequent notification comes in while another is pending
> with < MAX zcookies, it coalesces the new notification onto the pending
> skb and consumes the other. For RDS notifications, the implementation
> is an extra skb_put + uint32_t assignment.

This is an interesting idea, let me give it a try.

> Optionally, the socket can trigger another sk_error_report on each
> new notification.

I was trying to avoid that- an upcall for each message is not a good
idea when you have high traffic and are able to free multiple rds_messages
at a  time.

> > +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
> > +                                    struct rds_znotifier *znotifier,
> > +                                    bool force)
> > +{
> > +       struct sock *sk = rds_rs_to_sk(rs);
> > +       struct sk_buff *skb;
> > +       struct sock_exterr_skb *serr;
> > +       unsigned long flags;
> > +       u32 *ptr;
> > +       int ncookies = 0, i;
> > +       struct rds_znotifier *znotif, *ztmp, *first;
> > +       LIST_HEAD(tmp_list);
> > +
> > +       spin_lock_irqsave(&rs->rs_lock, flags);
> > +       ncookies = rs->rs_ncookies;
> > +       if (ncookies < SO_EE_ORIGIN_MAX_ZCOOKIES && !force) {
> > +               if (znotifier) { /* add this cookie to the list and return */
> 
> can be checked before taking lock.
> 
> More importantly, when is this ever NULL? 

It is null when invoked from tthe timer callback (rs_zcopy_notify() 
going off because havent had any traffic for the expiration interval
so we want to send out pending notifications, but dont have any znotifier
in this case). But you are right in that:

> This function is a callback
> for a zerocopy struct of type znotifier. Is it doing double duty to flush
> any outstanding if znotifier == NULL && force == true? If so, the first
> condition probably never occurs unless force == true and thus the
> second is redundant.

yes, force can simply be !znotifier. 

I dont quite follow the "can be checked before taking the lock
comment though"- the lock is needed to make sure we atomically do
the lists "add new entry and potentially flush" operation.
the check is for the "potentially" part of that operation, so 
I'm not seeing how it would help to move it out of the lock.

having said  all that, I like the earlier suggestion of just 
batching on the error_queue itself. If that works out without any
issues, all of this stuff may not be needed, so let me give that
a shot first.

> This adds unnecessary notification latency to delivery of current
> notification. The latest notification can be appended to tmp_list and
> sent up immediately.

true it if fits within the MAX cookies limit. 

> > +               (void)mod_timer(&rs->rs_cookie_timer, RDS_REAP_TIMEOUT);
> 
> Resetting timeout on each queued notification causes unbound
> notification latency for previous notifications on the queue.

I'm not sure I get that comment. RDS_REAP_TIMEOUT is the upper bound
for sending notification. If, in the interim, we get back a TCP
ack that lets us reap a bunch of messages, we'd go and flush the
queue anyway, so the RDS_REAP_TIMEOUT will not matter. Can you 
elaborate on your concern?


> > +
> > +               sock_put(rds_rs_to_sk(rs));
> > +               rm->m_rs = NULL;
> 
> These two lines are now called only if znotifier is true, but
> used to be called whenever rm->m_rs != NULL. Intentional?

good catch, no that's a bug. Thanks for flagging.

Also, ACK to all the comments for patch 5/7. Will fix for next round.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-28 13:51       ` Willem de Bruijn
@ 2018-01-28 16:18         ` Sowmini Varadhan
  2018-01-28 18:54           ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-28 16:18 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On (01/28/18 14:51), Willem de Bruijn wrote:
> > On (01/25/18 15:44), Willem de Bruijn wrote:
     ;
> >> You may alos be able to do the same as tcp zerocopy and
> >> hold an sk reference on the notification skb.
     ;
> I don't quite follow. Every notification skb is created when pages refcount
> is increased. It persists until at least rds_rm_zerocopy_callback, after data
> skb has been freed and pages refcount has been decreased.
> 
> In this callback, skb is consumed if another skb is already queued on
> the error queue, otherwise it is queued itself. It needs to hold a sock ref
> until it can be queued.

maybe I did not follow the original suggestion- were you 
suggesting that I hold a pointer to the sk from e.g., the skb->cb
itself? I dont know that it would make things simpler, 
whereas having the pointer and refcount in the rds_message itself,
and track this independantly of whether/not zcopy was used, seems 
like a more consistent dsta-structure model, so I'd like to leave 
this as is.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-01-28 14:00   ` Willem de Bruijn
@ 2018-01-28 16:18     ` Sowmini Varadhan
  2018-01-28 18:39       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-28 16:18 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar


Re-ordering review comments for selftests a bit..

> > +       cm->cmsg_level = SOL_RDS;
> > +       cm->cmsg_type = RDS_CMSG_ZCOPY_COOKIE;
> > +       ++cookie;
> > +       memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));
> cookie is not initialized

it's a static uint32_t in the function. It will get initialized to 0.
But the whole test program is rather simplistic, since it doesnt
actually verify the value of the cookies (hopefully me pending
updates to rds-stress will provide better testing/examples in this
space, because I actully have something multi-threaded, that must
necesarily use a truly random value for the cookie, and must really
make sure that the returned value from the kernel matches some
cookie that is pending)

> > +static void add_zcopy_cookie(struct msghdr *msg)
> Please just allocate ahead of time. And since cookie size is fixed
> and small just define a local variable on the stack in do_sendmsg.

Ok! I was just trying to make this as future-proof as possible, and 
provide useful example code for the next cut/paste developer to use.

> > +       cm->cmsg_len = CMSG_SPACE(sizeof(cookie));
> CMSG_LEN
  :
> Instead of indenting all this existing code please add a helper
> do_recv_completion_zcookie, call that if SO_EE_ORIGIN_ZCOOKIE
> and fall through to existing code otherwise.
  :
> Verify ncookies <= MAX_..
> Verify ret == ncookies * sizeof(uint32_t)

> > +               zerocopied = 1;
> Unused in this path

Ok, will fix all these.

Also, coalescing some related comments for patch 6/7:

> +       memset(&msg, 0, sizeof(msg));
> +       msg.msg_name = &din;
> +       msg.msg_namelen = sizeof(din);

Not read, so no need to configure. In that case a simpler
recv will do and is more concise than setting up recvmsg.

Real RDS applications actually have to use recvmsg, since they 
can get cmsg info about other things like congestion notification 
so let's leave this as recvmsg - it does no harm, and provides
test coverage for the recvmsg case as well. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-01-28 16:18     ` Sowmini Varadhan
@ 2018-01-28 18:39       ` Willem de Bruijn
  2018-01-28 19:57         ` Sowmini Varadhan
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 18:39 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Sun, Jan 28, 2018 at 5:18 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
>
> Re-ordering review comments for selftests a bit..
>
>> > +       cm->cmsg_level = SOL_RDS;
>> > +       cm->cmsg_type = RDS_CMSG_ZCOPY_COOKIE;
>> > +       ++cookie;
>> > +       memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie));
>> cookie is not initialized
>
> it's a static uint32_t in the function. It will get initialized to 0.

Oh right. I missed that.

> But the whole test program is rather simplistic, since it doesnt
> actually verify the value of the cookies (hopefully me pending
> updates to rds-stress will provide better testing/examples in this
> space, because I actully have something multi-threaded, that must
> necesarily use a truly random value for the cookie, and must really
> make sure that the returned value from the kernel matches some
> cookie that is pending)

It might be nice to at least increment the variable on each
successful send. The test is single threaded anyway. And
then we can test that the returned values are in the defined
range.

> Also, coalescing some related comments for patch 6/7:
>
>> +       memset(&msg, 0, sizeof(msg));
>> +       msg.msg_name = &din;
>> +       msg.msg_namelen = sizeof(din);
>
> Not read, so no need to configure. In that case a simpler
> recv will do and is more concise than setting up recvmsg.
>
> Real RDS applications actually have to use recvmsg, since they
> can get cmsg info about other things like congestion notification
> so let's leave this as recvmsg - it does no harm, and provides
> test coverage for the recvmsg case as well.

Ah, okay. Yes, let's keep it as, then.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 4/7] rds: support for zcopy completion notification
  2018-01-28 16:15     ` Sowmini Varadhan
@ 2018-01-28 18:46       ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 18:46 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Sun, Jan 28, 2018 at 5:15 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
>
> thanks for taking the time to go through the code!
>
>>
>> An alternative that does not require a timer is to batch on the sk
>> error queue itself, like tcp zerocopy. That queues the first notification
>> skb on the error queue without any notification latency.
>>
>> Then, if a subsequent notification comes in while another is pending
>> with < MAX zcookies, it coalesces the new notification onto the pending
>> skb and consumes the other. For RDS notifications, the implementation
>> is an extra skb_put + uint32_t assignment.
>
> This is an interesting idea, let me give it a try.
>
>> Optionally, the socket can trigger another sk_error_report on each
>> new notification.
>
> I was trying to avoid that- an upcall for each message is not a good
> idea when you have high traffic and are able to free multiple rds_messages
> at a  time.

Agreed. It was only a suggestion if that would be a reason
for you to not try the above idea.

>> > +static void rds_rm_zerocopy_callback(struct rds_sock *rs,
>> > +                                    struct rds_znotifier *znotifier,
>> > +                                    bool force)
>> > +{
>> > +       struct sock *sk = rds_rs_to_sk(rs);
>> > +       struct sk_buff *skb;
>> > +       struct sock_exterr_skb *serr;
>> > +       unsigned long flags;
>> > +       u32 *ptr;
>> > +       int ncookies = 0, i;
>> > +       struct rds_znotifier *znotif, *ztmp, *first;
>> > +       LIST_HEAD(tmp_list);
>> > +
>> > +       spin_lock_irqsave(&rs->rs_lock, flags);
>> > +       ncookies = rs->rs_ncookies;
>> > +       if (ncookies < SO_EE_ORIGIN_MAX_ZCOOKIES && !force) {
>> > +               if (znotifier) { /* add this cookie to the list and return */
>>
>> can be checked before taking lock.
>>
>> More importantly, when is this ever NULL?
>
> It is null when invoked from tthe timer callback (rs_zcopy_notify()
> going off because havent had any traffic for the expiration interval
> so we want to send out pending notifications, but dont have any znotifier
> in this case). But you are right in that:
>
>> This function is a callback
>> for a zerocopy struct of type znotifier. Is it doing double duty to flush
>> any outstanding if znotifier == NULL && force == true? If so, the first
>> condition probably never occurs unless force == true and thus the
>> second is redundant.
>
> yes, force can simply be !znotifier.
>
> I dont quite follow the "can be checked before taking the lock
> comment though"- the lock is needed to make sure we atomically do
> the lists "add new entry and potentially flush" operation.
> the check is for the "potentially" part of that operation, so
> I'm not seeing how it would help to move it out of the lock.
>
> having said  all that, I like the earlier suggestion of just
> batching on the error_queue itself. If that works out without any
> issues, all of this stuff may not be needed, so let me give that
> a shot first.

Sounds great!

>
>> This adds unnecessary notification latency to delivery of current
>> notification. The latest notification can be appended to tmp_list and
>> sent up immediately.
>
> true it if fits within the MAX cookies limit.
>
>> > +               (void)mod_timer(&rs->rs_cookie_timer, RDS_REAP_TIMEOUT);
>>
>> Resetting timeout on each queued notification causes unbound
>> notification latency for previous notifications on the queue.
>
> I'm not sure I get that comment. RDS_REAP_TIMEOUT is the upper bound
> for sending notification. If, in the interim, we get back a TCP
> ack that lets us reap a bunch of messages, we'd go and flush the
> queue anyway, so the RDS_REAP_TIMEOUT will not matter. Can you
> elaborate on your concern?

I meant that if packet rate is low enough to require the timer
to fire, then by resetting the timer on each notification, the maximum
timeout for the first enqueued notification is the max cookie limit
* timeout, as opposed to timeout if set once and never modified.

Some physical device drivers do the same for their interrupt
moderation. You really want the configured timeout to be the
upper bound.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock
  2018-01-28 16:18         ` Sowmini Varadhan
@ 2018-01-28 18:54           ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2018-01-28 18:54 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On Sun, Jan 28, 2018 at 5:18 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (01/28/18 14:51), Willem de Bruijn wrote:
>> > On (01/25/18 15:44), Willem de Bruijn wrote:
>      ;
>> >> You may alos be able to do the same as tcp zerocopy and
>> >> hold an sk reference on the notification skb.
>      ;
>> I don't quite follow. Every notification skb is created when pages refcount
>> is increased. It persists until at least rds_rm_zerocopy_callback, after data
>> skb has been freed and pages refcount has been decreased.
>>
>> In this callback, skb is consumed if another skb is already queued on
>> the error queue, otherwise it is queued itself. It needs to hold a sock ref
>> until it can be queued.
>
> maybe I did not follow the original suggestion- were you
> suggesting that I hold a pointer to the sk from e.g., the skb->cb
> itself?

Yes, I mean associating the notification skb that is eventually
queued onto the error queue with the socket. For tcp zerocopy,
this happens implicitly in sock_omalloc.

> I dont know that it would make things simpler,
> whereas having the pointer and refcount in the rds_message itself,
> and track this independantly of whether/not zcopy was used, seems
> like a more consistent dsta-structure model, so I'd like to leave
> this as is.

Sounds good. You know the rds internals a lot better than I do,
so are the better judge on whether to choose that or sock_omalloc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case
  2018-01-28 18:39       ` Willem de Bruijn
@ 2018-01-28 19:57         ` Sowmini Varadhan
  0 siblings, 0 replies; 23+ messages in thread
From: Sowmini Varadhan @ 2018-01-28 19:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, rds-devel, santosh.shilimkar

On (01/28/18 19:39), Willem de Bruijn wrote:
> > But the whole test program is rather simplistic, since it doesnt
> > actually verify the value of the cookies (hopefully me pending
    :
> It might be nice to at least increment the variable on each
> successful send. The test is single threaded anyway. And
> then we can test that the returned values are in the defined
> range.

Yeah one thought that went through my head was that since
I now pick the cookes as consecutive numbers (1..N) anyway,
we could do a very simple-minded checksum on recv_completion
notification to check that the sum of all the cookie values
returned is actually N * (N+1)/2 - and flag errors/warnings
if it is not.. 

this would be a test enhancement I could do later.. first let 
me try to avoid having all this complex timer-triggered code.

--Sowmini

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-01-28 19:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24 11:45 [PATCH net-next 0/7] RDS zerocopy support Sowmini Varadhan
2018-01-24 11:45 ` [PATCH net-next 1/7] skbuff: export mm_[un]account_pinned_pages for other modules Sowmini Varadhan
2018-01-24 11:45 ` [PATCH net-next 2/7] rds: hold a sock ref from rds_message to the rds_sock Sowmini Varadhan
2018-01-25 14:44   ` Willem de Bruijn
2018-01-25 15:35     ` Sowmini Varadhan
2018-01-28 13:51       ` Willem de Bruijn
2018-01-28 16:18         ` Sowmini Varadhan
2018-01-28 18:54           ` Willem de Bruijn
2018-01-24 11:45 ` [PATCH net-next 3/7] sock: permit SO_ZEROCOPY on PF_RDS socket Sowmini Varadhan
2018-01-24 11:45 ` [PATCH net-next 4/7] rds: support for zcopy completion notification Sowmini Varadhan
2018-01-28 13:56   ` Willem de Bruijn
2018-01-28 16:15     ` Sowmini Varadhan
2018-01-28 18:46       ` Willem de Bruijn
2018-01-24 11:46 ` [PATCH net-next 5/7] rds: zerocopy Tx support Sowmini Varadhan
2018-01-28 13:57   ` Willem de Bruijn
2018-01-24 11:46 ` [PATCH net-next 6/7] selftests/net: add support for PF_RDS sockets Sowmini Varadhan
2018-01-28 13:58   ` Willem de Bruijn
2018-01-24 11:46 ` [PATCH net-next 7/7] selftests/net: add zerocopy support for PF_RDS test case Sowmini Varadhan
2018-01-28 14:00   ` Willem de Bruijn
2018-01-28 16:18     ` Sowmini Varadhan
2018-01-28 18:39       ` Willem de Bruijn
2018-01-28 19:57         ` Sowmini Varadhan
2018-01-25 16:41 ` [PATCH net-next 0/7] RDS zerocopy support Santosh Shilimkar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).